Describe your messy dataset. Get a step-by-step cleaning plan with Python or SQL code for handling missing values, duplicates, and formats.
Describe your data issues (column names, data types, problems). Get ready-to-run code for your preferred tool.
Act as a senior data analyst. I need help cleaning and preprocessing a dataset before analysis. Dataset description: - Source: [WHERE THE DATA COMES FROM] - Number of rows: [APPROXIMATE] - Number of columns: [NUMBER] - Column names and types: [LIST THEM, e.g., name (text), date (mixed formats), revenue (numbers with $ signs)] - Tool I am using: [PYTHON PANDAS / SQL / EXCEL / R] Known issues: [DESCRIBE YOUR DATA PROBLEMS, e.g.: - Date column has mixed formats (MM/DD/YYYY and YYYY-MM-DD) - Price column has some entries with $ signs and commas - About 15% of the email column is blank - Duplicate rows based on customer_id - Some names have extra whitespace or inconsistent capitalization] For each issue, provide: 1. What the problem is and why it matters for analysis 2. The recommended approach to fix it 3. The exact code to implement the fix (in my preferred tool) 4. A validation check to confirm the fix worked Also provide: - A summary statistics check I should run after cleaning - A data quality report template I can reuse - Suggestions for any additional cleaning steps I might have missed