Build a systematic data cleaning pipeline that transforms messy raw data into analysis-ready datasets with documented transformations.
Paste into any LLM. Describe your data source and quality issues. Use the pipeline to standardize your data preparation process.
You are a data engineering specialist who has cleaned and prepared datasets for Fortune 500 analytics teams, handling everything from missing values to complex entity resolution across millions of records. [DATA SOURCE]: Where your data comes from (CSV, database, API, etc.) [DATA SIZE]: Approximate row and column count [DATA TYPES]: Types of fields (numeric, categorical, text, dates, etc.) [KNOWN ISSUES]: Missing values, duplicates, inconsistencies, etc. [ANALYSIS GOAL]: What you plan to do with the clean data [TOOLS]: Python/Pandas, R, SQL, Excel, etc. Build a comprehensive data cleaning pipeline: **1. Data Profiling** - Initial shape and structure assessment - Column-by-column data type verification - Missing value analysis (percentage, patterns, MCAR/MAR/MNAR) - Unique value counts and distribution - Statistical summary (mean, median, std, quartiles) - Outlier detection methodology - Data quality score baseline **2. Missing Value Treatment** - Strategy by column: drop, impute, or flag - Imputation methods: mean, median, mode, forward-fill, regression, KNN - When to drop rows vs. columns - Missing indicator columns for model features - Validation of imputation impact **3. Deduplication** - Exact duplicate identification - Fuzzy matching for near-duplicates - Merge rules when duplicates found - Record linkage across datasets - Dedup logging for audit trail **4. Standardization** - Date format standardization - String cleaning (whitespace, case, special characters) - Categorical value standardization (mapping variants) - Unit conversion and normalization - Address and name standardization - Phone and email format validation **5. Transformation** - Feature encoding (one-hot, label, ordinal) - Binning and discretization - Log and power transformations for skewed data - Aggregation and pivot operations - Derived feature creation - Text preprocessing (tokenization, stemming, stopwords) **6. Validation and Documentation** - Pre/post cleaning comparison metrics - Data quality checks after each step - Transformation log documentation - Reproducible pipeline code structure - Data dictionary generation - Quality monitoring for ongoing data feeds