5 Steps in Data Cleansing: Complete Process Guide
Learn the 5 essential steps in data cleansing. From data profiling to validation, discover how to clean messy data systematically and prepare it for analysis.
5 Steps in Data Cleansing: Complete Process Guide
If you're cleaning data without a structured process, you're missing errors, wasting time, and risking bad decisions. 88% of data analysts report that unstructured data cleaning leads to inaccurate results and rework.
By the end of this guide, you'll know the 5 essential steps in data cleansing and how to execute each step effectively to transform messy data into analysis-ready datasets.
Quick Summary
- Step 1: Data Profiling - Understand your data and identify issues
- Step 2: Data Standardization - Normalize formats and values
- Step 3: Data Validation - Check accuracy and completeness
- Step 4: Data Deduplication - Remove duplicate records
- Step 5: Data Enrichment - Fill gaps and enhance data quality
Common Problems Without Structured Data Cleansing
- Missing critical errors - Don't catch all data quality issues
- Inconsistent cleaning - Different methods each time, unpredictable results
- Time wasted - Spend hours on manual fixes that could be automated
- Incomplete data - Don't identify all missing values or gaps
- Duplicate records - Miss duplicate entries that skew analysis
- Format inconsistencies - Mixed date formats, number formats, text cases
- Invalid data - Don't validate data against business rules
- No documentation - Can't reproduce cleaning steps
- Data loss - Accidentally delete valid data during cleaning
- Poor data quality - End up with "clean" data that's still inaccurate
The 5 Steps in Data Cleansing
Step 1: Data Profiling
Before cleaning, understand what you're working with. Data profiling identifies data quality issues and patterns.
What to Profile:
Data Structure:
- Number of rows and columns
- Column names and data types
- Missing values count
- Unique values count
Data Quality Issues:
- Duplicate records
- Invalid formats
- Outliers and anomalies
- Inconsistent values
Excel Profiling Formulas:
Count Missing Values:
=COUNTBLANK(A2:A1000)
Count Unique Values:
=COUNTA(UNIQUE(A2:A1000))
Detect Invalid Formats:
=IF(ISNUMBER(A2), "Valid Number", "Invalid")
Find Outliers (using IQR method):
=IF(OR(A2<QUARTILE($A$2:$A$1000,1)-1.5*(QUARTILE($A$2:$A$1000,3)-QUARTILE($A$2:$A$1000,1)), A2>QUARTILE($A$2:$A$1000,3)+1.5*(QUARTILE($A$2:$A$1000,3)-QUARTILE($A$2:$A$1000,1))), "Outlier", "Normal")
Python Profiling (Optional):
import pandas as pd
# Load data
df = pd.read_excel('data.xlsx')
# Basic profiling
print("Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nUnique Values:")
print(df.nunique())
print("\nDescriptive Statistics:")
print(df.describe())
# Detect duplicates
print("\nDuplicate Rows:", df.duplicated().sum())
Create Profiling Report:
| Column | Data Type | Missing | Unique | Duplicates | Issues |
|---|---|---|---|---|---|
| Product Name | Text | 5 | 450 | 12 | Mixed case |
| Price | Number | 0 | 380 | 0 | Some negative |
| Date | Date | 8 | 200 | 0 | Mixed formats |
| Category | Text | 15 | 25 | 0 | Inconsistent |
RowTidy Usage:
RowTidy automatically profiles your data:
- Detects data types
- Identifies missing values
- Finds duplicates
- Highlights format inconsistencies
- Generates data quality report
Time saved: 2 hours of manual profiling → 30 seconds
Step 2: Data Standardization
Standardize formats, values, and structures to ensure consistency.
Standardize Text Case
Wrong Examples:
laptop stand
LAPTOP STAND
Laptop Stand
Laptop stand
Right Example:
Laptop Stand
Excel Formula:
=PROPER(A2)
For product names, use:
=UPPER(LEFT(A2,1))&LOWER(MID(A2,2,LEN(A2)))
Standardize Dates
Wrong Examples:
11/19/2025
2025-11-19
Nov 19, 2025
19-Nov-2025
Right Example:
2025-11-19
Excel Formula:
=TEXT(DATEVALUE(A2), "YYYY-MM-DD")
For mixed formats, use:
=IF(ISNUMBER(A2), TEXT(A2, "YYYY-MM-DD"), TEXT(DATEVALUE(A2), "YYYY-MM-DD"))
Standardize Numbers
Remove currency symbols and format:
=VALUE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"$",""),",","")," ",""))
Format to 2 decimal places:
=ROUND(A2, 2)
Standardize Text (Remove Extra Spaces)
Excel Formula:
=TRIM(CLEAN(A2))
TRIM() removes leading/trailing spaces
CLEAN() removes non-printable characters
Standardize Categories
Create lookup table for category normalization:
| Original | Standardized |
|---|---|
| Electronics | Electronics |
| Electronic | Electronics |
| Elec | Electronics |
| E-Products | Electronics |
Excel VLOOKUP:
=VLOOKUP(A2, Lookup_Table, 2, FALSE)
Standardize Addresses
Street suffix standardization:
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"St.","Street"),"Ave.","Avenue"),"Rd.","Road"),"Blvd.","Boulevard"),"Dr.","Drive")
RowTidy Usage:
RowTidy standardizes data automatically:
- Normalizes text case
- Standardizes date formats
- Formats numbers and currencies
- Removes extra spaces
- Standardizes categories using AI
- Normalizes addresses
Time saved: 4 hours of manual standardization → 1 minute
Step 3: Data Validation
Validate data against business rules and constraints to ensure accuracy.
Validate Email Addresses
Excel Formula:
=IF(AND(ISNUMBER(SEARCH("@",A2)), ISNUMBER(SEARCH(".",A2,SEARCH("@",A2))), LEN(A2)-LEN(SUBSTITUTE(A2,"@",""))=1), "Valid", "Invalid")
Validate Phone Numbers
US Phone Format Validation:
=IF(LEN(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A2,"(",""),")",""),"-","")," ",""))=10, "Valid", "Invalid")
Validate Tax IDs
EIN Format (XX-XXXXXXX):
=IF(AND(LEN(SUBSTITUTE(A2,"-",""))=9, ISNUMBER(VALUE(SUBSTITUTE(A2,"-","")))), "Valid", "Invalid")
Validate ZIP Codes
US ZIP (5 digits or 5+4):
=IF(OR(LEN(SUBSTITUTE(A2,"-",""))=5, LEN(SUBSTITUTE(A2,"-",""))=9), "Valid", "Invalid")
Validate Ranges
Price must be positive:
=IF(A2>0, "Valid", "Invalid")
Quantity must be between 1 and 1000:
=IF(AND(A2>=1, A2<=1000), "Valid", "Invalid")
Validate Dates
Date must be in the past:
=IF(A2<TODAY(), "Valid", "Invalid")
Date must be within last year:
=IF(AND(A2>=TODAY()-365, A2<=TODAY()), "Valid", "Invalid")
Validate Required Fields
Check if required fields are populated:
=IF(AND(A2<>"", B2<>"", C2<>""), "Complete", "Incomplete")
Create Validation Report:
| Column | Validation Rule | Passed | Failed | Error Rate |
|---|---|---|---|---|
| Valid format | 450 | 50 | 10% | |
| Phone | 10 digits | 480 | 20 | 4% |
| Price | > 0 | 495 | 5 | 1% |
| Date | Valid date | 490 | 10 | 2% |
RowTidy Usage:
RowTidy validates data automatically:
- Email format validation
- Phone number validation
- Tax ID validation
- Date range validation
- Business rule validation
- Generates validation report
Time saved: 3 hours of manual validation → 1 minute
Step 4: Data Deduplication
Remove duplicate records that skew analysis and reporting.
Identify Exact Duplicates
Excel Formula:
=COUNTIF($A$2:$A$1000, A2)>1
Returns TRUE for duplicate values.
Identify Duplicates Across Multiple Columns
Check if entire row is duplicate:
=COUNTIFS($A$2:$A$1000, A2, $B$2:$B$1000, B2, $C$2:$C$1000, C2)>1
Fuzzy Matching (Similar but Not Exact)
For similar names (first 5 characters):
=IF(COUNTIF($A$2:$A$1000, "*"&LEFT(A2,5)&"*")>1, "Possible Duplicate", "Unique")
Remove Duplicates in Excel
Method 1: Remove Duplicates Tool
- Select data range
- Go to Data > Remove Duplicates
- Choose columns to check
- Click OK
Method 2: Advanced Filter
- Select data range
- Go to Data > Advanced Filter
- Check "Unique records only"
- Click OK
Python Deduplication (Optional):
import pandas as pd
# Load data
df = pd.read_excel('data.xlsx')
# Remove exact duplicates
df_unique = df.drop_duplicates()
# Fuzzy matching for similar records
from fuzzywuzzy import fuzz
duplicates = []
for i, row1 in df.iterrows():
for j, row2 in df.iterrows():
if i < j:
similarity = fuzz.ratio(row1['Name'], row2['Name'])
if similarity > 85:
duplicates.append((i, j, similarity))
# Review and merge duplicates
Deduplication Strategy:
Which record to keep?
- Keep most recent record
- Keep record with most complete data
- Keep record with highest quality score
- Merge data from duplicates
RowTidy Usage:
RowTidy deduplicates automatically:
- Detects exact duplicates
- Finds fuzzy duplicates (similar names)
- Suggests which records to keep
- Merges duplicate data intelligently
- Removes duplicates with one click
Time saved: 2 hours of manual deduplication → 30 seconds
Step 5: Data Enrichment
Fill gaps, enhance data quality, and add missing information.
Fill Missing Values
Strategy 1: Use Default Values
=IF(A2="", "N/A", A2)
Strategy 2: Use Previous Value (Forward Fill)
=IF(A2="", A1, A2)
Strategy 3: Use Average/Median
=IF(A2="", AVERAGE($A$2:$A$1000), A2)
Add Missing Categories
Infer category from product name:
=IF(ISNUMBER(SEARCH("laptop", LOWER(A2))), "Electronics", IF(ISNUMBER(SEARCH("desk", LOWER(A2))), "Furniture", "Other"))
Enrich with Lookup Data
Add region from ZIP code:
=VLOOKUP(LEFT(A2,3), ZIP_Region_Table, 2, FALSE)
Calculate Derived Fields
Calculate total from unit price and quantity:
=A2*B2
Calculate age from birth date:
=YEAR(TODAY())-YEAR(A2)
Data Quality Scoring
Create quality score:
=IF(AND(A2<>"", B2<>"", C2<>"", D2<>""), 100, IF(AND(A2<>"", B2<>""), 50, 0))
RowTidy Usage:
RowTidy enriches data automatically:
- Fills missing values intelligently
- Infers categories from context
- Adds derived fields
- Calculates quality scores
- Enhances data with external sources (optional)
Time saved: 3 hours of manual enrichment → 1 minute
Real Example: Data Cleansing in Action
Before (Messy Data):
| Product Name | Price | Date | Category | |
|---|---|---|---|---|
| laptop stand | $29.99 | 11/19/2025 | Elec | john@example.com |
| LAPTOP STAND | 29.99 | 2025-11-19 | Electronic | john@example |
| Laptop Stand | $30 | Nov 19, 2025 | Elec | - |
| laptop stand | 29.99 | 11/19/2025 | Electronics | john@example.com |
After (Clean Data):
| Product Name | Price | Date | Category | Status | |
|---|---|---|---|---|---|
| Laptop Stand | 29.99 | 2025-11-19 | Electronics | john@example.com | Valid |
Changes Made:
- Step 1 (Profiling): Identified 4 records, 1 duplicate, 1 missing email
- Step 2 (Standardization): Normalized product name, price format, date format, category
- Step 3 (Validation): Validated email format, flagged invalid email
- Step 4 (Deduplication): Removed 3 duplicate records
- Step 5 (Enrichment): Kept record with complete data
Complete Data Cleansing Workflow
Workflow Summary:
- Profile → Understand data structure and issues
- Standardize → Normalize formats and values
- Validate → Check against business rules
- Deduplicate → Remove duplicate records
- Enrich → Fill gaps and enhance quality
Excel Template:
Create a data cleansing template with:
- Raw Data Sheet - Original data (never modify)
- Profiling Sheet - Data quality analysis
- Cleaned Data Sheet - Final clean data
- Validation Report - Error summary
- Log Sheet - Changes made
Automation with RowTidy:
RowTidy executes all 5 steps automatically:
- Auto-profiles your data
- Standardizes formats
- Validates against rules
- Deduplicates records
- Enriches missing data
Complete workflow time: 12 hours manually → 2 minutes with RowTidy
Mini Automation Using RowTidy
You can complete all 5 data cleansing steps in 10 seconds using RowTidy's AI Recipes.
The Problem:
Manual data cleansing takes 10-15 hours for a typical dataset:
- Profiling and analysis
- Standardizing formats
- Validating data
- Removing duplicates
- Enriching missing values
The Solution:
RowTidy automates the entire data cleansing process:
- Upload your messy data (Excel, CSV, Google Sheets)
- AI profiles data - Identifies all issues automatically
- Auto-standardizes - Normalizes all formats
- Validates data - Checks against business rules
- Removes duplicates - Finds and merges duplicates
- Enriches data - Fills gaps intelligently
- Exports clean data - Ready for analysis
RowTidy Recipe for Data Cleansing:
- Upload messy dataset
- AI detects all data quality issues
- Automatically standardizes formats
- Validates data against rules
- Removes duplicates
- Enriches missing values
- Generates data quality report
- Exports clean dataset
Time saved: 12 hours of manual work → 2 minutes
Instead of spending days cleaning data manually, let RowTidy automate all 5 data cleansing steps. Try RowTidy's data cleansing automation →
FAQ
1. In what order should I perform the 5 data cleansing steps?
Follow this order: 1) Profile, 2) Standardize, 3) Validate, 4) Deduplicate, 5) Enrich. Profiling first helps you understand what needs cleaning.
2. How long does data cleansing take?
For 1,000 records: 2-4 hours manually, 2-5 minutes with automation. For 10,000+ records, automation is essential.
3. Should I clean data before or after merging datasets?
Clean individual datasets first, then merge. Cleaning merged data is harder because issues compound.
4. How do I handle missing data?
Options: 1) Remove records with missing critical fields, 2) Fill with default values, 3) Use statistical methods (mean, median), 4) Leave blank if acceptable.
5. What's the difference between data cleaning and data transformation?
Data cleaning fixes errors and inconsistencies. Data transformation changes data structure (pivoting, aggregating, joining). Clean first, then transform.
6. How do I validate data against business rules?
Create validation rules in Excel formulas or use tools like RowTidy. Rules can check: ranges, formats, relationships, completeness, accuracy.
7. Should I delete duplicates or merge them?
Merge duplicates when possible to preserve all information. Delete only if records are truly identical. Always backup before deleting.
8. How often should I clean my data?
Clean data: 1) Before major analysis, 2) When receiving new data, 3) Quarterly for ongoing datasets, 4) Before system migrations.
9. Can I automate data cleansing?
Yes. Use Excel macros, Power Query, Python scripts, or tools like RowTidy. Automation ensures consistency and saves time.
10. How do I document data cleansing steps?
Document: 1) Issues found, 2) Changes made, 3) Formulas/rules used, 4) Records removed/modified, 5) Final data quality metrics.
Related Guides
- How to Clean Messy Excel Data Fast →
- Excel Data Cleaning Guide →
- Best Practices for Data Quality →
- Automate Data Cleaning →
Conclusion
Following the 5 steps in data cleansing—profiling, standardization, validation, deduplication, and enrichment—ensures your data is accurate, consistent, and analysis-ready. A structured approach prevents errors, saves time, and produces reliable results.
Try RowTidy — automate all 5 data cleansing steps and transform messy data into clean datasets in minutes.