Introduction
Data cleaning in Python is a critical step in any data analysis workflow. Before diving into complex analysis or building machine learning models, it's essential to ensure that your data is clean, consistent, and ready for use. In this guide, we'll walk you through the key techniques and best practices for data cleaning in Pandas, one of the most powerful Python libraries for data manipulation. Whether you're dealing with missing values, duplicate entries, or incorrect data types, this guide will equip you with the tools you need to transform messy data into a valuable asset for your analysis.
Data Cleaning
Data cleaning in Python is the process of preparing raw data by fixing or removing incorrect, incomplete, or irrelevant parts to make it ready for analysis. Think of it as tidying up a messy spreadsheet so it's usable and reliable.
Why is Data Cleaning Important?
- Accuracy: Ensures your analysis or model is based on correct data.
- Efficiency: Makes workflows faster and smoother by removing junk data.
- Reproducibility: Clean data helps others understand and replicate your work.
Common Issues in Raw Data
- Missing Values: Blank cells, NaNs, or null values.
- Inconsistent Formatting: Mixed date formats, uppercase vs lowercase.
- Duplicate Entries: Repeated rows of data.
- Outliers: Extreme values that don't make sense (e.g., age = 999).
- Irrelevant Data: Unnecessary columns or rows.
- Typographical Errors: Spelling mistakes or incorrect codes.
Before getting further into EDA, it’s important to make sure the data is usable.
Loading the data
Unless you manually enter data, it’s very likely that you would have stored the data in another file. Pandas can load data from a plethora of sources, but here are some of the most common:
Pandas loads data into a DataFrame form. With HTML, Pandas loads each table into a separate DataFrame, and the output is a list of DataFrames. Before we go further, if you haven’t, check out this article on the types of data structures in Pandas.
Finding missing values
.isnull() – an introduction
.isnull() is the method we use to find missing values. It returns a DataFrame or Series where each value is replaced by a Boolean:
Filtering rows based on missing values
A good way to get started is to view all rows where any values are missing:
This uses the .any() method which combined with axis=1(horizontal axis) checks if any value in a row, across all columns are missing.
To display all rows where there are null values in specific columns:
Conversely, if you want to show all rows where there are no missing values,
A fun, intuitive way to highlight missing value, if using Jupyter Notebook or any other environments that support HTML rendering:
Analytics
To find the total number of null values in each column, we use .isnull() together with .sum()
To find the total number of null values in the entire dataframe, we use .values.sum()
To calculate the percentage of missing values in each column:
To calculate the number of missing values in each row:
To calculate the percentage of missing values in each row:
You can also use these 2 parameters as filters for a dataframe:
Handling missing values
There are numerous ways to handle missing values, but they fall under 2 categories:
- Remove rows (or even columns) with missing values
- Fill the missing value using a constant or calculated value
We won’t get deep into the science of it. For that, you can check out our article on data cleaning. Instead, we’ll see how we can use these methods in Pandas.
Deletion
To delete a row with missing values:
To delete a column with missing values
Imputation
Simple Imputation
Replacing missing values with mean,median or mode:
Constant
Replacing missing values with a constant:
Forward fill (ffill)
Forward fill (often abbreviated as ffill) fills missing values by propagating the last valid (non-missing) value forward to replace the NaN values. These are used in sequential datasets, particularly time-series datasets..
Backward fill (bfill)
Backward fill (often abbreviated as bfill) fills missing values by propagating the next valid (non-missing) value backward to replace the NaN values. Similar to forward fill, these are used in sequential datasets
Interpolation
.fillna() is not the only method to fill missing values in a dataset. Interpolation estimates values that fall between two values in a continuous or ordered data set. There are many methods of interpolation, but here, we’ll cover two basic ones:
Linear interpolation
Linear interpolation estimates missing values by drawing straight lines between known points.
Polynomial interpolation
Polynomial interpolation estimates missing values by fitting a polynomial curve that passes through all the known data points.
Order refers to the degree of interpolation
- Linear (order=1): Often used when the data is expected to follow a roughly linear relationship.
- Quadratic (order=2): Useful when the data has some curvature.
- Cubic (order=3): A good balance for many cases, capturing moderate curves without too much complexity.
For polynomial interpolation, you must have at least order + 1 data points. For example, to use a polynomial of order 3 (cubic), you must have at least 4 data points. If you have fewer data points than the specified order + 1, Pandas will raise an error.
Advanced Imputation
Advanced imputations typically use other libraries. You can learn about some of these in detail in our advanced guide to data cleaning.
Wrong format (date)
Another common error encountered in datasets is dates entered in the wrong format, for example:
The .to_datetime() method converts dates entered in string format to the datetime format.
Handling outliers and wrong values
There are many methods to detect outliers. Here are some methods:
Using Z-Score
Using Interquartile Range
In the above 2 instances, the filtered DataFrame filters out rows where the ‘Value’ column datapoint is an outlier. However we can also handle this another way:
Capping outliers (Winsorizing)
Using Interquartile range, we can cap the values at the upper and lower bounds:
This uses Numpy’s powerful .where() function which is essentially the equivalent to “Find and Replace”. This can also come into handy when replacing outliers with imputations.
Replacing values using .loc and .iloc
If you want to manually replace values, .loc and .iloc are used to select rows, columns or data points based on positioning.
.loc uses labels while .iloc uses position-based indexing, however they serve the same purpose. Using this, you can isolate data points and change them manually.
Handling duplicates
Pandas’ .duplicated() returns a series or DataFrame of Boolean values, similar to .isnull(),depending on whether a row or value is duplicated or not.
The drop_duplicates() method has a keep parameter that allows you to specify which duplicate to keep. By default, the first occurrence is kept.
Summing Up
In this guide, so far you would have learned the essential data cleaning techniques in Python when it comes to:
- Missing Data
- Outliers
- Wrong formats
- Duplicates
You would also have learned the techniques of finding and handling missing values, duplicates and outliers through deletion or imputation. With these foundational skills, you can proceed with the next step – Data Analysis. If you want to go into the nitty-gritty of data cleaning, you can check out our Advanced Guide to Data Cleaning.