A Beginner’s Guide to Data Cleaning in Pandas

Introduction

Data cleaning in Python is a critical step in any data analysis workflow. Before diving into complex analysis or building machine learning models, it's essential to ensure that your data is clean, consistent, and ready for use. In this guide, we'll walk you through the key techniques and best practices for data cleaning in Pandas, one of the most powerful Python libraries for data manipulation. Whether you're dealing with missing values, duplicate entries, or incorrect data types, this guide will equip you with the tools you need to transform messy data into a valuable asset for your analysis.

Data Cleaning

Data cleaning in Python is the process of preparing raw data by fixing or removing incorrect, incomplete, or irrelevant parts to make it ready for analysis. Think of it as tidying up a messy spreadsheet so it's usable and reliable.

Why is Data Cleaning Important?

Accuracy: Ensures your analysis or model is based on correct data.
Efficiency: Makes workflows faster and smoother by removing junk data.
Reproducibility: Clean data helps others understand and replicate your work.

Common Issues in Raw Data

Missing Values: Blank cells, NaNs, or null values.
Inconsistent Formatting: Mixed date formats, uppercase vs lowercase.
Duplicate Entries: Repeated rows of data.
Outliers: Extreme values that don't make sense (e.g., age = 999).
Irrelevant Data: Unnecessary columns or rows.
Typographical Errors: Spelling mistakes or incorrect codes.

Before getting further into EDA, it’s important to make sure the data is usable.

Loading the data

Unless you manually enter data, it’s very likely that you would have stored the data in another file. Pandas can load data from a plethora of sources, but here are some of the most common:

A screenshot of a computerDescription automatically generated

Pandas loads data into a DataFrame form. With HTML, Pandas loads each table into a separate DataFrame, and the output is a list of DataFrames. Before we go further, if you haven’t, check out this article on the types of data structures in Pandas.

Finding missing values

.isnull() – an introduction

.isnull() is the method we use to find missing values. It returns a DataFrame or Series where each value is replaced by a Boolean:

Filtering rows based on missing values

A good way to get started is to view all rows where any values are missing:

This uses the .any() method which combined with axis=1(horizontal axis) checks if any value in a row, across all columns are missing.

To display all rows where there are null values in specific columns:

Conversely, if you want to show all rows where there are no missing values,

A fun, intuitive way to highlight missing value, if using Jupyter Notebook or any other environments that support HTML rendering:

Analytics

To find the total number of null values in each column, we use .isnull() together with .sum()

To find the total number of null values in the entire dataframe, we use .values.sum()

To calculate the percentage of missing values in each column:

To calculate the number of missing values in each row:

To calculate the percentage of missing values in each row:

You can also use these 2 parameters as filters for a dataframe:

Handling missing values

There are numerous ways to handle missing values, but they fall under 2 categories:

Remove rows (or even columns) with missing values
Fill the missing value using a constant or calculated value

We won’t get deep into the science of it. For that, you can check out our article on data cleaning. Instead, we’ll see how we can use these methods in Pandas.

Deletion

To delete a row with missing values:

To delete a column with missing values

Imputation

Simple Imputation

Replacing missing values with mean,median or mode:

Constant

Replacing missing values with a constant:

Forward fill (ffill)

Forward fill (often abbreviated as ffill) fills missing values by propagating the last valid (non-missing) value forward to replace the NaN values. These are used in sequential datasets, particularly time-series datasets..

Backward fill (bfill)

Backward fill (often abbreviated as bfill) fills missing values by propagating the next valid (non-missing) value backward to replace the NaN values. Similar to forward fill, these are used in sequential datasets

Interpolation

.fillna() is not the only method to fill missing values in a dataset. Interpolation estimates values that fall between two values in a continuous or ordered data set. There are many methods of interpolation, but here, we’ll cover two basic ones:

Linear interpolation

Linear interpolation estimates missing values by drawing straight lines between known points.

Polynomial interpolation

Polynomial interpolation estimates missing values by fitting a polynomial curve that passes through all the known data points.

Order refers to the degree of interpolation

Linear (order=1): Often used when the data is expected to follow a roughly linear relationship.
Quadratic (order=2): Useful when the data has some curvature.
Cubic (order=3): A good balance for many cases, capturing moderate curves without too much complexity.

For polynomial interpolation, you must have at least order + 1 data points. For example, to use a polynomial of order 3 (cubic), you must have at least 4 data points. If you have fewer data points than the specified order + 1, Pandas will raise an error.

Advanced Imputation

Advanced imputations typically use other libraries. You can learn about some of these in detail in our advanced guide to data cleaning.

Wrong format (date)

Another common error encountered in datasets is dates entered in the wrong format, for example:

column of object type converted to datetime

The .to_datetime() method converts dates entered in string format to the datetime format.

Handling outliers and wrong values

There are many methods to detect outliers. Here are some methods:

Using Z-Score

‍

Using Interquartile Range

In the above 2 instances, the filtered DataFrame filters out rows where the ‘Value’ column datapoint is an outlier. However we can also handle this another way:

Capping outliers (Winsorizing)

Using Interquartile range, we can cap the values at the upper and lower bounds:

‍

This uses Numpy’s powerful .where() function which is essentially the equivalent to “Find and Replace”. This can also come into handy when replacing outliers with imputations.

An image that explains the syntax of Numpy where.

Replacing values using .loc and .iloc

If you want to manually replace values, .loc and .iloc are used to select rows, columns or data points based on positioning.

.loc uses labels while .iloc uses position-based indexing, however they serve the same purpose. Using this, you can isolate data points and change them manually.

‍

Handling duplicates

Pandas’ .duplicated() returns a series or DataFrame of Boolean values, similar to .isnull(),depending on whether a row or value is duplicated or not.

‍

The drop_duplicates() method has a keep parameter that allows you to specify which duplicate to keep. By default, the first occurrence is kept.

‍

Summing Up

In this guide, so far you would have learned the essential data cleaning techniques in Python when it comes to:

Missing Data
Outliers
Wrong formats
Duplicates

You would also have learned the techniques of finding and handling missing values, duplicates and outliers through deletion or imputation. With these foundational skills, you can proceed with the next step – Data Analysis. If you want to go into the nitty-gritty of data cleaning, you can check out our Advanced Guide to Data Cleaning.

A Beginner’s Guide to Data Cleaning in Pandas

Introduction

Data Cleaning

Why is Data Cleaning Important?

Common Issues in Raw Data

Loading the data

Finding missing values

.isnull() – an introduction

Filtering rows based on missing values

Analytics

Handling missing values

Deletion

Imputation

Simple Imputation

Constant

Forward fill (ffill)

Backward fill (bfill)

Interpolation

Linear interpolation

Polynomial interpolation

Advanced Imputation

Wrong format (date)

Handling outliers and wrong values

Using Z-Score

Using Interquartile Range

Capping outliers (Winsorizing)

Replacing values using .loc and .iloc

Handling duplicates

Summing Up

SIMILAR BLOGS

Interested in Writing for Us?

OUR WRITERS

Rahul Rego

Arsha P. Joy

Sahin Ahmed

Saumya Khare

Get our stories delivered from us to your inbox weekly.

A Beginner’s Guide to Data Cleaning in Pandas

Introduction

Data Cleaning

Why is Data Cleaning Important?

Common Issues in Raw Data

Loading the data

Finding missing values

.isnull() – an introduction

Filtering rows based on missing values

Analytics

Handling missing values

Deletion

Imputation

Simple Imputation

Constant

Forward fill (ffill)

Backward fill (bfill)

Interpolation

Linear interpolation

Polynomial interpolation

Advanced Imputation

Wrong format (date)

Handling outliers and wrong values

Using Z-Score

Using Interquartile Range

Capping outliers (Winsorizing)

Replacing values using .loc and .iloc

Handling duplicates

Summing Up

SIMILAR BLOGS

What is Principal Component Analysis (PCA)? A Beginner’s Guide

The Importance of Organisational Behaviour: How It Impacts Workplace Success

Fibonacci Series in Python: How to Code and Understand It

Interested in Writing for Us?

OUR WRITERS

Rahul Rego

Arsha P. Joy

Sahin Ahmed

Saumya Khare

Get our stories delivered from us to your inbox weekly.