Handling Missing Data in Pandas
import pandas as pd
import numpy as np
df=pd.read_csv('/content/drive/MyDrive/ML Course by ME/Python/Pandas/datsets_pandas/HOUSING_PRICE.csv')
df.head()
Handling missing values is a crucial step in data preprocessing. Pandas provides several methods for detecting, removing, and imputing missing values in DataFrames. Here are some common techniques:
1. Detecting Missing Values:
isnull()
and notnull()
:
These methods return a boolean DataFrame indicating where the values are missing (NaN) or not missing, respectively.
# Detect missing values
df.isnull()
# Detect non-missing values
df.notnull()
2. Removing Missing Values:
dropna()
:
This method removes rows or columns with missing values.
# Remove rows with any missing values
df.dropna()
# Remove columns with any missing values
df.dropna(axis=1)
# Remove rows only if all values are missing
df.dropna(how='all')
# Remove rows with missing values in specific columns
df.dropna(subset=['column1', 'column2'])
3. Imputing Missing Values:
fillna()
:
This method fills missing values with a specified value or strategy.
# Fill missing values with a constant value
df.fillna(value)
# Fill missing values with the mean of the column
df.fillna(df.mean())
# Fill missing values with the median of the column
df.fillna(df.median())
# Fill missing values with the most frequent value in the column
df.fillna(df.mode().iloc[0])
4. Interpolation:
interpolate():
This method fills missing values by interpolating between existing values.
# Linear interpolation
df.interpolate(method='linear')
# Time-based interpolation
df.interpolate(method='time')
5. Handling Missing Values in Categorical Data:
Mode Imputation:
For categorical data, filling missing values with the mode (most frequent value) is common.
# Fill missing values in categorical column with mode
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)
6. Handling Missing Values in Time Series Data:
Forward Fill (ffill) or Backward Fill (bfill):
For time series data, forward fill or backward fill can be used to propagate the last observed value forward or backward.
# Forward fill missing values
df.fillna(method='ffill')
# Backward fill missing values
df.fillna(method='bfill')
Note:
- The choice of method for handling missing values depends on the nature of the data and the specific requirements of the analysis.
- It's essential to consider potential biases introduced by imputing missing values and to assess the impact on the analysis results.
- Always verify the effectiveness of the chosen method and its impact on downstream analysis.
df.isnull()
df.isnull().sum()
OUTPUT:
Id 0
MSSubClass 0
MSZoning 0
LotFrontage 259
LotArea 0
...
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 81, dtype: int64
df.columns[df.isnull().sum()>0]
OUTPUT:
Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea','BsmtQual',
'BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'Electrical','FireplaceQu', 'GarageType', 'GarageYrBlt',
'GarageFinish','GarageQual', 'GarageCond', 'PoolQC', 'Fence',
'MiscFeature'],
dtype='object')
df.dropna(axis=1)
df.dropna(axis=0)
df.columns
df.fillna(0)
df[df.columns[df.isnull().sum()>0]]
df['MasVnrArea'].fillna(df['MasVnrArea'].mean(),inplace=True)
df['MasVnrArea'].isnull().sum()
df['LotFrontage'].fillna(df['LotFrontage'].mean(),inplace=True)
df['LotFrontage'].isnull().sum()
0
df['BsmtQual'].value_counts()
OUTPUT:
TA 649
Gd 618
Ex 121
Fa 35
Name: BsmtQual, dtype: int64
df['BsmtQual'].isnull().sum()
OUTPUT: 37
df['BsmtQual'].mode()
OUTPUT:
0 TA
Name: BsmtQual, dtype: object
print('no of null values before ',df['BsmtQual'].isnull().sum())
df['BsmtQual'].fillna(df['BsmtQual'].mode()[0],inplace=True)
df['BsmtQual'].isnull().sum()
OUTPUT:
no of null values before 370
Adding new columns to a DataFrame
in Pandas is a common operation and can be done using various methods. You can add a new column with constant values, computed values based on existing columns, or by applying a function to existing data. Here are some approaches:
1. Adding a New Column with Constant Values:
You can add a new column with constant values using either bracket notation or the assign()
method.
Bracket Notation:
df['new_column'] = value
assign()
Method:
df = df.assign(new_column=value)
2. Adding a New Column Based on Existing Columns:
You can create a new column based on computations or operations involving existing columns.
Using Arithmetic Operations:
df['new_column'] = df['column1'] + df['column2']
Using Functions:
def calculate_new_value(row): return row['column1'] * row['column2']
df['new_column'] = df.apply(calculate_new_value, axis=1)
3. Adding a New Column with Conditions:
You can add a new column based on conditions using boolean indexing or the np.where()
function.
Using Boolean Indexing:
df['new_column'] = np.where(df['column1'] > 0, 'Yes', 'No')
Using np.where()
:
import numpy as npdf['new_column'] = np.where(df['column1'] > 0, 'Yes', 'No')
4. Adding a New Column with Values from a Series or List:
You can add a new column with values from a Series or list of values.
new_values = pd.Series([value1, value2, value3])df['new_column'] = new_values
5. Adding a New Column with Index Alignment:
You can add a new column with index alignment, where the values are aligned based on the DataFrame's index.
new_values = pd.Series([value1, value2, value3], index=['index1', 'index2', 'index3'])df['new_column'] = new_values
Note:
- When adding a new column, ensure that the length of the new data matches the length of the DataFrame.
- Always verify the correctness of the new column and its values after addition.
df['price_per_unit_area']=df['SalePrice']/df['LotArea']df.head()
df['sale_condition_normal'] = np.where(df['SaleCondition']=='Normal', 'Yes', 'No')
df.head()
mean_price=df['SalePrice'].mean()
def calculate_new_above_average_price(row):
return 'yes' if row['SalePrice']>mean_price else 'no'
df['sales_price_above_avg'] = df.apply(calculate_new_above_average_price, axis=1)
df.head()
df = df.assign(avg_price=mean_price)
df
df=df.assign(price_per_unit=df['SalePrice']/df['LotArea'])
df.head()
Adding new rows to a DataFrame:
in Pandas can be done using the append()
method, the loc
indexer, or by creating a new DataFrame and concatenating it with the original DataFrame. Here are several approaches:
1. Using append()
Method:
You can use the append()
method to add one or more rows to the end of the DataFrame.
# Create a new row as a dictionarynew_row = {'column1': value1, 'column2': value2, ...}# Append the new row to the DataFramedf = df.append(new_row, ignore_index=True)
2. Using loc
Indexer:
You can use the loc
indexer to add a new row by specifying the index label and assigning values.
# Add a new row at the end of the DataFramedf.loc[len(df)] = [value1, value2, ...]
3. Using Concatenation:
You can create a new DataFrame with the new row(s) and concatenate it with the original DataFrame.
# Create a new DataFrame with the new row(s)
new_data = pd.DataFrame([[value1, value2, ...]], columns=df.columns)
# Concatenate the new DataFrame with the original DataFrame
df = pd.concat([df, new_data], ignore_index=True)
Note:
- Ensure that the new row(s) have the same number of columns and follow the same data types as the existing DataFrame.
- The
ignore_index=True
parameter is used to reset the index of the resulting DataFrame after appending rows. - Adding rows using
append()
orloc
is generally more efficient than using concatenation, especially for large DataFrames. - Always verify the correctness of the new rows and their values after addition.
Start coding or generate with AI.
For the section on data type conversion and renaming columns and indexes, you can introduce these concepts with examples to demonstrate how these operations can be performed using Pandas. Here’s how you might structure this part of your notebook, including code snippets.
Data Type Conversion
Data type conversion is essential when you want to ensure the data types of your columns are suitable for the operations you plan to perform on them. Pandas provides several methods to convert the data type of DataFrame columns.
This example demonstrates how to convert the Age
column from strings to integers and the Salary
column to floats, using the .astype()
method.
Renaming Columns and Indexes
Renaming columns and indexes can be important for making a DataFrame more readable or for preparing data for further analysis.
# Renaming columns
df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age', 'Salary': 'Employee Salary'}, inplace=True)
print("DataFrame after renaming columns:")
print(df)
# Renaming indexes
df.index = ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4']
print("\nDataFrame after renaming indexes:")
print(df)
In this example, .rename()
is used with the columns
parameter to rename specific columns. The inplace=True
argument modifies the original DataFrame. Then, the DataFrame indexes are directly assigned new labels, effectively renaming them.
These operations are crucial for data cleaning and preparation, ensuring that your data is in the right format and is easily understandable.
import pandas as pd
# Sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': ['28', '34', '29', '32'], # Age as string
'Salary': [50000, 60000, 55000, 58000]}
df = pd.DataFrame(data)
# Display the original data types
print("Original data types:")
print(df.dtypes)
# Convert 'Age' from object (string) to integer
df['Age'] = df['Age'].astype(int)
# Convert 'Salary' to float
df['Salary'] = df['Salary'].astype(float)
# Display the data types after conversion
print("\nData types after conversion:")
print(df.dtypes)
OUTPUT:
Original data types:
Name object
Age object
Salary int64
dtype: object
Data types after conversion:
Name object
Age int64
Salary float64
dtype: object
Renaming Columns and Indexes
Renaming columns and indexes can be important for making a DataFrame more readable or for preparing data for further analysis.
# Renaming columns
df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age', 'Salary': 'Employee Salary'}, inplace=True)
print("DataFrame after renaming columns:")
print(df)
# Renaming indexes
df.index = ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4']
print("\nDataFrame after renaming indexes:")
print(df)
OUTPUT:
DataFrame after renaming columns:
Employee Name Employee Age Employee Salary
0 John 28 50000.0
1 Anna 34 60000.0
2 Peter 29 55000.0
3 Linda 32 58000.0
DataFrame after renaming indexes:
Employee Name Employee Age Employee Salary
Employee 1 John 28 50000.0
Employee 2 Anna 34 60000.0
Employee 3 Peter 29 55000.0
Employee 4 Linda 32 58000.0
in this example, .rename() is used with the columns parameter to rename specific columns. The inplace=True argument modifies the original DataFrame. Then, the DataFrame indexes are directly assigned new labels, effectively renaming them.
These operations are crucial for data cleaning and preparation, ensuring that your data is in the right format and is easily understandable.
Data aggregation and grouping
They are powerful Pandas functionalities that allow you to summarize and analyze datasets in meaningful ways. This section demonstrates how to use groupby operations, perform aggregation, filtering, transformation, and apply functions, as well as how to create pivot tables and perform cross-tabulation.
GroupBy Operations
GroupBy operations are used to split the data into groups based on some criteria, apply a function to each group independently, and combine the results into a data structure.
import pandas as pd
# Sample DataFrame
data = {'Department': ['Sales', 'Tech', 'Tech', 'HR', 'Sales'],
'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Steve'],
'Salary': [70000, 80000, 75000, 65000, 60000]}
df = pd.DataFrame(data)
# Grouping by 'Department' and calculating mean salary
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)
OUTPUT:
Department
HR 65000.0
Sales 65000.0
Tech 77500.0
Name: Salary, dtype: float64
Common GroupBy Functions:
Aggregation Functions: sum(), mean(), median(), min(), max(), count(), size(), std(), var(), etc.
# Aggregation
agg = df.groupby('Department').agg({'Salary': ['mean', 'min', 'max']})
print(agg)
OUTPUT:
Salary
mean min max
Department
HR 65000.0 65000 65000
Sales 65000.0 60000 70000
Tech 77500.0 75000 80000
In Pandas, aggregate
, filter
, transform
, and apply
are powerful methods used in conjunction with GroupBy operations for data manipulation and analysis. These methods allow you to perform various computations on grouped data efficiently.
1. Aggregate:
The aggregate
method allows you to apply one or more aggregation functions to grouped data, such as sum, mean, median, min, max, count, and custom aggregation functions.
Syntax:
grouped.aggregate(func)
func
: Aggregation function or a list/dictionary of aggregation functions to apply.
Example:
# Group by 'Category' and calculate sum and mean of 'Value'
result = df.groupby('Category').aggregate({'Value': ['sum', 'mean']})
2. Filter:
The filter
method allows you to apply a filtering function to each group and return the groups that satisfy the condition.
Syntax:
grouped.filter(func)
func
: Filtering function that returns a boolean indicating whether the group should be included.
Example:
# Filter groups where the mean of 'Value' is greater than 10
result = df.groupby('Category').filter(lambda x: x['Value'].mean() > 10)
3. Transform:
The transform
method applies a function to each group and returns a DataFrame with the same shape as the original DataFrame, where each value corresponds to the result of the function applied to the corresponding group.
Syntax:
grouped.transform(func)
func
: Transformation function to apply to each group.
Example:
# Normalize 'Value' within each group by subtracting the group mean and dividing by the group standard deviation
result = df.groupby('Category').transform(lambda x: (x - x.mean()) / x.std())
4. Apply:
The apply
method applies a function to each group and returns the combined result as a single DataFrame. It is more general-purpose than aggregate
and transform
as it allows for more flexible computations.
Syntax:
grouped.apply(func)
func
: Function to apply to each group.
Example:
# Apply a custom function to calculate the difference between the maximum and minimum values of 'Value' in each group
def custom_function(x):
return x['Value'].max() - x['Value'].min()
result = df.groupby('Category').apply(custom_function)
Key Points:
aggregate
,filter
,transform
, andapply
methods are used in combination with GroupBy operations for data manipulation.aggregate
is used to compute summary statistics for each group.filter
is used to filter groups based on a condition.transform
is used to perform element-wise transformations within each group.apply
is a more general-purpose method that allows you to apply arbitrary functions to each group.- These methods provide powerful capabilities for data analysis and manipulation in Pandas, enabling complex computations and transformations on grouped data efficiently.
# Filter
filter_func = lambda x: x['Salary'].mean() > 65000
filtered = df.groupby('Department').filter(filter_func)
print(filtered)
# Transformation
zscore = lambda x: (x - x.mean()) / x.std()
transformed = df.groupby('Department')['Salary'].transform(zscore)
df['Salary Z-Score'] = transformed
print(df)
# Apply
def update_salary(group):
return group['Salary'] * 1.1
applied = df.groupby('Department').apply(update_salary)
print(applied)
OUTPUT:
Salary
mean min max
Department
HR 65000.0 65000 65000
Sales 65000.0 60000 70000
Tech 77500.0 75000 80000
Department Employee Salary
1 Tech Anna 80000
2 Tech Peter 75000
Department Employee Salary Salary Z-Score
0 Sales John 70000 0.707107
1 Tech Anna 80000 0.707107
2 Tech Peter 75000 -0.707107
3 HR Linda 65000 NaN
4 Sales Steve 60000 -0.707107
Department
HR 3 71500.0
Sales 0 77000.0
4 66000.0
Tech 1 88000.0
2 82500.0
Name: Salary, dtype: float64
Case Study: Retail Sales Analysis
Background
A retail company operates multiple stores across different regions. Each store sells various products categorized into departments such as Electronics, Clothing, and Home & Garden. The company maintains a dataset of daily sales transactions that record the store ID, region, department, date of transaction, and sales amount. The management wants to analyze these sales data to understand performance patterns and make informed decisions.
Objective
To analyze the dataset for insights on:
Sales performance by region and department.
Trends in daily sales volumes.
The impact of seasons on sales figures.
Identify the top-performing stores and departments.
import pandas as pd
import numpy as np
# Creating a sample dataset
np.random.seed(42) # For reproducibility
dates = pd.date_range('2023-01-01', periods=90, freq='D') # 90 days of sales
data = {
'StoreID': np.random.choice(['Store_1', 'Store_2', 'Store_3'], size=90),
'Region': np.random.choice(['North', 'South', 'East', 'West'], size=90),
'Department': np.random.choice(['Electronics', 'Clothing', 'Home & Garden'], size=90),
'Date': dates,
'Sales': np.random.randint(100, 5000, size=90) # Random sales between $100 and $5000
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
df.head()
# Analysis 1: Sales Performance by Region and Department
sales_performance = df.groupby(['Region', 'Department'])['Sales'].sum().unstack()
# Analysis 2: Trends in Daily Sales Volumes
daily_sales = df.groupby('Date')['Sales'].sum()
# Analysis 3: Seasonal Impact on Sales
df['Month'] = df['Date'].dt.month
seasonal_sales = df.groupby(['Month', 'Department'])['Sales'].sum().unstack()
# Analysis 4: Top-Performing Stores and Departments
top_performers = df.groupby(['StoreID', 'Department'])['Sales'].sum().sort_values(ascending=False).head(5)
sales_performance, daily_sales, seasonal_sales, top_performers
OUTPUT:
(Department Clothing Electronics Home & Garden
Region
East 25802 31770 27463
North 28276 25508 17001
South 10906 6892 13707
West 18123 11949 27218,
Date
2023-01-01 2086
2023-01-02 246
2023-01-03 3319
2023-01-04 3011
2023-01-05 1834
...
2023-03-27 1169
2023-03-28 3544
2023-03-29 2665
2023-03-30 1160
2023-03-31 2427
Name: Sales, Length: 90, dtype: int64,
Department Clothing Electronics Home & Garden
Month
1 35622 23683 21398
2 23782 30415 26470
3 23703 22021 37521,
StoreID Department
Store_2 Home & Garden 39114
Store_3 Clothing 33577
Store_2 Clothing 31452
Store_3 Home & Garden 31419
Store_2 Electronics 28537
Name: Sales, dtype: int64)
# Python Program to find the area of triangle
a = 5
b = 6
c = 7
# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))
# calculate the semi-perimeter
s = (a + b + c) / 2
# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)