Handling Missing Data in Pandas

import pandas as pd
import numpy as np
df=pd.read_csv('/content/drive/MyDrive/ML Course by ME/Python/Pandas/datsets_pandas/HOUSING_PRICE.csv')
df.head()

Handling missing values is a crucial step in data preprocessing. Pandas provides several methods for detecting, removing, and imputing missing values in DataFrames. Here are some common techniques:

1. Detecting Missing Values:

isnull() and notnull():

These methods return a boolean DataFrame indicating where the values are missing (NaN) or not missing, respectively.

# Detect missing values
df.isnull()

# Detect non-missing values
df.notnull()

2. Removing Missing Values:

dropna():

This method removes rows or columns with missing values.

# Remove rows with any missing values
df.dropna()

# Remove columns with any missing values
df.dropna(axis=1)

# Remove rows only if all values are missing
df.dropna(how='all')

# Remove rows with missing values in specific columns
df.dropna(subset=['column1', 'column2'])

3. Imputing Missing Values:

fillna():

This method fills missing values with a specified value or strategy.

# Fill missing values with a constant value
df.fillna(value)

# Fill missing values with the mean of the column
df.fillna(df.mean())

# Fill missing values with the median of the column
df.fillna(df.median())

# Fill missing values with the most frequent value in the column
df.fillna(df.mode().iloc[0])
4. Interpolation:
interpolate():
This method fills missing values by interpolating between existing values.
# Linear interpolation
df.interpolate(method='linear')

# Time-based interpolation
df.interpolate(method='time')

5. Handling Missing Values in Categorical Data:

Mode Imputation:

For categorical data, filling missing values with the mode (most frequent value) is common.

# Fill missing values in categorical column with mode
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)

6. Handling Missing Values in Time Series Data:

Forward Fill (ffill) or Backward Fill (bfill):

For time series data, forward fill or backward fill can be used to propagate the last observed value forward or backward.

# Forward fill missing values
df.fillna(method='ffill')

# Backward fill missing values
df.fillna(method='bfill')

Note:

  • The choice of method for handling missing values depends on the nature of the data and the specific requirements of the analysis.
  • It's essential to consider potential biases introduced by imputing missing values and to assess the impact on the analysis results.
  • Always verify the effectiveness of the chosen method and its impact on downstream analysis.

df.isnull()
df.isnull().sum()

OUTPUT:

Id                0
MSSubClass        0
MSZoning          0
LotFrontage       259
LotArea           0
               ... 
MoSold            0
YrSold            0
SaleType          0
SaleCondition     0
SalePrice         0
Length: 81, dtype: int64

df.columns[df.isnull().sum()>0]

OUTPUT:

Index(['LotFrontage', 'Alley', 'MasVnrType', 'MasVnrArea','BsmtQual',
      'BsmtCond','BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
      'Electrical','FireplaceQu', 'GarageType', 'GarageYrBlt',
      'GarageFinish','GarageQual', 'GarageCond', 'PoolQC', 'Fence',
      'MiscFeature'],
     dtype='object')
df.dropna(axis=1)
df.dropna(axis=0)
df.columns
df.fillna(0)
df[df.columns[df.isnull().sum()>0]]
df['MasVnrArea'].fillna(df['MasVnrArea'].mean(),inplace=True)
df['MasVnrArea'].isnull().sum()
df['LotFrontage'].fillna(df['LotFrontage'].mean(),inplace=True)
df['LotFrontage'].isnull().sum()

0

df['BsmtQual'].value_counts()

OUTPUT:

TA    649
Gd    618
Ex    121
Fa     35
Name: BsmtQual, dtype: int64
df['BsmtQual'].isnull().sum()

OUTPUT: 37

df['BsmtQual'].mode()

OUTPUT:

0    TA
Name: BsmtQual, dtype: object

print('no of null values before  ',df['BsmtQual'].isnull().sum())
df['BsmtQual'].fillna(df['BsmtQual'].mode()[0],inplace=True)
df['BsmtQual'].isnull().sum()

OUTPUT:

no of null values before  370

 

Adding new columns to a DataFrame

in Pandas is a common operation and can be done using various methods. You can add a new column with constant values, computed values based on existing columns, or by applying a function to existing data. Here are some approaches:

1. Adding a New Column with Constant Values:

You can add a new column with constant values using either bracket notation or the assign() method.

Bracket Notation:

df['new_column'] = value

assign() Method:

df = df.assign(new_column=value)

2. Adding a New Column Based on Existing Columns:

You can create a new column based on computations or operations involving existing columns.

Using Arithmetic Operations:

df['new_column'] = df['column1'] + df['column2']

Using Functions:

def calculate_new_value(row):    return row['column1'] * row['column2']
df['new_column'] = df.apply(calculate_new_value, axis=1)


3. Adding a New Column with Conditions:

You can add a new column based on conditions using boolean indexing or the np.where() function.

Using Boolean Indexing:

df['new_column'] = np.where(df['column1'] > 0, 'Yes', 'No')

Using np.where():

import numpy as npdf['new_column'] = np.where(df['column1'] > 0, 'Yes', 'No')

4. Adding a New Column with Values from a Series or List:

You can add a new column with values from a Series or list of values.

new_values = pd.Series([value1, value2, value3])df['new_column'] = new_values

5. Adding a New Column with Index Alignment:

You can add a new column with index alignment, where the values are aligned based on the DataFrame's index.

new_values = pd.Series([value1, value2, value3], index=['index1', 'index2', 'index3'])df['new_column'] = new_values

Note:

  • When adding a new column, ensure that the length of the new data matches the length of the DataFrame.
  • Always verify the correctness of the new column and its values after addition.
‍df['price_per_unit_area']=df['SalePrice']/df['LotArea']df.head()
df['sale_condition_normal'] = np.where(df['SaleCondition']=='Normal', 'Yes', 'No')
df.head()
mean_price=df['SalePrice'].mean()
def calculate_new_above_average_price(row):
    return 'yes' if row['SalePrice']>mean_price else 'no'

df['sales_price_above_avg'] = df.apply(calculate_new_above_average_price, axis=1)
df.head()
df = df.assign(avg_price=mean_price)
df
df=df.assign(price_per_unit=df['SalePrice']/df['LotArea'])
df.head()

Adding new rows to a DataFrame:

in Pandas can be done using the append() method, the loc indexer, or by creating a new DataFrame and concatenating it with the original DataFrame. Here are several approaches:

1. Using append() Method:

You can use the append() method to add one or more rows to the end of the DataFrame.

# Create a new row as a dictionarynew_row = {'column1': value1, 'column2': value2, ...}# Append the new row to the DataFramedf = df.append(new_row, ignore_index=True)

2. Using loc Indexer:

You can use the loc indexer to add a new row by specifying the index label and assigning values.

# Add a new row at the end of the DataFramedf.loc[len(df)] = [value1, value2, ...]

3. Using Concatenation:

You can create a new DataFrame with the new row(s) and concatenate it with the original DataFrame.

# Create a new DataFrame with the new row(s)
new_data = pd.DataFrame([[value1, value2, ...]], columns=df.columns)

# Concatenate the new DataFrame with the original DataFrame
df = pd.concat([df, new_data], ignore_index=True)

Note:

  • Ensure that the new row(s) have the same number of columns and follow the same data types as the existing DataFrame.
  • The ignore_index=True parameter is used to reset the index of the resulting DataFrame after appending rows.
  • Adding rows using append() or loc is generally more efficient than using concatenation, especially for large DataFrames.
  • Always verify the correctness of the new rows and their values after addition.

Start coding or generate with AI.

For the section on data type conversion and renaming columns and indexes, you can introduce these concepts with examples to demonstrate how these operations can be performed using Pandas. Here’s how you might structure this part of your notebook, including code snippets.

Data Type Conversion

Data type conversion is essential when you want to ensure the data types of your columns are suitable for the operations you plan to perform on them. Pandas provides several methods to convert the data type of DataFrame columns.

This example demonstrates how to convert the Age column from strings to integers and the Salary column to floats, using the .astype() method.

Renaming Columns and Indexes

Renaming columns and indexes can be important for making a DataFrame more readable or for preparing data for further analysis.

# Renaming columns
df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age', 'Salary': 'Employee Salary'}, inplace=True)

print("DataFrame after renaming columns:")
print(df)

# Renaming indexes
df.index = ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4']

print("\nDataFrame after renaming indexes:")
print(df)

In this example, .rename() is used with the columns parameter to rename specific columns. The inplace=True argument modifies the original DataFrame. Then, the DataFrame indexes are directly assigned new labels, effectively renaming them.

These operations are crucial for data cleaning and preparation, ensuring that your data is in the right format and is easily understandable.

import pandas as pd

# Sample DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': ['28', '34', '29', '32'],  # Age as string
        'Salary': [50000, 60000, 55000, 58000]}
df = pd.DataFrame(data)

# Display the original data types
print("Original data types:")
print(df.dtypes)

# Convert 'Age' from object (string) to integer
df['Age'] = df['Age'].astype(int)

# Convert 'Salary' to float
df['Salary'] = df['Salary'].astype(float)

# Display the data types after conversion
print("\nData types after conversion:")
print(df.dtypes)

OUTPUT:

Original data types:
Name       object
Age        object
Salary     int64
dtype:     object
 
Data types after conversion:
Name        object
Age         int64
Salary      float64
dtype: object

Renaming Columns and Indexes

Renaming columns and indexes can be important for making a DataFrame more readable or for preparing data for further analysis.

# Renaming columns
df.rename(columns={'Name': 'Employee Name', 'Age': 'Employee Age', 'Salary': 'Employee Salary'}, inplace=True)

print("DataFrame after renaming columns:")
print(df)

# Renaming indexes
df.index = ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4']

print("\nDataFrame after renaming indexes:")
print(df)


OUTPUT:

DataFrame after renaming columns:
 Employee Name  Employee Age Employee Salary
0          John            28          50000.0
1          Anna            34          60000.0
2         Peter            29          55000.0
3         Linda            32          58000.0
 
DataFrame after renaming indexes:
          Employee Name  Employee Age Employee Salary
Employee 1         John            28          50000.0
Employee 2         Anna            34          60000.0
Employee 3        Peter            29          55000.0
Employee 4        Linda            32          58000.0

in this example, .rename() is used with the columns parameter to rename specific columns. The inplace=True argument modifies the original DataFrame. Then, the DataFrame indexes are directly assigned new labels, effectively renaming them.

These operations are crucial for data cleaning and preparation, ensuring that your data is in the right format and is easily understandable.

Data aggregation and grouping

They are powerful Pandas functionalities that allow you to summarize and analyze datasets in meaningful ways. This section demonstrates how to use groupby operations, perform aggregation, filtering, transformation, and apply functions, as well as how to create pivot tables and perform cross-tabulation.

GroupBy Operations

GroupBy operations are used to split the data into groups based on some criteria, apply a function to each group independently, and combine the results into a data structure.

import pandas as pd

# Sample DataFrame
data = {'Department': ['Sales', 'Tech', 'Tech', 'HR', 'Sales'],
        'Employee': ['John', 'Anna', 'Peter', 'Linda', 'Steve'],
        'Salary': [70000, 80000, 75000, 65000, 60000]}

df = pd.DataFrame(data)

# Grouping by 'Department' and calculating mean salary
grouped = df.groupby('Department')['Salary'].mean()
print(grouped)

OUTPUT:

Department
HR       65000.0
Sales    65000.0
Tech     77500.0
Name: Salary, dtype: float64

Common GroupBy Functions:

Aggregation Functions: sum(), mean(), median(), min(), max(), count(), size(), std(), var(), etc.

# Aggregation
agg = df.groupby('Department').agg({'Salary': ['mean', 'min', 'max']})
print(agg)

OUTPUT:

           Salary              
             mean    min    max
Department                       
HR          65000.0  65000 65000
Sales      65000.0  60000  70000
Tech       77500.0  75000  80000

In Pandas, aggregate, filter, transform, and apply are powerful methods used in conjunction with GroupBy operations for data manipulation and analysis. These methods allow you to perform various computations on grouped data efficiently.

1. Aggregate:

The aggregate method allows you to apply one or more aggregation functions to grouped data, such as sum, mean, median, min, max, count, and custom aggregation functions.

Syntax:

grouped.aggregate(func)
  • func: Aggregation function or a list/dictionary of aggregation functions to apply.

Example:

# Group by 'Category' and calculate sum and mean of 'Value'
result = df.groupby('Category').aggregate({'Value': ['sum', 'mean']})

2. Filter:

The filter method allows you to apply a filtering function to each group and return the groups that satisfy the condition.

Syntax:

grouped.filter(func)
  • func: Filtering function that returns a boolean indicating whether the group should be included.

Example:

# Filter groups where the mean of 'Value' is greater than 10
result = df.groupby('Category').filter(lambda x: x['Value'].mean() > 10)

3. Transform:

The transform method applies a function to each group and returns a DataFrame with the same shape as the original DataFrame, where each value corresponds to the result of the function applied to the corresponding group.

Syntax:

grouped.transform(func)
  • func: Transformation function to apply to each group.

Example:

# Normalize 'Value' within each group by subtracting the group mean and dividing by the group standard deviation
result = df.groupby('Category').transform(lambda x: (x - x.mean()) / x.std())

4. Apply:

The apply method applies a function to each group and returns the combined result as a single DataFrame. It is more general-purpose than aggregate and transform as it allows for more flexible computations.

Syntax:

grouped.apply(func)
  • func: Function to apply to each group.

Example:

# Apply a custom function to calculate the difference between the maximum and minimum values of 'Value' in each group
def custom_function(x):
    return x['Value'].max() - x['Value'].min()

result = df.groupby('Category').apply(custom_function)

Key Points:

  • aggregate, filter, transform, and apply methods are used in combination with GroupBy operations for data manipulation.
  • aggregate is used to compute summary statistics for each group.
  • filter is used to filter groups based on a condition.
  • transform is used to perform element-wise transformations within each group.
  • apply is a more general-purpose method that allows you to apply arbitrary functions to each group.
  • These methods provide powerful capabilities for data analysis and manipulation in Pandas, enabling complex computations and transformations on grouped data efficiently.

# Filter
filter_func = lambda x: x['Salary'].mean() > 65000
filtered = df.groupby('Department').filter(filter_func)
print(filtered)

# Transformation
zscore = lambda x: (x - x.mean()) / x.std()
transformed = df.groupby('Department')['Salary'].transform(zscore)
df['Salary Z-Score'] = transformed
print(df)

# Apply
def update_salary(group):
    return group['Salary'] * 1.1

applied = df.groupby('Department').apply(update_salary)
print(applied)

OUTPUT:

            Salary              
                mean      min        max
Department                       
HR          65000.0  65000  65000
Sales      65000.0  60000  70000
Tech       77500.0  75000  80000
 Department Employee  Salary
1       Tech     Anna   80000
2       Tech    Peter   75000
 Department           Employee  Salary  Salary Z-Score
0       Sales         John    70000        0.707107
1       Tech          Anna    80000       0.707107
2       Tech          Peter    75000       -0.707107
3       HR             Linda     65000          NaN
4      Sales          Steve     60000       -0.707107
Department   
HR           3    71500.0
Sales       0    77000.0
                4    66000.0
Tech        1    88000.0
                 2    82500.0
Name: Salary, dtype: float64


Case Study: Retail Sales Analysis

Background

A retail company operates multiple stores across different regions. Each store sells various products categorized into departments such as Electronics, Clothing, and Home & Garden. The company maintains a dataset of daily sales transactions that record the store ID, region, department, date of transaction, and sales amount. The management wants to analyze these sales data to understand performance patterns and make informed decisions.

Objective

To analyze the dataset for insights on:

Sales performance by region and department.

Trends in daily sales volumes.

The impact of seasons on sales figures.

Identify the top-performing stores and departments.

import pandas as pd
import numpy as np

# Creating a sample dataset
np.random.seed(42)  # For reproducibility
dates = pd.date_range('2023-01-01', periods=90, freq='D')  # 90 days of sales
data = {
    'StoreID': np.random.choice(['Store_1', 'Store_2', 'Store_3'], size=90),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], size=90),
    'Department': np.random.choice(['Electronics', 'Clothing', 'Home & Garden'], size=90),
    'Date': dates,
    'Sales': np.random.randint(100, 5000, size=90)  # Random sales between $100 and $5000
}

df = pd.DataFrame(data)

# Display the first few rows of the dataset
df.head()
# Analysis 1: Sales Performance by Region and Department
sales_performance = df.groupby(['Region', 'Department'])['Sales'].sum().unstack()

# Analysis 2: Trends in Daily Sales Volumes
daily_sales = df.groupby('Date')['Sales'].sum()

# Analysis 3: Seasonal Impact on Sales
df['Month'] = df['Date'].dt.month
seasonal_sales = df.groupby(['Month', 'Department'])['Sales'].sum().unstack()

# Analysis 4: Top-Performing Stores and Departments
top_performers = df.groupby(['StoreID', 'Department'])['Sales'].sum().sort_values(ascending=False).head(5)

sales_performance, daily_sales, seasonal_sales, top_performers


OUTPUT:

(Department  Clothing  Electronics  Home & Garden
Region                                          
East           25802        31770          27463
North          28276        25508          17001
South          10906         6892          13707
West           18123        11949          27218,
Date
2023-01-01    2086
2023-01-02     246
2023-01-03    3319
2023-01-04    3011
2023-01-05    1834
              ... 
2023-03-27    1169
2023-03-28    3544
2023-03-29    2665
2023-03-30    1160
2023-03-31    2427
Name: Sales, Length: 90, dtype: int64,
Department  Clothing  Electronics  Home & Garden
Month                                           
1              35622        23683          21398
2              23782        30415          26470
3              23703        22021          37521,
StoreID  Department   
Store_2  Home & Garden    39114
Store_3  Clothing         33577
Store_2  Clothing         31452
Store_3  Home & Garden    31419
Store_2  Electronics      28537
Name: Sales, dtype: int64)


Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.