Pandas DataFrame apply() Method

The .apply() method in Pandas is a powerful tool that allows you to apply a function along an axis of the DataFrame or Series. This method is incredibly versatile, enabling both row-wise and column-wise operations, and can be used to apply both simple and complex functions. Understanding how to use .apply() effectively can significantly enhance your data manipulation and analysis capabilities in Pandas.

Basic Usage of .apply()

The basic syntax of .apply() is:

DataFrame.apply(func, axis=0, args=(), **kwargs)

  • func: The function to apply to each column or row.
  • axis: Specifies the axis along which the function is applied:
    • axis=0: Apply a function to each column (default).
    • axis=1: Apply a function to each row.
  • args: Positional arguments to pass to the function.
  • **kwargs: Additional keyword arguments to pass to the function.

Applying a Function to Each Column

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 0, -2),
    'C': range(10, 15)
})

# Apply np.sum function to each column
print(df.apply(np.sum))

Applying a Function to Each Row

# Apply np.sum function to each row
print(df.apply(np.sum, axis=1))

Applying Custom Functions

.apply() becomes particularly powerful when you need to apply custom functions to your data.

Example: Subtracting the Minimum from the Maximum in Each Row

def custom_range(x):
    return x.max() - x.min()

print(df.apply(custom_range, axis=1))

Using .apply() with Additional Arguments

You can pass additional arguments and keyword arguments to the function being applied.

Example: Adding a Constant Value to Each Element

def add_custom_value(x, add_value):
    return x + add_value

print(df.apply(add_custom_value, args=(5,)))

.apply() on DataFrame vs. Series

While .apply() can be used on both DataFrames and Series, the behavior slightly differs. On a DataFrame, .apply() can work across all columns or rows. On a Series, it applies the function to each element.

Example: Applying a Function to a Series

# Applying a function that increases each element by 10% on Series 'A'
print(df['A'].apply(lambda x: x * 1.1))

Considerations and Alternatives

  • Performance: While .apply() is very flexible, it may not always be the most performant option, especially for large datasets. Vectorized operations with Pandas or NumPy functions are often faster.
  • Alternatives: For specific use cases (like arithmetic operations, string manipulations, etc.), Pandas provides vectorized functions and methods (like .str., .dt., etc.) that can be more efficient.
  • .applymap() for Element-wise Operations: For applying a function element-wise on a DataFrame, consider using .applymap() instead.

Conclusion

The .apply() method is a cornerstone of Pandas' functionality, offering the flexibility to apply both predefined and custom functions across DataFrames and Series. Whether you're performing simple arithmetic operations, complex row-wise or column-wise transformations, or applying conditional logic, .apply() provides the means to execute these tasks in an intuitive and powerful manner.

Case Study: Analyzing Sales Performance with Pandas .apply() Method

Scenario

A retail company wants to analyze its sales performance over the past year. The dataset contains sales transactions across different stores, including the date of sale, store ID, product category, and sales amount. The goal is to identify top-performing categories, adjust strategies for underperforming ones, and understand seasonal trends.

Dataset Overview

The dataset, named sales_data.csv, includes the following columns:

  • Date: The date of the transaction.
  • StoreID: Identifier for the store.
  • Category: The category of the product sold (e.g., Electronics, Clothing, Furniture).
  • Amount: The sales amount in USD.

Objectives

  1. Calculate the total sales for each product category.
  2. Determine the month with the highest sales for each category.
  3. Identify the store with the highest sales in each category.
  4. Analyze seasonal sales trends and identify any outliers.

Analysis

Step 1: Load the Data

import pandas as pd

sales_data = pd.read_csv('sales_data.csv', parse_dates=['Date'])

Step 2: Total Sales by Category

total_sales_by_category = sales_data.groupby('Category')['Amount'].sum()
print(total_sales_by_category)

Step 3: Month with Highest Sales for Each Category

First, extract the month from the date and create a new column.

sales_data['Month'] = sales_data['Date'].dt.month

Then, use .apply() to find the month with the highest sales for each category.

def get_top_month(group):
    return group.groupby('Month')['Amount'].sum().idxmax()

top_month_by_category = sales_data.groupby('Category').apply(get_top_month)
print(top_month_by_category)

Step 4: Store with Highest Sales in Each Category

def get_top_store(group):
    return group.groupby('StoreID')['Amount'].sum().idxmax()

top_store_by_category = sales_data.groupby('Category').apply(get_top_store)
print(top_store_by_category)

Step 5: Analyzing Seasonal Sales Trends

First, categorize sales data into seasons.

def categorize_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

sales_data['Season'] = sales_data['Month'].apply(categorize_season)

Analyze sales trends by season for each category.

seasonal_trends = sales_data.groupby(['Category', 'Season'])['Amount'].sum().unstack()
print(seasonal_trends)

Step 6: Identifying Outliers

Use .apply() with a lambda function to identify sales amounts significantly higher than the category average.

def identify_outliers(row):
    category_average = sales_data[sales_data['Category'] == row['Category']]['Amount'].mean()
    return 'Outlier' if row['Amount'] > category_average * 1.5 else 'Normal'

sales_data['SalesType'] = sales_data.apply(identify_outliers, axis=1)
print(sales_data[['Date', 'Category', 'Amount', 'SalesType']])

Conclusion

This analysis provided valuable insights into the sales performance of different product categories, highlighting top-performing months and stores for each category. Additionally, the seasonal trends analysis helped identify key periods for sales, while outlier detection pointed out transactions that significantly deviated from the norm. This information can guide strategic decisions to boost sales and improve inventory management.

import pandas as pd
import numpy as np

# Creating example data
np.random.seed(0)
dates = pd.date_range('2023-01-01', periods=120, freq='D')
data = {
    'Date': dates,
    'StoreID': np.random.choice(['Store1', 'Store2', 'Store3'], size=120),
    'Category': np.random.choice(['Electronics', 'Clothing', 'Furniture'], size=120),
    'Amount': np.random.randint(100, 2000, size=120)
}
sales_data = pd.DataFrame(data)

# Calculate total sales by category
total_sales_by_category = sales_data.groupby('Category')['Amount'].sum()

# Extract month from date
sales_data['Month'] = sales_data['Date'].dt.month

# Define function to get top month for each category
def get_top_month(group):
    return group.groupby('Month')['Amount'].sum().idxmax()

# Apply function to get top month by category
top_month_by_category = sales_data.groupby('Category').apply(get_top_month)

# Define function to get top store for each category
def get_top_store(group):
    return group.groupby('StoreID')['Amount'].sum().idxmax()

# Apply function to get top store by category
top_store_by_category = sales_data.groupby('Category').apply(get_top_store)

# Categorize sales data into seasons and analyze trends
def categorize_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

sales_data['Season'] = sales_data['Month'].apply(categorize_season)
seasonal_trends = sales_data.groupby(['Category', 'Season'])['Amount'].sum().unstack()

# Identify outliers
def identify_outliers(row):
    category_average = sales_data[sales_data['Category'] == row['Category']]['Amount'].mean()
    return 'Outlier' if row['Amount'] > category_average * 1.5 else 'Normal'

sales_data['SalesType'] = sales_data.apply(identify_outliers, axis=1)

# Display results
total_sales_by_category, top_month_by_category, top_store_by_category, seasonal_trends, sales_data.head()

OUTPUT:

(Category
Clothing       40430
Electronics    39073
Furniture      35911
Name: Amount, dtype:int64,
Category
Clothing       4
Electronics    2
Furniture      4
dtype: int64,
Category
Clothing       Store3
Electronics    Store1
Furniture      Store3
dtype: object,
Season       Spring Winter
Category                   
Clothing      23599  16831
Electronics   15953  23120
Furniture     19377  16534,
        Date StoreID     Category Amount  Month  Season SalesType
0 2023-01-01  Store1    Clothing    1212      1 Winter    Normal
1 2023-01-02  Store2   Furniture     133      1 Winter    Normal
2 2023-01-03  Store1    Clothing     745      1 Winter    Normal
3 2023-01-04  Store2 Electronics     332      1 Winter    Normal
4 2023-01-05  Store2   Furniture     867      1 Winter    Normal)
The analysis using the example data produced the following insights:

Total Sales by Category

Clothing: $40,430

Electronics: $39,073

Furniture: $35,911

Top Month by Category

Clothing and Furniture: April (Month 4)

Electronics: February (Month 2)

Top Store by Category

Clothing and Furniture: Store3

Electronics: Store1

Seasonal Trends

Spring saw the highest sales for Clothing (23,599)andFurniture(19,377), while Winter was the top season for Electronics ($23,120

Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.