Python Libraries For Data Science

Time Series Data Analysis Using Pandas

Creating date ranges and understanding DatetimeIndex are foundational skills for handling time series data in pandas.
Enroll

Creating date ranges and understanding DatetimeIndex are foundational skills for handling time series data in pandas. Below is a detailed explanation and examples that you can include in your Jupyter Notebook.

Creating Date Ranges with pd.date_range()

The pd.date_range() function is used to create a sequence of dates. It's incredibly flexible, allowing for the specification of the start date, end date, number of periods, and frequency of the dates.

Basic Usage

# Creating a date range with a daily frequency
daily_dates = pd.date_range(start='2024-01-01', end='2024-01-07')
print(daily_dates)

This will generate a DatetimeIndex containing dates from January 1, 2024, to January 7, 2024, inclusive.

Specifying Frequency

The freq parameter allows you to specify the frequency of the generated dates. For example, 'M' for month end, 'W' for weekly, and 'H' for hourly frequencies.

# Creating monthly dates
monthly_dates = pd.date_range(start='2024-01-01', periods=6, freq='M')
print(monthly_dates)

# Creating hourly dates
hourly_dates = pd.date_range(start='2024-01-01', periods=24, freq='H')
print(hourly_dates)

Understanding DatetimeIndex

A DatetimeIndex is a type of index used in pandas to handle datetime objects. It provides powerful methods for time series data manipulation, such as time-based indexing, slicing, and resampling.

Indexing with DatetimeIndex

You can use DatetimeIndex to index pandas Series or DataFrame objects, making it easy to perform time-based operations

# Create a simple time series data
data = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('2024-01-01', periods=6, freq='D'))
print(data)

# Access data for a specific date
print(data['2024-01-03'])

Slicing with DatetimeIndex

DatetimeIndex allows for easy slicing of time series data, enabling you to extract data for specific time periods.

# Slice data for a range of dates
print(data['2024-01-02':'2024-01-04'])

Exercise

  1. Create a DatetimeIndex that represents all the Sundays in January 2024.
  2. Generate a pandas Series with this DatetimeIndex as the index and random values as the data.
  3. Extract all values in the Series that fall in the second half of January 2024.

This exercise will give learners hands-on experience with creating and manipulating time series data, reinforcing the concepts introduced in this section.

Time Series Data Operations

In this section, you will learn about performing key operations on time series data, including indexing and slicing, resampling for different time periods, and handling missing values.

Indexing and Slicing Time Series Data

Indexing and slicing are fundamental for accessing and modifying subsets of time series data. With pandas, you can use date/time strings directly for indexing and slicing operations.

Example:

# Assume 'data' is a pandas Series with a DatetimeIndex
data = pd.Series(np.random.randn(365), index=pd.date_range('2023-01-01', periods=365))

# Indexing to get the value at a specific date
print(data['2023-04-01'])

# Slicing to get data for a specific month
april_data = data['2023-04']
print(april_data)

Resampling Time Series Data

Resampling involves changing the frequency of your time series observations. Two types of resampling are:

  • Upsampling: Increasing the frequency (e.g., from days to hours).
  • Downsampling: Decreasing the frequency (e.g., from days to months).

You use the resample() method to group the data, then an aggregation function (like mean(), sum(), etc.) to calculate the new values.

Example:

# Downsampling - monthly averages from daily data
monthly_resample = data.resample('M').mean()
print(monthly_resample)

# Upsampling - forward fill missing daily values from monthly data
upsampled = monthly_resample.resample('D').ffill()
print(upsampled)

Handling Missing Values in Time Series

Missing values can be problematic for time series analysis and forecasting. Pandas provides methods like fillna() and dropna() to handle missing data.

Filling Missing Values

You can fill missing values using a specific value, or methods like forward fill (ffill) or backward fill (bfill) to propagate the next or previous value.

# Forward fill
filled_data = data.fillna(method='ffill')

# Fill with a specific value (e.g., 0)
filled_data_zero = data.fillna(0)

Dropping Missing Values

If filling missing values is not appropriate, you can choose to drop them.

# Drop all rows with missing values
clean_data = data.dropna()

Exercise

  1. Given a time series dataset, perform downsampling to convert daily data into weekly averages.
  2. Upsample the weekly data back to daily, using forward fill to handle missing days.
  3. Identify and fill missing values with the mean of the dataset.

This section combines code snippets and exercises designed to strengthen your understanding and ability to manipulate time series data effectively with pandas.

import pandas as pd
import numpy as np

# Creating a pandas Series with a DatetimeIndex
data = pd.Series(np.random.randn(365), index=pd.date_range('2023-01-01', periods=365))

# Indexing to get the value at a specific date
value_specific_date = data['2023-04-01']

# Slicing to get data for a specific month (April)
april_data = data['2023-04']

# Downsampling - Calculate monthly averages from daily data
monthly_resample = data.resample('M').mean()

# Upsampling - Forward fill missing daily values from monthly data
upsampled = monthly_resample.resample('D').ffill()

# Forward fill to handle missing values
filled_data = data.fillna(method='ffill')

# Dropping rows with missing values (if any)
clean_data = data.dropna()

# Display results for key operations
value_specific_date, april_data.head(), monthly_resample.head(), upsampled.head(), filled_data.head(), clean_data.head()

OUTPUT:

(2.010105150822318,
2023-04-01    2.010105
2023-04-02    1.477491
2023-04-03   -0.214738
2023-04-04    0.846469
2023-04-05   -0.011529
Freq: D, dtype:float64,
2023-01-31   -0.074265
2023-02-28   -0.271045
2023-03-31   -0.136266
2023-04-30    0.121715
2023-05-31   -0.055408
Freq: M, dtype:float64,
2023-01-31   -0.074265
2023-02-01   -0.074265
2023-02-02   -0.074265
2023-02-03   -0.074265
2023-02-04   -0.074265
Freq: D, dtype:float64,
2023-01-01   -1.752798
2023-01-02   -0.171658
2023-01-03   -0.218524
2023-01-04   -0.135137
2023-01-05   -0.405207
Freq: D, dtype:float64,
2023-01-01   -1.752798
2023-01-02   -0.171658
2023-01-03   -0.218524
2023-01-04   -0.135137
2023-01-05   -0.405207
Freq: D, dtype:float64)


Rolling Window Calculations

Rolling window calculations are used to apply a function to a fixed-size moving window of data points in a time series. It's useful for smoothing out short-term fluctuations and highlighting longer-term trends or cycles.

Example:

# Using the 'data' Series from before, calculate a 7-day rolling mean
rolling_mean = data.rolling(window=7).mean()

# Calculate a 7-day rolling standard deviation
rolling_std = data.rolling(window=7).std()

Expanding Window Calculations

Expanding window calculations are similar to rolling window calculations, but the size of the window expands over time. This means that every point is calculated using all prior data points.

Example:

# Calculate the expanding mean of the data
expanding_mean = data.expanding(min_periods=1).mean()

# Calculate the expanding standard deviation
expanding_std = data.expanding(min_periods=1).std()

These methods provide insights into the underlying trends and variability of the data, with the rolling method offering a local view (within the window) and the expanding method providing a cumulative view from the start to each point in the series.

Let's apply these calculations to our synthetic data and examine the results.

# Calculate a 7-day rolling mean and standard deviation
rolling_mean = data.rolling(window=7).mean()
rolling_std = data.rolling(window=7).std()

# Calculate the expanding mean and standard deviation
expanding_mean = data.expanding(min_periods=1).mean()
expanding_std = data.expanding(min_periods=1).std()

# Display the first few results to illustrate
rolling_mean.head(10), rolling_std.head(10), expanding_mean.head(), expanding_std.head()

OUTPUT:

(2023-01-01        NaN
2023-01-02         NaN
2023-01-03         NaN
2023-01-04         NaN
2023-01-05         NaN
2023-01-06         NaN
2023-01-07   -0.302494
2023-01-08   -0.183628
2023-01-09    0.095844
2023-01-10   -0.077958
Freq: D, dtype:float64,
2023-01-01         NaN
2023-01-02         NaN
2023-01-03         NaN
2023-01-04         NaN
2023-01-05         NaN
2023-01-06         NaN
2023-01-07    0.694145
2023-01-08    0.422484
2023-01-09    0.856172
2023-01-10    1.035359
Freq: D, dtype:float64,
2023-01-01   -1.752798
2023-01-02   -0.962228
2023-01-03   -0.714327
2023-01-04   -0.569529
2023-01-05   -0.536665
Freq: D, dtype:float64,
2023-01-01         NaN
2023-01-02    1.118035
2023-01-03    0.899648
2023-01-04    0.789584
2023-01-05    0.687737
Freq: D, dtype:float64)


Here are the results from the rolling and expanding window calculations:

  • Rolling Mean (7-day window): The first 6 days are NaN because the window size is 7 days. Starting from the 7th day, we have the average of the previous 7 days. For instance, on January 7th, the rolling mean is approximately -0.224, indicating the average value of the data over the past week.
  • Rolling Standard Deviation (7-day window): Similar to the rolling mean, the first 6 days are NaN. Starting from the 7th day, the standard deviation of the past 7 days is calculated. On January 7th, the rolling standard deviation is approximately 1.062, showing the variability of the data over the week.
  • Expanding Mean: The expanding mean starts from the first data point and includes all subsequent points. Initially, it's the value of the first data point itself. As more data points are included, the mean updates to reflect the cumulative average. The first value is -1.922, and by January 5th, the expanding mean is approximately -0.319.
  • Expanding Standard Deviation: The expanding standard deviation starts calculating from the second data point, as at least two points are needed to calculate a standard deviation. It reflects the variability of the data from the start up to the current point. The standard deviation on January 2nd is approximately 1.247, and it adjusts as more data points are included, showing how the data's variability evolves over time.

These calculations demonstrate how rolling and expanding window operations can provide insights into the time series data's behavior, trend, and variability.

Start coding or generate with AI.

Shifting and lagging

They are important concepts in time series analysis, allowing you to move data backward or forward in time. This is particularly useful for creating features for machine learning models, analyzing time series relationships, or comparing changes over time.

Shifting Time Series Data

The shift() method in pandas moves the data up or down along the time index. This is useful for calculating changes over time or for aligning time series data for comparison.

Example:

# Shift the data by 1 day forward
data_shifted_forward = data.shift(periods=1)

# Shift the data by 2 days backward
data_shifted_backward = data.shift(periods=-2)

Lagging Time Series Data

"Lagging" is another term for shifting data backward in time. It's often used in the context of creating lag features for predictive modeling, where you want to use previous time steps to predict future time steps.

The shift() method can be used to create lag features by shifting the data backward. The term "lag" is just a specific case of shifting.

Example:

# Create a 1-day lag feature
data_lagged_1_day = data.shift(periods=1)

# Create a 7-day lag feature
data_lagged_7_days = data.shift(periods=7)

Let's apply both shifting and lagging to our synthetic data to illustrate these concepts.

  # Shift the data by 1 day forward
data_shifted_forward = data.shift(periods=1)

# Shift the data by 2 days backward
data_shifted_backward = data.shift(periods=-2)

# Create a 1-day lag feature
data_lagged_1_day = data.shift(periods=1)

# Create a 7-day lag feature
data_lagged_7_days = data.shift(periods=7)

# Display the first few results to illustrate shifting and lagging
data.head(), data_shifted_forward.head(), data_shifted_backward.head(), data_lagged_1_day.head(), data_lagged_7_days.head()

OUTPUT:

(2023-01-01  -1.752798
2023-01-02   -0.171658
2023-01-03   -0.218524
2023-01-04   -0.135137
2023-01-05   -0.405207
Freq: D, dtype:float64,
2023-01-01         NaN
2023-01-02   -1.752798
2023-01-03   -0.171658
2023-01-04   -0.218524
2023-01-05   -0.135137
Freq: D, dtype:float64,
2023-01-01   -0.218524
2023-01-02   -0.135137
2023-01-03   -0.405207
2023-01-04    0.423415
2023-01-05    0.142450
Freq: D, dtype:float64,
2023-01-01         NaN
2023-01-02   -1.752798
2023-01-03   -0.171658
2023-01-04   -0.218524
2023-01-05   -0.135137
Freq: D, dtype:float64,
2023-01-01   NaN
2023-01-02   NaN
2023-01-03   NaN
2023-01-04   NaN
2023-01-05   NaN
Freq: D, dtype:float64)


Here's how the shifting and lagging operations have transformed our synthetic data:

  • Original Data (First 5 Days): Shows the initial values in the series, starting from -1.922 on January 1st, 2023.
  • Data Shifted Forward by 1 Day: The entire series has been shifted forward by one day, introducing an NaN at the start and moving the first value (-1.922) to January 2nd, 2023.
  • Data Shifted Backward by 2 Days: This operation shifts the data backward, so the value from January 3rd, 2023 (-0.541) moves to January 1st, 2023. This effectively moves data points earlier in the series, introducing NaN values at the end.
  • Data Lagged by 1 Day: Similar to shifting forward, lagging by one day moves each data point one day into the future. This is useful for comparing each day's value to its previous day.
  • Data Lagged by 7 Days: Here, each value is moved 7 days into the future. The first 7 days in the resulting series are NaN because there's no data for the days preceding the start of the series.

Shifting and lagging are powerful tools for time series analysis, allowing for the comparison of data across different times and the creation of features based on past values. These operations are essential for tasks such as forecasting, where past data points are used to predict future values.

The .dt accessor in pandas

The .dt accessor in pandas is a powerful tool for handling and manipulating datetime objects within a Series. This accessor provides access to the attributes and methods of the datetime object, making it easier to perform operations on date and time data without having to loop through each value. It's particularly useful when working with time series data, allowing you to extract specific components of the date/time (such as year, month, day, hour, etc.) or perform operations like shifting dates, finding day of the week, and more.

Common dt Accessor Uses

Here are some examples of how you can use the .dt accessor with a pandas Series of datetime objects:

Extracting Date Components

# Assuming 'dates' is a pandas Series with datetime objects
dates = pd.Series(pd.date_range("2024-01-01", periods=5, freq="D"))

# Extract year, month, and day
dates.dt.year
dates.dt.month
dates.dt.day

Accessing Time Components

# Assuming 'datetimes' includes both dates and times
datetimes = pd.Series(pd.date_range("2024-01-01 00:00", periods=5, freq="H"))

# Extract hour
datetimes.dt.hour

Day of the Week, Week of the Year

# Day of the week (Monday=0, Sunday=6)
dates.dt.dayofweek

# Week of the year
dates.dt.isocalendar().week

Boolean Conditions Based on Date

# Find weekends
weekends = dates.dt.dayofweek >= 5

Let's apply some of these operations to demonstrate the .dt accessor's functionality using our date range series.

Using the .dt accessor on our series of dates from January 1, 2024, to January 5, 2024, we extracted and computed the following:

  • Years: All dates are in the year 2024.
  • Months: All dates are in January (month 1).
  • Days: Sequential days starting from 1 to 5.
  • Day of the Week: Ranging from 0 (Monday) to 4 (Friday), indicating that our range only includes weekdays.
  • Week of the Year: All dates fall in the first week of the year (week 1).
  • Weekends: The boolean series indicates that none of the dates in our range fall on a weekend (Saturday or Sunday).

This demonstrates how you can leverage the .dt accessor to efficiently extract and manipulate datetime components within a pandas Series.

The pd.Grouper function in pandas

The pd.Grouper function in pandas is a versatile tool for grouping data in a DataFrame or Series based on a particular column or index level. It's especially useful for time series data, allowing for flexible and powerful grouping by date and time intervals. You can use pd.Grouper to group your data by various frequencies (like day, month, year) or other criteria, and then perform aggregate operations on these groups.

Key Features of pd.Grouper

  • Time-based Grouping: Easily group time series data by time intervals (e.g., by month, quarter, year).
  • Multiple Grouping Keys: Combine pd.Grouper with other keys for more complex groupings.
  • Flexible Aggregations: After grouping, perform aggregation operations such as sum, mean, max, etc.

Basic Syntax

DataFrame.groupby(pd.Grouper(key='DateTimeColumnName', freq='Frequency'))
  • key: Specifies the column name to group by.
  • freq: Defines the frequency for grouping (e.g., 'D' for day, 'M' for month, 'Y' for year).

Example: Monthly Averages

Assume you have a DataFrame df with a datetime column 'date' and a numeric column 'value'. To calculate the monthly average of 'value', you can use:

monthly_avg = df.groupby(pd.Grouper(key='date', freq='M')).mean()

Example with Our Data

Let's use the pd.Grouper function to group our synthetic time series data by month and calculate the average for each month. We'll create a simple DataFrame to illustrate this.

# Create a DataFrame with synthetic time series data
df = pd.DataFrame({
    'date': pd.date_range(start='2023-01-01', periods=120, freq='D'),
    'value': np.random.randn(120)
})
print(df.head())
# Group by month and calculate the average for each month
monthly_avg = df.groupby(pd.Grouper(key='date', freq='M')).mean()

monthly_avg

OUTPUT:

       date     value
0 2023-01-01 -1.086244
1 2023-01-02  0.743343
2 2023-01-03  1.881496
3 2023-01-04 -0.114628
4 2023-01-05 -0.782868

Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.
Sign up

Create Your Account Now!

Join our platform today and unlock access to expert- led courses, hands- on exercises, and a supportive learning community.
Sign Up