Data Visualization with Python

Introduction to Seaborn

Seaborn is a powerful and versatile Python library that is specifically designed for creating statistical graphics in Python.
Enroll

Getting Started with Seaborn

Seaborn is a powerful and versatile Python library that is specifically designed for creating statistical graphics in Python. It is built on top of Matplotlib, another highly popular Python visualization library, and integrates closely with pandas data structures, making it an essential tool for data analysis and exploration. Here's an introduction to Seaborn and its role in the Python visualization ecosystem.

Why Seaborn?

Seaborn simplifies the process of creating beautiful and informative statistical graphics. Here are some key reasons why it has become a go-to library for data scientists and analysts:

  • Statistical plotting: Seaborn comes with a wide range of plotting functions that focus on statistical aggregation, summarization, and visualization, making it easier to explore complex datasets.
  • Beautiful and customizable: It produces aesthetically pleasing visualizations by default, with a high level of customization available for fine-tuning.
  • Integration with pandas: Seaborn works seamlessly with pandas DataFrames, facilitating the visualization of data directly from CSVs, Excel files, or databases.
  • Facilitates comparison: Its plotting functions are designed to compare distributions between different groups in the data, making it an excellent tool for exploratory data analysis.
  • Built on Matplotlib: Leveraging Matplotlib's capabilities means that while Seaborn abstracts away much of the complexity for ease of use, users can still access Matplotlib's extensive customization options when needed.

Understanding Seaborn's data structures is key to effectively using the library for data visualization. Seaborn is designed to work well with pandas DataFrames, which are the most common data structure used for storing and manipulating tabular data in Python. By leveraging pandas DataFrames, Seaborn allows for efficient and intuitive plotting of data.

keyboard_arrow_down

pandas DataFrames and Seaborn

A pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). DataFrames are ideal for representing real data in Python, especially when dealing with complex datasets that include various data types.

Seaborn's functions are optimized to work with DataFrames, providing a seamless experience for data visualization:

  • Automatic extraction of variable names: When you pass a DataFrame to Seaborn, it uses the DataFrame column names as the default labels for axes, legends, and annotations, making the plots informative and reducing the amount of manual labeling required.
  • Flexible data filtering: You can leverage pandas' powerful data manipulation capabilities to preprocess or filter your data before passing it to Seaborn for visualization.
  • Direct support for categorical data: pandas DataFrames support categorical data natively, and Seaborn's plotting functions are designed to handle categorical variables effectively, allowing for easy plotting of data grouped by categories.

Long-Form vs. Wide-Form Data

Seaborn distinguishes between two types of data structures for plotting: long-form (or "tidy") data and wide-form data.

  • Long-form (Tidy) Data: Each row is a single observation, and each column is a variable. Tidy data is preferred in Seaborn because it suits statistical modeling and plotting. Functions like sns.relplot, sns.catplot, and sns.lmplot are designed to work intuitively with tidy data, allowing you to easily map variables to different aspects of a plot (like the x and y axes, hues, sizes, and styles).
  • Wide-form Data: Each row is an observation, but each observation can have multiple columns representing different variables, measurements, or time points. Some Seaborn functions can work with wide-form data directly, automatically treating each column as a separate variable or series. This can be particularly handy for quickly comparing multiple series or variables without having to reshape your DataFrame.

Working with Seaborn

When working with Seaborn, the general workflow involves:

  1. Preparing your data: Ensure your data is in a pandas DataFrame, cleaning and possibly reshaping it into long-form if necessary.
  2. Choosing a plot type: Select the Seaborn function that best matches the kind of visualization you want to create, considering what aspects of the data you're interested in exploring.
  3. Mapping DataFrame columns to plot elements: Use the parameters of the Seaborn function to specify which columns of your DataFrame should be used for different parts of the plot (e.g., which column to use for the x-axis, which for the y-axis, which to use for color coding, etc.).
  4. Customizing your plot: Further customize the appearance and behavior of your plot using additional arguments and leveraging the underlying Matplotlib objects for even more control.

By understanding and effectively utilizing pandas DataFrames in conjunction with Seaborn's plotting capabilities, you can create insightful and beautiful statistical visualizations with relatively little code.

Double-click (or enter) to edit

Basic Plots in Seaborn - Line Plots, Bar Charts, Histograms, and Distributions

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Set the aesthetic style of the plots
sns.set_theme()
# Sample DataFrame
data = pd.DataFrame({
    'Year': [2010, 2011, 2012, 2013, 2014],
    'Sales': [12, 17, 22, 29, 37]
})

# Create a line plot
sns.lineplot(data=data, x='Year', y='Sales')
plt.show()
# Sample DataFrame
data = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D'],
    'Sales': [23, 17, 35, 29]
})

# Create a bar chart
sns.barplot(data=data, x='Product', y='Sales')
plt.show()

# Sample data
data = pd.DataFrame({
    'Age': [22, 55, 62, 45, 21, 22, 34, 42, 42, 4, 2, 102, 95, 85, 55, 110, 120]
})

# Create a histogram
sns.histplot(data=data, x='Age', bins=10, kde=True)
plt.show()

Customizing Seaborn Plots (Aesthetics, Labels, Themes)

Lesson 3: Customizing Seaborn Plots (Aesthetics, Labels, Themes)

Seaborn excels at creating beautiful, informative statistical graphics in Python with minimal code. However, the true power of Seaborn lies in its extensive customization capabilities, allowing you to tailor your plots for various contexts and audiences. This lesson will guide you through customizing Seaborn plots, focusing on aesthetics, labels, and themes.

Setting the Aesthetic Style

Seaborn provides several built-in themes and styles to quickly change the appearance of plots. These styles can be applied globally using the sns.set_style() function.

Example: Changing the Plot Style

sns.set_style("whitegrid")  # Options include: "dark", "white", "darkgrid", "whitegrid", "ticks"

Customizing Plot Labels and Titles

Adding informative labels and titles is crucial for making your plots understandable. Seaborn integrates with Matplotlib, allowing you to use Matplotlib's labeling functions to add and customize labels and titles.

Example: Adding Titles and Labels

ax = sns.barplot(x="day", y="total_bill", data=tips)
ax.set_title("Total Bill by Day")
ax.set_xlabel("Day of the Week")
ax.set_ylabel("Average Total Bill")

Customizing with Matplotlib

For more advanced customizations, you can directly use Matplotlib's functions. This is particularly useful for adjusting figure sizes, adding text, or fine-tuning the layout.

Example: Adjusting Figure Size and Adding Annotations

plt.figure(figsize=(10, 6))  # Adjust the figure size
ax = sns.lineplot(x="time", y="signal", data=data)
ax.set_title("Signal over Time")
ax.annotate("Peak", xy=(5, 10), xytext=(3, 15),
            arrowprops=dict(facecolor='black', shrink=0.05))

Seaborn Themes

Seaborn's set_theme() function allows you to customize the appearance of your plots globally. This includes setting the color palette, font scale, and the aforementioned styles for a consistent look across all your plots.

Example: Setting a Theme

sns.set_theme(style="darkgrid", palette="muted", font_scale=1.2)

Color Palettes

Choosing the right color palette can enhance the readability and aesthetic appeal of your plot. Seaborn offers a variety of color palettes, which can be set globally or specified for individual plots.

Example: Using Color Palettes

sns.set_palette("pastel")

# Specify palette for a single plot
sns.barplot(x="day", y="total_bill", data=tips, palette="Blues_d")

Conclusion

Customizing your Seaborn plots is straightforward yet powerful, with a range of options from simple style changes to detailed aesthetic adjustments. By combining Seaborn's statistical plotting capabilities with Matplotlib's customization features, you can create visually appealing, informative visualizations that effectively communicate your data's story. Experiment with different styles, themes, and customizations to discover the best ways to present your unique data insights.


import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Month': ['January', 'February', 'March', 'April', 'May', 'June'],
    'Sales': [200, 220, 250, 275, 300, 320]
})

# Set the aesthetic style
sns.set_theme(style="darkgrid")

# Create the line chart
plt.figure(figsize=(10, 6))
line_chart = sns.lineplot(data=data, x='Month', y='Sales', marker='o', color='blue', linewidth=2.5)

# Customizing the plot
line_chart.set_title('Monthly Sales', fontsize=16)
line_chart.set_xlabel('Month', fontsize=12)
line_chart.set_ylabel('Sales', fontsize=12)
line_chart.set_xticklabels(data['Month'], rotation=45)

plt.show()

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Sample data
np.random.seed(10)
data = pd.DataFrame({
    'City': ['City A', 'City B', 'City C', 'City D', 'City E', 'City F', 'City G', 'City H'],
    'Temperature': np.random.randint(20, 35, size=8),
    'Humidity': np.random.randint(40, 80, size=8),
    'Pollution Index': np.random.randint(1, 100, size=8)
})

highlight = 'City C'  # City to highlight
# Set the theme
sns.set_theme(style="whitegrid", palette="muted")

# Create the scatter plot
plt.figure(figsize=(10, 6))
scatter = sns.scatterplot(data=data, x='Temperature', y='Humidity',
                          size='Pollution Index', sizes=(50, 200),
                          hue='Pollution Index', style='City',
                          palette='coolwarm', legend="full")

# Customizing the legend
scatter.legend(title='Pollution Index',bbox_to_anchor=(1.05, 1), loc=2)

# Adding annotations
for i in range(data.shape[0]):
    if data.iloc[i]['City'] == highlight:
        plt.text(x=data.iloc[i]['Temperature']+0.5,
                 y=data.iloc[i]['Humidity'],
                 s=highlight,
                 fontdict=dict(color='red',size=10),
                 bbox=dict(facecolor='yellow',alpha=0.5))

# Adding titles and labels
plt.title('City Climate Characteristics', fontsize=16)
plt.xlabel('Average Temperature (°C)', fontsize=12)
plt.ylabel('Average Humidity (%)', fontsize=12)

# Show the plot
plt.show()

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Product': ['Product A', 'Product B', 'Product C', 'Product D'] * 2,
    'Region': ['North', 'North', 'North', 'North', 'South', 'South', 'South', 'South'],
    'Sales': [123, 432, 234, 321, 143, 423, 223, 312]
})
# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Create the bar chart
plt.figure(figsize=(10, 6))
bar_chart = sns.barplot(data=data, x='Product', y='Sales', hue='Region', palette='viridis')

# Customizing the legend
bar_chart.legend(title='Region', bbox_to_anchor=(1.05, 1), loc=2)

# Adding annotations for each bar
for p in bar_chart.patches:
    bar_chart.annotate(format(p.get_height(), '.1f'),
                       (p.get_x() + p.get_width() / 2., p.get_height()),
                       ha = 'center', va = 'center',
                       xytext = (0, 9),
                       textcoords = 'offset points')

# Adding titles and labels
plt.title('Sales by Product and Region', fontsize=16)
plt.xlabel('Product', fontsize=12)
plt.ylabel('Sales', fontsize=12)

# Show the plot
plt.tight_layout()  # Adjusts the plot to ensure everything fits without overlap
plt.show()

BOX PLOT

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Math': [88, 92, 80, 89, 100, 80, 60, 70, 55, 77, 88, 82],
    'Science': [72, 95, 78, 76, 88, 82, 92, 89, 95, 80, 82, 85],
    'English': [90, 85, 88, 95, 60, 78, 82, 80, 80, 85, 90, 92]
})

# Melting the DataFrame to work well with sns.boxplot
data_melted = data.melt(var_name='Subject', value_name='Score')
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Create the box plot
plt.figure(figsize=(10, 6))
box_plot = sns.boxplot(x='Subject', y='Score', data=data_melted, palette='Set2')

# Adding titles and labels
plt.title('Distribution of Student Scores by Subject', fontsize=16)
plt.xlabel('Subject', fontsize=12)
plt.ylabel('Scores', fontsize=12)

# Identifying outliers and annotating
for patch in box_plot.artists:
    r, g, b, a = patch.get_facecolor()
    patch.set_facecolor((r, g, b, .3))  # Making boxes translucent to highlight outliers

outliers = [(subject, score) for subject in data for score in data[subject] if score < (data[subject].quantile(0.25) - 1.5 * (data[subject].quantile(0.75) - data[subject].quantile(0.25))) or score > (data[subject].quantile(0.75) + 1.5 * (data[subject].quantile(0.75) - data[subject].quantile(0.25)))]
for outlier in outliers:
    plt.text(x=outlier[0], y=outlier[1], s=f"{outlier[1]}", color='red')

plt.show()

Violin plots

They are particularly useful for comparing the distribution of data across categories, as they show both the summary statistics (similar to box plots) and the probability density (similar to KDE plots). This makes them an excellent choice for visualizing and comparing distributions that may have multiple peaks or varying spreads.

# Sample data
data = pd.DataFrame({
    'Score': [88, 72, 90, 95, 78, 85, 82, 92, 88, 76, 88, 82, 100, 60, 70, 95, 80, 60, 78, 82],
    'Subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math'],
    'Level': ['Advanced', 'Advanced', 'Advanced', 'Intermediate', 'Intermediate', 'Intermediate', 'Beginner', 'Beginner', 'Beginner', 'Advanced', 'Intermediate', 'Beginner', 'Advanced', 'Intermediate', 'Advanced', 'Beginner', 'Beginner', 'Intermediate', 'Advanced', 'Intermediate']
})
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Create the violin plot
plt.figure(figsize=(12, 6))
violin = sns.violinplot(x='Subject', y='Score', data=data, palette='muted')

# Customizing the plot
plt.title('Score Distribution by Subject', fontsize=16)
plt.xlabel('Subject', fontsize=12)
plt.ylabel('Score', fontsize=12)
#plt.legend(title='Student Level', loc='upper left')

# Show the plot
plt.show()

sns.set_style("whitegrid")

# Create the violin plot
plt.figure(figsize=(12, 6))
violin = sns.violinplot(x='Subject', y='Score', data=data, palette='muted',split=True)

# Customizing the plot
plt.title('Score Distribution by Subject', fontsize=16)
plt.xlabel('Subject', fontsize=12)
plt.ylabel('Score', fontsize=12)
#plt.legend(title='Student Level', loc='upper left')

# Show the plot
plt.show()

Reg plot

The sns.regplot function in Seaborn is used to plot data and a linear regression model fit. It's a great tool for exploring the relationship between two variables, providing a quick and easy way to visualize whether there's a linear relationship and how strong that relationship might be. Here's a basic example to demonstrate how to use sns.regplot to create a scatter plot with a linear regression line.

# Generating sample data
np.random.seed(0)
hours_studied = np.random.rand(100) * 5  # Random data for hours studied
exam_scores = 50 + (hours_studied * 10) + np.random.normal(0, 5, 100)  # Scores with added noise

# Creating a DataFrame
data = pd.DataFrame({'Hours Studied': hours_studied, 'Exam Score': exam_scores})
# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Create the regression plot
plt.figure(figsize=(10, 6))
sns.regplot(x='Hours Studied', y='Exam Score', data=data, scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})

# Customizing the plot
plt.title('Relationship Between Hours Studied and Exam Score', fontsize=16)
plt.xlabel('Hours Studied', fontsize=12)
plt.ylabel('Exam Score', fontsize=12)

# Show the plot
plt.show()

CAT PLOT

The sns.catplot function in Seaborn is a versatile tool for visualizing the distribution of values within categorical data. It can create several types of plots, including box plots, violin plots, bar plots, and more, making it highly useful for comparing levels within a categorical variable or across several categories. Here's a basic example to demonstrate how to use sns.catplot to create a categorical plot that can show, for instance, the distribution of exam scores across different exam subjects.

sns.catplot is powerful for its versatility and ability to visualize complex relationships within categorical data. By adjusting its parameters and the plot type (kind), you can tailor the visualization to your specific analysis needs, whether you're exploring distributions, comparing groups, or highlighting trends within categories.

# Sample data
data = pd.DataFrame({
    'Subject': ['Math', 'Science', 'English', 'Math', 'Science', 'English', 'Math', 'Science', 'English'],
    'Score': [88, 72, 90, 95, 78, 85, 82, 92, 88],
    'Gender': ['Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female']
})
# Set the aesthetic style of the plots
sns.set_theme(style="whitegrid")

# Create the categorical plot
g = sns.catplot(x='Subject', y='Score', hue='Gender', data=data, kind='box', height=5, aspect=1.5, palette='pastel')

# Customizing the plot
g.fig.suptitle('Exam Scores by Subject and Gender', fontsize=16)
g.set_axis_labels("Subject", "Score")
g.legend.set_title("Gender")

# Adjust the title position
plt.subplots_adjust(top=0.92)

# Show the plot
plt.show()

g = sns.catplot(x='Subject', y='Score', hue='Gender', data=data, kind='violin', height=5, aspect=1.5, palette='pastel')

# Customizing the plot
g.fig.suptitle('Exam Scores by Subject and Gender', fontsize=16)
g.set_axis_labels("Subject", "Score")
g.legend.set_title("Gender")

# Adjust the title position
plt.subplots_adjust(top=0.92)

# Show the plot
plt.show()

heat maps

# Generating sample data
data = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])
rows = ['Row 1', 'Row 2', 'Row 3']
columns = ['Column 1', 'Column 2', 'Column 3']

# Creating a DataFrame to label rows and columns for the heatmap
df = pd.DataFrame(data, index=rows, columns=columns)

# Creating the heatmap
plt.figure(figsize=(10, 8))
heatmap = sns.heatmap(df, annot=True, cmap='coolwarm', linewidths=.5, cbar_kws={'shrink': .5})

# Customizing the plot
plt.title('Sample Heatmap', fontsize=20)
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()

kde plot


# Generating sample data
np.random.seed(10)
data1 = np.random.normal(loc=0, scale=1, size=400)
data2 = np.random.normal(loc=5, scale=2, size=400)
df = pd.DataFrame({'Data1': data1, 'Data2': data2})

# Creating the KDE plot
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, fill=True, common_norm=False, alpha=.5, linewidth=0)

# Customizing the plot
plt.title('KDE Plot of Data1 and Data2', fontsize=16)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Density', fontsize=12)
plt.show()

The boxen plot,

also known as a letter value plot, is an enhanced version of a box plot introduced for better visualization of more complex distributions. It is particularly useful for larger datasets, as it provides a deeper insight into the shape of the distribution, especially in the tails. The boxen plot represents data distribution through a series of quantiles that reveal more information about the structure of the data, compared to traditional box plots that typically summarize data with medians, quartiles, and outliers.

Key Features of Boxen Plots:

  • More Quantiles: Unlike traditional box plots that show the median, the interquartile range, and outliers, boxen plots display data through many more quantiles. This makes them excellent at representing both the central tendency and the tails of the distribution.
  • Handling Large Datasets: Boxen plots are particularly suited for large datasets. They can provide more detailed insights into the distribution of the data, making it easier to identify nuances and patterns that might not be visible with simpler plotting techniques.
  • Visualizing Distribution Shape: By displaying a wider range of data quantiles, boxen plots offer a clearer view of the distribution's shape, including the presence of heavy tails, multimodality, and skewness.
  • Outlier Visualization: While traditional box plots can sometimes obscure the finer details of distribution due to the focus on outliers, boxen plots manage to present a more comprehensive view of the data, including outliers, without losing focus on the distribution's overall shape.

Usage:

Boxen plots are particularly useful when you need to understand more about the data beyond the central tendency and variability. They are a powerful tool for exploratory data analysis, especially when dealing with large datasets or when interested in the finer details of the data distribution. Seaborn and other data visualization libraries provide functions to easily create boxen plots, making them accessible for data analysts and scientists to incorporate into their analytical workflows.

import seaborn as sns
sns.set_theme(style="whitegrid")

diamonds = sns.load_dataset("diamonds")
clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

sns.boxenplot(
    diamonds, x="clarity", y="carat",
    color="b", order=clarity_ranking, width_method="linear",
)

# Sample data
np.random.seed(0)
data = pd.DataFrame({
    'Group': np.repeat(['A', 'B', 'C'], 100),
    'Value': (np.random.randn(300) * 100).astype(int)
})

# Creating the boxen plot
plt.figure(figsize=(10, 6))
sns.boxenplot(x='Group', y='Value', data=data, palette='coolwarm')

# Customizing the plot
plt.title('Boxen Plot of Values by Group', fontsize=16)
plt.xlabel('Group', fontsize=12)
plt.ylabel('Value', fontsize=12)

plt.show()
# Generating sample data
np.random.seed(0)
data = pd.DataFrame({
    'X': np.random.rand(100) * 100,
    'Y': np.random.rand(100) * 50 + 50 * np.random.rand(100),
    'Category': np.random.choice(['Category 1', 'Category 2', 'Category 3'], 100)
})

# Creating the lmplot
sns.lmplot(x='X', y='Y', data=data, hue='Category', palette='Set1', aspect=1.5, height=7)

# Customizing the plot
plt.title('Linear Regression with lmplot by Category', fontsize=16)
plt.xlabel('X Value', fontsize=12)
plt.ylabel('Y Value', fontsize=12)

plt.show()

# Generating sample data
np.random.seed(10)
x = np.random.rand(150) * 50
y = 2 * x + np.random.normal(0, 8, 150)
category = np.random.choice(['Group A', 'Group B', 'Group C'], 150)

# Combining into a DataFrame
data = pd.DataFrame({'X': x, 'Y': y, 'Category': category})
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Creating the lmplot with more details
g = sns.lmplot(x='X', y='Y', col='Category', hue='Category', data=data,
               aspect=1.2, height=5, col_wrap=2, palette='Set1',
               scatter_kws={'s': 50, 'alpha': 0.5}, line_kws={'lw': 2},
               markers='o', sharex=False, sharey=False)

# Customizing the plot
g.fig.suptitle('Linear Regression Analysis by Category', fontsize=20, y=1.05)
g.set_axis_labels("X Variable", "Y Variable")
g.add_legend(title="Category")
g.set(xlim=(0, 50), ylim=(0, 120))

# Iterating over each subplot to make further customizations
for ax, title in zip(g.axes.flat, ['Group A - Linear Fit', 'Group B - Linear Fit', 'Group C - Linear Fit']):
    ax.set_title(title)

plt.tight_layout()
plt.show()

REL PLOT

sns.relplot is a figure-level function in Seaborn designed for visualizing statistical relationships between data. It can create both scatter plots and line plots, making it versatile for exploring how two or more quantitative variables relate across a dataset. This function is especially powerful due to its ability to facet the data across multiple subplots with the row and col parameters, allowing for the examination of complex interactions and trends within subsets of the data.

keyboard_arrow_down

Use Case:

A common use case for sns.relplot is when you want to understand the relationship between two variables while also considering the impact of one or two additional categorical variables. For instance, in a dataset containing information about cars, you might use sns.relplot to explore how the weight of cars affects their fuel efficiency (miles per gallon), with the plot points colored by the number of cylinders and faceted by the country of origin. This kind of visualization can quickly reveal patterns and anomalies, such as whether heavier cars consistently have lower fuel efficiency or if this trend varies significantly by the number of cylinders or the country of origin.

sns.relplot shines in its flexibility and ease of use for creating complex, multi-faceted visualizations that can accommodate various aspects of a dataset, making it an invaluable tool for exploratory data analysis.

import seaborn as sns
sns.set_theme(style="white")

# Load the example mpg dataset
mpg = sns.load_dataset("mpg")

# Plot miles per gallon against horsepower with other semantics
sns.relplot(x="horsepower", y="mpg", hue="origin", size="weight",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=mpg)
# Set the aesthetic style of the plots
sns.set_style("whitegrid")

# Creating the relational plot
g = sns.relplot(x="weight", y="mpg", hue="cylinders", data=mpg, kind="scatter", size="cylinders",
                palette="viridis", sizes=(40, 400), aspect=1.5, height=5, alpha=.7)

# Customizing the plot
g.fig.suptitle('Fuel Efficiency vs. Car Weight by Number of Cylinders', fontsize=16)
g.set_axis_labels("Weight", "Miles Per Gallon (MPG)")
g.legend.set_title("Cylinders")

plt.show()

swarmplot

is a categorical scatterplot in Seaborn, designed to show all data points without overlapping. This plot type is particularly useful for visualizing the distribution of data across categories while maintaining each point's individual value, making it easy to assess the density and distribution of data points within categories.

Key Features:

  • Avoids Overlapping: Points are adjusted along the categorical axis so that they don't overlap, providing a better sense of the distribution.
  • Reveals Data Density: By spreading out the points, a swarm plot can give a sense of the density of the data points within each category.
  • Highlights Outliers: Individual data points are visible, making it easy to spot outliers or anomalies within each category.

Use Case:

A common use case for a swarmplot is when you need to compare the distribution of a variable across different categories, especially when the dataset is not too large. For example, in the mpg dataset, a swarmplot could be used to visualize the distribution of fuel efficiency (measured in miles per gallon) across cars with different numbers of cylinders. This can help identify not only the central tendency and variability within each category but also how individual vehicles compare across different cylinder counts.

swarmplot is particularly valuable in exploratory data analysis when you're interested in understanding the spread of your data across categories and looking for patterns or outliers that warrant further investigation.

# Sample data
data = pd.DataFrame({
    'Group': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Value': [23, 45, 56, 78, 12, 30, 52, 44, 33, 67, 89, 36]
})

# Creating the swarm plot
plt.figure(figsize=(8, 6))
swarm_plot = sns.swarmplot(x='Group', y='Value', data=data, palette='viridis')

# Customizing the plot
plt.title('Swarm Plot of Values by Group', fontsize=16)
plt.xlabel('Group', fontsize=12)
plt.ylabel('Value', fontsize=12)

plt.show()
mpg = sns.load_dataset('mpg')

# Creating the swarm plot
plt.figure(figsize=(10, 8))
swarm_plot = sns.swarmplot(x='cylinders', y='mpg', data=mpg, palette='Set2')

# Customizing the plot
plt.title('Swarm Plot of MPG by Number of Cylinders', fontsize=16)
plt.xlabel('Number of Cylinders', fontsize=12)
plt.ylabel('Miles Per Gallon (MPG)', fontsize=12)

plt.show()

pairplot

is a matrix of scatter plots that enables quick visualization of relationships between multiple pairwise combinations of variables in a dataset. It's particularly useful for exploring correlations, distributions, and trends among several quantitative variables. Each plot in the matrix represents the relationship between two variables, allowing for the detection of patterns, outliers, and insights that might not be apparent from looking at single variables in isolation.

Key Features:

  • Multiple Variable Analysis: Simultaneously examines every pairwise combination of variables.
  • Histograms and KDEs: The diagonal plots are typically histograms or Kernel Density Estimates (KDEs) showing the distribution of a single variable.
  • Hue Differentiation: Can color-code data points by a categorical variable to further dissect the data across subgroups.
  • Flexibility: Offers customization options for plot types, aesthetics, and which variables to include.

Use Cases:

  • Exploratory Data Analysis (EDA): Quickly understand the pairwise relationships and distributions of multiple variables in a dataset.
  • Feature Selection: Identify promising variables for further analysis or modeling, based on their relationships and distributions.
  • Detecting Correlations and Anomalies: Spot potential correlations between variables or unusual patterns that might indicate outliers, data errors, or interesting phenomena for deeper investigation.

In practical terms, a pairplot might be used in a dataset like mpg (miles per gallon) to explore how different vehicle characteristics (like weight, horsepower, and engine displacement) relate to fuel efficiency and each other. This can help automotive researchers, for instance, to identify trends and factors influencing fuel efficiency, or data scientists to select features for machine learning models predicting vehicle performance.

import seaborn as sns
import matplotlib.pyplot as plt

# Load the mpg dataset
mpg = sns.load_dataset('mpg')

# Creating the pairplot
pairplot = sns.pairplot(mpg, vars=['mpg', 'displacement', 'horsepower', 'weight'], hue='cylinders', palette='viridis')

# Customizing the plot
pairplot.fig.suptitle('Pairplot of MPG, Displacement, Horsepower, and Weight by Number of Cylinders', y=1.02)

plt.show()

JOINT PLOT

A jointplot in Seaborn is a versatile tool for visualizing the relationship between two variables by displaying their bivariate (joint) distribution and their univariate (marginal) distributions simultaneously. Essentially, it combines a scatter plot or a hexbin plot of two variables with histograms or Kernel Density Estimates (KDEs) of each variable on the axes. This allows for a detailed exploration of the relationship between the two variables, including their individual distributions.

Key Features:

  • Versatile Plot Types: Supports scatter plots, hexbin plots, KDE plots, and more for the joint distribution, allowing for flexibility in how data is represented based on its nature and size.
  • Marginal Distributions: Displays the distribution of each variable on the plot's margins, usually as histograms or KDEs, providing a comprehensive view of the data.
  • Annotations: Can include correlation coefficients and p-values automatically to help quantify the relationship between the variables.
  • Customization: Offers extensive customization options for colors, bin sizes, aspect ratios, and more, making it adaptable to various data visualization needs.

Use Cases:

  • Exploratory Data Analysis (EDA): jointplot is particularly useful in the early stages of data analysis to understand the relationships between pairs of variables in a dataset and to identify patterns, trends, and potential outliers.
  • Scientific Research: In fields like biology, economics, and psychology, jointplot can help in examining the relationships between two measured phenomena, such as the effect of a drug on health outcomes or the relationship between economic indicators.
  • Data Preprocessing for Machine Learning: Before building predictive models, jointplot can be used to identify relationships between features and target variables or among features themselves, assisting in feature selection and engineering.

For instance, using the mpg dataset, a jointplot can reveal how a car's horsepower relates to its fuel efficiency (mpg). This visualization could provide insights into how engine power influences fuel consumption, useful for automotive engineers, policymakers, and consumers interested in vehicle performance and environmental impact.

# Load the mpg dataset
mpg = sns.load_dataset('mpg')

# Creating a detailed joint plot of 'mpg' vs 'horsepower' with a kind of 'hex'
jointplot = sns.jointplot(x='horsepower', y='mpg', data=mpg, kind='scatter', color='blue', space=0, marginal_kws=dict(bins=30, fill=True))

# Customizing the plot with titles and labels
jointplot.set_axis_labels('Horsepower', 'MPG', fontsize=12)
jointplot.fig.suptitle('MPG vs Horsepower', fontsize=15, y=1.03)

plt.show()

from scipy.stats import pearsonr
corr_coeff, p_value = pearsonr(mpg['horsepower'].fillna(0), mpg['mpg'].fillna(0))

# Creating the joint plot
jointplot = sns.jointplot(x='horsepower', y='mpg', data=mpg, kind='reg', color='blue')

# Annotating the plot with the correlation coefficient and p-value
plt.text(x=0.5, y=0.9, s=f'Corr: {corr_coeff:.2f}, p-value: {p_value:.2e}',
         ha='center', va='center', transform=jointplot.ax_joint.transAxes)

# Show the plot
plt.show()

penguins = sns.load_dataset("penguins")
sns.jointplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species")

dist plot

# Creating a displot for the distribution of mpg values
sns.displot(mpg['mpg'], kde=True,bins=20)

plt.title('Distribution of MPG in Cars')
plt.xlabel('Miles Per Gallon (MPG)')
plt.ylabel('Density')

plt.show()
#import seaborn as sns
sns.set_theme(style="whitegrid")

# Load the diamonds dataset
diamonds = sns.load_dataset("diamonds")

# Plot the distribution of clarity ratings, conditional on carat
sns.displot(
    data=diamonds,
    x="carat", hue="cut",
    kind="kde", height=6,
    multiple="fill", clip=(0, None),
    palette="ch:rot=-.25,hue=1,light=.75",
)

Lesson Assignment
Challenge yourself with our lab assignment and put your skills to test.
# Python Program to find the area of triangle

a = 5
b = 6
c = 7

# Uncomment below to take inputs from the user
# a = float(input('Enter first side: '))
# b = float(input('Enter second side: '))
# c = float(input('Enter third side: '))

# calculate the semi-perimeter
s = (a + b + c) / 2

# calculate the area
area = (s*(s-a)*(s-b)*(s-c)) ** 0.5
print('The area of the triangle is %0.2f' %area)
Sign up to get access to our code lab and run this code.
Sign up

Create Your Account Now!

Join our platform today and unlock access to expert- led courses, hands- on exercises, and a supportive learning community.
Sign Up