A Complete Guide to Scatter Plots

Learn what scatterplots are and how to create, interpret, and use them for data analysis, with expert tips and real-life examples.
Jan 20, 2025
12 min read

Discover everything you need to know about scatter plots in this comprehensive guide. Learn what scatterplots are and how to create, interpret, and use them for data analysis, with expert tips and real-life examples.

Introduction 

A scatter plot is a simple graph that uses plots values as dots on a chart to show the relationship between variables, or “correlations”, identify outliers and hidden insights in complex datasets.

I. When to use scatter plots

In a nutshell, scatter plots are used for the following reasons:

  • Identifying the relationship between two variables: Whether it’s correlation or causation, the shape of a scatter plot gives valuable insight into the relationship between them.
  • Correlation: Use scatter plots to determine the level of correlation.
  • Causation: When determining how one variable directly influences the other, the shape of a scatter plot determines the strength of the influence.
  • Understanding spread and distribution: 

II. Key Features

1. X-Axis

The X-Axis (Horizontal Axis) represents the independent variable, or the predictor variable. This axis shows the variable that is presumed to influence the other.

2. Y-Axis

The Y-Axis (Vertical Axis) represents the dependent variable , or the response variable.  This variable represents the outcome or result that depends on the independent variable.

3. Scaling

Both axes should be appropriately scaled and adjusted to the data range to avoid misleading patterns.

4. Data Points

Data points in a scatter plot are individual dots that represent the relationship between two variables (one on the x-axis and the other on the y-axis).

5. Trend Line

A trend line shows the general direction or pattern in the data points. This helps to visually identify the type of correlation, if any, and to make predictions on one variable based on another.

III. Describing a Scatter Plot

 There are four aspects of a scatter plot: 

  1. Correlation (direction)
  • Positive Correlation: When one variable increases, the other also tends to increase.
  • Negative Correlation: When one variable increases, the other tends to decrease.
  • No correlation: There’s no clear relationship between two things.
  1. Strength

The strength of the relationship between variables.

  1. Form
  • Linear: When the relationship between variables can be illustrated as a straight line.
  • Nonlinear: When the relationship between variables cannot be illustrated in a straight line.
  1. Outliers
    These are data points unusually far away from the general pattern.

Correlation Coefficient 

A correlation coefficient is a number that is used to describe the strength and direction of two variables. The most used correlation coefficient is Pearson Correlation Coefficient (r).  However, this requires the relationship to be linear.

However, the Pearson’s coefficient isn’t suitable when the data is skewed and non-linear. In these cases, Spearman’s coefficient (p) is used.

A graph of a functionDescription automatically generated

IV. Scatter Plot Options

1. Categorical third variable

You can show a third, categorical variable in a scatter plot, such as region or gender. This can be done through colour or shapes.

TScatterplot of tree heights and diameters colored by type of tree
A diagram of different colored shapesDescription automatically generated with medium confidence

2. Numeric third variable

Similarly, scatter plots can also show additional numeric variables. There are two common ways this is done – through hue, and size. 

A scatter plot which uses size based on the third variable is distinctly called a bubble chart.

Generic bubble chart where a moderate positive relationship is shown, but larger bubbles also tend to have higher positions.

As with categorical variables, colour can also be used to add numeric variables. However, we want to use a continuous sequence of colours for numeric variables. A legend is important for interpretation.

Scatter plot with points colored by a third variable, equivalent to above bubble chart.

3. 3D Scatter Plot

Rather than use the methods listed above, a 3D Scatter Plot adds a third axis (Z-Axis). This is generally not practical since humans struggle to perceive 3 dimensions.

V. Mistakes to Avoid

1. Correlation vs Causation

A scatter plot in itself can show correlation, but not necessarily causation. Imagine this: every summer, as ice cream sales soar, the number of shark attacks also goes up. Does that mean eating more ice cream causes shark attacks? Should we ban rocky road to save swimmers?

Not at all! This is a classic case of correlation, not causation. Both ice cream sales and shark attacks increase in the summer because it’s hot, and people are flocking to the beach. The real culprit here is summer weather, not your double-scoop cone.

2. Overcrowding the Plot

Too many data points in a scatter plot can cause visual confusion. There are a few ways to address this:

Examples of overplotting resolved due to sampling, transparency, or a different chart type
  • Sampled Data: A random selection of points should also still give the general idea of patterns in the full data.
  • Transparency: This allows for overlaps to be visible
  • Reducing Point Size: This also reduces the likelihood of overlaps
  • Heatmaps: Sometimes, you have to consider a different chart altogether. A 2-D histogram  is a type of heatmap that creates bins across the chart. Colour indicates the number of points in each bin.

3. Overreliance on trend lines

Trend lines are great for highlighting trends, as the name suggests. However, they aren’t always suitable.

  • Non-linear relationships: Imagine a scatter plot of hours studied vs. test scores. At first, the relationship might be linear (more studying, higher scores). But beyond a point, extra hours might not improve scores—or even hurt them if exhaustion sets in.
  • Masking data variability: If data points are widely scattered around the trendline, the line might suggest a strong pattern where there’s actually a weak or inconsistent one.
  • Hiding important details: If you’re analyzing sales data and focus only on the upward trendline, you might miss seasonal dips or outliers (like a sudden spike due to a one-time event).

4. Ignoring Outliers 

Outliers in a scatter plot are those data points that stand out from the rest—they're far away from the main cluster. While it's tempting to dismiss them as "noise" or "errors," ignoring outliers can lead to significant issues in your analysis. Here's why they matter and how to handle them.

  • Outliers can reveal valuable insights: In a scatter plot of marketing spend vs. sales revenue, one outlier shows unusually high sales for a low marketing spend. This could indicate an exceptionally successful campaign or a unique product feature worth studying. 
  • Outliers might indicate errors: If one data point in a plot of employee ages is "300," it's likely a typo or data entry error. Ignoring such errors can skew your analysis, leading to incorrect conclusions.

The best approach is to investigate the outliers and find out their cause. If they are errors, they should be excluded. However, in other cases, they should remain and possibly even be highlighted, with an explanation.

5. Wrong Scale

Choosing the right scale of the axes is imperative when creating a scatter plot. A wrong scale can significantly distort shape and distribution of the data, hide important details, or exaggerate minor differences. 

6. Lack of context

When creating or interpreting a scatter plot, ignoring the context of your data is like trying to solve a puzzle with half the pieces missing. Without context, even the most visually stunning plot can lead to confusion or misinterpretation. Here's why context matters and how to avoid leaving it out: 

  • Data without context is misleading: A scatter plot shows a strong correlation between coffee sales and car accidents. Without context, you might think coffee is dangerous. In reality, both may increase during early morning commutes—context reveals the true connection.
  • Adds relevance to insights: A scatter plot of housing prices vs. square footage could vary wildly depending on location. Context like city, neighborhood, or year gives the plot meaning.
  • Identifies outliers: If an outlier shows a house priced far below market value, it might look odd—unless you know it’s a foreclosure sale.

a) Common Pitfalls When Context is Ignored

  1. Overgeneralizing Patternssome text
    • Assuming a trend applies universally without considering specific circumstances.
    • Example: Thinking more hours worked always equals higher productivity, ignoring factors like burnout.
  2. Forgetting External Influencessome text
    • Leaving out key factors that shape the data.
    • Example: A plot of sales vs. time might not account for a holiday season boost.
  3. Ignoring Timeframessome text
    • Comparing data from different time periods without noting changes.
    • Example: A plot showing higher gas prices in 2024 vs. 2019 without acknowledging inflation.
  4. Neglecting Sample Limitationssome text
    • Assuming the data represents a larger population when it doesn’t.
    • Example: Plotting survey results from one city and treating them as a national trend.

b) How to Include Data Context

  1. Provide Background Informationsome text
    • Explain what the data represents, how it was collected, and any relevant conditions.
    • Add a brief caption or title that sets the scene.
  2. Use Annotationssome text
    • Highlight key points or events directly on the scatter plot.
    • Example: Label outliers or add notes for significant dates or changes.
  3. Segment the Datasome text
    • Break down the data by relevant categories (e.g., by region, time period, or demographic) for clearer analysis.
  4. Ask the Right Questionssome text
    • Why does this data look the way it does?
    • Are there external factors that could be influencing it?
  5. Combine Visuals with Explanationssome text
    • Pair your scatter plot with text or commentary to ensure viewers understand the full story.

VI. Tips for Effective Scatter Plots

  1. Define the Purpose: Ensure you know what you want to communicate—whether it’s identifying correlations, clusters, outliers, or trends.
  2. Choose the right variables: Use quantitative and continuous variables for both axes. The variables should be meaningful and relevant to the message.
  3. Label clearly: some text
    1. Clearly label the X and Y axes with descriptive titles, including units if applicable.
    2. Provide a concise title that explains the plot's purpose.
    3. Highlight important data points, trends, or outliers if needed. This helps provide vital context.
  4. Scale and limits: Use appropriate scales that match the data range and avoid distorting patterns (e.g., logarithmic for exponential data). Set axis limits to prevent truncating or stretching the data unnaturally.
  5. Add context: Use gridlines sparingly to guide the viewer without cluttering the plot. Include a legend if the scatter plot uses multiple groups or categories.
  6. Use Effective Visual Markers:some text
    1. Point Style: Use shapes and colours to differentiate groups or categories.
    2. Size: Adjust marker size to avoid overlap while ensuring visibility.
    3. Colour: Use a consistent and intuitive colour palette, ensuring accessibility (e.g., colourblind-friendly).
  7. Show  Statistical Information: Add trendlines or regression lines to indicate patterns. Include a correlation coefficient or summary statistics if relevant.
  8. Maintain simplicity: Avoid unnecessary elements like heavy gridlines or 3D effects that can distract from the data. Keep the design clean and focused.

VII. Conclusion

Congratulations! You’ve officially mastered the basics of scatter plots. By now, you should feel confident about what they are, why they’re useful, and how to use them effectively. Scatter plots take messy, raw data and make it look good—giving you a visual snapshot of trends, patterns, and relationships. So, whether you’re analysing business metrics or the relationship between how much coffee you drink and your productivity (spoiler: there’s usually a correlation), scatter plots have got your back.

SIMILAR BLOGS

Interested in Writing for Us?

Share your expertise, inspire others, and join a community of passionate writers. Submit your articles on topics that matter to our readers. Gain visibility, grow your portfolio, and make an impact.
Join Now