Back to Literature

The importance of visualizing data

2026-01-02
by Andri Ólafsson
Data ScienceStatisticsVisualizationAnscombes QuartetPython

The Ghost in the Machine

The "Perfect" Summary

Imagine you are looking at four different datasets. You run the numbers, and the results are identical:

  • Mean of x: 9.0
  • Variance of x: 11.0
  • Correlation: 0.816
  • Linear Regression: y = 3.00 + 0.500x

To a computer—or a hurried analyst—these datasets are indistinguishable. They are siblings. But once you plot them, you realize they aren't even from the same planet. This is Anscombe’s Quartet, and it is the most important cautionary tale in data science.

Anscombe's Quartet Explorer

48121620468101214
Mean X
9.0
Mean Y
~7.50
Variance X
11.0
Correlation
0.816

* All statistics are identical across all four datasets.


1. Plot First, Ask Questions Later

We have a tendency to trust the "cleanliness" of a single number. An average feels solid. A correlation coefficient feels like a verdict.

But data has a shape.

2. Drawing the Story: The Quartet Explained

When you toggle through the chart above, you see four distinct realities:

  • The Linear: The "ideal" world where our models actually work.
  • The Curve: A non-linear relationship where a straight line is fundamentally the wrong tool.
  • The Vertical: A dataset where the relationship is actually perfect, but one single outlier ruins the entire slope.
  • The Isolated: A dataset that is statistically useless, "saved" only by one extreme point that creates a fake correlation.

Coming from a Mechatronics background, I view these as physical states. A "flat" average signal might hide a high-frequency vibration that is actually tearing a motor apart in an unmanned vehicle. Without the waveform (the visual), the machine looks fine on paper until it explodes.

3. In Defense of the Outlier

There is a dangerous habit in data cleaning: the "Delete Key."

If a point sits three standard deviations away, the instinct is to call it "noise" and move on. But an outlier isn't always a "mistake."

  • The Hardware Perspective: Sometimes an outlier is a "Ghost in the Machine"—a loose wire, a power surge, or a voltage spike in a drone's IMU.
  • The Research Perspective: In medical research, the outlier is often the most interesting patient. It’s the one person who didn't react to a drug, or the one sensor that detected a rare precursor to a disease.
  • The Discovery: A rogue semicolon in a CSV can create an outlier, but so can a biological breakthrough. Before you delete the "ghost," you have to find out if the machine is haunted or if it’s just trying to tell you something new.

4. Andri’s Rules for Visualizing Correctly

To avoid being lied to by your own summary statistics, I follow a strict checklist:

  1. Plot first, calculate second: Never look at a Correlation Coefficient before you've seen the Scatter Plot.
  2. Don't hide the mess: Summary statistics like boxplots are great, but always show the raw points (or jitter plots) alongside them. The "mess" is where the truth lives.
  3. Context is King: Knowing why a point is far away is infinitely more important than knowing how far it is.

Conclusion: Numerical descriptions are incomplete. As a Data Storyteller, my job isn't just to calculate the average—it's to find the shape of the truth.