Seeing in Curves: A Practical Guide to ROC, TPR, FPR, and the Four Outcomes of Classification

Why Model Evaluation Deserves More Love

Machine learning tutorials love to focus on the glamorous parts: training neural networks, tuning hyperparameters, and showing off accuracy scores that look suspiciously perfect.
But anyone who has deployed a real model knows the truth: accuracy is a liar, and understanding how your model fails is far more important than celebrating how often it succeeds.

That’s where the humble confusion matrix and the mighty ROC curve come in.

They’re not flashy. They don’t trend on social media.
But they are the difference between a model that works in production and one that quietly ruins your day.

The Four Outcomes of a Binary Classifier

Every prediction your model makes falls into one of four buckets. These four numbers form the backbone of almost every evaluation metric in machine learning.

True Positive (TP)

The model predicted positive, and the actual label was positive.
Example: The model says a transaction is fraudulent, and it really is.

False Positive (FP)

The model predicted positive, but the actual label was negative.
Example: The model flags a legitimate transaction as fraud.
This is also known as a Type I error.

False Negative (FN)

The model predicted negative, but the actual label was positive.
Example: The model misses a fraudulent transaction.
This is a Type II error, and often the most dangerous.

True Negative (TN)

The model predicted negative, and the actual label was negative.
Example: The model correctly identifies a legitimate transaction.

The Confusion Matrix (KaTeX)

A confusion matrix is a compact way to visualize all four outcomes:

\begin{array}{c|cc} & \text{Predicted Positive} & \text{Predicted Negative} \\\hline \text{Actual Positive} & TP & FN \\ \text{Actual Negative} & FP & TN \\ \end{array}

This tiny table powers nearly every evaluation metric you’ve ever heard of.

From Confusion to Clarity: TPR and FPR

Two critical metrics emerge from these four outcomes:

True Positive Rate (TPR)

Also called Recall or Sensitivity.

\text{TPR} = \frac{TP}{TP + FN}

This answers the question:
“Of all the actual positives, how many did we catch?”

False Positive Rate (FPR)

\text{FPR} = \frac{FP}{FP + TN}

This answers:
“Of all the actual negatives, how many did we incorrectly flag?”

These two metrics form the axes of the ROC curve.

What Is an ROC Curve?

The Receiver Operating Characteristic (ROC) curve visualizes how your model behaves as you adjust the classification threshold.

Most classifiers output a probability:

“This email has a 0.82 probability of being spam.”

But you choose the threshold that turns that probability into a decision:

Lower threshold → more positives → higher TPR and higher FPR
Higher threshold → fewer positives → lower FPR and lower TPR

The ROC curve plots:

x-axis: False Positive Rate (FPR)
y-axis: True Positive Rate (TPR)**

Each point corresponds to a different threshold.

A perfect model hugs the top-left corner.
A random model draws a diagonal line.
A terrible model dips below the diagonal (and can be fixed by flipping its predictions).

AUC: The One Number to Rule Them All

The Area Under the Curve (AUC) compresses the ROC curve into a single value between 0 and 1.

1.0 → perfect
0.5 → random guessing
< 0.5 → worse than random (usually means labels are flipped)

AUC is especially useful when:

Classes are imbalanced
You want a threshold‑independent metric
You’re comparing multiple models fairly

A Concrete Example

Imagine a model that outputs the following confusion matrix:

| | Predicted Positive | Predicted Negative | |---------------------|--------------------|--------------------| | Actual Positive | 90 | 10 | | Actual Negative | 20 | 880 |

From this we compute:

True Positive Rate

\text{TPR} = \frac{90}{90 + 10} = 0.90

False Positive Rate

\text{FPR} = \frac{20}{20 + 880} = 0.022

This single point becomes one coordinate on the ROC curve.
Change the threshold, and you get another.
Connect them all, and you get the full curve.

Precision–Recall Curves: The Other Side of the Story

ROC curves are powerful, but they can be misleading when your dataset is highly imbalanced — which is common in fraud detection, medical diagnosis, and anomaly detection.

That’s where Precision–Recall (PR) curves shine.

Precision

\text{Precision} = \frac{TP}{TP + FP}

This answers:
“Of all the predicted positives, how many were actually positive?”

Recall

Recall is the same as TPR:

\text{Recall} = \frac{TP}{TP + FN}

The PR Curve

A PR curve plots:

x-axis: Recall
y-axis: Precision

Unlike ROC curves, PR curves focus only on the positive class — making them far more informative when positives are rare.

When PR curves outperform ROC curves

Fraud detection
Disease screening
Rare event prediction
Any dataset where the positive class is < 5–10%

In these cases, a model can achieve a deceptively high ROC AUC while still performing poorly on the minority class.
PR curves expose this weakness.

Why ROC and PR Curves Matter in the Real World

Different domains care about different errors:

Healthcare: False negatives can be deadly.
Fraud detection: False positives annoy customers but false negatives cost money.
Spam filtering: False positives hide important emails; false negatives let spam through.

ROC curves help you understand threshold trade-offs.
PR curves help you understand performance on the positive class.

Together, they give you a complete picture.

Final Thoughts

Understanding ROC curves, PR curves, and classification errors isn’t optional — it’s foundational.
These tools help you:

Diagnose model weaknesses
Compare models fairly
Choose thresholds intelligently
Communicate performance clearly to stakeholders

Accuracy may be the headline metric, but ROC and PR curves tell the real story.

If you want to go deeper, I can help you add scikit‑learn code examples, interactive diagrams, or a follow‑up post on threshold tuning strategies.

Interactive ROC Curve

Threshold: 0.50

Confusion Matrix

	Predicted Positive	Predicted Negative
Actual Positive	22	5
Actual Negative	3	20

ROC Curve Example

Below is an interactive ROC curve built with D3.js:

Understanding the Classification Threshold

Most machine learning classifiers don’t directly output a “yes” or “no.” Instead, they produce a probability score between 0 and 1. For example:

“This transaction has a 0.82 probability of being fraudulent.”

To turn that probability into a decision, we apply a threshold:

If the score is greater than or equal to the threshold, we classify it as positive
If the score is below the threshold, we classify it as negative

This threshold is one of the most important—and often overlooked—controls in a binary classifier.

How Changing the Threshold Affects Predictions

Adjusting the threshold changes the model’s behavior:

Lower threshold → more positives predicted
- Higher True Positive Rate (TPR)
- Higher False Positive Rate (FPR)
Higher threshold → fewer positives predicted
- Lower FPR
- Lower TPR

In other words:

Lowering the threshold makes the model more “aggressive.”
Raising the threshold makes it more “conservative.”

This trade-off is exactly what the ROC curve visualizes.

How the Threshold Shapes the ROC Curve

Every threshold produces a different pair of values:

TPR = TP / (TP + FN)
FPR = FP / (FP + TN)

As you sweep the threshold from 0 → 1, you trace out the ROC curve:

At threshold 0, everything is predicted positive → TPR = 1, FPR = 1
At threshold 1, everything is predicted negative → TPR = 0, FPR = 0
All other thresholds fall somewhere in between

The interactive ROC visualization lets you explore this relationship:

The slider changes the threshold
The point moves along the ROC curve
The tooltip shows the TPR/FPR at that threshold
The confusion matrix updates live to reflect the new predictions

This makes the threshold’s impact visible and intuitive.

Why the Threshold Matters in Real Applications

Different domains care about different types of errors:

Healthcare:
False negatives can be dangerous → use a lower threshold
Fraud detection:
Missing fraud is costly, but false alarms are tolerable → lower threshold
Spam filtering:
False positives hide important emails → higher threshold

There is no universally “correct” threshold.
The right choice depends on the cost of errors in your domain.

The Threshold as a Decision Lever

The threshold is not just a technical detail—it’s a policy decision:

How cautious should the model be
What risks are acceptable
What errors are more costly
How the model behaves in production

The ROC curve helps you visualize these trade-offs, but the threshold is the knob you turn to control them.

How to Choose the Optimal Threshold

Choosing the “optimal” threshold is one of the most important decisions you make when deploying a binary classifier. The model may output probabilities, but the threshold determines how those probabilities turn into real-world actions. There is no single universally correct threshold — the right choice depends on your domain, your tolerance for risk, and the cost of different types of errors.

Still, there are several principled ways to select a threshold that aligns with your goals.

1. Choosing a Threshold Based on Business or Domain Costs

In many real-world settings, the cost of a false negative is very different from the cost of a false positive.

Healthcare: Missing a disease (FN) is far worse than a false alarm (FP).
→ Use a lower threshold to maximize sensitivity.
Spam filtering: Marking a real email as spam (FP) is worse than letting spam through (FN).
→ Use a higher threshold to minimize false positives.
Fraud detection: Missing fraud (FN) is expensive, but false alarms (FP) are tolerable.
→ Use a lower threshold to catch more fraud.

This approach is grounded in the idea that the “optimal” threshold is the one that minimizes expected cost, not the one that maximizes accuracy.

2. Maximizing Youden’s J Statistic (ROC-Based)

A common ROC-based method is Youden’s J, defined as:

J = \text{TPR} - \text{FPR}

The threshold that maximizes ( J ) is the point on the ROC curve farthest from the diagonal (random guessing). It represents the best balance between sensitivity and specificity.

This method is useful when:

You want a single, balanced threshold
Costs of FP and FN are roughly equal
You want a threshold derived directly from the ROC curve

3. Maximizing the F1 Score (PR-Based)

The F1 score is the harmonic mean of precision and recall:

F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

The threshold that maximizes F1 is ideal when:

The positive class is rare
You care about both precision and recall
You want to avoid extreme trade-offs

This is especially useful in imbalanced datasets where ROC curves can be misleading.

4. Using Precision–Recall Curves for Imbalanced Data

When the positive class is rare (fraud, disease detection, anomalies), ROC curves can look deceptively good. In these cases, PR curves give a clearer picture.

You can choose a threshold that:

Achieves a minimum acceptable precision
Achieves a minimum acceptable recall
Maximizes area under the PR curve (AUPRC)

This ensures your model performs well where it matters most: the minority class.

5. Thresholds Based on Operational Constraints

Sometimes the threshold is dictated by external requirements:

“We can only handle 50 false positives per day.”
“We must catch at least 95% of true positives.”
“Precision must be above 0.90 for regulatory reasons.”

In these cases, you choose the threshold that satisfies the constraint, even if it’s not mathematically optimal.

6. Visual Exploration with Interactive Tools

Interactive ROC and PR visualizations (like the one in this post) help you:

See how TPR and FPR change with the threshold
Understand the shape of the ROC curve
Inspect the confusion matrix at each threshold
Identify thresholds that align with your goals

This hands-on exploration often reveals insights that metrics alone can’t capture.

The Bottom Line

There is no single “best” threshold.
Instead, the optimal threshold is the one that:

Aligns with your domain’s cost structure
Balances the trade-offs you care about
Performs well on the metrics that matter
Meets operational or regulatory constraints

The ROC curve helps you understand the trade-offs.
The threshold is the lever you pull to control them.

Why ROC Curves Can Be Misleading for Imbalanced Data

ROC curves are powerful tools, but they have a blind spot: they can make a weak model look deceptively strong when the dataset is highly imbalanced. This is a common scenario in fraud detection, medical screening, anomaly detection, and many real-world classification problems where the positive class is rare.

The Core Problem: FPR Looks Small Even When Many Negatives Are Misclassified

The False Positive Rate (FPR) is defined as:

\text{FPR} = \frac{FP}{FP + TN}

In an imbalanced dataset, the number of true negatives (TN) is huge.
So even if the model produces a large number of false positives (FP), the denominator is so large that FPR barely moves.

This means:

A model can generate hundreds of false alarms,
Yet still show a very low FPR,
Which makes the ROC curve look artificially impressive.

Example: Why This Is Misleading

Imagine a fraud detection dataset where only 1% of transactions are fraudulent.

If the model incorrectly flags 200 legitimate transactions as fraud:

FP = 200
TN = 9,800

Then:

\text{FPR} = \frac{200}{200 + 9800} = 0.02

An FPR of 2% looks great on a ROC curve.
But in reality, 200 false alarms per day might be completely unacceptable.

The ROC curve hides this operational pain.

ROC Curves Treat All Negatives as Equal — But Real Life Doesn’t

ROC curves implicitly assume:

False positives and true negatives are equally important
The negative class is as important as the positive class
The cost of errors is symmetric

In imbalanced datasets, none of these assumptions hold.

PR Curves Reveal What ROC Curves Hide

Precision–Recall (PR) curves focus only on the positive class:

Precision tells you how many predicted positives were correct
Recall tells you how many actual positives you caught

When positives are rare, PR curves give a much more honest picture of model performance.

A model with a great ROC curve can still have:

Terrible precision
Many false alarms
Poor usefulness in practice

PR curves expose this immediately.

When to Prefer PR Curves Over ROC Curves

Use PR curves when:

The positive class is rare (fraud, disease, anomalies)
False positives are costly
You care about the quality of positive predictions
You want to understand how the model behaves on the minority class

Use ROC curves when:

Classes are balanced
You care about overall discrimination ability
You want a threshold‑independent view of performance

The Bottom Line

ROC curves are not wrong — they’re just optimistic in imbalanced settings.
They can make a mediocre model look strong because FPR stays small even when the model produces many false alarms.

PR curves, on the other hand, highlight:

How many false positives the model produces
How reliable positive predictions are
How well the model handles the minority class

In imbalanced datasets, PR curves often tell the story that ROC curves hide.

How to Compute Thresholds in Python (scikit‑learn)

Once you understand how thresholds shape model behavior, the next step is learning how to compute them programmatically. Fortunately, scikit‑learn provides everything you need to extract ROC points, PR points, and optimal thresholds directly from your model’s predicted probabilities.

Below is a practical guide using real Python code.

Getting the Model Scores

Most scikit‑learn classifiers expose a predict_proba method that returns the probability of the positive class:

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Probability of the positive class
scores = model.predict_proba(X_test)[:, 1]

These scores are what you will sweep across thresholds.

Computing ROC Curve Thresholds

scikit‑learn’s roc_curve function returns:

FPR (False Positive Rate)
TPR (True Positive Rate)
thresholds used to compute each point

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, scores)

You now have:

A full ROC curve
The exact threshold that produced each (FPR, TPR) pair
A way to visualize how the model behaves across all thresholds

Computing PR Curve Thresholds

For imbalanced datasets, the Precision–Recall curve is often more informative. scikit‑learn provides:

precision
recall
thresholds

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, scores)

You now have:

A full PR curve
The exact threshold that produced each (precision, recall) pair
A way to visualize how the model behaves across all thresholds

Note: PR curves have one fewer threshold than precision/recall points — this is normal.

Choosing the Optimal Threshold with Youden’s J Youden’s J statistic identifies the ROC point farthest from the diagonal: import numpy as np

J = tpr - fpr best_idx = np.argmax(J) best_threshold_roc = roc_thresholds[best_idx]

This threshold balances sensitivity and specificity.

Choosing the Optimal Threshold with F1 Score To maximize the F1 score, evaluate F1 at each threshold: from sklearn.metrics import f1_score

f1_scores = []

for t in roc_thresholds: preds = (scores >= t).astype(int) f1_scores.append(f1_score(y_test, preds))

best_idx = np.argmax(f1_scores) best_threshold_f1 = roc_thresholds[best_idx]

This is especially useful for imbalanced datasets.

Thresholds Based on Precision or Recall Requirements Sometimes you need to enforce a minimum precision or recall:

Example: choose the lowest threshold that gives precision >= 0.90

target_precision = 0.90 idx = np.where(precision >= target_precision)[0][0] best_threshold_precision = pr_thresholds[idx]

This is common in fraud detection, medical screening, and safety‑critical systems.