Seeing in Curves: A Practical Guide to ROC, TPR, FPR, and the Four Outcomes of Classification
Seeing in Curves: A Practical Guide to ROC, TPR, FPR, and the Four Outcomes of Classification
Why Model Evaluation Deserves More Love
Machine learning tutorials love to focus on the glamorous parts: training neural networks, tuning hyperparameters, and showing off accuracy scores that look suspiciously perfect.
But anyone who has deployed a real model knows the truth: accuracy is a liar, and understanding how your model fails is far more important than celebrating how often it succeeds.
That’s where the humble confusion matrix and the mighty ROC curve come in.
They’re not flashy. They don’t trend on social media.
But they are the difference between a model that works in production and one that quietly ruins your day.
The Four Outcomes of a Binary Classifier
Every prediction your model makes falls into one of four buckets. These four numbers form the backbone of almost every evaluation metric in machine learning.
True Positive (TP)
The model predicted positive, and the actual label was positive.
Example: The model says a transaction is fraudulent, and it really is.
False Positive (FP)
The model predicted positive, but the actual label was negative.
Example: The model flags a legitimate transaction as fraud.
This is also known as a Type I error.
False Negative (FN)
The model predicted negative, but the actual label was positive.
Example: The model misses a fraudulent transaction.
This is a Type II error, and often the most dangerous.
True Negative (TN)
The model predicted negative, and the actual label was negative.
Example: The model correctly identifies a legitimate transaction.
The Confusion Matrix (KaTeX)
A confusion matrix is a compact way to visualize all four outcomes:
This tiny table powers nearly every evaluation metric you’ve ever heard of.
From Confusion to Clarity: TPR and FPR
Two critical metrics emerge from these four outcomes:
True Positive Rate (TPR)
Also called Recall or Sensitivity.
This answers the question:
“Of all the actual positives, how many did we catch?”
False Positive Rate (FPR)
This answers:
“Of all the actual negatives, how many did we incorrectly flag?”
These two metrics form the axes of the ROC curve.
What Is an ROC Curve?
The Receiver Operating Characteristic (ROC) curve visualizes how your model behaves as you adjust the classification threshold.
Most classifiers output a probability:
“This email has a 0.82 probability of being spam.”
But you choose the threshold that turns that probability into a decision:
- Lower threshold → more positives → higher TPR and higher FPR
- Higher threshold → fewer positives → lower FPR and lower TPR
The ROC curve plots:
- x-axis: False Positive Rate (FPR)
- y-axis: True Positive Rate (TPR)**
Each point corresponds to a different threshold.
A perfect model hugs the top-left corner.
A random model draws a diagonal line.
A terrible model dips below the diagonal (and can be fixed by flipping its predictions).
AUC: The One Number to Rule Them All
The Area Under the Curve (AUC) compresses the ROC curve into a single value between 0 and 1.
- 1.0 → perfect
- 0.5 → random guessing
- < 0.5 → worse than random (usually means labels are flipped)
AUC is especially useful when:
- Classes are imbalanced
- You want a threshold‑independent metric
- You’re comparing multiple models fairly
A Concrete Example
Imagine a model that outputs the following confusion matrix:
| | Predicted Positive | Predicted Negative | |---------------------|--------------------|--------------------| | Actual Positive | 90 | 10 | | Actual Negative | 20 | 880 |
From this we compute:
True Positive Rate
False Positive Rate
This single point becomes one coordinate on the ROC curve.
Change the threshold, and you get another.
Connect them all, and you get the full curve.
Precision–Recall Curves: The Other Side of the Story
ROC curves are powerful, but they can be misleading when your dataset is highly imbalanced — which is common in fraud detection, medical diagnosis, and anomaly detection.
That’s where Precision–Recall (PR) curves shine.
Precision
This answers:
“Of all the predicted positives, how many were actually positive?”
Recall
Recall is the same as TPR:
The PR Curve
A PR curve plots:
- x-axis: Recall
- y-axis: Precision
Unlike ROC curves, PR curves focus only on the positive class — making them far more informative when positives are rare.
When PR curves outperform ROC curves
- Fraud detection
- Disease screening
- Rare event prediction
- Any dataset where the positive class is < 5–10%
In these cases, a model can achieve a deceptively high ROC AUC while still performing poorly on the minority class.
PR curves expose this weakness.
Why ROC and PR Curves Matter in the Real World
Different domains care about different errors:
- Healthcare: False negatives can be deadly.
- Fraud detection: False positives annoy customers but false negatives cost money.
- Spam filtering: False positives hide important emails; false negatives let spam through.
ROC curves help you understand threshold trade-offs.
PR curves help you understand performance on the positive class.
Together, they give you a complete picture.
Final Thoughts
Understanding ROC curves, PR curves, and classification errors isn’t optional — it’s foundational.
These tools help you:
- Diagnose model weaknesses
- Compare models fairly
- Choose thresholds intelligently
- Communicate performance clearly to stakeholders
Accuracy may be the headline metric, but ROC and PR curves tell the real story.
If you want to go deeper, I can help you add scikit‑learn code examples, interactive diagrams, or a follow‑up post on threshold tuning strategies.
Interactive ROC Curve
Confusion Matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | 22 | 5 |
| Actual Negative | 3 | 20 |
ROC Curve Example
Below is an interactive ROC curve built with D3.js:
Understanding the Classification Threshold
Most machine learning classifiers don’t directly output a “yes” or “no.” Instead, they produce a probability score between 0 and 1. For example:
“This transaction has a 0.82 probability of being fraudulent.”
To turn that probability into a decision, we apply a threshold:
- If the score is greater than or equal to the threshold, we classify it as positive
- If the score is below the threshold, we classify it as negative
This threshold is one of the most important—and often overlooked—controls in a binary classifier.
How Changing the Threshold Affects Predictions
Adjusting the threshold changes the model’s behavior:
-
Lower threshold → more positives predicted
- Higher True Positive Rate (TPR)
- Higher False Positive Rate (FPR)
-
Higher threshold → fewer positives predicted
- Lower FPR
- Lower TPR
In other words:
Lowering the threshold makes the model more “aggressive.”
Raising the threshold makes it more “conservative.”
This trade-off is exactly what the ROC curve visualizes.
How the Threshold Shapes the ROC Curve
Every threshold produces a different pair of values:
- TPR = TP / (TP + FN)
- FPR = FP / (FP + TN)
As you sweep the threshold from 0 → 1, you trace out the ROC curve:
- At threshold 0, everything is predicted positive → TPR = 1, FPR = 1
- At threshold 1, everything is predicted negative → TPR = 0, FPR = 0
- All other thresholds fall somewhere in between
The interactive ROC visualization lets you explore this relationship:
- The slider changes the threshold
- The point moves along the ROC curve
- The tooltip shows the TPR/FPR at that threshold
- The confusion matrix updates live to reflect the new predictions
This makes the threshold’s impact visible and intuitive.
Why the Threshold Matters in Real Applications
Different domains care about different types of errors:
-
Healthcare:
False negatives can be dangerous → use a lower threshold -
Fraud detection:
Missing fraud is costly, but false alarms are tolerable → lower threshold -
Spam filtering:
False positives hide important emails → higher threshold
There is no universally “correct” threshold.
The right choice depends on the cost of errors in your domain.
The Threshold as a Decision Lever
The threshold is not just a technical detail—it’s a policy decision:
- How cautious should the model be
- What risks are acceptable
- What errors are more costly
- How the model behaves in production
The ROC curve helps you visualize these trade-offs, but the threshold is the knob you turn to control them.
How to Choose the Optimal Threshold
Choosing the “optimal” threshold is one of the most important decisions you make when deploying a binary classifier. The model may output probabilities, but the threshold determines how those probabilities turn into real-world actions. There is no single universally correct threshold — the right choice depends on your domain, your tolerance for risk, and the cost of different types of errors.
Still, there are several principled ways to select a threshold that aligns with your goals.
1. Choosing a Threshold Based on Business or Domain Costs
In many real-world settings, the cost of a false negative is very different from the cost of a false positive.
-
Healthcare: Missing a disease (FN) is far worse than a false alarm (FP).
→ Use a lower threshold to maximize sensitivity. -
Spam filtering: Marking a real email as spam (FP) is worse than letting spam through (FN).
→ Use a higher threshold to minimize false positives. -
Fraud detection: Missing fraud (FN) is expensive, but false alarms (FP) are tolerable.
→ Use a lower threshold to catch more fraud.
This approach is grounded in the idea that the “optimal” threshold is the one that minimizes expected cost, not the one that maximizes accuracy.
2. Maximizing Youden’s J Statistic (ROC-Based)
A common ROC-based method is Youden’s J, defined as:
The threshold that maximizes ( J ) is the point on the ROC curve farthest from the diagonal (random guessing). It represents the best balance between sensitivity and specificity.
This method is useful when:
- You want a single, balanced threshold
- Costs of FP and FN are roughly equal
- You want a threshold derived directly from the ROC curve
3. Maximizing the F1 Score (PR-Based)
The F1 score is the harmonic mean of precision and recall:
The threshold that maximizes F1 is ideal when:
- The positive class is rare
- You care about both precision and recall
- You want to avoid extreme trade-offs
This is especially useful in imbalanced datasets where ROC curves can be misleading.
4. Using Precision–Recall Curves for Imbalanced Data
When the positive class is rare (fraud, disease detection, anomalies), ROC curves can look deceptively good. In these cases, PR curves give a clearer picture.
You can choose a threshold that:
- Achieves a minimum acceptable precision
- Achieves a minimum acceptable recall
- Maximizes area under the PR curve (AUPRC)
This ensures your model performs well where it matters most: the minority class.
5. Thresholds Based on Operational Constraints
Sometimes the threshold is dictated by external requirements:
- “We can only handle 50 false positives per day.”
- “We must catch at least 95% of true positives.”
- “Precision must be above 0.90 for regulatory reasons.”
In these cases, you choose the threshold that satisfies the constraint, even if it’s not mathematically optimal.
6. Visual Exploration with Interactive Tools
Interactive ROC and PR visualizations (like the one in this post) help you:
- See how TPR and FPR change with the threshold
- Understand the shape of the ROC curve
- Inspect the confusion matrix at each threshold
- Identify thresholds that align with your goals
This hands-on exploration often reveals insights that metrics alone can’t capture.
The Bottom Line
There is no single “best” threshold.
Instead, the optimal threshold is the one that:
- Aligns with your domain’s cost structure
- Balances the trade-offs you care about
- Performs well on the metrics that matter
- Meets operational or regulatory constraints
The ROC curve helps you understand the trade-offs.
The threshold is the lever you pull to control them.
Why ROC Curves Can Be Misleading for Imbalanced Data
ROC curves are powerful tools, but they have a blind spot: they can make a weak model look deceptively strong when the dataset is highly imbalanced. This is a common scenario in fraud detection, medical screening, anomaly detection, and many real-world classification problems where the positive class is rare.
The Core Problem: FPR Looks Small Even When Many Negatives Are Misclassified
The False Positive Rate (FPR) is defined as:
In an imbalanced dataset, the number of true negatives (TN) is huge.
So even if the model produces a large number of false positives (FP), the denominator is so large that FPR barely moves.
This means:
- A model can generate hundreds of false alarms,
- Yet still show a very low FPR,
- Which makes the ROC curve look artificially impressive.
Example: Why This Is Misleading
Imagine a fraud detection dataset where only 1% of transactions are fraudulent.
If the model incorrectly flags 200 legitimate transactions as fraud:
- FP = 200
- TN = 9,800
Then:
An FPR of 2% looks great on a ROC curve.
But in reality, 200 false alarms per day might be completely unacceptable.
The ROC curve hides this operational pain.
ROC Curves Treat All Negatives as Equal — But Real Life Doesn’t
ROC curves implicitly assume:
- False positives and true negatives are equally important
- The negative class is as important as the positive class
- The cost of errors is symmetric
In imbalanced datasets, none of these assumptions hold.
PR Curves Reveal What ROC Curves Hide
Precision–Recall (PR) curves focus only on the positive class:
- Precision tells you how many predicted positives were correct
- Recall tells you how many actual positives you caught
When positives are rare, PR curves give a much more honest picture of model performance.
A model with a great ROC curve can still have:
- Terrible precision
- Many false alarms
- Poor usefulness in practice
PR curves expose this immediately.
When to Prefer PR Curves Over ROC Curves
Use PR curves when:
- The positive class is rare (fraud, disease, anomalies)
- False positives are costly
- You care about the quality of positive predictions
- You want to understand how the model behaves on the minority class
Use ROC curves when:
- Classes are balanced
- You care about overall discrimination ability
- You want a threshold‑independent view of performance
The Bottom Line
ROC curves are not wrong — they’re just optimistic in imbalanced settings.
They can make a mediocre model look strong because FPR stays small even when the model produces many false alarms.
PR curves, on the other hand, highlight:
- How many false positives the model produces
- How reliable positive predictions are
- How well the model handles the minority class
In imbalanced datasets, PR curves often tell the story that ROC curves hide.
How to Compute Thresholds in Python (scikit‑learn)
Once you understand how thresholds shape model behavior, the next step is learning how to compute them programmatically. Fortunately, scikit‑learn provides everything you need to extract ROC points, PR points, and optimal thresholds directly from your model’s predicted probabilities.
Below is a practical guide using real Python code.
Getting the Model Scores
Most scikit‑learn classifiers expose a predict_proba method that returns the probability of the positive class:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# Probability of the positive class
scores = model.predict_proba(X_test)[:, 1]
These scores are what you will sweep across thresholds.
Computing ROC Curve Thresholds
scikit‑learn’s roc_curve function returns:
- FPR (False Positive Rate)
- TPR (True Positive Rate)
- thresholds used to compute each point
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_test, scores)
You now have:
- A full ROC curve
- The exact threshold that produced each (FPR, TPR) pair
- A way to visualize how the model behaves across all thresholds
Computing PR Curve Thresholds
For imbalanced datasets, the Precision–Recall curve is often more informative. scikit‑learn provides:
- precision
- recall
- thresholds
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, scores)
You now have:
- A full PR curve
- The exact threshold that produced each (precision, recall) pair
- A way to visualize how the model behaves across all thresholds
Note: PR curves have one fewer threshold than precision/recall points — this is normal.
Choosing the Optimal Threshold with Youden’s J Youden’s J statistic identifies the ROC point farthest from the diagonal: import numpy as np
J = tpr - fpr best_idx = np.argmax(J) best_threshold_roc = roc_thresholds[best_idx]
This threshold balances sensitivity and specificity.
Choosing the Optimal Threshold with F1 Score To maximize the F1 score, evaluate F1 at each threshold: from sklearn.metrics import f1_score
f1_scores = []
for t in roc_thresholds: preds = (scores >= t).astype(int) f1_scores.append(f1_score(y_test, preds))
best_idx = np.argmax(f1_scores) best_threshold_f1 = roc_thresholds[best_idx]
This is especially useful for imbalanced datasets.
Thresholds Based on Precision or Recall Requirements Sometimes you need to enforce a minimum precision or recall:
Example: choose the lowest threshold that gives precision >= 0.90
target_precision = 0.90 idx = np.where(precision >= target_precision)[0][0] best_threshold_precision = pr_thresholds[idx]
This is common in fraud detection, medical screening, and safety‑critical systems.