How it works

A score predictor built from thousands of real r/Step2 score releases, validated on outcomes it had never seen, and run entirely in your browser.

±4 ptshalf of all predictions land this close to the real score on a blind test
89%of predictions land within 10 points
2,200+real score reports behind the model

How does this work?

You enter your practice scores, and the tool compares your pattern to thousands of real r/Step2 score reports, then predicts the Step 2 CK score that people with similar practice results actually got.

Under the hood it's an ensemble of 24 models trained on those reports. Each practice test you enter is first run through its own calibrator: a small model that knows how that specific form maps to real Step 2 outcomes, and how that mapping weakens the longer before your exam you took it. A test from the week before your exam counts for more than one from two months out, which is why adding timing (the clock icon) sharpens the estimate. The calibrated signals are then combined with your question-bank details, and the spread across the 24 models feeds the uncertainty range.

Everything runs in your browser; your scores are never uploaded anywhere.

How accurate is it?

Accuracy is measured with a blind test: the model was trained only on older reports, then asked to predict hundreds of newer exams it had never seen. On that test:

  • Half of all estimates landed within about 4 points of the real score.
  • ~89% landed within 10 points, and ~96% within 15.
  • Predictions tracked actual scores with a 0.80 correlation, with essentially zero systematic bias (−0.6 points).

The ranges are calibrated too, not just the point estimate: on the same blind test, the 80% range contained the real score 88% of the time, the 90% range 93%, and the 95% range 96%.

For context, the exam itself is noisy: USMLE reports a standard error of about 7 points, meaning the same student retaking Step 2 could easily score several points differently. That noise puts a floor of roughly 5–6 points on the average error any predictor could ever achieve. This tool performs at that limit.

Accuracy improves with more practice scores, and with scores taken closer to test day. It's a guide, not a guarantee.

How does it compare to other predictors?

Every row is measured the same way: each tool's predictions checked against real, self-reported outcomes from r/Step2 score releases.

PredictorMedian errorAverage errorWithin 5Within 10Tendency
This tool~4 pts~5.1 pts58%89%unbiased (~0.6)
AMBOSS prediction‡~5 pts~6.0 pts53%83%underpredicts by ~4
r/Step2 community calculator†~5 pts~5.9 pts56%85%underpredicts by ~3
PMSS and others*~6 pts~6.9 pts47%77%underpredicts by ~5
usmlepredictor.com~9 pts~10 pts26%56%underpredicts by ~9

* Most people post their predicted score without naming the tool that produced it. Many of those predictions likely come from PMSS, but the source can't be confirmed, so all unattributed predictions are pooled in this row.

† The community calculator was last updated in 2022 and accepts nothing newer than NBME 12. The row reflects predictions students posted; in 100 fresh runs of the live calculator against 2025–2026 exams, its typical error grows to ~7.5 points.

‡ In its own published validation, AMBOSS reports a typical error of 7.6 points for its score predictor. The row above reflects AMBOSS predictions that students actually posted alongside their results.

How each row was measured

  • This tool: a chronological blind test. The model was trained only on reports posted before 2025, then scored on 451 exams taken in 2025 and 2026 that it had never seen.
  • AMBOSS prediction: 245 reports whose prediction line explicitly names AMBOSS, scored against the actual result the same person posted.
  • r/Step2 community calculator: 336 posted predictions matching the calculator's output format, scored against the posters' actual results.
  • PMSS and others: 1,003 posted predictions with no named source.
  • usmlepredictor.com: its published in-browser algorithm, run on 341 outcomes from the same 2025–2026 pool used to test this tool.

Median error: half of predictions were closer than this. Average error: the mean absolute miss, which outliers pull upward. Last benchmarked June 2026.

Which practice tests are most predictive?

Ranked by typical prediction error: how far off an estimate from that test alone tends to be, using a cross-validated model that already accounts for how many days before the exam it was taken (recent reports weighted more). Lower = more predictive:

  • NBME 14 · UWSA 2± ~5.9–6.0
  • NBME 13, 15 · UWSA 3± ~6.0–6.1
  • NBME 10, 11, 12± ~6.3–6.5
  • New Free 120 · UWSA 1 · Old Free 120± ~6.6–6.7
  • NBME 9 · AMBOSS self-assessment± ~6.9–7.2

The gaps are small, about a point across the board, so no single test dominates. Combining several beats any one, which is why the tool asks for at least two. NBME 16 is too new to rank confidently (only ~90 reports so far); early data puts it in line with the other recent NBMEs rather than ahead.

Where does the data come from?

From public score-release posts on r/Step2: people who shared their practice scores and then their real result. Each post is parsed, cleaned, and kept only if the numbers are unambiguous; the model is trained on thousands of these reports from 2022 through 2026.

Every data point is a real score report: someone's actual practice scores, paired with the score they went on to get. The "similar test-takers" panel draws from the same pool, with each report reduced to numbers only: no usernames, no dates beyond the year, no text.

One honest caveat: people who post on r/Step2 aren't a random sample of all test-takers. That's why the percentile shown next to your prediction comes from the official USMLE score distribution (US MD first-takers, July 2022–June 2025, per the USMLE Score Interpretation Guidelines), not from this dataset.

Reading the results

The curve shows the calibrated probability distribution of your real score, where taller means more likely. The shaded tail matches the odds readout exactly; hover or drag across the curve to read the chance of scoring at or above any number.

The range selector (67 / 80 / 90 / 95) sets how confident the bracket should be. A 67% range means: among people with inputs like yours, about two in three landed inside it. Wider = more certain to contain your score, narrower = more precise but riskier.

What's driving your prediction re-runs the model without each of your inputs and shows how much the estimate moves with and without it.

Your trajectory converts each timed test into the Step 2 score it pointed to when you took it, with differences between forms factored out, then fits a trend. Raw score jumps often shrink here, because part of most climbs is the forms differing rather than the person changing.

Similar real test-takers finds the reports closest to your inputs (after putting every test on the same scale) and shows what those people actually scored.

Specialty data

Specialty means are for matched applicants: NRMP "Charting Outcomes in the Match" 2024 for most specialties, with Ophthalmology from SF Match data, Urology from the AUA match, and Thoracic Surgery from a program-director survey. Figures are read from published charts and are approximate; the spread uses an SD of ~13 points, in line with the overall matched distribution. Step 2 is one of many factors in matching, so treat the table as context rather than a verdict.

Privacy

The model runs entirely in your browser; your scores never leave your device. Saved inputs live in your browser's local storage. Share links encode your inputs in the link itself, so share them only with people you would show your scores to. The site collects anonymous page-view counts and nothing else.

Charts & guides

Data-backed charts and guides, built from the same real score reports:

Model and data last updated June 2026.
Not affiliated with the USMLE®, NBME®, NRMP®, UWorld, or AMBOSS. Predictions are statistical estimates from self-reported data. Use them as one signal among many.