Calibration drift
/recruiting/assessments/calibration-drift
This audit-only dashboard surfaces score discrepancies between the primary LLM scorer and a calibration reviewer. The goal: catch model drift early when the LLM starts to disagree with humans more than it should.
Filters
- Window — 7 / 30 / 90 / 180 days
- Flagged only — limits to events the disagreement-threshold flagged
Summary card
- Total samples in window
- Flagged count
- Average absolute delta — magnitude of disagreement irrespective of direction
- Average signed delta — used to detect bias (LLM systematically lower or higher than reviewers)
Event table
Per row: question stem, template, primary (LLM) score, calibration (human) score, delta (red if it exceeds the disagreement threshold), flagged status, sampling timestamp.
How sampling works
A worker job (handleAssessmentCalibrationDrift) periodically samples
already-scored responses and re-routes them to a human reviewer for blind
re-scoring. The two scores are compared; the result lands in this
dashboard.
What to do with the data
- Average signed delta drifting positive or negative → the LLM is biased upward or downward. Time to recalibrate the prompt or model.
- Average absolute delta climbing → noisy disagreement, even if balanced. May warrant updating the rubric or the few-shot examples.
- Spike of flags on a specific question → that question's rubric is ambiguous; review and tighten.
The dashboard does not modify scores or sampling decisions — it's a read-only window into model behaviour over time.