Calibration drift

/recruiting/assessments/calibration-drift

This audit-only dashboard surfaces score discrepancies between the primary LLM scorer and a calibration reviewer. The goal: catch model drift early when the LLM starts to disagree with humans more than it should.

Filters

Window — 7 / 30 / 90 / 180 days
Flagged only — limits to events the disagreement-threshold flagged

Summary card

Total samples in window
Flagged count
Average absolute delta — magnitude of disagreement irrespective of direction
Average signed delta — used to detect bias (LLM systematically lower or higher than reviewers)

Event table

Per row: question stem, template, primary (LLM) score, calibration (human) score, delta (red if it exceeds the disagreement threshold), flagged status, sampling timestamp.

How sampling works

A worker job (handleAssessmentCalibrationDrift) periodically samples already-scored responses and re-routes them to a human reviewer for blind re-scoring. The two scores are compared; the result lands in this dashboard.

What to do with the data

Average signed delta drifting positive or negative → the LLM is biased upward or downward. Time to recalibrate the prompt or model.
Average absolute delta climbing → noisy disagreement, even if balanced. May warrant updating the rubric or the few-shot examples.
Spike of flags on a specific question → that question's rubric is ambiguous; review and tighten.

The dashboard does not modify scores or sampling decisions — it's a read-only window into model behaviour over time.

Filters​

Summary card​

Event table​

How sampling works​

What to do with the data​

Filters

Summary card

Event table

How sampling works

What to do with the data