Skip to main content

Calibration drift

/recruiting/assessments/calibration-drift

This audit-only dashboard surfaces score discrepancies between the primary LLM scorer and a calibration reviewer. The goal: catch model drift early when the LLM starts to disagree with humans more than it should.

Filters

  • Window — 7 / 30 / 90 / 180 days
  • Flagged only — limits to events the disagreement-threshold flagged

Summary card

  • Total samples in window
  • Flagged count
  • Average absolute delta — magnitude of disagreement irrespective of direction
  • Average signed delta — used to detect bias (LLM systematically lower or higher than reviewers)

Event table

Per row: question stem, template, primary (LLM) score, calibration (human) score, delta (red if it exceeds the disagreement threshold), flagged status, sampling timestamp.

How sampling works

A worker job (handleAssessmentCalibrationDrift) periodically samples already-scored responses and re-routes them to a human reviewer for blind re-scoring. The two scores are compared; the result lands in this dashboard.

What to do with the data

  • Average signed delta drifting positive or negative → the LLM is biased upward or downward. Time to recalibrate the prompt or model.
  • Average absolute delta climbing → noisy disagreement, even if balanced. May warrant updating the rubric or the few-shot examples.
  • Spike of flags on a specific question → that question's rubric is ambiguous; review and tighten.

The dashboard does not modify scores or sampling decisions — it's a read-only window into model behaviour over time.