Inter-rater reliability (IRR) measures the consistency of scores assigned by different raters evaluating the same writing. In program assessment, it answers the question: if two trained faculty scorers read the same essay against the same rubric, how often do they agree?
Even experienced faculty scorers face systematic challenges: rubric interpretation drift, fatigue effects, halo effects from vocabulary or mechanics, order effects from adjacent essays, and norming decay over time. Research has consistently documented these patterns in large-scale essay scoring contexts.
Double scoring with reconciliation, anchor essays and norming packets, periodic check-in scoring, and highly specific rubric criteria. Each approach improves reliability but at significant cost in faculty time and coordination.
AI scoring has perfect internal consistency — the same essay scores identically every time. When calibrated with representative samples, AI systems achieve weighted kappa values of 0.70–0.85 with expert human scorers, comparable to human-to-human IRR. The key variable is calibration quality, which plays the same role as norming sessions for human scorers.
Use AI scoring for the initial pass, faculty review of low-confidence cases and a random validation sample, and calculate kappa between AI and faculty scores. Track override rates as a calibration health indicator: 5–10% suggests good calibration; 30%+ means recalibration is needed. Document the process for accreditation.
See our guides on AI essay scoring for program assessment and streamlining writing outcome assessment for accreditation.