Inter-Rater Reliability: Evaluating Alternatives to Cohen’s Kappa
Sciences and Mathematics, College of
Psychological Science, Department of
SURS Faculty Advisor
Adam Smiley and Will Best
Objective: Determining how similarly multiple raters evaluate behavior is an important component of observational research . Multiple interrater agreement statistics have been proposed such as Cohen’s Kappa, Krippendorff's Alpha, Fleiss' generalized Pi, and Gwet’s AC1. While many of these statistics attempt to address the paradox of kappa present in Cohen’s Kappa, the sizes of these competing statistics have not been investigated in depth, specifically in the case where the paradox of kappa is present. Method: We performed Monte Carlo simulations to evaluate the performance (e.g., type I error rates and p values) of the four statistics when two raters are making binary (e.g., yes/no) evaluations. The simulations investigated the size of these statistics under conditions that are common in clinical research, with varying base rates of categorization and sample sizes. Results: When the simulated likelihood of identifying an effect was 0.5 for each rater and the sample size was large, each statistic performed similarly. However, in cases where there was either small sample size, low observed interrater reliability, or both, all statistics had large sizes. In particular, Gwet’s AC1 had high type I error rates even with moderate sample sizes and observed interrater agreement. Conclusions: The results suggest that Gwet’s AC1 statistic is only a viable alternative when the rates of interrater agreement are close to 0.5. Cohen’s Kappa performed the best across all sample sizes and rates of observed interrater agreement, suggesting even in the presence of the paradox of kappa a viable alternative may not yet exist.
Jones, Wyatt; Smiley, Adam; Best, Will; and Shoda, Yuichi, "Inter-Rater Reliability: Evaluating Alternatives to Cohen’s Kappa" (2023). Science University Research Symposium (SURS). 97.