Measuring labeling consistency
So far, we have discussed a range of tools and techniques for creating consistent and high-quality annotations. While these elements create the foundation for good datasets, we also want to be able to measure whether our annotators are performing consistently.
To gauge annotator consistency, we recommend using two measures of labeling consistency called intra- and interobserver variability, respectively. These are standard terms in clinical research and refer to the degree of agreement among different measurements or evaluations made by the same observer (intra-) or by different observers (inter-). To simplify the explanation, consider “observer” to be interchangeable with “labeler,” “annotator, “rater,” “data collector,” and any other similar term we have used throughout this chapter.
While both intra- and interobserver variability relate to measurement consistency, they address different...