Abstract:
Interval scales are assumed by several basic descriptive statistics, such as mean and variance, and by many statistical significance tests which are daily used in IR to compare systems. Unfortunately, so far, there has not been any systematic and formal study to discover the actual scale properties of IR measures. Therefore, in this paper, we develop a theory of Information Retrieval (IR) evaluation measures, based on the representational theory of measurements, to determine whether and when IR measures are interval scales. We found that common set-based retrieval measures—namely Precision, Recall, and F-measure—always are interval scales in the case of binary relevance while this happens also in the case of multi-graded relevance only when the relevance degrees themselves are on a ratio scale and we define a specific partial order among systems. In the case of rank-based retrieval measures—namely AP, gRBP, DCG, and ERR—only gRPB is an interval scale when we choose a specific value of the parameter p and define a specific total order among systems while all the other IR measures are not interval scales. Besides the formal framework itself and the proof of the scale properties of several commonly used IR measures, the paper also defines some brand new set-based and rank-based IR evaluation measures which ensure to be interval scales.