Evaluation Validity in Information Retrieval

2026 International ACM SIGIR Conference on Research and Development in Information Retrieval |

Published by ACM

PDF

Information retrieval has long relied on evaluations that measure system performance. Improvements on standard evaluation protocols are interpreted as progress in system effectiveness, on the understanding that improved metrics indicate a better experience. However, most  evaluations are a drastic abstraction and simplification of that experience. It is reasonable to inquire after the validity of our evaluations, or the degree to which they do in fact represent phenomena we care about. If a metric improves, can we be sure there is a corresponding improvement in real-world effectiveness?

We discuss practical ways to discuss, measure, and improve the validity of evaluations in a range of settings. By considering validity, we can make better choices in evaluation protocols; we have a chance to make progress if and when evaluating and retrieving collapse into each other entirely, e.g., with LLM-as-judge; and we can optimise towards systems that people actually want.