Understanding Evaluation Bias
Evaluation bias may occur when the population in the benchmark set is not representative of the actual population. While the model is trained and optimised using its training data, its performance is assessed using a test or benchmark dataset during the evaluation phase.
Machine learning models are often tested against standardised benchmarks to enable objective comparisons. However, if the benchmark dataset does not accurately reflect the diversity or characteristics of the intended user population, it can lead to biased evaluations. This means models that perform well for only a specific subset of the population might be preferred over models that perform better across the full population.
Example of Evaluation Bias
“Illustration: Choosing the wrong benchmark set can lead to overlooking a potential bias. For example, if a facial recognition algorithm is trained on a dataset with underrepresented dark-skinned females and is tested on a similarly unbalanced benchmark, the bias will remain unrecognized”, Fahse et al., (2021).
Causes for Evaluation Bias in Machine Learning
There are several underlying assumptions in the performance metrics used to evaluate machine learning models. These assumptions can contribute to evaluation bias when they fail to align with real-world complexities:
Independence of Decisions
Many performance metrics assume that decisions are independent, reflecting a utilitarian approach where overall utility is calculated as the sum of individual utilities. In reality, decisions can influence one another. For example, denying a loan to one family member might negatively impact another family member’s ability to repay their own loan, demonstrating interdependent effects.
Symmetry of Decision Impact
Metrics often assume that decisions have equal consequences across all instances. However, outcomes can have drastically different impacts depending on the context. For example, rejecting a job application can severely affect an unemployed individual but have a lesser impact on someone already employed.
Generalisations from Benchmark Comparisons
Evaluation bias often arises from the desire to quantitatively compare models using benchmark datasets. While applying different models to external datasets helps in comparison, generalizing these results to broader claims about model quality is frequently statistically invalid. Overfitting to specific benchmarks can occur, particularly when the benchmark data suffers from historical, representation, or measurement biases.
Choice of Performance Metrics
The metrics selected to evaluate models can exacerbate bias. Aggregate measures, such as overall accuracy, can obscure poor performance on subgroups, which remains hidden because such metrics simplify comparisons and make it easier to judge which model appears “better.” Focusing on a single metric, such as accuracy, can also mask disparities in other errors, such as differences in false positive or false negative rates.
Underrepresentation in Benchmark Datasets
Historical underrepresentation in benchmark datasets amplifies evaluation bias. For example, Suresh and Guttag (2021) highlighted that commercial facial analysis tools performed poorly on images of dark-skinned women, who made up only 7.4% and 4.4% of common datasets like Adience and IJB-A. This underrepresentation resulted in algorithms that failed to perform adequately for this subgroup. Subsequently, benchmarking on more balanced datasets led to adjustments in development processes to improve performance across diverse groups.
Designing Mitigations for Evaluation Bias in Machine Learning
Evaluation bias can be effectively mitigated by making targeted, systematic adjustments to evaluation metrics, datasets, and deployment processes.
Tackling evaluation bias requires a systematic, proactive approach. You can get started with these resources:
Free Resources for Evaluation Bias Mitigation
Best Practices for Evaluation Bias from problem definition to model deployment (click free downloads).
Bias Design Cards – £399
Empower your team to drive Responsible AI by fostering alignment with interactive design card workshops for design, development and monitoring bias.



AI Bias Mitigation Package – £999
The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.



Customised AI Bias Mitigation Package – £2499



Sources
Fahse, T., Huber, V. and van Giffen, B., 2021. Managing bias in machine learning projects. In Innovation Through Information Systems: Volume II: A Collection of Latest Research on Technology Issues (pp. 94-109). Springer International Publishing.
Suresh, H. and Guttag, J., 2021, October. A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1-9).