As organisations continue to adopt and implement AI technologies, one of the key challenges they face is ensuring that their systems perform well across all user groups. In an increasingly diverse world, achieving fair and reliable outcomes requires more than just a broad, overarching performance metric. It requires disaggregated performance — the practice of evaluating AI and predictive models separately for different subpopulations. This approach highlights the importance of understanding how your system performs for distinct groups within your user base, whether they are identified by demographic factors like age or gender, or specific needs such as healthcare conditions. In this blog post, we will explore why disaggregated performance is essential, the complications it presents, and how it can lead to more equitable and effective decision-making.
Why Disaggregated Performance is Crucial
In theory, performance metrics like accuracy, precision, and recall give us a broad understanding of how well our models perform. However, these aggregate numbers can obscure significant disparities across different groups. For example, imagine a predictive model used in healthcare that is trained on a large dataset. On the surface, the model may show impressive overall accuracy. But if you dig deeper, you might find that it performs poorly for certain demographics — elderly patients, those with chronic conditions, or specific ethnic groups. The average performance score masks these failures, and without disaggregating the data, such disparities may go unnoticed until they have real-world consequences.
In practical terms, disaggregated performance allows us to uncover hidden biases or weaknesses in a model. By evaluating the model’s performance for different groups, we can identify which subpopulations are being underserved or, worse, harmed. This is particularly crucial in fields like healthcare, finance, criminal justice, and hiring, where biased algorithms can perpetuate inequality and lead to unintended consequences.
The Complications of Disaggregated Evaluation
While the need for disaggregated performance is clear, implementing this approach is not without its complications. The first challenge lies in data representation. Often, datasets are not evenly distributed across subpopulations, leading to underrepresentation of certain groups. This can skew the model’s performance evaluation, as it may be overfitted to larger, more dominant groups. In the case of healthcare, for instance, a dataset predominantly consisting of younger patients might result in a model that fails to account for the unique needs of older patients.
Another complication is the complexity of interpreting disaggregated results. When evaluating a model’s performance across multiple subgroups, leaders and managers must be prepared to deal with a broader range of metrics. For example, a model that performs very well on one group but poorly on another might still be deemed a success if viewed in aggregate. However, disaggregated performance metrics reveal the nuanced details of its failures, which can be uncomfortable but necessary for meaningful improvements.
Lastly, disaggregated evaluation often requires a shift in mindset and organisational culture. Traditionally, many organisations rely on broad metrics to gauge success. But as we move towards a more ethically aware and socially conscious approach to AI, leaders must embrace the responsibility of ensuring that models serve all groups equitably. This requires not only technical adjustments but also a willingness to scrutinise models and processes in a way that many are not accustomed to.
Insights from Research and Case Studies
Numerous studies and papers have highlighted the importance of disaggregated performance and the challenges of integrating it into real-world applications. One such paper, “A Comparison of Approaches to Improve Worst-Case Predictive Model Performance Over Patient Subpopulations”, demonstrates how predictive models can fail to account for underperforming subgroups. In healthcare, for instance, models trained on general population data may achieve high accuracy, but they often miss the mark when it comes to the most vulnerable patients. By evaluating performance on specific patient subgroups, such as those with multiple comorbidities or older populations, organisations can identify areas where the model’s predictions fall short.
Similarly, the paper “The Disagreement Deconvolution: Bringing Machine Learning Performance Metrics in Line With Reality” addresses the importance of aligning machine learning performance metrics with the lived realities of affected populations. It explores the complications that arise when performance metrics fail to consider demographic disparities. The study suggests that disaggregating performance can provide a clearer understanding of how models will impact different groups in practice, helping to mitigate potential harms.
These studies show that disaggregated performance is not just a technical requirement but a moral and ethical obligation. If we continue to rely on aggregate metrics that overlook vulnerable populations, we risk perpetuating existing inequalities and creating new ones.
Overcoming the Challenges: Best Practices for Disaggregated Performance
Fortunately, there are several best practices and strategies that organisations can adopt to better evaluate their models and ensure fairer outcomes for all groups:
- Ensure Diverse and Representative Data: The first step in achieving disaggregated performance is ensuring that your training data is diverse and representative of the populations your model will serve. This may involve actively seeking out data from underrepresented groups and ensuring that the model is trained on a variety of scenarios, behaviours, and conditions.
- Evaluate Across Multiple Subgroups: Regularly evaluate your model’s performance on different subpopulations. This should include demographic factors such as age, gender, ethnicity, and socio-economic status, as well as more specific categories like disease types or geographic locations. By disaggregating performance data, you can spot areas of weakness and take targeted action.
- Use Fairness-Aware Metrics: Traditional metrics may not capture the full picture when it comes to fairness. Implement fairness-aware metrics that measure the model’s performance not just in terms of accuracy but also in terms of equity across different groups. This could include metrics like equality of opportunity, demographic parity, or disparate impact.
- Incorporate Feedback from Affected Groups: One of the best ways to understand how your model is performing for different groups is to engage directly with those groups. If possible, get feedback from diverse users and stakeholders to understand their experiences and identify potential pain points. This can inform further model adjustments.
- Continuous Monitoring and Adjustment: Disaggregated performance evaluation is not a one-time task; it requires continuous monitoring and adjustment. As societal conditions evolve, so too will the needs and behaviours of different subpopulations. Regularly reassessing your models ensures they remain effective and fair over time.
Summary
Disaggregated performance is a crucial aspect of AI evaluation that ensures models work for everyone, not just the majority. By embracing this approach, leaders and organisations can create more equitable, effective systems that serve all people, regardless of their background or circumstances. While the process of disaggregating performance data comes with its challenges, it is an investment in building trust, reducing bias, and ultimately improving the quality of decision-making across industries.
As AI continues to play an ever more central role in shaping decisions in healthcare, finance, hiring, and beyond, it’s up to us as leaders to ensure that these systems are serving everyone fairly and equitably. To make this happen, we must adopt best practices for disaggregated performance, confront the complications head-on, and lead with a focus on inclusivity and fairness.
Next Steps
-
If you’re interested in bespoke training or design solutions on AI fairness, feel free to reach out for a consultation.
-
Check out our the following resources and upcoming workshops to equip your teams with the tools and knowledge to implement fair AI systems.
Free Resources for Individual Fairness Design Considerations
Sampling Bias in Machine Learning
Social Bias in Machine Learning
Representation Bias in Machine Learning
Disaggregated Performance for Fairness – £99
Empower your team to drive Responsible AI by fostering alignment with compliance needs and best practices.
Practical, easy-to-use guidance from problem definition to model monitoring
Checklists for every phase in the AI/ ML pipeline
AI Fairness Mitigation Package – £999
The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.



Customised AI Fairness Mitigation Package – £2499



Sources
Gordon, M.L., Zhou, K., Patel, K., Hashimoto, T. and Bernstein, M.S., 2021, May. The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-14).
Hall, M., Chern, B., Gustafson, L., Ventura, D., Kulkarni, H., Ross, C. and Usunier, N., 2023. Towards reliable assessments of demographic disparities in multi-label image classifiers. arXiv preprint arXiv:2302.08572.
Kraft, A. and Usbeck, R., 2022. The Ethical Risks of Analyzing Crisis Events on Social Media with Machine Learning. arXiv preprint arXiv:2210.03352.
Lu, J.H., Callahan, A., Patel, B.S., Morse, K.E., Dash, D., Pfeffer, M.A. and Shah, N.H., 2022. Assessment of adherence to reporting guidelines by commonly used clinical prediction models from a single vendor: a systematic review. JAMA network open, 5(8), pp.e2227779-e2227779.
Pfohl, S.R., Zhang, H., Xu, Y., Foryciarz, A., Ghassemi, M. and Shah, N.H., 2022. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific reports, 12(1), p.3254.