Disaggregated Evaluation in Machine Learning

As AI systems are increasingly used to make decisions in areas such as hiring, healthcare, finance, and law enforcement, there are substantial risks if these systems inadvertently perpetuate or exacerbate existing biases. The social and ethical implications of these biases are profound—discriminatory algorithms could lead to unfair outcomes, erode trust, and damage the reputation of organisations.

In this post, we’ll explore the nuances of disaggregated evaluation, the challenges leaders and managers face when implementing it, and share insights from my experience working with organisations striving to achieve fairness in their AI systems. Whether you’re a leader, a manager, or a key stakeholder in your organisation, you’ll find practical examples, actionable recommendations, and a fresh perspective on navigating the intricacies of fair AI evaluation.

What is Disaggregated Evaluation?

Disaggregated evaluation is an approach to assessing the performance of AI systems by segmenting the data into specific demographic groups (e.g., gender, race, age) or stakeholder categories (e.g., customers, suppliers, employees). Rather than looking at a system’s overall accuracy or performance, disaggregated evaluation involves analysing how it performs for each group, uncovering potential biases or inequities that may otherwise be hidden in aggregated metrics.

In AI, this is critical for ensuring that systems do not inadvertently harm or disadvantage specific groups. For instance, a facial recognition system may perform well in general, but it could show significant bias when evaluated on specific demographic groups, such as people of different races or genders. Disaggregated evaluation ensures that such disparities are identified and addressed before deployment.

The Need for Disaggregated Evaluation – Examples 

A prominent example of the importance of disaggregated evaluation comes from studies on AI systems used to predict recidivism in criminal justice. In one such case, a widely used predictive policing tool was evaluated by looking at its overall performance in predicting the likelihood of repeat offenders. On the surface, the system appeared to be quite accurate, boasting a high overall success rate in predicting who might re-offend. However, when the evaluation was disaggregated by demographic groups—specifically race and gender—the results revealed significant disparities in how the system performed for different groups.

The system performed well for white defendants but disproportionately predicted higher recidivism rates for Black defendants, even when controlling for prior criminal history and other factors. This was a classic example of how aggregate performance metrics can mask underlying biases. The disaggregated evaluation exposed that the system was unfairly penalising certain demographic groups, a crucial insight that would have been missed if the evaluation had relied solely on aggregate performance metrics.

This case underscores the critical importance of disaggregated evaluation, particularly in high-stakes applications like criminal justice, healthcare, or finance, where biased AI systems can have life-altering consequences. Without disaggregation, AI practitioners may overlook harmful biases that could reinforce systemic inequalities. The research highlights the need for AI systems to be evaluated not just in terms of overall performance but also in how they impact various demographic groups, especially those that have historically been marginalised.

By disaggregating performance across different groups, researchers and practitioners can ensure that AI systems work equitably for everyone, not just the majority. This approach helps identify and correct biases, ensuring that AI technologies are both effective and fair.

In another example, a recruitment AI tool that uses data to assess candidates for a job. If the data it has been trained on reflects historical biases against women or people from certain ethnic backgrounds, the system could unintentionally favour male candidates or candidates from more privileged backgrounds. Disaggregated evaluation allows organisations to identify these biases, assess their impact, and take corrective action before the system is implemented at scale.

The Complications and Challenges of Disaggregated Evaluation

While the need for disaggregated evaluation is clear, the process is far from simple. In my experience working with teams across various sectors, I’ve encountered several challenges that can complicate the implementation of effective disaggregated evaluations.

  1. Balancing Business Imperatives with Fairness Goals

One of the most significant challenges organisations face when conducting disaggregated evaluations is the tension between business imperatives and fairness goals. In many cases, businesses are eager to launch AI products quickly and with minimal disruption. However, fairness work—especially disaggregated evaluation—requires a deliberate, thoughtful approach that can slow down development cycles.

This conflict was evident in a project I worked on for a large financial institution. The company wanted to roll out an AI-driven loan approval system that could help reduce human bias in decision-making. However, during the evaluation process, we found that the model performed disproportionately poorly for female applicants compared to male applicants, particularly in lower-income brackets. Addressing this required further analysis, reworking training data, and fine-tuning the algorithm—all of which delayed the product launch. While the delay was frustrating, it was essential for ensuring the system would not inadvertently reinforce gender bias.

        2. Data Availability and Quality

Another challenge is the availability and quality of data. Disaggregated evaluations require high-quality, granular data that segments performance across diverse demographic groups. However, not all organisations have access to such data, or the data may be incomplete, outdated, or biased itself.

In a healthcare AI project I led, we faced this issue head-on. The goal was to develop an AI system capable of predicting patient outcomes, but the data we had was insufficiently diverse. There were few data points from underrepresented ethnic groups, and we realised that this lack of diversity could skew the model’s predictions, potentially leading to poor outcomes for these groups. To address this, we worked closely with healthcare providers to gather more diverse datasets and ensure the system would be robust across all demographic groups.

       3. Ethical Considerations and Stakeholder Engagement

Disaggregated evaluation doesn’t just require technical expertise; it also demands an ethical approach. Who are the stakeholders that should be involved in the evaluation process? How do we ensure that marginalised groups are adequately represented and their voices heard? Simply relying on the experiences of those within the development team may inadvertently perpetuate biases, especially if teams lack diversity.

In one of my AI governance projects, we identified a critical flaw in the way fairness was being assessed: the team had relied heavily on their own experiences, which were shaped by their particular demographic backgrounds. We realised that stakeholders from diverse backgrounds—especially those from historically marginalised communities—needed to be part of the evaluation process. This engagement allowed us to uncover biases that had been invisible to the original team.

      4. Complexity in Measuring Fairness

Fairness itself is subjective, and there are multiple ways to measure it. Should we prioritise equal outcomes for all groups, or should we focus on equal opportunity? These are difficult questions that require careful consideration of the context and the potential trade-offs involved.

In another example, I worked with a team developing an AI system for healthcare triage. The system was designed to prioritise patients based on the severity of their conditions, but when evaluating fairness, we had to consider whether prioritising patients based on medical need might inadvertently disadvantage certain groups, such as those with chronic conditions. This required a nuanced discussion of fairness and how to balance equitable healthcare delivery with AI-driven efficiency.

Practical Recommendations for Leaders and Stakeholders

  1. Prioritise Inclusivity from the Start

To avoid the pitfalls of bias, it’s essential to build inclusivity into AI systems from the very beginning. This means engaging diverse stakeholders—both internal and external—to ensure that all groups are represented in data collection, model design, and evaluation processes.

      2. Ensure Transparency in Evaluation Methods

Disaggregated evaluation should not be a hidden or secondary part of the process. Leaders should ensure that evaluation methods, including how stakeholder groups are defined and performance metrics are chosen, are transparent. This openness fosters accountability and trust.

      3. Embrace a Holistic Approach to Fairness

Fairness isn’t just about mitigating harm for underrepresented groups; it’s about understanding and addressing the broader social context in which the AI system will operate. Leaders should be prepared to ask tough questions about the potential ethical and social impact of AI systems and ensure that fairness is evaluated from multiple angles, including both outcomes and opportunities.

      4. Invest in Training and Awareness

Effective disaggregated evaluation requires technical and ethical expertise. Leaders should invest in training programmes and workshops that help teams understand fairness, how to measure it, and how to design systems that are equitable from the ground up.

Summary

Disaggregated evaluation is an essential tool for ensuring fairness in AI systems, but it’s not without its challenges. Leaders, managers, and stakeholders must navigate the complexities of business imperatives, data quality, stakeholder engagement, and ethical considerations. By prioritising inclusivity, transparency, and a holistic approach to fairness, organisations can ensure that their AI systems serve all users fairly and equitably.

As AI continues to reshape industries, the need for responsible, fair systems is more critical than ever. If you’re looking for more guidance on implementing disaggregated evaluation or need help navigating the complexities of fairness in AI, I invite you to explore training options, consult with experts, or reach out to discuss tailored solutions for your organisation. Together, we can ensure that AI serves everyone, not just a select few.

Next Steps

  • If you’re interested in bespoke training or design solutions on AI fairness, feel free to reach out for a consultation.

  • Check out our the following resources and upcoming workshops to equip your teams with the tools and knowledge to implement fair AI systems.

 

Free Resources for Individual Fairness Design Considerations

Data Bias

Sampling Bias in Machine Learning

Social Bias in Machine Learning

Representation Bias in Machine Learning

 

Disaggregated Evaluation for Fairness – £99

Empower your team to drive Responsible AI by fostering alignment with compliance needs and best practices.

dribbble, logo, media, social Practical, easy-to-use guidance from problem definition to model monitoring
dribbble, logo, media, social Checklists for every phase in the AI/ ML pipeline

 
 
AI Fairness Mitigation Package – £999

The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.

dribbble, logo, media, social Mitigate and resolve 15 Types of Fairness specific to your project with detailed guidance from problem definition to model monitoring.
dribbble, logo, media, social Packed with practical methods, research-based strategies, and critical questions to guide your team.
dribbble, logo, media, social Comprehensive checklists for every phase in the AI/ ML pipeline
Get Fairness Mitigation Package– (Delivery within 2-3 days)
 
Customised AI Fairness Mitigation Package – £2499
We’ll customise the design cards and checklists to meet your specific use case and compliance requirements—ensuring the toolkit aligns perfectly with your goals and industry standards.
dribbble, logo, media, social Mitigate and resolve 15 Types of Fairness specific to your project with detailed guidance from problem definition to model monitoring.
dribbble, logo, media, social Packed with practical methods, research-based strategies, and critical questions specific to your use case.
dribbble, logo, media, social Customised checklists for every phase in the AI/ ML pipeline

 

Sources

Barocas, S., Guo, A., Kamar, E., Krones, J., Morris, M.R., Vaughan, J.W., Wadsworth, W.D. and Wallach, H., 2021, July. Designing disaggregated evaluations of ai systems: Choices, considerations, and tradeoffs. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 368-378).

Hutchinson, B., Rostamzadeh, N., Greer, C., Heller, K. and Prabhakaran, V., 2022, June. Evaluation gaps in machine learning practice. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency (pp. 1859-1876).

Madaio, M., Egede, L., Subramonyam, H., Wortman Vaughan, J. and Wallach, H., 2022. Assessing the fairness of ai systems: Ai practitioners’ processes, challenges, and needs for support. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW1), pp.1-26.

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D. and Gebru, T., 2019, January. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency (pp. 220-229).

Weerts, H., Dudík, M., Edgar, R., Jalali, A., Lutz, R. and Madaio, M., 2023. Fairlearn: Assessing and improving fairness of ai systems. Journal of Machine Learning Research, 24(257), pp.1-8.

Share:

Related Courses & Al Consulting

Designing Safe, Secure and Trustworthy Al

Workshop for meeting EU AI ACT Compliance for Al

Contact us to discuss your requirements

Related Guidelines

As AI systems are increasingly used to make decisions in areas such as hiring, healthcare, finance, and law enforcement, there

Fairness in healthcare machine learning is highly context-dependent. Different use cases, populations, and risks require tailored approaches to fairness, making

Understanding Sampling Bias Sampling bias is like representation bias, which arises from the non-random sampling of subgroups. Because of sampling

As leaders and managers, you are in a unique position to drive the change needed to ensure fairness in machine

No data was found

To download the guide, fill it out.