Sampling Bias in Machine Learning

Understanding Sampling Bias

Sampling bias is like representation bias, which arises from the non-random sampling of subgroups. Because of sampling bias, the trends estimated for one population may not generalise to data collected from a new population Mehrabi et al. (2021). 

 

Example of Sampling Bias in Machine Learning

Sample Bias occurs when a correlation exists between features (such as skills) and labels (like interview invitations) in the training data, but this correlation does not hold in the broader population. In this case, the model was trained on applicants with both statistical knowledge and Python skills, which were in high demand by the IT department. These applicants were mostly from specific universities that offered both skills, leading to the model associating high interview chances with both these skills and the university (Gu & Oelke, 2019).

However, as more universities start offering similar programs, the model may need to prioritise the university over the candidates’ actual skills correctly. Testing the model with new data showed bias by recommending interviews primarily for applicants from the universities seen in the training data (University10, University9, University3) rather than focusing on the skills. The model still preferred the original universities even when new candidates had high skill scores but came from different universities.

This example highlights how unintended biases from the training data influence the model’s decisions. It also reveals that even if the model learns strong patterns, such as the importance of skills, it may still be influenced by irrelevant factors like the university, leading to biased decisions. Understanding these biases through model exploration can help uncover hidden patterns and guide improvements for more reliable outcomes.

 

Why Does Sampling Bias Happen?

Sampling bias can occur for several reasons, including:

  • Historical Discrimination in Data: Past inequalities reflected in the data get baked into the model.
  • Sampling Bias: The data collected may over-represent or under-represent certain groups while excluding others.
  • Data Preparation Overlooking Diversity: Failing to include diverse perspectives during the data-cleaning and preprocessing stages intentionally.
  • Shifts in Real-World Data: The world changes, and if the data used to train the model does not keep up, the AI quickly becomes outdated.

 

Designing Mitigations for Sampling Bias in Machine Learning

Product teams often focus on tackling sampling bias during the data collection phase. However, addressing this issue requires a much broader approach. Sampling bias starts creeping in right from the problem definition stage, and mitigating it effectively means taking thoughtful action at every step of the AI/ML pipeline, all through to ongoing monitoring and refinement.

Tackling representation bias requires a systematic, proactive approach.

You can get started with these resources:

Free Resources for Sampling Bias Mitigation

Best practice and design mitigations for Sampling Bias from problem definition to model deployment. (coming soon)

AI Bias Mitigation Package – £999

The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.

dribbble, logo, media, social Mitigate and resolve 15 Types of Bias specific to your project with detailed guidance from problem definition to model monitoring.
dribbble, logo, media, social Packed with practical methods, research-based strategies, and critical questions to guide your team.
dribbble, logo, media, social Comprehensive checklists with +75 design cards for every phase in the AI/ ML pipeline
Get Bias Mitigation Package– (Delivery within 2-3 days)
Customised AI Bias Mitigation Package – £2499
We’ll customise the design cards and checklists to meet your specific use case and compliance requirements—ensuring the toolkit aligns perfectly with your goals and industry standards.
dribbble, logo, media, social Mitigate and resolve 15 Types of Bias specific to your project with detailed guidance from problem definition to model monitoring.
dribbble, logo, media, social Packed with practical methods, research-based strategies, and critical questions specific to your use case.
dribbble, logo, media, social Customised checklists and +75 design cards for every phase in the AI/ ML pipeline
Get Customised AI Bias Mitigation Package– (Delivery within 7 days)

 

 

Conclusion

Sampling bias is a longstanding challenge that extends far beyond data collection, demanding nuanced solutions throughout the AI/ML pipeline. Even with perfect sampling, models may still struggle to represent underrepresented groups robustly. Mitigating this issue requires a multidisciplinary approach, bringing together data scientists, ethicists, social scientists, and affected communities to address the interconnectedness of technical decisions, ethical considerations, and societal impacts. By adopting a systematic and intentional framework, we can work toward building AI systems that are not only fairer but also more reflective of the diverse realities they aim to serve.

 

 

Sources

Catania, B., Guerrini, G. and Janpih, Z., 2023, December. Mitigating Representation Bias in Data Transformations: A Constraint-based Optimization Approach. In 2023 IEEE International Conference on Big Data (BigData) (pp. 4127-4136). IEEE.

Fahse, T., Huber, V. and van Giffen, B., 2021. Managing bias in machine learning projects. In Innovation Through Information Systems: Volume II: A Collection of Latest Research on Technology Issues (pp. 94-109). Springer International Publishing.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. and Galstyan, A., 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR)54(6), pp.1-35.

Mousavi, M., Shahbazi, N. and Asudeh, A., 2023. Data coverage for detecting representation bias in image datasets: A crowdsourcing approach. arXiv preprint arXiv:2306.13868.

Shahbazi, N., Lin, Y., Asudeh, A. and Jagadish, H.V., 2022. A survey on techniques for identifying and resolving representation bias in data. CoRR, abs/2203.11852.

Suresh, H. and Guttag, J.V., 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.100022(8), p.73.

Suresh, H. and Guttag, J., 2021. Understanding potential sources of harm throughout the machine learning life cycle.

Share:

Related Courses & Al Consulting

Designing Safe, Secure and Trustworthy Al

Workshop for meeting EU AI ACT Compliance for Al

Contact us to discuss your requirements

Related Guidelines

The EU AI Act, which came into effect on February 2, 2025, introduces strict regulations on artificial intelligence systems, with

Cognitive Bias Imagine navigating the complex world of decisions we face every day—choosing what task to do, interpreting new information,

The growing need for fairness in machine learning models stems from the increasing reliance on AI for decisions that impact

One concept that has gained significant attention in recent years is “Equalized Odds.” As AI and ML models continue to

No data was found

To download the guide, fill it out.