Understanding Sampling Bias
Sampling bias is like representation bias, which arises from the non-random sampling of subgroups. Because of sampling bias, the trends estimated for one population may not generalise to data collected from a new population Mehrabi et al. (2021).
Example of Sampling Bias in Machine Learning
Sample Bias occurs when a correlation exists between features (such as skills) and labels (like interview invitations) in the training data, but this correlation does not hold in the broader population. In this case, the model was trained on applicants with both statistical knowledge and Python skills, which were in high demand by the IT department. These applicants were mostly from specific universities that offered both skills, leading to the model associating high interview chances with both these skills and the university (Gu & Oelke, 2019).
However, as more universities start offering similar programs, the model may need to prioritise the university over the candidates’ actual skills correctly. Testing the model with new data showed bias by recommending interviews primarily for applicants from the universities seen in the training data (University10, University9, University3) rather than focusing on the skills. The model still preferred the original universities even when new candidates had high skill scores but came from different universities.
This example highlights how unintended biases from the training data influence the model’s decisions. It also reveals that even if the model learns strong patterns, such as the importance of skills, it may still be influenced by irrelevant factors like the university, leading to biased decisions. Understanding these biases through model exploration can help uncover hidden patterns and guide improvements for more reliable outcomes.
Why Does Sampling Bias Happen?
Sampling bias can occur for several reasons, including:
- Historical Discrimination in Data: Past inequalities reflected in the data get baked into the model.
- Sampling Bias: The data collected may over-represent or under-represent certain groups while excluding others.
- Data Preparation Overlooking Diversity: Failing to include diverse perspectives during the data-cleaning and preprocessing stages intentionally.
- Shifts in Real-World Data: The world changes, and if the data used to train the model does not keep up, the AI quickly becomes outdated.
Designing Mitigations for Sampling Bias in Machine Learning
Product teams often focus on tackling sampling bias during the data collection phase. However, addressing this issue requires a much broader approach. Sampling bias starts creeping in right from the problem definition stage, and mitigating it effectively means taking thoughtful action at every step of the AI/ML pipeline, all through to ongoing monitoring and refinement.
Tackling representation bias requires a systematic, proactive approach.
You can get started with these resources:
Free Resources for Sampling Bias Mitigation
AI Bias Mitigation Package – £999
The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.



Customised AI Bias Mitigation Package – £2499



Conclusion
Sampling bias is a longstanding challenge that extends far beyond data collection, demanding nuanced solutions throughout the AI/ML pipeline. Even with perfect sampling, models may still struggle to represent underrepresented groups robustly. Mitigating this issue requires a multidisciplinary approach, bringing together data scientists, ethicists, social scientists, and affected communities to address the interconnectedness of technical decisions, ethical considerations, and societal impacts. By adopting a systematic and intentional framework, we can work toward building AI systems that are not only fairer but also more reflective of the diverse realities they aim to serve.
Sources
Catania, B., Guerrini, G. and Janpih, Z., 2023, December. Mitigating Representation Bias in Data Transformations: A Constraint-based Optimization Approach. In 2023 IEEE International Conference on Big Data (BigData) (pp. 4127-4136). IEEE.
Fahse, T., Huber, V. and van Giffen, B., 2021. Managing bias in machine learning projects. In Innovation Through Information Systems: Volume II: A Collection of Latest Research on Technology Issues (pp. 94-109). Springer International Publishing.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. and Galstyan, A., 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR), 54(6), pp.1-35.
Mousavi, M., Shahbazi, N. and Asudeh, A., 2023. Data coverage for detecting representation bias in image datasets: A crowdsourcing approach. arXiv preprint arXiv:2306.13868.
Shahbazi, N., Lin, Y., Asudeh, A. and Jagadish, H.V., 2022. A survey on techniques for identifying and resolving representation bias in data. CoRR, abs/2203.11852.
Suresh, H. and Guttag, J.V., 2019. A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002, 2(8), p.73.
Suresh, H. and Guttag, J., 2021. Understanding potential sources of harm throughout the machine learning life cycle.