Artificial Intelligence (AI) systems can inadvertently replicate and amplify societal biases, especially when trained on skewed or unrepresentative datasets. Addressing fairness in AI begins with how we sample data.
For example:
-
A recruitment algorithm trained on predominantly male candidate profiles might favor men over women.
-
A facial recognition system with underrepresented samples from specific ethnic groups may fail for those populations.
This article explores sampling strategies that enhance fairness, ensuring AI models are equitable and unbiased.
What is Fair Sampling in AI?
Fair sampling aims to reduce biases related to sensitive attributes (e.g., gender, race) in training datasets. By carefully selecting and augmenting data, we can improve prediction accuracy for underrepresented groups while maintaining overall model performance.
Key Strategies for Fair Sampling
1. Active Sampling for Fairness
Inspired by Active Learning (AL), fair active sampling selects data points that reduce biases in model predictions. Key steps include:
-
Fair Seed Set Construction: Begin with a dataset with minimal distribution and hardness biases.
-
Query Strategy Selection: Enrich the dataset using techniques like:
-
Query By Committee (QBC): Identify contentious data points by comparing predictions from multiple models.
-
Least Confident Classification (LCC): Prioritize ambiguous samples where predictions lack certainty.
-
-
Dynamic and Fair Sampling: Iteratively sample data, adjusting ratios to improve prediction accuracy for protected groups.
2. Fair Representation Learning
Ensures embeddings (numerical representations of data) are neutral concerning sensitive attributes through:
-
Adversarial Frameworks: Remove sensitive attribute information via adversarial loss functions.
-
Projection Methods: Ensure embeddings are independent of sensitive attributes.
-
Heterogeneous Information Networks (HINs): Balance representation across nodes and edges for datasets with complex relationships.
3. Balanced Data Sampling
Equalizes representation of protected groups in training data using techniques like:
-
Oversampling: Generate synthetic data points for underrepresented groups, e.g., via SMOTE.
-
Undersampling: Reduce overrepresented group samples, ensuring balance without sacrificing key data.
-
Cluster Sampling: Use clusters to maintain diversity in sampled data.
-
Adaptive Sampling: Adjust ratios dynamically during training based on model performance.
-
Stratified Sampling: Ensure proportional representation across strata defined by sensitive attributes.
4. FAL-CUR: Fair Active Learning Using Uncertainty and Representativeness
A tailored approach combining:
-
Uncertainty Sampling: Focus on ambiguous or uncertain data points.
-
Representativeness: Maintain overall population diversity to prevent overemphasis on edge cases.
Free Resources for AI Fairness Sampling Design Considerations
Evaluation Bias in Machine Learning
Sampling Bias in Machine Learning
Measurement Bias in Machine Learning
Social Bias in Machine Learning
Representation Bias in Machine Learning
Fair Sampling Strategies for AI – £99
Empower your team to drive Responsible AI by fostering alignment with compliance needs and best practices.
Practical, easy-to-use guidance from problem definition to model monitoring
Checklists for every phase in the AI/ ML pipeline
AI Fairness Mitigation Package – £999
The ultimate resource for organisations ready to tackle bias at scale starting from problem definition through to model monitoring to drive responsible AI practices.



Customised AI Fairness Mitigation Package – £2499



Summary
Fair sampling transcends technical boundaries—it is a cornerstone of ethical AI development. While strategies like active learning and representation balancing are accessible, their real-world implementation requires expertise, continual evaluation, and adaptation.
Sources
Fajri, R., Saxena, A., Pei, Y. and Pechenizkiy, M., 2022. Fal-cur: fair active learning using uncertainty and representativeness on fair clustering. arXiv preprint arXiv:2209.12756.
Ferrara, E., 2023. Fairness and bias in artificial intelligence: A brief survey of sources, impacts, and mitigation strategies. Sci, 6(1), p.3.
Roh, Y., Lee, K., Whang, S. and Suh, C., 2021. Sample selection for fair and robust training. Advances in Neural Information Processing Systems, 34, pp.815-827.
Sha, L., Li, Y., Gasevic, D. and Chen, G., 2022. Bigger data or fairer data?: augmenting BERT via active sampling for educational text classification. In International Conference on Computational Linguistics 2022 (pp. 1275-1285). Association for Computational Linguistics (ACL).
Zeng, Z., Islam, R., Keya, K.N., Foulds, J., Song, Y. and Pan, S., 2021, May. Fair representation learning for heterogeneous information networks. In Proceedings of the International AAAI Conference on Web and Social Media (Vol. 15, pp. 877-887).