Evaluation Gaps in Machine Learning Practice

by   Ben Hutchinson, et al.

Forming a reliable judgement of a machine learning (ML) model's appropriateness for an application ecosystem is critical for its responsible use, and requires considering a broad range of factors including harms, benefits, and responsibilities. In practice, however, evaluations of ML models frequently focus on only a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. Through an empirical study of papers from recent high-profile conferences in the Computer Vision and Natural Language Processing communities, we demonstrate a general focus on a handful of evaluation methods. By considering the metrics and test data distributions used in these methods, we draw attention to which properties of models are centered in the field, revealing the properties that are frequently neglected or sidelined during evaluation. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts; these include commitments to consequentialism, abstractability from context, the quantifiability of impacts, the limited role of model inputs in evaluation, and the equivalence of different failure modes. Shedding light on these assumptions enables us to question their appropriateness for ML system contexts, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models


page 1

page 2

page 3

page 4


Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning

Machine learning (ML) requires using energy to carry out computations du...

On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods

Machine Learning (ML) models now inform a wide range of human decisions,...

The Dark Side: Security Concerns in Machine Learning for EDA

The growing IC complexity has led to a compelling need for design effici...

Towards Quantification of Bias in Machine Learning for Healthcare: A Case Study of Renal Failure Prediction

As machine learning (ML) models, trained on real-world datasets, become ...

The Benchmark Lottery

The world of empirical machine learning (ML) strongly relies on benchmar...

Designing Evaluations of Machine Learning Models for Subjective Inference: The Case of Sentence Toxicity

Machine Learning (ML) is increasingly applied in real-life scenarios, ra...

Thinking Beyond Distributions in Testing Machine Learned Models

Testing practices within the machine learning (ML) community have center...

Please sign up or login with your details

Forgot password? Click here to reset