When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?

12/11/2020
by   Gavin Brown, et al.
18

Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used for learning. Our problems are simple and fairly natural variants of the next-symbol prediction and the cluster labeling tasks. These tasks can be seen as abstractions of image- and text-related prediction problems. To establish our results, we reduce from a family of one-way communication problems for which we prove new information complexity lower bounds.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2022

Careful Data Curation Stabilizes In-context Learning

In-context learning (ICL) enables large language models (LLMs) to perfor...
research
02/09/2018

Information Planning for Text Data

Information planning enables faster learning with fewer training example...
research
04/15/2022

Accurate ADMET Prediction with XGBoost

The absorption, distribution, metabolism, excretion, and toxicity (ADMET...
research
06/09/2022

Strong Memory Lower Bounds for Learning Natural Models

We give lower bounds on the amount of memory required by one-pass stream...
research
02/04/2023

How Many and Which Training Points Would Need to be Removed to Flip this Prediction?

We consider the problem of identifying a minimal subset of training data...
research
04/30/2018

Adversarially Robust Generalization Requires More Data

Machine learning models are often susceptible to adversarial perturbatio...
research
05/27/2023

What indeed can GPT models do in chemistry? A comprehensive benchmark on eight tasks

Large Language Models (LLMs) with strong abilities in natural language p...

Please sign up or login with your details

Forgot password? Click here to reset