Learning inducing points and uncertainty on molecular data
Uncertainty control and scalability to large datasets are the two main issues for the deployment of Gaussian process models into the autonomous material and chemical space exploration pipelines. One way to address both of these issues is by introducing the latent inducing variables and choosing the right approximation for the marginal log-likelihood objective. Here, we show that variational learning of the inducing points in the high-dimensional molecular descriptor space significantly improves both the prediction quality and uncertainty estimates on test configurations from a sample molecular dynamics dataset. Additionally, we show that inducing points can learn to represent the configurations of the molecules of different types that were not present within the initialization set of inducing points. Among several evaluated approximate marginal log-likelihood objectives, we show that the predictive log-likelihood provides both the predictive quality comparable to the exact Gaussian process model and excellent uncertainty control. Finally, we comment on whether a machine learning model makes predictions by interpolating the molecular configurations in high-dimensional descriptor space. We show that despite our intuition, and even for densely sampled molecular dynamics datasets, most of the predictions are done in the extrapolation regime.
READ FULL TEXT