Mining Patents with Large Language Models Demonstrates Congruence of Functional Labels and Chemical Structures

09/15/2023
by   Clayton W. Kosonocky, et al.
0

Predicting chemical function from structure is a major goal of the chemical sciences, from the discovery and repurposing of novel drugs to the creation of new materials. Recently, new machine learning algorithms are opening up the possibility of general predictive models spanning many different chemical functions. Here, we consider the challenge of applying large language models to chemical patents in order to consolidate and leverage the information about chemical functionality captured by these resources. Chemical patents contain vast knowledge on chemical function, but their usefulness as a dataset has historically been neglected due to the impracticality of extracting high-quality functional labels. Using a scalable ChatGPT-assisted patent summarization and word-embedding label cleaning pipeline, we derive a Chemical Function (CheF) dataset, containing 100K molecules and their patent-derived functional labels. The functional labels were validated to be of high quality, allowing us to detect a strong relationship between functional label and chemical structural spaces. Further, we find that the co-occurrence graph of the functional labels contains a robust semantic structure, which allowed us in turn to examine functional relatedness among the compounds. We then trained a model on the CheF dataset, allowing us to assign new functional labels to compounds. Using this model, we were able to retrodict approved Hepatitis C antivirals, uncover an antiviral mechanism undisclosed in the patent, and identify plausible serotonin-related drugs. The CheF dataset and associated model offers a promising new approach to predict chemical functionality.

READ FULL TEXT

page 5

page 8

page 20

research
06/17/2021

Do Large Scale Molecular Language Representations Capture Important Structural Information?

Predicting chemical properties from the structure of a molecule is of gr...
research
07/24/2020

Named entity recognition in chemical patents using ensemble of contextual language models

Chemical patent documents describe a broad range of applications holding...
research
06/21/2023

Interactive Molecular Discovery with Natural Language

Natural language is expected to be a key medium for various human-machin...
research
05/09/2023

Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files

Language models are powerful tools for molecular design. Currently, the ...
research
02/16/2018

Algorithmic Complexity and Reprogrammability of Chemical Structure Networks

Here we address the challenge of profiling causal properties and trackin...
research
09/29/2022

polyBERT: A chemical language model to enable fully machine-driven ultrafast polymer informatics

Polymers are a vital part of everyday life. Their chemical universe is s...
research
04/20/2023

Censoring chemical data to mitigate dual use risk

The dual use of machine learning applications, where models can be used ...

Please sign up or login with your details

Forgot password? Click here to reset