The Hidden Language of AI in Chemistry: Unifying Graph and Language Models of Chemistry
Natural language models learn meaning by organizing words into high dimensional spaces where relationships appear as directions, “King” = “Queen” - “Woman” + “Man.” Graph neural networks do the same for atoms. When trained on molecular data, they construct a latent space where reaction pathways emerge as geometric patterns, redox: ‘reduced’ - ‘dihydrogen’ = ‘oxidized.’
Fig 1 - t-SNE projection of the 128-D AFV space from previous post. These clusters are not randomly arranged in 128-D but arranged in reaction syntax
Average Reaction Vectors and Latent Space State Functions
If the latent space is chemically coherent for this class, the average vector should be similar for all i. That is,
In practice, this is only approximate. Real reactions vary in ways the model must reflect. Different molecules have different electronic environments, steric effects, substituent patterns, and local functional groups. These influence how the representation shifts from reactant to product. Because of this molecular diversity, the reaction vectors are similar but not identical.
In language analogy, a transformation often follows a consistent pattern without being perfectly uniform. For example, making a noun plural usually involves adding “s,” yet not every word behaves the same way. Cat becomes cats but child becomes children. The underlying idea is stable, but the transformation must adapt to the specific word. In the latent space, this shows up as similar vectors rather than one exact displacement.
Chemistry behaves the same way. Oxidizing an alcohol follows a general pattern but oxidizing an aldehyde makes a carboxylic acid which introduces oxygen along with the double bond, whereas oxidizing an alkane creates a double bond only. Different molecules respond differently depending on their electronic environment and substituents.
cat - one + many = cats
child - one + many = children
child - avg[-one + many] = childs
Well-known examples from language models shows how meaning becomes geometry. Natural language models arrange each word representations into a learned vector space such that relationships between them align along simple linear directions. For example, gender, tense, or number changes appear as consistent vector changes.
Gender Transformations
King – Man + Woman ≈ Queen
Hen– Man + Woman ≈ Chicken
He – Man + Woman ≈ She
Other Transformations
Paris – France + Italy ≈ Rome
ice – cold + hot ≈ steam
Mozart - music + physics ≈ Newton
Verb and Tense
run – running + fly ≈ flying
walk – walked + play ≈ played
write – writes + read ≈ reads
In the gender examples, ‘- Man + Woman’ is a vector transformation that takes any male-gendered word and turns it into its female counterparts. These are not hand-crafted rules, this is how the model learns to represent linguistic structure geometrically. Ordered relationships between words give rise to consistent geometric directions in a model’s embedding space. Changes such as gender, tense, or number appear as stable vectors: the surrounding context stays the same, but the word shifts to a different form and meaning. This behavior is not unique to language models. It is a general consequence of learning from any structured domain. Chemistry is no exception.
Graph models of molecules operate under the same principles. In a previous post we worked out the mathematical definition of a graph neural network model. Then we tested the graph neural network with molecules from the QM9 dataset, extracted part of a trained graph neural network called the atom feature vectors (AFV) which represent a crucial part of the model’s latent space, its compressed information about each tested atom.
We saw that graph neural networks (GNNs) organize atom feature vectors into a structured latent space that reflects their functional groups. Here we extend this claim, functional group clusters are not randomly arranged in the AFV space, but in a way that respects the language of reaction formula.
Exploring the GNN latent space via reaction formulas
We begin to explore this idea using key reactions shown on the left. Each reaction was passed through the trained GNN model (a whole set of them, rather). For details on the exact architecture and dataset, see previous post.
As each molecule propagated through the network, we extracted the atom feature vector of the reaction center carbon from the final interaction layer. We then computed the difference between the product and reactant AFVs for every reaction in this class, and averaged these differences. This average displacement was then used in to transform reactants to products.
The AFVs live in a 128 dimensional space, so the raw results lie in 128 D. For visualization, we projected the reactant, product, and reaction difference vectors into two dimensions using linear PCA. Details on PCA can also be found in the previous post. The resulting plots for each reactant appear below, shown in the same order as the reactions on the left.
Fig 5 - Oxidation reaction formulas explored in this article.
Fig 6 - 2D PC projected oxidation direction in the carbon feature vector space.
It is visually clear from the first two principal components that the difference vectors are nearly parallel within each reaction class and even across different classes. This indicates that the latent space learned by the model is structured according to reaction formulae. If the space were arranged randomly, we would not observe such consistent directions. However, in two dimensions it is difficult to fully appreciate the underlying directionality of these vectors.
To quantify this structure in the original 128-D, we use several measures. The first is to computed how often the approximate transformation (average difference + reactant = product) maps to a product embedding that is closest to the true product embedding produced by a full GNN pass. The table below summarizes these results. For each reaction, a specific force field was used to optimize the geometry before inference. The table also reports how many of each product is found int he QM9, how many new products produced by reactions, and the density (in units of 1/total) which is a measure of how populated the class of products is. The %GNN column records the percentage of product embeddings for which the approximate transformation yields a nearest neighbor that is the correct product.
We also compare these results with Morgan fingerprints (MF), an alternative molecular representation. With Morgan fingerprints, the nearest neighbor is often the reactant itself because the fingerprint encodes global molecular similarity rather than local changes around a reaction center. After removing the reactant from the candidate set, a similar pattern emerges in Morgan fingerprint space, but the structure is weaker and less coherent than what we observe with GNN embeddings. This reflects the local atom-based nature of GNN representations, which capture the chemistry of reaction centers more effectively than global descriptors
Additionally, we use the cosine similarity between different average reaction vectors to find that they encode syntactical relationship between the vectors as the table below shows.
On the bottom left we show similarity between reaction vectors in the GNN space and in the upper right of the matrix we show the similarity between Morgan fingerprints. The GNN space shows stronger geometric relationships overall than the Morgan fingerprints. Upon closer examination the relationships make syntactical sense. For examples similarities between oxidations, elimination and tautomerization. The reverse alignment between Diels-Alder and oxidation (due to the loss of double bond of the reaction center carbon). Other interesting arrangements include the reverse alignment of nucleohpilic addition of step amide hydrolysis
Conclusions
Singular to Plural
Reduced to Oxidized
References
For those interested in diving deeper into the details, all connecting citations and results discussed in this post can be found in our full article: A. M. El-Samman and S. De Baerdemacker, “amide - amine + alcohol = carboxylic acid. Chemical reactions as linear algebraic analogies in graph neural networks,” ChemRxiv. 2024; doi:10.26434/chemrxiv-2024-fmck4. This reference provides a comprehensive exploration of how chemical reactions are modeled as linear algebraic analogies within the GNN decision-making framework. Be sure to check it out for the full breakdown!
Redox
Diels-Alder
Epoxide Ring-Opening
Multi-Step Transformations
As shown in the examples above, a reaction formula is analogous to a sentence in natural language. Reactants and reagents act as the “words,” the reaction arrow plays the role of the “verb,” and the products form the resulting “object.” Just as a sentence transforms meaning while preserving its structure, a chemical equation transforms matter under consistent chemical rules.
In the latent space of a trained graph neural network, this analogy appears again. Although the model is trained only on individual molecules and never on reactions, the embeddings of reactants and products still align to form clear, consistent reaction directions. These shifts resemble the way state functions depend only on initial and final states, not on the mechanistic path. The latent space has quietly organized chemistry so that many reaction types oxidation, hydrolysis, cycloadditions, and multi step sequences map to well defined geometric movements. In this sense, the model begins to “speak chemistry.”
To explore this structure, we extract latent vectors for reactants and products and analyze their differences, much like comparing state variables. This perspective offers a foundation for future tasks such as synthesis planning, reaction prediction, and generative molecular design, all supported by a chemically coherent representation.
If the organization of the latent space is reminiscent of state functions, meaning that regardless of the reaction pathway the final result is determined only by the initial and final states, then we can treat reactions in the latent space the same way. The mechanistic path does not appear in the representation. What matters is how the model positions the reactant and how it positions the product.
This lets us describe a reaction as a simple displacement in latent space. For a given reaction class for example alcohol oxidation we embed each reactant and product and extract their latent vectors, and take their average: