Transfer Learning to Molecular Properties: Designing Size-Extensive Neural Network Models of Chemistry

In a previous post, we showed first how graph convolutional neural networks, such as SchNet,[1-4] build atomistic models of molecules. These models are captured in “embedding” vectors which when analyzed reveal that they are representing chemical functional groups around the atom, see post here. We then showed how to use this atomistic model of chemistry to model new local properties of an atom: like its NMR chemical shift OR pKa dissociation constant, see post here. But how can we now leverage this atomistic modelling to predict properties of molecules of any size? Such as total energy, or even more interesting, solubility!

We can do this by using a special design of neural networks that sends each atom in the molecule, pre-trained atom-embedding to be exact, through the same atomwise neural net to predict a contribution to the total property. Because all atoms go through one neural net, the size of the molecule is irrelevant, we simply just add the contributions of all runs through the neural net. Figure 1 shows an illustration of this neural net design. We will use this neural net design to train on logS solubilities of molecules. If atom-embeddings give accurate results on solubility via this size-extensive model then transfer learning will have successfully been achieved; from pre-trained SchNet embeddings to a brand new molecular property.

Fig 1 - Illustration of how to leverage pre-trained atomistic representations to build neural networks that can learn a molecule’s properties, regardless of how many atoms in the molecule. Each atom goes through the same atomwise neural net, that predicts the atom’s contribution to the total property. The network is trained on the total property. Because each atom contributes "one-to-one” with the total logS, this simple design works best and maintains the permutational symmetry of indices because adding is commutative, switching the order of the atoms, gives same result. 

Data and Methods

The Natural Products Magnetic Resonance Database (NP-MRD) is a large database containing more than 100,000 molecules and associated properties such as solubility.[5] This solubility is given in terms of logS, which is perfect for our problem because logarithms are additive, so each atom can contribute an additive portion of the logS. We curated our own database containing only 800 molecules for a start to test the power of transferability on small datasets. These molecules were chosen such that they contain only the elements SchNet was trained on (those in QM9 database: H,C,N,O, and F).

After curating our database of molecules with their logS, we extracted the energy-trained SchNet embedding vectors associated with each atom in the molecule. These embeddings will serve as input for an atomwise neural net that will train to predict each atom’s contribution to the logS. This atomwise neural net was made up of 3 layers only with 200 nodes per layer and a “ReLu” activation function in between. Quite notably, it only took 100 epochs on this simple neural net architecture to be able to converge to a good fit for the training and testing data. Showing the ease of which transfer learning can happen from graph convolutional representational of molecules (embeddings).

The results of the training and testing of this network can be seen in Figure 2. The root mean square error (RMSE) on the training is: 0.02 , whereas the RMSE on the test set: 0.67 . These results are promising and show that transfer learning can be valuable when handling small datasets as the inputs, embeddings, come with much pre-trained information that they can easily transfer it to a new model.

Fig 1 - Truth vs Prediction of the transfer learning model that learns solubility, logS, from embeddings. The right figure shows the training results (700 molecules) whereas the left gives the testing results (100 molecules) of the model.

Does the Model Pick Up on Solubility Trends?

Fig 2 - Atomistic logS contributions vs (principal) component 1 of embedding to help differentiate the chemical environments.

Some interesting trends can be noted from Figure 9. In general, as expected, groups with less symmetry contribute more to the total solubility prediction. For instance, it seems that carbons that have more hydrogens around them are generally more soluble than their unsaturated counterparts, possibly due to the assymetry the hydrogen introduces. In addition, groups with only one hetero-atom around the carbon seem to be more soluble than groups with two/three hetero-atoms around the carbon. This could be attributed to the fact that with one hetero-atom, polarity of the group can be maximized, whereas introducing more introduces opposing polarity vectors bringing the solubility down. This is most evident by the placement of tri-amine substituted carbon in the Figure, which is apparently the least soluble group.

Conclusions

This post shows how we can leverage a size-extensive transfer learning methodology to predict a molecular property. Each atom-embedding in a molecule (of whatever size) goes through the same neural network to predict its own contribution to the total logS parameter. Transfer learning shows promising results in this direction especially with such a minimal dataset used we reached considerable errors. There is also evidence that the model picks up on trends in solubility; such as symmetry/asymmetry in polarity and and distribution of heteroatoms.

References

[1] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko and K.-R. Müller, arXiv preprint arXiv:1706.08566, 2017.

[2] K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller andA. Tkatchenko, Nature communications, 2017, 8, 1–8.

[3] K. Schutt, P. Kessel, M. Gastegger, K. Nicoli, A. Tkatchenko and K.-R. Muüller, Journal of chemical theory and computation, 2018, 15, 448–455.

[4] K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko and K.-R. Müller, The Journal of Chemical Physics, 2018, 148, 241722.

[5] David S Wishart, Zinat Sayeeda, Zachary Budinski, AnChi Guo, Brian L Lee, Mark Berjanskii, Manoj
Rout, Harrison Peters, Raynard Dizon, Robert Mah, et al. Np-mrd: the natural products magnetic
resonance database. Nucleic Acids Research, 50:665, 2022