05 Quantitative Structure-Property Relationships
Author
Joshua Schrier
Title
05 Quantitative Structure-Property Relationships
Description
https://chem.libretexts.org/@api/deki/files/244501/BP.CSV?revision=1
Category
Educational Materials
Keywords
cheminformatics, Chemoinformatics, chemical information, PubChem, quantitative structure, property relationships, QSPR, machine learning, computer-aided drug design, chemistry
URL
http://www.notebookarchive.org/2020-10-ebo4gxs/
DOI
https://notebookarchive.org/2020-10-ebo4gxs
Date Added
2020-10-31
Date Last Modified
2020-10-31
File Size
448.75 kilobytes
Supplements
Rights
CC BY-NC-SA 4.0
Download
Open in Wolfram Cloud
Quantitative Structure-Property Relationships
Quantitative Structure-Property Relationships
These files contain the boiling point example data used in this exercise. You may download the files to your local machine or else we will Import them directly:
Version notes: Updated for Mathematica 13.0 — certain functionality pertaining to removing hydrogen atoms from molecule structures has changed. This notebook will not work with Mathematica 12.
Objectives
Objectives
◼
◼
Use molecular descriptors to create a predictive model for boiling points of alkanes.
◼
Visualize data and models.
Quantitative Structure-Property Relationships
Quantitative Structure-Property Relationships
Quantitative Structure-Property Relationships (QSPR) and Quantitative Structure-Activity Relationships (QSAR) use statistical models to relate a set of predictor values to a response variable. Molecules are described using a set of descriptors, and then mathematical relationships can be developed to explain observed properties. In QSPR and QSAR, physico-chemical properties of theoretical descriptors of chemicals are used to predict either a physical property or a biological outcome.
Molecular Descriptors
Molecular Descriptors
A molecular descriptor is “final result of a logical and mathematical procedure, which transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment” (Todeschini, R.; Consonni, V. Molecular descriptors for chemoinformatics 2009 Wiley‑VCH, Weinheim). You are already familiar with descriptors such as molecular weight or number of heavy atoms and we have queried PubChem for data such as XLogP. We’ll examine just a few simple descriptors, but thousands have been developed for applications in QSPR.
Calculating descriptors in Mathematica
Calculating descriptors in Mathematica
In the readings for this Chapter, we have seen how to use algorithms to calculate descriptors such as the Wiener and Zagreb indices. Calculating these by hand would be time consuming, but a short function can be used to perform calculations like these. In this Assignment we will develop a few but you should know that many other common descriptors are available in Mathematica 12:
In[]:=
Dataset@Map[Length]@MoleculeValue["Properties"]
Out[]=
|
Wiener Index
Wiener Index
In Section 5.3 of the reading, you calculated the Wiener index for n-pentane and 2-methylpentane. Now let’s implement that calculation as a function. Recall that the Wiener index is defined as the sum of distance between any two (non-carbon) atoms in the molecule,where G represents the total atoms in the molecule, u and v are individual non-hydrogen atoms and d(u, v) is the shortest distance (in number of bonds) between atoms u and v. Let’s construct a function to do this in a step by step fashion. First, construct a Molecule that does not include explicit hydrogen atoms :
In[]:=
mol=Molecule["pentane"]
Out[]=
Molecule
|
Properties of a Molecule can be accessed in an abbreviated way:
In[]:=
mol["HydrogenCount"]
Out[]=
{3,2,2,2,3,0,0,0,0,0,0,0,0,0,0,0,0}
A complete list of available properties can be found in the documentation for the MoleculeValue function.
To obtain the number of bonds between the non-hydrogen atoms of this Molecule:
In[]:=
distances=mol["GraphDistanceMatrix",IncludeHydrogens->False]
Out[]=
{{0.,1.,2.,3.,4.},{1.,0.,1.,2.,3.},{2.,1.,0.,1.,2.},{3.,2.,1.,0.,1.},{4.,3.,2.,1.,0.}}
This can be read as the first list being the distances from the first atom to each of the other atoms, the second list as the distances from the second atom to each of the other atoms, and so on. The Total function adds these values together:
In[]:=
Total[distances]
Out[]=
{10.,7.,6.,7.,10.}
In[]:=
Total@Total[distances]
Out[]=
40.
This can be written more concisely as the Total over the first two (2) levels of the list-of-lists, which can be specified as the second argument:
In[]:=
Total[distances,2]
Out[]=
40.
Let’s define this entire process as a function:
In[]:=
WienerIndex[m_Molecule]:=With[{distances=m["GraphDistanceMatrix",IncludeHydrogens->False]},(Total[distances,2]/2)]WienerIndex[smiles_String]:=WienerIndex@Molecule[smiles]
where the second form is a convenience function that takes a SMILES string and converts it to a Molecule before sending it to the first function (that is, the second function gets used when the argument is a String, rather than a Molecule ). As a test, we can confirm this by applying to pentane and 2-methylpentane to confirm that it agrees with the version you calculated by hand:
In[]:=
Print["The Wiener index for n-pentane is: ",WienerIndex["CCCCC"]]Print["The Wiener index for 2-methylpentane is: ",WienerIndex["CCCC(C)C"]]
The Wiener index for n-pentane is: 20.
The Wiener index for 2-methylpentane is: 32.
Zagreb Indices
Zagreb Indices
In Section 5.3 of the reading, you also calculated the Zagreb indices for n-pentane and 2-methylpentane. Now let’s implement this in Mathematica. Recall that the Zagreb indices are defined in terms of the degrees (number of incoming bonds) of the non-hydrogen vertices. This property is defined as the “HeavyAtomCoordinationNumber” property (“CoordinationNumber” will include all atoms, including hydrogens):
In[]:=
degreeVector=mol["HeavyAtomCoordinationNumber",IncludeHydrogens->False]
Out[]=
{1,2,2,2,1}
The first Zagreb index, =, is equivalent to the dot-product of this degree vector with itself:
M
1
∑
i
2
δ
i
In[]:=
degreeVector.degreeVector
Out[]=
14
The second Zagreb index, =, is is equal to the sum of the products of the degrees of pairs of adjacent vertices. This can be computed efficiently by obtaining the AdjacencyMatrix (a matrix where 1 indicates a bond between two adjacent atoms, and zero indicates no bond):
M
1
∑
i,j
δ
i
δ
j
In[]:=
adjacencyMatrix=mol["AdjacencyMatrix",IncludeHydrogens->False]MatrixForm[%]
Out[]=
SparseArray
|
Out[]//MatrixForm=
0 | 1 | 0 | 0 | 0 |
1 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 |
0 | 0 | 0 | 1 | 0 |
And then to calculate the second Zagreb index, we take the vector-matrix-vector product of the degree vectors with this matrix:
In[]:=
degreeVector.adjacencyMatrix.degreeVector
Out[]=
24
This over counts the number of neighbors by a factor of two, so we should divide by 2 to correct for this. (In retrospect, it becomes clear that the first Zagreb index is the vector - matrix - vector product with the Identity matrix.) . Having explored the necessary components, let us define a function that takes a Molecule as input and returns a list containing the first and second Zagreb index values:
In[]:=
ZagrebIndices[m_Molecule]:=With[{degreeVector=m["HeavyAtomCoordinationNumber",IncludeHydrogens->False],adj=m["AdjacencyMatrix",IncludeHydrogens->False]},{degreeVector.degreeVector,degreeVector.adj.degreeVector/2}]ZagrebIndices[smiles_String]:=ZagrebIndices@Molecule[smiles,IncludeHydrogensFalse]
This is in agreement with the result from Section 5.3 for n-pentane and 2-methylpentane:
In[]:=
Print["The Zagreb indices for n-pentane are: ",ZagrebIndices["CCCCC"]]Print["The Zagreb indices for 2-methylpentane are: ",ZagrebIndices["CCCC(C)C"]]
The Zagreb indices for n-pentane are: {14,12}
The Zagreb indices for 2-methylpentane are: {20,18}
Looping through a list of molecules
Looping through a list of molecules
Having defined the descriptors, let’s apply it to a list of SMILES strings. One way is to perform an AssociationMap that will take each value in the list, and generate an Association whose keys are the original SMILES strings and whose values are the descriptor function values:
In[]:=
smiles={"CCC","CCCC","CCCCC","CCCC(C)C","CC(C)C(C)C"};AssociationMap[WienerIndex,smiles]
Out[]=
CCC4.,CCCC10.,CCCCC20.,CCCC(C)C32.,CC(C)C(C)C29.
In[]:=
Dataset@Map[<|"SMILES"#,"WienerIndex"WienerIndex[#]|>&]@smiles
Out[]=
|
Using descriptors to predict molecular properties
Using descriptors to predict molecular properties
For this exercise we will take a series of alkanes and create an equation that will allow us to predict boiling points. We will start with a 30 molecule alkane training set. We will obtain various descriptors and see how they can predict the physical property boiling point.
Boiling Point data
Boiling Point data
WARNING: The BP.csv file at the link above is misformatted; it treats the molecular weight information as a string, whereas it should be a number. (Download the file and open it in a text editor to see how this is the case.) Often when working with datasets, you will need to do type conversions, so it is useful to inspect these types of occurrences. One could download this file edit it to remove the quotes and extra return character at the end of each line. Alternatively, we will Import it and then modify the state of the last column:
In[]:=
data=Import["https://chem.libretexts.org/@api/deki/files/244501/BP.CSV?revision=1","Dataset","HeaderLines"1];(*importfromweb*)data=data[All,{"MW"->ToExpression}](*convertstringsinMWcolumnasnumbers*)
Out[]=
|
Graphing the data
Graphing the data
Now we can graph the data to show the boiling point (in Kelvin) as a function of molecular weight; these are stored in the “BP_K” and “MW” columns, respectively:
In[]:=
ListPlot[data[All,{"MW","BP_K"}],FrameTrue,FrameLabel{"Molecular Weight","Boiling Point in Kelvin"}]
Out[]=
This figure makes it apparent that multiple molecules have the same molecular weight but different boiling points. Molecular weight is therefore not the best predictor of boiling point. We can see if there are other descriptors that we can use such as Weiner or Zagreb, that will allow us to distinguish between different isomers when predicting the boiling point.
Adding descriptors to the dataset
Adding descriptors to the dataset
To add descriptors, we will Append an association of descriptor values computed using the “SMILES” column in the Dataset define above:
In[]:=
extended=data[All,Append[#,<|"Wiener"WienerIndex[#SMILES],"Z1"First[ZagrebIndices[#SMILES]],"Z2"Last[ZagrebIndices[#SMILES]]|>]&]
Out[]=
|
Now we can see how each of these descriptors are related to the boiling points of their respective compounds. It may be convenient to define a helper function that extracts the relevant column names and sets up the plot accordingly:
In[]:=
display[d_Dataset,{xLong_StringxShort_String},{yLong_StringyShort_String}]:=ListPlot[d[All,{xShort,yShort}],FrameTrue,FrameLabel{xLong,yLong}]
The boiling point as a function of the Wiener Index has the form:
In[]:=
display[extended,{"Wiener Index""Wiener"},{"Boiling Point in Kelvin"->"BP_K"}]
Out[]=
And similarly for the two Zagreb indices:
In[]:=
display[extended,{"Zagreb M1""Z1"},{"Boiling Point in Kelvin"->"BP_K"}]display[extended,{"Zabgre M2""Z2"},{"Boiling Point in Kelvin"->"BP_K"}]
Out[]=
Out[]=
Clearly molecular weight was somewhat predictive, but problematic. It looks like using the other indicators we have have some other ways to predict boiling point.
One option is write this data to a new CSV file and work in Microsoft Excel to perform a regression analysis. I don’t recommend you do this, but exporting the data is straightforward and your instructor may provide instructions on how to analyze the data using Excel.
One option is write this data to a new CSV file and work in Microsoft Excel to perform a regression analysis. I don’t recommend you do this, but exporting the data is straightforward and your instructor may provide instructions on how to analyze the data using Excel.
In[]:=
Export["bp_Descriptor_data.csv",extended]
Out[]=
bp_Descriptor_data.csv
However, rather than perform this analysis in a separate program, we will take advantage of model fitting tools provided in Mathematica.
Multiple regression analysis
Multiple regression analysis
LinearModelFit provides numerous tools for performing a multiple linear regression using all of our descriptors (molecular weight, Wiener index, Zagreb indices) to help predict boiling point. Let’s begin by extracting the data into an appropriate for where each example is a list of input values followed by the output value (BP_K) to be predicted:
In[]:=
d=extended[All,{"MW","Wiener","Z1","Z2","BP_K"}]//Normal//Values
Out[]=
{{16.043,0.,0,0,110.95},{30.07,1.,2,1,184.55},{44.1,4.,6,4,230.95},{58.12,10.,10,8,273.05},{58.12,9.,12,9,261.95},{72.15,20.,14,12,309.25},{72.15,18.,16,14,300.15},{72.15,16.,20,16,282.65},{86.18,35.,18,16,341.95},{86.18,32.,20,18,334.05},{86.18,31.,20,19,336.45},{86.18,28.,24,22,322.95},{86.18,29.,22,21,331.25},{100.2,56.,22,20,371.65},{100.2,48.,24,24,366.65},{100.2,46.,28,26,352.35},{100.2,46.,26,26,362.95},{100.2,48.,26,24,353.75},{100.205,52.,24,22,363.25},{100.2,50.,24,23,364.95},{114.23,84.,26,24,398.75},{114.232,76.,28,27,392.05},{114.23,58.,38,40,379.65},{114.23,62.,34,36,387.85},{114.23,65.,32,33,386.85},{114.23,66.,34,32,372.45},{128.25,120.,30,28,423.85},{128.259,114.,32,30,416.15},{142.28,165.,34,32,447.35},{142.28,158.,36,34,440.05}}
Having defined the data, we then provide it to LinearModelFit, along with lists of the input variables names to be used. It is a good practice to Clear these before proceeding, in case they have already been defined.
In[]:=
Clear[MW,Wiener,Z1,Z2];model=LinearModelFit[d,{MW,Wiener,Z1,Z2},{MW,Wiener,Z1,Z2}]
Out[]=
FittedModel
55.5695+4.43248MW-0.641108Wiener-4.392Z1+0.298153Z2
The output model allows us to generate predictions of the boiling point, given input values for the different variables. The display shows the coefficients for an equation that can be used to predict boiling points for any molecule (including those not in our dataset). For example, 2-methylheptane has MW = 114.232, Wiener Index = 79, Z1 = 28, Z2 = 26, and an observed Boiling Point = 390.6 K. What does our model predict?
In[]:=
model[114.232,79,28,26]
Out[]=
396.029
Model summary and analysis using partial regression plots
Model summary and analysis using partial regression plots
A quick look at the results summary shows that the model has an excellent R-squared value:
In[]:=
model["RSquared"]
Out[]=
0.99447
Upon more careful examination, you may notice that one of our descriptors has a very large P value:
In[]:=
model["ParameterTable"]
Out[]=
Estimate | Standard Error | t-Statistic | P-Value | |
1 | 55.5695 | 6.74538 | 8.23816 | 1.37071× -8 10 |
MW | 4.43248 | 0.202827 | 21.8535 | 8.28143× -18 10 |
Wiener | -0.641108 | 0.0643667 | -9.96024 | 3.47466× -10 10 |
Z1 | -4.392 | 1.2376 | -3.54881 | 0.00156178 |
Z2 | 0.298153 | 0.942943 | 0.316194 | 0.75448 |
This would indicate that perhaps the Z2 descriptor is not working well in this case. We can generate a more graphical interpretation that will make this more obvious. One possible approach is to plot each of the input variables against each other in a grid using the PairwiseScatterPlot repository function:
In[]:=
[◼] | PairwiseScatterPlot |
Out[]=
Perfectly correlated variables appear as a straight line. Of course the each variable is perfectly correlated with itself (hence the diagonal lines in the diagonal entries in this grid). In comparison of Z1 and Z2 shows that these are more closely related values than the others, consistent with our hypothesis that they are correlated. More quantitatively, we can construct a Correlation matrix of these data and observe that Z1 and Z2 are very highly correlated.
In[]:=
Correlation[d[[All,1;;4]]]//MatrixForm
Out[]//MatrixForm=
1. | 0.881836 | 0.954495 | 0.931148 |
0.881836 | 1. | 0.768933 | 0.739707 |
0.954495 | 0.768933 | 1. | 0.992088 |
0.931148 | 0.739707 | 0.992088 | 1. |
Howgoodisourmodel?
If we look at a plot of actual versus predicted boiling points:
In[]:=
predictionVersusActual=Transpose@model[{"PredictedResponse","Response"}];Show[ListPlot[predictionVersusActual,FrameTrue,AspectRatio1,FrameLabel{"Predicted boiling point (K)","Observed boiling point (K)"}],Plot[x,{x,110,450},PlotStyle{Red,Dotted}]]
Out[]=
The performance can also be quantifying by constructing a model between the predicted and actual data.
In[]:=
m=LinearModelFit[predictionVersusActual,x,x]
Out[]=
FittedModel
-5.41879×+1.x
-12
10
As seen above, the slope is 1 and the intercept is essentially zero. The model appears to have very good predictability within the original 30 molecule data set based on its R-squared value:
In[]:=
m["RSquared"]
Out[]=
0.99447
Another way to quantify the error is by computing the percent error for each point in our dataset:
In[]:=
m["SinglePredictionErrors"]/m["Response"]*100
Out[]=
{5.77259,3.29973,2.55011,2.10677,2.20417,1.83577,1.89455,2.02155,1.65381,1.69307,1.68095,1.75312,1.70792,1.52682,1.54689,1.60611,1.56084,1.60101,1.56049,1.55364,1.43451,1.45757,1.49366,1.46664,1.47293,1.52581,1.36293,1.38525,1.30329,1.32208}
Although the first point (methane) has a large error of 5.7%, the error decreases with increasing molecule size. Indeed, a Histogram of the errors can be informative:
In[]:=
Histogram[%]
Out[]=
We had mentioned earlier that Z2 may not be very predictive in this model. We can remove the variable an rerun the analysis to see if we can improve the predictability of the model:
In[]:=
d2=extended[All,{"MW","Wiener","Z1","BP_K"}]//Normal//Values;model2=LinearModelFit[d2,{MW,Wiener,Z1},{MW,Wiener,Z1}]
Out[]=
FittedModel
55.4979+4.41144MW-0.639724Wiener-4.02596Z1
This more parsimonious model appears to make as good of predictions as the previous model that included Z2:
In[]:=
model2["RSquared"]
Out[]=
0.994448
Let’s reexamine 2-methylheptane: MW = 114.232, Wiener Index = 79, Z1 = 28, observed Boiling Point = 390.6 K:
In[]:=
model2[114.232,79,28]
Out[]=
396.16
This corresponds to a 1.4% error in the predicted boiling point:
In[]:=
(%-390.6)/390.6*100
Out[]=
1.42356
A global comparison of the two models can be made by evaluating the percent errors for each individual data point, and plotting them together to see if there are systematic improvements or degradations in model performance:
In[]:=
percentErrorComparison=(#["SinglePredictionErrors"]/#["Response"]*100)&/@{model,model2}//Transpose
Out[]=
{{6.61583,6.39852},{3.58629,3.49446},{2.72021,2.67268},{2.26187,2.22237},{2.39025,2.29316},{1.9952,1.96031},{2.02433,1.98523},{2.44824,2.14901},{1.80902,1.77736},{1.81825,1.78336},{1.81347,1.77353},{1.9311,1.86185},{1.82502,1.79247},{1.66243,1.63338},{1.714,1.63341},{1.76446,1.70072},{1.69357,1.63668},{1.71948,1.67532},{1.67043,1.63816},{1.67198,1.63539},{1.54891,1.52187},{1.55491,1.52177},{1.84419,1.78051},{1.71061,1.59084},{1.63603,1.55431},{1.74481,1.65827},{1.49125,1.46506},{1.4966,1.4658},{1.53929,1.51167},{1.54255,1.50939}}
In[]:=
Show[ListPlot[percentErrorComparison,AspectRatio1,AxesLabel{"with Z2","without Z2"}],Plot[x,{x,1.49,6},PlotStyle{Red,Dotted}]]
Out[]=
All the points are below the line, and therefore the percent error of the model without Z2 is slightly lower than the error of the original model that included Z2. Excluding Z2 not only results in a simpler model, but in this case also slightly improves the model performance. Therefore it is probably best to leave this variable out of the model. Keep in mind that we have completed this analysis with only a training set of 30 molecules. If the training set had more molecules, you should be able to develop a better model.
Assignment
Assignment
You originally ran this analysis on a 30 molecule data set (BP.CSV). You also have available to you a 102 molecule data set at “https://chem.libretexts.org/@api/deki/files/244500/102BP.csv?revision=1” . Conveniently, this does not have the same problem with the molecular weight being represented as a String as in the shorter example used above, so it is sufficient to run the following to Import it for subsequent analysis:
In[]:=
Import["https://chem.libretexts.org/@api/deki/files/244500/102BP.csv?revision=1","Dataset","HeaderLines"1];
◼
Complete the above analysis using the expanded data set to determine if a better predictive model can be obtained with a larger training set. Note that 2-methylheptane is in this new dataset so you will need to choose a new test molecule.
When you have completed the analysis, you will create a new analysis:
◼
Choose four new topological and geometric descriptors available in Mathematica (see the details tab of the MoleculeValue documentation page)
◼
Complete simple linear analysis for each of your new descriptors.
◼
Complete a multiple linear regression to create an equation that best represents the data boiling point data and your descriptors.
◼
Create a separate sheet that has your regression data.
◼
Make a plot of Actual vs Predicted BP for your regression.
◼
Choose a new molecule not in the dataset (not 2-methylheptane, be creative and use chemical intuition).
◼
Use your multiple linear equation to predict this molecule’s BP and look of the literature value.
◼
Write a short one-two page paper that includes:
◼
What do your new chosen descriptors mean?
◼
Which new chosen descriptors correlate with each other and with the boiling point?
◼
What is the overall equation that you determined?
◼
How did you choose the molecule to test? Why is it a good choice?
◼
How close this multiple linear regression predicts your boiling point of your molecule?
Attributions
Attributions
Adapted from the corresponding OLCC 2019 Python Exercise:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/5._Quantitative _Structure _Property _Relationships/5.4%3 A_Python _Assignment
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/5._Quantitative _Structure _Property _Relationships/5.4%3 A_Python _Assignment
Cite this as: Joshua Schrier, "05 Quantitative Structure-Property Relationships" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo4gxs
Download