06 Molecular Similarity
Author
Joshua Schrier
Title
06 Molecular Similarity
Description
Objectives
Category
Educational Materials
Keywords
cheminformatics, Chemoinformatics, chemical information, PubChem, quantitative structure, property relationships, QSPR, machine learning, computer-aided drug design, chemistry
URL
http://www.notebookarchive.org/2020-10-ebo6m3k/
DOI
https://notebookarchive.org/2020-10-ebo6m3k
Date Added
2020-10-31
Date Last Modified
2020-10-31
File Size
1.18 megabytes
Supplements
Rights
CC BY-NC-SA 4.0
Download
Open in Wolfram Cloud
Molecular Similarity
Molecular Similarity
Note: This is a 2 week assignment.
Objectives
Objectives
◼
Generate molecular fingerprints for a given molecule.
◼
Evaluate structural similarity between molecules using different molecular fingerprints and similarity metrics.
Many useful documents/papers describe various aspects of molecular similarity, including molecular fingerprints and similarity measures. Please read these if you need more details.
◼
◼
Fingerprint Generation, GraphSim Toolkit 2.4.2 (https://docs.eyesopen.com/toolkits/python/graphsimtk/fingerprint.html)
◼
Fingerprint Generation
Fingerprint Generation
Begin by defining an example molecule:
In[]:=
mol=Molecule["CC(C)C1=C(C(=C(N1CC[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)C3=CC=CC=C3)C(=O)NC4=CC=CC=C4"]
Out[]=
Molecule
|
MACCS keys
MACCS keys
The MACCS key is a binary fingerprint (a string of 0’s and 1’s). Each bit position represents the presence (=1) or absence (=0) of a pre-defined structural feature. The feature definitions for the MACCS keys are available at:
https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
In[]:=
fp=
[mol]
[◼] | MACCSKeys |
Out[]=
SparseArray
|
Because most of the entries are zeros, the default returned data is in the form of a SparseArray. This can be converted into an ordinary vector of ones and zeros using Normal:
In[]:=
Normal[fp]
Out[]=
{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0}
The MACCS key is 166 bits long:
In[]:=
Length[fp]
Out[]=
166
The MACCSKeys function provides additional ways to describe the matched pattern. The “OnBits” option returns a list of indices that are “on”:
In[]:=
[◼] | MACCSKeys |
Out[]=
{42,53,62,65,74,75,80,83,85,87,89,90,91,92,96,97,100,104,107,108,110,111,117,118,121,122,123,125,128,129,131,132,133,134,135,136,137,139,140,142,144,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165}
And the “MoleculePlot” option shows what molecular features are being matched (here we’ll just take two representatives):
In[]:=
[◼] | MACCSKeys |
Out[]=
53
,62
Aside: The commonly used Python implementation of MACCS Keys in RDKit represents the result with a 167-bit vector (a consequence of Python’s array-indexing-by-zero convention); consequently all of the indices are shifted by one from the “correct” value implemented above and there is an extra zero present in the zeroth entry in the list. This has no consequence for any of the fingerprint similarity measurements.
Exercise
Exercise
Exercise 1a: Generate the MACCS keys for the molecules represented by the following SMILES, and get the positions of the bits set to ON (one) in each of the three fingerprints. What fragments do these bit positions correspond to?
smiles={"C1=CC=CC=C1",(*Benzene(Kekule)*),"c1ccccc1",(*Benzene("Aromatized"carbons)*)"C1CCCCC1"(*Cyclohexane*)};(*Writeyourcodeinthiscell.*)
In prose: Write the fragment definition of the “ON” bits by looking at the MACCSKeys function source code.
Circular Fingerprints
Circular Fingerprints
Circular fingerprints are hashed fingerprints. They are generated by exhaustively enumerating “circular” fragments (containing all atoms within a given radius from each heavy atom of the molecule) and then hashing these fragments into a fixed-length bitstring. (Here, the “radius” from an atom is measured by the number of bonds that separates two atoms).
Examples of circular fingerprints are the extended-connectivity fingerprint (ECFPs) and their variant called FCFPs (Functional-Class Fingerprints), originally described in a paper by Rogers and Hahn (https://doi.org/10.1021/ci100050t). The RDKit implementation of these fingerprints are called “Morgan Fingerprints” (https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints).
As of Mathematica 12.1 (09/2020), these are available as an undocumented function:
Examples of circular fingerprints are the extended-connectivity fingerprint (ECFPs) and their variant called FCFPs (Functional-Class Fingerprints), originally described in a paper by Rogers and Hahn (https://doi.org/10.1021/ci100050t). The RDKit implementation of these fingerprints are called “Morgan Fingerprints” (https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints).
As of Mathematica 12.1 (09/2020), these are available as an undocumented function:
In[]:=
Chemistry`MoleculeFingerprint[mol,FingerprintType"MorganConnectivity"]
Out[]=
SparseArray
|
This uses the default settings in RDKit of a path length of 3 bonds and a 2048 bit fingerprint.
Path-Based Fingerprints
Path-Based Fingerprints
Path-based fingerprints are also hashed fingerprints. They are generated by enumerating linear fragments of a given length and hashing them into a fixed-length bit string. An example is the RDKit’s topological fingerprint, which is available in Mathematica 12.1 (10/2020), as the default setting for the undocumented function; it uses the RDKit implementation and default parameters (minimum path size= 1 bond, maximum path size= 7 bonds, fingerprint size = 2048 bits, number of bits set per hash = 2 , minimum fingerprint size = 64 bits, target on-bit density = 0.0):
In[]:=
Chemistry`MoleculeFingerprint[mol]
Out[]=
SparseArray
|
Please note that as this is an undocumented function, it is subject to change in future versions. This fingerprint type can also be set explicitly by setting the FingerprintType option to RDKit:
In[]:=
Chemistry`MoleculeFingerprint[mol,FingerprintType"RDKit"]
Out[]=
SparseArray
|
PubChem Fingerprint
PubChem Fingerprint
The PubChem Fingerprint is a 881-bit-long binary fingerprint (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf). Similar to the MACCS keys, it uses a pre-defined fragment dictionary. The PubChem fingerprint for each compound in PubChem can be downloaded from PubChem. However, because they are base64-encoded, they should be decoded into binary bitstrings or bitvectors.
Details about how to decode base64 - encoded PubChem fingerprints is described on the last page of the PubChem Fingerprint specification. Below is a function that decodes a PubChem fingerprint into a SparseArray like used above, using the ByteArrayToBitList repository function:
In[]:=
PubChemFingerprintToVector[pcfpBase64_String]:=WithbitList=
@ByteArray[pcfpBase64],SparseArray@Take[bitList,{33,913}]
[◼] | ByteArrayToBitList |
Here is a usage example:
In[]:=
pcfps="AAADcYBgAAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGAAAAAAACACAEAAwAIAAAACAACBCAAACAAAgAAAIiAAAAIgIICKAERCAIAAggAAIiAcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==";(*demomolecule*)PubChemFingerprintToVector[pcfps](*applythefunctiontothedemo*)
Out[]=
SparseArray
|
Computation of similarity scores
Computation of similarity scores
Consider the following compounds:
In[]:=
cids=<|54454"Simvastatin (Zocor)",54687"Pravastatin (Pravachol)",60823"Atorvastatin (Lipitor)",446155"Fluvastatin (Lescol)",446157"Rosuvastatin (Crestor)",5282452"Pitavastatin (Livalo)",97938126"Lovastatin (Altoprev)"|>;
Let’s get the SMILES strings from PubChem:
In[]:=
smilesData=ServiceExecute["PubChem","CompoundProperties",{"CompoundID"Keys[cids],"Property"{"IsomericSMILES"}}]
Out[]=
|
In[]:=
moleculeData=smilesDataAll,With{mol=Molecule[#IsomericSMILES]},<|"CompoundID"#CompoundID,"mol"mol,"img"MoleculePlot[mol],"MACCS"
[mol]|>&
[◼] | MACCSKeys |
Out[]=
|
Now let’s compute the pair-wise similarity scores among the MACCS using the “Tanimoto” dissimilarity measurement. As mentioned in the reading, this related to the JaccardDissimilarity (but distinct from the RogersTanimotoDissimilarity):
In[]:=
DistanceMatrix[moleculeData[All,"MACCS"],DistanceFunctionJaccardDissimilarity]
Out[]=
0,,,,,,,,0,,,,,,,,0,,,,,,,,0,,,,,,,,0,,,,,,,,0,,,,,,,,0
9
47
17
26
41
65
30
43
36
59
3
22
9
47
23
37
19
31
61
85
33
56
9
44
17
26
23
37
11
32
13
28
1
2
47
73
41
65
19
31
11
32
35
74
13
48
37
60
30
43
61
85
13
28
35
74
19
36
29
41
36
59
33
56
1
2
13
48
19
36
30
53
3
22
9
44
47
73
37
60
29
41
30
53
This is easier to see if we represent the fractions as decimals (to three digits) and in the form of a table, whose headings are the CompoundID numbers, and truncating the numbers to two digits of precision:
In[]:=
With[{headers=Normal@moleculeData[All,"CompoundID"]},TableForm[SetPrecision[1-%,2],TableHeadings{headers,headers}]]
Out[]//TableForm=
54454 | 54687 | 60823 | 446155 | 446157 | 5282452 | 97938126 | |
54454 | 1.0 | 0.81 | 0.35 | 0.37 | 0.30 | 0.39 | 0.86 |
54687 | 0.81 | 1.0 | 0.38 | 0.39 | 0.28 | 0.41 | 0.80 |
60823 | 0.35 | 0.38 | 1.0 | 0.66 | 0.54 | 0.50 | 0.36 |
446155 | 0.37 | 0.39 | 0.66 | 1.0 | 0.53 | 0.73 | 0.38 |
446157 | 0.30 | 0.28 | 0.54 | 0.53 | 1.0 | 0.47 | 0.29 |
5282452 | 0.39 | 0.41 | 0.50 | 0.73 | 0.47 | 1.0 | 0.43 |
97938126 | 0.86 | 0.80 | 0.36 | 0.38 | 0.29 | 0.43 | 1.0 |
By default, the similarity score is generated using the Tanimoto equation. Mathematica supports many other distance and similarity metrics, including Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky. The definition of these metrics is available at the LibreTexts page (https://bit.ly/2kx9NCd). Let’s compare a few of these, subtracting one to turn the dissimilarity and distance measures into the corresponding similarity measures:
In[]:=
With[{mols=Normal@moleculeData[{1,2},"MACCS"]},{#,1-#@@mols}&/@{JaccardDissimilarity,DiceDissimilarity,CosineDistance,SokalSneathDissimilarity}]//N//TableForm
Out[]//TableForm=
JaccardDissimilarity | 0.808511 |
DiceDissimilarity | 0.894118 |
CosineDistance | 0.894675 |
SokalSneathDissimilarity | 0.678571 |
The Tversky score is an asymmetric similarity measure, and its computation requires the weightings of the two molecules being compared. It is described on page 3 of this slide deck (https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf ) and can be implemented in the following way:
In[]:=
TverskySimilarity[v1_,v2_,a_,b_]:=With[{l12=v1.v2,l1=Total[v1],l2=Total[v2]},l2/(a*l1+b*l2+(1-a-b)*l12)]
As a demonstration, apply this to the first two entries in the moleculeData, as observe how the similarity changes as a function of the two values:
In[]:=
{v1,v2}=moleculeData[{1,2},"MACCS"]//Normal;Table[{alpha,1-alpha,TverskySimilarity[v1,v2,alpha,1-alpha]//N},{alpha,0,1,0.1}]//TableForm[#,TableHeadings{None,{"α","β","Similarity"}}]&
Out[]//TableForm=
α | β | Similarity |
0. | 1. | 1. |
0.1 | 0.9 | 0.992736 |
0.2 | 0.8 | 0.985577 |
0.3 | 0.7 | 0.97852 |
0.4 | 0.6 | 0.971564 |
0.5 | 0.5 | 0.964706 |
0.6 | 0.4 | 0.957944 |
0.7 | 0.3 | 0.951276 |
0.8 | 0.2 | 0.9447 |
0.9 | 0.1 | 0.938215 |
1. | 0. | 0.931818 |
Simplifying Fingerprint Comparison
Simplifying Fingerprint Comparison
The function repository contains a MoleculeFingerprintSimilarity function that can facilitate some common similarity calculations. Let’s use the example statin drugs defined in the smilesData variable above. We will start by converting these into a list of Molecule representations:
In[]:=
examples=Molecule/@(Normal@smilesData[All,"IsomericSMILES"])
Out[]=
Molecule
,Molecule
,Molecule
,Molecule
,Molecule
,Molecule
,Molecule
|
|
|
|
|
|
|
The MoleculeFingerprintSimilarity function takes two Molecules as input, and has optional parameters for specifying the FingerprintType and SimilarityMeasure to use. To follow the example discussed earlier in this section, we will use the “MACCSKeys” and “Tanimoto” similarity, respectively. The below result should mimic the second element in the table of results presented earlier in this section:
In[]:=
|
Out[]=
0.8125
We can reproduce the entire table by defining an appropriate similarity function and then creating a DistanceMatrix of results using that function, reproducing the results shown earlier in this section:
maccsDistance[m1_,m2_]:=
[m1,m2,"FingerprintType""MACCSKeys","SimilarityMeasure"->"Tanimoto"]DistanceMatrix[examples,DistanceFunctionmaccsDistance]//TableForm
|
Out[]//TableForm=
1. | 0.8125 | 0.35443 | 0.378788 | 0.306818 | 0.4 | 0.866667 |
0.8125 | 1. | 0.386667 | 0.396825 | 0.287356 | 0.421053 | 0.8 |
0.35443 | 0.386667 | 1. | 0.661538 | 0.534884 | 0.507463 | 0.364865 |
0.378788 | 0.396825 | 0.661538 | 1. | 0.526316 | 0.734694 | 0.393443 |
0.306818 | 0.287356 | 0.534884 | 0.526316 | 1. | 0.472973 | 0.297619 |
0.4 | 0.421053 | 0.507463 | 0.734694 | 0.472973 | 1. | 0.444444 |
0.866667 | 0.8 | 0.364865 | 0.393443 | 0.297619 | 0.444444 | 1. |
Other fingerprint types (including the RDKit and Morgan fingerprints) and similarity measures are supported; See the MoleculeFingerprintSimilarity documentation for more details. On the other hand, as this does not include the PubChem fingerprints, so for the exercise below you will have to do it yourself, as described in the first part of this section.
Exercise
Exercise
Exercise 2a: Compute the Tanimoto similarity scores between the seven statin compounds used in this section, using the PubChem fingerprints.
◼
Download the PubChem Fingerprint for the seven CIDs.
◼
Convert the downloaded fingerprints into bit vectors.
◼
Compute the pair-wise Tanimoto scores using the bit vectors.
(*Writeyourcodeinthiscell*)
Interpretation of similarity scores
Interpretation of similarity scores
Using molecular fingerprints. we can compute the similarity scores between molecules. However, how should these scores be interpreted? For example, the Tanimoto score between CID 60823 and CID 446155 is computed to be 0.66, but does it mean that the two compounds are similar? How similar is similar? The following analysis would help answer these questions.
Step 1. Randomly select 1,000 compounds from PubChem and download their SMILES strings:
Step 1. Randomly select 1,000 compounds from PubChem and download their SMILES strings:
In[]:=
RandomSeed[1841];cidMax=138962044;(*ThemaximumCIDinPubChemasofSeptember2019*)cids=RandomInteger[{1,cidMax},1000];
Then define a function to request these CompoundID values from PubChem, storing the results:
request[cids_List,chunkSize_Integer:200]:=With{cidsBatches=Partition[cids,UpTo[chunkSize]],lookupSMILES=ServiceExecute["PubChem","CompoundProperties",{"CompoundID"#,"Property""IsomericSMILES"}]&},Join@@
[lookupSMILES,cidsBatches,Pause[0.2],1](*pauseevery1item*)smiles=request[cids]
|
Out[]=
|
Step 2. Generate the MACCSKeys for each compound:
In[]:=
maccs=ParallelMap
,Normal@smiles[All,"IsomericSMILES"];
[◼] | MACCSKeys |
(Here we use ParallelMap to use more than one processor on your computer to accelerate the calculation; but this could be replaced with an “ordinary” Map.)
Step 3. Compute the Tanimoto scores between compounds. It is only necessary to compute the similarity between each unique pair once:
In[]:=
similarity[u_,v_]:=Map[1.-JaccardDissimilarity[u,#]&,v]scores=ParallelMap[similarity[maccs[[#]],maccs[[#+1;;-1]]]&,Range[999]]//Flatten;
Step 4. Generate a histogram that shows the distribution of the pair-wise scores:
In[]:=
Histogram[scores,Automatic,#,PlotLabel#]&/@{"PDF","CDF"}
Out[]=
,
We can also define a CDF using the empirical distribution of observed values:
In[]:=
cdf=CDF@EmpiricalDistribution[scores]
Out[]=
Function{x.$},CDFDataDistribution
,x.$,{Listable}
|
The resulting function can be used like any other function:
In[]:=
Plot[cdf[x],{x,0,1}]
Out[]=
We can use this to compute the percent of observed samples that are above a threshold value of similarity:
In[]:=
TableForm[Table[{SetPrecision[similarity,2],SetPrecision[1-cdf[similarity],4]},{similarity,0,1,0.05}],TableHeadings{None,{"Similarity","% above"}}]
Out[]//TableForm=
Similarity | % above |
0 | 0.9997 |
0.050 | 0.9962 |
0.10 | 0.9816 |
0.15 | 0.9458 |
0.20 | 0.8798 |
0.25 | 0.7804 |
0.30 | 0.6529 |
0.35 | 0.5013 |
0.40 | 0.3443 |
0.45 | 0.2143 |
0.50 | 0.1121 |
0.55 | 0.05606 |
0.60 | 0.02201 |
0.65 | 0.007736 |
0.70 | 0.002120 |
0.75 | 0.0004505 |
0.80 | 0.00008408 |
0.85 | 0.00001802 |
0.90 | 4.004× -6 10 |
0.95 | 0 |
1.0 | 0 |
What is the average similarity score?
In[]:=
Mean[scores]
Out[]=
0.350967
From the distribution of the similarity scores among 1,000 compounds, we observe the following:
◼
If you randomly select two compounds from PubChem, the similarity score between them (computed using the Tanimoto equation and MACCS keys) is ~0.35 on average.
◼
About 5% of randomly selected compound pairs have a similarity score greater than 0.55.
◼
About 1% of randomly selected compound pairs have a similarity score greater than 0.65.
If two compounds have a Tanimoto score of 0.35, it is close to the average Tanimoto score between randomly selected compounds and there is < 50% chance that you will get a score of 0.35 or greater just by selecting two compounds randomly from PubChem. Therefore, it is reasonable to consider the two compounds are not similar.
The Tanimoto index may have a value ranging from 0 (for no similarity) to 1 (for identical molecules) and the midpoint of this value range is 0.5. Because of this, a Tanimoto score of 0.55 may not sound great enough to consider two compounds to be similar. However, according to the score distribution curve generated here, less than 5% of randomly selected compound pairs will have a score greater than this.
In the previous section, we computed the similarity scores between some cholesterol-lowering drugs, and CID 60823 and CID 446155 had a Tanimoto score of 0.66. Based on the score distribution curve generated in the second section, we can say that the probabiity of two randomly selected compounds from PubChem having a Tanimoto score greater than 0.66 is less than 1%.
You have probably taken standardized tests where you have been told your ranking was in the X-th percentile. What that means is that your score was greater than X% of the other scores, or (stated another way) that you received one of the top (100-X)% of scores. We can generate arbitrary X-iles from the list of scores using the Quantile function. For example, to find out the similarity score corresponding to the 97-th percentile (top 3% of similarity scores):
The Tanimoto index may have a value ranging from 0 (for no similarity) to 1 (for identical molecules) and the midpoint of this value range is 0.5. Because of this, a Tanimoto score of 0.55 may not sound great enough to consider two compounds to be similar. However, according to the score distribution curve generated here, less than 5% of randomly selected compound pairs will have a score greater than this.
In the previous section, we computed the similarity scores between some cholesterol-lowering drugs, and CID 60823 and CID 446155 had a Tanimoto score of 0.66. Based on the score distribution curve generated in the second section, we can say that the probabiity of two randomly selected compounds from PubChem having a Tanimoto score greater than 0.66 is less than 1%.
You have probably taken standardized tests where you have been told your ranking was in the X-th percentile. What that means is that your score was greater than X% of the other scores, or (stated another way) that you received one of the top (100-X)% of scores. We can generate arbitrary X-iles from the list of scores using the Quantile function. For example, to find out the similarity score corresponding to the 97-th percentile (top 3% of similarity scores):
In[]:=
Quantile[scores,0.97]
Out[]=
0.585714
Exercise
Exercise
Exercise 3a: In this exercise, we want to generate the distribution of the similarity scores among 1,000 compounds randomly selected from PubChem, using different molecular fingerprints and similarity metrics.
For molecular fingerprints, use the following:
For molecular fingerprints, use the following:
◼
PubChem Fingerprint
◼
MACCS keys
◼
RDKit Fingerprints
For similarity metrics, use the following:
◼
Tanimoto similarity (i.e., one minus JaccardDissimilarity)
◼
Dice similarity
◼
Cosine similarity
Generate six (6) distribution curves for these pairwise combinations .
◼
For each distribution curve, determine the similarity score threshold so that 1% of the compound pairs have a similarity score greater than or equal to this threshold.
◼
Use ResourceFunction["MACCSKeys"] to generate the MACCS keys, use Chemistry`MoleculeFingerprint[mol] for the RDKit fingerprints, and download the PubChem fingerprints from PubChem.
◼
You may use the ResourceFunction["MoleculeFingerprintSimilarity"] function for the MACCS and RDKit fingerprints if you wish, but note that it does not support the PubChem fingerprints.
◼
Step 1: Generate 1,000 random CIDs, download the isomeric SMILES for them, and create Molecule objects from the downloaded SMILES strings.
(*Writeyourcodeinthiscell*)
Step 2: Generate the fingerprints, compute the similarity scores, determine similarity thresholds, and make histograms.
(*Writeyourcodeinthiscell*)
Attributions
Attributions
Adapted from the corresponding OLCC 2019 Python Assignment:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/6%3 A_Molecular _Similarity/6.4%3 A_Python _Assignment
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/6%3 A_Molecular _Similarity/6.4%3 A_Python _Assignment
Cite this as: Joshua Schrier, "06 Molecular Similarity" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo6m3k
Download