Educational Materials

06 Molecular Similarity

Joshua Schrier

Author

Joshua Schrier

Title

06 Molecular Similarity

Description

Objectives

Molecular Similarity

Note: This is a 2 week assignment.

Objectives

◼

Generate molecular fingerprints for a given molecule.

◼

Evaluate structural similarity between molecules using different molecular fingerprints and similarity metrics.

Many useful documents/papers describe various aspects of molecular similarity, including molecular fingerprints and similarity measures. Please read these if you need more details.

◼

Chemical Fingerprints (https://docs.chemaxon.com/display/docs/Chemical+Fingerprints)

◼

Fingerprint Generation, GraphSim Toolkit 2.4.2 (https://docs.eyesopen.com/toolkits/python/graphsimtk/fingerprint.html)

◼

Extended-Connectivity Fingerprints (https://doi.org/10.1021/ci100050t)

Fingerprint Generation

Begin by defining an example molecule:

In[]:=

mol=Molecule["CC(C)C1=C(C(=C(N1CC[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)C3=CC=CC=C3)C(=O)NC4=CC=CC=C4"]

Out[]=

Molecule

Formula:

Atoms: 41(76) Bonds: 44(79)



MACCS keys

The MACCS key is a binary fingerprint (a string of 0’s and 1’s). Each bit position represents the presence (=1) or absence (=0) of a pre-defined structural feature. The feature definitions for the MACCS keys are available at:
https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py

A Mathematica implementation, MACCSKeys is available on the Wolfram Function Repository:

In[]:=

fp=

[◼]

MACCSKeys

[mol]

Out[]=

SparseArray

Specified elements: 61

Dimensions: {166}



Because most of the entries are zeros, the default returned data is in the form of a SparseArray. This can be converted into an ordinary vector of ones and zeros using Normal:

In[]:=

Normal[fp]

Out[]=

{0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,1,0,1,0,1,1,1,1,0,0,0,1,1,0,0,1,0,0,0,1,0,0,1,1,0,1,1,0,0,0,0,0,1,1,0,0,1,1,1,0,1,0,0,1,1,0,1,1,1,1,1,1,1,0,1,1,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0}

The MACCS key is 166 bits long:

In[]:=

Length[fp]

Out[]=

166

The MACCSKeys function provides additional ways to describe the matched pattern. The “OnBits” option returns a list of indices that are “on”:

In[]:=

[◼]

MACCSKeys

[mol,"OutputStyle""OnBits"]

Out[]=

{42,53,62,65,74,75,80,83,85,87,89,90,91,92,96,97,100,104,107,108,110,111,117,118,121,122,123,125,128,129,131,132,133,134,135,136,137,139,140,142,144,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165}

And the “MoleculePlot” option shows what molecular features are being matched (here we’ll just take two representatives):

In[]:=

[◼]

MACCSKeys

[mol,"OutputStyle""MoleculePlot"][[2;;3]]

Out[]=

53

,62



Aside: The commonly used Python implementation of MACCS Keys in RDKit represents the result with a 167-bit vector (a consequence of Python’s array-indexing-by-zero convention); consequently all of the indices are shifted by one from the “correct” value implemented above and there is an extra zero present in the zeroth entry in the list. This has no consequence for any of the fingerprint similarity measurements.

Exercise

Exercise 1a: Generate the MACCS keys for the molecules represented by the following SMILES, and get the positions of the bits set to ON (one) in each of the three fingerprints. What fragments do these bit positions correspond to?

smiles={"C1=CC=CC=C1",(*Benzene(Kekule)*),"c1ccccc1",(*Benzene("Aromatized"carbons)*)"C1CCCCC1"(*Cyclohexane*)};(*Writeyourcodeinthiscell.*)

In prose: Write the fragment definition of the “ON” bits by looking at the MACCSKeys function source code.

Circular Fingerprints

Circular fingerprints are hashed fingerprints. They are generated by exhaustively enumerating “circular” fragments (containing all atoms within a given radius from each heavy atom of the molecule) and then hashing these fragments into a fixed-length bitstring. (Here, the “radius” from an atom is measured by the number of bonds that separates two atoms).

Examples of circular fingerprints are the extended-connectivity fingerprint (ECFPs) and their variant called FCFPs (Functional-Class Fingerprints), originally described in a paper by Rogers and Hahn (https://doi.org/10.1021/ci100050t). The RDKit implementation of these fingerprints are called “Morgan Fingerprints” (https://www.rdkit.org/docs/GettingStartedInPython.html#morgan-fingerprints-circular-fingerprints).

As of Mathematica 12.1 (09/2020), these are available as an undocumented function:

In[]:=

Chemistry`MoleculeFingerprint[mol,FingerprintType"MorganConnectivity"]

Out[]=

SparseArray

Specified elements: 57

Dimensions: {2048}



This uses the default settings in RDKit of a path length of 3 bonds and a 2048 bit fingerprint.

Path-Based Fingerprints

Path-based fingerprints are also hashed fingerprints. They are generated by enumerating linear fragments of a given length and hashing them into a fixed-length bit string. An example is the RDKit’s topological fingerprint, which is available in Mathematica 12.1 (10/2020), as the default setting for the undocumented function; it uses the RDKit implementation and default parameters (minimum path size= 1 bond, maximum path size= 7 bonds, fingerprint size = 2048 bits, number of bits set per hash = 2 , minimum fingerprint size = 64 bits, target on-bit density = 0.0):

In[]:=

Chemistry`MoleculeFingerprint[mol]

Out[]=

SparseArray

Specified elements: 1491

Dimensions: {2048}



Please note that as this is an undocumented function, it is subject to change in future versions. This fingerprint type can also be set explicitly by setting the FingerprintType option to RDKit:

In[]:=

Chemistry`MoleculeFingerprint[mol,FingerprintType"RDKit"]

Out[]=

SparseArray

Specified elements: 1491

Dimensions: {2048}



PubChem Fingerprint

The PubChem Fingerprint is a 881-bit-long binary fingerprint (ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf). Similar to the MACCS keys, it uses a pre-defined fragment dictionary. The PubChem fingerprint for each compound in PubChem can be downloaded from PubChem. However, because they are base64-encoded, they should be decoded into binary bitstrings or bitvectors.

Details about how to decode base64 - encoded PubChem fingerprints is described on the last page of the PubChem Fingerprint specification. Below is a function that decodes a PubChem fingerprint into a SparseArray like used above, using the ByteArrayToBitList repository function:

In[]:=

PubChemFingerprintToVector[pcfpBase64_String]:=WithbitList=

[◼]	ByteArrayToBitList

@ByteArray[pcfpBase64],SparseArray@Take[bitList,{33,913}]

Here is a usage example:

In[]:=

pcfps="AAADcYBgAAAAAAAAAAAAAAAAAAAAAAAAAAAwAAAAAAAAAAABAAAAGAAAAAAACACAEAAwAIAAAACAACBCAAACAAAgAAAIiAAAAIgIICKAERCAIAAggAAIiAcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA==";(*demomolecule*)PubChemFingerprintToVector[pcfps](*applythefunctiontothedemo*)

Out[]=

SparseArray

Specified elements: 43

Dimensions: {881}



Computation of similarity scores

Consider the following compounds:

In[]:=

cids=<|54454"Simvastatin (Zocor)",54687"Pravastatin (Pravachol)",60823"Atorvastatin (Lipitor)",446155"Fluvastatin (Lescol)",446157"Rosuvastatin (Crestor)",5282452"Pitavastatin (Livalo)",97938126"Lovastatin (Altoprev)"|>;

Let’s get the SMILES strings from PubChem:

In[]:=

smilesData=ServiceExecute["PubChem","CompoundProperties",{"CompoundID"Keys[cids],"Property"{"IsomericSMILES"}}]

Out[]=

CompoundID	IsomericSMILES
54454	CCC(C)(C)C(=O)O[C@H]1C[C@H](C=C2[C@H]1[C@H]([C@H](C=C2)C)CC[C@@H]3C[C@H](CC(=O)O3)O)C
54687	CC[C@H](C)C(=O)O[C@H]1C[C@@H](C=C2[C@H]1[C@H]([C@H](C=C2)C)CC[C@H](C[C@H](CC(=O)O)O)O)O
60823	CC(C)C1=C(C(=C(N1CC[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)C3=CC=CC=C3)C(=O)NC4=CC=CC=C4
446155	CC(C)N1C2=CC=CC=C2C(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C3=CC=C(C=C3)F
446157	CC(C)C1=NC(=NC(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)N(C)S(=O)(=O)C
5282452	C1CC1C2=NC3=CC=CC=C3C(=C2/C=C/[C@H](C[C@H](CC(=O)O)O)O)C4=CC=C(C=C4)F
97938126	CC[C@H](C)C(=O)O[C@@H]1C[C@@H](C[C@@H]2[C@H]1[C@H]([C@H](C=C2)C)CC[C@@H]3C[C@H](CC(=O)O3)O)C

Now generate a new Dataset, containing a Molecule representation, the image of the molecule, and the corresponding MACCSKeys:

In[]:=

moleculeData=smilesDataAll,With{mol=Molecule[#IsomericSMILES]},<|"CompoundID"#CompoundID,"mol"mol,"img"MoleculePlot[mol],"MACCS"

[◼]

MACCSKeys

[mol]|>&

Out[]=

CompoundID

mol

img

MACCS

54454

Molecule

Formula:

Atoms: 30(68) Bonds: 32(70)



SparseArray

Specified elements: 44

Dimensions: {166}



54687

Molecule

Formula:

Atoms: 30(66) Bonds: 31(67)



SparseArray

Specified elements: 41

Dimensions: {166}



60823

Molecule

Formula:

Atoms: 41(76) Bonds: 44(79)



SparseArray

Specified elements: 61

Dimensions: {166}



446155

Molecule

Formula:

Atoms: 30(56) Bonds: 32(58)



SparseArray

Specified elements: 45

Dimensions: {166}



446157

Molecule

Formula:

Atoms: 33(61) Bonds: 34(62)



SparseArray

Specified elements: 68

Dimensions: {166}



5282452

Molecule

Formula:

Atoms: 31(55) Bonds: 34(58)



SparseArray

Specified elements: 38

Dimensions: {166}



97938126

Molecule

Formula:

Atoms: 29(67) Bonds: 31(69)



SparseArray

Specified elements: 38

Dimensions: {166}



Now let’s compute the pair-wise similarity scores among the MACCS using the “Tanimoto” dissimilarity measurement. As mentioned in the reading, this related to the JaccardDissimilarity (but distinct from the RogersTanimotoDissimilarity):

In[]:=

DistanceMatrix[moleculeData[All,"MACCS"],DistanceFunctionJaccardDissimilarity]

Out[]=

0,

,

,0,

,

,0,

,

,0,

,

,0,

,

,0,

,

,0

This is easier to see if we represent the fractions as decimals (to three digits) and in the form of a table, whose headings are the CompoundID numbers, and truncating the numbers to two digits of precision:

In[]:=

With[{headers=Normal@moleculeData[All,"CompoundID"]},TableForm[SetPrecision[1-%,2],TableHeadings{headers,headers}]]

Out[]//TableForm=

	54454	54687	60823	446155	446157	5282452	97938126
54454	1.0	0.81	0.35	0.37	0.30	0.39	0.86
54687	0.81	1.0	0.38	0.39	0.28	0.41	0.80
60823	0.35	0.38	1.0	0.66	0.54	0.50	0.36
446155	0.37	0.39	0.66	1.0	0.53	0.73	0.38
446157	0.30	0.28	0.54	0.53	1.0	0.47	0.29
5282452	0.39	0.41	0.50	0.73	0.47	1.0	0.43
97938126	0.86	0.80	0.36	0.38	0.29	0.43	1.0

By default, the similarity score is generated using the Tanimoto equation. Mathematica supports many other distance and similarity metrics, including Dice, Cosine, Sokal, Russel, Kulczynski, McConnaughey, and Tversky. The definition of these metrics is available at the LibreTexts page (https://bit.ly/2kx9NCd). Let’s compare a few of these, subtracting one to turn the dissimilarity and distance measures into the corresponding similarity measures:

In[]:=

With[{mols=Normal@moleculeData[{1,2},"MACCS"]},{#,1-#@@mols}&/@{JaccardDissimilarity,DiceDissimilarity,CosineDistance,SokalSneathDissimilarity}]//N//TableForm

Out[]//TableForm=

JaccardDissimilarity	0.808511
DiceDissimilarity	0.894118
CosineDistance	0.894675
SokalSneathDissimilarity	0.678571

The Tversky score is an asymmetric similarity measure, and its computation requires the weightings of the two molecules being compared. It is described on page 3 of this slide deck (https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf ) and can be implemented in the following way:

In[]:=

TverskySimilarity[v1_,v2_,a_,b_]:=With[{l12=v1.v2,l1=Total[v1],l2=Total[v2]},l2/(a*l1+b*l2+(1-a-b)*l12)]

As a demonstration, apply this to the first two entries in the moleculeData, as observe how the similarity changes as a function of the two values:

In[]:=

{v1,v2}=moleculeData[{1,2},"MACCS"]//Normal;Table[{alpha,1-alpha,TverskySimilarity[v1,v2,alpha,1-alpha]//N},{alpha,0,1,0.1}]//TableForm[#,TableHeadings{None,{"α","β","Similarity"}}]&

Out[]//TableForm=

α	β	Similarity
0.	1.	1.
0.1	0.9	0.992736
0.2	0.8	0.985577
0.3	0.7	0.97852
0.4	0.6	0.971564
0.5	0.5	0.964706
0.6	0.4	0.957944
0.7	0.3	0.951276
0.8	0.2	0.9447
0.9	0.1	0.938215
1.	0.	0.931818

Simplifying Fingerprint Comparison

The function repository contains a MoleculeFingerprintSimilarity function that can facilitate some common similarity calculations. Let’s use the example statin drugs defined in the smilesData variable above. We will start by converting these into a list of Molecule representations:

In[]:=

examples=Molecule/@(Normal@smilesData[All,"IsomericSMILES"])

Out[]=

Molecule

Formula:

Atoms: 30(68) Bonds: 32(70)

,Molecule

Formula:

Atoms: 30(66) Bonds: 31(67)

,Molecule

Formula:

Atoms: 41(76) Bonds: 44(79)

,Molecule

Formula:

Atoms: 30(56) Bonds: 32(58)

,Molecule

Formula:

Atoms: 33(61) Bonds: 34(62)

,Molecule

Formula:

Atoms: 31(55) Bonds: 34(58)

,Molecule

Formula:

Atoms: 29(67) Bonds: 31(69)



The MoleculeFingerprintSimilarity function takes two Molecules as input, and has optional parameters for specifying the FingerprintType and SimilarityMeasure to use. To follow the example discussed earlier in this section, we will use the “MACCSKeys” and “Tanimoto” similarity, respectively. The below result should mimic the second element in the table of results presented earlier in this section:

In[]:=

[◼]	MoleculeFingerprintSimilarity

[examples[[1]],examples[[2]],"FingerprintType""MACCSKeys","SimilarityMeasure"->"Tanimoto"]

Out[]=

0.8125

We can reproduce the entire table by defining an appropriate similarity function and then creating a DistanceMatrix of results using that function, reproducing the results shown earlier in this section:

maccsDistance[m1_,m2_]:=

[◼]	MoleculeFingerprintSimilarity

[m1,m2,"FingerprintType""MACCSKeys","SimilarityMeasure"->"Tanimoto"]DistanceMatrix[examples,DistanceFunctionmaccsDistance]//TableForm

Out[]//TableForm=

1.	0.8125	0.35443	0.378788	0.306818	0.4	0.866667
0.8125	1.	0.386667	0.396825	0.287356	0.421053	0.8
0.35443	0.386667	1.	0.661538	0.534884	0.507463	0.364865
0.378788	0.396825	0.661538	1.	0.526316	0.734694	0.393443
0.306818	0.287356	0.534884	0.526316	1.	0.472973	0.297619
0.4	0.421053	0.507463	0.734694	0.472973	1.	0.444444
0.866667	0.8	0.364865	0.393443	0.297619	0.444444	1.

Other fingerprint types (including the RDKit and Morgan fingerprints) and similarity measures are supported; See the MoleculeFingerprintSimilarity documentation for more details. On the other hand, as this does not include the PubChem fingerprints, so for the exercise below you will have to do it yourself, as described in the first part of this section.

Exercise

Exercise 2a: Compute the Tanimoto similarity scores between the seven statin compounds used in this section, using the PubChem fingerprints.

◼

Download the PubChem Fingerprint for the seven CIDs.

◼

Convert the downloaded fingerprints into bit vectors.

◼

Compute the pair-wise Tanimoto scores using the bit vectors.

(*Writeyourcodeinthiscell*)

Interpretation of similarity scores

Using molecular fingerprints. we can compute the similarity scores between molecules. However, how should these scores be interpreted? For example, the Tanimoto score between CID 60823 and CID 446155 is computed to be 0.66, but does it mean that the two compounds are similar? How similar is similar? The following analysis would help answer these questions.

Step 1. Randomly select 1,000 compounds from PubChem and download their SMILES strings:

In[]:=

RandomSeed[1841];cidMax=138962044;(*ThemaximumCIDinPubChemasofSeptember2019*)cids=RandomInteger[{1,cidMax},1000];

Then define a function to request these CompoundID values from PubChem, storing the results:

request[cids_List,chunkSize_Integer:200]:=With{cidsBatches=Partition[cids,UpTo[chunkSize]],lookupSMILES=ServiceExecute["PubChem","CompoundProperties",{"CompoundID"#,"Property""IsomericSMILES"}]&},Join@@

[◼]	MapBatched

[lookupSMILES,cidsBatches,Pause[0.2],1](*pauseevery1item*)smiles=request[cids]

Out[]=

CompoundID	IsomericSMILES
78613761	C1=CC=C2C(=C1)C=CC=C2C3=NC(=CC4=C(C=C(C=C4)[N+](=O)[O-])[N+](=O)[O-])C(=O)O3
7225364	CC(C)(C)C1=CC=C(C=C1)CSC2=NC(=CN2N)C3=CC=CC=C3
51397103	C[C@@H]1C[C@H]2[C@@H]3CCC4=CC(=O)C=C[C@@]4([C@@]3([C@@H](C[C@@]2([C@@]1(C(=O)CCl)OC(=O)C5=CC=CO5)C)O)Cl)C
102692256	CC(C)CN1CCN(CC1)S(=O)(=O)C2=CC=NN2
121605591	C1=CC(=CC=C1[C@@H]2[C@H]([C@]2(CO)N)C(F)(F)F)O
43407166	COC(=O)C(CO)NC(=O)C1CCN(CC1)C(=O)C2CC2
110035414	CC(C1=CC=CC=C1)NC(=NCC(=O)N(C)C)NCCCF.I
130750249	CC1=CSC(=C1)C(C2CC2)O
91852757	CC(=O)N[C@@H]1[C@H]([C@@H]([C@H](O[C@H]1O)CO)O[C@H]2[C@@H]([C@H]([C@H]([C@H](O2)CO)O[C@H]3[C@@H]([C@H]([C@H]([C@H](O3)CO)O)O)O)O)NC(=O)C)O
35751898	CC1=C(C=C(C=C1)NC2=NC(=CS2)C[C@H](C)N)C
49325579	C1=CC=C2C(=C1)C3=C(O2)C=C(C=C3)NC(=O)/C=C/C4=CSC=N4
17649403	CC(C(=O)NC1=CC=CC=C1C(=O)NCCC2=CC=CC=C2)OC3=CC=CC4=CC=CC=C43
11448279	C1=CC=C(C(=C1)C(=O)/C=C/C2=CC=CS2)C(=O)O
122959629	CC1=C(C=CC=C1Cl)N2C(=O)C(=C(C2=O)SC3=CC=CC=C3)NC4=CC=C(C=C4)N5CCCC5
117942558	CC(CC1=CC=C(C=C1)OCCCN2C=C(C=N2)OC)C3=C(C=C(C=C3)F)C4=NC(=CC=C4)N5C(=C(C=N5)C(=O)O)C(F)(F)F
6302930	CCOCCN\1C2=C(C=C(C=C2)S(=O)(=O)N)S/C1=N\C(=O)C3=CC=CC=C3F
73622998	CC1=CC(=CC(=C1)OCC2=CC=C(O2)C=CC(=O)C3=CC(=CC=C3)N4C=CC=C4)C
50968555	CC1=NC=C(N1)C(=O)N[C@@H]2C[C@H](N(C2)C)CO
128382972	CCC1=NSC(=N1)N2CCCC(C2)F
104354527	CC12CC3CC(C1)CC(C3)(C2)C(=O)NCCCl
rows 1–20 of 1000

Step 2. Generate the MACCSKeys for each compound:

In[]:=

maccs=ParallelMap

[◼]

MACCSKeys

,Normal@smiles[All,"IsomericSMILES"];

(Here we use ParallelMap to use more than one processor on your computer to accelerate the calculation; but this could be replaced with an “ordinary” Map.)

Step 3. Compute the Tanimoto scores between compounds. It is only necessary to compute the similarity between each unique pair once:

In[]:=

similarity[u_,v_]:=Map[1.-JaccardDissimilarity[u,#]&,v]scores=ParallelMap[similarity[maccs[[#]],maccs[[#+1;;-1]]]&,Range[999]]//Flatten;

Step 4. Generate a histogram that shows the distribution of the pair-wise scores:

In[]:=

Histogram[scores,Automatic,#,PlotLabel#]&/@{"PDF","CDF"}

Out[]=





We can also define a CDF using the empirical distribution of observed values:

In[]:=

cdf=CDF@EmpiricalDistribution[scores]

Out[]=

Function{x.$},CDFDataDistribution

Type: Empirical

Data points: 499500

,x.$,{Listable}

The resulting function can be used like any other function:

In[]:=

Plot[cdf[x],{x,0,1}]

Out[]=

We can use this to compute the percent of observed samples that are above a threshold value of similarity:

In[]:=

TableForm[Table[{SetPrecision[similarity,2],SetPrecision[1-cdf[similarity],4]},{similarity,0,1,0.05}],TableHeadings{None,{"Similarity","% above"}}]

Out[]//TableForm=

Similarity	% above
0	0.9997
0.050	0.9962
0.10	0.9816
0.15	0.9458
0.20	0.8798
0.25	0.7804
0.30	0.6529
0.35	0.5013
0.40	0.3443
0.45	0.2143
0.50	0.1121
0.55	0.05606
0.60	0.02201
0.65	0.007736
0.70	0.002120
0.75	0.0004505
0.80	0.00008408
0.85	0.00001802
0.90	4.004× -6 10
0.95	0
1.0	0

What is the average similarity score?

In[]:=

Mean[scores]

Out[]=

0.350967

From the distribution of the similarity scores among 1,000 compounds, we observe the following:

◼

If you randomly select two compounds from PubChem, the similarity score between them (computed using the Tanimoto equation and MACCS keys) is ~0.35 on average.

◼

About 5% of randomly selected compound pairs have a similarity score greater than 0.55.

◼

About 1% of randomly selected compound pairs have a similarity score greater than 0.65.

If two compounds have a Tanimoto score of 0.35, it is close to the average Tanimoto score between randomly selected compounds and there is < 50% chance that you will get a score of 0.35 or greater just by selecting two compounds randomly from PubChem. Therefore, it is reasonable to consider the two compounds are not similar.

The Tanimoto index may have a value ranging from 0 (for no similarity) to 1 (for identical molecules) and the midpoint of this value range is 0.5. Because of this, a Tanimoto score of 0.55 may not sound great enough to consider two compounds to be similar. However, according to the score distribution curve generated here, less than 5% of randomly selected compound pairs will have a score greater than this.

In the previous section, we computed the similarity scores between some cholesterol-lowering drugs, and CID 60823 and CID 446155 had a Tanimoto score of 0.66. Based on the score distribution curve generated in the second section, we can say that the probabiity of two randomly selected compounds from PubChem having a Tanimoto score greater than 0.66 is less than 1%.

You have probably taken standardized tests where you have been told your ranking was in the X-th percentile. What that means is that your score was greater than X% of the other scores, or (stated another way) that you received one of the top (100-X)% of scores. We can generate arbitrary X-iles from the list of scores using the Quantile function. For example, to find out the similarity score corresponding to the 97-th percentile (top 3% of similarity scores):

In[]:=

Quantile[scores,0.97]

Out[]=

0.585714

Exercise

Exercise 3a: In this exercise, we want to generate the distribution of the similarity scores among 1,000 compounds randomly selected from PubChem, using different molecular fingerprints and similarity metrics.
For molecular fingerprints, use the following:

◼

PubChem Fingerprint

◼

MACCS keys

◼

RDKit Fingerprints

For similarity metrics, use the following:

◼

Tanimoto similarity (i.e., one minus JaccardDissimilarity)

◼

Dice similarity

◼

Cosine similarity

Generate six (6) distribution curves for these pairwise combinations .

◼

For each distribution curve, determine the similarity score threshold so that 1% of the compound pairs have a similarity score greater than or equal to this threshold.

◼

Use ResourceFunction["MACCSKeys"] to generate the MACCS keys, use Chemistry`MoleculeFingerprint[mol] for the RDKit fingerprints, and download the PubChem fingerprints from PubChem.

◼

You may use the ResourceFunction["MoleculeFingerprintSimilarity"] function for the MACCS and RDKit fingerprints if you wish, but note that it does not support the PubChem fingerprints.

◼

For reproducibility, use SeedRandom[2019] before you generate random CIDs.

Step 1: Generate 1,000 random CIDs, download the isomeric SMILES for them, and create Molecule objects from the downloaded SMILES strings.

(*Writeyourcodeinthiscell*)

Step 2: Generate the fingerprints, compute the similarity scores, determine similarity thresholds, and make histograms.

(*Writeyourcodeinthiscell*)

Attributions

Adapted from the corresponding OLCC 2019 Python Assignment:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/6%3 A_Molecular _Similarity/6.4%3 A_Python _Assignment

Cite this as: Joshua Schrier, "06 Molecular Similarity" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo6m3k