04 Structure Search
Author
Joshua Schrier
Title
04 Structure Search
Description
Objectives
Category
Educational Materials
Keywords
cheminformatics, Chemoinformatics, chemical information, PubChem, quantitative structure, property relationships, QSPR, machine learning, computer-aided drug design, chemistry
URL
http://www.notebookarchive.org/2020-10-ebo2bs0/
DOI
https://notebookarchive.org/2020-10-ebo2bs0
Date Added
2020-10-31
Date Last Modified
2020-10-31
File Size
0.94 megabytes
Supplements
Rights
CC BY-NC-SA 4.0
Download
Open in Wolfram Cloud
Structure Search
Structure Search
NOTE: This is a 2 week assignment
Objectives
Objectives
◼
Learn various types of structure searches including identity search, similarity search, substructure and super structure searches.
◼
Learn the optional parameters available for each search type.
Using PUG-REST, one can perform various types of structure searches (https://bit.ly/2lPznCo), including:
◼
identity search
◼
similarity search
◼
super/substructure search
◼
molecular formula search
As explained in a PubChem paper (https://bit.ly/2kirxky), whereas structure search can be performed in either an ‘asynchronous’ or ‘synchronous’ way, it is highly recommended to use the synchronous approach.
The synchronous searches are invoked by using the keywords prefixed with ‘fast’, such as fastidentity, fastsimilarity_2d, fastsimilarity_3d, fastsubstructure, fastsuperstructure, and fastformula.
The synchronous searches are invoked by using the keywords prefixed with ‘fast’, such as fastidentity, fastsimilarity_2d, fastsimilarity_3d, fastsubstructure, fastsuperstructure, and fastformula.
Identity Search
Identity Search
Using PUG-REST
Using PUG-REST
PUG-REST allows you to search the PubChem Compound database for molecules identical to the query molecule. PubChem’s identity search supports different contexts of chemical identity, which the user can specify using the optional parameter, “identity_type”. Here are some commonly-used chemical identity contexts.
◼
same_connectivity: returns compounds with the same atom connectivity as the query molecule, ignoring stereochemistry and isotope information.
◼
same_isotope: returns compounds with the same isotopes (as well as the same atom connectivity) as the query molecule. Stereochemistry will be ignored.
◼
same_stereo: returns compounds with the same stereochemistry (as well as the same atom connectivity) as the query molecule. Isotope information will be ignored.
◼
same_stereo_isotope: returns compounds with the same stereochemistry AND isotope information (as well as the same atom connectivity). This is the default.
The following code cell demonstrates how these different contexts of chemical sameness affects identity search in PubChem.
We begin by defining a function to perform a sameness comparison given one of these types and a SMILES string to search; the function is defined in a general way, but with default values of the arguments corresponding to an example query about the molecule “C(/C=C/Cl)Cl”:
We begin by defining a function to perform a sameness comparison given one of these types and a SMILES string to search; the function is defined in a general way, but with default values of the arguments corresponding to an example query about the molecule “C(/C=C/Cl)Cl”:
In[]:=
samenessComparison[CIDType_String:"same_stereo_isotope",SMILES_String:"C(/C=C/Cl)Cl"]:=With[{url=URLBuild[{"https://pubchem.ncbi.nlm.nih.gov/rest/pug","compound/fastidentity/smiles/property/isomericsmiles/csv"},{"identity_type"CIDType}]},URLExecute[#,"CSV"]&@HTTPRequest[url,<|"Method""POST","Body"{"smiles"SMILES}|>]]
The function returns a list of lists of the CIDs and SMILES strings:
In[]:=
samenessComparison["same_stereo"]
Out[]=
{{CID,IsomericSMILES},{24726,C(/C=C/Cl)Cl},{102602172,[2H]/C(=C(/[2H])\Cl)/C([2H])([2H])Cl}}
It is often convenient to work with this in the form of a Dataset; the function repository provides a DatasetWithHeaders function that interprets the first row as a set of headers and constructs the relevant Dataset. Begin by retrieving the relevant function:
In[]:=
ResourceFunction["DatasetWithHeaders"]
Out[]=
|
Apply this function to our results:
In[]:=
[◼] | DatasetWithHeaders |
Out[]=
|
Next, we want to append images of the molecules generated using the SMILES strings contained in the IsomericSMILES column. To do this, we define a function:
In[]:=
appendImage[data_Dataset]:=With[{graphicsFunction=MoleculePlot@Molecule[#["IsomericSMILES"]]&},data[All,Append[#,"image"graphicsFunction[#]]&]](*example*)appendImage@
@samenessComparison["same_stereo"]
[◼] | DatasetWithHeaders |
Out[]=
|
Let’s take a look at the values for each of the possible query types by construction an Association from the query name to the Dataset of results generated above:
In[]:=
AssociationMapappendImage@
@samenessComparison[#]&,{"same_stereo_isotope","same_stereo","same_isotope","same_connectivity"}//Dataset
[◼] | DatasetWithHeaders |
Out[]=
|
Using the PubChem Service Connection
Using the PubChem Service Connection
As introduced in the previous exercises, the PubChem Service Connection provides a convenient wrapper for dealing with PubChem that hides many of the details of the PUG-REST API. Recall that we can request a compound ID by providing the SMILES strings:
In[]:=
cids=ServiceExecute["PubChem","CompoundCID",{"SMILES""C(/C=C/Cl)Cl"}]
Out[]=
|
The documentation about “Requests” contains documentation on the various options. The type of CID returned is specified by the “CIDType” option. For example, to return the “same_stereo” examples above:
In[]:=
ServiceExecute["PubChem","CompoundCID",{"SMILES""C(/C=C/Cl)Cl","CIDType""SameStereo"}]
Out[]=
|
Just like the manual PUG-REST calls, matches can be restricted to compounds with the same stereochemistry, isotopes or connectivity. The PUG-REST names are written in the typical Mathematica CamelCase string style, without underscores. The only exception is the PUG-REST “same_stereo_isotope” which is specified as “Original” in Mathematica (this reasonable, as “same_stereo_isotope” does in fact restrict it to the originally specified SMILES string, and is the default setting if no “CIDType” option is specified). Below is the parallel example to the PUG-REST example above (for brevity, we will not bother to draw the structures):
In[]:=
AssociationMap[ServiceExecute["PubChem","CompoundCID",{"SMILES""C(/C=C/Cl)Cl","CIDType"#}]&,{"Original","SameStereo","SameIsotopes","SameConnectivity"}]//Dataset
Out[]=
|
Exercise:
Exercise:
Exercise 1a: Find compounds that have the same atom connectivity and isotope information as the query molecule:
In[]:=
query="CC1=CN=C(C(=C1OC)C)C[S@](=O)C2=NC3=C(N2)C=C(C=C3)OC";
For each compound returned from the search, retrieve the following information:
◼
CID
◼
Isomeric SMILES string
◼
chemical synonyms (for simplicity, print only the five synonyms that first occur in the name list retrieved for each compound)
◼
Structure image (you may either retrieve this from PUG-REST or use MoleculePlot to compute it from the Isomeric SMILES string.
In[]:=
(*Writeyourcodeinthiscell.*)
Similarity search
Similarity search
2D- and 3D- Similarity Searching using PUG-REST
2D- and 3D- Similarity Searching using PUG-REST
PubChem supports 2-dimensional (2-D) and 3-dimensional (3-D) similarity searches. Because molecular similarity is not a measurable physical observable but a subjective concept, many approaches have been developed to evaluate it. Detailed discussion on how PubChem quantifies molecular similarity, read the following LibreTexts page: Searching PubChem Using a Non-Textual Query (https://bit.ly/2lPznCo)
The code cell below demonstrates how to perform 2-D and 3-D similarity searches:
The code cell below demonstrates how to perform 2-D and 3-D similarity searches:
In[]:=
prolog="https://pubchem.ncbi.nlm.nih.gov/rest/pug";myData=<|"smiles"->"C1COCC(=O)N1C2=CC=C(C=C2)N3C[C@@H](OC3=O)CNC(=O)C4=CC=C(S4)Cl"|>;url=URLBuild[{prolog,"/compound/fastsimilarity_2d/smiles/cids/txt"},{"Threshold"99}]cids=URLExecute[#,"CSV"]&@HTTPRequest[url,<|"Method""POST","Body"myData|>]Print["# Number of CIDs: ",Length[cids]]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt?Threshold=99
Out[]=
{{9875401},{6433119},{11524901},{68152323},{25255944},{25190310},{25164166},{145624246},{145624236},{144489887},{143433422},{123868009},{56598114},{56589668},{11994745},{25190129},{25190130},{25190186},{25190187},{25190188},{25190189},{25190190},{25190248},{25190249},{25190250},{25190251},{25190252},{25190311},{25255845},{25255945},{25255946},{49849874},{133687098}}
# Number of CIDs: 33
It is worth mentioning that the parameter name “Threshold” is case-sensitive. If “threshold” is used (rather than “Threshold”), it will be ignored and the default value (0.90) will be used for the parameter. As a matter of fact, all optional parameter names in PUG-REST are case-sensitive, as demonstrated below:
In[]:=
url=AssociationMap[(*constructURLsforbothcases*)URLBuild[{prolog,"/compound/fastsimilarity_2d/smiles/cids/txt"},{#95}]&,{"Threshold","threshold"}]Map[(*Constructtherequest,execute,andcomputethelengthofresultsforeachURL*)Length@URLExecute[#,"CSV"]&@HTTPRequest[#,<|"Method"->"POST","Body"myData|>]&,url]
Out[]=
Thresholdhttps://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt?Threshold=95,thresholdhttps://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt?threshold=95
Out[]=
Threshold227,threshold1054
It is possible to run 3-D similarity search using PUG-REST. However, because 3-D similarity search takes much longer than 2-D similarity search, it often exceeds the 30-second time limit and returns a time-out error, especially when the query molecule is big.
In addition, for 3-D similarity search, it is not possible to adjust the similarity threshold (that is, the optional “Threshold” parameter does not work). 3-D similarity search uses a shape-Tanimoto (ST) of >=0.80 and a color-Tanimoto (CT) of >=0.50 as a similarity threshold. Read the LibreTexts page for more details (https://bit.ly/2lPznCo).
In addition, for 3-D similarity search, it is not possible to adjust the similarity threshold (that is, the optional “Threshold” parameter does not work). 3-D similarity search uses a shape-Tanimoto (ST) of >=0.80 and a color-Tanimoto (CT) of >=0.50 as a similarity threshold. Read the LibreTexts page for more details (https://bit.ly/2lPznCo).
In[]:=
myData=<|"smiles""CC(=O)OC1=CC=CC=C1C(=O)O"|>;url=URLBuild[{prolog,"compound/fastsimilarity_3d/smiles/cids/txt"}];Length@URLExecute[#,"CSV"]&@HTTPRequest[url,<|"Method"->"POST","Body"myData|>]
Out[]=
2496
2D- and 3D- Similarity Searching using the PubChem Service Connection
2D- and 3D- Similarity Searching using the PubChem Service Connection
The PubChem Service Connection provides convenience function for searching 2D similarity searching. Repeating the query from above:
In[]:=
example="C1COCC(=O)N1C2=CC=C(C=C2)N3C[C@@H](OC3=O)CNC(=O)C4=CC=C(S4)Cl";ServiceExecute["PubChem","CompoundCID",{"SMILES"example,Method"Similarity2DSearch","Threshold"99}]
Out[]=
|
This returns the same number of compounds found above:
In[]:=
%["CompoundID"]//Length
Out[]=
33
Similarly, it supports 3D similarity searching (repeating the example above):
In[]:=
example="CC(=O)OC1=CC=CC=C1C(=O)O";result=ServiceExecute["PubChem","CompoundCID",{"SMILES"example,Method"Similarity3DSearch"}];result["CompoundID"]//Length
Out[]=
2496
Exercise:
Exercise:
Exercise 2a: Perform 2-D similarity search with the following query, using a threshold of 0.80 and find the macromolecule targets of the assays in which the returned compounds were tested. You will need to take these steps:
◼
Run 2-D similarity search using the SMILES string as a query (with Threshold=80).
◼
Retrieve the AIDs in which any of the returned CIDs was tested “active”.
◼
Retrieve the gene symbols of the targets for the returned AIDs.
In[]:=
(*Writeyourcodeinthiscell.*)(*examplequerytouse*)query="[C@@H]23C(=O)[C@H](N)C(C)[C@H](CCC1=COC=C1)[C@@]2(C)CCCC3(C)C";
Substructure/Superstructure search
Substructure/Superstructure search
When a chemical structure occurs as a part of a bigger chemical structure, the former is called a substructure and the latter is referred to as a superstructure (https://bit.ly/2lPznCo). PUG-REST supports both substructure and superstructure searches. For example, below is an example for substructure search using the core structure of antibiotic drugs called cephalosporins as a query (https://en.wikipedia.org/wiki/Cephalosporin). As usual, we construct a URL defining the REST call for the search, and perform a POST operation to retrieve the results:
In[]:=
query="C12(SCC(=C(N1C([C@H]2NC(=O)[*])=O)C(=O)O[H])[*])[H]";myData=<|"smiles"query|>;url=URLBuild[{prolog,"compound/fastsubstructure/smiles/cids/txt"},{"Stereo""exact"}]URLExecute[HTTPRequest[url,<|Method"POST","Body"myData|>],"CSV"]//Length(*NumberofCIDs*)
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/smiles/cids/txt?Stereo=exact
Out[]=
25573
In[]:=
ServiceExecute["PubChem","CompoundCID",{"SMILES"query,Method"SubstructureSearch","Stereo""Exact"}]
Out[]=
|
It is important to remember that if the query structure is not specific enough (for example too small of a fragment), a substructure search will return more matches than the PubChem server can handle. For example, attempting a substructure search using “C-C” as a query, will return an error, because PubChem has ~96 million (organic) compounds with more than two carbon atoms and most of them will have the “C-C” unit. Therefore, if you get a “time-out” error while doing substructure search, consider providing a more specific structure as an input query.
Exercise
Exercise
Exercise 3a: Below is the SMILES string for a HCV (Hepatitis C Virus) drug (Sovaldi). Perform substructure search using this SMILES string as a query, identify compounds that are mentioned in patent documents, and create a list of the patent documents that mentioning them.
◼
Use the default options for substructure search.
◼
Use the “XRefs” operation to retrieve Patent IDs associated with the returned compounds.
◼
For simplicity, ignore the CID-Patent ID mapping. (That is, no need to track which CID is associated with which patent document.)
In[]:=
(*Writeyourcodeinthiscell.*)(*examplequerytouse*)query="C[C@@H](C(=O)OC(C)C)N[P@](=O)(OC[C@@H]1[C@H]([C@@]([C@@H](O1)N2C=CC(=O)NC2=O)(C)F)O)OC3=CC=CC=C3";
Molecular formula search
Molecular formula search
Strictly speaking, molecular formula search is not structure search, but its PUG-REST request URL is constructed in a similar way to structure searches like identity, similarity, and substructure/superstructure searches:
In[]:=
query="C22H28FN3O6S";(*MolecularformulaforCrestor(Rosuvastatin:CID446157)*)url=URLBuild[{prolog,"compound/fastformula",query,"cids/txt"}]URLExecute[url,"CSV"]//Length(*NumberofCIDS*)
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/C22H28FN3O6S/cids/txt
Out[]=
197
It is possible to allow other elements to be present in addition to those specified by the query formula, by setting the AllowOtherElements option:
In[]:=
url=URLBuild[{prolog,"compound/fastformula",query,"cids/txt"},{"AllowOtherElements""true"}]URLExecute[url,"CSV"]//Length(*NumberofCIDS*)
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/C22H28FN3O6S/cids/txt?AllowOtherElements=true
Out[]=
220
Both types of queries can also be performed using Mathematica’s PubChem Service Connection, by specifying a “Formula” and setting the Method to “FormulaSearch”:
In[]:=
ServiceExecute["PubChem","CompoundCID",{"Formula"query,Method"FormulaSearch"}]
Out[]=
|
In[]:=
ServiceExecute["PubChem","CompoundCID",{"Formula"query,Method"FormulaSearch","AllowOtherElements"True}]
Out[]=
|
Exercise
Exercise
Exercise 4a: The general molecular formula for alcohols is CnH(2n+2)O [for example, CH4O (methanol), C2H6O (ethanol), C3H8O (propanol), etc]. Run molecular formula search using this general formula for n=1 through 20 and retrieve the XLogP values of the returned compounds for each value of n. Print the minimum and maximum XLogP values for each n value.
In[]:=
(*Writeyourcodeinthiscell.*)
Attributions
Attributions
Adapted from the corresponding OLCC 2019 Python Assignment:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/4._Searching _Databases _for _Chemical _Information/4.6%3 A_Python _Assignments
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/4._Searching _Databases _for _Chemical _Information/4.6%3 A_Python _Assignments
Cite this as: Joshua Schrier, "04 Structure Search" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo2bs0
Download