02B Interconversion Between PubChem Records
Author
Joshua Schrier
Title
02B Interconversion Between PubChem Records
Description
Depositor-provided records (i.e., substances) that are standardized to a given compound.
Category
Educational Materials
Keywords
cheminformatics, Chemoinformatics, chemical information, PubChem, quantitative structure, property relationships, QSPR, machine learning, computer-aided drug design, chemistry
URL
http://www.notebookarchive.org/2020-10-ebnvwao/
DOI
https://notebookarchive.org/2020-10-ebnvwao
Date Added
2020-10-31
Date Last Modified
2020-10-31
File Size
0.92 megabytes
Supplements
Rights
CC BY-NC-SA 4.0



Interconversion Between PubChem Records
Interconversion Between PubChem Records
PUG-REST can be used to retrieve PubChem records related to another PubChem records. Basically, PUG-REST takes an input list of records in one of the three PubChem databases (Compound, Substance, and BioAssay) and returns a list of the related records in the same or different database. Here, the meaning of the relationship between the input and output records may be specified using an optional parameter. This allows one to do various tasks, including (but not limited to):
◼
Depositor-provided records (i.e., substances) that are standardized to a given compound.
◼
Mixture compounds that contain a given component compound.
◼
Stereoisomers/isotopomers of a given compound.
◼
Compounds that are tested to be active in a given assay.
◼
Compounds that have similar structures to a given compound.
Getting depositor-provided records for a given compound
Getting depositor-provided records for a given compound
The code snippet below retrieves the substance record associated with a given CID (CID 129825914):
In[]:=
prolog="https://pubchem.ncbi.nlm.nih.gov/rest/pug";prInput="compound/cid/129825914";prOper="sids";prOutput="txt";url=URLBuild[{prolog,prInput,prOper,prOutput}]URLExecute[url]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/129825914/sids/txt
Out[]=
341669951
It is also possible to provide a comma separated list of CIDs as input identifiers:
In[]:=
pugin="compound/cid/129825914,129742624,129783988";pugoper="sids";pugout="txt";url=URLBuild[{prolog,pugin,pugoper,pugout}]URLExecute[url]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/129825914%2C129742624%2C129783988/sids/txt
Out[]=
{{341669951},{341492923},{341577059},{345261280},{368769438}}
Observe that Mathematica separates this into a list of lists, which can be Flatten-ed into a single list if desired:
In[]:=
Flatten[%]
Out[]=
{341669951,341492923,341577059,345261280,368769438}
In the example above, the input list has three CIDs, but the PUG-REST request returned five SIDs. It means that some CID(s) must be associated with multiple SIDs, but it is hard to see which CID correspond to which SIDs. Therefore, we want the SIDs grouped by the corresponding CIDs. This can be done using the optional parameter “list_return=grouped” and changing the output format to json.
In[]:=
pugin="compound/cid/129825914,129742624,129783988";pugoper="sids";pugout="json";url=URLBuild[{prolog,pugin,pugoper,pugout},{"list_return""grouped"}]URLExecute[url]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/129825914%2C129742624%2C129783988/sids/json?list_return=grouped
Out[]=
{InformationList{Information{{CID129825914,SID{341669951}},{CID129742624,SID{341492923}},{CID129783988,SID{341577059,345261280,368769438}}}}}
Mathematica automatically recognizes that the returned data is in the form of a JSON string, and converts it into a set of rules and lists. This conversion can be controlled by specifying a third option to URLExecute. For example, “RawJSON” will read the JSON object into associations (rather than rules):
In[]:=
result=URLExecute[url,{},"RawJSON"]
Out[]=
InformationListInformation{CID129825914,SID{341669951},CID129742624,SID{341492923},CID129783988,SID{341577059,345261280,368769438}}
In[]:=
result["InformationList"]
Out[]=
Information{CID129825914,SID{341669951},CID129742624,SID{341492923},CID129783988,SID{341577059,345261280,368769438}}
In[]:=
result["InformationList","Information"]
Out[]=
{CID129825914,SID{341669951},CID129742624,SID{341492923},CID129783988,SID{341577059,345261280,368769438}}
Alternatively, one can get the literal text returned by the server:
In[]:=
URLExecute[url,{},"Text"]
Out[]=
{ "InformationList": { "Information": [ { "CID": 129825914, "SID": [ 341669951 ] }, { "CID": 129742624, "SID": [ 341492923 ] }, { "CID": 129783988, "SID": [ 341577059, 345261280, 368769438 ] } ] }}
Note that the json output format is used in the above request. The "txt" output format in PUG - REST returns data into a single column but the result from the above request cannot fit well into a single column.
If you want output records to be “flattened”, rather than being grouped by the input identifiers, use “list_return=flat”:
If you want output records to be “flattened”, rather than being grouped by the input identifiers, use “list_return=flat”:
In[]:=
URLExecute@URLBuild[{prolog,pugin,pugoper,pugout},{"list_return""flat"}]
Out[]=
{IdentifierList{SID{341492923,341577059,341669951,345261280,368769438}}}
The default value for the “list_return” parameter is:
◼
“flat” when the output format is TXT
◼
“grouped” when the output format is JSON and XML
It is also possible to specify the input list implicitly, rather than providing the input identifiers explicitly. For example, the following example uses a chemical name to specify the input list. By forcing Mathematica to interpret the data returned from the server as a “CSV” format, we get a list of data as the result. Note the difference between the second and third URLs:
In[]:=
(*InputCIDsareprovidedusingachemicalname*)cids=URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lactose/cids/txt",{},"CSV"];Print["# CIDs returned: ",Length[cids]](*InputCIDsareprovidedusingthename,thencovertedtoSIDs.*)sids=URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/lactose/sids/txt",{},"CSV"];Print["# SIDs returned (method 1): ",Length[sids]](*InputSIDSareprovidedusingthename,andreturnedtheinputSIDs*)Print["# SIDS returned (method 2): ",#]&@Length@URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/name/lactose/sids/txt",{},"CSV"];
# CIDs returned: 1
# SIDs returned (method 1): 165
# SIDS returned (method 2): 125
COMMENT: New SIDs are frequently added to PubChem, and so the number of SIDs returned when you execute this code may be greater than the number output in older versions of this tutorial.
The above example illustrates how the list conversion works:
◼
In the first request, the name “lactose” is searched for against the Compound database and the resulting 7 CIDs are returned.
◼
If you change the operation part from “cids” to “sids” (as in the second request), the same name search is done first against the Compound database, followed by the list conversion from the resulting 7 CIDs to associated 165 SIDs.
◼
In the third request, the name search is performed against the Substance database, and the resulting 125 SIDs are returned.
Exercise:
Exercise:
Exercise 1a Statins are a class of drugs that lower cholesterol levels in the blood. Retrieve in JSON the substance records associated with the compounds whose names contain the string "statin".
◼
Make only one PUG-REST request.
◼
For partial name matching, set the name_type parameter to “word” (See the PUG-REST documentation for an example).
◼
Group the substances by the corresponding compound records.
◼
Print the JSON output
In[]:=
(*Writeyourcodehere*)
Getting mixture/component molecules for a given molecule.
Getting mixture/component molecules for a given molecule.
The list interconversion may be used to retrieve mixtures that contain a given molecule as a component. To do this, the input molecule should be a single-component compound (that is, with only one covalently-bound unit), and the optional parameter “cids_type=component” should be provided:
prolog="https://pubchem.ncbi.nlm.nih.gov/rest/pug";url=URLBuild[{prolog,"compound/name/tylenol/cids/txt"}]cids=URLExecute[url,{"cids_type""component"},"CSV"];Length[cids]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/tylenol/cids/txt
Out[]=
385
Comment: Remember, new mixtures are added every day to PubChem, and so the number you retrieve may be greater than the number shown in this archived output.
It should be noted that, if the input molecule is a multi-component compound, the option “cids_type=component” returns the components of that compound. For example, the following example shows how to get all components of the first molecule in the “cids” list generated in the previous example:
In[]:=
url=URLBuild[{prolog,"compound/cid",First[cids],"cids/txt"}];Print["CIDS: ",First[cids]];componentCids=URLExecute[url,{"cids_type""component"},"CSV"];Print["Number of components: ",Length[componentCids]]
CIDS: {145397607}
Number of components: 3
Exercise:
Exercise:
Exercise 2a: Many over-the-counter drugs contain more than one active ingredients. In this exercise, we want to find component molecules that occur with three common pain killers (aspirin, tylenol, advil) as a mixture.
Step 1. Define a list that contains three drug names (aspirin, tylenol, advil).
Step 1. Define a list that contains three drug names (aspirin, tylenol, advil).
In[]:=
(*writeyourcodeinthiscell*)
Step 2. Write a function that retrieves the PubChem CIDs corresponding to an input name. Then Map over your function to generate a new list. In order not to overload the PubChem servers, insert a Pause into your function, or use MapBatched.
In[]:=
(*writeyourcodeinthiscell*)
Step 3. Write functions (and then apply them using a Map or Table operation) to do the following things for each drug:
◼
Get the PubChem CIDs of the mixture compounds that contain each drug and store them in a list (for that particular drug).
◼
Get the PubChem CIDs of the components that occur in any of the returned mixtures, by setting the “list_return” parameter to “flat”. This can be done with a single request.
◼
Display all the components for each CID.
◼
Pause the code for 0.2 second each time a PUG-REST request is made.
In[]:=
(*Writeyourcodeinthiscell.*)
Getting compounds tested in a given assay
Getting compounds tested in a given assay
PUG-REST may be used to retrieve compounds tested in a given assay. For example, the following code cell shows how to get all compounds tested in AID 1207599. Notice how the number is provided in quotes as a string, rather than as an integer. Only a Short form of the total output is displayed, as otherwise it would be too long:
In[]:=
url=URLBuild[{prolog,"/assay/aid","1207599","cids/txt"}]cids=URLExecute[url,{},"CSV"];Length[cids]sids//Short
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/aid/1207599/cids/txt
Out[]=
791
Out[]//Short=
{{5071},{583808},{821154},159,{386477768},{387096313},{387096314}}
If you are interested in only the compounds that are tested “active” in a given assay, set the “cids_type” parameter to “active”, as shown in the code below:
cids=URLExecute[url,cids_type""active","CSV"];Length[cids]Short[cids]
Out[]=
435
Out[]//Short=
{{6197},{10219},{14169},{17558},427,{135925147},{135925159},{135925164},{136032603}}
It is also possible to specify the input assay list implicitly. For example, the following code cell retrieves compounds tested in any assays targeting human Carbonic anhydrase 2 (CA2), whose accession number is P00918. Again, for brevity, only a Short portion of this list is displayed:
In[]:=
url=URLBuild[{prolog,"assay/target/accession","P00918","cids/txt"}]cids=URLExecute[url,{},"CSV"];Length[cids]Short[cids]
Out[]=
https://pubchem.ncbi.nlm.nih.gov/rest/pug/assay/target/accession/P00918/cids/txt
Out[]=
23978
Out[]//Short=
{{448890},{126154},{447682},{3074858},23970,{1986},{134142877},{71515731},{71515605}}
Exercises
Exercises
Exercise 3a: Find compounds that are tested to be active against human acetylcholinesterase (accession: P08173) and retrieve SMILES strings for those compounds.
◼
Split the CID list into smaller chunks (with a chunk size of 100).
◼
Print the retrieved data in a CSV format (CID and SMILES strings in the first and second columns, respectively).
In[]:=
(*writeyourcodeinthiscell*)
Using the PubChem Service
Using the PubChem Service
In the above section, we evaluated a few common questions:
◼
What are the Substance ID (SID) record associated with one or more Compound IDs (CIDs)?
◼
What are the Compound IDs (CID) associated with a name?
◼
What are the Substance IDs (SID) associated with a name?
◼
What are the Substance IDs (SID) associated with compounds having a name?
◼
What mixtures contain a molecule (specified in some way) as a component?
◼
What compounds are tested (or are active) in a given assay?
As discussed in the first assignment, many of the queries to PubChem have a convenient “wrapper” in the form of Mathematica’s PubChem service. Take a few minutes to review the ‘Requests’ section of the PubChem Service documentation and try to determine what Requests best match these types of queries.
Retrieving the substance records associated with a given CID
Retrieving the substance records associated with a given CID
The “CompoundSID” query can be used to look up SubstanceIDs for a given specification (which might be a CompoundID, SMILES, or Formula, returning the results as a Dataset :
In[]:=
ServiceExecute["PubChem","CompoundSID",{"CompoundID""129825914"}]
Out[]=
|
It is also possible to provide a list of CIDs as input specifiers; these are automatically grouped by CID:
In[]:=
result=ServiceExecute["PubChem","CompoundSID",{"CompoundID"{129825914,129742624,129783988}}]
Out[]=
|
In[]:=
result[All,"SubstanceID",Length]
Out[]=
|
Alternatively, one can apply a function to a particular named column:
In[]:=
result[All,{"SubstanceID"Length}]
Out[]=
|
Dataset results can also be converted into JSON strings using the ExportString function; this shows that the Dataset is merely a list of associations:
In[]:=
ExportString[result,"JSON"]
Out[]=
[ {"CompoundID": 129825914, "SubstanceID": [341669951]}, {"CompoundID": 129742624, "SubstanceID": [341492923]}, {"CompoundID": 129783988, "SubstanceID": [341577059, 345261280, 368769438]}]
Retrieving CID and SID records by name
Retrieving CID and SID records by name
Just as in the manual construction of the REST queries, we can compare looking up a CID using a name:
In[]:=
ServiceExecute["PubChem","CompoundCID",{"Name""lactose"}]
Out[]=
|
...to looking up a CID by name then converting to SID:
In[]:=
results=ServiceExecute["PubChem","CompoundSID",{"Name""lactose"}]results[1,"SubstanceID",Length]
Out[]=
|
Out[]=
165
...to looking up the SubstanceIDs associated directly with a name:
In[]:=
ServiceExecute["PubChem","SubstanceSID",{"Name""lactose"}]%["SubstanceID",Length]
Out[]=
|
Out[]=
125
Exercise:
Exercise:
Exercise 1a-revisited Retrieve the substance records associated with the compounds whose names contain the string “statin” using the PubChem service.
In[]:=
(*Writeyourcodeinthiscell.*)
Getting mixture/component molecules for a given molecule.
Getting mixture/component molecules for a given molecule.
As demonstrated above, the REST-API provides a “cids_type=component” option; this can be provided as an option in the parameter list; in Mathematica this option is expressed named “CIDType”:
In[]:=
ServiceExecute["PubChem","CompoundCID",{"Name""tylenol","CIDType""Component"}]%["CompoundID",Length]
Out[]=
|
Out[]=
385
Exercise:
Exercise:
Exercise 2a-revisited: Retrieve PubChemCIDs corresponding to the three drugs: aspirin, tylenol, advil. Then do the following thing for each drug:
◼
Get the PubChem CIDs of the mixture compounds that contain each drug .
◼
Get the PubChem CIDs of the components that occur in any of the returned mixtures.
In[]:=
(*Writeyourcodeinthiscell.*)
Getting compounds tested in a given assay
Getting compounds tested in a given assay
Mathematica 12.0 did not provide the ability to retrieve compounds tested in a given assay, and the PubChem Service documentation for Mathematica 12.1 does not explicitly describe this capability. Given how easy it is to do it by a URL request (demonstrated above), this is not a great problem. However, there is an undocumented capability to do this as of Mathematica 12.1. This also provides an opportunity to mention how one can see possible request types programmatically, by initializing a service connection and then querying the possible “Requests” that can be made:
In[]:=
f=ServiceConnect["PubChem"]
Out[]=
ServiceObject
|
In[]:=
f["Requests"]
Out[]=
{AssayAID,AssayCID,AssaySID,Authentication,CompoundAID,CompoundAssaySummary,CompoundCID,CompoundCrossReferences,CompoundDescription,CompoundFullRecords,CompoundImage,CompoundMolecule,CompoundProperties,CompoundSDF,CompoundSID,CompoundSynonyms,ID,Information,Name,RawRequests,SubstanceAID,SubstanceAssaySummary,SubstanceCID,SubstanceCrossReferences,SubstanceFullRecords,SubstanceImage,SubstanceMolecule,SubstanceSDF,SubstanceSID,SubstanceSynonyms}
The “AssayCID” request allows us to look up compounds tested in a given assay. Let’s look up AID 1207599 as we did in the earlier PUG-REST example:
In[]:=
ServiceExecute["PubChem","AssayCID",{"AssayID"1207599}]
Out[]=
|
What about returning only compounds that are active for this assay? Unfortunately, there is a bug in how this is implemented (as of 12 Oct 2020)—which is understandable because it is undocumented. The bug occurs in version 12.0.81 of the PubChem service, and should be fixed in later versions. To check which version is currently deployed in your installation:
In[]:=
PacletFind["ServiceConnection_PubChem"]
Out[]=
PacletObject
|
Interlude: A Bug Fix for PubChem 12.0.81
Interlude: A Bug Fix for PubChem 12.0.81
It is relatively easy to fix this bug by modifying the function definitions. Executing the code below will modify the PubChem 12.0.81 file installed on your computer to fix this problem. It makes a backup of the file contents, and only modifies version 12.0.81. You will only need to do this once. Note that this will become obsolete once a new version of the Mathematica PubChem service is released, so this fix is only needed short term:
Module[{file=FileNameJoin[{PacletFind["ServiceConnection_PubChem"][[1,1,"Location"]],"Kernel","PubChem.m"}],pubChemVersion=PacletFind["ServiceConnection_PubChem"][[1,1,"Version"]],contents},If[pubChemVersion"12.0.81",(*makesurethatitisacompatibleversion&*)CopyFile[file,file<>".backup"];contents=Import[file,"Lines"];(*changethecontents*)contents[[1266]]="If[!StringMatchQ[ToString[\"CIDType\"/.newparams], \"All\"|\"Active\"|\"Inactive\"],";contents[[1271]]="params=Append[params,Rule[\"cids_type\",ToString[\"CIDType\"/.newparams]/.{\"All\"->\"all\",\"Active\"->\"active\",\"Inactive\"->\"inactive\"}]]";Export[file,Append[""]@contents,"Lines"];(*appendablanklinetotheend*)Get[file];(*reloadingdefinitionsforthecurrentsession*)Print["Modified file; reloaded definitions"];(*tellthehumanswhatwedid*),Print["PubChem Paclet Version is newer than 12.0.81---no modification made"];]]
Modified file; reloaded definitions
Returning to our regularly-scheduled PubChem...
Returning to our regularly-scheduled PubChem...
Once you have a correct version installed, we can proceed to restrict our search to Active or Inactive compounds for the given assay, analogous to the PUG-REST call constructed earlier. Observe that the type specifications begin with capital letters (following the Mathematica convention), as opposed to the all-lowercase expression used in the underlying PUG-REST call:
In[]:=
ServiceExecute["PubChem","AssayCID",{"AssayID"1207599,"CIDType""Active"}]
Out[]=
|
Exercises
Exercises
Exercise 3a-revisited: Find compounds that are tested to be active against human acetylcholinesterase (accession: P08173) and retrieve SMILES strings for those compounds using the PubChem service connection:
◼
Split the CID list into smaller chunks (with a chunk size of 100) and retrieve SMILES strings for each chunk using the functions described in Assignment 1.
◼
Concatenate the chunks, display the data in a browsable Dataset form, and store the retrieved results in a CSV format (CID and SMILES strings in the first and second columns, respectively).
In[]:=
(*writeyourcodeinthiscell*)
Attributions
Attributions
Adapted from the corresponding OLCC 2019 Python Assignment:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/2._Representing _Small _Molecules _on _Computers/2.7%3 A_Python _Assignment/Python_Assignment_ 2B
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/2._Representing _Small _Molecules _on _Computers/2.7%3 A_Python _Assignment/Python_Assignment_ 2B


Cite this as: Joshua Schrier, "02B Interconversion Between PubChem Records" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebnvwao

Download

