Educational Materials

03 Compound vs Substance

Joshua Schrier

Author

Joshua Schrier

Title

03 Compound vs Substance

Description

Understand the difference between compounds and substances in PubChem’s terminology.

Compound vs Substance

Objectives

◼

Understand the difference between compounds and substances in PubChem’s terminology.

◼

Learn how chemical structures are represented in a real world.

◼

Understand the ambiguities of name-structure associations.

◼

Learn how to draw chemical structures programmatically.

Structure Standardization

PubChem contains more than 200 millions chemical records submitted by hundreds of data contributors. These depositor-provided records are archived in a database called “Substance” and each record in this database is called a substance. The records in the Substance database are highly redundant, because different data contributors may submit information on the same chemical, independently of each other. Therefore, PubChem extracts unique chemical structures from the Substance database through a process called standardization (https://doi.org/10.1186/s13321-018-0293-8). These unique structures are stored in the Compound database and individuate records in this database is called “compounds”. To learn more about the PubChem compounds and substances, please read this PubChem Blog post (https://go.usa.gov/xVXct).

The code cells below demonstrates the effects of chemical structure standardization.

Step 1. Download a list of the SIDs associated with a given CID.

First, let's get a list of SIDs that are associated CID 1174 (uracil) either by manually constructing the PUG-REST call:

In[]:=

cid=1174;url=URLBuild[{"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/",ToString[cid],(*note:conversionofintegerCIDtostringtype*)"sids/txt"}]sids=URLExecute[url,{},"CSV"]//Flatten;Length[sids]

Out[]=

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/sids/txt

Out[]=

396

Or alternatively, by using ServiceExecute:

In[]:=

sids=ServiceExecute["PubChem","CompoundSID",{"CompoundID"1174}][First,"SubstanceID"]//Normal;Length[sids]

Out[]=

396

Both requests return 390+ substances, all of which are standardized to the same structure (CID 1174), and the postfix modifications (e.g., Flatten and Normal) used to convert the results into an ordinary list.

Step 2. Download the structure data for the SIDs

Now retrieve the depositor-provided structures for the returned substances. We'll begin by defining a function that retrieves the SDFs give a list of SID values:

retrieveSDF[sids_List]:=With[{url=URLBuild[{"https://pubchem.ncbi.nlm.nih.gov/rest/pug/substance/sid/",StringRiffle[sids,","],(*generateastringofcomma-separatedSIDs*)"/record/sdf"}]},URLExecute[url,{},"Text"]]

Next: Partition the list of SIDs into chunks of UpTo 50 elements each, and use a MapBatched operation to retrieve each of these batches, Pause-ing for 0.2 seconds between each batch. This will take a minute or two to execute:

In[]:=

chunks=Partition[sids,UpTo[50]];sdfs=

[◼]	MapBatched

[retrieveSDF,chunks,Pause[0.2],1];

At the end, we Export a file that has joined all of the chunks together. Because the data received from PubChem is in the form of text files, we force the Export command to write this as “Text”, rather than trying to interpret the input as an SDF file. We also insert a return (“\n”) after each block of text using StringRiffle, before writing the file to disk:

In[]:=

Export["cid2sids-uracil.sdf",StringRiffle[sdfs,"\n"],"Text"]

Out[]=

cid2sids-uracil.sdf

Take a moment to open this file in a text editor to get a sense of its contents.

Step 3. Convert the structures in the SDF file into the SMILES strings and identify unique SMILES and their frequencies.

Begin by Import-ing the structures from the SDF file that was written to your local disk in the previous step; Import looks at the file suffix (.sdf, .sd) and interprets it as containing molecular data on Import, returning a list of Molecule representations (introduced at the end of Assignment 1):

In[]:=

mols=Import["cid2sids-uracil.sdf"];(*370+Molecule[]s!*)

Let’s look at the first of these:

In[]:=

example=First[mols]

Out[]=

Molecule

Formula:

Atoms: 8(12) Bonds: 8(12)



Many MoleculeProperty’s can be queried of a Molecule, such as its SMILES string:

In[]:=

MoleculeProperty["SMILES"]@example

Out[]=

c1c[nH]c(=O)[nH]c1=O

These properties also include any metadata contained in the SDF entry:

In[]:=

MoleculeProperty["MetaInformation"]@example

Out[]=

PUBCHEM_CID_ASSOCIATIONS{1174,1},PUBCHEM_COMPOUND_ID_TYPE0,PUBCHEM_COORDINATE_TYPE{1,3},PUBCHEM_EXT_DATASOURCE_NAMEBioCyc,PUBCHEM_EXT_DATASOURCE_REGIDURACIL,PUBCHEM_EXT_DATASOURCE_URLhttps://biocyc.org/,PUBCHEM_EXT_SUBSTANCE_URLhttps://biocyc.org/compound?orgid=META&id=URACIL,PUBCHEM_GENERIC_REGISTRY_NAME66-22-8,PUBCHEM_SUBSTANCE_COMMENT{TYPES: Pyrimidine-Bases, Pyrimidines,Is a product of enzyme EC 3.2.2.27,Is a product of enzyme EC 3.2.2.28,Is a product of enzyme EC 2.4.2.9,Is a product of enzyme EC 2.4.2.2,Is a product of enzyme EC 3.5.4.1,Is a product of enzyme EC 2.4.2.3,Is a reactant of enzyme EC 4.2.1.70,Is a reactant of enzyme EC 1.17.99.4,Is a product of enzyme EC 1.14.11.10,Is a reactant of enzyme EC 1.14.99.46,Is a product of enzyme EC 1.3.3.7,Is a reactant of enzyme EC 2.5.1.53,Is a product of enzyme EC 3.13.1.M4,Is a product of enzyme EC 2.4.2.57,Is a product of enzyme EC 3.2.2.3,Is a product of enzyme EC 1.3.1.1,Is a product of enzyme EC 4.1.1.66,Is a product of enzyme EC 1.3.1.2},PUBCHEM_SUBSTANCE_ID3272,PUBCHEM_SUBSTANCE_SYNONYM{uracil,66-22-8},PUBCHEM_SUBSTANCE_VERSION25,PUBCHEM_TOTAL_CHARGE0,PUBCHEM_XREF_EXT_IDURACIL

To identify unique SMILES and their frequencies, we’ll first obtain the SMILES property for each of the molecules in the mol list, and then obtain an Association of Counts of the number of time each string appears:

In[]:=

uniqueSmilesFreq=Counts[MoleculeProperty["SMILES"]/@mols]

Out[]=

c1c[nH]c(=O)[nH]c1=O17,O=c1[nH]c(=O)cc[nH]1169,O=c1n([H])c(=O)ccn1[H]21,O(c1nc(O[H])nc([H])c1[H])[H]32,O=c1n([H])c(=O)c([H])c([H])n1[H]30,[nH]1c(=O)[nH]c(=O)cc19,Oc1nc(O)ccn149,O=c1nc(O)cc[nH]12,O=c1[nH]ccc(=O)[nH]11,15,[nH]1c(=O)[nH]ccc1=O6,n1c(O)nccc1O2,O=c1n([H])c(=O)cc[nH]12,O=c1n([H])ccc(=O)[nH]12,[nH]1c(=O)cc[nH]c1=O4,O=c1n([H])ccc(O)n12,O=c1n([H])c(O)ccn11,Oc1n([H])c(=O)ccn11,O=c1ccn([H])c(=O)n1[H]2,Oc1nc(=O)cc[nH]13,O=c1[nH]c(O)ccn13,Oc1[nH]c(=O)ccn14,O=c1[nH]c(=O)[nH]cc11,c1(=O)[nH]c(=O)cc[nH]12,c1cnc(O)nc1O2,Oc1nccc(O)n12,c1(O)[nH]c(=O)ccn11,[nH]1c(=O)nccc1O1,n1c(O)[nH]c(=O)cc11,Oc1[nH]ccc(=O)n11,[nH]1c(=O)nc(O)cc11,n1([H])c(=O)n([H])ccc1=O1,n1([H])c(=O)n([H])c(=O)c([H])c1[H]1,n1c(O)nc(O)cc11,c1([H])c([H])nc(O[H])nc1O[H]1,[H]n1ccc(=O)n([H])c1=O1,O=c1cc[nH]c(=O)[nH]12

If you look closely at the above list, you will notice that many of these structures vary only in the order in which the atoms are included. We can remove this ambiguity (while still retaining real differences in the specified protonation state, tautomerization, etc.) by using the “CanonicalSMILES” string:

In[]:=

uniqueSmilesFreq=Counts[MoleculeProperty["CanonicalSMILES"]/@mols]

Out[]=

O=c1cc[nH]c(=O)[nH]1271,Oc1ccnc(O)n189,O=c1nc(O)cc[nH]15,15,O=c1nccc(O)[nH]15,O=c1ccnc(O)[nH]17,O=c1cc[nH]c(O)n14

Converting this result to a Dataset makes it look nicer:

In[]:=

Dataset[%]

Out[]=

O=c1cc[nH]c(=O)[nH]1	271
Oc1ccnc(O)n1	89
O=c1nc(O)cc[nH]1	5
	15
O=c1nccc(O)[nH]1	5
O=c1ccnc(O)[nH]1	7
O=c1cc[nH]c(O)n1	4

(As noted in earlier exercises: Do not be alarmed if the numbers that you observe when you run this example are greater than the numbers shown above. New records are added to PubChem every day.)

The above output shows that the 390+ SIDs associated with CID 1174 are represented with six different SMILES strings. In addition, 15 substance records that resulted in an “empty” SMILES strings, implying that the depositors of these substance records did not provide structural information.

You may want to what these 15 substances without SMILES strings are and how the structures were generated. Begin by Select-ing the particular cases where the SMILES string is empty (because there are no atoms). In this example we will use Select in the function/argument style, where the list to be tested (“mols”) is the first argument, and the test to be applied is the pure function defined as the second argument:

In[]:=

noStructure=Select[mols,StringMatchQ[""]@MoleculeProperty["SMILES"]@#&];

Then define a function that takes a single Molecule as input and returns the information of interest as a list:

In[]:=

reportMissing[mol_Molecule]:=With[{metadata=MoleculeProperty["MetaInformation"]@mol},{metadata["PUBCHEM_SUBSTANCE_ID"],metadata["PUBCHEM_SUBS_AUTO_STRUCTURE"]}]

(a complete list of possible information available is documented at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_sdtags.txt )
Finally, Map this function over the list of structure-less molecules determined above, and display the result in TableForm for aesthetic purposes:

In[]:=

TableForm[reportMissing/@noStructure]

Out[]//TableForm=

50608295	Deposited Substance chemical structure was generated via Synonym "CID1174" to be CID 1174
76715622	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
313082517	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
319449334	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
329735657	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
330000149	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
376019281	Deposited Substance chemical structure was generated via Synonym(s) "2,4(1H,3H)-PYRIMIDINEDIONE", "66-22-8" and Synonym Consistency to be CID 1174
381002398	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
381013941	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
381360788	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
384257697	Deposited Substance chemical structure was generated via Synonym(s) "uracil" and MeSH to be CID 1174
384995482	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
402318513	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
402318514	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174
402318515	Deposited Substance chemical structure was generated via Synonym(s) "66-22-8" and Synonym Consistency to be CID 1174

Step 4. Generate the structure images from the SMILES

Now we want to see what these SMILES strings look like, by drawing molecular structures from them. In the previous section, we defined an association with canonical SMILES strings as keys and the number of occurrences as the corresponding values. Let’s print it out again to recall what it looks like:

In[]:=

uniqueSmilesFreq

Out[]=

O=c1cc[nH]c(=O)[nH]1271,Oc1ccnc(O)n189,O=c1nc(O)cc[nH]15,15,O=c1nccc(O)[nH]15,O=c1ccnc(O)[nH]17,O=c1cc[nH]c(O)n14

We’ll Select from the Association the Keys whose StringLength is greater than zero (i.e., removing the cases with missing structures). In this example, we will use Select in the operator form, defining the test to be applied as its argument, and then applying it to the Keys (i.e., the SMILES strings):

In[]:=

mySmiles=Select[StringLength[#]>0&]@Keys@uniqueSmilesFreq

Out[]=

{O=c1cc[nH]c(=O)[nH]1,Oc1ccnc(O)n1,O=c1nc(O)cc[nH]1,O=c1nccc(O)[nH]1,O=c1ccnc(O)[nH]1,O=c1cc[nH]c(O)n1}

Using this list of non-blank SMILES strings, then create a new Association whose values are the corresponding 2D-structures:

In[]:=

AssociationMap[MoleculePlot@Molecule[#]&,(*functionusedtogeneratethevalues*)mySmiles(*keysused*)]

Out[]=

O=c1cc[nH]c(=O)[nH]1

,Oc1ccnc(O)n1

,O=c1nc(O)cc[nH]1

,O=c1nccc(O)[nH]1

,O=c1ccnc(O)[nH]1

,O=c1cc[nH]c(O)n1



You may want to write these molecule images in files, rather than displaying them on this Mathematica notebook:

In[]:=

Export["image_"<>#<>".png",MoleculePlot@Molecule[#]]&/@mySmiles

Out[]=

{image_O=c1cc[nH]c(=O)[nH]1.png,image_Oc1ccnc(O)n1.png,image_O=c1nc(O)cc[nH]1.png,image_O=c1nccc(O)[nH]1.png,image_O=c1ccnc(O)[nH]1.png,image_O=c1cc[nH]c(O)n1.png}

You may also want to display all the images in a single figure:

In[]:=

figures=MoleculePlot@Molecule[#]&/@mySmiles;GraphicsRow[figures]

Out[]=

...and then save the resulting image to a file:

In[]:=

Export["image_grid.png",%]

Out[]=

image_grid.png

As shown these chemical images, the 390+ substances associated with CID 1174 (uracil) correspond to six tautomeric form of uracil, which differ from each other in the position of “movable” hydrogen atoms. Compare these structures with their standardized structure (CID 1174):

In[]:=

MoleculePlot@Molecule@URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/property/isomericsmiles/txt"]

Out[]=

Alternatively, you can get the structure image of CID 1174 directly from PubChem either by constructing the REST-PUG call yourself:

In[]:=

URLExecute["https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/1174/record/PNG?image_size=300x300"]

Out[]=

Or by using Mathematica’s PubChem service wrapper to retrieve the images:

In[]:=

ServiceExecute["PubChem","CompoundImage",{"CompoundID"1174}]

Out[]=

Exercises:

Comment: Exercises 1a and 1b in the

PythonversionofthisAssignment

have been omitted because they are specific to RDKit.

Exercise 1c

Retrieve the substance records associated with guanine (CID 135398634) and display unique structures generated from them, by following these steps:

◼

Retrieve the SIDs associated CID 135398634

◼

Download the structure data for the retrieved SIDs (in SDF).

◼

Generate canonical SMILES strings from the structure data in the SDF file and identify unique canonical SMILES strings.

◼

Draw the structures represented by the unique canonical SMILES strings in a single figure.

In[]:=

(*Writeyourcodeinthiscell.*)

Exercise 1d

Retrieve the substance records whose synonym is “glucose” and display unique structures generated from them, by following these steps:

◼

Retrieve the SIDs whose synonym is “glucose”.

◼

Download the structure data for the retrieved SIDs (in SDF).

◼

Generate canonical SMILES strings from the structure data in the SDF file and identify unique canonical SMILES strings

◼

Draw the structures represented by the unique canonical SMILES strings in a single figure.

In[]:=

(*Writeyourcodeinthiscell.*)

Exercise 1e

Retrieve the compound records associated with the SIDs retrieved in Exercise 1d and display unique structures generated from them, by following these steps:

◼

Retrieve the CIDs associated with the SIDs whose name is “glucose”, using a single PUG-REST request (as discussed in Assignment 2).

◼

Identify unique CIDs from the returned CIDs using the DeleteDuplicates function in Mathematica.

◼

Retrieve the isomeric SMILES for the unique CIDs through PUG-REST

◼

Draw the structures represented by the returned SMILES strings in a single figure.

In[]:=

(*Writeyourcodeinthiscell.*)

Attributions

Adapted from the corresponding OLCC 2019 Python Assignment:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/3._Database _Resources _in _Cheminformatics/3.7%3 A_Python _Assignment

Cite this as: Joshua Schrier, "03 Compound vs Substance" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebny1gg