Educational Materials

08 Machine Learning Basics

Joshua Schrier

Author

Joshua Schrier

Title

08 Machine Learning Basics

Description

Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.

Machine Learning Basics

Objectives

◼

Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.

◼

Evaluate the performance of the developed models using performance measures.

The initial dataset importing and processing makes extensive use of the Dataset and Query functionality introduced in Assignment 7 in this series. Please review that material if you need a refresher.

Import bioactivity data from PubChem

In this notebook, we will develop a prediction model for small molecule’s activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).

For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the “Download” button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below:

In[]:=

url="https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139";dfRaw=Import[url,"Dataset","HeaderLines"1]

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)
RESULT_TYPE							STRING	STRING	FLOAT
RESULT_DESCR							Type of compound activity based on both the aromatase antagonist mode readout and the cell viability counter screen.	Type of compound activity in the aromatase antagonist mode screen.	The concentration of sample yielding half-maximal inhibition of aromatase.
RESULT_UNIT									MICROMOLAR
1	144203552	12850184	Inactive	0			inactive	inactive
2	144203553	89753	Inactive	0			inactive	inactive
3	144203554	9403	Inactive	0			inactive	inactive
4	144203555	13218779	Inactive	0			inactive	inactive
5	144203556	142766	Inconclusive	25			inconclusive antagonist (cytotoxic)	active antagonist	15.5454
6	144203557	16043	Inactive	0			inactive	inactive
7	144203558	16043	Inconclusive	25			inconclusive antagonist (cytotoxic)	active antagonist	28.7255
8	144203559	2724411	Inconclusive	25			inconclusive antagonist (cytotoxic)	inconclusive antagonist	33.4915
9	144203560	2724372	Inconclusive	10			inconclusive agonist	inconclusive agonist	8.41267
10	144203561	637566	Inactive	0			inactive	inactive
11	144203562	994	Inconclusive	10			inconclusive agonist	inconclusive agonist	29.8493
12	144203563	4766	Inactive	0			inactive	inactive
13	144203564	79472	Inconclusive	30			inconclusive antagonist	inconclusive antagonist	17.7801
14	144203565	3080	Inactive	0			inactive	inactive
15	144203566	197883	Inactive	0			inactive	inactive
16	144203567	51	Inactive	0			inactive	inactive
17	144203568	66064	Inconclusive	25			inconclusive antagonist (cytotoxic)	active antagonist	28.1796
rows 1–20 of 10489 columns 1–10 of 15

Note: Lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed. One can either drop these after the import, or exclude these lines on import. SemanticImport is a variant of the Import function that can adds additional interpretative elements, and also has an option for excluding specific lines from the file:

In[]:=

dfRaw=SemanticImport[url,"HeaderLines"1,ExcludedLines{2,3,4}]

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)
1	144203552	12850184	Inactive	0	—	—	inactive	inactive	—
2	144203553	89753	Inactive	0	—	—	inactive	inactive	—
3	144203554	9403	Inactive	0	—	—	inactive	inactive	—
4	144203555	13218779	Inactive	0	—	—	inactive	inactive	—
5	144203556	142766	Inconclusive	25	—	—	inconclusive antagonist (cytotoxic)	active antagonist	15.5454
6	144203557	16043	Inactive	0	—	—	inactive	inactive	—
7	144203558	16043	Inconclusive	25	—	—	inconclusive antagonist (cytotoxic)	active antagonist	28.7255
8	144203559	2724411	Inconclusive	25	—	—	inconclusive antagonist (cytotoxic)	inconclusive antagonist	33.4915
9	144203560	2724372	Inconclusive	10	—	—	inconclusive agonist	inconclusive agonist	8.41267
10	144203561	637566	Inactive	0	—	—	inactive	inactive	—
11	144203562	994	Inconclusive	10	—	—	inconclusive agonist	inconclusive agonist	29.8493
12	144203563	4766	Inactive	0	—	—	inactive	inactive	—
13	144203564	79472	Inconclusive	30	—	—	inconclusive antagonist	inconclusive antagonist	17.7801
14	144203565	3080	Inactive	0	—	—	inactive	inactive	—
15	144203566	197883	Inactive	0	—	—	inactive	inactive	—
16	144203567	51	Inactive	0	—	—	inactive	inactive	—
17	144203568	66064	Inconclusive	25	—	—	inconclusive antagonist (cytotoxic)	active antagonist	28.1796
18	144203569	66435	Inactive	0	—	—	inactive	inactive	—
19	144203570	8969	Inactive	0	—	—	inactive	inactive	—
20	144203571	10664	Inactive	0	—	—	inactive	inactive	—
rows 1–20 of 10486 columns 1–10 of 15

What are the column names?

In[]:=

Keys@First@dfRaw

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID
PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL
PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity
Antagonist Potency (uM)	Antagonist Efficacy (%)	Viability Activity
Viability Potency (uM)	Viability Efficacy (%)	Sample Source

Check the number of compounds for each activity group

First, we need to understand what our data look like. Especially, we are interested in the activity class of the tested compounds because we are developing a model that classifies small molecules according to their activities against the target. This information is available in the “PUBCHEM_ACTIVITY_OUTCOME” and “Activity Summary” columns:

In[]:=

dfRaw[CountsBy["PUBCHEM_ACTIVITY_OUTCOME"]]

Out[]=

Inactive	7562
Inconclusive	2545
Active	379

Based on the data in the activity outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusives. (As has been noted in several previous exercises, these numbers are subject to increase in the future, as additional experimental results get added to the PubChem database.) One way to make sense of these results is to GroupBy each type of activity outcome description, and then tabulate the CountsBy the activity summary descriptions:

In[]:=

dfRaw[GroupBy["PUBCHEM_ACTIVITY_OUTCOME"],CountsBy["Activity Summary"]]

Out[]=

Inactive	inactive	7562
Inconclusive	inconclusive antagonist (cytotoxic)	1049
	inconclusive agonist	414
	inconclusive antagonist	367
	active agonist	612
	inconclusive agonist (cytotoxic)	59
	inconclusive	44
Active	active antagonist	379

Now, we can see that, in the “activity_summary” column, the inconclusive compounds are further classified into subclasses, which include:

◼

active agonist

◼

inconclusive

◼

inconclusive agonist

◼

inconclusive antagonist

◼

inconclusive agonist (cytotoxic)

◼

inconclusive antagonist (cytotoxic)

As implied in the title of this assay record (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139), this assay aims to identify aromatase inhibitors. Therefore, all active antagonists (in the activity summary column) were declared to be active compounds (in the activity outcome column).

On the other hand, the assay also identified 612 active agonists (in the activity summary column), and they are declared to be inconclusive (in the activity outcome column).

With that said, “inactive” compounds in this assay means those which are neither active agonists nor active antagonist.
It is important to understand that the criteria used for determining whether a compound is active or not in a given assay are selected by the data source who submitted that assay data to PubChem. For the purpose of this assignment (which aims to develop a binary classifier that tells if a molecule is active or inactive against the target), we should clarify what we mean by “active” and “inactive”.

◼

active: any compounds that can change (either increase or decrease) the activity of the target. This is equivalent to either active antagonists or active agonists in the activity summary column.

◼

inactive: any compounds that do not change the activity of the target. This is equivalent to inactive compounds in the activity summary column.

Select active/inactive compounds for model building

Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the “activity summary” column).

In[]:=

activeInactive=Query[Select[ContainsAny[{"active agonist","active antagonist","inactive"}]@{#["Activity Summary"]}&]];df=activeInactive@dfRaw

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)
1	144203552	12850184	Inactive	0	—	—	inactive	inactive	—
2	144203553	89753	Inactive	0	—	—	inactive	inactive	—
3	144203554	9403	Inactive	0	—	—	inactive	inactive	—
4	144203555	13218779	Inactive	0	—	—	inactive	inactive	—
6	144203557	16043	Inactive	0	—	—	inactive	inactive	—
10	144203561	637566	Inactive	0	—	—	inactive	inactive	—
12	144203563	4766	Inactive	0	—	—	inactive	inactive	—
14	144203565	3080	Inactive	0	—	—	inactive	inactive	—
15	144203566	197883	Inactive	0	—	—	inactive	inactive	—
16	144203567	51	Inactive	0	—	—	inactive	inactive	—
18	144203569	66435	Inactive	0	—	—	inactive	inactive	—
19	144203570	8969	Inactive	0	—	—	inactive	inactive	—
20	144203571	10664	Inactive	0	—	—	inactive	inactive	—
21	144203572	8739	Inactive	0	—	—	inactive	inactive	—
22	144203573	4904	Inactive	0	—	—	inactive	inactive	—
23	144203574	65463	Active	44	—	—	active antagonist	active antagonist	28.1796
24	144203575	439501	Active	50	—	—	active antagonist	active antagonist	1.43968
25	144203576	65708	Inactive	0	—	—	inactive	inactive	—
26	144203577	60196259	Inactive	0	—	—	inactive	inactive	—
27	144203578	5338	Inactive	0	—	—	inactive	inactive	—
rows 1–20 of 8553 columns 1–10 of 15

In[]:=

df[DeleteDuplicatesBy[#["PUBCHEM_SID"]&]]//Length(*uniqueSIDs*)df[DeleteDuplicatesBy[#["PUBCHEM_CID"]&]]//Length(*uniqueCIDs*)

Out[]=

8553

Out[]=

6864

Note that the number of CIDs is not the same as the number of SIDs. There are two important potential reasons for this observation.

First, not all substances (SIDs) in PubChem have associated compounds (CIDs) because some substances failed during structure standardization. Remember that, in PubChem, substances are depositor-provided structures and compounds are unique structures extracted from substances through structure standardization. Because our model will use structural information of molecules to predict their bioactivity, we need to remove substances without associated CIDs (i.e., no standardized structures).

Second, some compounds are associated with more than one substances. In the context of this assay, it means that a compound may be tested multiple times in different samples (which are designated as different substances). It is not uncommon that different samples of the same chemical may result in conflicting activities (e.g., active agonist in one sample but inactive in another sample). In this assignment, we remove such compounds with conflicting activities.

Drop substances without associated CIDs.

First, check if there are substances without associated CIDs. Let’s define a function to check if any of them are missing:

In[]:=

anyMissing[df_Dataset]:=Transpose[df][All,Count[_Missing]]anyMissing@df(*applyfunction*)

Out[]=

PUBCHEM_RESULT_TAG	0
PUBCHEM_SID	0
PUBCHEM_CID	138
PUBCHEM_ACTIVITY_OUTCOME	0
PUBCHEM_ACTIVITY_SCORE	0
PUBCHEM_ACTIVITY_URL	8553
PUBCHEM_ASSAYDATA_COMMENT	8553
Activity Summary	0
Antagonist Activity	0
Antagonist Potency (uM)	7563
Antagonist Efficacy (%)	990
Viability Activity	0
Viability Potency (uM)	8054
Viability Efficacy (%)	657
Sample Source	0

There are 138 records missing a “PUBCHEM_CID” entry. We want to remove those records:

In[]:=

removeMissingCID=Query[Select[!MissingQ@#["PUBCHEM_CID"]&]];(*definequery*)df=removeMissingCID@df;(*applyquerytodataset*)anyMissing@df(*checkagainifthemissingvalueshavedisappeared*)

Out[]=

PUBCHEM_RESULT_TAG	0
PUBCHEM_SID	0
PUBCHEM_CID	0
PUBCHEM_ACTIVITY_OUTCOME	0
PUBCHEM_ACTIVITY_SCORE	0
PUBCHEM_ACTIVITY_URL	8415
PUBCHEM_ASSAYDATA_COMMENT	8415
Activity Summary	0
Antagonist Activity	0
Antagonist Potency (uM)	7467
Antagonist Efficacy (%)	949
Viability Activity	0
Viability Potency (uM)	7919
Viability Efficacy (%)	653
Sample Source	0

Remove CIDs with conflicting activities

Now identify compounds with conflicting activities and remove them. The first step is to group the entries by CID, and determine how many distinct values of the Activity Summary are present. We will extract the entries where there is more than one distinct value:

In[]:=

CIDsWithConflictingActivity[df_Dataset]:=df[GroupBy["PUBCHEM_CID"],CountDistinct,"Activity Summary"][Select[GreaterThan[1]]]

In[]:=

CIDsWithConflictingActivity[df]

Out[]=

16043	2
443939	2
2170	2
2554	2
3397	2
2998	2
4911	2
5333	2
54682468	2
2145	2
8569	2
4632	2
4114	2
5362	2
54454	2
4764	2
6167	2
5280450	2
667493	2
12358480	2
rows 1–20 of 65

Let’s turn this into a list:

In[]:=

cidConflicts=(CIDsWithConflictingActivity@df)[[Keys]]//Normal

Out[]=

{16043,443939,2170,2554,3397,2998,4911,5333,54682468,2145,8569,4632,4114,5362,54454,4764,6167,5280450,667493,12358480,2950,5757,22571,15106,6917728,6238,135242,161803,14982,5391,7057995,135398739,28871,6917655,91744,9270,25750,33344,60196338,9833444,392622,6526396,4837,11443,8348,53316406,10458,8183,9352,7447,10332,637511,6001,7092,38479,11860,70837,9076,8560,8174,6478,15294,6944,7381,10868}

What are the data for these conflicting results (comprising 146 rows):

In[]:=

df[Select[ContainsAny[cidConflicts]@{#["PUBCHEM_CID"]}&]][SortBy["PUBCHEM_CID"]]

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)
406	144203957	2145	Inactive	0	—	—	inactive	inactive	—
10252	144213805	2145	Active	44	—	—	active antagonist	active antagonist	35.7211
64	144203615	2170	Inactive	0	—	—	inactive	inactive	—
9116	144212669	2170	Active	50	—	—	active antagonist	active antagonist	16.5803
104	144203655	2554	Inconclusive	20	—	—	active agonist	active agonist	2.87255
5918	144209471	2554	Inactive	0	—	—	inactive	inactive	—
6962	144210515	2554	Inactive	0	—	—	inactive	inactive	—
1122	144204673	2950	Inactive	0	—	—	inactive	inactive	—
4631	144208184	2950	Inconclusive	20	—	—	active agonist	active agonist	9.56437
302	144203853	2998	Inactive	0	—	—	inactive	inactive	—
7982	144211535	2998	Active	42	—	—	active antagonist	active antagonist	68.5896
154	144203705	3397	Inactive	0	—	—	inactive	inactive	—
7301	144210854	3397	Active	42	—	—	active antagonist	active antagonist	63.5221
432	144203983	4114	Inactive	0	—	—	inactive	inactive	—
5412	144208965	4114	Inconclusive	20	—	—	active agonist	active agonist	43.5413
9580	144213133	4114	Inconclusive	20	—	—	active agonist	active agonist	41.6478
418	144203969	4632	Inactive	0	—	—	inactive	inactive	—
5630	144209183	4632	Inconclusive	20	—	—	active agonist	active agonist	7.57083
9616	144213169	4632	Inactive	0	—	—	inactive	inactive	—
838	144204389	4764	Inactive	0	—	—	inactive	inactive	—
rows 1–20 of 146 columns 1–10 of 15

Let’s remove these conflicting entries:

In[]:=

removeCIDs[cids_List]:=Query[Select[ContainsNone[cids]@{#["PUBCHEM_CID"]}&]];df=removeCIDs[cidConflicts]@df;df[GroupBy["Activity Summary"],Length]

Out[]=

inactive	7389
active antagonist	343
active agonist	537

How many unique SID and CIDs are there?

In[]:=

df[CountDistinct,#]&/@{"PUBCHEM_SID","PUBCHEM_CID"}

Out[]=

{8269,6798}

Remove redundant data

The above code cells do not remove compounds tested multiple times if the testing results are consistent [e.g., active agonist in all samples (substances)]. The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound.

In[]:=

df=df[DeleteDuplicatesBy["PUBCHEM_CID"]]df[CountDistinct,#]&/@{"PUBCHEM_SID","PUBCHEM_CID"}

Out[]=

PUBCHEM_RESULT_TAG	PUBCHEM_SID	PUBCHEM_CID	PUBCHEM_ACTIVITY_OUTCOME	PUBCHEM_ACTIVITY_SCORE	PUBCHEM_ACTIVITY_URL	PUBCHEM_ASSAYDATA_COMMENT	Activity Summary	Antagonist Activity	Antagonist Potency (uM)
1	144203552	12850184	Inactive	0	—	—	inactive	inactive	—
2	144203553	89753	Inactive	0	—	—	inactive	inactive	—
3	144203554	9403	Inactive	0	—	—	inactive	inactive	—
4	144203555	13218779	Inactive	0	—	—	inactive	inactive	—
10	144203561	637566	Inactive	0	—	—	inactive	inactive	—
12	144203563	4766	Inactive	0	—	—	inactive	inactive	—
14	144203565	3080	Inactive	0	—	—	inactive	inactive	—
15	144203566	197883	Inactive	0	—	—	inactive	inactive	—
16	144203567	51	Inactive	0	—	—	inactive	inactive	—
18	144203569	66435	Inactive	0	—	—	inactive	inactive	—
19	144203570	8969	Inactive	0	—	—	inactive	inactive	—
20	144203571	10664	Inactive	0	—	—	inactive	inactive	—
21	144203572	8739	Inactive	0	—	—	inactive	inactive	—
22	144203573	4904	Inactive	0	—	—	inactive	inactive	—
23	144203574	65463	Active	44	—	—	active antagonist	active antagonist	28.1796
24	144203575	439501	Active	50	—	—	active antagonist	active antagonist	1.43968
25	144203576	65708	Inactive	0	—	—	inactive	inactive	—
26	144203577	60196259	Inactive	0	—	—	inactive	inactive	—
27	144203578	5338	Inactive	0	—	—	inactive	inactive	—
28	144203579	3246155	Inactive	0	—	—	inactive	inactive	—
rows 1–20 of 6798 columns 1–10 of 15

Out[]=

{6798,6798}

Adding “numeric” activity classes

In general, the inputs and outputs to machine learning algorithms need to have numerical forms.

In this exercise, the input (molecular structure) will be represented with binary fingerprints, which already have numerical forms (0 or 1). However, the output (activity) is currently in a string format (e.g., ‘active agonist’, ‘active antagonist’). Therefore, we want to add an additional, ‘activity’ column, which contains numeric codes representing the active and inactive compounds:

◼

1 for actives (either active agonists or active antagonists)

◼

0 for inactives

Note that we are merging the two classes "active agonist" and "active antagonist", because we are going to build a binary classifier that distinguish actives from inactives. We’ll also take this opportunity to create a smaller data frame that contains only CIDs and these binarized activity values. The Boole function converts True and False to one and zero;:

In[]:=

dfActivity=df[All,<|"CompoundID"#["PUBCHEM_CID"],"activity"Boole@Not@StringMatchQ["inactive"]@#["Activity Summary"]|>&]

Out[]=

CompoundID	activity
12850184	0
89753	0
9403	0
13218779	0
637566	0
4766	0
3080	0
197883	0
51	0
66435	0
8969	0
10664	0
8739	0
4904	0
65463	1
439501	1
65708	0
60196259	0
5338	0
3246155	0
rows 1–20 of 6798

In[]:=

df[CountsBy["Activity Summary"]]

Out[]=

inactive	6056
active antagonist	291
active agonist	451

In[]:=

dfActivity[CountsBy["activity"]]

Out[]=

0	6056
1	742

Download structure information for each compound from PubChem

Now we want to get structure information of the compounds from PubChem (in isomeric SMILES). Using the requestProperties function originally defined in Assignment 7 (copy-pasted below):

In[]:=

requestProperties[cids_List,chunkSize_Integer:200,properties_List:{"IsomericSMILES"}]:=With{cidsBatches=Partition[cids,UpTo[chunkSize]]},Join@@

[◼]	DynamicMap

[Pause[0.2];ServiceExecute["PubChem","CompoundProperties",{"CompoundID"#,"Property"properties}]&,cidsBatches]requestProperties[cids_Dataset,chunkSize_Integer:200,properties_List:{"IsomericSMILES"}]:=With[{cidList=cids[All,"CompoundID"]//Normal},requestProperties[cidList,chunkSize,properties]]

In[]:=

dfSmiles=requestProperties@dfActivity

Out[]=

CompoundID	IsomericSMILES
12850184	C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.C(C(=O)[C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O.[Ca+2]
89753	C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O.C([C@H]([C@H]([C@@H]([C@H](C(=O)[O-])O)O)O)O)O.[Ba+2]
9403	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OC(=O)CCC4CCCC4)CCC5=C3C=CC(=C5)O
13218779	C[C@@]12CC[C@@H](C1(C)C)C[C@H]2OC(=O)CSC#N
637566	CC(=CCC/C(=C/CO)/C)C
4766	C1=CC=C2C(=C1)C(OS2(=O)=O)(C3=CC=C(C=C3)O)C4=CC=C(C=C4)O
3080	C(C(CS)S)O
197883	[C@H]([C@@H](C(=O)O)S)(C(=O)O)S
51	C(CC(=O)O)C(=O)C(=O)O
66435	C[C@]12CC[C@H]3[C@H]([C@@H]1CC[C@@H]2OP(=O)(O)O)CCC4=C3C=CC(=C4)O
8969	COC(=O)[C@H]1[C@H](CC[C@@H]2[C@@H]1C[C@H]3C4=C(CCN3C2)C5=CC=CC=C5N4)O
10664	CC1=NC=C(C(=C1O)CN)CO.Cl.Cl
8739	C1=CC(=CC=C1C(=O)O)S(=O)(=O)N
4904	C1CCN(CC1)CCC(C2=CC=CC=C2)(C3=CC=CC=C3)O
65463	CC(C)(C)C1=CC=C(C=C1)CN2CCN(CC2)C(C3=CC=CC=C3)C4=CC=C(C=C4)Cl.Cl.Cl
439501	C[C@H]1[C@@H]([C@H]([C@H]([C@@H](O1)O[C@H]2C[C@H]([C@@]3([C@@H]4[C@@H](CC[C@@]3(C2)O)[C@]5(CC[C@@H]([C@]5(C[C@H]4O)C)C6=CC(=O)OC6)O)CO)O)O)O)O
65708	C1=CC=C2C(=C1)NC(=N2)CCC(=O)O
60196259	C[C@H](CCC(=O)O)[C@H]1CC[C@@H]2[C@@]1([C@H](C[C@H]3[C@H]2[C@@H](C[C@H]4[C@@]3(CC[C@H](C4)O)C)O)O)C.C1CNCCN1
5338	C1=CC=C2C(=C1)N=CC(=N2)NS(=O)(=O)C3=CC=C(C=C3)N
3246155	CN1[C@@H]2CC[C@H]1CC(C2)OC(C3=CC=CC=C3)C4=CC=CC=C4.CS(=O)(=O)O
rows 1–20 of 6798

Generate MACCS keys from SMILES.

Account for possible failures (e.g., the presence of “invalid” SMILES strings—in fact, two are present in this dataset). This will take a few minutes to generate:

In[]:=

dfFPS=dfSmilesAll,Join<|"CompoundID"#CompoundID|>,Association@MapIndexedStringTemplate["maccs``"]@First@#2#1&,

[◼]

MACCSKeys

@#IsomericSMILES&,FailureAction"Drop";

Merge activity data and fingerprint information

In[]:=

dfData=JoinAcross[dfActivity,dfFPS,"CompoundID"];

Save dfData in CSV for future use:

In[]:=

Export["dfData.csv",dfData]

Out[]=

dfData.csv

In[]:=

dfData=SemanticImport["dfData.csv"];(*incaseofrestart...uncomment*)

Preparation for model building

Removing zero variance features

Are there columns where the MACCS bits do not vary? Let’s find out:

In[]:=

Transpose[dfData][All,Variance][TakeSmallest[10]]

Out[]=

maccs4	0
maccs1	0
maccs2	0
maccs5	0.000588322
maccs6	0.000588322
maccs7	0.0014695
maccs12	0.00293468
maccs9	0.00293468
maccs21	0.00366564
maccs13	0.00366564

Indeed, three of these have no variation whatsoever. So let’s define a function to apply a variance threshold criterion, dropping columns that have zero variance and apply to the dataset:

In[]:=

removeZeroVarianceColumns[df_Dataset]:=With[{retainedColumns=Normal@Keys@Select[GreaterThan[0]]@Transpose[df][All,Variance]},df[All,retainedColumns]]dfData=removeZeroVarianceColumns@dfData;

Creating a Train-Test Split (9:1 ratio)

Now split the data set into a training set (90%) and test set (10%).The training set will be used to train the model.The developed model will be tested against the test set.

We will define out own function here, for educational purposes and so that we have control over the process. In practice, one can use the TrainTestSplit repository function to perform these types of operations:

In[]:=

trainTestSplit[fraction_][df_Dataset]:=With[{nTrainingItems=UpTo@Round@(fraction*Length[df])},TakeDrop[RandomSample[df],(*randomlyshuffletherows*)nTrainingItems](*takeupto90%ofthedatafortraining*)]

Now apply this function:

In[]:=

SeedRandom[1841];(*toforcethistobereproducibleeverytimewerunthisnotebook*){train,test}=trainTestSplit[0.9]@dfData;Length/@{train,test}

Out[]=

{6116,680}

Balance the training set through downsampling

Check the dimension of the training data set:

In[]:=

Length[train]

Out[]=

6116

Check the number of active and inactive compounds:

In[]:=

train[Counts,"activity"]

Out[]=

0	5451
1	665

The data set is highly imbalanced [the inactive to active ratio is (5451 / 665)]. To address this issue, let’s down sample the majority class (inactive compounds) to balance the data set.

In[]:=

downsample[column_String][df_Dataset]:=With[{groups=df[GroupBy[column]]},With[{actives=groups[Key[1]],(*assumethatthecolumniseither0or1encoded*)inactives=groups[Key[0]]},Join[actives,(*theactiveexamples,takenthemall*)RandomSample[inactives,Length[actives]](*sampletheinactives*)]]]

In[]:=

SeedRandom[1841];(*setarandomseedvaluetomakethisreproducible*)downsampledTrain=downsample["activity"]@train;

Confirm that the downsampled data set has the correct number of representatives from each class.

In[]:=

downsampledTrain[CountsBy["activity"]]

Out[]=

1	665
0	665

Separate input and output variables

We now want to separate the “inputs” (the MACCS keys) from the “outputs” (the activity). The Classify function takes a variety of different input styles, but one of the more convenient ones is as an Association of output keys to lists of input values. We want to make sure that the “inputs” do not include the activity or CompoundID values.

In[]:=

prepareInput[df_Dataset]:=df[GroupBy["activity"],All,KeyDrop[{"activity","CompoundID"}]]//Normal{trainingSet,testSet}=prepareInput/@{downsampledTrain,test};

Build a model using the training set

Now we are ready to build predictive models using machine learning algorithms available in the Classify function. This notebook will use Naive Bayes and decision tree models, because they are relatively fast and simple, but there are many other options described in the documentation.

Naive Bayes

In[]:=

nbModel=Classify[(*trainthemodel*)trainingSet,(*usingthisdataasthetrainingset*)Method"NaiveBayes"](*specifythemodeltype*)

Out[]=

ClassifierFunction



Input type:

Mixed

(number: 163)

Classes: 0,1

Data not saved. Save now



How well does this model describe the training data? One way to quantify this is by the accuracy:

In[]:=

ClassifierMeasurements[nbModel,trainingSet,"Accuracy"]

Out[]=

0.657895

If we are interested in multiple properties, it is better to generate a general ClassifierMeasurementsObject that can be used to assess difference performance metrics about the model:

In[]:=

nbPerformanceOnTrainingData=ClassifierMeasurements[nbModel,trainingSet]nbPerformanceOnTrainingData["Report"]

Out[]=

ClassifierMeasurementsObject



Classifier: NaiveBayes

Number of test examples: 1330

Data not saved. Save now



Out[]=

Classifier Measurements

Number of test examples	1330
Accuracy	(65.8 ± 1.3 ) %
Accuracy baseline	(50.0 ± 1.4 ) %
Geometric mean of probabilities	0.269 ± 0.016
Mean cross entropy	1.31 ± 0.061
Single evaluation time	29.7 ms/example
Batch evaluation speed	2.43 examples/ms
Rejection rate	0 %

Desired performance metrics can be extracted from the ClassifierMeasurementsObject and used to construct summary tables of the results:

In[]:=

testSummaryTable[cm_ClassifierMeasurementsObject]:={#,cm[#]}&/@{"Accuracy","Recall","MCC","Sensitivity","Specificity","AreaUnderROCCurve"}//TableFormtestSummaryTable[nbPerformanceOnTrainingData]

Out[]//TableForm=

Accuracy	0.657895
Recall	00.536842,10.778947
MCC	00.325472,10.325472
Sensitivity	00.536842,10.778947
Specificity	00.778947,10.536842
AreaUnderROCCurve	00.720391,10.720382

When applied to predict the activity of the training compounds, the NB classifier resulted in the accuracy of 0.66 and AUC-ROC of 0.72. (Note that your values may slightly vary depending on the training/test set split that is used, and details of how underlying hyperparameters are fitted; we will address these variations in the section on cross-validation below.) However, the real performance of the model should be evaluated with the test set data, which are not used for model training. This will test the ability of the model to generalize to unseen examples. This too can be performed using the ClassifierMeasurements function, only now we provide the trained model (nbModel) and provide the held-out testSet for it to evaluate upon:

In[]:=

nbPerformanceOnTestData=ClassifierMeasurements[nbModel,testSet];%["Report"]testSummaryTable[%%]

Out[]=

Classifier Measurements

Number of test examples	680
Accuracy	(55.7 ± 1.9 ) %
Accuracy baseline	(88.7 ± 1.2 ) %
Geometric mean of probabilities	0.180 ± 0.017
Mean cross entropy	1.72 ± 0.094
Single evaluation time	19.4 ms/example
Batch evaluation speed	2.43 examples/ms
Rejection rate	0 %

Out[]//TableForm=

Accuracy	0.557353
Recall	00.529022,10.779221
MCC	00.195365,10.195365
Sensitivity	00.529022,10.779221
Specificity	00.779221,10.529022
AreaUnderROCCurve	00.697207,10.697207

When evaluated on the test set, the accuracy is somewhat lower than for the training set. This is natural to expect; furthermore this provides us with a better estimate of the performance on new data.

Note: The precise values of the model performance metrics you obtain will vary if you change the training and test sets, or change various hyperparameters associated with the models constructed by Classify. The latter can be important with some model types, such as the Decision Trees studied in the next section, and the default settings (and the extent to which they are optimized) is subject to change in future versions of Mathematica. Your results should be comparable, even if they are not numerically identical.

Decision Tree

Construct the model by changing the Method setting of the Classify function:

In[]:=

decisionTree=Classify[trainingSet,Method"DecisionTree"];

How well does this model perform on the test set?

In[]:=

dtPerformance=ClassifierMeasurements[decisionTree,testSet];%["Report"]testSummaryTable[%%]

Out[]=

Classifier Measurements

Number of test examples	680
Accuracy	(75.3 ± 1.7 ) %
Accuracy baseline	(88.7 ± 1.2 ) %
Geometric mean of probabilities	0.385 ± 0.026
Mean cross entropy	0.954 ± 0.067
Single evaluation time	10. ms/example
Batch evaluation speed	25.2 examples/ms
Rejection rate	0 %

Out[]//TableForm=

Accuracy	0.752941
Recall	00.761194,10.688312
MCC	00.314002,10.314002
Sensitivity	00.761194,10.688312
Specificity	00.688312,10.761194
AreaUnderROCCurve	00.706748,10.833603

Notice the improvement relative to the Naive Bayes model when evaluating on the test set. We have considerably fewer false positives (predicted to be activity = 1, but truly active = 0), which improves all of the metrics.

How would the model perform without down sampling?

It is natural to ask the question of how the results might have changed had we not downsampled the training data. Let’s do a numerical experiment to find out:

In[]:=

Classify[prepareInput[train],(*theentire,non-stratifieddataset*)Method"DecisionTree"];dtDownsamplePerformance=ClassifierMeasurements[%,testSet];%["Report"]testSummaryTable[%%]

Out[]=

Classifier Measurements

Number of test examples	680
Accuracy	(89.0 ± 1.2 ) %
Accuracy baseline	(88.7 ± 1.2 ) %
Geometric mean of probabilities	0.615 ± 0.034
Mean cross entropy	0.485 ± 0.055
Single evaluation time	8.19 ms/example
Batch evaluation speed	26.9 examples/ms
Rejection rate	0 %

Out[]//TableForm=

Accuracy	0.889706
Recall	00.973466,10.233766
MCC	00.301304,10.301304
Sensitivity	00.973466,10.233766
Specificity	00.233766,10.973466
AreaUnderROCCurve	00.653594,10.880619

From the Accuracy score, it appears that we have improved overall performance. However, this can sometimes be a misleading metric, and it is helpful to examine how the model is performing better by comparing the two confusion matrices:

In[]:=

TableForm[{#["ConfusionMatrixPlot"]&/@{dtDownsamplePerformance,dtPerformance}},TableHeadings{None,{"Downsampled","Not Downsampled"}}]

Out[]//TableForm=

Downsampled	Not Downsampled

In general, the model generated using the downsampled data has reduced the number of false positives (upper right corner). This makes sense—it has access to a greater number of inactive compounds in the training set so it has a better idea of what inactive molecules “look like”. However, notice that the number of false negatives (lower left corner) has increased in the downsampled model, which indicates that the model is biased towards predicting that the molecule is inactive, even when it is truly active.

This type of behavior is problematic if used for a drug discovery application. If active molecules are rare, we don’t want to miss any of them by spuriously predicting them to be inactive. On the other hand, testing a few extra false positive candidates may not be a problem. This type of goal can be better described by the Recall score for active (outcome = 1) compounds (“how many of the actual positives are predicted to be positive by the model”).

Predicting error by cross-validation

So far, we have evaluated the performance of a model trained on one specific subset and evaluated on another specific subset. This raises the question: How much would the results change if we were to have chosen a different training and test set? To study this, let’s define a function that will perform the entire process, returning a ClassiferMeasurementObject:

In[]:=

modelConstruction[method_][df_Dataset]:=With[{trainTestData=MapThread[prepareInput@#1@#2&,(*processintotrainableform*){{downsample["activity"],Identity},(*stratifythetrainingset,leavethetestsetalone*)trainTestSplit[0.9][df]}(*resultsinalistof{training,testing}data*)]},ClassifierMeasurements[Classify[trainTestData[[1]],Methodmethod,TrainingProgressReportingNone],trainTestData[[2]]]]

Then generate ten examples—i.e., ten random choices of training/test set splits:

In[]:=

results=Table[modelConstruction["DecisionTree"]@dfData,{10}];

How do the results vary? We can interrogate each of the ClassiferMeasurementObject entries in the list of results to determine the accuracy

In[]:=

#["Accuracy"]&/@results

Out[]=

{0.783824,0.723529,0.638235,0.716176,0.592647,0.717647,0.713235,0.575,0.698529,0.697059}

These tend to fluctuate around 0.68, although there are some outliers:

In[]:=

%//MeanAround

Out[]=

0.686

0.020

Properties that return an association (e.g., different results for zero and one prediction) can also be examined. The mean Recall score indicates that we can expect to catch about 67% of the actual active compounds based on our predictions:

In[]:=

MeanAround[#["Recall"][1]&/@results]

Out[]=

0.67

0.04

Model building through cross-validation

Classify attempts to hide many of the details of model construction from the user; these come in the form of unspecified method options that set the default values for hyperparameters which cannot be learned by the training algorithm.

In[]:=

Information[decisionTree,"MethodOption"]

Out[]=

Method{DecisionTree,DistributionSmoothing1,FeatureFraction1}

Referring to the DecisionTree documentation, larger values of the DistributionSmoothing option typically reduce the size of the tree, and the FeatureFraction option limits the selection of features used by the tree. In general, Classify tries to optimize these hyperparameters by performing a cross validation “behind the scenes” on the training data that is provided. But we can also use cross validation to manually tune these parameters.The modelConstruction function taken above allows us to specify the hyperparameters:

In[]:=

modelConstruction[{"DecisionTree","DistributionSmoothing"10}][dfData]

Out[]=

ClassifierMeasurementsObject



Classifier: DecisionTree

Number of test examples: 680

Data not saved. Save now



We don’t need all of the data, so perhaps we will just examine the recall:

In[]:=

modelConstruction[{"DecisionTree","DistributionSmoothing"10}][dfData]["Recall"][1]

Out[]=

0.768293

This appears to be an improvement upon the average recall value computed at the end of the previous section. To see if this is just lucky or meaningful, we can perform cross validation to repeat the train-test split 10 times for each value of the parameter and get the mean and error. We’ll define a function to perform the evaluation, returning the mean performance over the ten trials. Because we can divide up the work evenly, we’ll use ParallelTable to run each calculation on a separate processor:

In[]:=

distributionTune[parameter_?NumericQ]:=Table[modelConstruction[{"DecisionTree","DistributionSmoothing"parameter}][dfData]["Recall"][1],{10}]//MeanAround

In[]:=

results=ParallelTable[{p,distributionTune[p]},{p,{0.1,1,5,10,50}}]

Out[]=

{{0.1,0.663

0.022

},{1,0.671

0.019

},{5,0.688

0.021

},{10,0.701

0.023

},{50,0.658

0.028

}}

Examining this performance, we see that the model is largely robust to this hyperparameter, but some improvement can be achieved by increasing the DistributionSmoothing option to 5.

In[]:=

ListLinePlot[results,AxesLabel{"Smoothing","Recall"}]

Out[]=

In real life, one would want to perform experiments like these to determine the best hyperparameter values. Here we have only optimized the DistributionSmoothing parameter, but there is also the FeatureFraction parameter. When there are only two hyperparamters, we might define a grid of values and evaluate the model on each of these choices. Performing an enumerative (grid) search of the results becomes computationally intractable as the number of parameters increases. In practice, a random sampling of hyperparameter choices is often sufficient. (Bergstra & Bengio, J. Machine Learning Res. 13, 281-305 (2012))

Cross validation in practice

We have illustrated how to construct a cross validation by manually dividing the dataset and building machine learning models for pedagogical purposes. In practice, one can use pre-existing functions, such as the CrossValidateModel repository function to facilitate cross-validation of models over a dataset.

Exercises

In this assignment, we will build predictive models using the same aromatase data.

Step 1: Show the following information to make sure that the activity data in the df_activity data frame is still available (if not, reconstruct them)

◼

The first five lines of dfActivity

In[]:=

(*Writeyourcodeinthiscell.*)

◼

The counts of active/inactive compounds in dfActivity

In[]:=

(*Writeyourcodeinthiscell.*)

Step 2: Show the following information to make sure the structure data is still available:

◼

The first five lines of dfSmiles

In[]:=

(*Writeyourcodeinthiscell.*)

◼

The number of rows of dfSmiles)

In[]:=

(*Writeyourcodeinthiscell.*)

Step 3: Retrieve the PubChem fingerprints using the CompoundIDs, storing the result in a variable named dfFPS:

In[]:=

(*Writeyourcodeinthiscell.*)

Step 4: Merge the dfActivity and dfFPS Datasets into a new Dataset called dfData

◼

Join the two data frames using the CID column as keys.

◼

Remove the rows that have any NULL values (i.e., compounds for which the fingerprints couldn’t be generated).

◼

Print the dimensions.

◼

Show the first five entries.

In[]:=

(*Writeyourcodeinthiscell.*)

Step 5: Prepare input and output data for model building

◼

Remove zero-variance features from X (if any).

◼

Split the data set into training and test sets (90% vs 10%) (using SeedRandom[3100]).

◼

Balance the training data set through downsampling.

◼

Prepare the data into input and output forms.

In[]:=

(*Writeyourcodeinthiscell.*)

Step 6: Building a Random Forest (RF) model using the balanced training data set.

◼

First read the RandomForest documentation.

In[]:=

(*Writeyourcodeinthiscell.*)

Step 7: Apply the developed RF model to predict the activity of the training set compounds.

◼

Report the confusion matrix, accuracy, recall, sensitivity, specificity, and AUC-ROC.

In[]:=

(*Writeyourcodeinthiscell.*)

Step 8: Apply the developed RF model to predict the activity of the test set compounds.

◼

Report the confusion matrix, accuracy, recall, sensitivity, specificity, and AUC-ROC.

In[]:=

(*Writeyourcodeinthiscell.*)

Step 9: Read a recent paper published in Chem. Res. Toxicol. (https://doi.org/10.1021/acs.chemrestox.7b00037) and answer the following questions (in no more than five sentences for each question).

◼

What different approaches did the paper take to develop prediction models (compared to those used in this notebook)?

◼

How different are the models reported in the paper from those constructed in this paper (in terms of the performance measures)?

◼

What would you do to develop models with improved performance?

Write your answers in this cell... (using words)

Attributions

Adapted from the corresponding OLCC 2019 Python Exercise:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/8%3 A_Machine-learning_Basics/8.2%3 A_Python _Assignment

Cite this as: Joshua Schrier, "08 Machine Learning Basics" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo8r9c