08 Machine Learning Basics
Author
Joshua Schrier
Title
08 Machine Learning Basics
Description
Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.
Category
Educational Materials
Keywords
cheminformatics, Chemoinformatics, chemical information, PubChem, quantitative structure, property relationships, QSPR, machine learning, computer-aided drug design, chemistry
URL
http://www.notebookarchive.org/2020-10-ebo8r9c/
DOI
https://notebookarchive.org/2020-10-ebo8r9c
Date Added
2020-10-31
Date Last Modified
2020-10-31
File Size
2.73 megabytes
Supplements
Rights
CC BY-NC-SA 4.0
Download
Open in Wolfram Cloud
Machine Learning Basics
Machine Learning Basics
Objectives
Objectives
◼
Build binary classification models that predict activity/inactivity of small molecules against human aromatase using supervised learning methods.
◼
Evaluate the performance of the developed models using performance measures.
Import bioactivity data from PubChem
Import bioactivity data from PubChem
In this notebook, we will develop a prediction model for small molecule’s activity against human aromatase (https://pubchem.ncbi.nlm.nih.gov/protein/EAW77416), which is encoded by the CYP19A1 gene (https://pubchem.ncbi.nlm.nih.gov/gene/1588). The model will predict the activity of a molecule based on the structure of the molecule (represented with molecular fingerprints).
For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the “Download” button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below:
For model development, we will use the Tox21 bioassay data for human aromatase, archived in PubChem (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139). The bioactivity data presented on this page can be downloaded by clicking the “Download” button available on this page and then read the data into a data frame. Alternatively, you can directly load the data into a data frame as shown in the cell below:
In[]:=
url="https://pubchem.ncbi.nlm.nih.gov/assay/pcget.cgi?query=download&record_type=datatable&actvty=all&response_type=save&aid=743139";dfRaw=Import[url,"Dataset","HeaderLines"1]
Out[]=
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Note: Lines 0-2 provide the descriptions for each column (data type, descriptions, units, etc). These rows need be removed. One can either drop these after the import, or exclude these lines on import. SemanticImport is a variant of the Import function that can adds additional interpretative elements, and also has an option for excluding specific lines from the file:
In[]:=
dfRaw=SemanticImport[url,"HeaderLines"1,ExcludedLines{2,3,4}]
Out[]=
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What are the column names?
In[]:=
Keys@First@dfRaw
Out[]=
|
Check the number of compounds for each activity group
Check the number of compounds for each activity group
First, we need to understand what our data look like. Especially, we are interested in the activity class of the tested compounds because we are developing a model that classifies small molecules according to their activities against the target. This information is available in the “PUBCHEM_ACTIVITY_OUTCOME” and “Activity Summary” columns:
In[]:=
dfRaw[CountsBy["PUBCHEM_ACTIVITY_OUTCOME"]]
Out[]=
|
Based on the data in the activity outcome column, there are 379 actives, 7562 inactives, and 2545 inconclusives. (As has been noted in several previous exercises, these numbers are subject to increase in the future, as additional experimental results get added to the PubChem database.) One way to make sense of these results is to GroupBy each type of activity outcome description, and then tabulate the CountsBy the activity summary descriptions:
In[]:=
dfRaw[GroupBy["PUBCHEM_ACTIVITY_OUTCOME"],CountsBy["Activity Summary"]]
Out[]=
|
Now, we can see that, in the “activity_summary” column, the inconclusive compounds are further classified into subclasses, which include:
◼
active agonist
◼
inconclusive
◼
inconclusive agonist
◼
inconclusive antagonist
◼
inconclusive agonist (cytotoxic)
◼
inconclusive antagonist (cytotoxic)
As implied in the title of this assay record (https://pubchem.ncbi.nlm.nih.gov/bioassay/743139), this assay aims to identify aromatase inhibitors. Therefore, all active antagonists (in the activity summary column) were declared to be active compounds (in the activity outcome column).
On the other hand, the assay also identified 612 active agonists (in the activity summary column), and they are declared to be inconclusive (in the activity outcome column).
With that said, “inactive” compounds in this assay means those which are neither active agonists nor active antagonist.
It is important to understand that the criteria used for determining whether a compound is active or not in a given assay are selected by the data source who submitted that assay data to PubChem. For the purpose of this assignment (which aims to develop a binary classifier that tells if a molecule is active or inactive against the target), we should clarify what we mean by “active” and “inactive”.
On the other hand, the assay also identified 612 active agonists (in the activity summary column), and they are declared to be inconclusive (in the activity outcome column).
With that said, “inactive” compounds in this assay means those which are neither active agonists nor active antagonist.
It is important to understand that the criteria used for determining whether a compound is active or not in a given assay are selected by the data source who submitted that assay data to PubChem. For the purpose of this assignment (which aims to develop a binary classifier that tells if a molecule is active or inactive against the target), we should clarify what we mean by “active” and “inactive”.
◼
active: any compounds that can change (either increase or decrease) the activity of the target. This is equivalent to either active antagonists or active agonists in the activity summary column.
◼
inactive: any compounds that do not change the activity of the target. This is equivalent to inactive compounds in the activity summary column.
Select active/inactive compounds for model building
Select active/inactive compounds for model building
Now we want to select only the active and inactive compounds from the data frame (that is, active agonists, active antagonists, and inactives based on the “activity summary” column).
In[]:=
activeInactive=Query[Select[ContainsAny[{"active agonist","active antagonist","inactive"}]@{#["Activity Summary"]}&]];df=activeInactive@dfRaw
Out[]=
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In[]:=
df[DeleteDuplicatesBy[#["PUBCHEM_SID"]&]]//Length(*uniqueSIDs*)df[DeleteDuplicatesBy[#["PUBCHEM_CID"]&]]//Length(*uniqueCIDs*)
Out[]=
8553
Out[]=
6864
Note that the number of CIDs is not the same as the number of SIDs. There are two important potential reasons for this observation.
First, not all substances (SIDs) in PubChem have associated compounds (CIDs) because some substances failed during structure standardization. Remember that, in PubChem, substances are depositor-provided structures and compounds are unique structures extracted from substances through structure standardization. Because our model will use structural information of molecules to predict their bioactivity, we need to remove substances without associated CIDs (i.e., no standardized structures).
Second, some compounds are associated with more than one substances. In the context of this assay, it means that a compound may be tested multiple times in different samples (which are designated as different substances). It is not uncommon that different samples of the same chemical may result in conflicting activities (e.g., active agonist in one sample but inactive in another sample). In this assignment, we remove such compounds with conflicting activities.
First, not all substances (SIDs) in PubChem have associated compounds (CIDs) because some substances failed during structure standardization. Remember that, in PubChem, substances are depositor-provided structures and compounds are unique structures extracted from substances through structure standardization. Because our model will use structural information of molecules to predict their bioactivity, we need to remove substances without associated CIDs (i.e., no standardized structures).
Second, some compounds are associated with more than one substances. In the context of this assay, it means that a compound may be tested multiple times in different samples (which are designated as different substances). It is not uncommon that different samples of the same chemical may result in conflicting activities (e.g., active agonist in one sample but inactive in another sample). In this assignment, we remove such compounds with conflicting activities.
Drop substances without associated CIDs.
Drop substances without associated CIDs.
First, check if there are substances without associated CIDs. Let’s define a function to check if any of them are missing:
In[]:=
anyMissing[df_Dataset]:=Transpose[df][All,Count[_Missing]]anyMissing@df(*applyfunction*)
Out[]=
|
There are 138 records missing a “PUBCHEM_CID” entry. We want to remove those records:
In[]:=
removeMissingCID=Query[Select[!MissingQ@#["PUBCHEM_CID"]&]];(*definequery*)df=removeMissingCID@df;(*applyquerytodataset*)anyMissing@df(*checkagainifthemissingvalueshavedisappeared*)
Out[]=
|
Remove CIDs with conflicting activities
Remove CIDs with conflicting activities
Now identify compounds with conflicting activities and remove them. The first step is to group the entries by CID, and determine how many distinct values of the Activity Summary are present. We will extract the entries where there is more than one distinct value:
In[]:=
CIDsWithConflictingActivity[df_Dataset]:=df[GroupBy["PUBCHEM_CID"],CountDistinct,"Activity Summary"][Select[GreaterThan[1]]]
In[]:=
CIDsWithConflictingActivity[df]
Out[]=
|
Let’s turn this into a list:
In[]:=
cidConflicts=(CIDsWithConflictingActivity@df)[[Keys]]//Normal
Out[]=
{16043,443939,2170,2554,3397,2998,4911,5333,54682468,2145,8569,4632,4114,5362,54454,4764,6167,5280450,667493,12358480,2950,5757,22571,15106,6917728,6238,135242,161803,14982,5391,7057995,135398739,28871,6917655,91744,9270,25750,33344,60196338,9833444,392622,6526396,4837,11443,8348,53316406,10458,8183,9352,7447,10332,637511,6001,7092,38479,11860,70837,9076,8560,8174,6478,15294,6944,7381,10868}
What are the data for these conflicting results (comprising 146 rows):
In[]:=
df[Select[ContainsAny[cidConflicts]@{#["PUBCHEM_CID"]}&]][SortBy["PUBCHEM_CID"]]
Out[]=
|
Let’s remove these conflicting entries:
In[]:=
removeCIDs[cids_List]:=Query[Select[ContainsNone[cids]@{#["PUBCHEM_CID"]}&]];df=removeCIDs[cidConflicts]@df;df[GroupBy["Activity Summary"],Length]
Out[]=
|
How many unique SID and CIDs are there?
In[]:=
df[CountDistinct,#]&/@{"PUBCHEM_SID","PUBCHEM_CID"}
Out[]=
{8269,6798}
Remove redundant data
Remove redundant data
The above code cells do not remove compounds tested multiple times if the testing results are consistent [e.g., active agonist in all samples (substances)]. The rows corresponding to these compounds are redundant, so we want remove them except for only one row for each compound.
In[]:=
df=df[DeleteDuplicatesBy["PUBCHEM_CID"]]df[CountDistinct,#]&/@{"PUBCHEM_SID","PUBCHEM_CID"}
Out[]=
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Out[]=
{6798,6798}
Adding “numeric” activity classes
Adding “numeric” activity classes
In general, the inputs and outputs to machine learning algorithms need to have numerical forms.
In this exercise, the input (molecular structure) will be represented with binary fingerprints, which already have numerical forms (0 or 1). However, the output (activity) is currently in a string format (e.g., ‘active agonist’, ‘active antagonist’). Therefore, we want to add an additional, ‘activity’ column, which contains numeric codes representing the active and inactive compounds:
In this exercise, the input (molecular structure) will be represented with binary fingerprints, which already have numerical forms (0 or 1). However, the output (activity) is currently in a string format (e.g., ‘active agonist’, ‘active antagonist’). Therefore, we want to add an additional, ‘activity’ column, which contains numeric codes representing the active and inactive compounds:
◼
1 for actives (either active agonists or active antagonists)
◼
0 for inactives
Note that we are merging the two classes "active agonist" and "active antagonist", because we are going to build a binary classifier that distinguish actives from inactives. We’ll also take this opportunity to create a smaller data frame that contains only CIDs and these binarized activity values. The Boole function converts True and False to one and zero;:
In[]:=
dfActivity=df[All,<|"CompoundID"#["PUBCHEM_CID"],"activity"Boole@Not@StringMatchQ["inactive"]@#["Activity Summary"]|>&]
Out[]=
| ||||||||||||||||||||||||||||||||||||||||||||||
|
In[]:=
df[CountsBy["Activity Summary"]]
Out[]=
|
In[]:=
dfActivity[CountsBy["activity"]]
Out[]=
|
Download structure information for each compound from PubChem
Download structure information for each compound from PubChem
Now we want to get structure information of the compounds from PubChem (in isomeric SMILES). Using the requestProperties function originally defined in Assignment 7 (copy-pasted below):
In[]:=
requestProperties[cids_List,chunkSize_Integer:200,properties_List:{"IsomericSMILES"}]:=With{cidsBatches=Partition[cids,UpTo[chunkSize]]},Join@@
[Pause[0.2];ServiceExecute["PubChem","CompoundProperties",{"CompoundID"#,"Property"properties}]&,cidsBatches]requestProperties[cids_Dataset,chunkSize_Integer:200,properties_List:{"IsomericSMILES"}]:=With[{cidList=cids[All,"CompoundID"]//Normal},requestProperties[cidList,chunkSize,properties]]
|
In[]:=
dfSmiles=requestProperties@dfActivity
Out[]=
|
Generate MACCS keys from SMILES.
Generate MACCS keys from SMILES.
Account for possible failures (e.g., the presence of “invalid” SMILES strings—in fact, two are present in this dataset). This will take a few minutes to generate:
In[]:=
dfFPS=dfSmilesAll,Join<|"CompoundID"#CompoundID|>,Association@MapIndexedStringTemplate["maccs``"]@First@#2#1&,
@#IsomericSMILES&,FailureAction"Drop";
[◼] | MACCSKeys |
Merge activity data and fingerprint information
Merge activity data and fingerprint information
In[]:=
dfData=JoinAcross[dfActivity,dfFPS,"CompoundID"];
Save dfData in CSV for future use:
In[]:=
Export["dfData.csv",dfData]
Out[]=
dfData.csv
In[]:=
dfData=SemanticImport["dfData.csv"];(*incaseofrestart...uncomment*)
Preparation for model building
Preparation for model building
Removing zero variance features
Removing zero variance features
Are there columns where the MACCS bits do not vary? Let’s find out:
In[]:=
Transpose[dfData][All,Variance][TakeSmallest[10]]
Out[]=
|
Indeed, three of these have no variation whatsoever. So let’s define a function to apply a variance threshold criterion, dropping columns that have zero variance and apply to the dataset:
In[]:=
removeZeroVarianceColumns[df_Dataset]:=With[{retainedColumns=Normal@Keys@Select[GreaterThan[0]]@Transpose[df][All,Variance]},df[All,retainedColumns]]dfData=removeZeroVarianceColumns@dfData;
Creating a Train-Test Split (9:1 ratio)
Creating a Train-Test Split (9:1 ratio)
Now split the data set into a training set (90%) and test set (10%).The training set will be used to train the model.The developed model will be tested against the test set.
We will define out own function here, for educational purposes and so that we have control over the process. In practice, one can use the TrainTestSplit repository function to perform these types of operations:
In[]:=
trainTestSplit[fraction_][df_Dataset]:=With[{nTrainingItems=UpTo@Round@(fraction*Length[df])},TakeDrop[RandomSample[df],(*randomlyshuffletherows*)nTrainingItems](*takeupto90%ofthedatafortraining*)]
Now apply this function:
In[]:=
SeedRandom[1841];(*toforcethistobereproducibleeverytimewerunthisnotebook*){train,test}=trainTestSplit[0.9]@dfData;Length/@{train,test}
Out[]=
{6116,680}
Balance the training set through downsampling
Balance the training set through downsampling
Check the dimension of the training data set:
In[]:=
Length[train]
Out[]=
6116
Check the number of active and inactive compounds:
In[]:=
train[Counts,"activity"]
Out[]=
|
The data set is highly imbalanced [the inactive to active ratio is (5451 / 665)]. To address this issue, let’s down sample the majority class (inactive compounds) to balance the data set.
In[]:=
downsample[column_String][df_Dataset]:=With[{groups=df[GroupBy[column]]},With[{actives=groups[Key[1]],(*assumethatthecolumniseither0or1encoded*)inactives=groups[Key[0]]},Join[actives,(*theactiveexamples,takenthemall*)RandomSample[inactives,Length[actives]](*sampletheinactives*)]]]
In[]:=
SeedRandom[1841];(*setarandomseedvaluetomakethisreproducible*)downsampledTrain=downsample["activity"]@train;
Confirm that the downsampled data set has the correct number of representatives from each class.
In[]:=
downsampledTrain[CountsBy["activity"]]
Out[]=
|
Separate input and output variables
Separate input and output variables
We now want to separate the “inputs” (the MACCS keys) from the “outputs” (the activity). The Classify function takes a variety of different input styles, but one of the more convenient ones is as an Association of output keys to lists of input values. We want to make sure that the “inputs” do not include the activity or CompoundID values.
In[]:=
prepareInput[df_Dataset]:=df[GroupBy["activity"],All,KeyDrop[{"activity","CompoundID"}]]//Normal{trainingSet,testSet}=prepareInput/@{downsampledTrain,test};
Build a model using the training set
Build a model using the training set
Now we are ready to build predictive models using machine learning algorithms available in the Classify function. This notebook will use Naive Bayes and decision tree models, because they are relatively fast and simple, but there are many other options described in the documentation.
Naive Bayes
Naive Bayes
In[]:=
nbModel=Classify[(*trainthemodel*)trainingSet,(*usingthisdataasthetrainingset*)Method"NaiveBayes"](*specifythemodeltype*)
Out[]=
ClassifierFunction
| |||||
Data not saved. Save now |
How well does this model describe the training data? One way to quantify this is by the accuracy:
In[]:=
ClassifierMeasurements[nbModel,trainingSet,"Accuracy"]
Out[]=
0.657895
If we are interested in multiple properties, it is better to generate a general ClassifierMeasurementsObject that can be used to assess difference performance metrics about the model:
In[]:=
nbPerformanceOnTrainingData=ClassifierMeasurements[nbModel,trainingSet]nbPerformanceOnTrainingData["Report"]
Out[]=
ClassifierMeasurementsObject
| |||||
Data not saved. Save now |
Out[]=
Classifier Measurements | ||||||||||||||||||
|
Desired performance metrics can be extracted from the ClassifierMeasurementsObject and used to construct summary tables of the results:
In[]:=
testSummaryTable[cm_ClassifierMeasurementsObject]:={#,cm[#]}&/@{"Accuracy","Recall","MCC","Sensitivity","Specificity","AreaUnderROCCurve"}//TableFormtestSummaryTable[nbPerformanceOnTrainingData]
Out[]//TableForm=
Accuracy | 0.657895 |
Recall | 00.536842,10.778947 |
MCC | 00.325472,10.325472 |
Sensitivity | 00.536842,10.778947 |
Specificity | 00.778947,10.536842 |
AreaUnderROCCurve | 00.720391,10.720382 |
When applied to predict the activity of the training compounds, the NB classifier resulted in the accuracy of 0.66 and AUC-ROC of 0.72. (Note that your values may slightly vary depending on the training/test set split that is used, and details of how underlying hyperparameters are fitted; we will address these variations in the section on cross-validation below.) However, the real performance of the model should be evaluated with the test set data, which are not used for model training. This will test the ability of the model to generalize to unseen examples. This too can be performed using the ClassifierMeasurements function, only now we provide the trained model (nbModel) and provide the held-out testSet for it to evaluate upon:
In[]:=
nbPerformanceOnTestData=ClassifierMeasurements[nbModel,testSet];%["Report"]testSummaryTable[%%]
Out[]=
Classifier Measurements | ||||||||||||||||||
|
Out[]//TableForm=
Accuracy | 0.557353 |
Recall | 00.529022,10.779221 |
MCC | 00.195365,10.195365 |
Sensitivity | 00.529022,10.779221 |
Specificity | 00.779221,10.529022 |
AreaUnderROCCurve | 00.697207,10.697207 |
When evaluated on the test set, the accuracy is somewhat lower than for the training set. This is natural to expect; furthermore this provides us with a better estimate of the performance on new data.
Note: The precise values of the model performance metrics you obtain will vary if you change the training and test sets, or change various hyperparameters associated with the models constructed by Classify. The latter can be important with some model types, such as the Decision Trees studied in the next section, and the default settings (and the extent to which they are optimized) is subject to change in future versions of Mathematica. Your results should be comparable, even if they are not numerically identical.
Note: The precise values of the model performance metrics you obtain will vary if you change the training and test sets, or change various hyperparameters associated with the models constructed by Classify. The latter can be important with some model types, such as the Decision Trees studied in the next section, and the default settings (and the extent to which they are optimized) is subject to change in future versions of Mathematica. Your results should be comparable, even if they are not numerically identical.
Decision Tree
Decision Tree
In[]:=
decisionTree=Classify[trainingSet,Method"DecisionTree"];
How well does this model perform on the test set?
In[]:=
dtPerformance=ClassifierMeasurements[decisionTree,testSet];%["Report"]testSummaryTable[%%]
Out[]=
Classifier Measurements | ||||||||||||||||||
|
Out[]//TableForm=
Accuracy | 0.752941 |
Recall | 00.761194,10.688312 |
MCC | 00.314002,10.314002 |
Sensitivity | 00.761194,10.688312 |
Specificity | 00.688312,10.761194 |
AreaUnderROCCurve | 00.706748,10.833603 |
Notice the improvement relative to the Naive Bayes model when evaluating on the test set. We have considerably fewer false positives (predicted to be activity = 1, but truly active = 0), which improves all of the metrics.
How would the model perform without down sampling?
How would the model perform without down sampling?
It is natural to ask the question of how the results might have changed had we not downsampled the training data. Let’s do a numerical experiment to find out:
In[]:=
Classify[prepareInput[train],(*theentire,non-stratifieddataset*)Method"DecisionTree"];dtDownsamplePerformance=ClassifierMeasurements[%,testSet];%["Report"]testSummaryTable[%%]
Out[]=
Classifier Measurements | ||||||||||||||||||
|
Out[]//TableForm=
Accuracy | 0.889706 |
Recall | 00.973466,10.233766 |
MCC | 00.301304,10.301304 |
Sensitivity | 00.973466,10.233766 |
Specificity | 00.233766,10.973466 |
AreaUnderROCCurve | 00.653594,10.880619 |
From the Accuracy score, it appears that we have improved overall performance. However, this can sometimes be a misleading metric, and it is helpful to examine how the model is performing better by comparing the two confusion matrices:
In[]:=
TableForm[{#["ConfusionMatrixPlot"]&/@{dtDownsamplePerformance,dtPerformance}},TableHeadings{None,{"Downsampled","Not Downsampled"}}]
Out[]//TableForm=
Downsampled | Not Downsampled |
In general, the model generated using the downsampled data has reduced the number of false positives (upper right corner). This makes sense—it has access to a greater number of inactive compounds in the training set so it has a better idea of what inactive molecules “look like”. However, notice that the number of false negatives (lower left corner) has increased in the downsampled model, which indicates that the model is biased towards predicting that the molecule is inactive, even when it is truly active.
This type of behavior is problematic if used for a drug discovery application. If active molecules are rare, we don’t want to miss any of them by spuriously predicting them to be inactive. On the other hand, testing a few extra false positive candidates may not be a problem. This type of goal can be better described by the Recall score for active (outcome = 1) compounds (“how many of the actual positives are predicted to be positive by the model”).
This type of behavior is problematic if used for a drug discovery application. If active molecules are rare, we don’t want to miss any of them by spuriously predicting them to be inactive. On the other hand, testing a few extra false positive candidates may not be a problem. This type of goal can be better described by the Recall score for active (outcome = 1) compounds (“how many of the actual positives are predicted to be positive by the model”).
Predicting error by cross-validation
Predicting error by cross-validation
So far, we have evaluated the performance of a model trained on one specific subset and evaluated on another specific subset. This raises the question: How much would the results change if we were to have chosen a different training and test set? To study this, let’s define a function that will perform the entire process, returning a ClassiferMeasurementObject:
In[]:=
modelConstruction[method_][df_Dataset]:=With[{trainTestData=MapThread[prepareInput@#1@#2&,(*processintotrainableform*){{downsample["activity"],Identity},(*stratifythetrainingset,leavethetestsetalone*)trainTestSplit[0.9][df]}(*resultsinalistof{training,testing}data*)]},ClassifierMeasurements[Classify[trainTestData[[1]],Methodmethod,TrainingProgressReportingNone],trainTestData[[2]]]]
Then generate ten examples—i.e., ten random choices of training/test set splits:
In[]:=
results=Table[modelConstruction["DecisionTree"]@dfData,{10}];
How do the results vary? We can interrogate each of the ClassiferMeasurementObject entries in the list of results to determine the accuracy
In[]:=
#["Accuracy"]&/@results
Out[]=
{0.783824,0.723529,0.638235,0.716176,0.592647,0.717647,0.713235,0.575,0.698529,0.697059}
These tend to fluctuate around 0.68, although there are some outliers:
In[]:=
%//MeanAround
Out[]=
0.686
±
0.020
Properties that return an association (e.g., different results for zero and one prediction) can also be examined. The mean Recall score indicates that we can expect to catch about 67% of the actual active compounds based on our predictions:
In[]:=
MeanAround[#["Recall"][1]&/@results]
Out[]=
0.67
±
0.04
Model building through cross-validation
Model building through cross-validation
Classify attempts to hide many of the details of model construction from the user; these come in the form of unspecified method options that set the default values for hyperparameters which cannot be learned by the training algorithm.
In[]:=
Information[decisionTree,"MethodOption"]
Out[]=
Method{DecisionTree,DistributionSmoothing1,FeatureFraction1}
Referring to the DecisionTree documentation, larger values of the DistributionSmoothing option typically reduce the size of the tree, and the FeatureFraction option limits the selection of features used by the tree. In general, Classify tries to optimize these hyperparameters by performing a cross validation “behind the scenes” on the training data that is provided. But we can also use cross validation to manually tune these parameters.The modelConstruction function taken above allows us to specify the hyperparameters:
In[]:=
modelConstruction[{"DecisionTree","DistributionSmoothing"10}][dfData]
Out[]=
ClassifierMeasurementsObject
| |||||
Data not saved. Save now |
We don’t need all of the data, so perhaps we will just examine the recall:
In[]:=
modelConstruction[{"DecisionTree","DistributionSmoothing"10}][dfData]["Recall"][1]
Out[]=
0.768293
This appears to be an improvement upon the average recall value computed at the end of the previous section. To see if this is just lucky or meaningful, we can perform cross validation to repeat the train-test split 10 times for each value of the parameter and get the mean and error. We’ll define a function to perform the evaluation, returning the mean performance over the ten trials. Because we can divide up the work evenly, we’ll use ParallelTable to run each calculation on a separate processor:
In[]:=
distributionTune[parameter_?NumericQ]:=Table[modelConstruction[{"DecisionTree","DistributionSmoothing"parameter}][dfData]["Recall"][1],{10}]//MeanAround
In[]:=
results=ParallelTable[{p,distributionTune[p]},{p,{0.1,1,5,10,50}}]
Out[]=
{{0.1,0.663
±
0.022
},{1,0.671±
0.019
},{5,0.688±
0.021
},{10,0.701±
0.023
},{50,0.658±
0.028
}}Examining this performance, we see that the model is largely robust to this hyperparameter, but some improvement can be achieved by increasing the DistributionSmoothing option to 5.
In[]:=
ListLinePlot[results,AxesLabel{"Smoothing","Recall"}]
Out[]=
In real life, one would want to perform experiments like these to determine the best hyperparameter values. Here we have only optimized the DistributionSmoothing parameter, but there is also the FeatureFraction parameter. When there are only two hyperparamters, we might define a grid of values and evaluate the model on each of these choices. Performing an enumerative (grid) search of the results becomes computationally intractable as the number of parameters increases. In practice, a random sampling of hyperparameter choices is often sufficient. (Bergstra & Bengio, J. Machine Learning Res. 13, 281-305 (2012))
Cross validation in practice
Cross validation in practice
We have illustrated how to construct a cross validation by manually dividing the dataset and building machine learning models for pedagogical purposes. In practice, one can use pre-existing functions, such as the CrossValidateModel repository function to facilitate cross-validation of models over a dataset.
Exercises
Exercises
In this assignment, we will build predictive models using the same aromatase data.
Step 1: Show the following information to make sure that the activity data in the df_activity data frame is still available (if not, reconstruct them)
Step 1: Show the following information to make sure that the activity data in the df_activity data frame is still available (if not, reconstruct them)
◼
The first five lines of dfActivity
In[]:=
(*Writeyourcodeinthiscell.*)
◼
The counts of active/inactive compounds in dfActivity
In[]:=
(*Writeyourcodeinthiscell.*)
Step 2: Show the following information to make sure the structure data is still available:
◼
The first five lines of dfSmiles
In[]:=
(*Writeyourcodeinthiscell.*)
◼
The number of rows of dfSmiles)
In[]:=
(*Writeyourcodeinthiscell.*)
Step 3: Retrieve the PubChem fingerprints using the CompoundIDs, storing the result in a variable named dfFPS:
In[]:=
(*Writeyourcodeinthiscell.*)
Step 4: Merge the dfActivity and dfFPS Datasets into a new Dataset called dfData
◼
Join the two data frames using the CID column as keys.
◼
Remove the rows that have any NULL values (i.e., compounds for which the fingerprints couldn’t be generated).
◼
Print the dimensions.
◼
Show the first five entries.
In[]:=
(*Writeyourcodeinthiscell.*)
Step 5: Prepare input and output data for model building
◼
Remove zero-variance features from X (if any).
◼
Split the data set into training and test sets (90% vs 10%) (using SeedRandom[3100]).
◼
Balance the training data set through downsampling.
◼
Prepare the data into input and output forms.
In[]:=
(*Writeyourcodeinthiscell.*)
Step 6: Building a Random Forest (RF) model using the balanced training data set.
◼
In[]:=
(*Writeyourcodeinthiscell.*)
Step 7: Apply the developed RF model to predict the activity of the training set compounds.
◼
Report the confusion matrix, accuracy, recall, sensitivity, specificity, and AUC-ROC.
In[]:=
(*Writeyourcodeinthiscell.*)
Step 8: Apply the developed RF model to predict the activity of the test set compounds.
◼
Report the confusion matrix, accuracy, recall, sensitivity, specificity, and AUC-ROC.
In[]:=
(*Writeyourcodeinthiscell.*)
Step 9: Read a recent paper published in Chem. Res. Toxicol. (https://doi.org/10.1021/acs.chemrestox.7b00037) and answer the following questions (in no more than five sentences for each question).
◼
What different approaches did the paper take to develop prediction models (compared to those used in this notebook)?
◼
How different are the models reported in the paper from those constructed in this paper (in terms of the performance measures)?
◼
What would you do to develop models with improved performance?
Write your answers in this cell... (using words)
Attributions
Attributions
Adapted from the corresponding OLCC 2019 Python Exercise:
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/8%3 A_Machine-learning_Basics/8.2%3 A_Python _Assignment
https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics_OLCC_ (2019)/8%3 A_Machine-learning_Basics/8.2%3 A_Python _Assignment
Cite this as: Joshua Schrier, "08 Machine Learning Basics" from the Notebook Archive (2020), https://notebookarchive.org/2020-10-ebo8r9c
Download