Educational Materials

Data Science with Andreas Lauschke (#6)

Andreas Lauschke

Author

Andreas Lauschke

Title

Data Science with Andreas Lauschke (#6)

Description

Outlier Detection Methods

Outlier Detection, Part 1

Andreas Lauschke, Aug 8 2019

today’s session: outlier detection methods, part 1 some quick methods with simple tools k-nearest neighbor, clustering diagnostic visualizations next session: outlier detection methods, part 2 (FindAnomalies) prob theory reminders FindAnomalies mathematical intro simple applications advanced applications problems

In[]:=

(*Ialwayslikethingraygridlinesonallmyplots,somostofmyprogramsstartwiththisline:*)SetOptions[#,GridLinesAutomatic]&/@{Plot,ListPlot,ListLinePlot,PolarPlot,AudioPlot,BoxWhiskerChart,DistributionChart,Histogram};(*alsoputinyourinifile!*)

Warning / Caveat

A warning: Today’s session is about outlier detection *methods*. It is *not* the case that you should always *remove* an outlier. Oftentimes the outlier can have the most meaningful information. What to *do* with an outlier is always a judgment call of the model builder. You *always* need a *scientific reason* to remove an outlier. “it’s so far out there” is not a valid reason. The removal decision has to be the judgment call of the scientist!

◼

a seemingly outlying observation may not really be an outlier. You *think* it is. But you don’t know the underlying model yet, you haven’t built it yet. What if nature functions according to a process that you just haven’t understood yet, and it contains highly localized behavior?

In[]:=

Plot[{Sin@x+1/(4-x)^2},{x,-4Pi,6Pi},PlotRange{-1,6},Epilog{PointSize@Large,Red,Point[{#,Sin@#+1/(4-#)^2}]&/@Range[-12.098,19,Pi/4]}]

◼

a lot of manipulation is done with outlier removal. Some companies fire those employees that “don’t fit in with the crowd”, and then they don’t get exit interviews. But *those* people may be the ones with the most relevant information about what the company is doing wrong. Especially *they* should be heard. This is how echo chambers are created (intentionally and unintentionally).

◼

strictly speaking, an outlier should be classified as such based on the scientific reasons, and the fact that it is “out there” is merely a manifestation of an underlying reason that disqualifies the observation from being in the “group” to begin with. And that reason should be an exogenous factor to the modelling environment.

◼

your measuring equipment gives you different readings because today the power company pumps 200V instead of 110V into the grid. Fine to remove these measurements, but not because the measured readings are different, but because these measurements are invalid to begin with.

◼

your measuring equipment gives you different readings because today is a super-hot day and the air-conditioning is broken, so all your metal sticks are 1 mm longer. NOT ok to remove these measurements, because you now know that your model needs to take temperature into account. If you proceed with the temperature-free model and remove the temperature-caused outliers, you’re building a bad model. ==> build a better model with an additional dimension / predictor.

◼

an outlier may have the most salient information. Oftentimes it’s actually the outlier itself that points you to where the problem is. Sometimes you are *actively looking* for outliers, finding outliers *is the goal*!

◼

room with the hottest temperature ==> that’s probably the room that is on fire

◼

the fuel gauge for tank 5 is coming down rapidly ==> that’s probably where your plane is leaking fuel.

◼

fraud detection ==> you want to find the guy whose financial transactions are sufficiently *dissimilar* to what “everyone else” does.

◼

remember: never remove an outlier without a scientific reason. Don’t remove outliers for being outliers. You need a *scientific reason* that justifies the removal, and that depends on the *purpose* of the model (“find problems” vs. “build model with homogenous data”). Sometimes you want only homogenous data, and sometimes you want to diagnose problems.

◼

but there are many good scientific reasons, and you don’t want outliers that can *justifiably* be removed to distort a good model.

◼

this question may help: is the suspected outlier really *abnormal* and *unnatural*?

Part 1: Quick, with simple Tools

Some of the following are heuristic approaches to try to *detect* an outlier. Not precise science. Nearest can be useful for 1dim data, CentralFeature and FindClusters work better with 2dim (and higher-dim) data.

an outlier should have a large distance to the nearest neighbor

In[]:=

list=Prepend[RandomInteger[{1,10},20],32];{#,Nearest[DeleteCases[list,#]"Distance",#]}&/@list//Sort

If we add just one more outlier with the same distance (1) to the other outlier, the detection falls apart:

In[]:=

list=Join[RandomInteger[{1,10},20],{32,33}];{#,Nearest[DeleteCases[list,#]"Distance",#]}&/@list//Sort

but if we increase the number of neighbors, we can still detect outliers (we’re looking at 2 neighbors here):

In[]:=

list=Join[RandomInteger[{1,10},20],{32,33}];{#,Nearest[DeleteCases[list,#]"Distance",#,2(*20*)]}&/@list//Sort

So this means: With a simple nearest-neighbor approach:

◼

we're looking for ONE outlier! ONE! Two outliers closely together can already ruin this simple approach. They’re “short distance neighbors” for each other.

◼

if it's more than one outlier, we have to increase the number of neighbors to avoid false traps. That’s why it’s called the “k-nearest neighbor” method.

In a way, we're bordering clustering here. We're looking for clusters of outliers and inliers. Outliers ought to form clusters (and be it just a 1-elem cluster!). And so ought inliers. If outliers don't form clusters, we can't really consider them outliers. In a future session, I should cover clustering theory, it's a *key concept* the modern-day data scientist needs to be familiar with, and it is very applicable to outlier detection!

CentralFeature and FindClusters better with larger examples below. I find them not-so-good with single outliers, and they can be *very* powerful for *several* (presumed / candidate) outliers.

Built-in graphical functions:

Box Plot: in descriptive statistics, the box plot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot or box-and-whisker diagram. They are non-parametric, and they don’t make any distributional assumptions. The width of the bands, and what is displayed with the whiskers, is not standardized, and there are a lot of variations. But typically, the quartiles, and the band inside the box, are the second quartile (the median) -- the ends of the whiskers can represent several possible alternative values. More on the box plot on the wikipedia page https://en.wikipedia.org/wiki/Box_plot

In[]:=

GraphicsRow[{BoxWhiskerChart@list,BoxWhiskerChart[list,"Outliers",GridLinesNone]}]

We can provide several specifications. From the help browser:

◼

The following box-and-whisker specifications

bwspec

can be given:

	"Notched"	median confidence interval notch
	"Outliers"	outlier markers
	"Median"	median marker
	"Basic"	box-and-whisker only
	"Mean"	mean marker
	"Diamond"	mean confidence interval diamond
	{{ elem 1 , val 11 ,…},…}	box-and-whisker element specification
	{"name",{ elem 1 , val 11 ,…},…}	named bwspec with element modifications

and many more styling/display options to customize the display (read help browser, too much to show here).

and note the hover!

In[]:=

Table[BoxWhiskerChart[list,i,PlotLabelText[i],GridLinesNone],{i,{"Basic","Outliers","Notched","Median","Mean","Diamond"}}]

the outlier specification: from the helpbrowser:

Outliers and far outliers are defined using the quartiles and interquartile range:

In[]:=

data={1,2,3,4,5,6,7,8,9,10,20,30};

In[]:=

{q1,median,q3}=N[Quartiles[data]]

In[]:=

iqr=q3-q1(*inter-quartilerange*)

In[]:=

BoxWhiskerChart[data,{"Outliers",{"Outliers","●"},{"FarOutliers","○"}},AspectRatio1/10,ImageSize500,BarOriginLeft,GridLines{{{q3+1.5iqr,Dashed},{q3+3iqr,Dashed}},None},FrameTicks{{None,None},{data,{{q1,"q1"},{q3,"q3"},{q3+1.5iqr,"near"},{q3+3iqr,"far"}}}}]

so we see:

◼

near outlier: above q3 + 1.5 iqr

◼

far outlier: above q3 + 3 iqr

Caveat: The box plot should not be used on skewed data.

with skewed distributions meeting those “outlier” thresholds happens easily: example: Dem % from the 2016 election data:

In[]:=

data=Rest@Import[FileNameJoin[{$HomeDirectory,"2016 US Presidential Election Results by County.csv"}]];

In[]:=

data=data[[All,5]];

In[]:=

{q1,median,q3}=N[Quartiles[data]]

In[]:=

iqr=q3-q1

In[]:=

yet it’s simply a skewed distribution:

In[]:=

Histogram@data

it is statistical malpractice to use the box plot on skewed data!

in my opinion the box plot can be thought of as a sub function of the histo. It’s like a “data summary” with most of the “middle information” removed. It is designed to accentuate extreme data, according to definitions related to quantiles / quartiles. BUT: If you *know* your data has a lot of observations that would exceed these quantile-based definitions, then the box plot is pointless to use, in fact, misleading.

my recommendation: always look at the histo first (quick at-a-glance non-parametric overview)! If the data is not skewed, the box plot may be a suitable next step (“quartile summary”). If the data is skewed, don’t base any decisions on any box plot-derived outliers!

DistributionChart: displays a distribution chart with a distribution symbol for each data[i].

In[]:=

DistributionChart@list

ugly, but it did its job: *detect* the outlier(s).

some examples from the companion website of Hastie, Tibshirani, Friedman: The Elements of Statistical Learning (Data Mining, Inference, and Prediction), Springer -- great book for advanced data scientists!

Los Angeles Ozone Data:

In[]:=

data=Import["http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/LAozone.data"];h=First@data;d=Rest@data;

In[]:=

Detailed variable names:
ozone: Upland Maximum Ozone
vh: Vandenberg 500 mb Height (note: 500 mb is about 18,000 ft, 700 mb is about 10,000 ft., 850 mb is about 5,000 ft.)
wind: Wind Speed (mph)
humidity: Humidity (%)
temp: Sandburg AFB Temperature
ibh: Inversion Base Height
dpg: Daggot Pressure Gradient
ibt: Inversion Base Temperature
vis: Visibility (miles)
doy: Day of the Year

In[]:=

Transpose@{h,ListPlot[d[[All,{#,1}]],PlotRangeAll]&/@Range[1,10]}/.{a_,b_}ab//TabView

In[]:=

BoxWhiskerChart@Transpose@d

outstanding methods for styling:

In[]:=

BoxWhiskerChart[Transpose@d,"Outliers",ChartLegendsh,ChartStyle20,GridLinesNone]

problem: often the datasets you are trying to analyse are not on the same scale (above). Look at this drone up there in 2 and the long bars in 6, squashing everything else. We have numbers (wind, <= 22) that are smaller than the whole range of others (e. g. vh, 630).

In[]:=

d[[All,3]]//Maxd[[All,2]]//MinMax//#[[2]]-#[[1]]&d[[All,6]]//MinMax//#[[2]]-#[[1]]&

we have to plot / analyse them individually:

In[]:=

c=0;(c++;GraphicsRow[{Histogram[#,PlotLabelh[[c]]],BoxWhiskerChart[#,"Outliers",PlotLabelh[[c]]]}])&/@Transpose@d

In[]:=

SetOptions[DistributionChart,ImageSize400];Grid[{{DistributionChart[Transpose@d,ChartLegendsh,ChartStyle20],DistributionChart[Transpose@d,ChartLegendsh,ChartStyle20,ChartElementFunction"HistogramDensity",PlotLabel"ChartElementFunction\"HistogramDensity\""]},{DistributionChart[Transpose@d,ChartLegendsh,ChartStyle14,ChartElementFunctionChartElementData["GlassQuantile","Quantile"9,"QuantileShading"True],PlotLabel"ChartElementFunctionChartElementData[\"GlassQuantile\",\"Quantile\"9,\"QuantileShading\"True"],Null}}]SetOptions[DistributionChart,ImageSizeAutomatic];

note: super-precise hover!

but here we have the same problem: data on different scales, hence, larger scales crush the smaller ones. So here as well: look at them individually:

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle20,PlotLabelh[[c]]])&/@Most@Transpose@d

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle66,ChartElementFunction"HistogramDensity",PlotLabelh[[c]]])&/@Most@Transpose@d

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle14,ChartElementFunctionChartElementData["GlassQuantile","Quantile"9,"QuantileShading"True],PlotLabelh[[c]]])&/@Most@Transpose@d

you’ll remember from last sessions: Query, operator form, and function composition. Here all in one. Isn’t that easy to get three numbers with one query:

In[]:=

d[[All,6]]//Query[{Length,Max,Median/*N,Mean/*N,Select[#>4999&]/*Length}]

footnote: Query is a great way to show a list of items that are of entirely different natures (types)! (e. g. in this case: integer-based list lengths, vs. actual data -- Max!)

next idea: does the center change if we remove the outlier(s):

In[]:=

CentralFeature[d[[All,6]]]

In[]:=

CentralFeature[Select[d[[All,6]],#<5000&]]

what does FindAnomalies say (more on this later):

In[]:=

d[[All,{1,6}]]//FindAnomalies[#,TargetDevice"GPU"]&

I think they’re large but not too outlier-y, for where they are. Large measurements are not always abnormal!

In[]:=

d[[All,6]]//FindClusters

FindClusters gave us a cluster for the “5000 group” (also has others, the “previous outlier” 3795).

nearest-neighbor approach:

In[]:=

sc=Union@d[[All,3]];({#,Nearest[DeleteCases[sc,#]"Distance",#]}&/@sc)[[All,1]]//FindAnomalies

In[]:=

FindAnomalies@sc

In[]:=

cf=CentralFeature[d[[All,3]]];Print@cf;Abs[cf-sc]cf+Abs[cf-sc]Select[d[[All,3]],#≥Max[cf+Abs[cf-sc]]&]

Prostate Cancer Data:

In[]:=

data=Import["https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data"];h=First@data;d=Rest@data;

In[]:=

Predictors (columns 1 -- 8) -- not further documentedlcavol log cancer volumelweight log prostate weightage agelbph log of the amount of benign prostatic hyperplasiasvi seminal vesicle invasionlcp log of capsular penetrationgleason Gleason scorepgg45 percent of Gleason scores 4 or 5outcome (column 9)lpsatrain/test indicator (column 10)

In[]:=

Transpose@{h,ListPlot[d[[All,{#,-2}]],PlotRangeAll]&/@Range[1,11]}/.{a_,b_}ab//TabView

In[]:=

c=0;(c++;GraphicsRow[{Histogram[#,PlotLabelh[[c]]],BoxWhiskerChart[#,"Outliers",PlotLabelh[[c]]]}])&/@Transpose@d

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle20,PlotLabelh[[c]]])&/@Most@Transpose@d

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle66,ChartElementFunction"HistogramDensity",PlotLabelh[[c]]])&/@Most@Transpose@d

In[]:=

c=0;(c++;DistributionChart[#,ChartStyle14,ChartElementFunctionChartElementData["GlassQuantile","Quantile"9,"QuantileShading"True],PlotLabelh[[c]]])&/@Most@Transpose@d

In[]:=

d[[All,5]]//MinMax

In[]:=

d[[All,5]]//Query[{Length,Min,Select[#<-1.3862&]/*Length}]

so again, we have 43 out of 97 at or below the lower limit! Not good!
This is the log of a measurement, what does Exp tell us?

In[]:=

Exp[-1.38629436`]

so I claim: amounts of benign prostatic hyperplasia less than 1/4 couldn’t be measured! So they simply called everything less than 1/4 1/4!

In[]:=

sc=Union@d[[All,5]];({#,Nearest[DeleteCases[sc,#]"Distance",#]}&/@sc)[[All,2]]//Flatten//FindAnomalies

how much does the CentralFeature change if we drop them?

In[]:=

CentralFeature[d[[All,5]]]

In[]:=

CentralFeature[Select[d[[All,5]],#>-1.3862&]]

and FindClusters:

In[]:=

d[[All,5]]//FindClusters

Heart Disease data South Africa

In[]:=

data=Import["http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data","CSV"];h=First@data;d=Rest@data;

In[]:=

sbp systolic blood pressuretobacco cumulative tobacco (kg)ldl low density lipoprotein cholesteroladiposityfamhist family history of heart disease (Present, Absent)typea type-A behaviorobesityalcohol current alcohol consumptionage age at onsetchd response, coronary heart disease

In[]:=

ListPlot[d[[All,#]],PlotRangeAll,PlotLabelh[[#]]]&/@Range[11]

here we have so much skewed data, I won’t even show the box plot.

each on its own:

In[]:=

c=0;(c++;Histogram[#,PlotLabelh[[c]]])&/@Transpose@d

2016 US Presidential Election Results by County

In[]:=

data=Import[FileNameJoin[{$HomeDirectory,"2016 US Presidential Election Results by County.csv"}]];

In[]:=

data[[1]]

In[]:=

d=Rest@data;RandomSample[d,5]//TableForm

a lot of far outliers (gray):

In[]:=

c=1;(c++;Histogram[#,PlotLabeldata[[1,c]]])&/@Transpose@d[[All,2;;7]]

similar for DistributionChart: with large numbers you get distortions due to outliers, and you’re trying to compare apples and oranges in the same chart.

In[]:=

DistributionChart[Transpose@d[[All,2;;7]],ImageSize800]

In[]:=

Select[d,#[[2]]>10^6||#[[3]]>10^6||#[[4]]>10^6&]//TableForm

In[]:=

c=1;(c++;DistributionChart[#,PlotLabeldata[[1,c]]])&/@Transpose@d[[All,2;;7]]

but even for one before options-tweaking:

In[]:=

DistributionChart[d[[All,4]],ImageSize800]

In[]:=

Select[d,#[[4]]>10^6&]//SortBy[#[[4]]&]//Reverse//TableForm

so with some options we get a decent vis:

In[]:=

DistributionChart[Transpose@d[[All,2;;4]],ChartLegends{"Dem","GOP","Total"},ChartStyle"AvocadoColors",ChartElementFunction"HistogramDensity",PlotLabel"ChartElementFunction\"HistogramDensity\""]

In[]:=

DistributionChart[Transpose@d[[All,5;;6]],ChartLegends{"Dem","GOP","Total"},ChartStyle"AvocadoColors",ChartElementFunction"HistogramDensity",PlotLabel"ChartElementFunction\"HistogramDensity\""]

In[]:=

Query[{MinMax,Median}]/@Transpose@d[[All,2;;6]]//Grid

In[]:=

CentralFeature[d[[All,#]]]&/@Range[2,6]

side note on CentralFeature:

In[]:=

d//Length

from helpbrowser:

CentralFeature

isthesameas

Median

withunivariatedatawhenthedatalengthisodd

for non-odd data length:

CentralFeature

findsanelementinthedatathatminimizesthesumofdistancestootherdatapoints

In[]:=

Query[{CentralFeature,Median}]/@{Range@100,Range@101}

In[]:=

50,

101

,{51,51}//N

In[]:=

data//First

top 10 tightest races by totals

In[]:=

With[{st=#},dem=Select[d,#[[9]]==st&][[All,2]]//Total;rep=Select[d,#[[9]]==st&][[All,3]]//Total;{st,dem,rep,Abs[dem-rep]}]&/@Union@d[[All,9]]//SortBy@Last//TableForm

Also look at other “nearest” methods (the “nearest neighbor” principle is fairly generic):

In[]:=

??*Nearest*

There are *many* more methods for outlier detection -- but then are very specific statistics implementations, and my goal in my sessions is the opposite: try to explain *general* principles, so you can do things on your own, and use ready-built vis tools that eliminate your need to wrestle with statistics functions on your own (BoxWhiskerChart, DistributionChart, etc.). Again here remember the paradigm I had mentioned in an earlier session: tell the software *what* to do, not allowing the software to tell you *how* to do it! We want to learn how to *harness* algorithms, not necessarily create algorithms.

I can cover more specific outlier detection methods if I get such audience requests. For now, simple jack-knife methods as above and BoxWhiskerChart and DistributionChart and part 2, below, should get you started.

There is much more to nearest-neighbor methods. MUCH more! They’re used in optimization, clustering, regression, neural networks, outlier detection, ... (long list would follow). And they fill important gaps that others don’t (such as linear least-squares regression). Quoting from Hastie, Tibshirani, Friedman: The Elements of Statistical Learning (Data Mining, Inference, and Prediction) -- which I highly recommend if you want to advance from “beginner” to “advanced” in terms of data science:

reminder: outlier detection can be approached from a regression / nearest-neighbor / clustering perspective.

I’ll be glad to discuss regression and NNs (that means both nearest-neighbors and neural networks, with neural networks oftentimes depending on nearest-neighbors) in detail in future sessions.

Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#6)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmehh8