Educational Materials

Data Science with Andreas Lauschke (#7)

Andreas Lauschke

Author

Andreas Lauschke

Title

Data Science with Andreas Lauschke (#7)

Description

Outlier Detection Methods

Outlier Detection, Part 2

Andreas Lauschke, Aug 4 2019

today’s session: outlier detection methods, part 2 (FindAnomalies) prob theory reminders FindAnomalies mathematical intro simple applications advanced applications problemsnext session: date / time / calendar functionality

put in an AI context: anomaly detection falls into the category of unsupervised learning: you’re trying to find hidden structures in the unlabeled data (similar to clustering). It doesn’t require training or labeling of data.

Andy Jassy, CEO of AWS, on re:Invent 2018 in Las Vegas, NV:

Part 2: FindAnomalies, based on robust Probability Theory

not that the box plot is *not* based on robust probability theory :) Reasoning on quantiles / quartiles can be *very* powerful. But it’s not the whole story.

First part, some math to understand its inner workings deeper
second part, FindAnomalies applications, from simple to more complex

first a quick recap of Probability and Probability Density! (using wikipedia definitions)

Definition Probability: The probability that an event will occur is the fraction of times you expect to see that event in many trials.
Definition Probability Density: value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.

In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. This probability is given by the integral of this variable' s PDF over that range—that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one.

standard normal distribution:

In[]:=

pdf=PDF[NormalDistribution[]]

is a special case of the (general) normal distribution (mean μ is not 0 and standard deviation σ is not 1):

In[]:=

pdf=PDF[NormalDistribution[μ,σ]]

reminder: if μ is 0 and σ is 1, it’s called the *standard* normal distribution (because there are non-standard normal distributions).

In[]:=

Manipulatep1=Plot

(x-m)

sigma



2π

sigma

,{x,-5,x1},FillingAxis,FillingStyleCyan,PlotRange{Automatic,All};p2=PlotCallout

(x-m)

sigma



2π

sigma

,"prob dens",{x1,Above},{x,-5,5},PlotLabel"Probability = Area under Curve = "<>ToStringNIntegrate

(x-m)

sigma



2π

sigma

,{x,-Infinity,x1},TraditionalForm,EpilogPointSize@Large,Red,Pointx1,

(x1-m)

sigma



2π

sigma

,PlotRange{Automatic,All};Show[{p2,p1}],{{x1,1},-5,5},{{m,0},-4.5,4.5},{{sigma,1},1/2,3}

so let’s pick an example and walk through it.

In[]:=

pdf=PDF[NormalDistribution[],x]

verify it’s a distribution (integral from -∞ to ∞ must be 1):

In[]:=

Integrate[pdf,{x,-Infinity,Infinity}]

and pdf must always be non-negative

In[]:=

Refine[pdf>=0,x∈Reals]

criteria 3 and 4 for a pdf: must also be real and integrable. That is the case here with pdf.

from help browser:

◼

FindAnomalies

attempts to model the distribution of non-anomalous data in order to detect anomalies (i.e. "out-of-distribution" examples). Examples are considered anomalous when their

RarerProbability

is below the value specified for

AcceptanceThreshold

so we need to look at RarerProbability:

◼

RarerProbability

computes

Probability



PDF

[dist,y]<

PDF

[dist,x],ydist

help browser example:

In[]:=

RarerProbability[NormalDistribution[],3]N@%

so let’s try to dig in and understand this result based on RarerProbability.

rarer prob for general x:

In[]:=

rp=Probability[PDF[NormalDistribution[],y]<PDF[NormalDistribution[],x],yNormalDistribution[]]//Nrp/.x3

rarer prob for specific x (3, from the helpbrowser example):

In[]:=

Probability[PDF[NormalDistribution[],y]<PDF[NormalDistribution[],3],yNormalDistribution[]]N@%

now let’s use the exponentials for the standard normal distribution directly:

In[]:=

PDF[NormalDistribution[],y]

In[]:=

Probability



2π



2π

,yNormalDistribution[]N@%

In[]:=

Probability



2π



2π

,yNormalDistribution[]N@%

now I’d like to simplify this



2π



2π

business, and I want to apply the integral for probability directly (remember definition from above: prob is integral[,,,{-infty, infty}]. Let’s dig deeper.

now let’s look closer at Probability. It is defined as:

◼

For a continuous distribution

dist

, the probability of

pred

is given by

∫

Boole

[pred]f[x]x

where

f[x]

is the probability density function of

dist

and the integral is taken over the domain of

dist

ok, show me:

In[]:=

IntegrateBoole



2π



2π





2π

,{y,-Infinity,Infinity}N@%

ok, so we like the outcome, but what’s that business of integrating over Boole?

In[]:=

Plot3DBoole



2π



2π

,{x,-5,5},{y,-5,5}

so a 0/1-flipper (Boole) imposes additional constraints on the support of the function. At 0 it “annihilates” everything, and at 1 it “let’s the function live”.

in our case of x==3:

In[]:=

Plot3DBoole



2π



2π

,{x,-5,5},{y,-5,5}

but now that’s really only a 1dim function!



2π

is a constant number:

In[]:=



2π

//N

so it’s really a 1dim problem:

In[]:=

PlotBoole



2π



2π

,{y,-5,5}

now let’s plot the whole thing, so we only need to multiply with the PDF for y. In general (i. e. x still variable):

In[]:=

Plot3DBoole



2π



2π





2π

,{x,-5,5},{y,-5,5},PlotRangeAll

you see how Boole enforces additional support restrictions, it “collapses” or “annihilates” half of the entire 2d plane.

but that’s just the support of the function. To actually get the prob, we have to *integrate* over this from -infty to infty:

In[]:=

int=IntegrateBoole



2π



2π





2π

,{y,-Infinity,Infinity}

took a while, didn’t it, these PDFs are still too complex (hold on for a moment), and let’s check:

In[]:=

int/.x3//N

so,Boole



2π



2π

isactuallyequivalenttoBoole[y^2>x^2](homework),sowecansimplifythisto:

In[]:=

int2=IntegrateBoole[y^2>x^2]



2π

,{y,-Infinity,Infinity}

and now we get a much simpler result. And it checks out:

In[]:=

int2/.x3//N

let' s look at the support:

In[]:=

Plot3D[Boole[y^2>x^2],{x,-5,5},{y,-5,5}]

same!

and now let’s plot the resulting prob integral:

In[]:=

Plot

1	- 2 x 2  >1
Erfc Log 2 x 2   	0< - 2 x 2  ≤1
0	True

,{x,-5,5}

here is the prob integral of the simpler version -- same curve:

In[]:=

PlotErfc

,{x,-5,5},PlotRangeAll

we should now verify if they’re really the same. Check if the diffs are zero: (ok, this is not entirely totally scientific, but decent) physicist’s answer: yeah, looks perfect, confirms the proposition. Diffs are kinda null. mathematician’s answer: I’ll believe it for now -- not done yet.

In[]:=

PlotEvaluate@

1	- 2 x 2  >1
Erfc Log 2 x 2   	0< - 2 x 2  ≤1
0	True

-Erfc

,{x,-20,20},PlotRangeAll

more refined and mathematically accurate:

In[]:=

Refine

1	- 2 x 2  >1
Erfc Log 2 x 2   	0< - 2 x 2  ≤1
0	True

-Erfc

,x∈Reals

and let’s look at it directly (“ground truth”):

In[]:=

Plot[Evaluate@RarerProbability[NormalDistribution[],x],{x,-5,5}]

now let’s take diffs against ground truth (sloppy method):

In[]:=

PlotEvaluate@Erfc

-RarerProbability[NormalDistribution[],x],{x,-20,20},PlotRangeAll

refined:

In[]:=

RefineErfc

-RarerProbability[NormalDistribution[],x],x∈Reals

and to show you a type of diagnostic plot that I like to use when studying approximation quality (create poles for 0-error approximations, to make them “stick out”):

In[]:=

PlotEvaluate@Log@Abs1-Erfc

RarerProbability[NormalDistribution[],x],{x,-20,20},PlotRangeAll

In[]:=

RefineLog@Abs1-Erfc

RarerProbability[NormalDistribution[],x],x∈Reals

homework: think about the usefulness of

log1-

approx

truth

to show the closeness of an approximation. Idea: create poles where curves intersect.

another example: I claim 3 x^2 is a probability density function on (0,1). Let’s check:

In[]:=

Plot



3x^2	0<x<1
0	True

,{x,-1,2},PlotRangeAll,PlotLabel"PDF"

check if the integral is 1 from -infty to infty:

In[]:=

Integrate



3x^2	0<x<1
0	True

,{x,-Infinity,Infinity}

In[]:=

Refine



3x^2	0<x<1
0	True

>=0,x∈Reals//Simplify

now let’s “declare” it to be a prob dens function:

In[]:=

dist=ProbabilityDistribution[3x^2,{x,0,1}]

what does its corresponding cumul dist func look like?

In[]:=

cdf=Integrate



3tau^2	0<tau<1
0	True

,{tau,-Infinity,x},Assumptionsx∈Reals

plot it:

In[]:=

Plot[cdf,{x,-1,2},PlotRangeAll,PlotLabel"CDF"]

and plot it using the built-in CDF function:

In[]:=

Plot[CDF[dist,x],{x,-1,2},PlotRangeAll,PlotLabel"CDF"]

built-in PDF :

In[]:=

PDF[dist,x]

built-in CDF :

In[]:=

CDF[dist,x]

example: 0.4

In[]:=

RarerProbability[dist,#]&/@{0.4,0.7,0.9}//N

In[]:=

Plot[Evaluate@RarerProbability[dist,x],{x,-1,2},PlotRangeAll]

from definition of rarer prob:

In[]:=

Probability[PDF[dist,y]<PDF[dist,x],ydist]

now let’s integrate from -infty to infty, because a prob is an integral:

In[]:=

int=IntegrateBoole

3 2 y	0<y<1
0	True

3 2 x	0<x<1
0	True



3 2 y	0<y<1
0	True

,{y,-Infinity,Infinity}

but that’s just like in the plot above!
sloppy check:

In[]:=

Plot[Evaluate@RarerProbability[dist,x]-int,{x,-1,2},PlotRangeAll]

In[]:=

Refine[RarerProbability[dist,x]-int]

and insert our value of 0.4:

In[]:=

int/.x{0.4,0.7,0.9}

checks out

second part: simple applications

In[]:=

SetOptions[FindAnomalies,PerformanceGoal"Quality"];

numbers:

In[]:=

FindAnomalies[{1.2,2.5,3.2,100,4.6,5,5.1}]

strings:

In[]:=

FindAnomalies[{"cat","dog","cow","bird"},{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]

the strings alone have no meaning. So FindAnomalies can’t really compute anything sensible.

In[]:=

FindAnomalies[StringLength/@{"cat","dog","cow","bird"},{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]

In[]:=

data=Import[FileNameJoin[{$HomeDirectory,"2016 US Presidential Election Results by County.csv"}]];

In[]:=

data[[1]]

In[]:=

d=Rest@data;(*ds=AssociationThread[data[[1]]#]&/@d//Dataset;*)

In[]:=

FindAnomalies@d//TableFormSelect[d,#[[2]]>10^6||#[[3]]>10^6||#[[4]]>10^6&]//TableForm

In[]:=

FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"KernelDensityEstimation"]//TableForm

In[]:=

FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"Multinormal"]//TableForm

recall from last session, the box plot was “detecting” several dozens of “outliers” for the Dem % numbers. It was based on quartile-based thresholds (quartile 3 plus 1.5 times the inter-quartile range). Malpractice to use the box plot on skewed data! FindAnomalies would be a better approach, returning a reasonable number of outliers, as it’s based on rarer probabilities, not quartiles on skewed data:

In[]:=

d=Rest@data[[All,5]];

how many by the # > q3 + 1.5 iqr & rule?

In[]:=

{q1,median,q3}=N[Quartiles[d]];iqr=q3-q1;Select[d,#>q3+1.5iqr&]//Length

and with FindAnomalies?

In[]:=

(res=FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"Multinormal"])//TableForm

reasonable!

who are they?

In[]:=

Select[Rest@data,#[[5]]>=Min@res[[2]]&][[All,{5,9,10}]]//TableForm

third part: a bit more advanced

In[]:=

data=FinancialData["EUR/USD",{{2018,1,1},{2019,8,30}}]["Path"];

pdata=Partition[data,4,3];

In[]:=

anomalies=FindAnomalies[pdata];p1=DateListPlot[data];p2=ListPlot[Flatten[anomalies,1],PlotStyleRed];Show[{p1,p2},ImageSize800]

4 and 3 above define what I call a “look-back period”. 4-3 is a 1-element look-back period. Partition[..., a, b] would make it a b-a-element look-back period.

let’s streamline this:

In[]:=

func[data_List,a_Integer,b_Integer]:=With[{pdata=Partition[data,a,b]},With[{anomalies=FindAnomalies[pdata,TrainingProgressReporting->None]},Show[{DateListPlot[data],ListPlot[Flatten[anomalies,1],PlotStyleRed]},ImageSize800]]];

In[]:=

func[a_Integer,b_Integer]:=func[data,a,b];

we can clearly see that different values for length and offset (indirectly defining the look-back period) show different indicative patterns for sharp movements:

In[]:=

TabView[{{2,1}->func[2,1],{4,3}func[4,3],{5,4}func[5,4],{8,7}func[8,7],{8,2}func[8,2],{8,4}func[8,4],{20,19}func[20,19],{20,10}func[20,10]}]

beyond scope for this session, but look-back periods are the key elements in moving-average-based technical trading systems, which are (in my opinion) the most successful technical trading systems (based on backtesting and actual trading results).

remember when the SNB (Swiss National Bank) unpegged the CHF (Swiss Franc) from the EUR (Euro) in mid Jan 2015? The Swiss Franc changed by 21% in a few seconds. 21% is enormous for a stock, consider 21% for a currency! (dozens of hedge funds were bankrupted in a few seconds, and billions of windfall-gains were made by others).

In[]:=

data=FinancialData["CHF/EUR",{{2014,1,1},{2015,12,31}}]["Path"];

In[]:=

pdata=Partition[data,4,3];

In[]:=

anomalies=FindAnomalies[pdata];p1=DateListPlot[data];p2=ListPlot[Flatten[anomalies,1],PlotStyleRed];Show[{p1,p2},ImageSize800]

In[]:=

TabView[{{2,1}->func[2,1],{4,3}func[4,3],{5,4}func[5,4],{8,7}func[8,7],{8,2}func[8,2],{8,4}func[8,4],{20,19}func[20,19],{20,10}func[20,10]}]

note: compare {20,19} with {20,10}.

triad recognition: FindAnomalies to distinguish major from minor

In[]:=

Audio["/home/video/al/opendiapasonsmall8great.wav"]

In[]:=

Audio["/home/video/al/clarinet8swell.wav"]

In[]:=

ranks={(*great*){"opendiapasonsmall8great",0.1},{"dulciana8great",0.25},(*tooquiet,notmuchlouderthantheblower.Can'tgetagoodthreshold*){"opendiapasonlarge8great",0.1},{"lieblichgedackt8great",0.15},{"principal4great",0.1},{"clearflute4great",0.1},"fifteenth","mixtureIV",(*asexpected,allovertheplace,don'tuse*)"trumpet8great",{"clarion4great",0.1},(*swell*){"rohrfloete8swell",0.35},{"lieblichbourdon16swell",0.15},"flautomagico4swell",{"voxcelestis8swell",0.18},(*firstoneismissing,don'tuse*){"geigenprincipal8swell",0.25},"piccolo2swell",(*toohigh,heuristicsdon'twork*)"violadagamba8swell",(*firstoneismissing,don'tuse*)"salicit4swell",{"contrafagotto16swell",0.15},"mixtureIIIswell",(*doesn'tworkonthe7th*){"clarinet8swell",0.3},"trumpet8swell","oboe8swell"};ranks=If[Head@#===String,{#,0.2},#]&/@ranks;ranks=AssociationThread[Range@Length@ranks,ranks];

In[]:=

n[n_Integer]:=Switch[n,1,1,2,2^(2/12),3,2^(4/12),4,2^(5/12),5,2^(7/12),6,2^(9/12),7,2^(10/12)];start[level_Integer,note_Integer]:=Switch[level,8,400,4,800,16,200,2,1600]n@note;

In[]:=

freq[note_]:=440.2^((note-49)/12);

In[]:=

func[int_Integer]:=Module[{t,min,s},s=t=Transpose@{Range@3100,Take[Abs@Fourier@labels@int,3100]};t=Take[s,{Floor[start[level,int]],Ceiling[start[level,int]1.55+50]}];t=MaximalBy[t,#[[2]]&,20]//SortBy[#,First]&;t=Split[t,Abs[#2[[1]]-#1[[1]]]≤14&];t=MaximalBy[#,#[[2]]&]&/@t//Flatten[#,1]&;t=t[[All,1]];{t[[2]]/t[[1]],t[[3]]/t[[2]],t[[3]]/t[[1]]}//N];

In[]:=

run[rank_Integer]:=With[{tt=10000},Module[{pick,file,threshold,chords,chordsmono,chordsdata,noteStart,Pos,notePos,pos},pick=rank;file=ranks[pick][[1]];threshold=ranks[pick][[2]];level=StringCases[file,DigitCharacter..]//First//ToExpression;chords=Audio["/home/video/al/"<>file<>".wav"]//AudioNormalize;chordsmono=AudioChannelMix[chords,"Mono"];chordsdata=AudioData[chordsmono];noteStart=If[#>threshold,1,0]&/@Abs[chordsdata[[1]]];pos=Position[noteStart,1];notePos=First/@Split[Flatten[pos],#2-#1<10000&];c1=Take[chordsdata[[1]],{notePos[[1]]+1000,notePos[[1]]+7tt}];c2=Take[chordsdata[[1]],{notePos[[2]]+1000,notePos[[2]]+7tt}];c3=Take[chordsdata[[1]],{notePos[[3]]+1000,notePos[[3]]+7tt}];c4=Take[chordsdata[[1]],{notePos[[4]]+1000,notePos[[4]]+7tt}];c5=Take[chordsdata[[1]],{notePos[[5]]+1000,notePos[[5]]+7tt}];c6=Take[chordsdata[[1]],{notePos[[6]]+1000,notePos[[6]]+7tt}];c7=Take[chordsdata[[1]],{notePos[[7]]+1000,notePos[[7]]+7tt}];labels=AssociationThread[Range@7{c1,c2,c3,c4,c5,c6,c7}];Print["Rank: "<>ranks[[rank,1]]];Print[#(t=Fourier@labels@#;ListLogLogPlot[Take[Abs@t,3100]^2,PlotRange{{150,All},{0.1,All}},PlotLabel"Power Spectrum "<>ranks[[rank,1]],DataRange{0,3099}44100/(Length@labels@#),AxesLabel{"Hz",None},GridLines{Range[0,3100,100],Automatic},ImageSize600,FillingAxis,FillingStyle{Thick,Red}])&/@Keys@labels//TabView];Print[func/@Range@7//TableForm//Panel];Print[func/@Range@7//FindAnomalies[#,AcceptanceThreshold0.0001,TrainingProgressReporting->None]&//TableForm//Panel];];];

In[]:=

run/@{1,2,3,4,5,6,9,10,11,12,13,15,18,19,21,22,23};

observe:anomaly detected is usually right. it never detects a wrong one (a. k. a.: missing one) it’s only wrong when it also flags “good” ones (aka: false positives) errors increase with higher frequencies

of course, we could have done this without FindAnomalies. The chord-ratios for major and minor are known (and, in fact, I used them, see above). But consider this for chords where the ratios are not known ahead of time -- music you only have as a sound file and want to understand a particular chord you happen to love (or hate). --> need Fourier. And then try to find what doesn’t fit in the picture (aka: chord sequence).

Part 3: Problems

note: it’s easy to create “anomalies”, that doesn’t necessarily mean something is wrong with your data. Remember from previous session: always ask yourself if an outlier candidate is really *abnormal*.

There is obviously nothing wrong with the HilbertMatrix:

In[]:=

FindAnomalies@HilbertMatrix@5

In[]:=

FindAnomalies@HilbertMatrix@15

In[]:=

FindAnomalies[HilbertMatrix@5,{"AnomalyCount","AnomalyRarerProbabilities","RarerProbabilities"}]//TableForm

In[]:=

FindAnomalies[HilbertMatrix@15,{"AnomalyCount","AnomalyRarerProbabilities","RarerProbabilities"}]//TableForm

FindAnomalies produces different outputs for different runs. So there *is* some random element in it, which I cannot explain nor justify. The

Method->”KernelDensityEstimation”

seems to be the most consistent.

In[]:=

FindAnomalies[{1,100,1,100}]

In[]:=

FindAnomalies[{1,1000,1,1000}]

In[]:=

FindAnomalies[{1000,1000,1000,1000}]

In[]:=

FindAnomalies[{Pi,Pi,Pi,Pi}]

In[]:=

FindAnomalies[{E,E,E,E}]

In[]:=

FindAnomalies[{a,a,a,a}]

In[]:=

FindAnomalies[{Pi,Pi,Pi,Pi},Method->"Multinormal"]

In[]:=

FindAnomalies[{Pi,Pi,Pi,Pi},Method->"KernelDensityEstimation"]

In[]:=

rndnew=FromCharacterCode@RandomChoice[Join@@Range[{48,65},{57,90}],#]&;

In[]:=

FindAnomalies[rndnew[{20,10}],{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]

Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#7)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmka0p