Data Science with Andreas Lauschke (#7)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#7)
Description
Outlier Detection Methods
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lmka0p/
DOI
https://notebookarchive.org/2020-09-4lmka0p
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
2.01 megabytes
Supplements
Rights
Redistribution rights reserved



Outlier Detection, Part 2
Outlier Detection, Part 2
Andreas Lauschke, Aug 4 2019
Andreas Lauschke, Aug 4 2019
today’s session: outlier detection methods, part 2 (FindAnomalies) prob theory reminders FindAnomalies mathematical intro simple applications advanced applications problemsnext session: date / time / calendar functionality
put in an AI context: anomaly detection falls into the category of unsupervised learning: you’re trying to find hidden structures in the unlabeled data (similar to clustering). It doesn’t require training or labeling of data.
Andy Jassy, CEO of AWS, on re:Invent 2018 in Las Vegas, NV:
Part 2: FindAnomalies, based on robust Probability Theory
Part 2: FindAnomalies, based on robust Probability Theory
not that the box plot is *not* based on robust probability theory :) Reasoning on quantiles / quartiles can be *very* powerful. But it’s not the whole story.
First part, some math to understand its inner workings deeper
second part, FindAnomalies applications, from simple to more complex
second part, FindAnomalies applications, from simple to more complex
first a quick recap of Probability and Probability Density! (using wikipedia definitions)
Definition Probability: The probability that an event will occur is the fraction of times you expect to see that event in many trials.
Definition Probability Density: value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.
Definition Probability Density: value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.
In a more precise sense, the PDF is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. This probability is given by the integral of this variable' s PDF over that range—that is, it is given by the area under the density function but above the horizontal axis and between the lowest and greatest values of the range. The probability density function is nonnegative everywhere, and its integral over the entire space is equal to one.
standard normal distribution:
In[]:=
pdf=PDF[NormalDistribution[]]
is a special case of the (general) normal distribution (mean μ is not 0 and standard deviation σ is not 1):
In[]:=
pdf=PDF[NormalDistribution[μ,σ]]
reminder: if μ is 0 and σ is 1, it’s called the *standard* normal distribution (because there are non-standard normal distributions).
In[]:=
Manipulatep1=Plot,{x,-5,x1},FillingAxis,FillingStyleCyan,PlotRange{Automatic,All};p2=PlotCallout,"prob dens",{x1,Above},{x,-5,5},PlotLabel"Probability = Area under Curve = "<>ToStringNIntegrate,{x,-Infinity,x1},TraditionalForm,EpilogPointSize@Large,Red,Pointx1,,PlotRange{Automatic,All};Show[{p2,p1}],{{x1,1},-5,5},{{m,0},-4.5,4.5},{{sigma,1},1/2,3}
-
2
(x-m)
2
2
sigma
2π
sigma-
2
(x-m)
2
2
sigma
2π
sigma-
2
(x-m)
2
2
sigma
2π
sigma-
2
(x1-m)
2
2
sigma
2π
sigmaso let’s pick an example and walk through it.
In[]:=
pdf=PDF[NormalDistribution[],x]
verify it’s a distribution (integral from -∞ to ∞ must be 1):
In[]:=
Integrate[pdf,{x,-Infinity,Infinity}]
and pdf must always be non-negative
In[]:=
Refine[pdf>=0,x∈Reals]
criteria 3 and 4 for a pdf: must also be real and integrable. That is the case here with pdf.
from help browser:
◼
FindAnomalies
RarerProbability
AcceptanceThreshold
so we need to look at RarerProbability:
◼
RarerProbability
Probability
PDF
PDF
help browser example:
In[]:=
RarerProbability[NormalDistribution[],3]N@%
so let’s try to dig in and understand this result based on RarerProbability.
rarer prob for general x:
In[]:=
rp=Probability[PDF[NormalDistribution[],y]<PDF[NormalDistribution[],x],yNormalDistribution[]]//Nrp/.x3
rarer prob for specific x (3, from the helpbrowser example):
In[]:=
Probability[PDF[NormalDistribution[],y]<PDF[NormalDistribution[],3],yNormalDistribution[]]N@%
now let’s use the exponentials for the standard normal distribution directly:
In[]:=
PDF[NormalDistribution[],y]
In[]:=
Probability<,yNormalDistribution[]N@%
-
2
y
2
2π
-
2
x
2
2π
In[]:=
Probability<,yNormalDistribution[]N@%
-
2
y
2
2π
-
2
3
2
2π
now I’d like to simplify this <business, and I want to apply the integral for probability directly (remember definition from above: prob is integral[,,,{-infty, infty}]. Let’s dig deeper.
-
2
y
2
2π
-
2
3
2
2π
now let’s look closer at Probability. It is defined as:
◼
For a continuous distribution , the probability of is given by where is the probability density function of and the integral is taken over the domain of .
dist
pred
∫[pred]f[x]x
Boole
f[x]
dist
dist
ok, show me:
In[]:=
IntegrateBoole<,{y,-Infinity,Infinity}N@%
-
2
y
2
2π
-
2
3
2
2π
-
2
y
2
2π
ok, so we like the outcome, but what’s that business of integrating over Boole?
In[]:=
Plot3DBoole<,{x,-5,5},{y,-5,5}
-
2
y
2
2π
-
2
x
2
2π
so a 0/1-flipper (Boole) imposes additional constraints on the support of the function. At 0 it “annihilates” everything, and at 1 it “let’s the function live”.
in our case of x==3:
In[]:=
Plot3DBoole<,{x,-5,5},{y,-5,5}
-
2
y
2
2π
-
2
3
2
2π
but now that’s really only a 1dim function!
-
2
3
2
2π
In[]:=
-
2
3
2
2π
so it’s really a 1dim problem:
In[]:=
PlotBoole<,{y,-5,5}
-
2
y
2
2π
-
2
3
2
2π
now let’s plot the whole thing, so we only need to multiply with the PDF for y. In general (i. e. x still variable):
In[]:=
Plot3DBoole<,{x,-5,5},{y,-5,5},PlotRangeAll
-
2
y
2
2π
-
2
x
2
2π
-
2
y
2
2π
you see how Boole enforces additional support restrictions, it “collapses” or “annihilates” half of the entire 2d plane.
but that’s just the support of the function. To actually get the prob, we have to *integrate* over this from -infty to infty:
In[]:=
int=IntegrateBoole<,{y,-Infinity,Infinity}
-
2
y
2
2π
-
2
x
2
2π
-
2
y
2
2π
took a while, didn’t it, these PDFs are still too complex (hold on for a moment), and let’s check:
In[]:=
int/.x3//N
so,Boole<isactuallyequivalenttoBoole[y^2>x^2](homework),sowecansimplifythisto:
-
2
y
2
2π
-
2
x
2
2π
In[]:=
int2=IntegrateBoole[y^2>x^2],{y,-Infinity,Infinity}
-
2
y
2
2π
and now we get a much simpler result. And it checks out:
In[]:=
int2/.x3//N
let' s look at the support:
In[]:=
Plot3D[Boole[y^2>x^2],{x,-5,5},{y,-5,5}]
same!
and now let’s plot the resulting prob integral:
In[]:=
Plot
,{x,-5,5}
|
here is the prob integral of the simpler version -- same curve:
In[]:=
PlotErfc,{x,-5,5},PlotRangeAll
2
x
2
we should now verify if they’re really the same. Check if the diffs are zero: (ok, this is not entirely totally scientific, but decent) physicist’s answer: yeah, looks perfect, confirms the proposition. Diffs are kinda null. mathematician’s answer: I’ll believe it for now -- not done yet.
In[]:=
PlotEvaluate@
-Erfc,{x,-20,20},PlotRangeAll
|
2
x
2
more refined and mathematically accurate:
In[]:=
Refine
-Erfc,x∈Reals
|
2
x
2
and let’s look at it directly (“ground truth”):
In[]:=
Plot[Evaluate@RarerProbability[NormalDistribution[],x],{x,-5,5}]
now let’s take diffs against ground truth (sloppy method):
In[]:=
PlotEvaluate@Erfc-RarerProbability[NormalDistribution[],x],{x,-20,20},PlotRangeAll
2
x
2
refined:
In[]:=
RefineErfc-RarerProbability[NormalDistribution[],x],x∈Reals
2
x
2
and to show you a type of diagnostic plot that I like to use when studying approximation quality (create poles for 0-error approximations, to make them “stick out”):
In[]:=
PlotEvaluate@Log@Abs1-ErfcRarerProbability[NormalDistribution[],x],{x,-20,20},PlotRangeAll
2
x
2
In[]:=
RefineLog@Abs1-ErfcRarerProbability[NormalDistribution[],x],x∈Reals
2
x
2
homework: think about the usefulness of to show the closeness of an approximation. Idea: create poles where curves intersect.
log1-
approx
truth
another example: I claim 3 x^2 is a probability density function on (0,1). Let’s check:
In[]:=
Plot
,{x,-1,2},PlotRangeAll,PlotLabel"PDF"
|
|
check if the integral is 1 from -infty to infty:
In[]:=
Integrate
,{x,-Infinity,Infinity}
|
|
In[]:=
Refine
>=0,x∈Reals//Simplify
|
|
now let’s “declare” it to be a prob dens function:
In[]:=
dist=ProbabilityDistribution[3x^2,{x,0,1}]
what does its corresponding cumul dist func look like?
In[]:=
cdf=Integrate
,{tau,-Infinity,x},Assumptionsx∈Reals
|
|
plot it:
In[]:=
Plot[cdf,{x,-1,2},PlotRangeAll,PlotLabel"CDF"]
and plot it using the built-in CDF function:
In[]:=
Plot[CDF[dist,x],{x,-1,2},PlotRangeAll,PlotLabel"CDF"]
built-in PDF :
In[]:=
PDF[dist,x]
built-in CDF :
In[]:=
CDF[dist,x]
example: 0.4
In[]:=
RarerProbability[dist,#]&/@{0.4,0.7,0.9}//N
In[]:=
Plot[Evaluate@RarerProbability[dist,x],{x,-1,2},PlotRangeAll]
from definition of rarer prob:
In[]:=
Probability[PDF[dist,y]<PDF[dist,x],ydist]
now let’s integrate from -infty to infty, because a prob is an integral:
In[]:=
int=IntegrateBoole
<
,{y,-Infinity,Infinity}
|
|
|
but that’s just like in the plot above!
sloppy check:
sloppy check:
In[]:=
Plot[Evaluate@RarerProbability[dist,x]-int,{x,-1,2},PlotRangeAll]
In[]:=
Refine[RarerProbability[dist,x]-int]
and insert our value of 0.4:
In[]:=
int/.x{0.4,0.7,0.9}
checks out
second part: simple applications
In[]:=
SetOptions[FindAnomalies,PerformanceGoal"Quality"];
numbers:
In[]:=
FindAnomalies[{1.2,2.5,3.2,100,4.6,5,5.1}]
strings:
In[]:=
FindAnomalies[{"cat","dog","cow","bird"},{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]
the strings alone have no meaning. So FindAnomalies can’t really compute anything sensible.
In[]:=
FindAnomalies[StringLength/@{"cat","dog","cow","bird"},{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]
In[]:=
data=Import[FileNameJoin[{$HomeDirectory,"2016 US Presidential Election Results by County.csv"}]];
In[]:=
data[[1]]
In[]:=
d=Rest@data;(*ds=AssociationThread[data[[1]]#]&/@d//Dataset;*)
In[]:=
FindAnomalies@d//TableFormSelect[d,#[[2]]>10^6||#[[3]]>10^6||#[[4]]>10^6&]//TableForm
In[]:=
FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"KernelDensityEstimation"]//TableForm
In[]:=
FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"Multinormal"]//TableForm
recall from last session, the box plot was “detecting” several dozens of “outliers” for the Dem % numbers. It was based on quartile-based thresholds (quartile 3 plus 1.5 times the inter-quartile range). Malpractice to use the box plot on skewed data! FindAnomalies would be a better approach, returning a reasonable number of outliers, as it’s based on rarer probabilities, not quartiles on skewed data:
In[]:=
d=Rest@data[[All,5]];
how many by the # > q3 + 1.5 iqr & rule?
In[]:=
{q1,median,q3}=N[Quartiles[d]];iqr=q3-q1;Select[d,#>q3+1.5iqr&]//Length
and with FindAnomalies?
In[]:=
(res=FindAnomalies[d,{"AnomalyCount","Anomalies","AnomalyRarerProbabilities"},Method"Multinormal"])//TableForm
reasonable!
who are they?
In[]:=
Select[Rest@data,#[[5]]>=Min@res[[2]]&][[All,{5,9,10}]]//TableForm
third part: a bit more advanced
In[]:=
data=FinancialData["EUR/USD",{{2018,1,1},{2019,8,30}}]["Path"];
pdata=Partition[data,4,3];
In[]:=
anomalies=FindAnomalies[pdata];p1=DateListPlot[data];p2=ListPlot[Flatten[anomalies,1],PlotStyleRed];Show[{p1,p2},ImageSize800]
4 and 3 above define what I call a “look-back period”. 4-3 is a 1-element look-back period. Partition[..., a, b] would make it a b-a-element look-back period.
let’s streamline this:
In[]:=
func[data_List,a_Integer,b_Integer]:=With[{pdata=Partition[data,a,b]},With[{anomalies=FindAnomalies[pdata,TrainingProgressReporting->None]},Show[{DateListPlot[data],ListPlot[Flatten[anomalies,1],PlotStyleRed]},ImageSize800]]];
In[]:=
func[a_Integer,b_Integer]:=func[data,a,b];
we can clearly see that different values for length and offset (indirectly defining the look-back period) show different indicative patterns for sharp movements:
In[]:=
TabView[{{2,1}->func[2,1],{4,3}func[4,3],{5,4}func[5,4],{8,7}func[8,7],{8,2}func[8,2],{8,4}func[8,4],{20,19}func[20,19],{20,10}func[20,10]}]
beyond scope for this session, but look-back periods are the key elements in moving-average-based technical trading systems, which are (in my opinion) the most successful technical trading systems (based on backtesting and actual trading results).
remember when the SNB (Swiss National Bank) unpegged the CHF (Swiss Franc) from the EUR (Euro) in mid Jan 2015? The Swiss Franc changed by 21% in a few seconds. 21% is enormous for a stock, consider 21% for a currency! (dozens of hedge funds were bankrupted in a few seconds, and billions of windfall-gains were made by others).
In[]:=
data=FinancialData["CHF/EUR",{{2014,1,1},{2015,12,31}}]["Path"];
In[]:=
pdata=Partition[data,4,3];
In[]:=
anomalies=FindAnomalies[pdata];p1=DateListPlot[data];p2=ListPlot[Flatten[anomalies,1],PlotStyleRed];Show[{p1,p2},ImageSize800]
In[]:=
TabView[{{2,1}->func[2,1],{4,3}func[4,3],{5,4}func[5,4],{8,7}func[8,7],{8,2}func[8,2],{8,4}func[8,4],{20,19}func[20,19],{20,10}func[20,10]}]
note: compare {20,19} with {20,10}.
triad recognition: FindAnomalies to distinguish major from minor
In[]:=
Audio["/home/video/al/opendiapasonsmall8great.wav"]
In[]:=
Audio["/home/video/al/clarinet8swell.wav"]
In[]:=
ranks={(*great*){"opendiapasonsmall8great",0.1},{"dulciana8great",0.25},(*tooquiet,notmuchlouderthantheblower.Can'tgetagoodthreshold*){"opendiapasonlarge8great",0.1},{"lieblichgedackt8great",0.15},{"principal4great",0.1},{"clearflute4great",0.1},"fifteenth","mixtureIV",(*asexpected,allovertheplace,don'tuse*)"trumpet8great",{"clarion4great",0.1},(*swell*){"rohrfloete8swell",0.35},{"lieblichbourdon16swell",0.15},"flautomagico4swell",{"voxcelestis8swell",0.18},(*firstoneismissing,don'tuse*){"geigenprincipal8swell",0.25},"piccolo2swell",(*toohigh,heuristicsdon'twork*)"violadagamba8swell",(*firstoneismissing,don'tuse*)"salicit4swell",{"contrafagotto16swell",0.15},"mixtureIIIswell",(*doesn'tworkonthe7th*){"clarinet8swell",0.3},"trumpet8swell","oboe8swell"};ranks=If[Head@#===String,{#,0.2},#]&/@ranks;ranks=AssociationThread[Range@Length@ranks,ranks];
In[]:=
n[n_Integer]:=Switch[n,1,1,2,2^(2/12),3,2^(4/12),4,2^(5/12),5,2^(7/12),6,2^(9/12),7,2^(10/12)];start[level_Integer,note_Integer]:=Switch[level,8,400,4,800,16,200,2,1600]n@note;
In[]:=
freq[note_]:=440.2^((note-49)/12);
In[]:=
func[int_Integer]:=Module[{t,min,s},s=t=Transpose@{Range@3100,Take[Abs@Fourier@labels@int,3100]};t=Take[s,{Floor[start[level,int]],Ceiling[start[level,int]1.55+50]}];t=MaximalBy[t,#[[2]]&,20]//SortBy[#,First]&;t=Split[t,Abs[#2[[1]]-#1[[1]]]≤14&];t=MaximalBy[#,#[[2]]&]&/@t//Flatten[#,1]&;t=t[[All,1]];{t[[2]]/t[[1]],t[[3]]/t[[2]],t[[3]]/t[[1]]}//N];
In[]:=
run[rank_Integer]:=With[{tt=10000},Module[{pick,file,threshold,chords,chordsmono,chordsdata,noteStart,Pos,notePos,pos},pick=rank;file=ranks[pick][[1]];threshold=ranks[pick][[2]];level=StringCases[file,DigitCharacter..]//First//ToExpression;chords=Audio["/home/video/al/"<>file<>".wav"]//AudioNormalize;chordsmono=AudioChannelMix[chords,"Mono"];chordsdata=AudioData[chordsmono];noteStart=If[#>threshold,1,0]&/@Abs[chordsdata[[1]]];pos=Position[noteStart,1];notePos=First/@Split[Flatten[pos],#2-#1<10000&];c1=Take[chordsdata[[1]],{notePos[[1]]+1000,notePos[[1]]+7tt}];c2=Take[chordsdata[[1]],{notePos[[2]]+1000,notePos[[2]]+7tt}];c3=Take[chordsdata[[1]],{notePos[[3]]+1000,notePos[[3]]+7tt}];c4=Take[chordsdata[[1]],{notePos[[4]]+1000,notePos[[4]]+7tt}];c5=Take[chordsdata[[1]],{notePos[[5]]+1000,notePos[[5]]+7tt}];c6=Take[chordsdata[[1]],{notePos[[6]]+1000,notePos[[6]]+7tt}];c7=Take[chordsdata[[1]],{notePos[[7]]+1000,notePos[[7]]+7tt}];labels=AssociationThread[Range@7{c1,c2,c3,c4,c5,c6,c7}];Print["Rank: "<>ranks[[rank,1]]];Print[#(t=Fourier@labels@#;ListLogLogPlot[Take[Abs@t,3100]^2,PlotRange{{150,All},{0.1,All}},PlotLabel"Power Spectrum "<>ranks[[rank,1]],DataRange{0,3099}44100/(Length@labels@#),AxesLabel{"Hz",None},GridLines{Range[0,3100,100],Automatic},ImageSize600,FillingAxis,FillingStyle{Thick,Red}])&/@Keys@labels//TabView];Print[func/@Range@7//TableForm//Panel];Print[func/@Range@7//FindAnomalies[#,AcceptanceThreshold0.0001,TrainingProgressReporting->None]&//TableForm//Panel];];];
In[]:=
run/@{1,2,3,4,5,6,9,10,11,12,13,15,18,19,21,22,23};
observe:anomaly detected is usually right. it never detects a wrong one (a. k. a.: missing one) it’s only wrong when it also flags “good” ones (aka: false positives) errors increase with higher frequencies
of course, we could have done this without FindAnomalies. The chord-ratios for major and minor are known (and, in fact, I used them, see above). But consider this for chords where the ratios are not known ahead of time -- music you only have as a sound file and want to understand a particular chord you happen to love (or hate). --> need Fourier. And then try to find what doesn’t fit in the picture (aka: chord sequence).
Part 3: Problems
Part 3: Problems
note: it’s easy to create “anomalies”, that doesn’t necessarily mean something is wrong with your data. Remember from previous session: always ask yourself if an outlier candidate is really *abnormal*.
There is obviously nothing wrong with the HilbertMatrix:
In[]:=
FindAnomalies@HilbertMatrix@5
In[]:=
FindAnomalies@HilbertMatrix@15
In[]:=
FindAnomalies[HilbertMatrix@5,{"AnomalyCount","AnomalyRarerProbabilities","RarerProbabilities"}]//TableForm
In[]:=
FindAnomalies[HilbertMatrix@15,{"AnomalyCount","AnomalyRarerProbabilities","RarerProbabilities"}]//TableForm
FindAnomalies produces different outputs for different runs. So there *is* some random element in it, which I cannot explain nor justify. The
Method->”KernelDensityEstimation”
seems to be the most consistent.
Method->”KernelDensityEstimation”
seems to be the most consistent.
In[]:=
FindAnomalies[{1,100,1,100}]
In[]:=
FindAnomalies[{1,1000,1,1000}]
In[]:=
FindAnomalies[{1000,1000,1000,1000}]
In[]:=
FindAnomalies[{Pi,Pi,Pi,Pi}]
In[]:=
FindAnomalies[{E,E,E,E}]
In[]:=
FindAnomalies[{a,a,a,a}]
In[]:=
FindAnomalies[{Pi,Pi,Pi,Pi},Method->"Multinormal"]
In[]:=
FindAnomalies[{Pi,Pi,Pi,Pi},Method->"KernelDensityEstimation"]
In[]:=
rndnew=FromCharacterCode@RandomChoice[Join@@Range[{48,65},{57,90}],#]&;
In[]:=
FindAnomalies[rndnew[{20,10}],{"Anomalies","AnomalyCount","AnomalyBooleanList","AnomalyPositions","AnomalyRarerProbabilities","NonAnomalies","RarerProbabilities"}]


Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#7)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmka0p

Download

