Data Science with Andreas Lauschke (#8)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#8)
Description
RarerProbability
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lmmcr3/
DOI
https://notebookarchive.org/2020-09-4lmmcr3
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
87.37 kilobytes
Supplements
Rights
Redistribution rights reserved
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
data:image/s3,"s3://crabby-images/97e21/97e21d941045101921bcfd57c45c820c8eed2b93" alt=""
RarerProbability
RarerProbability
Andreas Lauschke, Sep 20, 2019
Andreas Lauschke, Sep 20, 2019
today’s session: more details on RarerProbability exploiting knowledge of the distribution for faster computation in particular symmetry warning: not all distributions are symmetric, not all have tails, and not all are unimodal! next session: date / time / calendar functionality
RarerProbability
RarerProbability
Reminder:
◼
RarerProbability: [dist,y]<[dist,x],ydist
Probability
PDF
PDF
◼
Probability (continuous case):
∫[pred]f[x]x
Boole
◼
Probability (discrete case):
[pred]f[x]
Boole
that means RarerProbability[dist,x] is in its general form:
In[]:=
Integrate[Boole[PDF[dist,y]<PDF[dist,x]]PDF[dist,y],{y,-Infinity,Infinity}]
That computes the points of the pdf that are less than some arbitrary value x. It requires the re-calculation of the pdf for every element of the (sub)domain. This is oftentimes inefficient. The same thing can be accomplished by calculating the integration limits for the (sub)domain. To obtain all the elements of the (sub)domain, say, of a normal distribution outside the range {-3, 3}, all you need to do is, figuratively speaking, Select[domain, ! -3 < # < 3 &].
In particular, symmetry can be exploited in a symmetric distribution. I’ll show several simplifications.
In particular, symmetry can be exploited in a symmetric distribution. I’ll show several simplifications.
Exploitation of Symmetry
Exploitation of Symmetry
for the first example, let’s use the standard normal distribution again.
In[]:=
ndist=NormalDistribution[];pdf=PDF[ndist,t];
we need cut-offs for the integration limits:
In[]:=
Manipulate[p1=Plot[pdf,{t,-4,4}];p2=Plot[pdf,{t,-4,4},RegionFunctionFunction[{x,y},!-cutoff<x<cutoff],FillingAxis,FillingStyleRed];Show[p1,p2,ImageSize600],{{cutoff,1},1,3,1/2}]
In[]:=
1-Integrate[pdf,{t,-3,3}]N@%
same numbers as in the previous session.
uses the error function! Last session we had used the complementary error function:
In[]:=
Erfc//N
3
2
it’s the same:
In[]:=
Erfc==1-Erf//FullSimplify
3
2
3
2
result: as we’ve seen before:
In[]:=
Plot[Evaluate[1-Integrate[pdf,{t,-Abs@x,Abs@x}]],{x,-5,5}]
note that things already got a bit complicated for this simple plot. For a symmetric distribution like the standard normal distribution, we have a right and left tail for *one* number x, and x can be positive or negative. For x==3 (positive number), for example, we need to integrate from -3 to 3. For x==-4 (negative number), we need to integrate from -4 to 4. So we have to write it as -Abs@x and Abs@x to guarantee that the lower integration limit is the “negative version” of that number and the upper integration limit is the “positive version”.
but the two plots also show us something else: the plots are symmetric around 0.
In[]:=
GraphicsRow[{p1=Plot[Evaluate[pdf],{t,-5,5},RegionFunctionFunction[{x,y},-3<x<0],FillingAxis,FillingStyleYellow];p2=Plot[Evaluate[pdf],{t,-5,5},RegionFunctionFunction[{x,y},0<x<3],FillingAxis,FillingStyleCyan];Show[p1,p2,PlotLabel->"Yellow == Cyan"],p3=Plot[Evaluate[1-Integrate[pdf,{t,-Abs@x,Abs@x}]],{x,-5,5},RegionFunctionFunction[{x,y},-3<x<0],FillingAxis,FillingStyleYellow];p4=Plot[Evaluate[1-Integrate[pdf,{t,-Abs@x,Abs@x}]],{x,-5,5},RegionFunctionFunction[{x,y},0<x<3],FillingAxis,FillingStyleCyan];Show[p3,p4,PlotLabel->"Yellow == Cyan"]},FrameAll]
And we can also show that mathematically:
In[]:=
pdf===(pdf/.(t-t))===
-
2
t
2
2π
-
2
(-t)
2
2π
footnote: wouldn’t need the parens, pdf===pdf/.t-t would yield the same, but for a beginner it’s easier to understand evaluation order by seeing the grouping.
even true in the complex case, we don’t even need Refine.
given that integration can be very time-consuming (definite integrals and indefinite integrals, symbolic as well as numeric), we can “split it up” in the middle, perform only a one-sided integration, and multiply the result with 2:
In[]:=
1-2Integrate[pdf,{t,-3,0}]1-2Integrate[pdf,{t,0,3}]N@%N@%%
this is my favorite version, because we also avoid indefinite integrals: we don’t have to compute integrals involving + or - infty.
and in words: “RarerProbability is everything, except for the stuff that is not rare”. Well, that’s 100% (remember: a pdf must have an integral from -infty to infty that equals exactly 1), minus the left or right half of the “normal” (sub)domain, which is easier to compute, as it doesn’t involve +/- infty. Later on today we’ll see it’s also the fastest way.
and in words: “RarerProbability is everything, except for the stuff that is not rare”. Well, that’s 100% (remember: a pdf must have an integral from -infty to infty that equals exactly 1), minus the left or right half of the “normal” (sub)domain, which is easier to compute, as it doesn’t involve +/- infty. Later on today we’ll see it’s also the fastest way.
In[]:=
1-2Integrate[pdf,{t,0,3}]//Timing1-2Integrate[pdf,{t,-3,0}]//Timing1-2NIntegrate[pdf,{t,0,3}]//Timing1-2NIntegrate[pdf,{t,-3,0}]//Timing
equivalent computation: add two parts, integrated separately: from -infty to lower limit, plus upper limit to infty (this now involves two integrals, instead of one, and both integrals are indefinite integrals):
In[]:=
Integrate[pdf,{t,-Infinity,-3}]+Integrate[pdf,{t,3,Infinity}]N@%
result: as we’ve seen before:
In[]:=
Plot[Evaluate[Integrate[pdf,{t,-Infinity,-Abs@x}]+Integrate[pdf,{t,Abs@x,Infinity}]],{x,-5,5}]
same problems as before: we need Abs simply to plot it, and we know that the left integration must have the same value as the right integration, so we can really do it as:
In[]:=
2Integrate[pdf,{t,-Infinity,-3}]2Integrate[pdf,{t,3,Infinity}]
but that takes longer as it involves two indefinite integrals:
In[]:=
2Integrate[pdf,{t,-Infinity,-3}]//Timing2Integrate[pdf,{t,3,Infinity}]//Timing2NIntegrate[pdf,{t,-Infinity,-3}]//Timing2NIntegrate[pdf,{t,3,Infinity}]//Timing
NIntegrate is smart, the numeric version takes about the same amount of time as in the first example. There is a lot of algorithmic smartness under the hood, but in general: try to avoid indefinite integrals if you can, even though NIntegrate can usually tackle them easily.
third way: like in RarerProbabilities, make the integrand the product of Boole[pred] * pdf, by excluding the “inner” interval in the predicate:
In[]:=
Integrate[Boole[!-3<x<3]PDF[ndist,x],{x,-Infinity,Infinity}]N@%
result: as we've seen before:
In[]:=
Plot[Evaluate[Integrate[Boole[!-Abs@t<x<Abs@t]PDF[ndist,x],{x,-Infinity,Infinity}]],{t,-5,5}]
same problems as above: we need Abs for a simple plot, and we don’t exploit the symmetry. And here we even have both + and - infty in *one* integral!
Instead: split it up:
Instead: split it up:
In[]:=
Integrate[Boole[!-3<x<3]PDF[ndist,x],{x,-Infinity,Infinity}]//Timing2Integrate[Boole[!-3<x<0]PDF[ndist,x],{x,-Infinity,0}]//Timing2Integrate[Boole[!0<x<3]PDF[ndist,x],{x,0,Infinity}]//TimingNIntegrate[Boole[!-3<x<3]PDF[ndist,x],{x,-Infinity,Infinity}]//Timing2NIntegrate[Boole[!-3<x<0]PDF[ndist,x],{x,-Infinity,0}]//Timing2NIntegrate[Boole[!0<x<3]PDF[ndist,x],{x,0,Infinity}]//Timing
now even the NIntegrate versions take a bit longer. It’s because the integrand of Boole[pred] * pdf is more involved.
and finally let’s look at “ground truth”, RarerProbability:
In[]:=
RarerProbability[NormalDistribution[],3]//Timing
now I’d like to see a competition with 1000 random variates from a standard normal distribution:
In[]:=
rv=RandomVariate[NormalDistribution[],1000];
In[]:=
ndist=NormalDistribution[];pdf=PDF[ndist,t];
In[]:=
Timing[1-2NIntegrate[pdf,{t,0,#}]&/@rv]//FirstTiming[2NIntegrate[pdf,{t,#,Infinity}]&/@rv]//FirstTiming[2NIntegrate[Boole[!0<x<#]PDF[ndist,x],{x,0,Infinity}]&/@rv]//First
and with RarerProbability directly (have to go down from 1000 to something much smaller, like 10):
In[]:=
rv=RandomVariate[NormalDistribution[],10];
In[]:=
Timing[RarerProbability[NormalDistribution[],#]&/@rv]//First
we have a clear winner. Nothing beats “numerically integrate from 0 to threshold, double it, and subtract from 1”. We could perhaps further improve this speed result by using C compilation on (not tested). And even more for massively parallel applications when using CUDA, and on other distributions. Simulations are a great application of samples generated/evaluated massively in parallel.
-
2
t
2
2π
However, the version RarerProbability uses with Boole[pred] * pdf is more flexible and can be used for more general problems to select wanted events. But it’s *much* slower (due to its ability to handle much more general problems).
Warning: not all Distributions are symmetric, not all have tails, and not all are unimodal!
Warning: not all Distributions are symmetric, not all have tails, and not all are unimodal!
RarerProbabilies works on *both* sides! That includes smaller parts of non-symmetric distributions!
footnote: remember when last session I said that the curves of the RarerProbability would look different for different distributions or parameters. The following examples illustrate that.
we’ll use the FrechetDistribution for the next example (unimodal, non-symmetric).
In[]:=
dist=FrechetDistribution[3,1];
In[]:=
p1=Plot[PDF[dist,x],{x,0,5}]
let’s cut off the right tail at 3 (this is a point, not an integral, nor a probability):
In[]:=
rtail=PDF[dist,3]%//N
we’ll want rarer prob, but everywhere it occurs, so here in both tails (this is a prob == integral == area, not a point):
In[]:=
bothtails=RarerProbability[dist,3]%//N
In[]:=
(*thiswilltakeawhile,bepatient!*)Plot[Evaluate@RarerProbability[dist,x],{x,-2,6},PlotRangeAll]
let's solve for the other threshold (3 we already knew, we had *set* it at 3 -- but what’s the corresponding threshold on the left tail? It must attain the same value on the pdf):
In[]:=
Solve[PDF[dist,x]PDF[dist,3],x]
let’s get the bothtails rarer prob by integration, using those two limits:
In[]:=
rp=IntegratePDF[dist,x],x,-Infinity,
+Integrate[PDF[dist,x],{x,3,Infinity}]%//N
0.522
…
and we see that rarer prob equals our bothtails prob == integral == area under curve.
In[]:=
rpbothtails//Simplify
as a Plot:
In[]:=
p2=Plot[PDF[dist,x],{x,0,5},RegionFunctionFunction[{x,y},!0.5219603012030648`<x<3],FillingAxis,FillingStyleRed];p3=Plot[Callout[rtail,"don't\nforget\nme",{0.975,0.125},{0.5219603012030648`,rtail},CalloutMarker"Arrow"],{x,-9,9},PlotStyleDashed];Show[p1,p2,p3,ImageSize600]
so: values less than 0.5219603012030648`would be an outlier! (because integration begins at -infty and goes up to 0.5219603012030648`!) Hard to believe when looking at the shape! Never let shapes fool you!
With extreme value problems, like volatility or insurance problems, you will generally want only one tail, usually the right, as insurance problems generally look at the loss as the positive cost of the event. And when talking billions of dollars, == 0.0008830390156378181` can be a huge amount of money, and also consider the potential for lawsuits over lack of contractual clarity about “one-sidedness” vs. “two-sidedness”.
IntegratePDF[dist,x],x,-Infinity,
//N
0.522
…
people often forget that RarerProbabilties and FindAnomalies are concepts that do *not* require symmetry. Tails can have different shapes, and, in fact, rarer prob can include areas == integrals that are not on tails, but "in the middle". Here we’ll use a bimodal distribution, constructed simply from two normal distributions.
In[]:=
pdf=(PDF[NormalDistribution[-3,1],x]+PDF[NormalDistribution[3,1],x])/2
In[]:=
dist=ProbabilityDistribution[pdf,{x,-Infinity,Infinity}]
In[]:=
p1=Plot[PDF[dist,x],{x,-9,9},PlotRangeAll,GridLinesAutomatic]
let’s check if it’s really a distribution:
In[]:=
Integrate[PDF[dist,x],{x,-Infinity,Infinity}]Refine[PDF[dist,x]>=0,x∈Reals]
looks good, and we’ll waive the check if pdf is integrable and real (a sum of two normal distributions is integrable and real). That completes the four distribution checks.
In[]:=
RarerProbability[dist,5]N@%
what’s the value at the cut-off?
In[]:=
limvalue=PDF[dist,5]N@%
In[]:=
p2=Plot[PDF[dist,x],{x,-9,9},RegionFunctionFunction[{x,y},y<limvalue],FillingAxis,FillingStyleRed];p3=Plot[limvalue,{x,-9,9},PlotStyleDashed];Show[p1,p2,p3]
to avoid misunderstandings, I emphasize that “computes the probability for distribution to generate a sample that has a lower than ”in the definition of rarer prob mean “... that has a lower PDF than example *has*”. So “lower” refers to the y-axis, not the x-axis. So it’s not “anything to the left of x”. We could just as well have taken any other of the four points that attain the same value. We know 5 and -5, let’s compute the other two:
dist
PDF
example
In[]:=
sol=FindRoot[PDF[dist,x]-limvalue,{x,1},WorkingPrecision100][[1,2]]
and check for all four:
In[]:=
{N@#,RarerProbability[dist,#],PDF[dist,#],limvalue}&/@{5,-5,sol,-sol}//N[#,30]&//TableForm
we can do more. How about trimodal (and we don’t have to center it on 0)?
In[]:=
pdf=(PDF[NormalDistribution[-5,1],x]+PDF[NormalDistribution[0,1],x]+PDF[NormalDistribution[5,1],x])/3
In[]:=
dist=ProbabilityDistribution[pdf,{x,-Infinity,Infinity}]
In[]:=
p1=Plot[PDF[dist,x],{x,-9,+9},PlotRangeAll,GridLinesAutomatic]
distribution check:
In[]:=
Integrate[PDF[dist,x],{x,-Infinity,Infinity}]Refine[PDF[dist,x]>=0,x∈Reals]
In[]:=
RarerProbability[dist,7]N@%
In[]:=
limvalue=PDF[dist,7]N@%
In[]:=
p2=Plot[PDF[dist,x],{x,-9,9},RegionFunctionFunction[{x,y},y<limvalue],FillingAxis,FillingStyleRed];p3=Plot[limvalue,{x,-9,9},PlotStyleDashed];Show[p1,p2,p3]
In[]:=
sol1=FindRoot[PDF[dist,x]-limvalue,{x,2},WorkingPrecision100][[1,2]]sol2=FindRoot[PDF[dist,x]-limvalue,{x,3},WorkingPrecision100][[1,2]]
and check for all six:
In[]:=
{N@#,RarerProbability[dist,#],PDF[dist,#],limvalue}&/@{7,-7,sol1,-sol1,sol2,-sol2}//N[#,30]&//TableForm
so please note that RarerProbability and FindAnomalies don’t inherently rely on tails or symmetry. But, if the distribution *is* symmetric, computations can be sped up significantly (part 1 of this session).
it also gives us a way to test FindAnomalies. Generate some 10000 random variates from this particular distribution:
In[]:=
s=RandomVariate[dist,10000];
In[]:=
FindAnomalies[s,"Anomalies",AcceptanceThreshold1/20];Histogram[%,100]
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#8)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmmcr3
data:image/s3,"s3://crabby-images/afa7e/afa7e751d718eac7e65669706b85c714b1d1becc" alt=""
Download
data:image/s3,"s3://crabby-images/c9374/c9374a157002afb9ce03cd482ea9bc6b4ee16fc0" alt=""
data:image/s3,"s3://crabby-images/7630b/7630b01d225114cfa2bafc392f9b6df93ec5f7bb" alt=""