Data Science with Andreas Lauschke (#1)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#1)
Description
Operator notation, pattern matching and function composition.
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lm0sva/
DOI
https://notebookarchive.org/2020-09-4lm0sva
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
125.94 kilobytes
Supplements
Rights
Redistribution rights reserved
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
data:image/s3,"s3://crabby-images/97e21/97e21d941045101921bcfd57c45c820c8eed2b93" alt=""
Operator Notation and Application: Data Extraction
Operator Notation and Application: Data Extraction
Andreas Lauschke, Apr 23 2019
Andreas Lauschke, Apr 23 2019
◼
will present about weekly
◼
will incorporate feedback over time
◼
target audience: beginner, intermediate, advanced. *Not* expert!
◼
will start with data science applications, but after a while I’ll branch out to other areas of the Mathematica system (let us know what you’d like to see demo’ed)
today: operator notation, some pattern matching, function compositionnext two sessions: Associations and Dataset, part1 (rectangular data) Associations and Dataset, part 2 (hierarchical data)
Operator Notation
Operator Notation
Motivation:
The short version:
For brevity and to prevent temp / dummy variables, many functions don’t require an explicit input. Map, Select, Cases, MemberQ, FreeQ, ContainsAll (and brethren), ... so let’s get rid of them.
A somewhat longer version:
◼
very efficient notation: concise, succinct.
◼
higher-level programming means you tell the system WHAT to do, not HOW to do it. Handling variables / symbols is a way of telling the system HOW to do things, we only want to tell the system WHAT to do (cognitive advantage).
◼
familiar to Linux users: "pipe" operator: op 1 | op 2 | op 3 | op 4 | ...
◼
◼
can help eliminate # and & and Function (slot, Map, pure function) -- but this is not always better, see Caveat at the end.
◼
advantages of operator form cumulate as they can be daisy-chained
◼
can be used to improve efficiency, not just with a reduced LeafCount but in lower algorithmic complexity
◼
can mimic natural language patterns
◼
extremely useful for Dataset (see next session), which requires functions provided as arguments
◼
operator notation is like post - fix ("left to right"):
In[]:=
a//b//c//d
Let's start with a simple example. We're all familiar with Select. It picks out all elements of a list for which a particular condition is True. This is the signature we've been using all the time:
In[]:=
"traditional", ("inside out")
In[]:=
Select[Range@20,EvenQ]
But since v10 we have the new operator notation. Note the third signature of Select:
In[]:=
post - fix, (“left to right”):
In[]:=
Range@20//Select[EvenQ]
pre - fix, “right to left”:
In[]:=
Select[EvenQ][Range@20]
In[]:=
Tuples[{Red,Green,Blue},3]
In[]:=
Tuples[{Red,Green,Blue,Yellow,Black,White},6]//Length
find all palindromes:
In[]:=
Tuples[{Red,Green,Blue,Yellow,Black,White},6]//Select[#Reverse@#&]
another “Swiss army knife” is Cases. Use Select if you want to pick list elements based on a criterion (True / False) and Cases if you want to pick list elements based on a structural match (pattern).
In[]:=
mylist={641,65,gb[a],2.3,7.3,Red,y,g[89],9/4,g[13],Pi/2,Blue,Round@7.6,{1,2},{2},{3,4,1},{5,4},{3,3}};
show me all Integers
In[]:=
mylist//Cases[_Integer]
show me all Colors
In[]:=
mylist//Cases[_RGBColor]
show me all Rationals
In[]:=
mylist//Cases[_Rational]
show me all Reals
In[]:=
mylist//Cases[_Real]
show me all 2-element sublists:
In[]:=
mylist//Cases[{_,_}]
are there any polynomials in y?
In[]:=
{x^2,y^3,x^5,x^6}//FreeQ[y^_]
In[]:=
{x^2,y^3,x^5,x^6}//MemberQ[y^_]
at what position(s) is b?
In[]:=
{f,h,b,d,a,b,c,v,j,b,u,3,Blue,Pi/2}//Position[b]
at what positions are polynomials in x?
In[]:=
{7+x^2,5.4,x^4.2,c+(4+x^2)^4}//Position[x^_]
I should explain the multiple nesting levels: FullForm is your friend:
In[]:=
{7+x^2,5.4,x^4.2,c+(4+x^2)^4}//FullForm
note that these position lists show you exactly where you find Power[x,_].
at what positions are exponential expressions (in general)?
In[]:=
{7+x^2,5.4,x^4.2,c+(4+x^2)^4}//Position[_^_]
note that here we have another one, in position {4, 2}, because (4+x^2)^4 itself is of the form _^_!
operator notation is not possible for functions that take several arguments, for example Position[expr, pattern, levelspec]:
up to first level (and now we have to use pre-fix):
In[]:=
Position[{7+x^2,5.4,x^4.2,c+(4+x^2)^4},_^_,1]
up to second level:
In[]:=
Position[{7+x^2,5.4,x^4.2,c+(4+x^2)^4},_^_,2]
ONLY second level:
In[]:=
Position[{7+x^2,5.4,x^4.2,c+(4+x^2)^4},_^_,{2}]
you see, I needed to provide the input as a first argument.
however, many functions that have only one argument (univariate functions) *behave* as if it were operator notation:
In[]:=
{a,3,f,6,j,3,b}//Length
In[]:=
{1,4,32,6,9,6}//Sort
In[]:=
{{a,b},{y,z}}//Transpose
In[]:=
HilbertMatrix@5//Det
how many bs do we have in the list?
In[]:=
{f,h,b,d,a,b,c,v,j,b,u,3,Blue,Pi/2}//Count[b]
how many list elements do we have that are not b?
In[]:=
{f,h,b,d,a,b,c,v,j,b,u,3,Blue,Pi/2}//Count[Except[b]]
does the list contain any of these?
In[]:=
{b,a,b}//ContainsAny[{a,b,c}]
In[]:=
{m,n,p}//ContainsAny[{a,b,c}]
Does the list contain all of these?
In[]:=
{b,a,b,c}//ContainsAll[{a,b}]
In[]:=
{b,a,b,c}//ContainsAll[{a,b,d}]
can also define our own criterion:
In[]:=
f=(Total[#]<100)&;(*Listsumis<100*)
create a 100 x 3 matrix of 100 random integers up to 100
In[]:=
RandomInteger[100,{100,3}]
and now show me all sublists with Total <100:
In[]:=
RandomInteger[100,{100,3}]//Select@f
keep pipelining / daisy-chaining:
factor all even integers between 56 and 100:
In[]:=
Range@100//Select[EvenQ]//Select[#>55&]//FactorInteger
equivalent:
In[]:=
Range@100//Select[EvenQ@#&&#>55&]//FactorInteger
can also use “right to left”, aka: post-fix:
In[]:=
FactorInteger@Select[EvenQ@#&&#>55&]@Range@100
Function Composition. The operator notation is *particularly* useful when applied in function composition. We’ll see more of this during the next session, when I’ll introduce the Dataset. The Dataset makes heavy use of operators (so does Query, and Dataset uses Query internally). I strongly recommend you get familiar with operators and operator chaining (and operator notation / operator form). Composable functions is one of the main pillars of the functional programming paradigm.
"left to right” (first select even, then select >55, then factor), use /*:
In[]:=
sel=Select[EvenQ]/*Select[#>55&]/*FactorInteger
In[]:=
sel//FullForm
note that sel is just a definition of a function. sel itself is not applied to any data *yet*!
NOW we apply it on data:
In[]:=
sel@Range@100
"right to left": (first select larger than 55, then select even, then factor), use @*:
In[]:=
sel=FactorInteger@*Select[EvenQ]@*Select[#>55&]
In[]:=
sel//FullForm
In[]:=
sel@Range@100
Brief Summary: “inside out” without operator notation means:
◼
you may have to do “bracket surgery” during debugging
◼
you may have to change the order of operators, or add/remove operations
◼
more error-prone, unreliable, can be unpleasant
◼
hard to see where to start
Application: Data Extraction / Parsing
Application: Data Extraction / Parsing
Deutsche Börse provides all XETRA (electronic stock trades) and EUREX (electronic options and futures trades) transactions, hosted by AWS on S3:
https://registry.opendata.aws/deutsche-boerse-pds/
documentation is at https://github.com/Deutsche-Boerse/dbg-pds.
https://registry.opendata.aws/deutsche-boerse-pds/
documentation is at https://github.com/Deutsche-Boerse/dbg-pds.
Another Swiss army knife is pattern matching. We’ll make extensive use of that here. The following section will also be a good refresher for you in terms of pattern matching.
First, create a new dir for yesterday, and then in a second step we’ll download all files into that new dir:
In[]:=
datestring="2019-04-18";
In[]:=
dir=CreateDirectory["/mnt/seconddrive/pds/"<>datestring]
I personally don’t like dir permissions of 777, so we fix that:
In[]:=
Run["chmod 700 "<>dir]
Alternatively, we can get down to the bare bones and do everything together:
In[]:=
dir="/mnt/seconddrive/pds/"<>datestring;Run["mkdir "<>dir<>";chmod 700 /mnt/seconddrive/pds/"<>datestring<>";"];
Now we’re ready to d/l all the files. The XEUR files are the options trades and the XETR files are the XETRA trades:
In[]:=
URLDownload["https://s3.eu-central-1.amazonaws.com/deutsche-boerse-eurex-pds/"<>datestring<>"/"<>datestring<>"_BINS_XEUR"<>#<>".csv",dir<>"/"<>datestring<>"_BINS_XEUR"<>#<>".csv"]&/@(If[#<10,"0"<>ToString@#,ToString@#]&/@Range[0,23]);
In[]:=
URLDownload["https://s3.eu-central-1.amazonaws.com/deutsche-boerse-xetra-pds/"<>datestring<>"/"<>datestring<>"_BINS_XETR"<>#<>".csv",dir<>"/"<>datestring<>"_BINS_XETR"<>#<>".csv"]&/@(If[#<10,"0"<>ToString@#,ToString@#]&/@Range[0,23]);
now let’s load yesterday’s stock trade files:
In[]:=
data=Import["/mnt/seconddrive/pds/"<>datestring<>"/*XETR*.csv"];
all in one list for the whole day:
In[]:=
alldata=Flatten[data,1]//DeleteCases[{"ISIN",__}];
the headers:
In[]:=
data[[1,1]]
show the Lufthansa activity between 8 and 9 o’clock
In[]:=
data[[9]]//Cases[{_,"LHA",__}]//TableForm
price chart of Lufthansa between 8 and 9:
In[]:=
data[[9]]//Cases[{_,"LHA",__,a_,_,_}a]//ListLinePlot
price chart of Lufthansa for the whole day:
In[]:=
alldata//Cases[{_,"LHA",__,a_,_,_}a]//ListLinePlot
min / max prices for Lufthansa
In[]:=
alldata//Cases[{_,"LHA",__,a_,_,_}a]//Minalldata//Cases[{_,"LHA",__,a_,_,_}a]//Max
all traded instruments, and how many are they:
In[]:=
alldata[[All,3]]//Union//Shortalldata[[All,3]]//Union//Length
shortest / longest company names:
In[]:=
alldata[[All,3]]//Union//MinimalBy[StringLength]
In[]:=
alldata[[All,3]]//Union//MaximalBy[StringLength]
Lufthansa volume and number of trades for the day
In[]:=
alldata//Cases[{_,"LHA",__,a_,_}a]//Totalalldata//Cases[{_,"LHA",__,a_}a]//Total
which companies are Kommanditgesellschafen auf Aktien (have KGAA in the name)?
In[]:=
alldata[[All,3]]//Union//Select[StringContainsQ["KGAA"]]
in the above, not that StringContainsQ itself is in operator notation!
no recorded price change during the minute. Compare:
In[]:=
alldata//Select[SameQ@@#[[{-3,-4,-5,-6}]]&]//Lengthalldata//Length
as percent of all:
In[]:=
%%/%//N
what security types traded on XETRA yesterday:
In[]:=
alldata[[All,4]]//Union//Shortalldata[[All,4]]//Union//Lengthalldata[[All,4]]//Totalalldata[[All,4]]//Counts
In[]:=
apple+apple+apple+orange+orange
now let’s look at yesterday’s futures and options trades:
In[]:=
data=Import["/mnt/seconddrive/pds/"<>datestring<>"/*XEUR*.csv"];
all in one list for the whole day:
In[]:=
alldata=Flatten[data,1]//DeleteCases[{"ISIN",__}];
the headers:
In[]:=
data[[1,1]]
show the DAX activity between 8 and 9 o’clock
In[]:=
data[[9]]//Cases[{_,_,"DAX",__}]//Length
the DAX futures activity between 8 and 9
In[]:=
data[[9]]//Cases[{_,_,"DAX",__,"FUT",__}]//Length
the DAX options activity between 8 and 9
In[]:=
data[[9]]//Cases[{_,_,"DAX",__,"OPT",__}]//Length
price chart of the DAX future June expiration from 8 to 9
In[]:=
data[[9]]//Cases[{_,_,"DAX",__,"FUT",20190621,__,a_,_,_}a]//ListLinePlot
and for the whole day
In[]:=
alldata//Cases[{_,_,"DAX",__,"FUT",20190621,__,a_,_,_}a]//ListLinePlot
how many DAX put contracts traded during the day?
In[]:=
alldata//Cases[{_,_,"DAX",__,"OPT",__,"Put",__,a_,_}a]//Totalalldata//Cases[{_,_,"DAX",__,"Put",__,a_,_}a]//Total
how many DAX put contract trades during the day?
In[]:=
alldata//Cases[{_,_,"DAX",__,"OPT",__,"Put",__,a_}a]//Totalalldata//Cases[{_,_,"DAX",__,"Put",__,a_}a]//Total
how many strikes are there in the DAX June expiration?
In[]:=
alldata//Cases[{_,_,"DAX",__,"OPT",20190621,a_,"Call"|"Put",__}a]//Union//Lengthalldata//Cases[{_,_,"DAX",__,"OPT",20190621,a_,"Call"|"Put",__}a]//Unionalldata//Cases[{_,_,"DAX",__,"OPT",20190621,a_,"Call"|"Put",__}a]//Counts
we don’t need to filter for “OPT” anymore, because they can only be calls or puts:
In[]:=
alldata//Cases[{_,_,"DAX",__,20190621,a_,"Call"|"Put",__}a]//Union//Length
how many DAX future expirations traded?
In[]:=
alldata//Cases[{_,_,"DAX",__,"FUT",a_,__}a]//Unionalldata//Cases[{_,_,"DAX",__,"FUT",a_,__}a]//Counts
all traded instruments are in which currencies?
In[]:=
alldata[[All,5]]//Unionalldata[[All,5]]//Totalalldata[[All,5]]//Counts
which currency does the DAX future trade in?
In[]:=
alldata//Cases[{__,"DAX",_,a_,__}a]//Union
what are the market segments?
In[]:=
alldata[[All,2]]//Union//Shortalldata[[All,2]]//Counts//Shortalldata[[All,2]]//Total//Short
% of same price in their respective minute brackets:
In[]:=
alldata//Select[SameQ@@#[[{-3,-4,-5,-6}]]&]//Lengthalldata//Length
In[]:=
%%/%//N
and I just *love* Part -- very useful if you know the column. The security type is in the sixth column, if you know that, Part is your friend:
what security types traded yesterday at the EUREX, and how many in each:
In[]:=
alldata[[All,6]]//Unionalldata[[All,6]]//Totalalldata[[All,6]]//Counts
Caveat
Caveat
Operator notation is not *always* more succinct. Especially when you can use Part, or its [[...]] (bracket notation short form), this is often more succinct. For the currency example above you don’t even have to do the rule replacement. But, you’re now back to pre-fix (not that this a bad thing, operator notation is not a “holy grail” or “king solution” for everything):
In[]:=
Cases[alldata,{__,"DAX",__}][[All,5]]//Union
Cases[expr, pattern] is not operator notation, and Part has no operator notation. You *could* force the use of operator notation and“left to right” by changing the evaluation order with parentheses, but then it looks quite unnatural:
In[]:=
(alldata//Cases[{__,"DAX",__}])[[All,5]]//Union
Using the bracketing operation is more work than not using operator notation. The problem here is the operator precedence of Part, which is higher in the precedence table than // (post-fix). Still, keep Part in your mind as another Swiss army knife, and oftentimes its high precedence is extremely useful. It’s *important* that Part binds with high precedence. But that can be very irritating when used with // (post-fix), which has a very low precedence. And I’d be glad to say more about operator precedence at a future session, and am requesting your feedback on this (table Operator Precedence in the Operator Input Forms tutorial in the helpbrowser). For now, just three lines from the table:
expr 1 expr 2 | Part expr 1 expr 2 | (e〚e〛)〚e〛 |
expr 1 expr 2 | expr 1 expr 2 | e@(e@e) |
expr 1 expr 2 | expr 2 expr 1 | (e//e)//e |
That also means post-fix is evaluated *after* pre-fix:
In[]:=
a@b//c
So, there is nothing *inherently* wrong with pre-fix (“right to left” / “inside out”). For many people it still reads quite natural, and it is also “closer” to mathematics itself, where we typically write function application as
f(g(h(x))),
which is pre-fix, aka “right to left” and “inside out” at the same time. And the front-end makes interactive “bracketing” easy by “clicking your way out” (see in live demo)
f(g(h(x))),
which is pre-fix, aka “right to left” and “inside out” at the same time. And the front-end makes interactive “bracketing” easy by “clicking your way out” (see in live demo)
In[]:=
Union@Cases[alldata,{__,"DAX",__}][[All,5]]
Also, operator notation can appear very nerdy. We *could* write
In[]:=
100//Range//FactorInteger
but many people will find that a bizarre nerd appearance, as we can write much more naturally
In[]:=
FactorInteger@Range@100
although it’s pre-fix (“right to left”). But many people will find *this* much easier to read/write/understand than the post-fix operator notation. I decree: post-fix is not the end of the world, and operator notation has its limitations!
At the end of my script on post-fix now a post-script:
Post-Script
Post-Script
kinda nerd warning!
Here is a hack to find all functions that have a documented operator form:
In[]:=
Select[Names@"*",StringMatchQ[ToString@ToExpression[#<>"::usage"],"*operator form*"]&]
You' ll find that Take and Drop are *not* in that list! Vararg functions cannot possibly have operator forms, but for Take and Drop it would in theory be possible. But here’s a hack too: Extract[{;;n}].
In[]:=
50//Range//Extract[{;;20}]
obviously the same as
In[]:=
Take[Range@50,20]
does not work:
In[]:=
50//Range//Take[20]
you could now define your own take to use in operator notation:
In[]:=
take[n_Integer]:=Extract[{;;n}];
In[]:=
50//Range//take[20]
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#1)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lm0sva
data:image/s3,"s3://crabby-images/afa7e/afa7e751d718eac7e65669706b85c714b1d1becc" alt=""
Download
data:image/s3,"s3://crabby-images/c9374/c9374a157002afb9ce03cd482ea9bc6b4ee16fc0" alt=""
data:image/s3,"s3://crabby-images/7630b/7630b01d225114cfa2bafc392f9b6df93ec5f7bb" alt=""