Data Science with Andreas Lauschke (#2)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#2)
Description
Association and the Dataset functions
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lm3iy6/
DOI
https://notebookarchive.org/2020-09-4lm3iy6
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
105.84 kilobytes
Supplements
Rights
Redistribution rights reserved
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
data:image/s3,"s3://crabby-images/97e21/97e21d941045101921bcfd57c45c820c8eed2b93" alt=""
Associations and Dataset, Part 1
Associations and Dataset, Part 1
Andreas Lauschke, May 2 2019
Andreas Lauschke, May 2 2019
We’ll start with a gentle introduction to Associations and Dataset and will go deeper in part 2 during the next session.
today’s session : Associations and Dataset, part1 (rectangular data)next session: Associations and Dataset, part 2 (hierarchical data)
Association
Association
Motivation:
◼
called associative arrays, dictionaries, hashes, or maps in other languages
◼
like in a dictionary, you look up an entry based on a keyword. First find the keyword, then you can read the entry associated with that keyword
◼
associates a key with a value
Here are some of the basics of Associations:
◼
set of bindings between keys and values
◼
efficient for looking up the value, given the key. Designed for several millions of elements
◼
keys and values can be arbitrary expressions
◼
entries in Associations are ordered
◼
Associations are convenient in creating hierarchical data structures
Simple example, using rules:
In[]:=
assoc=<|133,2antr,55,ab,bc,cd,3Pi/2,"Frankfurt""Hessen",7^2"hello",blah3,"Hessen""Germany",123424321,k-3x-7,plotPlot[Sin@x,{x,0,2Pi}]|>
look up values:
In[]:=
assoc[3]assoc[7^2]assoc@49assoc[[4]]
Associations can be turned into Lists with Normal.
In[]:=
assoc//Normal
note that this is now a List!
very useful, for example, to visualize the association structure :
In[]:=
TreePlot[Normal[assoc],VertexLabelsAutomatic,DirectedEdgesTrue,ImageSize600]
look-up tables / key-value structures are really nothing new, conceptually. Examples:
mortage["Kathy"]=1765.23balance["Michelle"]=982.34days[April]=30days[November]=30capital["Italy"]="Rome"population["USA"]=326population["Germany"]=83tires["Honda"]=4tires["747"]=18
But with M10 the Associations were implemented with a focus on performance, and they also provide a certain *context* to the surrounding function.
In[]:=
days[November]//FullForm
In[]:=
days=<|"November"30|>
In[]:=
FullForm@%
In[]:=
days["November"]days[["November"]]
Here the fact that we’re not just talking about a *binding* of a key (November) to the array (days), but have a new “thing” (Association), makes it possible for the surrounding function to get some context. For example, you can ListPlot associations, that will also give you legends automatically, like so:
In[]:=
ListPlot[<|"Foo"Range@10,"Bar"Range[10]+5|>]
For that to work, ListPlot needed to “detect” that an Association was provided!
We will see more of that when we start looking at the Dataset.
Construction: use the Association symbol (sometimes easier programmatically) or enter them directly:
In[]:=
g=Association[a55,b66]g=<|a55,b66|>
In[]:=
??g
Very important are AssociationThread and AssociationMap
AssociationThread allows you to create associations when you have keys and values in separate lists:
as two lists (a key list and a values list):
In[]:=
AssociationThread[{1,d,Pi/2,<|RedYellow,BlueGreen|>},{y,z,"hello world",<|233,7788|>}]
as a keys list pointing to a values list:
In[]:=
AssociationThread[{1,2,3}{"eins","zwei","drei"}]
In[]:=
AssociationThread[Range@100Range[101,200]]//Short
AssociationMap is an extremely powerful way to combine function application with the creation of the association. It applies a function to a list of elements that are assumed to be the keys of the new association:
from the helpbrowser:
AssociationMap
key
1
key
2
f[],f[],…
key
1
key
1
key
2
key
2
In[]:=
AssociationMap[StringLength,{"cat","dog","shark","dinosaur","pterodactylus"}]
it can be symbolic:
In[]:=
AssociationMap[foo,Range[10]]
it can use pure functions:
In[]:=
AssociationMap[#^2&,Range@20]AssociationMap[1/#^3&,Range@20]AssociationMap[Binomial[#+1,2]&,Range[0,20]]AssociationMap[Sum[k,{k,0,#}]&,Range[0,20]]AssociationMap[Plot[Csc[#x],{x,-Pi/2,Pi/2}]&,Range[0,20]]
Note that this is NOT a list of plots. It’s an association, the plots are values, the integers are the keys!
In[]:=
AssociationMap[Plot[Tan@x,{x,0,#Pi},GridLinesAutomatic]&,Range@10]
Reverse reverses the roles of keys and values:
In[]:=
AssociationMap[Reverse,<|a1,b2|>]
You can also operate on the values directly, that leaves the keys unchanged. AssociationMap applies the keys (first argument) to the second argument. Map applies a *function* (not the keys) to the values:
In[]:=
Map[#^2&,<|aa1,bb1,cc1|>]
In[]:=
Map[Total,<|a{a1,a2},b{b1,b2},c{c1,c2}|>]
In[]:=
Map[Det,<|a{{a1,a2},{a3,a4}},b{{b1,b2},{b3,b4}},c{{c1,c2},{c3,c4}}|>]
we obviously need a simple way to get a list of the keys and the values:
In[]:=
assoc=AssociationThread[{1,2,3}{"eins","zwei","drei"}]
In[]:=
Keys@assocValues@assoc
A typical use of associations (if not *the* most prominent use) is to set up the association in a way that the keys have an interpretation of row and column specifications (as a table / spreadsheet would have). For that you set up a list of associations, so that the rows are the associations, and the keys are the column labels:
In[]:=
mytable={<|"Name""Bob","Age"30,"Sex""male"|>,<|"Name""Bill","Age"34,"Sex""male"|>,<|"Name""Stacey","Age"27,"Sex""female"|>}
In[]:=
InputForm@mytable
In[]:=
FullForm@mytable
In[]:=
TabView@mytable
And in this form we are ready for the next section, the Dataset. As a list of associations, where every association represents a row, and every key represents a column, we can start using the Dataset:
In[]:=
Dataset@mytable
(now show helpbrowser, section “Dataset Structure” and below)
Dataset
Dataset
seriously cool stuff!
◼
concise query syntax for powerful transformations -- shorter and faster than SQL queries
◼
sometimes “raw data” can be unwieldy to work with
◼
too large to display
◼
nested structure can be difficult to discern
◼
hard to write complex analyses / transformations of the data
◼
hard to debug / trace. If something goes wrong, hard to find out why
◼
Datasets solve these problems
◼
displays large data in a convenient format, displays nested structures in an intelligent way
◼
automatic column headers, number of nesting levels, and total number of elements, formatted left-aligned
◼
scrollbars, jump-to-first, jump-to-last, automatic grid frame, ...
◼
typically used to manage objects that have the same properties:
{<|"Name""Bob","Age"30,"Sex""male"|>,<|"Name""Bill","Age"34,"Sex""male"|>,<|"Name""Stacey","Age"27,"Sex""female"|>}
◼
this is like rows / columns in a table!
◼
we can keep nesting these lists and associations (see the moons in the planets example, further down)
◼
allows easy mapping and selecting
◼
extract individual rows / columns to get single records
◼
Caveat on Dataset: some people find the Dataset confusing, I get verbiage as “not intuitive”, “hairy”, “never use it except for visualization, prefer lists of associations or associations of associations” and “all these operators, I get confused”. I will admit it takes some getting used to and some experimentation to harness its full power.
Let's load the Titanic example data and look at its raw format: we find a list of Associations:
In[]:=
titanic=ExampleData[{"Dataset","Titanic"}];titanic//Normal
note the format, it’s important to understand: this is *one* 1dim list of (several) Associations, in which every element has a “column label” as key, and a value corresponding to that key. That is like rows/columns in a table!
In[]:=
titanic//FullForm
convenient format, displays nesting as rows and columns, automatic column headers, scrollbar, total number of elements:
In[]:=
titanic
concise query syntax for powerful queries and transformations
how many per class?
In[]:=
titanic[Counts,"class"]titanic[All,"class"]@Counts
how many men / women?
In[]:=
titanic[Counts,"sex"]titanic[All,"sex"]@Counts
how many survived / died?
In[]:=
titanic[Counts,"survived"]titanic[All,"survived"]@Counts
In[]:=
titanic[Counts,"age"]
In[]:=
titanic[Count[_Missing],"age"]
Note that the first argument here is the query expression, the second is the “column selector”. This is, however, “right to left”. As Count is higher in the level spec, this use of the titanic dataset goes “age” first, and *then* Count[...]. We *can* force “left to right”, but then we need the function application, ...[...] or ...@..., which itself is “inside out” (which usually means “right to left”), and you need to specify an additional All:
In[]:=
titanic[All,"age"][Count[_Missing]]titanic[All,"age"]@Count@_Missing
mean age please
In[]:=
titanic[Mean,"age"]//Ntitanic[All,"age"]@Mean//Ntitanic[All,"age"]@Mean@*N
Note that the last example has Mean@*N as the argument to the dataset, I use parens to demonstrate:
In[]:=
titanic[All,"age"]@(Mean@*N)
This does not bind as
In[]:=
(titanic[All,"age"]@Mean)@*N
due to the higher precedence of @* over @ in the precedence table.
So, indeed, perhaps a bit tricky, because you need to evaluate Mean@*N first, which would be “right to left”, although you wanted “left to right”. You see that in order to force LTR you need to use RTL to make it possible, and you need to apply function application (...[...] or ...@...), which is why I personally prefer the direct use of RTL in many cases:
In[]:=
titanic[Mean,"age"]//N(*LTR:readas:takethemeanoftheagesandthenturnintoareal*)titanic[N@*Mean,"age"](*RTL:readas:startwiththeages,thentakethemeanofthem,andthenturnintoareal*)
min / mean / max age please. If you want *multiple* functions applied to data at the same stage, just use braces.
In[]:=
titanic[Min,"age"]titanic[Mean,"age"]titanic[Max,"age"]titanic[{Min,Mean,Max},"age"]titanic[All,"age"]@{Min,Mean,Max}
you see that Rationals in a Dataset are automatically turned into Reals when the fraction gets too tall, for display purposes.
robust error-handling:
In[]:=
titanic[All,"nationality"]
the sky is the limit as to the query / computation / visualization function we can use:
In[]:=
titanic[Histogram,"age"]
In the previous examples you’ll have noticed a bit of a pattern: the operation was always an operation that works on a 1dim list (Mean, Min, Max, Histogram). That’s because we specified “age” in the second argument, and age is a “column” in the underlying structure. Think about the arguments as “right to left”: you start with “age”, that should give you a “column”, and then you apply the operation on the “column”.
age total (keep in mind 263 are missing):
In[]:=
titanic[Total,"age"]
are there some ages missing (other than missing ages)?
In[]:=
titanic[Complement[Range@80,#]&,"age"]
Let’s now add another layer. Again think “right to left”, we start with the class, then for every class we count, and then we group by sex as the grouping key:
In[]:=
titanic[GroupBy["sex"],Counts,"class"]
we can also group by survival status:
In[]:=
titanic[GroupBy["survived"],Counts,"class"]
how about age histograms, grouped by class:
In[]:=
titanic[GroupBy["class"],Histogram[#,{0,80,4}]&,"age"]
we can “go further” to the left. Reminder: right to left! Note that in the above diagram, we grouped by class, which was in rows. Now we go one further, by sex, and now class is in columns, and sex is in rows!
In[]:=
titanic[GroupBy["sex"],GroupBy["class"],Histogram[#,{0,80,4}]&,"age"]
In[]:=
titanic[GroupBy["sex"],Counts,"survived"]
We talked about function composition during our first session. You’ll remember /*, which is “left to right” function composition, and @*, which is “right to left” function composition, and mind you, function composition is the construction of functions when *not* applied to data (yet). So, keeping in line with “right to left” mental model, we want to compute the survival ratio per sex, grouped by class and sex. We start with “survived” at the very right. That should give use a 1dim list (can think of as row or column, doesn’t matter, it’s a 1dim list). Next, we want to get true/false, so we apply Boole to that list. For the survival ratio we compute the mean, and we need to force it into a Real with N. Then we group by class, then by sex. You see, we went from right to left, step by step: survived, Boole, Mean, N, class grouping, sex grouping:
In[]:=
titanic[GroupBy["sex"],GroupBy["class"],N@*Mean@*Boole,"survived"]
So if you were on the Titanic in 1912, you better have been a woman!
We can also change the grouping order, we can group by sex first and then put class into rows, simply by swapping the first two:
In[]:=
titanic[GroupBy["class"],GroupBy["sex"],Boole/*Mean/*N,"survived"]
overall, most died:
In[]:=
titanic[N@*Mean@*Boole,"survived"]
In[]:=
titanic[GroupBy["age"],N@*Mean@*Boole,"survived"]//ListPlot
it seems the survival ratio was lowest for the “middle aged” people. Older and younger people had somewhat higher likelihoods of survival.
can also define your own function to outsource longer expressions:
In[]:=
getmean=N@*Mean@*Boole;
In[]:=
titanic[GroupBy["sex"],GroupBy["class"],getmean,"survived"]titanic[GroupBy["age"],getmean,"survived"]//ListPlot
Let’s look at the Deutsche Boerse data again and use Dataset for some queries and comparisons.
In[]:=
datestring="2019-04-30";data=Import["/mnt/seconddrive/pds/"<>datestring<>"/*XETR*.csv"];alldata=Flatten[data,1]//DeleteCases[{"ISIN",__}];data[[1,1]]
In[]:=
db=AssociationThread[data[[1,1]]#]&/@alldata//Dataset;
In[]:=
db[Counts,"Mnemonic"]
let’s sort this by the first column:
In[]:=
db[KeySort@*Counts,"Mnemonic"]
many equivalent ways (pick whatever you like best, I prefer the one above, not these three):
In[]:=
{db[Counts,"Mnemonic"][KeySort],db[All,"Mnemonic"][KeySort@*Counts],db[All,"Mnemonic"][Counts/*KeySort]}
how many different symbols?
In[]:=
db[Length@*Union,"Mnemonic"]alldata[[All,2]]//Union//Length
sort by the symbol counts:
In[]:=
{db[Counts,"Mnemonic"]//Sort,db[Sort@*Counts,"Mnemonic"],db[Counts,"Mnemonic"][Sort],db[All,"Mnemonic"][Counts/*Sort],db[All,"Mnemonic"][Sort@*Counts]}
show the security descriptions that have KGAA in the name:
In[]:=
db[Select[StringContainsQ["KGAA"]]@*Union,"SecurityDesc"]
same as a List:
In[]:=
alldata[[All,3]]//Union//Select[StringContainsQ["KGAA"]]
what currencies, and how many entries per currency?
In[]:=
db[Counts,"Currency"]alldata[[All,5]]//Counts
what security types, and how many entries per security type?
In[]:=
db[Counts,"SecurityType"]alldata[[All,4]]//Counts
total traded volume for the day:
In[]:=
db[Total,"TradedVolume"]alldata[[All,-2]]//Totalalldata//Cases[{__,a_,_}a]//Total
total number of trades for the day:
In[]:=
db[Total,"NumberOfTrades"]alldata[[All,-1]]//Totalalldata//Cases[{__,a_}a]//Total
shown together:
In[]:=
db[Total,{"TradedVolume","NumberOfTrades"}]alldata//Cases[{__,a_,b_}{a,b}]//Totalalldata[[All,{-1,-2}]]//Total
average number of instruments per trade:
In[]:=
db[N@*(#[[1]]/#[[2]]&)@*Total,{"TradedVolume","NumberOfTrades"}]db[Total/*(#[[1]]/#[[2]]&)/*N,{"TradedVolume","NumberOfTrades"}]alldata//Cases[{__,a_,b_}{a,b}]//Total@#[[All,1]]/Total@#[[All,2]]&//N
max price for the day:
In[]:=
db[Max,"MaxPrice"]alldata//Cases[{__,a_,_,_,_,_}a]//Maxalldata[[All,-5]]//Max
min price for the day:
In[]:=
db[Min,"MinPrice"]alldata//Cases[{__,a_,_,_,_}a]//Minalldata[[All,-4]]//Min
volume per symbol, sorted by symbol
In[]:=
db[GroupBy["Mnemonic"]/*KeySort,Total,"TradedVolume"]
Lufthansa volume for the day (show in helpbrowser):
In[]:=
db[Select[#Mnemonic"LHA"&],"TradedVolume"]//Totaldb[Select[#Mnemonic"LHA"&]/*Total,"TradedVolume"]alldata//Cases[{__,"LHA",__,a_,_}a]//Total
all Lufthansa entries:
In[]:=
db[Select[#Mnemonic"LHA"&]]
Note how convenient you can now scroll through this long list, imagine you had to do this without the Dataset!
Lufthansa min prices per 1minute bucket:
In[]:=
db[Select[#Mnemonic"LHA"&],{"Time","MinPrice"}]
Lufthansa min price day plot:
In[]:=
db[Select[#Mnemonic"LHA"&],{"Time","MinPrice"}]//DateListPlot
Caveat
Caveat
Dataset has the potential to cause “infinite” computations when wrongly used. Try with small data first, and get experience with the query functions. Then carefully “scale into” the full size of your data.
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#2)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lm3iy6
data:image/s3,"s3://crabby-images/afa7e/afa7e751d718eac7e65669706b85c714b1d1becc" alt=""
Download
data:image/s3,"s3://crabby-images/c9374/c9374a157002afb9ce03cd482ea9bc6b4ee16fc0" alt=""
data:image/s3,"s3://crabby-images/7630b/7630b01d225114cfa2bafc392f9b6df93ec5f7bb" alt=""