Educational Materials

Data Science with Andreas Lauschke (#2)

Andreas Lauschke

Author

Andreas Lauschke

Title

Data Science with Andreas Lauschke (#2)

Description

Association and the Dataset functions

Associations and Dataset, Part 1

Andreas Lauschke, May 2 2019

We’ll start with a gentle introduction to Associations and Dataset and will go deeper in part 2 during the next session.

today’s session : Associations and Dataset, part1 (rectangular data)next session: Associations and Dataset, part 2 (hierarchical data)

Association

Motivation:

◼

called associative arrays, dictionaries, hashes, or maps in other languages

◼

like in a dictionary, you look up an entry based on a keyword. First find the keyword, then you can read the entry associated with that keyword

◼

associates a key with a value

Here are some of the basics of Associations:

◼

set of bindings between keys and values

◼

efficient for looking up the value, given the key. Designed for several millions of elements

◼

keys and values can be arbitrary expressions

◼

entries in Associations are ordered

◼

Associations are convenient in creating hierarchical data structures

Simple example, using rules:

In[]:=

assoc=<|133,2antr,55,ab,bc,cd,3Pi/2,"Frankfurt""Hessen",7^2"hello",blah3,"Hessen""Germany",123424321,k-3x-7,plotPlot[Sin@x,{x,0,2Pi}]|>

look up values:

In[]:=

assoc[3]assoc[7^2]assoc@49assoc[[4]]

Associations can be turned into Lists with Normal.

In[]:=

assoc//Normal

note that this is now a List!

very useful, for example, to visualize the association structure :

In[]:=

TreePlot[Normal[assoc],VertexLabelsAutomatic,DirectedEdgesTrue,ImageSize600]

look-up tables / key-value structures are really nothing new, conceptually. Examples:

mortage["Kathy"]=1765.23balance["Michelle"]=982.34days[April]=30days[November]=30capital["Italy"]="Rome"population["USA"]=326population["Germany"]=83tires["Honda"]=4tires["747"]=18

But with M10 the Associations were implemented with a focus on performance, and they also provide a certain *context* to the surrounding function.

In[]:=

days[November]//FullForm

In[]:=

days=<|"November"30|>

In[]:=

FullForm@%

In[]:=

days["November"]days[["November"]]

Here the fact that we’re not just talking about a *binding* of a key (November) to the array (days), but have a new “thing” (Association), makes it possible for the surrounding function to get some context. For example, you can ListPlot associations, that will also give you legends automatically, like so:

In[]:=

ListPlot[<|"Foo"Range@10,"Bar"Range[10]+5|>]

For that to work, ListPlot needed to “detect” that an Association was provided!

We will see more of that when we start looking at the Dataset.

Construction: use the Association symbol (sometimes easier programmatically) or enter them directly:

In[]:=

g=Association[a55,b66]g=<|a55,b66|>

In[]:=

??g

Very important are AssociationThread and AssociationMap

AssociationThread allows you to create associations when you have keys and values in separate lists:

as two lists (a key list and a values list):

In[]:=

AssociationThread[{1,d,Pi/2,<|RedYellow,BlueGreen|>},{y,z,"hello world",<|233,7788|>}]

as a keys list pointing to a values list:

In[]:=

AssociationThread[{1,2,3}{"eins","zwei","drei"}]

In[]:=

AssociationThread[Range@100Range[101,200]]//Short

AssociationMap is an extremely powerful way to combine function application with the creation of the association. It applies a function to a list of elements that are assumed to be the keys of the new association:

from the helpbrowser:

AssociationMap

[f,{

key

,…}]

creates the association



key

f[

key

f[

key

],…

In[]:=

AssociationMap[StringLength,{"cat","dog","shark","dinosaur","pterodactylus"}]

it can be symbolic:

In[]:=

AssociationMap[foo,Range[10]]

it can use pure functions:

In[]:=

AssociationMap[#^2&,Range@20]AssociationMap[1/#^3&,Range@20]AssociationMap[Binomial[#+1,2]&,Range[0,20]]AssociationMap[Sum[k,{k,0,#}]&,Range[0,20]]AssociationMap[Plot[Csc[#x],{x,-Pi/2,Pi/2}]&,Range[0,20]]

Note that this is NOT a list of plots. It’s an association, the plots are values, the integers are the keys!

In[]:=

AssociationMap[Plot[Tan@x,{x,0,#Pi},GridLinesAutomatic]&,Range@10]

Reverse reverses the roles of keys and values:

In[]:=

AssociationMap[Reverse,<|a1,b2|>]

You can also operate on the values directly, that leaves the keys unchanged. AssociationMap applies the keys (first argument) to the second argument. Map applies a *function* (not the keys) to the values:

In[]:=

Map[#^2&,<|aa1,bb1,cc1|>]

In[]:=

Map[Total,<|a{a1,a2},b{b1,b2},c{c1,c2}|>]

In[]:=

Map[Det,<|a{{a1,a2},{a3,a4}},b{{b1,b2},{b3,b4}},c{{c1,c2},{c3,c4}}|>]

we obviously need a simple way to get a list of the keys and the values:

In[]:=

assoc=AssociationThread[{1,2,3}{"eins","zwei","drei"}]

In[]:=

Keys@assocValues@assoc

A typical use of associations (if not *the* most prominent use) is to set up the association in a way that the keys have an interpretation of row and column specifications (as a table / spreadsheet would have). For that you set up a list of associations, so that the rows are the associations, and the keys are the column labels:

In[]:=

mytable={<|"Name""Bob","Age"30,"Sex""male"|>,<|"Name""Bill","Age"34,"Sex""male"|>,<|"Name""Stacey","Age"27,"Sex""female"|>}

In[]:=

InputForm@mytable

In[]:=

FullForm@mytable

In[]:=

TabView@mytable

And in this form we are ready for the next section, the Dataset. As a list of associations, where every association represents a row, and every key represents a column, we can start using the Dataset:

In[]:=

Dataset@mytable

(now show helpbrowser, section “Dataset Structure” and below)

Dataset

seriously cool stuff!

◼

concise query syntax for powerful transformations -- shorter and faster than SQL queries

◼

sometimes “raw data” can be unwieldy to work with

◼

too large to display

◼

nested structure can be difficult to discern

◼

hard to write complex analyses / transformations of the data

◼

hard to debug / trace. If something goes wrong, hard to find out why

◼

Datasets solve these problems

◼

displays large data in a convenient format, displays nested structures in an intelligent way

◼

automatic column headers, number of nesting levels, and total number of elements, formatted left-aligned

◼

scrollbars, jump-to-first, jump-to-last, automatic grid frame, ...

◼

typically used to manage objects that have the same properties:

{<|"Name""Bob","Age"30,"Sex""male"|>,<|"Name""Bill","Age"34,"Sex""male"|>,<|"Name""Stacey","Age"27,"Sex""female"|>}

◼

this is like rows / columns in a table!

◼

we can keep nesting these lists and associations (see the moons in the planets example, further down)

◼

allows easy mapping and selecting

◼

extract individual rows / columns to get single records

◼

Caveat on Dataset: some people find the Dataset confusing, I get verbiage as “not intuitive”, “hairy”, “never use it except for visualization, prefer lists of associations or associations of associations” and “all these operators, I get confused”. I will admit it takes some getting used to and some experimentation to harness its full power.

Let's load the Titanic example data and look at its raw format: we find a list of Associations:

In[]:=

titanic=ExampleData[{"Dataset","Titanic"}];titanic//Normal

note the format, it’s important to understand: this is *one* 1dim list of (several) Associations, in which every element has a “column label” as key, and a value corresponding to that key. That is like rows/columns in a table!

In[]:=

titanic//FullForm

convenient format, displays nesting as rows and columns, automatic column headers, scrollbar, total number of elements:

In[]:=

titanic

concise query syntax for powerful queries and transformations

how many per class?

In[]:=

titanic[Counts,"class"]titanic[All,"class"]@Counts

how many men / women?

In[]:=

titanic[Counts,"sex"]titanic[All,"sex"]@Counts

how many survived / died?

In[]:=

titanic[Counts,"survived"]titanic[All,"survived"]@Counts

In[]:=

titanic[Counts,"age"]

In[]:=

titanic[Count[_Missing],"age"]

Note that the first argument here is the query expression, the second is the “column selector”. This is, however, “right to left”. As Count is higher in the level spec, this use of the titanic dataset goes “age” first, and *then* Count[...]. We *can* force “left to right”, but then we need the function application, ...[...] or ...@..., which itself is “inside out” (which usually means “right to left”), and you need to specify an additional All:

In[]:=

titanic[All,"age"][Count[_Missing]]titanic[All,"age"]@Count@_Missing

mean age please

In[]:=

titanic[Mean,"age"]//Ntitanic[All,"age"]@Mean//Ntitanic[All,"age"]@Mean@*N

Note that the last example has Mean@*N as the argument to the dataset, I use parens to demonstrate:

In[]:=

titanic[All,"age"]@(Mean@*N)

This does not bind as

In[]:=

(titanic[All,"age"]@Mean)@*N

due to the higher precedence of @* over @ in the precedence table.

So, indeed, perhaps a bit tricky, because you need to evaluate Mean@*N first, which would be “right to left”, although you wanted “left to right”. You see that in order to force LTR you need to use RTL to make it possible, and you need to apply function application (...[...] or ...@...), which is why I personally prefer the direct use of RTL in many cases:

In[]:=

titanic[Mean,"age"]//N(*LTR:readas:takethemeanoftheagesandthenturnintoareal*)titanic[N@*Mean,"age"](*RTL:readas:startwiththeages,thentakethemeanofthem,andthenturnintoareal*)

min / mean / max age please. If you want *multiple* functions applied to data at the same stage, just use braces.

In[]:=

titanic[Min,"age"]titanic[Mean,"age"]titanic[Max,"age"]titanic[{Min,Mean,Max},"age"]titanic[All,"age"]@{Min,Mean,Max}

you see that Rationals in a Dataset are automatically turned into Reals when the fraction gets too tall, for display purposes.

robust error-handling:

In[]:=

titanic[All,"nationality"]

the sky is the limit as to the query / computation / visualization function we can use:

In[]:=

titanic[Histogram,"age"]

In the previous examples you’ll have noticed a bit of a pattern: the operation was always an operation that works on a 1dim list (Mean, Min, Max, Histogram). That’s because we specified “age” in the second argument, and age is a “column” in the underlying structure. Think about the arguments as “right to left”: you start with “age”, that should give you a “column”, and then you apply the operation on the “column”.

age total (keep in mind 263 are missing):

In[]:=

titanic[Total,"age"]

are there some ages missing (other than missing ages)?

In[]:=

titanic[Complement[Range@80,#]&,"age"]

Let’s now add another layer. Again think “right to left”, we start with the class, then for every class we count, and then we group by sex as the grouping key:

In[]:=

titanic[GroupBy["sex"],Counts,"class"]

we can also group by survival status:

In[]:=

titanic[GroupBy["survived"],Counts,"class"]

how about age histograms, grouped by class:

In[]:=

titanic[GroupBy["class"],Histogram[#,{0,80,4}]&,"age"]

we can “go further” to the left. Reminder: right to left! Note that in the above diagram, we grouped by class, which was in rows. Now we go one further, by sex, and now class is in columns, and sex is in rows!

In[]:=

titanic[GroupBy["sex"],GroupBy["class"],Histogram[#,{0,80,4}]&,"age"]

In[]:=

titanic[GroupBy["sex"],Counts,"survived"]

We talked about function composition during our first session. You’ll remember /*, which is “left to right” function composition, and @*, which is “right to left” function composition, and mind you, function composition is the construction of functions when *not* applied to data (yet). So, keeping in line with “right to left” mental model, we want to compute the survival ratio per sex, grouped by class and sex. We start with “survived” at the very right. That should give use a 1dim list (can think of as row or column, doesn’t matter, it’s a 1dim list). Next, we want to get true/false, so we apply Boole to that list. For the survival ratio we compute the mean, and we need to force it into a Real with N. Then we group by class, then by sex. You see, we went from right to left, step by step: survived, Boole, Mean, N, class grouping, sex grouping:

In[]:=

titanic[GroupBy["sex"],GroupBy["class"],N@*Mean@*Boole,"survived"]

So if you were on the Titanic in 1912, you better have been a woman!

We can also change the grouping order, we can group by sex first and then put class into rows, simply by swapping the first two:

In[]:=

titanic[GroupBy["class"],GroupBy["sex"],Boole/*Mean/*N,"survived"]

overall, most died:

In[]:=

titanic[N@*Mean@*Boole,"survived"]

In[]:=

titanic[GroupBy["age"],N@*Mean@*Boole,"survived"]//ListPlot

it seems the survival ratio was lowest for the “middle aged” people. Older and younger people had somewhat higher likelihoods of survival.

can also define your own function to outsource longer expressions:

In[]:=

getmean=N@*Mean@*Boole;

In[]:=

titanic[GroupBy["sex"],GroupBy["class"],getmean,"survived"]titanic[GroupBy["age"],getmean,"survived"]//ListPlot

Let’s look at the Deutsche Boerse data again and use Dataset for some queries and comparisons.

In[]:=

datestring="2019-04-30";data=Import["/mnt/seconddrive/pds/"<>datestring<>"/*XETR*.csv"];alldata=Flatten[data,1]//DeleteCases[{"ISIN",__}];data[[1,1]]

In[]:=

db=AssociationThread[data[[1,1]]#]&/@alldata//Dataset;

In[]:=

db[Counts,"Mnemonic"]

let’s sort this by the first column:

In[]:=

db[KeySort@*Counts,"Mnemonic"]

many equivalent ways (pick whatever you like best, I prefer the one above, not these three):

In[]:=

{db[Counts,"Mnemonic"][KeySort],db[All,"Mnemonic"][KeySort@*Counts],db[All,"Mnemonic"][Counts/*KeySort]}

how many different symbols?

In[]:=

db[Length@*Union,"Mnemonic"]alldata[[All,2]]//Union//Length

sort by the symbol counts:

In[]:=

{db[Counts,"Mnemonic"]//Sort,db[Sort@*Counts,"Mnemonic"],db[Counts,"Mnemonic"][Sort],db[All,"Mnemonic"][Counts/*Sort],db[All,"Mnemonic"][Sort@*Counts]}

show the security descriptions that have KGAA in the name:

In[]:=

db[Select[StringContainsQ["KGAA"]]@*Union,"SecurityDesc"]

same as a List:

In[]:=

alldata[[All,3]]//Union//Select[StringContainsQ["KGAA"]]

what currencies, and how many entries per currency?

In[]:=

db[Counts,"Currency"]alldata[[All,5]]//Counts

what security types, and how many entries per security type?

In[]:=

db[Counts,"SecurityType"]alldata[[All,4]]//Counts

total traded volume for the day:

In[]:=

db[Total,"TradedVolume"]alldata[[All,-2]]//Totalalldata//Cases[{__,a_,_}a]//Total

total number of trades for the day:

In[]:=

db[Total,"NumberOfTrades"]alldata[[All,-1]]//Totalalldata//Cases[{__,a_}a]//Total

shown together:

In[]:=

db[Total,{"TradedVolume","NumberOfTrades"}]alldata//Cases[{__,a_,b_}{a,b}]//Totalalldata[[All,{-1,-2}]]//Total

average number of instruments per trade:

In[]:=

db[N@*(#[[1]]/#[[2]]&)@*Total,{"TradedVolume","NumberOfTrades"}]db[Total/*(#[[1]]/#[[2]]&)/*N,{"TradedVolume","NumberOfTrades"}]alldata//Cases[{__,a_,b_}{a,b}]//Total@#[[All,1]]/Total@#[[All,2]]&//N

max price for the day:

In[]:=

db[Max,"MaxPrice"]alldata//Cases[{__,a_,_,_,_,_}a]//Maxalldata[[All,-5]]//Max

min price for the day:

In[]:=

db[Min,"MinPrice"]alldata//Cases[{__,a_,_,_,_}a]//Minalldata[[All,-4]]//Min

volume per symbol, sorted by symbol

In[]:=

db[GroupBy["Mnemonic"]/*KeySort,Total,"TradedVolume"]

Lufthansa volume for the day (show in helpbrowser):

In[]:=

db[Select[#Mnemonic"LHA"&],"TradedVolume"]//Totaldb[Select[#Mnemonic"LHA"&]/*Total,"TradedVolume"]alldata//Cases[{__,"LHA",__,a_,_}a]//Total

all Lufthansa entries:

In[]:=

db[Select[#Mnemonic"LHA"&]]

Note how convenient you can now scroll through this long list, imagine you had to do this without the Dataset!

Lufthansa min prices per 1minute bucket:

In[]:=

db[Select[#Mnemonic"LHA"&],{"Time","MinPrice"}]

Lufthansa min price day plot:

In[]:=

db[Select[#Mnemonic"LHA"&],{"Time","MinPrice"}]//DateListPlot

Caveat

Dataset has the potential to cause “infinite” computations when wrongly used. Try with small data first, and get experience with the query functions. Then carefully “scale into” the full size of your data.

Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#2)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lm3iy6