Data Science with Andreas Lauschke (#3)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#3)
Description
Associations and Dataset
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lm63c0/
DOI
https://notebookarchive.org/2020-09-4lm63c0
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
105.95 kilobytes
Supplements
Rights
Redistribution rights reserved



Associations and Dataset, Part 2
Associations and Dataset, Part 2
Andreas Lauschke, May 21 2019
Andreas Lauschke, May 21 2019
We started with a gentle introduction to Associations and Dataset and will go deeper in part 2 today and finish up in part 3 during the next session.
today’s session : Associations and Dataset, part2 (hierarchical data)next session: finishing up on Associations, Dataset, and introduction of Querysession after: free data: web scraping (“traditional”, XML, and with the new WebExecute from M12)
Modeling Trees with Associations
Modeling Trees with Associations
Associations are a great way to represent trees, as key-value mappings are very efficient building-blocks for hierarchical data. Technically, every map implies a one-branch tree).
In[]:=
assoc=<|"Mike""Sue","Stan""Sue","Oscar""Chris","Andy""Chris","Sue""Peggy","Chris""Peggy","Carol""Sue","Tina""Kathy","Kathy""Peggy","James""Jenny","Kim""Jenny","Stacey""Tina","Richard"->"Tina","Jim""Tina","Nick"->"Kathy","Jerome""Kathy","Mabel""Chris","Harry""Jenny"|>;TreePlot[Normal[assoc],ImageSize400,DirectedEdgesTrue,VertexLabelsAutomatic]
Quick check: number of edges in a forest is number of nodes of the forest minus number of connected components (tree): n - k. Pick k == 1, you get a tree (a forest of one tree is a tree or a one-tree forest). Here we have 2 trees (connected components), or 2-tree forest.
In[]:=
Length@assocLength/@{EdgeList@Graph@Normal@assoc,VertexList@Graph@Normal@assoc}
But now let’s assume that Carol will no longer work for Sue, but for Kathy, and that Jenny’s group gets integrated and will report to Peggy. All we need to do: repointer!
In[]:=
assoc["Jenny"]="Peggy";assoc["Carol"]="Kathy";TreePlot[Normal[assoc],ImageSize600,DirectedEdgesTrue,VertexLabelsAutomatic]
And we see that later assignments override earlier assignments. You see that we never removed “Carol”->”Sue”. We just repointered “Carol”->”Kathy”! So we have: the last assignment prevails:
In[]:=
assoc2=<|12,13,14,155|>;assoc2TreePlot[Normal[assoc2],ImageSize20,DirectedEdgesTrue,VertexLabelsAutomatic]
So we have: an easy "prune and graft" just by repointering Association elements.
And let’s do the edge/vertices check again:
In[]:=
Length@assocLength/@{EdgeList@Graph@Normal@assoc,VertexList@Graph@Normal@assoc}
In[]:=
Unset[assoc["Kathy"]];(*Unset*)assoc["Sue"]=.;(*shortformofUnset:assgining.(dot)toSet(=)*)assoc=Delete[assoc,Key@"Chris"];(*useDeleteontheAssociation,buthavetoassign!Deletedoesn'treassign!*)TreePlot[Normal[assoc],ImageSize400,DirectedEdgesTrue,VertexLabelsAutomatic]
In[]:=
Length@assocLength/@{EdgeList@Graph@Normal@assoc,VertexList@Graph@Normal@assoc}
On a related note: you may prefer dealing with Graphs directly instead of using TreePlot.
In[]:=
gr=Graph[Normal@assoc,ImageSize300,VertexLabelsAutomatic]
In[]:=
EdgeCount@grVertexCount@gr
Reminder: Associations are ordered pointers with unique keys.
Duplicates are removed:
In[]:=
<|12,34,34,68|>
Looking up by value will (obviously) not work (it cannot, what would you do multiple keys have the same value?)
In[]:=
<|12|>[2]
But notice that this isn’t just “doesn’t work”, but that you get the head Missing, which will allow you to handle such cases subsequently. It’s not an error message, but a head!
In[]:=
<|12|>[2]//FullForm<|12|>[2]//Head
same as looking up a key that doesn’t exist (well, trying to look up a value is the same -- unless it happens to *also* be a key):
In[]:=
<|ab|>[c]
In[]:=
<|12|>[2]<|12,24|>[2]
Be sure to pick the right type for the head in the key specification:
In[]:=
<|ab,"a"c|>["a"]<|ab,"a"c|>[a]
If the association has a name (symbol), you can look up from that symbol:
In[]:=
blah=<|12,<|a->b|>"Kathy",RedGreen,"Elvis""Presley"|>;blah[1]blah[7-6]blah[<|a->b|>]blah[Red]
But keep in mind that associations are *ordered*, so you can look up by position:
In[]:=
blah[[1]]blah[[2]]blah[[3]]blah[[4]]
Can also look up with Lookup:
In[]:=
Lookup[blah,1]Lookup[blah,<|a->b|>]Lookup[blah,Red]Lookup[blah,"Elvis"]
Lookup has a third argument for error handling:
In[]:=
Lookup[blah,"Poland",$Failed]Lookup[blah,"Poland",Print@"Error"]Lookup[blah,"Poland",4-77]
You can look up multiple keys at once:
In[]:=
Lookup[blah,{1,"Elvis","something else"}]Lookup[blah,{1,"Elvis","something else"},$Failed]Lookup[blah,{1,"Elvis","something else"},66+77]
You can also look up a key in multiple associations (note this is a list of associations):
In[]:=
Lookup[{blah,<|15,67|>,<|1"Germany",66"Poland",k"Botswana"|>},{1,Red},$Failed]
from helpbrowser:
Lookup
assoc
1
assoc
2
key
assoc
i
In[]:=
Lookup
key
1
key
2
key
i
In[]:=
Lookup
default
key
Part notation: [[“keyasastring”]] results in the same as [“keyasastring”]
In[]:=
<|25,"river""Delaware"|>["river"]<|25,"river""Delaware"|>[["river"]]
want first part, not 1 (as a key):
In[]:=
Part[<|"Jan""Feb","Mar""Apr","May""Jun","Jul""Aug","Sep""Oct","Nov"->"Dec",166|>,1]<|"Jan""Feb","Mar""Apr","May""Jun","Jul""Aug","Sep""Oct","Nov"->"Dec",166|>[1]
but then we can also take from the end:
In[]:=
Part[<|"Jan""Feb","Mar""Apr","May""Jun","Jul""Aug","Sep""Oct","Nov"->"Dec",166|>,-1]
or use whatever else from Part:
In[]:=
Part[<|"Jan""Feb","Mar""Apr","May""Jun","Jul""Aug","Sep""Oct","Nov"->"Dec",166|>,1;;-1;;2]
As the keys in the association are unique (possibly after overwriting previous keys), you end up with the keys forming a set. The definition of a set is that no element can occur twice. And we can use that to “build up” an association incrementally, but you have to assign to an *expression* that represents that association! (big gotcha!)
start empty, then insert the first element:
In[]:=
set=<||>;set[a]=1;set
insert second element:
In[]:=
set[b]=2;set
overwrite value of an existing key:
In[]:=
set[a]=7;set
Note two things: we did *not* insert a new key-value pair of a (with value 7). Then we would have a second key a, but that is illegal for a set. Instead, we changed the value for the key a.
Another way to change the value of an existing key, or to insert new key-value pairs is to use AssociateTo:
In[]:=
AssociateTo[set,a88]
In[]:=
AssociateTo[set,c345]
to insert new key-value pairs we can also use AppendTo:
In[]:=
AppendTo[set,d->123]
But there is also PrependTo (remember, an association is *ordered*!)
In[]:=
PrependTo[set,first767676]
we can also use Insert to specify the exact position of the insertion (c is in the fourth position, so this will insert m->66 in the fourth position):
In[]:=
Insert[set,m66,Key@c]set
But this didn’t *store* m->66 in set (Insert is not a ...To function!) So to keep it, we have to assign it back to set:
In[]:=
set=Insert[set,m66,Key@c]set
or we specify the position directly (insert p->77 in the third position):
In[]:=
Insert[set,p77,3]set
again, didn't store, we have to assign set to the Insert:
In[]:=
set=Insert[set,p77,3]set
Insert q->88 in the second-last position (and keep it):
In[]:=
set=Insert[set,q88,-2]
Removing elements: use Unset: remove a:
In[]:=
set[a]=.set
or use Delete. Remove the third element (that’s b->2), and store it:
In[]:=
set=Delete[set,3]
but if we don’t know which position: Key is key!
In[]:=
set=Delete[set,Key@q]
can use for multiple deletions at once (remove m by key, and 1 by position):
In[]:=
set=Delete[set,{{Key@m},{1}}]
you get sensible error messages. If you try to delete by position, you get an error message, if you try to delete by non-existing key, nothing will happen:
In[]:=
Delete[set,{{Key@r},{18}}]Delete[set,{{Key@r},{Key@n}}]Delete[set,Key@w]
set remained unchanged:
In[]:=
set
some subleties: if you remove key p, then c->345 is now in position 1, and d->123 is now in position 2. So be removing the second element in the second step, actually removes d->123, which is *now* the second element, although it was in position 3 of the input provided. So by removing the second position now *leaves* only c->345, although it was in the second position in the input provided, and we requested to delete the second element (yes, but *after* deleting the element with key p in a first step! This may read a tad confusing!
In[]:=
set=Delete[set,{{Key[p]},{2}}]
careful with the nesting: let’s first add an element that has a list as values, and then remove a particular element of that list:
In[]:=
set[l]=Range@8set
In[]:=
set=Delete[set,{Key@l,3}]
Note the diff between Delete[...,{{...},{...}}] and Delete[...,{...,...}]!
keep removing the third element:
In[]:=
set=Delete[set,{Key@l,3}]
In[]:=
set[h]=HilbertMatrix@8;set
we can go further in the nesting levels: remove the sixth element in the third element of the value of h:
In[]:=
set=Delete[set,{Key@h,3,6}]
keep deleting:
In[]:=
set=Delete[set,{Key@h,3,6}]
we have three elements in the association:
In[]:=
Length@set
it's a 3 x 1:
In[]:=
Dimensions@set
First, Last, Rest, Most work like on Lists:
In[]:=
First@set(*firstelement*)Last@set(*lastelement*)Rest@set(*everythingbutfirst*)Most@set(*everythingbutlast*)
Dataset
Dataset
more on levels, with colors, to help better understand. Back to titanic:
In[]:=
titanic=ExampleData[{"Dataset","Titanic"}]
let’s only pick the first ten entries, to keep things smaller:
In[]:=
t=titanic[[1;;10]]
In[]:=
t//Normal
Dataset uses the concept of *levels*:
In[]:=
{"class""1st","age"29,"sex""female","survived"True,"class""1st","age"1,"sex""male","survived"True,"class""1st","age"2,"sex""female","survived"False,"class""1st","age"30,"sex""male","survived"False,"class""1st","age"25,"sex""female","survived"False,"class""1st","age"48,"sex""male","survived"True,"class""1st","age"63,"sex""female","survived"True,"class""1st","age"39,"sex""male","survived"False,"class""1st","age"53,"sex""female","survived"True,"class""1st","age"71,"sex""male","survived"False}
level 1, level 2, level 3
all ages:
t[All,"age"]
all ages, and then round to the nearest even:
t[All,"age",Round[#,2]&]
all ages that survived:
In[]:=
t[Select[#survivedTrue&],"age"]
Now let’s use the planets data, my main point here will be on interactivity through nested Associations!
a convenient top-level view with interactive drill-downs:
In[]:=
planets=ExampleData[{"Dataset","Planets"}]
In[]:=
Normal@planets
Now we can do simple queries that behave like simple look-ups:
What is the mass of Earth?
In[]:=
planets["Earth","Mass"]
BarChart of the planet radii please!
In[]:=
planets[BarChart,"Radius"]
Show me the moons of Neptune:
In[]:=
planets["Neptune","Moons"]
planet masses please
In[]:=
planets[All,"Mass"]
planet radii please
In[]:=
planets[All,"Radius"]
How many moons does every planet have?
In[]:=
planets[All,"Moons",Length]
We can further process the Dataset as if it were a “traditional” list:
In[]:=
PieChart[%,ChartLegendsAutomatic]
Which moons have masses heavier than earth’s moon’s mass?
In[]:=
mm=planets["Earth","Moons","Moon","Mass"];planets[All,"Moons",Select[#Mass>mm&]/*Keys]
which planets have radii larger than Mercury’s radius?
In[]:=
r=planets["Mercury","Radius"];planets[All,"Moons",Select[#Radius>r&]/*Keys]
Give me a new Dataset of all moons:
In[]:=
moons=planets[Catenate,"Moons"]
Now plot all moons against their radii:
In[]:=
moons[ListLogLogPlot,{#Mass,#Radius}&]
equivalent (as Datasets and Associations are supposed to “feel” like Lists):
In[]:=
ListLogLogPlot@moons
Queries on Associations of a Dataset
In[]:=
data={{"Symbol","Description","Quantity","Price","Security Type"},{"MMM","3M CO",300,185.22`,"Equity"},{"T","A T & T INC",550,30.7`,"Equity"},{"AFL","AFLAC INC",200,50.49`,"Equity"},{"AMAT","APPLIED MATERIALS",500,43.96`,"Equity"},{"ADP","AUTO DATA PROCESSING",800,160.19`,"Equity"},{"BA","BOEING CO",500,376.46`,"Equity"},{"CAT","CATERPILLAR INC",800,139.06`,"Equity"},{"SCHW","CHARLES SCHWAB CORP",900,46.24`,"Equity"},{"CME","CME GROUP INC CLASS A",100,173.91`,"Equity"},{"CMI","CUMMINS INC",500,169.19`,"Equity"},{"D","DOMINION ENERGY INC",200,76.79`,"Equity"},{"EMR","EMERSON ELECTRIC CO",300,71.1`,"Equity"},{"XOM","EXXON MOBIL CORP",100,77.47`,"Equity"},{"GIS","GENERAL MILLS INC",100,51.18`,"Equity"},{"HP","HELMERICH & PAYNE",400,57.55`,"Equity"},{"HD","HOME DEPOT INC",300,200.56`,"Equity"},{"HON","HONEYWELL INTL INC",600,173.54`,"Equity"},{"INTC","INTEL CORP",500,51.75`,"Equity"},{"IP","INTERNTNL PAPER",700,47.1`,"Equity"},{"JPM","J P MORGAN CHASE & CO",400,116.12`,"Equity"},{"JNJ","JOHNSON & JOHNSON",200,142.01`,"Equity"},{"KLAC","K L A TENCOR CORP",800,128.47`,"Equity"},{"LLY","LILLY ELI & CO",100,116.91`,"Equity"},{"MDT","MEDTRONIC PLC F",200,89.58`,"Equity"},{"MCHP","MICROCHIP TECHNOLOGY",500,100.75`,"Equity"},{"MSFT","MICROSOFT CORP",400,128.9`,"Equity"},{"NKE","NIKE INC CLASS B",400,85.7`,"Equity"},{"OKE","ONEOK INC",690,66.89`,"Equity"},{"PG","PROCTER & GAMBLE",925,106.08`,"Equity"},{"QCOM","QUALCOMM INC",500,89.29`,"Equity"},{"ROP","ROPER TECHNOLOGIES",800,360.19`,"Equity"},{"SWK","STANLEY BLACK & DECK",300,153.08`,"Equity"},{"KO","THE COCA-COLA CO",800,48.72`,"Equity"},{"TRV","TRAVELERS COMPANIES",210,143.35`,"Equity"},{"UNP","UNION PACIFIC CORP",600,179.2`,"Equity"},{"UTX","UNITED TECHNOLOGIES",200,141.63`,"Equity"},{"WMT","WALMART INC",800,102.08`,"Equity"},{"912796VA4","US TREASURY BILL19 U S T BILL DUE 05/07/19",2000,99.99`,"Fixed Income"},{"912796VB2","US TREASURY BILL19 U S T BILL DUE 05/14/19",8000,99.95`,"Fixed Income"},{"912796VC0","US TREASURY BILL19 U S T BILL DUE 05/21/19",6000,99.9`,"Fixed Income"},{"912796VD8","US TREASURY BILL19 U S T BILL DUE 05/28/19",10000,99.86`,"Fixed Income"},{"912796RU5","US TREASURY BILL19 U S T BILL DUE 06/13/19",6000,99.75`,"Fixed Income"},{"912796VJ5","US TREASURY BILL19 U S T BILL DUE 07/02/19",5000,99.63`,"Fixed Income"},{"912796RA9","US TREASURY BILL19 U S T BILL DUE 09/12/19",6000,99.15`,"Fixed Income"},{"912796RF8","US TREASURY BILL19 U S T BILL DUE 10/10/19",6000,98.96`,"Fixed Income"},{"912796RT8","US TREASURY BILL20 U S T BILL DUE 01/02/20",6000,98.43`,"Fixed Income"}};
headers:
In[]:=
data[[1]]
turn into Dataset:
In[]:=
ds=AssociationThread[data[[1]]#]&/@Rest[data]//Dataset
Using Association functions to create new data sets from an existing data set. It works without applying Normal to the Dataset or converting anything back!
In[]:=
{ds[All,<|"Symbol"#["Symbol"],"Quantity"#["Quantity"],"Price"#["Price"],"Total"#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]|>&],ds[All,<|"Symbol"#"Symbol","Quantity"#"Quantity","Price"#"Price","Total"#"Quantity"#"Price"If[#"Security Type""Fixed Income",1/100.,1.]|>&]}
Caveat
Caveat
You have to be careful how to construct a Dataset. In principle, every matrix can be turned into a Dataset:
In[]:=
hm=Prepend[HilbertMatrix@5,{"one","two","three","four","five"}]//Dataset
It’s clearly a Dataset:
In[]:=
Head@hm
But it’s not a list of Associations, the header is no longer threaded across a list of Associations. When you normalize it, you get a List of Lists, not a List of Associations:
In[]:=
Normal@hm
The header row is interpreted as part of the list. Notice also that the header row is no longer shaded and not clickable. You *have* to make sure you that you provide a list of Associations before you apply Dataset:
In[]:=
AssociationThread[{"one","two","three","four","five"}#]&/@HilbertMatrix@5//Dataset


Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#3)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lm63c0

Download

