Data Science with Andreas Lauschke (#5)
Author
Andreas Lauschke
Title
Data Science with Andreas Lauschke (#5)
Description
Dataset, Query, and Web Scraping of Free Data
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2020-09-4lmceo8/
DOI
https://notebookarchive.org/2020-09-4lmceo8
Date Added
2020-09-10
Date Last Modified
2020-09-10
File Size
0.54 megabytes
Supplements
Rights
Redistribution rights reserved



Dataset, Query, and Web Scraping of Free Data
Dataset, Query, and Web Scraping of Free Data
Andreas Lauschke, June 18 2019
Andreas Lauschke, June 18 2019
We started with a gentle introduction to Associations and Dataset and will finish up the Dataset today, along with an intro of the Query.
today’s session: finishing up on Dataset introduction of Query free data: web scraping (“traditional”, XML, and with the new WebExecute from M12)next session: outlier detection methods
Most examples of today follow a particular 3-element pattern:{ 1 dataset[[ some query ]], 2 the same display done with the corresponding Query [ some query ]@dataset, 3 the same Query [ some query ]@rawdata}I want you to see how the Query is constructed, it is the same as applied directly to the Dataset, and then the same query on your raw data (to underscore the point: Query doesn’t need Dataset, it can operate on your raw data).1 and 2 look identical, 2 and 3 use the very same query construction: Query [ some query ]We’ll follow all syntax examples of the Details section of the helpbrowser page for Dataset, listed under Part Operations, Dataset Queries, common forms and more specialized forms, and apply them on the planets data.our program for today:
In[]:=
pl=ExampleData[{"Dataset","Planets"}];plraw=Normal@pl;
◼
The following part operations are commonly used to extract rows from tabular datasets:
dataset[["name"]] | extract a named row (if applicable) |
In[]:=
{pl[["Mars"]],Query["Mars"]@pl,Query["Mars"]@plraw}
dataset[[{" name 1 | extract a set of named rows |
In[]:=
{pl[[{"Earth","Mars"}]],Query[{"Earth","Mars"}]@pl,Query[{"Earth","Mars"}]@plraw}
dataset[[1]] | extract the first row |
In[]:=
{pl[[1]],Query[1]@pl,Query[1]@plraw}
dataset[[n]] | extract the n th |
In[]:=
{pl[[4]],Query[4]@pl,Query[4]@plraw}
dataset[[-1]] | extract the last row |
In[]:=
{pl[[-1]],Query[-1]@pl,Query[-1]@plraw}
dataset[[m;;n]] | extract rows m n |
In[]:=
{pl[[2;;4]],Query[2;;4]@pl,Query[2;;4]@plraw}
dataset[[{ n 1 n 2 | extract a set of numbered rows |
In[]:=
{pl[[{1,3,5,6,8}]],Query[{1,3,5,6,8}]@pl,Query[{1,3,5,6,8}]@plraw//Short}
◼
The following part operations are commonly used to extract columns from tabular datasets:
dataset All | extract a named column (if applicable) |
In[]:=
{pl[[All,"Mass"]],Query[All,"Mass"]@pl,Query[All,"Mass"]@plraw}
dataset All name 1 | extract a set of named columns |
In[]:=
{pl[[All,{"Mass","Radius"}]],Query[All,{"Mass","Radius"}]@pl,Query[All,{"Mass","Radius"}]@plraw}
dataset All | extract the first column |
In[]:=
{pl[[All,1]],Query[All,1]@pl,Query[All,1]@plraw}
dataset All | extract the n th |
In[]:=
{pl[[All,2]],Query[All,2]@pl,Query[All,2]@plraw}
dataset All | extract the last column |
In[]:=
{pl[[All,-1]],Query[All,-1]@pl,Query[All,-1]@plraw}
dataset All | extract columns m n |
In[]:=
{pl[[All,2;;3]],Query[All,2;;3]@pl,Query[All,2;;3]@plraw}
more concise if we pick 1 and 2, not the moons:
In[]:=
{pl[[All,1;;2]],Query[All,1;;2]@pl,Query[All,1;;2]@plraw}
dataset All n 1 n 2 | extract a subset of the columns |
In[]:=
{pl[[All,{1,2}]],Query[All,{1,2}]@pl,Query[All,{1,2}]@plraw}
◼
Like , row and column operations can be combined. Some examples include:
Part
In[]:=
dataset[[n,m]] | take the cell at the n th m th |
In[]:=
{pl[[3,2]],Query[3,2]@pl,Query[3,2]@plraw}
dataset[[n,"colname"]] | extract the value of the named column in the n th |
In[]:=
{pl[[3,"Radius"]],Query[3,"Radius"]@pl,Query[3,"Radius"]@pl}
In[]:=
dataset[["rowname","colname"]] | take the cell at the named row and column |
In[]:=
{pl[["Earth","Radius"]],Query["Earth","Radius"]@pl,Query["Earth","Radius"]@plraw}
◼
The following operations can be used to remove the labels from rows and columns, effectively turning associations into lists:
dataset Values | remove labels from rows |
In[]:=
{pl[[Values]],Query[Values]@pl,Query[Values]@plraw//Short}
dataset All Values | remove labels from columns |
In[]:=
{pl[[All,Values]],Query[All,Values]@pl,Query[All,Values]@plraw//Short}
dataset Values Values | remove labels from rows and columns |
In[]:=
{pl[[Values,Values]],Query[Values,Values]@pl,Query[Values,Values]@plraw//Short}
◼
The query syntax can be thought of as an extension of syntax to allow aggregations and transformations to be applied, as well as taking subsets of data.
dataset[,,…]
op
1
op
2
Part
◼
Some common forms of query include:
dataset[f] | apply f |
round all numbers:
{pl[Round],Query[Round]@pl,Query[Round]@plraw//Short}
dataset All | apply f |
how many elements does every row == planet have? 3! Mass, radius, and asso of moons!
{pl[All,Length],Query[All,Length]@pl,Query[All,Length]@plraw}
dataset All All | apply f |
what length does every cell have? 2! They are Quantities! Except for the moons asso, because these are not quantities, but assos of quantities!
{pl[All,All,Length],Query[All,All,Length]@pl,Query[All,All,Length]@plraw}
so let' s go one level deeper: the lengths of the cells in the moon assos: the mass and radius lengths are now zero, and the lengths of the cells in the moon assos are now 2 -- those are now the quantities of the moons' masses and radii
{pl[All,All,All,Length],Query[All,All,All,Length]@pl,Query[All,All,All,Length]@plraw//Short}
dataset[f,n] | extract the n th f |
this basically counts the number of row: we have 8 planets:
{pl[Length,3],Query[Length,3]@pl,Query[Length,3]@plraw}
sort the planets by radius:
{pl[Sort,2],Query[Sort,2]@pl,Query[Sort,2]@plraw}
sort the planets by mass:
{pl[Sort,1],Query[Sort,1]@pl,Query[Sort,1]@plraw}
dataset[f,"name"] | extract the named column, then apply f |
same as above, but with named column, instead of column index:
{pl[Sort,"Mass"],Query[Sort,"Mass"]@pl,Query[Sort,"Mass"]@plraw}
dataset[n,f] | extract the n th f |
how many elements does Mars have? 3! Mass, radius, asso of moons!
{pl[4,Length],Query[4,Length]@pl,Query[4,Length]@plraw}
dataset["name",f] | extract the named f |
same as above, but with named row, instead of indexed row:
{pl["Mars",Length],Query["Mars",Length]@pl,Query["Mars",Length]@plraw}
dataset[{nf}] | selectively map f n th |
how many elements does Mars have? 3! Mass, radius, and asso of moons:
{pl[{4Length}],Query[{4Length}]@pl,Query[{4Length}]@plraw//Short}
dataset All | selectively map f n th |
for the third column, give me the lengths (aka: number of moons):
{pl[All,{3Length}],Query[All,{3Length}]@pl,Query[All,{3Length}]@plraw}
give me the volumes of the spheres, based on radius (2nd column) -- but only for planets, not their moons:
{t1=pl[All,{2(4/3Pi#^3&)}],t2=Query[All,{2(4/3Pi#^3&)}]@pl,t3=Query[All,{2(4/3Pi#^3&)}]@plraw};{t1[All,KeyMap[Replace["Radius""Volume"]]],t2[All,KeyMap[Replace["Radius""Volume"]]],KeyMap[Replace["Radius""Volume"],#]&/@t3}
◼
Some more specialized forms of query include:
dataset Counts | give counts of different values in the named column |
how many times does each radius occur? only once!
{pl[Counts,"Radius"],Query[Counts,"Radius"]@pl,Query[Counts,"Radius"]@plraw}
Mean radius:
{pl[Mean,"Radius"],Query[Mean,"Radius"]@pl,Query[Mean,"Radius"]@plraw}
Mean mass:
{pl[Mean,"Mass"],Query[Mean,"Mass"]@pl,Query[Mean,"Mass"]@plraw}
dataset CountDistinct | count the number of distinct values in the named column |
how many distinct moon entries do we have? 7! Six planets have moons, and the other 2 have no moons, so <||> occurs twice, and for counting distinct that makes it 7:
{pl[CountDistinct,"Moons"],Query[CountDistinct,"Moons"]@pl,Query[CountDistinct,"Moons"]@plraw}
remember:
Query[All,{3Length}]@pl
dataset MinMax | give minimum and maximum values in the named column |
give min/max of radius:
{pl[MinMax,"Radius"],Query[MinMax,"Radius"]@pl,Query[MinMax,"Radius"]@plraw}
give min/max of mass:
{pl[MinMax,"Mass"],Query[MinMax,"Mass"]@pl,Query[MinMax,"Mass"]@plraw}
dataset Mean | give the mean value of the named column |
mean radius:
{pl[Mean,"Radius"],Query[Mean,"Radius"]@pl,Query[Mean,"Radius"]@plraw}
dataset Total | give the total value of the named column |
total all radii:
{pl[Total,"Radius"],Query[Total,"Radius"]@pl,Query[Total,"Radius"]@plraw}
dataset Select | extract those rows that satisfy condition h |
show me all planets that have a radius less than 4000 km:
{pl[Select[#Radius<Quantity[4000,"Kilometers"]&]],Query[Select[#Radius<Quantity[4000,"Kilometers"]&]]@pl,Query[Select[#Radius<Quantity[4000,"Kilometers"]&]]@plraw}
interactively:
Manipulate[pl[Select[#Radius<Quantity[a,"Kilometers"]&]],{{a,2440},1,100000}]
dataset Select Length | count the number of rows that satisfy condition h |
how many planets have a radius less than 7000 km?
{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@plraw}
interactively:
Manipulate[pl[Select[#Radius<Quantity[a,"Kilometers"]&]/*Length],{{a,2440},1,100000}]
dataset Select Length | count the number of rows that satisfy condition h |
count the number planets that have radius less than 7000 kilometers:
{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@plraw}
dataset Select | select rows, then extract the named column from the result |
show me the masses of the planets that have radius less than 7000 kilometers:
{pl[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"],Query[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"]@plraw}
dataset Select | select rows, extract the named column, then apply f |
total the masses of all planets that have radius less than 7000 kilometers:
{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"]@plraw}
dataset TakeLargestBy | give the n |
planets with the largest three radii:
{pl[TakeLargestBy["Radius",3]],Query[TakeLargestBy["Radius",3]]@pl,Query[TakeLargestBy["Radius",3]]@plraw}
planets with the largest three masses:
{pl[TakeLargestBy["Mass",3]],Query[TakeLargestBy["Mass",3]]@pl,Query[TakeLargestBy["Mass",3]]@plraw}
top four masses, top four radii, separate tables:
{Column[{pl[TakeLargest[4],"Mass"],pl[TakeLargest[4],"Radius"]}],Column[{Query[TakeLargest[4],"Mass"]@pl,Query[TakeLargest[4],"Radius"]@pl}],Column[{Query[TakeLargest[4],"Mass"]@plraw,Query[TakeLargest[4],"Radius"]@plraw}]}
back to the portfolio data. In the following I’ll only use Query:
In[]:=
data={{"Symbol","Description","Quantity","Price","Security Type"},{"MMM","3M CO",300,185.22`,"Equity"},{"T","A T & T INC",550,30.7`,"Equity"},{"AFL","AFLAC INC",200,50.49`,"Equity"},{"AMAT","APPLIED MATERIALS",500,43.96`,"Equity"},{"ADP","AUTO DATA PROCESSING",800,160.19`,"Equity"},{"BA","BOEING CO",500,376.46`,"Equity"},{"CAT","CATERPILLAR INC",800,139.06`,"Equity"},{"SCHW","CHARLES SCHWAB CORP",900,46.24`,"Equity"},{"CME","CME GROUP INC CLASS A",100,173.91`,"Equity"},{"CMI","CUMMINS INC",500,169.19`,"Equity"},{"D","DOMINION ENERGY INC",200,76.79`,"Equity"},{"EMR","EMERSON ELECTRIC CO",300,71.1`,"Equity"},{"XOM","EXXON MOBIL CORP",100,77.47`,"Equity"},{"GIS","GENERAL MILLS INC",100,51.18`,"Equity"},{"HP","HELMERICH & PAYNE",400,57.55`,"Equity"},{"HD","HOME DEPOT INC",300,200.56`,"Equity"},{"HON","HONEYWELL INTL INC",600,173.54`,"Equity"},{"INTC","INTEL CORP",500,51.75`,"Equity"},{"IP","INTERNTNL PAPER",700,47.1`,"Equity"},{"JPM","J P MORGAN CHASE & CO",400,116.12`,"Equity"},{"JNJ","JOHNSON & JOHNSON",200,142.01`,"Equity"},{"KLAC","K L A TENCOR CORP",800,128.47`,"Equity"},{"LLY","LILLY ELI & CO",100,116.91`,"Equity"},{"MDT","MEDTRONIC PLC F",200,89.58`,"Equity"},{"MCHP","MICROCHIP TECHNOLOGY",500,100.75`,"Equity"},{"MSFT","MICROSOFT CORP",400,128.9`,"Equity"},{"NKE","NIKE INC CLASS B",400,85.7`,"Equity"},{"OKE","ONEOK INC",690,66.89`,"Equity"},{"PG","PROCTER & GAMBLE",925,106.08`,"Equity"},{"QCOM","QUALCOMM INC",500,89.29`,"Equity"},{"ROP","ROPER TECHNOLOGIES",800,360.19`,"Equity"},{"SWK","STANLEY BLACK & DECK",300,153.08`,"Equity"},{"KO","THE COCA-COLA CO",800,48.72`,"Equity"},{"TRV","TRAVELERS COMPANIES",210,143.35`,"Equity"},{"UNP","UNION PACIFIC CORP",600,179.2`,"Equity"},{"UTX","UNITED TECHNOLOGIES",200,141.63`,"Equity"},{"WMT","WALMART INC",800,102.08`,"Equity"},{"912796VA4","US TREASURY BILL19 U S T BILL DUE 05/07/19",2000,99.99`,"Fixed Income"},{"912796VB2","US TREASURY BILL19 U S T BILL DUE 05/14/19",8000,99.95`,"Fixed Income"},{"912796VC0","US TREASURY BILL19 U S T BILL DUE 05/21/19",6000,99.9`,"Fixed Income"},{"912796VD8","US TREASURY BILL19 U S T BILL DUE 05/28/19",10000,99.86`,"Fixed Income"},{"912796RU5","US TREASURY BILL19 U S T BILL DUE 06/13/19",6000,99.75`,"Fixed Income"},{"912796VJ5","US TREASURY BILL19 U S T BILL DUE 07/02/19",5000,99.63`,"Fixed Income"},{"912796RA9","US TREASURY BILL19 U S T BILL DUE 09/12/19",6000,99.15`,"Fixed Income"},{"912796RF8","US TREASURY BILL19 U S T BILL DUE 10/10/19",6000,98.96`,"Fixed Income"},{"912796RT8","US TREASURY BILL20 U S T BILL DUE 01/02/20",6000,98.43`,"Fixed Income"}};
headers:
In[]:=
data[[1]]
turn into a List of Associations (not: Dataset!):
In[]:=
port=AssociationThread[data[[1]]#]&/@Rest[data];
max and min position values. Remember: position value is quantity times price times type-dependent multiplier:
port//Query[MaximalBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]port//Query[MinimalBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]
sort them by position value:
port//Query[SortBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]
here the Dataset display may be more convenient again:
Dataset@%
in what positions do I have more than 5000, and please sort the output by price in ascending order:
port//Query[Select[#"Quantity">5000&]/*SortBy["Price"]]
we can peek directly at the query that the kernel uses internally. We see a right-composition!
Query[Select[#"Quantity">5000&]/*SortBy["Price"]]//Normal
show me all positions with more than 5000 in them, and display them in column order 3, 4, and last:
port//Query[Select[#"Quantity">5000&],{3,4,-1}]
a Part-style query, the kernel does a right-composition with a Slice function in GeneralUtilities:
Query[Select[#"Quantity">5000&],{3,4,-1}]//Normal
how many total shares do we have?
port//Query[Select[#"Security Type"=="Equity"&],#Quantity&]/*Totalport//Query[Select[#"Security Type""Equity"&]/*Total,#Quantity&]
again, peek at the internal representations of the queries:
Query[Select[#"Security Type"=="Equity"&],#Quantity&]/*Total//NormalQuery[Select[#"Security Type""Equity"&]/*Total,#Quantity&]//Normal
min/max quantities please:
port//Query[MinMax,"Quantity"]
query peek : another case of right-composition and Slice in GeneralUtilities, here with a MinMax:
Query[MinMax,"Quantity"]//Normal
easy to modify: insert an exchange rate. Value in EUR.USD. Here I think Dataset is better to visualize again:
er=FinancialData["EUR/USD"];port//Query[All,<|"Symbol"#["Symbol"],"Quantity"#["Quantity"],"Price"#["Price"],"Exchange Rate"er,"Total"#["Quantity"]#["Price"]/erIf[#["Security Type"]"Fixed Income",1/100.,1.]|>&]//Dataset
pie-chart of position strengths. We only need the symbol column (for the legend) and the position value column (for the relative angles):
pv=port//Query[All,<|"Symbol"#["Symbol"],"Total"#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]|>&];PieChart[Association[Rule@@@Values@pv],ChartStyle"DarkRainbow",ChartLabelspv[[All,1]],ImageSize800]
Summary: the query mechanism underlying Dataset is Query we can use Query on raw data, without Dataset Query is extremely powerful, you can use function composition in it
Web Scraping of Data
Web Scraping of Data
◼
oftentimes you just need to Import the web location and parse the plain html
◼
since M12 we have WebExecute which lets you access the data directly
◼
access the elements / tags directly and process
◼
run JavaScript code directly from your front-end session
Remember: you still need to understand the structure of the returned data, there is no “standard recipe” for web pages unless they conform to a standard. So your retrieval method and data parser will most likely always depend on the *particular* web page / web service you are using.
Import and Parse: Wiki Page about Unemployment by Country
Import and Parse: Wiki Page about Unemployment by Country
In[]:=
data=Import["http://en.wikipedia.org/wiki/Unemployment_by_country","Data"];
not much to do, just look at the import to understand the data structure, and then extract / parse out what you want. Oftentimes it’s better to use rule replacements or XML, here it’s so simple we can use Part and are done:
In[]:=
Rest@data[[1,1]][[All,{1,2}]]//Short//TableForm
so simple, can do in a one-liner:
In[]:=
Rest@Import["http://en.wikipedia.org/wiki/Unemployment_by_country","Data"][[1,1]][[All,{1,2}]]//TableForm
Straight html / tag Parsing (no JavaScript)
Straight html / tag Parsing (no JavaScript)
here we use the new WebExecute, but still need no JavaScript. We use WebExecute only to get to the tags / “ElementText” of the page.
A METAR is a current weather observation at an airport, a TAF is a forecast (up to 36 hours) at a larger airport. Here we put several US airport identifiers in the URL and extract all the METARS and TAFs from the returned page.
A METAR is a current weather observation at an airport, a TAF is a forecast (up to 36 hours) at a larger airport. Here we put several US airport identifiers in the URL and extract all the METARS and TAFs from the returned page.
In[]:=
session=StartWebSession["Firefox"];
In[]:=
WebExecute[session,{"OpenPage""https://aviationweather.gov/taf/data?ids=KEWR+KMMU+KJFK+KLGA+KDXR+KHVN+KTEB+kabe+KRDG+KLNS+KPNE+KPHL+KALB+KFDK+KBWI+KAVP+KPIT+KDCA+KOXC+KBTV+KORD+KSAN+KSEA+KLAX+KSLC&format=raw&metars=on&submit=Get+TAF+data"}]
In[]:=
code=WebExecute["LocateElements""Tag""code"];
In[]:=
(res=WebExecute["ElementText"code])//TableForm
In[]:=
Head/@%
these strings contain full formatting, e. g. they have \n in them for the line breaks and extra spaces for indentation in the TAFs:
In[]:=
res[[2]]//FullForm
that makes the display somewhat cluttered, I sometimes like the throw a Panel around, that also makes the font a bit smaller:
In[]:=
Panel/@res//TableForm
quite a lot, so how about a Dataset, so we can scroll?
In[]:=
wx=With[{ident=StringTake[#[[1]],4]},wx=If[Length@#2,<|ident<|METARPanel@#[[1]],TAFPanel@#[[2]]|>|>,<|ident<|METARPanel@#[[1]]|>|>(*mostclassDairportshavenowxforecast!*)]]&/@SplitBy[res,StringTake[#,4]&];Dataset@wx
looks like a table that got too long (Dataset formatting is a tricky issue, it has to make educated guesses based on data types and anticipated display outcomes), but we can get a little hacky and use
In[]:=
Dataset`$DatasetTargetRowCount=8
In[]:=
Dataset@wx
In[]:=
DeleteObject[session]
remember: you can always look up details with ?? (shortcut for Information). Such as:
In[]:=
??Dataset`*
JavaScript examples
JavaScript examples
dividend scraper from investors.com
Here we need some simple JavaScript and a While loop to iterate over all the tables. We then append all table results to an accumulator. Then we convert the data into XML and parse out the dividend information with an XML-based pattern matcher in a Cases:
Here we need some simple JavaScript and a While loop to iterate over all the tables. We then append all table results to an accumulator. Then we convert the data into XML and parse out the dividend information with an XML-based pattern matcher in a Cases:
In[]:=
session=StartWebSession["Firefox"];
In[]:=
WebExecute[session,{"OpenPage""https://www.investors.com/data-tables/dividends-jun-14-2019/"}]
In[]:=
str="";i=0;strList={};While[Head[str]=!=Failure,str=WebExecute["JavascriptExecute""return document.getElementsByTagName('table')["<>ToString[i]<>"].outerHTML;"];If[Head[str]===String,AppendTo[strList,str]];i++];
In[]:=
divlist=ImportString[#,"XML"]&/@strList;divtable=Cases[divlist,XMLElement["tr",{},{XMLElement["td",{},{a_}],XMLElement["td",{},{b_}]}]{a,ToExpression@b},∞];divtable//TableForm
stock options on yahoo finance
Similar to the above, we use some simple JavaScript and a While loop to iterate over all tables found with the JavaScript snippet, accumulate the tables, convert to XML and then make separate extractions for the call and put data with an XML-based pattern matcher in Cases.
Similar to the above, we use some simple JavaScript and a While loop to iterate over all tables found with the JavaScript snippet, accumulate the tables, convert to XML and then make separate extractions for the call and put data with an XML-based pattern matcher in Cases.
In[]:=
WebExecute[session,{"OpenPage""https://finance.yahoo.com/quote/MSFT/options"}]
In[]:=
str="";i=0;strList={};While[Head[str]=!=Failure,str=WebExecute["JavascriptExecute""return document.getElementsByTagName('table')["<>ToString[i]<>"].outerHTML;"];If[Head[str]===String,AppendTo[strList,str]];i++];
In[]:=
optionslist=ImportString[#,"XML"]&/@strList;
In[]:=
calltable=Cases[optionslist[[1]],XMLElement["tr",___,{XMLElement["td",___,{XMLElement["a",___,{id_}]}],XMLElement["td",___,{date_}],XMLElement["td",___,{XMLElement["a",___,{strike_}]}],XMLElement["td",___,{last_}],XMLElement["td",___,{bid_}],XMLElement["td",___,{ask_}],XMLElement["td",___,{___}],XMLElement["td",___,{XMLElement["span",___,{___}]}],XMLElement["td",___,{___}],XMLElement["td",___,{oi_}],XMLElement["td",___,{iv_}]}]{id,date,ToExpression@strike,ToExpression@last,ToExpression@bid,ToExpression@ask,ToExpression@StringReplace[oi,","""],ToExpression@StringReplace[iv,"%"""]},∞];calltable//TableForm
In[]:=
puttable=Cases[optionslist[[2]],XMLElement["tr",___,{XMLElement["td",___,{XMLElement["a",___,{id_}]}],XMLElement["td",___,{date_}],XMLElement["td",___,{XMLElement["a",___,{strike_}]}],XMLElement["td",___,{last_}],XMLElement["td",___,{bid_}],XMLElement["td",___,{ask_}],XMLElement["td",___,{___}],XMLElement["td",___,{XMLElement["span",___,{___}]}],XMLElement["td",___,{___}],XMLElement["td",___,{oi_}],XMLElement["td",___,{iv_}]}]{id,date,ToExpression@strike,ToExpression@last,ToExpression@bid,ToExpression@ask,ToExpression@StringReplace[oi,","""],ToExpression@StringReplace[iv,"%"""]},∞];puttable//TableForm
scraping from a webM page
The return of a .jsp page is usually plain html. We use WebExecute and a simple JavaScript snippet, to find the tags to tell us TypeElement and ClickElement tags. We need those to fill the input fields, and then click the “Submit” button. The rest is like above (iterate over tables, accumulate, ...), except we now have html and not XML. That’s not a problem, html is easy to parse, and we can convert the html to XML and parse out from that.
The return of a .jsp page is usually plain html. We use WebExecute and a simple JavaScript snippet, to find the tags to tell us TypeElement and ClickElement tags. We need those to fill the input fields, and then click the “Submit” button. The rest is like above (iterate over tables, accumulate, ...), except we now have html and not XML. That’s not a problem, html is easy to parse, and we can convert the html to XML and parse out from that.
In[]:=
session=StartWebSession["Firefox"];
In[]:=
WebExecute[session,{"OpenPage""http://andreaslauschke.net:49050/webMathematica/LC/consumerfinance/creditcard.jsp"}]
In[]:=
input=WebExecute["LocateElements""Tag""input"];
In[]:=
WebExecute["TypeElement"{input[[1]](*Principal*),"2000"}];WebExecute["TypeElement"{input[[2]](*APRdecimal*),"0.163"}];WebExecute["TypeElement"{input[[3]](*minpaydecimal*),"0.03"}];WebExecute["TypeElement"{input[[4]](*minpayamt*),"15"}];
In[]:=
WebExecute["ClickElement"input[[5]]]
In[]:=
str="";i=0;strList={};While[Head[str]=!=Failure,str=WebExecute["JavascriptExecute""return document.getElementsByTagName('table')["<>ToString[i]<>"].outerHTML;"];If[Head[str]===String,AppendTo[strList,str]];i++];
In[]:=
Length@strList
you can parse it out any way you want. Here I chose CSV for this example (no XML), so I can show you parsing with StringCases. This needs a bit more cleaning up than if you use XML with replacement rules, but you’ll see that the pattern matcher string in StringCases is short and similar to a rule-based pattern matcher when using XML in Cases:
In[]:=
csv=Flatten@ImportString[strList[[4]],"CSV"]
In[]:=
Flatten[StringCases[#,__~~"name="~~a__~~" "~~__~~"value="~~b__->{a,b}]&/@(StringReplace[#,{"\"""",">"""}]&/@Select[csv,StringContainsQ[#,"input"]&]),1]//TableForm
incidentally, as this is code for a specific data segment, we wouldn’t have needed to parse out the row labels, we could have “pre-hardcoded” as we know them ahead of time, and they never change.
however, *usually* XML conversion and then pattern-matching with Cases seems *much* easier. For the comparison table:
In[]:=
xml=ImportString[strList[[5]],"XML"];
In[]:=
Partition[Cases[xml,XMLElement["td",{"align""right"},{a_}]a,Infinity],4]//TableForm
and for the payment schedule:
In[]:=
xml=ImportString[strList[[6]],"XML"];
In[]:=
(schedule=Partition[Cases[xml,XMLElement["td",{"align""right"},{a_}]a,Infinity],3])//TableForm
or we turn it into a Dataset for longer repayment schedules:
In[]:=
Dataset`$DatasetTargetRowCount=20
In[]:=
h=First@schedule;d=Rest@schedule;ds=AssociationThread[h#]&/@d//Dataset
In[]:=
DeleteObject[session]


Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#5)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmceo8

Download

