Educational Materials

Data Science with Andreas Lauschke (#5)

Andreas Lauschke

Author

Andreas Lauschke

Title

Data Science with Andreas Lauschke (#5)

Description

Dataset, Query, and Web Scraping of Free Data

Andreas Lauschke, June 18 2019

We started with a gentle introduction to Associations and Dataset and will finish up the Dataset today, along with an intro of the Query.

today’s session: finishing up on Dataset introduction of Query free data: web scraping (“traditional”, XML, and with the new WebExecute from M12)next session: outlier detection methods

Most examples of today follow a particular 3-element pattern:{ 1 dataset[[ some query ]], 2 the same display done with the corresponding Query [ some query ]@dataset, 3 the same Query [ some query ]@rawdata}I want you to see how the Query is constructed, it is the same as applied directly to the Dataset, and then the same query on your raw data (to underscore the point: Query doesn’t need Dataset, it can operate on your raw data).1 and 2 look identical, 2 and 3 use the very same query construction: Query [ some query ]We’ll follow all syntax examples of the Details section of the helpbrowser page for Dataset, listed under Part Operations, Dataset Queries, common forms and more specialized forms, and apply them on the planets data.our program for today:

In[]:=

pl=ExampleData[{"Dataset","Planets"}];plraw=Normal@pl;

◼

The following part operations are commonly used to extract rows from tabular datasets:

dataset[["name"]]

extract a named row (if applicable)

In[]:=

{pl[["Mars"]],Query["Mars"]@pl,Query["Mars"]@plraw}

dataset[[{"

name

",…}]]

extract a set of named rows

In[]:=

{pl[[{"Earth","Mars"}]],Query[{"Earth","Mars"}]@pl,Query[{"Earth","Mars"}]@plraw}

dataset[[1]]

extract the first row

In[]:=

{pl[[1]],Query[1]@pl,Query[1]@plraw}

dataset[[n]]

extract the

row

In[]:=

{pl[[4]],Query[4]@pl,Query[4]@plraw}

dataset[[-1]]

extract the last row

In[]:=

{pl[[-1]],Query[-1]@pl,Query[-1]@plraw}

dataset[[m;;n]]

extract rows

through

In[]:=

{pl[[2;;4]],Query[2;;4]@pl,Query[2;;4]@plraw}

dataset[[{

,…}]]

extract a set of numbered rows

In[]:=

{pl[[{1,3,5,6,8}]],Query[{1,3,5,6,8}]@pl,Query[{1,3,5,6,8}]@plraw//Short}

◼

The following part operations are commonly used to extract columns from tabular datasets:

dataset

All

,"name"

extract a named column (if applicable)

In[]:=

{pl[[All,"Mass"]],Query[All,"Mass"]@pl,Query[All,"Mass"]@plraw}

dataset

All

,{"

name

",…}

extract a set of named columns

In[]:=

{pl[[All,{"Mass","Radius"}]],Query[All,{"Mass","Radius"}]@pl,Query[All,{"Mass","Radius"}]@plraw}

dataset

All

,1

extract the first column

In[]:=

{pl[[All,1]],Query[All,1]@pl,Query[All,1]@plraw}

dataset

All

,n

extract the

column

In[]:=

{pl[[All,2]],Query[All,2]@pl,Query[All,2]@plraw}

dataset

All

,-1

extract the last column

In[]:=

{pl[[All,-1]],Query[All,-1]@pl,Query[All,-1]@plraw}

dataset

All

,m;;n

extract columns

through

In[]:=

{pl[[All,2;;3]],Query[All,2;;3]@pl,Query[All,2;;3]@plraw}

more concise if we pick 1 and 2, not the moons:

In[]:=

{pl[[All,1;;2]],Query[All,1;;2]@pl,Query[All,1;;2]@plraw}

dataset

All

,…}

extract a subset of the columns

In[]:=

{pl[[All,{1,2}]],Query[All,{1,2}]@pl,Query[All,{1,2}]@plraw}

◼

Part

, row and column operations can be combined. Some examples include:

In[]:=

dataset[[n,m]]

take the cell at the

row and

column

In[]:=

{pl[[3,2]],Query[3,2]@pl,Query[3,2]@plraw}

dataset[[n,"colname"]]

extract the value of the named column in the

row

In[]:=

{pl[[3,"Radius"]],Query[3,"Radius"]@pl,Query[3,"Radius"]@pl}

In[]:=

dataset[["rowname","colname"]]

take the cell at the named row and column

In[]:=

{pl[["Earth","Radius"]],Query["Earth","Radius"]@pl,Query["Earth","Radius"]@plraw}

◼

The following operations can be used to remove the labels from rows and columns, effectively turning associations into lists:

dataset

Values



remove labels from rows

In[]:=

{pl[[Values]],Query[Values]@pl,Query[Values]@plraw//Short}

dataset

All

Values



remove labels from columns

In[]:=

{pl[[All,Values]],Query[All,Values]@pl,Query[All,Values]@plraw//Short}

dataset

Values



remove labels from rows and columns

In[]:=

{pl[[Values,Values]],Query[Values,Values]@pl,Query[Values,Values]@plraw//Short}

◼

The query syntax

dataset[

,…]

can be thought of as an extension of

Part

syntax to allow aggregations and transformations to be applied, as well as taking subsets of data.

◼

Some common forms of query include:

dataset[f]

apply

to the entire table

round all numbers:

{pl[Round],Query[Round]@pl,Query[Round]@plraw//Short}

dataset

All

,f

apply

to every row in the table

how many elements does every row == planet have? 3! Mass, radius, and asso of moons!

{pl[All,Length],Query[All,Length]@pl,Query[All,Length]@plraw}

dataset

All

,f

apply

to every cell in the table

what length does every cell have? 2! They are Quantities! Except for the moons asso, because these are not quantities, but assos of quantities!

{pl[All,All,Length],Query[All,All,Length]@pl,Query[All,All,Length]@plraw}

so let' s go one level deeper: the lengths of the cells in the moon assos: the mass and radius lengths are now zero, and the lengths of the cells in the moon assos are now 2 -- those are now the quantities of the moons' masses and radii

{pl[All,All,All,Length],Query[All,All,All,Length]@pl,Query[All,All,All,Length]@plraw//Short}

dataset[f,n]

extract the

column, then apply

to it

this basically counts the number of row: we have 8 planets:

{pl[Length,3],Query[Length,3]@pl,Query[Length,3]@plraw}

sort the planets by radius:

{pl[Sort,2],Query[Sort,2]@pl,Query[Sort,2]@plraw}

sort the planets by mass:

{pl[Sort,1],Query[Sort,1]@pl,Query[Sort,1]@plraw}

dataset[f,"name"]

extract the named column, then apply

to it

same as above, but with named column, instead of column index:

{pl[Sort,"Mass"],Query[Sort,"Mass"]@pl,Query[Sort,"Mass"]@plraw}

dataset[n,f]

extract the

row, then apply

to it

how many elements does Mars have? 3! Mass, radius, asso of moons!

{pl[4,Length],Query[4,Length]@pl,Query[4,Length]@plraw}

dataset["name",f]

extract the

named

row, then apply

to it

same as above, but with named row, instead of indexed row:

{pl["Mars",Length],Query["Mars",Length]@pl,Query["Mars",Length]@plraw}

dataset[{nf}]

selectively map

onto the

row

how many elements does Mars have? 3! Mass, radius, and asso of moons:

{pl[{4Length}],Query[{4Length}]@pl,Query[{4Length}]@plraw//Short}

dataset

All

,{nf}

selectively map

onto the

column

for the third column, give me the lengths (aka: number of moons):

{pl[All,{3Length}],Query[All,{3Length}]@pl,Query[All,{3Length}]@plraw}

give me the volumes of the spheres, based on radius (2nd column) -- but only for planets, not their moons:

{t1=pl[All,{2(4/3Pi#^3&)}],t2=Query[All,{2(4/3Pi#^3&)}]@pl,t3=Query[All,{2(4/3Pi#^3&)}]@plraw};{t1[All,KeyMap[Replace["Radius""Volume"]]],t2[All,KeyMap[Replace["Radius""Volume"]]],KeyMap[Replace["Radius""Volume"],#]&/@t3}

◼

Some more specialized forms of query include:

dataset

Counts

,"name"

give counts of different values in the named column

how many times does each radius occur? only once!

{pl[Counts,"Radius"],Query[Counts,"Radius"]@pl,Query[Counts,"Radius"]@plraw}

Mean radius:

{pl[Mean,"Radius"],Query[Mean,"Radius"]@pl,Query[Mean,"Radius"]@plraw}

Mean mass:

{pl[Mean,"Mass"],Query[Mean,"Mass"]@pl,Query[Mean,"Mass"]@plraw}

dataset

CountDistinct

,"name"

count the number of distinct values in the named column

how many distinct moon entries do we have? 7! Six planets have moons, and the other 2 have no moons, so <||> occurs twice, and for counting distinct that makes it 7:

{pl[CountDistinct,"Moons"],Query[CountDistinct,"Moons"]@pl,Query[CountDistinct,"Moons"]@plraw}

remember:

Query[All,{3Length}]@pl

dataset

MinMax

,"name"

give minimum and maximum values in the named column

give min/max of radius:

{pl[MinMax,"Radius"],Query[MinMax,"Radius"]@pl,Query[MinMax,"Radius"]@plraw}

give min/max of mass:

{pl[MinMax,"Mass"],Query[MinMax,"Mass"]@pl,Query[MinMax,"Mass"]@plraw}

dataset

Mean

,"name"

give the mean value of the named column

mean radius:

{pl[Mean,"Radius"],Query[Mean,"Radius"]@pl,Query[Mean,"Radius"]@plraw}

dataset

Total

,"name"

give the total value of the named column

total all radii:

{pl[Total,"Radius"],Query[Total,"Radius"]@pl,Query[Total,"Radius"]@plraw}

dataset

Select

[h]

extract those rows that satisfy condition

show me all planets that have a radius less than 4000 km:

{pl[Select[#Radius<Quantity[4000,"Kilometers"]&]],Query[Select[#Radius<Quantity[4000,"Kilometers"]&]]@pl,Query[Select[#Radius<Quantity[4000,"Kilometers"]&]]@plraw}

interactively:

Manipulate[pl[Select[#Radius<Quantity[a,"Kilometers"]&]],{{a,2440},1,100000}]

dataset

Select

[h]/*

Length



count the number of rows that satisfy condition

how many planets have a radius less than 7000 km?

{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@plraw}

interactively:

Manipulate[pl[Select[#Radius<Quantity[a,"Kilometers"]&]/*Length],{{a,2440},1,100000}]

dataset

Select

[h]/*

Length



count the number of rows that satisfy condition

count the number planets that have radius less than 7000 kilometers:

{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Length]@plraw}

dataset

Select

[h],"name"

select rows, then extract the named column from the result

show me the masses of the planets that have radius less than 7000 kilometers:

{pl[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"],Query[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&],"Mass"]@plraw}

dataset

Select

[h]/*f,"name"

select rows, extract the named column, then apply

to it

total the masses of all planets that have radius less than 7000 kilometers:

{pl[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"],Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"]@pl,Query[Select[#Radius<Quantity[7000,"Kilometers"]&]/*Total,"Mass"]@plraw}

dataset

TakeLargestBy

["name",n]

give the

rows for which the named column is largest

planets with the largest three radii:

{pl[TakeLargestBy["Radius",3]],Query[TakeLargestBy["Radius",3]]@pl,Query[TakeLargestBy["Radius",3]]@plraw}

planets with the largest three masses:

{pl[TakeLargestBy["Mass",3]],Query[TakeLargestBy["Mass",3]]@pl,Query[TakeLargestBy["Mass",3]]@plraw}

top four masses, top four radii, separate tables:

{Column[{pl[TakeLargest[4],"Mass"],pl[TakeLargest[4],"Radius"]}],Column[{Query[TakeLargest[4],"Mass"]@pl,Query[TakeLargest[4],"Radius"]@pl}],Column[{Query[TakeLargest[4],"Mass"]@plraw,Query[TakeLargest[4],"Radius"]@plraw}]}

back to the portfolio data. In the following I’ll only use Query:

In[]:=

data={{"Symbol","Description","Quantity","Price","Security Type"},{"MMM","3M CO",300,185.22`,"Equity"},{"T","A T & T INC",550,30.7`,"Equity"},{"AFL","AFLAC INC",200,50.49`,"Equity"},{"AMAT","APPLIED MATERIALS",500,43.96`,"Equity"},{"ADP","AUTO DATA PROCESSING",800,160.19`,"Equity"},{"BA","BOEING CO",500,376.46`,"Equity"},{"CAT","CATERPILLAR INC",800,139.06`,"Equity"},{"SCHW","CHARLES SCHWAB CORP",900,46.24`,"Equity"},{"CME","CME GROUP INC CLASS A",100,173.91`,"Equity"},{"CMI","CUMMINS INC",500,169.19`,"Equity"},{"D","DOMINION ENERGY INC",200,76.79`,"Equity"},{"EMR","EMERSON ELECTRIC CO",300,71.1`,"Equity"},{"XOM","EXXON MOBIL CORP",100,77.47`,"Equity"},{"GIS","GENERAL MILLS INC",100,51.18`,"Equity"},{"HP","HELMERICH & PAYNE",400,57.55`,"Equity"},{"HD","HOME DEPOT INC",300,200.56`,"Equity"},{"HON","HONEYWELL INTL INC",600,173.54`,"Equity"},{"INTC","INTEL CORP",500,51.75`,"Equity"},{"IP","INTERNTNL PAPER",700,47.1`,"Equity"},{"JPM","J P MORGAN CHASE & CO",400,116.12`,"Equity"},{"JNJ","JOHNSON & JOHNSON",200,142.01`,"Equity"},{"KLAC","K L A TENCOR CORP",800,128.47`,"Equity"},{"LLY","LILLY ELI & CO",100,116.91`,"Equity"},{"MDT","MEDTRONIC PLC F",200,89.58`,"Equity"},{"MCHP","MICROCHIP TECHNOLOGY",500,100.75`,"Equity"},{"MSFT","MICROSOFT CORP",400,128.9`,"Equity"},{"NKE","NIKE INC CLASS B",400,85.7`,"Equity"},{"OKE","ONEOK INC",690,66.89`,"Equity"},{"PG","PROCTER & GAMBLE",925,106.08`,"Equity"},{"QCOM","QUALCOMM INC",500,89.29`,"Equity"},{"ROP","ROPER TECHNOLOGIES",800,360.19`,"Equity"},{"SWK","STANLEY BLACK & DECK",300,153.08`,"Equity"},{"KO","THE COCA-COLA CO",800,48.72`,"Equity"},{"TRV","TRAVELERS COMPANIES",210,143.35`,"Equity"},{"UNP","UNION PACIFIC CORP",600,179.2`,"Equity"},{"UTX","UNITED TECHNOLOGIES",200,141.63`,"Equity"},{"WMT","WALMART INC",800,102.08`,"Equity"},{"912796VA4","US TREASURY BILL19 U S T BILL DUE 05/07/19",2000,99.99`,"Fixed Income"},{"912796VB2","US TREASURY BILL19 U S T BILL DUE 05/14/19",8000,99.95`,"Fixed Income"},{"912796VC0","US TREASURY BILL19 U S T BILL DUE 05/21/19",6000,99.9`,"Fixed Income"},{"912796VD8","US TREASURY BILL19 U S T BILL DUE 05/28/19",10000,99.86`,"Fixed Income"},{"912796RU5","US TREASURY BILL19 U S T BILL DUE 06/13/19",6000,99.75`,"Fixed Income"},{"912796VJ5","US TREASURY BILL19 U S T BILL DUE 07/02/19",5000,99.63`,"Fixed Income"},{"912796RA9","US TREASURY BILL19 U S T BILL DUE 09/12/19",6000,99.15`,"Fixed Income"},{"912796RF8","US TREASURY BILL19 U S T BILL DUE 10/10/19",6000,98.96`,"Fixed Income"},{"912796RT8","US TREASURY BILL20 U S T BILL DUE 01/02/20",6000,98.43`,"Fixed Income"}};

headers:

In[]:=

data[[1]]

turn into a List of Associations (not: Dataset!):

In[]:=

port=AssociationThread[data[[1]]#]&/@Rest[data];

max and min position values. Remember: position value is quantity times price times type-dependent multiplier:

port//Query[MaximalBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]port//Query[MinimalBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]

sort them by position value:

port//Query[SortBy[#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]&]]

here the Dataset display may be more convenient again:

Dataset@%

in what positions do I have more than 5000, and please sort the output by price in ascending order:

port//Query[Select[#"Quantity">5000&]/*SortBy["Price"]]

we can peek directly at the query that the kernel uses internally. We see a right-composition!

Query[Select[#"Quantity">5000&]/*SortBy["Price"]]//Normal

show me all positions with more than 5000 in them, and display them in column order 3, 4, and last:

port//Query[Select[#"Quantity">5000&],{3,4,-1}]

a Part-style query, the kernel does a right-composition with a Slice function in GeneralUtilities:

Query[Select[#"Quantity">5000&],{3,4,-1}]//Normal

how many total shares do we have?

port//Query[Select[#"Security Type"=="Equity"&],#Quantity&]/*Totalport//Query[Select[#"Security Type""Equity"&]/*Total,#Quantity&]

again, peek at the internal representations of the queries:

Query[Select[#"Security Type"=="Equity"&],#Quantity&]/*Total//NormalQuery[Select[#"Security Type""Equity"&]/*Total,#Quantity&]//Normal

min/max quantities please:

port//Query[MinMax,"Quantity"]

query peek : another case of right-composition and Slice in GeneralUtilities, here with a MinMax:

Query[MinMax,"Quantity"]//Normal

easy to modify: insert an exchange rate. Value in EUR.USD. Here I think Dataset is better to visualize again:

er=FinancialData["EUR/USD"];port//Query[All,<|"Symbol"#["Symbol"],"Quantity"#["Quantity"],"Price"#["Price"],"Exchange Rate"er,"Total"#["Quantity"]#["Price"]/erIf[#["Security Type"]"Fixed Income",1/100.,1.]|>&]//Dataset

pie-chart of position strengths. We only need the symbol column (for the legend) and the position value column (for the relative angles):

pv=port//Query[All,<|"Symbol"#["Symbol"],"Total"#["Quantity"]#["Price"]If[#["Security Type"]"Fixed Income",1/100.,1.]|>&];PieChart[Association[Rule@@@Values@pv],ChartStyle"DarkRainbow",ChartLabelspv[[All,1]],ImageSize800]

Summary: the query mechanism underlying Dataset is Query we can use Query on raw data, without Dataset Query is extremely powerful, you can use function composition in it

Web Scraping of Data

◼

oftentimes you just need to Import the web location and parse the plain html

◼

since M12 we have WebExecute which lets you access the data directly

◼

access the elements / tags directly and process

◼

run JavaScript code directly from your front-end session

Remember: you still need to understand the structure of the returned data, there is no “standard recipe” for web pages unless they conform to a standard. So your retrieval method and data parser will most likely always depend on the *particular* web page / web service you are using.

Import and Parse: Wiki Page about Unemployment by Country

In[]:=

data=Import["http://en.wikipedia.org/wiki/Unemployment_by_country","Data"];

not much to do, just look at the import to understand the data structure, and then extract / parse out what you want. Oftentimes it’s better to use rule replacements or XML, here it’s so simple we can use Part and are done:

In[]:=

Rest@data[[1,1]][[All,{1,2}]]//Short//TableForm

so simple, can do in a one-liner:

In[]:=

Rest@Import["http://en.wikipedia.org/wiki/Unemployment_by_country","Data"][[1,1]][[All,{1,2}]]//TableForm

Straight html / tag Parsing (no JavaScript)

here we use the new WebExecute, but still need no JavaScript. We use WebExecute only to get to the tags / “ElementText” of the page.
A METAR is a current weather observation at an airport, a TAF is a forecast (up to 36 hours) at a larger airport. Here we put several US airport identifiers in the URL and extract all the METARS and TAFs from the returned page.

In[]:=

session=StartWebSession["Firefox"];

In[]:=

WebExecute[session,{"OpenPage""https://aviationweather.gov/taf/data?ids=KEWR+KMMU+KJFK+KLGA+KDXR+KHVN+KTEB+kabe+KRDG+KLNS+KPNE+KPHL+KALB+KFDK+KBWI+KAVP+KPIT+KDCA+KOXC+KBTV+KORD+KSAN+KSEA+KLAX+KSLC&format=raw&metars=on&submit=Get+TAF+data"}]

In[]:=

code=WebExecute["LocateElements""Tag""code"];

In[]:=

(res=WebExecute["ElementText"code])//TableForm

In[]:=

Head/@%

these strings contain full formatting, e. g. they have \n in them for the line breaks and extra spaces for indentation in the TAFs:

In[]:=

res[[2]]//FullForm

that makes the display somewhat cluttered, I sometimes like the throw a Panel around, that also makes the font a bit smaller:

In[]:=

Panel/@res//TableForm

quite a lot, so how about a Dataset, so we can scroll?

In[]:=

wx=With[{ident=StringTake[#[[1]],4]},wx=If[Length@#2,<|ident<|METARPanel@#[[1]],TAFPanel@#[[2]]|>|>,<|ident<|METARPanel@#[[1]]|>|>(*mostclassDairportshavenowxforecast!*)]]&/@SplitBy[res,StringTake[#,4]&];Dataset@wx

looks like a table that got too long (Dataset formatting is a tricky issue, it has to make educated guesses based on data types and anticipated display outcomes), but we can get a little hacky and use

In[]:=

Dataset`$DatasetTargetRowCount=8

In[]:=

Dataset@wx

In[]:=

DeleteObject[session]

remember: you can always look up details with ?? (shortcut for Information). Such as:

In[]:=

??Dataset`*

JavaScript examples

dividend scraper from investors.com
Here we need some simple JavaScript and a While loop to iterate over all the tables. We then append all table results to an accumulator. Then we convert the data into XML and parse out the dividend information with an XML-based pattern matcher in a Cases:

In[]:=

session=StartWebSession["Firefox"];

In[]:=

WebExecute[session,{"OpenPage""https://www.investors.com/data-tables/dividends-jun-14-2019/"}]

In[]:=

str="";i=0;strList={};While[Head[str]=!=Failure,str=WebExecute["JavascriptExecute""return document.getElementsByTagName('table')["<>ToString[i]<>"].outerHTML;"];If[Head[str]===String,AppendTo[strList,str]];i++];

In[]:=

divlist=ImportString[#,"XML"]&/@strList;divtable=Cases[divlist,XMLElement["tr",{},{XMLElement["td",{},{a_}],XMLElement["td",{},{b_}]}]{a,ToExpression@b},∞];divtable//TableForm

stock options on yahoo finance
Similar to the above, we use some simple JavaScript and a While loop to iterate over all tables found with the JavaScript snippet, accumulate the tables, convert to XML and then make separate extractions for the call and put data with an XML-based pattern matcher in Cases.

In[]:=

WebExecute[session,{"OpenPage""https://finance.yahoo.com/quote/MSFT/options"}]

In[]:=

optionslist=ImportString[#,"XML"]&/@strList;

In[]:=

calltable=Cases[optionslist[[1]],XMLElement["tr",___,{XMLElement["td",___,{XMLElement["a",___,{id_}]}],XMLElement["td",___,{date_}],XMLElement["td",___,{XMLElement["a",___,{strike_}]}],XMLElement["td",___,{last_}],XMLElement["td",___,{bid_}],XMLElement["td",___,{ask_}],XMLElement["td",___,{___}],XMLElement["td",___,{XMLElement["span",___,{___}]}],XMLElement["td",___,{___}],XMLElement["td",___,{oi_}],XMLElement["td",___,{iv_}]}]{id,date,ToExpression@strike,ToExpression@last,ToExpression@bid,ToExpression@ask,ToExpression@StringReplace[oi,","""],ToExpression@StringReplace[iv,"%"""]},∞];calltable//TableForm

In[]:=

puttable=Cases[optionslist[[2]],XMLElement["tr",___,{XMLElement["td",___,{XMLElement["a",___,{id_}]}],XMLElement["td",___,{date_}],XMLElement["td",___,{XMLElement["a",___,{strike_}]}],XMLElement["td",___,{last_}],XMLElement["td",___,{bid_}],XMLElement["td",___,{ask_}],XMLElement["td",___,{___}],XMLElement["td",___,{XMLElement["span",___,{___}]}],XMLElement["td",___,{___}],XMLElement["td",___,{oi_}],XMLElement["td",___,{iv_}]}]{id,date,ToExpression@strike,ToExpression@last,ToExpression@bid,ToExpression@ask,ToExpression@StringReplace[oi,","""],ToExpression@StringReplace[iv,"%"""]},∞];puttable//TableForm

scraping from a webM page
The return of a .jsp page is usually plain html. We use WebExecute and a simple JavaScript snippet, to find the tags to tell us TypeElement and ClickElement tags. We need those to fill the input fields, and then click the “Submit” button. The rest is like above (iterate over tables, accumulate, ...), except we now have html and not XML. That’s not a problem, html is easy to parse, and we can convert the html to XML and parse out from that.

http://andreaslauschke.net:49050/webMathematica/LC/consumerfinance/creditcard.jsp

In[]:=

session=StartWebSession["Firefox"];

In[]:=

WebExecute[session,{"OpenPage""http://andreaslauschke.net:49050/webMathematica/LC/consumerfinance/creditcard.jsp"}]

In[]:=

input=WebExecute["LocateElements""Tag""input"];

In[]:=

WebExecute["TypeElement"{input[[1]](*Principal*),"2000"}];WebExecute["TypeElement"{input[[2]](*APRdecimal*),"0.163"}];WebExecute["TypeElement"{input[[3]](*minpaydecimal*),"0.03"}];WebExecute["TypeElement"{input[[4]](*minpayamt*),"15"}];

In[]:=

WebExecute["ClickElement"input[[5]]]

In[]:=

Length@strList

you can parse it out any way you want. Here I chose CSV for this example (no XML), so I can show you parsing with StringCases. This needs a bit more cleaning up than if you use XML with replacement rules, but you’ll see that the pattern matcher string in StringCases is short and similar to a rule-based pattern matcher when using XML in Cases:

In[]:=

csv=Flatten@ImportString[strList[[4]],"CSV"]

In[]:=

Flatten[StringCases[#,__~~"name="~~a__~~" "~~__~~"value="~~b__->{a,b}]&/@(StringReplace[#,{"\"""",">"""}]&/@Select[csv,StringContainsQ[#,"input"]&]),1]//TableForm

incidentally, as this is code for a specific data segment, we wouldn’t have needed to parse out the row labels, we could have “pre-hardcoded” as we know them ahead of time, and they never change.

however, *usually* XML conversion and then pattern-matching with Cases seems *much* easier. For the comparison table:

In[]:=

xml=ImportString[strList[[5]],"XML"];

In[]:=

Partition[Cases[xml,XMLElement["td",{"align""right"},{a_}]a,Infinity],4]//TableForm

and for the payment schedule:

In[]:=

xml=ImportString[strList[[6]],"XML"];

In[]:=

(schedule=Partition[Cases[xml,XMLElement["td",{"align""right"},{a_}]a,Infinity],3])//TableForm

or we turn it into a Dataset for longer repayment schedules:

In[]:=

Dataset`$DatasetTargetRowCount=20

In[]:=

h=First@schedule;d=Rest@schedule;ds=AssociationThread[h#]&/@d//Dataset

In[]:=

DeleteObject[session]

Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#5)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmceo8