Educational Materials

Data Science with Andreas Lauschke (#11)

Andreas Lauschke

Author

Andreas Lauschke

Title

Data Science with Andreas Lauschke (#11)

Description

Managing and Analyzing Large Data

Handling Large Data

Andreas Lauschke, Dec 19 2019

today’s session: analysing large datanext sessions 2 - 3: parallelism

some introductory comments to make your life easier when working with large data:

◼

you want the data representation of the data you are *actively* working with as parsimonious as possible. Include everything that is essential, and exclude everything that is not needed.

◼

try to get a good grasp of what data you will need for your analysis. The earlier you understand your data needs, the better.

◼

don’t need it --> drop it. Think of data as a “load”, and you don’t want to carry loads that give you no benefit

◼

inclusion of every data element needs a justification. If you can’t justify the inclusion of the data in your analysis, drop it!

◼

remember: RAM is expensive memory. Most desktop computers have 32 GB RAM, most high-end computers have 128 - 256 GB RAM. You can’t fit 1 TB of data in there.

◼

if possible, try to integrate time-consuming tasks in your daily work routine

◼

if it takes an hour to process, submit just before going to lunch

◼

if it takes several hours to process, submit just before going home in the evening

◼

remember: data processing time does not need your brain, so you can let it run unsupervised. So, use your non-working time for unsupervised tasks.

◼

you don’t *have to* submit to the command line from the f/e. You can submit them directly from the command line, outside of the f/e session. But if you make the f/e your primary work environment, it is very practical to read the textual results directly into the kernel (ReadList is my preference). Know your command line tools! Some produce a lot of output that you don’t want in the kernel or in the output cell in the f/e (for example diagnostic tools for pre-processing), and some produce the computed/parsed/filtered output (solutions) that you *do* want in the kernel! Read it in directly, otherwise it would be tedious to *get* into the kernel! Always think ahead and decide whether or not you want the command line output in the session.

let’s look at the Amazon customer review data, 75 GB.

First, what’s in the inventory?

In[]:=

(inventory=ReadList["!aws2 s3 ls s3://amazon-reviews-pds/tsv/","String"])//TableForm

how much data did we d/l? (this is before decompression)

In[]:=

ToExpression[StringSplit[inventory][[All,3]]]//Total

Let’s keep only the filenames, as only those have to be submitted for the d/l query:

In[]:=

(files=Select[inventory,StringContainsQ[#,"amazon_reviews_us"]&]//StringSplit[#]&//#[[All,-1]]&)//TableForm

get’em:
(this will take a while, but we can submit all of them at once, the actual speed now depends on your internet connection speed, and generally speaking the server speed (although the latter is not an issue in this particular case, as AWS S3 is extremely fast). Even though this will d/l the files in parallel, it’s a good time to prepare dinner, as the d/l of 35 GB of data takes 20 minutes).

Run["!aws2 s3 cp s3://amazon-reviews-pds/tsv/"<>#<>" ~/Downloads/reviews-raw"]&/@files

now we have to decompress them, as they are .gz files. Again, combine it with a productive break, do something else for 20 minutes, as this can run unsupervised. No need for you to babysit file decompression:

In[]:=

Run["gunzip ~/Downloads/reviews-raw/*"]

how large is the data now after decompression?

In[]:=

Read["!du -sbh ~/Downloads/reviews-raw/",String]//StringTake[#,3]&

here are the decompressed .tsv files (but display only the file names):

In[]:=

(allfiles=ReadList["!ls -la ~/Downloads/reviews-raw/amazon_reviews_us_*","String"]//StringSplit[#,"/"]&//#[[All,-1]]&)//TableForm

equivalent, using built-in function (again, only keeping the filenames:):

In[]:=

FileNames["~/Downloads/reviews-raw/amazon_reviews_us_*"]//StringSplit[#,"/"]&//#[[All,-1]]&//TableForm

only the categories from the file names (and pick only US, as the international files have a different format):

In[]:=

categories=StringCases[allfiles,"amazon_reviews_us_"~~a__~~"_v1_"~~__a]//Flatten

we see that some occur more than once, Amazon has v1_00, v1_01, v1_02 for some, they split larger files into smaller files:

In[]:=

Counts@categoriesSelect[%,#>1&]

so we’ll keep the “file counter” with the category, because some of the operations we’ll do later may be parallelizable (search for something, totalize columns, etc.). Thus, we’ll read in the categories again, but this time with the version number / file counter

In[]:=

categories=StringCases[allfiles,"amazon_reviews_us_"~~a__~~".tsv"a]//FlattenLength@%

what are the headers?

In[]:=

Read["!head -1 ~/Downloads/reviews-raw/amazon_reviews_us_Camera_v1_00.tsv",String]

To save space, we don’t want all columns. For example, we don’t want any column that has natural language text in it, this would be a lot of data we’d have to load into memory. Once we look at AI functions in the kernel in this series, we could also analyse review headline or review text for sentiment, for example. For now, we don’t want any natural language text. Also, 1st column, market place, can be skipped, because we want to drop the non-US market places anyway. product_parent (5th column) is of no interest to us, and neither is the category column (7th), because that information is already in the filename.

short excursion: if you think submitting cut in parallel (one per category) will speed things up significantly, because each cut runs in its own thread, this will not speed things up a lot, because the bottleneck will not be CPU resources (cores), but i/o operations. Use tools like vmstat 1 30, mpstat 2, iostat, etc to analyse your computer’s i/o performance. The following will give only a small speed increase:

sequential:

In[]:=

Do[Run["cut -f2,3,4,8,9,10,11,12,15 ~/Downloads/reviews-raw/amazon_reviews_us_"<>categories[[i]]<>".tsv > ~/Downloads/reviews-selected/"<>categories[[i]]<>".tsv"],{i,Length@categories}];//AbsoluteTiming

Out[]=

{486.868,Null}

parallel:

In[]:=

ParallelDo[Run["cut -f2,3,4,8,9,10,11,12,15 ~/Downloads/reviews-raw/amazon_reviews_us_"<>categories[[i]]<>".tsv > ~/Downloads/reviews-selected/"<>categories[[i]]<>".tsv"],{i,Length@categories}];//AbsoluteTiming

Out[]=

{471.814,Null}

Bottleneck here is i/o, not CPU power (computer has 12 physical cores, aka 24 available threads with hyperthreading -- didn’t help much).

how much data is it now, after column selection?

In[]:=

Read["!du -sbh ~/Downloads/reviews-selected/",String]//StringTake[#,3]&

quite a reduction! from 75GB down to 7.8 GB!

and *now* we are ready to import a data file for analysis. Let’s pick a category, at random I pick cameras.

What are the headers now?

In[]:=

Read["!head -1 ~/Downloads/reviews-selected/Camera_v1_00.tsv",String]

In[]:=

data=Import["~/Downloads/reviews-selected/Camera_v1_00.tsv"];//AbsoluteTiming(*andlookhowfastthiswasbyusingpre-filteredcolumnson1.8millionitemsinCamera*)

headers:

In[]:=

headers=First@datac=1;{c++,#}&/@%//TableFormMapIndexed[{#2,#1}&,headers]//TableFormTranspose@{Range@Length@headers,headers}//TableFormil@headers

don’t want the headers for the data section:

In[]:=

d=Rest@data;

how many are they?

In[]:=

Length@d

what do they look like? what are the data types?

In[]:=

Head/@First@RandomSample[d,1]RandomSample[d,10]//TableFormRandomSample[d,10]//TableView

a histo of the star ratings please (column 4):

In[]:=

Histogram[d[[All,4]],PlotLabel"Star Rating in Camera",ImageSize500]

what are max, min, median, and standard deviation for both the number of votes and number of helpful votes?

In[]:=

d[[All,5]]//Query[{MinMax,Median,Mean/*N,StandardDeviation/*N}]d[[All,6]]//Query[{MinMax,Median,Mean/*N,StandardDeviation/*N}]

a median of 0 indicates that a lot of reviews received no votes at all. Let’s see how many received 0 votes, and histo the number of total votes, and percentages please:

In[]:=

CountsBy[t=d[[All,6]],#>0&]%[True]/Length@d[[All,6]]//N//PercentFormHistogram[d[[All,6]],{1},PlotLabel"Number of Reviews receiving Votes in Camera ("<>ToString@Total@t<>" total)",ImageSize500]

a date plot of the number of daily reviews over time:

In[]:=

DateListPlot[CountsBy[d,Last],PlotLabel"Number of daily Reviews in Camera",PlotRangeAll,ImageSize800]//AbsoluteTiming

distribution of the number of votes given, and reminder on Query to produce data of different types in one line:

In[]:=

(t=CountsBy[d,#[[3]]&])//Values//Query[{MinMax,Median,Mean/*N,StandardDeviation/*N}]Histogram[t,{1},PlotLabel"how many times were the indicated number of reviews per product given in Camera?",ImageSize800]

In[]:=

CountsBy[d,#[[3]]&]//Values//CountsBy[#>1&]

In[]:=

(t=CountsBy[d,First])//Values//Query[{MinMax,Median,Mean/*N,StandardDeviation/*N}]Histogram[t,{1},PlotLabel"how many times were the indicated number of reviews per customer given in Camera?",ImageSize800]

In[]:=

CountsBy[d,First]//Values//CountsBy[#>1&]

how many reviews in Camera came from vine customers, and what percentage is that?

In[]:=

Counts@d[[All,7]]%["Y"]/Length@d[[All,7]]//N//PercentForm

how many reviews in Camera where for verified purchases, and what percentage is that?

In[]:=

Counts@d[[All,8]]%["Y"]/Length@d[[All,8]]//N//PercentForm

let’s look at some “helpful” votes numbers of the verified purchases:

In[]:=

Off[Part::partw];verified=Select[d,#[[8]]==="Y"&];unverified=Complement[d,verified];

of all verified purchases with more than 200 (300) helpful votes, who are the top ten?

In[]:=

If[#[[5]]>200,{#[[1]],#[[5]],#[[6]],N[#[[5]]/#[[6]]]}]&/@verified//DeleteCases[Null]//SortBy[#[[4]]&]//Reverse//Take[#,10]&//TableFormIf[#[[5]]>300,{#[[1]],#[[5]],#[[6]],N[#[[5]]/#[[6]]]}]&/@verified//DeleteCases[Null]//SortBy[#[[4]]&]//Reverse//Take[#,10]&//TableForm

what is the largest number of votes cast in Camera, and who is it, and what product? (separate for verified / unverified purchases)

In[]:=

verified[[All,{1,3,4,5,6}]]//MaximalBy[#,Last]&unverified[[All,{1,3,4,5,6}]]//MaximalBy[#,Last]&

large numbers in columns 7 and 8, but he isn’t in the lists above, because his *ratio* is only

In[]:=

5132/5287//N

for the following, let’s pick customer 50894528. He wrote a review that got 927 votes, 922 of them voted positive. He seems to know what he’s talking about, at least for that one product.

let’s get all of his reviews, across all categories (this goes through 7.7 GB of data in 45 files):

In[]:=

allreviewsfor50894528=ReadList["!grep 50894528 ~/Downloads/reviews-selected/*",String];//AbsoluteTiming(*lookhowfast,giventhatwenowhaveonlyselectedcolumns*)

we have to do some reformatting again: a) extract the category from the filename and “grep hit”, and turn the numbers in the strings into numbers.

In[]:=

allreviewsfor50894528={First@StringCases[#[[1]],__~~"/Downloads/reviews-selected/"~~a__~~".tsv"~~__a],#[[2]],#[[3]],ToExpression@#[[4]],ToExpression@#[[5]],ToExpression@#[[6]],#[[7]],#[[8]],#[[9]]}&/@StringSplit[allreviewsfor50894528,"\t"]

Grid has a plethora of formatting options, I format many tables like this or similar:

In[]:=

Text@Grid[Prepend[allreviewsfor50894528,Prepend[Rest@headers,"category"]],Background{None,{Lighter[Yellow,.9],{White,Lighter[Blend[{Blue,Green}],.8]}}},Alignment{{Left,Left,Left,".",".",".",Right,Right,Right}},FrameAll]

what were the products he bought?

In[]:=

allreviewsfor50894528[[All,3]]

how many reviews did he leave?

In[]:=

Length@allreviewsfor50894528[[All,2]]

histo of the star reviews of the products he bought:

In[]:=

Histogram[allreviewsfor50894528[[All,4]]]

in what categories did he buy?

In[]:=

c=Counts@allreviewsfor50894528[[All,1]]Row[{{Normal@c/.(a_b_){a,b}//Text@Grid[#,Background{None,{{White,Lighter[Blend[{Blue,Green}],.8]}}},Alignment{{Left,"."}},FrameAll]&,Normal@c/.(a_b_){a,b}//Text@Grid[#,Background{None,{{White,Cyan}}},Alignment{{Left,"."}},FrameAll]&,Normal@c/.(a_b_){a,b}//Text@Grid[#,Background{None,{{Yellow,Cyan}}},Alignment{{Left,"."}},FrameAll]&,Normal@c/.(a_b_){a,b}//Text@Grid[#,Background{None,{{White,LightGray}}},Alignment{{Left,"."}},FrameAll]&}}]PieChart[Callout@#&/@c,ChartStyle"TemperatureMap",ChartLegendsKeys@c,ImageSize500]BarChart[c,ChartStyle"TemperatureMap",ChartLegendsKeys@c,ImageSize500]

how helpful were all his other reviews?

In[]:=

rev=allreviewsfor50894528//#[[All,{5,6}]]&BarChart[t=First@#/Last@#&/@DeleteCases[rev,{_,0}],ImageSize500]

how many of his reviews were at least 50% considered helpful?

In[]:=

CountsBy[t,#>1/2&]%[True]/Length@rev//N//PercentForm

how many of his purchases were verified purchases?

In[]:=

Counts@allreviewsfor50894528[[All,8]]%["Y"]/Length@allreviewsfor50894528[[All,8]]//N//PercentForm

what was his review activity over time?

In[]:=

DateHistogram[allreviewsfor50894528[[All,-1]],PlotLabel"Review Activity over Time for Customer 50894528",ImageSize500]

Cite this as: Andreas Lauschke, "Data Science with Andreas Lauschke (#11)" from the Notebook Archive (2020), https://notebookarchive.org/2020-09-4lmsiah