What Fraction of Possible Letter Sequences Are Words?
Author
Stephen Wolfram
Title
What Fraction of Possible Letter Sequences Are Words?
Description
Get the lengths of words in the word list for English:
Category
Educational Materials
Keywords
URL
http://www.notebookarchive.org/2019-08-98zxh94/
DOI
https://notebookarchive.org/2019-08-98zxh94
Date Added
2019-08-20
Date Last Modified
2019-08-20
File Size
49.77 kilobytes
Supplements
Rights
Redistribution rights reserved
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
data:image/s3,"s3://crabby-images/97e21/97e21d941045101921bcfd57c45c820c8eed2b93" alt=""
What Fraction of Possible Letter Sequences Are Words?
What Fraction of Possible Letter Sequences Are Words?
Get the lengths of words in the word list for English:
In[]:=
StringLength[WordList[]]
Out[]=
{1,3,8,5,6,5,7,7,9,11,5,9,5,7,9,5,9,8,4,6,5,5,10,11,12,8,10,7,9,6,9,9,8,5,4,8,10,4,7,8,5,10,9,8,5,7,7,6,9,8,10,6,7,6,7,8,8,6,4,6,8,4,8,10,8,11,10,6,5, ⋯39989⋯ | |||||
|
Make a histogram of their lengths:
In[]:=
Histogram[%]
Out[]=
There is only one word of length 1 in the word list:
In[]:=
Select[WordList[],StringLength[#]1&]
Out[]=
{a}
Here are the words of length 2:
In[]:=
Select[WordList[],StringLength[#]2&]
Out[]=
{ad,ah,an,as,at,ax,be,by,dB,do,eh,em,en,er,ex,fa,go,ha,he,hi,hm,ho,id,if,in,it,kc,kW,la,lo,ma,me,mi,ms,mu,my,no,nu,of,oh,on,ow,ox,pa,pH,pi,re,sh,so,ta,ti,to,um,up,us,we,xi,ya,ye}
(Some of them are slightly questionable words...)
Here is a count of the number of length-2 words in the word list:
In[]:=
Length[Select[WordList[],StringLength[#]2&]]
Out[]=
59
Here is a list of all possible pairs of letters:
In[]:=
StringJoin/@Tuples[Alphabet[],2]//Short
Out[]//Short=
{aa,ab,ac,ad,ae,af,ag,ah,660,zs,zt,zu,zv,zw,zx,zy,zz}
Find the number of possible pairs:
In[]:=
Length[%]
Out[]=
676
One can compute the number of possible pairs like this:
In[]:=
26^2
Out[]=
676
Note that the number of actual 2-letter words is much smaller than the number of combinations of 2 letters.
The list of possible 3-letter combinations:
In[]:=
StringJoin/@Tuples[Alphabet[],3]
Out[]=
{aaa,aab,aac,aad,aae,aaf,aag,aah,aai,aaj,aak,aal,aam,aan,aao,aap,aaq,aar,aas,aat,aau,aav,aaw,aax, ⋯17528⋯ | |||||
|
A count of them:
In[]:=
26^3
Out[]=
17576
Here is the number of actual 3-letter words in the word list:
In[]:=
Length[Select[WordList[],StringLength[#]3&]]
Out[]=
594
The count of possible 4-letter sequences:
In[]:=
26^4
Out[]=
456976
The actual number of 4-letter words:
In[]:=
Length[Select[WordList[],StringLength[#]4&]]
Out[]=
1951
Things to do
Things to do
Find the Shannon entropy by fitting 26^(h n). Look at the results for different human languages.
data:image/s3,"s3://crabby-images/4079d/4079d57633b5f88bf9a49688684d35628eb2c6bf" alt=""
data:image/s3,"s3://crabby-images/56607/56607cca9c3f8f5e959237fb5ea16950a488c5ec" alt=""
Cite this as: Stephen Wolfram, "What Fraction of Possible Letter Sequences Are Words?" from the Notebook Archive (2019), https://notebookarchive.org/2019-08-98zxh94
data:image/s3,"s3://crabby-images/afa7e/afa7e751d718eac7e65669706b85c714b1d1becc" alt=""
Download
data:image/s3,"s3://crabby-images/c9374/c9374a157002afb9ce03cd482ea9bc6b4ee16fc0" alt=""
data:image/s3,"s3://crabby-images/7630b/7630b01d225114cfa2bafc392f9b6df93ec5f7bb" alt=""