Singularize: Depluralize English Words
Author
Tuseeta Banerjee
Title
Singularize: Depluralize English Words
Description
A function to depluralize english words
Category
Essays, Posts & Presentations
Keywords
Singularize, Pluralize, WordData, Regular Expression, String Matching, Grammar Rules
URL
http://www.notebookarchive.org/2021-07-62jrbtr/
DOI
https://notebookarchive.org/2021-07-62jrbtr
Date Added
2021-07-13
Date Last Modified
2021-07-13
File Size
280.62 kilobytes
Supplements
Rights
Redistribution rights reserved



WOLFRAM SUMMER SCHOOL 2021
Singularize: Depluralize English Words
Singularize: Depluralize English Words
Tuseeta Banerjee
Wolfram Research
This project aims to create a Singularize function that takes a “plural” word and makes it singular. Grammar rules for singularizing or pluralizing words are not “strict” and often have a lot of exceptions, overlapping rules, and exceptions to exceptions. Coming up with heuristics for Pluralizing or Singularizing is thus a very challenging task. We tackle this problem from different approach. One approach is to use the rich data that already exists in Wolfram Language, specifically, InflectedForms of WordData to come up with a function for Singularize. We test this method exhaustively on all the words, identify potential problems and discuss some “mild” limitations of this approach.
Further, we decided to use Wolfram Language’s already implemented Pluralize function, and instead of re-inventing the wheel, singularize function could be obtained by simply reversing the rules of the function, and showing a simple proof of concept.
Next, perhaps the most challenging question to ask is, are the 2437 “exceptions” in the Pluralize functions are truly exceptions? A deeper dive into the patterns show that there are patterns in the exceptions that can be identified to create new rules or modify existing ones. So, the steps are the following- identify the pattern, create a new rule or modify the existing one, find out exceptions to the rule, add the new rule or modify the existing rules, and add the exceptions to the newly created/modified rule back to the dictionary of exceptions. However, there’s a catch, the exception that you identified in the newly created/modified rule, might also not be a true exception, i.e. they might actually follow another existing rules or another new rule you create (or create in the future). That’s why this section of the project is titled “Inception of Exceptions”.
After thorough testing of each rule, we finally create the new rules with 50% less exceptions and see it work!
Further, we decided to use Wolfram Language’s already implemented Pluralize function, and instead of re-inventing the wheel, singularize function could be obtained by simply reversing the rules of the function, and showing a simple proof of concept.
Next, perhaps the most challenging question to ask is, are the 2437 “exceptions” in the Pluralize functions are truly exceptions? A deeper dive into the patterns show that there are patterns in the exceptions that can be identified to create new rules or modify existing ones. So, the steps are the following- identify the pattern, create a new rule or modify the existing one, find out exceptions to the rule, add the new rule or modify the existing rules, and add the exceptions to the newly created/modified rule back to the dictionary of exceptions. However, there’s a catch, the exception that you identified in the newly created/modified rule, might also not be a true exception, i.e. they might actually follow another existing rules or another new rule you create (or create in the future). That’s why this section of the project is titled “Inception of Exceptions”.
After thorough testing of each rule, we finally create the new rules with 50% less exceptions and see it work!
Section 1: Data Method
Section 1: Data Method
Section 2: Rule Method (Reverse Pluralize Rules)
Section 2: Rule Method (Reverse Pluralize Rules)
Section 3: Organize the existing IrregularRules
Section 3: Organize the existing IrregularRules
3.1 “ves” -> “f”|”fe”
3.1 “ves” -> “f”|”fe”
3.2 “ies” -> “y”
3.2 “ies” -> “y”
3.3 “(x|ch|ss|sh) es” -> “(x|ch|ss|sh)”
3.3 “(x|ch|ss|sh) es” -> “(x|ch|ss|sh)”
3.4 “ae” -> “a”
3.4 “ae” -> “a”
3.5 “teeth” -> “tooth”
3.5 “teeth” -> “tooth”
3.6 “men” -> “man”
3.6 “men” -> “man”
3.7 “oes” -> “o”
3.7 “oes” -> “o”
3.8 “i” -> “us”
3.8 “i” -> “us”
3.9 “a”-> “um”
3.9 “a”-> “um”
3.10 “m(l)ice”-> “m(l)ouse”
3.10 “m(l)ice”-> “m(l)ouse”
3.11 “ices”-> “ix”
3.11 “ices”-> “ix”
3.12 “ices”-> “ex”
3.12 “ices”-> “ex”
3.13“ses”-> “sis”
3.13“ses”-> “sis”
Deal with the first iterations
Deal with the first iterations
Section 4: Create the new rules, finally!
Section 4: Create the new rules, finally!
Throwback the exceptions:
In[]:=
$NewIrregularsingularRules=Join[undeciphered,vesexcept,iesexcept,esexcept,aeexcept,menexcept,newoesexcept,us2except,umexcept,miceexcept,ixexcept,exexcept,sesexcept];Length@$NewIrregularsingularRules
Out[]=
1189
In[]:=
$newSingularizationRules={"(ax|test)es$""$1is","(abac|abacul|acanth|acar|acin|acule|alumn|alve|alveol|anaptich|anim|annul|articul|asc|ascococc|bacchi|bacill|cact|calam|canalicul|canth|carol|carp|cathet|chorag|chorepiscop|choriamb|cipp|cirr|clype|cocc|coloss|cumulonimb|denari|diplococc|disc|discobol|domin|dracuncul|echin|electrophor|embol|emerit|encrin|fascicul|fatu|faun|flocc|floccul|foc|fuc|fung|funicul|ginglym|gladi|gladiol|glomerul|gyr|hectocotyl|hippopotam|homuncul|humer|hydrocaul|hyporadi|hypotars|iamb|ichthyosaur|incub|liencul|literat|litu|loc|locul|malle|malleol|medi|metatars|micrococc|minim|mod|modi|modiol|modul|naupli|nautil|nid|nimb|nucell|nucle|nucleol|nunci|obel|obol|ocell|octop|ocul|omnib|ovococc|pal|palp|palul|papyr|paragnath|paxill|pedicul|pedipalp|pessul|phacell|phosphor|pic|pile|plesiosaur|plute|polyp|polypor|prothall|pull|pulvill|pulvinul|pylor|radi|ranuncul|retrovir|rhomb|rhonch|sacc|saccul|sarcophag|scamill|scirrh|scyph|sor|splencul|stimul|strateg|streptococc|succub|sulc|syllab|tal|tars|tarsometatars|termin|thalam|thall|thesaur|thromb|thyrs|toph|tor|triungul|troch|trochil|trochisc|troil|tumul|tylar|unc|uncin|urceol|uter|ventricul|vill)i$""$1us","(carbonar|cicisbe|concett|glissand|graffit|librett|mafios|paparazz|pizzicat|prosciutt|scud|sold|temp|timpan|tympan|vetturin|virtuos|zingar)i$""$1o","(alias|status)es$""$1","(bu)ses$""$1s","(o)es$""$1","([ti])a$""$1um","(addend|agend|alabastr|ambulacr|animalcul|antr|arcan|athenae|caec|candelabr|cec|centr|claustr|coagul|coll|conid|continu|curricul|diverticul|duoden|ecthore|elytr|entostern|epistern|flagell|fraenul|fren|glabell|haustell|hypocleid|hypoge|hypoptil|hypostern|ile|incunabul|infundibul|intervall|involucell|involucr|jejun|jug|jugul|labar|labell|labr|latibul|lustr|mal|memorand|menstru|mutand|notand|oblong|observand|ommate|opercul|optim|opuscul|oscul|ossicul|ov|ovul|paradactyl|parapter|pericul|perine|periostrac|petal|phyl|pistillid|plectr|plethr|podarthr|propagul|propylae|pseudov|pudend|receptacul|residu|reticul|retinacul|retine|rostell|rostrul|sacell|sacr|sag|scutell|septul|sequestr|ser|sigill|spectr|spicul|stichid|stragul|succedane|tentacul|terg|triquetr|tritov|tubercul|vall|vascul|vel|vestibul|vexill|vibracul|vincul|xiphistern|zygantr)a$""$1um","$ses""sis","(teeth)$""tooth","(men)$""man","(?:([^f])fe|([lr])f)ves$""$1$2","(hive)s$""$1","([^aeiouy]|qu)ies$""$1y","(x|ch|ss|sh)es$""$1","(a)e$""$1","(appendi|aviatri|cervi|cicatri|dominatri|executri|forni|generatri|heli|matri|radi|sali|spadi|testatri|vari)ces$""$1x","(ap|arusp|cim|cod|cort|ib|ind|mur|poll|pontif|proscol|vert|vort)ices$""$1ex","([m|l])ice$""$1ouse","^(ox)en""$1","^(oxen)$""$1","(quiz)zes$""$1"};
Test and make it work!
In[]:=
makesingularizationRule[rule_->replacement_]:=StringReplace[#,RegularExpression[rule]->replacement]&newsingularRules=Map[makesingularizationRule,$newSingularizationRules];
In[]:=
capitalFormQ[x_]:=And[UpperCaseQ[StringTake[x,1]],LowerCaseQ[StringDrop[x,1]]]toCapitalForm[x_]:=StringJoin[ToUpperCase[StringTake[x,1]],StringDrop[x,1]]nounQ[str_]:=Module[{iPart,part,iPart1,part1},iPart=WordData[str,"PartsOfSpeech"];iPart1=WordData[toCapitalForm[str],"PartsOfSpeech"];part=If[SameQ[Head[iPart],WordData],{},iPart];part1=If[SameQ[Head[iPart1],WordData],{},iPart1];MemberQ[part,"Noun"]||MemberQ[part1,"Noun"]]
In[]:=
irregularsingularForm[str_]:=$NewIrregularsingularRules[str]
◼
It checks if the string is a noun, only then it singularizes
◼
Removes casing to avoid any confusion
◼
Checks if the word is in the irregulaRules, if it is, then it does the replacement
◼
If the word is not in the dictionary of the IrregularRules, then it goes through the actual rules sequentially and does the replacement
◼
If finally, the word is not singularized and has an “s” at the end, then it removes the last “s” to singularize (accounts for a lot of -s ending characters, this needs to be further tested for corner cases. This should also reduce a lot of IrregularRules cases which does the same.
In[]:=
iSingularize[str_]:=Module[{singular,strl,str2},str2=ToLowerCase@str;If[nounQ[str2],singular=irregularsingularForm[str2];If[MissingQ[singular],strl=str2;Catch[Table[singular=replacement[strl];If[strl=!=singular,Throw[Null]],{replacement,newsingularRules}]];If[singular==str2,If[StringTake[ToLowerCase@singular,-1]=="s",singular=StringTake[singular,;;-2]]]]];singular];
In[]:=
iSingularize/@{"women","children","Chateaux","Mice","Loaves","buffaloes","axes","shelves","statuses","mice","Lives","babies","living"}
Out[]=
{woman,children,chateau,mouse,loaf,buffalo,axis,shelf,status,mouse,life,baby,living}
Concluding remarks
Concluding remarks
While the first two sections have a definitive answer to completion, the third section is an iterative way of thinking, matching existing patterns and rules and at best can be work in progress (for any foreseeable future). However, the third approach has a major benefit to the existing way of just adding word to the ever-increasing exception categories (currently at 2437). When you are identifying and creating the new rules from the existing exception patterns, the probability of WL users keep identifying niche cases will keep decreasing, as a result, could potentially lead to a more robust system.
Currently, in just this phase we have reduced the exceptions from 2437 to 1189.
Additionally, we have also added previously non existent exceptions by carefully testing each of the rules/exception patterns and matching it with the built-in dictionary.
Currently, in just this phase we have reduced the exceptions from 2437 to 1189.
Additionally, we have also added previously non existent exceptions by carefully testing each of the rules/exception patterns and matching it with the built-in dictionary.
Keywords
Keywords
◼
Singularize
◼
Pluralize
◼
Word Data
◼
Regular Expression
◼
String Matching
◼
Grammar
Acknowledgment
Acknowledgment
I would like to thank Dr. Wolfram for the idea of the function and for the developers of Pluralize for the initial ideas to make this function. Finally, I would like to thank my students for being so co-operative and providing an amicable environment so that I could do my own mini-explorations. Last, butt definitely not least, I would like to thank my co-worker Jofre Espigule Pons for some very serious motivation.
References
References
◼
Paper: https : //users.monash.edu/~damian/papers/HTML/Plurals.html
◼
Grammarly: https://www.grammarly.com/blog/plural-nouns/


Cite this as: Tuseeta Banerjee, "Singularize: Depluralize English Words" from the Notebook Archive (2021), https://notebookarchive.org/2021-07-62jrbtr

Download

