Essays, Posts & Presentations

Automatic Metrical Scansion of Latin Poetry in Dactylic Hexameter

Laney Moy

Author

Laney Moy

Title

Automatic Metrical Scansion of Latin Poetry in Dactylic Hexameter

Description

Automatic Metrical Scansion of Latin Poetry in Dactylic Hexameter

Laney Moy

Mentor: Christian Pasquel

Description and Method

Many significant works of Latin poetry follow the format of dactylic hexameter, meaning that each line is composed of some combination of six metrical feet, each of which is either a spondee (two long syllables) or a dactyl (one long syllable followed by two short syllables). The determination of the length of each syllable is based on the word itself, its position in the line, and the surrounding words and metrical feet. Scansion is the process of identifying the pattern of metrical feet in a line of Latin poetry. Using machine learning, I automatically scanned lines of Latin poetry in dactylic hexameter and outputted color-coded text with notation of the syllable lengths.

After finding and parsing data, I experimented with four different input/output combinations when developing my neural network and fiddled around with the numbers and layers for each. I ended up using a sequence-to-sequence network with an input of a line of text and an output of a list of probabilities that each character will be a long vowel; short vowel; or consonant, punctuation mark, or ignored vowel. From there, I converted the neural network output into a color-coded line with metrical markings.

Code

Note: Download “aenscan.txt” (training data) on GitHub before running the code.

Parsing training data

Parsing a single scanned line of poetry

First, I removed metrical foot breaks and adjusted the spacing.

The function uses StringReplace to remove foot breaks and then StringSplit and StringRiffle to remove extra spacing between words.

removeFootMarks[line_]:=StringRiffle[StringSplit[StringReplace[line,"|""","|| ""","‖ """]]]

I removed diacritics and replaced ““Y” with lowercase “y” because “Ŷ” indicates a long “y” and no words in my training set began with a capital “y.”

I removed diacritics and used StringReplace to replace uppercase “Y” with lowercase “y.”

removeDiacritics[line_]:=StringReplace[RemoveDiacritics[line],{"Y""y"}]

I replaced the scanned line with the plain text line.

I removed foot marks and removed diacritics from the scanned line.

plainLine[line_]:=removeDiacritics[removeFootMarks[line]]

I made two lists containing the characters used for long vowels and the characters used for short vowels so that I could parse the data.

longs includes vowels with long marks above them in uppercase and lowercase and the letter “Ŷ,” which is used to indicate a long “y” in the data file. shorts includes vowels with short marks above them in uppercase and lowercase and the letter “ŷ,” which is used to indicate a short “y” in the data file.

longs={"ā","ē","ī","ō","ū"};longs=Flatten[{longs,ToUpperCase[longs],"Ŷ"}];shorts={"ă","ĕ","ĭ","ŏ","ŭ"};shorts=Flatten[{shorts,ToUpperCase[shorts],"ŷ"}];

I found the “length” of a character with diacritics if applicable. The integer 1 corresponds to a long vowel; 2 to a short vowel; and 3 to a consonant, punctuation mark, or ignored vowel.

If the current character is in longs, return 1. Else, if the current character is in shorts, return 2. Else, return 3.

charLength[ch_]:=If[MemberQ[longs,ch],1,If[MemberQ[shorts,ch],2,3]]

I calculated the character length for each character in a line.

Find the charLength for each character in the line.

allCharLengths[line_]:=charLength[#]&/@Characters[line]

Parsing a file of scanned lines of poetry

I retrieved all lines from a file and removed line numbers and blank strings.

I imported the lines from a file, used StringReplace to remove all numbers from each line, and deleted each line that contained an empty string or a space.

getLines[file_]:=(lineData=Import[file,"Lines"];lineData=Table[StringReplace[x,NumberString""],{x,lineData}];lineData=Table[If[x""||x" ",Nothing,x],{x,lineData}])

From a list of scanned lines of poetry, I made a list of rules going from the plain text line to the sequence of integers representing the length of each character in the line.

For each line in lines, I made a rule from the plain line to the character lengths of the line without metrical foot markings.

inputOutput[lines_]:=Table[plainLine[x]allCharLengths[removeFootMarks[x]],{x,lines}]

I retrieved the lines from the training data file. For training data, I used the scanned lines of The Aeneid available in the AP Latin resources on www.hands-up-education.org. On my computer, filePath is the path of the “aenscan.txt” file I’ve uploaded to GitHub.

I called getLines on the training data file.

lines=getLines[filePath];

From the scanned lines, I made a list of rules for my neural network training set.

I called inputOutput on the lines from the training data file.

trainingSet=inputOutput[lines];

Training the neural network

Of the 844 lines I found, I took 20 lines for testing and used the rest for training.

I used TakeDrop and RandomSample to split trainingSet into testData (20 lines) and trainingData (824 lines).

{testData,trainingData}=TakeDrop[RandomSample[trainingSet],20];

I made a sequence-to-sequence neural network for my data.

I used NetChain to create a network from a list of layers. EmbeddingLayer[12] transforms each character in the input into a vector with magnitude of 12. NetBidirectionalOperator[LongShortTermMemoryLayer[32]] creates a recurrent layer to process the input. NetMapOperator[LinearLayer[3]] creates the output sequence of probabilities for each character. SoftmaxLayer[] represents a softmax net layer. The input net encoder accepts all the listed characters and ignores differences between upper and lower case.

net=NetChain[{EmbeddingLayer[12],NetBidirectionalOperator[LongShortTermMemoryLayer[32]],NetMapOperator[LinearLayer[3]],SoftmaxLayer[]},"Input"NetEncoder[{"Characters",{{"!","(",")",".",",",";","?",":","”","“","-"}1," ",CharacterRange["a","z"],_},"IgnoreCase"True}]]

I trained my neural network on trainingData.

I used NetTrain to train net on trainingData with testData as the validation set. I saved the network as trainedNet.

result=NetTrain[net,trainingData,All,LossFunctionCrossEntropyLossLayer["Index"],ValidationSettestData,MaxTrainingRounds100]trainedNet=result["TrainedNet"]

The neural network outputs a list of probabilities that each character will be a long vowel; short vowel; or consonant, punctuation mark, or ignored vowel.

Out[]=

{{0.911013,0.00387983,0.0851076},{0.00286743,0.000125398,0.997007},{0.0000929661,7.50044×

-6

,0.9999},{0.996867,0.000253586,0.00287991},{0.000925767,7.28794×

-6

,0.999067},{0.0000307552,6.03498×

-6

,0.999963},{0.976602,0.0101038,0.0132937},{0.00062027,0.0000198048,0.99936},{0.998598,0.000032841,0.00136875},{0.000189847,3.10921×

-7

,0.99981},{0.00039262,6.66146×

-6

,0.999601},{0.40642,0.0547649,0.538815},{0.980407,0.0129127,0.00668045},{0.0000893308,0.0000336337,0.999877},{0.000114328,0.000429966,0.999456},{0.592774,0.405725,0.00150085},{0.0000355084,0.000400376,0.999564},{0.980523,0.0176164,0.00186094},{0.000457323,0.0000210991,0.999522},{0.0000765484,0.000134851,0.999789},{3.79155×

-6

,0.000315587,0.999681},{1.91275×

-6

,0.00413318,0.995865},{0.000253485,0.938962,0.0607847},{0.00393455,0.84484,0.151226},{0.315217,0.0264762,0.658306},{0.851321,0.0277576,0.120922},{0.0000299706,2.98281×

-6

,0.999967},{0.0000686256,0.0000681379,0.999863},{0.0391653,0.0929969,0.867838},{0.97785,0.0144166,0.00773378},{0.0000334678,0.0000141414,0.999952},{4.99697×

-6

,0.000116391,0.999879},{4.69657×

-6

,0.000907054,0.999088},{0.0745563,0.922621,0.00282246},{1.50023×

-6

,0.0000111714,0.999987},{0.0320256,0.871196,0.0967786},{0.0000108925,0.0000640128,0.999925},{0.000061851,0.0000282191,0.99991},{0.98976,0.00823098,0.00200885},{0.000446167,0.0000445491,0.999509},{0.000475882,0.0000176752,0.999506},{0.967382,0.00674791,0.0258704},{0.00454222,0.000294338,0.995163},{0.992125,0.00297951,0.0048956},{0.0031128,0.000107191,0.99678}}

Formatting output

I converted the probabilities into the most likely list of integers.

For each list of probabilities for a character, I found the highest probability and selected the index at which it occurred. I returned a list of the appropriate integers corresponding to each character.

calcOutput[probs_]:=Table[If[Max[x]x[[1]],1,If[Max[x]x[[2]],2,3]],{x,probs}]

calcOutput returns integers from the probabilities.

In[]:=

calcOutput



Out[]=

{1,3,3,1,3,3,1,3,1,3,3,3,1,3,3,1,3,1,3,3,3,3,2,2,3,1,3,3,3,1,3,3,3,2,3,2,3,3,1,3,3,1,3,1,3}

I added overscripts for the vowel lengths of each character.

markings[[1]] is a bar, the mark for a long syllable. markings[[2]] is a cup, the mark for a short syllable. overscripts creates a list of the formatted characters--either the normal String if the associated integer is 3 or an Overscript object if the associated integer is a 1 or 2.

markings={Style["-",Bold],Style["u",Smaller]};overscripts[line_,probs_]:=Table[If[probs[[x]]3,StringPart[line,x],Overscript[StringPart[line,x],markings[[probs[[x]]]]]],{x,Range[Length[probs]]}]

I added colors for the long and short vowels.

colors takes a list of Strings and Overscript objects. For each item in the list, if the associated integer is a 1 or a 2, I used Style to make the color Blue or Red respectively. I then used Style again to put the Strings and Overscripts in a row and make it bigger.

colors[line_,probs_]:=(newLine=Table[If[probs[[x]]3,line[[x]],Style[line[[x]],If[probs[[x]]1,Blue,Red]]],{x,Range[Length[probs]]}];Style[Row[newLine],Large])

Final product

I combined all of my functions to scan and format the line.

The program calculates the appropriate integer list for a line and formats the result.

scan[line_]:=(out=calcOutput[trainedNet[line]];colors[overscripts[line,out],out])

Conclusion and future work

Through this project, I was able to create a neural network that successfully determines and formats the metrical pattern of lines of Latin poetry in dactylic hexameter with an error of 2.9%. The program takes a plain line of poetry and outputs the resulting color-coded line with metrical markings.

In the future, markings between metrical feet could be added to the displayed lines. The program could also be extended to other types of meter or to poetry in other languages.

Acknowledgements

I would like to thank my wonderful mentor, Christian Pasquel, for his endless support throughout the process. I would also like to thank the other Wolfram Summer Camp mentors and the Wolfram Summer School mentors and students.

Cite this as: Laney Moy, "Automatic Metrical Scansion of Latin Poetry in Dactylic Hexameter" from the Notebook Archive (2019), https://notebookarchive.org/2019-08-0gjg0rc

Automatic Metrical Scansion of Latin Poetry in Dactylic Hexameter

Description and Method

Code

Parsing training data

Parsing a single scanned line of poetry

​Parsing a file of scanned lines of poetry

Training the neural network

Formatting output

Final product

Conclusion and future work

Acknowledgements

Parsing a file of scanned lines of poetry