Data Curation Tools: Morphadorner from the MONK Project
Posted on: 20 Dec '10

I have been playing around with tools from the MONK project. I have written before about the MONK Workbench, which I found to be a brilliant tool for introducing people to the possibilities of linguistic analysis and even, within the limits of its data-set, for doing some pretty nifty research. However, I found the Morphadorner to be a much more exciting tool because of the control it can give individual projects over their data-sets.

For any large scale data mining project, it is the quality of the data that largely determines the quality and usefulness of the information that can be extracted from it. Just throwing a mass of data at a machine learning algorithm is unlikely to be very useful. And for especially complex data like culturally produced and transmitted texts, proper curation and preparation of the data for analysis becomes even more important. Morphadorner, paired with the Abbot tool (which converts TEI and some other related XML/SGML formats into the TEI-Analytic, or TEI-A format - more about Abbot in a later post), not only makes the task of text curation a lot easier, it opens up many avenues of analysis that would have otherwise been nearly impossible or, at best, so error-ridden as to be useless. Leaving the difficulties of standardizing large digital corpora aside for a later post, let me describe briefly what exactly Morphadorner does to a text through an example. I will use as an example the beginning of the Prologue from Middleton and Dekker's play, The Roaring Girl. The prologue in the quarto text looks like this:

Prologue from the 1611 Quarto of Thomas Dekker and Thomas Middleton's play The Roaring Girl

Evidently, this will be a nightmare for most OCR software, and this is actually a pretty nice sample for the period. So, once this has been encoded by hand by the EEBO project, or the EEBO-TCP project, what does this text look like?

<HEAD>Prologus.</HEAD>
   <LG>
      <L>A Play (expected long) makes the Audience looke</L>
      <L>For wonders:—that each Scoene should be a booke,</L>
      <L>Compos'd to all perfection; each one comes</L>
      <L>And brings a play in's head with him: vp he summes,</L>
      <L>What he would of a Roaring Girle haue writ;</L>
      <L>If that he findes not here, he m<GAP DESC="illegible" EXTENT="1 letter" DISP="•"/>wes at it.</L>

This is the TEI encoded file. The first tag identifies the heading, the tag marks the beginning of a line-group, or, in other words, indicates that what follows is verse and finally the individual pairs mark each line. The digital scribe did quite well in this bit of text apart from the last line where the word "mews" was too obscure in the quarto and thus marked illegible. The tag recommends that renderings of this file show a dot instead of the missing letter. If one were to nit-pick, one might argue that the diphthong in the second sentence should have been encoded, but that is not important for our present purpose of generating an analyzable digital text. Now that we have this digitized text encoded with the basic textual features, what can we do with it. The answer, unfortunately, is "not much" in terms of large scale analysis. Apart from the vagaries of early modern spelling and punctuation, the very complexities of natural language make standardized processing a challenge. To an accustomed reader it might be patently obvious that "looke" can be modernized as "look" and "scoene" as "scene" but how do we let a machine know that when faced with a corpus of over a hundred thousand texts containing billions of words? Even with regularized and modernized spelling - say if one were to digitize modern critical editions instead of original printings - our problem doesn't quite go away. Fortunately linguistics has solutions to most of these problems and, since linguists have used computers to analyze and understand language much longer than literary scholars, they have pretty standardized ways of performing these linguistic transformations. Morphadorner encodes the incoming TEI file with this linguistic data added in the form of tag attributes. Mainly, it lemmatizes each word (i.e. gives the standard uninflected form of the word that would be found in the dictionary)and adds part of speech data. The resulting file looks somewhat like this:

<LG>
  <L>
    <w eos="0" lem="a" pos="dt" reg="A" spe="A" tok="A" xml:id="roaring-girl-04740" ord="449" part="N">A</w>
    <c> </c>
    <w eos="0" lem="play" pos="n1" reg="Play" spe="Play" tok="Play" xml:id="roaring-girl-04750" ord="450" part="N">Play</w>
    <c> </c>
    <w eos="0" lem="(" pos="(" reg="(" spe="(" tok="(" xml:id="roaring-girl-04760" ord="451" part="N">(</w>
    <w eos="0" lem="expect" pos="vvd" reg="expected" spe="expected" tok="expected" xml:id="roaring-girl-04770" ord="452" part="N">expected</w>
    <c> </c>
    <w eos="0" lem="long" pos="av-j" reg="long" spe="long" tok="long" xml:id="roaring-girl-04780" ord="453" part="N">long</w>
    <w eos="0" lem=")" pos=")" reg=")" spe=")" tok=")" xml:id="roaring-girl-04790" ord="454" part="N">)</w>
    <c> </c>
    <w eos="0" lem="make" pos="vvz" reg="makes" spe="makes" tok="makes" xml:id="roaring-girl-04800" ord="455" part="N">makes</w>
    <c> </c>
    <w eos="0" lem="the" pos="dt" reg="the" spe="the" tok="the" xml:id="roaring-girl-04810" ord="456" part="N">the</w>
    <c> </c>
    <w eos="0" lem="audience" pos="n1" reg="Audience" spe="Audience" tok="Audience" xml:id="roaring-girl-04820" ord="457" part="N">Audience</w>
    <c> </c>
    <w eos="0" lem="look" pos="vvb" reg="look" spe="looke" tok="looke" xml:id="roaring-girl-04830" ord="458" part="N">looke</w>
  </L>

You will notice that this just the first sentence. In other words, all of this added data exponentially increases the file size. But apart from the increased requirement for processing power to generate and handle these larger file sizes (and how confusing all this looks to the human eye), not much changes in terms of the TEI encoding. The tags, although they might be a little confusing to find, are exactly the same with the addition of tags for words and tags for the space characters. Since the basic encoding doesn't change any TEI processor, XML parser or XSL transformation will still process the file smoothly, only now it will have a plethora of new information that can be used for analysis at the word level. For example, it is now an easy step to generate a modernized spelling version by combining the lemma and POS information - not a trivial feat if you are looking at a database of 150,000 odd texts full of funky Tudor spelling.

Before I end this post, let me include a screen dump of what Morphadorner looks like in action. It is basically a Java program that you can run off the command line of any computer. It contains some quite large databases of word lists for English print from different eras. What it does in the background is conceptually a set of uncomplicated but intensely repetitive tasks. Read each word and match them up against lists of tens of thousands of tri-grams (word-lemms-POS) and insert the matching results into appropriate TEI compatible tags. Essentially it is bringing to our literary texts the decades of linguistic research and knowledge that has made such analysis possible and it is doing it extremely efficiently, not to mention blindingly fast. Notice the times and taken for each step in the screen dump below from my middle of the road MacBook with 4 Gb of RAM.

- MorphAdorner version 1.0
- Initializing, please wait...
- Using Trigram tagger.
- Using I retagger.
- Loaded word lexicon with 151,124 entries in 4 seconds.
- Loaded suffix lexicon with 209,656 entries in 5 seconds.
- Loaded 305,855 Latin words in 3 seconds.
- Loaded 4,383 abbreviations in 1 second.
- Loaded transition matrix in 4 seconds.
- Loaded 162,248 standard spellings in 1 second.
- Loaded 358,590 alternative spellings in 6 seconds.
- Loaded 349 more alternative spellings in 14 word classes in 1 second.
- Loaded 2 names into name standardizer in < 1 second.
- 1 file to process.
- Before processing input texts: Free memory: 244,915,208, total memory: 637,349,888
- Processing file './ab-input/roaring-girl-short.xml' .
- Input file ./ab-input/roaring-girl-short.xml split into 5 segments.
- Processing segment 'text00001' (3 of 5).
- Extracted 842 words in 48 sentences in 1 second.
- lines: 49; words: 919
- Part of speech adornment completed in 1 second. 2,954 words adorned per second.
- Generating other adornments.
- Adornments generated in 1 second.
- Inserting adornments into XML text.
- Inserted adornments into XML text in 1 second.
- Processing segment 'text00002' (4 of 5).
- Extracted 1,410 words in 80 sentences in 1 second.
- lines: 130; words: 2,469
- Part of speech adornment completed in 1 second. 5,402 words adorned per second.
- Generating other adornments.
- Adornments generated in 1 second.
- Inserting adornments into XML text.
- Inserted adornments into XML text in 1 second.
- Processing segment 'text00003' (5 of 5).
- Extracted 390 words in 7 sentences in 1 second.
- lines: 138; words: 2,867
- Part of speech adornment completed in 1 second. 8,863 words adorned per second.
- Generating other adornments.
- Adornments generated in 1 second.
- Inserting adornments into XML text.
- Inserted adornments into XML text in 1 second.
- Merging adorned XML segments.
- Writing final XML to ./ab-output/roaring-girl.xml.
- Adorned XML written to ./ab-output/roaring-girl-short.xml in 6 seconds.
- After completing ./ab-input/roaring-girl.xml: Free memory: 253,047,736, total memory: 637,349,888
- All files adorned in 32 seconds.

32 seconds to emerge from a mass of bytes that only a renaissance scholar can read with ease into a properly curated database of machine readable information!