I have been playing around with tools from the MONK project. I have written before about the MONK Workbench, which I found to be a brilliant tool for introducing people to the possibilities of linguistic analysis and even, within the limits of its data-set, for doing some pretty nifty research. However, I found the Morphadorner to be a much more exciting tool because of the control it can give individual projects over their data-sets.
For any large scale data mining project, it is the quality of the data that largely determines the quality and usefulness of the information that can be extracted from it. Just throwing a mass of data at a machine learning algorithm is unlikely to be very useful. And for especially complex data like culturally produced and transmitted texts, proper curation and preparation of the data for analysis becomes even more important. Morphadorner, paired with the Abbot tool (which converts TEI and some other related XML/SGML formats into the TEI-Analytic, or TEI-A format - more about Abbot in a later post), not only makes the task of text curation a lot easier, it opens up many avenues of analysis that would have otherwise been nearly impossible or, at best, so error-ridden as to be useless. Leaving the difficulties of standardizing large digital corpora aside for a later post, let me describe briefly what exactly Morphadorner does to a text through an example. I will use as an example the beginning of the Prologue from Middleton and Dekker's play, The Roaring Girl. The prologue in the quarto text looks like this:
Evidently, this will be a nightmare for most OCR software, and this is actually a pretty nice sample for the period. So, once this has been encoded by hand by the EEBO project, or the EEBO-TCP project, what does this text look like?
<HEAD>Prologus.</HEAD> <LG> <L>A Play (expected long) makes the Audience looke</L> <L>For wonders:—that each Scoene should be a booke,</L> <L>Compos'd to all perfection; each one comes</L> <L>And brings a play in's head with him: vp he summes,</L> <L>What he would of a Roaring Girle haue writ;</L> <L>If that he findes not here, he m<GAP DESC="illegible" EXTENT="1 letter" DISP="•"/>wes at it.</L>
This is the TEI encoded file. The first tag identifies the heading, the
<LG> <L> <w eos="0" lem="a" pos="dt" reg="A" spe="A" tok="A" xml:id="roaring-girl-04740" ord="449" part="N">A</w> <c> </c> <w eos="0" lem="play" pos="n1" reg="Play" spe="Play" tok="Play" xml:id="roaring-girl-04750" ord="450" part="N">Play</w> <c> </c> <w eos="0" lem="(" pos="(" reg="(" spe="(" tok="(" xml:id="roaring-girl-04760" ord="451" part="N">(</w> <w eos="0" lem="expect" pos="vvd" reg="expected" spe="expected" tok="expected" xml:id="roaring-girl-04770" ord="452" part="N">expected</w> <c> </c> <w eos="0" lem="long" pos="av-j" reg="long" spe="long" tok="long" xml:id="roaring-girl-04780" ord="453" part="N">long</w> <w eos="0" lem=")" pos=")" reg=")" spe=")" tok=")" xml:id="roaring-girl-04790" ord="454" part="N">)</w> <c> </c> <w eos="0" lem="make" pos="vvz" reg="makes" spe="makes" tok="makes" xml:id="roaring-girl-04800" ord="455" part="N">makes</w> <c> </c> <w eos="0" lem="the" pos="dt" reg="the" spe="the" tok="the" xml:id="roaring-girl-04810" ord="456" part="N">the</w> <c> </c> <w eos="0" lem="audience" pos="n1" reg="Audience" spe="Audience" tok="Audience" xml:id="roaring-girl-04820" ord="457" part="N">Audience</w> <c> </c> <w eos="0" lem="look" pos="vvb" reg="look" spe="looke" tok="looke" xml:id="roaring-girl-04830" ord="458" part="N">looke</w> </L>
You will notice that this just the first sentence. In other words, all of this added data exponentially increases the file size. But apart from the increased requirement for processing power to generate and handle these larger file sizes (and how confusing all this looks to the human eye), not much changes in terms of the TEI encoding. The tags, although they might be a little confusing to find, are exactly the same with the addition of
Before I end this post, let me include a screen dump of what Morphadorner looks like in action. It is basically a Java program that you can run off the command line of any computer. It contains some quite large databases of word lists for English print from different eras. What it does in the background is conceptually a set of uncomplicated but intensely repetitive tasks. Read each word and match them up against lists of tens of thousands of tri-grams (word-lemms-POS) and insert the matching results into appropriate TEI compatible tags. Essentially it is bringing to our literary texts the decades of linguistic research and knowledge that has made such analysis possible and it is doing it extremely efficiently, not to mention blindingly fast. Notice the times and taken for each step in the screen dump below from my middle of the road MacBook with 4 Gb of RAM.
- MorphAdorner version 1.0 - Initializing, please wait... - Using Trigram tagger. - Using I retagger. - Loaded word lexicon with 151,124 entries in 4 seconds. - Loaded suffix lexicon with 209,656 entries in 5 seconds. - Loaded 305,855 Latin words in 3 seconds. - Loaded 4,383 abbreviations in 1 second. - Loaded transition matrix in 4 seconds. - Loaded 162,248 standard spellings in 1 second. - Loaded 358,590 alternative spellings in 6 seconds. - Loaded 349 more alternative spellings in 14 word classes in 1 second. - Loaded 2 names into name standardizer in < 1 second. - 1 file to process. - Before processing input texts: Free memory: 244,915,208, total memory: 637,349,888 - Processing file './ab-input/roaring-girl-short.xml' . - Input file ./ab-input/roaring-girl-short.xml split into 5 segments. - Processing segment 'text00001' (3 of 5). - Extracted 842 words in 48 sentences in 1 second. - lines: 49; words: 919 - Part of speech adornment completed in 1 second. 2,954 words adorned per second. - Generating other adornments. - Adornments generated in 1 second. - Inserting adornments into XML text. - Inserted adornments into XML text in 1 second. - Processing segment 'text00002' (4 of 5). - Extracted 1,410 words in 80 sentences in 1 second. - lines: 130; words: 2,469 - Part of speech adornment completed in 1 second. 5,402 words adorned per second. - Generating other adornments. - Adornments generated in 1 second. - Inserting adornments into XML text. - Inserted adornments into XML text in 1 second. - Processing segment 'text00003' (5 of 5). - Extracted 390 words in 7 sentences in 1 second. - lines: 138; words: 2,867 - Part of speech adornment completed in 1 second. 8,863 words adorned per second. - Generating other adornments. - Adornments generated in 1 second. - Inserting adornments into XML text. - Inserted adornments into XML text in 1 second. - Merging adorned XML segments. - Writing final XML to ./ab-output/roaring-girl.xml. - Adorned XML written to ./ab-output/roaring-girl-short.xml in 6 seconds. - After completing ./ab-input/roaring-girl.xml: Free memory: 253,047,736, total memory: 637,349,888 - All files adorned in 32 seconds.
32 seconds to emerge from a mass of bytes that only a renaissance scholar can read with ease into a properly curated database of machine readable information!