Silico Sapiens: Learning Machines
Posted on: 22 May '11

This is a follow-up to my previous post: Rogue-Mining: Can We Close Read 150,000 texts?

In my last blog entry, I described some tentative approaches that might help us integrate the large scale analysis of textual corpora with the traditional practices of literary research. The main problem seemed to be one of simulating "learning," but more importantly it was one of modeling the kind of negotiations and revisions - reading and re-reading - that mark literary research. How could we communicate these shifting emphases to algorithms in ways that could focus the data-set on particular aspects of interest to the project? Meta-data provided a partial solution, but the quality of meta-data that could be expected even from the most carefully curated databases and the relative coarseness of the controls meta-data allows can not compare to the nuances of actual reading by a competent reader.

Re-Search: Literary Analysis as Recursive Reading

How does this imagined reader read? How does he collect "data" about the text before it is crystallized into particular meanings or interpretations? One cannot hope to simulate the competence of a good reader through an algorithm, but even if we can begin to approximate the processes through which literary critics usually form broad research topics, if we can begin to direct the mining of literary data by nudging it in particular directions, it would be a major step forward in terms of what we are trying to achieve.

If we step back from the direct act of close reading and think of how we conceptualize an area of research, I think we can agree that we often start with a broad set of associated ideas. As the research progresses, these ideas evolve and change. Some drop out while other new ones appear within the research framework. In the case my own research I started with a very diffused interest in broad "clusters" of ideas - "crime," "vagrancy," "early capitalism" etc - that were only very loosely correlated. Soon enough, some of them got eliminated and others developed into a more complex and nuanced set of categories. That cycle of gentle flux and revision continued as the project developed and reached completion. But how was I associating ideas with texts and with each other? Partly, my choices were down to the accidents of academic training, blind chance or the preferences and the advice of my directors. Not being able process a few billion words worth of early modern print proved to be a drawback in terms of absolute and comprehensive coverage, but it didn't fundamentally affect the recursive pattern of learning, reflection, revision and re-learning.

To continue the example from my last post, if I had an undergraduate as a research assistant for this project, I would take considerable time to give her a sense of its evolving directions. She wouldn't necessarily be up on all the research or nuances of argument and available data, but perhaps I'd ask her to read a couple of texts to get a sense of what we're interested in. Admittedly, it is unlikely that my research assistant would be able to process a hundred and fifty thousand texts in a few minutes. But, for most literary or historical projects, this sense of focus - rather than mere broadness of scope - is the defining factor. In other words, better data is often more important than simply more data. With computers as our research assistants, it is necessary to find ways of telling them about our research interests - of asking them to "read up" a bit on what we're doing before sending them off the the library.

Teaching Machines to Read

Machine-learning techniques can be used to roughly recreate this process of recursive reading on the typically non-human scales that computers are capable of, but with a significant amount of hand-holding and feedback from the researcher. Such "training" is a standard component of many machine-learning approaches, but the approach I shall suggest will be an especially time consuming and involved process. But given the technology we now have at our disposal, and the scale of digital corpora, it can help us to put together highly focused databases in a matter of weeks. Compared to the period of several years that goes into a dissertation or a book project, this is a minimal investment that allows one to immeasurably broaden the range of materials and analysis.

Machine-learning algorithms can be highly sophisticated, but to make my essential point about the need for recurring communication with the researcher, let me take as an example a simple variation of a search based algorithm - albeit, one that can look for various metrics of word associations as well. (Variations of this approach are common in language processing - for example, the WordNet database as used by the Python Natural Language Toolkit). As is common practice with many "training" approaches, we can start with a core group of terms or associations that are of particular interest to us within our research. In the case of my work, I could perhaps take a sample of representative sixteenth century "rogue-pamphlets" such as Thomas Harman's highly influential A Caveat for Common Cursitors, Vulgarly Called Vagabonds, or a selection of pamphlets by Robert Greene or Thomas Dekker and through a careful reading of such core texts (a task any researcher engages in anyway), compile a core vocabulary of terms related to crime and criminals. This would give me a general but still considerably focused starting point. With the training vocabulary set, it would then be possible to scan the entire corpus for occurrences of the words or word-clusters while at the same time noting words that occur in frequent proximity or association with them. A simple algorithm can assign weights to found words depending on distance or other association criteria. Using a properly curated lemmatized corpus would make the job much more efficient and avoid unnecessary repetition. After a complete scan, certain obvious noise words can be eliminated automatically and linguistic or frequency thresholds can be set for acceptance. The user would then be presented by the remaining part of the vocabulary. This, for sure, will contain many associated terms related to the research topic that did not occur in the training-text, but it would also have many unrelated words, which one can easily eliminate as not of interest.

While such repeated and recursive training and re-training can be tedious, it will contribute to the quality of the data and allow for infinitesimal adjustments. Chances are, the selecting process itself will bring to light certain unforeseen associations and nudge the research in new directions. Terms can be added, removed, or weighted for importance. Once the revised training data is formulated it can be used to crawl the corpus for associations yet again, and the more the number of iterations, the higher the quality of the resultant database will be (albeit with diminishing returns after a threshold of a few runs).

What can we do with this database that we couldn't with traditional scholarship or with our original approach to corpus-mining? The most obvious benefits are those of scale. Computer excel at repetitive tasks like searching through massively large piles of data. They are not, however, very good at making interpretive decisions about literary texts. This approach tries to harness the first and compensate for the second. Even before running the data through analytic software and clustering programs, the benefits of scale should be obvious. No scholar working on any topic can read more than a miniscule chunk of an existing corpus. Thus, simply letting a "trained" algorithm pick out texts that might be good candidates for inclusion has its advantages. A quick routine can bring up the texts within the corpus with the highest relative frequencies of the core terms, potentially identifying little known texts of relevance to the project.

In terms of further computational analysis, the possibilities opened up with a highly focused subset of the original corpus are immense. Existing techniques such as those implemented in the current MONK workbench can be applied more effectively. Other traditional methods like supervised clustering that depend on meta-data can be expected to be more effectively deployed with such highly focused data. Statistically, the expectation of better results would be justified by the simple elimination of "noise" from the mined data by the repeated "training" sessions. However, in terms of literary scholarship, I believe the main benefit of such an approach would come from the way it integrates itself with the normal processes of literary analysis. It would allow us to avoid sending our digital research assistant off into the data mines with a pickaxe and a shovel where an archeologist's brush and trowel might be more appropriate.