Rogue-Mining: Can We Close-Read 150,000 Texts?
Posted on: 17 May '11

Formulating the Problem

The Digital Inquiry Group at UW-Madison was recently discussing ways in which a large digital corpus of literary texts can be used to supplement traditional humanities research approaches. I volunteered my own research as a test case. I am interested in representations of crime and criminals are used in early modern print culture and drama. How would I go about doing this with the existing corpus of EEBO-TCP texts and with tools like DocuScope or the MONK workbench (assuming the entire TCP corpus were made available there)? Both DocuScope and MONK would let me run individual texts through them and generate linguistic "signatures" weighted in particular ways. These might be generated according to rhetorical categories as in DocuScope or the word-bag techniques deployed in the Monk workbench. Essentially, these "signatures" would be a series of numerical attributes in an array that would give the characteristics of a particular text. So a text X might have an array of n values to represent it: X = {x1, x2, x3 ... xn-1, xn}. Once I have a number of text processed (say z), I shall have z arrays (X1-z) with n attributes each. This is my data-set for any numerical analysis.

While this is fine, a couple of caveats seem in order if we're not to lose sight of our original goal - tracing representations of criminality. To be able to do what I want rather than what DocuScope or MONK is programmed to do, it is necessary to be very meticulous about the parameters that go into this data-set. It is reasonably easy to tweak the data to emphasize rhetorical or linguistic features that I think are important to me but to do that I need to have a very precise idea of what exactly the parameters x1-xn are depicting.

A Clustering Conundrum

Now that I have my raw data, what might I do with it? Well, it can be processed through any number of programs (it starts life as an extremely portable CSV file) from the vanilla Excel, to a statistical programming language like R, to a range of specialized visualization packages. Since I am not exactly sure that my data is taking me in the direction I want to go, I will avoid packages that make it easy to generate attractive visualizations and deal with the nuts and bolts myself. So importing the data into R, I prepare it for clustering. Each item in my dataset can be thought of as a point existing in n dimensions but I need to reduce this complexity for the cluster analysis algorithm. I can do this through a process called principle component analysis (PCA) which reduces the dimensionality of data while preserving the original variation as closely as possible. Finally, I run it through a clustering algorithm, usually an unsupervised form of machine learning - i.e. the algorithm tries to find closely proximate "groups" of objects without the addition of any external data.

R spews out my jpegs and I study them to see if the linguistic analysis threw up any significant patterns. I might be surprised to find that indeed jest-books and ballads cluster very differently from plays or that texts that talk about crime are radically different from ones that don't. This is unlikely, though, and even if I lucked out like this I'd be left with the task of figuring out what exactly is the cause of this clustering. I don't mean why the numerical data clustered in this way, but rather what features in the underlying texts - features that imprint themselves subtly on the traditional "slow reader" - caused a particular kind of clustering. At this point, I might feel that while I am doing dazzling and perhaps even productive work, I am not being productive quite in the way I want to be. Rather I am functioning within the parameters of the computing system I employ.

One of the chief causes of this is that since I started out, I haven't been able to tell the machine what my research interest is - what is it I am looking for, what is it that I am hoping to learn? All I have communicated are a series of tangential inputs, tweaking parameters that I assume are correlated to particular aspects of texts. To be sure, MONK lets me supply a word or even sets of words and/or look for part-of-speech categories, but is that how a perceptive slow reader works? I did not start out saying "I want to trace early modern adjectives with the root 'rogue,'" I started wanting to trace representations of criminality, a broad, evolving and diffused category with no strict pre-determined boundaries. In fact, I am hoping to learn how boundaries - between honesty and crime, inside and outside, self and other - are constructed.

So, I remain stuck with a conundrum. If I were to hire an undergrad as research assistant I would take some time to outine my broad research interests to her before sending her off to the library. Yet, I am forced to send this program into a library of 150,000 texts with barely more than what it is already designed to find. If every reading of literature is a reading from a particular perspective looking for particular things, how can we design digital humanities tools that will let me communicate the nuances of that perspective? In other words, while digital tools are great at finding broad linguistic patterns, how can I harness them to the service of traditional literary research?

Meta-Data: Supervised Machine Learning

Even if we have properly curated, lemmatized and tagged data, how do we convey the research-focus to the data-mining program? No amount of pre-programming will do because every reading and every research project has different interests and goals. Neither would a set of words such as the MONK workbench allows be useful because an interest area is rarely defined by a definite set of words. To continue the example of my own work on criminality, I would struggle just to enumerate the words which denote whoredom, a trope frequently invoked in early modern discussions of female crime - and any list would be incomplete because of the layers of innuendo, puns and metaphors that were often deployed.

One approach might be to get some crucial data across in the form of metadata. This concept is explored in MONK and at the EEBO-TCP Linguistic Analysis project at UW-Madison. We found metadata to be indispensable, even though that project is not seeking to conduct the kind of guided research on a highly focused topic as a complement to traditional reading that I am trying to implement here. What benefit would meta-data offer? It would allow us to pass crucial bits of information - genre, date, sex of the author etc - to the clustering algorithm. This allows us to establish certain relationships within the data-set. We can, for example, communicate whether we are looking to compare tragedies against comedies, or Elizabethan plays against Jacobean plays. Our clustering algorithm can easily accommodate the new dimension and generates scatterplots with circles helpfully color-coded or marked with different shapes

But meta-data is still heavily biased towards broad categories. It is coded into the headers of TEI files and of necessity it can only describe a limited number of aspects. Even for fairly general categories, things that might interest a particular researcher might not be indexed in the meta-data. Not all data sets might note the gender of the author. How about social status - one might be interested in texts written by the members of the nobility, gentry, citizenry etc. Can metadata possibly address the entire gamut of approaches to a set of texts, or does it predetermine what approaches are possible at all?

Meta-data has its uses and is a huge improvement over just raw data, but it is unlikely that any data-set will be able to use just meta-data as a satisfactory platform to meet a wide range of research demands. One approach might be to formulate quick ways of letting the user add provisional meta-data and this is indeed the way we chose to go in the current phase of analysis within the Digital Inquiry Group. We are developing interfaces that would let us insert metadata into several hundred files relatively quickly. But manually curating a really large body of texts like the whole EEBO corpus most of which will not relate to the project at hand might not be the most practical approach. Moreover, each project would need specialized meta-data. It is therefore suited to smaller scale projects focusing on a limited number of texts.

We are a little further ahead in our quest for harnessing digital tools for custom research projects, but meta-data, while useful, is not quite the solution we are hoping for. In the next post I'll try to outline an approach to corpus analysis that might allow us to tell, indeed teach, machines to read not along pre-programmed paths but for the particular things we are interested in.

A follow-up to this post: Silico-Sapiens: Learning Machines