Is that a Bat or a Baboon? On the Peculiar Characteristics of Literary Data
Posted on: 15 May '11

A duck and a porcupine (grammar be damned!)
Into the body of a "duckupine" crammed!

-From the Nonsense Poems of Sukumar Ray, my translation.

Ray's gallery of "hybrid" animals, c. 1923.

Ray's gallery of "hybrid" animals, c. 1923.

Recently Mike Witmore, Jonathan Hope and Mike Gleicher wrote an exciting and provocative note on using computational models developed for phylogenetics for the analysis and classification of literary texts. While outlining the ways in which the methods used to identify genetic transmission in bioinformatics might be applicable to a large corpus of literary data, they point out an interesting caveat through a metaphor developed more fully in Prof. Hope's essay on the evolution of standard English:

In biology, traits (or the genes that produce them) have to be passed down in a closed, continuous way.[...]

None of this holds for language though. If we’re writing a ‘history,’ we’ll want to borrow some traits from other histories – but we don’t have to take everything from other histories, and we can take traits from pretty much any histories we happen to have read: old, recent, famous, unknown. So the status of generic traits is very different to genetic ones.

In addition, we are not confined to our own linguistic species. If we want, we can introduce traits from a completely different species to produce something new. In language, if you want a bat, you can cross rats with sparrows. In biology, you have to wait for one to evolve.

What, then, do we gain by using a phylogenetic model for mapping the development of genres? Evidently, we acquire a set of polished tools evolved over decades to meet stringent standards. For a new field like digital humanities, where scholars are just starting to explore the radical possibilities offered by technology and where lacunae and knowledge gaps are still significant, such a tried and tested toolbox can be extremely useful. But I'd like to propose that we might have more to lose in the long run than we stand to gain. It is precisely because literary informatics is a young field that we need to be very careful indeed about the seduction of metaphor, more so when it presents itself with the weight and authority that scientific discourse can muster. It would indeed be a missed opportunity if literary scholars did not look to other fields to gain any insights they can, but they would do well to build their own data and analytic tools, as much as possible, from the ground up if the particular characteristics of literary texts - the idiosyncratic "grammar" that makes them unique - are to be accounted for.

In this sense, I would argue, it is crucial that digital literary studies strive to maintain continuity rather than posit itself in a relation of radical discontinuity to traditional literary scholarship. In thinking about the implications of the entire corpus of early modern literature as one massively addressable object, I have often returned to an important caveat that Prof. Witmore has repeatedly emphasized in his talks - that the mass of data spewed out by DocuScope (the program being used in the data-mining project at UW-Madison) and then rendered as PCA distributions by clustering algorithms will mean very little to someone who is not a scholar of early modern literature. While this persistence of the human factor - the value of human expertise in the face of a massive assault by "reading machines" - might reassure us, it is also important to interrogate the underlying reasons behind it.

What is the peculiar characteristic of literary data that can be mapped by machines but can only be meaningfully grasped by competent human readers? An introductory textbook on genetics I was recently reading pointed out that biologists often deal with "laws" that many physicists would call mere "probabilities." But when compared to the fluidity of literary data, the approximations involved in gene mutations might seem like exercises in perfect precision. To be sure, scientific knowledge is not a static monolith. It is constantly debated, revised, updated. But to a very large extent there can be agreement on the possible outcomes of processes. Even if one cannot combine a rat and a sparrow to create a bat, one is at least likely to agree on what constitutes a bat. In a literary text on the other hand, reasonable people might disagree on whether a given object is a bat or a baboon! It might serve as either, or both, depending on the particular conditions of reading. Recently at the Digital Development Division of the UW-Madison Libraries, while discussing the development of software that would allow users to quickly tag a large number of texts with meta-data on genre, we tried to come up with a possible standardized list of genres for early modern texts. We soon agreed that there were hardly any texts where all scholars would agree on a standard classification!

This is the unique complexity, the challenge, and the richness of the literary object. It exists primarily within the fragile and fluid realms of the social, it's peculiar "grammar" is one that allows strange hybridities that would be unacceptable within other disciplines. But it is this very hybridity that we must preserve - to banish it would be to reduce infinitely the complexity and nuance of literary analysis. Any model that we use must preserve that grammar and not lure us by the imposition of algorithmic order on what is essentially chaotic and fluid. So, even as we explore the tools and methods used in various disciplines from bioinformatics to market-analytics, I think we should be very careful about how well they fit our data and our methods. There is some merit in building our own wheels without actually reinventing it. The mathematics and statistics that underlie large scale clustering operations in a range of disciplines are essentially the same but individual tools are adapted to the quirks of particular disciplines. If biologists expect "mistakes" in copying genes, market-analysts expect rapidly changing information. We might need to go to these sources, take more time to understand the various techniques available, and then collaborate with mathematicians, statisticians, and computer scientists to construct models of analysis that fit our particular needs, our particular "grammar."