Data Curation: From Access to Analysis
Posted on: 14 Sep '10

While digital humanities research has made rapid advances in recent years, it is still in its infancy as a field of inquiry. It is only in the last decade or so that computing platforms sufficiently powerful for such research have come into regular use in humanities departments. In spite of all the advances, this first decade in the life of this young discipline has mostly been about ground-laying and exploration. On the one hand scholars in the humanities have embraced new media technologies to facilitate teaching in the classroom. At the same time, they have looked to different disciplines for tools, models or metaphors for thinking about technology. Their willingness to embrace such cross-disciplinary collaboration is evident in the speed with which techniques from geo-spatial analysis to a plethora of visualization tools have been adapted for the representation of literary data whether it be visualizing maps or networks of human interaction.

Digital Access

But the most important groundwork that has defined the tenor of the last decade has been more mundane - the tedious process of digitizing the corpus of texts that form the foundation of literary scholarship. From Google's trillion word corpus, to various specialized databases such as EEBO and ECCO to individual projects focused on particular areas or authors, there has been steady progress in the digitization of texts to the point that for many areas of literary studies all but the most esoteric texts are now available digitally in one form or another. However, even though the TEI standard which, in its latest P5 incarnation, has matured into a formidable encoding tool, has provided some modicum of stability and standardization, digitization efforts have largely been scattered across a wide range of institutions and evolving practices. Institutional barriers, as much as technological ones have resulted in pockets of rich information that are not properly curated or that cannot easily talk to each other. As a result, many a digital edition that has been painstakingly encoded in TEI has been confined to simplistic interfaces, their interactivity reduced to little more than "search" boxes, and serving only to ease access - instead of having to visit a library to look at an old edition or a manuscript, you can now access it at the click of a mouse.

Digital Analysis

Even after the painstaking work that has gone into the digitization of corpora, many are in need of massive curation efforts before they can be used for large scale analysis. Much of the technology for the development of a standardized corpus that will let the fragmented pieces of the puzzle fit together across platforms and disciplines is already being developed. I have been working with tools from the MONK project which takes a giant step toward such standardization. The Bamboo Project, which got under way this fall promises the development of a shared infrastructure that will allow common access to a set of major databases and enable users to apply a shared set of tools for the analysis of these texts. It will be exciting to see how these efforts turn out and whether the next decade sees a decisive turn toward a shared infrastructure for the curation and analysis of texts.

The production of a carefully curated and maintained corpus of core data would be essential for the development of analytical tools that can penetrate beyond the mere surface of texts and transform in fundamental ways our approach to literary scholarship. The mere digitization of texts effectively solves the problem of access, but we will need to curate these collections of data to make them work together and to to make large scale analysis possible.

To this end, I want to mention the very interesting trend I have been noticing both in new digital humanities projects and on the TEI-L mailing list discussions. We're increasingly seeing the emergence of digital texts of higher editorial quality, carefully encoded with TEI to preserve the full diversity of manuscript or print features, offering variant readings and cross-references. I believe we will eventually see something akin to the evolution of variorum editions within digital scholarship - a move from the hastily printed quarto to the meticulously compiled scholarly tome. There are signs that this phenomenon is already under way with ever more detailed and innovative encoding projects being announced and if scholars interested in large scale text analytics can develop an open standard that allows for the sharing of resources across projects and platforms, then the rich metadata embedded in such editions can prove to be a productive site for future research.

Edit: Soon after posting this, I noticed a couple of sophisticated online variorum projects, including the Donne Variorum hosted at Texas A&M. Even more exciting is the upcoming Dynamic Variorum project - part of the "Digging into Data Challenge" - that seeks to create variorum editions for the entire corpus of Greco Roman texts in the Perseus Digital Library. This should be interesting!