I have been keen on finding ways to visualize plot structure as a network. Here I parse character-relations from the Shakespeare corpus and visualize the results as a weighted interactive network that the user can adjust to explore different aspects of plot.
Thomas Heywood is said to have claimed in 1633 that he had had "either an entire hand, or at least a maine finger" in the composition of 220 plays. Even if exaggerated, his claim points to the intensely collaborative nature of theatrical production in early modern London, a process that was as much economic division of labor as artistic collaboration. But any student of early modern drama who has pored over editorial notes on the style and habits of individual contributors will recognize that even in this strangely ad-hoc conveyor belt of cultural production that churned out plays to meet the insatiable demands of the London audience, there were genuine spaces of artistic collaboration.
What features among the categories generated by DocuScope contribute most to genre? In other words, if we can think of genre as the effect of underlying linguistic features, how can we begin to analyze what combinations define particular genre categories?
The King James Bible just turned 400. And what better way to pay tribute to "the greatest book ever written by committee" than to mine it and visualize the results! I was building some tools for mining a large corpus of sixteenth and seventeenth century texts and decided to test some of those techniques by extracting some patterns from the Gutenberg text of the KJV and visualizing the results as a network.
I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to "drill down" to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire "gene" of the text within a logical space in relation to other texts.
I have been thinking about ways that we can account for structural development within texts for the computational analysis of genres, and this poses a very interesting problem. Can we track the process of comic resolution or the buildup to tragic conflict within particular plot-lines through the analysis of such pronoun clusters? And might such intra-textual tracking provide further clues to the linguistic signature of genres? To answer this question, I wrote a program to extract frequency counts from TEI encoded files in a way that would let us account for changes within texts.
Before setting out to create a digital map of early modern London, we might ask what information would scholars in the humanities like to find when they look at such a map? Of course, the geographical features of early modern London, especially as represented on historical maps are in themselves of considerable interest. But most likely, scholars are as interested in the social relations that early modern spaces represent as part of the socio-economic and cultural milieu of early modern London.
In my last blog entry, I described some tentative approaches that might help us integrate the large scale analysis of textual corpora with the traditional practices of literary research. The main problem seemed to be one of simulating "learning," but more importantly it was one of modeling the kind of negotiations and revisions - reading and re-reading - that mark literary research. How could we communicate these shifting emphases to algorithms in ways that could focus the data-set on particular aspects of interest to the project?
The Digital Inquiry Group at UW-Madison was recently discussing ways in which a large digital corpus of literary texts can be used to supplement traditional humanities research approaches. I volunteered my own research as a test case. I am interested in the ways in which representations of crime and criminals are used in early modern print culture and drama. How would I go about exploring this with the existing corpus of EEBO-TCP texts and with tools like DocuScope or the MONK workbench (assuming the entire TCP corpus were made available there)?
Recently Mike Witmore, Jonathan Hope and Mike Gleicher wrote an exciting and provocative note on using computational models developed for phylogenetics for the analysis and classification of literary texts. While outlining the ways in which the methods used to identify genetic transmission in bioinformatics might be applicable to a large corpus of literary data, they point out an interesting caveat.
I have been playing around with tools from the MONK project. I have written before about the MONK Workbench, which I found to be a brilliant tool for introducing people to the possibilities of linguistic analysis and even, within the limits of its data-set, for doing some pretty nifty research. However, I found the Morphadorner to be a much more exciting tool because of the control it can give individual projects over their data-sets.
In spite of all the advances that have been made in the digital humanities, the first decade in the life of this young discipline has mostly been about ground-laying and exploration. On the one hand scholars have embraced new media technologies to facilitate teaching in the classroom. At the same time, they have looked to different disciplines for tools, models or metaphors for thinking about technology. But the most important groundwork that has defined the tenor of the last decade has been more mundane - the tedious process of digitizing the corpus of texts that form the foundation of literary scholarship.