In an intriguing post on the Wine Dark Sea blog Jonathan Hope uses the Wordhoard tool from the MONK project to compare the relative occurrences of certain pronoun classes across Shakespeare''s Tragedies and Comedies. The results confirm the general trend Hope and Witmore had noted in this article about Shakespeare's use of rapid exchanges or negotiations between individuals to drive towards the comic resolution of conflicts. The frequency of I/you and my/your strings turns out to be higher in the comedies and conversely, we/our clusters occur more in the tragedies. This might be surprising because we think of comedies in terms of festive communal settings that affirm shared values, while tragedies are marked by lone protagonists and the dismantling of any shared sense of community. But the finding draws our attention to the structural development of the plays - the linguistic devices playwrights use to build up to this sense of climax of comic resolution or tragic catastrophe. What struck me was the importance of the sense of linear progress of plot:
Shakespearean comedies typically involve people arguing about things, striving to arrive at a ‘we’ of agreement, but not being able to until the final scene.
I have been thinking about ways that we can account for structural development within texts for the computational analysis of genres, and this poses a very interesting problem. Can we track the process of comic resolution or the buildup to tragic conflict within particular plot-lines through the analysis of such pronoun clusters? And might such intra-textual tracking provide further clues to the linguistic signature of genres? To answer this question, I wrote a program to extract frequency counts from TEI encoded files in a way that would let us account for changes within texts. I chose a rather arbitrary set of plays based on personal preferences (hence the prevalence of city-comedy), and the availability of suitable quarto texts from the EEBO-TCP corpus. I regularized these texts with the Morphadorner tool so I could access lemma and used a few Unix system utilities to convert everything to UTF-8 encoding. The program itself is a bit of a hurriedly written hack but works well for my purposes. Developed in Ruby, the script essentially breaks each play down into a specified number of equal sized chunks and counts the frequencies of "I," "you," and "we" clusters. So if a particular play has, say, 4000 words and I use 40 as the number of chunks (this can be thought of as granularity), the script breaks the play into 40 sections of 100 words each and counts word frequencies within them and puts the entire data into a CSV file. This allows us to trace how occurrences change within the plot along with observations between texts and genres. Since I am primarily interested in patterns within particular texts I don''t need relative frequencies and used straight word counts, but it would be easy to convert this data into relative frequencies. Running about 30 early modern plays drawn from the works of Marlowe, Shakespeare, Jonson, Dekker, Middleton, Ford and Webster through the program, the resulting data throws up some interesting possibilities and challenges. As is to be expected with literary data, no uniform pattern holds across all texts. However, there are certain intriguing patterns that begin to emerge. Here, for example, is the cumulative graph for the "I/my" group in comedies. As plots move towards resolution, these occurrences seem to gradually fall off.
As with all data generated by the use of computational techniques on literary texts, we must ask the central question - what does this mean? Are we seeing the success of the language of comic negotiation that Hope and Witmore traced in the quick exchange of I/you patterns? If so, is "I" replaced by the "we" of shared community?
The evidence is not quite as clear here, but there seems to be an emerging pattern where "we" occurs more frequently toward the end of plays. But literary data is nothing if not quirky. It thrives not on conformity but on uniqueness and often the individual stamp of an author or the peculiarity of a plot-line will throw off an emerging pattern within a group. In this case a few plays like Satiromastix and The Devil is an Ass, among others, produce some outliers that contradict the general trend and throw off the graph. We can, however, confirm certain broad trends within texts and the entire data-set offers some enticing conjectures to explore.
But the pivotal critical purchase to be gained from this kind of analysis is in the way they enable us to look within individual texts. Do these patterns hold over particular plots? Can genres be distinguished with a degree of certainty depending on whether usage of these key word-groups increases or decreases over the course of the play? Here, for example, is a polynomial regression line fitted to a scatterplot that indicates the use of "we" in Doctor Faustus.
If the disintegration of Faustus'' world is mapped in the negative slope of the regression line, the following regression line for "I" in the play, charts Faustus'' increasing struggle with himself.
To a large degree the correlation lines for "we" in tragedies have negative slopes while the lines mapping "I" have positive slopes. As tragic plots progress, in other words, their protagonists become more and more individualistic as shared notions of community disintegrate around them. Does the expected opposite hold true for comedies? Here is the graph for "we" in The Shoemaker''s Holiday that affirms the triumph of the shared communal spirit that is the hallmark of comedy.
Conversely, the progression of the comic plot transcends the individual, producing the expected negative slope for the correlation line.
I will explore the data in more detail and post further analysis of individual plays, but from my observations thus far, these patterns of regression slopes seem to hold within genres to a remarkable degree. Even when the complete data-set might not reveal a readily discernible pattern, the basic movements of tragic and comic plot progressions seem to be preserved to a large extent. As readers, our sense of genre is tied to a degree to the notion of the linear progression of plot. This sense of the linearity of narrative can be difficult to capture with computational approaches that treat the entire text as one unit. By splitting up the text and comparing the development of linguistic trends within the resulting chunks, we can at least partially imitate this readerly sense of genre and capture a quantitative approximation of it.
In closing I want to note that these results were obtained with a relatively crude data-set - mostly early seventeenth century quarto texts where the TEI markup did not allow us any degree of control over selecting individual speakers etc. As better curated texts become available and we can draw on a larger set of meta-data, we will be able to embark on far more nuanced analyses of the early modern corpus. Even so, at this early stage and with not much more than a quickly cobbled together program, these results seem to hold out exciting promises for the future.','I have been thinking about ways that we can account for structural development within texts for the computational analysis of genres, and this poses a very interesting problem. Can we track the process of comic resolution or the buildup to tragic conflict within particular plot-lines through the analysis of such pronoun clusters? And might such intra-textual tracking provide further clues to the linguistic signature of genres? To answer this question, I wrote a program to extract frequency counts from TEI encoded files in a way that would let us account for changes within texts.