Decompiling Shakespeare: A Semi-Supervised Classification Approach to Genre
Posted on: 02 Jan '12

A network representing the five "nearest neighbors" of each Shakespeare play.

The quantification of certain features of literary texts for digital analysis can present unique interpretive challenges. A text analysis program like DocuScope can provide us with a rich set of features - in the case of DocuScope relative frequency counts of a 101 rhetorical categories, to be exact - but going from this mass of numbers to a particular interpretation of a text or genre remains challenging, if not a fundamentally alienating experience. Of course, certain elements might stand out immediately when quantified and can lead to useful insights. For example, one might readily notice the prevalence of the "SenseObject" category in A Midsummer Night's Dream and reread the play in the light of this observation. But most significant aspects of texts or genres that a literary scholar might be interested in are far more difficult to extrapolate from quantitative data. When we use dimension reduction techniques such as principal component analysis (PCA), or a variety of clustering techniques to look at the data, we are trying to get from the numbers to the underlying patterns that might interest us. To be sure, this problem is not unique to literary or humanistic data. Scientists from a variety of fields dealing with far less subjective data struggle to extract the parts that actually provide significant information for the problem at hand from the mass of data that can be available. Feature selection (along with dimension reduction) forms perhaps the most crucial part of any machine-learning algorithm that tries to extract "sense" from a mass of data.

### "Genre" as Feature-Set

What features among the categories generated by DocuScope contribute most to genre? In other words, if we can think of genre as the effect of underlying linguistic features, how can we begin to analyze what combinations define particular genre categories? Prof. Witmore proposed an interesting version of this question when he asked how, given the clustering techniques we have at our disposal, might we think of the "ideal" of a genre and how such a designation would affect other texts in the corpus? Say, we decide that Twelfth Night is the "ideal" Shakespearean Comedy. How would such a designation rearrange the other texts in the corpus?

To approach this problem, I want to begin with a simple force-field network generated from DocuScope data for the Shakespeare corpus that represent the five "nearest-neighbors" of each play. The nodes are color-coded by genre and the edges - the links between nodes - are based on the "distance" between the vectors that represent the plays. The thickness of edges represent this distance and their colors, the average of the nodes they connect, throw inter-genre edges into quick relief. While networks are better suited to representing relationships between "nodes" than for clustering applications, weighing the "edges" in terms of a distance metric produces a pseudo-clustering effect, as is evident here.

The idea of "distance" between literary texts might seem odd but it is simply the Euclidean distance between the numeric vectors that represent each text:

$$d(x,y)=\sqrt{\sum_{i=1}^{n}{(x_i-y_i)^2}}$$

Essentially, this is a quantitative indicator of what we might think of as similarities or differences between two texts. We must note, of course, that giving equal weight to every feature doesn't allow us to distinguish between texts at the multiple nuanced levels that a good close reader might. For example, one might say of King Lear that while it is generically similar to the other tragedies, there are elements of humor in it that recall the Henry IV plays. And in a sense, it is precisely this kind of nuance that we are trying to introduce into our digital analysis by differentiating among the features that DocuScope generates. Once we calculate the distance between texts, we can reduce the 101-dimensional vector-space into two dimensions by using "multi-dimensional scaling," a technique that tries to get the best representation of distances between nodes until the distance between a node-pair can't be adjusted without adversely affecting the distance between another pair of nodes.

Even with this relatively crude method of dimension-reduction, it is obvious that a very distinctive clustering pattern exists in this network representation in spite of significant anomalies and exceptions. There is much that is interesting and suggestive here, both things that seem to reaffirm our expectations and the striking anomalies that warrant further investigation. For example, it might be interesting that Cymbeline and Othello(see Hope and Witmore on Othello's comic strain) end up closer to the comedies, that the very late Henry VIII shows affinities to other late plays. Just as interesting and perhaps even more productive and provocative might be the anomalies - Titus Andronicus aligning with the histories or the comedies that remain stubbornly uncomic in this force-directed universe. However, informative and challenging as such a network might be, it remains to a large extent a "black-box" process, both because of the rather forced reduction of a 101 dimensions to 2 based on a single distance metric and because there is no easy way to go back from the observation of a node or an edge on the graph to the array of numerical patterns behind it and, ultimately, to the aspects of the texts that produced those numbers.

### An Approach Using K-Means Clustering

It is obvious that the features that DocuScope is picking up contain a strong "signal" for genre, aspects that are obviously contributing to the clustering effect that tends to group genres together. However, there is enough "noise" that the partitioning is not neatly demarcated and more importantly it is difficult to determine which features contribute the most to genre distinctions. These problems are not merely artifacts of the network representation, but part of the difficulty of dealing with such a high-dimensional data-set.

To verify this, I tried to separate the Shakespeare corpus using k-means clustering, a widely used clustering algorithm that takes a set of points and tries to separate it into k groups. While there are many variations to this approach, what the k-means algorithm essentially does is group a given set of points into k clusters based on distance. To do this the algorithm begins by choosing k random points within the vector space and then calculates the distance of each node from these starting points. A node is assigned to the point it is closest to, thus forming k clusters. After each iteration, however, the algorithm computes the centroid of all the points assigned to a cluster and, taking these k centroids as the key points, repeats the clustering process. The process continues until further iterations no longer change cluster assignments (or until a given number of iterations are complete). K-means demonstration.

Comparing k-means clustering on the Iris dataset and the Shakespeare dataset.

While k-means is a very useful algorithm for unsupervised clustering of a variety of data, its effectiveness is diminished for data-points that are not clearly separated. For example, I used a sample data-set that comes with the R statistical package to perform multiple k-means clustering with randomized starting points. This data-set represents sepal and petal lengths for iris species and is clearly separated into distinctive groups when we perform PCA on it and plot the first two components. As expected, multiple runs of the k-means algorithm produces the same clustering each time. Compare this, however, to the "fuzzy" Shakespeare data on the bottom row where the same k-means clustring trying to separate the points into two groups results in different clusters on each run.

The Shakespeare data-set is characteristic of literary data in that it has a relatively high number of dimensions and yet not many of them are highly correlated. We cannot expect distinct separation from such data and even techniques like PCA need several principal components to capture most of the variation (dimension reduction is much easier on highly correlated data where the first couple of principal components are often enough to explain most of the variation). In fact, increasing the number of clusters results in even more irregular clusterings. This can be demonstrated with a single line or R code. I ran k-means 5 times on the same Shakespeare data - contained in a matrix named "mat" - trying to separate out 3 clusters and printed out the resulting cluster assignments:

> for(i in 1:5){print(as.numeric(kmeans(mat, 3)\$cluster))}
[1] 2 1 1 1 2 1 2 1 1 2 1 1 1 2 1 1 1 2 3 2 2 1 2 1 3 3 3 1 1 3 3 3 3 3 1 3
[1] 3 1 3 1 2 3 3 3 1 2 1 3 1 2 1 3 1 2 1 2 2 1 3 1 2 2 2 3 1 2 1 2 2 2 3 2
[1] 2 3 3 3 2 1 2 1 3 2 3 1 3 2 3 3 3 2 1 2 2 1 2 1 2 1 2 1 1 1 1 1 1 2 1 3
[1] 2 1 2 1 3 2 2 2 1 3 1 2 1 3 1 1 1 3 1 3 3 1 2 1 3 3 3 1 1 3 1 1 3 3 2 3
[1] 3 1 1 2 2 1 3 1 1 3 1 1 1 3 1 1 1 3 2 3 3 1 3 1 2 2 2 1 1 2 2 2 2 2 1 2


As we can see, the clusterings change significantly each time the algorithm runs. While k-means clustering has the advantage of working on the actual high-dimensional data-set rather than simply using the distance-matrix as the force-field algorithm did, it proves equally limited in distinguishing genre clusters for our Shakespeare data-set.

### From Unsupervised Clustering to Semi-Supervised Classification

Of course, speaking of the "ideal" of genre as if it were a specific, definable set of expectations is absurd. After all, genre only exists in its particular manifestations, or "utterances" as actual literary works. Any notion of an ideal or and essence that defines a genre must be provisional, an extrapolated langue to the parole that is a literary text. Mindful of this caveat, it is still useful to ask what it means to think of the "ideal" of a genre in terms of a quantitative vector-space. If we imagine texts as defined by a set of vectors, then each vector is a point in an N-dimensional space. Say, the 14 Shakespearean comedies are points in such a 101 dimensional DocuScope space. We might reasonably expect that the "ideal" of this set of points would lie in the centroid of this point cloud, where the centroid can be determined by simply calculating the mean of each dimension.

In fact, this is precisely what k-means clustering does at each iteration for each group it identifies. So, if we could eliminate the effects of the random starting seeds might it be possible to use the k-means algorithm to identify the ideal of a genre in a way that aligns with our categorizations? In other words, could we guide the k-means algorithm to find three plays which best explain the three Shakespearean genres? If we can, these plays would be the closest to the "ideals" of their genre within the Shakespere corpus - the comedy, tragedy or history closest to the centroid of the point clouds that make up their genre cluster. To do this, I wrote a script in R that uses a modified version of k-means clustering often called seeded k-means. There are several other ways of influencing how k-means chooses seeds or sorts particular nodes, but seeding provides the most convenient way of controlling the algorithm for our purposes. Instead of allowing the algorithm to choose a set of k random starting points within the vector space, I forced it to use a set of plays as the starting points of the algorithm. What this effectively does is designate that play as the "center" or "ideal" of a genre. The culstering algorithm then proceeds as normal - calculating distances and assigning each play to the closest centroid and reitrating this process till a stable grouping emerges.

The three plays chosen as seeds serve the purpose of Prof. Witmore's genre ideal. By designating a play as the ideal comedy we skew the genre-space to see what other plays are categorized with it. This approach can obviously result in many absurd classifications, allowing us to ask questions such as "if Macbeth were a comedy, what other plays would be classified as comic?" But setting aside such wild skewings of the genre-space we can also look for combinations of plays that best help explain our subjective human categorizations better.

Essentially, we are using our meta-data - the genre tags about the play that we already have - to turn the unsupervised clustering algorithm into a semi-supervised classification process. At this point the process is similar to the training phase of a machine-learning algorithm where a data-set might be parsed in various ways to extract the features that contribute most to the correct classification. However, if we want 3 genres out of 36 plays, we must try out all possible seed combinations. How many combinations of seeds are possible? There are nCk ways to choose 3 items from 36 which calculates to 7140 possible combinations. The program runs through all 7140 possible seed combinations and calculates how many correct classifications are made for each iteration. Here is the screen dump from the program. It takes two arguments, the first is the DocuScope datafile itself and the second is the heading of the metadata column on which we want to try our classification. In this case it is "Genre" and the program can figure out that the column lists 3 different genres and so we are looking for 3 clusters:

> k-means.r -i shakespeare.csv -m Genre

Performing K-means for  36  items with  3  seeds.
A total of  7140  combinations....................................
Building model for best fit of k-means classifications with metadata.
Best fit is  32  out of  36.

Results:
Comedy                    History                   Tragedy
Comedy of Errors          2 KING HENRY VI           Titus Andronicus
Comedy of Errors          2 KING HENRY VI           Romeo and Juliet
Comedy of Errors          2 KING HENRY VI           Julius Caesar
Comedy of Errors          2 KING HENRY VI           Troilus and Cressida
Comedy of Errors          2 KING HENRY VI           Macbeth
Comedy of Errors          3 KING HENRY VI           Antony And Cleopatra
Comedy of Errors          3 KING HENRY VI           Coriolanus
Comedy of Errors          3 KING HENRY VI           Timon Of Athens
Comedy of Errors          3 KING HENRY VI           Cymbeline
...
...
Misclassified Texts:
Love's Labours Lost ( History )
A Midsummer Night's Dream ( History )
Tempest ( History )
KING RICHARD III ( Comedy )


Before moving on to the very interesting classifications the program generates, let's look at the working of the program. It begins by calculating all 7140 possible seeded k-means classifications and notes the output of each. This process can take several minutes and after it is complete, the program turns to the meta-data and checks how many classifications each iteration has guessed correctly. It turn out (as one would expect with literary data) that no combination explains all plays. The best fit is 32 right classifications out of 36 and even that occurs in only about .5% of all possible combinations. Finally the program prints out a list of the plays which can serve as the centroids or the "ideals" of these best fit cases and a set of plays that remain misclassified. The Comedy of Errors, it turns out, is the most "comic" play that Shakespeare wrote. Other genres, and especially tragedy, produce far less decisive results. While both 2 and 3 Henry VI can serve as the ideal history play, a whole set of tragedies can more or less equally exemplify tragedy. Finally the plays that are persistently misclassified include the three comedies that went off on their own in our force-field network, along with Richard III which seems to be on the borderline between history and comedy.

### What Next? Mapping Shakespearean Genres

To a large extent this result is unsurprising for anyone who has studied the Shakespeare data generated by DocuScope. All of these patterns can be glimpsed by carefully exploring the data with something like LATtice. This process, however, takes a more rigorous approach to separate out certain patterns and anomalies that other visualizations have persistently hinted at. For example, as the network diagram and LATtice indicates, tragedy shows much more linguistic variation in Shakespeare's hands rather than some coherent generic unity - an underlyting "type" that finds various manifestations. Tragedies, in a sense, stand alone, much like their protagonists, while comedies (leaving aside the very interesting aberrations) form a tight stylistic group.

How might we build on these results to move towards a more informed digital analysis of genre? Ultimately this might lead to a machine-learning implementation that can accept a literary text and classify it by genre with a high degree of accuracy - we manged to hit almost 89% here with the Shakespeare data-set. But while this might be an exciting algorithm to develop, it is more important to be able to analyze what characterizes genre. To that end, we might begin by studying the plays that typify the genres, as well as the ones that are misclassified. Looking at ideal plays might help us identify the combinations of rhetorical features that most distinguish genres from each other.

The neat separation of genres has proved difficult to achieve from our data even as all our visualizations have indicated strong tendencies for generic differentiation. Here, for example are the results of principal component analysis on the Shakespeare data coded by genre. As I mentioned above, more than two principal components are often needed to explain a large chunk of the variation in literary data-sets. So I plotted all possible combinations of the first four principal components. As we can see, all show tendencies toward separation - genres generally pull apart in different directions - but all have significant overlap and noise problems akin to what we observed in our network diagram: PCA plot mapping K-means demonstration.

Plots of the first four principal components from the Shakespeare dataset against each other.

Now here, again, is the plot of PC1 against PC3 (the second panel above) with what we learned from our R script integrated:

Separating Shakespearean genres: Plot of PC1 against PC3 with results of the supervised classification factored in.

The seeds of each genre for this particular iteration are plotted in larger size, but the most important transformation happens when we remove the four plays that are misclassified. Plotting those in black reveals a clear separation of genres with no overlap or ambiguity. Similar plots across other principal components show improvements in separation as well, although this combination is the most distinct.

Data-mining literary texts can provide exciting insights and challenges. It can confirm in novel ways what we already know, and it can open up new ways of thinking about texts. Using our meta-data categories to, as it were, "reverse-engineer" Shakespeare's language of genre will hopefully provide us with a better map as we navigate familiar texts within an unfamilar maze of quantitative data.