O London, [...] thou hast all things in thee to make thee fairest, and all things in thee to make thee foulest: for thou art attired like a bride, drawing all that look upon thee, to be in love with thee, but there is much harlot in thine eyes.
- Thomas Dekker, The Seven Deadly Sins of London, 1606
In my last blog post, I described an approach to mining data from large literary corpora that combines machine learning techniques with repeated mediation and reiterative scanning. Here I want to outline one possible application of such data-mining: a project that seeks to create an interactive map of early modern London. Such a project would try to glean information about places from the corpus of early modern texts and integrate this data with a digital map of London such as the one at the University of Victoria.
Mapping Spatial Data
Recent years have seen rapid advances in the use of spatial data in the humanities. Several easily accessible tools now allow the mapping of various kinds of spatial data extracted from humanities corpora. For example the Mapping the Republic of Letters project at Stanford has used meta-data in innovative ways to trace the correspondence between Intellectuals across Early Modern Europe. In short, very sophisticated tools for visualizing spatial data already exist in the form of GIS systems. There are even tools such as HyperCities that allow us to visualize spatial data on historical maps that have been stretched or "geo-rectified." Section of Visscher's Panoramic View, 1616.
Before setting out to create a digital map of early modern London, we might ask what information would scholars in the humanities like to find when they look at such a map? Of course, the geographical features of early modern London, especially as represented on historical maps are in themselves of considerable interest. But most likely, scholars are as interested in the social relations that early modern spaces represent as part of the socio-economic and cultural milieu of early modern London. Within the framework proposed by Henri Lefebvre in The Production of Space, we might say that the map is a "representation of space" but we try to understand the "representational spaces" and the "spatial practices" that lie beyond it. In other words, early modern maps are only a part of the larger matrix of social valences, norms and practices that defined life in Shakespeare's metropolis. We need to put space together with its meanings to produce a truly interactive map. How can data-mining help us achieve this?
Mining Place Names: A Hybrid Approach
The early modern corpus presents challenges that are slightly different from those involved in mining more recent texts for spatial data. First, we must account for the vagaries of spelling and the fact that many early modern place names do not have clear modern counterparts. Thus, it is no surprise that our standard techniques for automatically modernizing spelling in the EEBO database won't work as well in the case of place names (this technique comprises of applying lemma and POS tags with a linguistic analysis tool such as Morphadorner from the MONK project - see my earlier post for a discussion of this tool).
So how do we build a "training database" to initialize the mining of place names? One way would be to use a key text and collect a list of all the place names in it. For early modern London, the best candidate for such a text is almost certainly John Stow's masterpiece The Survey of London, first published in 1598 and subsequently reprinted and expanded many times. Stow's book is quite a tome, but fortunately our task of harvesting place names from it is made easy by the fact that almost all scholarly editions have a complete list of place names as an appendix.
Of course, our problem is not solved by one text alone. Stow's antiquarist account of London, though amazing in the ward-by-ward detail it provides, is unabashedly nostalgic. Even in the 1590s Stow is taken aback by the rapid growth and change in the city and longs for a more simpler time in London's past. My other favorite interlocutor of London, Thomas Dekker, for example gives a picture of the city as fast changing, volatile, and essentially "modern" with dark underbelly that is not quite present in Stow. In the original "training database" idea that I outlined in the last post, this would not be a problem. We would expect associated themes to be picked up by the data-mining algorithm in its recursive runs through the corpus. However, unlike thematic keywords, place names might not work well with proximity association algorithms. I am reminded of a brilliant section of Franco Moretti's Atlas of the European Novel where he mapped the cases of Sherlock Holmes against the Jack-the-Ripper murders to demonstrate that they operated in quite different spatial worlds. So how do we ensure that the database continues to grow and "learn" new places beyond the initial "training" set?
Crowd-sourcing might be an approach to tackle this problem for a long-term project. Martin Mueller has discussed various advantages and drawbacks of crowd-sourcing information on the DATA blog, so I will leave them aside for the moment, noting only that it might provide a great way to build a comprehensive database of early modern place-names and spelling variations.
Once we have a working list of place names, we can start the process of mining data from the entire corpus. The question we are trying to answer through our algorithm is "what ideas come up in frequent association with particular place names?" Every time a certain place name occurs in a sixteenth or seventeenth century text we can scan the text around it, weigh the results according to closer proximity to the place name, and add it to a database. I estimate that once the entire EEBO corpus is tagged and lemmatized (a process still like to take a few years, although enough of it is available now to get this project substantially under way), it will amount to close to a terabyte of data and a relatively well-equipped server might have to hum away at it for a few weeks to create the complete database. But that is exactly the kind of repetition on a mind-boggling scale that computers excel at. Our job is to design and perfect the algorithms, write the code and let 'em have at it.
Implementing Interactivity: Bringing it Together
Okay. We've ended up with a list of place names and giant database of words associated with each of them. How do we visually present this in ways that can enhance research and teaching in the humanities? Consider even the most simplest level. Wouldn't it be useful in a class on early modern city-comedy to bring up a map, hover over a particular place on it and have a word cloud pop-up that shows the words that are frequently associated with it? Imagine teaching Jonson's Bartholomew Fair, set in Smithfield, and seeing at a glance the social valences associated with smithfield through the decades from 1500 to 1700? The possibilities are endless - meta-data can be easily used to restrict searches to authors, genres or time periods. Conversely it would be a great research tool if one could type in a set of words - say "crime," or "debt," or "market" - and have a visual representation of what places in early modern London were frequently associated with these. There is an argument that goes that the rise of banking drove the brothels out of the City walls. With such a database, having a decade by decade visual map that actually represents the dynamics of this retreat might become possible.
The visual tools needed for such a high level of interactivity might seem daunting, but all of this technology already exists and is being used by digital humanists for various purposes. What does not yet exist is a complete and curated corpus, the machine-learning tools to mine that corpus, and the API based approach that would let the various tools talk to each other to make such a massive project possible.