Final Project: Digital Textual Analysis


My name is Ben Smith – I am a first-year PhD student in English at Brown. This is a final project for Digital Storytelling, AMST 2699, Spring Semester 2017. It focuses on digital textual analysis. It is an extension of another project I conducted earlier in the semester on Voyant:

Debates, Major Issues, and Points of Contention for Digital Textual Interpretation

In an article for the LA Review of Books, David Golumbia spells out what seems to me to be the basic point of contention surrounding digital textual interpretation in particular, and Digital Humanities in general:

Advocates position Digital Humanities as a corrective to the “traditional” and outmoded approaches to literary study that supposedly plague English departments. Like much of the rhetoric surrounding Silicon Valley today, this discourse sees technological innovation as an end in itself and equates the development of disruptive business models with political progress. Yet despite the aggressive promotion of Digital Humanities as a radical insurgency, its institutional success has for the most part involved the displacement of politically progressive humanities scholarship and activism in favor of the manufacture of digital tools and archives.

The question is whether or not Digital Humanities is politically progressive, or precisely the opposite. Golumbia raises the important point that DH is in the odd position of being both new and cutting edge, perhaps unsettling the ‘conservative,’ (as in old) loci of power in humanities departments, while simultaneously displacing the (at times) politically radical practice of close reading and critique that, especially since the poststructuralism movement of the 1980s, has focused on re-interpreting the relationship between truth and power.

In The Death of a Discipline, Golumbia argues in favor of the idea that DH “as a politics has overtaken (though by no means displaced) another, to my mind, much more radical politics, one that promised a remarkable, thoroughgoing, and productive reconsideration of the foundations of scholarly research but that, crucially, emerged quite directly out of the research practices and protocols that had been developing in literary studies until then.” He argues that DH is politically regressive because it displaces the politically progressive politics that accompany close reading. Digital interpretation favors formalization and modeling over the more politically inflected concepts that inform contemporary literary interpretation – that the text does not have a single, authoritative meaning, that the reader and author are equally responsible for creating meaning, that discourse is the locus of power, and that interpretation is the means by which power is re-directed and re-organized.

Golumbia argues that digital textual interpretation is politically regressive, furthermore, because it discourages critics from analyzing a wide range of literature from different cultures and languages: Rather than seeking out the marginalized languages of the Southern Hemisphere and elsewhere and engaging the literature produced by their speakers, DH has doubled down on the institutional investment in the world’s majority languages (often seeming not even to understand the kinds of problems Spivak raises about these); in important ways, it embraces the idea that literary scholars should be monolingual”
It is for these reasons, Golumbia argues, that Digital Humanities has been a point of such contention in English departments, as opposed to linguistics departments, for example, where digital textual analysis has been just as prevalent. Linguistics, Golumbia notes, “as a discipline, has not been a hotbed for the kinds of politically inflected interpretive practice that drives right-wing opposition.” In Golumbia’s formulation – and I would agree with him – literature departments are a last frontier for politically inflected, left-wing interpretive practices. Literature departments should therefore be wary, the argument goes, of digital infiltration, and should protect its historical roots in close reading and solitary, human-centered interpretation methods. This argument is compelling. It is inevitable to associate digital tools with a right-wing, neoliberal politics, in which everything must be quantifiable, all conclusions demonstrable, and all work accessible and easily graspable.

In a Differences article from 2015 entitled Digital Humanities and National Security, Brian Lennon goes so far as to say that DH is an infiltration of the military-industrial complex into the sanctified space of academia:

Can a ballyhooed turn in the humanities, especially in literary studies, that promotes a putatively novel computational textual analytics including textual and other data “visualization” possibly be or remain iso- lated from the cultural-analytic and specifically textual-analytic activities of the security and military intelligence organizations that are the university’s neighbors—especially when such a turn is represented as a historic opportunity made possible by historic advances in information technology? It seems unlikely.
This strikes me as a comical association, but of course it’s there to be made. Lennon even provides an historical narrative of how the FBI deployed agents into humanities departments post-WW2 in order to draft reports of what kind of work they were doing. Lennon’s analysis draws on a simple line of associations – from digital to mechanical to industrial to military to government. I think his essay speaks to the understandable anxiety about why digital methods are cropping up in the humanities and whom these methods serve.
Accompanying these questions surrounding digital interpretation is the question of funding. As Golumbia notes in his LARB article, the neoliberal university favors methods that garner federal funding, and digital textual studies has a penchant for doing so. He writes that “from the viewpoint of the neoliberal university, the best kind of research (and the kind to be most handsomely rewarded) is the kind that brings in the most external funding. This is one of the main reasons why the digitization of archives and the development of software tools — conspicuously expensive activities that influential funding bodies have enthusiastically supported can exert such powerful attraction…” Large-scale digital textual projects involve a lot of people and a lot of time – perfect for justifying funding. It’s also collaborative work, when done on a large scale, so funding can be justified for groups of people, not just individuals. Golumbia is wary of the ease with which external funding sources can be convinced to fund digital humanities projects. This doesn’t seem to be a bad thing to me, but for Golumbia, it is further evidence of the link between an insidious neoliberal politics infiltrating the University and DH.

Why Are Digital Textual Analysis Projects Interesting?

So what are DH’ers actually doing with digital textual analysis? What are these projects, and what is so interesting about them? A term to characterize the basic premise underlying many of them is ‘distant reading.’ Franco Moretti coined this term to denote the practice of feeding a large number of texts into a digital tool of some kind, and then synthesizing and interpreting the output. The advantage of digital tools for textual analysis is that they can read lots of texts; more in a few minutes than a human could read in a lifetime. And that’s exciting.
Part of what is so interesting to me about this work is that the computers don’t actually understand a word of text. This is one of those concepts that is so obvious that it is easy to overlook, but I think it’s actually quite significant. Willard McCarthy, in a great essay called Humanities Computing , describes this phenomenon. He describes a thought experiment conducted by UC Berkeley philosopher John Searle to demonstrate the concept. To paraphrase the experiment: a man is in a cell. He is given a detailed set of instructions outlining how to interpret Chinese characters. Then, he is given a piece of paper with Chinese writing on it. By following the instructions he has been given, he can translate the Chinese text without knowing a word of Chinese. McCarthy cites this thought experiment to demonstrate how computers interpret: “Computers are good for following long, tedious, and complex instructions carefully; they can manipulate symbols without over-interpreting them. The instructions in English can change how Searle acts, but the Chinese script cannot. This is an important difference between computers and us.” In short, computers can’t read. But they can process words in certain ways by following instructions written by a programmer.
This is fascinating because it is the most un-interested, anti-interpretive form of reading possible. Computers can look at text with a blankness and absence of interpretation that is impossible for humans. As McCarthy notes: try reading the words ‘Hello World’ without interpreting them; it is impossible for humans. So in this sense, computer interpretation is a new frontier for reading texts. And this is precisely what makes it so interesting.
In contemporary literary studies, scholars like Rita Felski, Tim Bewes, Sharon Marcus, and Heather Love are arguing that critics should interpret literature by reading ‘on the surface,’ or ‘with the grain,’ or ‘generously.’ There is a trend moving away from ‘deep’ interpretation of hidden meanings towards ‘mere description,’ and pointing out what’s immediately visible in a text. It’s fascinating to think about how this trend in literary theory maps on to the trend to digitally interpret texts, which is in a way the purest possible surface reading. Text to a machine becomes pure pattern, a pure presence of markings that, for the human mind, is only theoretically possible. So it’s possible to see, by looking at these trends together, a general movement in literary scholarship in the direction of the surface, towards the distant, backing away from meaning and from the hidden and repressed ‘signified’ of the text. That this move is taking place both in literary theory and in practice in DH indicates a certain zeitgeist that deserves attention, not dismissal.

So What Are These Projects?

One of the most interesting textual analysis projects I found was Matthew Jocker’s project Syuzhet. Jocker states that Syuzhet “is designed to extract sentiment and plot information from prose. Methods for text import, sentiment extraction, and plot arc modeling are described in the documentation and in the package vignette.” Basically, Jocker uses this program to create graphs to formalize the emotional trajectory of a novel.

The program, in truth brilliantly simple, measures the emotional arc of novels by tracking the frequency of words that Jocker’s program dubs positive and negative. ‘Happy, trust, joy, love,’ for example, would be positive words, and ‘death, disgust, hate’ would be negative words. Jockers’ graphs show the frequency of these words. He then boils the results down from a corpus of over 50,000 novels, and concludes that there are six ‘archetypal’ plot shapes.
In his article, Jocker cites a lecture taught by Kurt Vonnegut that outlines archetypal plot patterns in a similar way to Syuzhet. Vonnegut’s lecture is a joy to watch. It certainly does provide corroborating evidence, moreover, for Jocker’s findings. However, Jocker gets into hot water, I think, when he (intentionally or not) asserts that his findings are somehow more valid than Vonnegut’s because his sample size was bigger: “This is more or less the method Vonnegut employed, excepting, of course, that he probably only read a few hundred stories and probably only sketched out a few dozen on his chalk board.” This kind of statement asserts the priority of a software program over the expertise of Vonnegut, and I think that is an unproductive argument to make for Jocker. He doesn’t need to assert his program’s capacity to read more than Vonnegut; rather, he should attempt to explore the difficult terrain dividing Vonnegut’s ‘conjecture’ and his ‘scientifically proven’ statement. I don’t think that the value in Jocker’s work lies in the fact that it is more objectively true than Vonnegut’s lecture. The value is that there is software underlying it. And the point worth exploring is how that software relates to Vonnegut’s thought process. We must take Vonnegut at his word when he tells us about universal plot types; with Jocker, however, we can ask to see the code. The question to ask is: what are the logical similarities between Jocker’s machine-based conclusions and Vonnegut’s conjecture?
In a counter-argument to Jocker’s findings, Annie Swafford argues that Jocker feeds his data through a series of “low-pass filters,” thus simplifying his findings to an excessive degree. Essentially, Swafford accuses Jocker of manipulating the data for the sake of simplicity. This raises another important challenge to digital textual analysis, echoed by Golumbia, that the divide between coders and humanists leaves humanists in a position in which they don’t understand the code that they use to interpret. This is indeed a problem, but one that I will leave aside for now, given that it is of course possible for humanists to learn code, and the barrier is thus merely one of inconvenience; if humanists take the time to learn the code, it becomes accessible.
A second digital interpretation project, Ben Schmidt’s Sapping Attention, charts the presence of certain words and phrases, such as “If” and “I love you,” across narrative time, to show where they generally appear in narratives.


The key to these graphs, I would argue, is simplicity. They are great when they can be read in an instant, but quickly turn awful when they have to be carefully deciphered. The art of creating these visuals is to make something that is immediately comprehensible. For an example of a terrible graph, see below:


Stanford Literary Lab

By far the most interesting work I found in digital textual analysis is being done at the Stanford Literary Lab. These guys have a team of people working on different digital analysis projects – ranging from analysis of dramatic narrative, time, emotions, place names, and canonicity – in huge corpuses of text. Furthermore, they’re creating beautiful essays to present their findings. They are reminiscent of scientific storyboards that outline experimental results, except they’re extremely readable and almost entirely . They call them “pamphlets.” They break their work down into stages, use figures, carefully lay out the stakes, and present their findings conclusively.

One pamphlet analyzes 50 years worth of World Bank documents to show changes in language use over time. Another, headed by Franco Moretti, tracks the referencing of place names in London in novels published between 1700-1900 and the emotions associated with the place-name references. Another , headed by Mark Algee-Hewitt, compares different lists of the best novels of the 20th century to determine patterns between books chosen and people who compiled the list.

Not only are these projects interesting, but they’re based in statistics. Some humanities people cringe at the word statistics. I find it fascinating in this context. It’s not that the statistics can necessarily allow us to arrive at any groundbreaking conclusions; rather, it’s groundbreaking because it’s statistical. Franco Moretti makes this argument excellently in an introduction to his Emotions of London pamphlet:

There comes a moment, in digital humanities talks, when someone raises the hand and says: “Ok. Interesting. But is it really new?” Digital humanities have presented themselves as a radical break with the past, and must therefore produce evidence of such a break. And the evidence, let’s be frank, is not strong. What is there, moreover, comes in a variety of forms, beginning with the slightly paradoxical fact that, in a new approach, not everything has to be new. When “Network Theory, Plot Analysis” pointed out, in passing, that a network of Hamlet had Hamlet at its center (P2/4), the New York Times gleefully mentioned the passage as an unmistakable sign of stupidity.

Here’s a link to the Times article he mentions:

It’s funny how accurate a critique Kathryn Shultz’s Times article offers of Moretti’s work. Hamlet, the central character in Hamlet! You don’t say. Moretti’s response to Shultz is that the conclusion is not the point of the study; the point is that the conclusion is corroborated. That’s revolutionary in the humanities. As is the very practice of modeling and graphing. Moretti’s network plot of Hamlet is a new way to look at the play. I side with Moretti on this point. I love to see Hamlet graphed into a network of relations with other characters. I have a visceral reaction to that work. It’s genius because it’s simple. And because it introduces the idea of claiming points in humanities rigorously. It’s a whole different model than citation and quotation, which is how traditional scholarship backs up its claims. Moretti and the Stanford Literary Lab team introduce a new level of rigor. It takes months – years, even – to show, with evidence, that Hamlet is the central character of Hamlet. I love this as a counterpoint to the sweeping claims of poststructuralist theory, which habitually makes sweeping claims across philosophy, from Plato to Heidegger and up to the present moment.


Digital textual interpretation is flawed. It has accessibility issues. Some tools, like Pliny, are hard to use. Writing software to crunch text is hard to do. Although I enjoyed reading about these critical projects at the Stanford Literary Lab, for example, I would not be able to carry them out myself. I don’t think text analysis serves to make reading easier, or to become a shortcut to taking notes, as a program like Pliny aims to be. When I used Voyant as a means of comparing five different texts, it helped, but only a little bit; not enough to justify taking the time to do it on my own, if I hadn’t had designated time within this course to explore the tool.

And it does have the problem, as Kathryn Shultz pointed out and Franco Moretti acceded, of often arriving at rather banal conclusions. In that sense it can be hard to really get excited about the work; the NYT publishes a dismissal of Moretti’s conclusions, rather than an endorsement.
But the fact that the NYT published an article on Moretti’s work at all is reason enough to pay attention. The humanities is so mired down in its 19th and 20th century roots at present. The obscenely long, convoluted journal articles. The deep-thinking, well-steeped, entirely independent humanist scholar is a relic of the past. I shake my head as I write it, but of course it’s true. The computer is the tool of the future. If we’re not using it, we’re missing out. It is, to quote Steve Jobs, the bicycle of the mind. Humanist scholars need to be riding the bike, not walking slowly with their noses up to ‘technology.’

With that said, adoption of digital interpretation methods should be done with close attention and respect to poststructuralism. The work done in the humanities in the 60s, 70s, and 80s was beautiful, revolutionary, and shouldn’t be overshadowed by DH. Interpretive methods pioneered by Jacques Derrida, Paul de Man, Barbara Johnson, Roland Barthes, and Jonathan Culler disrupt and re-think 2,000 + years of philosophical tradition. For me, their structuralist and deconstructive methods of interpretation will continue to be the cornerstone of literary criticism, until someone re-thinks and improves upon their methods.
I think that DH’s accessibility, its use of platforms like Twitter and WordPress, and its use of blogs and websites is a significant asset. Humanities scholars across all fields should borrow that collaborative spirit from DH. Looking ahead, I would like to see more research done on the parallels between structuralist criticism and the digital modeling and formalizing of texts. The two disciplines share an assumption: that interpretation is best conducted by looking at formal, general structural principles that underly all texts, rather than by looking at one text in particular. While structuralism approaches this method via linguistics, DH approaches it through corpora analysis. They take different angles, but they do have common ground. The object of interpretation is language in general, or 19 century British texts in general. I think that the ability to digitize and represent patterns across such a wide range of texts can only be beneficial, if for no reason other than that it allows a text to reach our eyes via a different visual organization. That is bound to spark new ideas.

Leave a Reply

Your email address will not be published. Required fields are marked *