February 21 – Text Analysis and Topic Modeling

///February 21 – Text Analysis and Topic Modeling
February 21 – Text Analysis and Topic Modeling 2017-01-23T18:38:27-05:00

Class Plan

  • Voyant Workshop


Bradley, John. “Text Tools.” In Companion to Digital Humanities (Blackwell Companions to Literature and Culture), edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell Publishing Professional, 2004.

Burrows, John. “Textual Analysis.” In Companion to Digital Humanities (Blackwell Companions to Literature and Culture), edited by Susan Schreibman, Ray Siemens, and John Unsworth. Oxford: Blackwell Publishing Professional, 2004.

Meeks, Elijah. “The Digital Humanities Contribution to Topic Modeling.” Journal of Digital Humanities, April 9, 2013.

Blei, David. “Probabilistic Topic Models.” Communications of the ACM 55, no. 4 (2012): 77–84. (this is hard but skim for some of the origins of Topic Modeling)


Come to class with a web site or digital version of a text for us to play around with in Voyant. Try to make it a sizable text to provide a larger pool of data for analysis. Use this site as a helpful resource: http://guides.library.ucla.edu/text.


Voyant –http://voyant-tools.org

Wordle – http://www.wordle.net/

Topic Modeling Tool – https://code.google.com/p/topic-modeling-tool/

Google Ngram Viewer – https://books.google.com/ngrams


  1. Whitney February 20, 2017 at 3:34 pm

    1.) (This is a technical question): if you were to remove the most common lexical words from the analyses and the personal pronouns – like John Burrows suggests is widely understood and accepted – would you then be able to go back and look at the most commonly used words and their phrases with those previously removed lexical words? Or would the phrases also be missing the unregistered words previously designated?

    2.) Elija Meeks says in his article “The Digital Humanities Contribution to Topic Modeling” that “methods like topic modeling are reflections of movements”. If the current movement is trending toward distant reading/topic modeling, what might come next? Or, what are the potential benefits/damages that could be seeded in a movement toward distant reading?

    3.) In the article “Probablistic Topic Models”, David Blei states that, “the utility of topic models stems from the property that the inferred hidden structure resembles the thematic structure of the collection.” Through machine learning, is the computer able to thematically determine patterns and make differentiations between literary movements or across genres? I.e., would the computer understand how to determine theme between work primarily of the postmodernist movement/Renaissance love poetry/scientific articles about bacteria?

  2. Alfo February 20, 2017 at 11:21 pm

    1. Borrows states that when a group of texts is analyzed regarding their the word variation, we could “yield complex but intelligible patterns of affinity and disaffinity”, but just later states that “the principal disadvantage of cluster analysis is that the underlying word-patterns are not made visible”? Which are these underlying patterns that we sill won’t be able to discover with text analyses?

    2. Meeks defends a critical use of the topic modeling tools. Should we understand topic modeling as a tool for gathering ‘new’ raw material that, as the text itself, will not give scholars any reliable conclusions, but rather new paths for further/better interpretation?

    3. I find very interesting the potential of thinking new ways of topics visualization, as suggested by Blei at the end of his article. Are they any tools that, not only point out the commonality, probability, or usage of topics; but also locates the topics in the text? I can imagine some topics that are more recurrent in the openings or the introductions of texts; while other topics may be distributed more uniformly.

  3. Hannah February 21, 2017 at 10:27 am

    Burrows explains, “To begin directly with the first of several case studies, let us bring a number of poems together, employ a statistical procedure that allows each of them to show its affinity for such others as it most resembles, and draw whatever inferences the results admit.” I’m wondering how to preserve the integrity of the humanities as we bring in tools from the social sciences – something we’ve spoken about before.

    Also, how do you choose your sample? This is something I’ve been wondering about for my own research, and perhaps that’s where the classic humanities component comes in. What determines the sources you use and what bias is involved? More to say on this in class I guess – with regard to my own research. We are pulling social science tools, or a better way to say it may be modes of analysis, but we don’t have the infrastructure that the social sciences have in terms of things like random sampling.

    I just generally thought that this was an interesting view of what we’ve been talking about from Meeks – “Because topic modeling transforms or compresses free data (raw narrative text) into structured data (topics as a ratio of word tokens and their strength of representation in documents) it is tempting to think of it as “solving” text.” He doesn’t go in to it in depth, but I think the statement speaks to the concern about creating a humanities that isn’t so much interpretative.

  4. Shoshanah February 21, 2017 at 2:55 pm

    John Burrows’ ‘Textual Analysis’ begins with an analogy comparing text analysis to a handwoven rug, “the principle point of interest is neither a single stitch, a single thread, nor even a single color but the overall effect.” Based on this, could it be said that the text analysis tools discussed by Burrows vary from the text tools described by Bradley, in that analysis told attempt to account for impact, exigency, and affect? “Effects like these, we may suppose, impinge upon the minds of good readers as part of their overall response.”

    I’m confused on the definition or parameters of ‘gray literature.’ Is the issue that self/web publishing is blurring the lines between gray literature and everything else? (Also, I’m interested in the label ‘gray literature.’ Where did it come from? Does it have anything to do with ‘purple prose’?

    Meeks and Weingart state, “traditional Humanities scholarship often equates digital humanities with technological optimism.” Are digital humanists and media theorists more or less optimistic about the world-making/world-breaking potentialities of technology?

    Is what David Blei is proposing in ‘Probabilistic Topic Models’ a rhizomatic rather than hierarchical approach to internet research? Is his article a call for digital-dramaturgy?

  5. phyllis plitch February 21, 2017 at 3:59 pm

    1. The first two readings, Text Tools and Textual Analysis, are more than a decade old. Are there certain key points we should take away from these two readings? (Perhaps I missed that these were more recently updated.)
    2. In something of a variation on a theme, the next reading, “The Digital Humanities Contribution to Topic Modeling,” is several years old as well. Does this edition of the Journal of Digital Humanities represent the most current thinking on Topic modeling?
    3. Is there a current topic modeling exemplar? I am trying to connect the dots about how this type of analysis enriches our understanding of textual documents the way Marion Thain did in class last week. In Topic Modeling: A Basic Introduction in the Journal of Digital Humanities, Megan R. Brett describes the text mining and topic modeling work of Cameron Blevins. As she explains, Blevins used these tools to analyze the diary of Martha Ballard and notes that he compared his results to Laurel Thatcher Ulrich’s work, which was done by hand, that the two result sets generally align, and that “the results of the topic modeling help to uncover evidence already in the text.” In last week’s TEI Workshop, Marion Thain showed the connection between textual encoding and individual documents and how this work pushed the boundaries of literary knowledge and deepened our understanding of the Michael Field diaries, for example. With Topic Modeling I am still having a hard to see how the output goes beyond a statistical analysis using algorithms that somehow remains outside (and not equal to) the underlying corpus. I would like to!

  6. Cat February 21, 2017 at 4:23 pm

    1. John Bradley spends a lot of time describing the specific communities that are usually most comfortable with or drawn to Perl or TuStep. I am curious if these communities have changed at all over time, or even since this article was published in 2004?

    2. Entering this reading, I agreed with topic modeling critics that the tools “focused on corpora and not individual texts, treating the works themselves as unceremonious ‘buckets of words,’ and providing seductive but obscure results in the forms of easily interpreted (and manipulated)’“topics’” (Elijah Meeks). I have to admit, Meeks didn’t quite convince me otherwise. I still feel like topic modeling is so much more wrapped up in the aesthetic of the graphic produced than the criticality he talks about. What do we really get from word clouds?

    3. David Blei’s discussion of topic modeling was much more interesting to me. I liked his point, “Topic modeling not only reveals the trajectory of tangible themes (housework, births, gardening, etc.), but also begins to quantify and visualize abstract themes by charting Ballard’s emotional state of being.” I understand that topic modeling can help us previously unidentifiable or imperceptible patterns across huge texts, but how accurate can it really be at identifying emotions? I suppose this goes back to what we were talking about in class last week—we still need an expert to guide the tools. A scholar of this particular writer could help the computer identify certain words that the writer usually uses to convey certain emotions. But how does topic modeling help us learn something new? It seems that it is just using the scholar’s knowledge base and new technology to get the job (previously possible, but imposingly long) done.

  7. Jane Excell February 21, 2017 at 5:19 pm

    I appreciated Meeks’ statement in “The Digital Humanities Contribution to Topic Modeling” that the scholars who use topic-modelling conduct their work, “with as much of a focus on what the computational techniques obscure as reveal” but I wonder what an example of this would be. How do we know what is obscured by a topic-model analysis of a text?

    What Blei describes as the “themes” or ‘topics” of a document reminds me very much of the subject classifications that I use to catalog materials for the library. As part of its metadata, each item is assigned as many subject headings as necessary, in order of relevance. These subject headings have a controlled vocabulary that can be narrowed down through the use of extensive sub-categorization. Subject headings are searchable and allow one to jump between items with the same subjects in the library catalog. The Online Computer Library Center (OCLC) contains millions of records submitted by institutions all over the world that contain this subject-metadata. Reading this article, it wasn’t entirely clear to me how the themes pulled out by topic-modelling are much different or better than those assigned by libraries, and I disagreed somewhat with his statement that, “we do not interact with electronic archives in this way,” when I see what he seems to be describing in library databases now. I realize that the means by which topic-modelling and subject classification achieve their results is different, but what major differences between these two methods of determining the themes of a work am I missing?

    In his examples of textual analyses, John Burrows uses long poems (specified as over 2,000 words) and correspondence as his test materials, which made me wonder- is there an agreed minimum word length for these kinds of studies? I imagine that the shorter the piece, the more difficult to extract meaningful data, but how does one determine whether or not a piece is too short to be analyzed in this way?

  8. Lauren February 21, 2017 at 6:27 pm

    With all the complex tools does it leave the field in the hands of tools that people are unable to know the underlayers? is this hurting the understanding?

    The use of regex is a very fascinating use. How close does it get to finding the right context versus the patterns? Can you combine this with TEI to be able to do complex pattern matching that is conditional on the TEI encoding?

    Topic modeling is only giving us more information to be able to do the more advance understanding , the computer can only do what it is told and cant do the anlysis– how far can things go? is that what AI? machine learning? genetic algorthims going to change the way we do work? What can a computer give us an answer on and what can it leave beyond?

Leave A Comment