TWiC Progress Update #4 – Day of DH 2015 – Text Cluster highlighting and the Day Comes to a Close

So you began the day after the national holiday at a car dealership getting your car repaired, and spent the rest of it programming in JavaScript. You arrive home and realize that your house keys are with your car keys and you’re locked out. What do you do?  Well, you head over to Dieu du Ciel until your roommate gets home and anoint the summer with a late evening coding session and Hibiscus beer.

TWiC at Dieu du Ciel
And then you get to work. To finish off this day of digital humanities, I thought I might try to tackle a brand, new feature. This is a challenge day after all. I wanted to bring these miniaturized texts a bit more into alignment with the nature of these TWiC bullseyes. And to do so, I combined the functionality of both objects in order to wrap these text rectangles in topic colors – sort of rectangular bullseye.

TWiC New Text Rectangle highlighting

The thought here is that each text itself will have a different set of top N topics, and like with the corpus level clusters of the view above this one, it could be useful to understand which texts at this level share prevalent topics. As you mouse over the bordering rectangles, texts with those topics will be highlighted. The code for this is coming along, and there’s just a few more bugs to iron out. Once readied, this mouseover effect will also be useful elsewhere in TWiC as well. One aspect of panels communicating that I slightly elided earlier is that all of the mouseover highlighting effects I’ve showcased today in graphical panels are also linkable.  Take a look at what happens when I connect the corpus and corpus cluster views, for instance.

With views of varying scales of information in sight, the potential to understand hierarchical connections in the topic model become possible.  And as it’s just past midnight and Day of DH 2015 is officially over, I thought I’d close with a look at the final and lowest level of TWiC: the individual text with its topic words, thankfully, now in context. So to all of you secret close-reading fans out there in DH land, I bid you a fond goodnight – I’m one of you, after all.

TWiC individual text

If you have any questions or suggestions about TWiC, feel free to reach out to me on Twitter at @jonathangrams, or if you’d like to contribute to the development of the project, message me on Github: github.com/jarmoza.  For more reading on the early development of TWiC and its relation to Emily Dickinson’s fascicles, take a look at my regular blog at McGill.

TWiC Progress Update #3 – Day of DH 2015 – Simple Implementations, Large Rewards

Refueled on lunch and coffee at some reputable downtown Montréal institutions, and some recreational reading (re: Silicon Valley & a16z). And back at it.

I suspected the task I set out in my last post would not be too difficult to accomplish, and thankfully TWiC’s javascript architecture allowed just that. (In fact I’m certain this post will take me longer to write than my solution.) TWiC’s underlying code objects consist of levels (the entire TWiC screen), panels (graphical  and information views like I’ve been showing), and datashapes (e.g. the TWiC bullseye seen thus far).  Levels contains panels and panels contains datashapes. Fairly straightforward. Through the magic of object inheritance they also have a consistent set of methods for being initialized, drawn, and updated. Panels and datashapes communicate with each other when there is some change of state in the visualization. When a user mouses over a TWiC bullseye for instance, not only is there code executing to highlight and de-highlight the proper circles, but any panels that have been linked to the particular panel the user is mousing over will be updated as well.

In this case, I wanted to have the click on a TWiC bullseye freeze the process of its panel’s highlighting corresponding topic circles – so that I might more easily see which of that bullseye’s topics exist in other clusters. It also might be nice to freeze related views such as the topic bar below so I can still read the words of that topic. And there we are:

D3, the magical Javascript data visualization code library that allows for much of the visible part of TWiC to happen also contains its own event system. Through D3, programmers can dynamically tell the browser to add new HTML objects to the web page already loaded (and specifically a subset of HTML known as SVG tags, or “Scalar Vector Graphics”). By grouping together a bunch of those graphical HTML tags under one group tag (<g>…</g>) with D3, you can move, scale, and change just about any applicable attribute of those graphical components within.  You can also assign code to typical events for those components or groups like mouseovers, clicks, and doubleclicks, etc.

So aside from a few lines to allow panels to have a “paused” state to prevent updating, what does it take to keep all of this information in view with a click, and then with another to allow business as usual to resume?

All the D3 it takes to pause

A small amount of work for a large amount of investigatory gain. In fact, I had to reassign my previous click code to do this, and move some other behavior into the double-click event.  Clicking on a TWiC bullseye used to generate the next view I’m previewing for Day of DH. What’s underneath one of these bullseyes anyway?  The actual stuff of life of course: the human-authored texts we want to read in order to understand this topic modeling business.  Those texts that all share their “top” topic. But we still want to be able to relate those texts’ topic data before reading them directly, and there’s only so much screen space to share (especially on my 13″ laptop). One level down in TWiC is where users can find the “Text Cluster” view.

TWiC's text cluster view

In this view are the beginnings of the idea behind “Topics Words in Context.” MALLET outputs what is known as a topic “state” file. This file directs us as to which topic each word in the texts are derived from or are assigned to, depending on your point of view.  As it also turns out, not all words are given the same consideration beyond the obvious ones that are typically ignored in lexical analysis like stopwords. Never the less, TWiC’s view above this cluster of texts shows words as pixels, and words from topics are given the appropriate coloring. The corpus of poetry helps this first use case as it’s not necessary to truncate the amount of text shown, as would be the case with novels.  Surrounding each miniature representation of Dickinson’s poems is also a colored rectangle reminder of the highest measured topic in each text.  Mousing over topic words (i.e. pixels) produces a similar effect on a linked topic bar below.

But how else could a researcher relate these texts in this particular level of viewing, somewhat similar to the highlighting/frozen view of the corpus and corpus cluster views? That’s the next task as this day of DH continues…

 

 

TWiC Progress Update #2 – Day of DH 2015 – Corpus Clusters, A Task and a Solution

Having managed to extract myself from the land of car dealerships, I find myself programming away at McGill’s Burney Centre where some fellow graduate students have allowed me to perch. I have spent some time this morning prepping and parametrizing text tags for the geometric forms that TWiC uses to represent texts and clusters of texts – so that I might better illustrate today’s work. On to the next task of the DH day, and some further explanation of the TWiC view I’ll be attacking next. (For those interested, my code and current progress/issue tracking can be found at my github repository for TWiC.)

One level down beneath the topics for our corpus is a set of information that many will not be familiar with. Using MALLET’s output files we can figure out a lot of different things about our topic model beyond just topic word lists and their prevalence in the corpus.  Each text is similarly assigned a list of prevalent topics and their probability of occurrence. This provides a few helpful points with which to navigate the topic model.

(1) We know which texts in which topics most likely occur – or to rephrase (if you prefer), which are the “top” topics of each text.
(2) We can compare the topic probability distribution of two texts in some basic ways. We could group the texts by prevalent topic, or we could average a set of those distributions, for instance.
(3) If we think of the given distribution of topic probabilities as a vector, we can also make some spatial interpretations of that distribution. Using a common method for comparing probability distributions, the Jensen-Shannon divergence formula allows us to produce a singular value – a sort of distance, if you will.

Alright. Now hold onto your eyeballs. This is going to get a little psychedelic, but after this brief video intermission I’ll explain where I’m at and where I’m going to go today on this day of digital humanities. For your viewing pleasure, TWiC’s corpus cluster view.

What I’ve created to represent this “middle” level of topic model information is a network graph where each node represents the top N topics of texts clustered together in a very particular way.  At the center of the graph is a representation of – like with the corpus view in my last post – all texts whose top topic is the top topic of the corpus.  Similarly, every other bullseye (or node) extending out from the center represents a cluster of texts whose top topic is a particular topic (a text tag with the topic number is added for ease of viewing).

Now here’s where things get a little more complex as we’re still in the land of big data and probability distributions. The colored topic circles in each bullseye represent the top N topics of those clustered texts after their topic distributions have been averaged together. And the distance of each bullseye from the center is representative of the “distance” (via Jensen Shannon divergence) between those average distributions.

As the mouse moves over the clusters, opacity and coloring is used to denote the shared top topics of each cluster of texts, and the corresponding topic word list is made available in the topic bar information panel below.

In essence, what is represented in this view is the distribution of topics across the entire corpus being modeled, but now those distribution numbers have been converted into spatial form for easy navigation of the sort of “data topography” of the model.

What would be nice though is if when one moused over these bullseyes, if a click would allow the highlighted topic circles and topic word list to be “paused” and studied. Perhaps a de-click would “unpause” the view and allow mouseovers to continue.  Up next: discussion of that work, D3 events, TWiC’s internals, and the next level down the topic modeling rabbit hole, the “Text Cluster” view, where we can start to see “topics” as they exist within texts themselves.

TWiC Progress Update #1 – Day of DH 2015 – Work Continues

As I briefly mentioned in the last post, TWiC allows multiple levels of access into a topic modeled corpus. I spent yesterday developing what is perhaps the simplest but most comprehensive of views: looking at the top N topics of a corpus. One of the key motivations for the development of TWiC was the need to link the output of topic modelers like MALLET (i.e. the loosely associated words of topic word lists) with the texts of the examined corpus and the language of each individual text as well. TWiC provides a mini research environment composed of two different types of panels: graphical views and information views. The screen in my last post showed a corpus graphical view, a colored, bullseye-like geometric abstraction representing the top 10 topics and below it a information view, the topic bar, that lists and colors all of the topics of the model.  If those panels are linked up we can see those topics for the poems of Emily Dickinson, moving from the 10th most prevalent topic on the outside of the bullseye to the most prevalent topic at its center.

As I mentioned, this is the simplest of views for TWiC, but the geometric abstraction and coloring represents a spatial device that is employed throughout as one dives deeper into the corpus.

TWiC @ Day of DH 2015

The morning begins where most wish it would the day after a national holiday: a car dealership in the depths of Montréal’s Plateau neighborhood. Proof that digital humanities can be practiced anywhere. And as I fire up Python’s SimpleHTTPServer, I find the fruits of yesterday’s afternoon-evening-late-night coding session.

TWiC Corpus View

Welcome to TWiC or “Topic Words in Context”, a highly-interactive D3 visualization for topic modeling I have been building for my Master’s thesis on Emily Dickinson here at McGill University. It allows researchers to locate topic distribution at multiple levels of the corpus being modeled – from the top with the corpus’ topics and topic proportions as pictured above, all the way down to the level of the individual text with its own topic words and topic proportions viewable within the contexts of the original text from which the entire model’s topics are derived. More on the development of this visualization can be seen at my blog at McGill, but stay tuned here for a preview of TWiC before I demonstrate it at the Canadian Society for Digital Humanities in Ottawa this June.