Either…? Or…? … Both of them!

When two become one...
When two become one…


The traditional way of dealing with literature? Reading “Le roman au XVIIe siècle” (The novel in the 18th century), highlighting important passages, writing down some notes on the most important aspects – that’s where I come from.
But then – there it was: Digital Humanities. Computers, tools, programming; new perspectives, new possibilities, new chances – but also so much to learn. And this is what I am still doing: learning! A new language (Python => “Hello world!”), new approaches to literary questions, new ways of thinking. And I quite like that: It is not “either Literary Studies OR using computational methods and tools” – it is rather both of them. That’s what really fascinates me and that’s what I am doing right now: writing simple Python scripts and still reading my old-school paperback on the French novel.

How to grasp ‘our’ genres?


In our research group, we will all work on the question of if and how literary genres and subgenres can be defined and delimited in terms of style, using a computational approach.

Biweekly, we meet for a “reading seminar” where we talk about texts that concern all of us and think about common methodological grounds in the research group. A few weeks ago, we came up with a working definition of “the novel” to help us build our corpora:


The parentheses, pipes and question marks indicate alternatives and facultative characteristics. We noticed quickly that we need a flexible working definition, not just because of the novel’s general tendency to variety but as well because of the different periods and language areas we will be working with. The history of the words functioning as genre labels, the literary and cultural contexts will certainly be something important for us to consider even though and especially when using computational and quantitative methods. Just to make sure that ‘our genres’ don’t escape from us…

Balsa Ibérica (picture taken from )
Balsa Ibérica (picture taken from <http://raquel-collardembar.blogspot.de/2011/02/la-verdad-de-las-mentiras.html>)

Just some thoughts coming up when reading about the 19th century Latin American novel.


One of the advantages of being an academic is that you spend a lot of time travelling by train, which is an excellent way of not being able to do administrative paperwork, so that you can get some reading and writing done instead. So, on my way to Göttingen, I have just spent an hour quietly revising the draft of an article aiming to introduce the TEI to a wide (but German-speaking) audience of literary scholars. It discusses a variety of “digital representation formats” for text and their advantages and inconvenients. Not surprisingly, TEI turns out to be the most adequate among the currently available formats. Some basic uses of TEI for creating textual resources and a whole list of uses of TEI markup in text analysis are then presented and illustrated with little examples from the literature, including some of my own previous work with TEI. The article is coming along nicely and I hope to be able to submit it soon. By the way, this article was an occasion for me to read (also on the train, last week) large parts of Lou Burnard’s “What is the Text Encoding Initiative?“, a short book which I recommend to all of you to recommend to interested people as a first encounter with the TEI.

Decoding HTML entities automatically

I have to admit that I have learned more about Digital Humanities in non academic environments than in academic ones (at least until today). So many times I don’t know if my every-day ways of working are well known in the DH, old fashioned and improved or I found a very good way of doing something. Recently I realised that one the things that I do is not entirely wrong and I would like to share it.

So, we all know how ugly are HTML entities and how disturbing is to find (when reading or when analyzing) things like español. One can decide wether to ignore it and work with entities or to change it to beautiful and intuitive UTF8 characters: español. But of course, searching and replacing (in Spanish!) 5 vowels, their lower and capital versions, plus üÜ, plus ñÑ… And you never know if in the text other characters are encoded as entities. Tones of work and a lousy result.

Since some important texts libraries in Spanish publish with non standard version of HTML (such as Biblioteca Virtual Cervantes Virtual), this might be a great problem. The best solution that I have found until know is a Plugin in Notepad++ called HTM Tag. I was just using this Plugin and I thought, it might be useful for other Digital Humanists.

HTML Tag for Notepad++
HTML Tag for Notepad++

Once installed, you can select all the text, which contains entities. Then you click “Decode entities”, and voilà!

With ugly entities
With ugly entities
With beautiful utf8 characters
With beautiful utf8 characters

Probably you can do that also with Oxygen, but I didn’t find out yet how! Any idea?

Doing loops with Python


As my colleagues Ulrike and Stefanie, I found my self on this Digital Humanities day doing some exercises with Python. Specifically I am trying to understand the way loops work and maniging to get them to do what I need, for the moment with toy examples with numbers.

Looping pythons
Looping pythons

I am glad that I have already worked with loops (not in Python but in PHP) so I am already familiar with its logic. I found it anyway from time to time slippery. Probably as a humanist I am too used to do one thing different a time: now I search for this word, then I read the meaning, after that I see examples…

Probably that is the kind of thing that the new paradigm of Digital Humanities has brought to our way of working: not understanding the unities as unique things (or unique tasks), but trying to see (or trying to do) the same thing with every unity. It will be better for me if I get use to loop, cause doing things by hand with more than 200 Spanish novels would mean to much effort!! Loops (among others) allow us to do distant reading, to work with Big Data.

So, I decided that I have to stop this post right now and go back to my Python exercises.

The Tristram Shandy conundrum

#Christof: Later today, I will give a talk at the Göttingen Center for Digital Humanities on topic modeling used in the analysis of literary subgenres. The more I work on this paper, the longer the “future work” section gets! This makes me feel like Tristam Shandy, who famously noted that while writing down the story of his life, with every page he wrote, his delay relative to the unfolding of events kept growing and growing, and that he would likely never be able to finish.

So, caught up in the Tristram Shandy conundrum, I spent some time this morning reading (in great haste, obviously) about possible methodological enhancements to my method, and one that kind of imposes itself more and more is to develop a smarter way of correlating per-segment topic scores with per-document genre labels. Currently, I’m simply taking average topic scores across subgenres and then look at those topics with the largest standard deviation across subgenres to find topics distinctive of subgenres. In addition, I’m looking at loadings in a Principal Components Analysis of plays with their topic scores as features, also with quite interesting  (and similar) results. However, there are at least two more ways to related topic scores and genres: one is logistical regression (which Allen Riddell and me used to model function word frequencies and subgenres at DH last year, in a paper called “Progress through regression”), the other is “Supervised Topic Modeling” which David Blei and Jon McAuliffe have shown to outperform regression in several topic-based label prediction tasks.

I really hope the junior research group will help us move faster from writing future work sections to the actual future work so that we can escape from the Tristram Shandy conundrum.

Li[fv]e (in|from) the office

#Stefanie #Ulrike

Just arrived at the office. No one there except us. We’re located on Würzburg’s “Campus Nord”, on the area of the former Leighton-Barracks. On the picture below you can see the view from one of our office windows. This is where we would like to host a welcome barbecue once the office is fully furnished and entirely equipped.

View from the office
View from the office

Today, the first two tasks on our to do list are:

  1. apply for the European Summer University in Leipzig
  2. do some Python exercises for Thorsten Vitt‘s introductory course we are taking to enable us to “do” things with our texts

Good Morning Day of DH 2015!

Hello there, and good morning! This Day of DH is a bit special for me, or should I say us, because we’re blogging here as a team!Over the course of the day, the members of the junior research group on “Computational Literary Genre Stylistics” (CLiGS for short) will report on their DH work on the Day of DH. The team members are Ulrike Henny, Stefanie Popp, Daniel Schlör, José Calvo and myself (Christof Schöch). Some of us have done this before, like Ulrike and me, and others are discovering it this year. We’ll indicate who’s writing which post and I think we will have fun!