The Lexicon

One of the tasks I work on is the development of “The Lexicon”, which, as the name might imply, is a dictionary-like component of the corpus. It consists of entries for each word that occurs in the OCOJ, and each entry contains information about part of speech, conjugation class, semantic categories for some verbs (currently under development), derivation, related words, etc.

The XML version for the entry for yuk- ‘go’ looks like this:


But not many people want to read an entry that looks like that. So the XML file gets converted to a user-friendly HTML file and goes online looking like this:


It’s then possible to click the links to get more information. Clicking the link for “Conjugation class: quadrigrade” directs the user to a list of all verbs in that class. Clicking “Verb classification: motion” brings up all motion verbs. Clicking “collocations” gives a list of all the nouns that head noun phrases that are arguments (e.g., subjects, objects, goals, sources) occurring with the verb yuk- and also all nouns that are modified by the verb. Here are the top collocations:


Clicking on the link for the nouns will bring the user to the lexicon where they can see the definition: kimi ‘you’; miti ‘road’; wa ‘I’; midu ‘water’; tabi ‘trip, journey’; ware ‘I’; pito ‘person’; yama ‘mountain’; pi ‘sun, day’; kapa ‘river’; pye ‘layer’; yworu ‘night’; miyakwo ‘capital’; pune ‘boat’. This list could, of course, be changed to show only subjects, only objects, only modified nouns, or whatever else people want to see, it’s just a matter of tweaking the search conditions and generating new results. This can be done in just a few minutes.

There is no way this type of information could be extracted using a dictionary and a collection of books. This is just one example of how Digital Humanities improves the way research can be conducted.

A Digital Humanist – “So, what is it that you do exactly?”

I like the term “Digital Humanist”. Sure, it confuses people when you introduce yourself as a digital humanist if they’ve never heard the term before (and many people I talk to haven’t), and I usually get the question: “So, what is it that you do exactly?”

It’s a great question. Not one that’s easily answered, of course, and I certainly don’t want to put people into a deep slumber if I try.

The short answer is that I use technology to access a dead language.

A longer answer is that I’ve spent the past 6 1/2 years working on the design and development of  The Oxford Corpus of Old Japanese (OCOJ; http://vsarpj.orinst.ox.ac.uk/corpus/), which is a syntactically parsed corpus of all extant Old Japanese texts. The corpus is tagged in XML following the guidelines of the Text Encoding Initiative (TEI). Having a corpus like this drastically improves the way data can be accessed and analysed. There have already been a few dissertations and several articles written using this resource. I’m also involved with a few research groups who want to incorporate data from the OCOJ in their diachronic projects.

I should also mention that being able to do research in this way is really fun.

I’ll post several examples of the kinds of things that can be quickly examined using a corpus during the Day of Digital Humanities.

