One of the tasks I work on is the development of “The Lexicon”, which, as the name might imply, is a dictionary-like component of the corpus. It consists of entries for each word that occurs in the OCOJ, and each entry contains information about part of speech, conjugation class, semantic categories for some verbs (currently under development), derivation, related words, etc.

The XML version for the entry for yuk- ‘go’ looks like this:


But not many people want to read an entry that looks like that. So the XML file gets converted to a user-friendly HTML file and goes online looking like this:


It’s then possible to click the links to get more information. Clicking the link for “Conjugation class: quadrigrade” directs the user to a list of all verbs in that class. Clicking “Verb classification: motion” brings up all motion verbs. Clicking “collocations” gives a list of all the nouns that head noun phrases that are arguments (e.g., subjects, objects, goals, sources) occurring with the verb yuk- and also all nouns that are modified by the verb. Here are the top collocations:


Clicking on the link for the nouns will bring the user to the lexicon where they can see the definition: kimi ‘you’; miti ‘road’; wa ‘I’; midu ‘water’; tabi ‘trip, journey’; ware ‘I’; pito ‘person’; yama ‘mountain’; pi ‘sun, day’; kapa ‘river’; pye ‘layer’; yworu ‘night’; miyakwo ‘capital’; pune ‘boat’. This list could, of course, be changed to show only subjects, only objects, only modified nouns, or whatever else people want to see, it’s just a matter of tweaking the search conditions and generating new results. This can be done in just a few minutes.

There is no way this type of information could be extracted using a dictionary and a collection of books. This is just one example of how Digital Humanities improves the way research can be conducted.

Like most academics, part of my day is spent in meetings or workshops. This morning I had a meeting with the Japanese and Korean faculty, and next up is a crowdsourcing workshop. I have some ideas for crowdsourcing to add to the OCOJ, and I’m here to get more ideas.

Today for the international Day in Digital Humanities event, digital humanists have been asked to document what we do in a day. It’s a great idea, and I expect the comments will be as varied as our research.

My training is as a historical linguist. I am mainly interested in the development of the Japonic language family, which is the language family consisting of Japanese and Ryukyuan languages. I came to Oxford to work on the AHRC-funded Verb semantics and argument realization in pre-modern Japanese (VSARPJ) project, and in order to do the research for the project we soon realized we needed to build a corpus. I attended the Digital Humanities Summer School in Oxford in the summer of 2009, and that’s where I learned to do much of the necessary skills  for this kind of work, including how to use XML, XPath, XSLT, etc. (As it happens, I’ll be co-teaching the Text to Tech workshop in the Digital Humanities Summer School this year, and registration is still open.)

When I wrote my dissertation, a few years before coming here, I had to rely on dictionaries and the few available indices to get information about words attested in Old Japanese, the oldest stage of the Japanese language (8th century). Now that we have a corpus, the Oxford Corpus of Old Japanese (OCOJ), it’s so much easier, faster, and more accurate to get data about any given word.

In addition to the corpus, I am also working on the development of a bidirectional Old Japanese – English dictionary, making it possible to group words together by their meaning. It’s also possible to jump from the dictionary to examples in the corpus, and from the corpus to the dictionary. But more on that later.


Nothing like getting in to work and realising that they will be filming outside the office today.  Every now and then something is filmed around here, usually Lewis these days. That’s Oxford for you.

At least they will be filming safely today.


