Category Archives: OCOJ data

The Lexicon

Published by:

One of the tasks I work on is the development of “The Lexicon”, which, as the name might imply, is a dictionary-like component of the corpus. It consists of entries for each word that occurs in the OCOJ, and each entry contains information about part of speech, conjugation class, semantic categories for some verbs (currently under development), derivation, related words, etc.

The XML version for the entry forĀ yuk- ‘go’ looks like this:


But not many people want to read an entry that looks like that. So the XML file gets converted to a user-friendly HTML file and goes online looking like this:


It’s then possible to click the links to get more information. Clicking the link for “Conjugation class: quadrigrade” directs the user to a list of all verbs in that class. Clicking “Verb classification: motion” brings up all motion verbs. Clicking “collocations” gives a list of all the nouns that head noun phrases that are arguments (e.g., subjects, objects, goals, sources) occurring with the verb yuk- and also all nouns that are modified by the verb. Here are the top collocations:


Clicking on the link for the nouns will bring the user to the lexicon where they can see the definition: kimi ‘you’; miti ‘road’; wa ‘I’; midu ‘water’; tabi ‘trip, journey’; ware ‘I’; pito ‘person’; yama ‘mountain’; pi ‘sun, day’; kapa ‘river’; pye ‘layer’; yworu ‘night’; miyakwo ‘capital’; pune ‘boat’. This list could, of course, be changed to show only subjects, only objects, only modified nouns, or whatever else people want to see, it’s just a matter of tweaking the search conditions and generating new results. This can be done in just a few minutes.

There is no way this type of information could be extracted using a dictionary and a collection of books. This is just one example of how Digital Humanities improves the way research can be conducted.

Skip to toolbar