Wrapping up

I’m about to head home from the coffeehouse, having made some progress on the core of the new TEI export routine.  Because the logical units of a text are orthogonal to the pages–pages being the fundamental units of transcription in FromThePage–I will need to take a run-off style approach to TEI rendering.  This requires a container to hold the state of the rendering process as it moves from page to page, and that’s what I built tonight.

I was joined by Darcy Myers, a friend who was recently laid off from a position doing bioinformatics data science and has been brushing up her tech skills while applying for grad school and fending off recruiters.  She’s been exploring D3.js, and brought her genomics visualization project along to the Cafe Bedouins meeting.  We compared notes, and I was left with a healthy respect for the “hard” mathematical skills you don’t usually need doing web development.  It will be interesting to see how broadly D3 is taken up among the DH community.

Time to head home.  See you next year!

TEI-XML exports at a Coffeehouse

As was the case last year, I’m wrapping up my night at Houndstooth Coffee with Cafe Bedouins, an informal group of people who meet to work on projects that aren’t strictly part of their day jobs.  The group can be quiet–mostly folks huddled over their laptops–but is always helpful if you have a problem with your Rails environment, need to discuss an idea, or are willing to exchange beer for copy-editing.

Tonight I plan to start work on a major revision to FromThePage’s TEI-XML export.  The current exporter generates valid P5-compliant TEI, but the translation features I’ve developed for Fordham’s Center for Medieval Studies need to be integrated.  Furthermore, conversations with Susanna Allés Torrent–a scholar working on encoding a bilingual Nahuatl/English edition of the Codex Aubin–have raised a number of structural challenges for the TEI exporter, requiring a fairly major rewrite to support more appropriate uses of DIVs to indicate texts, sections and entries rather than pages, and to generate corresp attributes linking paragraphs within the transcript to paragraphs within the translation.

Fortunately, the work for Codex Aubin should be generally useful, as I’ll also be able to test it with the sister project at Fordham, a bilingual Old French/English edition of the Assizes of the crusader kingdom of Jerusalem.  That project is working with an 1841 edition as its source, from which FromThePage has ingested the OCR.  In that case, the page–the fundamental unit of transcription–is an artifact of 19th-century typesetting, and has no semantic meaning whatsoever.  Stripping page-specific DIVs from the TEI will be beneficial to both projects.

How do you make a feature developed for a specific client useful for everyone?  It’s a tough problem, since development without real, live users is often wasted effort, while “contract-driven development” has doomed many software firms trying to build a product.  The approach I’m attempting is to take the specific OCR-ingestion requirements from the Assizes project and the specific TEI-XML requirements of the Codex Aubin project and combine them to produce an OCR-to-TEI workflow.  Several scholars volunteered to advise the project when I asked on TEI-L, and they’ve given me some excellent feedback on the old exporter.  After the new exporter is deployed for Fordham, I’ll be making another pass on the sample documents contributed by the advisors to get their perspectives.

Now that the dishes are done, it’s off to the coffeehouse!

Good morning!

Day of DH marks the third year of my work as an independent software developer.  As I mentioned last year, I’ve spent about half of that period supplementing DH projects with a part-time industry contract, two weeks per month.  While that was a distraction at first, it has really paid off by forcing me to schedule my work months in advance, allowing me to turn down projects with marginal returns, and funding work on FromThePage by front-end specialists.

Today is one of those days at the non-DH gig, however, so my DH work will be limited to a 5:30-6:30 AM shift and 19:30-21:30 shift.

I’ll spend the remaining minutes of the morning on two projects:

  • FromThePage2 will launch on September 1, adding collaborative translation and OCR correction to the open-source manuscript transcription tool.  (Check out the preview site and tell me what you think.)  I woke up to emails from the front-end developer I’ve hired to refresh the UI describing his implementation of a four-mode transcription screen: over-under/side-by-side, and two other options.  What are these two other options?  I’ll have to pull the code and review his work to find out.
  • FreeREG2 is a client project that launched just after Easter this year.   It’s an online database of volunteer-contributed parish register entries.  One of the features I take the most pride in is the search engine’s ability to search across abbreviated or Latinized names in the system.  This is a feature I created that was heavily influenced by the Guide to Documentary Editing — the ingestion into the search engine applies emendations to the transcripts which will be used for matching search queries, but not when presenting the actual record.But something went wrong when we deployed the site, and the emendations were not applied.  I spent all of Friday and part of the weekend writing code to apply emendations retrospectively to the database, and need to discuss how to deploy it on the 32 million records in the system without causing major I/O issues for the other sites running on those servers.