My afternoon was packed with lots of small tasks and thought experiments:
Happy Birthday, Barbara!
Because the staff in the IDHMC are part of the English Department staff, we participate in all gatherings with those terrific ladies and gentleman. Today, we celebrated our Undergraduate Administrative Coordinator’s birthday with some chocolate cake.
- Our project coordinator, Nancy Sumpter, and I investigated a suspect email from DreamHost (where we have some IDHMC project domains registered), which turned out to be just bad form on DreamHosts part and not actually a scam.
- I took a close look at the text documents produced by Matt Christy to enable the transformation from MARC subject headings to ARC RDF specifications for Early English Books Online (EEBO). Because our intention is to 1) identify subject headings in the MARC records, then 2) successfully crosswalk those distinct values to ARC RDF values, Matt has generated files with the subject headings split into several files based on MARC code and subfield. So, we have several documents that have many thousand lines that look like the following to work with:
| A24571 R23575 7708785 40100
msplit1.txt-=650 $aExecutions and executioners$xEngland.
Philosophy, Ancient$vEarly works to 1800
Sermons, English$y17th century
Sermons, English$y17th century
An (initial, failed) attempt to parse some MARC data in TXT format.
After taking a look at the files with parsed subject headings, I loaded several of these files into OpenRefine to see if/how this tool could help with crosswalking the data from MARC to ARC RDF. There were a few missteps here, but I know the direction we need to go in now.
- I made sure all my notes from the Great Lakes Aggregation project meeting today were stored in our project management software, Basecamp.
- Now, I’m off to finish my list of accomplishments and revised CV. Staff performance evaluations are this month at TAMU, and my performance evaluation meeting is scheduled for tomorrow morning, first thing.
Just another DH day.
I am Timothy Duguid, Postdoctoral Research Associate for the IDHMC. I am also the PI for an upcoming NEH-funded project, “MuSO: Data Aggregation and Peer Review in Music.” My current work for the IDHMC mainly involves data visualization. I manage our Humanities Visualization Space and help to develop and test software for that space, including BigDIVA (Big Data Infrastructure Visualization Application).
My DayofDH began by checking emails and catching up on Twitter and Facebook for both the IDHMC (@IDHMC_nexus) and the Advanced Research Consortium (@ARCtamu). Of particular interest today are the goings on at the Music Encoding Conference in Florence, Italy. After catching up with that, I resumed my research on visualized search engines. I have been reading about visual search, with the intent of utilizing the latest findings in that field to help us with our BigDIVA design.
I also started looking into Google’s latest search algorithm changes (for instance, see this article on searchengineland.com). Dubbed the “Quality Update”, this has had a significant impact on the results that appear at the top of the page of search results. In fact, it seems that commercial sites now tend to appear at the top of the list, which is potentially bad news for researchers. This could be an interesting case study into the benefit of using services like BigDIVA. By eschewing proprietary search algorithms, BigDIVA as a research tool provides all contributing digital resources with equal footing, and its searches are reproducible.
I spent a little time working out how to parse the text file I’m creating (via a bash script) of EEBO docs with associated ESTC subject headings. Liz Grumbach wanted to be able to see subject headings broken up by marc code and subfield into separate files so she could determine which ones would be most useful, and start developing a crosswalk for those headings with ARC’s subject headings. She’s going to use Google Refine to do this work–very cool.
Had lunch with some eMOP collaborators in the Texas A&M Evans Library to talk about some final coding work.
After taking a break for a doctor’s visit and lunch, I returned to the office for a meeting with the project manager of the Great Lakes Aggregation (GLA) project, one of ARC’s new, regional nodes.
Corinne Vieracker and I met to discuss next steps for GLA development; they are currently working on testing a development environment of their new Collex software and beginning to transform metadata from contributors to meet the ARC RDF specifications for ingestion into our federated Solr index.
We got to have some great discussions about how to use OpenRefine as a tool to generate or transform metadata formats, specially from Dublin Core records to the ARC RDF schema. Corinne and I have both been working with an RDF extension for OpenRefine, written by DERI, that allows users to construct an RDF skeleton and export RDF documents as XML.
We also considered how to use OpenRefine to map and/or crosswalk subject terms from MARC or Dublin Core to the genre, discipline, and type formats that ARC requires for ingestion into Solr. Then, we concluded our metadata chat by considering how URIs (unique record identifiers) are used as record ids in the ARC system, as compared to how they are published for the semantic web in an RDF triple-store.
Thanks to Corinne and GLA for kicking off my afternoon in a great way!
I spent some time this morning going through my inbox to clear out some old emails and address stuff that I’d let slip. Then I spent a little time updating the TxDHC website. It’s a great organization and I’m proud to be their web master.
Finally, I spent some time testing a code bug in some solr indexing software for ARC.
Back to emails.
Today I will be returning from the Scale and Value Conference which was held last week at the University of Washington. I stayed on to help lead a discussion about the relationship between Surface Reading and Data Mining in Jeffery Todd Knight‘s Colloquium in Digital Culture and Digital Humanities, held at the beautiful Simpson Center which is led by the wonderful Kathleen Woodward.
Before flying home, I will be meeting shortly with Sarah Kremen Hicks to discuss a digital poetry project: she and Brian Gutierrez lead the Simpson-Center pilot program called Demystifying Digital Humanities.
Hello! I’m Director of the Initiative for Digital Humanities, Media, and Culture at Texas A&M. I lead a dedicated team of people working on the Early Modern OCR Project and ARC which oversees NINES, 18thConnect, and MESA. We have just put up a wall of screens in order to make our Humanities Visualization Space, and here you can see the ARC Project Manager Liz Grumbach working on our new BigDIVA tool:
Also, I have just received the advance copy of my book, Breaking the Book: Print Humanities in the Digital Age.
Every day begins with emails. After confirming a few meetings for the week, I responded to an ongoing conversation we’ve had with our programming support for ARC (Performant Software) regarding wonky error messages received during the RDF indexing process.
After reading up a bit, I’m about to chat with our Lead Developer to try some possible solutions and tweaks to the code to optimize our indexing process for larger sets of data.
The goal: fully index the Early English Books Online RDF into the ARC Solr Index, including the indexing all all EEBO-TCP Phase 1 texts to provide full text searching through the 18thConnect interface. A secondary, but just as important goal: to have efficient ruby rake tasks for our indexing process.
I’m thrilled to be blogging this year for DayofDH! At the IDHMC, my main duties include the management of two of our signature projects: the Advanced Research Consortium (ARC) and 18thConnect. 18thConnect, a virtual research environment for 18th-Century humanities scholars, is one of the nodes of ARC, which is an organization that oversees the social and technical aspects of our other period-specific and thematic nodes: like NINES, MESA, and SiRO. 18thConnect offers peer review to digital projects produced by humanities scholars, and the ARC federated Solr index aggregates content from peer-reviewed projects and other vetted sources to provide search and exhibit-building functionality for our users.
As a Research Associate on staff at the IDHMC, my “other duties as needed” include supporting graduate student and faculty projects, as well as participating in the planning and support of other IDHMC signature projects (like eMOP and BigDIVA).
I started today with emails, of course, and then setup the IDHMC’s Day of DH website. I also checked the progress of some scripts I am running both on my computer and on a server. One is going through some ESTC metadata files we have to pull out subject headings for each EEBO document we have in our eMOP DB. The other is using a list of EEBO documents that we think are missing from the DB and looking through our folder of EEBO docs looking for those “missing” documents that we actually have. We can use all of this to update our eMOP DB. They are both going very slowly due to the amount of data and files involved, but it’ll be great once they are done and we can use this data to improve our eMOP DB and the EEBO metadata we have uploaded to 18thConnect.