Scriptio Continua
Tuesday, October 27, 2009
Object Artefact Script
A couple of weeks ago, I attended a workshop at the Edinburgh eScience Institute on the relation of text in ancient (and other) documents to its context and on the problems of reading difficult texts on difficult objects and ways in which technology can aid the process of interpretation and dissemination without getting in the way of it. The meeting was well summarized by Alejandro Giacometti in his blog, and the presentations are posted on the eSI wiki.

Kathryn Piquette discussed what would be required to digitally represent Egyptian hieroglyphic texts without divorcing them from their contexts as an integral part of monumental architecture. For example, the interpretation of the meaning of texts should be able to take into account the times of day (and/or year) when they would have been able to be read, their relationship to their surroundings, and so on. The established epigraphical practice of divorcing the transcribed text from its context, while often necessary, does some violence to its meaning, and this must be recognized and accounted for. At the same time, digital 3D reconstructions are themselves an interpretation, and it is important to disclose the evidence on which that interpretation is based.

Ségolène Tarte talked about the process of scholarly interpretation in reading the Vindolanda tablets and similar texts. As part of analysing the scholarly reading process, the eSAD project observed two experts reading a previously-published tablet. During the course of their work, they came up with a new reading that completely changed their understanding of the text. The previous reading hinged on the identification of a single word, which led to the (mistaken) recognition of the document as recording the sale of an ox. The new reading hinged on the recognition of a particular letterform as an 'a'. The ways in which readings of difficult texts are produced—involving skipping around looking for recognizable pieces of text upon which (multiple) partial mental models of the texts are constructed, which must then be resolved somehow into a reading—means that an Interpretation Support System (such as the one eSAD proposes to develop) must be sensitive to the different ways of reading scholars use and must be careful not to impose "spurious exactitude" on them.

Dot Porter gave an overview of a variety of projects that focus on representing text, transcription, and annotation alongside one another as a way into discussing the relationship between digital text and physical text. She cautioned against attempts to digitally replicate the experience of the codex, since there is a great deal of (necessary) data interpolation that goes on in any detailed digital reconstruction, and this elides the physical reality of the text. Digital representations may improve (or even make possible) the reading of difficult texts, such as the Vindolanda tablets or the Archimedes Palimpsest, so for purposes of interpretation, they may be superior to the physical reality. They can combine data, metadata, and other contextual information in ways that help a reader to work with documents. But they cannot satisfactorily replicate the physicality of the document, and it may be a bit dishonest to try.

I talked about the img2xml project I'm working on with colleagues from UNC Chapel Hill. I've got a post or two about that in the pipeline, so I won't say much here. It involves the generation of SVG tracings of text in manuscript documents as a foundation for linking and annotation. Since the technique involves linking to an XML-based representation of the text, it may prove superior to methods that rely simply on pointing at pixel coordinates in images of text.

Ryan Bauman talked about the use of digital images as scholarly evidence. He gave a fascinating overview of sophisticated techniques for imaging very difficult documents (e.g. carbonized, rolled up scrolls from Herculaneum) and talked about the need for documentation of the techniques used in generating the images. This is especially important because the images produced will not resemble the way the document looks in visible light. Ryan also talked about the difficulties involved in linking views of the document that may have been produced at different times, when the document was in different states, or may have used different techniques. The Archimedes Palimpsest project is a good example of what's involved in referencing all of the images so that they can be linked to the transcription.

Finally, Leif Isaksen talked about how some of the techniques discussed in the earlier presentations might be used in crowdsourcing the gathering of data about inscriptions. Inscriptions (both published and unpublished) are frequently encountered (both in museums and out in the open) by tourists who may be curious about their meaning, but lack the ability to interpret them. They may well, however, have sophisticated tools available for image capture, geo-referencing, and internet access (via digital cameras, smartphones, etc.). Can they be employed, in exchange for information about the texts they encounter, as data gatherers?

Some themes that emerged from the discussion included:

This was a terrific workshop, and I hope to see followup on it. ESAD is holding a workshop next month on "Understanding image-based evidence," that I'm sorry I can't attend and from which look forward to seeing the output.

Labels:

 
Friday, October 16, 2009
Stomping on Innovation Killers
@foundhistory has a nice post on objections one might hear on a grant review panel that would unjustly torpedo an innovative proposal. I thought it might be a good idea to take a sideways look at these as advice to grant writers.





So, some ideas for countering these when you're working on your proposal:


  1. Have you looked at work that's been done in this area (this might entail some real digging)? If there are projects and/or literature that deal with the same areas as your proposal, then you should take them into account. You need to be able to show you've done your homework and that your project is different from what's come before.

  2. Who is your audience? Have you talked to them? If you can get letters of support from one or more of them, that will help silence the stakeholders objection.

  3. You ought to have some sort of story about sustainability and/or the future beyond the project, to show that you've thought about what comes next. Even if your project is an experiment, you should talk about how you're going to disseminate the results so that those who come after will be able to build on your work.



I agree with Tom that these criticisms can be deployed to stifle creative work. In technology, sometimes wheels need to be reinvented, sometimes the conventional wisdom is flat wrong, and sometimes worrying overmuch about the future paralyses you. But if you're writing a proposal, assume these objections will be thrown at it, and do some prior thinking so you can spike them before they kill your innovative idea.
 
Monday, August 10, 2009
Upgrade Notes
During my recent work on moving the Papyrological Navigator from Columbia to NYU, I ran into some issues that bear noting. It's a bit hard to know whether these are generalizable, but they seem to me to be good examples of the kinds of things that can happen when you're upgrading a complex system, and I don't want to forget about them.

Issue #1
Search results in the PN are supposed to return with KWIC snippets, highlighting the search terms. As part of the move, I upgraded Lucene to the latest release (2.4.1). The Lucene in the PN was 2.3.x, but the developer at Columbia had worked hard to eke as much indexing speed out of it as possible, and had imported code from the 2.4 branch, with some modifications. Since this code was really close to 2.4, I'd had reason to hope the upgrade would be smooth, and it mostly was. Highlighting wasn't working for Greek though, even though the search itself was...

Debugging this was really hard, because as it turned out, there was no failure in any of the running code. It just wasn't running the right code. A couple of the slightly modified Lucene classes in the PN codebase were being stepped on by the new Lucene because instead of a jar named "ddbdp.jar", the new PN jars were named after the project in which they resided (so, "pn-ddbdp-indexers.jar". And they were getting loaded after Lucene instead of before. Not the first time I'd seen this kind of problem, but always a bit baffling. In the end I moved the PN Lucene classes out of the way by changing their names and how they were called.

Issue #2

This one was utterly baffling as well. Lemmatized search (that is, searching for dictionary headwords and getting hits on all the forms of the word—very useful for inflected languages, like Greek) was working at Columbia, and not at NYU. Bizarre. I hadn't done anything to the code. Of course, it was my fault. It almost always is the programmer's fault. A few months before, in response to a bug report (and before I started working for NYU), I had updated the transcoder software (which converts between various encodings for Ancient Greek) to conform to the recommended practice for choosing which precomposed (letter + accent) character to use when the same one (e.g. alpha + acute accent) occurs in both the Greek (Modern) and Greek Extended (Ancient) blocks in Unicode. Best practice is to choose the character from the Greek block, so \u03AC instead of \u1F71 for ά. Transcoder used to use the Greek Extended character, but since late 2008 it has followed the new recommendation and used characters from the Greek block, where available. Unfortunately this change happened after transcoder had been used to build the lemma database that the PN uses to expand lemmatized queries. So it had the wrong characters in it, and a search for any lemma containing an acute accent would fail. Again, all the code was executing perfectly; some of the data was bad. It didn't help that when I pasted lemmas into Oxygen, it normalized the encoding, or I might have realized sooner that there were differences.

Issue #3

Last, but not least, was a bug which manifested as a failure in certain types of search. "A followed by B within n places" searches worked, but "A and B (not in order) within n places" and "A but not B within n places" both failed. Again, no apparent errors in the PN code. The NullPointerException that was being thrown came from within the Lucene code! After a lot of messing about, I was able to determine that the failure was due to a Lucene change that the PN code wasn't implementing against. Once I'd found that, all it took to fix it was to override a method from the Lucene code. This was actually a Lucene bug (https://issues.apache.org/jira/browse/LUCENE-1748) which I reported. In trying to maintain backward compatibility, they had kept compile-time compatibility with pre-2.4 code, but broken it in execution. I have to say, I was really impressed with how fast the Lucene team, particularly Mark Miller, responded. The bug is already fixed.

So, lessons learned:


  1. Tests are good. I didn't have any available for the project that contained all of the bugs listed here. They exist (though coverage is spotty), but there are dependencies that are tricky to resolve, and I had decided to defer getting the tests to work in favor of getting the PN online. Not having tests ate into the time I'd saved by deferring them.

  2. In both cases #1 and #3, I had to find the problem by reading the code and stepping through it in my head. Practice this basic skill.

  3. Look for ways your architecture may have changed during the upgrade. Anything may be significant, including filenames.

  4. Greek character encoding is the Devil (but I already knew that).

  5. It's probably your fault, but it might not be. Look closely at API changes in libraries you upgrade. Go look at the source if anything looks fishy. I didn't expect to find anything wrong with something as robust as Lucene, but I did.

 
Friday, January 23, 2009
Endings and Beginnings
It's been that sort of a week. Great beginning with the inauguration on Tuesday and the start of a new Obama presidency. My wife was in tears. Growing up in a small southern town, she never imagined she'd see a black president, and now our youngest daughter will never know a world in which there hasn't been one. Sometimes things do change for the better.

On a personal note, I gave my notice to UNC on Tuesday. My position was partially funded with soft money, and one-time money is one of the primary ways they're trying to address the budget crisis, in order not to lay off permanent employees (as is right and proper). I'm rather sad about leaving, but I will be starting a job with the NYU digital library team in February, working on digital papyrology. This has the look of a job where I can unite both the Classics geek and the tech geek sides of my personality. I may become unbearable.
 
Wednesday, December 31, 2008
OpenLayers and Djatoka
For the last few weeks, I've been playing around with the new JPEG2000 image server released by the Los Alamos National Labs (http://african.lanl.gov/aDORe/projects/djatoka/). I never could get the image viewer released along with it to work, and I immediately thought of OpenLayers (http://openlayers.org/), a javascript API for embedding maps. OpenLayers is like Google Maps in many ways, but Free. Besides maps, it works very well for any image, and provides a lot of tools developed for mapping, but also useful for displaying and working with any large image. I wanted to use OpenLayers support for tiled images in conjunction with Djatoka's ability to render arbitrary sections of an image at a number of zoom levels (the number of levels available depends on how the image was compressed).

After a lot of messing around and some false starts, I've developed a Javascript class that supports Djatoka's OpenURL API. I've been testing it on JPEG2000 images created with ContentDM in the UNC Library's digital collections, with a good deal of success. The results are not yet available online, because I don't have a public-facing server I can host it on, but the source code is up on github here.

Instructions:

Install Djatoka. Incidentally, in order to get this in the queue for installation on our systems, I had to make Djatoka work on Tomcat 6. The binary doesn't work out of the box, but when I rebuilt it on my system (RHEL 5), it worked fine.

Copy the adore-djatoka WAR into your Tomcat webapps directory. Follow the instructions on the Djatoka site to start the webapp.

Grab a copy of OpenLayers. Put the OpenURL.js file in lib/OpenLayers/Layer/ and run the build.py script.

To just run the demo, copy the djatoka.html, the OpenLayers.js you just built, and the .css files from OpenLayers/theme/ and from the examples/ directory, as well as the OpenLayers control images from OpenLayers/img into the adore-djatoka directory in webapps. You should then be able to access the djatoka.html file and see the demo.

This all comes with no guarantees, of course. It seems to work quite well with the JPEG2000 images I've tested, and the tiling means that each request of Djatoka consumes an equal amount of resources. I've run into OutOfMemoryErrors when requesting full-size images, but this method loads them without any problem.

Update (2009-01-05 14:37): I've posted a fix to the OpenURL.js script for a bug pointed out to me by John Fereira on the djatoka-devel list. If you grabbed a copy before now, you should update.

Update: screenshots --







 
Wednesday, October 29, 2008
Thoughts on crosswalking
For the second Integrating Digital Papyrology project, we need to develop a method for crosswalking between EpiDoc (which is a dialect of TEI) and various database formats. We've thought about this quite a bit in the past and we think that we don't just want to write a one-off conversion because (a) there will be more than one such conversion and (b) we want to be able to document the mappings between data sources in a stable format that isn't just code (script, XSLT, etc.)

Some of the requirements for this notional tool are:


So far, my questions to various lists have turned up favorable responses (i.e. "yes, that would be a good thing") but no existing standards....
 
Monday, October 20, 2008
On Bamboo the 2nd
I spent Thursday - Saturday last week at the second Bamboo workshop in San Francisco. So some reactions:

1) The organizers are well-intentioned and are sincerely trying to wrestle with the problem of cyberinfrastructure for Digital Humanities.

2) That said, it isn't clear that the Bamboo approach is workable. The team is very IT focused, and while they seem to have a solid grasp of large-scale software architecture, the ways in which that might be applied to the Humanities with any success aren't obvious. There was a lot of misdirected effort between B1 and B2 by some very smart people, who I must say had the good grace to admit it was a nonstarter. Their attempt to factor the practices of scholars into implementable activities resulted in something that lacked enough context and specificity to be useful. A refocusing on context and on the processes that contain and help define the activities happened at the workshop and seems likely to go forward.

3) The workshops themselves seem to have been quite useful. I wasn't at any or the round one workshops, and I doubt I'll be at any of the others (I represented the UNC Library because the usual candidates weren't available), but everyone I talked to was very engaged (if often skeptical). The connections and discussion that seem to have emerged so far probably make the investment worthwhile, even if "Bamboo" as conceived doesn't work.

4) The best idea I heard came (not surprisingly) from Martin Mueller, who suggested Bamboo become a way to focus Mellon funding on projects that conform to certain criteria (such as reusable components and standards) for a defined period (say five years). The actual outcome of the current Bamboo would be the criteria for the RFP. Simple, encourages institutions to think along the right lines, might actually do some good, and might allow participation by smaller groups as well.

5) There was a lot of talk about the people who are both researchers and technologists (guilty). These were variously defined as "hybrids," "translators," and, most offensively, "the white stuff inside the Oreo." None of this was meant to be offensive, but in the end, it is. People who can operate comfortably in both the worlds of scholarship and IT can certainly be useful go-betweens for those who can't, but that is not our sole raison d'être. Until recently there haven't been many jobs for us, but that seems to be changing, and I hope it continues to. See Lisa Spiro's excellent recent post on Digital Humanities Jobs and Sean Gillies, who without having been there, manages to capture some of the reservations I feel about the current enterprise and pick up on the educational aspect. One possible useful future for Bamboo would be simply to foster the development of more "hybrids."

6) The Bamboo folks have set themselves a truly difficult task. They are making a real effort to tackle it in an open way, and should be commended for it. But it is a very hard problem, and one for which there is still not a clear definition. The software engineer part of my hybrid brain wants problems defined before it will even consider solutions. The classicist part believes some things are just hard, and you can't expect technology to make them easy for you.
 
Thoughts on software development, Digital Humanities, the ancient world, and whatever else crosses my radar.

Name: Hugh Cayless
Location: Chapel Hill, North Carolina, United States
Archives
October 2004 / February 2005 / March 2005 / April 2005 / October 2005 / August 2006 / January 2007 / February 2007 / March 2007 / May 2007 / October 2007 / January 2008 / March 2008 / May 2008 / August 2008 / September 2008 / October 2008 / December 2008 / January 2009 / August 2009 / October 2009 /


Powered by Blogger

Subscribe to
Posts [Atom]