Wednesday, October 29, 2008

Thoughts on crosswalking

For the second Integrating Digital Papyrology project, we need to develop a method for crosswalking between EpiDoc (which is a dialect of TEI) and various database formats. We've thought about this quite a bit in the past and we think that we don't just want to write a one-off conversion because (a) there will be more than one such conversion and (b) we want to be able to document the mappings between data sources in a stable format that isn't just code (script, XSLT, etc.)

Some of the requirements for this notional tool are:

  • should document mappings between data formats in a declarative fashion

  • certain fields will require complex transformations. For example, the document text will likely be encoded in some variant of Leiden in the database, and will need to be converted to EpiDoc XML. This is currently accomplished by a fairly complex Python script, so it should be possible to define categories of transformation which would signal a call to an external process.

  • some mappings will involve the combination of database fields into a single EpiDoc element, and others, the division of a single field into multiple EpiDoc elements

  • Context-specific information (not included in the database) will need to be inserted into the EpiDoc document, so some sort of templating mechanism should be supported.

  • The mapping should be bidirectional. We aren't just talking about exporting from a database to EpiDoc, but also about importing from EpiDoc, which is envisioned as an interchange format as well as a publication format. This is why a single mapping document, rather than a set of instructions on how to get from one to the other would be nice.

So far, my questions to various lists have turned up favorable responses (i.e. "yes, that would be a good thing") but no existing standards....

No comments: