Thursday, November 10, 2011

TEI in other formats; part the first: HTML

There has been a fair amount of discussion of late about TEI either having a standard HTML5 representation or even moving entirely to an HTML5 format. I want to do a little thinking "out loud" about how that might work.

Let's start with a fairly standard EpiDoc document (EpiDoc being a set of guidelines for using TEI to mark up ancient documents).;2;74/source (see;2;74/ for an HTML version with more info) is a fairly typical example of EpiDoc used to mark up a papyrus document. The document structure is fairly flat, but with a number of editorial interventions, all marked up. Line 12, below, shows supplied, unclear, and gap tags

<lb n="12"/><gap reason="lost" quantity="16" unit="character"/> <supplied reason="lost"> Π</supplied><unclear>ε</unclear>ρὶ Θή<supplied reason="lost">βας καὶ Ἑ</supplied><unclear>ρ</unclear>μωνθ<supplied reason="lost">ίτ </supplied><gap reason="lost" extent="unknown" unit="character"/>

So how might we take this line and translate it to HTML? First, we have an <lb> tag, which at first glance would seem to map quite readily onto the HTML <br> tag, but if we look at the TEI Guidelines page for lb (, we see a large number of possible attributes that don't necessarily convert well. In practice, all I usually see on a line break tag in TEI is an @n and maybe an @xml:id attribute. HTML doesn't really have a general-purpose attribute like @n, but @class or @title might serve. On <lb>, @n is often used to provide line numbers, so @title seems logical.

Now <gap reason="lost" quantity="16" unit="character"/> is a bit more of a puzzler. First, HTML's semantics don't extend at all to the recording of attributes of a text being transcribed, so nothing like the gap element exists. We'll have to use a general-purpose inline element (span seems obvious) and figure out how to represent the attribute values. TEI has no lack of attributes, and these don't naturally map to HTML at all in most cases. If we're going to keep TEI's attributes, we'll have to represent them as child elements.  We'll want to identify both the original TEI element and wrap its attributes and maybe its content too, so let's assume we'll use the @class attribute with a couple of fake "namespaces", "teie-" for TEI element names, "teia-" for attribute names, and "teig-" to identify attributes and wrap element contents (the latter might be overkill, but seems sensible as a way to control whitespace). We can assume a stylesheet with a span.teig-attribute selector that sets display:none.

<span class="tei-gap">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-attribute teia-quantity">16</span>
  <span class="teig-attribute teia-unit">character</span>

Like HTML, TEI has three structural models for elements: block, inline, and milestone. Block elements assume a "block" of text, that is, they create a visually distinct chunk of text. Divs, paragraphs, tables, and similar elements are block level. Inline elements contain content, but don't create a separate block. Examples are span in HTML, or hi in TEI. Milestones are empty elements like lb or br. TEI has several of these, and HTML, which has "generic" elements of the block and inline varieties (div and span) lacks a generic empty element. Hence the need to represent tei:gap as a span.

tei:supplied is clearly an inline element, and we can do something similar to the example above, using span:

<span class="tei-supplied">
  <span class="teig-attribute teia-reason">lost</span>
  <span class="teig-content">Π</span>

and likewise with unclear:

<span class="tei-unclear">
  <span class="teig-content">ε</span>

Now, doing this using generic HTML elements and styling/hiding them with CSS could be considered bad behavior. It's certainly frowned upon in the CSS 2.1 spec (see the note at I don't honestly see another way to do it though, because, although RDFa has been suggested as a vehicle for porting TEI to HTML, there is no ontology for TEI, so no good way to say "HTML element p is the same as TEI element p, here". Even granting the possibility of saying that, it doesn't help with the attribute problem. And we're still left with the problem of presentation: what will my HTML look like in a browser? It must be said that my messing about above won't produce anything like the desired effect, which for line 12 is something like:
[- ca.16 - Π]ε̣ρὶ Θή[βας καὶ Ἑ]ρ̣μωνθ[ίτ -ca.?- ] 
I could certainly make it so, probably with a combination of CSS and JavaScript, but what have I gained by doing so? I'll have traded one paradigm, XML + XSLT, for another, HTML + CSS + JavaScript. I'll have lost the ability to validate my markup, though I'll still be able to transform it to other formats. I should be able to round-trip it to TEI and back, so perhaps I could solve the validation problem that way. But is anything about this better than TEI XML? I don't think so…

I suspect I'm missing the point here, and that what the proponents of TEI in HTML are really after is a radically curtailed (or re-thought) version of TEI that does map more comfortably to HTML. The somewhat Baroque complexity of TEI leads the casual observer to wish for something simpler immediately, and can provoke occasional dismay even in experienced users. I certainly sympathize with the wish for a simpler architecture, but text modeling is a complex problem, and simple solutions to complex problems are hard to engineer.


Con said...

Very interesting, Hugh.

I had a play last night with a stylesheet for converting TEI XML into HTML5, while retaining all the TEI metadata (original element names and attribute names and values) in the form of HTML5 microdata.

It's certainly possible to do. I'm left wondering about the pragmatic value in it, though. One thing is that, if used for publication online, it would have the desirable effect of exposing the TEI which is often these days hidden away where people can't get to it.

Con said...

Another plus for a standard HTML5 microdata serialization of TEI is that it can provide standard hooks for CSS. e.g. your CSS selectors can look something this:


... rather than having to refer to some purely conventional representation of the TEI semantics in the form of HTML class attributes.

Unknown said...

Thanks Con, I too wonder about the pragmatic value of it. It would have the advantage of making your model clear in the HTML, but on the other hand, I'm one of those people who believes in always making the XML version available. I think if you aren't, you are in effect not showing your work, and being a bit intellectually dishonest. I know there are people (whom I respect) who hold the opposing view, but they're wrong.

You definitely can do this using microdata, and perhaps that's a little better. But you still have the stilted HTML problem.