Scriptio Continua: TEI

Showing posts with label TEI. Show all posts

Monday, November 14, 2011

TEI in other formats; part the second: Theory

In my first post on this subject, I poked a bit at how one might represent TEI in HTML without discarding the text model from the TEI document. Now I want to talk a bit more about that model, and the theory behind it. I may at the end say irritable things about Theory as well, if you can stand it until the end. I humbly beg the reader's pardon.

Let's look at the same document I talked about last time: http://papyri.info/ddbdp/p.ryl;2;74/source (see also http://papyri.info/ddbdp/p.ryl;2;74/). We can visualize the document structure using Graphviz and a spot of XSLT (click for high-res):

tree structure of a TEI document

It's a fairly flat tree. As an XML document, it has to be a tree, of course, and TEI leverages this built-in "tree-ness" to express concepts like "this text is part of a paragraph" (i.e. it has a tei:p element as its ancestor). In line 1, for example, we find

<supplied reason="lost">Μάρ</supplied>κος

meaning the first three letters of the name Markos have been lost due to damage suffered by the papyrus the text was written on, and the editor of the text has supplied them. The fact that the letters "Μάρ" are contained by the supplied element, or, more properly that the text node containing those letters is a child of the supplied element, means that those letters have been supplied. In other words, the parent-child relationship is given additional semantics by TEI. We already have some problems here: the child of supplied is itself part of a word, "Markos", and that word is broken up by the supplied element. Only the fact that no white space intervenes between the end of the supplied element and the following text lets us know that this is a word. It's even worse if you look at the tree version, which is, incidentally, how the document will be interpreted by a computer after it has been parsed:

There's no obvious connection here between the first and second halves of the name. And in fact, if we hadn't taken steps to prevent it, any program the processed the document might reformat it so that "Mar" and "kos" were no longer connected. We could solve this problem by adding more elements. As the joke goes, "XML is like violence. If it isn't working, you're not using it enough." We could explicitly mark all the words, using a "w" element, thus:

<w><supplied reason="lost">Μάρ</supplied>κος</w>

which would solve any potential problems with words getting split up, because we could always fix the problem—we would know what all the words are. We could even attach useful metadata, like the lemma (the dictionary headword) of the word in question. We don't do this for a couple of reasons. First, because we don't need to. We can work around the splitting-up of words by markup. Second, because it complicates the document and makes it harder for human editors to deal with, and third, because it introduces new chances for overlap. Overlap is the Enemy, as far as XML is concerned. The more containers you have, the greater the chances one container will need to start outside another, but finish inside (or vice versa). Consider that there's no reason at all a region of supplied text shouldn't start in the middle of one word and end in the middle of another. Look at lines 5-6 for example:

... οὐ̣[χ ἱκανὸν εἶ-]
[ναι εἰ]ς

A supplied section begins in the middle of the third word from the end of line five, and continues for the rest of the line. The last word is itself broken and continues on the following line, the beginning of which is also supplied, that section ending in the middle of the second word on line six. This is a mess that would only be compounded if we wanted to mark off words.

This may all seem like a numbing level of detail, but it is on these details that theories of text are tested. The text model here cares about editorial observations on and interventions in the text, and those are what it attempts to capture. It cares much less about the structure of the text itself—note that the text is contained in a tei:ab, an element designed for delineating a block of text without saying anything about its nature as a block (unlike tei:p, for example). Visible features like columns, or text continued on multiple papyri, or multiple texts on the same papyrus would be marked by tei:divs. This is in keeping with papyrology's focus on the materiality of the text. What the editor sees, and what they make of it is more important than the construction of a coherent narrative from the text—something that is often impossible in any case. Making that set of tasks as easy as possible is therefore the focus of the text model we use.

What I'm trying to get at here is that there is Theory at work here (a host of theories in fact), having to do with a way to model texts, and that that set of theories are mapped onto data structures (TEI, XML, the tree) using a set of conventions, and taking advantage of some of the strengths of the data structures available. Those data structures have weaknesses too, and where we hit those, we have to make choices about how to serve our theories best with the tools we have. There is no end of work to be done at this level, of joining theory to practice, and a great deal of that work involves hacking, experimenting with code and data. It is from this realization, I think, that the "more hack, less yack" ethic of THATCamp emerged. And it is at this level, this intersection, this interface, that scholar-programmer types (like me) spend a lot of our time. And we do get a bit impatient with people who don't, or can't, or won't engage at the same level, especially if they then produce critiques of what we're doing.

As it happens, I do think that DH tends to be under-theorized, but by that I don't mean it needs more Foucault. Because it is largely project-driven, and the people who are able to reason about the lower-level modeling and interface questions are mostly paid only to get code into production, important decisions and theories are left implicit in code and in the shapes of the data, and aren't brought out into the light of day and yacked about as they should be.

Thursday, November 10, 2011

TEI in other formats; part the first: HTML

There has been a fair amount of discussion of late about TEI either having a standard HTML5 representation or even moving entirely to an HTML5 format. I want to do a little thinking "out loud" about how that might work.

Let's start with a fairly standard EpiDoc document (EpiDoc being a set of guidelines for using TEI to mark up ancient documents). http://papyri.info/ddbdp/p.ryl;2;74/source (see http://papyri.info/ddbdp/p.ryl;2;74/ for an HTML version with more info) is a fairly typical example of EpiDoc used to mark up a papyrus document. The document structure is fairly flat, but with a number of editorial interventions, all marked up. Line 12, below, shows supplied, unclear, and gap tags

<lb n="12"/><gap reason="lost" quantity="16" unit="character"/> <supplied reason="lost"> Π</supplied><unclear>ε</unclear>ρὶ Θή<supplied reason="lost">βας καὶ Ἑ</supplied><unclear>ρ</unclear>μωνθ<supplied reason="lost">ίτ </supplied><gap reason="lost" extent="unknown" unit="character"/>

So how might we take this line and translate it to HTML? First, we have an <lb> tag, which at first glance would seem to map quite readily onto the HTML tag, but if we look at the TEI Guidelines page for lb (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html), we see a large number of possible attributes that don't necessarily convert well. In practice, all I usually see on a line break tag in TEI is an @n and maybe an @xml:id attribute. HTML doesn't really have a general-purpose attribute like @n, but @class or @title might serve. On <lb>, @n is often used to provide line numbers, so @title seems logical.

Now <gap reason="lost" quantity="16" unit="character"/> is a bit more of a puzzler. First, HTML's semantics don't extend at all to the recording of attributes of a text being transcribed, so nothing like the gap element exists. We'll have to use a general-purpose inline element (span seems obvious) and figure out how to represent the attribute values. TEI has no lack of attributes, and these don't naturally map to HTML at all in most cases. If we're going to keep TEI's attributes, we'll have to represent them as child elements. We'll want to identify both the original TEI element and wrap its attributes and maybe its content too, so let's assume we'll use the @class attribute with a couple of fake "namespaces", "teie-" for TEI element names, "teia-" for attribute names, and "teig-" to identify attributes and wrap element contents (the latter might be overkill, but seems sensible as a way to control whitespace). We can assume a stylesheet with a span.teig-attribute selector that sets display:none.


lost
16
character


Like HTML, TEI has three structural models for elements: block, inline, and milestone. Block elements assume a "block" of text, that is, they create a visually distinct chunk of text. Divs, paragraphs, tables, and similar elements are block level. Inline elements contain content, but don't create a separate block. Examples are span in HTML, or hi in TEI. Milestones are empty elements like lb or br. TEI has several of these, and HTML, which has "generic" elements of the block and inline varieties (div and span) lacks a generic empty element. Hence the need to represent tei:gap as a span.

tei:supplied is clearly an inline element, and we can do something similar to the example above, using span:


lost
Π


and likewise with unclear:


ε


Now, doing this using generic HTML elements and styling/hiding them with CSS could be considered bad behavior. It's certainly frowned upon in the CSS 2.1 spec (see the note at http://www.w3.org/TR/CSS2/selector.html#class-html). I don't honestly see another way to do it though, because, although RDFa has been suggested as a vehicle for porting TEI to HTML, there is no ontology for TEI, so no good way to say "HTML element p is the same as TEI element p, here". Even granting the possibility of saying that, it doesn't help with the attribute problem. And we're still left with the problem of presentation: what will my HTML look like in a browser? It must be said that my messing about above won't produce anything like the desired effect, which for line 12 is something like:

[- ca.16 - Π]ε̣ρὶ Θή[βας καὶ Ἑ]ρ̣μωνθ[ίτ -ca.?- ]

I could certainly make it so, probably with a combination of CSS and JavaScript, but what have I gained by doing so? I'll have traded one paradigm, XML + XSLT, for another, HTML + CSS + JavaScript. I'll have lost the ability to validate my markup, though I'll still be able to transform it to other formats. I should be able to round-trip it to TEI and back, so perhaps I could solve the validation problem that way. But is anything about this better than TEI XML? I don't think so…

I suspect I'm missing the point here, and that what the proponents of TEI in HTML are really after is a radically curtailed (or re-thought) version of TEI that does map more comfortably to HTML. The somewhat Baroque complexity of TEI leads the casual observer to wish for something simpler immediately, and can provoke occasional dismay even in experienced users. I certainly sympathize with the wish for a simpler architecture, but text modeling is a complex problem, and simple solutions to complex problems are hard to engineer.

Thursday, January 20, 2011

Interfaces and Models

In my last post, I argued that TEI is a text modelling language, and in the prior post, I discussed a frequently-expressed request for TEI editors that hide the tags. Here, I'm going to assert that your editing interface (implicitly) expresses a model too, and because it does, generic, tag-hiding editors are a losing proposition.

Everything to do with human-computer interfaces uses models, abstractions, and metaphors. Your computer "desktop" is a metaphor that treats the primary, default interface like the surface of a desk, where you can leave stuff laying around that you want to have close at hand. "Folders" are like physical file folders. Word processors make it look like you're editing a printed page; HTML editors can make it look as though you're directly editing the page as it appears in a browser. These metaphors work by projecting an image that looks like something you (probably) already have a mental model of. The underlying model used by the program or operating system is something else again. Folders don't actually represent any physical containment on the system's local storage, for example. The WYSIWYG text you edit might be a stream of text and formatting instructions, or a Document Object Model (DOM) consisting of Nodes that model HTML elements and text.

If you're lucky, there isn't a big mismatch between your mental model and the computer's. But sometimes there is: we've all seen weirdly mis-formatted documents, where instead of using a header style for header text, the writer just made it bold, with a bigger font, and maybe put a couple of newlines after it. Maybe you've done this yourself, when you couldn't figure out the "right" way to do it. This kind of thing only bites you, after all, when you want to do something like change the font for all headers in a document.

And how do we cope if there's a mismatch between the human interface and the underlying model? If the interface is much simpler than the model, then you will only be able to create simple instances with it; you won't be able to use the model to its full capabilities. We see this with word processor-to-TEI converters, for example. The word processor can do structural markup, like headers and paragraphs, but it can't so easily do more complex markup. You could, in theory, have a tagless TEI editor capable of expressing the full range of TEI, but it would have to be as complex as the TEI is. You could hide the angle brackets, but you'd have to replace them with something else.

Because TEI is a language for producing models of texts, it is probably impossible to build a generic tagless TEI editor. In order for the metaphor to work, there must be a mapping from each TEI structure to a visual feature in the editor. But in TEI, there are always multiple ways of expressing the same information. The one you choose is dictated by your goals, by what you want to model, and by what you'll want the model to do. There's nothing to map to on the TEI side until you've chosen your model. Thus, while it's perfectly possible (and is useful,* and has been done, repeatedly) to come up with a "tagless" interface that works well for a particular model of text, I will assert that developing a generic TEI editor that hides the markup would be hard task.

This doesn't mean you couldn't build a tool to generate model-specific TEI editors, or build a highly-customizable tagless editor. But the customization will be a fairly hefty intellectual effort. And there's a potential disadvantage here too: creating such a customization implies that you know exactly how you want your model to work, and at the start of a project, you probably don't. You might find, for example, that for 1% of your texts, your initial assumptions about your text model are completely inadequate, and so it has to be refined to account for them. This sort of thing happens all the time.

My advice is to think hard before deciding to "protect" people from the markup. Text modeling is a skill that any scholar of literature could stand to learn.

UPDATE: a comment on another site by Joe Wicentowski makes me think I wasn't completely clear above. There's NOTHING wrong with building "padded cell" editors that allow users to make only limited changes to data. But you need to be clear about what you want to accomplish with them before you implement one.

*Michael C. M. Sperberg-McQueen has a nice bit on "padded cell editors" at http://www.blackmesatech.com/view/?p=11