Tuesday, January 11, 2011

TEI is a text modelling language

I'm teaching a TEI class this weekend, so I've been pondering it a bit. I've come to the conclusion that calling what we do with TEI "text encoding" is misleading. I think what we're really doing is text modeling.

TEI provides an XML vocabulary that lets you produce models of texts that can be used for a variety of purposes. Not a Model of Text, mind you, but models (lowercase) of texts (also lowercase).

TEI has made the (interesting, significant) decision to piggyback its semantics on the structure of XML, which is tree-based. So XML structure implies semantics for a lot of TEI. For example, paragraph text appears inside <p> tags; to mark a personal name, I surround the name with a <persname> tag, and so on. This arrangement is extremely convenient for processing purposes: it is trivial to transform the TEI <p> into an HTML <p>*, for example, or the <persname> into an HTML hyperlink, which points to more information about the person. It means, however, that TEI's modeling capabilities are to a large extent XML's own. This approach has opened TEI up to criticism. Buzetti (2002) has argued that its tree structure simply isn't expressive enough to represent the complexities of text, and Schmidt (2010) criticizes TEI for (among other problems) being a bad model of text, because it imposes editorial interpretation on the text itself.

The main disagreement I have with Schmidt's argument is the assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise. So I'd argue that TEI is at least honest in that it puts the editorial interventions front and center where they are obvious.

As for the argument that TEI's structure is inadequate to model certain aspects of text, I can only agree. But TEI has proved good enough to do a lot of serious scholarly work. That, and the fact that its choice of structure means it can bring powerful XML tools to bear on the problems it confronts, means that TEI represents a "worse is better" solution. It works a lot of the time, doesn't claim to be perfect, and incrementally improves. Where TEI isn't adequate to model a text in the way you want to use it, then you either shouldn't use it, or should figure out how to extend it.

One should bear in mind that any digital representation of a text is ipso facto a model. It's impossible do anything digital without a model (whether you realize it's there or not). Even if you're just transcribing text from a printed page to a text editor you're making editorial decisions, like what character encoding to use, how to represent typographic features in that encoding, how to represent whitespace, and what to do with things you can't easily type (inline figures or symbols without a Unicode representation, for example).

So why argue that TEI is a language for modeling texts, rather than a language for "encoding" texts? The simple answer is that this is a better way of explaining what people use TEI for. TEI provides a lot of tags to choose from. No-one uses them all. Some are arguably incompatible with one another. We tag the things in a text that we care about and want to use. In other words, we build models of the source text, models that reflect what we think is going on structurally, semantically, or linguistically in the text, and/or models that we hope to exploit in some way.

For example, EpiDoc is designed to produce critical editions of inscribed or handwritten ancient texts. It is concerned with producing an edition (a reading) of the source text that records the editor's observations of and ideas about that text. It does not at this point concern itself with marking personal or geographic names in the text. An EpiDoc document is a particular model of the text that focuses on the editor's reading of that text. As a counterexample, I might want to use TEI to produce a graph of the interactions of characters in Hamlet. If I wanted to do that, I would produce a TEI document that marked people and whom they were addressing when they spoke. This would be a completely different model of the text than a critical edition of Hamlet might be. I could even try to do both at the same time, but that might be a mess—models are easier to deal with when they focus on one thing.

This way of understanding TEI makes clear a problem that arises whenever one tries to merge collections of TEI documents: that of compatibility. Just because two documents are marked up in TEI, that does not mean they are interoperable. This is because each document represents the editor's model of that text. Compatibility is certainly achievable if both documents follow the same set of conventions, but we shouldn't expect it any more than we'd expect to be able to merge any two models that follow different ground rules.

* with the caveat that the semantics of TEI <p> and HTML <p> are different, and there may be problems. TEI's <p> can contain lists, for example, whereas HTML's cannot.

Yes, I wrote a blog post with endnotes and bibliography. Sue me.
  1. Buzzetti D. "Digital Representation and the Text Model." New Literary History 2002; 33.1:61-88.
  2. Schmidt, D. "The Inadequacy of Embedded Markup for Cultural Heritage Texts." Literary and LInguistic Computing 2010; 25.3:337-356.


desmond said...

I realise this post is really cold now, but I must disagree with your assessment of TEI interoperability: "Compatibility is certainly achievable if both documents follow the same set of conventions" is I think is a tad ambitious. Documents, after all, don't follow anything. It is people that follow conventions or not and therein lies the problem. Patrick Durusau points out that there are more than 4 million ways to transcribe a single sentence taken from a printed book using the TEI Guidelines. Given that kind of variation the chance that any two people would encode the same features in the same way is just about zero. As Alan Renear pointed out, a TEI tag added to describe an analog document has a completely different illocutionary force to exactly the same tag created by the author of an electronic document as part of his text. The first tag is pure interpretation, the second is pure fact. So in my view TEI texts cannot ever be interoperable. It's funny that a lot of people think they ought to be, or act as if they were. If you look at projects like TextGrid or the British version TextGrid VRE, or the TAPAS project you'll see this assumption underlies the whole project. But it's a big mistake. You only have to look at Project Bamboo to see what happens when you try to make TEI texts interoperate. Or the original grant proposal of the TEI where they make interoperability one of their goals, and then abandon it 25 years later. Or the chilling assessment of TEI's uselessness by the chairman of the TEI board.
As for your objection to my supposed "assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise." I guess you are thinking of fragmentary texts - but they are the exception not the rule. I didn't ever say that you couldn't annotate a text containing computed variation. What I said was that although computerised comparison could never be perfect it was still far more accurate than manually created variant recording using embedded markup. The reason is that recording variants using XML may seem to give you more freedom but in fact it powerfully restricts the kinds of variation you can record. What about texts with 100 versions? Or major transpositions? Or variations in the markup itself? Or non-hierarchical variation? You soon get tied up in knots trying to represent that in XML so that the automatic collation works out to be much better. Are you really saying that automatic comparison serves no purpose? I think you mean that you want to annotate variation, and you can, but you don't need embedded markup to do it. A reading that is literally different from some other reading is just differnt. There's no interpretation required to see that, and having that information to hand has got to be useful.

Unknown said...

Hi Desmond, really nice to have your input, even a couple of years later! You seem to say that TEI compatibility is next to impossible, but I don't think that's the case. I do think it entails either agreement to work to the same standards or crosswalking between projects, and I certainly agree that it's not something you get for free. I'd forgotten about that Bamboo paper, but I look at it and giggle a little hysterically at the idea anyone would think TEI structured bibliography is that simple an animal.

I think too that it depends what you mean by "interoperable". I do agree that projects which seem to have the idea that you can take all the TEI texts in the world, throw them into some sort of bin, and then by some automatic process distill something useful from that are being naïve. I don't really see the point of that sort of project though...

Given your background, you'll understand that the majority of the texts I deal with (papyri and inscriptions) are your exceptions :-).

If I understand your last argument, it's that TEI is useless because you can't use it to produce some sort of critical hyperedition where every textual variation is recorded and aligned. I don't think TEI would be very useful for that sort of project. I can imagine having a TEI base text and recording variants as standoff annotations on that base—and in fact I've played around with doing that sort of thing to model manuscript collation—but I don't think doing the whole thing in TEI would be sensible. Actually, I think it would be nutty. And doomed.

But that's not how critical editions work. An edition is a single reading derived from 1-n sources, where the editor chooses to surface only a subset of the source variations, their own conjectures, and those of earlier editors (and only those they consider relevant) in the apparatus, not the text. TEI is perfectly capable of representing that sort of edition, or an edition of a single manuscript.

So I think the variance argument is a red herring. TEI has its flaws, and indeed there are serious problems with some of its encoding recommendations. You don't get free interoperability with it and it's not suitable for every text-based project. None of that makes it useless though.

Thanks again for taking the time to reply to an old blog post! I enjoy thinking about this stuff and would love to continue the discussion.