Scriptio Continua

Reminder

2017-06-02T09:07:00.003-04:00

In the midst of the ongoing disaster that has befallen the country, I had a reminder recently that healthcare in the USA is still a wreck. When I had my episode of food poisoning (or whatever it was) in Michigan recently, my concerned wife took me to an urgent care. We of course had to pay out-of-pocket for service (about $100), as we were way outside our network (the group of providers who have agreements with our insurance company). I submitted the paperwork to our insurance company when we got home (Duke uses Aetna), to see if they would reimburse some of that amount. Nope. Rejected, because we didn't call them first to get approval—not something you think of at a time like that. Thank God I waved off the 911 responders when my daughter called them after I first got sick and almost passed out. We might have been out thousands of dollars. And this is with really first-class insurance, mind you. I have great insurance through Duke. You can't get much better in this country.

People from countries with real healthcare systems find this kind of thing shocking, but it's par for the course here. And our government is actively trying to make it worse. It's just one more bit of dreadful in a sea's worth, but it's worth remembering that the disastrous state of healthcare in the US affects all of us, even the lucky ones with insurance through our jobs. And again, our government is trying its best to make it worse. You can be quite sure it will be worse for everyone.

Experiencing Technical Difficulties

2017-05-01T12:28:00.000-04:00

I've been struggling with a case of burnout for a while now. It's a common problem in programming, where we have to maintain a fairly high level of creative energy all the time, and unlike my colleagues in academia or the library, I'm not eligible for research leave or sabbaticals. Vacation is the only opportunity for recharging my creative batteries, but that's hard too when there are a lot of tasks that can't wait. I have taken the day off to work before, but that just seems stupid. So I grind away, hoping the fog will lift.

A few weeks ago, the kids and I joined my wife on a work trip to Michigan. It was supposed to be a mini-vacation for us, but I got violently ill after lunch one day—during a UMich campus tour. It sucked about as much as it possibly could. My marvelous elder daughter dealt with the situation handily, but of course we ended up missing most of the tour, and I ended up in bed the rest of the day, barring the occasional run to the bathroom. My world narrowed down to a point. I was quite happy to lie there, not thinking. I could have read or watched television, but I didn't want to. Trying the occasional sip of gatorade was as much as I felt like. For someone who normally craves input all the time, it was very peaceful. It revealed to me again on how much of a knife-edge my consciousness really is. It would take very little to knock it off the shelf to shatter on the ground.

My father has Alzheimer's Disease, and this has already happened to him. Where once there was an acutely perceptive and inquiring mind, there remains only his personality, which seems in his case to be the last thing to go. I try to spend time with him at least once or twice a week, both to take a little pressure off my mother and to check on their general well-being. We take walks. Physically, he's in great shape for a man in his 80s. And there are still flashes of the person he was. He can't really hold a conversation, and will ask the same questions over and over again, my answers slipping away as soon as they're heard, but as we walked the other day, accompanied by loud birdsong, he piped up "We hear you!" to the birds, his sense of humor suddenly back on the surface. We are lucky that my parents have fantastic insurance and a good retirement plan, courtesy of an employer, the Episcopal Church, that cares about its people beyond the period of their usefulness.

Burnout is a species of depression, really. It is the same sort of thing as writer's block. Your motivation simply falls out from under you. You know what needs to be done, but it's hard to summon the energy to do it. The current political climate doesn't help, as we careen towards the cliff's edge like the last ride of Thelma and Louise, having (I hope metaphorically, but probably not for many of us) chosen death over a constrained future, for the sake of poking authority in the eye. My children will suffer because the Baby Boomers have decided to try to take it all with them, because as a society we've fallen in love with Death. All we can do really is try to arm the kids against the hard times to come, their country having chosen war, terror, and oppression in preference to the idea that someone undeserving might receive any benefit from society. We Gen-Xers at least had some opportunity to get a foot on the ladder. Their generation will face a much more tightly constrained set of choices, with a much bigger downside if they make the wrong ones. I don't write much about my children online, because we want to keep them as much as possible out of the view of the social media Panopticon until they're mature enough to make their own decisions about confronting it. At least they may have a chance to start their lives without the neoliberal machine knowing everything about them. They won't have anything like the support I had, and when we've dismantled our brief gesture towards health care as a human right and insurance decisions are made by AIs that know everything about you going back to your childhood, things are going to be quite difficult.

A symptom, I think, of my burnout is my addiction to science fiction and urban fantasy novels. They give me a chance to check out from the real world for a while, but I think it's become a real addiction rather than an escape valve. Our society rolls ever forward toward what promises to be an actual dystopia with all the trappings: oppressed, perhaps enslaved underclasses, policed by unaccountable quasi-military forces, hyper-wealthy elites living in walled gardens with the latest technology, violent and unpredictable weather, massive unemployment and social unrest, food and water shortages, and ubiquitous surveillance. Escapism increasingly seems unwise. Some of that future can be averted if we choose not to be selfish and paranoid, to stop oppressing our fellow citizens and to stop demonizing immigrants, to put technology at the service of bettering society and surviving the now-inevitable changes to our climate. But we are not making good choices. Massive unemployment is a few technological innovations away. It doesn't have to be a disaster, indeed it could lead to a renaissance, but I think we're too set in our thinking to avoid the disaster scenario. The unemployed are lazy after all, our culture tells us, they must deserve the bad things that have happened to them. Our institutions are set up to push them back towards work by curtailing their benefits. But It could never happen to me, could it?

And that comes back around to why I try to grind my way through burnout rather than taking time to recover from it. I live in an "at will" state. I could, in theory, be fired because my boss saw an ugly dog on the way in to work. That wouldn't happen, I hasten to say—I work with wonderful, supportive people. But there are no guarantees to be had. People can be relied on, but institutions that have not been explicitly set up to support us cannot, and institutional structures and rules tend to win in the end. Best to keep at it and hope the spark comes back. It usually does.

Thank You

2016-02-22T12:38:00.000-05:00

Back in the day, Joel Spolsky had a very influential tech blog, and one of the pieces he wrote described the kind of software developer he liked to hire, one who was "Smart, and gets things done." He later turned it into a book (http://www.amazon.com/Smart-Gets-Things-Done-Technical/dp/1590598385). Steve Yegge, who was also a very influential blogger in the oughties, wrote a followup, in which he tackled the problem of how you find and hire developers who are smarter than you. Given the handicaps of human psychology, how do you even recognize what you're looking at? His rubric for identifying these people (flipping Spolsky's) was "Done, and gets things smart". That is, this legendary "10X" developer was the sort who wouldn't just get done the stuff that needed to be done, but would actually anticipate what needed to be done. When you asked them to add a new feature, they'd respond that it was already done, or that they'd just need a few minutes, because they'd built things in such a way that adding your feature that you just thought of would be trivial. They wouldn't just finish projects, they'd make everything better—they'd create code that other developers could easily build upon. Essentially, they'd make everyone around them more effective as well.

I've been thinking a lot about this over the last few months, as I've worked on finishing a project started by Sebastian Rahtz: integrating support for the new "Pure ODD" syntax into the TEI Stylesheets. The idea is to have a TEI syntax for describing the content an element can have, rather than falling back on embedded RelaxNG. Lou Burnard has written about it here: https://jtei.revues.org/842. Sebastian wrote the XSLT Stylesheets and the supporting infrastructure which are both the reference implementation for publishing TEI and the primary mechanism by which the TEI Guidelines themselves are published. And they are the basis of TEI schema generation as well. So if you use TEI at all, you have Sebastian to thank.

Picking up after Sebastian's retirement last year has been a tough job. It was immediately obvious to me just how much he had done, and had been doing for the TEI all along. When Gabriel Bodard described to me how the TEI Council worked, after I was elected for the first time, he said something like: "There'll be a bunch of people arguing about how to implement a feature, or even whether it can be done, and then Sebastian will pipe up from the corner and say 'Oh, I just did it while you were talking.'" You only have to look at the contributors pages for both the TEI and the Stylesheets to see that Sebastian was indeed operating at a 10X level. Quietly, without making any fuss about it, he's been making the TEI work for many years.

The contributions of software developers are often easily overlooked. We only notice when things don't work, not when everything goes smoothly, because that's what's supposed to happen, isn't it? Even in Digital Humanities, which you'd expect to be self-aware about this sort of thing, the intellectual contributions of software developers can often be swept under the rug. So I want to go on record, shouting a loud THANK YOU to Sebastian for doing so much and for making the TEI infrastructure smart.

*****
UPDATE 2016-3-16

I heard the sad news last night that Sebastian passed away yesterday on the Ides of March. We are much diminished by his loss.

DH Data Talk

2013-10-25T09:48:00.001-04:00

Last night I was on a panel organized by Duke Libraries' Digital Scholarship group. The panelists each gave some brief remarks and then we had what I thought was a really productive and interesting discussion. The following are my own remarks, with links to my slides (opens a new tab). In my notes, //slide// means click forward (not always to a new slide, maybe just a fragment).

This is me, and I work //slide// for this outfit. I'm going to talk just a little about a an old project and a new one, and not really give any details about either, but surface a couple of problems that I hope will be fodder for discussion. //slide// The old project is Papyri.info and publishes all kinds of data about ancient documents mostly written in ink on papyrus. The new one, Integrating Digital Epigraphies (IDEs), is about doing much the same thing for ancient documents mostly incised on stone.

If I had to characterize (most of) the work I'm doing right now, I'd say I'm working on detecting and making machine-actionable the scholarly links and networks embedded in a variety of related projects, with data sources including plain text, XML, Relational Databases, web services, and images. These encompass critical editions of texts (often in large corpora), bibliography, citations in books and articles, images posted on Flickr, and databases of texts. You could think of what I'm doing as recognizing patterns and then converting those into actual links; building a scaffold for the digital representation of networks of scholarship. This is hard work. //slide// It's hard because while superficial patterns are easy to detect, //slide// without access to the system of thought underlying those patterns (and computers can't do that yet—maybe never), those patterns are really just proxies kicked up by the underlying system. They don't themselves have meaning, but they're all you have to hold on to. //slide// Our brains (with some prior training) are very good at navigating this kind of mess, but digital systems require explicit instructions //slide// —though granted, you can sometimes use machine learning techniques to generate those.

When I say I'm working on making scholarly networks machine actionable, I'm talking about encoding as digital relations the graph of references embedded in these books, articles and corpora, and in the metadata of digital images. There are various ways one might do this, and the one we're most deeply into right now is called //slide// RDF. RDF models knowledge as a set of simple statements in the form Subject, Predicate, Object. //slide// So A cites B, for example. RDF is a web technology, so all three of these elements may be URIs that you could open in a web browser, //slide// and if you use URIs in RDF, then the object of one statement can be the subject of another, and so on. //slide// So you can use it to model logical chains of knowledge. Now notice that these statements are axioms. You can't qualify them, at least not in a fine-grained way. So this works great in a closed system (papyri.info), where we get to decide what the facts are; it's going to be much more problematic in IDEs, where we'll be coordinating data from at least half a dozen partners. Partners who may not agree on everything. //slide// What I've got is the same problem from a different angle—I need to model a big pile of opinion but all I have to do it with are facts.

Part of the solution to these problems has to be about learning how to make the insertion of machine-actionable links and facts (or at least assertions), part of—that is, a side-effect of—the normal processes of resource creation and curation. But it also has to be about building systems that can cope with ambiguity and opinion.

Outside the tent

2013-09-11T13:05:00.000-04:00

Yesterday was a bad day. I’m chasing a messed-up software problem whose main symptom is the application consuming all available memory and then falling over without leaving a useful stacktrace. Steve Ramsay quit Twitter. A colleague I have huge respect for is leaving a project that’s foundational and is going to be parked because of it (that and the lack of funding). This all sucks. As I said on Twitter, it feels like we’ve hit a tipping point. I think DH has moved on and left a bunch of us behind. I have to start this off by saying that I really have nothing to complain about, even if some of this sounds like whining. I love my job, my colleagues, and I’m doing my best to get over being a member of a Carolina family working at Duke :-). I’m also thinking about these things a lot in the run up to Speaking in Code.

For some time now I’ve been feeling uneasy about how I should present myself and my work. A few years ago, I’d have confidently said I work on Digital Humanities projects. Before that, I was into Humanities Computing. But now? I’m not sure what I do is really DH any more. I suspect the DH community is no longer interested in the same things as people like me, who write software to enable humanistic inquiry and also like to think (and when possible write and teach) about how that software instantiates ideas about the data involved in humanistic inquiry. On one level, this is fine. Time, and academic fashion, marches on. It is a little embarrassing though given that I’m a “Senior Digital Humanities Programmer”.

Moreover, the field of “programming” daily spews forth fresh examples of unbelievable, poisonous, misogyny and seems largely incapable of recognizing what a shitty situation its in because of it.

The tech industry is in moral crisis. We live in a dystopian, panoptic geek revenge fantasy infested by absurd beliefs in meritocracy, full of entrenched inequalities, focused on white upper-class problems, inherently hostile to minorities, rife with blatant sexism and generally incapable of reaching anyone beyond early adopter audiences of people just like us. (from https://medium.com/about-work/f6ccd5a6c197)

I think communities who fight against this kind of oppression, like #DHPoco, for example, are where DH is going. But while I completely support them and think they’re doing good, important work, I feel a great lack of confidence that I can participate in any meaningful way in those conversations, both because of the professional baggage I bring with me and because they’re doing a different kind of DH. I don’t really see a category for the kinds of things I write about on DHThis or DHNow, for example.

If you want to be part of a community that HELPS DEFINE #digitalhumanities please join and promote #DHThis today! http://t.co/VTWjtGQbgr
— Adeline Koh (@adelinekoh) September 10, 2013

This is great stuff, but it’s also not going to be a venue for me wittering on about Digital Classics or text encoding. It could be my impostor syndrome kicking in, but I really doubt they’re interested.

It does seem like a side-effect of the shift toward a more theoretical DH is an environment less welcoming to participation by “staff”. It’s paradoxical that the opening up of DH also comes with a reversion to the old academic hierarchies. I’m constantly amazed at how resilient human insitutions are.

If Digital Humanities isn’t really what I do, and if Programmer comes with a load of toxic slime attached to it, perhaps “Senior” is all I have left. Of course, in programmer terms, “senior” doesn’t really mean “has many years of experience”, it’s code for “actually knows how to program”. You see ads for senior programmers with 2-3 years of experience all the time. By that standard, I’m not Senior, I’m Ancient. Job titles are something that come attached to staff, and they are terrible, constricting things.

I don’t think that what I and many of my colleagues do has become useless, even if we no longer fit the DH label. It still seems important to do that work. Maybe we’re back to doing Humanities Computing. I do think we’re mostly better off because Digital Humanities happened, but maybe we have to say goodbye to it as it heads off to new horizons and get back to doing the hard work that needs to be done in a Humanities that’s at least more open to digital approaches than it used to be. What I’m left wondering is where the place of the developer (and, for that matter other DH collaborators) is in DH if DH is now the establishment and looks structurally pretty much like the old establishment did.

Is digital humanities development a commodity? Are DH developers interchangeable? Should we be? Programming in industry is typically regarded as a commodity. Programmers are in a weird position, both providers of indispensable value, and held at arm’s length. The problem businesses have is how to harness a resource that is essentially creative and therefore very subject to human inconsistency. It’s hard to find good programmers, and hard to filter for programming talent. Programmers get burned out, bored, pissed off, distracted. Best to keep a big pool of them and rotate them out when they become unreliable or too expensive or replace them when they leave. Comparisons to graduate students and adjunct faculty may not escape the reader, though at least programmers are usually better-compensated. Academia has a slightly different programmer problem: it’s really hard to find good DH programmers and staffing up just for a project may be completely impossible. The only solution I see is to treat it as analogous to hiring faculty: you have to identify good people and recruit them and train people you’d want to hire. You also have to give them a fair amount of autonomy—to deal with them as people rather than commodities. What you can’t count on doing is retaining them as contingent labor on soft money. But here we’re back around to the faculty/staff problem: the institutions mostly only deal with tenure-track faculty in this way. Libraries seem to be the only academic institutions capable of addressing the problem at all. But they’re also the insitutions most likely to come under financial pressure and they have other things to worry about. It’s not fair to expect them to come riding over the hill.

The ideal would situation would be if there existed positions to which experts could be recruited who had sufficient autonomy to deal with faculty on their own level (this essentially means being able to say ‘no’), who might or might not have advanced degrees, who might teach and/or publish, but wouldn’t have either as their primary focus. They might be librarians, or research faculty, or something else we haven’t named yet. All of this would cost money though. What’s the alternative? Outsourcing? Be prepared to spend all your grant money paying industry rates. Grad Students? Many are very talented and have the right skills, but will they be willing to risk sacrificing the chance of a faculty career by dedicating themselves to your project? Will your project be maintainable when they move on? Mia Ridge, in her twitter feed, reminds me that in England there exist people called “Research Software Engineers”.

Notes from #rse2013 breakout discussions appearing at https://t.co/PD0ItLBb8t - lots of resonances with #musetech #codespeak
— Mia (@mia_out) September 11, 2013

There are worse labels, but it sounds like they have exactly the same set of problems I’m talking about here.

Missing DH

2013-07-15T16:31:00.002-04:00

I'm watching the tweets from #dh2013 starting to roll in and feeling kind of sad (and, let's be honest, left out) not to be there. Conference attendance has been hard the last few years because I didn't have any travel funding in my old job. So I've tended only to go to conferences close to home or where I could get grant funding to pay for them.

It's also quite hard sometimes to decide what conferences to go to. On a self-funded basis, I can manage about one a year. So deciding which one can be hard. I'm a technologist working in a library, on digital humanities projects, with a focus on markup technologies and on ancient studies. So my list is something like:

DH
JCDL
One of many language-focused conferences
The TEI annual meeting
Balisage

I could also make a case for conferences in my home discipline, Classics, but I haven't been to the APA annual meeting in over a decade. Now that the Digital Classics Association exists, that might change.

I tend to cycle through the list above. Last year I went to the TEI meeting, the year before, I went to Clojure/conj and DH (because a grant paid). The year before that, I went to Balisage, which is an absolutely fabulous conference if you're a markup geek like me (seriously, go if you get the chance).

DH is a nice compromise though, because you get a bit of everything. It's also attended by a whole bunch of my friends, and people I'd very much like to become friends with. I didn't bother submitting a proposal for this year, because my job situation was very much up in the air at the time, and indeed, I started working at DC3 just a couple of weeks ago. DH 2013 would have been unfeasible for all kinds of reasons, but I'm still bummed out not to be there. Have a great time y'all. I'll be following from a distance.

First Contact

2013-02-06T08:51:00.000-05:00

It seems like I've had many versions of this conversation in the last few months, as new projects begin to ramp up:

Client: I want to do something cool to publish my work.

Developer: OK. Tell me what you'd like to do.

Client: Um. I need you to to tell me what's possible, so I can tell you what I want.

Developer: We can do pretty much anything. I need you to tell me what you want so I can figure out how to make it.

Almost every introductory meeting with a client/customer starts out this way. There's a kind of negotiation period where we figure out how to speak each other's language, often by drawing crude pictures. We look at things and decide how to describe them in a way we both understand. We wave our hands in the air and sometimes get annoyed that the other person is being so dense.

It's crucially important not to short-circuit this process though. You and your client likely have vastly different understandings of what can be done, how hard it is to do what needs to be done, and even whether it's worth doing. The initial negotiation sets the tone for the rest of the relationship. If you hurry through it, and let things progress while there are still major misunderstandings in the air, Bad Things will certainly happen. Like:

Client: This isn't what I wanted at all!

Developer: But I built exactly what you asked for!

You did _what_? Form-based XML editing

2012-04-26T11:11:00.000-04:00

My most recent project has been building an editing system for APIS (the Advanced Papyrological Information System—I know, I know) data as part of the Papyrological Editor (PE). APIS records are now TEI files. They started out using a text-based format that was MARC-inspired (fields, not structure), but I converted them to TEI a while back to bring them in line with the other papyri.info data. HGV records (also TEI) already had an editing system in the PE, so it seemed like the task of adding an APIS editor, which would be a simpler beast, wouldn't be that hard, just a matter of extending the existing one. In fact, it was an awful struggle.

Partly this is my fault. The PE is a Rails application, running on JRuby, and my Rails knowledge was pretty rusty. I loved it when I was working on it full time a few years ago, but coming back to it now, and working on an application someone else built, I found it obtuse and mentally exhausting. I had a hard time keeping in my head the different places business logic might be found, and figuring out where to graft on the pieces of the new editor was a continual fight against coder's block. The HGV editor relies on a YAML-based mapping file that documents where the editable information in an HGV document is to be found. This mapping is used to build a data structure that is used by the View to populate an editing form, and upon form submission, the process is reversed.

It's not at all unlike what XForms does, and in fact I was repeatedly saying to myself "Why didn't they just use XForms?" I got annoyed enough that I took some time to look at what it would take to just plug XForms into the application and use that for the APIS editor. The reluctant conclusion I came to was that there just aren't any XForms frameworks out there that I could do this with. And the XForms tutorials I was looking at didn't deal with data that looked like mine at all. TEI is flexible, but not all that regular, and every example I saw dealt with very regular XML data. Moreover, the only implementation I could find that wasn't a server-side framework (and I wasn't going to install another web service just to handle one form) is XSLTForms. The latter is impressive, but relies on in-browser XSLT transforms, which is fine if you have control of the whole page, but inconvenient for me, because I've already got a framework producing the shell of my pages. I just wanted something that would plug into what I already had. A bit sadder but wiser, I decided the team who built the HGV editor had done what they had to given what they had to work with.

Then, about a month ago, I got sick. Like bedridden sick. I actually took 3 days off work, which is probably the most sick leave I've taken since my children were born. While I was lying in bed waiting for the antibiotics to kick in, I got bored and thought I'd have a crack at rewriting the APIS editor. In Javascript. There was already a perfectly good pure-XML editing view, where you can open your document, make changes, and save it back (having it validated etc. on the server), so why not build something that could load the XML into a form in the browser, and then rebuild the XML with the form's contents and post it back? Doesn't sound that hard. And so that's what I did. I should say, this seemed like a potentially very silly thing to do, adding another, non-standard, editing method that duplicates the functionality of an existing standard to a system that already has a working method. I'd spent a lot of time on figuring out how to refactor the existing editor because that seemed like the Right Thing To Do. Being sick, bored, and off the clock gave me the scope to play around a bit.

Let me talk a bit more about the constraints of the project. Data is stored in XML files in a Git repository. This means that not only do I want my edits to be sending back good XML, I want it to look as much like the XML I got in the first place, with the same formatting, indentation, etc., so that Git can make sensible judgements about what has actually changed. I might want some data extracted from the XML to be processed a bit before it's put into the form, and vice versa. For example, I have dates with @when or @notBefore/@notAfter attributes that have values in the form YYYY-MM-DD, but I want my form to have separate Year, Month, and Day fields. Mixed (though only lightly mixed) content is possible in some elements. I need my form to be able to insert elements that may not be there at all in the source. I need it to deal with repeating data. I need it to deal with structured data (i.e. an XML fragment containing several pieces of data should be treated as a unit). And of course, it needs to be able to put data into the XML document in such a way that it will validate, so the order of inserted elements matters. Moreover, I need to build the form the way the framework expects it to be built, as a Rails View.

So the tool needs to know about an instance document:

how to get data out of the XML, maybe running it through a conversion function on the way out
how to map that data to one or a group of form elements, so there needs to be a naming convention
how to deal with repeating elements or groups of elements
how to structure the data coming out of the form, possibly running it through a function, and being able to deal with optional sub-structures
how to put the form data into the correct place in the XML, so some knowledge of the content model is necessary
how to add missing ancestor elements (maybe with attributes) for new form data

Basically, it needs to be able to deal with XML as it occurs in the wild: nasty, brutish, and (one hopes) short.

Form-based XML editing is one of those things that sounds fairly easy, but is in fact fraught with complications. It's easy to get data out of XML (XPath!), and it's easy to manipulate an XML DOM, removing and adding elements, attributes, and text nodes. But it's actually quite hard to get everything right, and to make the XML's formatting stay consistent. In my next post, I'll talk about how I did it.

How to Apologize

2012-03-20T22:12:00.001-04:00

The latest regrettable spasm of sexism in the programming world played out this afternoon, as a company called Sqoot's announcement of a hackathon caused said event to implode before it ever began by including the infuriating and insensitive line under "perks" [Update: just to be clear, in the context of the original page, it was clear that the presence of women serving beer was one of the perks for attendees]:

Women: Need another beer? Let one of our friendly (female) event staff get that for you.

Gag. Sqoot fairly quickly realized they had walked into a buzzsaw, as lots of people called them on it, and their sponsors started pulling their support. It's rather nice to see that kind of quick, public reaction. Cloudmine's blog post about it particularly impressed me. Squoot issued an apology fairly swiftly, which I quote below:

Sqoot is hosting an API Jam in Boston at the end of March. One of the perks we (not our sponsors) listed on the event page was:

“Women: Need another beer? Let one of our friendly (female) event staff get that for you.”

While we thought this was a fun, harmless comment poking fun at the fact that hack-a-thons are typically male-dominated, others were offended. That was not our intention and thus we changed it.

We’re really sorry,

Avand & Mo

This didn't do much for a lot of people, but it got me thinking about apologies in tech in general, since they are actually crucial moments in the interaction between you and your customers/audience. When I worked at Lulu, Bob Young used to say that whenever you screw up, it's actually a tremendous opportunity to win a customer's loyalty by making it right. This applies both to small screwups (a customer's order never made it) and large ones (you did something that made lots of people mad). It strikes me that in this day and age, when the "non-apology" has become so frequent, people may actually not realize when it isn't appropriate to use conditional or evasive language in apologies. It's one thing if you're worried about being sued and can't admit culpability, or if you're someone like Rush Limbaugh, who's presumably concerned about appearing to back down in front of his audience. But if you're actually intent on repairing the damage done to your relationship with your customer or your audience, you need to be able to apologize properly.

So what are the elements of a good apology?

I hear you.
I am truly sorry.
(semi-optional, depending on what happened) This is what went wrong.
I am doing x to make sure this doesn't happen again and y to make it right with you.
Thank you. I appreciate the feedback.

#1 is crucial. The person or group you're addressing has to know that you've heard their complaint and understand it. Apologies that lack this element sound cold and disconnected. And this is the main problem with Sqoot's "others were offended." They aren't speaking to the people they offended. This is just guaranteed to further piss people off.

#2 should be unconditional. Not "I'm sorry if you were offended." Indeed, if you find yourself pushing the focus onto the people whom you pissed off at all, you may be sliding into non-apology territory. This isn't about them—they're mad because you made them mad. Note that a good apology is not defensive, and does not attempt to shift the blame, even if that blame belongs to an employee whom you've just fired. If you did that, it's part of #4, the "how I'm fixing it" part, not the "I'm sorry" part. Don't try to save face in a genuine apology. Indicating that you meant no harm is fine, but if you're apologizing, it means you caused harm regardless of your intent.

#3 is a bit more tricky. People want to know how this could have happened, but it doesn't do to dwell on it too much, and this is another mistake Sqoot makes. They probably shouldn't quote the line that made everyone mad (it will make the readers mad all over again). It would have been enough to say they put something stupid and sexist into an event page which they now regret. On the other hand, you do have to acknowledge what happened and not look like you're trying to dodge it. So don't go into excruciating detail about what went wrong with a customer's order, for example. "I'm afraid you found a bug in our shopping cart" is probably enough detail. Sqoot's apology does this really badly: they explain exactly what they did, how it happened (we thought it was funny, because we're aware that these things are mostly male), and then contrast the "others" (who lack their sense of humor) who were offended. Explaining how you messed up does not mean defending yourself, and defending yourself in an apology must be handled delicately, or you look like an ass.

#4 Fix it, if you can. "We're refunding your order immediately and giving you a coupon", "I shall be entering rehab tomorrow morning", "We're donating $$ to x charity".

#5 Reconnect. When you screw up, people are paying very close attention to you, and it's an opportunity to show that you're a stand-up company/organization/person. You stand to win greater loyalty and affection by handling the problem effectively. The people who are complaining (assuming they are correct, of course) are helping you by showing you where you're wrong, or at least showing you a different perspective. Squoot "signing" their apology is actually good, in this case, because it indicates the founders (I presume that's who they are) are taking responsibility. It's too bad they flubbed the middle bit.

A spot of mansplaining

2012-03-04T15:06:00.003-05:00

This is bit of rambling, responding to Bethany Nowviskie's terrific "Don't circle the wagons", itself a response to Miriam Posner's "Some things to think about before you exhort everyone to code".

I'm a middle-aged, white, male programmer, so that's where I'm coming from. I can't help any of that, but doubtless it colors my perspective.

First, I have to say that the idea of coding being associated with prestige (as it seems now to be in DH) is rather foreign to my experience, but the rise of the brogrammer is probably a sign that in general it's not such a marginal activity anymore. These guys would probably have gone and gotten MBAs instead in years past.

DH is slightly uncomfortable territory for programmers, as I've written in the past, at least it is for people like me who mostly program rather than academics who write code to accomplish their ends. I speak in generalities, and there are good local exceptions, but we don't get adequate (or often any) professional development funding, we don't get research time, we don't get credit for academic stuff we may do (like writing papers, presenting at conferences, etc.), we don't get to lead projects, and our jobs are very often contingent on funding. All this in exchange for about a 30% pay cut over what we could be earning in industry. There are compensations of course: incredibly interesting work, really great people to work with, and often more flexible working conditions. That's worth putting up with a lot. I have a wife and young kids, and I'm rather fond of them, so being able to work from home and having decent working hours and vacation time is a major bonus for me.

None of that in any way accounts for the gender imbalance that Miriam is highlighting, though it does perhaps work as a general disincentive (in academia) to learn to code too well. I'd also say that there is nothing that can make you feel abysmally stupid quite like the discipline of programming. Errors as small as a single character (added or omitted) can make a program fail in any number of weird ways. I am frequently faced with problems that I don't know ahead of time I'll be able to solve. Ostensibly hard problems may be dead simple to fix, while simple tasks may end up taking days. It keeps you humble. But I would say that if you're the lone woman sitting in a class/lecture/lab, and you're feeling lost, you're not the only one, and it has nothing at all to do with your gender.

As Bethany cautions, please, please don't circle the wagons. It's my contention that most of the offensive things about programmer culture are not intentional nor deeply ingrained but are actually artifacts of the gender/race imbalance.

[As an aside, I was interested in Miriam's remarks about codeacademy. I started working through it with my daughter a couple of weeks ago, and she was finding it incredibly frustrating. It does not fit her learning style at all. She, like me, needs to know why she has to type this thing. She finds being told to just type var myName="Grace" without any context stupid. In the end we gave up and started learning how to build web pages, and I'll reintroduce Javascript in that context.]

Programmer culture is exclusionary though. Undergraduate CS programs have "weed out" courses. I've actually taken a couple of these, and the first one did weed me out—it managed to be hard and extremely boring at the same time. I only came back to it years later. This gets at an aspect of programmer culture though, a sort of "are you good enough?" ethic. It's not without foundation—a lot of people who self-identify as programmers can't program—but it also means that when you start to work with a new group, there's often a kind of ritual pissing contest where you have to prove your worth before you're treated as an equal. This kind of thing is irritating enough on it's own, and it's easy to imagine it taking on sexist or racist overtones.

Programming also tends to squeeze out older folks. Actual age discrimination does happen, but it's also because staying current means almost totally rebooting your skillset every few years. The Pragmatic Programmer book recommends learning a new language every year, and this is crucial advice (my own record is more like one every 18 months or so, but that's been enough so far). If you let yourself get comfortable and coast, or go into management, your skills are going to be close to worthless in about 5 years. And, while your experience definitely gives you an edge, you're not guaranteed to have the best solution to any given problem. The 22-year-old who read something on Hacker News last night might have found an answer that totally blows your received wisdom out of the water.

[As another aside, the speed at which skills go stale means that any organization that doesn't invest in professional development for their programmers is being seriously stupid. Or they expect their devs to leave and be replaced every few years.]

The upshot is that the population of programmers not only skews male, it skews young. Put a bunch of young men together, particularly in small groups that are under a lot of pressure (in a startup, for example), and you get the sorts of tribal behaviors that make women and anyone over 35 uncomfortable. There's not just misogyny, there's hostility towards people who might want to have a life outside of work (e.g. people who have spouses and kids and like spending time with them). And this is both a cause of sexual/racial/age imbalance and a result. It's a self-reinforcing cycle.

But there isn't really one monolithic "coder culture", there are lots of them. Every company and institution has its own culture. Teams have cultures. Basically, any grouping of human beings is going to develop its own set of values and ways of doing things. It's what people do naturally. The leaders of these groups have a lot to do with how those cultures develop, but they aren't necessarily in any position to remedy imbalances.

Once you're in a position to hire people, you realize that hiring good developers is hard. In any pool of applicants, you're going to have a bunch of totally unsuitable people, a few maybes, and if you're lucky, one or two gems (or people you think might become a gem with a little polishing). Are any of these going to be women? Not unless you're really lucky, because the weeding out has already done its work. So once you're in a position to decide who gets hired, it's too late to redress any imbalance. The imbalance is not because leaders don't want to hire women, it's already there.

The only answer I can see is to get a lot more women into programming. If the CS curricula can't do it, maybe new modes like DH can. From what I've seen the gender balance in DH, while still not great, is a lot less ruinous than in programming in general, and a CS degree is far from the only road into a programming career (I don't have one). I think the cycle can be broken. I don't think there's a lot of deeply ingrained misogyny or racism in coding culture. Rather, it's a boys club because it happens to be mostly boys. If there were more women it would adjust. And I don't think that (mostly) it would put up much resistance to an influx of women. So circling the wagons is the exact opposite of what needs to be done.

TEI in other formats; part the second: Theory

2011-11-14T19:23:00.001-05:00

In my first post on this subject, I poked a bit at how one might represent TEI in HTML without discarding the text model from the TEI document. Now I want to talk a bit more about that model, and the theory behind it. I may at the end say irritable things about Theory as well, if you can stand it until the end. I humbly beg the reader's pardon.

Let's look at the same document I talked about last time: http://papyri.info/ddbdp/p.ryl;2;74/source (see also http://papyri.info/ddbdp/p.ryl;2;74/). We can visualize the document structure using Graphviz and a spot of XSLT (click for high-res):

tree structure of a TEI document

It's a fairly flat tree. As an XML document, it has to be a tree, of course, and TEI leverages this built-in "tree-ness" to express concepts like "this text is part of a paragraph" (i.e. it has a tei:p element as its ancestor). In line 1, for example, we find

<supplied reason="lost">Μάρ</supplied>κος

meaning the first three letters of the name Markos have been lost due to damage suffered by the papyrus the text was written on, and the editor of the text has supplied them. The fact that the letters "Μάρ" are contained by the supplied element, or, more properly that the text node containing those letters is a child of the supplied element, means that those letters have been supplied. In other words, the parent-child relationship is given additional semantics by TEI. We already have some problems here: the child of supplied is itself part of a word, "Markos", and that word is broken up by the supplied element. Only the fact that no white space intervenes between the end of the supplied element and the following text lets us know that this is a word. It's even worse if you look at the tree version, which is, incidentally, how the document will be interpreted by a computer after it has been parsed:

There's no obvious connection here between the first and second halves of the name. And in fact, if we hadn't taken steps to prevent it, any program the processed the document might reformat it so that "Mar" and "kos" were no longer connected. We could solve this problem by adding more elements. As the joke goes, "XML is like violence. If it isn't working, you're not using it enough." We could explicitly mark all the words, using a "w" element, thus:

<w><supplied reason="lost">Μάρ</supplied>κος</w>

which would solve any potential problems with words getting split up, because we could always fix the problem—we would know what all the words are. We could even attach useful metadata, like the lemma (the dictionary headword) of the word in question. We don't do this for a couple of reasons. First, because we don't need to. We can work around the splitting-up of words by markup. Second, because it complicates the document and makes it harder for human editors to deal with, and third, because it introduces new chances for overlap. Overlap is the Enemy, as far as XML is concerned. The more containers you have, the greater the chances one container will need to start outside another, but finish inside (or vice versa). Consider that there's no reason at all a region of supplied text shouldn't start in the middle of one word and end in the middle of another. Look at lines 5-6 for example:

... οὐ̣[χ ἱκανὸν εἶ-]
[ναι εἰ]ς

A supplied section begins in the middle of the third word from the end of line five, and continues for the rest of the line. The last word is itself broken and continues on the following line, the beginning of which is also supplied, that section ending in the middle of the second word on line six. This is a mess that would only be compounded if we wanted to mark off words.

This may all seem like a numbing level of detail, but it is on these details that theories of text are tested. The text model here cares about editorial observations on and interventions in the text, and those are what it attempts to capture. It cares much less about the structure of the text itself—note that the text is contained in a tei:ab, an element designed for delineating a block of text without saying anything about its nature as a block (unlike tei:p, for example). Visible features like columns, or text continued on multiple papyri, or multiple texts on the same papyrus would be marked by tei:divs. This is in keeping with papyrology's focus on the materiality of the text. What the editor sees, and what they make of it is more important than the construction of a coherent narrative from the text—something that is often impossible in any case. Making that set of tasks as easy as possible is therefore the focus of the text model we use.

What I'm trying to get at here is that there is Theory at work here (a host of theories in fact), having to do with a way to model texts, and that that set of theories are mapped onto data structures (TEI, XML, the tree) using a set of conventions, and taking advantage of some of the strengths of the data structures available. Those data structures have weaknesses too, and where we hit those, we have to make choices about how to serve our theories best with the tools we have. There is no end of work to be done at this level, of joining theory to practice, and a great deal of that work involves hacking, experimenting with code and data. It is from this realization, I think, that the "more hack, less yack" ethic of THATCamp emerged. And it is at this level, this intersection, this interface, that scholar-programmer types (like me) spend a lot of our time. And we do get a bit impatient with people who don't, or can't, or won't engage at the same level, especially if they then produce critiques of what we're doing.

As it happens, I do think that DH tends to be under-theorized, but by that I don't mean it needs more Foucault. Because it is largely project-driven, and the people who are able to reason about the lower-level modeling and interface questions are mostly paid only to get code into production, important decisions and theories are left implicit in code and in the shapes of the data, and aren't brought out into the light of day and yacked about as they should be.

TEI in other formats; part the first: HTML

2011-11-10T11:17:00.000-05:00

There has been a fair amount of discussion of late about TEI either having a standard HTML5 representation or even moving entirely to an HTML5 format. I want to do a little thinking "out loud" about how that might work.

Let's start with a fairly standard EpiDoc document (EpiDoc being a set of guidelines for using TEI to mark up ancient documents). http://papyri.info/ddbdp/p.ryl;2;74/source (see http://papyri.info/ddbdp/p.ryl;2;74/ for an HTML version with more info) is a fairly typical example of EpiDoc used to mark up a papyrus document. The document structure is fairly flat, but with a number of editorial interventions, all marked up. Line 12, below, shows supplied, unclear, and gap tags

<lb n="12"/><gap reason="lost" quantity="16" unit="character"/> <supplied reason="lost"> Π</supplied><unclear>ε</unclear>ρὶ Θή<supplied reason="lost">βας καὶ Ἑ</supplied><unclear>ρ</unclear>μωνθ<supplied reason="lost">ίτ </supplied><gap reason="lost" extent="unknown" unit="character"/>

So how might we take this line and translate it to HTML? First, we have an <lb> tag, which at first glance would seem to map quite readily onto the HTML tag, but if we look at the TEI Guidelines page for lb (http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-lb.html), we see a large number of possible attributes that don't necessarily convert well. In practice, all I usually see on a line break tag in TEI is an @n and maybe an @xml:id attribute. HTML doesn't really have a general-purpose attribute like @n, but @class or @title might serve. On <lb>, @n is often used to provide line numbers, so @title seems logical.

Now <gap reason="lost" quantity="16" unit="character"/> is a bit more of a puzzler. First, HTML's semantics don't extend at all to the recording of attributes of a text being transcribed, so nothing like the gap element exists. We'll have to use a general-purpose inline element (span seems obvious) and figure out how to represent the attribute values. TEI has no lack of attributes, and these don't naturally map to HTML at all in most cases. If we're going to keep TEI's attributes, we'll have to represent them as child elements. We'll want to identify both the original TEI element and wrap its attributes and maybe its content too, so let's assume we'll use the @class attribute with a couple of fake "namespaces", "teie-" for TEI element names, "teia-" for attribute names, and "teig-" to identify attributes and wrap element contents (the latter might be overkill, but seems sensible as a way to control whitespace). We can assume a stylesheet with a span.teig-attribute selector that sets display:none.


lost
16
character


Like HTML, TEI has three structural models for elements: block, inline, and milestone. Block elements assume a "block" of text, that is, they create a visually distinct chunk of text. Divs, paragraphs, tables, and similar elements are block level. Inline elements contain content, but don't create a separate block. Examples are span in HTML, or hi in TEI. Milestones are empty elements like lb or br. TEI has several of these, and HTML, which has "generic" elements of the block and inline varieties (div and span) lacks a generic empty element. Hence the need to represent tei:gap as a span.

tei:supplied is clearly an inline element, and we can do something similar to the example above, using span:


lost
Π


and likewise with unclear:


ε


Now, doing this using generic HTML elements and styling/hiding them with CSS could be considered bad behavior. It's certainly frowned upon in the CSS 2.1 spec (see the note at http://www.w3.org/TR/CSS2/selector.html#class-html). I don't honestly see another way to do it though, because, although RDFa has been suggested as a vehicle for porting TEI to HTML, there is no ontology for TEI, so no good way to say "HTML element p is the same as TEI element p, here". Even granting the possibility of saying that, it doesn't help with the attribute problem. And we're still left with the problem of presentation: what will my HTML look like in a browser? It must be said that my messing about above won't produce anything like the desired effect, which for line 12 is something like:

[- ca.16 - Π]ε̣ρὶ Θή[βας καὶ Ἑ]ρ̣μωνθ[ίτ -ca.?- ]

I could certainly make it so, probably with a combination of CSS and JavaScript, but what have I gained by doing so? I'll have traded one paradigm, XML + XSLT, for another, HTML + CSS + JavaScript. I'll have lost the ability to validate my markup, though I'll still be able to transform it to other formats. I should be able to round-trip it to TEI and back, so perhaps I could solve the validation problem that way. But is anything about this better than TEI XML? I don't think so…

I suspect I'm missing the point here, and that what the proponents of TEI in HTML are really after is a radically curtailed (or re-thought) version of TEI that does map more comfortably to HTML. The somewhat Baroque complexity of TEI leads the casual observer to wish for something simpler immediately, and can provoke occasional dismay even in experienced users. I certainly sympathize with the wish for a simpler architecture, but text modeling is a complex problem, and simple solutions to complex problems are hard to engineer.

Humanities Data Curation

2011-06-28T11:00:00.001-04:00

Last Thursday, I attended the excellent Humanities Data Curation Summit, organized by Allen Renear, Trevor Muñoz, Katherine L. Walter, and Julia Flanders. I'm still processing the day, which included a breakout session with Allen, Elli Mylonas, and Michael Sperberg-McQueen, who are some of my favorite people in DH.

What I started thinking about today was that we'd skipped definitions at the beginning—there was a joke that Allen, as a philosopher, could have spent all day on that task. But in doing so, we elided the question of what is data in the humanities, and what is different about it from science or social science data.

Humanities data are not usually static, collected data like instrument readings or survey results. They are things like marked up texts, uncorrected OCR, images in need of annotation, etc. Humanities datasets can almost always be improved upon. "Curation" for them is not simply preservation, access, and forward migration. It means enabling interested communities to work with the data and make it better. Community interaction needs to be factored into the data's curation lifecycle.

I feel a blog post coming on about how the Integrating Digital Papyrology / papyri.info project does this...

Interfaces and Models

2011-01-20T09:09:00.004-05:00

In my last post, I argued that TEI is a text modelling language, and in the prior post, I discussed a frequently-expressed request for TEI editors that hide the tags. Here, I'm going to assert that your editing interface (implicitly) expresses a model too, and because it does, generic, tag-hiding editors are a losing proposition.

Everything to do with human-computer interfaces uses models, abstractions, and metaphors. Your computer "desktop" is a metaphor that treats the primary, default interface like the surface of a desk, where you can leave stuff laying around that you want to have close at hand. "Folders" are like physical file folders. Word processors make it look like you're editing a printed page; HTML editors can make it look as though you're directly editing the page as it appears in a browser. These metaphors work by projecting an image that looks like something you (probably) already have a mental model of. The underlying model used by the program or operating system is something else again. Folders don't actually represent any physical containment on the system's local storage, for example. The WYSIWYG text you edit might be a stream of text and formatting instructions, or a Document Object Model (DOM) consisting of Nodes that model HTML elements and text.

If you're lucky, there isn't a big mismatch between your mental model and the computer's. But sometimes there is: we've all seen weirdly mis-formatted documents, where instead of using a header style for header text, the writer just made it bold, with a bigger font, and maybe put a couple of newlines after it. Maybe you've done this yourself, when you couldn't figure out the "right" way to do it. This kind of thing only bites you, after all, when you want to do something like change the font for all headers in a document.

And how do we cope if there's a mismatch between the human interface and the underlying model? If the interface is much simpler than the model, then you will only be able to create simple instances with it; you won't be able to use the model to its full capabilities. We see this with word processor-to-TEI converters, for example. The word processor can do structural markup, like headers and paragraphs, but it can't so easily do more complex markup. You could, in theory, have a tagless TEI editor capable of expressing the full range of TEI, but it would have to be as complex as the TEI is. You could hide the angle brackets, but you'd have to replace them with something else.

Because TEI is a language for producing models of texts, it is probably impossible to build a generic tagless TEI editor. In order for the metaphor to work, there must be a mapping from each TEI structure to a visual feature in the editor. But in TEI, there are always multiple ways of expressing the same information. The one you choose is dictated by your goals, by what you want to model, and by what you'll want the model to do. There's nothing to map to on the TEI side until you've chosen your model. Thus, while it's perfectly possible (and is useful,* and has been done, repeatedly) to come up with a "tagless" interface that works well for a particular model of text, I will assert that developing a generic TEI editor that hides the markup would be hard task.

This doesn't mean you couldn't build a tool to generate model-specific TEI editors, or build a highly-customizable tagless editor. But the customization will be a fairly hefty intellectual effort. And there's a potential disadvantage here too: creating such a customization implies that you know exactly how you want your model to work, and at the start of a project, you probably don't. You might find, for example, that for 1% of your texts, your initial assumptions about your text model are completely inadequate, and so it has to be refined to account for them. This sort of thing happens all the time.

My advice is to think hard before deciding to "protect" people from the markup. Text modeling is a skill that any scholar of literature could stand to learn.

UPDATE: a comment on another site by Joe Wicentowski makes me think I wasn't completely clear above. There's NOTHING wrong with building "padded cell" editors that allow users to make only limited changes to data. But you need to be clear about what you want to accomplish with them before you implement one.

*Michael C. M. Sperberg-McQueen has a nice bit on "padded cell editors" at http://www.blackmesatech.com/view/?p=11

TEI is a text modelling language

2011-01-11T07:47:00.008-05:00

I'm teaching a TEI class this weekend, so I've been pondering it a bit. I've come to the conclusion that calling what we do with TEI "text encoding" is misleading. I think what we're really doing is text modeling.

TEI provides an XML vocabulary that lets you produce models of texts that can be used for a variety of purposes. Not a Model of Text, mind you, but models (lowercase) of texts (also lowercase).

TEI has made the (interesting, significant) decision to piggyback its semantics on the structure of XML, which is tree-based. So XML structure implies semantics for a lot of TEI. For example, paragraph text appears inside tags; to mark a personal name, I surround the name with a <persname> tag, and so on. This arrangement is extremely convenient for processing purposes: it is trivial to transform the TEI into an HTML *, for example, or the <persname> into an HTML hyperlink, which points to more information about the person. It means, however, that TEI's modeling capabilities are to a large extent XML's own. This approach has opened TEI up to criticism. Buzetti (2002) has argued that its tree structure simply isn't expressive enough to represent the complexities of text, and Schmidt (2010) criticizes TEI for (among other problems) being a bad model of text, because it imposes editorial interpretation on the text itself.

The main disagreement I have with Schmidt's argument is the assumption that there is a text independent of the editorial apparatus. Maybe there is sometimes, but I can point at many examples where there is no text, as such, only readings. And a reading is, must be, an interpretive exercise. So I'd argue that TEI is at least honest in that it puts the editorial interventions front and center where they are obvious.

As for the argument that TEI's structure is inadequate to model certain aspects of text, I can only agree. But TEI has proved good enough to do a lot of serious scholarly work. That, and the fact that its choice of structure means it can bring powerful XML tools to bear on the problems it confronts, means that TEI represents a "worse is better" solution.† It works a lot of the time, doesn't claim to be perfect, and incrementally improves. Where TEI isn't adequate to model a text in the way you want to use it, then you either shouldn't use it, or should figure out how to extend it.

One should bear in mind that any digital representation of a text is ipso facto a model. It's impossible do anything digital without a model (whether you realize it's there or not). Even if you're just transcribing text from a printed page to a text editor you're making editorial decisions, like what character encoding to use, how to represent typographic features in that encoding, how to represent whitespace, and what to do with things you can't easily type (inline figures or symbols without a Unicode representation, for example).

So why argue that TEI is a language for modeling texts, rather than a language for "encoding" texts? The simple answer is that this is a better way of explaining what people use TEI for. TEI provides a lot of tags to choose from. No-one uses them all. Some are arguably incompatible with one another. We tag the things in a text that we care about and want to use. In other words, we build models of the source text, models that reflect what we think is going on structurally, semantically, or linguistically in the text, and/or models that we hope to exploit in some way.

For example, EpiDoc is designed to produce critical editions of inscribed or handwritten ancient texts. It is concerned with producing an edition (a reading) of the source text that records the editor's observations of and ideas about that text. It does not at this point concern itself with marking personal or geographic names in the text. An EpiDoc document is a particular model of the text that focuses on the editor's reading of that text. As a counterexample, I might want to use TEI to produce a graph of the interactions of characters in Hamlet. If I wanted to do that, I would produce a TEI document that marked people and whom they were addressing when they spoke. This would be a completely different model of the text than a critical edition of Hamlet might be. I could even try to do both at the same time, but that might be a mess—models are easier to deal with when they focus on one thing.

This way of understanding TEI makes clear a problem that arises whenever one tries to merge collections of TEI documents: that of compatibility. Just because two documents are marked up in TEI, that does not mean they are interoperable. This is because each document represents the editor's model of that text. Compatibility is certainly achievable if both documents follow the same set of conventions, but we shouldn't expect it any more than we'd expect to be able to merge any two models that follow different ground rules.

Notes

* with the caveat that the semantics of TEI and HTML are different, and there may be problems. TEI's can contain lists, for example, whereas HTML's cannot.

† See http://www.dreamsongs.com/RiseOfWorseIsBetter.html

Yes, I wrote a blog post with endnotes and bibliography. Sue me.

Buzzetti D. "Digital Representation and the Text Model." New Literary History 2002; 33.1:61-88.
Schmidt, D. "The Inadequacy of Embedded Markup for Cultural Heritage Texts." Literary and LInguistic Computing 2010; 25.3:337-356.

I Will Never NOT EVER Type an Angle Bracket (or IWNNETAAB for short)

2011-01-06T23:14:00.006-05:00

From time to time, I hear an argument that goes something like this: "Our users won't deal with angle brackets, therefore we can't use TEI, or if we do, it has to be hidden from them." It's an assumption I've encountered again quite recently. Since it's such a common trope, I wonder how true it is. Of course, I can't speak for anyone's user communities other than the ones I serve. And mine are perhaps not the usual run of scholars. But they haven't tended to throw their hands up in horror at the sight of angle brackets. Indeed, some of them have become quite expert at editing documents in TEI.

The problems with TEI (and XML in general) are manifold, but its shortcomings often center around its not being expressive *enough* to easily deal with certain classes of problem. And the TEI evolves. You can get involved and change it for the better.

The IWNNETAAB objection seems grounded in fear. But fear of what? As I mentioned at the start, IWNNETAAB isn't usually an expression of personal revulsion, it's not just Luddism, it's IWNNETAAB by proxy: my users/clients/stakeholders won't stand for it. Or they'll mess it up. TEI is hard. It has *hundreds* of elements. How can they/why should they learn something so complex just to be able to digitize texts?! What we want to do is simple, can't we have something simple that produces TEI in the end?

The problem with simplified editing interfaces is easy to understand: they are simple. Complexities have been removed, and along with them, the ability to express complex things. To put it another way, if you aren't dealing with the tags, you're dealing with something in which a bunch of decisions have already been made for you. My argument in the recent discussion was that in fact, these decisions tend to be extremely project-specific. You can't set it up once and expect it to work again in different circumstances; you (or someone) will have to do it over and over again. So, for a single project, the cost/benefit equation may look like it leans toward the "simpler" option. But taken over many projects, you're looking either at learning a reasonably complex thing or building a long series of tools that each produce a different version of that thing. Seen in this light, I think learning TEI makes a lot of sense. On the learning TEI side, the costs go down over time, on the GUI interface side, they keep going up.

Moreover, knowing TEI means that you (or your stakeholders) aren't shackled to an interface that imposes decisions that were made before you ever looked at the text you're encoding, instead, you are actually engaging with the text, in the form in which it will be used. You're seeing behind the curtain. I can't really fathom why that would be a bad thing.

(Inspiration for the title comes from a book my 2-year-old is very fond of)

DH Tea Leaves

2010-12-28T07:10:00.001-05:00

From reading my (possibly) representative sample of DH proposals, I'd say the main theme of the conference will not be "Big Tent Digital Humanities" but "data integration". Of the 8 proposals I read, more than half of them were concerned with problems of connecting data across projects, disciplines, and different systems. My proposal was too (making 9), so perhaps I did have a representative sample.

Data integration is a meaty problem, resistant to generalized solutions. To my mind the answers, such as they are, will rely on the same set of practices that good data curation techniques use: open formats and open source code, and good documentation that covers the "why" of decisions made for projects as well as the "how." Data integration is a process that involves gaining an understanding of the sources and the semantics of their structures before you can connect them together. So, while there are tools out there that can enable successful data integration, there are (as usual) no silver bullets. Grasping the meanings and assumptions embodied in each project's data structures has to be the first step and this is only possible when those structures have been explained.

That Bug Bit Me

2010-12-05T16:29:00.002-05:00

"I had this problem and I fixed it" stories are boring to anyone except those intimately concerned with the problem, so I'm not going to tell that story. Instead, I'm going to talk about projects in the Digital Humanities that rely on 3rd party software, and talk about the value of expertise in programming and software architecture. From the outside, modern software development can look like building a castle out of Lego pieces: you take existing components and pop them together. Need search? Grab Apache Solr and plug it in. Need a data store? Grab a (No)SQL database and put your data in it. Need to do web development fast? Grab a framework, like Rails or Django. Doesn't sound that hard.

This is, more or less, what papyri.info looks like internally. There's a Solr install that handles search, a Mulgara triple store that keeps track of document relationships, small bits of code that handle populating the former two and displaying the web interface, and a JRuby on Rails application that provides crowdsourced editing capabilities.

Upgrading components in this architecture should range from trivially easy, to moderately complex (that's only if some interface has changed between versions, for example).

So why did I find myself sitting in a hotel lobby in Rome a few weeks ago having to roll back an to an older version of Mulgara so the editor application would work for a presentation the next day? A bunch of our queries had stopped working, meaning the editor couldn't load texts to edit. Oops.

And why did I spend the last week fighting to keep the application standing up after a new release of the editor was deployed?

The answer to both questions was that our Lego blocks didn't function the way they were supposed to. They aren't Lego blocks after all—they're complex pieces of software that may have bugs. The fact that our components are open source, and have responsive developers behind them is a help, but we can't necessarily expect those developers to jump to solve our problems. After all, the project's tests must have passed in order for the new release to be pushed, and unless there's a chorus of complaint, our problem isn't necessarily going to be high on their list of things to fix.

No, the whole point of using open source components is that you don't have to depend solely on other people to fix your problems. In the case of Mulgara, I was able to track down and fix the bug myself, with some pointers from the lead developer. The fix (or a better version of it) will go into the next release, and meantime we can use my patched version. In the case of the Rails issue, there seems to be a bug in the ActiveSupport file caching under JRuby that causes it to go nuts: the request never returns and something continually creates resources that have to be garbage collected. The symptom I was seeing was constant GC and a gradual ramping up of the CPU usage to the point where the app became unstable. Tracing back from that symptom took a lot of work, but once I identified it, we were able to switch away from file store caching, and so far things look good.

My takeaway from this is that even when you're constructing your application from prebuilt blocks, it really helps to have the expertise to dig into the architecture of the blocks themselves. Software components aren't Lego blocks, and although you'll want to use them (because you don't have the time or money to write your own search engine from scratch) you do need to be able to understand them in a pinch. It also really pays to work with open source components. I didn't have to spend weeks feeding bug reports to a vendor to help them fix our Mulgara problem. A handful of emails and a about a day's worth of work (spread over the course of a week) were enough to get me to the source of the problem and a fix for it (a 1-liner, incidentally).

#alt-ac Careers: Digital Humanities Developer

2010-05-10T10:42:00.003-04:00

(part 1 of a series)

I've been a digital humanities developer, that is, someone who writes code, does interface and system design, and rescues data from archaic formats in support of DH projects in a few contexts during the course of my career. I'll be writing a piece for Bethany Nowviskie's #alt-ac (http://nowviskie.org/2010/alt-ac/) volume this year. This is (mostly) not that piece, though portions of it may appear there, so it's maybe part of a draft. This is an attempt to dredge up bits of my perspective as someone who has had an alternate-academic career (on and off) for the last decade. It's fairly narrowly aimed at people like me, who've done advanced graduate work in the Humanities, have an interest in Digital Humanities, and who have or are thinking about jobs that employ their technical skills instead of pursuing a traditional academic career.

A couple of future installments I have in mind are "What Skills Does a DH Developer Need?" and "What's Up with Digital Classics?"

In this installment, I'm going to talk about some of the environments I've worked in. I'm not going to pull punches, and this might get me into trouble, but I think that if there is to be a future for people like me, these things deserve an airing. If your institution is more enlightened than what I describe, please accept my congratulations and don't take offense.

Working for Libraries

In general, libraries are a really good place to work as a programmer, especially doing DH projects. I've spent the last three years working in digital library programming groups. There are some downsides to be aware of: Libraries are very hierarchical organizations, and if you are not a librarian then you are probably in a lower "caste". You will likely not get consistent (or perhaps any) support for professional development, conference attendance, etc. Librarians, as faculty, have professional development requirements as part of their jobs. You, whose professional development is not mandated by the organization (merely something you have to do if you want to stay current and advance your career), will not get the same level of support and probably won't get any credit for publishing articles, giving papers, etc. This is infuriating, and in my opinion self-defeating on the part of the institution, but it is an unfortunate fact.

[Note: Bethany Nowviskie informs me that this is not the case at UVA, where librarians and staff are funded at the same level for professional development. I hope that signals a trend. And by the way, I do realize I'm being inflammatory, talking of castes. This should make you uncomfortable.]

Another downside is that as a member of a lower caste, you may not be able to initiate projects on your own. At many insitutions, only faculty (including librarians) and higher level administrators can make grant proposals, so if you come up with a grant-worthy project idea, someone will have to front for you (and get the credit).

There do exist librarian/developer jobs, and this would be a substantially better situation from a professional standpoint, but since librarian jobs typically require a Master's degree in Library and/or Information Science, libraries may make the calculation that they would be excluding perfectly good programmers from the job pool by putting that sort of requirement in. These are not terribly onerous programs on the whole, should you want to get an MLIS degree, but it does mean obtaining another credential. For what it's worth, I have one, but have never held a librarian position.

It's not all bad though: you will typically have a lot of freedom, loose deadlines, shorter than average work-weeks, and the opportunity to apply your skills to really interesting and hard problems. If you want to continue to pursue your academic interests however, you'll be doing it as a hobby. They don't want your research agenda unless you're a librarian. In a lot of ways, being a DH developer in a library is a DH developer's nirvana. I rant because it's so close to ideal.

Working for a .edu IT Organization

My first full time, permanent position post-Ph.D. was working for an IT organization that supports the College of Arts and Sciences at UNC Chapel Hill. I was one of a handful of programmers who did various kinds of administrative and faculty project support. It was a really good environment to work in. I got to try out new technologies, learned Java, really understood XSLT for the first time, got good at web development and had a lot of fun. I also learned to fear unfunded mandates, that projects without institutional support are doomed, and that if you're the last line of support for a web application, you'd better get good at making it scale.

IT organizations typically pay a bit better than, say, libraries and since it's an IT organization they actually understand technology and what it takes to build systems. There's less sense of being the odd man out in the organization. That said, if you're the academic/DH applications developer it's really easy to get overextended, and I did a bad job of avoiding that fate, "learning by suffering" as Aeschylus said.

Working in Industry

Working outside academia as a developer is a whole other world. Again, DH work is likely to have to be a hobby, but depending on where you work, it may be a relevant hobby. You will be paid (much) more, will probably have a budget for professional development, and may be able to use it for things such as attending DH conferences. Downsides are that you'll probably work longer hours and you'll have less freedom to choose what you do and how you do it, because you're working for an organization that has to make money. The capitalist imperative may strike you as distasteful if you've spent years in academia, but in fact it can be a wonderful feedback mechanism. Doing things the right way (in general) makes the organization money, and doing them wrong (again, in general) doesn't. It can make decision-making wonderfully straightforward. Companies, particularly small ones, can make decisions with a speed that seems bewilderingly quick when compared to libraries, which thrive on committees and meetings and change direction with all the flexibility of a supertanker.

Another advantage of working in industry is that you are more likely to be part of a team working on the same stuff as you. In DH we tend to only be able to assign one or two developers to a job. You will likely be the lone wolf on a project at some point in your career. Companies have money, and they want to get stuff done, so they hire teams of developers. Being on a team like this is nice, and I often miss it.

There are lots of companies that work in areas you may be interested in as someone with a DH background, including the semantic web, text mining, linked data, and digital publishing. In my opinion, working on DH projects is great preparation for a career outside academia.

Funding

As a DH developer, you will more likely than not end up working on grant-funded projects, where your salary is paid with "soft money". What this means in practical terms is that your funding will expire at a certain date. This can be good. It's not uncommon for programmers to change jobs every couple of years anyway, so a time-limited position gives you a free pass at job-switching without being accused of job-hopping. If you work for an organization that's good at attracting funding, then it's quite possible to string projects together and/or combine them. Though there can be institutional impedance mismatch problems here, in that it might be hard to renew a time-limited position, or to convert it to a permanent job without re-opening it for new applicants, or to fill in the gaps between funding cycles. So some institutions have a hard time mapping funding streams onto people efficiently. These aren't too hard to spot because they go though "boom and bust" cycles, staffing up to meet demand and then losing everybody when the funding is gone. This doesn't mean "don't apply for this job," just do it with your eyes open. Don't go in with the expectation (or even much hope) that it will turn into a permanent position. Learn what you can and move on. The upside is that these are often great learning opportunities.

In sum, being a DH developer is very rewarding. But I'm not sure it's a stable career path in most cases, which is a shame for DH as a "discipline" if nothing else. It would be nice if there were more senior positions for DH "makers" as well as "thinkers" (not that those categories are mutually exclusive). I suspect that the institutions that have figured this out will win the lion's share of DH funding in the future, because their brain trusts will just get better and better. The ideal situation (and what you should look for when you're looking to settle down) is a place

that has a good track record of getting funded,

where developers are first-class members of the organization (i.e. have "researcher" or similar status),

where there's a team in place and it's not just you, and

where there's some evidence of long-range planning.

For the most part, though, DH development may be the kind of thing you do for a few years while you're young before you go and do something else. I often wonder whether my DH developer expiration date is approaching. Grant funding often won't pay enough for an experienced programmer, unless those who wrote the budget knew what they were doing [Note: I've read too many grant proposals where the developer salary is < $50K (entry-level) but they have the title "Lead Developer" vel sim.— for what it's worth, this positively screams "We don't know what we're doing!"]. It may soon be time to go back to working for lots more money in industry; or to try to get another administrative DH job. For now, I still have about a year's worth of grant funding left. Better get back to work.

Addenda et Corrigenda

2010-05-04T20:54:00.004-04:00

The proceedings of the 2009 Lawrence J. Schoenberg Symposium on Manuscript Studies in the Digital Age, at which I was a panelist were recently published at http://repository.upenn.edu/ljsproceedings/. I contributed a short piece arguing for the open licensing of content related to the study of medieval manuscripts.

Peter Hirtle, Senior Policy Advisor at the Cornell University Library, wrote me a message commenting on the piece and raising a point that I had elided (reproduced with permission):

Dear Dr. Cayless:

I read with great interest your article on “Digitized Manuscripts and Open Licensing.” Your arguments in favor of a CC-BY license for medieval scholarship are unusual, important, and convincing.

I was troubled, however, to see your comments on reproductions of medieval manuscripts. For example, you note that if you use a CC-BY license, “an entrepreneur can print t-shirts using your digital photograph of a nice initial from a manuscript page.” Later you add:

"Should reproductions of cultural objects that have never been subject to copyright (and that would no longer be, even if they once had) themselves be subject to copyright? The fact is that they are, and some uses of the copyright on photographs may be laudable, for example a museum or library funding its ongoing maintenance costs by selling digital or physical images of objects in its collection, but the existence of such examples does not provide an answer to the question: as an individual copyright owner, do you wish to exert control how other people use a photograph of something hundreds or thousands or years old?"

There is a fundamental mistaken concept here. While some reproductions of cultural objects are subject to copyright, most aren’t. Ever since the Bridgeman decision, the law in the New York circuit at least (and we believe in most other US courts) is that a “slavish” reproduction does not have enough originality to warrant its own copyright protection. If it is an image of a three-dimensional object, there would be copyright, but if it is just a reproduction of a manuscript page, there would be no copyright. It may take great skill to reproduce well a medieval manuscript, but it does not take originality. To claim that it does encourages what has been labeled as “copyfraud.”

You can read more about Bridgeman on pps. 34-35 in my book on Copyright and Cultural Institutions: Guidelines for Digitization for U.S. Libraries, Archives, and Museums, available for sale from Amazon and as a free download from SSRN at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1495365.

Sincerely,
Peter Hirtle

Peter is quite right that the copyright situation in the US, at least insofar as faithful digital reproductions of manuscript pages are concerned, is (with high probability) governed by the 1999 Bridgeman vs. Corel decision. So it is arguably best practice for a scholar who has photographed manuscripts to publish them and assert that they are in the public domain.

I skimmed over this in my article because, for one thing, much of the scholarly content of the symposium dealt with European institutions and their online publication (some behind a paywall—sometimes a crazily expensive one) of manuscripts, so US copyright law doesn't necessarily pertain. For another, I had in mind not just manuscripts but inscriptions, to which—to the extent that they are three-dimensional objects—Bridgeman doesn't pertain. Finally, while it is common practice to produce faithful digital reproductions of manuscript texts, it is also common to enhance those images for the sake of readability, at which point they are (may be?) no longer "slavish reproductions" and thus subject to copyright. The problem of course, as so often in copyright, is that there's no case law to back this up. If I crank up the contrast or run a Photoshop (or Gimp) filter on an image, have I altered it sufficiently to make it copyrightable? I don't know for certain, and I'm not sure anyone does. So on balance I'd still argue for doing the simple thing and putting a Creative Commons license (CC-BY is my recommendation) on everything. This is what the Archimedes Palimpsest project does, for example, who are arguably in just this situation with their publication of multispectral imaging of the palimpsest. And they are to be commended for doing so.

Anyway, I'd like to thank Peter, first for reading my article and second for prodding me into usefully complicating something I had oversimplified for the sake of argument.

Making a new Numbers Server for papyri.info

2010-03-02T10:05:00.007-05:00

[UPDATE: added relationships diagram]

One of the components of papyri.info (a fairly well-hidden one) is a service that provides lookups of identifiers and correlates them with related records in other collections. Over the last few weeks, I've been working on replacing the old, Lucene-based numbers server with a new, triplestore-based one. One of the problems with the old version (though not the one that initially sent me on this quest, which was that I hated its identifiers) was that its structure didn't match the multidimensional nature of the data.

Dimensions:

Collections in the PN (there are four, so far) are hierarchical: for the Duke Databank of Documentary Papyri (DDbDP) and the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV—which has two collections, of metadata and translations), there are series, volumes, and items, and for the Advanced Papyrological Information System (APIS) there are institutions and items.
FRBR: there's a Work (the ancient document itself), which has expression in a scholarly publication, from which the DDbDP transcription, HGV records and translations, and APIS records and translations are derived; these may be made manifest in a variety of ways, including EpiDoc XML, an HTML view, etc. The scholarly work has bibliography, which is surfaced in the HGV records. There is the possibility of attaching bibliography at the volume level as well (since these are actual books, sitting in libraries). Libraries may have series-level catalog records too.
There are relationships between items that describe the same thing. DDbDP and HGV usually have a 1::1 relationship (but not always). APIS has some overlap with both.
There are internal relationships as well. HGV has the idea of a "principal edition," the canonical publication of a document (there are also "andere publikationen"—other publications). DDbDP does as well, but expresses it slightly differently: older versions that have been superseded have stub records with a pointer to the replacement. The replacements point backward as well, and these can form sometimes complex chains (imagine two fragments published separately, but later recognized as belonging to the same document and republished together).

Relationships:

All of this is really hard to represent in a relational or document-oriented fashion. It turns out though, that a graph database does really well. I experimented with Mulgara and found that it does the job perfectly. I can write SPARQL queries that retrieve the data I need from the point of view of any component. Then I can map these to nice URLs so that they are easy to retrieve, using a servlet that does some URL rewriting. Some examples:

An HGV record:
http://papyri.info/hgv/8875/rdf
An HGV translation:
http://papyri.info/hgvtrans/8875/rdf
An HGV record's principal edition and andere pub.:
http://papyri.info/hgv/249/frbr:Work/rdf
An APIS record:
http://papyri.info/apis/berenike.apis.17/rdf
A (corresponding) DDb record:
http://papyri.info/ddbdp/o.berenike;1;17/rdf
A DDb series listing (with "human-readable" citation):
http://papyri.info/ddbdp/chla/rdf
A DDb volume listing:
http://papyri.info/ddbdp/chla;1/rdf
The DDb collection listing:
http://papyri.info/ddbdp/rdf
The HGV collection listing:
http://papyri.info/hgv/rdf

Results are also available in Notation3 or JSON formats (available by substituting "n3" or "json" for "rdf" in the URLs above). All of this makes for a nice machine interface to the relationships between papyri.info data. One that can be generated purely from the data files themselves, plus an RDF file that contains the abbreviated series citations from which DDbDP derives its identifiers. The new Papyrological Editor that will allow scholars to propose emendations to existing documents and to add new ones will use it to determine what files to pull for editing. I also plan to drive the new Solr-based search indexing (which is necessarily document-oriented) using it, since it provides a clear view of which documents should be aggregated.

The URL schemes above illustrate what I plan to do with the new version of papyri.info. Content URLs will be of the form http://papyri.info/ <collection name> / <collection specific identifier> [/ <format>]. Leaving off a format will give you a standard HTML view of the document + associated documents; /source will give you the EpiDoc source document by itself; /atom will give you an ATOM-based representation. I'm also thinking of /rdfa for an HTML-based view of the numbers server data, with embedded RDFa.

What's Next

I haven't done anything really sophisticated with this yet. I'd like to experiment with extending the DCTERMS vocabulary to deal with (e.g.) typed identifiers. Importing other vocabularies (like FRBR or BIBO) may make sense as well. We're talking about hooking this up to bibliography (via records in Zotero) and ancient places (via Pleiades). It all works well with my design philosophy for papyri.info, which is that it should consist of data (in the form of EpiDoc source files and representations of those files), retrievable via sensible URLs, with modular services surrounding the data to make it discoverable and usable.

I made a couple of changes to Mulgara during the course of this:

turned off its strange and repugnant habit of representing namespaces and other URIs by declaring them as entities in an internal DTD in returned RDF results. Please don't do this. It's 2010. For another thing it breaks if you have any URLEncoded characters (i.e. %something) in a URI, because your XML parser will think they are malformed parameter entities.
made the servlet return a 404 not found for queries with no hits (which seems more RESTfully correct)

Anyway, I need to revisit the Mulgara changes I made and try to either get them committed to the Mulgara codebase, or refactor them so that I'm not actually messing with Mulgara's internals. I guess trying another triplestore is a third option. Mulgara is fast, easy to use, and it solved my problem, so I went with it. But there still might be better alternatives out there.

#APA2010

2009-12-31T09:38:00.002-05:00

In between shortening my lifespan by doing a crazy yardwork project this week, I've been following with interest the tweets from #MLA09. A couple of items of interest were that Digital Humanities has become an overnight success (only decades in the making), the job market (still) reeks, and there are serious inequities in the status of non-faculty collaborators in DH projects. None of this is new, of course, but it's good to see it so well stated in a highly-visible venue.

I'm more than ever convinced that, despite the occasional feelings of regret, I made the right decision to stop seeking faculty employment after I got my Ph.D. DH was not then, and perhaps still isn't now, a hot topic in Classics. It is odd, because some of the most innovative DH work comes out of Classics, but, as I've said on a number of occasions, DH pickup in the field is concentrated in a few folks who are 20 years ahead of everyone else. It's interesting to speculate why this may be so. Classics is hard: you have to master (at least) a couple of ancient languages (Latin, Greek at least), plus a couple of modern ones (French and German are the most likely suspects, but maybe Italian, Spanish, Modern Greek, etc. also, depending on your specialization), then a body of literature, history, and art before you can do serious work. Ph.D.s from other disciplines sometimes quail when I describe the comps we had to go through (2 3-hour translation exams, 2 4-hour written exams, and an oral—and that's before you got to do your proposal defense). It may be that there's no room for anything else in this mix, and it's something you have to add later on. Virtually all the "digital classicists" I know are either tenured or are not faculty (and aren't going to be—at least not in Classics). It's all a bit grim really. A decade ago, if you were a grad student in Classics with an interest in DH, you were doomed unless you were willing to suppress that interest until you had tenure. I don't know whether that's changed at all. I hope it has.

The good news, of course, is that digital skills are highly portable (and better-paid). The one on-campus interview I had (for which I wasn't offered the job) would have paid several thousand (for a tenure-track job!) less than the (academic!) programming job I ended up taking. And as fate would have it, I ended up doing digital classics anyway, at least until the grant money runs out.

So I wonder what the twitter traffic from APA10 will be like next week. Maybe DH will be the next big thing there too, but a scan of the program doesn't leave me optimistic.

Converting APIS

2009-12-16T07:23:00.009-05:00

On Monday, I finished converting the APIS (Advanced Papyrological Information System) intake files to EpiDoc XML. I thought I'd write it up, since I tried some new things to do it. The APIS intake files employ a MARC-inspired text format that looks like:


cu001 | 1 | duke.apis.31254916
cu035 | 1 | (NcD)31254916
cu965 | 1 | APIS

status | 1 | 1
cu300 | 1 | 1 item : papyrus, two joining fragments mounted in 
glass, incomplete ; 19 x 8 cm
cuDateSchema | 1 | b
cuDateType | 1 | o
cuDateRange | 1 | b
cuDateValue | 1 | 199
cuDateRange | 2 | e
cuDateSchema | 2 | b
cuDateType | 2 | o
cuDateValue | 2 | 100
cuLCODE | 1 | egy
cu090 | 1 | P.Duk.inv. 723 R
cu500 | 1 | Actual dimensions of item are 18.5 x 7.7 cm
cu500 | 2 | 12 lines
cu500 | 3 | Written along the fibers on the recto; written 
across the fibers on the verso in a different hand and 
inverse to the text on the recto
cu500 | 4 | P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R
cu510_m | 5 | http://scriptorium.lib.duke.edu/papyrus/records/723r.html
cu520 | 6 | Papyrus account of wheat from the Arsinoites (modern 
name: Fayyum), Egypt. Mentions the bank of Pakrouris(?)
cu546 | 7 | In Demotic
cu655 | 1 | Documentary papyri Egypt Fayyum 332-30 B.C
cu655 | 2 | Accounts Egypt Fayyum 332-30 B.C
cu655 | 3 | Papyri

cu653 | 1 | Accounting -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 2 | Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 3 | Wheat -- Egypt -- Fayyum -- 332-30 B.C.
cu245ab | 1 | Account of wheat [2nd cent. B.C.]
cuPart_no | 1 | 1
cuPart_caption | 1 | Recto
cuPresentation_no | 1 | 1 | 1
cuPresentation_display_res | 1 | 1 | thumbnail
cuPresentation_url | 1 | 1 | http://scriptorium.lib.duke.edu/papyrus/images/thumbnails/723r-thumb.gif
cuPresentation_format | 1 | 1 | image/gif
cuPresentation_no | 1 | 2 | 2
cuPresentation_display_res | 1 | 2 | 72dpi
cuPresentation_url | 1 | 2 | http://scriptorium.lib.duke.edu/papyrus/images/72dpi/723r-at72.gif
cuPresentation_format | 1 | 2 | image/gif
cuPresentation_no | 1 | 3 | 3
cuPresentation_display_res | 1 | 3 | 150dpi
cuPresentation_url | 1 | 3 | http://scriptorium.lib.duke.edu/papyrus/images/150dpi/723r-at150.gif
cuPresentation_format | 1 | 3 | image/gif
perm_group | 1 | w

cu090_orgcode | 1 | NcD
cuOrgcode | 1 | NcD

Some of the element names come from, and have the semantics of MARC, while others don't. Fields are delimited with pipe characters '|' and are sometimes 3 columns, sometimes 4. The second column is meant to express order, e.g. cu500 (general note) 1, 2, 3, and 4. If there are 4 columns, the third is used to link related fields, e.g. an image with its label. The last column is the field data, which can wrap to multiple lines. This has to be converted to EpiDoc like:


<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Account of wheat [2nd cent. B.C.]</title>
         </titleStmt>
         <publicationStmt>
            <authority>APIS</authority>
            <idno type="apisid">duke.apis.31254916</idno>
            <idno type="controlno">(NcD)31254916</idno>
         </publicationStmt>
         <sourceDesc>
            <msDesc>
               <msIdentifier>
                  <idno type="invno">P.Duk.inv. 723 R</idno>
               </msIdentifier>
               <msContents>
                  <summary>Papyrus account of wheat from the Arsinoites (modern name: Fayyum), Egypt. 
                    Mentions the bank of Pakrouris(?)</summary>
                  <msItem>
                     <note type="general">Actual dimensions of item are 18.5 x 7.7 cm</note>
                     <note type="general">12 lines</note>
                     <note type="general">Written along the fibers on the recto; written across the fibers on 
                       the verso in a different hand and inverse to the text on the recto</note>
                     <note type="general">P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R</note>
                     <textLang mainLang="egy">In Demotic</textLang>
                  </msItem>
               </msContents>
               <physDesc>
                  <p>1 item : papyrus, two joining fragments mounted in glass, incomplete ; 19 x 8 cm</p>
               </physDesc>
               <history>
                  <origin>
                     <origDate notBefore="-0199" notAfter="-0100"/>
                  </origin>
               </history>
            </msDesc>
         </sourceDesc>
      </fileDesc>
      <profileDesc>
         <langUsage>
            <language ident="en">English</language>
            <language ident="egy-Egyd">In Demotic</language>
         </langUsage>
         <textClass>
            <keywords scheme="#apis">
               <term>Accounting -- Egypt -- Fayyum -- 332-30 B.C.</term>
               <term>Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.</term>
               <term>Wheat -- Egypt -- Fayyum -- 332-30 B.C.</term>
               <term>
                  <rs type="genre_form">Documentary papyri Egypt Fayyum 332-30 B.C</rs>
               </term>
               <term>
                  <rs type="genre_form">Accounts Egypt Fayyum 332-30 B.C</rs>
               </term>
               <term>
                  <rs type="genre_form">Papyri</rs>
               </term>
            </keywords>
         </textClass>
      </profileDesc>
   </teiHeader>
   <text>
      <body>
         <div type="bibliography" subtype="citations">
            <p>
               <ref target="http://scriptorium.lib.duke.edu/papyrus/records/723r.html">Original record</ref>.</p>
         </div>
         <div type="figure">
            <figure>
               <head>Recto</head>
               <figDesc> thumbnail</figDesc>
               <graphic url="http://scriptorium.lib.duke.edu/papyrus/images/thumbnails/723r-thumb.gif"/>
            </figure>
            <figure>
               <head>Recto</head>
               <figDesc> 72dpi</figDesc>
               <graphic url="http://scriptorium.lib.duke.edu/papyrus/images/72dpi/723r-at72.gif"/>
            </figure>
            <figure>
               <head>Recto</head>
               <figDesc> 150dpi</figDesc>
               <graphic url="http://scriptorium.lib.duke.edu/papyrus/images/150dpi/723r-at150.gif"/>
            </figure>
         </div>
      </body>
   </text>
</TEI>

I started learning Clojure this summer. Clojure is a Lisp implementation on top of the Java Virtual Machine. So I thought I'd have a go at writing an APIS converter in it. The result is probably thoroughly un-idiomatic Clojure, but it converts the 30,000 plus APIS records to EpiDoc in about 2.5 minutes, so I'm fairly happy with it as a baby-step. The script works by reading the intake file line by line and issuing SAX events that are handled by a Saxon XSLT TRansformerHandler, which in turn converts to EpiDoc. So in effect, the intake file is treated as though it were an XML file and transformed with a stylesheet.

Most of the processing is done with three functions:

generate-xml takes a File, instantiates a transforming SAX handler from a pool of TransformerFactory objects, starts calling SAX events, and then hands off to the process-file function.


(defn generate-xml
  [file-var]
    (let [xslt (.poll @templates)
     handler (.newTransformerHandler (TransformerFactoryImpl.) xslt)]
       (try
         (doto handler
           (.setResult (StreamResult. (File. (.replace 
             (.replace (str file-var) "intake_files" "xml") ".if" ".xml"))))
           (.startDocument)
           (.startElement "" "apis" "apis" (AttributesImpl.)))
           (process-file (read-file file-var) "" handler)
           (doto handler
           (.endElement "" "apis" "apis")
           (.endDocument))
       (catch Exception e 
         (.println *err* (str (.getMessage e) " processing file " file-var))))
     (.add @templates xslt)))

process-file recursively processes a sequence of lines from the file. If lines is empty, we're at the end of the file, and we can end the last element and exit, otherwise, it splits the current line on pipe characters, calls handle line, then calls itself on the remainder of the line sequence.


(defn process-file
  [lines, elt-name, handler]
  (if (empty? lines)
    (.endElement handler "" elt-name elt-name)
    (if (not (.startsWith (first lines) "#"))  ; comments start with '#' and can be ignored
      (let [line (.split (first lines) "\\s+\\|\\s+")
            ename (if (.contains (first lines) "|") (aget line 0) elt-name)]
          (handle-line line elt-name handler)
          (process-file (rest lines) ename handler)))))

handle-line does most of the XML-producing work. The field name is emitted as an element, columns 2 (and 3 if it's a 4-column field) are emitted as @n and @m attributes, and the last column is emitted as character conthttp://www.blogger.com/img/blank.gifent. If the line is a continuation of the preceding line, then it will be emitted as character data.


(defn handle-line
    [line, elt-name, handler]
  (if (> (alength line) 2) ; lines < 2 columns long are either continuations or empty fields
    (do (let [atts (AttributesImpl.)]
        (doto atts
          (.addAttribute "" "n" "n" "CDATA" (.trim (aget line 1))))
        (if (> (alength line) 3)
          (doto atts
            (.addAttribute "" "m" "m" "CDATA" (.trim (aget line 2)))))
        (if (false? (.equals elt-name ""))
          (.endElement handler "" elt-name elt-name))
        (.startElement handler "" (aget line 0) (aget line 0) atts))
        (let [content (aget line (- (alength line) 1))]
            (.characters handler (.toCharArray (.trim content)) 0 (.length (.trim content)))))
    (do 
      (if (== (alength line) 1)
        (.characters handler (.toCharArray (aget line 0)) 0 (.length (aget line 0)))))))

The -main function kicks everything off by calling init-templates to load up a ConcurrentLinkedQueue with new Template objects capable of generating an XSLT handler and then kicking off a thread pool and mapping the generate-xml function to a sequence of files with the ".if" suffix. -main takes 3 arguments, the directory to look for intake files in, the XSLT to use for transformation, and the number of worker threads to use. I've been kicking it off with 20 threads. Speed depends on how much work my machine (3 GHc Intel Core 2 Duo Macbook Pro) is doing at the moment, but is quite zippy.


(defn init-templates
    [xslt, nthreads]
  (dosync (ref-set templates (ConcurrentLinkedQueue.) ))
  (dotimes [n nthreads]
    (let [xsl-src (StreamSource. (FileInputStream. xslt))
            configuration (Configuration.)
            compiler-info (CompilerInfo.)]
          (doto xsl-src 
            (.setSystemId xslt))
          (doto compiler-info
            (.setErrorListener (StandardErrorListener.))
            (.setURIResolver (StandardURIResolver. configuration)))
          (dosync (.add @templates (.newTemplates (TransformerFactoryImpl.) xsl-src compiler-info))))))
  
(defn -main
    [dir-name, xsl, nthreads]
  (def xslt xsl)
  (def dirs (file-seq (File. dir-name)))
  (init-templates xslt nthreads)
  (let [pool (Executors/newFixedThreadPool nthreads)
    tasks (map (fn [x]
        (fn []
          (generate-xml x)))
      (filter #(.endsWith (.getName %) ".if") dirs))]
      (doseq [future (.invokeAll pool tasks)]
            (.get future))
      (.shutdown pool)))

I had some conceptual difficulties figuring out how best to associate Templates with the threads that execute them. The easy thing to do would be to put the Template creation in the function that is mapped to the file sequence, but that bogs down fairly quickly, presumably because a new Template is being created for each file and memory usage balloons pretty quickly. So that doesn't work. In Java, I'd either a) write a custom thread that spun up its own Template or b) create a pool of Templates. After some messing around, I went with b) because I couldn't see how to do such an object-oriented thing in a functional way. b) was a bit hard too, because I couldn't see how to store Templates in a Clojure collection, access them, and use them without wrapping the whole process in a transaction, which seems like it would lock the collection much too much. So I used a threadsafe Java collection, ConcurrentLinkedQueue, which manages concurrent access to its members on its own.

I've no doubt there are better ways to do this, and I expect I'll learn them in time, but for now, I'm quite pleased with my first effort. Next step will probably be to add some Schematron validation for the APIS files. My impression of Clojure is that it's really powerful, and a good way to write concurrent programs. To do it really well, I think you'd need a fairly deep knowledge of both Lisp-style functional programming and the underlying Java/JVM aspects, but that seems doable.

Object Artefact Script

2009-10-27T10:19:00.006-04:00

A couple of weeks ago, I attended a workshop at the Edinburgh eScience Institute on the relation of text in ancient (and other) documents to its context and on the problems of reading difficult texts on difficult objects and ways in which technology can aid the process of interpretation and dissemination without getting in the way of it. The meeting was well summarized by Alejandro Giacometti in his blog, and the presentations are posted on the eSI wiki.

Kathryn Piquette discussed what would be required to digitally represent Egyptian hieroglyphic texts without divorcing them from their contexts as an integral part of monumental architecture. For example, the interpretation of the meaning of texts should be able to take into account the times of day (and/or year) when they would have been able to be read, their relationship to their surroundings, and so on. The established epigraphical practice of divorcing the transcribed text from its context, while often necessary, does some violence to its meaning, and this must be recognized and accounted for. At the same time, digital 3D reconstructions are themselves an interpretation, and it is important to disclose the evidence on which that interpretation is based.

Ségolène Tarte talked about the process of scholarly interpretation in reading the Vindolanda tablets and similar texts. As part of analysing the scholarly reading process, the eSAD project observed two experts reading a previously-published tablet. During the course of their work, they came up with a new reading that completely changed their understanding of the text. The previous reading hinged on the identification of a single word, which led to the (mistaken) recognition of the document as recording the sale of an ox. The new reading hinged on the recognition of a particular letterform as an 'a'. The ways in which readings of difficult texts are produced—involving skipping around looking for recognizable pieces of text upon which (multiple) partial mental models of the texts are constructed, which must then be resolved somehow into a reading—means that an Interpretation Support System (such as the one eSAD proposes to develop) must be sensitive to the different ways of reading scholars use and must be careful not to impose "spurious exactitude" on them.

Dot Porter gave an overview of a variety of projects that focus on representing text, transcription, and annotation alongside one another as a way into discussing the relationship between digital text and physical text. She cautioned against attempts to digitally replicate the experience of the codex, since there is a great deal of (necessary) data interpolation that goes on in any detailed digital reconstruction, and this elides the physical reality of the text. Digital representations may improve (or even make possible) the reading of difficult texts, such as the Vindolanda tablets or the Archimedes Palimpsest, so for purposes of interpretation, they may be superior to the physical reality. They can combine data, metadata, and other contextual information in ways that help a reader to work with documents. But they cannot satisfactorily replicate the physicality of the document, and it may be a bit dishonest to try.

I talked about the img2xml project I'm working on with colleagues from UNC Chapel Hill. I've got a post or two about that in the pipeline, so I won't say much here. It involves the generation of SVG tracings of text in manuscript documents as a foundation for linking and annotation. Since the technique involves linking to an XML-based representation of the text, it may prove superior to methods that rely simply on pointing at pixel coordinates in images of text.

Ryan Bauman talked about the use of digital images as scholarly evidence. He gave a fascinating overview of sophisticated techniques for imaging very difficult documents (e.g. carbonized, rolled up scrolls from Herculaneum) and talked about the need for documentation of the techniques used in generating the images. This is especially important because the images produced will not resemble the way the document looks in visible light. Ryan also talked about the difficulties involved in linking views of the document that may have been produced at different times, when the document was in different states, or may have used different techniques. The Archimedes Palimpsest project is a good example of what's involved in referencing all of the images so that they can be linked to the transcription.

Finally, Leif Isaksen talked about how some of the techniques discussed in the earlier presentations might be used in crowdsourcing the gathering of data about inscriptions. Inscriptions (both published and unpublished) are frequently encountered (both in museums and out in the open) by tourists who may be curious about their meaning, but lack the ability to interpret them. They may well, however, have sophisticated tools available for image capture, geo-referencing, and internet access (via digital cameras, smartphones, etc.). Can they be employed, in exchange for information about the texts they encounter, as data gatherers?

Some themes that emerged from the discussion included:

the importance of communicating the processes involved in generating digital representations of texts and their contexts (i.e. showing your work)

the need for standard ways of linking together image and textual data

the importance of disseminating data and code, not just results

This was a terrific workshop, and I hope to see followup on it. ESAD is holding a workshop next month on "Understanding image-based evidence," that I'm sorry I can't attend and from which look forward to seeing the output.

Stomping on Innovation Killers

2009-10-16T15:31:00.002-04:00

@foundhistory has a nice post on objections one might hear on a grant review panel that would unjustly torpedo an innovative proposal. I thought it might be a good idea to take a sideways look at these as advice to grant writers.

“

Haven’t X, Y, and Z already done this? We shouldn’t be supporting duplication of effort.

Are all of the stakeholders on board? (Hat tip to @patrickgmj for this gem.)

What about sustainability?

”

So, some ideas for countering these when you're working on your proposal:

Have you looked at work that's been done in this area (this might entail some real digging)? If there are projects and/or literature that deal with the same areas as your proposal, then you should take them into account. You need to be able to show you've done your homework and that your project is different from what's come before.

Who is your audience? Have you talked to them? If you can get letters of support from one or more of them, that will help silence the stakeholders objection.

You ought to have some sort of story about sustainability and/or the future beyond the project, to show that you've thought about what comes next. Even if your project is an experiment, you should talk about how you're going to disseminate the results so that those who come after will be able to build on your work.

I agree with Tom that these criticisms can be deployed to stifle creative work. In technology, sometimes wheels need to be reinvented, sometimes the conventional wisdom is flat wrong, and sometimes worrying overmuch about the future paralyses you. But if you're writing a proposal, assume these objections will be thrown at it, and do some prior thinking so you can spike them before they kill your innovative idea.