In between shortening my lifespan by doing a crazy yardwork project this week, I've been following with interest the tweets from #MLA09. A couple of items of interest were that Digital Humanities has become an overnight success (only decades in the making), the job market (still) reeks, and there are serious inequities in the status of non-faculty collaborators in DH projects. None of this is new, of course, but it's good to see it so well stated in a highly-visible venue.
I'm more than ever convinced that, despite the occasional feelings of regret, I made the right decision to stop seeking faculty employment after I got my Ph.D. DH was not then, and perhaps still isn't now, a hot topic in Classics. It is odd, because some of the most innovative DH work comes out of Classics, but, as I've said on a number of occasions, DH pickup in the field is concentrated in a few folks who are 20 years ahead of everyone else. It's interesting to speculate why this may be so. Classics is hard: you have to master (at least) a couple of ancient languages (Latin, Greek at least), plus a couple of modern ones (French and German are the most likely suspects, but maybe Italian, Spanish, Modern Greek, etc. also, depending on your specialization), then a body of literature, history, and art before you can do serious work. Ph.D.s from other disciplines sometimes quail when I describe the comps we had to go through (2 3-hour translation exams, 2 4-hour written exams, and an oral—and that's before you got to do your proposal defense). It may be that there's no room for anything else in this mix, and it's something you have to add later on. Virtually all the "digital classicists" I know are either tenured or are not faculty (and aren't going to be—at least not in Classics). It's all a bit grim really. A decade ago, if you were a grad student in Classics with an interest in DH, you were doomed unless you were willing to suppress that interest until you had tenure. I don't know whether that's changed at all. I hope it has.
The good news, of course, is that digital skills are highly portable (and better-paid). The one on-campus interview I had (for which I wasn't offered the job) would have paid several thousand (for a tenure-track job!) less than the (academic!) programming job I ended up taking. And as fate would have it, I ended up doing digital classics anyway, at least until the grant money runs out.
So I wonder what the twitter traffic from APA10 will be like next week. Maybe DH will be the next big thing there too, but a scan of the program doesn't leave me optimistic.
Thoughts on software development, Digital Humanities, the ancient world, and whatever else crosses my radar. All original content herein is licensed under a Creative Commons Attribution license.
Thursday, December 31, 2009
Wednesday, December 16, 2009
Converting APIS
On Monday, I finished converting the APIS (Advanced Papyrological Information System) intake files to EpiDoc XML. I thought I'd write it up, since I tried some new things to do it. The APIS intake files employ a MARC-inspired text format that looks like:
Some of the element names come from, and have the semantics of MARC, while others don't. Fields are delimited with pipe characters '|' and are sometimes 3 columns, sometimes 4. The second column is meant to express order, e.g. cu500 (general note) 1, 2, 3, and 4. If there are 4 columns, the third is used to link related fields, e.g. an image with its label. The last column is the field data, which can wrap to multiple lines. This has to be converted to EpiDoc like:
I started learning Clojure this summer. Clojure is a Lisp implementation on top of the Java Virtual Machine. So I thought I'd have a go at writing an APIS converter in it. The result is probably thoroughly un-idiomatic Clojure, but it converts the 30,000 plus APIS records to EpiDoc in about 2.5 minutes, so I'm fairly happy with it as a baby-step. The script works by reading the intake file line by line and issuing SAX events that are handled by a Saxon XSLT TRansformerHandler, which in turn converts to EpiDoc. So in effect, the intake file is treated as though it were an XML file and transformed with a stylesheet.
Most of the processing is done with three functions:
generate-xml takes a File, instantiates a transforming SAX handler from a pool of TransformerFactory objects, starts calling SAX events, and then hands off to the process-file function.
process-file recursively processes a sequence of lines from the file. If lines is empty, we're at the end of the file, and we can end the last element and exit, otherwise, it splits the current line on pipe characters, calls handle line, then calls itself on the remainder of the line sequence.
handle-line does most of the XML-producing work. The field name is emitted as an element, columns 2 (and 3 if it's a 4-column field) are emitted as @n and @m attributes, and the last column is emitted as character conthttp://www.blogger.com/img/blank.gifent. If the line is a continuation of the preceding line, then it will be emitted as character data.
The -main function kicks everything off by calling init-templates to load up a ConcurrentLinkedQueue with new Template objects capable of generating an XSLT handler and then kicking off a thread pool and mapping the generate-xml function to a sequence of files with the ".if" suffix. -main takes 3 arguments, the directory to look for intake files in, the XSLT to use for transformation, and the number of worker threads to use. I've been kicking it off with 20 threads. Speed depends on how much work my machine (3 GHc Intel Core 2 Duo Macbook Pro) is doing at the moment, but is quite zippy.
I had some conceptual difficulties figuring out how best to associate Templates with the threads that execute them. The easy thing to do would be to put the Template creation in the function that is mapped to the file sequence, but that bogs down fairly quickly, presumably because a new Template is being created for each file and memory usage balloons pretty quickly. So that doesn't work. In Java, I'd either a) write a custom thread that spun up its own Template or b) create a pool of Templates. After some messing around, I went with b) because I couldn't see how to do such an object-oriented thing in a functional way. b) was a bit hard too, because I couldn't see how to store Templates in a Clojure collection, access them, and use them without wrapping the whole process in a transaction, which seems like it would lock the collection much too much. So I used a threadsafe Java collection, ConcurrentLinkedQueue, which manages concurrent access to its members on its own.
I've no doubt there are better ways to do this, and I expect I'll learn them in time, but for now, I'm quite pleased with my first effort. Next step will probably be to add some Schematron validation for the APIS files. My impression of Clojure is that it's really powerful, and a good way to write concurrent programs. To do it really well, I think you'd need a fairly deep knowledge of both Lisp-style functional programming and the underlying Java/JVM aspects, but that seems doable.
cu001 | 1 | duke.apis.31254916
cu035 | 1 | (NcD)31254916
cu965 | 1 | APIS
status | 1 | 1
cu300 | 1 | 1 item : papyrus, two joining fragments mounted in
glass, incomplete ; 19 x 8 cm
cuDateSchema | 1 | b
cuDateType | 1 | o
cuDateRange | 1 | b
cuDateValue | 1 | 199
cuDateRange | 2 | e
cuDateSchema | 2 | b
cuDateType | 2 | o
cuDateValue | 2 | 100
cuLCODE | 1 | egy
cu090 | 1 | P.Duk.inv. 723 R
cu500 | 1 | Actual dimensions of item are 18.5 x 7.7 cm
cu500 | 2 | 12 lines
cu500 | 3 | Written along the fibers on the recto; written
across the fibers on the verso in a different hand and
inverse to the text on the recto
cu500 | 4 | P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R
cu510_m | 5 | http://scriptorium.lib.duke.edu/papyrus/records/723r.html
cu520 | 6 | Papyrus account of wheat from the Arsinoites (modern
name: Fayyum), Egypt. Mentions the bank of Pakrouris(?)
cu546 | 7 | In Demotic
cu655 | 1 | Documentary papyri Egypt Fayyum 332-30 B.C
cu655 | 2 | Accounts Egypt Fayyum 332-30 B.C
cu655 | 3 | Papyri
cu653 | 1 | Accounting -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 2 | Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 3 | Wheat -- Egypt -- Fayyum -- 332-30 B.C.
cu245ab | 1 | Account of wheat [2nd cent. B.C.]
cuPart_no | 1 | 1
cuPart_caption | 1 | Recto
cuPresentation_no | 1 | 1 | 1
cuPresentation_display_res | 1 | 1 | thumbnail
cuPresentation_url | 1 | 1 | http://scriptorium.lib.duke.edu/papyrus/images/thumbnails/723r-thumb.gif
cuPresentation_format | 1 | 1 | image/gif
cuPresentation_no | 1 | 2 | 2
cuPresentation_display_res | 1 | 2 | 72dpi
cuPresentation_url | 1 | 2 | http://scriptorium.lib.duke.edu/papyrus/images/72dpi/723r-at72.gif
cuPresentation_format | 1 | 2 | image/gif
cuPresentation_no | 1 | 3 | 3
cuPresentation_display_res | 1 | 3 | 150dpi
cuPresentation_url | 1 | 3 | http://scriptorium.lib.duke.edu/papyrus/images/150dpi/723r-at150.gif
cuPresentation_format | 1 | 3 | image/gif
perm_group | 1 | w
cu090_orgcode | 1 | NcD
cuOrgcode | 1 | NcD
Some of the element names come from, and have the semantics of MARC, while others don't. Fields are delimited with pipe characters '|' and are sometimes 3 columns, sometimes 4. The second column is meant to express order, e.g. cu500 (general note) 1, 2, 3, and 4. If there are 4 columns, the third is used to link related fields, e.g. an image with its label. The last column is the field data, which can wrap to multiple lines. This has to be converted to EpiDoc like:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Account of wheat [2nd cent. B.C.]</title>
</titleStmt>
<publicationStmt>
<authority>APIS</authority>
<idno type="apisid">duke.apis.31254916</idno>
<idno type="controlno">(NcD)31254916</idno>
</publicationStmt>
<sourceDesc>
<msDesc>
<msIdentifier>
<idno type="invno">P.Duk.inv. 723 R</idno>
</msIdentifier>
<msContents>
<summary>Papyrus account of wheat from the Arsinoites (modern name: Fayyum), Egypt.
Mentions the bank of Pakrouris(?)</summary>
<msItem>
<note type="general">Actual dimensions of item are 18.5 x 7.7 cm</note>
<note type="general">12 lines</note>
<note type="general">Written along the fibers on the recto; written across the fibers on
the verso in a different hand and inverse to the text on the recto</note>
<note type="general">P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R</note>
<textLang mainLang="egy">In Demotic</textLang>
</msItem>
</msContents>
<physDesc>
<p>1 item : papyrus, two joining fragments mounted in glass, incomplete ; 19 x 8 cm</p>
</physDesc>
<history>
<origin>
<origDate notBefore="-0199" notAfter="-0100"/>
</origin>
</history>
</msDesc>
</sourceDesc>
</fileDesc>
<profileDesc>
<langUsage>
<language ident="en">English</language>
<language ident="egy-Egyd">In Demotic</language>
</langUsage>
<textClass>
<keywords scheme="#apis">
<term>Accounting -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Wheat -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>
<rs type="genre_form">Documentary papyri Egypt Fayyum 332-30 B.C</rs>
</term>
<term>
<rs type="genre_form">Accounts Egypt Fayyum 332-30 B.C</rs>
</term>
<term>
<rs type="genre_form">Papyri</rs>
</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<text>
<body>
<div type="bibliography" subtype="citations">
<p>
<ref target="http://scriptorium.lib.duke.edu/papyrus/records/723r.html">Original record</ref>.</p>
</div>
<div type="figure">
<figure>
<head>Recto</head>
<figDesc> thumbnail</figDesc>
<graphic url="http://scriptorium.lib.duke.edu/papyrus/images/thumbnails/723r-thumb.gif"/>
</figure>
<figure>
<head>Recto</head>
<figDesc> 72dpi</figDesc>
<graphic url="http://scriptorium.lib.duke.edu/papyrus/images/72dpi/723r-at72.gif"/>
</figure>
<figure>
<head>Recto</head>
<figDesc> 150dpi</figDesc>
<graphic url="http://scriptorium.lib.duke.edu/papyrus/images/150dpi/723r-at150.gif"/>
</figure>
</div>
</body>
</text>
</TEI>
I started learning Clojure this summer. Clojure is a Lisp implementation on top of the Java Virtual Machine. So I thought I'd have a go at writing an APIS converter in it. The result is probably thoroughly un-idiomatic Clojure, but it converts the 30,000 plus APIS records to EpiDoc in about 2.5 minutes, so I'm fairly happy with it as a baby-step. The script works by reading the intake file line by line and issuing SAX events that are handled by a Saxon XSLT TRansformerHandler, which in turn converts to EpiDoc. So in effect, the intake file is treated as though it were an XML file and transformed with a stylesheet.
Most of the processing is done with three functions:
generate-xml takes a File, instantiates a transforming SAX handler from a pool of TransformerFactory objects, starts calling SAX events, and then hands off to the process-file function.
(defn generate-xml
[file-var]
(let [xslt (.poll @templates)
handler (.newTransformerHandler (TransformerFactoryImpl.) xslt)]
(try
(doto handler
(.setResult (StreamResult. (File. (.replace
(.replace (str file-var) "intake_files" "xml") ".if" ".xml"))))
(.startDocument)
(.startElement "" "apis" "apis" (AttributesImpl.)))
(process-file (read-file file-var) "" handler)
(doto handler
(.endElement "" "apis" "apis")
(.endDocument))
(catch Exception e
(.println *err* (str (.getMessage e) " processing file " file-var))))
(.add @templates xslt)))
process-file recursively processes a sequence of lines from the file. If lines is empty, we're at the end of the file, and we can end the last element and exit, otherwise, it splits the current line on pipe characters, calls handle line, then calls itself on the remainder of the line sequence.
(defn process-file
[lines, elt-name, handler]
(if (empty? lines)
(.endElement handler "" elt-name elt-name)
(if (not (.startsWith (first lines) "#")) ; comments start with '#' and can be ignored
(let [line (.split (first lines) "\\s+\\|\\s+")
ename (if (.contains (first lines) "|") (aget line 0) elt-name)]
(handle-line line elt-name handler)
(process-file (rest lines) ename handler)))))
handle-line does most of the XML-producing work. The field name is emitted as an element, columns 2 (and 3 if it's a 4-column field) are emitted as @n and @m attributes, and the last column is emitted as character conthttp://www.blogger.com/img/blank.gifent. If the line is a continuation of the preceding line, then it will be emitted as character data.
(defn handle-line
[line, elt-name, handler]
(if (> (alength line) 2) ; lines < 2 columns long are either continuations or empty fields
(do (let [atts (AttributesImpl.)]
(doto atts
(.addAttribute "" "n" "n" "CDATA" (.trim (aget line 1))))
(if (> (alength line) 3)
(doto atts
(.addAttribute "" "m" "m" "CDATA" (.trim (aget line 2)))))
(if (false? (.equals elt-name ""))
(.endElement handler "" elt-name elt-name))
(.startElement handler "" (aget line 0) (aget line 0) atts))
(let [content (aget line (- (alength line) 1))]
(.characters handler (.toCharArray (.trim content)) 0 (.length (.trim content)))))
(do
(if (== (alength line) 1)
(.characters handler (.toCharArray (aget line 0)) 0 (.length (aget line 0)))))))
The -main function kicks everything off by calling init-templates to load up a ConcurrentLinkedQueue with new Template objects capable of generating an XSLT handler and then kicking off a thread pool and mapping the generate-xml function to a sequence of files with the ".if" suffix. -main takes 3 arguments, the directory to look for intake files in, the XSLT to use for transformation, and the number of worker threads to use. I've been kicking it off with 20 threads. Speed depends on how much work my machine (3 GHc Intel Core 2 Duo Macbook Pro) is doing at the moment, but is quite zippy.
(defn init-templates
[xslt, nthreads]
(dosync (ref-set templates (ConcurrentLinkedQueue.) ))
(dotimes [n nthreads]
(let [xsl-src (StreamSource. (FileInputStream. xslt))
configuration (Configuration.)
compiler-info (CompilerInfo.)]
(doto xsl-src
(.setSystemId xslt))
(doto compiler-info
(.setErrorListener (StandardErrorListener.))
(.setURIResolver (StandardURIResolver. configuration)))
(dosync (.add @templates (.newTemplates (TransformerFactoryImpl.) xsl-src compiler-info))))))
(defn -main
[dir-name, xsl, nthreads]
(def xslt xsl)
(def dirs (file-seq (File. dir-name)))
(init-templates xslt nthreads)
(let [pool (Executors/newFixedThreadPool nthreads)
tasks (map (fn [x]
(fn []
(generate-xml x)))
(filter #(.endsWith (.getName %) ".if") dirs))]
(doseq [future (.invokeAll pool tasks)]
(.get future))
(.shutdown pool)))
I had some conceptual difficulties figuring out how best to associate Templates with the threads that execute them. The easy thing to do would be to put the Template creation in the function that is mapped to the file sequence, but that bogs down fairly quickly, presumably because a new Template is being created for each file and memory usage balloons pretty quickly. So that doesn't work. In Java, I'd either a) write a custom thread that spun up its own Template or b) create a pool of Templates. After some messing around, I went with b) because I couldn't see how to do such an object-oriented thing in a functional way. b) was a bit hard too, because I couldn't see how to store Templates in a Clojure collection, access them, and use them without wrapping the whole process in a transaction, which seems like it would lock the collection much too much. So I used a threadsafe Java collection, ConcurrentLinkedQueue, which manages concurrent access to its members on its own.
I've no doubt there are better ways to do this, and I expect I'll learn them in time, but for now, I'm quite pleased with my first effort. Next step will probably be to add some Schematron validation for the APIS files. My impression of Clojure is that it's really powerful, and a good way to write concurrent programs. To do it really well, I think you'd need a fairly deep knowledge of both Lisp-style functional programming and the underlying Java/JVM aspects, but that seems doable.
Subscribe to:
Posts (Atom)