Wednesday, December 16, 2009

Converting APIS

On Monday, I finished converting the APIS (Advanced Papyrological Information System) intake files to EpiDoc XML. I thought I'd write it up, since I tried some new things to do it. The APIS intake files employ a MARC-inspired text format that looks like:

cu001 | 1 | duke.apis.31254916
cu035 | 1 | (NcD)31254916
cu965 | 1 | APIS

status | 1 | 1
cu300 | 1 | 1 item : papyrus, two joining fragments mounted in
glass, incomplete ; 19 x 8 cm
cuDateSchema | 1 | b
cuDateType | 1 | o
cuDateRange | 1 | b
cuDateValue | 1 | 199
cuDateRange | 2 | e
cuDateSchema | 2 | b
cuDateType | 2 | o
cuDateValue | 2 | 100
cuLCODE | 1 | egy
cu090 | 1 | P.Duk.inv. 723 R
cu500 | 1 | Actual dimensions of item are 18.5 x 7.7 cm
cu500 | 2 | 12 lines
cu500 | 3 | Written along the fibers on the recto; written
across the fibers on the verso in a different hand and
inverse to the text on the recto
cu500 | 4 | P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R
cu510_m | 5 |
cu520 | 6 | Papyrus account of wheat from the Arsinoites (modern
name: Fayyum), Egypt. Mentions the bank of Pakrouris(?)
cu546 | 7 | In Demotic
cu655 | 1 | Documentary papyri Egypt Fayyum 332-30 B.C
cu655 | 2 | Accounts Egypt Fayyum 332-30 B.C
cu655 | 3 | Papyri

cu653 | 1 | Accounting -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 2 | Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.
cu653 | 3 | Wheat -- Egypt -- Fayyum -- 332-30 B.C.
cu245ab | 1 | Account of wheat [2nd cent. B.C.]
cuPart_no | 1 | 1
cuPart_caption | 1 | Recto
cuPresentation_no | 1 | 1 | 1
cuPresentation_display_res | 1 | 1 | thumbnail
cuPresentation_url | 1 | 1 |
cuPresentation_format | 1 | 1 | image/gif
cuPresentation_no | 1 | 2 | 2
cuPresentation_display_res | 1 | 2 | 72dpi
cuPresentation_url | 1 | 2 |
cuPresentation_format | 1 | 2 | image/gif
cuPresentation_no | 1 | 3 | 3
cuPresentation_display_res | 1 | 3 | 150dpi
cuPresentation_url | 1 | 3 |
cuPresentation_format | 1 | 3 | image/gif
perm_group | 1 | w

cu090_orgcode | 1 | NcD
cuOrgcode | 1 | NcD

Some of the element names come from, and have the semantics of MARC, while others don't. Fields are delimited with pipe characters '|' and are sometimes 3 columns, sometimes 4. The second column is meant to express order, e.g. cu500 (general note) 1, 2, 3, and 4. If there are 4 columns, the third is used to link related fields, e.g. an image with its label. The last column is the field data, which can wrap to multiple lines. This has to be converted to EpiDoc like:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="">
<title>Account of wheat [2nd cent. B.C.]</title>
<idno type="apisid">duke.apis.31254916</idno>
<idno type="controlno">(NcD)31254916</idno>
<idno type="invno">P.Duk.inv. 723 R</idno>
<summary>Papyrus account of wheat from the Arsinoites (modern name: Fayyum), Egypt.
Mentions the bank of Pakrouris(?)</summary>
<note type="general">Actual dimensions of item are 18.5 x 7.7 cm</note>
<note type="general">12 lines</note>
<note type="general">Written along the fibers on the recto; written across the fibers on
the verso in a different hand and inverse to the text on the recto</note>
<note type="general">P.Duk.inv. 723 R was formerly P.Duk.inv. MF79 69 R</note>
<textLang mainLang="egy">In Demotic</textLang>
<p>1 item : papyrus, two joining fragments mounted in glass, incomplete ; 19 x 8 cm</p>
<origDate notBefore="-0199" notAfter="-0100"/>
<language ident="en">English</language>
<language ident="egy-Egyd">In Demotic</language>
<keywords scheme="#apis">
<term>Accounting -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Banks and banking -- Egypt -- Fayyum -- 332-30 B.C.</term>
<term>Wheat -- Egypt -- Fayyum -- 332-30 B.C.</term>
<rs type="genre_form">Documentary papyri Egypt Fayyum 332-30 B.C</rs>
<rs type="genre_form">Accounts Egypt Fayyum 332-30 B.C</rs>
<rs type="genre_form">Papyri</rs>
<div type="bibliography" subtype="citations">
<ref target="">Original record</ref>.</p>
<div type="figure">
<figDesc> thumbnail</figDesc>
<graphic url=""/>
<figDesc> 72dpi</figDesc>
<graphic url=""/>
<figDesc> 150dpi</figDesc>
<graphic url=""/>

I started learning Clojure this summer. Clojure is a Lisp implementation on top of the Java Virtual Machine. So I thought I'd have a go at writing an APIS converter in it. The result is probably thoroughly un-idiomatic Clojure, but it converts the 30,000 plus APIS records to EpiDoc in about 2.5 minutes, so I'm fairly happy with it as a baby-step. The script works by reading the intake file line by line and issuing SAX events that are handled by a Saxon XSLT TRansformerHandler, which in turn converts to EpiDoc. So in effect, the intake file is treated as though it were an XML file and transformed with a stylesheet.

Most of the processing is done with three functions:

generate-xml takes a File, instantiates a transforming SAX handler from a pool of TransformerFactory objects, starts calling SAX events, and then hands off to the process-file function.

(defn generate-xml
(let [xslt (.poll @templates)
handler (.newTransformerHandler (TransformerFactoryImpl.) xslt)]
(doto handler
(.setResult (StreamResult. (File. (.replace
(.replace (str file-var) "intake_files" "xml") ".if" ".xml"))))
(.startElement "" "apis" "apis" (AttributesImpl.)))
(process-file (read-file file-var) "" handler)
(doto handler
(.endElement "" "apis" "apis")
(catch Exception e
(.println *err* (str (.getMessage e) " processing file " file-var))))
(.add @templates xslt)))

recursively processes a sequence of lines from the file. If lines is empty, we're at the end of the file, and we can end the last element and exit, otherwise, it splits the current line on pipe characters, calls handle line, then calls itself on the remainder of the line sequence.

(defn process-file
[lines, elt-name, handler]
(if (empty? lines)
(.endElement handler "" elt-name elt-name)
(if (not (.startsWith (first lines) "#")) ; comments start with '#' and can be ignored
(let [line (.split (first lines) "\\s+\\|\\s+")
ename (if (.contains (first lines) "|") (aget line 0) elt-name)]
(handle-line line elt-name handler)
(process-file (rest lines) ename handler)))))

handle-line does most of the XML-producing work. The field name is emitted as an element, columns 2 (and 3 if it's a 4-column field) are emitted as @n and @m attributes, and the last column is emitted as character cont If the line is a continuation of the preceding line, then it will be emitted as character data.

(defn handle-line
[line, elt-name, handler]
(if (> (alength line) 2) ; lines < 2 columns long are either continuations or empty fields
(do (let [atts (AttributesImpl.)]
(doto atts
(.addAttribute "" "n" "n" "CDATA" (.trim (aget line 1))))
(if (> (alength line) 3)
(doto atts
(.addAttribute "" "m" "m" "CDATA" (.trim (aget line 2)))))
(if (false? (.equals elt-name ""))
(.endElement handler "" elt-name elt-name))
(.startElement handler "" (aget line 0) (aget line 0) atts))
(let [content (aget line (- (alength line) 1))]
(.characters handler (.toCharArray (.trim content)) 0 (.length (.trim content)))))
(if (== (alength line) 1)
(.characters handler (.toCharArray (aget line 0)) 0 (.length (aget line 0)))))))

The -main function kicks everything off by calling init-templates to load up a ConcurrentLinkedQueue with new Template objects capable of generating an XSLT handler and then kicking off a thread pool and mapping the generate-xml function to a sequence of files with the ".if" suffix. -main takes 3 arguments, the directory to look for intake files in, the XSLT to use for transformation, and the number of worker threads to use. I've been kicking it off with 20 threads. Speed depends on how much work my machine (3 GHc Intel Core 2 Duo Macbook Pro) is doing at the moment, but is quite zippy.

(defn init-templates
[xslt, nthreads]
(dosync (ref-set templates (ConcurrentLinkedQueue.) ))
(dotimes [n nthreads]
(let [xsl-src (StreamSource. (FileInputStream. xslt))
configuration (Configuration.)
compiler-info (CompilerInfo.)]
(doto xsl-src
(.setSystemId xslt))
(doto compiler-info
(.setErrorListener (StandardErrorListener.))
(.setURIResolver (StandardURIResolver. configuration)))
(dosync (.add @templates (.newTemplates (TransformerFactoryImpl.) xsl-src compiler-info))))))

(defn -main
[dir-name, xsl, nthreads]
(def xslt xsl)
(def dirs (file-seq (File. dir-name)))
(init-templates xslt nthreads)
(let [pool (Executors/newFixedThreadPool nthreads)
tasks (map (fn [x]
(fn []
(generate-xml x)))
(filter #(.endsWith (.getName %) ".if") dirs))]
(doseq [future (.invokeAll pool tasks)]
(.get future))
(.shutdown pool)))

I had some conceptual difficulties figuring out how best to associate Templates with the threads that execute them. The easy thing to do would be to put the Template creation in the function that is mapped to the file sequence, but that bogs down fairly quickly, presumably because a new Template is being created for each file and memory usage balloons pretty quickly. So that doesn't work. In Java, I'd either a) write a custom thread that spun up its own Template or b) create a pool of Templates. After some messing around, I went with b) because I couldn't see how to do such an object-oriented thing in a functional way. b) was a bit hard too, because I couldn't see how to store Templates in a Clojure collection, access them, and use them without wrapping the whole process in a transaction, which seems like it would lock the collection much too much. So I used a threadsafe Java collection, ConcurrentLinkedQueue, which manages concurrent access to its members on its own.

I've no doubt there are better ways to do this, and I expect I'll learn them in time, but for now, I'm quite pleased with my first effort. Next step will probably be to add some Schematron validation for the APIS files. My impression of Clojure is that it's really powerful, and a good way to write concurrent programs. To do it really well, I think you'd need a fairly deep knowledge of both Lisp-style functional programming and the underlying Java/JVM aspects, but that seems doable.

No comments: