Monday, October 22, 2007

I've been at the Chicago Colloquium on Digital Humanities and Computer Science since yesterday, presenting on the Colonial and State Records project (available soon at http://docsouth.unc.edu).

Interesting themes that have emerged:
  • The importance of Not Reading, i.e. how to use computational tools to investigate textual spaces when there is more text than you can digest by reading cover-to-cover.
  • Going beyond search: Discovery is an important task, but it's one we do quite well now, how do we go beyond just finding stuff and start to explore the data spaces that digital methods make available? Visualization tools are going to be an important component of this exploration. Digitization and search hasn't changed the nature of research. It has improved the speed with which research is done (nobody spends years producing concordances anymore), but it hasn't changed the questions we ask.
  • The dawn of Eurasian scholarship (this from Lewis Lancaster's talk): the divide between Occidental and Oriental scholarship no longer makes any sense (well, it never really did) and is probably over.

Monday, May 21, 2007

Note to job seekers

When applying for a programming job, listing Dreamweaver as a skill is an automatic 50 demerits.

Thursday, March 01, 2007

I'm going to be a digital librarian!

As of March 15th, I will be working for the UNC Library as a digital library programmer. I'm going to miss Lulu a lot. It's been a wonderful environment to work in, with people I'm going to find hard to leave. But working with collections like Documenting the American South is a text geek's Nirvana, so it was far too good an opportunity to pass up...

Tuesday, February 06, 2007

How has Ruby blown your mind?

...asks Pat Eyler

I had the opportunity to learn Ruby as part of a work project last year and was immediately impressed by its object-orientation, its use of blocks, the straightforward way it handles multiple inheritance with modules, and just the elegance and speed with which I could work in it. The moment that really changed the way I saw the language came when I had to generate previews of Word and OpenDocument (ODT) documents uploaded to the site I was working on. Converting Word to ODT seemed like the way to go, since ODT has a zipped XML format, and can therefore be transformed to XHTML. I have a lot of experience using XSLT to transform XML from one vocabulary to another, so this seemed like well explored territory to me, even if it would take a fair amount of work to accomplish. As usual, I did some web-trolling to see who had dealt with this issue before me, in case the problem was already solved. Google pointed me at J. David Eisenberg's ruby_odt_to_xhtml, which looked like a good start. It didn't do everything I wanted, in particular it didn't handle footnotes adequately, but I didnt expect it would be too hard to modify. The surprises came when I looked at the code...

The first surprise was the utter lack of XSLT. Not a huge surprise, perhaps. I'd already gathered that Rubyists viewed XML with a somewhat jaundiced eye. Tim Bray has lamented the state of XML support in Ruby as well. Tim is quite right about the relative weakness of XML support in Ruby, even though I absolutely agree with the practice of avoiding XML configuration files. There is a perfectly good Ruby frontend to libxslt, however, so it's use is not out of the question. But there it was: for whatever reason, the author had decided not to use the technology I was familiar with...why would he do that, and could I still use his tool?

The mind expansion came about when I started figuring out how to extend odt_to_xhtml to handle notes, which it was basically ignoring. I wanted to turn ODT footnotes into endnotes with named anchors at the bottom of the page, links in the text to the corresponding anchor, and backlinks from the note to its link in the text. Before describing what I found, I should give a little background on XSLT:

At its most basic, XSLT expects input in the form of an XML document, and produces either XML or text output. In XSLT, the functions are called templates. Templates respond either to calls (as do functions in most languages) or, more often, to matches on the input XML document. So a template like


<xsl:template match="text:p">
<p><xsl:apply-templates/></p>
</xsl:template>


would be triggered every time a paragraph element in an OpenDocument content.xml is encountered and would output a <p> tag, yield to any other matching templates, and then close the <p> tag.

As I looked at JDE's code, I saw lots of methods like this:


def process_text_list_item( element, output_node )
style_name = register_style( element )
item = emit_element( output_node, "li", {"class" => style_name} )
process_children( element, item )
end


emit_element does what it sounds like it does, adds a child element to the element passed in to the method with a hash of attribute name/value pairs. It's process_children that really interests me:


# Process an element's children
# node: the context node
# output_node: the node to which to add the children
# xpath_expr: which children to process (default is all)
#
# Algorithm:
# If the node is a text node, output to the destination.
# If it's an element, munge its name into
# <tt>process_prefix_elementname</tt>. If that
# method exists, call it to handle the element. Otherwise,
# process this node's children recursively.
#
def process_children( node, output_node, xpath_expr="node()" )
REXML::XPath.each( node, xpath_expr ) do |item|
if (item.kind_of?(REXML::Element)) then
str = "process_" + @namespace_urn[item.namespace] + "_" + item.name.tr_s(":-", "__")
if ODT_to_XHTML.method_defined?( str ) then
self.send( str, item, output_node )
else
process_children(item, output_node)
end
elsif (item.kind_of?(REXML::Text) && !item.value.match(/^\s*$/))
output_node.add_text(item.value)
end
end
#
# If it's empty, add a null string to force a begin and end
# tag to be generated
if (!output_node.has_elements? && !output_node.has_text?) then
output_node.add_text("")
end
end


Mind expansion ensued. This Ruby class was doing exactly the same thing that I'd expect an XSLT stylesheet to do, with the help of a few lines of code to keep it going! process_text_list_item is a template! Coming from Java and then PHP, I'd have no hesitation switching to XSLT to accomplish a bit of XML processing like this, but in Ruby, there really wasn't any need. I could write XSLT-like code perfectly naturally without ever leaving Ruby!

Now, I still like XSLT, and I'd still use it in many cases like this, because it's portable across different lanaguages and platforms. But here, where there are other considerations, it's wonderful that I'm not forced to step outside the language I'm working in to accomplish what I want. In order to extend the code to handle notes, I just added some new template-like methods to match on notes and note-citations, e.g.:


def process_text_note( element, output_node )
process_children(element, output_node, "#{text_ns}:note-citation")
end


In OpenDocument, notes are inline structures. The note is embedded within the text at the point where the citation occurs, so to create endnotes, you need to split the note into a citation link and a note that is placed at the end of the output document. To add the endnotes, I borrowed a trick from XSLT: modes. If an XSL template has a mode="something" attribute, then that template will not match on an input node unless it was dispatched with an <apply-templates mode="something"/>. So I did the same thing, e.g.:


def process_text_note_mode_endnote( element, output_node )
p = emit_element(output_node, "p", {"class" => "footnote"})
process_children(element, p, "#{@text_ns}:note-citation", "endnote")
process_text_s(element, p)
process_children(element, p, "#{@text_ns}:note-body/#{@text_ns}:p[1]/node()")
process_children(element, p, "#{@text_ns}:note-body/#{@text_ns}:p[1]/following-sibling::*")
end


The method that controls the processing flow in JDE's code is called analyze_content_xml. I just added a call to my moded methods in analyze_content_xml and modified process_children to take a mode parameter.


def process_children( node, output_node, xpath_expr="node()", mode=nil )
if xpath_expr.nil?
xpath_expr = "node()"
end
REXML::XPath.each( node, xpath_expr ) do |item|
if (item.kind_of?(REXML::Element)) then
str = "process_" + @namespace_urn[item.namespace] + "_" + item.name.tr_s(":-", "__")
if mode
str += "_mode_#{mode}"
end
if ODT_to_XHTML.method_defined?( str ) then
self.send( str, item, output_node )
else
process_children(item, output_node)
end
elsif (item.kind_of?(REXML::Text) && !item.value.match(/^\s*$/))
output_node.add_text(item.value)
end
end
#
# If it's empty, add a null string to force a begin and end
# tag to be generated
if (!output_node.has_elements? && !output_node.has_text?) then
output_node.add_text("")
end
end

Done. Easy. Blew my mind.

Saturday, January 20, 2007

Prototype grows up

http://prototypejs.org is the new site for Prototype 1.5. As the Ajaxian blog noted: Now with Documentation! Of course, Prototype always had some documentation; quite good documentation at that, even though there were substantial pieces missing and you had to go digging sometimes.

Prototype played a big part in reawakening my interest in Javascript as a programming language. I was rather anti-Javascript for a while, having fought many bloody battles with cross-browser incompatibilities in the early 2000's (UNC Chapel Hill, my then employer, had standardized, somewhat foolishly, on Netscape 4.7, but of course we had to support IE too -- nightmare). I got back into it seriously when I started to notice all of the AJAXy and Web 2.0-ish stuff going on. I've learned a lot from digging around in the Prototype source code, so the spotty documentation actually did me some good.

Kudos to the Prototype development team and the contributors to the documentation effort. You've done us all a great service! I look forward to using 1.5...