Sunday, September 28, 2008

Go Zotero!

The Thomson Reuters lawsuit against the developers of Zotero is getting a lot of notice, which is good.

I've noticed that in the library world, when people mention getting sued, it's with fear and the implication that this represents the end of the world. It's an interesting contrast coming from working for a startup (albeit a pretty well-funded one) where lawsuits == a) publicity, and are not to be feared (perhaps even to be provoked) and/or b) are a signal that you've scared your competitors enough to make them go running to Daddy, thus unequivocally validating your business model.

This is an act of sheer desperation on the part of Thomson Reuters. They're hoping GMU will crumble and shut the project down. I do hope Dan has contacted the EFF (donate!) and that the GMU administration will take this for what it is: fantastic publicity for one of their most important departments and an indicator that they are doing something truly great.

Friday, August 15, 2008

Back from Balisage

I never made it to Extreme, Balisage's predecessor, despite wanting to very badly, so I'm very glad I did go to its new incarnation. I'm still processing the week's very rich diet of information, but it was very, very cool.

Simon St. Laurent, who wrote one of the first XML books I bought back in 1999, Inside XML DTDs has a photo of one of the slides from my presentation in his Balisage roundup post. This is the kind of κλέος I can appreciate!

Thursday, August 14, 2008

Balisage Presentation online

I just rsynced up my presentation on linking manuscript images to transcriptions using SVG for Balisage, that I gave this morning. It's at http://www.unc.edu/~hcayless/img2xml/presentation.html. The image viewer embedded into the presentation is at http://www.unc.edu/~hcayless/img2xml/viewer.html. Text paths are still busted at the highest resolution, as you'll see if you zoom all the way in, but apart from that it seems to work.

Balisage has been a really great conference so far. I highly recommend it.

Saturday, May 31, 2008

New TransCoder release

This is something I've been meaning to wrap up and write up for a while now: thanks to the Duke Integrating Digital Papyrology grant from the Andrew W. Mellon Foundation, I've been able to make a bunch of updates to the Transcoder, a piece of software I originally wrote for the EpiDoc project. Transcoder is a Java program that handles switching the encodings of Greek text, for example from Beta Code to Unicode (or back again). It's used in initiatives like Perseus and Demos. I've been modifying it to work with Duke Databank of Documentary Papyri XML files (which are TEI based). Besides a variety of bug fixes, there is now also included in Transcoder a fully-functional SAX ContentHandler that allows the processing of XML files containing Greek text to be transcoded.

There are a lot of complex edge cases in this sort of work. For example, Beta Code (or at least the DDbDP's Beta) doesn't distinguish between medial (σ) and final (ς) sigmas. That's an easy conversion in the abstract (just look for 's' at the end of a word, and it's final), but when your text is embedded in XML, and there may be an expansion (<expan>) tag in the middle of a word, for example, it becomes a lot harder. You can't just convert the contents of a particular element--you have to be able to look ahead. The problem with SAX, of course, is that it's stream-based, so no lookahead is possible unless you do some buffering. In the end what I did was buffer SAX events when an element (say a paragraph) marked as Greek begins, and keep track of all the text therein. That let me do the lookahead I needed to do, since I have a buffer containing the whole textual content of the <p> tag. When the end of the element comes, I then flush the buffer, and all the queued-up SAX events fire, with the transcoded text in them.

That's a lot of work for one letter, but I'm happy to say that it functions well now, and is being used to process the whole DDbDP. Another edge case that I chose not to solve in the Transcoder program is the problem of apparatus and their contents in TEI. An <app> element can contain a <lem> (lemma) and one or more <rdg> (readings). The problem with it is that the lemma and readings are conceptually parallel in the text. For example:

The quick brown <lem>fox</lem> jumped over the lazy dog.
                <rdg>cat</rdg>


The TEI would be:

The quick brown <app><lem>fox</lem><rdg>cat</rdg></app> jumped over the lazy dog

So "cat" follows immediately after "fox" in the text stream, but both words occupy the same space as far as the markup is concerned. In other words, I couldn't rely only on my fancy new lookahead scheme, because it broke down in edge cases like this. The solution I went with is dumb, but effective: format the apparatus so that there is a newline after the lemma (and the reading, if there are multiple readings). That way my code will still be able to figure out what's going on. The whitespace so introduced really needs to be flagged as significant, so that it doesn't get clobbered by other XML processes though. That has already happened to us once. It caused a bug for me too, because I wasn't buffering ignorable whitespace.

All that trouble over one little letter. Lunate sigmas would have made life so much easier...

Sunday, March 16, 2008

D·M·S· Allen Ross Scaife 1960-2008

On Saturday afternoon, March 15th, I learned that my friend Ross had died that morning after a long and hard-fought struggle with cancer. He was at his home in Lexington, Kentucky, surrounded by his family.

Ross was one of the giants of the Digital Classics community. He was the guiding force behind the Stoa, and the founder of many of its projects. Ross was always generous with his time and resources and has been responsible for incubating many fledgling Digital Humanities initiatives. His loss leaves a gap that will be impossible to fill.

Ross was also a good friend, easy to talk to, and always ready to encourage me to experiment with new ideas. I miss him very much.

What he began will continue without him, and though we cannot ever replace Ross, we can honour his memory by carrying on his good work.

update (March 21, 21:04)

Dot posted a lovely obituary of Ross at the Stoa. Tom and several others have posted nice memorials as well.

On a happier note: my daughter, Caroline Emma Ross Cayless was born at 11:52 pm, March 19th.

Wednesday, January 23, 2008

Catching up

My New Year's resolution was to write more, and specifically to blog more, but so far all of my writing has been internally focussed at my job.  So I shall have another go...

Speaking of New Year's, I spent a chunk of New Year's Eve getting The Colonial and State Records of North Carolina off the ground.  It's driven by the eXist XML database, of which I've grown rather fond.  XQuery has a lot of promise as a tool for digital humanists with large collections of XML.

Monday, October 22, 2007

I've been at the Chicago Colloquium on Digital Humanities and Computer Science since yesterday, presenting on the Colonial and State Records project (available soon at http://docsouth.unc.edu).

Interesting themes that have emerged:
  • The importance of Not Reading, i.e. how to use computational tools to investigate textual spaces when there is more text than you can digest by reading cover-to-cover.
  • Going beyond search: Discovery is an important task, but it's one we do quite well now, how do we go beyond just finding stuff and start to explore the data spaces that digital methods make available? Visualization tools are going to be an important component of this exploration. Digitization and search hasn't changed the nature of research. It has improved the speed with which research is done (nobody spends years producing concordances anymore), but it hasn't changed the questions we ask.
  • The dawn of Eurasian scholarship (this from Lewis Lancaster's talk): the divide between Occidental and Oriental scholarship no longer makes any sense (well, it never really did) and is probably over.

Monday, May 21, 2007

Note to job seekers

When applying for a programming job, listing Dreamweaver as a skill is an automatic 50 demerits.

Thursday, March 01, 2007

I'm going to be a digital librarian!

As of March 15th, I will be working for the UNC Library as a digital library programmer. I'm going to miss Lulu a lot. It's been a wonderful environment to work in, with people I'm going to find hard to leave. But working with collections like Documenting the American South is a text geek's Nirvana, so it was far too good an opportunity to pass up...

Tuesday, February 06, 2007

How has Ruby blown your mind?

...asks Pat Eyler

I had the opportunity to learn Ruby as part of a work project last year and was immediately impressed by its object-orientation, its use of blocks, the straightforward way it handles multiple inheritance with modules, and just the elegance and speed with which I could work in it. The moment that really changed the way I saw the language came when I had to generate previews of Word and OpenDocument (ODT) documents uploaded to the site I was working on. Converting Word to ODT seemed like the way to go, since ODT has a zipped XML format, and can therefore be transformed to XHTML. I have a lot of experience using XSLT to transform XML from one vocabulary to another, so this seemed like well explored territory to me, even if it would take a fair amount of work to accomplish. As usual, I did some web-trolling to see who had dealt with this issue before me, in case the problem was already solved. Google pointed me at J. David Eisenberg's ruby_odt_to_xhtml, which looked like a good start. It didn't do everything I wanted, in particular it didn't handle footnotes adequately, but I didnt expect it would be too hard to modify. The surprises came when I looked at the code...

The first surprise was the utter lack of XSLT. Not a huge surprise, perhaps. I'd already gathered that Rubyists viewed XML with a somewhat jaundiced eye. Tim Bray has lamented the state of XML support in Ruby as well. Tim is quite right about the relative weakness of XML support in Ruby, even though I absolutely agree with the practice of avoiding XML configuration files. There is a perfectly good Ruby frontend to libxslt, however, so it's use is not out of the question. But there it was: for whatever reason, the author had decided not to use the technology I was familiar with...why would he do that, and could I still use his tool?

The mind expansion came about when I started figuring out how to extend odt_to_xhtml to handle notes, which it was basically ignoring. I wanted to turn ODT footnotes into endnotes with named anchors at the bottom of the page, links in the text to the corresponding anchor, and backlinks from the note to its link in the text. Before describing what I found, I should give a little background on XSLT:

At its most basic, XSLT expects input in the form of an XML document, and produces either XML or text output. In XSLT, the functions are called templates. Templates respond either to calls (as do functions in most languages) or, more often, to matches on the input XML document. So a template like


<xsl:template match="text:p">
<p><xsl:apply-templates/></p>
</xsl:template>


would be triggered every time a paragraph element in an OpenDocument content.xml is encountered and would output a <p> tag, yield to any other matching templates, and then close the <p> tag.

As I looked at JDE's code, I saw lots of methods like this:


def process_text_list_item( element, output_node )
style_name = register_style( element )
item = emit_element( output_node, "li", {"class" => style_name} )
process_children( element, item )
end


emit_element does what it sounds like it does, adds a child element to the element passed in to the method with a hash of attribute name/value pairs. It's process_children that really interests me:


# Process an element's children
# node: the context node
# output_node: the node to which to add the children
# xpath_expr: which children to process (default is all)
#
# Algorithm:
# If the node is a text node, output to the destination.
# If it's an element, munge its name into
# <tt>process_prefix_elementname</tt>. If that
# method exists, call it to handle the element. Otherwise,
# process this node's children recursively.
#
def process_children( node, output_node, xpath_expr="node()" )
REXML::XPath.each( node, xpath_expr ) do |item|
if (item.kind_of?(REXML::Element)) then
str = "process_" + @namespace_urn[item.namespace] + "_" + item.name.tr_s(":-", "__")
if ODT_to_XHTML.method_defined?( str ) then
self.send( str, item, output_node )
else
process_children(item, output_node)
end
elsif (item.kind_of?(REXML::Text) && !item.value.match(/^\s*$/))
output_node.add_text(item.value)
end
end
#
# If it's empty, add a null string to force a begin and end
# tag to be generated
if (!output_node.has_elements? && !output_node.has_text?) then
output_node.add_text("")
end
end


Mind expansion ensued. This Ruby class was doing exactly the same thing that I'd expect an XSLT stylesheet to do, with the help of a few lines of code to keep it going! process_text_list_item is a template! Coming from Java and then PHP, I'd have no hesitation switching to XSLT to accomplish a bit of XML processing like this, but in Ruby, there really wasn't any need. I could write XSLT-like code perfectly naturally without ever leaving Ruby!

Now, I still like XSLT, and I'd still use it in many cases like this, because it's portable across different lanaguages and platforms. But here, where there are other considerations, it's wonderful that I'm not forced to step outside the language I'm working in to accomplish what I want. In order to extend the code to handle notes, I just added some new template-like methods to match on notes and note-citations, e.g.:


def process_text_note( element, output_node )
process_children(element, output_node, "#{text_ns}:note-citation")
end


In OpenDocument, notes are inline structures. The note is embedded within the text at the point where the citation occurs, so to create endnotes, you need to split the note into a citation link and a note that is placed at the end of the output document. To add the endnotes, I borrowed a trick from XSLT: modes. If an XSL template has a mode="something" attribute, then that template will not match on an input node unless it was dispatched with an <apply-templates mode="something"/>. So I did the same thing, e.g.:


def process_text_note_mode_endnote( element, output_node )
p = emit_element(output_node, "p", {"class" => "footnote"})
process_children(element, p, "#{@text_ns}:note-citation", "endnote")
process_text_s(element, p)
process_children(element, p, "#{@text_ns}:note-body/#{@text_ns}:p[1]/node()")
process_children(element, p, "#{@text_ns}:note-body/#{@text_ns}:p[1]/following-sibling::*")
end


The method that controls the processing flow in JDE's code is called analyze_content_xml. I just added a call to my moded methods in analyze_content_xml and modified process_children to take a mode parameter.


def process_children( node, output_node, xpath_expr="node()", mode=nil )
if xpath_expr.nil?
xpath_expr = "node()"
end
REXML::XPath.each( node, xpath_expr ) do |item|
if (item.kind_of?(REXML::Element)) then
str = "process_" + @namespace_urn[item.namespace] + "_" + item.name.tr_s(":-", "__")
if mode
str += "_mode_#{mode}"
end
if ODT_to_XHTML.method_defined?( str ) then
self.send( str, item, output_node )
else
process_children(item, output_node)
end
elsif (item.kind_of?(REXML::Text) && !item.value.match(/^\s*$/))
output_node.add_text(item.value)
end
end
#
# If it's empty, add a null string to force a begin and end
# tag to be generated
if (!output_node.has_elements? && !output_node.has_text?) then
output_node.add_text("")
end
end

Done. Easy. Blew my mind.

Saturday, January 20, 2007

Prototype grows up

http://prototypejs.org is the new site for Prototype 1.5. As the Ajaxian blog noted: Now with Documentation! Of course, Prototype always had some documentation; quite good documentation at that, even though there were substantial pieces missing and you had to go digging sometimes.

Prototype played a big part in reawakening my interest in Javascript as a programming language. I was rather anti-Javascript for a while, having fought many bloody battles with cross-browser incompatibilities in the early 2000's (UNC Chapel Hill, my then employer, had standardized, somewhat foolishly, on Netscape 4.7, but of course we had to support IE too -- nightmare). I got back into it seriously when I started to notice all of the AJAXy and Web 2.0-ish stuff going on. I've learned a lot from digging around in the Prototype source code, so the spotty documentation actually did me some good.

Kudos to the Prototype development team and the contributors to the documentation effort. You've done us all a great service! I look forward to using 1.5...

Thursday, August 24, 2006

XSL-FO 2.0 Workshop 2006

International Workshop on the future of the Extensible Stylesheet Language (XSL-FO) Version 2.0


XSL-FO 2.0 Workshop 2006

I have two suggestions:
  • use CSS instead of weird attribute-based style declarations.

  • for God's sake, have a reference implementation.

Wednesday, August 02, 2006

Boycott Blackboard!

I knew Blackboard had a patent application for their LMS, but apparently it has been granted and their first act was to file a lawsuit against one of their competitors. This is terrible on many levels. Not least that such a stupid patent should never have been granted,. Of course, the USPTO would probably let me patent my nose hair.

I certainly won't be using Blackboard for my XML class in the Fall, and I'd encourage other instructors to drop it too.

Friday, October 21, 2005

Google Library

I just read John Battelle's post on the AAP's lawsuit. The comments are particularly interesting, with a couple of very strident ones criticising Google. I have a theory about how Google plans to justify their actions:
  1. Libraries are allowed, under copyright law, to make a single copy of any work in their possession. This is called the Library Exemption. There is a nice outline of the terms here. The libraries themselves can't get in trouble for contracting with Google to do this for them, because they are receiving no commercial advantage from it. Google clearly is receiving a competitive advantage from it, BUT:
  2. They may be able to make a good case for Fair Use, depending on the nature of what they keep from the book. There are four aspects to be weighed in any Fair Use defense (see Wikipedia):
    1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
    2. the nature of the copyrighted work;
    3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
    4. the effect of the use upon the potential market for or value of the copyrighted work.
Clearly Google hopes for commercial advantage from the use of the scanned books, so they might fail the first test. The second doesn't really apply: these are clearly books subject fully to copyright law. It's the third and fourth aspects that I think are the center of Google's defense. A copy is a copy, but a searchable index created from a scanned copy is arguably a transformative use of the book. A human being can neither read the index, nor reconstruct the original from it, so Google may be able to successfully defend themselves on aspect #3. Their main weakness is the existence of page images from the original scan. These may or may not be stored and accessible in such a way that a whole copy of the original could be reconstructed and read. Aspect #4 is another winner for Google. The clear effect of this system will be to sell more copies of the publishers' books. The only (theoretical) commercial harm caused to the publishers is that they are effectively prevented from rolling a Google Print of their own, which might bring them in more money than simply selling their books. So Google wins on at least two of the four counts, and the act of copying itself is protected under the Library Exemption.

I suspect the AAP would have an uphill battle in winning this one. I wouldn't be surprised if they wanted Google to license their books for their index at some fairly exorbitant rate, and Google refused to pay because they're doing the publishers a favor. That would make the lawsuit a negotiating tactic.

Thursday, April 28, 2005

When SEOs Attack

Search Engine Foo: iUniverse Book Publishing: Book Publisher for Self Publishing and Print on Demand. Care to guess what terms they're optimizing for? It does seem to work. They show up #1 for a Google search on "self publishing." So clearly this sort of spamming works. But it leads to pretty hilarious prose:

iUniverse, the leading online book publisher, offers the most comprehensive book publishing services in the self-publishing industry—awarded the Editor's Choice award by PC Magazine and chosen by thousands of satisfied authors as the leading print-on-demand book publisher.

We help authors to prepare a manuscript, design and self-publish a book of professional quality, publicize and market their book, and print copies of their book for sale online and in bookstores around the world.

As an innovative book publisher, we also offer exclusive services such as our acclaimed Editorial Review and our revolutionary Star Program, designed to discover and nurture exceptional new talent within our growing author community.

Don't wait any longer to get that manuscript off your desk and into the marketplace. With iUniverse as your book publisher, you can become a published author in a matter of weeks. Why not get started today?


Yes, indeed. Publish your book with a publishing publisher and be published. Ouch. Not sure I'd pay them an exorbitant fee to edit my book.

Friday, March 18, 2005

Writing Code

I'm coming to the conclusion that writing code, as an activity, really is like writing prose. I find myself treating code projects just like writing projects:
  1. I spend the first part of the project thinking about it and being (apparently) very unproductive. (25-30%)
  2. After I reach some sort of critical mass in my thinking, I very quickly pour out everything into code/onto the page. The project is 80% done as far as volume goes at this point. (10-20%)
  3. I spend the rest of the time editing, bugfixing, refining, etc. (50-60%)
For larger projects, this cycle gets repeated for each component of the project. This is precisely the pattern I followed when writing my dissertation. I don't know if this kind of working method is in any way typical, but it does seem to produce the desired results. It makes giving project completion estimates next to impossible though, because I really have no idea how long the project will take until I enter the hyper-productive phase, and when that's complete, I often still have a lot of work to do, even though the bulk of the code/writing is done.

This is why it's best for me if I've got some variety in a job. The hyper-productive phase really can't overlap with anything else: if I'm interrupted then, I'll get off track and it may blow the whole day, but in phase 1 or 3 I'm better off not spending all my time focusing on the project, because I'll just end up web surfing. Or blogging.

Wednesday, February 23, 2005

Tagging Notes

From a conversation this afternoon: the tags used in folksonomies are deliberately stupid. They are atomic units of information. So a tag can be any atomic unit--it doesn't have to be a word, it could be a URL, a zip code, or anything else you can think of that isn't reducible.

adaptive path » ajax: a new approach to web applications

adaptive path » ajax: a new approach to web applications. Web application development is starting to get really exciting again. The funny thing is that a lot of this technology has been around for a while, and even though IE supported it, you didn't see tools like these. I wonder whether the development of Firefox is really what's pushed it. Certainly nearly all the developers I know shun IE. So perhaps having a capable Free browser was what sparked all this innovation.

Tuesday, February 22, 2005

Blog-binding

Recently, I've been seeing a number of companies and projects springing up around the idea of publishing blogs as books. The examples I'm aware of are Blogbinders, qoop, LJBook, and most recently, book this blog, but I'd bet there are more. What I'm wondering about is how useful the blog-directly-to-book pathway is. Wouldn't an application that aggregates your blog posts into an editing environment (like Word or OpenOffice) be more useful? Can we really smooth over the formatting differences between web and print well enough to produce (automatically) a nice-looking book 100% of the time? I'm a little skeptical.

From my (admittedly cursory) browsing, it looks like blogbinders have a human in the middle of the process, and qoop certainly did for their only title so far, John Battelle's SearchBlog. Requiring a human being in the loop raises costs and introduces scaling issues.

There's a fine tradition of publishing diaries, going back at least to Caesar, but unless you happen to be famous, or have a blog that's truly interesting a high percentage of the time, the way to monetize blogs is more likely to be on the Hardball Times model, where you build up an audience, and then sell them work they're interested in.

But if blogs really are the ultimate vanity presses, then there may indeed be money in printing them, if you charge the blogger enough up front. It will be interesting to see how all this shakes out.

Tuesday, October 26, 2004

Reflexive link

Just noted on my work blog the Wired article by Hilary Rosen on Creative Commons. Always nice to see a new convert.