--- layout: post status: publish published: true title: Exposing XML data for Orbit wordpress_id: 2654 wordpress_url: https://www.martineve.com/?p=2654 date: !binary |- MjAxMy0wMy0yOSAxOTo1OTo1MSArMDEwMA== date_gmt: !binary |- MjAxMy0wMy0yOSAxOTo1OTo1MSArMDEwMA== categories: - Technology - Open Access tags: - XML - XSLT comments: [] ---

Although, for now, this will be of limited interest/use to probably most readers of the journal, I today undertook the necessary work (by which I mean: cleaning up for compliance!) to expose the XML files that power the typesetting behind my journal of Pynchon studies, Orbit: Writing Around Pynchon.

As you can see if you visit Issue 1, all the articles now have their XML available for download. I intend to sequentially work through the rest of our published articles to expose this data. Let me explain what this means and then why I think it's important.

The way that I construct articles for the journal is, after successful peer review, to transcribe the word documents that we are sent into extensible markup language (XML) under a document type definition provided by the National Library of Medicine. This Journal Publishing Tagset specifies how this document should be formed for compliance and they provide some sample tools to produce output.

Once I've got the XML file together -- and this can be no small job in the case of complex citations -- I run it through my custom galley production suite, meXml. Running the tools/gengalleys.sh script produces PDF and XHTML output from the same file, so I know that they are synchronised. I can then, also (with any luck in the near future), do a transform on the XML to produce the documents that I need to send to CrossRef.

Why do this, though? Why not just botch together a PDF and HTML exported from Word? A few important reasons:

The XHTML that is produced should be 100% W3C standards compliant, which increases accessibility for those with, for instance, visual impairments.
The documents will always be synchronised, even if another version is required.
Data mining. If people want to work out what scholars are doing in our field, they now have the raw data at their disposal, under a CC-BY license.
Digital preservation. If I get hit by a bus tomorrow and, over the coming years, the PDF becomes obsolete while XHTML isn't the way that primarily consume texts, having the XML available will allow others to forward-migrate the content, should they so wish. This is just another of the strategies that we are employing, alongside CLOCKSS and LOCKSS preservation, to ensure the persistence of the journal beyond our lifespans.