--- layout: post status: publish published: true title: Exposing XML data for Orbit wordpress_id: 2654 wordpress_url: https://www.martineve.com/?p=2654 date: !binary |- MjAxMy0wMy0yOSAxOTo1OTo1MSArMDEwMA== date_gmt: !binary |- MjAxMy0wMy0yOSAxOTo1OTo1MSArMDEwMA== categories: - Technology - Open Access tags: - XML - XSLT comments: [] ---
Although, for now, this will be of limited interest/use to probably most readers of the journal, I today undertook the necessary work (by which I mean: cleaning up for compliance!) to expose the XML files that power the typesetting behind my journal of Pynchon studies, Orbit: Writing Around Pynchon.
As you can see if you visit Issue 1, all the articles now have their XML available for download. I intend to sequentially work through the rest of our published articles to expose this data. Let me explain what this means and then why I think it's important.
The way that I construct articles for the journal is, after successful peer review, to transcribe the word documents that we are sent into extensible markup language (XML) under a document type definition provided by the National Library of Medicine. This Journal Publishing Tagset specifies how this document should be formed for compliance and they provide some sample tools to produce output.
Once I've got the XML file together -- and this can be no small job in the case of complex citations -- I run it through my custom galley production suite, meXml. Running the tools/gengalleys.sh script produces PDF and XHTML output from the same file, so I know that they are synchronised. I can then, also (with any luck in the near future), do a transform on the XML to produce the documents that I need to send to CrossRef.
Why do this, though? Why not just botch together a PDF and HTML exported from Word? A few important reasons: