--- layout: post status: publish published: true title: Metadata handling for Open Access Journal PDFs wordpress_id: 2114 wordpress_url: https://www.martineve.com/?p=2114 date: !binary |- MjAxMi0wNi0wNiAwODo1OToxNyArMDIwMA== date_gmt: !binary |- MjAxMi0wNi0wNiAwODo1OToxNyArMDIwMA== categories: - Technology - Open Access - Academia - Mendeley tags: - Technology - OA - metadata comments:  ---
As I count down to the launch of Orbit: Writing around Pynchon, I've been thinking carefully about the mechanisms through which the articles will be consumed. In short: what metadata should be in the PDFs and where should it be.
Obviously, I want the metadata to be visible to the human eye, but what about embedding this within the PDF's proper metadata mechanism? Apache FOP, which I'm using to the transforms, has the facility to do this. However, do other journals bother?
Here's a metadata dump using pdftk on a top-rank Taylor and Francis journal in English literature:
InfoValue: iText 2.1.4 (by lowagie.com)
That's not especially descriptive!
By contrast, my XSL transform is producing the following:
InfoValue: meXml: Martin Eve's XML Generator. https://www.martineve.com/
InfoValue: Generating PDFs from OJS
InfoValue: Apache FOP Version 1.0
InfoValue: Martin Paul Eve
InfoValue: It has long been desirable to create PDF files from a standard XML base. This plugin allows that to happen using a combination of OJS, Saxon and FOP.
However, interestingly, the Taylor and Francis journal can be perfectly detected by Zotero. So where is it getting its info?
The great JISC document on PDF metadata extraction mechanisms has the following for Zotero:
Zotero uses "Google Scholar Results as well as DOIs on the first page to get metadata and that works in a large majority of cases". This implies that metadata extraction relies on converting the PDF to text at the client, using Regular Expressions to detect the DOI string, and submitting that string to Google Scholar or doi.org to retrieve the matching record.
All sounds good. So, as a test, I changed the DOI in my test document to reflect an article that I know worked. I changed the author, Title and DOI to all match the second article. I even put in a URL pointer to dx.doi.org/.....
However, Zotero still wouldn't pick it up; it completely mis-identifies it. So I decided to dive into the mechanics.
This first code block runs pdftotext on the file. The command it assembles looks somewhat like this: pdftotext -enc UTF-8 -nopgbrk -l '3' new.pdf /your/zotero/directory/recognizePDFcache.txt.
So far so good. The output I got looked a little like this:
Orbit: Writing Around Pynchon
Author Name Redacted
University of Sussex
28 September 2011
It has long been desirable to create PDF files from a standard XML base.
I can confirm that my DOI passes the match test here.
The next step Zotero takes is to work out how many lines are in the document. If there are fewer than 20 lines, it assumes that the document doesn't contain OCRed text and returns a fail.
As you can see, though, Zotero also has a debug function, so I enabled that at this point. When I looked in the log, the DOI number was not being picked up by Zotero's internal pdftotext. In fact, Zotero's version of pdfttotext seems to disregard anything inside a
The second I put the DOI number in a non-table area, it was detected.
tl;dr: make sure your DOI numbers are somewhere that Zotero's version of pdftotext can read it.
Featured image by TJOwens under a CC-BY license.