Cataloguing-in-Publication for ePubs

There’s been a whole swathe of developments happening on the eBook front recently, mainly a slew of new devices; both hardware devices such as the kindle and nook and software devices such as stanza and the firefox plugin. These are mainly built around the ePub format, a standards-based format which is essentially a ubiquitous zip file stuffed full of XHTML files for the content and an XML file for metadata and navigation. ePub metadata is stored in a file called ‘content.opf’ which has a metadata tag which holds the document level metadata (or item level metadata if you’re an archivist).

For ePubs, Cataloguing-in-Publication information has a significantly larger role than in print books, because of the possibility that sufficiently smart readers can use the information to build browse structures, insert links and create indexes. Indeed, there is in theory no reason why a reader couldn’t run a full OpenURL resolver to link local copies of references. ePubs also differ because the removal of the print-run overhead means that at any point the ePub can be updated with better bibliographic information at any point.

When we (the NZETC) digitally publish works, we generate metadata tags that look like:

  <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcterms="http://purl.org/dc/terms/">
    <dc:title>Moko; or Maori Tattooing</dc:title>
    <dc:language xsi:type="dcterms:RFC3066">en</dc:language>
    <dc:language>en</dc:language>
    <dc:identifier opf:scheme="URI" id="dcidid">http://www.nzetc.org/tm/scholarly/tei-RobMoko.html</dc:identifier>
    <dc:subject>Historical Māori and Pacific Islands</dc:subject>
    <dc:description/>
    <dc:creator>Major-General Robley</dc:creator>
    <dc:creator>T. Dunbabin</dc:creator>
    <dc:creator>H. Fildes</dc:creator>
    <dc:creator>Sir George Grey</dc:creator>
    <dc:creator>George Grey</dc:creator>
    <dc:creator>Alexander McLeay</dc:creator>
    <dc:publisher>New Zealand Electronic Text Centre</dc:publisher>
    <dc:date xsi:type="dcterms:W3CDTF">2007-08-07T21:18:20</dc:date>
    <dc:rights>Creative Commons (see front page)</dc:rights>
    <meta content="cover" name="cover"/>
    <meta content="http://www.nzetc.org/tm/scholarly/tei-RobMoko.html" name="DC.relation.isFormatOf"/>
  </metadata>

There’s no ISBN associated with the original work (it’s far too early). We have several times considered adding ISBNs to our ePubs and decided against it. The problem is that ISBNs identify editions, and we regularly (sometimes weekly) reissue new editions of all our ePubs. For example, I’ve noticed while writing this that there should be a tag indicating the year of publication in the metadata tag. If I update the scripts to add the date and regenerate all our epubs, then my reading of the ISBN rules says that those are new editions and should have new ISBNs assigned to them. This doesn’t seem like a sane approach to the identification of works to me. We have 1300 works each of which has an ePub, I can’t imagine that our ISBN issuing authority would be very happy if we churned though 1300 ISBNs every time I found a bug in my ePub generation scripts.

What is needed, I believe, is a way to label works rather than editions.

Referencing works rather than editions is not a panacea, of course, there are some differences between works and editions, particularly when it comes to educational performances (such as classes reading aloud together) and also deep links (references to ‘page 678’ make little sense if the edition to hand is a video rather than textual edition).

What systems are there for identifying works? Well the most widely-used one appears to be that used by librarything. For example the edition documented above is http://www.librarything.com/work/4901325. [The more numerous authors recorded by us are annotators and the authors of material cut’n’pasted into the version of the work we have choosen to scan] Of course, if we’re matching against librarything, we can also leverage their tags and common knowledge; they seem to be doing a much better job of genre description than the core library classifications. What might it look like? an addition at the end:

  <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/">
    ....
    <dc:identifier opf:scheme="URI" id="http://www.librarything.com/">http://www.librarything.com/work/4901325</dc:identifier>
    <meta content="keywords" name="anthropology Maori New Zealand"/>
  </metadata>

Of course, if LibraryThing has reviews, ratings, descriptions and other metadata, we can add them too. But for most of the NZETC works, that won’t be an issue, since most of the works we’ve republished have been out of print for >100 years and so aren’t well represented in librarything, which for all it’s depth and bredth, is heavy on recently published works held in smaller libraries.

Lastly there is the issue of licence. The LibraryThing for Libraries terms of use are clearly written with a very clear use in mind and really don’t work here. We’d have to negioate some kind of agreement, I suspect. Since we’d be cross-linking with librarything and giving them the full text, I’d hope they’d be willing to let us have (and update) the metadata.

See also http://blog.threepress.org/2009/11/18/whats-in-an-identifier/

Does this sound like a sane idea? Is there anyone else out there doing any good work at the work level rather than the edition level?

It is tempting to put a chunk of linkeddata/RDF into the ePubs as Cataloging-in-Publication information, but this information needs to be useful (a) without a network connection and (b) without computational heavy lifting, both of which mitigate against linkeddata/RDF.

[Update: it’s been pointed out to me that the NZETC could do a better job of using local identifiers in ePubs, which would solve the problem of identifying multiple editions of the same work by the same publisher.]

[Update: yes, we should be using Ngā Ūpoko Tukutuku / Māori Subject Headings, my excuses are: (a) that no one is doing anything useful with them yet anyway; (b) I’ve had no training on how to use them (and they’re a tool designed by librarians not computer scientists, so they’re not intuitive to me, I _need_ training)].

Tags: , ,

2 Responses to “Cataloguing-in-Publication for ePubs”

  1. Douglas Campbell Says:

    We live in a world where everything has multiple identifiers, but we only have control over identifiers we create ourselves.

    So I think the first thing to do is establish your own identifiers; you’re already doing that for your names I think? So maybe eg. http://www.nzetc.org/work/7384 and http://www.nzetc.org/edition/7384.22 ? Then later on you can match those to other identifiers for crosswalking, linked data, etc. So there might be multiple dc:identifier elements – other apps can choose which identifier works for them.

    You can re-use other people’s identifiers instead of minting your own, but you must be confident they are good identifiers and persistent, otherwise you risk building your identification system on ‘sand’.

  2. stuartyeates Says:

    Thanks for your comment Douglas.

    We do actually have URLs for what we call ‘abstract texts’ which are ‘works’ in FRBR. For example the URL for this abstract text is: http://www.nzetc.org/tm/scholarly/name-102939.html

    As for control over identifiers, I’d much rather have authority than control for my identifiers.

    cheers
    stuart

Leave a comment