10 things to do with $10k from DigitalNZ

November 23, 2009

Yesterday at NDF, DigitalNZ announced that they’re giving away two parcels of $10k seed funding. Apparently, though it has to be used for actual digitisation rather than digitisation tools. I’m not really a content person, but it occurs to me that there are lots of small geek jobs that could be done for similar amounts of money. Here’s my list of ten cool DigitalNZ things that could be done for about 10k (or there abouts):

  1. a linked-data view of DigitalNZ. Get approval from a subset of contributors to map the DigitalNZ records to linked-data. (I’m sure NZETC wouldn’t have a problem with this.)
  2. a DigitalNZ widget that contributors could add to their content pages to allow visitors to tag content with Māori Subject Headings. The heading would be added as a URL to DigitalNZ, and then mapped to a textual string form display once we know the language we want to display it in (also means that we don’t have to worry about macrons and encoding issues). Repatriation of the metadata at the contributors convenience.
  3. as above except for the Iwi/Hapu list.
  4. as above except for a placename list (any placename list).
  5. a DigitalNZ widget that allows users to say that two DigitalNZ items have the same subject.
  6. a DigitalNZ widget that allows users to normalise the orthography of Māori text (add macrons on vowels; combine double vowels to macron’d vowels, etc.) in DigitalNZ records.
  7. a DigitalNZ search widget that returns (in addition to normally ranked results) the poorest metadata record that matches the search, presented in a way that invites the user to enrich the metadata.
  8. a collection of a dozen examples that use the API in different ways.
  9. a tool to normalise DigitalNZ records for terms which are more indicative of the contributor rather than record (like ‘record’ from library contributors and ‘item’ from archive contributors)
  10. a DigitalNZ search widget that uses Māori-English and English-Māori dictionaries and a bi-lingual placename list to automatically translate a mono-lingual query into a bi-lingual one.
  11. expose the Māori Subject Headings and the Iwi/Hapu list as linked data.
  12. find a way to expose DigitalNZ to http://www.data.govt.nz/ which aggregates data from non-cultural sectors. Maybe slice it by contributor?
  13. a tool for finding references to items elsewhere on the web.

[updated x2]

Cataloguing-in-Publication for ePubs

November 20, 2009

There’s been a whole swathe of developments happening on the eBook front recently, mainly a slew of new devices; both hardware devices such as the kindle and nook and software devices such as stanza and the firefox plugin. These are mainly built around the ePub format, a standards-based format which is essentially a ubiquitous zip file stuffed full of XHTML files for the content and an XML file for metadata and navigation. ePub metadata is stored in a file called ‘content.opf’ which has a metadata tag which holds the document level metadata (or item level metadata if you’re an archivist).

For ePubs, Cataloguing-in-Publication information has a significantly larger role than in print books, because of the possibility that sufficiently smart readers can use the information to build browse structures, insert links and create indexes. Indeed, there is in theory no reason why a reader couldn’t run a full OpenURL resolver to link local copies of references. ePubs also differ because the removal of the print-run overhead means that at any point the ePub can be updated with better bibliographic information at any point.

When we (the NZETC) digitally publish works, we generate metadata tags that look like:

  <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dcterms="http://purl.org/dc/terms/">
    <dc:title>Moko; or Maori Tattooing</dc:title>
    <dc:language xsi:type="dcterms:RFC3066">en</dc:language>
    <dc:identifier opf:scheme="URI" id="dcidid">http://www.nzetc.org/tm/scholarly/tei-RobMoko.html</dc:identifier>
    <dc:subject>Historical Māori and Pacific Islands</dc:subject>
    <dc:creator>Major-General Robley</dc:creator>
    <dc:creator>T. Dunbabin</dc:creator>
    <dc:creator>H. Fildes</dc:creator>
    <dc:creator>Sir George Grey</dc:creator>
    <dc:creator>George Grey</dc:creator>
    <dc:creator>Alexander McLeay</dc:creator>
    <dc:publisher>New Zealand Electronic Text Centre</dc:publisher>
    <dc:date xsi:type="dcterms:W3CDTF">2007-08-07T21:18:20</dc:date>
    <dc:rights>Creative Commons (see front page)</dc:rights>
    <meta content="cover" name="cover"/>
    <meta content="http://www.nzetc.org/tm/scholarly/tei-RobMoko.html" name="DC.relation.isFormatOf"/>

There’s no ISBN associated with the original work (it’s far too early). We have several times considered adding ISBNs to our ePubs and decided against it. The problem is that ISBNs identify editions, and we regularly (sometimes weekly) reissue new editions of all our ePubs. For example, I’ve noticed while writing this that there should be a tag indicating the year of publication in the metadata tag. If I update the scripts to add the date and regenerate all our epubs, then my reading of the ISBN rules says that those are new editions and should have new ISBNs assigned to them. This doesn’t seem like a sane approach to the identification of works to me. We have 1300 works each of which has an ePub, I can’t imagine that our ISBN issuing authority would be very happy if we churned though 1300 ISBNs every time I found a bug in my ePub generation scripts.

What is needed, I believe, is a way to label works rather than editions.

Referencing works rather than editions is not a panacea, of course, there are some differences between works and editions, particularly when it comes to educational performances (such as classes reading aloud together) and also deep links (references to ‘page 678’ make little sense if the edition to hand is a video rather than textual edition).

What systems are there for identifying works? Well the most widely-used one appears to be that used by librarything. For example the edition documented above is http://www.librarything.com/work/4901325. [The more numerous authors recorded by us are annotators and the authors of material cut’n’pasted into the version of the work we have choosen to scan] Of course, if we’re matching against librarything, we can also leverage their tags and common knowledge; they seem to be doing a much better job of genre description than the core library classifications. What might it look like? an addition at the end:

  <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dcterms="http://purl.org/dc/terms/">
    <dc:identifier opf:scheme="URI" id="http://www.librarything.com/">http://www.librarything.com/work/4901325</dc:identifier>
    <meta content="keywords" name="anthropology Maori New Zealand"/>

Of course, if LibraryThing has reviews, ratings, descriptions and other metadata, we can add them too. But for most of the NZETC works, that won’t be an issue, since most of the works we’ve republished have been out of print for >100 years and so aren’t well represented in librarything, which for all it’s depth and bredth, is heavy on recently published works held in smaller libraries.

Lastly there is the issue of licence. The LibraryThing for Libraries terms of use are clearly written with a very clear use in mind and really don’t work here. We’d have to negioate some kind of agreement, I suspect. Since we’d be cross-linking with librarything and giving them the full text, I’d hope they’d be willing to let us have (and update) the metadata.

See also http://blog.threepress.org/2009/11/18/whats-in-an-identifier/

Does this sound like a sane idea? Is there anyone else out there doing any good work at the work level rather than the edition level?

It is tempting to put a chunk of linkeddata/RDF into the ePubs as Cataloging-in-Publication information, but this information needs to be useful (a) without a network connection and (b) without computational heavy lifting, both of which mitigate against linkeddata/RDF.

[Update: it’s been pointed out to me that the NZETC could do a better job of using local identifiers in ePubs, which would solve the problem of identifying multiple editions of the same work by the same publisher.]

[Update: yes, we should be using Ngā Ūpoko Tukutuku / Māori Subject Headings, my excuses are: (a) that no one is doing anything useful with them yet anyway; (b) I’ve had no training on how to use them (and they’re a tool designed by librarians not computer scientists, so they’re not intuitive to me, I _need_ training)].

Giving saxon-xslt large amounts of memory

August 4, 2009

I was running the following XSLT 2.0 script:

<?xml version="1.0" encoding="UTF-8"?>  
<xsl:stylesheet version="2.0" 
 <xsl:variable name="files" select="collection('./tei/?select=*.xml')"/>  

    <xsl:template match="/">  
          <xsl:for-each select="for $x in $files return saxon:discard-document($x)">  
             <xsl:for-each select=".//tei:name">
                   <xsl:value-of select=".//text()"/>  
<span style="font-size: x-small;"> 

It just iterates over all files and all names within the file and outputs a list of the text with the names with lots of whitespace (I detest XML all scrunched up without whitespace).

Alas it was giving me the error “Exception in thread “main” java.lang.OutOfMemoryError: Java heap space” errors, so I knew I had to give it more. The trick turned out to be changing the command line from:

saxonb-xslt -ext:on get-all-names.xsl get-all-names.xsl


java -Xmx8000M -jar /usr/share/java/openoffice/saxon9.jar  -ext:on get-all-names.xsl get-all-names.xsl

This, of course, is on a 64bit machine with 8G of RAM, so Java can stretch out it’s legs into that space. Worked a treat.