Releasing the collection on GitHub

Late last week we released the Cooper-Hewitt’s collection metadata as a downloadable file. And in a first for the Smithsonian, we dedicated the release to the public domain, using Creative Commons Zero.

I’m often asked why releasing collection metadata is important. My teams did similar things at the Powerhouse Museum when I was there, and I still believe that this is the direction that museums and other collecting institutions need to go. With the growing Digital Humanities field, there is increasing value in scholars being able to ‘see’ a collection at a macro, zoomed out level – something which just isn’t possible with search interfaces. Likewise the release of such data under liberal licenses or to the public domain brings closer a future in which cross-institutional discovery is the norm.

Philosophically, too, the public release of collection metadata asserts, clearly, that such metadata is the raw material on which interpretation through exhibitions, catalogues, public programmes, and experiences are built. On its own, unrefined, it is of minimal ‘value’ except as a tool for discovery. It also helps remind us that collection metadata is not the collection itself.

Of course it is more complex than that.

There are plenty of reasons why museums are hesitant to release their metadata.

Collection metadata is often in a low quality state. Sometimes it is purposely unrefined, especially in art museums where historical circumstance and scholarly norms have meant that so called ‘tombstone data’ has sometimes been kept to a bare minimum so as to not ‘bring opinion’ to objects. Other times it has simply been kept at a minimum because of a lack of staff resources. Often, too, internal workflows still keep exhibition label and catalogue publishing separate from collection documentation meaning that obvious improvements such as the rendering of ‘label copy’ and catalogue narrative to object records is not automatic.

But I digress.

We released our metadata through GitHub, and that needs some additional explanation.

GitHub is a source repository of the kind traditionally used by coders. And, lacking a robust public endpoint of our own which could track changes and produce diff files as we uploaded new versions of the collection data, GitHub was the ideal candidate. Not only that, the type of ‘earlyvangelists’ we are targetting with the data release, hang out there in quantity.

The idea for using GitHub to host collection datasets had actually been bouncing around since April 2009. Aaron Straup-Cope and I were hanging out in-between sessions at Museums and the Web in Indianapolis talking about Solr, collection data, and APIs. Aaron suggested that GitHub would be the perfect place for museums to dump their collections – as giant text blobs – and certainly better than putting it on their own sites. Then 2010 happened and the early-mover museums all suddenly had built APIs for their collections. Making a text dump was suddenly off the agenda, but that idea of using GitHub still played on my mind.

Now, Cooper-Hewitt is not yet in a suitable position infrastructurally to develop an API for its collection. So when the time came to make release the dataset, that conversation from 2009 suddenly became a reality.

And, fittingly, Aaron has been the first to fork the collection – creating individual JSON for each object record.

Could GitHub become not just a source code repository but a repository for ‘cultural source code’?

(But read the data info first!)

15 thoughts on “Releasing the collection on GitHub

  1. Nate Solas

    All around awesome, from Aaron’s brainstorming to Cooper-Hewitt getting on board with the idea. I did notice the other day when I went to have a look that I couldn’t get the data without doing a full git clone — the csv is way too big to display online, so github throws an error. I could obviously do a clone and get the file that way, but I guess my point is … I didn’t. (Yet.) I just wanted to see what the data looked like to decide what we might be able to do with it and somehow cloning the whole repo seemed like a big deal at the time. Maybe something to think about? Even just a snippet of the file I could view online before I grabbed the whole thing?

    Still, amazing stuff, and pretty brilliant use of github. Well done, everyone.

    Reply
  2. Aaron Straup Cope

    That is one reason for exporting the CSV dump as individual records. Well, actually there are three reasons:

    * One, to simply be able to browse the records one at a time, for example: https://github.com/straup/collection/blob/master/objects/100/01/10001.json

    * Two, to be able to edit those documents inline using the handy “Edit this page” link on Github

    * In the hopes that some day Github will set/allow CORS headers on the “raw” version of JSON docs in repos so that they can be addressed and used by clients around the Internet, for example: https://raw.github.com/straup/collection/master/objects/100/01/10001.json

    That said, there are still some very real problems with exporting/creating lots of tiny files in a Github repo. Namely: the indexing is very slow. This appears to be a known git-ism and something the Facebook kids are struggling with: http://thread.gmane.org/gmane.comp.version-control.git/189776.

    Reply
  3. Glen Barnes

    I don’t know if you’ve seen one of the latest Wired articles but it has been released on GitHub. They are actively promoting people forking it and submitting pull requests. 

    Glen 

    PS: Did you realise that you haven’t actually linked to your own repo you talked about in your article (you linked to Aaron’s only)?  

    Reply
    1. cooperhewitt

      Oops! Well it was linked in the first sentence, but now added to the end also.

      Yes, the Wired article was serendipitously timed, eh?

      Reply
  4. Mia

    It’s a great idea, I wish I’d done that for the Science Museum/NMSI .csv data, because almost immediately a few people sent back variations on cleaned data or converted it into other formats and it would have made merging everything back so much easier. 

    My only question is how usable github is for novices?  For example, on the download page it says ‘Sorry, there aren’t any downloads for this repository’, and that’s more prominent on the page than the .zip or .gz download link… Quite a few people managed to play around with the .csv files in Excel or whatever, but I don’t know if they would have survived Github. 

    Reply
    1. cooperhewitt

      Agreed. Its always a trade-off but I’m hoping with the DH movement happening at reasonable pace, more scholars of the type that might actually do something with the data will be ‘GitHub aware’ or know someone in their faculty to ask.

      I expect we will add direct download links in time from our site – especially if good improvements are made.

      I’m counting down the hours before someone runs it through Google Refine and cleans it all up and resubmits it.

      Reply
    2. Micah Walter

      Mia,
      Apparently, I had to upload the file once manually to make it available on the Downloads page. It should be there now for all users and non-logged in people. There is also a “tags” tab where you can download .zip or .tar versions.

      Github also offers a really nice GUI tool for Mac users where you don’t need to install the command line version of Git. It’s available here http://mac.github.com/, and you just need a free Github account to play.

      -micah

      Reply
  5. Rob Landry

    Great work, Seb!  Even better might be for museums to develop a common standard for collections metadata and then an API that would broaden access to the materials easily.  I’m guessing somebody, somewhere already has that in the works…

    Reply
    1. cooperhewitt

      Institutions, do, on the whole store and collect metadata according to common standards. There are also common standards – Dublin Core, CDWA-lite, and LIDO – for interoperability and sharing.  But adoption of sharing standards is always slow and technical interoperability is always much easier than cultivating an organisational desire to interoperate and share in the first place. Zorich, Waibel and Erway’s 2008 report Beyond the Silos of the LAMs is good reading as to the challenges.

      APIs are unlikely to be a good solution although almost all the museums who have them, (Brooklyn, Powerhouse, V&A, Digital NZ, Europeana etc), have all developed them to have a RESTful interface informed heavily by Flickr’s API. This means that if you develop using the FlickrAPI then developing for the museum APIs is reasonably straightforward.

      I’m hopeful, though, that the national collections approach of the UK, Europe, Canada, Australia and NZ, might finally eventuate here too. The idea that people might be able to navigate many institutions by time, geography or creator/artist regardless of the institution has been around since the 1960s and the first use of computers in museums. That it has taking 50+ years to get anything really usable in this space for the general public (some scholarly attempts have been somewhat successful at providing raw text) is testament to the philosophical difficulties more than the technical.

      Reply
  6. Pingback: Links for 10 March 2012 - Chris Unitt's blog

  7. Jonathan Dahan

    Excellent job! The more examples of other museums pushing data out, the easier my job is in convincing the Met that its the best thing for us to do.

    Reply
  8. Pete Forde

    This is really cool! I am curious to see what comes of your publishing efforts.

    That said, I’m curious whether you considered posting this data on BuzzData. We designed BuzzData to be the perfect place for people working with datasets to comment, fork and annotate what they are working on.

    Any comments or suggestions you have are certainly appreciated.

    Pete

    Reply
  9. Pingback: Do you git it?: Open educational resources/practices meets software version control #ukoer – MASHe

  10. Pingback: Cooper-Hewitt Museum releases its collection metadata | Information Futures

  11. Pingback: cc0 and git for data | inkdroid

Leave a Reply

Your email address will not be published. Required fields are marked *