Tag Archives: github

Releasing the collection on GitHub

Late last week we released the Cooper-Hewitt’s collection metadata as a downloadable file. And in a first for the Smithsonian, we dedicated the release to the public domain, using Creative Commons Zero.

I’m often asked why releasing collection metadata is important. My teams did similar things at the Powerhouse Museum when I was there, and I still believe that this is the direction that museums and other collecting institutions need to go. With the growing Digital Humanities field, there is increasing value in scholars being able to ‘see’ a collection at a macro, zoomed out level – something which just isn’t possible with search interfaces. Likewise the release of such data under liberal licenses or to the public domain brings closer a future in which cross-institutional discovery is the norm.

Philosophically, too, the public release of collection metadata asserts, clearly, that such metadata is the raw material on which interpretation through exhibitions, catalogues, public programmes, and experiences are built. On its own, unrefined, it is of minimal ‘value’ except as a tool for discovery. It also helps remind us that collection metadata is not the collection itself.

Of course it is more complex than that.

There are plenty of reasons why museums are hesitant to release their metadata.

Collection metadata is often in a low quality state. Sometimes it is purposely unrefined, especially in art museums where historical circumstance and scholarly norms have meant that so called ‘tombstone data’ has sometimes been kept to a bare minimum so as to not ‘bring opinion’ to objects. Other times it has simply been kept at a minimum because of a lack of staff resources. Often, too, internal workflows still keep exhibition label and catalogue publishing separate from collection documentation meaning that obvious improvements such as the rendering of ‘label copy’ and catalogue narrative to object records is not automatic.

But I digress.

We released our metadata through GitHub, and that needs some additional explanation.

GitHub is a source repository of the kind traditionally used by coders. And, lacking a robust public endpoint of our own which could track changes and produce diff files as we uploaded new versions of the collection data, GitHub was the ideal candidate. Not only that, the type of ‘earlyvangelists’ we are targetting with the data release, hang out there in quantity.

The idea for using GitHub to host collection datasets had actually been bouncing around since April 2009. Aaron Straup-Cope and I were hanging out in-between sessions at Museums and the Web in Indianapolis talking about Solr, collection data, and APIs. Aaron suggested that GitHub would be the perfect place for museums to dump their collections – as giant text blobs – and certainly better than putting it on their own sites. Then 2010 happened and the early-mover museums all suddenly had built APIs for their collections. Making a text dump was suddenly off the agenda, but that idea of using GitHub still played on my mind.

Now, Cooper-Hewitt is not yet in a suitable position infrastructurally to develop an API for its collection. So when the time came to make release the dataset, that conversation from 2009 suddenly became a reality.

And, fittingly, Aaron has been the first to fork the collection – creating individual JSON for each object record.

Could GitHub become not just a source code repository but a repository for ‘cultural source code’?

(But read the data info first!)