A proposal: Glossaries (dot json)

Early in the development cycle of the new Cooper-Hewitt collections website we decided that we wanted to define as many concordances, as possible, between our collection and the stuff belonging to other institutions.

We’re going to save an in-depth discussion of the specifics behind those efforts for another blog post except to say that the first step in building those concordances was harvesting other people’s data (through public data dumps or their APIs).

Defining equivalencies is still more of an art than a science and so having a local copy of someone else’s dataset is important for testing things out. A big part of that work is looking at someone else’s data and asking yourself: What does this field mean? It’s not a problem specific to APIs or datasets, either. Every time two people exchange a spreadsheet the same question is bound to pop up.

It’s long been something of a holy grail of museums, and cultural heritage institutions, to imagine that we can define a common set of metadata standards that we will all use and unlock a magic (pony) world of cross-institutional search and understanding. The shortest possible retort to this idea is: Yes, but no.

We can (and should) try to standardize on those things that are common between institutions. However it is the differences – differences in responsibilities; in bias and interpretation; in technical infrastructure – that distinguish institutions from one another. One needs look no further than the myriad ways in which museum encode their collection data in API responses to see this reality made manifest.

I choose to believe that there are good and valid, and very specific, reasons why every institution does things a little differently and I don’t want to change that. What I would like, however, is some guidance.

In the process of working with everyone else’s data I wrote myself a little tool that iterates over a directory of files and generates a “glossary” file. Something that I can use as a quick reference listing all the possible keys that a given API or data dump might define.

The glossary files began life as a tool to make my life a little easier and they have three simple rules:

They are meant to written by humans, in human-speak.
They are meant to read by humans, in human-speak.
They are meant to updated as time and circumstances permit.

That’s it.

They are not meant to facilitate the autonomous robot-readable world, at least not on their own. They are meant to be used in concert with humans be they researchers investigating another institution’s collection or developers trying to make their collections hold hands with someone else’s.

So, here’s the proposal: What if we all published our own glossary files?

What is a glossary file?

Glossary files are just dictionaries of dictionaries, encoded as JSON. There is nothing special about JSON other than that it currently does the best job at removing the most amount of markup required by machines to make sense of things, and is supported by just about every programming language out there. If someone comes up with something better it stands to reason that glossary files would use that instead.

You can see a copy of the Cooper-Hewitt’s glossary file for objects in our collection repository over on Github. And yes, you would be correct in noting that it doesn’t actually have any useful descriptive data in yet. One thing at a time.

The parent dictionary contains keys which are the names of the properties in the data structure used by an institution. Nested properties are collapsed in to a string, using a dot notation. For example: 'images' : { 'b' : { 'width' : '715' } } would become 'images.b.width'.

The values are another dictionary with one required and two optional keys. They are:

description

This is a short text describing the key and how its values should be approached.

There is no expectation of any markup in text fields in a glossary file. Nothing is stopping you from adding markup but the explicit goal of glossary files is to be simpler than simple and to be the sort of thing that could be updated using nothing fancier than a text editor. It’s probably better to rely on the magic of language rather than semantics.

notes

This is an optional list of short texts describing gotchas, edge cases, special considerations or anything else that doesn’t need to go in the description field but is still relevant.

sameas

This is an optional list of pointers asserting that you (the person maintaining a glossary file) believe that your key is the same as someone else’s. For example, the Cooper-Hewitt might say that our date field is the same as the Indianapolis Museum of Art’s creation_date field.

There are two important things to remember about the sameas field:

You (as the author) are only asserting things that you believe to be true. There is absolutely no requirement that you define sameas properties for all the fields in your glossary file. Think of these pointers as the icing on the cake, rather than the cake itself.
There is currently no standard for how pointers should be encoded other than the stated goal of being “easy” for both humans and robots alike. The hope is that this will evolve through consensus – and working code.

For example, we might say our date field is the same as:

ima:creation_date
x-urn:indianapolismuseumofart:creation_date
https://www.imamuseum.org#creation_date

My personal preference would be for the first notation (ima:creation_date) but that does mean we, as a community, need to agree on namespaces (the ima: prefix) or simply that we pledge to list them someplace where they can be looked up. Again, my personal preference is to start simple and see what happens.

The unstated rule with glossaries is that they also be easy enough to change without requiring a lot of time or heavy-lifting. If something fails that criteria that’s probably a good indication it’s best saved for a different project.

It’s easy to consider both the possibilities and the pitfalls afforded by sameas pointers. They are not going to solve every problem out there, by any means. On the other hand they feel like they might be a better than 80/20 solution (or at least forward motion) towards addressing the problem of equivalencies. It’s really just about ensuring a separation of concerns. If we each commit to stating the things we believe to be true and to publishing those statements somewhere they can found then, over time, we can use that data to tell us new and interesting things about our collections.

More importantly, between now and then – whatever “then” ends up looking like – we can still just get on with the work at hand.

Git(hub)

I mentioned that our glossary files are part of the Cooper-Hewitt’s collections metadata repository on Github. Those files will always be included with the collections metadata but I am considering putting the canonical versions in their own repository.

This would allow other people, out there in the community, to contribute suggestions and fixes (“pull requests” in Git-speak) without having to download the entirety of our collection. As always with Git(hub) it offers a way for institutions to preserve the authority over the meaning of their collections and to give other institutions some degree of confidence in the things we are saying.

It also means that institutions will need to commit to taking seriously any pull requests that are submitted and tracking down the relevant people (inside the building) to approve or reject the changes. This is maybe not something we’re all used to doing. We are not really wired, intellectually or institutionally, for dealing with the public pushing back on the things we publish.

But if we’re being honest everyone knows that it’s not only a thing seen distantly on the horizon but a thing approaching with a daunting (and some times terrifying) speed. We’re going to have to figure out how to address that reality even if that just means better articulating the reasons why it’s not something a given institution wants to do.

Which means that in addition to being a useful tool for researchers, developers and other people with directed goals glossaries can also act as a simple and safe device for putting some of these ideas to the test and, hopefully, understand where the remaining bumpy bits lay.

Discuss!

Michael Lascarides December 11, 2012 at 4:33 pm

Yes. Thinking a lot about this since NDF. A couple of quick thoughts (and some code will even be following). First, agreed JSON is a great choice for now, but in terms of the writable and readable by humans, the proper markup is probably eventually something even more concise. Second, the idea of a standard place for a glossary.json begs the question what other simple, minimal but insanely useful info could be easily put out by organizations? I’d be curious about what a standard “about.json” (who is the institution or individual, mainly for attribution) or “license.json” (what can I do with the stuff herein?) would look like. “About” would be particularly easy, just put it in the root next to the robots.txt. License would be trickier, but maybe a minimum-permissible license that can be superseded with more restrictions in subdirectories and individual records? Github (or a github-like system) would manage a mess of revisions and who did what, but does not address the question of “how am I allowed to use this particular piece of information?” There are others, I’m sure.

As you say, though, one step at a time.

As for the pitfalls of “sameas”, the problem of “not quite the same as” feels like a much, much better problem to have than “had no idea that that existed”. Match items up first, then learn about how “sameas” is insufficient, then create standards later for “mostlythesameas”, “wassameasuntil1973”, “isonlythesameasonalternatingthursdays”, etc. when they become necessary.

Wholehearted agreement with the bottom line: when the number of links between items is exponentially larger than an already huge number of items, the process of making said links has to be as easy as possible for humans and machines alike.

Reply ↓

One thought on “A proposal: Glossaries (dot json)”

Michael Lascarides December 11, 2012 at 4:33 pm

Yes. Thinking a lot about this since NDF. A couple of quick thoughts (and some code will even be following). First, agreed JSON is a great choice for now, but in terms of the writable and readable by humans, the proper markup is probably eventually something even more concise. Second, the idea of a standard place for a glossary.json begs the question what other simple, minimal but insanely useful info could be easily put out by organizations? I’d be curious about what a standard “about.json” (who is the institution or individual, mainly for attribution) or “license.json” (what can I do with the stuff herein?) would look like. “About” would be particularly easy, just put it in the root next to the robots.txt. License would be trickier, but maybe a minimum-permissible license that can be superseded with more restrictions in subdirectories and individual records? Github (or a github-like system) would manage a mess of revisions and who did what, but does not address the question of “how am I allowed to use this particular piece of information?” There are others, I’m sure.

As you say, though, one step at a time.

As for the pitfalls of “sameas”, the problem of “not quite the same as” feels like a much, much better problem to have than “had no idea that that existed”. Match items up first, then learn about how “sameas” is insufficient, then create standards later for “mostlythesameas”, “wassameasuntil1973”, “isonlythesameasonalternatingthursdays”, etc. when they become necessary.

Wholehearted agreement with the bottom line: when the number of links between items is exponentially larger than an already huge number of items, the process of making said links has to be as easy as possible for humans and machines alike.

Reply ↓

Cooper Hewitt Labs

Technology + Media + Experience