Tag Archives: alpha

Who's on first?

Houston Jet Shoes, 2013

photo by Martin Kalfatovic

We made a new thing. It is a nascent thing. It is an experimental thing. It is a thing we hope other people will help us “kick the tires” around.

It’s called “Who’s on first?” or, more accurately, “solr-whosonfirst“. solr-whosonfirst is an experimental Solr 4 core for mapping person names between institutions using a number of tokenizers and analyzers.

How does it work?

The core contains the minimum viable set of data fields for doing concordances between people from a variety of institutions: collection; collection_id; name and when available year_birth; year_death.

The value of name is then meant to copied (literally, using Solr copyField definitions) to a variety of specialized field definitions. For example the name field is copied to a name_phonetic so that you can query the entire corpus for names that sound alike.

Right now there are only two such fields, both of which are part of the default Solr schema: name_general and name_phonetic.

The idea is to compile a broad collection of specialized fields to offer a variety of ways to compare data sets. The point is not to presume that any one tokenizer / analyzer will be able to meet everyone’s needs but to provide a common playground in which we might try things out and share tricks and lessons learned.

Frankly, just comparing the people in our collections using Solr’s built-in spellchecker might work as well as anything else.

For example:

$> curl  'https://localhost:8983/solr/select?q=name_general:moggridge&wt=json&indent=on&fq=name_general:bill'

{"response":{"numFound":2, "start":0,"docs":[
    {
        "collection_id":"18062553" ,
        "concordances":[
            "wikipedia:id= 1600591",
            "freebase:id=/m/05fpg1"],
        "uri":"x-urn:ch:id=18062553" ,
        "collection":"cooperhewitt" ,
        "name":["Bill Moggridge"],
        "_version_":1423275305600024577},
    {
        "collection_id":"OL3253093A" ,
        "uri":"x-urn:ol:id=OL3253093A" ,
        "collection":"openlibrary" ,
        "name":["Bill Moggridge"],
        "_version_":1423278698929324032}]
    }
}

Now, we’ve established a concordance between our record for Bill Moggridge and Bill’s author page at the Open Library. Yay!

Here’s another example:

$> https://localhost:8983/solr/whosonfirst/select?q=name_general:dreyfuss&wt=json&indent=on

{"response":{"numFound":3,"start":0,"docs":[
    {
        "concordances":["ulan:id=500059346"],
        "name":["Dreyfuss, Henry"],
        "uri":"x-urn:imamuseum:id=656174",
        "collection":"imamuseum",
        "collection_id":"656174",
        "year_death":[1972],
        "year_birth":[1904],
        "_version_":1423872453083398149},
    {
        "concordances":["ulan:id=500059346",
                "wikipedia:id=1697559",
                "freebase:id=/m/05p6rp",
                "viaf:id=8198939",
                "ima:id=656174"],
        "name":["Henry Dreyfuss"],
        "uri":"x-urn:ch:id=18041501",
        "collection":"cooperhewitt",
        "collection_id":"18041501",
        "_version_":1423872563648397315},
    {
        "concordances":["wikipedia:id=1697559",
                "moma:id=1619"],
        "name":["Henry Dreyfuss Associates"],
        "uri":"x-urn:ch:id=18041029",
        "collection":"cooperhewitt",
        "collection_id":"18041029",
        "_version_":1423872563567656970}]
    }
}

See the way the two records for Henry Dreyfuss, from the Cooper-Hewitt, have the same concordance in Wikipedia? That’s an interesting wrinkle that we should probably take a look at. In the meantime, we’ve managed to glean some new information from the IMA (Henry Dreyfuss’ year of birth and death) and them from us (concordances with Wikipedia and Freebase and VIAF).

The goal is to start building out the some of the smarts around entity (that’s fancy-talk for people and things) disambiguation that we tend to gloss over.

None of what’s being proposed here is all that sophisticated or clever. It’s a little clever and my hunch tells me it will be a good general-purpose spelunking tool and something for sanity checking data more than it will be an all-knowing magic pony. The important part, for me, is that it’s an effort to stand something up in public and to share it and to invite comments and suggestions and improvements and gentle cluebats.

Concordances (and machine tags)

There are also some even-more experimental and very much optional hooks for allowing you to store known concordances as machine tags and to query them using the same wildcard syntax that Flickr uses, as well as generating hierarchical facets.

I put the machine tags reading list that I prepared for Museums and the Web in 2010, on Github. It’s a good place to start if you’re unfamiliar with the subject.

Download

There are two separate repositories that you can download to get started. They are:

The first is the actual Solr core and config files. The second is a set of sample data files and import scripts that you can use to pre-seed your instance of Solr. Sample data files are available from the following sources:

The data in these files is not standardized. There are source specific tools for importing each dataset in the bin directory. In some cases the data here is a subset of the data that the source itself publishes. For example, the Open Library dataset only contains authors and IDs since there are so many of them (approxiamately 7M).

Additional datasets will be added as time and circumstances (and pull requests) permit.

Enjoy!

'Discordances' – or the big to-do list

Yesterday Aaron rolled out a minor data update to the online collection. Along with this he also whipped up a ‘anti-concordances’ view for our people/company records. (Yes, for various reasons why are conflating people and companies/organisations). This allows us to show all the records we have that don’t have exact matches in Wikipedia (and other sources).

Along with revealing that we don’t yet have a good automated way of getting a ‘best guess’ of matches (Marimekko Oy is the same as Marimekko) without also getting matches for Canadian hockey players who happen to share a name with a designer, the list of ‘discordances’ is a good, finite problem that can be solved with more human eyes.

People | Wikipedia | Collection of Smithsonian Cooper-Hewitt, National Design Museum

We are currently inviting people (you, for instance) to parse the list of non-matches and then tell us which ones should link to existing but differently-spelled Wikipedia pages, and which ones need to be created as new stubs in Wikipedia itself.

If you’re feeling really bold you could even start those stubs yourself! You can even use our easy-to-insert Wikipedia citation snippet to reference anything in our collection from your newly created articles. You’ll find the snippet tool at the bottom of each object and person/company record.

A proposal: Glossaries (dot json)

Untitled

Early in the development cycle of the new Cooper-Hewitt collections website we decided that we wanted to define as many concordances, as possible, between our collection and the stuff belonging to other institutions.

We’re going to save an in-depth discussion of the specifics behind those efforts for another blog post except to say that the first step in building those concordances was harvesting other people’s data (through public data dumps or their APIs).

Defining equivalencies is still more of an art than a science and so having a local copy of someone else’s dataset is important for testing things out. A big part of that work is looking at someone else’s data and asking yourself: What does this field mean? It’s not a problem specific to APIs or datasets, either. Every time two people exchange a spreadsheet the same question is bound to pop up.

It’s long been something of a holy grail of museums, and cultural heritage institutions, to imagine that we can define a common set of metadata standards that we will all use and unlock a magic (pony) world of cross-institutional search and understanding. The shortest possible retort to this idea is: Yes, but no.

We can (and should) try to standardize on those things that are common between institutions. However it is the differences – differences in responsibilities; in bias and interpretation; in technical infrastructure – that distinguish institutions from one another. One needs look no further than the myriad ways in which museum encode their collection data in API responses to see this reality made manifest.

I choose to believe that there are good and valid, and very specific, reasons why every institution does things a little differently and I don’t want to change that. What I would like, however, is some guidance.

In the process of working with everyone else’s data I wrote myself a little tool that iterates over a directory of files and generates a “glossary” file. Something that I can use as a quick reference listing all the possible keys that a given API or data dump might define.

The glossary files began life as a tool to make my life a little easier and they have three simple rules:

  • They are meant to written by humans, in human-speak.

  • They are meant to read by humans, in human-speak.

  • They are meant to updated as time and circumstances permit.

That’s it.

They are not meant to facilitate the autonomous robot-readable world, at least not on their own. They are meant to be used in concert with humans be they researchers investigating another institution’s collection or developers trying to make their collections hold hands with someone else’s.

So, here’s the proposal: What if we all published our own glossary files?

What is a glossary file?

Glossary files are just dictionaries of dictionaries, encoded as JSON. There is nothing special about JSON other than that it currently does the best job at removing the most amount of markup required by machines to make sense of things, and is supported by just about every programming language out there. If someone comes up with something better it stands to reason that glossary files would use that instead.

You can see a copy of the Cooper-Hewitt’s glossary file for objects in our collection repository over on Github. And yes, you would be correct in noting that it doesn’t actually have any useful descriptive data in yet. One thing at a time.

The parent dictionary contains keys which are the names of the properties in the data structure used by an institution. Nested properties are collapsed in to a string, using a dot notation. For example: 'images' : { 'b' : { 'width' : '715' } } would become 'images.b.width'.

The values are another dictionary with one required and two optional keys. They are:

description

This is a short text describing the key and how its values should be approached.

There is no expectation of any markup in text fields in a glossary file. Nothing is stopping you from adding markup but the explicit goal of glossary files is to be simpler than simple and to be the sort of thing that could be updated using nothing fancier than a text editor. It’s probably better to rely on the magic of language rather than semantics.

notes

This is an optional list of short texts describing gotchas, edge cases, special considerations or anything else that doesn’t need to go in the description field but is still relevant.

sameas

This is an optional list of pointers asserting that you (the person maintaining a glossary file) believe that your key is the same as someone else’s. For example, the Cooper-Hewitt might say that our date field is the same as the Indianapolis Museum of Art’s creation_date field.

There are two important things to remember about the sameas field:

  • You (as the author) are only asserting things that you believe to be true. There is absolutely no requirement that you define sameas properties for all the fields in your glossary file. Think of these pointers as the icing on the cake, rather than the cake itself.

  • There is currently no standard for how pointers should be encoded other than the stated goal of being “easy” for both humans and robots alike. The hope is that this will evolve through consensus – and working code.

For example, we might say our date field is the same as:

  • ima:creation_date

  • x-urn:indianapolismuseumofart:creation_date

  • https://www.imamuseum.org#creation_date

My personal preference would be for the first notation (ima:creation_date) but that does mean we, as a community, need to agree on namespaces (the ima: prefix) or simply that we pledge to list them someplace where they can be looked up. Again, my personal preference is to start simple and see what happens.

The unstated rule with glossaries is that they also be easy enough to change without requiring a lot of time or heavy-lifting. If something fails that criteria that’s probably a good indication it’s best saved for a different project.

It’s easy to consider both the possibilities and the pitfalls afforded by sameas pointers. They are not going to solve every problem out there, by any means. On the other hand they feel like they might be a better than 80/20 solution (or at least forward motion) towards addressing the problem of equivalencies. It’s really just about ensuring a separation of concerns. If we each commit to stating the things we believe to be true and to publishing those statements somewhere they can found then, over time, we can use that data to tell us new and interesting things about our collections.

More importantly, between now and then – whatever “then” ends up looking like – we can still just get on with the work at hand.

Git(hub)

I mentioned that our glossary files are part of the Cooper-Hewitt’s collections metadata repository on Github. Those files will always be included with the collections metadata but I am considering putting the canonical versions in their own repository.

This would allow other people, out there in the community, to contribute suggestions and fixes (“pull requests” in Git-speak) without having to download the entirety of our collection. As always with Git(hub) it offers a way for institutions to preserve the authority over the meaning of their collections and to give other institutions some degree of confidence in the things we are saying.

It also means that institutions will need to commit to taking seriously any pull requests that are submitted and tracking down the relevant people (inside the building) to approve or reject the changes. This is maybe not something we’re all used to doing. We are not really wired, intellectually or institutionally, for dealing with the public pushing back on the things we publish.

But if we’re being honest everyone knows that it’s not only a thing seen distantly on the horizon but a thing approaching with a daunting (and some times terrifying) speed. We’re going to have to figure out how to address that reality even if that just means better articulating the reasons why it’s not something a given institution wants to do.

Which means that in addition to being a useful tool for researchers, developers and other people with directed goals glossaries can also act as a simple and safe device for putting some of these ideas to the test and, hopefully, understand where the remaining bumpy bits lay.

Discuss!

Getting lost in the collection (alpha)

Last week marked a pretty significant moment in my career here at Cooper-Hewitt.

As I’m sure most of you already know, we launched our Alpha collections website. The irony of this being an “alpha” of course is that it is by leaps and bounds better than our previous offering built on eMuseum.

If you notice in the screengrab below, eMuseum was pretty bland. The homepage, which is still available allowed you to engage by selecting from one of 4 museum oriented departments. You could also search. Right…

Upon entering the site, either via search or through browsing one of the four departments several things were in my mind huge problems.

Above is a search for “lennon” and you get the idea. Note the crazy long URLs with all kinds of session specific data. This was for years a huge issue as people would typically copy and paste that URL to use in blog posts and tweets. Trying to come back to the same URL twice never worked, so at some point we added a little “permlink” link at the bottom, but users rarely found it. You’ll also note the six search options under the menu item “Search.” OK, it’s just confusing.

Finally landing on an object page and you have the object data, but where does it all lead you to?

For me the key to a great, deep, user experience is to allow users to get lost within the site. It sounds odd at first. I mean if you are spending your precious time doing research on our site, you wouldn’t really want to “get lost” but in practice, it’s how we make connections, discover the oddities we never knew existed, and actually allow ourselves to uncover our own original thought. Trust me, getting lost is essential. (And it is exactly what we know visitors enjoy doing inside exhibitions.)

As you can probably tell by now, getting lost on our old eMuseum was pretty tough to do. It was designed to do just the opposite. It was designed to help you find “an object.”

Here’s what happens when you try to leave the object page in the old site.

So we ditched all that. Yes, I said that right, we ditched eMuseum! To tell you the truth, this has been something I have been waiting to do since I started here over two years ago.

When I started here, we had about 1000 objects in eMuseum. Later we upped that to 10,000, and when Seb began (at the end of 2011) we quickly upped that number to about 123,000.

I really love doing things in orders of magnitude.

I noticed that adding more and more objects to eMuseum didn’t really slow it down, it just made it really tough to browse and get lost. There was just too much stuff in one place. There were no other entry points aside from searching and clicking lists of the four departments.

Introducing our new collections website.

We decided to start from scratch, pulling data from our existing TMS database. This is the same data we were exporting to eMuseum, and the same data we released as CC0 on GitHub back in February. The difference would be, presenting the data in new ways.

Note the many menu options. These will change over time, but immediately you can browse the collection by some high level categories.

Note the random button–refreshing your browser displays three random objects form our collection. This is super-fun and to our surprise has been one of the most talked about features of the whole alpha release. We probably could have added a random button to eMuseum and called it a day, but we like to aim a little higher.

Search is still there, where it should be. So let’s do a similar search for “lennon” and see what happens.

Here, I’ve skipped the search results page, but I will mention, there were a few more results. Our new object page for this same poster by Richard Avedon is located at the nice, friendly and persistent URL https://collection.cooperhewitt.org/objects/18618175/. It has its own unique ID (more on this in a post next week by Aaron) and I have to say, looks pretty simple. We have lots of other kinds of URLs with things in them like “people” , “places” and “periods.” URLs are important, and we intend to ensure that these live on forever. So, go ahead and link away.

The basic page layout is pretty similar to eMuseum, at first. You have essential viatal stats in the gray box on the right. Object ID, Accession Number, Tombstone data, etc, etc. You also have a map, and some helpful hints.

But then things get a little more exciting. We pull in loads of TMS data, crunch it and link it up to all sorts of things. Take a scroll down this page and you’ll see lots of text, with lots of links and a few fun bits at the end, like our “machine tag” field.

Each object has been assigned a machine tag based on its unique ID number. This is simple and straightforward future-proofing. If you’re on a site like Flickr and you come across (or have your own) photo of the same object we have in our collection, you can add the machine tag.

Some day in the near future we will write code to pull in content tagged in this way from a variety of sources. This is where the collection site will really begin to take shape as well not only be displaying the “thing we have” but its relevance in the world.

It puts a whole new spin on the concept of “collecting” and it’s something we are all very excited to see happen. So start tagging!

Moving on, I mentioned that you can get lost. This is nice.

From the Avedon poster page I clicked on the decade called 1960’s This brings me to a place where I can browse based on the decade. You can jump ahead in time and easily get lost. It’s so interesting to connect the object you are currently looking at to others from the same time period. You immediately get the sense that design happens all over the world in a wide variety of ways. It’s pretty addictive.

Navigating to the “person” page for Richard Avedon, we begin to see how these connections can extend beyond our own institutional research. We begin by pointing out what kinds of things we have by Avedon. This is pretty straight-forward, but in the gray box on the right you can see we have also linked up Avedon’s person record in our own TMS database with a wide variety of external datasets. For Avedon we have concordances with Freebase, MoMA, the V & A, and of course Wikipedia. In fact, we are pulling in Wikipedia text directly to the page.

In future releases we will connect with more and more on the web. I mean, thats the whole point of the web, right? If you want to help us make these connections ( we cant do everything in code ) feel free to fork our concordances repository on GitHub and submit a pull request.

We have many categorical methods of browsing our collection now. But I have to say, my favorite is still the Random images on the home page. In fact I started a Pinterest board called “Random Button” to capture a few of my favorites. There are fun things, famous things, odd things and downright ugly things, formerly hidden away from sight, but now easy to discover, serendipitously via the random button!

There is much more to talk about, but I’ll stop here for now. Aaron, the lead developer on the project is working on a much more in depth and technical post about how he engineered the site and was able to bring us to a public alpha in less than three months!

So stay tuned…. and get exploring.