Who's on first?

Houston Jet Shoes, 2013

photo by Martin Kalfatovic

We made a new thing. It is a nascent thing. It is an experimental thing. It is a thing we hope other people will help us “kick the tires” around.

It’s called “Who’s on first?” or, more accurately, “solr-whosonfirst“. solr-whosonfirst is an experimental Solr 4 core for mapping person names between institutions using a number of tokenizers and analyzers.

How does it work?

The core contains the minimum viable set of data fields for doing concordances between people from a variety of institutions: collection; collection_id; name and when available year_birth; year_death.

The value of name is then meant to copied (literally, using Solr copyField definitions) to a variety of specialized field definitions. For example the name field is copied to a name_phonetic so that you can query the entire corpus for names that sound alike.

Right now there are only two such fields, both of which are part of the default Solr schema: name_general and name_phonetic.

The idea is to compile a broad collection of specialized fields to offer a variety of ways to compare data sets. The point is not to presume that any one tokenizer / analyzer will be able to meet everyone’s needs but to provide a common playground in which we might try things out and share tricks and lessons learned.

Frankly, just comparing the people in our collections using Solr’s built-in spellchecker might work as well as anything else.

For example:

$> curl  'https://localhost:8983/solr/select?q=name_general:moggridge&wt=json&indent=on&fq=name_general:bill'

{"response":{"numFound":2, "start":0,"docs":[
        "collection_id":"18062553" ,
            "wikipedia:id= 1600591",
        "uri":"x-urn:ch:id=18062553" ,
        "collection":"cooperhewitt" ,
        "name":["Bill Moggridge"],
        "collection_id":"OL3253093A" ,
        "uri":"x-urn:ol:id=OL3253093A" ,
        "collection":"openlibrary" ,
        "name":["Bill Moggridge"],

Now, we’ve established a concordance between our record for Bill Moggridge and Bill’s author page at the Open Library. Yay!

Here’s another example:

$> https://localhost:8983/solr/whosonfirst/select?q=name_general:dreyfuss&wt=json&indent=on

        "name":["Dreyfuss, Henry"],
        "name":["Henry Dreyfuss"],
        "name":["Henry Dreyfuss Associates"],

See the way the two records for Henry Dreyfuss, from the Cooper-Hewitt, have the same concordance in Wikipedia? That’s an interesting wrinkle that we should probably take a look at. In the meantime, we’ve managed to glean some new information from the IMA (Henry Dreyfuss’ year of birth and death) and them from us (concordances with Wikipedia and Freebase and VIAF).

The goal is to start building out the some of the smarts around entity (that’s fancy-talk for people and things) disambiguation that we tend to gloss over.

None of what’s being proposed here is all that sophisticated or clever. It’s a little clever and my hunch tells me it will be a good general-purpose spelunking tool and something for sanity checking data more than it will be an all-knowing magic pony. The important part, for me, is that it’s an effort to stand something up in public and to share it and to invite comments and suggestions and improvements and gentle cluebats.

Concordances (and machine tags)

There are also some even-more experimental and very much optional hooks for allowing you to store known concordances as machine tags and to query them using the same wildcard syntax that Flickr uses, as well as generating hierarchical facets.

I put the machine tags reading list that I prepared for Museums and the Web in 2010, on Github. It’s a good place to start if you’re unfamiliar with the subject.


There are two separate repositories that you can download to get started. They are:

The first is the actual Solr core and config files. The second is a set of sample data files and import scripts that you can use to pre-seed your instance of Solr. Sample data files are available from the following sources:

The data in these files is not standardized. There are source specific tools for importing each dataset in the bin directory. In some cases the data here is a subset of the data that the source itself publishes. For example, the Open Library dataset only contains authors and IDs since there are so many of them (approxiamately 7M).

Additional datasets will be added as time and circumstances (and pull requests) permit.


Leave a Reply

Your email address will not be published. Required fields are marked *