Tag Archives: data

Interactive timeline design: seeking feedback!

Exploring the Cooper Hewitt collection with timelines and tags: guest post by Olivia Vane

‘Black & white’ timeline detail, Cooper Hewitt data

“A physical museum is itself a sort of data set — an aggregation of the micro in order to glimpse the macro. One vase means little on its own, beyond perhaps illustrating a scene from daily life. But together with its contemporaries, it means the contours of a civilization. And when juxtaposed against all vases, it helps create a first-hand account of the history of the world.”
From ‘An Excavation Of One Of The World’s Greatest Art Collections’

“The ability to draw on historic examples from various cultures, to access forgotten techniques and ideas and juxtapose them with contemporary works, creates provocative dialogues and amplifies the historic continuum. This range is an asset few museums have or utilize and provides a continuing source of inspiration to contemporary viewers and designers.”
From ‘Making Design: Cooper Hewitt, Smithsonian Design Museum Collection’ p.28

Guest post by Olivia Vane

I’m 4 months into a 5-month fellowship at Cooper Hewitt working with their digitised collection. I’m normally based in London where I’m a PhD student in Innovation Design Engineering at the Royal College of Art supervised by Stephen Boyd Davis, Professor of Design Research. My PhD topic is designing and building interactive timelines for exploring cultural data (digitised museum, archive and library collections). And, in London, I have been working with partners at the V&A, the Wellcome Library and the Science Museum.

The key issue in my PhD is how we ‘make sense’ of history using interactive diagrams. This is partly about visualisation of things we already know in order to communicate them to others. But it is also about visual analytics – using visuals for knowledge discovery. I’m particularly interested in what connects objects to one another, across time and through time.

I am very fortunate to be spending time at Cooper Hewitt as they have digitised their entire collection, more than 200,000 objects, and it is publicly available through an API. The museum is also known for its pioneering work in digital engagement with visitors and technical innovations in the galleries. It is a privilege to be able to draw on the curatorial, historical and digital expertise of the staff around me here for developing and evaluating my designs.

As I began exploring the collection API, I noticed many of the object records had ‘tags’ applied to them (like ‘birds’, ‘black & white’, ‘coffee and tea drinking’, ‘architecture’, ‘symmetry’ or ‘overlapping’). These tags connect diverse objects from across the collection: they represent themes that extend over time and across the different museum departments. This tagging interested me because it seemed to offer different paths through the data around shape, form, style, texture, motif, colour, function or environment. (It’s similar to the way users on platforms like Pinterest group images into ‘boards’ around different ideas). An object can have many tags applied to it suggesting different ways to look at it, and different contexts to place it in.

Where do these tags come from? Here, the tags are chosen and applied by the museum when objects are included in an exhibition. They provide a variety of ways to think about an object, highlighting different characteristics, and purposely offer a contrasting approach to more scholarly descriptive information. The tags are used to power recommendation systems on the museum collection website and applications in the galleries. They constitute both personal and institutional interpretation of the collection, and situate each item in a multi-dimensional set of context.

Some examples of tags and tagged objects in the Cooper Hewitt collection

I was interested to trace these themes over the collection and, since objects often have multiple tags, to explore what it would be like to situate or view each object through these various lenses.

The temporal dimension is important for identifying meaningful connections between items in cultural collections, so my first thoughts were to map tagged objects by date.

I’m working on a prototype interface that allows users to browse in a visually rich way through the collection by tags. A user starts with one object image and a list of the tags that apply to that object. They may be interested to see what other objects in the collection share a given tag and how the starting image sits in each of those contexts. When they click a tag, a timeline visualisation is generated of images of the other objects sharing that tag – arranged by date. The user can then click on further tags, to generate new timeline visualisations around the same starting image, viewing that image against contrasting historical narratives. And if they see a different image that interests them in one of these timelines, they can click on that image making it the new central image with a new list of tags through which to generate timelines and further dig into the collection. By skipping from image to image and tag to tag, it’s easy to get absorbed in exploring the dataset this way; the browsing can be undirected and doesn’t require a familiarity with the dataset.

‘Coffee and tea drinking’ timeline: designs in the collection stretch from 1700 to the present with a great diversity of forms and styles, elaborate and minimal.

‘Water’ timeline. Here there are many different ways of thinking about water: images of garden plans with fountains and lakes from the 16^th–18^th Century, or modern interventions for accessing and cleaning water in developing countries. Contrasting representations (landscape painting to abstracted pattern) and functions (drinking to boating) stretch between.

‘Water’ timeline, detail

‘Space’ timeline: 1960s ‘space age’ souvenirs (Soviet and American) precede modern telescope imaging. And a 19^th Century telescope reminds us of the long history of mankind’s interest in space.

I’m plotting the object images themselves as data points so users can easily make visual connections between them and observe trends over time (for instance in how an idea is visually represented or embodied in objects, or the types of objects present at different points in time). The images are arranged without overlapping, but in an irregular way. I hoped to emulate a densely packed art gallery wall or mood board to encourage these visual connections. Since the tags are subjective and haven’t been applied across the whole collection, I also felt this layout would encourage users to view the data in a more qualitative way.

Yale Center for British Art: Long Gallery, image credit Richard Caspole, YCBA & Elizabeth Felicella, Esto

Moodboard, image credit ERRE

Dealing with dates

How to work with curatorial dating?

While most of the post-1800 objects in the dataset have a date/date span expressed numerically, pre-1800 objects often only have date information as it would appear on a label: for example ‘Created before 1870s’, ‘late 19th–early 20th century’, ‘ca. 1850’ or ‘2012–present’. My colleagues at the Royal College of Art have previously written about the challenges of visualising temporal data from cultural collections (Davis, S.B. and Kräutli, F., 2015. The Idea and Image of Historical Time: Interactions between Design and Digital Humanities. Visible Language, 49(3), p.101).

In order to process this data computationally, I translated the label date text to numbers using the yearrange library (which is written for working with curatorial date language). This library works by converting, for example, ‘late 18th century’ to ‘start: 1775, end: 1799’ For my purposes, this seems to work well, though I am unsure how to deal with some cases:

How should I deal with objects whose date is ‘circa X’ or ‘ca. X’ etc.? At the moment I’m just crudely extending the date span by ±20 years.
How should I deal with ‘before X’? How much ‘before’ does that mean? I’m currently just using X as the date in this case.
The library does not translate BC dates (though I could make adjustments to the code to enable this…) I am just excluding these at the moment.
There are some very old objects in the Cooper Hewitt collection for example ‘1.85 million years old’, ‘ca. 2000-1595 BCE’ and ‘300,000 years old’. These will create problems if I want to include them on a uniformly scaled timeline! Since these are rare cases, I’m excluding them at the moment.

Skewing the timeline scale

The Cooper Hewitt collection is skewed towards objects dating post-1800 so to even out image distribution over the timeline I am using a power scale. Some tags, however, – such as ‘neoclassical’ or ‘art nouveau’ – have a strong temporal component and the power scale fails to even out image distribution in these cases.

How are the images arranged?

My layout algorithm aims to separate images so that they are not overlapping, but still fairly closely packed. I am using a rule that images can be shifted horizontally to avoid overlaps so long as there is still some part of the image within its date span. Since images are large data markers, it is already not possible to read dates precisely from this timeline. And the aim here is for users to observe trends and relationships, rather than read off exact dates, so I felt it was not productive to worry too much about exact placement horizontally. (Also, this is perhaps an appropriate design feature here since dating cultural objects is often imprecise and/or uncertain anyway). This way the images are quite tightly packed, but don’t stray too far from their dates.

‘Personal environmental control’ timeline: a dry juxtaposition of these decorated fans against modern Nest thermostats.

‘Foliate’ timeline, detail

‘Squares’ timeline

I’ve also tried to spread images out within date spans, rather than just use the central point, to avoid misleading shapes forming (such as a group of objects dating 18^th century forming a column at the midpoint, 1750).

Things to think about

Interface design

The layout algorithm slows when there are many (100 or more) images visualised. Is there a more efficient way to do this?
I’m considering rotating the design 90° for web-use; I anticipate users will be interested to scroll along time, and scrolling vertically may improve usability with a mouse.
Would a user be interested to see different timeline visualisations next to each other, to compare them?
It could be interesting to apply this interface to other ways of grouping objects such as type, colour, country or other descriptor.
I need to build in a back button, or some way to return to previously selected images. Maybe a search option for tags? Or a way to save images to return to?

The Visit Statistics Page

Leave a reply

Hi Labs blog readers! My name is Belle and I’m this summer’s Peter A. Krueger intern in the Digital and Emerging Media Department. At the end of my time here I’ll be starting my senior year at Wellesley College, where I’m majoring in Media Arts and Sciences, focusing on Human-Computer Interaction and Printmaking. My project for this summer, the “visit statistics page” augments Cooper Hewitt’s distinctive post-visit experience.

On my first day at Cooper Hewitt I learned that I had the freedom to determine my own internship-long project. This kind of freedom was daunting. Where should I even start? I scoured Cooper Hewitt’s website, GitHub issue logs, Labs blog posts, and talked to friends and museum staff. I determined that my project should focus on experience and data relating to the Pen, specifically the “post-visit experience”, since it’s so unique to Cooper Hewitt. After visiting the physical museum, a user can go to the website and find everything they collected with their Pen during their visit. Viewing their collected objects is often the first time a user is accessing the website after their visit, so a good user experience is essential to keep the user interested in what else the collections website has to offer.

I looked for areas of expansion within the post-visit experience and zeroed in on the way collected objects are shown to a user when they view their visit—a grid layout with object thumbnails, titles, and the time collected with the Pen. The interconnection of object data throughout the collections site makes it effortless and enjoyable to sort through object fields and click links to explore deep into the collection. The UI for objects and data on the post-visit pages, however, did not yet capitalize upon the opportunity to cross-link visit data and collections data. This is what the post-visit experience was lacking. I wanted to add a dimension of exploration to a user’s visit and gently launch them into the depths of the collections website. A key goal for my project was for it to be visually based but still accessible to the blind. A completely graphics-based page would be inaccessible to those using screen readers to browse their collection. At the end of all of my brainstorming, I developed the solution to accomplish my goal of accessibility: a primarily text based statistics page, which had the potential to add non-essential visualizations.

Early user flow sketch of image-based object filtering.

User flow sketch of visit filtering with tab UI.

As I developed a design strategy, I browsed through Cooper Hewitt’s GitHub and tried to familiarize myself with the codebase. To build the visit stats page I learned PHP and Emacs, Labs’ text editor of choice. I slowly navigated foreign PHP files and, bit by bit, built up my page with PHP and Smarty, a PHP templating engine.

The first iteration of my page took about three days to complete and was not interactive. I focused on calculating specific values from PHP and sending them to the template. And while I was very proud of myself for finally figuring out how for loops work in PHP, I knew that the main goal for my page was to encourage exploration of Cooper Hewitt’s vast collection, with a user’s visit only acting as a starting point. So next I added links to everything that already had a linkable page: countries, people, colors, and tags. While searching the existing Cooper Hewitt codebase, I serendipitously found a PHP file that collected all colors in all of a user’s visits, along with the objects associated with each color. With a little massaging, I was able to use that code to make a page that only shows the colors associated with a specific visit. I linked this to the visit statistics page so then the user could see a nice tag-cloud-like page of colors and the number of objects associated with each color.

Screenshot of the first version of the visit statistics page

The beginnings of the visit statistics page.

The first public version of the visit statistics page.

The visit colors page of a small but colorful visit.

Next, I deployed the second iteration of the visit stats page—text-based presentation of visit data and links to the collection—to the production website and shared it with my friends and family. And to my surprise, it actually worked! Or so I thought, but I’ll explain that later.

My next step was to add some visualization. Chart.js makes it easy to create beautiful interactive graphs that are perfect for the data I wanted to display. Building on the implementation of Chart.js on some internal administrative pages on the collections website, I created a graph that shows how a user interacted with objects in the gallery.

A screenshot showing a graph of a user's objects collected over time.

A graph showing the collecting activity of a visit. The y-axis is number of objects and the x-axis is time. This visit has 164 items collected over 2 hours.

Memory Issues
To test this iteration of the statistics page, which included links and a Chart.js graph, I used a larger visit with almost 200 objects. I clicked the link to the statistics page and… it didn’t load. I quickly realized that I naively only tested the statistics page on smaller visits (fewer than 60 objects) and the stats page had too much data crunching for visits over 150 objects. After some digging in the logs, Micah and I discovered there was an issue with the amount of memory the page was using. We discovered a useful built-in PHP function called memory_get_usage(), which tells you the number of bytes used in a file. With this, we discovered that for a visit of over 160 objects, the stats page totaled almost 60 MB of memory usage. A typical page (for example, the visit page for a single visit) is around 1-2 MB of memory.

I first thought that the many for loops and multi-dimensional arrays utilized in my code for gathering data were the culprit of this memory issue. After some debugging, I discovered the function which inflates each object in a user’s visit (a.k.a. gathers every single bit of metadata available about that object from MySQL) was taking up close to 95% of the page’s total memory. This function returns a giant multi-dimensional array, which the page iterates over many many times to pick out the necessary data. This gathering of information from MySQL, along with the many for loops going through it, caused my page to be slowed down by getting and storing a ton of information that it didn’t even need.

So, I tried cutting down that array by adding conditionals to the function that inflates each object. I pared down each object’s entry in the array to get the minimum amount of information from MySQL. This way, I was able to decrease the amount of memory used by the page by a few MB. But this still didn’t resolve the memory issue. While the page itself wasn’t using as much memory, the memory used up by helper functions and other functions within functions called on my page were still contributing to the memory used. So Micah and I brainstormed and brainstormed until we thought of two possible options: 1. Create a task for the offline server, which would require a lot of difficult coding and setting up which we both didn’t quite know how to do or 2. Use Elasticsearch to get exactly the data needed for the stats page—which I kind of knew how to do after minor exposure to Elasticsearch—but I would need to go through trial and error to fully figure out. I would also need to edit all of the existing code for the stats page to work with the results from Elasticsearch. I chose option two, because from a user experience standpoint it would be disappointing to go to the stats page and just see a message saying “Oops, give us a minute, we are working on getting the information for you”.

To work with Elasticsearch I had to figure out how it builds queries and aggregates responses. After a lot of trial and error, I figured out what terms Elasticsearch accepts and what it doesn’t. This newly built query returned the minimum amount of information I needed for the statistics page and was way faster than getting a huge multi-dimensional array from MySQL. Then, all I had to do was alter a few lines in my existing code to navigate through the new Elasticsearch response. A few bug fixes later and the statistics page was working better than it was pre-Elasticsearch. After this implementation of Elasticsearch the total memory used (as found by memory_get_usage()) was down to 4 MB! Roughly a 90% decrease in memory usage simply by changing the way I got the necessary data. And, for the moment of truth, when I tested the stats page on the visit that consists of almost 200 objects—it loaded! And it loaded quickly! Now for comparison, before Elasticsearch, a short visit would take 2-3 seconds to load, which isn’t terrible but definitely not ideal. A larger visit could take upward of 10 seconds, which is absolutely too slow. At that point, the server timed out and the page would not load at all. Now, on first load, the stats page for a 200 object visit is just over a second, and is under a second when accessed again with a cache.

What’s Next?
Although the page is at a good resting point currently, I could see some improvements being made in the future, such as the addition of more links, graphs, data, and fine tuning my code.

One of the biggest things relating to this project that I could see in the future is another page to view the statistics of a user’s shoebox, that is, the combination of all of a user’s visits. A statistics page that spans all visits would be more powerful for someone who wants to see overall trends in their collections.

A simpler addition would be to make more pages, similar to the separate “Colors Of Your Visit” page, that show a complete list of the countries, people, and tags a user collected.

Another potential feature would be a world map showing one’s collected objects. It would be a great visual addition to a text-based page showing all of the countries from which a user collected objects. Something like this could be added to the separate page reserved for listing all of the countries in a visit.

It could be a cool experiment to give a little feedback to a user about their collecting habits, to make museum go-ers more aware of how they interacted with the pen and the items in the galleries. Perhaps a user who collected a lot of stuff at the beginning and then tapered off would get a short message below their graph: “It looks like you stopped collecting items toward the end of your visit!” Maybe someone who collected only five objects in a two hour period would get “You didn’t use your pen much during your visit,” or someone who collected 300 objects in one hour would see “Wow! You’re quite the treasure hunter!” This idea also demonstrates the fine line between responsive and creepy, which I struggled with while deciding on phrasing throughout my page.

Cooper Hewitt’s Pen gives us so much information about a user’s visit and their journey through the galleries. Combined with the collection’s breadth and depth of metadata, it’s astounding how much information a user can receive about their visit and their viewing habits. My statistics page is a move toward condensing the vast amount of information available about a visit and makes it more accessible for anyone interested in learning about their interests, taste in design, or museum-going habits.

The visit statistics page is currently on the website and I encourage you to check it out for yourself by going to one of your visits and clicking the link on the right that says “View your visit stats”. Here is a link to some of my visits and here are some cool examples using visits from people at Cooper Hewitt Labs:

A screenshot of a visit statistics page from my visit.

A statistics page from one of my own visits.

A screenshot of the visit statistics page

A colorful visit from Lisa.

A screenshot from a visit statistics page.

A visit from Rachel, the Labs’ new interactive media developer.

Mia Ridge explores the shape of Cooper-Hewitt collections

16 Replies

Or, “what can you learn about 270,000 records in a week?”

Guest post by Mia Ridge.

I’ve just finished a weeks’ residency at the Cooper-Hewitt, where Seb had asked me to look at ‘the shape of their collection‘. Before I started a PhD in Digital Humanities I’d spent a lot of time poking around collections databases for various museums, but I didn’t know much about the Cooper-Hewitt’s collections so this was a nice juicy challenge.

What I hoped to do

Museum collections are often accidents of history, the result of the personalities, trends and politics that shaped an institution over its history. I wanted to go looking for stories, to find things that piqued my curiosity and see where they lead me. How did the collection grow over time? What would happen if I visualised materials by date, or object type by country? Would showing the most and least exhibited objects be interesting? What relationships could I find between the people listed in the Artist and Makers tables, or between the collections data and the library? Could I find a pattern in changing sizes of different types of objects over time – which objects get bigger and which get smaller over time? Which periods have the most colourful or patterned objects?

I was planning to use records from the main collections database, which for large collections usually means some cleaning is required. Most museum collections management systems date back several decades and there’s often a backlog of un-digitised records that need entering and older records that need enhancing to modern standards. I thought I’d iterate through stages of cleaning the data, trying it in different visualisations, then going back to clean up more precisely as necessary.

I wanted to get the easy visualisations like timelines and maps out of the way early with tools like IBM’s ManyEyes and Google Fusion Tables so I could start to look for patterns in the who, what, where, when and why of the collections. I hoped to find combinations of tools and data that would let a visitor go looking for potential stories in the patterns revealed, then dive into the detail to find out what lay behind it or pull back to view it in context of the whole collection.

What I encountered

Well, that was a great plan, but that’s not how it worked in reality. Overall I spent about a day of my time dealing with the sheer size of the dataset: it’s tricky to load 60 meg worth of 270,000 rows into tools that are limited by the number of rows (Excel), rows/columns (Google Docs) or size of file (Google Refine, ManyEyes), and any search-and-replace cleaning takes a long time.

However, the unexpectedly messy data was the real issue – for whatever reason, the Cooper-Hewitt’s collections records were messier than I expected and I spent most of my time trying to get the data into a workable state. There were also lots of missing fields, and lots of uncertainty and fuzziness but again, that’s quite common in large collections – sometimes it’s the backlog in research and enhancing records, sometimes an object is unexpectedly complex (e.g. ‘Begun in Kiryu, Japan, finished in France‘) and sometimes it’s just not possible to be certain about when or where an object was from (e.g. ‘Bali? Java? Mexico?’). On a technical note, some of the fields contained ‘hard returns’ which cause problems when exporting data into different formats. But the main issue was the variation and inconsistency in data entry standards over time. For example, sometimes fields contained additional comments – this certainly livened up the Dimensions fields but also made it impossible for a computer to parse them.

In some ways, computers are dumb. They don’t do common sense, and they get all ‘who moved my cheese’ if things aren’t as they expect them to be. Let me show you what I mean – here are some of the different ways an object was listed as coming from the USA:

U.S.
U.S.A
U.S.A.
USA
United States of America
United States (case)

We know they all mean exactly the same place, but most computers are completely baffled by variations in punctuation and spacing, let alone acronyms versus full words. The same inconsistencies were evident when uncertainties were expressed: it might have been interesting to look at the sets of objects that were made in ‘U.S.A. or England’ but there were so many variations like ‘U.S.A./England ?’ and ‘England & U.S.A.’ that it wasn’t feasible in the time I had. This is what happens when tools encounter messy data when they expect something neat:

Map with mislabelled location and number of records

3 objects from ‘Denmark or Germany’? No! Messy data confuses geocoding software.

Data cleaning for fun and profit

I used Google Refine to clean up the records then upload them to Google Fusion or Google Docs for test visualisations. Using tools that let me move data between them was the nearest I could get to a workflow that made it easy to tidy records iteratively without being able to tidy the records at source.

Refine is an amazing tool, and I would have struggled to get anywhere without it. There are some great videos on how to use it at freeyourmetadata.org, but in short, it helps you ‘cluster‘ potentially similar values and update them so they’re all consistent. The screenshot below shows Refine in action.

Google Refine in action

One issue is that museums tend to use question marks to record when a value is uncertain, but Refine strips out all punctuation, so you have to be careful about preserving the distinction between certain and uncertain records (if that’s what you want). The suitability of general tools for cultural heritage data is a wider issue – a generic timeline generator doesn’t know what year to map ‘early 17th century’ to so it can be displayed, but date ranges are often present in museum data, and flattening it to 1600 or 1640 or even 1620 is a false level of precision that has the appearance of accuracy.

When were objects collected?

Having lost so much time to data cleaning without resolving all the issues, I eventually threw nuance, detail and accuracy out the window so I could concentrate on the overall shape of the collection. Working from the assumption that object accession numbers reflected the year of accession and probably the year of acquisition, I processed the data to extract just the year, then plotted it as accessions by department and total accessions by year. I don’t know the history of the Cooper Hewitt well enough to understand why certain years have huge peaks, but I can get a sense of the possible stories hidden behind the graph – changes of staff, the effect of World War II? Why were 1938 and 1969 such important years for the Textiles Department, or 1991 for the Product Design and Decorative Arts Department?

Accessions by Year for all Departments

Or try the interactive version available at ManyEyes.

I also tried visualising the Textiles data as a bubble chart to show the years when lots of objects were collected in a different way:

Accessions for Textiles Department by year

Where are objects from?

I also made a map which shows which countries have been collected from most intensively. To get this display, I had to remove out any rows that had values that didn’t exactly match the name of just one country, etc, so it doesn’t represent the entire collection. But you can get a sense of the shape of the collection – for example, there’s a strong focus on the US and Western Europe objects.

Object sources by country

The interactive version is available at https://bit.ly/Ls572u.

This also demonstrates the impact of the different tools – I’m sure the Cooper-Hewitt has more than 43 objects from the countries (England, Scotland, Wales and Northern Ireland) that make up the United Kingdom but Google’s map has only picked up references to ‘United Kingdom’, effectively masking the geo-political complexities of the region and hiding tens of thousands of records.

Linking Makers to the rest of the web

Using Refine’s Reconciliation tool, I automatically ‘reconciled’ or matched 9000 names in the Makers table to records in Freebase. For example, the Cooper-Hewitt records about Gianni Versace were linked to the Freebase page about him, providing further context for objects related to him. By linking them to a URL that identifies the subject of a record, those records can now be part of the web, not just on the web. However, as might be expected with a table that contains a mixture of famous, notable and ordinary people, Refine couldn’t match everything with a high level of certainty so 66453 records are left as an exercise for the reader.

I also had a quick go at graphing the different roles that occurred in the Makers table.

The benefit of hindsight, and thoughts for the future

With hindsight, I would have stuck with a proper database for data manipulation because trying to clean really large datasets with consumer tools is cumbersome. I also would have been less precious about protecting the detail and nuance of the data and been more pragmatic and ruthless about splitting up files into manageable sizes and tidying up inconsistencies and uncertainties from the start. I possibly should have given up on the big dataset and concentrated on seeing what could be done with the more complete, higher quality records.

The quality of collections data has a profound impact of the value of visualisations and mashups. The collections records would be more usable in future visualisations if they were tidied in the source database. A tool like Google Refine can help create a list of values to be applied and provide some quick wins for cleaning date and places fields. Uncertainty in large datasets is often unavoidable, but with some tweaking Refine could also be used to provide suggestions for representing uncertainty more consistently. I’m biased as crowdsourcing is the subject of my PhD, but asking people who use the collections to suggest corrections to records or help work through the records that can’t be cleaned automatically could help deal with the backlog. Crowdsourcing could also be used to help match more names from the various People fields to pages on sites like Freebase and Wikipedia.

If this has whetted your appetite and you want to have a play with some of Cooper-Hewitt’s data, check out Collection Data Access & Download.

Finally, a big thank you to the staff of the Cooper-Hewitt for hosting me for a week.

People playing with collections #14: collection data on Many Eyes

Building the wall

2 Replies

Last month we released our collection data on Github.com. It was a pretty monumental occasion for the museum and we all worked very hard to make it happen. In an attempt to build a small example of what one might do with all of this data, we decided to build a new visualization of our collection in the form of the “Collection Wall Alpha.”

The collection wall, Alpha

The idea behind the collection wall was simple enough–create a visual display of the objects in our collection that is fun and interactive. I thought about how we might accomplish this, what it would look like, and how much work it would be to get it done in a short amount of time. I thought about using our own .csv data, I tinkered, and played, and extracted, and extracted, and played some more. I realized quickly that the very data we were about to release required some thought to make it useful in practice. I probably over-thought.

Isotope

After a short time, we found this lovely JQuery plugin called Isotope. Designed by David DeSandro, Isotope offers “an exquisite Jquery plugin of magical layouts.” And it does! I quickly realized we should just use this plugin to display a never-ending waterfall of collection objects, each with a thumbnail, and linked back to the records in our online collection database. Sounds easy enough, right?

Getting Isotope to work was pretty straight-forward. You simply create each item you want on the page, and add class identifiers to control how things are sorted and displayed. It has many options, and I picked the ones I thought would make the wall work.

Next I needed a way to reference the data, and I needed to produce the right subset of the data–the objects that actually have images! For this I decided to turn to Amazon’s SimpleDB. SimpleDB is pretty much exactly what it sounds like. It’s a super-simple to implement, scalable, non-relational database which requires no setup, configuration, or maintenance. I figured it would be the ideal place to store the data for this little project.

Once I had the data I was after, I used a tool called RazorSQL to upload the records to our SimpleDB domain. I then downloaded the AWS PHP SDK and used a few basic commands to query the data and populate the collection wall with images and data. Initially things were looking good, but I ran into a few problems. First, the data I was querying was over 16K rows tall. Thats allot of data to store in memory. Fortunately, SimpleDB is already designed with this issue in mind. By default, a call to SimpleDB only returns the first 100 rows ( you can override this up to 2500 rows ). The last element in the returned data is a special token key which you can then use to call the next 100 rows.

Using this in a loop one could easily see how to grab all 16K rows, but that sort of defeats the purpose as it still fills up the memory with the full 16K records. My next thought was to use paging, and essentially grab 100 rows at a time, per page. Isotope offers a pretty nifty “Infinite Scroll” configuration. I thought this would be ideal, allowing viewers to scroll through all 16K images. Once I got the infinite scroll feature to work, I realized that it is an issue once you page down 30 or 40 pages. So, I’m going to have to figure out a way to dump out the buffer, or something along those lines in a future release.

After about a month online, I noticed that SimpleDB charges were starting to add up. I haven’t really been able to figure out why. According to the docs, AWS only charges for “compute hours” which in my thinking should be much less than what I am seeing here. I’ll have to do some more digging on this one so we don’t break the bank!

SimpleDB charges

Another issue I noticed was that we were going to be calling lots of thumbnail images directly from our collection servers. This didn’t seem like such a great idea, so I decided to upload them all to an Amazon S3 bucket. To make sure I got the correct images, I created simple php script that went through the 16K referenced images and automatically downloaded the correct resolution. It also auto-renamed each file to correspond with the record ID. Lastly, I set up an Amazon CloudFront CDN for the bucket, in hopes that this would speed up access to the images for users far and wide.

Overall I think this demonstrates just one possible outcome of our releasing of the collection meta-data. I have plans to add more features such as sorting and filtering in the near future, but it’s a start!

Check out the code after the jump ( a little rough, I know ).

Continue reading →

Cooper Hewitt Labs

Technology + Media + Experience