Author Archives: Micah Walter

About Micah Walter

Micah is the Director of Digital & Emerging Media.

Building the wall

Last month we released our collection data on Github.com. It was a pretty monumental occasion for the museum and we all worked very hard to make it happen. In an attempt to build a small example of what one might do with all of this data, we decided to build a new visualization of our collection in the form of the “Collection Wall Alpha.”

The collection wall, Alpha

The idea behind the collection wall was simple enough–create a visual display of the objects in our collection that is fun and interactive. I thought about how we might accomplish this, what it would look like, and how much work it would be to get it done in a short amount of time. I thought about using our own .csv data, I tinkered, and played, and extracted, and extracted, and played some more. I realized quickly that the very data we were about to release required some thought to make it useful in practice. I probably over-thought.

Isotope

After a short time, we found this lovely JQuery plugin called Isotope. Designed by David DeSandro, Isotope offers “an exquisite Jquery plugin of magical layouts.” And it does! I quickly realized we should just use this plugin to display a never-ending waterfall of collection objects, each with a thumbnail, and linked back to the records in our online collection database. Sounds easy enough, right?

Getting Isotope to work was pretty straight-forward. You simply create each item you want on the page, and add class identifiers to control how things are sorted and displayed. It has many options, and I picked the ones I thought would make the wall work.

Next I needed a way to reference the data, and I needed to produce the right subset of the data–the objects that actually have images! For this I decided to turn to Amazon’s SimpleDB. SimpleDB is pretty much exactly what it sounds like. It’s a super-simple to implement, scalable, non-relational database which requires no setup, configuration, or maintenance. I figured it would be the ideal place to store the data for this little project.

Once I had the data I was after, I used a tool called RazorSQL to upload the records to our SimpleDB domain. I then downloaded the AWS PHP SDK and used a few basic commands to query the data and populate the collection wall with images and data. Initially things were looking good, but I ran into a few problems. First, the data I was querying was over 16K rows tall. Thats allot of data to store in memory. Fortunately, SimpleDB is already designed with this issue in mind. By default, a call to SimpleDB only returns the first 100 rows ( you can override this up to 2500 rows ). The last element in the returned data is a special token key which you can then use to call the next 100 rows.

Using this in a loop one could easily see how to grab all 16K rows, but that sort of defeats the purpose as it still fills up the memory with the full 16K records. My next thought was to use paging, and essentially grab 100 rows at a time, per page. Isotope offers a pretty nifty “Infinite Scroll” configuration. I thought this would be ideal, allowing viewers to scroll through all 16K images. Once I got the infinite scroll feature to work, I realized that it is an issue once you page down 30 or 40 pages. So, I’m going to have to figure out a way to dump out the buffer, or something along those lines in a future release.

After about a month online, I noticed that SimpleDB charges were starting to add up. I haven’t really been able to figure out why. According to the docs, AWS only charges for “compute hours” which in my thinking should be much less than what I am seeing here. I’ll have to do some more digging on this one so we don’t break the bank!

SimpleDB charges

Another issue I noticed was that we were going to be calling lots of thumbnail images directly from our collection servers. This didn’t seem like such a great idea, so I decided to upload them all to an Amazon S3 bucket. To make sure I got the correct images, I created simple php script that went through the 16K referenced images and automatically downloaded the correct resolution. It also auto-renamed each file to correspond with the record ID. Lastly, I set up an Amazon CloudFront CDN for the bucket, in hopes that this would speed up access to the images for users far and wide.

Overall I think this demonstrates just one possible outcome of our releasing of the collection meta-data. I have plans to add more features such as sorting and filtering in the near future, but it’s a start!

Check out the code after the jump ( a little rough, I know ).

Continue reading →

Media servers and some open sourceness

Moving to the Fog

Archiving Websites

3 Replies

I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.

There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.

The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!

Scrape The Site

One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.

wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.

https://gist.github.com/1528607

wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.

The httrack homepage

For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.

Here is httrack running in my Mac’s terminal.

Hosting A Scraped Site

Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.

Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.

Here’s an example of one of our scraped sites — https://archive.cooperhewitt.org/campana

301s

It’s pretty standard practice on the web to create 301 redirects for sites you are moving to a new domain. I was able to do this pretty easily using an .htaccess file and the following commands.

https://gist.github.com/1571750

This allows you to browse the site by going to the original URL at https://campana.cooperhewitt.org or any of its permalinks like https://campana.cooperhewitt.org/about.html

The Downsides

As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.

One Final Step

Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.

Designing Search

2 Replies

Some might say this is really our homepage

When I first started working at the Cooper-Hewitt in June of 2010 I was really interested in making our web more searchable. This is no simple task, and it is one that I am still working on today. So, I thought I might document some of the projects we have going on with regards to “search” at the Cooper-Hewitt.

The Google Search Appliance

One of the nice things about being part of the larger Smithsonian Institution is that we have some amazing resources. When I first started working here I took a day trip down to Washington DC to meet with Smithsonian’s CTO and some of the folks who run the Smithsonian data center in Herndon Virginia. They took me on a great tour of the facilities and talked about many of the unique resources I had at my fingertips. One such resource was our clustered Google Search Appliance. It’s no big surprise that Google, one of the biggest names in the search business, offers its technology to the enterprise.

Our setup provides the entire Smithsonian Institution with an enterprise class web crawler and index database. Shown in the picture below, it consists of two separate clusters of five 2U Google Search Appliances. One rack of 5 is simply a hot spare in case the other goes down.

Smithsonian’s Google Search Appliance — The one on the right is a backup for the one on the left.

The Google Search Appliance works in a similar way to Google itself. It constantly crawls across all of Smithsonian’s web properties, updating its index and providing a front-end for search. In fact, you can try it out yourself at https://search.si.edu where you can search across our entire network, or simply select the Smithsonian unit you are interested in. As you can see, results are displayed by relevance and follow a format similar to Google.com.

The Google Search Appliance is a great tool and we now have it integrated with our main website. If you want to try it out, go to https://cooperhewitt.org/search/gsa. This should give you results from the Google Search Appliance that span most of our web properties.

Of course the GSA comes with a few strings attached. First, it’s a web crawler, so it is indexing web properties by crawling in an automated way. This means you tell it where to start and it automatically finds pages by crawling each link from one page to another. This works fine in most cases, but you really don’t have much control over how it crawls and what it finds. It’s also a device that is shared by the entire Smithsonian Institution, so it comes with some restrictions as to how it can be utilized and customized. As well, its a completely off the shelf solution based on proprietary code. In other-words, it’s not open source, and thus can’t be hacked or altered or customized to do fun stuff! Lastly, these devices are pretty expensive. We are lucky to have them in our datacenter, but many of you reading this post will need a less costly option.

Google Custom Search

One such option might be Google’s Custom Search. This solution simply leverages Google’s existing infrastructure to create a site specific search. Of course you have little control over how and when your site gets indexed, but its free and easy to set up. I’ve used Google Custom Search for a few personal sites that needed a hosted search solution.

You can try out our custom search site here. This one searches our main site as well as a number of other web properties we have. One nice thing about the GCS is that it can be hosted on Google’s site, or embedded on your own site through a number of layout options.

Drupal and WordPress

Nearly all Content Management Systems come with some type of search feature. This type of search is simply created for you automatically when you author a new post. There is no web-crawler involved because it’s not necessary. However, this type of search comes with many caveats. First, you can only index the site managed by the CMS. So unlike the Google Search Appliance, you can’t use this to search multiple web properties on multiple domains. Also, the search baked into many popular CMS systems tend to be pretty limited in how they function on the front end. WordPress is notoriously bad at displaying search results out of the box. It simply displays them in alphabetical order–not too helpful!

Drupal is a little better with a decent advanced search page and ability to sort results by relevance. But, it is still pretty basic.

On the other hand, this type of manual indexing is very important as you can ensure that every page you author gets indexed as opposed to just hoping a web-cralwer finds your pages. Another advantage with Drupal is that the search back-end is fairly extensible and can be modified with modules. For example, our Google Search Appliance is easily integrated with our main website through this contributed module for Drupal.

Open-Source

While thinking about ways to improve search we realized that we really needed to do our own thing. We sort of needed both modes of indexing ( web crawling and manual indexing ). We also needed something we had full control over, and something that was open-source and enterprise class.

Enter Solr and Nutch. These two applications are part of the larger Apache project which powers a giant portion of the web. Solr ( pronounced solar ) is:

the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

And Nutch is a complimentary project to Solr that provides an enterprise class web-crawler. Nutch can help us crawl across multiple domains and web properties. It can also scale nicely using another Apache project known as Hadoop.

With Solr and Nutch ( and maybe a little Hadoop ) we should be able to build a pretty sophisticated platform for search. In fact, this has already been done at the Smithsonian in their Collections Search project, where you can search across nearly every Smithsonian unit.

In my next post I will dig in deep and show you how we get these things up and running.

Cooper Hewitt Labs

Technology + Media + Experience