Author Archives: Micah Walter

About Micah Walter

Micah is the Director of Digital & Emerging Media.

Building the wall

Last month we released our collection data on Github.com. It was a pretty monumental occasion for the museum and we all worked very hard to make it happen. In an attempt to build a small example of what one might do with all of this data, we decided to build a new visualization of our collection in the form of the “Collection Wall Alpha.”

The collection wall, Alpha

The collection wall, Alpha

The idea behind the collection wall was simple enough–create a visual display of the objects in our collection that is fun and interactive. I thought about how we might accomplish this, what it would look like, and how much work it would be to get it done in a short amount of time. I thought about using our own .csv data, I tinkered, and played, and extracted, and extracted, and played some more. I realized quickly that the very data we were about to release required some thought to make it useful in practice. I probably over-thought.

Isotope

Isotope

After a short time, we found this lovely JQuery plugin called Isotope. Designed by David DeSandro, Isotope offers “an exquisite Jquery plugin of magical layouts.” And it does! I quickly realized we should just use this plugin to display a never-ending waterfall of collection objects, each with a thumbnail, and linked back to the records in our online collection database. Sounds easy enough, right?

Getting Isotope to work was pretty straight-forward. You simply create each item you want on the page, and add class identifiers to control how things are sorted and displayed. It has many options, and I picked the ones I thought would make the wall work.

Next I needed a way to reference the data, and I needed to produce the right subset of the data–the objects that actually have images! For this I decided to turn to Amazon’s SimpleDB. SimpleDB is pretty much exactly what it sounds like. It’s a super-simple to implement, scalable, non-relational database which requires no setup, configuration, or maintenance. I figured it would be the ideal place to store the data for this little project.

Once I had the data I was after, I used a tool called RazorSQL to upload the records to our SimpleDB domain. I then downloaded the AWS PHP SDK and used a few basic commands to query the data and populate the collection wall with images and data. Initially things were looking good, but I ran into a few problems. First, the data I was querying was over 16K rows tall. Thats allot of data to store in memory. Fortunately, SimpleDB is already designed with this issue in mind. By default, a call to SimpleDB only returns the first 100 rows ( you can override this up to 2500 rows ). The last element in the returned data is a special token key which you can then use to call the next 100 rows.

Using this in a loop one could easily see how to grab all 16K rows, but that sort of defeats the purpose as it still fills up the memory with the full 16K records. My next thought was to use paging, and essentially grab 100 rows at a time, per page. Isotope offers a pretty nifty “Infinite Scroll” configuration. I thought this would be ideal, allowing viewers to scroll through all 16K images. Once I got the infinite scroll feature to work, I realized that it is an issue once you page down 30 or 40 pages. So, I’m going to have to figure out a way to dump out the buffer, or something along those lines in a future release.

After about a month online, I noticed that SimpleDB charges were starting to add up. I haven’t really been able to figure out why. According to the docs, AWS only charges for “compute hours” which in my thinking should be much less than what I am seeing here. I’ll have to do some more digging on this one so we don’t break the bank!

SimpleDB charges

SimpleDB charges

Another issue I noticed was that we were going to be calling lots of thumbnail images directly from our collection servers. This didn’t seem like such a great idea, so I decided to upload them all to an Amazon S3 bucket. To make sure I got the correct images, I created simple php script that went through the 16K referenced images and automatically downloaded the correct resolution. It also auto-renamed each file to correspond with the record ID. Lastly, I set up an Amazon CloudFront CDN for the bucket, in hopes that this would speed up access to the images for users far and wide.

Overall I think this demonstrates just one possible outcome of our releasing of the collection meta-data. I have plans to add more features such as sorting and filtering in the near future, but it’s a start!

Check out the code after the jump ( a little rough, I know ).

Continue reading

Media servers and some open sourceness

We use Amazon S3 for a good portion of our media hosting. It’s a simple and cost effective solution for serving up assets big and small. When we moved initially to Drupal 6.x ( about a year ago ) I wanted to be sure that we would use S3 for as many of our assets as possible. This tactic was partly inspired by wanting to keep the Drupal codebase nice and clean, and also to allow us to scale horizontally if needed ( multiple app servers behind a load balancer ).

Horizontal Scaling

Horizontal Scaling

So in an attempt to streamline workflows, we modified this amazon_s3 Drupal module a little. The idea was to allow authors to easily use the Drupal node editor to upload their images and PDFs directly to our S3 bucket. It would also rewrite the URLs to pull the content from our CloudFront CDN. It also sorts your images into folders based on the date ( a-la-Wordpress).

amazon_s3

Our fork of amazon_s3 rewrite the URL for our CDN, and sorts into folders by date.

I’ve opened sourced that code now which is simply a fork of the amazon_s3 module. It works pretty well on Drupal 6.x. It has an issue where it uploads assets with some incorrect meta-data. It’s really only a problem for uploaded PDFs where the files will download but won’t open in your browser. This has to do with the S3 metadata tag of application/octet-stream vs. application/pdf. All in all I think its a pretty useful module.

As we move towards migrating to Drupal 7, I have been doing some more research about serving assets via S3 and CloudFront. Additionally, it seems that the Drupal community have developed some new modules which should help streamline a few things

Custom Origin

Create a CloudFront distribution for you whole site using a custom origin

As of a couple years ago Amazon’s CloudFront CDN allows you to use a custom origin. This is really great as you can simply tell it to pull from your own domain rather than an S3 bucket.

So for example, I set this blog up with a CloudFront distribution that pulls direct from https://www.cooperhewitt.org. The resultant distribution is at https://d2y3kexd1yg34t.cloudfront.net. If you go to that URL you should see a mirror of this site. Then all we have to do is install a plugin for WordPress to replace static asset URLs with the CloudFront URL. You might notice this in action if you inspect the URL of any images on the site. You can of course add a CNAME to make the CloudFront URL prettier, but it isn’t required.

On the Drupal end of things, there is a simple module called CDN that does the same thing as we are doing here via the WordPress W3TC plugin. It simply replaces static asset files with your CloudFront domain. Additionally, I see there is now a new Drupal module called amazons3 ( note the lack of the underscore ). This module is designed to allow Drupal to replace it’s default file system with your S3 bucket. So, when a user uploads files through the Drupal admin interface ( which normally sends files to sites/default/files on your local server ) files automatically wind up in your S3 bucket.

I haven’t gotten this to work as of yet, but I think it’s a promising approach. Using this setup, you could maintain a clean and scalable Drupal codebase, keeping all of your user uploaded assets on an S3 bucket without much change to the standard workflow within the Drupal backend. NICE!

 

Moving to the Fog

When people have asked me where we host our website, I have usually replied with “it’s complicated.”

Last week we made some serious changes to our web infrastructure. Up until now we have been running most of our web properties on servers we have managed ourselves at Rackspace. These have included dedicated physical servers as well as a few cloud based instances. We also have a couple of instances running on Amazon EC2, as well as a few properties running at the Smithsonian Mothership in Washington DC.

For a long time, I had been looking for a more seamless and easier to manage solution. This was partially achieved when I moved the main site from our old dedicated server to a cloud-based set of instances behind a Rackspace load balancer. It seemed to perform pretty well, but still I was mostly responsible for it on my own.

PHPFog

PHPFog can be used to easily scale your web-app by adding multiple app servers

Eventually I discovered a service built on top of Amazon EC2 known as PHPFog. This Platform as a Service (PaaS) is designed to allow people like myself to easily develop and deploy PHP based web apps in the Cloud. Essentially, what PHPFog does is set up an EC2 instance, configured and optimized by their own design. This is placed behind their own set of load balancers, Varnish Cache servers and other goodies, and connected up with an Amazon RDS MySQL server. They also give you a hosted Git repository, and in fact, Git becomes your only connection to the file system. At first this seemed very un-orthrodox. No SSH, no FTP, nothing… just Git and PHPMyAdmin to deal with the database. However, I spent a good deal of time experimenting with PHPFog and after a while I found the workflow to be really simple and easy to manage. Deployment is as easy as doing a Git Push, and the whole thing worked in a similar fashion to Heroku.com, the popular Ruby on Rails PaaS.

What’s more is that PHPFog, being built on EC2 was fairly extensible. If I wanted to, I could easily add an ElastiCache server, or my own dedicated RDS server. Basically, through setting up security groups which allow communication to PHPFog’s instances, I am able to connect to just about anything that Amazon AWS has to offer.

I continued to experiment with PHPFog and found some additional highlights. Each paid account comes with a free NewRelic monitoring account. NewRelic is really great as it offers a much more comprehensive monitoring system than many of the typical server alerting and monitoring apps available today. You can really get a nice picture of where the different bottlenecks are happening on your app, and what the real “end user” experience is like. In short, NewRelic was the icing on the cake.

NewRelic

Our NewRelic Dashboard

So, last week, we made the switch and are now running our main cooperhewitt.org site on “The Fog.” We have also been running this blog on the same instance. In fact, if you are really interested, you can check out our NewRelic stats for the last three hours in the “Performance menu tab!” It took a little tweaking to get our NewRelic alerts properly configured, but they seem to be working pretty seamlessly now.

Here’s a nice video explaining how AppFog/PHPFog works.

As you can see, we’ve got a nice little stack running here and all easily managed with minimal staff resource.

And here’s a somewhat different Fog altogether.

(Yes we are a little John Carpenter obsessed here)

Archiving Websites

I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.

There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.

The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!

Scrape The Site

One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.

wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.

https://gist.github.com/1528607

wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.

httrack

The httrack homepage

For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.

httrack

Here is httrack running in my Mac’s terminal.

Hosting A Scraped Site

Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.

Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.

Here’s an example of one of our scraped sites — https://archive.cooperhewitt.org/campana

301s

It’s pretty standard practice on the web to create 301 redirects for sites you are moving to a new domain. I was able to do this pretty easily using an .htaccess file and the following commands.

https://gist.github.com/1571750

This allows you to browse the site by going to the original URL at https://campana.cooperhewitt.org or any of its permalinks like https://campana.cooperhewitt.org/about.html

The Downsides

As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.

One Final Step

Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.

Designing Search

Our Homepage?

Some might say this is really our homepage

When I first started working at the Cooper-Hewitt in June of 2010 I was really interested in making our web more searchable. This is no simple task, and it is one that I am still working on today. So, I thought I might document some of the projects we have going on with regards to “search” at the Cooper-Hewitt.

The Google Search Appliance

One of the nice things about being part of the larger Smithsonian Institution is that we have some amazing resources. When I first started working here I took a day trip down to Washington DC to meet with Smithsonian’s CTO and some of the folks who run the Smithsonian data center in Herndon Virginia. They took me on a great tour of the facilities and talked about many of the unique resources I had at my fingertips. One such resource was our clustered Google Search Appliance. It’s no big surprise that Google, one of the biggest names in the search business, offers its technology to the enterprise.

Our setup provides the entire Smithsonian Institution with an enterprise class web crawler and index database. Shown in the picture below, it consists of two separate clusters of five 2U Google Search Appliances. One rack of 5 is simply a hot spare in case the other goes down.

Smithsonian's Google Search Appliance

Smithsonian’s Google Search Appliance — The one on the right is a backup for the one on the left.

The Google Search Appliance works in a similar way to Google itself. It constantly crawls across all of Smithsonian’s web properties, updating its index and providing a front-end for search. In fact, you can try it out yourself at https://search.si.edu where you can search across our entire network, or simply select the Smithsonian unit you are interested in. As you can see, results are displayed by relevance and follow a format similar to Google.com.

The Google Search Appliance is a great tool and we now have it integrated with our main website. If you want to try it out, go to https://cooperhewitt.org/search/gsa. This should give you results from the Google Search Appliance that span most of our web properties.

Of course the GSA comes with a few strings attached. First, it’s a web crawler, so it is indexing web properties by crawling in an automated way. This means you tell it where to start and it automatically finds pages by crawling each link from one page to another. This works fine in most cases, but you really don’t have much control over how it crawls and what it finds. It’s also a device that is shared by the entire Smithsonian Institution, so it comes with some restrictions as to how it can be utilized and customized. As well, its a completely off the shelf solution based on proprietary code. In other-words, it’s not open source, and thus can’t be hacked or altered or customized to do fun stuff! Lastly, these devices are pretty expensive. We are lucky to have them in our datacenter, but many of you reading this post will need a less costly option.

Google Custom Search

One such option might be Google’s Custom Search. This solution simply leverages Google’s existing infrastructure to create a site specific search. Of course you have little control over how and when your site gets indexed, but its free and easy to set up. I’ve used Google Custom Search for a few personal sites that needed a hosted search solution.

You can try out our custom search site here. This one searches our main site as well as a number of other web properties we have. One nice thing about the GCS is that it can be hosted on Google’s site, or embedded on your own site through a number of layout options.

Drupal and WordPress

Nearly all Content Management Systems come with some type of search feature. This type of search is simply created for you automatically when you author a new post. There is no web-crawler involved because it’s not necessary. However, this type of search comes with many caveats. First, you can only index the site managed by the CMS. So unlike the Google Search Appliance, you can’t use this to search multiple web properties on multiple domains. Also, the search baked into many popular CMS systems tend to be pretty limited in how they function on the front end. WordPress is notoriously bad at displaying search results out of the box. It simply displays them in alphabetical order–not too helpful!

Drupal is a little better with a decent advanced search page and ability to sort results by relevance. But, it is still pretty basic.

On the other hand, this type of manual indexing is very important as you can ensure that every page you author gets indexed as opposed to just hoping a web-cralwer finds your pages. Another advantage with Drupal is that the search back-end is fairly extensible and can be modified with modules. For example, our Google Search Appliance is easily integrated with our main website through this contributed module for Drupal.

Open-Source

While thinking about ways to improve search we realized that we really needed to do our own thing. We sort of needed both modes of indexing ( web crawling and manual indexing ). We also needed something we had full control over, and something that was open-source and enterprise class.

Enter Solr and Nutch. These two applications are part of the larger Apache project which powers a giant portion of the web. Solr ( pronounced solar ) is:

the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

And Nutch is a complimentary project to Solr that provides an enterprise class web-crawler. Nutch can help us crawl across multiple domains and web properties. It can also scale nicely using another Apache project known as Hadoop.

With Solr and Nutch ( and maybe a little Hadoop ) we should be able to build a pretty sophisticated platform for search. In fact, this has already been done at the Smithsonian in their Collections Search project, where you can search across nearly every Smithsonian unit.

In my next post I will dig in deep and show you how we get these things up and running.