Tag Archives: s3

Building the wall

Last month we released our collection data on Github.com. It was a pretty monumental occasion for the museum and we all worked very hard to make it happen. In an attempt to build a small example of what one might do with all of this data, we decided to build a new visualization of our collection in the form of the “Collection Wall Alpha.”

The collection wall, Alpha

The collection wall, Alpha

The idea behind the collection wall was simple enough–create a visual display of the objects in our collection that is fun and interactive. I thought about how we might accomplish this, what it would look like, and how much work it would be to get it done in a short amount of time. I thought about using our own .csv data, I tinkered, and played, and extracted, and extracted, and played some more. I realized quickly that the very data we were about to release required some thought to make it useful in practice. I probably over-thought.

Isotope

Isotope

After a short time, we found this lovely JQuery plugin called Isotope. Designed by David DeSandro, Isotope offers “an exquisite Jquery plugin of magical layouts.” And it does! I quickly realized we should just use this plugin to display a never-ending waterfall of collection objects, each with a thumbnail, and linked back to the records in our online collection database. Sounds easy enough, right?

Getting Isotope to work was pretty straight-forward. You simply create each item you want on the page, and add class identifiers to control how things are sorted and displayed. It has many options, and I picked the ones I thought would make the wall work.

Next I needed a way to reference the data, and I needed to produce the right subset of the data–the objects that actually have images! For this I decided to turn to Amazon’s SimpleDB. SimpleDB is pretty much exactly what it sounds like. It’s a super-simple to implement, scalable, non-relational database which requires no setup, configuration, or maintenance. I figured it would be the ideal place to store the data for this little project.

Once I had the data I was after, I used a tool called RazorSQL to upload the records to our SimpleDB domain. I then downloaded the AWS PHP SDK and used a few basic commands to query the data and populate the collection wall with images and data. Initially things were looking good, but I ran into a few problems. First, the data I was querying was over 16K rows tall. Thats allot of data to store in memory. Fortunately, SimpleDB is already designed with this issue in mind. By default, a call to SimpleDB only returns the first 100 rows ( you can override this up to 2500 rows ). The last element in the returned data is a special token key which you can then use to call the next 100 rows.

Using this in a loop one could easily see how to grab all 16K rows, but that sort of defeats the purpose as it still fills up the memory with the full 16K records. My next thought was to use paging, and essentially grab 100 rows at a time, per page. Isotope offers a pretty nifty “Infinite Scroll” configuration. I thought this would be ideal, allowing viewers to scroll through all 16K images. Once I got the infinite scroll feature to work, I realized that it is an issue once you page down 30 or 40 pages. So, I’m going to have to figure out a way to dump out the buffer, or something along those lines in a future release.

After about a month online, I noticed that SimpleDB charges were starting to add up. I haven’t really been able to figure out why. According to the docs, AWS only charges for “compute hours” which in my thinking should be much less than what I am seeing here. I’ll have to do some more digging on this one so we don’t break the bank!

SimpleDB charges

SimpleDB charges

Another issue I noticed was that we were going to be calling lots of thumbnail images directly from our collection servers. This didn’t seem like such a great idea, so I decided to upload them all to an Amazon S3 bucket. To make sure I got the correct images, I created simple php script that went through the 16K referenced images and automatically downloaded the correct resolution. It also auto-renamed each file to correspond with the record ID. Lastly, I set up an Amazon CloudFront CDN for the bucket, in hopes that this would speed up access to the images for users far and wide.

Overall I think this demonstrates just one possible outcome of our releasing of the collection meta-data. I have plans to add more features such as sorting and filtering in the near future, but it’s a start!

Check out the code after the jump ( a little rough, I know ).

Continue reading

Media servers and some open sourceness

We use Amazon S3 for a good portion of our media hosting. It’s a simple and cost effective solution for serving up assets big and small. When we moved initially to Drupal 6.x ( about a year ago ) I wanted to be sure that we would use S3 for as many of our assets as possible. This tactic was partly inspired by wanting to keep the Drupal codebase nice and clean, and also to allow us to scale horizontally if needed ( multiple app servers behind a load balancer ).

Horizontal Scaling

Horizontal Scaling

So in an attempt to streamline workflows, we modified this amazon_s3 Drupal module a little. The idea was to allow authors to easily use the Drupal node editor to upload their images and PDFs directly to our S3 bucket. It would also rewrite the URLs to pull the content from our CloudFront CDN. It also sorts your images into folders based on the date ( a-la-Wordpress).

amazon_s3

Our fork of amazon_s3 rewrite the URL for our CDN, and sorts into folders by date.

I’ve opened sourced that code now which is simply a fork of the amazon_s3 module. It works pretty well on Drupal 6.x. It has an issue where it uploads assets with some incorrect meta-data. It’s really only a problem for uploaded PDFs where the files will download but won’t open in your browser. This has to do with the S3 metadata tag of application/octet-stream vs. application/pdf. All in all I think its a pretty useful module.

As we move towards migrating to Drupal 7, I have been doing some more research about serving assets via S3 and CloudFront. Additionally, it seems that the Drupal community have developed some new modules which should help streamline a few things

Custom Origin

Create a CloudFront distribution for you whole site using a custom origin

As of a couple years ago Amazon’s CloudFront CDN allows you to use a custom origin. This is really great as you can simply tell it to pull from your own domain rather than an S3 bucket.

So for example, I set this blog up with a CloudFront distribution that pulls direct from http://labs.cooperhewitt.org. The resultant distribution is at http://d2y3kexd1yg34t.cloudfront.net. If you go to that URL you should see a mirror of this site. Then all we have to do is install a plugin for WordPress to replace static asset URLs with the CloudFront URL. You might notice this in action if you inspect the URL of any images on the site. You can of course add a CNAME to make the CloudFront URL prettier, but it isn’t required.

On the Drupal end of things, there is a simple module called CDN that does the same thing as we are doing here via the WordPress W3TC plugin. It simply replaces static asset files with your CloudFront domain. Additionally, I see there is now a new Drupal module called amazons3 ( note the lack of the underscore ). This module is designed to allow Drupal to replace it’s default file system with your S3 bucket. So, when a user uploads files through the Drupal admin interface ( which normally sends files to sites/default/files on your local server ) files automatically wind up in your S3 bucket.

I haven’t gotten this to work as of yet, but I think it’s a promising approach. Using this setup, you could maintain a clean and scalable Drupal codebase, keeping all of your user uploaded assets on an S3 bucket without much change to the standard workflow within the Drupal backend. NICE!