Tag Archives: amazon

Totally cached out

We do a good deal of cacheing on our web properties here at Cooper-Hewitt.

Our web host, PHPFog adds a layer of cacheing for free known as Varnish cache. Varnish Cache sits in front of our web servers and performs what is known as reverse proxy cacheing. This type of cacheing is incredibly important as it adds the ability to quickly serve cached files to users on the Internet vs. continually recreating dynamic web-pages by making calls into the database.

For static assets such as images, javascripts, and css files, we turn to Amazon’s CloudFront CDN. This type of technology ( which I’ve mentioned in a number of other posts here ) places these static assets on a distributed network of “edge” locations around the world, allowing quicker access to these assets geographically speaking, and as well, it removes a good deal of burden from our application servers.

However, to go a bit further, we thought of utilizing memcache. Memcache is an in-memory database key-value type cacheing application. It helps to speed up calls to the database by storing as much of that information in memory as possible. This has been proven to be extremely effective across many gigantic, database intensive websites like Facebook, Twitter, Tumblr, and Pinterest ( to name just a few ). Check this interesting post on scaling memcached at Facebook.

To get started with memcache I turned to Amazon’s Elasticache offering. Elasticache is essentially a managed memcache server. It allows you to spin up a memcache cluster in a minute or two, and is super easy to use. In fact, you could easily provision a terabyte of memcache in the same amount of time. There is no installation, configuration or maintenance to worry about. Once your memcache cluster is up and running you can easily add or remove nodes, scaling as your needs change on a nearly real-time basis.

Check this video for a more in-depth explanation.

Elasticache also works very nicely with our servers at PHPFog as they are all built on Amazon EC2, and are in fact in the same data center. To get the whole thing working with our labs.cooperhewitt.org blog, I had to do the following.

  1. Create a security group. In order for PHPFog to talk to your own Elasticache cluster, you have to create a security group that contains PHPFog’s AWS ID. There is documentation on the PHPFog website on how to do this for use with an Amazon RDS server, and the same steps apply for Elasticache.
  2. Provision an Elasticache cluster. I chose to start with a single node, m1.large instance which gives me about 7.5 Gig of RAM to work with at $0.36 an hour per node. I can always add more nodes in the future if I want, and I can even roll down to a smaller instance size by simply creating another cluster.
  3. Let things simmer for a minute. It takes a minute or two for your cluster to initialize.
  4. On WordPress install the W3TC plugin. This plugin allows you to connect up your Elasticache server, and as well offers tons of configurable options for use with things like a CloudFront CDN and more. Its a must have! If you are on Drupal or some other CMS. there are similar modules that achieve the same result.
  5. In W3TC enable whatever types of cacheing you wish to do and set the cache type to memcache. In my case, I chose page cache, minify cache, database cache, and object cache, all of which work with memcache. Additionally I set up our CloudFront CDN from within this same plugin.
  6. In each cache types config page, set your memcache endpoint to the one given by your AWS control panel. If you have multiple nodes, you will have to copy and paste them all into each of these spaces. There is a test button you can hit to make sure your installation is communicating with your memcache server.

That last bit is interesting. You can have multiple clusters with multiple nodes serving as cache servers for a number of different purposes. You can also use the same cache cluster for multiple sites, so long as they are all reachable via your security group settings.

Once everything is configured and working you can log out and let the cacheing being. It helps to click through the site to allow the cache to build up, but this will happen automatically if your site gets a decent amount of traffic. In the AWS control panel you can check on your cache cluster in the CloudWatch tab where you can keep track of how much memory and cpu is being utilized at any given time. You can also set up alerts so that if you run out of cache, you get notified so you can easily add some nodes.

We hope to employ this same cacheing cluster on our main cooperhewitt.org website, as well as a number of our other web properties in the near future.

Media servers and some open sourceness

We use Amazon S3 for a good portion of our media hosting. It’s a simple and cost effective solution for serving up assets big and small. When we moved initially to Drupal 6.x ( about a year ago ) I wanted to be sure that we would use S3 for as many of our assets as possible. This tactic was partly inspired by wanting to keep the Drupal codebase nice and clean, and also to allow us to scale horizontally if needed ( multiple app servers behind a load balancer ).

Horizontal Scaling

Horizontal Scaling

So in an attempt to streamline workflows, we modified thisĀ amazon_s3 Drupal module a little. The idea was to allow authors to easily use the Drupal node editor to upload their images and PDFs directly to our S3 bucket. It would also rewrite the URLs to pull the content from our CloudFront CDN. It also sorts your images into folders based on the date ( a-la-Wordpress).

amazon_s3

Our fork of amazon_s3 rewrite the URL for our CDN, and sorts into folders by date.

I’ve opened sourced that code now which is simply a fork of the amazon_s3 module. It works pretty well on Drupal 6.x. It has an issue where it uploads assets with some incorrect meta-data. It’s really only a problem for uploaded PDFs where the files will download but won’t open in your browser. This has to do with the S3 metadata tag of application/octet-stream vs. application/pdf. All in all I think its a pretty useful module.

As we move towards migrating to Drupal 7, I have been doing some more research about serving assets via S3 and CloudFront. Additionally, it seems that the Drupal community have developed some new modules which should help streamline a few things

Custom Origin

Create a CloudFront distribution for you whole site using a custom origin

As of a couple years ago Amazon’s CloudFront CDN allows you to use a custom origin. This is really great as you can simply tell it to pull from your own domain rather than an S3 bucket.

So for example, I set this blog up with a CloudFront distribution that pulls direct from http://labs.cooperhewitt.org. The resultant distribution is at http://d2y3kexd1yg34t.cloudfront.net. If you go to that URL you should see a mirror of this site. Then all we have to do is install a plugin for WordPress to replace static asset URLs with the CloudFront URL. You might notice this in action if you inspect the URL of any images on the site. You can of course add a CNAME to make the CloudFront URL prettier, but it isn’t required.

On the Drupal end of things, there is a simple module called CDN that does the same thing as we are doing here via the WordPress W3TC plugin. It simply replaces static asset files with your CloudFront domain. Additionally, I see there is now a new Drupal module called amazons3 ( note the lack of the underscore ). This module is designed to allow Drupal to replace it’s default file system with your S3 bucket. So, when a user uploads files through the Drupal admin interface ( which normally sends files to sites/default/files on your local server ) files automatically wind up in your S3 bucket.

I haven’t gotten this to work as of yet, but I think it’s a promising approach. Using this setup, you could maintain a clean and scalable Drupal codebase, keeping all of your user uploaded assets on an S3 bucket without much change to the standard workflow within the Drupal backend. NICE!