Author Archives: Micah Walter

About Micah Walter

Micah Walter is the Webmaster. He likes tacos and Drupal and you'll be hearing a lot about web technologies from him. Before joining the Cooper-Hewitt he received his MFA in Photographic and Electronic Media from MICA (no relation). Before that he lived on an island in the West Indies, and before that he claims to have been a photojournalist in the Middle East, but says all the photographic evidence is on some hard drive in his closet.

Welcome to object phone. Your call has been placed in a queue.

I made another small thing. Again, another way for me to experiment with the Collection API, and again, another way to experiment with new ways of accessing the collection. This time, there aren’t many screen shots to display–there is no website to look at. This time, it’s “Welcome to object phone!”

(718) 213-4915

Object Phone” is ( presently ) a very, very simple implementation of a way to explore our collection by dialing a telephone, or sending a text message. I had been thinking of a few of the more popular museum oriented audio tour products, and how they all seem to be very CMS style in their design, and wondering if we could just use our own API.

For example, TourML and TAP ( which offer the web programmer a very powerful framework for programming a mobile guide using the Drupal CMS ) are very nice, but they are still very dependent on content production. The developer or content manager has to build and curate all of the content for the “tour.” This might be a good way to go about things, especially if you are leaning on an existing Drupal installation for a good deal of your content, but I was looking for a way to access existing data, and specifically the data in our collection website.

In the beginning of developing our collection website, we went through the process of assigning EVERYTHING a unique “bigint” in the form of what we are referring to as an “artisinal integer.” This means that each object record, each person record and each, well, everything else has a unique integer which no other thing can have. This is not in place of accession numbers–we will probably always have accession numbers The nice thing about unique integers is that they’re really easy to deal with on a programmatic level.

For example, if you text 18704235 to 718-213-4915 you should get a response that looks like the screenshot below. In fact you can text any object id number from our collection and get a similar response.

2013-04-18 10.15.18

You can also dial that same number and use your keypad to either search the collection by object ID, or ask for a random object. The application will respond to you using a text to speech converter, which is usually pretty good.

Presently, the app is not replying with a whole lot of information. You essentially get the object’s title and medium field if it has one. In many cases, asking for a random object may just result in something like “Drawing.” Many of our object records don’t have much more useful information than this, and also, I am trying to wrangle with the idea of how much information is useful in a voice and text message ( with a 160 character limit per SMS).

The whole system is leveraging the Twilio service and API. Twilio offers quite a range of possibilities, and I am very excited to experiment with more. For example, instead of text to speech, Twilio can play back .wav files. Additionally, Twilio can do things like dial another phone number, forward calls and record the caller’s voice. There are so many possibilities here that I wont even begin to list them, but for example, I could easily see us using this to capture user feedback in our galleries by phone and text.

I’m very interested in figuring out a way to search by voice. I’m sort of dreaming of programming the thing to go “Why don’t you just tell me the object number!” as in this great episode of Seinfeld which you can watch by clicking the image below.

Screen Shot 2013-04-18 at 10.35.01 AMIf you are interested, I have also made the code public on this Gist. It’s pretty messy and redundant right now, but you’ll get the idea.

One of the more complicated aspects of this project will be designing the phone interface so it makes sense. Currently, once you hear an object play back, the system just hangs up on you. It would be nice to offer the user a better way to manipulate the system which is still pleasant and easy to understand. By that same token, there is a completely different approach that is needed for the SMS end of things as you don’t really have a menu tree, but instead of list of possible commands the user need to learn. Fortunately, there is a ton of great work that has already been accomplished in this arena, specifically by the Walker Art Center’s very long running and very yellow website Art on Call.

More to come & code after the jump

Continue reading

Introducing the Albers API method

Screen Shot 2013-02-06 at 12.11.51 PM

We recently added a method to our Collection API which allows you to get any object’s “Albers” color codes. This is a pretty straightforward method where you pass the API an object ID, and it returns to you a triple of color values in hex format.

As an experiment, I thought it would be fun to write a short script which uses our API to grab a random object, grab its Albers colors, and then use that info to build an Albers inspired image. So here goes.

For this project I chose to work in Python as I already have some experience with it, and I know it has a decent imaging library. I started by using pycurl to authenticate with our API storing the result in a buffer and then using simplejson to parse the results. This first step grabs a random object using the getRandom API method.

api_token = 'YOUR-COOPER-HEWITT-TOKEN'

buf = cStringIO.StringIO()

c = pycurl.Curl()
c.setopt(c.URL, 'https://api.collection.cooperhewitt.org/rest')
d = {'method':'cooperhewitt.objects.getRandom','access_token':api_token}

c.setopt(c.WRITEFUNCTION, buf.write)

c.setopt(c.POSTFIELDS, urllib.urlencode(d) )
c.perform()

random = json.loads(buf.getvalue())

buf.reset()
buf.truncate()

object_id = random.get('object', [])
object_id = object_id.get('id', [])

print object_id

I then use the object ID I got back to ask for the Albers color codes. The getAlbers API method returns the hex color value and ID number for each “ring.” This is kind of interesting because not only do I know the color value, but I know what it refers to in our collection ( period_id, type_id, and department_id ).

d = {'method':'cooperhewitt.objects.getAlbers','id':object_id ,'access_token':api_token}

c.setopt(c.POSTFIELDS, urllib.urlencode(d) )
c.perform()

albers = json.loads(buf.getvalue())

rings = albers.get('rings',[])
ring1color = rings[0]['hex_color']
ring2color = rings[1]['hex_color']
ring3color = rings[2]['hex_color']

print ring1color, ring2color, ring3color

buf.close()

Now that I have the ring colors I can build my image. To do this, I chose to follow the same pattern of concentric rings that Aaron talks about in this post, introducing the Albers images as a visual language on our collections website. However, to make things a little interesting, I chose to add some randomness to the the size and position of each ring. Building the image in python was pretty easy using the ImageDraw module

size = (1000,1000)             
im = Image.new('RGB', size, ring1color) 
draw = ImageDraw.Draw(im)

ring2coordinates = ( randint(50,100), randint(50,100) , randint(900, 950), randint(900,950))

print ring2coordinates

ring3coordinates = ( randint(ring2coordinates[0]+50, ring2coordinates[0]+100) , randint(ring2coordinates[1]+50, ring2coordinates[1]+100) ,  randint(ring2coordinates[2]-200, ring2coordinates[2]-50) , randint(ring2coordinates[3]-200, ring2coordinates[3]-50) )

print ring3coordinates

draw.rectangle(ring2coordinates, fill=ring2color)
draw.rectangle(ring3coordinates, fill=ring3color)

del draw 

im.save('file.png', 'PNG')

The result are images that look like the one below, saved to my local disk. If you’d like to grab a copy of the full working python script for this, please check out this Gist.

Albersify

A bunch of Albers images

So, what can you humanities hackers do with it?

Curatorial Poetry

Curatorial Poetry

I made a fun thing…yesterday. In my 20% time, downtime, er.. 2am time, I decided to build a simple tumblr blog called Curatorial Poetry. I was inspired by Aaron’s take on our collection data and how he chose to present objects in our collection that have no image, but have a “description.” In the office we often have fun reading these aloud, or better, with Apple’s screen reader.

But, I thought it would be fun to “reblog” these in another form. So, I built a simple python script to do just that. To do this, I forked our collection data and wrote a short tool to convert our JSON objects into an sqlite3 database. I chose sqlite3 because, well, it’s light, and doesn’t require me to set up a DB server or anything.

Next, I spent most of my time trying to learn how oAuth2 works. It took me a good bit of time googling around before I realized that the python-oath2 library includes oAuth1 ( which tumblr uses ). All I really needed to do with the tumblr API was to create the post. Once I had my keys worked out and authenticating, it was just one line of code.

Once the post is published, the script updates my sqlite3 db so it makes sure to not post the same thing twice. Thats all!

I’d like to expand the code for this a bit to add some error checking, build in connections to our own API ( instead of using the data dump ) and connect with Twitter. I’m also interested in adding other museum’s data. We have the IMA data available on GitHub, but they don’t include “description” text, so well see… In the meantime, follow it and you’ll receive a new “poem” in your tumblr feed every two hours, for the next 8 years!

Getting lost in the collection (alpha)

Last week marked a pretty significant moment in my career here at Cooper-Hewitt.

As I’m sure most of you already know, we launched our Alpha collections website. The irony of this being an “alpha” of course is that it is by leaps and bounds better than our previous offering built on eMuseum.

If you notice in the screengrab below, eMuseum was pretty bland. The homepage, which is still available allowed you to engage by selecting from one of 4 museum oriented departments. You could also search. Right…

Upon entering the site, either via search or through browsing one of the four departments several things were in my mind huge problems.

Above is a search for “lennon” and you get the idea. Note the crazy long URLs with all kinds of session specific data. This was for years a huge issue as people would typically copy and paste that URL to use in blog posts and tweets. Trying to come back to the same URL twice never worked, so at some point we added a little “permlink” link at the bottom, but users rarely found it. You’ll also note the six search options under the menu item “Search.” OK, it’s just confusing.

Finally landing on an object page and you have the object data, but where does it all lead you to?

For me the key to a great, deep, user experience is to allow users to get lost within the site. It sounds odd at first. I mean if you are spending your precious time doing research on our site, you wouldn’t really want to “get lost” but in practice, it’s how we make connections, discover the oddities we never knew existed, and actually allow ourselves to uncover our own original thought. Trust me, getting lost is essential. (And it is exactly what we know visitors enjoy doing inside exhibitions.)

As you can probably tell by now, getting lost on our old eMuseum was pretty tough to do. It was designed to do just the opposite. It was designed to help you find “an object.”

Here’s what happens when you try to leave the object page in the old site.

So we ditched all that. Yes, I said that right, we ditched eMuseum! To tell you the truth, this has been something I have been waiting to do since I started here over two years ago.

When I started here, we had about 1000 objects in eMuseum. Later we upped that to 10,000, and when Seb began (at the end of 2011) we quickly upped that number to about 123,000.

I really love doing things in orders of magnitude.

I noticed that adding more and more objects to eMuseum didn’t really slow it down, it just made it really tough to browse and get lost. There was just too much stuff in one place. There were no other entry points aside from searching and clicking lists of the four departments.

Introducing our new collections website.

We decided to start from scratch, pulling data from our existing TMS database. This is the same data we were exporting to eMuseum, and the same data we released as CC0 on GitHub back in February. The difference would be, presenting the data in new ways.

Note the many menu options. These will change over time, but immediately you can browse the collection by some high level categories.

Note the random button–refreshing your browser displays three random objects form our collection. This is super-fun and to our surprise has been one of the most talked about features of the whole alpha release. We probably could have added a random button to eMuseum and called it a day, but we like to aim a little higher.

Search is still there, where it should be. So let’s do a similar search for “lennon” and see what happens.

Here, I’ve skipped the search results page, but I will mention, there were a few more results. Our new object page for this same poster by Richard Avedon is located at the nice, friendly and persistent URL http://collection.cooperhewitt.org/objects/18618175/. It has its own unique ID (more on this in a post next week by Aaron) and I have to say, looks pretty simple. We have lots of other kinds of URLs with things in them like “people” , “places” and “periods.” URLs are important, and we intend to ensure that these live on forever. So, go ahead and link away.

The basic page layout is pretty similar to eMuseum, at first. You have essential viatal stats in the gray box on the right. Object ID, Accession Number, Tombstone data, etc, etc. You also have a map, and some helpful hints.

But then things get a little more exciting. We pull in loads of TMS data, crunch it and link it up to all sorts of things. Take a scroll down this page and you’ll see lots of text, with lots of links and a few fun bits at the end, like our “machine tag” field.

Each object has been assigned a machine tag based on its unique ID number. This is simple and straightforward future-proofing. If you’re on a site like Flickr and you come across (or have your own) photo of the same object we have in our collection, you can add the machine tag.

Some day in the near future we will write code to pull in content tagged in this way from a variety of sources. This is where the collection site will really begin to take shape as well not only be displaying the “thing we have” but its relevance in the world.

It puts a whole new spin on the concept of “collecting” and it’s something we are all very excited to see happen. So start tagging!

Moving on, I mentioned that you can get lost. This is nice.

From the Avedon poster page I clicked on the decade called 1960′s This brings me to a place where I can browse based on the decade. You can jump ahead in time and easily get lost. It’s so interesting to connect the object you are currently looking at to others from the same time period. You immediately get the sense that design happens all over the world in a wide variety of ways. It’s pretty addictive.

Navigating to the “person” page for Richard Avedon, we begin to see how these connections can extend beyond our own institutional research. We begin by pointing out what kinds of things we have by Avedon. This is pretty straight-forward, but in the gray box on the right you can see we have also linked up Avedon’s person record in our own TMS database with a wide variety of external datasets. For Avedon we have concordances with Freebase, MoMA, the V & A, and of course Wikipedia. In fact, we are pulling in Wikipedia text directly to the page.

In future releases we will connect with more and more on the web. I mean, thats the whole point of the web, right? If you want to help us make these connections ( we cant do everything in code ) feel free to fork our concordances repository on GitHub and submit a pull request.

We have many categorical methods of browsing our collection now. But I have to say, my favorite is still the Random images on the home page. In fact I started a Pinterest board called “Random Button” to capture a few of my favorites. There are fun things, famous things, odd things and downright ugly things, formerly hidden away from sight, but now easy to discover, serendipitously via the random button!

There is much more to talk about, but I’ll stop here for now. Aaron, the lead developer on the project is working on a much more in depth and technical post about how he engineered the site and was able to bring us to a public alpha in less than three months!

So stay tuned…. and get exploring.

People playing with collections #14: collection data on Many Eyes

Many Eyes Website

Many Eyes Website

I love seeing examples of uses of our collection metadata in the wild. bartdavis has uploaded our data to Many Eyes and created a few visualizations.

I found it interesting to see how many “matchsafes” we have in the collection, as you can easily see in the “color blindness test” inspired bubble chart! Here are a few screen grabs, but check them out for yourself at http://www-958.ibm.com.

Of interest to us, too, is that these visualisations are only possible because we released the collection data as a single dump. If we had, like many museums, only provided an API, this would not have been possible (or at least been much more difficult) to do.

Bubble chart of object types

Bubble chart of object types

Number of objects by century

Number of objects by century

Word cloud of object types

Word cloud of object types

Totally cached out

We do a good deal of cacheing on our web properties here at Cooper-Hewitt.

Our web host, PHPFog adds a layer of cacheing for free known as Varnish cache. Varnish Cache sits in front of our web servers and performs what is known as reverse proxy cacheing. This type of cacheing is incredibly important as it adds the ability to quickly serve cached files to users on the Internet vs. continually recreating dynamic web-pages by making calls into the database.

For static assets such as images, javascripts, and css files, we turn to Amazon’s CloudFront CDN. This type of technology ( which I’ve mentioned in a number of other posts here ) places these static assets on a distributed network of “edge” locations around the world, allowing quicker access to these assets geographically speaking, and as well, it removes a good deal of burden from our application servers.

However, to go a bit further, we thought of utilizing memcache. Memcache is an in-memory database key-value type cacheing application. It helps to speed up calls to the database by storing as much of that information in memory as possible. This has been proven to be extremely effective across many gigantic, database intensive websites like Facebook, Twitter, Tumblr, and Pinterest ( to name just a few ). Check this interesting post on scaling memcached at Facebook.

To get started with memcache I turned to Amazon’s Elasticache offering. Elasticache is essentially a managed memcache server. It allows you to spin up a memcache cluster in a minute or two, and is super easy to use. In fact, you could easily provision a terabyte of memcache in the same amount of time. There is no installation, configuration or maintenance to worry about. Once your memcache cluster is up and running you can easily add or remove nodes, scaling as your needs change on a nearly real-time basis.

Check this video for a more in-depth explanation.

Elasticache also works very nicely with our servers at PHPFog as they are all built on Amazon EC2, and are in fact in the same data center. To get the whole thing working with our labs.cooperhewitt.org blog, I had to do the following.

  1. Create a security group. In order for PHPFog to talk to your own Elasticache cluster, you have to create a security group that contains PHPFog’s AWS ID. There is documentation on the PHPFog website on how to do this for use with an Amazon RDS server, and the same steps apply for Elasticache.
  2. Provision an Elasticache cluster. I chose to start with a single node, m1.large instance which gives me about 7.5 Gig of RAM to work with at $0.36 an hour per node. I can always add more nodes in the future if I want, and I can even roll down to a smaller instance size by simply creating another cluster.
  3. Let things simmer for a minute. It takes a minute or two for your cluster to initialize.
  4. On WordPress install the W3TC plugin. This plugin allows you to connect up your Elasticache server, and as well offers tons of configurable options for use with things like a CloudFront CDN and more. Its a must have! If you are on Drupal or some other CMS. there are similar modules that achieve the same result.
  5. In W3TC enable whatever types of cacheing you wish to do and set the cache type to memcache. In my case, I chose page cache, minify cache, database cache, and object cache, all of which work with memcache. Additionally I set up our CloudFront CDN from within this same plugin.
  6. In each cache types config page, set your memcache endpoint to the one given by your AWS control panel. If you have multiple nodes, you will have to copy and paste them all into each of these spaces. There is a test button you can hit to make sure your installation is communicating with your memcache server.

That last bit is interesting. You can have multiple clusters with multiple nodes serving as cache servers for a number of different purposes. You can also use the same cache cluster for multiple sites, so long as they are all reachable via your security group settings.

Once everything is configured and working you can log out and let the cacheing being. It helps to click through the site to allow the cache to build up, but this will happen automatically if your site gets a decent amount of traffic. In the AWS control panel you can check on your cache cluster in the CloudWatch tab where you can keep track of how much memory and cpu is being utilized at any given time. You can also set up alerts so that if you run out of cache, you get notified so you can easily add some nodes.

We hope to employ this same cacheing cluster on our main cooperhewitt.org website, as well as a number of our other web properties in the near future.

Building the wall

Last month we released our collection data on Github.com. It was a pretty monumental occasion for the museum and we all worked very hard to make it happen. In an attempt to build a small example of what one might do with all of this data, we decided to build a new visualization of our collection in the form of the “Collection Wall Alpha.”

The collection wall, Alpha

The collection wall, Alpha

The idea behind the collection wall was simple enough–create a visual display of the objects in our collection that is fun and interactive. I thought about how we might accomplish this, what it would look like, and how much work it would be to get it done in a short amount of time. I thought about using our own .csv data, I tinkered, and played, and extracted, and extracted, and played some more. I realized quickly that the very data we were about to release required some thought to make it useful in practice. I probably over-thought.

Isotope

Isotope

After a short time, we found this lovely JQuery plugin called Isotope. Designed by David DeSandro, Isotope offers “an exquisite Jquery plugin of magical layouts.” And it does! I quickly realized we should just use this plugin to display a never-ending waterfall of collection objects, each with a thumbnail, and linked back to the records in our online collection database. Sounds easy enough, right?

Getting Isotope to work was pretty straight-forward. You simply create each item you want on the page, and add class identifiers to control how things are sorted and displayed. It has many options, and I picked the ones I thought would make the wall work.

Next I needed a way to reference the data, and I needed to produce the right subset of the data–the objects that actually have images! For this I decided to turn to Amazon’s SimpleDB. SimpleDB is pretty much exactly what it sounds like. It’s a super-simple to implement, scalable, non-relational database which requires no setup, configuration, or maintenance. I figured it would be the ideal place to store the data for this little project.

Once I had the data I was after, I used a tool called RazorSQL to upload the records to our SimpleDB domain. I then downloaded the AWS PHP SDK and used a few basic commands to query the data and populate the collection wall with images and data. Initially things were looking good, but I ran into a few problems. First, the data I was querying was over 16K rows tall. Thats allot of data to store in memory. Fortunately, SimpleDB is already designed with this issue in mind. By default, a call to SimpleDB only returns the first 100 rows ( you can override this up to 2500 rows ). The last element in the returned data is a special token key which you can then use to call the next 100 rows.

Using this in a loop one could easily see how to grab all 16K rows, but that sort of defeats the purpose as it still fills up the memory with the full 16K records. My next thought was to use paging, and essentially grab 100 rows at a time, per page. Isotope offers a pretty nifty “Infinite Scroll” configuration. I thought this would be ideal, allowing viewers to scroll through all 16K images. Once I got the infinite scroll feature to work, I realized that it is an issue once you page down 30 or 40 pages. So, I’m going to have to figure out a way to dump out the buffer, or something along those lines in a future release.

After about a month online, I noticed that SimpleDB charges were starting to add up. I haven’t really been able to figure out why. According to the docs, AWS only charges for “compute hours” which in my thinking should be much less than what I am seeing here. I’ll have to do some more digging on this one so we don’t break the bank!

SimpleDB charges

SimpleDB charges

Another issue I noticed was that we were going to be calling lots of thumbnail images directly from our collection servers. This didn’t seem like such a great idea, so I decided to upload them all to an Amazon S3 bucket. To make sure I got the correct images, I created simple php script that went through the 16K referenced images and automatically downloaded the correct resolution. It also auto-renamed each file to correspond with the record ID. Lastly, I set up an Amazon CloudFront CDN for the bucket, in hopes that this would speed up access to the images for users far and wide.

Overall I think this demonstrates just one possible outcome of our releasing of the collection meta-data. I have plans to add more features such as sorting and filtering in the near future, but it’s a start!

Check out the code after the jump ( a little rough, I know ).

Continue reading

Media servers and some open sourceness

We use Amazon S3 for a good portion of our media hosting. It’s a simple and cost effective solution for serving up assets big and small. When we moved initially to Drupal 6.x ( about a year ago ) I wanted to be sure that we would use S3 for as many of our assets as possible. This tactic was partly inspired by wanting to keep the Drupal codebase nice and clean, and also to allow us to scale horizontally if needed ( multiple app servers behind a load balancer ).

Horizontal Scaling

Horizontal Scaling

So in an attempt to streamline workflows, we modified this amazon_s3 Drupal module a little. The idea was to allow authors to easily use the Drupal node editor to upload their images and PDFs directly to our S3 bucket. It would also rewrite the URLs to pull the content from our CloudFront CDN. It also sorts your images into folders based on the date ( a-la-Wordpress).

amazon_s3

Our fork of amazon_s3 rewrite the URL for our CDN, and sorts into folders by date.

I’ve opened sourced that code now which is simply a fork of the amazon_s3 module. It works pretty well on Drupal 6.x. It has an issue where it uploads assets with some incorrect meta-data. It’s really only a problem for uploaded PDFs where the files will download but won’t open in your browser. This has to do with the S3 metadata tag of application/octet-stream vs. application/pdf. All in all I think its a pretty useful module.

As we move towards migrating to Drupal 7, I have been doing some more research about serving assets via S3 and CloudFront. Additionally, it seems that the Drupal community have developed some new modules which should help streamline a few things

Custom Origin

Create a CloudFront distribution for you whole site using a custom origin

As of a couple years ago Amazon’s CloudFront CDN allows you to use a custom origin. This is really great as you can simply tell it to pull from your own domain rather than an S3 bucket.

So for example, I set this blog up with a CloudFront distribution that pulls direct from http://labs.cooperhewitt.org. The resultant distribution is at http://d2y3kexd1yg34t.cloudfront.net. If you go to that URL you should see a mirror of this site. Then all we have to do is install a plugin for WordPress to replace static asset URLs with the CloudFront URL. You might notice this in action if you inspect the URL of any images on the site. You can of course add a CNAME to make the CloudFront URL prettier, but it isn’t required.

On the Drupal end of things, there is a simple module called CDN that does the same thing as we are doing here via the WordPress W3TC plugin. It simply replaces static asset files with your CloudFront domain. Additionally, I see there is now a new Drupal module called amazons3 ( note the lack of the underscore ). This module is designed to allow Drupal to replace it’s default file system with your S3 bucket. So, when a user uploads files through the Drupal admin interface ( which normally sends files to sites/default/files on your local server ) files automatically wind up in your S3 bucket.

I haven’t gotten this to work as of yet, but I think it’s a promising approach. Using this setup, you could maintain a clean and scalable Drupal codebase, keeping all of your user uploaded assets on an S3 bucket without much change to the standard workflow within the Drupal backend. NICE!

 

Moving to the Fog

When people have asked me where we host our website, I have usually replied with “it’s complicated.”

Last week we made some serious changes to our web infrastructure. Up until now we have been running most of our web properties on servers we have managed ourselves at Rackspace. These have included dedicated physical servers as well as a few cloud based instances. We also have a couple of instances running on Amazon EC2, as well as a few properties running at the Smithsonian Mothership in Washington DC.

For a long time, I had been looking for a more seamless and easier to manage solution. This was partially achieved when I moved the main site from our old dedicated server to a cloud-based set of instances behind a Rackspace load balancer. It seemed to perform pretty well, but still I was mostly responsible for it on my own.

PHPFog

PHPFog can be used to easily scale your web-app by adding multiple app servers

Eventually I discovered a service built on top of Amazon EC2 known as PHPFog. This Platform as a Service (PaaS) is designed to allow people like myself to easily develop and deploy PHP based web apps in the Cloud. Essentially, what PHPFog does is set up an EC2 instance, configured and optimized by their own design. This is placed behind their own set of load balancers, Varnish Cache servers and other goodies, and connected up with an Amazon RDS MySQL server. They also give you a hosted Git repository, and in fact, Git becomes your only connection to the file system. At first this seemed very un-orthrodox. No SSH, no FTP, nothing… just Git and PHPMyAdmin to deal with the database. However, I spent a good deal of time experimenting with PHPFog and after a while I found the workflow to be really simple and easy to manage. Deployment is as easy as doing a Git Push, and the whole thing worked in a similar fashion to Heroku.com, the popular Ruby on Rails PaaS.

What’s more is that PHPFog, being built on EC2 was fairly extensible. If I wanted to, I could easily add an ElastiCache server, or my own dedicated RDS server. Basically, through setting up security groups which allow communication to PHPFog’s instances, I am able to connect to just about anything that Amazon AWS has to offer.

I continued to experiment with PHPFog and found some additional highlights. Each paid account comes with a free NewRelic monitoring account. NewRelic is really great as it offers a much more comprehensive monitoring system than many of the typical server alerting and monitoring apps available today. You can really get a nice picture of where the different bottlenecks are happening on your app, and what the real “end user” experience is like. In short, NewRelic was the icing on the cake.

NewRelic

Our NewRelic Dashboard

So, last week, we made the switch and are now running our main cooperhewitt.org site on “The Fog.” We have also been running this blog on the same instance. In fact, if you are really interested, you can check out our NewRelic stats for the last three hours in the “Performance menu tab!” It took a little tweaking to get our NewRelic alerts properly configured, but they seem to be working pretty seamlessly now.

Here’s a nice video explaining how AppFog/PHPFog works.

As you can see, we’ve got a nice little stack running here and all easily managed with minimal staff resource.

And here’s a somewhat different Fog altogether.

(Yes we are a little John Carpenter obsessed here)

Archiving Websites

I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.

There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.

The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!

Scrape The Site

One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.

wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.

wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.

httrack

The httrack homepage

For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.

httrack

Here is httrack running in my Mac’s terminal.

Hosting A Scraped Site

Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.

Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.

Here’s an example of one of our scraped sites — http://archive.cooperhewitt.org/campana

301s

It’s pretty standard practice on the web to create 301 redirects for sites you are moving to a new domain. I was able to do this pretty easily using an .htaccess file and the following commands.

This allows you to browse the site by going to the original URL at http://campana.cooperhewitt.org or any of its permalinks like http://campana.cooperhewitt.org/about.html

The Downsides

As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.

One Final Step

Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.