The Labs is a place where you can discover what is going on behind the scenes in digital & emerging media at the Cooper-Hewitt. Now you might well be asking why the Cooper-Hewitt needs another blog – we’ve got the popular design blog on our site already – so here’s why:
It is going to get noisy here. It might even get a bit messy. And that’s the point.
(There’s a reason we have a tanuki as an unofficial mascot for the Labs.)
There’s going to be a lot of technical posts in amongst others that teardown how experiments have performed. And some speculations, ideas, and things we’d like to do, too.
The Cooper-Hewitt is undergoing a transformation with a major rebuilding project. As a result digital content development, digital outreach, and the integration of digital into the fabric and visitor experience of the new building, are all top of mind for us here. This means that there is a lot of experimentation and rapid change, and in the spirit of the broader Smithsonian New Media & Web Strategy, we are going to be doing a lot of ‘thinking aloud’.
I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.
There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.
The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!
Scrape The Site
One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.
wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.
wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.
The httrack homepage
For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.
Here is httrack running in my Mac’s terminal.
Hosting A Scraped Site
Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.
Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.
As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.
One Final Step
Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.