Tag Archives: apache

Archiving Websites

I would imagine that just about any organization out there will eventually amass a collection of legacy web properties. I know we have! Be it a microsite from 1998 or some fantastic ( at the time ) forum that has now been declared “dead” — it’s a problem. The big question being, what to do with them.

There are a few technical problems at work here. First, there is a feeling of permanence on the Internet that is hard to ignore. You want these legacy sites to live on in some form. archive.org is a pretty good system for looking back at your main website, but its a moving target, constantly being updated with each iteration of your site. I’m talking more about preserving old web outliers. Those exhibition micro-sites, and one-off contest sites you might have produced years ago.

The next issue is that in order for these sites to live on, you need to provide some level of maintenance for them. Nearly every website these days has a database running the show, so in order for these sites to work, they need to have an open connection to that database. This means you need to continually update the application code, and do crazy things like upgrade to MySQL 5, 6, 7 and so on. What a drag!

Scrape The Site

One option we have been using here at Cooper-Hewitt is called web scraping. This is a pretty common technique that essentially creates a non-dynamic, static version of any website. There are several ways of scraping a site, one of the simplest being the wget program.

wget is a pretty simple program that comes installed on most linux distributions. You can also install it on your Mac using Homebrew. Here is a sample command line call using wget.

https://gist.github.com/1528607

wget works pretty well but its not really the ideal tool for the job. All it does is download web content. It’s great for downloading files to your linux server ( nice way to install WordPress on a new linux box ) but it doesn’t do much else.

httrack

The httrack homepage

For scraping our sites, we chose to go with a pretty simple tool called httrack ( thanks to Geoff Barker at Powerhouse ). This program ( available as a command line tool for Mac ) does the same thing wget does, with some added bells and whistles. The main bell being that it re-writes all of the internal hyperlinks in the site so that the archived site can be hosted on just about any domain name.

httrack

Here is httrack running in my Mac’s terminal.

Hosting A Scraped Site

Once you have scraped a site, it probably makes sense to move it somewhere for safe keeping. We had lots of sites on lots of domains. It didn’t really make sense after years of producing these sites with different methodologies. So, we decided to create archive.cooperhewitt.org and place each scraped site as a sub-folder of this domain.

Initially I thought it would be really nice to host these static sites on Amazon’s S3. I know it’s possible to do this, but I found that many of the pages wouldn’t load correctly. I’m still interested in S3 as an option for this as it’s sort of the perfect hardware for the job ( is it really hardware? ) but instead I chose to spin up a micro instance on EC2 and host the sites there.

Here’s an example of one of our scraped sites — https://archive.cooperhewitt.org/campana

301s

It’s pretty standard practice on the web to create 301 redirects for sites you are moving to a new domain. I was able to do this pretty easily using an .htaccess file and the following commands.

https://gist.github.com/1571750

This allows you to browse the site by going to the original URL at https://campana.cooperhewitt.org or any of its permalinks like https://campana.cooperhewitt.org/about.html

The Downsides

As with anything, there are downsides to using this technique. The main one being no more interactivity. If your website had a commenting feature built in, it won’t work anymore. If it ran off a CMS like WordPress, you won’t be able to log in and make edits to your content. Everything is now static HTML, forever. Also, httrack won’t do it all. It hiccups on some types of URLs depending on the underlying structure/technology. I found this to be a small problem with things like roll over images and dynamic hyperlinks ( especially links with ? marks in them ). But most of these issues can be resolved with a little cleanup.

One Final Step

Since you are scraping the site and turning it into static html, it does make sense to make a real archive of your original site files and any attached database. I simply copied all the files in our /var/www directory to an external hard drive and did a mass MySQL dump to the same drive. If I ever really need to resurrect one of the sites, I have everything I need sitting on a shelf in cold storage.

Designing Search

Our Homepage?

Some might say this is really our homepage

When I first started working at the Cooper-Hewitt in June of 2010 I was really interested in making our web more searchable. This is no simple task, and it is one that I am still working on today. So, I thought I might document some of the projects we have going on with regards to “search” at the Cooper-Hewitt.

The Google Search Appliance

One of the nice things about being part of the larger Smithsonian Institution is that we have some amazing resources. When I first started working here I took a day trip down to Washington DC to meet with Smithsonian’s CTO and some of the folks who run the Smithsonian data center in Herndon Virginia. They took me on a great tour of the facilities and talked about many of the unique resources I had at my fingertips. One such resource was our clustered Google Search Appliance. It’s no big surprise that Google, one of the biggest names in the search business, offers its technology to the enterprise.

Our setup provides the entire Smithsonian Institution with an enterprise class web crawler and index database. Shown in the picture below, it consists of two separate clusters of five 2U Google Search Appliances. One rack of 5 is simply a hot spare in case the other goes down.

Smithsonian's Google Search Appliance

Smithsonian’s Google Search Appliance — The one on the right is a backup for the one on the left.

The Google Search Appliance works in a similar way to Google itself. It constantly crawls across all of Smithsonian’s web properties, updating its index and providing a front-end for search. In fact, you can try it out yourself at https://search.si.edu where you can search across our entire network, or simply select the Smithsonian unit you are interested in. As you can see, results are displayed by relevance and follow a format similar to Google.com.

The Google Search Appliance is a great tool and we now have it integrated with our main website. If you want to try it out, go to https://cooperhewitt.org/search/gsa. This should give you results from the Google Search Appliance that span most of our web properties.

Of course the GSA comes with a few strings attached. First, it’s a web crawler, so it is indexing web properties by crawling in an automated way. This means you tell it where to start and it automatically finds pages by crawling each link from one page to another. This works fine in most cases, but you really don’t have much control over how it crawls and what it finds. It’s also a device that is shared by the entire Smithsonian Institution, so it comes with some restrictions as to how it can be utilized and customized. As well, its a completely off the shelf solution based on proprietary code. In other-words, it’s not open source, and thus can’t be hacked or altered or customized to do fun stuff! Lastly, these devices are pretty expensive. We are lucky to have them in our datacenter, but many of you reading this post will need a less costly option.

Google Custom Search

One such option might be Google’s Custom Search. This solution simply leverages Google’s existing infrastructure to create a site specific search. Of course you have little control over how and when your site gets indexed, but its free and easy to set up. I’ve used Google Custom Search for a few personal sites that needed a hosted search solution.

You can try out our custom search site here. This one searches our main site as well as a number of other web properties we have. One nice thing about the GCS is that it can be hosted on Google’s site, or embedded on your own site through a number of layout options.

Drupal and WordPress

Nearly all Content Management Systems come with some type of search feature. This type of search is simply created for you automatically when you author a new post. There is no web-crawler involved because it’s not necessary. However, this type of search comes with many caveats. First, you can only index the site managed by the CMS. So unlike the Google Search Appliance, you can’t use this to search multiple web properties on multiple domains. Also, the search baked into many popular CMS systems tend to be pretty limited in how they function on the front end. WordPress is notoriously bad at displaying search results out of the box. It simply displays them in alphabetical order–not too helpful!

Drupal is a little better with a decent advanced search page and ability to sort results by relevance. But, it is still pretty basic.

On the other hand, this type of manual indexing is very important as you can ensure that every page you author gets indexed as opposed to just hoping a web-cralwer finds your pages. Another advantage with Drupal is that the search back-end is fairly extensible and can be modified with modules. For example, our Google Search Appliance is easily integrated with our main website through this contributed module for Drupal.

Open-Source

While thinking about ways to improve search we realized that we really needed to do our own thing. We sort of needed both modes of indexing ( web crawling and manual indexing ). We also needed something we had full control over, and something that was open-source and enterprise class.

Enter Solr and Nutch. These two applications are part of the larger Apache project which powers a giant portion of the web. Solr ( pronounced solar ) is:

the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.

And Nutch is a complimentary project to Solr that provides an enterprise class web-crawler. Nutch can help us crawl across multiple domains and web properties. It can also scale nicely using another Apache project known as Hadoop.

With Solr and Nutch ( and maybe a little Hadoop ) we should be able to build a pretty sophisticated platform for search. In fact, this has already been done at the Smithsonian in their Collections Search project, where you can search across nearly every Smithsonian unit.

In my next post I will dig in deep and show you how we get these things up and running.