High Availability Services
Our second case study looks at providing high availability services to the internet. This includes both customer targeted services such as web sites, as well as providing redundancy for services like e-mail, or VPN access. We revisit the same Houston based Oil and Gas Company in this study. It’s 2004, and our e-mail is provided by a single server, and we have two internet connections. However, incoming and outgoing e-mail will only travel over one of the internet connections. This means that if our primary internet connection goes down or is disrupted our e-mail is disrupted. The backup internet connection can still provide internet access, but our e-mail server must use the primary service.
Our first proposition is to install two new e-mail servers, one for each connection. This will provide redundancy and improved throughput. Redundant in that if either server fails the other will continue to accept e-mail, and improved throughput because both servers will share the load of processing incoming e-mail.
Our second proposition is to install a second internet connection at the headquarters. This will not only provide a backup in the event of any interruption of service on the primary connection, but also allows for failover of critical services from the primary internet provider (Timewarner Telecom), to a totally different provider (Abovenet). This is accomplished through the use of an industry standard routing protocol (BGP). The addresses allocated to the e-mail server, are much like a street address, and e-mail reaches the server by following a route provided by backbone devices called routers. By having in place two internet connections and by configuring BGP on both connections, it is possible to introduce two routes to the internet or two ways to reach the same address. This plan was implemented and thoroughly tested during installation. Because the routes are distributed to the internet it takes some time for one route to be removed (when a path is no longer available), but we were able to minimize this time by testing and coordination with the providers involved.
The net result of the implementation matched the initial goals of a system that can survive a failure at any level. An internet provider failure, an e-mail server failure, a power failure at a site, and any other number of catastrophes.