No technology process in the world can eliminate all future outages, defective code, or random human foolishness, but you can hedge your bets, of course you could spend thousands on a fully redundant infrastructure but short of that budget-busting scenario, a few small steps can greatly simplify recovery from all sorts of problems.
Tip 1, keep cold spares of everything. Ideally you’ve already standardized on network and server components. Sure, there may be a few odd parts here and there, but your closet switches should be all the same brand, if not the same model. Your servers are homogenous or at least homogeneous to their purpose (such as HP ProLiant DL360s for one major infrastructure component and Dell PowerEdge R415s for another). These servers aren’t that expensive, especially if they’re purchased in their minimum configuration. In a pinch, you can replace a failed server with the cold spare, moving the functional parts over to the spare in an instant. In some cases you’ll even be able to simply swap the disks and have the new box up in no time.
For routers and switches, the same is true. With tools like RANCID to automatically download and archive switch and router configurations and making sure you have a backup of the device images, in the event of a failure you can dump the configuration of a failed router or switch and the backed up image to the cold spare and save the day. Firewalls work the same way. In many cases, you can even pull your cold spares from a supplier and get the cheap: you don’t care about support on these units, so you can forgo that expense and still cover your needs. Even if you’re running Cisco ASAs, you can probably find an end-of-life Cisco PIX with a similar configuration for a few hundred pounds that can at least bring critical services back up if you experience a failure.
Naturally, you don’t want to buy cold spare for big-ticket items like core switches, but if you do a little leg work, you can cover the rest without putting a major dent in your budget. You can also keep some legacy equipment to cope with a core switch failure while waiting for advanced replacement parts. Your old hardware may not be good enough for day to day operation but if it means the difference between no productivity and being operational then it is worth keeping.
Tip 2, Go wiki,
What was the serial number of that remote-office switch anyway? What version of IOS was that router running before the power supply blew? An easy way to collect this data in a way that’s easily located is in a wiki. Toss CentOS on a virtual machine, install Media Wiki, and start compiling data on your infrastructure. I paste the output of sh ver on a Cisco device straight to a wiki page as well as write up synopses of the switches’ functions and responsibilities; in the event that something does go awry, I can quickly dig up those ever-so necessary bits of information that can turn a three-hour recovery into 30 minutes.
I don’t go so far as to put passwords in wiki documents, but anything short of that is fair game: lists of serial console server ports and what they’re connected to, switch port assignments and VLAN blocks for DMZ and public switches, as well as each server, its brand model, serial number, role, storage and RAM configuration, and so forth. If it exists in your infrastructure it should have an entry in the wiki.
Starting this project from scratch is a real pain, but maintaining the information on an on-going basis is easy. The next time you have an immediate need to know the serial number of a failed remote switch, you’ll have it right at your fingertips.
Tip 3, establish backup links wherever and whenever possible
If at all possible, there should be multiple paths to every data centre and remote office. Back in the day, this was very expensive, but now you can probably get a business-class DSL or cable connection to most of your locations. For less than £100 a month in many cases, you have an alternate access method to that site for use in emergencies – or for sensitive remote configurations of the production routers and firewalls. It might even be feasible to split your traffic in those sites, pushing business traffic over leased lines and internet browsing traffic over the DSL or cable circuit.
If cost is the ultimate issue, you can take a page from the first item in this list and procure a used firewall from a supplier for this circuit
Tip 4: bet on a big box.
This one really applies to virtualised infrastructures only. Say you have a virtualisation farm of a dozen 1U servers running a few hundred virtual machines. If something goes wrong with the production system, you can probably get away with running some subset of those VMs to maintain critical line-of-business applications. If that’s the case, you don’t need to maintain a duplicate virtualisation farm. Instead, you can invest in a single four-CPU server with a bunch of RAM that can take the production load for some length of time.
This server wouldn’t necessarily play in the farm itself (though it could), but would instead be installed and ready to handle a load if the situation calls for it. In some cases, you may even be able to game the virtualisation vendor’s evaluation period to avoid paying for licenses on a dormant server, but your mileage may vary.
The size of this emergency server should correspond to your infrastructure needs and the number and weight of the virtual machines you expect it to run. Generally speaking, you can get an awful lot of emergency processing power in a virtualised environment for under £10,000. Is that too much for peace of mind?
Tip 5: Learn Linux
Even if you’re a windows shop, learning enough about Linux can open up a huge number of valuable, low-cost options. You may not feel comfortable running critical business applications on Linux or Unix without in depth knowledge of the OS, but they are incredibly stable platforms. There are Windows versions of many of these tools, but they are natively Unix-based. The benefit of learning Linux and running these tools is twofold: you gain Linux skills, and you enrich your network with a raft of supporting players that makes everyone’s life simpler.
It’s easier to preach about being proactive than to actually make these measure happen in the topsy-turvy, break-fix world of IT. But to paraphrase a recognised saying, if you’re too busy mopping the floor to turn off the tap, you probably need to rethink your approach.