Jay Butler, Senior Technical Consultant
What is your financial institution’s tolerance for downtime? Is it eight hours, two days, a week? Sensible answers require at least a basic risk analysis to compare the likelihood of an event and its consequences to the cost of possible recovery contingencies. While relatively rare, the failure of a server, router, firewall, Ethernet switch, data circuit, or core software application will result in significant, widespread disruption to the institution. Still, the exorbitant cost to duplicate all of these systems across the enterprise necessitates a compromise between expense and tolerance. Alternative solutions include next-day hardware warranties, support service level agreements, and single backup servers intended to cover multiple primary servers. The latter contingency frequently causes disappointment and surprise during server failures where the tolerance for downtime is understandably very low.
Continuous or daily backups protect server data. To be sure, close attention is paid to this critical aspect of server recovery. Unfortunately, comprehensive data backups do not equal fast server recovery due to the unique configurations of each server. To provide a “flip the switch” or high availability server recovery solution requires more sophisticated solutions that carry a far greater expense compared to a single backup server. Servers must be replicated using real-time synchronization solutions such as Double-Take®. In simple form, replication solutions require a target server for each production server for real-time backup synchronization and instantaneous failover. More refined deployments incorporate virtual server technology and Storage Area Networks (SANS) to avoid server sprawl as well as provide additional functionality. Without some kind of high availability solution, server recovery in under a day is often an exception and certainly not a rule.
A compromise between tolerance and expense must be reached, so deploying a high availability configuration for only the most critical servers can be a good option. This logic can be extended to the other core components as well. Rather than maintaining a preconfigured spare router at every location, maintain one at the main office that can be moved where needed. Save money with a spare firewall that lacks some features of the primary but can safeguard the network and maintain Internet communications. Many of our clients keep a single spare Ethernet switch because the likelihood of multiple switch failures at once is very remote. The same idea may be applied to network workstations.
Single workstation failures usually only interrupt the work of one user so that overall business services continue unabated, but the tolerance may be very low if the employee cannot resume productivity at another open workstation. Compared to a high availability server solution, keeping a spare workstation on hand would be relatively inexpensive. Assuming around $1500 or more for the machine including continual configuration and maintenance, the expense may be a good deal because you will avoid the potential costs associated with hours of lost productivity. To make the expenditure more worthwhile, impose a plan to replace a problem workstation with the spare after an hour spent troubleshooting any malfunction. Similar to a workstation failure, the cost associated with extended troubleshooting may be avoided.
These basic examples cannot replace the comprehensive risk analysis required to determine the most judicious appropriation of limited business resources, but hopefully they can spark interest in an area often glossed over until a major failure happens. It may help to start with a spreadsheet that maps business functions to software applications including the network components each software depends on. The spreadsheet should at least highlight the most critical business areas. How long can the business tolerate a failure for each? How long would it actually take to resume operation?