Tuesday, June 18, 2013

Recovery Strategies

Besides cost, the key business continuity drivers for a recovery solution are the Recovery Point Objective and the Recovery Time Objective.

Recovery Point Objective

The Recovery Point Objective (RPO) refers to the recovery point in time. Another way to think of this is that the RPO specifies the maximum allowable time delay between a data commit on the production side and the replication of this data to the recovery site.

It is probably easiest to think of RPO in terms of the amount of allowable data loss. The RPO is frequently expressed in terms of its relation to the time at which replication stops, as in “less than 5 minutes of data loss.”

Recovery Time Objective

The second major business driver is the Recovery Time Objective (RTO). This is the amount of time it will take us to recover from a disaster. Depending on the context, this may refer only to the technical steps required to bring up services on the recovery system. Usually, however, it refers to the amount of time that the service will be unavailable, including time to discover that an outage has occurred, the time required to decide to fail over, the time to get staff in place to perform the recovery, and then the amount of time to bring up services at the recovery site.

The costs associated with different RPO and RTO values will be determined by the type of application and its business purpose. Some applications may be able to tolerate unplanned outages of up to days without incurring substantial costs. Other applications may cause significant business-side problems with even minor amounts of unscheduled downtime.

Different applications and environments have different tolerances for RPO and RTO. Some applications might be able to tolerate a potential data loss of days or even weeks; some may not be able to tolerate any data loss at all. Some applications can remain unavailable long enough for us to purchase a new system and restore from tape; some cannot.

Recovery Strategies

There are several different strategies for recovering an application. Choosing a strategy will almost always involve an investment in hardware, software, and implementation time. If a strategy is chosen that does not support the business RPO and RTO requirements, an expensive re-tooling may be necessary.

Many types of replication solutions can be implemented at a server, disk storage, or storage network level. Each has unique advantages and disadvantages. Server replication tends to be cheapest, but also involves using server cycles to manage the replication. Storage network replication is extremely flexible, but can be more difficult to configure. Disk storage replication tends to be rock solid, but is usually limited in terms of supported hardware for the replication target.

Regardless where we choose to implement our data replication solution, we will still face a lot of the same issues. One issue that needs to be addressed is re-silvering of a replication solution that has been partitioned for some amount of time. Ideally, only the changed sections of the disks will need to be re-replicated. Some less sophisticated solutions require a re-silvering of the entire storage area, which can take a long time and soak up a lot of bandwidth. Re-silvering is an issue that needs to be investigaged during the product evaluation.

Continuity Planning

Continuity planning should be done during the initial architecture and design phases for each service. If the service is not designed to accommodate a natural recovery, it will be expensive and difficult to retrofit a recovery mechanism.

The type of recovery that is appropriate for each service will depend on the importance of the service and what the tolerance for downtime is for that service.

There are five generally-recognized approaches to recovery architecture:

  • Server Replacement: Some services are run on standard server images with very little local customization. Such servers may most easily be recovered by replacing them with standard hardware and standard server images.
  • Backup and Restore: Where there is a fair amount of tolerance for downtime on a service, it may be acceptable to rely on hardware replacement combined with restores from backups.
  • Shared Nothing Failover: Some services are largely data-independent and do not require frequent data replication. In such cases, it might make sense to have an appropriately configured replacement at a recovery site. (One example may be an application server that pulls its data from a database. Aside from copying configuration changes, replication of the main server may not be necessary.)
  • Replication and Failover: Several different replication technologies exist, each with different strengths and weaknesses. Array-based, SAN-based, file system-based or file-based technologies allow replication of data on a targeted basis. Synchronous replication techniques prevent data loss at the cost of performance and geographic dispersion. Asynchronous replication techniques permit relatively small amounts of data loss in order to preserve performance or allow replication across large distances. Failover techniques range from nearly instantaneous automated solutions to administrator-invoked scripts to involved manual checklists.
  • Live Active-Active Stretch Clusters: Some services can be provided by active servers in multiple locations, where failover happens by client configurations. Some examples include DNS services (failover by resolv.conf lists), SMTP gateway servers (failover by MX record), web servers (failover by DNS load balancing), and some market data services (failover by client configuration). Such services should almost never be down. (Stretch clusters are clusters where the members are located at geographically dispersed locations.)
Which of these recovery approaches is appropriate to a given situation will depend on the cost of downtime on the service, as well as the particular characteristics of the service's architecture.

Causes of Recovery Failure

Janco released a study outlining the most frequent causes of a recovery failure:
  • Failure of the backup or replication solution. If the a copy of the data is not available, we will not be able to recover.
  • Unidentified failure modes. The recovery plan does not cover a type of failure.
  • Failure to train staff in recovery procedure. If people don't know how to carry out the plan, the work is wasted.
  • Lack of a communication plan. How do you communicate when your usual infrastructure is not available?
  • Insufficient backup power. Do you have enough capacity? How long will it run?
  • Failure to prioritize. What needs to be restored first? If you don't lay that out in advance, you will waste valuable time on recovering less critical services.
  • Unavailable disaster documentation. If your documentation is only available on the systems that have failed, you are stuck. Keep physical copies available in recovery locations.
  • Inadequate testing. Tests reveal weaknesses in the plan and also train staff to deal with a recovery situation in a timely way.
  • Unavailable passwords or access. If the recovery team does not have the permissions necessary to carry out the recovery, it will fail.
  • Plan is out of date. If the plan is not updated to reflect changes in the environment, the recovery will not succeed.

Recovery Business Practices

Janco also suggested several key business practices to improve the likelihood that you will survive a recovery:
  • Eliminate single points of failure.
  • Regularly update staff contact information, including assigned responsibilities.
  • Stay abreast of current events, such as weather and other emergency situations.
  • Plan for the worst case.
  • Document your plans and keep updated copies available in well-known, available locations.
  • Script what you can, and test your scripts.
  • Define priorities and thresholds.
  • Perform regular tests and make sure you can meet your RTO and RPO requirements.

No comments: