OVH Datacenter Meltdown: Why a Single Cloud Provider is Not Enough

The catastrophic March 10 fire at the OVH datacenter facility in Strasbourg, France, which caused irreparable damage to multiple datacenters and servers, was a devastating reminder that disasters can and will happen anywhere, anytime and to any type of company. OVH founder Octave Klaba warned it could take weeks for the firm to fully recover and urged clients to implement their own data recovery plans.

Apparently, many companies didn’t have such a plan (or had one that was seriously flawed). Financial institutions, government offices and commercial entities incurred downtime and – best case – were able to restore services from another location. Others, which fully trusted their hosted infrastructure and assumed that their data was backed up, lost everything.

What’s so regrettable is that this colossal disaster could have been avoided. Companies that planned and implemented best practices for DR prior to the blaze were not affected and did not experience downtime or data loss.

We can only hope this horrific fire serves as a wake-up call for companies to re-evaluate their disaster readiness strategy. To get the ball rolling, here’s a quick primer covering the various options for ensuring data availability in the event of a datacenter outage.

The Hierarchy of Data Availability

Each cloud provider organizes its data infrastructure into availability zones (i.e., datacenters) and regions. Each provider has multiple availability zones that are geographically close to one another to ensure a low-latency connection. Regions are distanced from one other (e.g., East Coast, West Coast and Europe) with each region comprising multiple availability zones.

Leveraging this hierarchical structure, organizations can choose to manage their data in different ways to attain different levels of data availability (ordered from lowest to highest availability):

Single availability zone. Since any single datacenter is susceptible to failure at some point in time, relying on the same availability zone for primary data and backup is a serious gamble from a business continuity perspective. While this may sound obvious, the companies that lost all their data in the OVH datacenter blaze did not heed this warning.
Two availability zones with synchronous replication between them. These zones should be at least a few miles apart (and not on the same campus like at OVH). By replicating data across multiple zones in the same region, you can quickly failover from one zone to another in the case of an outage without losing any data. Synchronous replication means that every “write” operation to storage is written to two locations before your application receives an OK. Synchronous replication works well when locations are close and latency is low – if not, application performance will suffer.
High availability between two availability zones (as above), plus background replication to another region. This method addresses the potential for disaster on a larger scale (e.g., massive earthquake, flood, etc.) affecting an entire region. On top of having two nearby copies of your data always in sync, this additional layer uses asynchronous replication to create a third copy in the background which is then stored in another region. In the case of a major disaster, you can still failover to another region and only lose a few seconds or minutes of data (depending on the replication lag) while quickly getting operations back online.
High availability between two availability zones, plus background replication to a second cloud provider. This is the “holy grail” of data availability, addressing the rare but business-critical scenario that your cloud provider suffers a multi-hour outage across multiple regions. This type of cascading failure has happened at major providers due to software bugs, technical malfunctions and human error. Beyond data availability, there are many advantages to embracing the multi-cloud mindset.

Don’t Overlook Data Availability at the Edge

Surely, if a datacenter with state-of-the-art protection can fail, so can one of your branch office servers. Good DR strategy means you should be ready for this scenario as well. Unlike the replication methods described above, many edge locations use traditional backup solutions that require a restore operation to recover the data following a disaster. Typically, such an operation takes at least a few hours/days to complete – seriously compromising business continuity.

To overcome this problem, enterprises are deploying global file systems, such as CTERA‘s, that use smart cloud caching filers at the edge to ensure that data is always available locally, while the master copy is stored in the cloud. In the event of a failure at the edge, you can failover to the cloud as your DR site, allowing users and apps to stay online. Once an edge filer is back up, metadata is restored, and then data is restored in the background. This means that users can resume working with their files almost instantly, having to wait only for the metadata to be downloaded and eliminating the delay traditionally needed for restoring all the data.

Multi-Cloud Disaster Recovery

In light of the OVH datacenter disaster, it’s time for companies to rethink their DR strategy and avoid placing all their applications and data in a single cloud environment. Organizations moving their IT operation to the cloud need to understand the various DR alternatives as they build their business continuity plan. A multi-cloud strategy not only offers the most robust data availability option, but also enables organizations to maximize business value from the cloud.