Data Resiliency in The Public Cloud

Written by: Roey Libfeld

Data resiliency in the public cloud - Replix
Spread the love

In the world of enterprise IT, there is no word scarier than downtime. When you are responsible for business continuity, nothing can compare to the sinking feeling when you realize that you’ve just lost a large chunk of critical data that cannot be recovered.

For companies providing client-facing services requiring the implementation of strict Service Level Agreements (SLA), this can have severe consequences on business outcomes that are often difficult to recover from.

For an eCommerce business, for example, this could mean minutes, or even hours of lost customer transactions, resulting in a direct hit to the revenues. This could quickly add up to a sizable loss. A 63-minute Amazon outage on its “Prime” sales day in 2018 cost the company nearly $100 million.  For financial services that handle payments and transfers of financial data, the consequences of lost transactions can be even more severe due to compliance and regulatory risks.

When every bit of data counts, minimizing the cost of failures through meeting the recovery point objectives (RPOs) and recovery time objectives (RTOs) is becoming the critical KPIs for DevOps teams the world over.

What is Zero RPO?

A Recovery Point Objective (RPO) marks how much data loss can be tolerated when a failure occurs. A non-zero RPO means that transactions that occurred between the RPO and the point of failure could be lost.

But when your business continuity goals are measured in seconds, often no amount of data loss, no matter how small, can be tolerated. While meeting Zero RPO with your own data centers is possible, it is very complex and requires a sizable investment in the infrastructure and maintenance.

In the public cloud, Zero RPO is virtually unheard of due to both lack of control over geographical locations and the inability to implement specialty hardware. So far, cloud-native companies had no choice but to settle for the inevitability of some data loss in case of critical failure and plan accordingly.

How can my business benefit from Zero RPO?

Today’s organizations understand that mobility, flexibility, and level of service are not just capabilities — they are a critical competitive advantage. Not losing a single transaction even in the event of a disaster can mean the difference between a business that survives and a business that doesn’t.

The challenges of achieving Zero PRO in the public cloud

One way to deal with data loss is to mirror your data across availability zones and cloud providers. Remote replication is an essential part of business continuity strategies and regulation compliance, and has been a part of enterprise IT for decades.

However, when it comes to the public cloud, replication still can’t meet Zero RPO objectives due to geographical distance issues and limitations such as:

  • Latency — the greater the distance between the locations, the greater the latency. In the cloud, the distance between locations is out of your control
  • Bandwidth — businesses must ensure an optic fiber connection for rapid data transfer.
  • Data rate — in order not to overload the network, the data rate should be lower than the available bandwidth. For dense data volumes, this can be problematic.

The geographical distance barrier is probably the biggest challenge for synchronous replication. Fiber connection, a primary enterprise storage transport protocol, is limited by the speed of light a physical constraint that cant be passed easily. Also, in public clouds, you don’t have any control over distances between locations, and hence the latency.

With static data, latency becomes a problem since propagation delays increase with distance. Propagation causes delays as they require the system to be stuck waiting for confirmation of the completion of each storage operation. This means that the practical distance for synchronous replication with static data is very limited. The distance constitutes about 100 to 200 km, depending on the application response time tolerance and other factors.

What is even a bigger challenge, enterprise-grade replication must be facilitated by specialty hardware. Apart from cost considerations, you simply can’t deploy specialty hardware in the public cloud.

Live vs Passive Data

When it comes to replication, businesses have two options.

Synchronous replication: data is replicated to a secondary remote location at the same time as new data is being created or updated in the primary datacenter.

Asynchronous replication: Data is replicated only in predetermined periods (this could be hourly, daily, or weekly). The replica can be stored in a remote DR location, as the replica does not have to be synchronized with the primary site in real-time.

Obviously, for high-end transactional applications that need instant failover if the primary node fails, synchronous replication is preferable. When you are dealing with backups and snapshots you are always dealing with data that is in the past. Most backup and replication solutions are in essence point in time solutions.

Multi-region replication today, however, is limited to asynchronous replication only.  All cross-region solutions in the cloud today transfer your data at intervals of fifteen minutes or more.

At the same time, we live in a world where the rapid speed of data creation and the dependency of business outcomes on the ability to manage this data effectively makes a point of time solutions a thing of the past. Static data carry a high cost when it comes to RPO.

When you do replication cross-region, you are stuck with data that is stuck at that point in time.

 

After a disaster, how up to date is the data I’m using?

Until recently, in the public cloud, some amount of data loss was inevitable, and businesses needed to settle for minimizing the Recovery Point Objective and getting is as close to zero as they possibly could.

Replix operates a mesh of relays that bridge the distance between regions to support higher availability with lower latency between regions, clouds, and locations. Replix Data Fabric ensures that your data is replicated in real-time, making the distance limitations and infrastructure concerns irrelevant.  With Replix, Zero RPO replication in the public cloud is finally within reach.

Topics: