Report: Cloud services can be made more resilient but at a premium

Uptime Institute says enterprises should beware that some cloud service architectures may appear more resilient but actually provide few guarantees of better availability.

business cloud services flowchart

Enterprises have options for making cloud services more resilient but at a price premium of up to 111% over the base price of services that offer no protection, according to a study by Uptime Institute.

The extra cost can mean faster recovery times, better compensation from service-level agreements when there are outages, and improved “implied reliability,” according to Uptime’s report “Public cloud costs versus resiliency: stateless applications”.

The institute modeled three scenarios for improving the resiliency of a simple WordPress website that was, at peak, required to deliver webpages within three seconds of requests. The researchers generated a Python simulation that varied bandwidth and virtual-machine demands to analyze their effects on costs.

The study was based on Amazon Web Services rates, but the report says the results are indicative of what to expect in general. “Other public cloud services have similar pricing models, services and architectural principles; the fundamental analysis in this report applies to other cloud providers, too,” it says.

The institute looked at improving resiliency of the WordPress app in three different architectures: by providing backup of just the VM hosting it in the same availability zone; by providing backup of the VM in a separate cloud availability zone in the same region; and by providing backup in separate cloud-provider regions.

A cloud-provider availability zone is a virtual data center, and a collection of them in the same geographic location make up a region. “Single resources, such as a VM, are likely to become unresponsive from time to time,” the report says. “It is also likely that whole availability zones will go down occasionally, rendering many resources unresponsive. A regional outage is rarer but will bring down multiple availability zones.”

The charge for the baseline service being studied, with no protection, consisted of the cost of using the VM plus the cost of outbound bandwidth, and that totaled $217.38 per month. If the VM were to fail and there were no backup of the app, the recovery time would be determined by how long it took the customer to replace it. “While AWS says its data control plane for this architecture is designed to deliver 99.95% availability, it will only compensate if availability drops below 99.5%,” according to the report, and Uptime calculates that compensation for an outage lasting longer than a day and a half would be 29% of the monthly application cost.

Same-zone active backup of a VM

Using a load balancer and backing up the VM with a separate, active VM in the same availability zone would provide zero downtime if the VM failed and would result in the same implied availability of 99.95%. Compensation for outages lasting longer than a day and a half would increase to 44% of monthly cost. Because this architecture calls for an extra VM and a load balancer, it also costs more—$311, up 43% over the baseline.

Active backup in two zones in the same region

Backing up the VM with a separate active VM in a different availability zone within the same region also costs $311 per month. It costs nothing more to put the second VM in a separate zone, but the implied availability improves to 99.99%. The recovery time remains zero and the 44% compensation rate remains the same.

Active backup in separate regions

Setting up the app in two different regions, with each region hosting two active instances of the app in two different zones offers “arguably the most resilient method," according to the report.

In this model, there would be four active virtual machines hosting the app, two in each region with a virtual load balancer in each region directing traffic between VMs situated in two separate zones. “The load balancers provide simple balancing and resiliency in the event of a VM or availability-zone outage,” the report says, and “externally, the failure wouldn’t even be noticed by an end user or need to be managed by their devices.”

Traffic to these virtual load balancers would be directed there by the Domain Name System (DNS). The DNS could be configured to choose the better load balancer based on its physical proximity, the delay on the path to the load balancer, or on weighting policies. The DNS could also run health checks to detect when a load balancer becomes unavailable and when one does, direct traffic to the other one.

While this may be the most resilient option, it has down sides. “DNS as a balancing mechanism is imperfect because … users’ devices that access the web application … will have a stored record of the IP address of the application,” the report says. “If this address becomes unavailable, [users’ devices] will be unable to access the application until it has updated its local cache with the IP address from the DNS system.” So end users could experience unavailability of uncertain length in the event of a failure.

In this scenario, implied availability rises to 99.9999%, and the costs rise to a 111% premium over the baseline, reaching $457.80. If one of the regions goes down, that means a load balancer and two VMs are unavailable, entitling the customer to compensation amounting to 62% of cost for the service if the outage lasts longer than 1.5 days, the report says.

Uptime says that cloud providers often offer resiliency across availability zones as a standard part of many of their services, and that that resiliency provides higher availability at a relatively small cost premium.

The report also issues a warning: “Users should be aware that designs that appear more resilient may offer few meaningful guarantees regarding availability or outage compensation.”


Copyright © 2022 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022