One of the more important metrics to monitor in a consumer application is availability. Availability is a huge competitive advantage in the marketplace, since customers want the applications we build to be available all (or at least most) of the time. This post will focus on some high-level ideas needed to maintain high availability.

What is Availability?

A great book on availability is Architecting for Scale, by Lee Atchison. Atchison defines availability in the following way:

“The ability of your system to be operational when needed to perform those [sic] operations”

Atchison also attempts to differentiate between availability and another important metric, reliability, whereas the reliability of a system implies the system’s ability to operate without making a mistake. Of course both metrics are important when building and maintaining an application, but for a consumer product, returning a result (even if incorrect) is typically more encouraging for customers than being unresponsive.

Measuring Availability

The availability of a system only makes sense when measured within the context of a given time period. For instance, a system which is available for 10 minutes uninterrupted is highly available. However, if that system falls over for the next 10 minutes, then availability is extremely low. Availability is therefore useful when referred to as a percentage of uptime over a given period.

For measuring an availability percentage within a given time period, we need the following variables:

Total seconds (or other temporal value) in the time period. Let’s call this variable T.
Number of seconds the system is down. Let’s call this variable D.

We can assign a variable for site availability percentage, P. Therefore, to calculate P:

P = [T - D]/T

For example, let’s say we’re measuring availability for a given month, or 30 days or 2,592,000 seconds. Our example system was down for 30 minutes, or 1800 seconds. Therefore our site availability percentage is calculated as follows:

P = [2592000 - 1800]/2592000 = 0.999305 = %99.93

How Many Significant Figures?

When calculating availability percentage, the result will likely be a repeating decimal. The question then becomes how many decimals in an availability percentage should we keep?. The answer lies in what is typically called the nines classes. These classes can be illustrated with the following table:

Nines	Percentage	Allotted Outage
2 Nines	99%	432 Minutes
3 Nines	99.9%	43 Minutes
4 Nines	99.99%	4 Minutes
5 Nines	99.999%	26 Seconds

When building an application it can be important to define the targeted nines class, which is typically defined in a Service Level Agreement (SLA).

Tools for Measuring

A tool for measuring our application’s availability can certainly be built. A simple strategy could be to have a standalone service which periodically pings our application and calculates availability using successful and unsuccessful pings as data points.

However, as our application grow, and as the team which maintains the application grows, a need for more robust tools becomes apparent. Some of the most popular tools for measuring availability are Nagios, Datadog, Uptime, and New Relic. Personally I really like DataDog’s user interface.

Building Available Systems

There has been much research in the field of available systems. This section will outline one of the main strategies, which is to leverage microservices.

Microservices

Because of technologies like Docker and Kubernetes, the microservices architecture has been extremely popular in recent years. Some of the properties of a microservices archetecture are the following:

Single code repository: Each service has it’s own repository in version control.
Data isolation: Each service manages its own data. Typically this means each service will have a separate database.
Provides capabilities to other services: A microservice will typically provide an API for other services to consume from.
Defined Owners: A huge advantage of using microservices from an organizational point of view is that each service can be maintained by a single owner, or a group of owners. Care should be taken when assigning owners to microservices, since if something goes wrong, the owners should be able to fix the errors quickly (or as quickly as possible).

Tracing Errors

We want our availability infrastructure to be implemented such that each service has its availability measured separately. This obviously helps overall availability of our system, since we only need to look at specific services which are failing. Furthermore, we can ensure that our system fails as early as possible since a failed upstream service will become apparent if downstream services are still available.

Of course, as our application grows in scope and more team members are needed, more microserves will be created. So how can we trace errors from an upstream service to a downstream service? One strategy could be to send a unique identifier as part of each request to downstream requests (this is typically done with a UUID).

As an example, let’s say a service which needs to grab user info must send requests to the Auth Service before forwarding requests downstream to the User Info Service (see Image 1). In order to track each request to the User Info Service, we can provide a unique identifier. If the identifier never reaches the User Info Service, it’s likely that there was a problem upstream, possibly in the Auth Service. This hypothesis can be confirmed by verifying that the identifier reaches the Auth Service.

Image 1. We can trace requests from Service A to the User Info Service using a unique identifier.

These UUIDs are useful as part of a logging infrastructure, in which we can trace the requests and events in a microservices architecture. Similar to availability monitoring, there are many options out there (including building our own logging infrastructure). I really like Loggly for this, but there are many others. If the application archetecture is built exclusively on AWS, Amazon Cloud Watch could be a great fit.

Closing Thoughts

Availability is a widely researched topic in distributed systems. An application growing to the point where availability matters is a good problem to have; it means we have a growing customer base. One of the best architectural patterns for preemptively dealing with availability is the microservices pattern. However, jumping into the pattern without further research can cause more problems than using a monolithic application. When building a new application, I recommend starting with a monolith, and only split the monolith into microserves once there are enough team members to monitor the individual services.

Are You Available?