SLA and SLO fundamentals and how to calculate SLA (2024)

Soufiane Bouchaara

6 min read

Mar 5, 2022

SLA aka Service-Level Agreement is an agreement you make with your clients/users, which is a measured metric that can be time-based or aggregate-based.

We can calculate the tolerable duration of downtime to reach a given number of nines of availability, using the following formula:

SLA and SLO fundamentals and how to calculate SLA (2)

For example, a web application with an availability of 99.95% can be down for up to 4.38 hours max in a year.

The following table explains the maximum duration of tolerated downtime per year/month/week/day/hour.

SLA and SLO fundamentals and how to calculate SLA (3)

Let’s imagine a Backend with an API, that serves 250M requests per day, and an SLA Aggregate-based of 99.99% which cannot exceed more than 25k errors per day.

SLA and SLO fundamentals and how to calculate SLA (4)

Note that an error is counted if it’s an internal server error HTTP 5XX

SLO aka Service-Level Objective is an SLA agreement that defines the expectations and goals that the company should achieve during a defined period of time.

Example:

Imagine a company with a current Uptime-Based SLA is 95% which means they have a tolerated maximum of 1.5days of downtime per month. (Which is very bad)

We will define our next objectives so the SLA should meet 99%.

For that, we have to take several actions, here are some examples:

Review the hardware infrastructure
Add preventive monitoring
Find the root causes of the downtimes
Review the network configurations
Review for any single point of failure

In this example, we are going to collect Nginx Logs and ship them to Elasticsearch in order to visualize them in Kibana to create an SLA Dashboard

SLA and SLO fundamentals and how to calculate SLA (5)

Why vector as logs collector? because it’s a lightweight, ultra-fast tool for building observability pipelines, where we will collect, transform, and route the Nginx logs to Elasticsearch.

Step 1: Configure your Nginx to provide more detailed logs
Edit /etc/nginx/nginx.confg in http and block add the following :

log_format apm '"$time_local" client=$remote_addr '
 'method=$request_method request="$request" '
 'request_length=$request_length '
 'status=$status bytes_sent=$bytes_sent '
 'body_bytes_sent=$body_bytes_sent '
 'referer=$http_referer '
 'user_agent="$http_user_agent" '
 'upstream_addr=$upstream_addr '
 'upstream_status=$upstream_status '
 'request_time=$request_time '
 'upstream_response_time=$upstream_response_time '
 'upstream_connect_time=$upstream_connect_time '
 'upstream_header_time=$upstream_header_time';

On your Nginx server definition edit the log configuration as follow:

server {
....
 access_log /var/log/nginx/access.log apm;
 error_log /var/log/nginx/error.log;
....
}

Step 2: Install Vector to collect, parse and ship your Nginx Logs.

curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | bash

Click for more installation documentation

Using VRL (Vector Remap Langage) to parse the Nginx Logs:

[sources.nginx_source]
type = "file"
ignore_older_secs = 600
include = [ "/var/log/nginx/error.log" ]
read_from = "beginning"
max_line_bytes = 102_400
max_read_bytes = 2_048[transforms.modify_logs]
type = "remap"
inputs = ["nginx_source"]#parse each line with VRL regex
source = """
. = parse_regex!(.message, r'^\"(?P<timestamp>.*)\" client=(?P<client>.*) method=(?P<method>.*) request=\"(?P<method_type>.*) (?P<request_path>.*) (?P<http_version>.*)\" request_length=(?P<request_length>.*) status=(?P<status>.*) bytes_sent=(?P<bytes_sent>.*) body_bytes_sent=(?P<body_bytes_sent>.*) referer=(?P<referer>.*) user_agent=\"(?P<user_agent>.*)\" upstream_addr=(?P<upstream_addr>.*) upstream_status=(?P<upstream_status>.*) request_time=(?P<request_time>.*) upstream_response_time=(?P<upstream_response_time>.*) upstream_connect_time=(?P<upstream_connect_time>.*) upstream_header_time=(?P<upstream_header_time>.*)$')#covnert metrics to proper types
.timestamp = to_timestamp!(.timestamp)
.request_length = to_int!(.request_length)
.status = to_int!(.status)
.bytes_sent = to_int!(.bytes_sent)
.body_bytes_sent = to_int!(.body_bytes_sent)
.upstream_status = to_int!(.upstream_status)
.request_time = to_float!(.request_time)
.upstream_response_time = to_float!(.upstream_response_time)
.upstream_connect_time = to_float!(.upstream_connect_time)
.upstream_header_time = to_float!(.upstream_header_time)
.host = "YOUR_HOSTNAME_HERE""""#for debug mode only - output to console
#[sinks.debug_sink]
#type = "console"
#inputs = ["modify_logs"]
#target = "stdout"
#encoding = "json"#OUPUT 1 : Elasticsearch
[sinks.send_to_elastic]
type = "elasticsearch"
inputs = [ "modify" ]
endpoint = "https://elasticsearch-endpoint.com"
index = "websitelogs-%F"
mode = "bulk"
auth.user="xxxxxxxxxxxxxxxxxxxxxxx"
auth.password="xxxxxxxxxxxxxxxxxxx"
auth.strategy="basic"systemctl restart vector.service

Check out the logs of the vector systemd unit :

journalctl -u vector.service

it should look like the following :

SLA and SLO fundamentals and how to calculate SLA (6)

For demo purposes, I have deployed an Elasticsearch instance in the elastic cloud

In Kibana, we need to create an Index Pattern to read from Elasticsearch indexes.

SLA and SLO fundamentals and how to calculate SLA (7)

Note: The timestamp should be the one from logs, not the Time of Ingest(default one by Elasticsearch).

Our logs are successfully being shipped into Elasticsearch, so the next step is to create a dashboard in Kibana with some graphs in order to calculate our current SLA.

Let’s discover the logs

SLA and SLO fundamentals and how to calculate SLA (8)

Well, it looks like we got around 8300 for the hit last 15 minutes.

and every hit log is well prepared and looks as follow :

SLA and SLO fundamentals and how to calculate SLA (9)

Let’s create our first graph, the count of Hits:

SLA and SLO fundamentals and how to calculate SLA (10)

SLA and SLO fundamentals and how to calculate SLA (11)

So as we agreed on the formula above, we will consider only 5XX as failed requests, and the rest of the status codes are successful (4XX are considered as client behavior).

so Our SLA Aggregate-based during the last 15mins is 87.04% and during the last 30 days is as follow (98.56%):

SLA and SLO fundamentals and how to calculate SLA (12)

This gap of difference between the last 15 mins and the last 30 days leads us to understand that an incident is going on.

Our Dashboard will start to look like this :

SLA and SLO fundamentals and how to calculate SLA (13)

Metrics are so powerful than Logs, you can use them to get real-time dashboards.
In this tutorial, we used Nginx logs and Elasticsearch as a document-store database.
But for better performance and real-time dashboards, I highly recommend using metrics instead of logs.

For that, we can use time-series databases such as InfluxDB or Warp10 to store our metrics and use Grafana as a visualization tool.

On vector to configure the output sink to InfluxDB by adding the following bloc:

#OUPUT 2 : InfluxDB Database
[sinks.influxdb_output]
type = "influxdb_logs"
inputs = [ "modify_logs" ]
bucket = "vector-bucket"
consistency = "any"
database = "xxxxxxxxxxx"
endpoint = "https://your-endpoint.com"
password = "your-password-here"
username = "username"
batch.max_events=1000
batch.timeout_secs=60
namespace = "service"

then restart your vector service

In this article, we discovered what is SLA and SLO and how SLA Aggregate-based can be calculated from Nginx Log.
I will cover in the next article how to calculate SLA time-based, and how to improve the SLA by finding the root causes.

SLA and SLO fundamentals and how to calculate SLA (2024)

FAQs

How do you calculate SLA? ›

SLA formula: (365 - {downtime days}) / 365 * 100 = SLA where 365 is 365 days which translates to yearly 24/7 service uptime.

Tell Me More ›

What is an SLO vs SLA? ›

What is an SLO? An SLO (service level objective) is an agreement within an SLA about a specific metric like uptime or response time. So, if the SLA is the formal agreement between you and your customer, SLOs are the individual promises you're making to that customer.

What are the fundamentals of SLA? ›

Key components of an SLA

Agreement overview.
A list of stakeholders.
The goals of all stakeholders.
A description of services.
Service levels.
A list of services excluded from the agreement.
Conditions of cancellation.
A plan if goals aren't reached.

More items...

Apr 23, 2024

Tell Me More ›

How is SLO calculated? ›

When your data source supports the bad-over-total ratio metrics, you can use it for your SLO. In this case, bad events are compared against total . Users can provide input to these two streams for Nobl9 to calculate their SLOs (time above the threshold or good-to-total / bad-to-total occurrences ratio).

Explore More ›

How is SLA measured? ›

Service Level Agreement (SLA) metrics are used to measure a service provider's performance against agreed service level goals. These metrics are an essential part of SLAs as they offer both parties a way to objectively measure the quality of service and identify areas for improvement.

Know More ›

What are the 3 types of SLA? ›

What are the three types of SLAs? There are three basic types of SLAs: customer, internal and multilevel service-level agreements. A customer service-level agreement is between a service provider and its external or internal customers. It is sometimes called an external service agreement.

Find Out More ›

What is SLA vs KPI vs SLO? ›

A KPI is a metric you track, an SLA is something you promise, and an SLO is a range for those KPIs to live in.

Learn More Now ›

What is an example of an SLO? ›

E-commerce website: The e-commerce website should be available 99.9% or 99.99% of the time. This SLO example provides a standard of availability that allows customers to browse and purchase products without interruptions.

What are rules for SLA? ›

SLA best practices

Create an SLA that stops tracking time to resolution while you're waiting for a customer to reply. ...
Remember the agent experience. ...
Break up large, complex SLAs. ...
Set different performance goals based on ticket priority levels. ...
Keep some SLAs running 24/7, and restrict others to normal business hours.

Read The Full Story ›

How to track SLA? ›

The first step to track SLAs is to define the key metrics that will be used to evaluate the service quality and outcomes. These metrics should be SMART: specific, measurable, achievable, relevant, and time-bound.

Show Me More ›

What is an example of a SLA? ›

For example, a company can draw up an internal service-level agreement between its sales department and its marketing team. This SLA might specify that marketing needs to provide a certain number of leads to sales per month to reach its quota.

Get More Info ›

How to calculate SLAs? ›

Clear SLAs align call center performance with customer expectations and organizational goals. How to Calculate Service Level: Divide the number of calls answered within a specific timeframe by the total number of calls, then multiply by 100. This service level formula helps assess service performance and quality.

Discover More Details ›

What is an SLO vs an SLA? ›

SLAs are used externally to define an agreement between a company's service and its paid users. SLOs are objectives that are measured internally to determine whether the SLA is being met. If an SLO's terms are violated, teams must respond and react quickly to prevent from breaking the SLA.

What is the formula for SLA for incidents? ›

There are 2 formulas here:

For SLA which uses 24/7 default calendar. For tickets that met the SLAs, Time to Resolution as x = (SLA - displayed value in green) For tickets that did not meet SLA Time to Resolution as y = (SLA + displayed value in red) Sum of hours = ( Σx + Σy ) = z . ...
For SLA which uses 9-5 calendar.

Aug 7, 2023

Find Out More ›

What is the standard SLA percentage? ›

SLA Uptime Metrics

The industry standard is five 9's, or 99.999% availability. But not every service provider offers that. In fact, when viewed over an entire year, what many companies offer can leave customers down for much longer than they think. Consider a service provider who offers 99% uptime in their SLA.

What is the SLA for 99.99 per month? ›

Uptime and downtime with 99.9 % SLA

Weekly: 10m 4.8s. Monthly: 43m 28s. Quarterly: 2h 10m 24s. Yearly: 8h 41m 38s.

View Details ›

How do you calculate agreement level? ›

Cohen's kappa (κ) calculates inter-observer agreement taking into account the expected agreement by chance as follows: κ = (observed agreement [P_o] – expected agreement [P_e])/(1-expected agreement [P_e]). In the above example [Table 1, Situation 1], Cohen's k = (0.80 − 0.50)/(1 − 0.50) = 0.30/0.50 = 0.60.

How is response SLA calculated? ›

Respond – Response SLA is calculated from the time the incident is created and assigned to a group till it is assigned to someone from the group. It is the time taken to acknowledge the ticket. Resolution – Resolution SLA is calculated from the time the incident is created till the time the incident is resolved.