SRE Journey — Starting SLO Implementation (Part 2 — SLO & Error Budget)

https://successive.cloud/sre-fundamentals-sla-slo-sli/

Prerequiste:

SRE Journey — Starting SLO Implementation (Part 1 — Start with SLI)

After we learn about SLI spec, SLI Implementation, Time Window, and general formula to calculate SLI from Part 1, it’s time to build your SLO.

A. SLO

SLO basic rules

Before we start to build our SLOs, there are few rules that should be followed if you want to run your SLO culture.

  • Believe that 100% SLO is wrong target.
  • The people responsible for ensuring that the service meets its SLO have agreed that it is possible to meet this SLO under normal circumstances.
  • The organization has committed to using the error budget for decision making and prioritizing. This commitment is formalized in an error budget policy.
  • There is a process in place for refining the SLO.

Create SLO Starter from SLI

To make you have better understanding about generating SLO from SLI, lets take our example that was extracted within 1 month time window

  • Total requests: 4,789,101
  • Total successful requests (returning 200 or 201 status code): 4,567,891
  • 90th percentile latency: 432 ms
  • 99th percentile latency: 891 ms

there are few SLI that we can create from that example

  • Latency
  • Successful Request

You can generate directly the latency SLI because the metrics has shown the percentile metrics. But, How about successful request ?

Remember our general formula to calculate SLI ?
good events / total events

So in this case our Successful Request SLI is

4,567,891 / 4,789,101 * 100% = 95.38 %

Then we get few SLIs

Successful request = 95.38 %
Latency < 318ms = 90%
Latency < 779ms = 99%

Based on Google SRE Workbook to obtain our starting SLOs We can round down these SLIs to manageable numbers

  • round down the availability number until it reach non-decimal number
  • round up SLI into 50ms multipicative number (50ms, 100ms, 150ms, …n x 50ms)

So for our starting SLOs, it will be something like these

Successful Request = 95%
Latency < 350ms = 90%
Latency < 800ms = 99%

Then by default our existing system didn’t breach our SLO.

B. Error Budget

Definition

In SLO context, error is defined as an unexpected behavior that didn’t increase your good events metrics definition.

Back to our example, we can define what error meaning on our SLO

success: Successful Request ( returning 200 or 201 status code )
error: Unseccesful Request ( returning all code except 200 or 201 )

success: Latency < 350ms
error: Latency ≥ 350ms or timeout

success: Latency < 800ms
error: Latency ≥ 800ms or timeout

By definition, error budget is a number of error that should be allowed on particular service or a system.

Remember that our SLO definition is SLI target for a certain circumstances ?, you can extend error budget from its definition.

If the SLO is the SLI target that you should achieve, then the Error budget is the maximum number of error that allowed to be happened so your SLI keep in target.

we can write on the mathematical term like this

error budget = 1 — SLO

How to Calculate it

Back to our SLO example, we have SLOs

Successful Request = 95%
Latency < 350ms = 90%
Latency < 800ms = 99%

Then we have error budget

based on our mathematical equation above, we can have each of error budget in percentile like this

Successful Request = 1-95% = 5%
Latency < 350ms = 1-90% = 10%
Latency < 800ms = 1-99% = 1%

And then we can calculate error budget

Successful Request = 5% * 4,789,101 = ~ 239,456
Latency < 350ms = 10% * 4,789,101 = 478,910.1
Latency < 800ms = 1% * 4,789,101 = 47,891.01

So these are the number of errors that allowed within 1 month on our each SLOs

Error Cost

Error cost basically can be defined as error budget consumers group by a certain type or event of errors. So in order to calculate error cost, your monitoring system should have retrieving the error metric that correlated with your error budget.

This concept will help you and your team decide a priority on which error should be prioritized.

For example, remember our error budget for successful request was 239,456 ?, Let say on certain conditions all of budget was consumed by the error within these details

status 500 = 150,000 = 62.64 %
status 499 = 70,000 = 29.23 %
status 502 = 19,456 = 8.12 %

Its even better if you classify again these errors based on log criteria, so you and your team will directly know what is the root cause of error that should be prioritized, but you need more comprehensive design monitoring system to get this, it will be discussed on another topic.

Error Budget Policy

Error budget policy is the list of actions that should be taken if the error budget had been consumed close to the limit, reaches the limit, or even exceeds the limit.

This policy should have detailed, clear, and actionable items. The policy often needs the escalation path within your organization.

C. Documentation

SLO Documentation

Based on Google SRE Workbook, There are some characteristics that define a good SLO documentation, those are

  • The authors of the SLO, the reviewers (who checked it for technical accuracy), and the approvers (who made the business decision about whether it is the right SLO).
  • The date on which it was approved, and the date when it should next be reviewed.
  • A brief description of the service to give the reader context.
  • The details of the SLO: the objectives and the SLI implementations.
  • The details of how the error budget is calculated and consumed.
  • The rationale behind the numbers, and whether they were derived from experimental or observational data. Even if the SLOs are totally ad hoc, this fact should be documented so that future engineers reading the document don’t make bad decisions based upon ad hoc data.

You can refer to SLO example.

Error Budget Policy Documentation

your Error Budget Policy should have these criterias

  • The policy authors, reviewers, and approvers
  • The date on which it was approved, and the date when it should next be reviewed
  • A brief description of the service to give the reader context
  • The actions to be taken in response to budget exhaustion
  • A clear escalation path to follow if there is disagreement on the calculation or whether the agreed-upon actions are appropriate in the circumstances
  • Depending upon the audience’s level of error budget experience and expertise, it may be beneficial to include an overview of error budgets.

See Error Budget Policy documentation example

D. Effectiveness & Improvements

Measure the Effectiveness

Finally after you have SLO, calculate error budget, create error budget policy and then documenting all of it, it’s time to measure what you’ve built.

You can start to create a correlation between outage event (you can retrieve it from ticket, ops team, or directly from metrics) and your SLI metric when outage happened. If your SLI doesn’t reflect the same with your outage event, you probably should change the SLI Implementation process.

This correlation scoring can be provided by Spearman’s correlation between the outage ticket and the number of error budget that consumed.

This is an example from Google SRE Workbook ( https://sre.google/workbook/implementing-slos/ — Improve your SLO Quality)

Always Refine and Improve

If your SLO culture was going well, and all of the stakeholder also get used to it, you can refine and improve the SLO culture to another area.

  1. User Experience SLO
    You can make SLO for your user experience, It depends on your business process, for example if you are an e-commerce company you can create SLO for
    – Duration of search engine data retrieval
    – Transaction Success
  2. Bucket Tier SLO
    You can make SLO and then group it by tiers, for example on Availability SLO
    Premium tier = 99 %
    Standard tier = 90 %
  3. Dependency Modelling
    This will be useful if your system was integrated with a lot of tools or external system. You can also calculate the SLO for your system related to external or dependency SLO, but this topic will be covered on different article.

Closing Statement

So for now, you have completed to implement your own SLOs, it’s time to build your own effective alerting and incident response. But those topics will be on another series.

These SLO and Error Budget concepts will help you to monitor current system, and also speed up the recovery time if something happened.

Thanks!

Leave a Reply