Prerequiste:
SRE Journey — Starting SLO Implementation (Part 1 — Start with SLI)
After we learn about SLI spec, SLI Implementation, Time Window, and general formula to calculate SLI from Part 1, it’s time to build your SLO.
A. SLO
SLO basic rules
Before we start to build our SLOs, there are few rules that should be followed if you want to run your SLO culture.
- Believe that 100% SLO is wrong target.
- The people responsible for ensuring that the service meets its SLO have agreed that it is possible to meet this SLO under normal circumstances.
- The organization has committed to using the error budget for decision making and prioritizing. This commitment is formalized in an error budget policy.
- There is a process in place for refining the SLO.
Create SLO Starter from SLI
To make you have better understanding about generating SLO from SLI, lets take our example that was extracted within 1 month time window
- Total requests: 4,789,101
- Total successful requests (returning 200 or 201 status code): 4,567,891
- 90th percentile latency: 432 ms
- 99th percentile latency: 891 ms
there are few SLI that we can create from that example
- Latency
- Successful Request
You can generate directly the latency SLI because the metrics has shown the percentile metrics. But, How about successful request ?
Remember our general formula to calculate SLI ?
good events / total events
So in this case our Successful Request SLI is
4,567,891 / 4,789,101 * 100% = 95.38 %
Then we get few SLIs
Successful request = 95.38 %
Latency < 318ms = 90%
Latency < 779ms = 99%
Based on Google SRE Workbook to obtain our starting SLOs We can round down these SLIs to manageable numbers
- round down the availability number until it reach non-decimal number
- round up SLI into 50ms multipicative number (50ms, 100ms, 150ms, …n x 50ms)
So for our starting SLOs, it will be something like these
Successful Request = 95%
Latency < 350ms = 90%
Latency < 800ms = 99%
Then by default our existing system didn’t breach our SLO.
B. Error Budget
Definition
In SLO context, error is defined as an unexpected behavior that didn’t increase your good events metrics definition.
Back to our example, we can define what error meaning on our SLO
success: Successful Request ( returning 200 or 201 status code )
error: Unseccesful Request ( returning all code except 200 or 201 )success: Latency < 350ms
error: Latency ≥ 350ms or timeoutsuccess: Latency < 800ms
error: Latency ≥ 800ms or timeout
By definition, error budget is a number of error that should be allowed on particular service or a system.
Remember that our SLO definition is SLI target for a certain circumstances ?, you can extend error budget from its definition.
If the SLO is the SLI target that you should achieve, then the Error budget is the maximum number of error that allowed to be happened so your SLI keep in target.
we can write on the mathematical term like this
error budget = 1 — SLO
How to Calculate it
Back to our SLO example, we have SLOs
Successful Request = 95%
Latency < 350ms = 90%
Latency < 800ms = 99%
Then we have error budget
based on our mathematical equation above, we can have each of error budget in percentile like this
Successful Request = 1-95% = 5%
Latency < 350ms = 1-90% = 10%
Latency < 800ms = 1-99% = 1%
And then we can calculate error budget
Successful Request = 5% * 4,789,101 = ~ 239,456
Latency < 350ms = 10% * 4,789,101 = 478,910.1
Latency < 800ms = 1% * 4,789,101 = 47,891.01
So these are the number of errors that allowed within 1 month on our each SLOs
Error Cost
Error cost basically can be defined as error budget consumers group by a certain type or event of errors. So in order to calculate error cost, your monitoring system should have retrieving the error metric that correlated with your error budget.
This concept will help you and your team decide a priority on which error should be prioritized.
For example, remember our error budget for successful request was 239,456 ?, Let say on certain conditions all of budget was consumed by the error within these details
status 500 = 150,000 = 62.64 %
status 499 = 70,000 = 29.23 %
status 502 = 19,456 = 8.12 %
Its even better if you classify again these errors based on log criteria, so you and your team will directly know what is the root cause of error that should be prioritized, but you need more comprehensive design monitoring system to get this, it will be discussed on another topic.
Error Budget Policy
Error budget policy is the list of actions that should be taken if the error budget had been consumed close to the limit, reaches the limit, or even exceeds the limit.
This policy should have detailed, clear, and actionable items. The policy often needs the escalation path within your organization.
C. Documentation
SLO Documentation
Based on Google SRE Workbook, There are some characteristics that define a good SLO documentation, those are
- The authors of the SLO, the reviewers (who checked it for technical accuracy), and the approvers (who made the business decision about whether it is the right SLO).
- The date on which it was approved, and the date when it should next be reviewed.
- A brief description of the service to give the reader context.
- The details of the SLO: the objectives and the SLI implementations.
- The details of how the error budget is calculated and consumed.
- The rationale behind the numbers, and whether they were derived from experimental or observational data. Even if the SLOs are totally ad hoc, this fact should be documented so that future engineers reading the document don’t make bad decisions based upon ad hoc data.
You can refer to SLO example.
Error Budget Policy Documentation
your Error Budget Policy should have these criterias
- The policy authors, reviewers, and approvers
- The date on which it was approved, and the date when it should next be reviewed
- A brief description of the service to give the reader context
- The actions to be taken in response to budget exhaustion
- A clear escalation path to follow if there is disagreement on the calculation or whether the agreed-upon actions are appropriate in the circumstances
- Depending upon the audience’s level of error budget experience and expertise, it may be beneficial to include an overview of error budgets.
D. Effectiveness & Improvements
Measure the Effectiveness
Finally after you have SLO, calculate error budget, create error budget policy and then documenting all of it, it’s time to measure what you’ve built.
You can start to create a correlation between outage event (you can retrieve it from ticket, ops team, or directly from metrics) and your SLI metric when outage happened. If your SLI doesn’t reflect the same with your outage event, you probably should change the SLI Implementation process.
This correlation scoring can be provided by Spearman’s correlation between the outage ticket and the number of error budget that consumed.
Always Refine and Improve
If your SLO culture was going well, and all of the stakeholder also get used to it, you can refine and improve the SLO culture to another area.
- User Experience SLO
You can make SLO for your user experience, It depends on your business process, for example if you are an e-commerce company you can create SLO for
– Duration of search engine data retrieval
– Transaction Success - Bucket Tier SLO
You can make SLO and then group it by tiers, for example on Availability SLO
Premium tier = 99 %
Standard tier = 90 % - Dependency Modelling
This will be useful if your system was integrated with a lot of tools or external system. You can also calculate the SLO for your system related to external or dependency SLO, but this topic will be covered on different article.
Closing Statement
So for now, you have completed to implement your own SLOs, it’s time to build your own effective alerting and incident response. But those topics will be on another series.
These SLO and Error Budget concepts will help you to monitor current system, and also speed up the recovery time if something happened.
Thanks!
