What SLOs Mean for Your Team and Your Customers
Over the years, we’ve had several methods of managing how we track and monitor the performance of our services both internally and externally at Float. If you don’t know anything about SLOs (service level objectives), the simplest explanation is that they represent the performance or health of a service.
You can use SLOs in production environments to ensure released code stays within a specific error budget. We started looking at implementing SLOs for our production services and systems a while back—both to test the waters on how SLOs might benefit our organization and as (the beginning of) a path to further improve the quality and design of our systems.
After over a year of utilizing SLOs and delivering a number of significant new features for our customers, we're now ready to optimize our use of service level objectives and expand them across Float.
Three areas of focus for SLO optimization
Whether you are in the process of selecting, implementing, reporting, or even breaking an SLO, the decision impacts the whole organization. As such, it's important to get your team's buy-in for each step of the process. Without buy-in, there's no protocol for when things go south.
Lack of education across a team is problematic. Responses to SLOs can be wide-ranging—from very positive to confused. You can't always expect everyone to be highly engaged, but you do need to have everyone on your team understand why SLOs are important and how they work for your product.
One of the biggest problems we see with SLOs is accuracy. It's a potential engineering cliché—everything is rosy when it comes to planning, but when the rubber finally hits the road, that's when things get complicated.
Aiming for accuracy has its complexities
The problem with some SLOs is they can map processes across multiple events, functions, and services. Tracking that data cleanly and efficiently when you get to it can often be more complex than you anticipate. As an example, here's an SLO for a contact form on our website:
99% of all messages from contact forms are successfully received by the Float team across 30 days
And here is the diagram which reveals the complexity behind such a simple form:
Tracking the data for that SLO relies on multiple pieces of information spread across various stacks and systems. This is often where accuracy can begin to degrade. There are also questions like how do we define email received? What actions should submit form on website include? Much like in complex systems in series becoming less reliable as you add more items, the same can be true when tracking data across multiple stacks and systems.
Accuracy can be particularly tricky when first implementing SLOs. As a result, our focus is to begin to at least measure something of value and then continue to increase our accuracy based on the criticality and confidence of the SLO and its data.
Learnings from year one of implementing SLOs
There were several things that we approached well during our first batch of SLOs:
- We implemented slowly
It's often best to go slow when you're trying to introduce a complex set of procedures or changes. Doing so gives you time to make mistakes (and fix them) before you've piled too many changes on top of each other. A slow burn also gives people time to get used to changes that might impact their workflows.
- We added incrementally
At Float, there was no widespread mandate or change overnight when we first introduced SLOs. They simply acted as another level of monitoring or reporting alongside the usual channels. This gave the team plenty of time to be curious about them and react to them independently without feeling like they suddenly had another mouth to feed.
- We reported regularly
We also reported on the SLOs every month, and this report was sent out to all team members. It wasn't required reading, but consistent reporting on the measures is important. Everyone was at least aware of how our services were operating each month and could easily find out historically what those levels of service were.
- We collected the data
By running these streamlined SLOs for a year, we got some excellent historical data for how services were operating. Although our engineering team has a sixth sense for which services are linchpins or generally receive the most pressure, we can now easily see the more definitive data on how certain services are operating and prove or disprove those ideas.
Why SLOs are necessary for your team and customers
Once you adopt SLOs internally, you have essentially defined what your customers can expect from you in terms of reliable and consistent service. Using SLOs has many positive outcomes for an organization:
- Improve software quality: SLOs can highlight issues you potentially didn't know existed ahead of a major incident.
- Help make decisions: Once you've defined your SLOs, discussions on debt and reliability are much easier to make.
- Speed vs. Stability: SLOs can gauge how much time you need to invest in making a system reliable vs. shipping more features.
One of the fundamental tenets of service level objectives is consistent optimization. That means regularly reviewing and updating how we create and use SLOs.
As we continue to expand, tweak, and tighten our internal SLOs at Float, we're making a unanimous agreement between everyone in the organization—from product and marketing to customer success and engineering—that our customers demand a level of quality that we are willing and able to always deliver.
Get exclusive monthly updates on the best tools and productivity tips for asynchronous remote work
Join 90,000+ readers globally