Starting an SRE team? Unsure where to start?
Here are some tips based on our research and conversations with leading SRE practitioners
You have taken up a leadership role in a new organization and you are trying to make operations more efficient and effective. You have heard stories about how transformational SRE has been in several large organizations. You want to bring those practices into your organization as well, but how do you get started? Do you have to go through the same learning curve that everybody else has gone through?
Starting an SRE team is not like creating a product or a regular operations team. An SRE team’s responsibilities could vary widely depending on organizational requirements.
We have spoken to several SRE practitioners and leaders to understand their journey. While SRE really is what you make of it, there are some fundamental steps that any new leader building a team can follow
Understanding the current setup of day 2 operations practices could prove useful in determining what areas need SRE transformation at your organization. The first step would be to assess the overall health of an application stack or ecosystem. Start at a high level and look at the key indicators of health listed below without muddying the waters with too many complex metrics
- How many Major outages have we had (Quarterly Incident report)?
- Quarterly downtime/availability report
- Performance metrics (Time to Interact/Page Speed/Service response time)
- Customer satisfaction report
Typically, patterns will start emerging once these metrics are gathered and you should be able to assess gaps quite easily. The following questions can be used for conducting interviews with first line managers
- Are the incidents trending up or down?
- Have SLAs been met during issue identification and resolution
- Were RCAs being conducted effectively to prevent recurring problems
- Is there monitoring coverage sufficient to reliably produce these metrics?
- Are all applications trending similarly or are there outliers?
- Do we have the right tools?
Once you have gone through the exercise above, you will have a list of issues that you can brainstorm on. Instead of trying to solve for all of them, our recommendation would be to perform a Pareto analysis and narrow the list down to 2 major problems that could give you the best bang for the buck
Once you have these two issues, it is important to understand what the potential root cause for these could be. We highly recommend using 5 Whys for performing this study. Outcome of this exercise will lead to observations like below
- Low monitoring coverage across all components that must be observed
- Incident management and RCA process is broken
- Direct correlation between frequency of deployments and outages
- Poor capacity planning causing frequent response time degradation during peak loads
Identify skill sets
This is where things finally start taking shape, there is clarity on what problems you want to solve, all you need now is to assemble your A-Team. Based on your problem statements you may need several shared services capabilities within your team.
While it is ideal to have multi skilled people on the team, it is not always easy to find those skills on the market
A typical SRE team could consist of the following capabilities
- Developers (with scripting knowledge)
- Network/Server/Cloud engineers
- Database Administrators
- DevOps engineers
- Monitoring experts
- L2 support experts
Get leadership buy in
Great! You have the problems you want to solve; you have the timeline, you have the budget estimates, you have the right team – time to go fishing! Well, not yet. Most of these shared services capabilities typically do not roll up under the same management and there are clear boundaries established (especially in large enterprises)
Even if you decide to hire externally for all or some of these roles it is important to have buy in from leaders responsible for those functions. We cannot over state the importance of this step, this will also help build long term trust and partnership if you decide to scale the SRE team
You have managed to answer the WIIFM questions successfully and now everybody is onboard to help you with your SRE journey and now it is time for action! Nope, not yet
We like to set measurable goals before we start the experiment. OKR framework is very useful and can help your teamwork towards a common vision
Here is an example
- Improve customer satisfaction
Key result (KR1)
- Reach 100% monitoring coverage of all digital assets by end of Q1
Key result (KR2)
- Reduce incidents by 10% by Q2
Key result (KR3)
- Improve page speed to under 3 seconds across US and Latin America
Invest in Training
You have assembled a dream team; they are skilled and understand your objectives clearly. You surely have enough to finally start your SRE journey! Almost, one final step (patience is a virtue).
Every new team especially with different skill sets takes time to come together and start delivering as a unit. This is where training can play a massive role in reducing lead times and learning curves for the new team. Learning the tenets of SRE and equipping teams with the right set of tools and processes helps the team get started in the right direction once the rubber meets the road.
Teams must be familiar with core SRE concepts like Chaos engineering, 4 golden signals, defining error budgets, conducting blameless RCA, monitoring best practices etc.
At NorthUp we provide corporate or individual training plans completely customized to your needs. Below is a broad approach that we take
- Evaluate skillsets
- Consider culture
- Build learner profile
- Design Content
- Setup Hands-on Labs
- Create objectives
- Feedback loop to evaluate training outcome
Find the right training partner to help you succeed
Ooch before you leap
Finally, you have a well-trained team of highly motivated engineers that are ready for action! One of the biggest reasons for failure in new SRE teams is that they try to boil the ocean! You can build a 30-member team at $5 Million but that may not be the best approach if you are starting a new practice.
We recommend Ooching before leaping – Ooching is a concept of conducting “small experiments to test one’s hypothesis.” As defined by the Heath brothers in their popular book ‘Decisive’
You could start small by solving the problems you have identified for a smaller portion of the organization rather than across enterprise and this will help you iron out problems and pivot before scaling out
North Up is an online learning company that enables an individual to succeed in the digital economy. Our vision is to create the next generation of cutting edge SRE engineers by providing immersive hands-on training on all practical aspects of Site Reliability Engineering.