In most IT organizations, there is an inherent gap between product development teams and operations teams. One team wants to get more features into production to see how customers/users react to it, while the operations teams’ job is to ensure up-time and stability of the site. In the agile world, CI-CD and daily deployment is the norm, as much as the Ops team hate entropy, it is important for product teams to get new features and improvements deployed to win in this competitive market.
This is where SRE comes in. This team bridges the gap between product teams and traditional operations.
So, what is SRE?
The only right answer to this question is – Nobody knows for sure! SRE like DevOps does not have a fixed job description or requirements. It is extremely flexible, and you can tailor it to your organization’s requirements. We know of at least 7 MNCs that have decent sized SRE teams and they all have different charters, goals, tools and approaches they use. The only common factor here is that all of them unanimously agree that SRE teams have transformed operations and production engineering in general
“In Ambiguity Lies Great Opportunity “
That said, there are certain things that all SRE engineers/teams have to do really well in order to be successful
- An SRE engineer is responsible for the overall uptime and performance of the product or website
- She strives to reduce toil in the ecosystem by automating recurring manual tasks
- She works with product owners, engineering leaders and other teams to set goals or service level agreements for product uptime and performance and holds them to it. Any breach of these goals will result in consequences such as limiting deployments that the SRE engineer is empowered to implement
- She has a holistic view of how the entire ecosystem works instead of modules within applications
- She is responsible for ensuring all digital assets are observable, which results in saving crucial minutes while restoring service
These principles might look simple at a macro level, at a micro-level they need a solid understanding of large-scale distributed IT systems and skills to don several hats at the same time.
SRE engineers typically possess a wide array of skills but it isn’t always practical to expect one SRE engineer to have them all. Most organizations build a collection of these skills into a team which is a quicker and better approach.
Let’s look at a typical day in the life of an SRE engineering team to understand this better:
You start your day by looking at the amazing visualization you have built around your critical KPIs and conclude your systems are healthy.
While you are having a coffee with your teammate, your business partner calls you informing about a significant conversion drop in the ANZ region, many customers are dropping off at checkout.
After gathering details of the problem, you go back to your trusted monitoring tools to check if something sticks out, but your ecosystem continues to show green for both infrastructure and application.
Your real-time user monitoring is robust, so you have details of the customer sessions that were affected.
After analyzing the sessions, you conclude that a certain type of payment that is widely used in ANZ is where the maximum number of drop-offs are seen.
You engage the vendor and bring them on a call to check what is going on. The payment vendor confirms there is an outage on their side and quickly resolve the problem
You meet with your team and measure the following metrics
- Total downtime
- Revenue impact
- Mean time to identify and restore service
You inform your executive leadership team about the downtime and revenue impact.
You reach out to the vendor informing them of the revenue impact and goal breach from their perspective. They are informed that their service is already at 99.7% availability and they cannot afford to be down for longer than 5 min the rest of the year or they will be breaching SLA contract that they signed with your organization
You then go back to the drawing board with the team and figure out if there are opportunities to enhance monitoring that could have helped you discover this issue before business informed you
You find out a certain error that was being logged by your own payments application that was not part of your dashboard or alerts. You go back and add it to reduce downtime if this issue recurs.
As you can see, to have successfully conducted this business day you will need the following skills on the team:
- Deep understanding of E2E ecosystem
- Knowledge of log aggregation tools
- Knowledge of front-end real-time user monitoring tools
- Effective Incident management process to perform quick root cause analysis
- Effective executive communication
- Defining and implementing SLAs with vendors
- Post-mortem analysis process to fix gaps internally
This is just an example of one day in the life of one SRE. This is a unique role where you can create an amazing impact in any organization. Immaterial of your current IT skills, with the right training and mentorship, you will be able to play an expanded role on your current team even with a basic understanding of SRE principles & toolsets. This will help you standout on any team – QA, Development or Operations. This will help you decide if you want to become a full-fledged reliability engineer, which has one of the highest job satisfaction rates in the IT industry (3.8 according to Payscale).