The Rise of the Site Reliability Engineer
Cloud ,   SRE  

The Rise of the Site Reliability Engineer

Cloud, SRE
July 13, 2021
Written by Firas Sozan
Find me on
2 minute read
Written by Firas Sozan
Find me on
2 minute read

The meaning of Site Reliability Engineering and the role of Site Reliability engineers has evolved over the last 15 years, adapting and incorporating new aspects as technology and consumer demands advanced, but back in 2004, SRE, as described by its creator Benjamin Treynor Sloss, was “what happens when you treat operations as a software problem and you staff it with software engineers.”

Coming from a background in software engineering, Treynor designed and managed the first-ever team of SREs to work the way he would have worked himself as an SRE. His premise was that engineers who develop a system could also be an SRE on that system. He, thus, set the foundational principles of Site Reliability Engineering, that is to say, to apply a software engineering mindset to system administration. 

Fundamentally, Site Reliability Engineering is a set of principles and practices. It brought a new and unique way of thinking and approaching traditional software production. At first, SRE was not a role; it described something you do rather than something you are. Even today, there is no all-encompassing definition of SRE.

Yet, the methodology boils down to one simple tenet: automation. 

Site Reliability Engineer and Automation

Four years after the practice got its start at Google, the DevOps philosophy emerged from the same foundational principles of Site Reliability Engineering. Some might say that DevOps is a proliferation of some of the core SRE principles to a larger spectrum of organizations, while others might view SRE as a specific implementation of the DevOps methodology. 

Regardless of your viewpoint, Site Reliability Engineering bridges the gap between development and operations by applying software engineering best practices to system administration.

As Benjamin Treynor Sloss pointed out, “In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.” 

From a team of seven engineers at Google, SRE has evolved over the last decade into a global community. The initial success of Sloss’s team determined other big names in the tech world, such as Amazon and Netflix, to adopt Google’s practices and add their own to the mix.

The current state of the method is the result of the innovative and creative minds that made up those first SRE teams. By making innovation possible in any field, Site Reliability Engineering advanced the state of the art and what was possible and evolved from a homegrown approach to a globally embraced part of computer science.

Today, there are over 90,000 SRE job openings across the United States. Since 2004, Site Reliability Engineering has earned its place as a leading practice for service reliability. So, why has SRE garnered so much attention and continues to do so? In an interview for the DevOps Institute, Jennifer Petoff argues that “SRE is really gaining in popularity because an SRE practice is really built on some simple foundational principles that are, at their core, [...] very rational. [...] SRE is providing a way to align incentives of different functions, so development, operations, and the business. Instead of working across purposes, SRE uses service level objectives tied to the customer experience as a common goal, and a way to determine how priorities need to shift based on various circumstances.”

Site Reliability Engineering was born out of the necessity for more reliable, efficient, and scalable solutions, and it delivered its promise. In today’s technological environment, applications are provided as a service; therefore, the service has now become the product. This shift has naturally accentuated the need for reliability.

Plus, business executives understand that the ability to deliver new and reliable features faster to the customer has a direct impact on profitability. We live in a technology-driven world, where consumer dependency and expectations of digital services have increased significantly. For companies to maintain a competitive position, the implementation of SRE practices is no longer optional – it is essential.

The future of Site Reliability Engineering

Some argue that the role of Site Reliability engineers will continue to grow in close connection with DevOps practices in that it will better describe the work to be done. Benjamin Treynor Sloss says there are countless opportunities for growth and every innovative mind in SRE will shape its future. However, to allow this innovation to happen, organizations should embrace the SRE mindset at a cultural and organizational level.


Cloud SRE