DevOps & SRE: competitors or companions?
Not long ago, automation in IT was fairly a new concept. All the IT operations, mostly the administrative jobs, were carried out without automation. Automation started gaining momentum in the middle of the last decade and eventually led to the theory of continuous everything. This improved dev and ops collaboration that was aimed towards continuous integration and continuous deployment backed by agile practices. So, to have seamless product deployment and delivery, DevOps gradually became an integral part of every IT organization.
While DevOps is trying to close the gap between dev and ops, it is important to understand that the main motive of the development process is still the same. Dev team focuses on creating value in the product and pushing them faster to production. The ops team focuses on the stability of the environments by better management of software and hardware. Dev wants faster production while ops wants stable production, and this is where Site Reliability Engineering (SRE) came into the picture. The term was coined by Benjamin Treynor Sloss who was responsible for founding Google’s Site Reliability team. And according to Google, the SRE team is not only responsible for the stability of the production environment, but it is also committed to new features for applications and operational improvement at the same time. At first, different combinations with numbers representing both the dev and ops teams were tried to form the SRE team. While there could be a debate about the exact proportion of dev and ops teams to configure the SRE team, I believe it should vary based on the project needs.
DevOps and SRE are rulers of the software development world, but at the same time, they tend to confuse people by overlapping each other in some aspects. Both these terms tend to focus on automation and monitoring to reduce the time when a developer commits a change until it is deployed to production. According to Google, SRE and DevOps are not so different from each other. Just like DevOps, SRE is also about bringing together dev and ops teams to increase the production speed and at the same time increasing the visibility of the entire application life cycle. If you see DevOps as a philosophy, then SRE is a way of accomplishing that philosophy. They are not two methods competing against each other. However, one is a cultural change while other is a practice that is complementing the change in some or the other way. You can look at them as teams that can work toward breaking down the organizational barriers to deliver superior software faster.
So, what exactly is SRE?
To start with its contents, the SRE team is also made up of dev and ops. However, here in SRE, the ops team is with more of a coding core. Naturally, they try to codify all the aspects of operations. Once the codified monitoring measures are in place, a consolidated method Is developed to calculate reliability at every single stage. SREs are responsible for measuring the SLIs and SLOs. SLI stands for Service Level Indicator and SLO stands for Service Level Objective which is defined as quantitative measures about all major aspects of the level of service that is to be provided. SRE in simple terms can be explained as a software engineering approach to IT operations and takes up the tasks managed by the operations teams in the organization. The manual tasks carried by the operations teams are assigned to engineers explicitly using software and automation to resolve the issues. These engineers are also responsible for managing production systems. The SRE teams in any organization utilize software as a tool to maximize the problem-solving ability, manage their systems, and automate the operational tasks.
More reliability means more immunity against failures. However, SRE is more about accepting these failures. Talking about accepting failures and measuring everything in the software development life cycle, I think SRE makes organizations focus on the reliability aspect more. SREs measure SLI and SLOs, whereas DevOps measures the failure and the success rate with the help of various tools and methods. The reliability is not only related to the infrastructure but also on the quality of your application, performance, and security. To measure this reliability, having reliable data is very important. Your reliable data can consist of implementation stack and bytecode, the total variable state covered on full source code, JVM state which comprises of threads and environment variables, applicable log statements with DEBUG and TRACE in the production, and analytics of the event in terms of frequency, failure rate, deployment, and application. You can also use methods like setting up alerts for various scenarios, peer-code review, unit tests to make your data reliable as well as actionable.
Let’s see how DevOps and SRE are complementing or are different from each other and see how SRE can work within the DevOps paradigm.
DevOps and SRE
DevOps is about “what” needs to be done, SRE is about “how” to do it. The similarities and differences between DevOps and SRE can be explained based on the top five pillars in DevOps…read more.