Site Reliability Engineering for Cloud-Native Operations

Developers want to change things as soon as they can, while operations teams remain apprehensive that changes will break stuff. To reconcile these two drives, Google forged the path of site reliability engineering (SRE), an emerging practice for maintaining complex computing systems that need to run with high reliability. As the founder of Google’s SRE Team, Ben Treynor put it: SRE is “what happens when a software engineer is tasked with what used to be called operations.”

SRE dates back to 2003 when Treynor joined Google to manage a team of engineers to run a production environment. The practice proved to be a success, and the company now 1,500 engineers working in SRE. Apple, Oracle, Microsoft, Twitter, Dropbox, IBM, and Amazon have all implemented their own SRE teams as well.

Read more at The New Stack