Larger teams typically have Site Reliability Engineers to guide the team towards operations, monitoring, scaling, etc. best practices. But what do you do when your team isn’t big enough for a full time SRE? How do you manage operations without becoming a site reliability expert yourself?
In this talk, you’ll learn best practices for site reliability engineering, and specific tools and methods for applying those best practices.
This talk is intended for Python developers with limited or no experience in site reliability or operations/devops.
I’ll share learnings and recommendations from the dataquest.io team, where we manage complex infrastructure and maintain 99.99% uptime, all with a small team and no SRE!
Talk Outline
Introduction
Best practices
Specific tools and methods
After this talk, you’ll be able to bring site reliability best practices to your small team, helping you to keep your infrastructure maintainable, and set you up for success in the future, without requiring you to become an expert on site reliability!