Site reliability engineering when your team is too small for a full time SRE

Larger teams typically have Site Reliability Engineers to guide the team towards operations, monitoring, scaling, etc. best practices. But what do you do when your team isn’t big enough for a full time SRE? How do you manage operations without becoming a site reliability expert yourself?

In this talk, you’ll learn best practices for site reliability engineering, and specific tools and methods for applying those best practices.

This talk is intended for Python developers with limited or no experience in site reliability or operations/devops.

I’ll share learnings and recommendations from the dataquest.io team, where we manage complex infrastructure and maintain 99.99% uptime, all with a small team and no SRE!

Talk Outline

Introduction

Who am I, and why should you listen to me on this topic?
What is SRE and why should you care?
What’s the difference between SRE and DevOps?
When should you think about hiring a full time SRE?

Best practices

Infrastructure as code
Delivery & release automation (and why this matters for reliability)
Scaling & capacity planning
Monitoring & alerting
Incident response

Specific tools and methods

Dive into specific tools and methods my team uses
When should you just write a Python script, vs. figuring out how to use a tool designed for the job?
What should you automate, vs. just doing it once?
Should you use microservice architecture?

After this talk, you’ll be able to bring site reliability best practices to your small team, helping you to keep your infrastructure maintainable, and set you up for success in the future, without requiring you to become an expert on site reliability!

Site reliability engineering when your team is too small for a full time SRE

Friday, May 26

16:45 - 17:15

Darla Magdalene Shockley

Stay tuned!