Talk

Site reliability engineering when your team is too small for a full time SRE

Friday, May 26

16:45 - 17:15
RoomPanino
LanguageEnglish
Audience levelBeginner
Elevator pitch

Larger teams typically have Site Reliability Engineers to guide the team towards operations, monitoring, scaling, etc. best practices. But what do you do when your team isn’t big enough for a full time SRE? How do you manage operations without becoming a site reliability expert yourself?

Abstract

In this talk, you’ll learn best practices for site reliability engineering, and specific tools and methods for applying those best practices.  

This talk is intended for Python developers with limited or no experience in site reliability or operations/devops.  

I’ll share learnings and recommendations from the dataquest.io team, where we manage complex infrastructure and maintain 99.99% uptime, all with a small team and no SRE!  

Talk Outline

Introduction

  • Who am I, and why should you listen to me on this topic?
  • What is SRE and why should you care?
  • What’s the difference between SRE and DevOps?
  • When should you think about hiring a full time SRE?

Best practices

  • Infrastructure as code
  • Delivery & release automation (and why this matters for reliability)
  • Scaling & capacity planning
  • Monitoring & alerting
  • Incident response

Specific tools and methods

  • Dive into specific tools and methods my team uses
  • When should you just write a Python script, vs. figuring out how to use a tool designed for the job?
  • What should you automate, vs. just doing it once?
  • Should you use microservice architecture?

After this talk, you’ll be able to bring site reliability best practices to your small team, helping you to keep your infrastructure maintainable, and set you up for success in the future, without requiring you to become an expert on site reliability!

TagsBest Practice, Infrastructure, Operations, DevOps
participant photo

Darla Magdalene Shockley

Darla Shockley has worked in software engineering (and with Python) for over a decade, in a variety of roles: full stack, backend, data engineer, site reliability engineer, and most recently, management. She currently works as an engineering manager at fly.io.