Operating a Large, Distributed System in a Reliable Way


"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems."


There's much ground to cover:

  • Monitoring
  • Oncall, Anomaly Detection & Alerting
  • Outages & Incident Management Processes
  • Postmortems, Incident Reviews & a Culture of Ongoing Improvements
  • Failover Drills, Capacity Planning & Blackbox Testing
  • SLOs, SLAs & Reporting on Them
  • SRE as an Independent Team
  • Reliability as an Ongoing Investment
  • Further Recommended Reading

Comments