Operating a Large, Distributed System in a Reliable Way

"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems."

There's much ground to cover:

Monitoring
Oncall, Anomaly Detection & Alerting
Outages & Incident Management Processes
Postmortems, Incident Reviews & a Culture of Ongoing Improvements
Failover Drills, Capacity Planning & Blackbox Testing
SLOs, SLAs & Reporting on Them
SRE as an Independent Team
Reliability as an Ongoing Investment
Further Recommended Reading

Read full article >>>

Search This Blog

Notes about Cutting-Edge Technologies and Everything

Operating a Large, Distributed System in a Reliable Way

Comments

Post a Comment