This is primarily a notebook to register the content I find worth saving and sharing.
This is not a blog full of my well structured clever thoughts. However, if you find any, let me know =)
Operating a Large, Distributed System in a Reliable Way
Get link
Facebook
X
Pinterest
Email
Other Apps
-
"The article is the collection of the practices I've found useful to reliably operate a large system at Uber, while working here. My experience is not unique - people working on similar sized systems go through a similar journey. I've talked with engineers at Google, Facebook, and Netflix, who shared similar experiences and solutions. Many of the ideas and processes listed here should apply to systems of similar scale, regardless of running on own data centers (like Uber mostly does) or on the cloud (where Uber sometimes scales to). However, the practices might be an overkill for smaller or less mission-critical systems."
There's much ground to cover:
Monitoring
Oncall, Anomaly Detection & Alerting
Outages & Incident Management Processes
Postmortems, Incident Reviews & a Culture of Ongoing Improvements
Comments
Post a Comment