A curated list of awesome Site Reliability and Production Engineering resources.
Please take a look at the contribution guidelines first. Contributions are always welcome!
- Culture
- Education
- Books
- Hiring
- Reliability
- Alerting
- Monitoring
- On-Call
- Post-Mortem
- Capacity Planning
- Service Level Agreement
- Performance
- Articles
- Blogs
- Conferences
- What is Site Reliability Engineering?
- Keys To SRE by Ben Treynor
- Google SRE Resources
- Notes from Production Engineering by Pedro Canahuati
- PostOps: Recovery from Operations
- Love DevOps? Wait 'till you meet SRE
- How Google Does Planet-Scale Engineering for Planet-Scale Infra
- Site Reliability Engineering at Facebook
- A History of Site Reliability Engineering at Uber
- Case Study: Adopting SRE Principles at StackOverflow
- Site Reliability Engineering at Dropbox
- Site Reliability Engineers — Keeping Google up and running 24/7
- Site Reliability Engineering at Salesforce
- From Sys Admin to Netflix SRE
- SRE@Google: Thousands of DevOps Since 2004
- Transactional System Administration Is Killing Us and Must be Stopped
- Maslow's hierarchy of SRE needs
- PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability
- Engineering Reliability into Web Sites: Google SRE
- From SysAdmin to Netflix SRE
- SRE: An incomplete guide to cultural Narnia
- Panel: Educating SRE
- From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams
- New to an SRE team?
- Site Reliability Engineering: How Google Runs Production Systems
- The Practice Of Cloud System Administration: Designing and Operating Large Distributed Systems
- Fail at Scale by Ben Maurer
- Embracing Failure: Fault-Injection and Service Reliability
- 10 Years of Crashing Google
- How we break things at Twitter: failure testing
- Reliable Cron across the Planet
- Push our limits - reliability testing at Twitter
- The Verification of a Distributed System by Caitie McCaffrey
- Weathering the Unexpected
- The Remediation Ballet
- A Working Theory-of-Monitoring
- The Evolution of Monitoring Systems at Google - Tony Rippy
- Monitoring without Infrastructure @ Airbnb
- Being an On-Call Engineer: A Google SRE Perspective
- Inside Atlassian: how our site reliability engineers do incident management
- Inside Atlassian: how IT & SRE use ChatOps to run incident management
- Incident Response at Heroku
- Add your favorite resources
- SLA Aware Maintenance for Operators - Joe Smith
- If It's in the Cloud, Get It on Paper: Cloud Computing Contract Issues
- Service Level Agreements in the Cloud: Who cares?