click to view more

On-Call In Action: Site Reliability Engineering Best Practices for Building Resilient Systems

by Huynh, Quan

$16.05

List Price: $19.99
Save: $3.94 (19%)
add to favourite
  • In Stock - Ship in 24 hours with Free Online tracking.
  • FREE DELIVERY by Wednesday, July 23, 2025
  • 24/24 Online
  • Yes High Speed
  • Yes Protection

Description

In today's "always-on" world, downtime is not an option. Your users expect seamless service, 24/7. Your business depends on it. But how do you guarantee that reliability when complex systems inevitably encounter turbulence? The answer lies in a world-class on-call capability.

"On-Call In Action" is your practical playbook for building just that. This isn't just another theoretical tome; it's a hands-on guide to navigating the high-stakes reality of modern on-call. We'll equip you with the SRE principles, incident management lifecycles, and effective alerting strategies (leveraging the Versus Incident project as our real-world example) that form the backbone of resilient operations.

This book, "On-Call In Action," is your friendly guide to making on-call work better. We'll show you:

  • Why being on-call is so important.
  • What to do when a problem (we call it an "incident") happens.
  • How to set up good alerts so you only get called for big problems. We'll even show you how with a free tool called "Versus Incident."
  • How to check if your services are running well (using simple goals).
  • How to learn from mistakes without blaming anyone, so things get better.
  • How to make good on-call schedules so people don't get too tired.
  • How to create a supportive team for on-call work.
Stop just reacting to problems and start engineering reliability. Whether you're a tech person who is on-call, a manager, or just curious, this book will give you clear advice and real examples. We want to help you build an on-call system that keeps your services running and your team feeling good.

This book contains 11 chapters:

  • Chapter 1 Foundations: Why On-Call Matters & SRE Principles
  • Chapter 2 Anatomy of an Incident: The Management Lifecycle
  • Chapter 3 Effective Alerting: Strategy and Routing Use Versus Incident
  • Chapter 4: Integrating Monitoring Sources and Escalation Policies: A Case Study
  • Chapter 5: Measuring Reliability: SLIs, SLOs, and Error Budgets
  • Chapter 6: Putting It All Together: Practical Examples of Unified Alerting & Templating
  • Chapter 7: Learning from Failure: Blameless Postmortems
  • Chapter 8: Sustainable On-Call: Scheduling and Managing Burnout
  • Chapter 9: Effective Incident
  • Chapter 10: The On-Call Ecosystem: Tooling and Future Trends
  • Chapter 11: On-Call in Action: Digital Customer Onboarding in Banking

Last updated on

Product Details

  • May 14, 2025 Pub Date:
  • 9798283556314 ISBN-10:
  • 9798283556314 ISBN-13:
  • English Language