What is Cloud Resiliency, Really?
Summary
Carl and Brandon break down the core concepts behind cloud resiliency, availability, reliability, and redundancy — how they relate, where they differ, and why understanding those distinctions is critical. Just because a service is “always on” doesn’t mean it’s resilient. They explore the difference between planned and unplanned outages, how graceful degradation works in practice, and why resiliency is measured by recovery, not just uptime.
It’s not just about uptime. It’s about what breaks, how you recover, and what keeps going when everything else doesn’t.
They also cover the architectural side: distributed systems, zone-aware deployments, chaos testing, and recovery strategies that go beyond documentation. With real-world failure scenarios and practical planning advice, this episode helps cloud teams build for failure — before it happens.
Links
- AWS | Failover with AWS
- AWS | Well-Architected Framework: Reliability Pillar
- Azure | Reliability design principles
- Azure | Resiliency Overview
- Azure | Well-Architected Framework: Reliability Pillar
- Google Cloud | Architecture Framework: Reliability Pillar
- Google Cloud | Patterns for scalable and resilient apps
- Google Cloud | Site Reliability Engineering (SRE) Book
- principlesofchaos.org | Principles of Chaos Engineering