Disaster Recovery Drills: Simulate Outages Safely

A technical playbook to simulate CDN and cloud provider outages safely in preprod—DNS failover, traffic shaping, canary rollbacks, and CI-safe patterns.

Stop hoping your staging is lucky — run provider outage simulations safely in preprod

If a major CDN, cloud region, or SaaS provider goes down on a Friday, you want your team to be calm — not scrambling. The fastest way to build that calm is to practice realistic provider outage scenarios inside preprod, without breaking CI or impacting production traffic. This playbook gives a step-by-step, safe, and repeatable approach for running provider outage simulations in 2026: DNS failover, traffic shaping, canary rollbacks, and the safety gates you need to keep risk zero-to-minimal. For architecture patterns and design guidance to survive multi-provider failures, see Building Resilient Architectures: Design Patterns to Survive Multi-Provider Failures.

Executive summary — what you can achieve in a single drill

The recommended drill flow below is optimized for staging environments that mirror production networking and deployment topology. In 60–180 minutes you can:

Simulate a CDN/edge outage (e.g., Cloudflare-like failure) by bypassing the CDN for a controlled percentage of traffic. Techniques for edge-level testing and image delivery are discussed in resources like Serving Responsive JPEGs for Edge CDN and Cloud Gaming.
Simulate a cloud provider region or availability zone outage (AWS-style) via DNS weight shifts and Route53 failover records in preprod.
Validate canary rollback behavior using traffic shifting (Istio/Envoy/Service Mesh) and automated rollback triggers in your CD system (ArgoCD/Flux/Spinnaker). These CI/CD governance patterns are covered in depth in From Micro-App to Production: CI/CD and Governance for LLM-Built Tools.
Keep CI safe with pipeline gating, ephemeral preprod instances, and blast-radius controls. Developer productivity and multisite governance topics intersect here — see Developer Productivity and Cost Signals in 2026.

Why this matters in 2026

Incidents in late 2025 and early 2026 (widespread CDN and social platform outages) reinforced two trends: 1) outages are more often multi-provider and cross-layer (edge + control plane), and 2) resilience testing must be automated and frequent. Teams are adopting chaos-as-code, eBPF-based observability, and AI-driven anomaly detection for faster recovery. This playbook reflects those trends: it’s automated, auditable, and designed to be integrated into modern CI/CD pipelines. For observability patterns and subscription health practices, consult Observability in 2026.

Preconditions: What your environment must have

Before you start drilling, ensure the following:

Isolated preprod: clear separation from production networking, secrets, and databases (sanitized data only).
DNS control in preprod: a domain or subdomain you own and can change records for (e.g., preprod.example.com) and programmatic access to your DNS provider API (Route53, Cloudflare DNS, NS1, etc.).
Service mesh or traffic-router: Istio/Envoy, Linkerd, AWS App Mesh, or a load balancer that supports weighted routing for canaries.
CD tool with rollback automation: ArgoCD, Flux, Spinnaker, or GitHub Actions + scripts that can rollback on metric thresholds. Governance for these flows is discussed in CI/CD governance guides.
Observability & alerting: Prometheus, Grafana, Datadog, or New Relic; synthetic tests (k6); and SLA/SLO definitions for the drill. See deeper observability patterns in Observability in 2026.

High-level playbook (safe-by-design)

The playbook is arranged into four stages. Each stage contains safety checks so you never accidentally touch production.

1) Planning and guardrails (15–30 minutes)

Define the objective: e.g., "Validate our CDN bypass strategy and origin scalability for 20% of preprod traffic."
Set blast-radius: limit test to IP CIDR ranges and a percentage of traffic (start 1–5%).
Schedule a maintenance window and notify stakeholders and automation (Slack, PagerDuty). Use an auditable ticket and store the run metadata in Git.
Ensure production isolation by asserting DNS target is a preprod-only domain and that any API keys used are preprod-scoped.

2) Environment readiness checks (10–20 minutes)

Run smoke tests against preprod endpoints (health endpoints, DB connectivity, feature flags loaded).
Verify observability: dashboards render, alerting rules ready, and synthetic traffic script (k6) is available.
Verify CD rollback hook registered: e.g., ArgoCD has a health check metric and a rollback job that can be invoked or auto-triggered by Metrics Adapter.

3) Controlled outage simulation (30–90 minutes)

This is the heart of the playbook. We give two common scenarios and the exact controls to use.

Scenario A — CDN/edge outage (Cloudflare-style)

Goal: Ensure origin can handle a bypass and that security controls and WAF rules still work. Approach: route a percentage of traffic around the CDN and to the origin.

Use a preprod subdomain, e.g., origin.preprod.example.com, that resolves directly to origin IPs.
Create a weighted split at DNS or edge: use Cloudflare Load Balancer or your DNS provider’s traffic steering to send 100% -> 80% CDN path, 20% direct origin path. For providers without weight at the DNS level, implement a traffic-splitting reverse proxy (Envoy) in front of the CDN endpoint for preprod only.
Start synthetic traffic with k6 to the duplicate origin route. Run at ramped rates: 1 minute at 1 rps, 5 minutes at 10 rps, etc.
Monitor origin latency, error rate (5xx), CPU, and request queue sizes. Alert thresholds must be pre-configured to trigger rollback or stop the test. Caching strategies and cache ops matter greatly in origin-scale scenarios.
If origin metrics breach thresholds, trigger the rollback plan: abort traffic bypass and revert DNS/weights to use CDN-only paths. Log the timeline and root cause artifacts.

# Example: Cloudflare API call to create load balancer pool (simplified)
curl -X POST "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers" \
 -H "Authorization: Bearer $CF_API_TOKEN" \
 -H "Content-Type: application/json" \
 --data '{
  "description":"preprod cdn bypass",
  "name":"lb-origin-bypass.preprod.example.com",
  "enabled":true,
  "default_pools":["pool-cdn","pool-origin-bypass"],
  "fallback_pool":"pool-origin-bypass",
  "pop_pools":{},
  "steering_policy": "weight"
}'

Scenario B — Cloud provider region outage (AWS-style)

Goal: Validate Route53 failover and cross-region origin behavior without touching prod. Approach: shift DNS weights or use secondary region endpoints under preprod domain.

Set up two isolated preprod clusters in different regions: preprod-us-east-1 and preprod-us-west-2 using identical manifests and sanitized data.
Create Route53 weighted records for preprod.example.com with weight 90 -> us-east, 10 -> us-west. For failover tests, you’ll move 100% to us-west or mark the us-east record as "unhealthy" in Route53 health checks.
Run traffic generator to apply load against preprod.example.com while gradually shifting weight: 90/10 -> 50/50 -> 0/100. Monitor application behavior under failover. These multi-region patterns are described alongside governance practices in developer productivity and governance analyses.
Verify session handling, sticky sessions if any, cache invalidation, cross-region replication, and RDS/DB failover mechanics (if any) using mocks or read replicas only in preprod.

# Example: AWS CLI to change weights (simplified)
aws route53 change-resource-record-sets --hosted-zone-id Z12345 \
 --change-batch '{
  "Comment": "Shift preprod traffic",
  "Changes": [
    {"Action":"UPSERT","ResourceRecordSet":{
      "Name":"preprod.example.com.","Type":"A","SetIdentifier":"us-east",
      "Weight":0,"TTL":60,"ResourceRecords":[{"Value":"203.0.113.10"}]
    }},
    {"Action":"UPSERT","ResourceRecordSet":{
      "Name":"preprod.example.com.","Type":"A","SetIdentifier":"us-west",
      "Weight":100,"TTL":60,"ResourceRecords":[{"Value":"198.51.100.20"}]
    }}
  ]
}'

4) Post-drill analysis and CI-safe learnings (30–60 minutes)

Gather logs, APM traces, spike charts, synthetic test results, and CD pipeline events. Store artifacts in a runbook for compliance/audit. Use structured runbooks and operations playbooks like Operations Playbook: Scaling Capture Ops as inspiration for storing artifacts and backlogs.
Execute a blameless postmortem focusing on process gaps, runbook improvements, and fixes prioritized in the backlog.
Automate frequent, smaller drills in CI: add a scheduled job in your pipeline to run a light-weight CDN bypass test nightly against ephemeral preprod clusters. This approach pairs with modern policy engines and observability platforms (observability).

How to keep CI safe: gating and automation patterns

The top concern we hear from engineering teams: "How do we run these drills without impacting our CI or making developers wait?" The answer is a combination of isolation, feature branching, ephemeral environments, and policy gates. For governance detail on micro-app release flows see CI/CD and governance for micro-apps.

Policy & pipeline patterns

Ephemeral preprod per branch: spin a mini preprod with Terraform and Kubernetes for each release branch. Use short TTLs (1–24 hours) and automated destroy-on-merge to avoid resource creep.
Nonblocking canary jobs: run chaos drills as a separate pipeline that only uses preprod scopes. Do not attach disaster-drill jobs to production pipelines.
Approval gates for larger drills: CI triggers larger drills only after an approver signs off. Use GitOps PRs to change DNS weights and require code review.
Feature flags and tenant isolation: ensure real user traffic cannot reach preprod by using header-based routing, auth tokens, or IP whitelisting.
Automatic rollback scripts: place rollback as the first recovery control in your pipeline. Make rollback a reversible git commit so changes are auditable and immediate.

Example CI step: start a safe CDN bypass

# GitHub Actions job (simplified)
jobs:
  cdn-bypass-drill:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4
      - name: Run preprod smoke tests
        run: ./scripts/smoke.sh --env preprod
      - name: Create preprod load balancer rule (Cloudflare)
        run: ./scripts/cloudflare/create_lb.sh --env preprod --weight-origin 20
      - name: Start k6 synthetic traffic
        run: k6 run --vus 50 --duration 5m tests/k6/cdn_bypass.js
      - name: Watch metrics and auto-rollback
        run: ./scripts/monitor_and_rollback.sh --thresholds thresholds.json

Traffic shaping techniques (practical recipes)

Traffic shaping lets you test degradations without fully failing a provider.

DNS weighted routing — Good for cross-region tests. Pros: provider-agnostic. Cons: DNS TTLs and caching complicate timing.
Edge load-balancer weights (Cloudflare Load Balancer, Fastly, NS1) — Best for CDN bypasses because these systems can steer at the HTTP level.
Service mesh traffic-splitting — Best for canary rollbacks. Use Istio/Envoy to instantly shift live traffic weights with VirtualService manifests.
Application header routing — Easiest for preprod-only scenarios: send test traffic with a header X-Drill: true and route it to the test path via ingress rules.

# Istio VirtualService weight example
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: webapp
spec:
  hosts: ["webapp.preprod.svc.cluster.local"]
  http:
  - route:
    - destination:
        host: webapp
        subset: stable
      weight: 90
    - destination:
        host: webapp
        subset: canary
      weight: 10

Monitoring & automated rollback — rules you must have

Manual drills are useful, but modern DR drills require automation. Define these core rules:

Error rate threshold: e.g., >1% 5xx sustained for 2 minutes → immediate rollback.
Latency threshold: p95 latency increase > 200% over baseline for 5 minutes → scale or rollback. Reducing latency and improving observability are part of a holistic resilience approach — see live stream conversion & latency guides for practical latency-reduction techniques.
Infrastructure alarms: CPU > 90% or container OOM rate > 0.5% → stop test and revert weights.
Health check gating: fail health probe → traffic re-route via DNS health checks or CD rollback hooks.

Chaos engineering tools and 2026 trends

In 2026, teams moved from ad-hoc chaos to declarative, policy-driven resilience tests. Consider these tools and patterns:

Chaos-as-code frameworks (LitmusChaos, Chaos Mesh, Gremlin) integrated with GitOps for reproducible drills.
eBPF-based fault injection for low-level network failure simulations without changing app code (use for latency, packet loss experiments in preprod only).
AI-assisted anomaly detection that triggers automatic rollback earlier than static thresholds based on behavioral baselines. For autonomous agents and experimental orchestration, review work on autonomous orchestration and agent patterns: Benchmarking Autonomous Agents.
Policy engines (Open Policy Agent) to restrict what a drill can change: e.g., disallow any DNS changes that point to production records. Indexing and policy manuals for edge operations are emerging — see Indexing Manuals for the Edge Era.

Runbook template (copy-paste friendly)

Use this minimal runbook for every drill and store it in Git.

Title: CDN bypass drill - preprod
Objective: Validate origin capacity when CDN is bypassed for 20% traffic
Start: 2026-01-20 15:00 UTC
Owner: SRE Team
Pre-conditions:
 - preprod domain: preprod.example.com
 - Observable dashboards ready
 - Rollback script: scripts/rollback_cdn_bypass.sh
Steps:
 1. Run smoke tests
 2. Create load balancer weight: origin 20%
 3. Start k6 script: tests/k6/cdn_bypass.js
 4. Monitor for 15 minutes
Abort conditions:
 - 5xx rate > 1% sustained 2m
 - p95 latency > 2x baseline for 5m
Post-mortem: upload logs to /runs/cdn-bypass-YYYYMMDD

Common failure modes and mitigations

DNS cache issues: use low TTLs in preprod and client-side cache-busting headers for synthetic traffic.
Blocked by WAF/ACL: ensure preprod IP ranges are whitelisted or use preprod-only WAF rules.
Stateful session problems: avoid stateful session patterns in preprod or use sticky session emulation with session replicas.
Data leakage: always use sanitized data and encrypt secrets with Vault or AWS Secrets Manager scoped to preprod only.

"Do not run provider outage simulations against production DNS or using production credentials. The safest systems are auditable, reversible, and fully isolated."

Case study (concise): How one SaaS team avoided a Friday outage

In December 2025, a mid-market SaaS vendor ran a quarterly DR drill in preprod using the above approach. They simulated a Cloudflare edge failure for 10% of traffic and a region failover for their US cluster. The automated rollback triggered when origin CPU spiked during the third minute, preventing visible failures. Postmortem revealed a poorly configured connection pool on the origin; a 2-line fix improved failover capacity by 3x. The drill reduced their mean time to recover (MTTR) in outages from 48 minutes to under 12 minutes.

Actionable checklist to start your first safe drill (15 minutes to decision)

Confirm you have a preprod domain and API access to DNS provider.
Create an isolated preprod cluster(s) in 2 regions with sanitized data.
Implement traffic-splitting in your ingress/service-mesh for canary traffic.
Write a tiny rollback script and hook it into your CD tool.
Schedule a 1-hour drill, notify stakeholders, and run a small (5–10%) CDN bypass test.

Final thoughts: make drills routine and blameless

Resilience is a muscle. In 2026 the most resilient teams are the ones that codified provider outage simulations into their release trains, combined them with AI-signal-based rollback, and kept drills auditable via GitOps. Start small, keep it reversible, and treat everything as code — your future on-call self will thank you.

Takeaways — what to implement this week

Set up a preprod DNS and automate weighted records.
Integrate a canary traffic split in your service mesh and add automated rollback hooks.
Automate a minimal CDN-bypass drill in your CI with strict blast-radius controls.
Store runbooks and results in Git for audit and continuous improvement.

Call to action

Ready to run your first provider outage drill without impacting CI? Download our preprod drill templates (DNS scripts, Istio manifests, k6 scenarios, and rollback playbooks) or schedule a 1:1 workshop with our SRE architects to instrument safe, automated resilience testing inside your preprod environment.

Disaster Recovery Drills: Simulating Major Provider Outages in Preprod Without Breaking CI

Stop hoping your staging is lucky — run provider outage simulations safely in preprod

Executive summary — what you can achieve in a single drill

Why this matters in 2026

Preconditions: What your environment must have

High-level playbook (safe-by-design)

1) Planning and guardrails (15–30 minutes)

2) Environment readiness checks (10–20 minutes)

3) Controlled outage simulation (30–90 minutes)

Scenario A — CDN/edge outage (Cloudflare-style)

Scenario B — Cloud provider region outage (AWS-style)

4) Post-drill analysis and CI-safe learnings (30–60 minutes)

How to keep CI safe: gating and automation patterns

Policy & pipeline patterns

Example CI step: start a safe CDN bypass

Traffic shaping techniques (practical recipes)

Monitoring & automated rollback — rules you must have

Chaos engineering tools and 2026 trends

Runbook template (copy-paste friendly)

Common failure modes and mitigations

Case study (concise): How one SaaS team avoided a Friday outage

Actionable checklist to start your first safe drill (15 minutes to decision)

Final thoughts: make drills routine and blameless

Takeaways — what to implement this week

Call to action

Related Topics

preprod

Up Next

Release Freeze Checklist: What to Review in Preprod Before High-Risk Launch Windows

Multi-Cloud Preprod Architecture: When It Helps and When It Adds Unnecessary Complexity

Preprod Security Scanning: SAST, DAST, and Dependency Checks That Matter

Stop hoping your staging is lucky — run provider outage simulations safely in preprod

Executive summary — what you can achieve in a single drill

Why this matters in 2026

Preconditions: What your environment must have

High-level playbook (safe-by-design)

1) Planning and guardrails (15–30 minutes)

2) Environment readiness checks (10–20 minutes)

3) Controlled outage simulation (30–90 minutes)

Scenario A — CDN/edge outage (Cloudflare-style)

Scenario B — Cloud provider region outage (AWS-style)

4) Post-drill analysis and CI-safe learnings (30–60 minutes)

How to keep CI safe: gating and automation patterns

Policy & pipeline patterns

Example CI step: start a safe CDN bypass

Traffic shaping techniques (practical recipes)

Monitoring & automated rollback — rules you must have

Chaos engineering tools and 2026 trends

Runbook template (copy-paste friendly)

Common failure modes and mitigations

Case study (concise): How one SaaS team avoided a Friday outage

Actionable checklist to start your first safe drill (15 minutes to decision)

Final thoughts: make drills routine and blameless

Takeaways — what to implement this week

Call to action

Related Reading

Related Topics

preprod

Up Next

Release Freeze Checklist: What to Review in Preprod Before High-Risk Launch Windows

Multi-Cloud Preprod Architecture: When It Helps and When It Adds Unnecessary Complexity

Preprod Security Scanning: SAST, DAST, and Dependency Checks That Matter