outagemulti-clouddisaster recovery

Build S3 Failover Plans: Lessons from the Cloudflare and AWS Outages

sstorages

2026-01-21

10 min read

Turn outage lessons into a concrete S3 + CDN failover playbook with Terraform templates, edge fallbacks, and a test plan for 2026.

Build S3 Failover Plans: Lessons from the Cloudflare and AWS Outages

Hook: If your website or web app depends on a single S3 origin and a single CDN, one simultaneous outage — like the Cloudflare and AWS incidents in early 2026 — can turn an SLA into a full outage. This playbook turns those painful lessons into a repeatable S3 + CDN failover plan you can implement, test, and automate with Terraform.

Why this matters in 2026

Two trends sharpen the urgency in 2026: the rise of edge-first architectures and the growing adoption of S3-compatible object stores beyond AWS (R2, GCS, DO Spaces). At the same time, multi-supplier complexity and tighter budgets push teams to consolidate without sacrificing resilience. Recent simultaneous outages exposed brittle single-vendor designs — the exact failure modes this guide addresses.

Executive summary — immediate actions

Identify your critical assets: static site buckets, presigned URL flows, and API endpoints that depend on S3 object access.
Deploy a low-cost secondary object store (S3-compatible) and configure automated sync from primary to secondary.
Put your CDN in front as the arbiter: configure multi-origin load balancing and CDN fallback rules.
Automate DNS failover and implement health checks. Prefer CDN-native load balancers before DNS if available.
Run a documented test plan and weekly health exercises. Treat each failover test as a tabletop and a live drill.

Step-by-step S3 + CDN failover playbook

1) Discovery: inventory and risk profiling

Start by classifying every object and endpoint that matters for user experience and revenue. For each asset determine:

RTO/RPO requirements
Access pattern (static public, private presigned, high-churn uploads)
Cacheability (cache-control, TTLs)
Dependencies (Lambda@Edge, signed cookies, origin authentication)

Export S3 inventory (AWS S3 Inventory or object listing) and map it to CDN distributions and DNS records.

2) Choose a secondary object target and replication method

Options in 2026:

AWS cross-region replication for intra-AWS resiliency (fastest but single-cloud).
Cloud-native multi-cloud replication: tools like Rclone, minio-mc, or managed replication (vendor-specific partners) to sync to R2, GCS, or DO Spaces.
Push artifacts into CI/CD pipelines so builds publish to multiple targets simultaneously (best for static sites generated at build time).

Recommendation: for web apps and static sites, combine CI-publish + continuous sync. CI-publish minimizes eventual consistency issues for new releases; continuous sync covers retroactive changes.

3) Put the CDN in charge: multi-origin and intelligent failover

Modern CDNs (Cloudflare, Fastly, Akamai) can accept multiple origins and perform active health checks. This is the most reliable failover mechanism because it avoids DNS propagation delays and lets the CDN short-circuit origin failures.

Deployment pattern:

Primary origin: AWS S3 bucket (or CloudFront origin pointing at S3).
Secondary origin: S3-compatible endpoint (R2, GCS, DO Spaces) with the same object key structure.
Configure origin priorities and health checks in the CDN. Set short health-check intervals during drills and longer TTLs in steady state.

4) CDN fallback logic: nuanced rules

Failover is not just a switch. Handle different responses differently:

Return code >= 500 from origin -> failover to secondary origin.
Object not found (404) -> do not failover unless a specific mapping exists (avoid masking content deletion).
Presigned URL failures -> regenerate with a short TTL and push via edge middleware or rewriter service.

Practical rule: prefer CDN-level failover for availability, and application-level reconciliation for correctness.

5) DNS and global traffic management

When CDN-native failover isn't available or you run multi-CDN, use DNS failover as a fallback. Implement health checks with Route53 / NS1 / GCP Cloud DNS and weighted routing to split traffic during recovery. Be mindful of DNS TTLs and resolver caching.

6) Security and signed URL considerations

If you use signed URLs or cookies, ensure the signing mechanism is duplicated across regions and clouds. Options:

Central signing service replicated via Kubernetes across regions.
Edge signing using Workers (Cloudflare) or edge functions so tokens can be regenerated without hitting origin.

7) Observability and runbooks

Instrument origin and edge metrics, and correlate S3 API error rates, CDN origin timeouts, and DNS query failures. Maintain a concise runbook for each failure mode:

Primary S3 read errors -> enable CDN origin failover, evacuate in-flight uploads.
Primary S3 write errors -> queue writes to secondary via a temporary queue and surface degraded state to clients.

Terraform templates (practical snippets)

Below are short, practical Terraform examples to bootstrap this architecture. They are intentionally compact — treat them as templates to adapt to your org policies.

A) AWS S3 primary bucket (static site) + Policy

resource "aws_s3_bucket" "primary_site" {
  bucket = "acme-site-primary-prod"
  acl    = "public-read"

  website {
    index_document = "index.html"
    error_document = "404.html"
  }

  tags = {
    environment = "prod"
  }
}

resource "aws_s3_bucket_policy" "primary_policy" {
  bucket = aws_s3_bucket.primary_site.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Sid       = "PublicReadGetObject",
        Effect    = "Allow",
        Principal = "*",
        Action    = ["s3:GetObject"],
        Resource  = ["${aws_s3_bucket.primary_site.arn}/*"]
      }
    ]
  })
}

B) DigitalOcean Spaces secondary (S3-compatible) — example

Note: DigitalOcean Spaces is S3-compatible. For other providers use provider-specific resources or the S3-compatible API.

provider "digitalocean" {
  token = var.do_token
}

resource "digitalocean_spaces_bucket" "secondary_site" {
  name = "acme-site-secondary-prod"
  region = "nyc3"
}

C) Cloudflare Load Balancer with two origins

Use Cloudflare when you want CDN-level failover. This example uses the Terraform Cloudflare provider.

provider "cloudflare" {
  email = var.cf_email
  api_key = var.cf_api_key
}

resource "cloudflare_load_balancer" "site_lb" {
  zone_id = var.cf_zone_id
  name    = "www.example.com"
  fallback_pool = cloudflare_load_balancer_pool.secondary_pool.id
  default_pools = [cloudflare_load_balancer_pool.primary_pool.id]
  ttl = 30
}

resource "cloudflare_load_balancer_pool" "primary_pool" {
  zone_id = var.cf_zone_id
  name    = "primary-s3"
  origins = [
    {
      name = "aws-s3"
      address = "acme-site-primary-prod.s3.amazonaws.com"
      enabled = true
      weight = 1
    }
  ]
}

resource "cloudflare_load_balancer_pool" "secondary_pool" {
  zone_id = var.cf_zone_id
  name    = "secondary-space"
  origins = [
    {
      name = "do-space"
      address = "acme-site-secondary-prod.nyc3.digitaloceanspaces.com"
      enabled = true
      weight = 1
    }
  ]
}

D) Cloudflare Worker: fetch primary, fallback to secondary on 5xx

Deploy a tiny worker to add another layer of resilience at the edge. It attempts the primary origin first and fetches the secondary if it gets a 5xx response.

addEventListener('fetch', event => {
  event.respondWith(handle(event.request))
})

async function handle(req) {
  const primary = new URL(req.url)
  primary.hostname = 'acme-site-primary-prod.s3.amazonaws.com'

  const secondary = new URL(req.url)
  secondary.hostname = 'acme-site-secondary-prod.nyc3.digitaloceanspaces.com'

  let res = await fetch(primary, {
    cf: { cacheTtl: 60 }
  })

  if (res.status >= 500) {
    // Try secondary origin
    res = await fetch(secondary, {
      cf: { cacheTtl: 60 }
    })
  }
  return res
}

Deploy this with Wrangler or the Cloudflare Terraform provider's worker script resources.

Automated sync patterns

Choose the sync approach that suits your workflow:

Build-time publish: CI (GitHub Actions / GitLab) writes to both primary and secondary simultaneously.
Continuous sync job: containerized rclone/minio-mc job that watches for object changes.
Event-driven replication: SQS -> Lambda -> POST to secondary API for new objects (good for uploads).

Testing and validation plan (test scripts & drill checklist)

Failover works only if you test it. This is a practical test plan you can run weekly and during major releases.

Test categories

Unit tests — CI ensures objects are published to both origins on each release.
Integration tests — synthetic requests through the CDN exercise headers, caching, and signed URL flows.
Chaos and resilience tests — simulate origin failures from the CDN perspective and validate behavior.
Performance benchmarks — measure latency and throughput from the user's location in both normal and failover modes.
Security checks — verify IAM/ACL parity and that signed URLs continue to validate across origins.

Example live drill

Announce the drill window and scope to stakeholders.
Switch the primary origin to return 500 (set S3 bucket policy to block GET or temporarily modify bucket CORS/permissions in a controlled way).
Run a load test (k6 or Vehicular) against the public URL to ensure the CDN fails over to the secondary origin.
Validate that 99% of requests are served from the secondary origin and monitor for increased 4xx/5xx errors.
Reinstate primary and confirm traffic returns to normal. Document the exact steps and any issues encountered.

Test checklist (automatable)

Health checks: CDN origin health shows green for secondary after failover.
Cache headers: Content retains expected cache-control and ETag semantics.
Signed tokens: Short-lived presigned URL generation is replicated and valid.
Metrics: Alert fires for origin error spike and recovery event.

Operational playbook and runbook snippets

Keep concise runbooks for common failure modes. Here are two examples:

Runbook: primary S3 read outage

Step 1: Confirm errors via CDN metrics and S3 CloudWatch error logs.
Step 2: Force CDN origin failover (if automatic failover isn't triggered): change Load Balancer default_pools to secondary.
Step 3: Notify stakeholders and update status page.
Step 4: If writes are impacted, enable write-queue to secondary. Pause batch jobs that write to primary.
Step 5: Run integrity checks on secondary objects (hash comparisons) and note any replication lag.

Runbook: CDN provider outage

Step 1: Detect via synthetic monitoring and CDN status page.
Step 2: If using multi-CDN, switch DNS to secondary CDN using short TTL or DNS provider API.
Step 3: If using only one CDN, enable fallback to direct S3 origins via DNS - but only if objects are public and CORS is allowed.
Step 4: Monitor for caching warm-up issues and client-side behavior.

Cost, performance, and compliance trade-offs

Adding a secondary store and CDN failover has costs: storage, replication bandwidth, and possibly egress. But the cost of downtime typically exceeds multi-cloud redundancy for customer-facing sites. Balance these factors:

For infrequently changed archives, rely on asynchronous replication (lower cost).
For high-traffic static sites, CI-publish to both origins to avoid per-request sync costs.
Use lifecycle policies to archive older objects and reduce cross-cloud transfer costs.

2026 trends and future-proofing

In 2026, expect these trends to influence failover design:

Edge compute expansion: Running auth and signing at edge providers reduces origin hitting during failover — a trend covered in creator ops and edge playbooks.
Multi-cloud object replication services: Third-party managed replication will become more common, simplifying cross-cloud consistency.
Standardization on S3 APIs: More providers offering S3-compatible APIs (including R2 and object stores at the edge) reduces lock-in.
AI-assisted incident playbooks: Tools that suggest runbook steps during an outage will speed recovery, but human validation remains essential.

Real-world example: what the Cloudflare + AWS outage taught us

The simultaneous outage event in early 2026 highlighted these failure modes:

CDN provider-level failure that took out edge logic and DNS for many domains simultaneously.
S3 API throttles in a region that increased 5xx rates and caused CDN origins to fail over repeatedly.
Presigned URLs issued by a central service became invalid when that service lost connectivity to the signing keys.

Lessons applied in the playbook above: decentralize signing, use CDN-level failover, and ensure a decoupled replication path to a secondary store so user reads remain available even if the primary S3 or CDN is impaired.

Actionable takeaways (checklist)

Inventory and classify all S3 objects and their RTO/RPO.
Implement CI-publish to at least two object stores for static sites.
Configure CDN multi-origin with active health checks and short failover TTLs.
Deploy an edge fallback (Cloudflare Worker or equivalent) to attempt secondary fetches on 5xx responses.
Automate failover drills and integrate them into your incident response cadence.

Final recommendations

Don’t wait for the next simultaneous outage to reveal brittle designs. Start with a low-friction pilot: pick one high-value static site, add a secondary S3-compatible storage target, and route through your CDN with multi-origin failover. Run one full drill and iterate on the runbook.

Remember: availability is a system property. It’s not enough to replicate objects — you must replicate identity, signing, monitoring, and the playbooks teams use when the alerts start firing.

Call to action

Ready to harden your S3-backed sites with a practical multi-cloud failover? Download our Terraform templates and a ready-made test harness for k6 and GitHub Actions from storages.cloud. Run a failover drill this week and reduce your outage risk before the next cloud-wide incident.

storages

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.