When the Big Clouds Blink: AWS, Azure, Fly.io and the Edge of AI Infrastructure

In the last two weeks the two biggest public clouds on earth both had very public bad days. On 19–20 October, AWS’s us-east-1 region suffered a multi-service disruption that rippled into consumer apps, SaaS backends and even popular games, with DNS resolution failures and degraded internal load balancers at the core of the problem (The Verge, ThousandEyes, AWS health). A week later, on 29–30 October, the same region had “another bad day”, with EC2, container tasks and case management affected again, prompting coverage that sounded uncomfortably familiar (The Register, Times of India). This is the region that powers large chunks of the commercial internet; when it coughs, dashboards everywhere go yellow.

Microsoft did not get to stand aside. On 9–10 October Azure Front Door and internal network paths saw packet loss that knocked Microsoft 365, Outlook and other workloads into timeouts (ThousandEyes, Economic Times). Only three weeks earlier, on 10 September, Azure’s own status history shows a long East US 2 control-plane degradation caused by failures in the allocator service for two availability zones (Azure status history). Add the September fibre cuts in the Red Sea, which forced Azure to reroute Europe–Asia traffic and introduced long latencies into a lot of seemingly local workloads (Reuters, Tom’s Hardware), and you get a picture that is not “cloud is down” but “cloud is under strain in the parts that everybody depends on.”

Recent shocks, same root

Look closely and the AWS and Azure incidents rhyme. Both were control-plane or front-door style events, not “a box died”: AWS’s was about internal DNS and load-balancer behaviour in us-east-1, which is exactly the region where customers put their “global” services because of history and cost (AWS analysis on Intelligent CIO). Azure’s was about a central piece of the delivery edge (Front Door) and later a central piece of VM orchestration (the allocator). In both cases, retries and backlogs made the recovery longer than the core fix. And in both cases, the blast radius was large not because every rack was broken, but because the dependency graph was too steep: one noisy layer at the top and a lot of unrelated services looked sick.

That is the first lesson. The more we consolidate around a small number of global entry points, the more every “just a few AZs” event feels like internet weather.

Fly.io: the small-provider version of the same pain

If the hyperscalers struggle to keep globally visible control planes perfectly smooth, it should not be surprising that smaller, more opinionated providers have to fight the same class of problems with fewer people and less redundancy. Fly.io has been very transparent about this. In May 2024 it wrote about “first-of-their-kind outages in the control plane for our global WireGuard mesh,” which led to “several days of involuntary chaos testing” while they re-established orchestration across regions (Fly infra log, May 5 2024). In September 2024 they documented a request-routing disruption in Johannesburg caused by a reverted deployment, noting that the fastest practical fix was not to fail the region out of Anycast, but to redeploy the right config (Fly infra log, Sep 7 2024). And in October 2024 they reported “a fleetwide severe orchestration outage” that blocked deploys and machine updates for nearly two hours across the whole fleet (Fly infra log, Oct 26 2024).

What Fly.io is trying to do — run an Anycast-heavy, global, low-latency platform on top of a mesh of their own boxes, while offering customers databases, machines and per-app networking — is exactly the kind of thing the big clouds were supposed to make unnecessary. Their openness is useful because it shows the hard part: it is not buying servers, it is keeping the control plane for “where is this app right now and how do I reach it” alive while you are constantly deploying new code. The similarity with the AWS and Azure outages is uncomfortable: even the small guys are discovering that once you become a transit point, every control-plane hiccup is a customer-visible incident.

Are we pushing AI toward the edge of what centralised infra can do?

This is where 2025’s AI workload pattern starts to matter. Running a big model is not new; running thousands of small or mid-size inference jobs, with heavy context windows, coming from multiple geographies and often from inside other clouds, is. When a us-east-1 DNS or NLB issue occurs, it does not only block a web app; it can interrupt an entire chain of AI calls: retrieval in us-east-1, vector store in the same region, model in a private VPC endpoint, callback to a third-party tool, and then back to the user. The more agentic the workload, the longer the chain, and the more sensitive it is to a single-region event.

This is why you are seeing more talk of “bring your own GPU microcloud” and “local inference first, centralised training later.” The outages do not prove that AWS or Azure cannot run AI; they do show that the internet’s current habit of putting every control hop in Virginia or East US 2 is brittle. If a discording-style product wanted to run its embedding or search inference close to users in, say, South Asia on 6 September, Azure’s cable reroutes alone would have degraded that plan. That is what “AI at the edge of infrastructure” looks like in practice: not a romantic return to on-prem, but a need to spread the failure domains horizontally so that an agent chain can succeed even during a cloud’s bad hour.

The data problem is even harder than the compute problem

Discord’s public posts over the last two years are a good lens for this. In 2017 they had 12 Cassandra nodes; by early 2022 they had 177 nodes storing trillions of messages, and the main issues were toil, unpredictable latency and the fact they had to throttle maintenance because it became too expensive to run at that scale (Discord, 2023). In April 2025 they described a search layer that indexes trillions of messages, runs 40 Elasticsearch clusters, and still serves sub-500ms p99 queries thanks to better ingestion and automatic rolling restarts (Discord, 2025). ByteByteGo and others later noted that Discord has been experimenting with ScyllaDB and Rust-based data services to shield the databases and improve concurrency (ByteByteGo).

Two things jump out. First, Discord keeps essentially everything: all chat history, forever. That is the extreme version of what AI products in 2025 also want — huge, ever-growing corpora from which to retrieve, fine-tune or personalise. Second, Discord treats storage and indexing as separate, resilient subsystems, each shardable, each operable during upgrades. That is exactly the kind of discipline AI agent platforms do not yet have. Many of them still use a single-region vector store and a single-region object store as if they were just another microservice. When AWS us-east-1 stumbles, everything stumbles.

So the real bottleneck might not be “we are running out of GPUs”; it might be “we cannot keep writing, indexing and moving the data fast enough, reliably enough, across the network topologies we currently have.” Azure’s cable incidents are a reminder that the physical network is a constraint, not an abstraction.

What this means for builders

The pattern across AWS, Azure and Fly.io is that the most painful incidents are not about raw capacity, they are about orchestration and metadata: DNS, allocators, control planes, global meshes. Those are exactly the parts AI workloads hit hardest, because AI is chatty. It fetches context, hits tools, ships embeddings, posts results. Every extra thousand users multiplies those control-plane calls.

The answer is boring but clear. More multi-region-by-default, even for AI inference. More headless or self-hosted options for teams that do not want to trust a single cloud’s front door — you see this already in products like bolt.diy, and you will see it in AI agent runtimes too. More Discord-style separation between storage, indexing and serving, so that you can patch or move one without taking users down. And more honesty, a la Fly.io, about how hard it actually is to run your own global substrate.

The outages of October 2025 do not mean public cloud is failing. They do mean we are operating too close to the same choke points, at the very moment when AI wants to generate, store, vectorise and search more data than ever. If we keep that shape, every bad hour in Virginia becomes a bad hour for AI everywhere.