Skip to main content
Data Infrastructure

I called the Netflix Tyson fight failure. Amazon has the same problem every Prime Day

I told my team Netflix was going to crash before the Tyson fight even started. Not because I'm clairvoyant. I'd lived the same problem at Pluto TV for years. And Amazon has it every Prime Day.

TonyTony
June 11, 2026
5 min read
Amazon Practical

I told my team Netflix was going to crash.

This was right after Netflix announced the Paul vs Tyson fight would stream live. Their first major live event. Their infrastructure had never been tested at that scale. I'd been at Pluto TV for years, and the one problem we could never put to bed (the one that lived as an epic on our roadmap from week one to the day I left) was: get the live streams to load faster, and keep them in sync.

It's the same problem. And Netflix's solution had not been built yet.

I wasn't surprised when the boxing match started buffering for millions of people watching. I was surprised it didn't crash worse.

This is the same shape of problem Amazon has every Prime Day.

What live-event scale actually looks like

There are two different scale problems people lump together.

The first one: a million people watching a million different things. Different shows, different movies, different listings. You can shard that. Spin up servers, route requests, isolate failure domains. Each user is sandboxed. If one CDN edge melts, the other 999,999 users don't notice. This is the scale problem most engineers have actually solved.

The second one: a million people watching the same thing at the same moment. One global stream. One coordinated state. Everyone needs the same frame, the same caption, the same ad break, within a window measured in milliseconds. You can't shard this. You have to orchestrate it.

At Pluto TV we lived inside that second problem. We had a video player baked into the Chromium browser, a JavaScript stack rendering it, and a server-side stitching system (Stitcher) that spliced commercial spots into the live feed. The hard part was never throughput. It was keeping the feed, the captions, the analytics, and the ad insertions all in sync for every viewer simultaneously. That's what live-event scale actually means. Not throughput. Coordination.

Most engineers haven't built systems facing that problem. They've built shardable systems. When they get put on a live-event team, they reach for the patterns they know. Those patterns don't apply.

The parallel to Amazon Prime Day

Prime Day is Amazon's live event.

Not literally a video stream, but the same simultaneity problem. For about 48 hours, every operator on Amazon hits every API at the same moments. Sponsored Ads spend reports. SQP data refreshes. Attribution events. Listing changes. Bid updates. Inventory feeds. The entire data plane is under coordinated stress, and Amazon's own infrastructure responds by degrading in ways it doesn't on a regular Tuesday.

That degradation cascades. SP-API throttling profiles shift during peak windows. Endpoints that respond in 200 milliseconds on a normal day stretch to 4 seconds, then start returning 429s. Any tool that assumed the normal-day latency profile, and any business that runs on top of that tool, inherits the slowdown.

Your reporting goes stale. Your bid optimizer skips cycles. Your dashboards lag the actual events by hours instead of minutes. Your client calls the next morning asking why their flagship ASIN got hammered and you don't have answers yet because the data is still settling.

The pattern is identical to the streaming problem. Coordinated load, not random load. Same shape. Different domain.

What people get wrong about preparing for scale

The wrong mental model is "provision more capacity." More servers, more workers, bigger instances. This works for the first kind of scale problem. It doesn't work for this one.

The right mental model is "partition your system so peak doesn't break it." That means specific patterns:

Cache aggressively. Anything you can serve from cache during the peak window, you serve from cache. Stale-but-available beats fresh-but-timing-out, especially during a 48-hour event.

Queue everything you can defer. Not every API call needs to happen at the millisecond it was triggered. Bid changes, report exports, listing updates: route them through a queue and process at sustainable rates.

Graceful degradation by default. When the upstream is slow, your system should keep working with reduced capability rather than failing completely. Read-only fallbacks. Last-known-good data. Explicit "data is delayed" badges on the UI instead of silent staleness.

Idempotency on every write. When throttling hits and retries fire, idempotency keeps you from double-spending, double-counting, or double-listing.

Pre-rendered, not live-calculated. Anything you can compute ahead of the peak and store, you compute ahead.

The brutal lesson from streaming: you cannot buy your way out of a coordination problem with more servers. Netflix has effectively infinite money and the best infrastructure team in the industry. They still buffered the Tyson fight. Because the problem wasn't capacity. The problem was coordination at a millisecond grain, and they hadn't built for it.

The principle for Amazon operators

If you're running an Amazon business and you don't have a Prime Day fallback plan, you're not running it. You're hoping.

The plan doesn't have to be perfect. It has to exist. Five concrete items:

  1. Know which API endpoints your business depends on. List them. Note their normal-day latency. You can't plan around what you haven't named.
  2. Pre-export the reports you'll need before peak. Don't try to pull a 30-day Sponsored Ads report during the event window. Pull it the day before.
  3. Decide what your tools do when data is stale. Read-only mode? Reduced cadence? Explicit warnings? Default behavior should not be "show numbers from 6 hours ago without saying so."
  4. Cache aggressively on your side. Anything the SP-API serves you, cache it for the duration. Refresh on a schedule, not on demand.
  5. Run a Prime Day fire drill in May. Simulate degraded SP-API responses against your production stack. Find the failure modes before Amazon finds them for you.

The operators who survive Prime Day every year have one thing in common. They've already lived through one breaking. They know which assumption fell over last time, and they built around it. The ones who haven't lived through one are about to learn.