diff --git a/public/blog/webrtc-is-the-problem/bleach.png b/public/blog/webrtc-is-the-problem/bleach.png new file mode 100644 index 0000000..fe79196 Binary files /dev/null and b/public/blog/webrtc-is-the-problem/bleach.png differ diff --git a/public/blog/webrtc-is-the-problem/car.png b/public/blog/webrtc-is-the-problem/car.png new file mode 100644 index 0000000..c6c3327 Binary files /dev/null and b/public/blog/webrtc-is-the-problem/car.png differ diff --git a/public/blog/webrtc-is-the-problem/chad.png b/public/blog/webrtc-is-the-problem/chad.png new file mode 100644 index 0000000..d288369 Binary files /dev/null and b/public/blog/webrtc-is-the-problem/chad.png differ diff --git a/public/blog/webrtc-is-the-problem/claude.png b/public/blog/webrtc-is-the-problem/claude.png new file mode 100644 index 0000000..1294466 Binary files /dev/null and b/public/blog/webrtc-is-the-problem/claude.png differ diff --git a/public/blog/webrtc-is-the-problem/rtt.png b/public/blog/webrtc-is-the-problem/rtt.png new file mode 100644 index 0000000..1fee4fd Binary files /dev/null and b/public/blog/webrtc-is-the-problem/rtt.png differ diff --git a/public/blog/webrtc-is-the-problem/trenchcoat.png b/public/blog/webrtc-is-the-problem/trenchcoat.png new file mode 100644 index 0000000..f068227 Binary files /dev/null and b/public/blog/webrtc-is-the-problem/trenchcoat.png differ diff --git a/src/pages/blog/webrtc-is-the-problem.mdx b/src/pages/blog/webrtc-is-the-problem.mdx new file mode 100644 index 0000000..62fc9dd --- /dev/null +++ b/src/pages/blog/webrtc-is-the-problem.mdx @@ -0,0 +1,383 @@ +--- +layout: "@/layouts/global.astro" +title: OpenAI's WebRTC Problem +author: kixelated +description: There are ways to do voice AI without being traumatized by WebRTC. +cover: "/blog/webrtc-is-the-problem/rtt.png" +date: 2026-05-06 +--- + +# OpenAI's WebRTC Problem + +OpenAI [posted a technical blog](https://openai.com/index/delivering-low-latency-voice-ai-at-scale/) a few days ago. +This blog post triggered me more than it should have. +I urge to slap my meaty fingers on the keyboard. + +**You should NOT copy OpenAI.** + +I don't think you should use WebRTC for voice AI. +WebRTC is the problem. + +## Me +Like 6 years ago I wrote a WebRTC SFU at Twitch. +Originally we used [Pion](https://github.com/pion) (Go) just like OpenAI, but forked after benchmarking revealed that it was too slow. +I ended up rewriting every protocol, because of course I did! + +Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust. +Because of course I did! +You're probably noticing a trend. + +**Fun Fact**: WebRTC consists of ~45 RFCs dating back to the early 2000s. +And some de-facto standards that are technically drafts (ex. TWCC, REMB). +Not a fun fact when you have to implement them all. + +You should consider me a **Certified WebRTC Expert**. +Which is why I never, never want to use WebRTC again. + +## Product Fit +I'm going to cheat a little bit and start with the hot takes before they get cold. +Don't worry, we'll get right back to talking about the OpenAI blog post and load balancing, I promise. + +**WebRTC is a poor fit for Voice AI.** + +But that seems counter-intuitive? +WebRTC is for conferencing, and that involves speaking? +And robots can speak, right? + +## WebRTC is too aggressive +Let's say I pull up my OpenAI app on my phone. +I say hi to ~~Scarlett Johansson~~ Sky and then I utter: +> should I walk or drive to the car wash? + +WebRTC is designed to **degrade and drop my prompt** during poor network conditions. + +wtf my dude + +WebRTC aggressively drops audio packets to keep latency low. +If you've ever heard distorted audio on a conference call, that's WebRTC baybee. +The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable. + +...but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. +After all, I'm paying good money to boil the ocean, and a garbage prompt means a garbage response. +It's not like LLMs are particularly responsive anyway. + +**But I'm not allowed to wait**. +It's *impossible* to even retransmit a WebRTC audio packet within a browser; we tried at Discord. +The *implementation* is hard-coded for real-time latency **or else**. + +Yes, Voice AI agents will eventually get the latency down to the conversational range. +But **reducing latency has trade-offs**. +I'm not even sure that purposely degrading audio prompts will ever be worth it. + +
+ ![car or wash](/blog/webrtc-is-the-problem/car.png) +
Two roads diverged in a yellow wood. And sorry I could not travel both. And be one traveler, long I stood. And looked down one as far as I could. Until I ran out of tokens.
+
+ + + +## TTS is faster than real-time +You speak into the microphone, it gets sent to one of OpenAI's billion servers, and then a GPU pretends to talk to you via text-to-speech. +Neato. + +Let's say it takes 2s of GPUs to generate 8s of audio. +In an ideal world, we would stream the audio as it's being generated (over 2s) and the client would start playing it back (over 8s). +That way, if there's a network blip, some audio is buffered locally. +The user might not even notice the network blip. + +But nope, WebRTC has **no buffering** and renders based on **arrival time**. +Like seriously, timestamps are just suggestions. +It's even more annoying when video enters the picture. + +To compensate for this, OpenAI has to make sure packets arrive *exactly* when they should be rendered. +They need to **add a sleep** in front of every audio packet **before sending it**. +But if there's network congestion, *oops* we lost that audio packet and it'll never be retransmitted. + +OpenAI is literally introducing artificial latency, and then aggressively dropping packets to "keep latency low". +It's the equivalent of screen sharing a YouTube video instead of buffering it. +**The quality will be degraded**. + +
+ ![You should not drink bleach](/blog/webrtc-is-the-problem/bleach.png) +
Thank for Mr Robot friend for the unambiguous advice
+
+ +**Fun fact**: WebRTC actually adds latency. +It's not much, but WebRTC has a dynamic jitter buffer that can be sized anywhere from 20ms to 200ms (for audio). +This is meant to smooth out network jitter, but none of this is needed if you transfer faster than real-time. + + +## Ports Ports Ports +Okay but let's talk about the *technical meat* of the OpenAI article. +We're no longer [on a boat](/blog/on-a-boat), but let's talk about ports. + +When you host a TCP server, you open a port (ex. 443 for HTTPS) and listen for incoming connections. +The TCP client will randomly select an ephemeral port to use, and the connection is identified by the source/destination IP/ports. +For example, a connection might be identified as `123.45.67.89:54321 -> 192.168.1.2:443`. + +But there's a minor problem... client addresses can change. +When your phone switches from WiFi to cellular, oops your IP changes. +NATs can also arbitrarily change your source IP/port because of course they can. + +Whenever this happens, **bye bye connection**, it's time to dial a new one. +And that means an expensive TCP + TLS handshake which takes at least 2-3 RTTs. +The users definitely notice the network hiccup when you're live streaming. + +WebRTC tried to solve this issue but made things worse. **Seriously**. + +A WebRTC implementation is *supposed* to allocate an ephemeral port for each connection. +That way, a WebRTC session can identified by the destination IP/port only; the source is irrelevant. +If the source IP/port changes, oh hey that's still Bob because the destination port is the same. + +But as OpenAI corroborates, this causes issues at scale because... +- Servers only have a limited number of ports available. +- Firewalls love to block ephemeral ports. +- Kubernetes lul + +You could probably abuse IPv6 to work around this, but IDK I never tried. +Twitch didn't even support IPv6... + +## Hacks by Necessity +So most services end up ignoring the WebRTC specifications. +Because of course they do. +We mux multiple connections onto a single port instead. + +At Twitch I literally hosted my WebRTC server on `UDP:443`. +That's *supposed* to be the HTTPS/QUIC port, but lying meant we could get past more firewalls. +Like the Amazon corporate network, which blocked all but ~30 ports. + +Discord uses ports 50000-50032, one for each CPU core. +As a result it gets blocked on more corporate networks. +But like, if you're on a Discord voice call on the Amazon corporate network, you probably won't be there much longer anyway. + +**HOWEVER, HUGE PROBLEM**. + +WebRTC is actually a bunch of standards in a trenchcoat, and 5 of those go over UDP directly. +It's [not hard](https://datatracker.ietf.org/doc/html/rfc5764) to figure out which protocol a packet is using, but we need to figure out how to route each packet. + +- **STUN**: We can choose a unique `ufrag` and route on it. +- **SRTP/SRTCP**: The browser chooses a random `ssrc` (u32)... which we can *usually* route based on. +- **DTLS**: Uh oh. We pray that [RFC9146](https://www.rfc-editor.org/rfc/rfc9146.html) gets widespread support. +- **TURN**: IDK I've never implemented it. + +So OpenAI only uses STUN: +> No protocol termination: Relay parses only STUN headers/ufrag; it uses cached state for subsequent DTLS, RTP, and RTCP, keeping packets opaque. + +It's a positive way of saying: +> We really hope the user's source IP/port never changes, because we broke that functionality. + +While it's impressive load balancing *anything* at OpenAI scale, their custom load balancing is a hack. +But a necessary hack, because the core protocol is at fault. + +
+ ![trenchcoat](/blog/webrtc-is-the-problem/trenchcoat.png) +
Personally, I would prefer 3 raccoons.
+
+ + +**Fun fact**: Browsers can randomly generate the same `ssrc`. +If there is a collision, and no source IP/port mapping is available, Discord attempts to decrypt the packet with each possible decryption key. +If the key worked, hey we identified the connection! + +## Round Trips and U +The OpenAI blog post starts with 3 requirements, one of them is: +> - Fast connection setup so a user can start speaking as soon as a session begins + +lol + +It takes a minimum of 8\* round trips (RTT) to establish a WebRTC connection. +While we *try* to run CDN edge nodes close enough to every user to minimize RTT, it adds up. + +Signaling server (ex. [WHIP](https://www.rfc-editor.org/rfc/rfc9725.html)): +- 1 for TCP +- 1 for TLS 1.3 +- 1 for HTTP + +Media server: +- 1 for ICE (with server) +- 2 for DTLS 1.2 +- 2 for SCTP + +\* It's complicated to compute, because some protocols can be pipelined to avoid 0.5 RTT. +Kinda like [half an A-Press](https://www.youtube.com/watch?v=kpk2tdsPh0A). + +
+ ![8 RTTS ahead](/blog/webrtc-is-the-problem/rtt.png) +
an [obscure reference](https://www.youtube.com/watch?v=catXIV3zAGY) to an obscure reference
+
+ +All of this nonsense is because WebRTC needs to support P2P. +It doesn't matter if you have a server with a static IP address, you still need to do this dance. + +It's extra depressing when the signaling and media server are running on the same host/process. +You end up doing two redundant and expensive handshakes. +It's like walking AND driving your car to the car wash. + + +## Forking the Protocol +**Fun Fact**: This was originally going to be a **Fun Fact**, but it gets its own section now. + +WebRTC practically encourages you to fork the protocol. +There's so many limitations that I've barely scratched the surface. +The browser implementation is owned by Google and tailor made for Google Meet, so it's also an existential threat for conferencing apps. + +**Sad Fact**: That's why every conferencing app (except Google Meet) tries to shove a native app down your throat. +**It's the only way to avoid using WebRTC**. + +OpenAI definitely has the ~~debt~~ funding to do this. +But I think they should also throw the baby out with the bath water. +Don't fork WebRTC, replace it with something that has browser support. + +**Fun Fact**: Discord has forked WebRTC *so hard* that native clients only implement a tiny fraction of the protocol. +No more SDP/ICE/STUN/TURN/DTLS/SCTP/SRTP/etc. +But we still have to implement everything for web clients. + +## But What Instead? +If not WebRTC, then what should you use for Voice AI? + +Honestly, if I was working at OpenAI, I'd just stream audio over WebSockets. +You can leverage existing TCP/HTTP infrastructure instead of inventing a custom WebRTC load balancer. +It makes for a boring blog post, but it's simple, works with Kubernetes, and SCALES. + +The day will come when it's desirable to drop audio packets. +Or you want to transmit video over the same connection. +Then you *should* switch to QUIC/WebTransport, because... + +## QUIC Fixes This +Look I've spent a long time dunking on WebRTC. +It's time to be positive and gush about QUIC. + +Remember the round trip discussion? +Good times. +Here's how many RTTs it takes to establish a QUIC connection: + +- 1 for QUIC+TLS + +But that was an easy one. +Let's dive into the deeper details of QUIC that you wouldn't know about unless you're a turbo QUIC nerd (it me). + +## Connection ID +Remember that link to [RFC9146](https://www.rfc-editor.org/rfc/rfc9146.html)? +In the DTLS section? +That you didn't click? +Good times. +The idea is literally copied from QUIC. + +QUIC ditches source IP/port based routing. +Instead, every packet contains a `CONNECTION_ID`, which can be 0-20 bytes long. +And most importantly for us: **it's chosen by the receiver**. + +So our QUIC server generates a unique `CONNECTION_ID` for each connection. +Now we can use a single port and still figure out when the source IP/port changes. +When it does, QUIC automatically switches to the new address instead of severing the connection like TCP. + +But if your gut reaction is: "how dare they! this is a waste of bytes!" +These bytes are *very* important, keep reading u nerd. + +## Stateless Load Balancing +I glossed over this, but OpenAI's load balancers (like most) depend on *shared state*. +Even if you have a sticky packet router, load balancers can still restart/crash. +Something has to store the mapping from source IP/port -> backend server. + +They're using a Redis instance to store the mapping of source IP/port to backend server. +Simple and easy, I approve. + +But do you know what is even simpler and easier? +Not having a database. +Here's how [QUIC-LB](https://www.ietf.org/archive/id/draft-ietf-quic-load-balancers-21.html) does it: + +When a client initiates a QUIC connection, the load balancer forwards the packet to healthy backend server. +The backend server completes the handshake and **encodes its own ID** into the `CONNECTION_ID`. +That way **every subsequent QUIC packet** contains the ID of the backend server. + +Now packets become trivial for load balancers to forward. +They don't need encryption keys or a routing table, just decode the first few bytes and forward it to that guy. +It doesn't even matter if the server reboots. + +**Zero state** also means **zero global state**. +These load balancers could listen on a *global* anycast address and forward packets *globally* to the indicated backend server. +Cloudflare uses this extensively; no need for a global Redis cluster. + +**Unpaid Shill**: [AWS NLB](https://aws.amazon.com/about-aws/whats-new/2025/11/aws-network-load-balancer-quic-passthrough-mode/) offers QUIC load balancing using QUIC-LB. +Other cloud providers need to step up their game and offer it too. + +## Anycast + Unicast +Based on the OpenAI blog, it sounds like they assign connections to regional load balancers. +Functional but lame. +[Anycast](https://en.wikipedia.org/wiki/Anycast) is way cooler. + +I brought this up in my ancient [Quic Powers](https://moq.dev/blog/quic-powers/) blog post, but I'll excuse you for not reading it (yet). +QUIC has something called `preferred_address` that is a game changer for load balancing. + +Let's say we have thousands of backend servers around the world that could accept a new connection. +We have them all advertise the same anycast address, ex. `1.2.3.4`. +When a client tries to connect to `1.2.3.4`, the magic internet routers forward the packet to one of the servers. + +Now, we could just use QUIC-LB and route traffic to the indicated backend. +But that would be boring. + +Instead, we can give each QUIC server a unique unicast address, ex. `5.6.7.8`. +The idea is that we use anycast for handshakes and unicast for stateful connections. + +- **Server**: Listen for QUIC packets on `1.2.3.4` and `5.6.7.8`. +- **Client**: Sends a QUIC handshake packet to `1.2.3.4`. +- **Server**: Establishes the QUIC connection, indicating `preferred_address=5.6.7.8`. +- **Client**: Sends future packets to `5.6.7.8`. + +When the server is overloaded and doesn't want more connections, it stops advertising `1.2.3.4`. +We won't drop existing connections because they're safe on unicast. + +Just like that, no load balancers needed! +The anycast address is basically a health check! + +Holy shit I wish I actually had the scale to build this. +Reach out if you work for the orange butthole company. + +
+ ![claude](/blog/webrtc-is-the-problem/claude.png) +
Looks something like this but orange.
+
+ +## Summary +**WebRTC** +1. hurts your product +2. hurts your load balancing +3. hurts your dog, maybe + +**QUIC** +1. loves your product +2. loves your load balancing +3. loves your dog, definitely + +
+ ![QUIC is chad](/blog/webrtc-is-the-problem/chad.png) +
I have labeled QUIC as the chad, therefore it is the superior protocol.
+
+ +## To Be Fair +I know many engineers at OpenAI and they are extremely bright. +They're dealing with unprecedented levels of stress. +They **MUST** scale and they **MUST** scale now. + +I'm just some guy who quit my job to work on a passion project. +I literally spend my time tracing memes. +It's easy for me to judge from my lofty position, like a movie critic ranting about how *they casted Jared Leto again*? + +I just don't think the obvious solution is a good fit for Voice AI. +And the obvious solution is very difficult to scale. +WebRTC is Jared Leto. +There I said it. + +And I'll be honest, MoQ isn't a perfect fit for Voice AI either. +It'll work, but a lot of the cache/fanout semantics are useless for 1:1 audio. +You should definitely use QUIC though. + +## Me +Anyway, hit me up if you want to chat: `meself@kixel.me` + +I'm cool. +You won't regret it. +Probably. + +Written by [@kixelated](https://github.com/kixelated). +![@kixelated](/blog/avatar.png)