Hacker News • 82일 전

OpenAI의 WebRTC 문제점

IMP

6/10

핵심 요약

WebRTC 전문가가 OpenAI가 음성 AI에 WebRTC를 사용하는 것을 강하게 비판하는 글입니다. WebRTC는 낮은 지연 시간을 위해 오디오 패킷을 과도하게 드롭하고 버퍼링이 불가능하여, 비용이 많이 드는 LLM 프롬프트가 손상될 수 있습니다. 특히 TTS가 실시간보다 빠르게 생성됨에도 불구하고 불필요한 대기 시간을 추가하고 네트워크 변동에 취약해지는 구조적 모순을 지적합니다.

번역된 본문

2026년 5월 6일 게시됨

OpenAI의 WebRTC 문제

며칠 전 OpenAI가 기술 블로그 글을 올렸습니다. 이 블로그 글은 내가 감당할 수 있는 수준 이상으로 나를 자극했고, 나의 두툼한 손가락이 키보드를 두드리도록 강력히 충동질했습니다.

여러분은 OpenAI를 본받아서는 안 됩니다. 나는 여러분이 음성 AI에 WebRTC를 사용해서는 안 된다고 생각합니다. WebRTC가 바로 문제입니다.

나에 대하여 약 6년 전, 나는 Twitch에서 WebRTC SFU를 작성했습니다. 원래 우리는 OpenAI와 똑같이 Pion(Go 언어 기반)을 사용했지만, 벤치마크 결과 너무 느린 것으로 나타나 포크(Fork)를 떴습니다. 결국 나는 모든 프로토콜을 다시 작성했는데, 당연히 그렇게 할 수밖에 없었으니까요!

딱 1년 전, 나는 Discord에 있었고 WebRTC SFU를 Rust로 다시 작성했습니다. 당연히 그렇게 할 수밖에 없었죠! 여러분은 아마 이런 패턴을 눈치채셨을 겁니다.

재미 있는 사실: WebRTC는 2000년대 초반으로 거슬러 올라가는 약 45개의 RFC(비표준 규격)로 구성되어 있습니다. 그리고 기술적으로는 아직 초안인 사실상의 표준(예: TWCC, REMB)들도 있습니다. 이 모든 것을 직접 구현해야 할 때는 결코 재미있는 사실이 아닙니다. 저를 공인된 WebRTC 전문가로 생각하셔도 됩니다. 그렇기 때문에 저는 두 번 다시 WebRTC를 사용하고 싶지 않습니다.

제품 적합성 생각이 식기 전에 뜨거운 논쟁거리부터 먼저 꺼내면서 살짝 치트를 쓰겠습니다. 걱정 마세요, OpenAI 블로그 글과 로드 밸런싱에 대한 이야기로 바로 돌아갈 테니까요.

WebRTC는 음성 AI에 적합하지 않습니다. 하지만 이는 직관에 반하는 것 같네요? WebRTC는 화상 회의용이고, 그건 말하는 것이 포함되어 있잖아요? 그리고 로봇도 말할 수 있죠?

WebRTC는 너무 공격적입니다. 내가 휴대폰에서 OpenAI 앱을 켜고 스칼렛 요한슨 목소리의 AI에게 안녕이라고 인사한 뒤 이렇게 말한다고 가정해 봅시다. "세차장까지 걸어가야 할까요, 운전해서 가야 할까요?"

WebRTC는 열악한 네트워크 환경에서 대기 시간(Latency)을 낮게 유지하기 위해 나의 프롬프트를 열화시키고 드롭하도록 설계되어 있습니다. 이게 무슨 소리냐고요.

WebRTC는 대기 시간을 낮추기 위해 오디오 패킷을 공격적으로 버립니다. 화상 회의 중 끊기는 소리를 들어본 적이 있나요? 그게 바로 WebRTC가 만든 결과물입니다. 화상 회의는 빠른 주고받기에 의존하므로, 오디오를 기다리기 위해 멈추는 것은 용납되지 않는다는 게 그 이유입니다.

...하지만 사용자 입장에서는, 느리고 비용이 많이 드는 내 프롬프트가 정확하게 처리되기 위해 200ms를 더 기다리는 편이 훨씬 나을 것입니다. 결국, 막대한 비용을 지불하며 바다를 끓이는(엄청난 연산을 수행하는) 중인데, 쓰레기 같은 프롬프트는 쓰레기 같은 응답을 의미하니까요. LLM이 원래 반응 속도가 엄청 빠른 것도 아닙니다. 하지만 나는 기다릴 수 조차 없습니다.

브라우저 내에서 WebRTC 오디오 패킷을 재전송하는 것은 불가능에 가깝습니다. 우리가 Discord에서 시도해봤으니까요. 이 구현은 실시간 대기 시간을 유지하도록 하드 코딩되어 있습니다.

맞습니다. 음성 AI 에이전트는 결국 대기 시간을 대화 가능한 수준으로 줄일 것입니다. 하지만 대기 시간을 줄이는 데는 트레이드오프가 있습니다. 나는 고의로 오디오 프롬프트를 열화시키는 것이 언제든 가치가 있을지조차 확신하지 못합니다.

TTS는 실시간보다 빠릅니다. 사용자가 마이크에 말하면, 그 소리는 OpenAI의 수십억 대 서버 중 한 곳으로 전송되고, GPU가 텍스트 음성 변환(TTS)을 통해 사용자에게 말을 겁니다. 멋지네요.

8초짜리 오디오를 생성하는 데 2초의 GPU 시간이 걸린다고 가정해 봅시다. 이상적인 세계에서는 오디오가 생성되는 대로(2초에 걸쳐) 스트리밍하고, 클라이언트는 이를 재생(8초에 걸쳐)하기 시작할 것입니다. 그렇게 하면 네트워크에 문제가 생겨도 일부 오디오가 로컬에 버퍼링되어 있기 때문에 사용자는 네트워크 장애를 전혀 눈치채지 못할 것입니다.

하지만 아니죠, WebRTC에는 버퍼링이 없으며 도착 시간에 맞춰 렌더링합니다. 진심으로요, 타임스탬프는 그저 참고 사항일 뿐입니다. 여기에 비디오까지 들어가면 더욱 짜증납니다.

이를 보상하기 위해 OpenAI는 패킷이 렌더링되어야 하는 정확한 시간에 도착하도록 만들어야 합니다. 오디오 패킷을 보내기 전에 매번 대기 시간(Sleep)을 추가해야만 하죠. 하지만 네트워크 정체가 발생하면, 이런, 해당 오디오 패킷을 잃어버렸고 재전송은 영영 불가능해집니다.

OpenAI는 말 그대로 인위적인 대기 시간을 도입한 다음, "대기 시간을 낮게 유지"하기 위해 공격적으로 패킷을 드롭하고 있습니다. 이는 버퍼링을 하는 대신 YouTube 동영상을 화면 공유하는 것과 같습니다. 품질은 저하될 수밖에 없죠.

재미 있는 사실: WebRTC는 실제로 대기 시간을 추가합니다. 많지는 않지만 WebRTC는 20ms에서 200ms(오디오의 경우) 사이로 크기가 조절될 수 있는 동적 지터 버퍼(Jitter Buffer)를 가지고 있습니다. 이는 네트워크 지터를 완화하기 위한 것이지만, 실시간보다 빠르게 전송한다면 이런 것은 전혀 필요 없습니다.

포트, 포트, 포트 자, 이제 OpenAI 기사의 기술적인 핵심에 대해 이야기해 봅시다. 우리는 더 이상 배 위에 있지 않습니다.

원문 보기

원문 보기 (영어)

published 5/6/2026 OpenAI’s WebRTC Problem OpenAI posted a technical blog a few days ago. This blog post triggered me more than it should have. I urge to slap my meaty fingers on the keyboard. You should NOT copy OpenAI. I don’t think you should use WebRTC for voice AI. WebRTC is the problem. Me Like 6 years ago I wrote a WebRTC SFU at Twitch. Originally we used Pion (Go) just like OpenAI, but forked after benchmarking revealed that it was too slow. I ended up rewriting every protocol, because of course I did! Just a year ago, I was at Discord and I rewrote the WebRTC SFU in Rust. Because of course I did! You’re probably noticing a trend. Fun Fact : WebRTC consists of ~45 RFCs dating back to the early 2000s. And some de-facto standards that are technically drafts (ex. TWCC, REMB). Not a fun fact when you have to implement them all. You should consider me a Certified WebRTC Expert . Which is why I never, never want to use WebRTC again. Product Fit I’m going to cheat a little bit and start with the hot takes before they get cold. Don’t worry, we’ll get right back to talking about the OpenAI blog post and load balancing, I promise. WebRTC is a poor fit for Voice AI. But that seems counter-intuitive? WebRTC is for conferencing, and that involves speaking? And robots can speak, right? WebRTC is too aggressive Let’s say I pull up my OpenAI app on my phone. I say hi to Scarlett Johansson Sky and then I utter: should I walk or drive to the car wash? WebRTC is designed to degrade and drop my prompt during poor network conditions. wtf my dude WebRTC aggressively drops audio packets to keep latency low. If you’ve ever heard distorted audio on a conference call, that’s WebRTC baybee. The idea is that conference calls depend on rapid back-and-forth, so pausing to wait for audio is unacceptable. …but as a user, I would much rather wait an extra 200ms for my slow/expensive prompt to be accurate. After all, I’m paying good money to boil the ocean, and a garbage prompt means a garbage response. It’s not like LLMs are particularly responsive anyway. But I’m not allowed to wait . It’s impossible to even retransmit a WebRTC audio packet within a browser; we tried at Discord. The implementation is hard-coded for real-time latency or else . Yes, Voice AI agents will eventually get the latency down to the conversational range. But reducing latency has trade-offs . I’m not even sure that purposely degrading audio prompts will ever be worth it. TTS is faster than real-time You speak into the microphone, it gets sent to one of OpenAI’s billion servers, and then a GPU pretends to talk to you via text-to-speech. Neato. Let’s say it takes 2s of GPUs to generate 8s of audio. In an ideal world, we would stream the audio as it’s being generated (over 2s) and the client would start playing it back (over 8s). That way, if there’s a network blip, some audio is buffered locally. The user might not even notice the network blip. But nope, WebRTC has no buffering and renders based on arrival time . Like seriously, timestamps are just suggestions. It’s even more annoying when video enters the picture. To compensate for this, OpenAI has to make sure packets arrive exactly when they should be rendered. They need to add a sleep in front of every audio packet before sending it . But if there’s network congestion, oops we lost that audio packet and it’ll never be retransmitted. OpenAI is literally introducing artificial latency, and then aggressively dropping packets to “keep latency low”. It’s the equivalent of screen sharing a YouTube video instead of buffering it. The quality will be degraded . Fun fact : WebRTC actually adds latency. It’s not much, but WebRTC has a dynamic jitter buffer that can be sized anywhere from 20ms to 200ms (for audio). This is meant to smooth out network jitter, but none of this is needed if you transfer faster than real-time. Ports Ports Ports Okay but let’s talk about the technical meat of the OpenAI article. We’re no longer on a boat , but let’s talk about ports. When you host a TCP server, you open a port (ex. 443 for HTTPS) and listen for incoming connections. The TCP client will randomly select an ephemeral port to use, and the connection is identified by the source/destination IP/ports. For example, a connection might be identified as 123.45.67.89:54321 -> 192.168.1.2:443 . But there’s a minor problem… client addresses can change. When your phone switches from WiFi to cellular, oops your IP changes. NATs can also arbitrarily change your source IP/port because of course they can. Whenever this happens, bye bye connection , it’s time to dial a new one. And that means an expensive TCP + TLS handshake which takes at least 2-3 RTTs. The users definitely notice the network hiccup when you’re live streaming. WebRTC tried to solve this issue but made things worse. Seriously . A WebRTC implementation is supposed to allocate an ephemeral port for each connection. That way, a WebRTC session can identified by the destination IP/port only; the source is irrelevant. If the source IP/port changes, oh hey that’s still Bob because the destination port is the same. But as OpenAI corroborates, this causes issues at scale because… Servers only have a limited number of ports available. Firewalls love to block ephemeral ports. Kubernetes lul You could probably abuse IPv6 to work around this, but IDK I never tried. Twitch didn’t even support IPv6… Hacks by Necessity So most services end up ignoring the WebRTC specifications. Because of course they do. We mux multiple connections onto a single port instead. At Twitch I literally hosted my WebRTC server on UDP:443 . That’s supposed to be the HTTPS/QUIC port, but lying meant we could get past more firewalls. Like the Amazon corporate network, which blocked all but ~30 ports. Discord uses ports 50000-50032, one for each CPU core. As a result it gets blocked on more corporate networks. But like, if you’re on a Discord voice call on the Amazon corporate network, you probably won’t be there much longer anyway. HOWEVER, HUGE PROBLEM . WebRTC is actually a bunch of standards in a trenchcoat, and 5 of those go over UDP directly. It’s not hard to figure out which protocol a packet is using, but we need to figure out how to route each packet. STUN : We can choose a unique ufrag and route on it. SRTP/SRTCP : The browser chooses a random ssrc (u32)… which we can usually route based on. DTLS : Uh oh. We pray that RFC9146 gets widespread support. TURN : IDK I’ve never implemented it. So OpenAI only uses STUN: No protocol termination: Relay parses only STUN headers/ufrag; it uses cached state for subsequent DTLS, RTP, and RTCP, keeping packets opaque. It’s a positive way of saying: We really hope the user’s source IP/port never changes, because we broke that functionality. While it’s impressive load balancing anything at OpenAI scale, their custom load balancing is a hack. But a necessary hack, because the core protocol is at fault. Fun fact : Browsers can randomly generate the same ssrc . If there is a collision, and no source IP/port mapping is available, Discord attempts to decrypt the packet with each possible decryption key. If the key worked, hey we identified the connection! Round Trips and U The OpenAI blog post starts with 3 requirements, one of them is: Fast connection setup so a user can start speaking as soon as a session begins lol It takes a minimum of 8* round trips (RTT) to establish a WebRTC connection. While we try to run CDN edge nodes close enough to every user to minimize RTT, it adds up. Signaling server (ex. WHIP ): 1 for TCP 1 for TLS 1.3 1 for HTTP Media server: 1 for ICE (with server) 2 for DTLS 1.2 2 for SCTP * It’s complicated to compute, because some protocols can be pipelined to avoid 0.5 RTT. Kinda like half an A-Press . All of this nonsense is because WebRTC needs to support P2P. It doesn’t matter if you have a server with a static IP address, you still need to do this dance. It’s extra depressing when the signali

WebRTC 음성 AI Realtime API 네트워크 아키텍처 OpenAI