WebRTCSIPVoIPhow it works

How Browser Calls Work: WebRTC, SIP, and the Tech Behind VoIP Apps

April 16, 202614 min read

You hit "Call" in your browser, hear a couple of rings, and your mom on the other side of the world picks up. The audio is crisp, no lag. Other times it goes the other way: echo, robotic voice, second-long pauses after which you both start talking at once. Why does this happen — and what's actually going on between the moment you click that button and the moment a phone rings thousands of miles away? Let's break it down.

The Three Pillars: WebRTC, SIP, PSTN

Getting a call from a browser to an ordinary phone takes three technologies working in concert.

WebRTC (Web Real-Time Communication) is a real-time audio and video engine built into every modern browser. There's nothing to install — Chrome, Firefox, Safari, and Edge support it out of the box. WebRTC handles capturing audio from the microphone, compressing it with a codec, encrypting it, and transmitting it. But that's not all: WebRTC is also responsible for acoustic echo cancellation (AEC). Without it, calling through laptop speakers would turn into an infinite loop — you'd hear your own voice coming back with a half-second delay. The built-in AEC algorithm recognizes the sound coming from the speaker and subtracts it from the microphone signal in real time.

SIP (Session Initiation Protocol) is the signaling protocol. It doesn't carry voice — it manages the call: setting up the connection, negotiating parameters (which codec to use, where to send the packets), and tearing down the session. Think of SIP as a dispatcher: it doesn't haul the cargo, but it tells the trucks where to go.

PSTN (Public Switched Telephone Network) is the regular phone network that every mobile and landline number in the world connects to. This is the final destination, because grandma in another country isn't using WebRTC.

What Happens When You Hit "Call"

Let's walk through the entire path of a call, step by step.

Step 1: Capture, Compress, and Clean the Audio

The browser requests microphone access through the WebRTC API. Once permission is granted, audio runs through a processing chain: noise suppression, automatic gain control, echo cancellation — and only then reaches the codec. The most widely used today is Opus. It's adaptive: on a good connection it delivers HD quality (48 kHz), on a weak one it automatically lowers the bitrate while keeping speech intelligible.

For comparison, traditional phone networks use the G.711 codec, which takes 64 kbps and sounds like a call from the '90s. Opus in wideband mode at the same 64 kbps sounds dramatically better.

Step 2: Setting Up the Connection (Signaling)

While the audio is being compressed, the browser app sends call data (destination number, media parameters, authentication) over WebSocket to the VoIP app's server. Important nuance: WebRTC doesn't define a signaling protocol — the browser typically sends JSON or an SDP description (SDP is essentially a technical spec sheet for your device: which codecs it supports, which ports it's ready to receive audio on), not "real" SIP.

The app server then converts this data into a SIP request and sends it to the VoIP provider (Twilio, Telnyx, etc.). The provider checks the user's balance, picks a route to the destination country, and establishes the SIP session.

Step 3: NAT Traversal — Getting Through Firewalls

Here WebRTC solves one of the hardest problems in network communications. Most devices sit behind NAT — their real IP address is hidden behind a router. To help audio packets find their way, WebRTC uses three protocols: ICE, STUN, and TURN.

A STUN server helps the device discover its external address. If a direct connection isn't possible (in corporate networks with strict firewall rules, for example), traffic flows through a TURN server that acts as a relay. Setting up and maintaining TURN infrastructure is one of the engineering tasks that VoIP apps like Calloza take on, so calls go through even from office networks with paranoid firewalls.

Step 4: The Media Bridge — From Internet to Phone Network

The VoIP provider receives the encrypted audio stream from the browser and routes it to a media gateway — the point where the internet meets the PSTN. Here transcoding happens: audio in Opus format is re-encoded to G.711 (the phone network standard), and IP packets are converted into the digital phone-line signal (TDM — Time Division Multiplexing).

This is the quality bottleneck. Opus encodes audio in a band up to 48 kHz; G.711 only goes up to 8 kHz. Conversion inevitably sacrifices audio fidelity by cutting higher frequencies. That's why a browser-to-phone call will always sound worse than a call between two apps (Telegram → Telegram, for example), where Opus runs the entire route without transcoding.

Step 5: The Phone on the Other End Rings

The gateway connects to the nearest phone switch in the destination country. The signal travels through the local carrier, reaches a cell tower (for mobile) or a local exchange (for landlines) — and the person on the other end hears the ring. The whole process from button press to first ring takes 1–3 seconds.

Why Call Quality "Fluctuates"

Voice data on the internet travels in UDP packets — small chunks of audio, 20 milliseconds each. Unlike TCP, UDP doesn't guarantee delivery or order. This is a deliberate choice: for voice, it's better to lose a packet than to wait for it to be resent and end up with lag.

Three key metrics determine call quality:

  • Latency — the time it takes for a packet to reach its destination. Up to 150 ms — the conversation feels natural. 150–300 ms — noticeable pauses creep in. 300+ ms — the call turns into an ordeal of constant interruptions.
  • Jitter — the variation in packet arrival times. Even if average latency is low, if packets arrive sometimes after 20 ms and sometimes after 200 ms, the audio starts to "swim" and stutter.
  • Packet loss — the percentage of packets that don't arrive at all. Up to 1% — usually unnoticeable, the codec fills the gaps. 3–5% — robotic distortions appear. Above 5% — the call is essentially unusable.

Modern VoIP apps fight this in several ways. Jitter buffer accumulates packets and orders them properly, smoothing out network unevenness. Packet Loss Concealment (PLC) is an algorithm that "guesses" missing fragments based on neighboring packets. And adaptive bitrate — when the Opus codec automatically lowers quality to maintain stability as the connection degrades.

Encryption: Who Can Listen to Your Call?

WebRTC encrypts the media stream by mandate — it's not optional, it's a spec requirement. The mechanism is two-stage: first DTLS (Datagram Transport Layer Security) handles the handshake and secure key exchange, then the voice stream itself is encrypted via SRTP (Secure Real-time Transport Protocol).

This means that on the leg from browser to VoIP provider, the call is encrypted and practically impossible to intercept. However, on the leg from provider to PSTN, encryption depends on the phone network — and most often, there isn't any. This isn't a VoIP vulnerability — it's a limitation of phone infrastructure that's decades old.

It's important to understand the distinction: this is transport encryption, not end-to-end. The VoIP provider technically has access to the decrypted stream on its servers. For most users this is an acceptable level of security — comparable to how a regular phone call works.

Why a Call to India Costs $0.02 and a Call to Cuba Costs $0.80

International call pricing isn't determined by distance — it's determined by route and local market. VoIP providers buy wholesale "termination routes" from local carriers in each country. The price depends on several factors: competition among local carriers, government regulation (some countries impose high fees on incoming international calls), the type of destination number (mobile is usually pricier than landline), and route quality.

Routes come in different "qualities." Premium (CLI) routes correctly pass Caller ID, have stable connections and low latency, but cost more. Budget (non-CLI) routes are cheaper but Caller ID may show as "Unknown" or get spoofed. At Calloza we only use premium routes — the cents you'd save aren't worth the frustration when mom doesn't pick up an unknown number.

Exotic destinations are a separate story. Calls to satellite phones (Iridium, Thuraya) cost $3–5+ per minute because the signal literally travels through space. Some island nations set monopolistic rates on incoming calls, turning telephony into a source of state revenue. And some destinations are expensive because of telecom fraud: scammers generate mass calls to premium numbers, and operators raise rates for everyone to cover the losses.

The Whole Journey in 3 Seconds

To wrap it up — here's what happens between hitting the button and a phone ringing on the other side of the world:

  1. Microphone → WebRTC captures audio, removes echo and noise
  2. Opus codec → compresses the audio, adapting to connection speed
  3. DTLS + SRTP → encrypts the stream
  4. ICE/STUN/TURN → finds a path through NAT and firewalls
  5. Internet → UDP packets fly to the VoIP provider's server
  6. SIP routing → the provider picks the optimal route
  7. Media gateway → Opus → G.711, IP → TDM (from internet to phone network)
  8. PSTN → local carrier delivers the call
  9. The phone rings

Twenty years ago an international call required a phone card and a payphone. Today all you need is to open a browser tab.