This content originally appeared on DEV Community and was authored by SIP GAMES
βSDP made the rules, RTP plays the game.β
In the previous episode of SIP GAMES, we peeked inside the SDP invite that tells your opponent how you’d like to play: what codecs, what ports, and what IPs. But who actually carries the media?
Enter RTP β Real-time Transport Protocol.
What is RTP?
Think of RTP as the courier that carries your voice across the network β broken into little time-stamped, sequence-numbered packages.
- SIP sets up the call
- SDP describes the media setup
- RTP sends the actual media (voice/video)
RTP runs on top of UDP (User Datagram Protocol) because itβs fast and tolerant of occasional loss β just like a real conversation.
RTP Packet Structure
Hereβs the basic layout of an RTP packet:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|CC|M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Contributing Source (CSRC) Identifiers (optional) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload (audio/video) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Letβs decode this header:
RTP Header Fields
Field | What It Means |
---|---|
V |
Version (always 2) |
P |
Padding (if extra bytes added) |
X |
Extension header present |
CC |
CSRC count (used in conferencing) |
M |
Marker bit (e.g. start of a talkspurt) |
PT |
Payload type (codec, e.g., 0 = PCMU, 96 = dynamic) |
Sequence Number |
Increments by 1 per packet β used to detect loss |
Timestamp |
Used for media playback timing |
SSRC |
Senderβs unique ID |
CSRC |
IDs of other contributing streams (optional) |
Payload |
Actual audio or video data |
Packetization Time (a.k.a. ptime)
What is packetization time?
Itβs the duration of audio in each RTP packet, often advertised in SDP using a=ptime:20
(means 20 ms per packet).
Common values:
Codec | Typical ptime | Result |
---|---|---|
PCMU | 20 ms | 50 packets/sec |
Opus | Variable | Can do 20β60 ms |
G.729 | 20 ms | Small, compressed |
Frequency of RTP Transmission
The number of RTP packets per second depends on the codecβs ptime.
Example:
- If ptime is 20ms, thatβs 50 packets/second
- If itβs 30ms, ~33.3 packets/sec
- Higher ptime = fewer packets = less overhead
- Lower ptime = smoother audio but more packets
Why Do I Care?
If you’re implementing RTP or trying to debug call quality:
- Jitter? Check packet arrival times and timestamps
- Audio out of sync? Sequence or timestamp mismatch
- Silence or gaps? Packets lost or arriving too late
-
Wrong codec? Check the Payload Type (
PT
) field
RTP is everywhere in VoIP β and understanding this header lets you trace, debug, and build your own media streamers.
Example: A Real RTP Packet (with G.711)
Letβs say we’re using G.711 with 20ms ptime.
- Payload Type:
0
(PCMU) - Sequence Number:
10567
- Timestamp:
160000
- SSRC:
0x789ABC
- Payload: 160 bytes of G.711 data (8-bit PCM at 8000 Hz)
Thatβs 160 samples Γ 8 kHz Γ 20ms = 160 bytes
TL;DR
- RTP carries media after SIP/SDP sets things up
- Each RTP packet has headers: version, PT, seq, timestamp, etc.
- Ptime defines how much media is in each packet
- Frequency of packets is based on ptime
- Use RTP headers to debug and analyze VoIP issues
Up Next in SIP GAMES:
βSpy Tools for VoIP Agentsβ
Weβll break down the best open-source tools like Wireshark, sipp, and rtpengine, and show you how to capture, simulate, and troubleshoot your VoIP calls like a pro.
Follow @sip_games to keep leveling up your VoIP game.
This content originally appeared on DEV Community and was authored by SIP GAMES