close banner

VoIP Technology and Glossary

VoIP Technology and Glossary Although VoIP systems are capable of some unique functions (for example: video conferencing, instant messaging, and multicasting), this appendix concentrates on the ways in which VoIP can be used to replicate the voice conversation functionality of the public switched telephone network (PSTN).

There are several competing approaches to implementing VoIP. Each makes use of a variety of protocols to handle signaling, data transfer, and other tasks. To help describe the similarities and differences between these approaches, consider the following simplified description of a telephone call under VoIP:

  1. Caller picks up the phone (his terminal), hears a dial tone and dials a destination number.
  2. Destination number is mapped to a destination IP address.
  3. Call setup routines are invoked, handled by signaling protocols. Depending on the VoIP standard in use, this may involve a device (or function) known as a Gateway, and may also involve a Gatekeeper.
  4. Destination phone generates a ring, the called party picks up the phone, and a two-way conversation is established.
  5. Data is moved between media protocol, the Real-time Transport Protocol (RTP). A codec (coder/decoder) is used to convert the sound of each caller’s voice to digital data, then back to analog audio signals at the other end.
  6. Conversation ends and the call is torn down. Again, this involves the signaling protocols appropriate to the particular implementation of VoIP, along with any Gateway or Gatekeeper functions.

Note that the instructions governing the call-the call setup and call teardown-are handled separately from the transmission of the actual data content of the call, or the encoding and packetization of voice media.

VoIP Network Hardware

VoIP systems make use of specialized hardware such as terminals (VoIP phones), and may include Gateways, Gatekeepers, or Multipoint Control Units (MCUs).

Terminal  An device that provides communications services (User Interface).

Gateway  A translation device that provides real-time bi-directional communication between terminals.

Gatekeeper  An H.323 device that performs call control duties for terminals.

MCU  Multipoint Control Unit used to coordinate between three or more terminals.

Figure J.1 Gateway and Gatekeeper (GK) in H.323 VoIP call

A Gateway acts as the interface between the packet switched network (IP) and the circuit switched network (PSTN), translating formats between the two. It is responsible for call setup and teardown, compression/decompression and packetization of the voice or other media, and conversion between signaling and media types. A Gateway is sometimes a dedicated device but, more commonly, routers with “voice modules” act as gateways. The software in the router handles call setup/teardown, voice encoding, and so forth, with LAN connectivity provided through the regular router ports.

There are several different types of gateways. The Media Gateway (MG) terminates voice calls from the PSTN, packetizes and compresses voice data into data packets, and delivers the data packets to the IP network. The Media Gateway Controller controls registration and manages resources for Gateways. It communicates with the Central Office Switch via Signaling Gateways. A Signaling Gateway provides transparent connections between IP networks and switched networks (including SS7 termination), and may provide additional translation.

A Gatekeeper provides management for groups of H.323 devices known as zones. There is typically only one Gatekeeper per zone, but installation may have one or more alternatives for backup and load balancing. A Gatekeeper provides address translation, admission control, and bandwidth control for its zone. It may also provide call authorization and management services, as well as bandwidth management and directory services.

Gatekeepers are optional. (Microsoft NetMeeting for example, does not use Gatekeepers by default). It is most often a software application, but can also be integrated into a Gateway or terminal. If Gatekeepers are not used, then Gateways must be configured to talk directly to one another.

A Multipoint Control Unit (MCU) typically supports conferences between three or more stations. It can be a stand-alone device or integrated into a Gateway, Gatekeeper, or terminal. The MCU consists of two functional entities: the Multipoint Controller (MC) and the Multipoint Processor (MP). The MC handles control and signaling for conference support. The MP receives and processes streams in the conference.

VoIP Protocols

Like every other aspect of Internet communications, VoIP has evolved rapidly since its introduction in 1995, and continues to evolve today. The standards show the influence of their creators: the traditional telecommunications players, the Internet community, and the communications equipment manufacturers such as Cisco and 3Com.

In roughly chronological order of introduction, the most widely used VoIP systems are:

H.323  Developed by the International Telecommunications Union (ITU) and the Internet Engineering Task Force (IETF)

MGCP (Megaco)  Developed by Cisco as an alternative to H.323

SIP  Developed by 3Com as an alternative to H.323

SKINNY  A Cisco proprietary system allowing skinny clients to communicate with H.323 systems, by off-loading some functions to a Call Manager.

Each of these approaches involves the use of multiple protocols. In the sections below, we split these software tools into three groups: Signaling protocols, Media protocols, and Codecs. The media protocols (RTP and RTCP) are common to all types of VoIP, and the codecs are also widely used. The principle distinction between one VoIP setup and another is their use of signaling protocols and related devices or functions, such as Gateways and Gatekeepers.

Signaling protocols

In VoIP communication, the signaling that controls the conversation is distinct from the actual stream of data carrying the voice content of the conversation. The principal families of VoIP signaling protocols are described briefly below.

Note:   The data streams of VoIP are carried in connectionless UDP packets. Many setups use UDP for signaling also, but some require the connection-oriented TCP instead, and few permit either TCP or UDP for signaling.

H.323 protocols suite

H.323 is an ITU-T standard that provides multimedia video conferencing, voice, and data capability for use over packet-switched networks. It is the most widely deployed VoIP protocol in enterprise and carrier markets.

  • H.225.0 defines the call signaling between the Gatekeeper
  • H.225.0 Annex G and H.501 define the procedures and protocol for communication within and between Peer Elements
  • H.245 is the protocol used to control establishment and closure of media channels within the context of a call and to perform conference control
  • H.460.x is a series of version-independent extensions to the base H.323 protocol
  • T.120 specifies how to do data conferencing
  • T.38 defines how to relay fax signals
  • V.150.1 defines how to relay modem signals
  • H.235 defines security within H.323 systems
  • X.680 defines the ASN.1 syntax used by the Recommendations
  • X.691 defines the Packed Encoding Rules (PER) used to encode messages for transmission on the network


Media Gateway Control Protocol is used for controlling telephony gateways from external call control elements called media gateway controllers or call agents. A telephony gateway is a network element that provides conversion between the audio signals carried on telephone circuits and data packets carried over the Internet or over other packet networks.

MEGACO (H.248)

Media Gateway Control protocol (H.248) is used between elements of a physically decomposed multimedia gateway. This protocol creates a general framework suitable for gateways, multipoint control units (MCUs) and interactive voice response units (IVRs).


Simple Gateway Control Protocol (SGCP) is used to control telephony gateways from external call control elements.


Session Initiation Protocol (SIP) is used to initiate VoIP connections. SIP provides the necessary protocol mechanisms so that the end user systems and proxy servers can provide different services such as call forwarding, called and calling number identification, and caller and called authentication. See IETF RFC 2543.


As a generic computing term, “skinny” refers to a device with fewer features or functions than the common or “fat” version of the same device. In VoIP, SKINNY is a proprietary Cisco system intended to allow skinny clients to communicate with H.323 VoIP systems, by placing most of the required H.323 processing capabilities in an intervening device called a Call Manager. The skinny client and the Call Manager use a simple messaging set called Skinny Client Control Protocol (SCCP) to communicate with each other over TCP/IP. SKINNY systems use a proxy for the H.225 and H.245 signalling and use RTP/UDP/IP for audio.

Media protocols

RTP and RTCP (RFC 3550) are used to transmit media such as audio and video over IP networks. RTP and RTCP are carried in UDP packets.


The Real-time Transport Protocol (RTP) provides end-to-end network transport functions suitable for applications transmitting real-time data such as audio, video or simulation data, over multicast or unicast network services. RTP does not address resource reservation and does not guarantee quality-of-service for real-time services. The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scalable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers. The protocol supports the use of RTP-level translators and mixers.


The RTP Control Protocol (RTCP) is based on the periodic transmission of control packets to all participants in the session, using the same distribution mechanism as the data packets. The underlying protocol must provide multiplexing of the data and control packets, for example using separate port numbers with UDP.


A codec (coder/decoder) handles the conversion of analog signals to digital form, and back again. VoIP systems may use any of a wide variety of codecs for voice, video, or both. In VoIP, the codec used is often referred to as the encoding method or the payload type for the RTP packet. Codec designers seek to optimize among three primary factors: the speed of the encoding/decoding operations (packetization delay), the quality and fidelity of sound and/or video signal, and the size of the resulting encoded data stream. In Table J.1, note that the Data Rate column refers to the compressed (encoded) data, while the Bandwidth column describes the uncompressed audio data equivalent delivered by the codec.

Table J.1 VoIP codec comparison
Data Rate
Packetization Delay
G.711u 64.0 Kbps 1.0 msec 87.2 kbps
G.711a 64.0 Kbps 1.0 msec 187.2 kbps
G.726 32.0 Kbps 1.0 msec 55.2 kbps
G.729 8.0 Kbps 25.0 msec 31.2 kbps
G.723.1 MPMLQ 6.3 Kbps 67.5 msec 21.9 kbps
G.723.1 ACLEP 5.3 Kbps 67.5 msec 20.8 kbps
* From “Taking Charge of Your VoIP Project,” Cisco Press 2004

OmniPeek can correctly identify and perform analysis based on a wide range of VoIP codecs. It can also play back and perform passive MOS (Mean Opinion Score) analysis on the most commonly used voice codecs, as shown in Table J.3.

OmniPeek Analysis
Passive MOS
G.711 u-law
G.711 a-law
G.723.1 5.3K
G.723.1 6.3K
G.726 16kb
G.726 24kb
G.726 32kb
G.726 40kb
GSM (Full Rate)
G.728, 16k


Bias Bias is a measure of cumulative jitter. The bias at any given point in the RTP packet stream indicates the amount by which packet arrival times are deviating from the expected packet arrival times. Optimal jitter buffer settings can be made based on the maximum bias for a stream.
Codec Coder/Decoder. Converts voice, video, and other analog signals to a digital form acceptable to modern digital PBXs and digital transmission systems. It then converts the digital information back to analog signals so that you can hear and understand the other party.
FEP Front-End Processor. A dedicated communications system that intercepts and handles activity for the host. It is designed to offload from the host computer all or most of its data communication functions.
Gatekeeper The Gatekeeper is an optional component in the H.323 system which is primarily used for admission control and address resolution. The Gatekeeper may allow calls to be placed directly or it may route the call signaling through itself to perform functions such as follow-me/find-me and forward on busy.
Gateway An actual protocol translation computer or logical boundary area within a network computer that serves to interconnect data communications networks with different protocols. The Gateway is composed of a Media Gateway Controller (MGC) and a Media Gateway (MG). The MGC handles call signaling and other non-media-related functions.
GoS Grade of Service.
IETF Internet Engineering Task Force
ITU-T International Telecommunications Union, Telecom standardization section
Jitter The slight movement of a transmission signal in time or phase that can introduce errors and loss of synchronization for high-speed synchronous communications. In VoIP, jitter is the absolute value of the difference between actual packet arrival time and the expected packet arrival time. See also Packet Delay Variation.
LIM Link Interface Module.
Media In the context of telecommunications, media is most often the conduit or link that carries transmissions. In the context of VoIP, it is the encoded voice or video, sometimes extended to include the packets that carry this information.
MEGACO Media Gateway Control Protocol (MGCP).
MGCP Media Gateway Control Protocol designed to bridge between circuit-based public switched telephone networks (PSTN) and Internet Protocol (IP) technology-based networks.
MOS Mean Opinion Score. Because of the inherently subjective nature of voice quality testing, one method to quantify quality is to have relatively large numbers of human listeners rate voice quality as part of a controlled and well-defined test process. The advantage of this method is that clarity evaluations are derived directly from the individuals who will experience the voice call. Another advantage is the statistical validity provided by numerous evaluators. This, in fact, has been the method used for many years and is defined as Mean Opinion Score: a scale of 1-5 in which 5 is best.
Packet Delay Variation Deviation from the expected arrival time of a media packet. In VoIP, media packets (those containing the encoded voice or video, for example) are sent in a continuous stream. If you know the encoding method in use and the timestamp of a previous packet, you can predict an expected arrival time for subsequent packets in the stream. Packet delay variation is the difference between the actual arrival time and this expected arrival time.
PAMS Perceptual Analysis Measurement System. PAMS is a model for objectively measuring voice quality. PAMS provides a Listening Quality Score and a Listening Effort Score, both which correlate to MOS scores and are on the same 1 to 5 scale.
PDH Plesiochronous Digital Hierarchy. Developed to carry digitized voice over twisted pair cabling more efficiently. Includes familiar standards such as T1, T3, E1, E3, and so forth.
PESQ Perceptual Evaluation of Speech Quality. An objective speech quality measurement applicable to both speech codecs and end-to-end measurements. Listening quality rating: 1-5. PESQ combines the strongest parts of PAMS and PSQM+ algorithms, and is designed to predict subjective opinion scores of a degraded audio sample.
RAS Registration, Admissions and Status signaling function, part of the H.245 call control protocol. Governs registration, admission and bandwidth functions and gatekeepers (not used if a gatekeeper is not present).
RFC Request For Comments, a basic document describing a proposed standard or other approaches to Internet usage, prepared under the auspices of the IETF.
RTCP RTP Control Protocol. A part of the RTP protocol.
RTP Real-time Transport Protocol used for streaming real-time audio or audio-visual media over IP in packets. Supports transport of real-time data like interactive voice and video over packet-switched networks.
SCCP Skinny Client Control Protocol. A Cisco protocol with a limited message set used to communicate between a skinny VoIP client and a Cisco Call Manager, over TCP/IP.
SIP Session Initiation Protocol. A signaling protocol developed to set up, modify, and tear down multimedia sessions over the Internet.
SKINNY A Cisco VoIP system intended to allow skinny clients to communicate with H.323 VoIP systems, by placing most of the required H.323 processing capabilities in a separate device called a Call Manager. The skinny client and the Call Manager use a simple messaging set called Skinny Client Control Protocol (SCCP) to communicate with each other over TCP/IP. SKINNY systems use a proxy for the H.225 and H.245 signaling and use RTP/UDP/IP for audio. As a generic computing term, a “skinny” device is one that is slimmed down by eliminating some functions or capabilities found in the common or “fat” versions of the same device.
SS7 Signaling System Seven. An architecture for performing out-of-band signaling in support of the call-establishment, billing, routing, and information-exchange functions of the public switched telephone network (PSTN). It identifies functions to be performed by a signaling-system network and a protocol to enable their performance.
SSRC Synchronization Source identifier. This number is chosen randomly so that no two synchronization sources within the same RTP session will have the same SSRC identifier.
VoIP Voice over Internet Protocol. The technologies used to transmit voice conversations over a data network using the Internet Protocol.
VQMS Voice Quality Measurement System.
WAV WAVEform audio format. The Microsoft and IBM audio file format for storing audio on PCs.