A look at perhaps the most important standard in communications convergence today:
SIP-Session Initiation Protocol.
Session Initiation Protocol (SIP) is a relatively new protocol designed to maximize the potential of relatively new networks built on IP (Internet Protocol). IP, of course, is the packet-switching protocol developed in the 1970s in support of data communications over the Internet.
It was a much different Internet in those days, having been developed exclusively for interactive computer-to-computer data communications between a relatively few trusted users in government and academia. I hope it comes as no surprise to you that the Internet now supports millions of users, many of whom can hardly be characterized as trusted.
Further, the Internet now supports all sorts of multimedia traffic, including real-time voice. Aside from the Internet, there are a number of proprietary IP-based networks owned and operated by commercial carriers and service providers; these support all variety of exotic features and applications that weren't even imagined just a few short years ago.
So the Internet and IP networks, in general, have evolved to the point that they are quickly becoming all-purpose in nature and are gradually replacing the circuit-switched PSTN (Public Switched Telephone Network). As it became clear that the old PSTN might fade away in favor of the Internet, it became clear that a new set of protocols would be necessary to supplement the basic IP protocol suite.
Enter H.323
The ITU-T developed the H.323 Recommendation to support multimedia communications over LANs, which do not support QoS (Quality of Service). Subsequently, H.323 was found to be useful over IP-based networks, which also don't offer true QoS.
As real-time voice and video benefit greatly from QoS, H.323 or some similar mechanism is important and the ITU-T certainly was in a position to take the lead. While the ITU-T is the rock solid standards foundation in the voice world of the PSTN, it often is characterized as slow and inflexible. And so it is with many of the ITU standards, including H.323.
H.323 queries for the address of the destination device(s), then establishes a session and then negotiates the features and capabilities of the session before finally connecting the call. As a result of this complex signaling process, call setup time can be considerably longer than the average time required to set up a voice call over the PSTN.
Terminal devices can include intelligent workstations with voice and video capabilities. The transmit and receive devices can be intelligent workstations running H.323 client software. Alternatively and as illustrated in Figure 1 below, this can also be less capable devices interconnecting across the network via optional gateways that serve as protocol converters. Gatekeepers are optional elements that serve as central points in H.323 zones of control. If a gatekeeper is present, all devices must register with it. Gatekeepers serve to translate LAN addresses into IP or IPX addresses and to route H.323 calls.

If H.323 seems a bit complex, it's because it is. Among other things, H.323 requires a separate numbering scheme, which mucks up intercarrier internetworking. And that level of complexity doesn't scale well.
Make Way for SIP
SIP was defined by the IETF (Internet Engineering Task Force) specifically for IP networks in 1999 (RFC 2543) as an application layer signaling protocol for the establishing, modifying and terminating of multimedia sessions or calls.
SIP identifies clients through a hierarchical URI (Uniform Resource Indicator) much like a URL (Uniform Resource Locator) used as the basis for an e-mail address, e.g., SIPmain@mainlinecom.com. (Don't try it, as we are not SIP-compliant at the moment.)
The calling client invites the called client to join in a session, providing it with enough information to do so. The information might include an invitation to join in a videoconference, perhaps employing H.261 video, G.728 audio, with Japanese as the preferred language. An optional SIP server on the receiving end of the communication might determine the location of each of the invited parties and connect the call.
Once the called client (e.g., multimedia workstation, PDA or cell phone) receives the invitation to join the session, it can accept the call, or perhaps forward it to a voice messaging system or even a unified messaging (UM) system, or perhaps to another Japanese-speaking user. If the call is multimedia in nature, but the receiving client is not, it may elect to accept only the media types it can support.
For example, it may accept the voice component of the call while rejecting the video component. If appropriate, multiple clients can be rung at once through a process known as call forking or splitting, and
each client can deal with the call in the manner most appropriate. During the course of the call, additional functions such as whiteboarding or data conferencing can be added through additional invite requests from any of the clients.
SIP makes use of UDP (User Datagram Protocol), for performance reasons. This puts the responsibility for error control on the receiving device, but is much faster than TCP (Transmission Control Protocol). This reliance on UDP is typical in real-time compressed VoIP and video over IP networking. TCP is optional, but it can be done and it certainly is more reliable.
If this sounds a lot cleaner and faster than the H.323 approach, it's because it is. SIP also scales much better than H.323, which is due in part to its reliance on a variation of the existing addressing scheme employed in the Internet. SIP also is a lot more flexible, as it allows individual receiving clients to tailor the incoming call to their own capabilities. Further, the nature of the call can change while it is in progress.
By the way, SIP's reliance on the URL addressing scheme defined by the DNS (Domain Name System) of the Internet is an incredible advantage, as it ties into the effort of the IETF's ENUM Working Group. That group is working on the interconnection of DNS and the E.164 numbering scheme used in the PSTN. Ultimately, that work will enable the interoperability of the two numbering schemes in a full convergence scenario.
Which Way Did It Go? Which Way Did It Go?
Despite the obvious advantages, SIP had a long way to go to replace the incumbent H.323. (It's always tough to replace an incumbent.) The fact that the final standard wasn't approved until June 2002 didn't help. What certainly did help was Microsoft's decision (2001) to incorporate SIP in Windows XP, Windows CE 4.0, Windows .NET Server and other devices embedded with XP. Now it's not as though a lot of other major telecom and datacom vendors didn't get behind SIP, but Microsoft's support makes a big difference in the life of any standard these days.
Between the time SIP was first announced in 1999 and its finalization in 2002, a lot of pre-release products were developed and marketed. As is typical, the vendors didn't all take the same approach, exercising different options and maintaining different levels of currency with the developing standard.
That's created interoperability problems, which is not exactly why standards are developed. After all, standards are supposed to allow multiple manufacturers to build to the same core specifications, which ultimately provides carriers, service providers and end user to deploy networks comprising product of diverse origins without compromising interconnectivity and interoperability. These issues ultimately will be sorted out, but it may be a while.
The IETF regularly holds SIP Test Events to test product interoperability, and at increasing levels of difficulty. There are strong indications that at least one third-party organization will soon emerge to assume responsibility for SIP certification. In the meantime, IP carriers and service providers are rolling out networks and services based on SIP, dealing with the problems as they go and confidant that it all will be.