A Systems Approach Version 6.2-dev documentation (2024)

In contrast to a simple demultiplexing protocol like UDP, a moresophisticated transport protocol is one that offers a reliable,connection-oriented, byte-stream service. Such a service has provenuseful to a wide assortment of applications because it frees theapplication from having to worry about missing or reordered data. TheInternet’s Transmission Control Protocol is probably the most widelyused protocol of this type; it is also the most carefully tuned. It isfor these two reasons that this section studies TCP in detail, althoughwe identify and discuss alternative design choices at the end of thesection.

In terms of the properties of transport protocols given in the problemstatement at the start of this chapter, TCP guarantees the reliable,in-order delivery of a stream of bytes. It is a full-duplex protocol,meaning that each TCP connection supports a pair of byte streams, oneflowing in each direction. It also includes a flow-control mechanism foreach of these byte streams that allows the receiver to limit how muchdata the sender can transmit at a given time. Finally, like UDP, TCPsupports a demultiplexing mechanism that allows multiple applicationprograms on any given host to simultaneously carry on a conversationwith their peers.

In addition to the above features, TCP also implements a highly tunedcongestion-control mechanism. The idea of this mechanism is to throttlehow fast TCP sends data, not for the sake of keeping the sender fromover-running the receiver, but so as to keep the sender from overloadingthe network. A description of TCP’s congestion-control mechanism ispostponed until the next chapter, where we discuss it in the largercontext of how network resources are fairly allocated.

Since many people confuse congestion control and flow control, werestate the difference. Flow control involves preventing senders fromover-running the capacity of receivers. Congestion control involvespreventing too much data from being injected into the network, therebycausing switches or links to become overloaded. Thus, flow control is anend-to-end issue, while congestion control is concerned with how hostsand networks interact.

5.2.1 End-to-End Issues

At the heart of TCP is the sliding window algorithm. Even though this isthe same basic algorithm as is often used at the link level, because TCPruns over the Internet rather than a physical point-to-point link, thereare many important differences. This subsection identifies thesedifferences and explains how they complicate TCP. The followingsubsections then describe how TCP addresses these and othercomplications.

First, whereas the link-level sliding window algorithm presented runsover a single physical link that always connects the same two computers,TCP supports logical connections between processes that are running onany two computers in the Internet. This means that TCP needsan explicitconnection establishment phase during which the two sides of theconnection agree to exchange data with each other. This difference isanalogous to having to dial up the other party, rather than having adedicated phone line. TCP also has an explicit connection teardownphase. One of the things that happens during connection establishment isthat the two parties establish some shared state to enable the slidingwindow algorithm to begin. Connection teardown is needed so each hostknows it is OK to free this state.

Second, whereas a single physical link that always connects the same twocomputers has a fixed round-trip time (RTT), TCP connections are likelyto have widely different round-trip times. For example, a TCP connectionbetween a host in San Francisco and a host in Boston, which areseparated by several thousand kilometers, might have an RTT of 100ms,while a TCP connection between two hosts in the same room, only a fewmeters apart, might have an RTT of only 1ms. The same TCP protocol mustbe able to support both of these connections. To make matters worse, theTCP connection between hosts in San Francisco and Boston might have anRTT of 100ms at 3a.m., but an RTT of 500ms at 3p.m. Variations inthe RTT are even possible during a single TCP connection that lasts onlya few minutes. What this means to the sliding window algorithm is thatthe timeout mechanism that triggers retransmissions must be adaptive.(Certainly, the timeout for a point-to-point link must be a settableparameter, but it is not necessary to adapt this timer for a particularpair of nodes.)

A third difference is that packets may be reordered as they cross theInternet, but this is not possible on a point-to-point link where thefirst packet put into one end of the link must be the first to appear atthe other end. Packets that are slightly out of order do not cause aproblem since the sliding window algorithm can reorder packets correctlyusing the sequence number. The real issue is how far out of orderpackets can get or, said another way, how late a packet can arrive atthe destination. In the worst case, a packet can be delayed in theInternet until the IP time to live (TTL) field expires, at whichtime the packet is discarded (and hence there is no danger of itarriving late). Knowing that IP throws packets away after their TTLexpires, TCP assumes that each packet has a maximum lifetime. The exactlifetime, known as the maximum segment lifetime (MSL), is anengineering choice. The current recommended setting is 120seconds. Keepin mind that IP does not directly enforce this 120-second value; it issimply a conservative estimate that TCP makes of how long a packet mightlive in the Internet. The implication is significant—TCP has to beprepared for very old packets to suddenly show up at the receiver,potentially confusing the sliding window algorithm.

Fourth, the computers connected to a point-to-point link are generallyengineered to support the link. For example, if a link’s delay ×bandwidth product is computed to be 8KB—meaning that a window size isselected to allow up to 8KB of data to be unacknowledged at a giventime—then it is likely that the computers at either end of the link havethe ability to buffer up to 8KB of data. Designing the system otherwisewould be silly. On the other hand, almost any kind of computer can beconnected to the Internet, making the amount of resources dedicated toany one TCP connection highly variable, especially considering that anyone host can potentially support hundreds of TCP connections at the sametime. This means that TCP must include a mechanism that each side usesto “learn” what resources (e.g., how much buffer space) the other sideis able to apply to the connection. This is the flow control issue.

Fifth, because the transmitting side of a directly connected link cannotsend any faster than the bandwidth of the link allows, and only one hostis pumping data into the link, it is not possible to unknowingly congestthe link. Said another way, the load on the link is visible in the formof a queue of packets at the sender. In contrast, the sending side of aTCP connection has no idea what links will be traversed to reach thedestination. For example, the sending machine might be directlyconnected to a relatively fast Ethernet—and capable of sending data at arate of 10Gbps—but somewhere out in the middle of the network, a1.5-Mbps link must be traversed. And, to make matters worse, data beinggenerated by many different sources might be trying to traverse thissame slow link. This leads to the problem of network congestion.Discussion of this topic is delayed until the next chapter.

We conclude this discussion of end-to-end issues by comparing TCP’sapproach to providing a reliable/ordered delivery service with theapproach used by virtual-circuit-based networks like the historicallyimportant X.25 network. In TCP, the underlying IP network is assumed tobe unreliable and to deliver messages out of order; TCP uses the slidingwindow algorithm on an end-to-end basis to provide reliable/ordereddelivery. In contrast, X.25 networks use the sliding window protocolwithin the network, on a hop-by-hop basis. The assumption behind thisapproach is that if messages are delivered reliably and in order betweeneach pair of nodes along the path between the source host and thedestination host, then the end-to-end service also guaranteesreliable/ordered delivery.

The problem with this latter approach is that a sequence of hop-by-hopguarantees does not necessarily add up to an end-to-end guarantee.First, if a heterogeneous link (say, an Ethernet) is added to one end ofthe path, then there is no guarantee that this hop will preserve thesame service as the other hops. Second, just because the sliding windowprotocol guarantees that messages are delivered correctly from nodeA tonodeB, and then from nodeB to nodeC, it does not guarantee thatnodeB behaves perfectly. For example, network nodes have been known tointroduce errors into messages while transferring them from an inputbuffer to an output buffer. They have also been known to accidentallyreorder messages. As a consequence of these small windows ofvulnerability, it is still necessary to provide true end-to-end checksto guarantee reliable/ordered service, even though the lower levels ofthe system also implement that functionality.

Key Takeaway

This discussion serves to illustrate one of the most importantprinciples in system design—the end-to-end argument. In a nutshell,the end-to-end argument says that a function (in our example,providing reliable/ordered delivery) should not be provided in thelower levels of the system unless it can be completely and correctlyimplemented at that level. Therefore, this rule argues in favor ofthe TCP/IP approach. This rule is not absolute, however. It doesallow for functions to be incompletely provided at a low level as aperformance optimization. This is why it is perfectly consistent withthe end-to-end argument to perform error detection (e.g., CRC) on ahop-by-hop basis; detecting and retransmitting a single corruptpacket across one hop is preferable to having to retransmit an entirefile end-to-end. [Next]

5.2.2 Segment Format

TCP is a byte-oriented protocol, which means that the sender writesbytes into a TCP connection and the receiver reads bytes out of theTCP connection. Although “byte stream” describes the service TCPoffers to application processes, TCP does not, itself, transmitindividual bytes over the Internet. Instead, TCP on the source hostbuffers enough bytes from the sending process to fill a reasonablysized packet and then sends this packet to its peer on the destinationhost. TCP on the destination host then empties the contents of thepacket into a receive buffer, and the receiving process reads fromthis buffer at its leisure. This situation is illustrated inFigure 127, which, for simplicity, showsdata flowing in only one direction. Remember that, in general, asingle TCP connection supports byte streams flowing in bothdirections.

The packets exchanged between TCP peers in Figure 127 are called segments, since each one carries asegment of the byte stream. Each TCP segment contains the headerschematically depicted in Figure 128. Therelevance of most of these fields will become apparent throughout thissection. For now, we simply introduce them.

The SrcPort and DstPort fields identify the source anddestination ports, respectively, just as in UDP. These two fields, plusthe source and destination IP addresses, combine to uniquely identifyeach TCP connection. That is, TCP’s demux key is given by the 4-tuple

(SrcPort, SrcIPAddr, DstPort, DstIPAddr)

Note that because TCP connections come and go, it is possible for aconnection between a particular pair of ports to be established, used tosend and receive data, and closed, and then at a later time for the samepair of ports to be involved in a second connection. We sometimes referto this situation as two different incarnations of the sameconnection.

The Acknowledgement, SequenceNum, and AdvertisedWindowfields are all involved in TCP’s sliding window algorithm. Because TCPis a byte-oriented protocol, each byte of data has a sequence number.The SequenceNum field contains the sequence number for the firstbyte of data carried in that segment, and the Acknowledgement andAdvertisedWindow fields carry information about the flow of datagoing in the other direction. To simplify our discussion, we ignorethe fact that data can flow in both directions, and we concentrate ondata that has a particular SequenceNum flowing in one directionand Acknowledgement and AdvertisedWindow values flowing in theopposite direction, as illustrated in Figure 129. The use of these three fields is described more fullylater in this chapter.

The 6-bit Flags field is used to relay control information betweenTCP peers. The possible flags include SYN, FIN, RESET,PUSH, URG, and ACK. The SYN and FIN flags are usedwhen establishing and terminating a TCP connection, respectively. Theiruse is described in a later section. The ACK flag is set any timethe Acknowledgement field is valid, implying that the receivershould pay attention to it. The URG flag signifies that this segmentcontains urgent data. When this flag is set, the UrgPtr fieldindicates where the nonurgent data contained in this segment begins. Theurgent data is contained at the front of the segment body, up to andincluding a value of UrgPtr bytes into the segment. The PUSHflag signifies that the sender invoked the push operation, whichindicates to the receiving side of TCP that it should notify thereceiving process of this fact. We discuss these last two features morein a later section. Finally, the RESET flag signifies that thereceiver has become confused—for example, because it received a segmentit did not expect to receive—and so wants to abort the connection.

Finally, the Checksum field is used in exactly the same way as forUDP—it is computed over the TCP header, the TCP data, and thepseudoheader, which is made up of the source address, destinationaddress, and length fields from the IP header. The checksum is requiredfor TCP in both IPv4 and IPv6. Also, since the TCP header is of variablelength (options can be attached after the mandatory fields), aHdrLen field is included that gives the length of the header in32-bit words. This field is also known as the Offset field, since itmeasures the offset from the start of the packet to the start of thedata.

5.2.3 Connection Establishment and Termination

A TCP connection begins with a client (caller) doing an active open to aserver (callee). Assuming that the server had earlier done a passiveopen, the two sides engage in an exchange of messages to establish theconnection. (Recall from Chapter1 that a party wanting to initiate aconnection performs an active open, while a party willing to accept aconnection does a passive open.1) Only after this connectionestablishment phase is over do the two sides begin sending data.Likewise, as soon as a participant is done sending data, it closes onedirection of the connection, which causes TCP to initiate a round ofconnection termination messages. Notice that, while connection setup isan asymmetric activity (one side does a passive open and the other sidedoes an active open), connection teardown is symmetric (each side has toclose the connection independently). Therefore, it is possible for oneside to have done a close, meaning that it can no longer send data, butfor the other side to keep the other half of the bidirectionalconnection open and to continue sending data.

1: To be more precise, TCP allows connection setup to be symmetric,with both sides trying to open the connection at the same time,but the common case is for one side to do an active open and theother side to do a passive open.

Three-Way Handshake

The algorithm used by TCP to establish and terminate a connection iscalled a three-way handshake. We first describe the basic algorithmand then show how it is used by TCP. The three-way handshake involvesthe exchange of three messages between the client and the server, asillustrated by the timeline given in Figure 130.

The idea is that two parties want to agree on a set of parameters,which, in the case of opening a TCP connection, are the startingsequence numbers the two sides plan to use for their respective bytestreams. In general, the parameters might be any facts that each sidewants the other to know about. First, the client (the activeparticipant) sends a segment to the server (the passive participant)stating the initial sequence number it plans to use (Flags =SYN, SequenceNum = x). The server then responds with a singlesegment that both acknowledges the client’s sequence number (Flags =ACK, Ack = x + 1) and states its own beginning sequence number(Flags = SYN, SequenceNum = y). That is, both the SYN andACK bits are set in the Flags field of this second message.Finally, the client responds with a third segment that acknowledgesthe server’s sequence number (Flags = ACK, Ack = y + 1). Thereason why each side acknowledges a sequence number that is one largerthan the one sent is that the Acknowledgement field actuallyidentifies the “next sequence number expected,” thereby implicitlyacknowledging all earlier sequence numbers. Although not shown in thistimeline, a timer is scheduled for each of the first two segments, andif the expected response is not received the segment is retransmitted.

You may be asking yourself why the client and server have to exchangestarting sequence numbers with each other at connection setup time. Itwould be simpler if each side simply started at some “well-known”sequence number, such as 0. In fact, the TCP specification requires thateach side of a connection select an initial starting sequence number atrandom. The reason for this is to protect against two incarnations ofthe same connection reusing the same sequence numbers too soon—that is,while there is still a chance that a segment from an earlier incarnationof a connection might interfere with a later incarnation of theconnection.

State-Transition Diagram

TCP is complex enough that its specification includes a state-transitiondiagram. A copy of this diagram is given in Figure 131.This diagram shows only the states involved in opening a connection(everything above ESTABLISHED) and in closing a connection (everythingbelow ESTABLISHED). Everything that goes on while a connection isopen—that is, the operation of the sliding window algorithm—is hidden inthe ESTABLISHED state.

TCP’s state-transition diagram is fairly easy to understand. Each boxdenotes a state that one end of a TCP connection can find itself in. Allconnections start in the CLOSED state. As the connection progresses, theconnection moves from state to state according to the arcs. Each arc islabeled with a tag of the form event/action. Thus, if a connection isin the LISTEN state and a SYN segment arrives (i.e., a segment with theSYN flag set), the connection makes a transition to the SYN_RCVDstate and takes the action of replying with an ACK+SYN segment.

Notice that two kinds of events trigger a state transition: (1) asegment arrives from the peer (e.g., the event on the arc from LISTENto SYN_RCVD), or (2) the local application process invokes anoperation on TCP (e.g., the active open event on the arc from CLOSEDto SYN_SENT). In other words, TCP’s state-transition diagrameffectively defines the semantics of both its peer-to-peer interfaceand its service interface. The syntax of these two interfaces isgiven by the segment format (as illustrated in Figure 128) and by some application programming interface, suchas the socket API, respectively.

Now let’s trace the typical transitions taken through the diagram inFigure 131. Keep in mind that at each end of theconnection, TCP makes different transitions from state to state. Whenopening a connection, the server first invokes a passive open operationon TCP, which causes TCP to move to the LISTEN state. At some latertime, the client does an active open, which causes its end of theconnection to send a SYN segment to the server and to move to theSYN_SENT state. When the SYN segment arrives at the server, it moves tothe SYN_RCVD state and responds with a SYN+ACK segment. The arrival ofthis segment causes the client to move to the ESTABLISHED state and tosend an ACK back to the server. When this ACK arrives, the serverfinally moves to the ESTABLISHED state. In other words, we have justtraced the three-way handshake.

There are three things to notice about the connection establishment halfof the state-transition diagram. First, if the client’s ACK to theserver is lost, corresponding to the third leg of the three-wayhandshake, then the connection still functions correctly. This isbecause the client side is already in the ESTABLISHED state, so thelocal application process can start sending data to the other end. Eachof these data segments will have the ACK flag set, and the correctvalue in the Acknowledgement field, so the server will move to theESTABLISHED state when the first data segment arrives. This is actuallyan important point about TCP—every segment reports what sequence numberthe sender is expecting to see next, even if this repeats the samesequence number contained in one or more previous segments.

The second thing to notice about the state-transition diagram is thatthere is a funny transition out of the LISTEN state whenever the localprocess invokes a send operation on TCP. That is, it is possible for apassive participant to identify both ends of the connection (i.e.,itself and the remote participant that it is willing to have connect toit), and then for it to change its mind about waiting for the other sideand instead actively establish the connection. To the best of ourknowledge, this is a feature of TCP that no application process actuallytakes advantage of.

The final thing to notice about the diagram is the arcs that are notshown. Specifically, most of the states that involve sending a segmentto the other side also schedule a timeout that eventually causes thesegment to be resent if the expected response does not happen. Theseretransmissions are not depicted in the state-transition diagram. Ifafter several tries the expected response does not arrive, TCP gives upand returns to the CLOSED state.

Turning our attention now to the process of terminating a connection,the important thing to keep in mind is that the application process onboth sides of the connection must independently close its half of theconnection. If only one side closes the connection, then this means ithas no more data to send, but it is still available to receive data fromthe other side. This complicates the state-transition diagram because itmust account for the possibility that the two sides invoke the closeoperator at the same time, as well as the possibility that first oneside invokes close and then, at some later time, the other side invokesclose. Thus, on any one side there are three combinations of transitionsthat get a connection from the ESTABLISHED state to the CLOSED state:

This side closes first:ESTABLISHED \(\rightarrow\)FIN_WAIT_1 \(\rightarrow\) FIN_WAIT_2 \(\rightarrow\) TIME_WAIT \(\rightarrow\) CLOSED.
The other side closes first:ESTABLISHED \(\rightarrow\)CLOSE_WAIT \(\rightarrow\) LAST_ACK \(\rightarrow\) CLOSED.
Both sides close at the same time:ESTABLISHED \(\rightarrow\)FIN_WAIT_1 \(\rightarrow\) CLOSING \(\rightarrow\) TIME_WAIT \(\rightarrow\) CLOSED.

There is actually a fourth, although rare, sequence of transitions thatleads to the CLOSED state; it follows the arc from FIN_WAIT_1 toTIME_WAIT. We leave it as an exercise for you to figure out whatcombination of circ*mstances leads to this fourth possibility.

The main thing to recognize about connection teardown is that aconnection in the TIME_WAIT state cannot move to the CLOSED state untilit has waited for two times the maximum amount of time an IP datagrammight live in the Internet (i.e., 120seconds). The reason for this isthat, while the local side of the connection has sent an ACK in responseto the other side’s FIN segment, it does not know that the ACK wassuccessfully delivered. As a consequence, the other side mightretransmit its FIN segment, and this second FIN segment might be delayedin the network. If the connection were allowed to move directly to theCLOSED state, then another pair of application processes might comealong and open the same connection (i.e., use the same pair of portnumbers), and the delayed FIN segment from the earlier incarnation ofthe connection would immediately initiate the termination of the laterincarnation of that connection.

5.2.4 Sliding Window Revisited

We are now ready to discuss TCP’s variant of the sliding windowalgorithm, which serves several purposes: (1)it guarantees the reliabledelivery of data, (2)it ensures that data is delivered in order, and(3)it enforces flow control between the sender and the receiver. TCP’suse of the sliding window algorithm is the same as at the link level inthe case of the first two of these three functions. Where TCP differsfrom the link-level algorithm is that it folds the flow-control functionin as well. In particular, rather than having a fixed-size slidingwindow, the receiver advertises a window size to the sender. This isdone using the AdvertisedWindow field in the TCP header. The senderis then limited to having no more than a value of AdvertisedWindowbytes of unacknowledged data at any given time. The receiver selects asuitable value for AdvertisedWindow based on the amount of memoryallocated to the connection for the purpose of buffering data. The ideais to keep the sender from over-running the receiver’s buffer. Wediscuss this at greater length below.

Reliable and Ordered Delivery

To see how the sending and receiving sides of TCP interact with eachother to implement reliable and ordered delivery, consider thesituation illustrated in Figure 132. TCP on thesending side maintains a send buffer. This buffer is used to storedata that has been sent but not yet acknowledged, as well as data thathas been written by the sending application but not transmitted. Onthe receiving side, TCP maintains a receive buffer. This buffer holdsdata that arrives out of order, as well as data that is in the correctorder (i.e., there are no missing bytes earlier in the stream) butthat the application process has not yet had the chance to read.

To make the following discussion simpler to follow, we initially ignorethe fact that both the buffers and the sequence numbers are of somefinite size and hence will eventually wrap around. Also, we do notdistinguish between a pointer into a buffer where a particular byte ofdata is stored and the sequence number for that byte.

Looking first at the sending side, three pointers are maintained intothe send buffer, each with an obvious meaning: LastByteAcked,LastByteSent, and LastByteWritten. Clearly,

LastByteAcked <= LastByteSent

since the receiver cannot have acknowledged a byte that has not yet beensent, and

LastByteSent <= LastByteWritten

since TCP cannot send a byte that the application process has not yetwritten. Also note that none of the bytes to the left ofLastByteAcked need to be saved in the buffer because they havealready been acknowledged, and none of the bytes to the right ofLastByteWritten need to be buffered because they have not yet beengenerated.

A similar set of pointers (sequence numbers) are maintained on thereceiving side: LastByteRead, NextByteExpected, andLastByteRcvd. The inequalities are a little less intuitive, however,because of the problem of out-of-order delivery. The first relationship

LastByteRead < NextByteExpected

is true because a byte cannot be read by the application until it isreceived and all preceding bytes have also been received.NextByteExpected points to the byte immediately after the latestbyte to meet this criterion. Second,

NextByteExpected <= LastByteRcvd + 1

since, if data has arrived in order, NextByteExpected points to thebyte after LastByteRcvd, whereas if data has arrived out of order,then NextByteExpected points to the start of the first gap in thedata, as in Figure 132. Note that bytes to the left ofLastByteRead need not be buffered because they have already beenread by the local application process, and bytes to the right ofLastByteRcvd need not be buffered because they have not yet arrived.

Flow Control

Most of the above discussion is similar to that found in the standardsliding window algorithm; the only real difference is that this time weelaborated on the fact thatthe sending and receiving applicationprocesses are filling and emptying their local buffer, respectively.(The earlier discussion glossed over the fact that data arriving from anupstream node was filling the send buffer and data being transmitted toa downstream node was emptying the receive buffer.)

You should make sure you understand this much before proceeding becausenow comes the point where the two algorithms differ more significantly.In what follows, we reintroduce the fact that both buffers are of somefinite size, denoted MaxSendBuffer and MaxRcvBuffer, although wedon’t worry about the details of how they are implemented. In otherwords, we are only interested in the number of bytes being buffered, notin where those bytes are actually stored.

Recall that in a sliding window protocol, the size of the window setsthe amount of data that can be sent without waiting for acknowledgmentfrom the receiver. Thus, the receiver throttles the sender byadvertising a window that is no larger than the amount of data that itcan buffer. Observe that TCP on the receive side must keep

LastByteRcvd - LastByteRead <= MaxRcvBuffer

to avoid overflowing its buffer. It therefore advertises a window sizeof

AdvertisedWindow = MaxRcvBuffer - ((NextByteExpected - 1) - LastByteRead)

which represents the amount of free space remaining in its buffer. Asdata arrives, the receiver acknowledges it as long as all the precedingbytes have also arrived. In addition, LastByteRcvd moves to theright (is incremented), meaning that the advertised window potentiallyshrinks. Whether or not it shrinks depends on how fast the localapplication process is consuming data. If the local process is readingdata just as fast as it arrives (causing LastByteRead to beincremented at the same rate as LastByteRcvd), then the advertisedwindow stays open (i.e., AdvertisedWindow = MaxRcvBuffer). If,however, the receiving process falls behind, perhaps because it performsa very expensive operation on each byte of data that it reads, then theadvertised window grows smaller with every segment that arrives, untilit eventually goes to 0.

TCP on the send side must then adhere to the advertised window it getsfrom the receiver. This means that at any given time, it must ensurethat

LastByteSent - LastByteAcked <= AdvertisedWindow

Said another way, the sender computes an effective window that limitshow much data it can send:

EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked)

Clearly, EffectiveWindow must be greater than 0 before the sourcecan send more data. It is possible, therefore, that a segment arrivesacknowledging xbytes, thereby allowing the sender to incrementLastByteAcked by x, but because the receiving process was notreading any data, the advertised window is now x bytes smaller than thetime before. In such a situation, the sender would be able to freebuffer space, but not to send any more data.

All the while this is going on, the send side must also make sure thatthe local application process does not overflow the send buffer—that is,

LastByteWritten - LastByteAcked <= MaxSendBuffer

If the sending process tries to write ybytes to TCP, but

(LastByteWritten - LastByteAcked) + y > MaxSendBuffer

then TCP blocks the sending process and does not allow it to generatemore data.

It is now possible to understand how a slow receiving process ultimatelystops a fast sending process. First, the receive buffer fills up, whichmeans the advertised window shrinks to 0. An advertised window of 0means that the sending side cannot transmit any data, even though datait has previously sent has been successfully acknowledged. Finally, notbeing able to transmit any data means that the send buffer fills up,which ultimately causes TCP to block the sending process. As soon as thereceiving process starts to read data again, the receive-side TCP isable to open its window back up, which allows the send-side TCP totransmit data out of its buffer. When this data is eventuallyacknowledged, LastByteAcked is incremented, the buffer space holdingthis acknowledged data becomes free, and the sending process isunblocked and allowed to proceed.

There is only one remaining detail that must be resolved—how does thesending side know that the advertised window is no longer 0? Asmentioned above, TCP always sends a segment in response to a receiveddata segment, and this response contains the latest values for theAcknowledge and AdvertisedWindow fields, even if these valueshave not changed since the last time they were sent. The problem isthis. Once the receive side has advertised a window size of 0, thesender is not permitted to send any more data, which means it has no wayto discover that the advertised window is no longer 0 at some time inthe future. TCP on the receive side does not spontaneously send nondatasegments; it only sends them in response to an arriving data segment.

TCP deals with this situation as follows. Whenever the other sideadvertises a window size of 0, the sending side persists in sending asegment with 1byte of data every so often. It knows that this data willprobably not be accepted, but it tries anyway, because each of these1-byte segments triggers a response that contains the current advertisedwindow. Eventually, one of these 1-byte probes triggers a response thatreports a nonzero advertised window.

Note that these 1-byte messages are called Zero Window Probes and inpractice they are sent every 5 to 60 seconds. As for what single byte ofdata to send in the probe: it’s the next byte of actual data justoutside the window. (It has to be real data in case it’s accepted by thereceiver.)

Key Takeaway

Note that the reason the sending side periodically sends this probesegment is that TCP is designed to make the receive side as simple aspossible—it simply responds to segments from the sender, and it neverinitiates any activity on its own. This is an example of awell-recognized (although not universally applied) protocol designrule, which, for lack of a better name, we call the smart sender/dumb receiver rule. Recall that we saw another example of this rulewhen we discussed the use of NAKs in sliding window algorithm.[Next]

Protecting Against Wraparound

This subsection and the next consider the size of the SequenceNumand AdvertisedWindow fields and the implications of their sizes onTCP’s correctness and performance. TCP’s SequenceNum field is32bits long, and its AdvertisedWindow field is 16bits long,meaning that TCP has easily satisfied the requirement of the slidingwindow algorithm that the sequence number space be twice as big as thewindow size: 2³² >> 2 × 2¹⁶. However, thisrequirement is not the interesting thing about these two fields.Consider each field in turn.

The relevance of the 32-bit sequence number space is that the sequencenumber used on a given connection might wrap around—a byte withsequence number S could be sent at one time, and then at a later timea second byte with the same sequence number S might be sent. Onceagain, we assume that packets cannot survive in the Internet forlonger than the recommended MSL. Thus, we currently need to make surethat the sequence number does not wrap around within a 120-secondperiod of time. Whether or not this happens depends on how fast datacan be transmitted over the Internet—that is, how fast the 32-bitsequence number space can be consumed. (This discussion assumes thatwe are trying to consume the sequence number space as fast aspossible, but of course we will be if we are doing our job of keepingthe pipe full.) Table 22 shows how long it takesfor the sequence number to wrap around on networks with variousbandwidths.

Table 22. Time Until 32-Bit Sequence Number SpaceWraps Around.
Bandwidth	Time until Wraparound
T1 (1.5 Mbps)	6.4 hours
T3 (45 Mbps)	13 minutes
Fast Ethernet (100 Mbps)	6 minutes
OC-3 (155 Mbps)	4 minutes
OC-48 (2.5 Gbps)	14 seconds
OC-192 (10 Gbps)	3 seconds
10GigE (10 Gbps)	3 seconds

As you can see, the 32-bit sequence number space is adequate at modestbandwidths, but given that OC-192 links are now common in the Internetbackbone, and that most servers now come with 10Gig Ethernet (or 10Gbps) interfaces, we’re now well-past the point where 32 bits is toosmall. Fortunately, the IETF has worked out an extension to TCP thateffectively extends the sequence number space to protect against thesequence number wrapping around. This and related extensions aredescribed in a later section.

Keeping the Pipe Full

The relevance of the 16-bit AdvertisedWindow field is that it mustbe big enough to allow the sender to keep the pipe full. Clearly, thereceiver is free to not open the window as large as theAdvertisedWindow field allows; we are interested in the situation inwhich the receiver has enough buffer space to handle as much data as thelargest possible AdvertisedWindow allows.

In this case, it is not just the network bandwidth but the delay xbandwidth product that dictates how big the AdvertisedWindow fieldneeds to be—the window needs to be opened far enough to allow a fulldelay × bandwidth product’s worth of data to be transmitted. Assuming anRTT of 100ms (a typical number for a cross-country connection in theUnited States), Table 23 gives the delay × bandwidthproduct for several network technologies.

Table 23. Required Window Size for 100-ms RTT
Bandwidth	Delay × Bandwidth Product
T1 (1.5 Mbps)	18 KB
T3 (45 Mbps)	549 KB
Fast Ethernet (100 Mbps)	1.2 MB
OC-3 (155 Mbps)	1.8 MB
OC-48 (2.5 Gbps)	29.6 MB
OC-192 (10 Gbps)	118.4 MB
10GigE (10 Gbps)	118.4 MB

As you can see, TCP’s AdvertisedWindow field is in even worse shapethan its SequenceNum field—it is not big enough to handle even a T3connection across the continental United States, since a 16-bit fieldallows us to advertise a window of only 64KB. The very same TCPextension mentioned above provides a mechanism for effectivelyincreasing the size of the advertised window.

5.2.5 Triggering Transmission

We next consider a surprisingly subtle issue: how TCP decides totransmit a segment. As described earlier, TCP supports a byte-streamabstraction; that is, application programs write bytes into the stream,and it is up to TCP to decide that it has enough bytes to send asegment. What factors govern this decision?

If we ignore the possibility of flow control—that is, we assume thewindow is wide open, as would be the case when a connection firststarts—then TCP has three mechanisms to trigger the transmission of asegment. First, TCP maintains a variable, typically called the maximumsegment size (MSS), and it sends a segment as soon as it hascollected MSS bytes from the sending process. MSS is usually setto the size of the largest segment TCP can send without causing thelocal IP to fragment. That is, MSS is set to the maximumtransmission unit (MTU) of the directly connected network, minus thesize of the TCP and IP headers. The second thing that triggers TCP totransmit a segment is that the sending process has explicitly asked itto do so. Specifically, TCP supports a push operation, and the sendingprocess invokes this operation to effectively flush the buffer of unsentbytes. The final trigger for transmitting a segment is that a timerfires; the resulting segment contains as many bytes as are currentlybuffered for transmission. However, as we will soon see, this “timer”isn’t exactly what you expect.

Silly Window Syndrome

Of course, we can’t just ignore flow control, which plays an obviousrole in throttling the sender. If the sender has MSS bytes of datato send and the window is open at least that much, then the sendertransmits a full segment. Suppose, however, that the sender isaccumulating bytes to send, but the window is currently closed. Nowsuppose an ACK arrives that effectively opens the window enough for thesender to transmit, say, MSS/2 bytes. Should the sender transmit ahalf-full segment or wait for the window to open to a full MSS? Theoriginal specification was silent on this point, and earlyimplementations of TCP decided to go ahead and transmit a half-fullsegment. After all, there is no telling how long it will be before thewindow opens further.

It turns out that the strategy of aggressively taking advantage of anyavailable window leads to a situation now known as the silly windowsyndrome. Figure 133 helps visualize whathappens. If you think of a TCP stream as a conveyor belt with “full”containers (data segments) going in one direction and empty containers(ACKs) going in the reverse direction, then MSS-sized segmentscorrespond to large containers and 1-byte segments correspond to verysmall containers. As long as the sender is sending MSS-sizedsegments and the receiver ACKs at least one MSS of data at a time,everything is good (Figure 133(a)). But,what if the receiver has to reduce the window, so that at some timethe sender can’t send a full MSS of data? If the senderaggressively fills a smaller-than-MSS empty container as soon asit arrives, then the receiver will ACK that smaller number of bytes,and hence the small container introduced into the system remains inthe system indefinitely. That is, it is immediately filled andemptied at each end and is never coalesced with adjacent containers tocreate larger containers, as in Figure 133(b). This scenario was discovered when earlyimplementations of TCP regularly found themselves filling the networkwith tiny segments.

Note that the silly window syndrome is only a problem when either thesender transmits a small segment or the receiver opens the window asmall amount. If neither of these happens, then the small container isnever introduced into the stream. It’s not possible to outlaw sendingsmall segments; for example, the application might do a push aftersending a single byte. It is possible, however, to keep the receiverfrom introducing a small container (i.e., a small open window). The ruleis that after advertising a zero window the receiver must wait for spaceequal to an MSS before it advertises an open window.

Since we can’t eliminate the possibility of a small container beingintroduced into the stream, we also need mechanisms to coalesce them.The receiver can do this by delaying ACKs—sending one combined ACKrather than multiple smaller ones—but this is only a partial solutionbecause the receiver has no way of knowing how long it is safe to delaywaiting either for another segment to arrive or for the application toread more data (thus opening the window). The ultimate solution falls tothe sender, which brings us back to our original issue: When does theTCP sender decide to transmit a segment?

Nagle’s Algorithm

Returning to the TCP sender, if there is data to send but the window isopen less than MSS, then we may want to wait some amount of timebefore sending the available data, but the question is how long? If wewait too long, then we hurt interactive applications like Telnet. If wedon’t wait long enough, then we risk sending a bunch of tiny packets andfalling into the silly window syndrome. The answer is to introduce atimer and to transmit when the timer expires.

While we could use a clock-based timer—for example, one that firesevery 100 ms—Nagle introduced an elegant self-clocking solution. Theidea is that as long as TCP has any data in flight, the sender willeventually receive an ACK. This ACK can be treated like a timerfiring, triggering the transmission of more data. Nagle’s algorithmprovides a simple, unified rule for deciding when to transmit:

When the application produces data to send if both the available data and the window >= MSS send a full segment else if there is unACKed data in flight buffer the new data until an ACK arrives else send all the new data now

In other words, it’s always OK to send a full segment if the windowallows. It’s also all right to immediately send a small amount of dataif there are currently no segments in transit, but if there is anythingin flight the sender must wait for an ACK before transmitting the nextsegment. Thus, an interactive application like Telnet that continuallywrites one byte at a time will send data at a rate of one segment perRTT. Some segments will contain a single byte, while others will containas many bytes as the user was able to type in one round-trip time.Because some applications cannot afford such a delay for each write itdoes to a TCP connection, the socket interface allows the application toturn off Nagle’s algorithm by setting the TCP_NODELAY option.Setting this option means that data is transmitted as soon as possible.

5.2.6 Adaptive Retransmission

Because TCP guarantees the reliable delivery of data, it retransmitseach segment if an ACK is not received in a certain period of time. TCPsets this timeout as a function of the RTT it expects between the twoends of the connection. Unfortunately, given the range of possible RTTsbetween any pair of hosts in the Internet, as well as the variation inRTT between the same two hosts over time, choosing an appropriatetimeout value is not that easy. To address this problem, TCP uses anadaptive retransmission mechanism. We now describe this mechanism andhow it has evolved over time as the Internet community has gained moreexperience usingTCP.

Original Algorithm

We begin with a simple algorithm for computing a timeout value between apair of hosts. This is the algorithm that was originally described inthe TCP specification—and the following description presents it in thoseterms—but it could be used by any end-to-end protocol.

The idea is to keep a running average of the RTT and then to computethe timeout as a function of this RTT. Specifically, every time TCPsends a data segment, it records the time. When an ACK for thatsegment arrives, TCP reads the time again, and then takes thedifference between these two times as a SampleRTT. TCP thencomputes an EstimatedRTT as a weighted average between theprevious estimate and this new sample. That is,

EstimatedRTT = alpha x EstimatedRTT + (1 - alpha) x SampleRTT

The parameter alpha is selected to smooth theEstimatedRTT. A small alpha tracks changes in the RTT but isperhaps too heavily influenced by temporary fluctuations. On the otherhand, a large alpha is more stable but perhaps not quick enough toadapt to real changes. The original TCP specification recommended asetting of alpha between 0.8 and 0.9. TCP then usesEstimatedRTT to compute the timeout in a rather conservative way:

TimeOut = 2 x EstimatedRTT

Karn/Partridge Algorithm

After several years of use on the Internet, a rather obvious flaw wasdiscovered in this simple algorithm. The problem was that an ACK doesnot really acknowledge a transmission; it actually acknowledges thereceipt of data. In other words, whenever a segment is retransmittedand then an ACK arrives at the sender, it is impossible to determineif this ACK should be associated with the first or the secondtransmission of the segment for the purpose of measuring the sampleRTT. It is necessary to know which transmission to associate it withso as to compute an accurate SampleRTT. As illustrated inFigure 134, if you assume that the ACK is forthe original transmission but it was really for the second, then theSampleRTT is too large (a); if you assume that the ACK is for thesecond transmission but it was actually for the first, then theSampleRTT is too small (b).

The solution, which was proposed in 1987, is surprisingly simple.Whenever TCP retransmits a segment, it stops taking samples of the RTT;it only measures SampleRTT for segments that have been sent onlyonce. This solution is known as the Karn/Partridge algorithm, after itsinventors. Their proposed fix also includes a second small change toTCP’s timeout mechanism. Each time TCP retransmits, it sets the nexttimeout to be twice the last timeout, rather than basing it on the lastEstimatedRTT. That is, Karn and Partridge proposed that TCP useexponential backoff, similar to what the Ethernet does. The motivationfor using exponential backoff is simple: Congestion is the most likelycause of lost segments, meaning that the TCP source should not react tooaggressively to a timeout. In fact, the more times the connection timesout, the more cautious the source should become. We will see this ideaagain, embodied in a much more sophisticated mechanism, in the nextchapter.

Jacobson/Karels Algorithm

The Karn/Partridge algorithm was introduced at a time when the Internetwas suffering from high levels of network congestion. Their approach wasdesigned to fix some of the causes of that congestion, but, although itwas an improvement, the congestion was not eliminated. The followingyear (1988), two other researchers—Jacobson and Karels—proposed a moredrastic change to TCP to battle congestion. The bulk of that proposedchange is described in the next chapter. Here, we focus on the aspect ofthat proposal that is related to deciding when to time out andretransmit a segment.

As an aside, it should be clear how the timeout mechanism is related tocongestion—if you time out too soon, you may unnecessarily retransmit asegment, which only adds to the load on the network. The other reasonfor needing an accurate timeout value is that a timeout is taken toimply congestion, which triggers a congestion-control mechanism.Finally, note that there is nothing about the Jacobson/Karels timeoutcomputation that is specific to TCP. It could be used by any end-to-endprotocol.

The main problem with the original computation is that it does not takethe variance of the sample RTTs into account. Intuitively, if thevariation among samples is small, then the EstimatedRTT can bebetter trusted and there is no reason for multiplying this estimate by 2to compute the timeout. On the other hand, a large variance in thesamples suggests that the timeout value should not be too tightlycoupled to the EstimatedRTT.

In the new approach, the sender measures a new SampleRTT as before.It then folds this new sample into the timeout calculation as follows:

Difference = SampleRTT - EstimatedRTTEstimatedRTT = EstimatedRTT + ( delta x Difference)Deviation = Deviation + delta (|Difference| - Deviation)

where delta is between 0 and 1. That is, we calculate both themean RTT and the variation in that mean.

TCP then computes the timeout value as a function of bothEstimatedRTT and Deviation as follows:

TimeOut = mu x EstimatedRTT + phi x Deviation

where based on experience, mu is typically set to 1 and phi isset to 4. Thus, when the variance is small, TimeOut is close toEstimatedRTT; a large variance causes the Deviation term todominate the calculation.

Implementation

There are two items of note regarding the implementation of timeouts inTCP. The first is that it is possible to implement the calculation forEstimatedRTT and Deviation without using floating-pointarithmetic. Instead, the whole calculation is scaled by 2ⁿ,with delta selected to be 1/2ⁿ. This allows us to do integerarithmetic, implementing multiplication and division using shifts,thereby achieving higher performance. The resulting calculation is givenby the following code fragment, where n=3(i.e., delta = 1/8). Note that EstimatedRTT and Deviation arestored in their scaled-up forms, while the value of SampleRTT at thestart of the code and of TimeOut at the end are real, unscaledvalues. If you find the code hard to follow, you might want to tryplugging some real numbers into it and verifying that it gives the sameresults as the equations above.

{ SampleRTT -= (EstimatedRTT >> 3); EstimatedRTT += SampleRTT; if (SampleRTT < 0) SampleRTT = -SampleRTT; SampleRTT -= (Deviation >> 3); Deviation += SampleRTT; TimeOut = (EstimatedRTT >> 3) + (Deviation >> 1);}

The second point of note is that the Jacobson/Karels algorithm is onlyas good as the clock used to read the current time. On typical Uniximplementations at the time, the clock granularity was as large as500ms, which is significantly larger than the average cross-country RTTof somewhere between 100 and 200ms. To make matters worse, the Uniximplementation of TCP only checked to see if a timeout should happenevery time this 500-ms clock ticked and would only take a sample of theround-trip time once per RTT. The combination of these two factors couldmean that a timeout would happen 1second after the segment wastransmitted. Once again, the extensions to TCP include a mechanism thatmakes this RTT calculation a bit more precise.

All of the retransmission algorithms we have discussed are based onacknowledgment timeouts, which indicate that a segment has probably beenlost. Note that a timeout does not, however, tell the sender whether anysegments it sent after the lost segment were successfully received. Thisis because TCP acknowledgments are cumulative; they identify only thelast segment that was received without any preceding gaps. The receptionof segments that occur after a gap grows more frequent as fasternetworks lead to larger windows. If ACKs also told the sender whichsubsequent segments, if any, had been received, then the sender could bemore intelligent about which segments it retransmits, draw betterconclusions about the state of congestion, and make better RTTestimates. A TCP extension supporting this is described in a latersection.

Key Takeaway

There is one other point to make about computing timeouts. It is asurprisingly tricky business, so much so, that there is an entire RFCdedicated to the topic: RFC6298. The takeaway is thatsometimes fully specifying a protocol involves so much minutiae thatthe line between specification and implementation becomes blurred.That has happened more than once with TCP, causing some to argue that“the implementation is the specification.” But that’s notnecessarily a bad thing as long as the reference implementation isavailable as open source software. More generally, the industry isseeing open source software grow in importance as open standardsrecede in importance. [Next]

5.2.7 Record Boundaries

Since TCP is a byte-stream protocol, the number of bytes written by thesender are not necessarily the same as the number of bytes read by thereceiver. For example, the application might write 8bytes, then2bytes, then 20 bytes to a TCP connection, while on the receiving sidethe application reads 5bytes at a time inside a loop that iterates 6times. TCP does not interject record boundaries between the 8th and 9thbytes, nor between the 10th and 11th bytes. This is in contrast to amessage-oriented protocol, such as UDP, in which the message that issent is exactly the same length as the message that is received.

Even though TCP is a byte-stream protocol, it has two different featuresthat can be used by the sender to insert record boundaries into thisbyte stream, thereby informing the receiver how to break the stream ofbytes into records. (Being able to mark record boundaries is useful, forexample, in many database applications.) Both of these features wereoriginally included in TCP for completely different reasons; they haveonly come to be used for this purpose over time.

The first mechanism is the urgent data feature, as implemented by theURG flag and the UrgPtr field in the TCP header. Originally, theurgent data mechanism was designed to allow the sending application tosend out-of-band data to its peer. By “out of band” we mean data thatis separate from the normal flow of data (e.g., a command to interruptan operation already under way). This out-of-band data was identified inthe segment using the UrgPtr field and was to be delivered to thereceiving process as soon as it arrived, even if that meant deliveringit before data with an earlier sequence number. Over time, however, thisfeature has not been used, so instead of signifying “urgent” data, ithas come to be used to signify “special” data, such as a record marker.This use has developed because, as with the push operation, TCP on thereceiving side must inform the application that urgent data has arrived.That is, the urgent data in itself is not important. It is the fact thatthe sending process can effectively send a signal to the receiver thatis important.

The second mechanism for inserting end-of-record markers into a byte isthe push operation. Originally, this mechanism was designed to allowthe sending process to tell TCP that it should send (flush) whateverbytes it had collected to its peer. The push operation can be used toimplement record boundaries because the specification says that TCP mustsend whatever data it has buffered at the source when the applicationsays push, and, optionally, TCP at the destination notifies theapplication whenever an incoming segment has the PUSH flag set. If thereceiving side supports this option (the socket interface does not),then the push operation can be used to break the TCP stream intorecords.

Of course, the application program is always free to insert recordboundaries without any assistance from TCP. For example, it can send afield that indicates the length of a record that is to follow, or it caninsert its own record boundary markers into the data stream.

5.2.8 TCP Extensions

We have mentioned at four different points in this section that thereare now extensions to TCP that help to mitigate some problem that TCPfaced as the underlying network got faster. These extensions aredesigned to have as small an impact on TCP as possible. In particular,they are realized as options that can be added to the TCP header. (Weglossed over this point earlier, but the reason why the TCP header has aHdrLen field is that the header can be of variable length; thevariable part of the TCP header contains the options that have beenadded.) The significance of adding these extensions as options ratherthan changing the core of the TCP header is that hosts can stillcommunicate using TCP even if they do not implement the options. Hoststhat do implement the optional extensions, however, can take advantageof them. The two sides agree that they will use the options during TCP’sconnection establishment phase.

The first extension helps to improve TCP’s timeout mechanism. Instead ofmeasuring the RTT using a coarse-grained event, TCP can read the actualsystem clock when it is about to send a segment, and put this time—thinkof it as a 32-bit timestamp—in the segment’s header. The receiver thenechoes this timestamp back to the sender in its acknowledgment, and thesender subtracts this timestamp from the current time to measure theRTT. In essence, the timestamp option provides a convenient place forTCP to store the record of when a segment was transmitted; it stores thetime in the segment itself. Note that the endpoints in the connection donot need synchronized clocks, since the timestamp is written and read atthe same end of the connection.

The second extension addresses the problem of TCP’s 32-bitSequenceNum field wrapping around too soon on a high-speed network.Rather than define a new 64-bit sequence number field, TCP uses the32-bit timestamp just described to effectively extend the sequencenumber space. In other words, TCP decides whether to accept or reject asegment based on a 64-bit identifier that has the SequenceNum fieldin the low-order 32bits and the timestamp in the high-order 32 bits.Since the timestamp is always increasing, it serves to distinguishbetween two different incarnations of the same sequence number. Notethat the timestamp is being used in this setting only to protect againstwraparound; it is not treated as part of the sequence number for thepurpose of ordering or acknowledging data.

The third extension allows TCP to advertise a larger window, therebyallowing it to fill larger delay × bandwidth pipes that are madepossible by high-speed networks. This extension involves an option thatdefines a scaling factor for the advertised window. That is, ratherthan interpreting the number that appears in the AdvertisedWindowfield as indicating how many bytes the sender is allowed to haveunacknowledged, this option allows the two sides of TCP to agree thatthe AdvertisedWindow field counts larger chunks (e.g., how many16-byte units of data the sender can have unacknowledged). In otherwords, the window scaling option specifies how many bits each sideshould left-shift the AdvertisedWindow field before using itscontents to compute an effective window.

The fourth extension allows TCP to augment its cumulative acknowledgmentwith selective acknowledgments of any additional segments that have beenreceived but aren’t contiguous with all previously received segments.This is the selective acknowledgment, or SACK, option. When the SACKoption is used, the receiver continues to acknowledge segmentsnormally—the meaning of the Acknowledge field does not change—but italso uses optional fields in the header to acknowledge any additionalblocks of received data. This allows the sender to retransmit just thesegments that are missing according to the selective acknowledgment.

Without SACK, there are only two reasonable strategies for a sender. Thepessimistic strategy responds to a timeout by retransmitting not justthe segment that timed out, but any segments transmitted subsequently.In effect, the pessimistic strategy assumes the worst: that all thosesegments were lost. The disadvantage of the pessimistic strategy is thatit may unnecessarily retransmit segments that were successfully receivedthe first time. The other strategy is the optimistic strategy, whichresponds to a timeout by retransmitting only the segment that timed out.In effect, the optimistic approach assumes the rosiest scenario: thatonly the one segment has been lost. The disadvantage of the optimisticstrategy is that it is very slow, unnecessarily, when a series ofconsecutive segments has been lost, as might happen when there iscongestion. It is slow because each segment’s loss is not discovereduntil the sender receives an ACK for its retransmission of the previoussegment. So it consumes one RTT per segment until it has retransmittedall the segments in the lost series. With the SACK option, a betterstrategy is available to the sender: retransmit just the segments thatfill the gaps between the segments that have been selectivelyacknowledged.

These extensions, by the way, are not the full story. We’ll see somemore extensions in the next chapter when we look at how TCP handlescongestion. The Internet Assigned Numbers Authority (IANA) keeps trackof all the options that are defined for TCP (and for many other Internetprotocols). See the references at the end of the chapter for a link toIANA’s protocol number registry.

5.2.9 Performance

Recall that Chapter1 introduced the two quantitative metrics by whichnetwork performance is evaluated: latency and throughput. As mentionedin that discussion, these metrics are influenced not only by theunderlying hardware (e.g., propagation delay and link bandwidth) butalso by software overheads. Now that we have a complete software-basedprotocol graph available to us that includes alternative transportprotocols, we can discuss how to meaningfully measure its performance.The importance of such measurements is that they represent theperformance seen by application programs.

We begin, as any report of experimental results should, by describingour experimental method. This includes the apparatus used in theexperiments; in this case, each workstation has a pair of dual CPU2.4-GHz Xeon processors running Linux. In order to enable speeds above1 Gbps, a pair of Ethernet adaptors (labeled NIC, for networkinterface card) are used on each machine. The Ethernet spans a singlemachine room so propagation is not an issue, making this a measure ofprocessor/software overheads. A test program running on top of thesocket interface simply tries to transfer data as quickly as possiblefrom one machine to the other. Figure 135illustrates the setup.

You may notice that this experimental setup is not especially bleedingedge in terms of the hardware or link speed. The point of this sectionis not to show how fast a particular protocol can run, but to illustratethe general methodology for measuring and reporting protocolperformance.

The throughput test is performed for a variety of message sizes usinga standard benchmarking tool called TTCP. The results of thethroughput test are given in Figure 136. The keything to notice in this graph is that throughput improves as themessages get larger. This makes sense—each message involves a certainamount of overhead, so a larger message means that this overhead isamortized over more bytes. The throughput curve flattens off above1KB, at which point the per-message overhead becomes insignificantwhen compared to the large number of bytes that the protocol stack hasto process.

It’s worth noting that the maximum throughput is less than 2 Gbps, theavailable link speed in this setup. Further testing and analysis ofresults would be needed to figure out where the bottleneck is (or ifthere is more than one). For example, looking at CPU load might give anindication of whether the CPU is the bottleneck or whether memorybandwidth, adaptor performance, or some other issue is to blame.

We also note that the network in this test is basically “perfect.” Ithas almost no delay or loss, so the only factors affecting performanceare the TCP implementation and the workstation hardware and software. Bycontrast, most of the time we deal with networks that are far fromperfect, notably our bandwidth-constrained, last-mile links andloss-prone wireless links. Before we can fully appreciate how theselinks affect TCP performance, we need to understand how TCP deals withcongestion, which is the topic of the next chapter.

At various times in the history of networking, the steadily increasingspeed of network links has threatened to run ahead of what could bedelivered to applications. For example, a large research effort wasbegun in the United States in 1989 to build “gigabit networks,” wherethe goal was not only to build links and switches that could run at1Gbps or higher but also to deliver that throughput all the way to asingle application process. There were some real problems (e.g., networkadaptors, workstation architectures, and operating systems all had to bedesigned with network-to-application throughput in mind) and also someperceived problems that turned out to be not so serious. High on thelist of such problems was the concern that existing transport protocols,TCP in particular, might not be up to the challenge of gigabitoperation.

As it turns out, TCP has done well keeping up with the increasingdemands of high-speed networks and applications. One of the mostimportant factors was the introduction of window scaling to deal withlarger bandwidth-delay products. However, there is often a bigdifference between the theoretical performance of TCP and what isachieved in practice. Relatively simple problems like copying the datamore times than necessary as it passes from network adaptor toapplication can drive down performance, as can insufficient buffermemory when the bandwidth-delay product is large. And the dynamics ofTCP are complex enough (as will become even more apparent in the nextchapter) that subtle interactions among network behavior, applicationbehavior, and the TCP protocol itself can dramatically alterperformance.

For our purposes, it’s worth noting that TCP continues to perform verywell as network speeds increase, and when it runs up against some limit(normally related to congestion, increasing bandwidth-delay products, orboth), researchers rush in to find solutions. We’ve seen some of thosein this chapter, and we’ll see some more in the next.

5.2.10 Alternative Design Choices (SCTP, QUIC)

Although TCP has proven to be a robust protocol that satisfies the needsof a wide range of applications, the design space for transportprotocols is quite large. TCP is by no means the only valid point inthat design space. We conclude our discussion of TCP by consideringalternative design choices. While we offer an explanation for why TCP’sdesigners made the choices they did, we observe that other protocolsexist that have made other choices, and more such protocols may appearin the future.

First, we have suggested from the very first chapter of this book thatthere are at least two interesting classes of transport protocols:stream-oriented protocols like TCP and request/reply protocols like RPC.In other words, we have implicitly divided the design space in half andplaced TCP squarely in the stream-oriented half of the world. We couldfurther divide the stream-oriented protocols into two groups—reliableand unreliable—with the former containing TCP and the latter being moresuitable for interactive video applications that would rather drop aframe than incur the delay associated with a retransmission.

This exercise in building a transport protocol taxonomy is interestingand could be continued in greater and greater detail, but the worldisn’t as black and white as we might like. Consider the suitability ofTCP as a transport protocol for request/reply applications, for example.TCP is a full-duplex protocol, so it would be easy to open a TCPconnection between the client and server, send the request message inone direction, and send the reply message in the other direction. Thereare two complications, however. The first is that TCP is abyte-oriented protocol rather than a message-oriented protocol, andrequest/reply applications always deal with messages. (We explore theissue of bytes versus messages in greater detail in a moment.) Thesecond complication is that in those situations where both the requestmessage and the reply message fit in a single network packet, awell-designed request/reply protocol needs only two packets to implementthe exchange, whereas TCP would need at least nine: three to establishthe connection, two for the message exchange, and four to tear down theconnection. Of course, if the request or reply messages are large enoughto require multiple network packets (e.g., it might take 100packets tosend a 100,000-byte reply message), then the overhead of setting up andtearing down the connection is inconsequential. In other words, it isn’talways the case that a particular protocol cannot support a certainfunctionality; it’s sometimes the case that one design is more efficientthan another under particular circ*mstances.

Second, as just suggested, you might question why TCP chose to provide areliable byte-stream service rather than a reliable message-streamservice; messages would be the natural choice for a database applicationthat wants to exchange records. There are two answers to this question.The first is that a message-oriented protocol must, by definition,establish an upper bound on message sizes. After all, an infinitely longmessage is a byte stream. For any message size that a protocol selects,there will be applications that want to send larger messages, renderingthe transport protocol useless and forcing the application to implementit* own transport-like services. The second reason is that, whilemessage-oriented protocols are definitely more appropriate forapplications that want to send records to each other, you can easilyinsert record boundaries into a byte stream to implement thisfunctionality.

A third decision made in the design of TCP is that it delivers bytesin order to the application. This means that it may hold onto bytesthat were received out of order from the network, awaiting somemissing bytes to fill a hole. This is enormously helpful for manyapplications but turns out to be quite unhelpful if the application iscapable of processing data out of order. As a simple example, a Webpage containing multiple embedded objects doesn’t need all the objectsto be delivered in order before starting to display the page. In fact,there is a class of applications that would prefer to handleout-of-order data at the application layer, in return for getting datasooner when packets are dropped or misordered within the network. Thedesire to support such applications led to the creation of not one buttwo IETF standard transport protocols. The first of these was SCTP,the Stream Control Transmission Protocol. SCTP provides a partiallyordered delivery service, rather than the strictly ordered service ofTCP. (SCTP also makes some other design decisions that differ fromTCP, including message orientation and support of multiple IPaddresses for a single session.) More recently, the IETF has beenstandardizing a protocol optimized for Web traffic, known asQUIC. More on QUIC in a moment.

Fourth, TCP chose to implement explicit setup/teardown phases, butthis is not required. In the case of connection setup, it would bepossible to send all necessary connection parameters along with thefirst data message. TCP elected to take a more conservative approachthat gives the receiver the opportunity to reject the connectionbefore any data arrives. In the case of teardown, we could quietlyclose a connection that has been inactive for a long period of time,but this would complicate applications like remote login that want tokeep a connection alive for weeks at a time; such applications wouldbe forced to send out-of-band “keep alive” messages to keep theconnection state at the other end from disappearing.

Finally, TCP is a window-based protocol, but this is not the onlypossibility. The alternative is a rate-based design, in which thereceiver tells the sender the rate—expressed in either bytes or packetsper second—at which it is willing to accept incoming data. For example,the receiver might inform the sender that it can accommodate 100packetsa second. There is an interesting duality between windows and rate,since the number of packets (bytes) in the window, divided by the RTT,is exactly the rate. For example, a window size of 10packets and a100-ms RTT implies that the sender is allowed to transmit at a rate of100packets a second. It is by increasing or decreasing the advertisedwindow size that the receiver is effectively raising or lowering therate at which the sender can transmit. In TCP, this information is fedback to the sender in the AdvertisedWindow field of the ACK forevery segment. One of the key issues in a rate-based protocol is howoften the desired rate—which may change over time—is relayed back to thesource: Is it for every packet, once per RTT, or only when the ratechanges? While we have just now considered window versus rate in thecontext of flow control, it is an even more hotly contested issue in thecontext of congestion control, which we will discuss in the nextchapter.

QUIC

QUIC originated at Google in 2012 and was subsequently developed as aproposed standard at the IETF. Unlike many other efforts to add to theset of transport protocols in the Internet, QUIC has achievedwidespread deployment. As discussed further in Chapter 9, QUIC was motivated in large part by thechallenges of matching the request/response semantics of HTTP to thestream-oriented nature of TCP. These issues have become morenoticeable over time, due to factors such as the rise of high-latencywireless networks, the availability of multiple networks for a singledevice (e.g., Wi-Fi and cellular), and the increasing use ofencrypted, authenticated connections on the Web (as discussed inChapter 8). While a fulldescription of QUIC is beyond our scope, some of the key designdecisions are worth discussing.

If network latency is high—say 100 milliseconds or more—then a fewRTTs can quickly add up to a visible annoyance for an enduser. Establishing an HTTP session over TCP with Transport LayerSecurity (Section 8.5) would typicallytake at least three round trips (one for TCP session establishment and two forsetting up the encryption parameters) before the first HTTP messagecould be sent. The designers of QUIC recognized that this delay—thedirect result of a layered approach to protocol design—could bedramatically reduced if connection setup and the required securityhandshakes were combined and optimized for minimal round trips.

Note also how the presence of multiple network interfaces might affectthe design. If your mobile phone loses its Wi-Fi connection and needsto switch to a cellular connection, that would typically require botha TCP timeout on one connection and a new series of handshakes on theother. Making the connection something that can persist over differentnetwork layer connections was another design goal for QUIC.

Finally, as noted above, the reliable byte stream model for TCP is apoor match to a Web page request, when many objects need to be fetchedand page rendering could begin before they have all arrived. Whileone workaround for this would be to open multiple TCP connections inparallel, this approach (which was used in the early days of the Web)has its own set of drawbacks, notably on congestion control (seeChapter 6). Specifically, eachconnection runs its own congestion control loop, so the experience ofcongestion on one connection is not apparent to the other connections,and each connection tries to figure out the appropriate amount ofbandwidth to consume on its own. QUIC introduced the idea of streamswithin a connection, so that objects could be delivered out-of-orderwhile maintaining a holistic view of congestion across all thestreams.

Interestingly, by the time QUIC emerged, many design decisions hadbeen made that presented challenges for the deployment of a newtransport protocol. Notably, many “middleboxes’’ such as NATs andfirewalls (see Section 8.5) have enoughunderstanding of the existing widespread transport protocols (TCP andUDP) that they can’t be relied upon to pass a new transportprotocol. As a result, QUIC actually rides on top of UDP. In otherwords, it is a transport protocol running on top of a transportprotocol. This is not as uncommon as our focus on layering mightsuggest, as the next two subsections also illustrate. This choice wasmade in the interests of making QUIC deployable on the Internet as itexists, and has been quite successful.

QUIC implements fast connection establishment with encryption andauthentication in the first RTT. It provides a connection identifierthat persists across changes in the underlying network. It supports themultiplexing of several streams onto a single transport connection, toavoid the head-of-line blocking that may arise when a single packet isdropped while other useful data continues to arrive. And it preserves(and in some ways improves on)the congestion avoidance properties of TCP, an important aspect oftransport protocols that we return to in Chapter 6.

HTTP went through a number of versions (1.0, 1.1, 2.0, discussed inSection 9.1) in an effort to map itsrequirements more cleanly onto the capabilities of TCP. With thearrival of QUIC, HTTP/3 is now able to leverage a transport layer thatwas explicitly designed to meet the application requirements ofthe Web.

QUIC is a most interesting development in the world of transportprotocols. Many of the limitations of TCP have been known for decades,but QUIC represents one of the most successful efforts to date tostake out a different point in the design space. Because QUIC wasinspired by experience with HTTP and the Web—which arose long afterTCP was well established in the Internet—it presents a fascinatingcase study in the unforeseen consequences of layered designs and inthe evolution of the Internet.