From 4bfd864f10b68b71482b35c818559068ef8d5797 Mon Sep 17 00:00:00 2001 From: Thomas Voss Date: Wed, 27 Nov 2024 20:54:24 +0100 Subject: doc: Add RFC documents --- doc/rfc/rfc817.txt | 1388 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1388 insertions(+) create mode 100644 doc/rfc/rfc817.txt (limited to 'doc/rfc/rfc817.txt') diff --git a/doc/rfc/rfc817.txt b/doc/rfc/rfc817.txt new file mode 100644 index 0000000..dcdef8a --- /dev/null +++ b/doc/rfc/rfc817.txt @@ -0,0 +1,1388 @@ + +RFC: 817 + + + + MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION + + David D. Clark + MIT Laboratory for Computer Science + Computer Systems and Communications Group + July, 1982 + + + 1. Introduction + + + Many protocol implementers have made the unpleasant discovery that + +their packages do not run quite as fast as they had hoped. The blame + +for this widely observed problem has been attributed to a variety of + +causes, ranging from details in the design of the protocol to the + +underlying structure of the host operating system. This RFC will + +discuss some of the commonly encountered reasons why protocol + +implementations seem to run slowly. + + + Experience suggests that one of the most important factors in + +determining the performance of an implementation is the manner in which + +that implementation is modularized and integrated into the host + +operating system. For this reason, it is useful to discuss the question + +of how an implementation is structured at the same time that we consider + +how it will perform. In fact, this RFC will argue that modularity is + +one of the chief villains in attempting to obtain good performance, so + +that the designer is faced with a delicate and inevitable tradeoff + +between good structure and good performance. Further, the single factor + +which most strongly determines how well this conflict can be resolved is + +not the protocol but the operating system. + + 2 + + + 2. Efficiency Considerations + + + There are many aspects to efficiency. One aspect is sending data + +at minimum transmission cost, which is a critical aspect of common + +carrier communications, if not in local area network communications. + +Another aspect is sending data at a high rate, which may not be possible + +at all if the net is very slow, but which may be the one central design + +constraint when taking advantage of a local net with high raw bandwidth. + +The final consideration is doing the above with minimum expenditure of + +computer resources. This last may be necessary to achieve high speed, + +but in the case of the slow net may be important only in that the + +resources used up, for example cpu cycles, are costly or otherwise + +needed. It is worth pointing out that these different goals often + +conflict; for example it is often possible to trade off efficient use of + +the computer against efficient use of the network. Thus, there may be + +no such thing as a successful general purpose protocol implementation. + + + The simplest measure of performance is throughput, measured in bits + +per second. It is worth doing a few simple computations in order to get + +a feeling for the magnitude of the problems involved. Assume that data + +is being sent from one machine to another in packets of 576 bytes, the + +maximum generally acceptable internet packet size. Allowing for header + +overhead, this packet size permits 4288 bits in each packet. If a + +useful throughput of 10,000 bits per second is desired, then a data + +bearing packet must leave the sending host about every 430 milliseconds, + +a little over two per second. This is clearly not difficult to achieve. + +However, if one wishes to achieve 100 kilobits per second throughput, + + 3 + + +the packet must leave the host every 43 milliseconds, and to achieve one + +megabit per second, which is not at all unreasonable on a high-speed + +local net, the packets must be spaced no more than 4.3 milliseconds. + + + These latter numbers are a slightly more alarming goal for which to + +set one's sights. Many operating systems take a substantial fraction of + +a millisecond just to service an interrupt. If the protocol has been + +structured as a process, it is necessary to go through a process + +scheduling before the protocol code can even begin to run. If any piece + +of a protocol package or its data must be fetched from disk, real time + +delays of between 30 to 100 milliseconds can be expected. If the + +protocol must compete for cpu resources with other processes of the + +system, it may be necessary to wait a scheduling quantum before the + +protocol can run. Many systems have a scheduling quantum of 100 + +milliseconds or more. Considering these sorts of numbers, it becomes + +immediately clear that the protocol must be fitted into the operating + +system in a thorough and effective manner if any like reasonable + +throughput is to be achieved. + + + There is one obvious conclusion immediately suggested by even this + +simple analysis. Except in very special circumstances, when many + +packets are being processed at once, the cost of processing a packet is + +dominated by factors, such as cpu scheduling, which are independent of + +the packet size. This suggests two general rules which any + +implementation ought to obey. First, send data in large packets. + +Obviously, if processing time per packet is a constant, then throughput + +will be directly proportional to the packet size. Second, never send an + + 4 + + +unneeded packet. Unneeded packets use up just as many resources as a + +packet full of data, but perform no useful function. RFC 813, "Window + +and Acknowledgement Strategy in TCP", discusses one aspect of reducing + +the number of packets sent per useful data byte. This document will + +mention other attacks on the same problem. + + + The above analysis suggests that there are two main parts to the + +problem of achieving good protocol performance. The first has to do + +with how the protocol implementation is integrated into the host + +operating system. The second has to do with how the protocol package + +itself is organized internally. This document will consider each of + +these topics in turn. + + + 3. The Protocol vs. the Operating System + + + There are normally three reasonable ways in which to add a protocol + +to an operating system. The protocol can be in a process that is + +provided by the operating system, or it can be part of the kernel of the + +operating system itself, or it can be put in a separate communications + +processor or front end machine. This decision is strongly influenced by + +details of hardware architecture and operating system design; each of + +these three approaches has its own advantages and disadvantages. + + + The "process" is the abstraction which most operating systems use + +to provide the execution environment for user programs. A very simple + +path for implementing a protocol is to obtain a process from the + +operating system and implement the protocol to run in it. + +Superficially, this approach has a number of advantages. Since + + 5 + + +modifications to the kernel are not required, the job can be done by + +someone who is not an expert in the kernel structure. Since it is often + +impossible to find somebody who is experienced both in the structure of + +the operating system and the structure of the protocol, this path, from + +a management point of view, is often extremely appealing. Unfortunately, + +putting a protocol in a process has a number of disadvantages, related + +to both structure and performance. First, as was discussed above, + +process scheduling can be a significant source of real-time delay. + +There is not only the actual cost of going through the scheduler, but + +the problem that the operating system may not have the right sort of + +priority tools to bring the process into execution quickly whenever + +there is work to be done. + + + Structurally, the difficulty with putting a protocol in a process + +is that the protocol may be providing services, for example support of + +data streams, which are normally obtained by going to special kernel + +entry points. Depending on the generality of the operating system, it + +may be impossible to take a program which is accustomed to reading + +through a kernel entry point, and redirect it so it is reading the data + +from a process. The most extreme example of this problem occurs when + +implementing server telnet. In almost all systems, the device handler + +for the locally attached teletypes is located inside the kernel, and + +programs read and write from their teletype by making kernel calls. If + +server telnet is implemented in a process, it is then necessary to take + +the data streams provided by server telnet and somehow get them back + +down inside the kernel so that they mimic the interface provided by + +local teletypes. It is usually the case that special kernel + + 6 + + +modification is necessary to achieve this structure, which somewhat + +defeats the benefit of having removed the protocol from the kernel in + +the first place. + + + Clearly, then, there are advantages to putting the protocol package + +in the kernel. Structurally, it is reasonable to view the network as a + +device, and device drivers are traditionally contained in the kernel. + +Presumably, the problems associated with process scheduling can be + +sidesteped, at least to a certain extent, by placing the code inside the + +kernel. And it is obviously easier to make the server telnet channels + +mimic the local teletype channels if they are both realized in the same + +level in the kernel. + + + However, implementation of protocols in the kernel has its own set + +of pitfalls. First, network protocols have a characteristic which is + +shared by almost no other device: they require rather complex actions + +to be performed as a result of a timeout. The problem with this + +requirement is that the kernel often has no facility by which a program + +can be brought into execution as a result of the timer event. What is + +really needed, of course, is a special sort of process inside the + +kernel. Most systems lack this mechanism. Failing that, the only + +execution mechanism available is to run at interrupt time. + + + There are substantial drawbacks to implementing a protocol to run + +at interrupt time. First, the actions performed may be somewhat complex + +and time consuming, compared to the maximum amount of time that the + +operating system is prepared to spend servicing an interrupt. Problems + +can arise if interrupts are masked for too long. This is particularly + + 7 + + +bad when running as a result of a clock interrupt, which can imply that + +the clock interrupt is masked. Second, the environment provided by an + +interrupt handler is usually extremely primitive compared to the + +environment of a process. There are usually a variety of system + +facilities which are unavailable while running in an interrupt handler. + +The most important of these is the ability to suspend execution pending + +the arrival of some event or message. It is a cardinal rule of almost + +every known operating system that one must not invoke the scheduler + +while running in an interrupt handler. Thus, the programmer who is + +forced to implement all or part of his protocol package as an interrupt + +handler must be the best sort of expert in the operating system + +involved, and must be prepared for development sessions filled with + +obscure bugs which crash not just the protocol package but the entire + +operating system. + + + A final problem with processing at interrupt time is that the + +system scheduler has no control over the percentage of system time used + +by the protocol handler. If a large number of packets arrive, from a + +foreign host that is either malfunctioning or fast, all of the time may + +be spent in the interrupt handler, effectively killing the system. + + + There are other problems associated with putting protocols into an + +operating system kernel. The simplest problem often encountered is that + +the kernel address space is simply too small to hold the piece of code + +in question. This is a rather artificial sort of problem, but it is a + +severe problem none the less in many machines. It is an appallingly + +unpleasant experience to do an implementation with the knowledge that + + 8 + + +for every byte of new feature put in one must find some other byte of + +old feature to throw out. It is hopeless to expect an effective and + +general implementation under this kind of constraint. Another problem + +is that the protocol package, once it is thoroughly entwined in the + +operating system, may need to be redone every time the operating system + +changes. If the protocol and the operating system are not maintained by + +the same group, this makes maintenance of the protocol package a + +perpetual headache. + + + The third option for protocol implementation is to take the + +protocol package and move it outside the machine entirely, on to a + +separate processor dedicated to this kind of task. Such a machine is + +often described as a communications processor or a front-end processor. + +There are several advantages to this approach. First, the operating + +system on the communications processor can be tailored for precisely + +this kind of task. This makes the job of implementation much easier. + +Second, one does not need to redo the task for every machine to which + +the protocol is to be added. It may be possible to reuse the same + +front-end machine on different host computers. Since the task need not + +be done as many times, one might hope that more attention could be paid + +to doing it right. Given a careful implementation in an environment + +which is optimized for this kind of task, the resulting package should + +turn out to be very efficient. Unfortunately, there are also problems + +with this approach. There is, of course, a financial problem associated + +with buying an additional computer. In many cases, this is not a + +problem at all since the cost is negligible compared to what the + +programmer would cost to do the job in the mainframe itself. More + + 9 + + +fundamentally, the communications processor approach does not completely + +sidestep any of the problems raised above. The reason is that the + +communications processor, since it is a separate machine, must be + +attached to the mainframe by some mechanism. Whatever that mechanism, + +code is required in the mainframe to deal with it. It can be argued + +that the program to deal with the communications processor is simpler + +than the program to implement the entire protocol package. Even if that + +is so, the communications processor interface package is still a + +protocol in nature, with all of the same structural problems. Thus, all + +of the issues raised above must still be faced. In addition to those + +problems, there are some other, more subtle problems associated with an + +outboard implementation of a protocol. We will return to these problems + +later. + + + There is a way of attaching a communications processor to a + +mainframe host which sidesteps all of the mainframe implementation + +problems, which is to use some preexisting interface on the host machine + +as the port by which a communications processor is attached. This + +strategy is often used as a last stage of desperation when the software + +on the host computer is so intractable that it cannot be changed in any + +way. Unfortunately, it is almost inevitably the case that all of the + +available interfaces are totally unsuitable for this purpose, so the + +result is unsatisfactory at best. The most common way in which this + +form of attachment occurs is when a network connection is being used to + +mimic local teletypes. In this case, the front-end processor can be + +attached to the mainframe by simply providing a number of wires out of + +the front-end processor, each corresponding to a connection, which are + + 10 + + +plugged into teletype ports on the mainframe computer. (Because of the + +appearance of the physical configuration which results from this + +arrangement, Michael Padlipsky has described this as the "milking + +machine" approach to computer networking.) This strategy solves the + +immediate problem of providing remote access to a host, but it is + +extremely inflexible. The channels being provided to the host are + +restricted by the host software to one purpose only, remote login. It + +is impossible to use them for any other purpose, such as file transfer + +or sending mail, so the host is integrated into the network environment + +in an extremely limited and inflexible manner. If this is the best that + +can be done, then it should be tolerated. Otherwise, implementors + +should be strongly encouraged to take a more flexible approach. + + + 4. Protocol Layering + + + The previous discussion suggested that there was a decision to be + +made as to where a protocol ought to be implemented. In fact, the + +decision is much more complicated than that, for the goal is not to + +implement a single protocol, but to implement a whole family of protocol + +layers, starting with a device driver or local network driver at the + +bottom, then IP and TCP, and eventually reaching the application + +specific protocol, such as Telnet, FTP and SMTP on the top. Clearly, + +the bottommost of these layers is somewhere within the kernel, since the + +physical device driver for the net is almost inevitably located there. + +Equally clearly, the top layers of this package, which provide the user + +his ability to perform the remote login function or to send mail, are + +not entirely contained within the kernel. Thus, the question is not + + 11 + + +whether the protocol family shall be inside or outside the kernel, but + +how it shall be sliced in two between that part inside and that part + +outside. + + + Since protocols come nicely layered, an obvious proposal is that + +one of the layer interfaces should be the point at which the inside and + +outside components are sliced apart. Most systems have been implemented + +in this way, and many have been made to work quite effectively. One + +obvious place to slice is at the upper interface of TCP. Since TCP + +provides a bidirectional byte stream, which is somewhat similar to the + +I/O facility provided by most operating systems, it is possible to make + +the interface to TCP almost mimic the interface to other existing + +devices. Except in the matter of opening a connection, and dealing with + +peculiar failures, the software using TCP need not know that it is a + +network connection, rather than a local I/O stream that is providing the + +communications function. This approach does put TCP inside the kernel, + +which raises all the problems addressed above. It also raises the + +problem that the interface to the IP layer can, if the programmer is not + +careful, become excessively buried inside the kernel. It must be + +remembered that things other than TCP are expected to run on top of IP. + +The IP interface must be made accessible, even if TCP sits on top of it + +inside the kernel. + + + Another obvious place to slice is above Telnet. The advantage of + +slicing above Telnet is that it solves the problem of having remote + +login channels emulate local teletype channels. The disadvantage of + +putting Telnet into the kernel is that the amount of code which has now + + 12 + + +been included there is getting remarkably large. In some early + +implementations, the size of the network package, when one includes + +protocols at the level of Telnet, rivals the size of the rest of the + +supervisor. This leads to vague feelings that all is not right. + + + Any attempt to slice through a lower layer boundary, for example + +between internet and TCP, reveals one fundamental problem. The TCP + +layer, as well as the IP layer, performs a demultiplexing function on + +incoming datagrams. Until the TCP header has been examined, it is not + +possible to know for which user the packet is ultimately destined. + +Therefore, if TCP, as a whole, is moved outside the kernel, it is + +necessary to create one separate process called the TCP process, which + +performs the TCP multiplexing function, and probably all of the rest of + +TCP processing as well. This means that incoming data destined for a + +user process involves not just a scheduling of the user process, but + +scheduling the TCP process first. + + + This suggests an alternative structuring strategy which slices + +through the protocols, not along an established layer boundary, but + +along a functional boundary having to do with demultiplexing. In this + +approach, certain parts of IP and certain parts of TCP are placed in the + +kernel. The amount of code placed there is sufficient so that when an + +incoming datagram arrives, it is possible to know for which process that + +datagram is ultimately destined. The datagram is then routed directly + +to the final process, where additional IP and TCP processing is + +performed on it. This removes from the kernel any requirement for timer + +based actions, since they can be done by the process provided by the + + 13 + + +user. This structure has the additional advantage of reducing the + +amount of code required in the kernel, so that it is suitable for + +systems where kernel space is at a premium. The RFC 814, titled "Names, + +Addresses, Ports, and Routes," discusses this rather orthogonal slicing + +strategy in more detail. + + + A related discussion of protocol layering and multiplexing can be + +found in Cohen and Postel [1]. + + + 5. Breaking Down the Barriers + + + In fact, the implementor should be sensitive to the possibility of + +even more peculiar slicing strategies in dividing up the various + +protocol layers between the kernel and the one or more user processes. + +The result of the strategy proposed above was that part of TCP should + +execute in the process of the user. In other words, instead of having + +one TCP process for the system, there is one TCP process per connection. + +Given this architecture, it is not longer necessary to imagine that all + +of the TCPs are identical. One TCP could be optimized for high + +throughput applications, such as file transfer. Another TCP could be + +optimized for small low delay applications such as Telnet. In fact, it + +would be possible to produce a TCP which was somewhat integrated with + +the Telnet or FTP on top of it. Such an integration is extremely + +important, for it can lead to a kind of efficiency which more + +traditional structures are incapable of producing. Earlier, this paper + +pointed out that one of the important rules to achieving efficiency was + +to send the minimum number of packets for a given amount of data. The + +idea of protocol layering interacts very strongly (and poorly) with this + + 14 + + +goal, because independent layers have independent ideas about when + +packets should be sent, and unless these layers can somehow be brought + +into cooperation, additional packets will flow. The best example of + +this is the operation of server telnet in a character at a time remote + +echo mode on top of TCP. When a packet containing a character arrives + +at a server host, each layer has a different response to that packet. + +TCP has an obligation to acknowledge the packet. Either server telnet + +or the application layer above has an obligation to echo the character + +received in the packet. If the character is a Telnet control sequence, + +then Telnet has additional actions which it must perform in response to + +the packet. The result of this, in most implementations, is that + +several packets are sent back in response to the one arriving packet. + +Combining all of these return messages into one packet is important for + +several reasons. First, of course, it reduces the number of packets + +being sent over the net, which directly reduces the charges incurred for + +many common carrier tariff structures. Second, it reduces the number of + +scheduling actions which will occur inside both hosts, which, as was + +discussed above, is extremely important in improving throughput. + + + The way to achieve this goal of packet sharing is to break down the + +barrier between the layers of the protocols, in a very restrained and + +careful manner, so that a limited amount of information can leak across + +the barrier to enable one layer to optimize its behavior with respect to + +the desires of the layers above and below it. For example, it would + +represent an improvement if TCP, when it received a packet, could ask + +the layer above whether or not it would be worth pausing for a few + +milliseconds before sending an acknowledgement in order to see if the + + 15 + + +upper layer would have any outgoing data to send. Dallying before + +sending the acknowledgement produces precisely the right sort of + +optimization if the client of TCP is server Telnet. However, dallying + +before sending an acknowledgement is absolutely unacceptable if TCP is + +being used for file transfer, for in file transfer there is almost never + +data flowing in the reverse direction, and the delay in sending the + +acknowledgement probably translates directly into a delay in obtaining + +the next packets. Thus, TCP must know a little about the layers above + +it to adjust its performance as needed. + + + It would be possible to imagine a general purpose TCP which was + +equipped with all sorts of special mechanisms by which it would query + +the layer above and modify its behavior accordingly. In the structures + +suggested above, in which there is not one but several TCPs, the TCP can + +simply be modified so that it produces the correct behavior as a matter + +of course. This structure has the disadvantage that there will be + +several implementations of TCP existing on a single machine, which can + +mean more maintenance headaches if a problem is found where TCP needs to + +be changed. However, it is probably the case that each of the TCPs will + +be substantially simpler than the general purpose TCP which would + +otherwise have been built. There are some experimental projects + +currently under way which suggest that this approach may make designing + +of a TCP, or almost any other layer, substantially easier, so that the + +total effort involved in bringing up a complete package is actually less + +if this approach is followed. This approach is by no means generally + +accepted, but deserves some consideration. + + 16 + + + The general conclusion to be drawn from this sort of consideration + +is that a layer boundary has both a benefit and a penalty. A visible + +layer boundary, with a well specified interface, provides a form of + +isolation between two layers which allows one to be changed with the + +confidence that the other one will not stop working as a result. + +However, a firm layer boundary almost inevitably leads to inefficient + +operation. This can easily be seen by analogy with other aspects of + +operating systems. Consider, for example, file systems. A typical + +operating system provides a file system, which is a highly abstracted + +representation of a disk. The interface is highly formalized, and + +presumed to be highly stable. This makes it very easy for naive users + +to have access to disks without having to write a great deal of + +software. The existence of a file system is clearly beneficial. On the + +other hand, it is clear that the restricted interface to a file system + +almost inevitably leads to inefficiency. If the interface is organized + +as a sequential read and write of bytes, then there will be people who + +wish to do high throughput transfers who cannot achieve their goal. If + +the interface is a virtual memory interface, then other users will + +regret the necessity of building a byte stream interface on top of the + +memory mapped file. The most objectionable inefficiency results when a + +highly sophisticated package, such as a data base management package, + +must be built on top of an existing operating system. Almost + +inevitably, the implementors of the database system attempt to reject + +the file system and obtain direct access to the disks. They have + +sacrificed modularity for efficiency. + + + The same conflict appears in networking, in a rather extreme form. + + 17 + + +The concept of a protocol is still unknown and frightening to most naive + +programmers. The idea that they might have to implement a protocol, or + +even part of a protocol, as part of some application package, is a + +dreadful thought. And thus there is great pressure to hide the function + +of the net behind a very hard barrier. On the other hand, the kind of + +inefficiency which results from this is a particularly undesirable sort + +of inefficiency, for it shows up, among other things, in increasing the + +cost of the communications resource used up to achieve the application + +goal. In cases where one must pay for one's communications costs, they + +usually turn out to be the dominant cost within the system. Thus, doing + +an excessively good job of packaging up the protocols in an inflexible + +manner has a direct impact on increasing the cost of the critical + +resource within the system. This is a dilemma which will probably only + +be solved when programmers become somewhat less alarmed about protocols, + +so that they are willing to weave a certain amount of protocol structure + +into their application program, much as application programs today weave + +parts of database management systems into the structure of their + +application program. + + + An extreme example of putting the protocol package behind a firm + +layer boundary occurs when the protocol package is relegated to a front- + +end processor. In this case the interface to the protocol is some other + +protocol. It is difficult to imagine how to build close cooperation + +between layers when they are that far separated. Realistically, one of + +the prices which must be associated with an implementation so physically + +modularized is that the performance will suffer as a result. Of course, + +a separate processor for protocols could be very closely integrated into + + 18 + + +the mainframe architecture, with interprocessor co-ordination signals, + +shared memory, and similar features. Such a physical modularity might + +work very well, but there is little documented experience with this + +closely coupled architecture for protocol support. + + + 6. Efficiency of Protocol Processing + + + To this point, this document has considered how a protocol package + +should be broken into modules, and how those modules should be + +distributed between free standing machines, the operating system kernel, + +and one or more user processes. It is now time to consider the other + +half of the efficiency question, which is what can be done to speed the + +execution of those programs that actually implement the protocols. We + +will make some specific observations about TCP and IP, and then conclude + +with a few generalities. + + + IP is a simple protocol, especially with respect to the processing + +of normal packets, so it should be easy to get it to perform + +efficiently. The only area of any complexity related to actual packet + +processing has to do with fragmentation and reassembly. The reader is + +referred to RFC 815, titled "IP Datagram Reassembly Algorithms", for + +specific consideration of this point. + + + Most costs in the IP layer come from table look up functions, as + +opposed to packet processing functions. An outgoing packet requires two + +translation functions to be performed. The internet address must be + +translated to a target gateway, and a gateway address must be translated + +to a local network number (if the host is attached to more than one + + 19 + + +network). It is easy to build a simple implementation of these table + +look up functions that in fact performs very poorly. The programmer + +should keep in mind that there may be as many as a thousand network + +numbers in a typical configuration. Linear searching of a thousand + +entry table on every packet is extremely unsuitable. In fact, it may be + +worth asking TCP to cache a hint for each connection, which can be + +handed down to IP each time a packet is sent, to try to avoid the + +overhead of a table look up. + + + TCP is a more complex protocol, and presents many more + +opportunities for getting things wrong. There is one area which is + +generally accepted as causing noticeable and substantial overhead as + +part of TCP processing. This is computation of the checksum. It would + +be nice if this cost could be avoided somehow, but the idea of an end- + +to-end checksum is absolutely central to the functioning of TCP. No + +host implementor should think of omitting the validation of a checksum + +on incoming data. + + + Various clever tricks have been used to try to minimize the cost of + +computing the checksum. If it is possible to add additional microcoded + +instructions to the machine, a checksum instruction is the most obvious + +candidate. Since computing the checksum involves picking up every byte + +of the segment and examining it, it is possible to combine the operation + +of computing the checksum with the operation of copying the segment from + +one location to another. Since a number of data copies are probably + +already required as part of the processing structure, this kind of + +sharing might conceivably pay off if it didn't cause too much trouble to + + 20 + + +the modularity of the program. Finally, computation of the checksum + +seems to be one place where careful attention to the details of the + +algorithm used can make a drastic difference in the throughput of the + +program. The Multics system provides one of the best case studies of + +this, since Multics is about as poorly organized to perform this + +function as any machine implementing TCP. Multics is a 36-bit word + +machine, with four 9-bit bytes per word. The eight-bit bytes of a TCP + +segment are laid down packed in memory, ignoring word boundaries. This + +means that when it is necessary to pick up the data as a set of 16-bit + +units for the purpose of adding them to compute checksums, horrible + +masking and shifting is required for each 16-bit value. An early + +version of a program using this strategy required 6 milliseconds to + +checksum a 576-byte segment. Obviously, at this point, checksum + +computation was becoming the central bottleneck to throughput. A more + +careful recoding of this algorithm reduced the checksum processing time + +to less than one millisecond. The strategy used was extremely dirty. + +It involved adding up carefully selected words of the area in which the + +data lay, knowing that for those particular words, the 16-bit values + +were properly aligned inside the words. Only after the addition had + +been done were the various sums shifted, and finally added to produce + +the eventual checksum. This kind of highly specialized programming is + +probably not acceptable if used everywhere within an operating system. + +It is clearly appropriate for one highly localized function which can be + +clearly identified as an extreme performance bottleneck. + + + Another area of TCP processing which may cause performance problems + +is the overhead of examining all of the possible flags and options which + + 21 + + +occur in each incoming packet. One paper, by Bunch and Day [2], asserts + +that the overhead of packet header processing is actually an important + +limiting factor in throughput computation. Not all measurement + +experiments have tended to support this result. To whatever extent it + +is true, however, there is an obvious strategy which the implementor + +ought to use in designing his program. He should build his program to + +optimize the expected case. It is easy, especially when first designing + +a program, to pay equal attention to all of the possible outcomes of + +every test. In practice, however, few of these will ever happen. A TCP + +should be built on the assumption that the next packet to arrive will + +have absolutely nothing special about it, and will be the next one + +expected in the sequence space. One or two tests are sufficient to + +determine that the expected set of control flags are on. (The ACK flag + +should be on; the Push flag may or may not be on. No other flags should + +be on.) One test is sufficient to determine that the sequence number of + +the incoming packet is one greater than the last sequence number + +received. In almost every case, that will be the actual result. Again, + +using the Multics system as an example, failure to optimize the case of + +receiving the expected sequence number had a detectable effect on the + +performance of the system. The particular problem arose when a number + +of packets arrived at once. TCP attempted to process all of these + +packets before awaking the user. As a result, by the time the last + +packet arrived, there was a threaded list of packets which had several + +items on it. When a new packet arrived, the list was searched to find + +the location into which the packet should be inserted. Obviously, the + +list should be searched from highest sequence number to lowest sequence + + 22 + + +number, because one is expecting to receive a packet which comes after + +those already received. By mistake, the list was searched from front to + +back, starting with the packets with the lowest sequence number. The + +amount of time spent searching this list backwards was easily detectable + +in the metering measurements. + + + Other data structures can be organized to optimize the action which + +is normally taken on them. For example, the retransmission queue is + +very seldom actually used for retransmission, so it should not be + +organized to optimize that action. In fact, it should be organized to + +optimized the discarding of things from it when the acknowledgement + +arrives. In many cases, the easiest way to do this is not to save the + +packet at all, but to reconstruct it only if it needs to be + +retransmitted, starting from the data as it was originally buffered by + +the user. + + + There is another generality, at least as important as optimizing + +the common case, which is to avoid copying data any more times than + +necessary. One more result from the Multics TCP may prove enlightening + +here. Multics takes between two and three milliseconds within the TCP + +layer to process an incoming packet, depending on its size. For a 576- + +byte packet, the three milliseconds is used up approximately as follows. + +One millisecond is used computing the checksum. Six hundred + +microseconds is spent copying the data. (The data is copied twice, at + +.3 milliseconds a copy.) One of those copy operations could correctly + +be included as part of the checksum cost, since it is done to get the + +data on a known word boundary to optimize the checksum algorithm. + + 23 + + +However, the copy also performs another necessary transfer at the same + +time. Header processing and packet resequencing takes .7 milliseconds. + +The rest of the time is used in miscellaneous processing, such as + +removing packets from the retransmission queue which are acknowledged by + +this packet. Data copying is the second most expensive single operation + +after data checksuming. Some implementations, often because of an + +excessively layered modularity, end up copying the data around a great + +deal. Other implementations end up copying the data because there is no + +shared memory between processes, and the data must be moved from process + +to process via a kernel operation. Unless the amount of this activity + +is kept strictly under control, it will quickly become the major + +performance bottleneck. + + + 7. Conclusions + + + This document has addressed two aspects of obtaining performance + +from a protocol implementation, the way in which the protocol is layered + +and integrated into the operating system, and the way in which the + +detailed handling of the packet is optimized. It would be nice if one + +or the other of these costs would completely dominate, so that all of + +one's attention could be concentrated there. Regrettably, this is not + +so. Depending on the particular sort of traffic one is getting, for + +example, whether Telnet one-byte packets or file transfer maximum size + +packets at maximum speed, one can expect to see one or the other cost + +being the major bottleneck to throughput. Most implementors who have + +studied their programs in an attempt to find out where the time was + +going have reached the unsatisfactory conclusion that it is going + + 24 + + +equally to all parts of their program. With the possible exception of + +checksum processing, very few people have ever found that their + +performance problems were due to a single, horrible bottleneck which + +they could fix by a single stroke of inventive programming. Rather, the + +performance was something which was improved by painstaking tuning of + +the entire program. + + + Most discussions of protocols begin by introducing the concept of + +layering, which tends to suggest that layering is a fundamentally + +wonderful idea which should be a part of every consideration of + +protocols. In fact, layering is a mixed blessing. Clearly, a layer + +interface is necessary whenever more than one client of a particular + +layer is to be allowed to use that same layer. But an interface, + +precisely because it is fixed, inevitably leads to a lack of complete + +understanding as to what one layer wishes to obtain from another. This + +has to lead to inefficiency. Furthermore, layering is a potential snare + +in that one is tempted to think that a layer boundary, which was an + +artifact of the specification procedure, is in fact the proper boundary + +to use in modularizing the implementation. Again, in certain cases, an + +architected layer must correspond to an implemented layer, precisely so + +that several clients can have access to that layer in a reasonably + +straightforward manner. In other cases, cunning rearrangement of the + +implemented module boundaries to match with various functions, such as + +the demultiplexing of incoming packets, or the sending of asynchronous + +outgoing packets, can lead to unexpected performance improvements + +compared to more traditional implementation strategies. Finally, good + +performance is something which is difficult to retrofit onto an existing + + 25 + + +program. Since performance is influenced, not just by the fine detail, + +but by the gross structure, it is sometimes the case that in order to + +obtain a substantial performance improvement, it is necessary to + +completely redo the program from the bottom up. This is a great + +disappointment to programmers, especially those doing a protocol + +implementation for the first time. Programmers who are somewhat + +inexperienced and unfamiliar with protocols are sufficiently concerned + +with getting their program logically correct that they do not have the + +capacity to think at the same time about the performance of the + +structure they are building. Only after they have achieved a logically + +correct program do they discover that they have done so in a way which + +has precluded real performance. Clearly, it is more difficult to design + +a program thinking from the start about both logical correctness and + +performance. With time, as implementors as a group learn more about the + +appropriate structures to use for building protocols, it will be + +possible to proceed with an implementation project having more + +confidence that the structure is rational, that the program will work, + +and that the program will work well. Those of us now implementing + +protocols have the privilege of being on the forefront of this learning + +process. It should be no surprise that our programs sometimes suffer + +from the uncertainty we bring to bear on them. + + 26 + + +Citations + + + [1] Cohen and Postel, "On Protocol Multiplexing", Sixth Data + +Communications Symposium, ACM/IEEE, November 1979. + + + [2] Bunch and Day, "Control Structure Overhead in TCP", Trends and + +Applications: Computer Networking, NBS Symposium, May 1980. + + -- cgit v1.2.3