summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc817.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc817.txt')
-rw-r--r--doc/rfc/rfc817.txt1388
1 files changed, 1388 insertions, 0 deletions
diff --git a/doc/rfc/rfc817.txt b/doc/rfc/rfc817.txt
new file mode 100644
index 0000000..dcdef8a
--- /dev/null
+++ b/doc/rfc/rfc817.txt
@@ -0,0 +1,1388 @@
+
+RFC: 817
+
+
+
+ MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION
+
+ David D. Clark
+ MIT Laboratory for Computer Science
+ Computer Systems and Communications Group
+ July, 1982
+
+
+ 1. Introduction
+
+
+ Many protocol implementers have made the unpleasant discovery that
+
+their packages do not run quite as fast as they had hoped. The blame
+
+for this widely observed problem has been attributed to a variety of
+
+causes, ranging from details in the design of the protocol to the
+
+underlying structure of the host operating system. This RFC will
+
+discuss some of the commonly encountered reasons why protocol
+
+implementations seem to run slowly.
+
+
+ Experience suggests that one of the most important factors in
+
+determining the performance of an implementation is the manner in which
+
+that implementation is modularized and integrated into the host
+
+operating system. For this reason, it is useful to discuss the question
+
+of how an implementation is structured at the same time that we consider
+
+how it will perform. In fact, this RFC will argue that modularity is
+
+one of the chief villains in attempting to obtain good performance, so
+
+that the designer is faced with a delicate and inevitable tradeoff
+
+between good structure and good performance. Further, the single factor
+
+which most strongly determines how well this conflict can be resolved is
+
+not the protocol but the operating system.
+
+ 2
+
+
+ 2. Efficiency Considerations
+
+
+ There are many aspects to efficiency. One aspect is sending data
+
+at minimum transmission cost, which is a critical aspect of common
+
+carrier communications, if not in local area network communications.
+
+Another aspect is sending data at a high rate, which may not be possible
+
+at all if the net is very slow, but which may be the one central design
+
+constraint when taking advantage of a local net with high raw bandwidth.
+
+The final consideration is doing the above with minimum expenditure of
+
+computer resources. This last may be necessary to achieve high speed,
+
+but in the case of the slow net may be important only in that the
+
+resources used up, for example cpu cycles, are costly or otherwise
+
+needed. It is worth pointing out that these different goals often
+
+conflict; for example it is often possible to trade off efficient use of
+
+the computer against efficient use of the network. Thus, there may be
+
+no such thing as a successful general purpose protocol implementation.
+
+
+ The simplest measure of performance is throughput, measured in bits
+
+per second. It is worth doing a few simple computations in order to get
+
+a feeling for the magnitude of the problems involved. Assume that data
+
+is being sent from one machine to another in packets of 576 bytes, the
+
+maximum generally acceptable internet packet size. Allowing for header
+
+overhead, this packet size permits 4288 bits in each packet. If a
+
+useful throughput of 10,000 bits per second is desired, then a data
+
+bearing packet must leave the sending host about every 430 milliseconds,
+
+a little over two per second. This is clearly not difficult to achieve.
+
+However, if one wishes to achieve 100 kilobits per second throughput,
+
+ 3
+
+
+the packet must leave the host every 43 milliseconds, and to achieve one
+
+megabit per second, which is not at all unreasonable on a high-speed
+
+local net, the packets must be spaced no more than 4.3 milliseconds.
+
+
+ These latter numbers are a slightly more alarming goal for which to
+
+set one's sights. Many operating systems take a substantial fraction of
+
+a millisecond just to service an interrupt. If the protocol has been
+
+structured as a process, it is necessary to go through a process
+
+scheduling before the protocol code can even begin to run. If any piece
+
+of a protocol package or its data must be fetched from disk, real time
+
+delays of between 30 to 100 milliseconds can be expected. If the
+
+protocol must compete for cpu resources with other processes of the
+
+system, it may be necessary to wait a scheduling quantum before the
+
+protocol can run. Many systems have a scheduling quantum of 100
+
+milliseconds or more. Considering these sorts of numbers, it becomes
+
+immediately clear that the protocol must be fitted into the operating
+
+system in a thorough and effective manner if any like reasonable
+
+throughput is to be achieved.
+
+
+ There is one obvious conclusion immediately suggested by even this
+
+simple analysis. Except in very special circumstances, when many
+
+packets are being processed at once, the cost of processing a packet is
+
+dominated by factors, such as cpu scheduling, which are independent of
+
+the packet size. This suggests two general rules which any
+
+implementation ought to obey. First, send data in large packets.
+
+Obviously, if processing time per packet is a constant, then throughput
+
+will be directly proportional to the packet size. Second, never send an
+
+ 4
+
+
+unneeded packet. Unneeded packets use up just as many resources as a
+
+packet full of data, but perform no useful function. RFC 813, "Window
+
+and Acknowledgement Strategy in TCP", discusses one aspect of reducing
+
+the number of packets sent per useful data byte. This document will
+
+mention other attacks on the same problem.
+
+
+ The above analysis suggests that there are two main parts to the
+
+problem of achieving good protocol performance. The first has to do
+
+with how the protocol implementation is integrated into the host
+
+operating system. The second has to do with how the protocol package
+
+itself is organized internally. This document will consider each of
+
+these topics in turn.
+
+
+ 3. The Protocol vs. the Operating System
+
+
+ There are normally three reasonable ways in which to add a protocol
+
+to an operating system. The protocol can be in a process that is
+
+provided by the operating system, or it can be part of the kernel of the
+
+operating system itself, or it can be put in a separate communications
+
+processor or front end machine. This decision is strongly influenced by
+
+details of hardware architecture and operating system design; each of
+
+these three approaches has its own advantages and disadvantages.
+
+
+ The "process" is the abstraction which most operating systems use
+
+to provide the execution environment for user programs. A very simple
+
+path for implementing a protocol is to obtain a process from the
+
+operating system and implement the protocol to run in it.
+
+Superficially, this approach has a number of advantages. Since
+
+ 5
+
+
+modifications to the kernel are not required, the job can be done by
+
+someone who is not an expert in the kernel structure. Since it is often
+
+impossible to find somebody who is experienced both in the structure of
+
+the operating system and the structure of the protocol, this path, from
+
+a management point of view, is often extremely appealing. Unfortunately,
+
+putting a protocol in a process has a number of disadvantages, related
+
+to both structure and performance. First, as was discussed above,
+
+process scheduling can be a significant source of real-time delay.
+
+There is not only the actual cost of going through the scheduler, but
+
+the problem that the operating system may not have the right sort of
+
+priority tools to bring the process into execution quickly whenever
+
+there is work to be done.
+
+
+ Structurally, the difficulty with putting a protocol in a process
+
+is that the protocol may be providing services, for example support of
+
+data streams, which are normally obtained by going to special kernel
+
+entry points. Depending on the generality of the operating system, it
+
+may be impossible to take a program which is accustomed to reading
+
+through a kernel entry point, and redirect it so it is reading the data
+
+from a process. The most extreme example of this problem occurs when
+
+implementing server telnet. In almost all systems, the device handler
+
+for the locally attached teletypes is located inside the kernel, and
+
+programs read and write from their teletype by making kernel calls. If
+
+server telnet is implemented in a process, it is then necessary to take
+
+the data streams provided by server telnet and somehow get them back
+
+down inside the kernel so that they mimic the interface provided by
+
+local teletypes. It is usually the case that special kernel
+
+ 6
+
+
+modification is necessary to achieve this structure, which somewhat
+
+defeats the benefit of having removed the protocol from the kernel in
+
+the first place.
+
+
+ Clearly, then, there are advantages to putting the protocol package
+
+in the kernel. Structurally, it is reasonable to view the network as a
+
+device, and device drivers are traditionally contained in the kernel.
+
+Presumably, the problems associated with process scheduling can be
+
+sidesteped, at least to a certain extent, by placing the code inside the
+
+kernel. And it is obviously easier to make the server telnet channels
+
+mimic the local teletype channels if they are both realized in the same
+
+level in the kernel.
+
+
+ However, implementation of protocols in the kernel has its own set
+
+of pitfalls. First, network protocols have a characteristic which is
+
+shared by almost no other device: they require rather complex actions
+
+to be performed as a result of a timeout. The problem with this
+
+requirement is that the kernel often has no facility by which a program
+
+can be brought into execution as a result of the timer event. What is
+
+really needed, of course, is a special sort of process inside the
+
+kernel. Most systems lack this mechanism. Failing that, the only
+
+execution mechanism available is to run at interrupt time.
+
+
+ There are substantial drawbacks to implementing a protocol to run
+
+at interrupt time. First, the actions performed may be somewhat complex
+
+and time consuming, compared to the maximum amount of time that the
+
+operating system is prepared to spend servicing an interrupt. Problems
+
+can arise if interrupts are masked for too long. This is particularly
+
+ 7
+
+
+bad when running as a result of a clock interrupt, which can imply that
+
+the clock interrupt is masked. Second, the environment provided by an
+
+interrupt handler is usually extremely primitive compared to the
+
+environment of a process. There are usually a variety of system
+
+facilities which are unavailable while running in an interrupt handler.
+
+The most important of these is the ability to suspend execution pending
+
+the arrival of some event or message. It is a cardinal rule of almost
+
+every known operating system that one must not invoke the scheduler
+
+while running in an interrupt handler. Thus, the programmer who is
+
+forced to implement all or part of his protocol package as an interrupt
+
+handler must be the best sort of expert in the operating system
+
+involved, and must be prepared for development sessions filled with
+
+obscure bugs which crash not just the protocol package but the entire
+
+operating system.
+
+
+ A final problem with processing at interrupt time is that the
+
+system scheduler has no control over the percentage of system time used
+
+by the protocol handler. If a large number of packets arrive, from a
+
+foreign host that is either malfunctioning or fast, all of the time may
+
+be spent in the interrupt handler, effectively killing the system.
+
+
+ There are other problems associated with putting protocols into an
+
+operating system kernel. The simplest problem often encountered is that
+
+the kernel address space is simply too small to hold the piece of code
+
+in question. This is a rather artificial sort of problem, but it is a
+
+severe problem none the less in many machines. It is an appallingly
+
+unpleasant experience to do an implementation with the knowledge that
+
+ 8
+
+
+for every byte of new feature put in one must find some other byte of
+
+old feature to throw out. It is hopeless to expect an effective and
+
+general implementation under this kind of constraint. Another problem
+
+is that the protocol package, once it is thoroughly entwined in the
+
+operating system, may need to be redone every time the operating system
+
+changes. If the protocol and the operating system are not maintained by
+
+the same group, this makes maintenance of the protocol package a
+
+perpetual headache.
+
+
+ The third option for protocol implementation is to take the
+
+protocol package and move it outside the machine entirely, on to a
+
+separate processor dedicated to this kind of task. Such a machine is
+
+often described as a communications processor or a front-end processor.
+
+There are several advantages to this approach. First, the operating
+
+system on the communications processor can be tailored for precisely
+
+this kind of task. This makes the job of implementation much easier.
+
+Second, one does not need to redo the task for every machine to which
+
+the protocol is to be added. It may be possible to reuse the same
+
+front-end machine on different host computers. Since the task need not
+
+be done as many times, one might hope that more attention could be paid
+
+to doing it right. Given a careful implementation in an environment
+
+which is optimized for this kind of task, the resulting package should
+
+turn out to be very efficient. Unfortunately, there are also problems
+
+with this approach. There is, of course, a financial problem associated
+
+with buying an additional computer. In many cases, this is not a
+
+problem at all since the cost is negligible compared to what the
+
+programmer would cost to do the job in the mainframe itself. More
+
+ 9
+
+
+fundamentally, the communications processor approach does not completely
+
+sidestep any of the problems raised above. The reason is that the
+
+communications processor, since it is a separate machine, must be
+
+attached to the mainframe by some mechanism. Whatever that mechanism,
+
+code is required in the mainframe to deal with it. It can be argued
+
+that the program to deal with the communications processor is simpler
+
+than the program to implement the entire protocol package. Even if that
+
+is so, the communications processor interface package is still a
+
+protocol in nature, with all of the same structural problems. Thus, all
+
+of the issues raised above must still be faced. In addition to those
+
+problems, there are some other, more subtle problems associated with an
+
+outboard implementation of a protocol. We will return to these problems
+
+later.
+
+
+ There is a way of attaching a communications processor to a
+
+mainframe host which sidesteps all of the mainframe implementation
+
+problems, which is to use some preexisting interface on the host machine
+
+as the port by which a communications processor is attached. This
+
+strategy is often used as a last stage of desperation when the software
+
+on the host computer is so intractable that it cannot be changed in any
+
+way. Unfortunately, it is almost inevitably the case that all of the
+
+available interfaces are totally unsuitable for this purpose, so the
+
+result is unsatisfactory at best. The most common way in which this
+
+form of attachment occurs is when a network connection is being used to
+
+mimic local teletypes. In this case, the front-end processor can be
+
+attached to the mainframe by simply providing a number of wires out of
+
+the front-end processor, each corresponding to a connection, which are
+
+ 10
+
+
+plugged into teletype ports on the mainframe computer. (Because of the
+
+appearance of the physical configuration which results from this
+
+arrangement, Michael Padlipsky has described this as the "milking
+
+machine" approach to computer networking.) This strategy solves the
+
+immediate problem of providing remote access to a host, but it is
+
+extremely inflexible. The channels being provided to the host are
+
+restricted by the host software to one purpose only, remote login. It
+
+is impossible to use them for any other purpose, such as file transfer
+
+or sending mail, so the host is integrated into the network environment
+
+in an extremely limited and inflexible manner. If this is the best that
+
+can be done, then it should be tolerated. Otherwise, implementors
+
+should be strongly encouraged to take a more flexible approach.
+
+
+ 4. Protocol Layering
+
+
+ The previous discussion suggested that there was a decision to be
+
+made as to where a protocol ought to be implemented. In fact, the
+
+decision is much more complicated than that, for the goal is not to
+
+implement a single protocol, but to implement a whole family of protocol
+
+layers, starting with a device driver or local network driver at the
+
+bottom, then IP and TCP, and eventually reaching the application
+
+specific protocol, such as Telnet, FTP and SMTP on the top. Clearly,
+
+the bottommost of these layers is somewhere within the kernel, since the
+
+physical device driver for the net is almost inevitably located there.
+
+Equally clearly, the top layers of this package, which provide the user
+
+his ability to perform the remote login function or to send mail, are
+
+not entirely contained within the kernel. Thus, the question is not
+
+ 11
+
+
+whether the protocol family shall be inside or outside the kernel, but
+
+how it shall be sliced in two between that part inside and that part
+
+outside.
+
+
+ Since protocols come nicely layered, an obvious proposal is that
+
+one of the layer interfaces should be the point at which the inside and
+
+outside components are sliced apart. Most systems have been implemented
+
+in this way, and many have been made to work quite effectively. One
+
+obvious place to slice is at the upper interface of TCP. Since TCP
+
+provides a bidirectional byte stream, which is somewhat similar to the
+
+I/O facility provided by most operating systems, it is possible to make
+
+the interface to TCP almost mimic the interface to other existing
+
+devices. Except in the matter of opening a connection, and dealing with
+
+peculiar failures, the software using TCP need not know that it is a
+
+network connection, rather than a local I/O stream that is providing the
+
+communications function. This approach does put TCP inside the kernel,
+
+which raises all the problems addressed above. It also raises the
+
+problem that the interface to the IP layer can, if the programmer is not
+
+careful, become excessively buried inside the kernel. It must be
+
+remembered that things other than TCP are expected to run on top of IP.
+
+The IP interface must be made accessible, even if TCP sits on top of it
+
+inside the kernel.
+
+
+ Another obvious place to slice is above Telnet. The advantage of
+
+slicing above Telnet is that it solves the problem of having remote
+
+login channels emulate local teletype channels. The disadvantage of
+
+putting Telnet into the kernel is that the amount of code which has now
+
+ 12
+
+
+been included there is getting remarkably large. In some early
+
+implementations, the size of the network package, when one includes
+
+protocols at the level of Telnet, rivals the size of the rest of the
+
+supervisor. This leads to vague feelings that all is not right.
+
+
+ Any attempt to slice through a lower layer boundary, for example
+
+between internet and TCP, reveals one fundamental problem. The TCP
+
+layer, as well as the IP layer, performs a demultiplexing function on
+
+incoming datagrams. Until the TCP header has been examined, it is not
+
+possible to know for which user the packet is ultimately destined.
+
+Therefore, if TCP, as a whole, is moved outside the kernel, it is
+
+necessary to create one separate process called the TCP process, which
+
+performs the TCP multiplexing function, and probably all of the rest of
+
+TCP processing as well. This means that incoming data destined for a
+
+user process involves not just a scheduling of the user process, but
+
+scheduling the TCP process first.
+
+
+ This suggests an alternative structuring strategy which slices
+
+through the protocols, not along an established layer boundary, but
+
+along a functional boundary having to do with demultiplexing. In this
+
+approach, certain parts of IP and certain parts of TCP are placed in the
+
+kernel. The amount of code placed there is sufficient so that when an
+
+incoming datagram arrives, it is possible to know for which process that
+
+datagram is ultimately destined. The datagram is then routed directly
+
+to the final process, where additional IP and TCP processing is
+
+performed on it. This removes from the kernel any requirement for timer
+
+based actions, since they can be done by the process provided by the
+
+ 13
+
+
+user. This structure has the additional advantage of reducing the
+
+amount of code required in the kernel, so that it is suitable for
+
+systems where kernel space is at a premium. The RFC 814, titled "Names,
+
+Addresses, Ports, and Routes," discusses this rather orthogonal slicing
+
+strategy in more detail.
+
+
+ A related discussion of protocol layering and multiplexing can be
+
+found in Cohen and Postel [1].
+
+
+ 5. Breaking Down the Barriers
+
+
+ In fact, the implementor should be sensitive to the possibility of
+
+even more peculiar slicing strategies in dividing up the various
+
+protocol layers between the kernel and the one or more user processes.
+
+The result of the strategy proposed above was that part of TCP should
+
+execute in the process of the user. In other words, instead of having
+
+one TCP process for the system, there is one TCP process per connection.
+
+Given this architecture, it is not longer necessary to imagine that all
+
+of the TCPs are identical. One TCP could be optimized for high
+
+throughput applications, such as file transfer. Another TCP could be
+
+optimized for small low delay applications such as Telnet. In fact, it
+
+would be possible to produce a TCP which was somewhat integrated with
+
+the Telnet or FTP on top of it. Such an integration is extremely
+
+important, for it can lead to a kind of efficiency which more
+
+traditional structures are incapable of producing. Earlier, this paper
+
+pointed out that one of the important rules to achieving efficiency was
+
+to send the minimum number of packets for a given amount of data. The
+
+idea of protocol layering interacts very strongly (and poorly) with this
+
+ 14
+
+
+goal, because independent layers have independent ideas about when
+
+packets should be sent, and unless these layers can somehow be brought
+
+into cooperation, additional packets will flow. The best example of
+
+this is the operation of server telnet in a character at a time remote
+
+echo mode on top of TCP. When a packet containing a character arrives
+
+at a server host, each layer has a different response to that packet.
+
+TCP has an obligation to acknowledge the packet. Either server telnet
+
+or the application layer above has an obligation to echo the character
+
+received in the packet. If the character is a Telnet control sequence,
+
+then Telnet has additional actions which it must perform in response to
+
+the packet. The result of this, in most implementations, is that
+
+several packets are sent back in response to the one arriving packet.
+
+Combining all of these return messages into one packet is important for
+
+several reasons. First, of course, it reduces the number of packets
+
+being sent over the net, which directly reduces the charges incurred for
+
+many common carrier tariff structures. Second, it reduces the number of
+
+scheduling actions which will occur inside both hosts, which, as was
+
+discussed above, is extremely important in improving throughput.
+
+
+ The way to achieve this goal of packet sharing is to break down the
+
+barrier between the layers of the protocols, in a very restrained and
+
+careful manner, so that a limited amount of information can leak across
+
+the barrier to enable one layer to optimize its behavior with respect to
+
+the desires of the layers above and below it. For example, it would
+
+represent an improvement if TCP, when it received a packet, could ask
+
+the layer above whether or not it would be worth pausing for a few
+
+milliseconds before sending an acknowledgement in order to see if the
+
+ 15
+
+
+upper layer would have any outgoing data to send. Dallying before
+
+sending the acknowledgement produces precisely the right sort of
+
+optimization if the client of TCP is server Telnet. However, dallying
+
+before sending an acknowledgement is absolutely unacceptable if TCP is
+
+being used for file transfer, for in file transfer there is almost never
+
+data flowing in the reverse direction, and the delay in sending the
+
+acknowledgement probably translates directly into a delay in obtaining
+
+the next packets. Thus, TCP must know a little about the layers above
+
+it to adjust its performance as needed.
+
+
+ It would be possible to imagine a general purpose TCP which was
+
+equipped with all sorts of special mechanisms by which it would query
+
+the layer above and modify its behavior accordingly. In the structures
+
+suggested above, in which there is not one but several TCPs, the TCP can
+
+simply be modified so that it produces the correct behavior as a matter
+
+of course. This structure has the disadvantage that there will be
+
+several implementations of TCP existing on a single machine, which can
+
+mean more maintenance headaches if a problem is found where TCP needs to
+
+be changed. However, it is probably the case that each of the TCPs will
+
+be substantially simpler than the general purpose TCP which would
+
+otherwise have been built. There are some experimental projects
+
+currently under way which suggest that this approach may make designing
+
+of a TCP, or almost any other layer, substantially easier, so that the
+
+total effort involved in bringing up a complete package is actually less
+
+if this approach is followed. This approach is by no means generally
+
+accepted, but deserves some consideration.
+
+ 16
+
+
+ The general conclusion to be drawn from this sort of consideration
+
+is that a layer boundary has both a benefit and a penalty. A visible
+
+layer boundary, with a well specified interface, provides a form of
+
+isolation between two layers which allows one to be changed with the
+
+confidence that the other one will not stop working as a result.
+
+However, a firm layer boundary almost inevitably leads to inefficient
+
+operation. This can easily be seen by analogy with other aspects of
+
+operating systems. Consider, for example, file systems. A typical
+
+operating system provides a file system, which is a highly abstracted
+
+representation of a disk. The interface is highly formalized, and
+
+presumed to be highly stable. This makes it very easy for naive users
+
+to have access to disks without having to write a great deal of
+
+software. The existence of a file system is clearly beneficial. On the
+
+other hand, it is clear that the restricted interface to a file system
+
+almost inevitably leads to inefficiency. If the interface is organized
+
+as a sequential read and write of bytes, then there will be people who
+
+wish to do high throughput transfers who cannot achieve their goal. If
+
+the interface is a virtual memory interface, then other users will
+
+regret the necessity of building a byte stream interface on top of the
+
+memory mapped file. The most objectionable inefficiency results when a
+
+highly sophisticated package, such as a data base management package,
+
+must be built on top of an existing operating system. Almost
+
+inevitably, the implementors of the database system attempt to reject
+
+the file system and obtain direct access to the disks. They have
+
+sacrificed modularity for efficiency.
+
+
+ The same conflict appears in networking, in a rather extreme form.
+
+ 17
+
+
+The concept of a protocol is still unknown and frightening to most naive
+
+programmers. The idea that they might have to implement a protocol, or
+
+even part of a protocol, as part of some application package, is a
+
+dreadful thought. And thus there is great pressure to hide the function
+
+of the net behind a very hard barrier. On the other hand, the kind of
+
+inefficiency which results from this is a particularly undesirable sort
+
+of inefficiency, for it shows up, among other things, in increasing the
+
+cost of the communications resource used up to achieve the application
+
+goal. In cases where one must pay for one's communications costs, they
+
+usually turn out to be the dominant cost within the system. Thus, doing
+
+an excessively good job of packaging up the protocols in an inflexible
+
+manner has a direct impact on increasing the cost of the critical
+
+resource within the system. This is a dilemma which will probably only
+
+be solved when programmers become somewhat less alarmed about protocols,
+
+so that they are willing to weave a certain amount of protocol structure
+
+into their application program, much as application programs today weave
+
+parts of database management systems into the structure of their
+
+application program.
+
+
+ An extreme example of putting the protocol package behind a firm
+
+layer boundary occurs when the protocol package is relegated to a front-
+
+end processor. In this case the interface to the protocol is some other
+
+protocol. It is difficult to imagine how to build close cooperation
+
+between layers when they are that far separated. Realistically, one of
+
+the prices which must be associated with an implementation so physically
+
+modularized is that the performance will suffer as a result. Of course,
+
+a separate processor for protocols could be very closely integrated into
+
+ 18
+
+
+the mainframe architecture, with interprocessor co-ordination signals,
+
+shared memory, and similar features. Such a physical modularity might
+
+work very well, but there is little documented experience with this
+
+closely coupled architecture for protocol support.
+
+
+ 6. Efficiency of Protocol Processing
+
+
+ To this point, this document has considered how a protocol package
+
+should be broken into modules, and how those modules should be
+
+distributed between free standing machines, the operating system kernel,
+
+and one or more user processes. It is now time to consider the other
+
+half of the efficiency question, which is what can be done to speed the
+
+execution of those programs that actually implement the protocols. We
+
+will make some specific observations about TCP and IP, and then conclude
+
+with a few generalities.
+
+
+ IP is a simple protocol, especially with respect to the processing
+
+of normal packets, so it should be easy to get it to perform
+
+efficiently. The only area of any complexity related to actual packet
+
+processing has to do with fragmentation and reassembly. The reader is
+
+referred to RFC 815, titled "IP Datagram Reassembly Algorithms", for
+
+specific consideration of this point.
+
+
+ Most costs in the IP layer come from table look up functions, as
+
+opposed to packet processing functions. An outgoing packet requires two
+
+translation functions to be performed. The internet address must be
+
+translated to a target gateway, and a gateway address must be translated
+
+to a local network number (if the host is attached to more than one
+
+ 19
+
+
+network). It is easy to build a simple implementation of these table
+
+look up functions that in fact performs very poorly. The programmer
+
+should keep in mind that there may be as many as a thousand network
+
+numbers in a typical configuration. Linear searching of a thousand
+
+entry table on every packet is extremely unsuitable. In fact, it may be
+
+worth asking TCP to cache a hint for each connection, which can be
+
+handed down to IP each time a packet is sent, to try to avoid the
+
+overhead of a table look up.
+
+
+ TCP is a more complex protocol, and presents many more
+
+opportunities for getting things wrong. There is one area which is
+
+generally accepted as causing noticeable and substantial overhead as
+
+part of TCP processing. This is computation of the checksum. It would
+
+be nice if this cost could be avoided somehow, but the idea of an end-
+
+to-end checksum is absolutely central to the functioning of TCP. No
+
+host implementor should think of omitting the validation of a checksum
+
+on incoming data.
+
+
+ Various clever tricks have been used to try to minimize the cost of
+
+computing the checksum. If it is possible to add additional microcoded
+
+instructions to the machine, a checksum instruction is the most obvious
+
+candidate. Since computing the checksum involves picking up every byte
+
+of the segment and examining it, it is possible to combine the operation
+
+of computing the checksum with the operation of copying the segment from
+
+one location to another. Since a number of data copies are probably
+
+already required as part of the processing structure, this kind of
+
+sharing might conceivably pay off if it didn't cause too much trouble to
+
+ 20
+
+
+the modularity of the program. Finally, computation of the checksum
+
+seems to be one place where careful attention to the details of the
+
+algorithm used can make a drastic difference in the throughput of the
+
+program. The Multics system provides one of the best case studies of
+
+this, since Multics is about as poorly organized to perform this
+
+function as any machine implementing TCP. Multics is a 36-bit word
+
+machine, with four 9-bit bytes per word. The eight-bit bytes of a TCP
+
+segment are laid down packed in memory, ignoring word boundaries. This
+
+means that when it is necessary to pick up the data as a set of 16-bit
+
+units for the purpose of adding them to compute checksums, horrible
+
+masking and shifting is required for each 16-bit value. An early
+
+version of a program using this strategy required 6 milliseconds to
+
+checksum a 576-byte segment. Obviously, at this point, checksum
+
+computation was becoming the central bottleneck to throughput. A more
+
+careful recoding of this algorithm reduced the checksum processing time
+
+to less than one millisecond. The strategy used was extremely dirty.
+
+It involved adding up carefully selected words of the area in which the
+
+data lay, knowing that for those particular words, the 16-bit values
+
+were properly aligned inside the words. Only after the addition had
+
+been done were the various sums shifted, and finally added to produce
+
+the eventual checksum. This kind of highly specialized programming is
+
+probably not acceptable if used everywhere within an operating system.
+
+It is clearly appropriate for one highly localized function which can be
+
+clearly identified as an extreme performance bottleneck.
+
+
+ Another area of TCP processing which may cause performance problems
+
+is the overhead of examining all of the possible flags and options which
+
+ 21
+
+
+occur in each incoming packet. One paper, by Bunch and Day [2], asserts
+
+that the overhead of packet header processing is actually an important
+
+limiting factor in throughput computation. Not all measurement
+
+experiments have tended to support this result. To whatever extent it
+
+is true, however, there is an obvious strategy which the implementor
+
+ought to use in designing his program. He should build his program to
+
+optimize the expected case. It is easy, especially when first designing
+
+a program, to pay equal attention to all of the possible outcomes of
+
+every test. In practice, however, few of these will ever happen. A TCP
+
+should be built on the assumption that the next packet to arrive will
+
+have absolutely nothing special about it, and will be the next one
+
+expected in the sequence space. One or two tests are sufficient to
+
+determine that the expected set of control flags are on. (The ACK flag
+
+should be on; the Push flag may or may not be on. No other flags should
+
+be on.) One test is sufficient to determine that the sequence number of
+
+the incoming packet is one greater than the last sequence number
+
+received. In almost every case, that will be the actual result. Again,
+
+using the Multics system as an example, failure to optimize the case of
+
+receiving the expected sequence number had a detectable effect on the
+
+performance of the system. The particular problem arose when a number
+
+of packets arrived at once. TCP attempted to process all of these
+
+packets before awaking the user. As a result, by the time the last
+
+packet arrived, there was a threaded list of packets which had several
+
+items on it. When a new packet arrived, the list was searched to find
+
+the location into which the packet should be inserted. Obviously, the
+
+list should be searched from highest sequence number to lowest sequence
+
+ 22
+
+
+number, because one is expecting to receive a packet which comes after
+
+those already received. By mistake, the list was searched from front to
+
+back, starting with the packets with the lowest sequence number. The
+
+amount of time spent searching this list backwards was easily detectable
+
+in the metering measurements.
+
+
+ Other data structures can be organized to optimize the action which
+
+is normally taken on them. For example, the retransmission queue is
+
+very seldom actually used for retransmission, so it should not be
+
+organized to optimize that action. In fact, it should be organized to
+
+optimized the discarding of things from it when the acknowledgement
+
+arrives. In many cases, the easiest way to do this is not to save the
+
+packet at all, but to reconstruct it only if it needs to be
+
+retransmitted, starting from the data as it was originally buffered by
+
+the user.
+
+
+ There is another generality, at least as important as optimizing
+
+the common case, which is to avoid copying data any more times than
+
+necessary. One more result from the Multics TCP may prove enlightening
+
+here. Multics takes between two and three milliseconds within the TCP
+
+layer to process an incoming packet, depending on its size. For a 576-
+
+byte packet, the three milliseconds is used up approximately as follows.
+
+One millisecond is used computing the checksum. Six hundred
+
+microseconds is spent copying the data. (The data is copied twice, at
+
+.3 milliseconds a copy.) One of those copy operations could correctly
+
+be included as part of the checksum cost, since it is done to get the
+
+data on a known word boundary to optimize the checksum algorithm.
+
+ 23
+
+
+However, the copy also performs another necessary transfer at the same
+
+time. Header processing and packet resequencing takes .7 milliseconds.
+
+The rest of the time is used in miscellaneous processing, such as
+
+removing packets from the retransmission queue which are acknowledged by
+
+this packet. Data copying is the second most expensive single operation
+
+after data checksuming. Some implementations, often because of an
+
+excessively layered modularity, end up copying the data around a great
+
+deal. Other implementations end up copying the data because there is no
+
+shared memory between processes, and the data must be moved from process
+
+to process via a kernel operation. Unless the amount of this activity
+
+is kept strictly under control, it will quickly become the major
+
+performance bottleneck.
+
+
+ 7. Conclusions
+
+
+ This document has addressed two aspects of obtaining performance
+
+from a protocol implementation, the way in which the protocol is layered
+
+and integrated into the operating system, and the way in which the
+
+detailed handling of the packet is optimized. It would be nice if one
+
+or the other of these costs would completely dominate, so that all of
+
+one's attention could be concentrated there. Regrettably, this is not
+
+so. Depending on the particular sort of traffic one is getting, for
+
+example, whether Telnet one-byte packets or file transfer maximum size
+
+packets at maximum speed, one can expect to see one or the other cost
+
+being the major bottleneck to throughput. Most implementors who have
+
+studied their programs in an attempt to find out where the time was
+
+going have reached the unsatisfactory conclusion that it is going
+
+ 24
+
+
+equally to all parts of their program. With the possible exception of
+
+checksum processing, very few people have ever found that their
+
+performance problems were due to a single, horrible bottleneck which
+
+they could fix by a single stroke of inventive programming. Rather, the
+
+performance was something which was improved by painstaking tuning of
+
+the entire program.
+
+
+ Most discussions of protocols begin by introducing the concept of
+
+layering, which tends to suggest that layering is a fundamentally
+
+wonderful idea which should be a part of every consideration of
+
+protocols. In fact, layering is a mixed blessing. Clearly, a layer
+
+interface is necessary whenever more than one client of a particular
+
+layer is to be allowed to use that same layer. But an interface,
+
+precisely because it is fixed, inevitably leads to a lack of complete
+
+understanding as to what one layer wishes to obtain from another. This
+
+has to lead to inefficiency. Furthermore, layering is a potential snare
+
+in that one is tempted to think that a layer boundary, which was an
+
+artifact of the specification procedure, is in fact the proper boundary
+
+to use in modularizing the implementation. Again, in certain cases, an
+
+architected layer must correspond to an implemented layer, precisely so
+
+that several clients can have access to that layer in a reasonably
+
+straightforward manner. In other cases, cunning rearrangement of the
+
+implemented module boundaries to match with various functions, such as
+
+the demultiplexing of incoming packets, or the sending of asynchronous
+
+outgoing packets, can lead to unexpected performance improvements
+
+compared to more traditional implementation strategies. Finally, good
+
+performance is something which is difficult to retrofit onto an existing
+
+ 25
+
+
+program. Since performance is influenced, not just by the fine detail,
+
+but by the gross structure, it is sometimes the case that in order to
+
+obtain a substantial performance improvement, it is necessary to
+
+completely redo the program from the bottom up. This is a great
+
+disappointment to programmers, especially those doing a protocol
+
+implementation for the first time. Programmers who are somewhat
+
+inexperienced and unfamiliar with protocols are sufficiently concerned
+
+with getting their program logically correct that they do not have the
+
+capacity to think at the same time about the performance of the
+
+structure they are building. Only after they have achieved a logically
+
+correct program do they discover that they have done so in a way which
+
+has precluded real performance. Clearly, it is more difficult to design
+
+a program thinking from the start about both logical correctness and
+
+performance. With time, as implementors as a group learn more about the
+
+appropriate structures to use for building protocols, it will be
+
+possible to proceed with an implementation project having more
+
+confidence that the structure is rational, that the program will work,
+
+and that the program will work well. Those of us now implementing
+
+protocols have the privilege of being on the forefront of this learning
+
+process. It should be no surprise that our programs sometimes suffer
+
+from the uncertainty we bring to bear on them.
+
+ 26
+
+
+Citations
+
+
+ [1] Cohen and Postel, "On Protocol Multiplexing", Sixth Data
+
+Communications Symposium, ACM/IEEE, November 1979.
+
+
+ [2] Bunch and Day, "Control Structure Overhead in TCP", Trends and
+
+Applications: Computer Networking, NBS Symposium, May 1980.
+
+