summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc528.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc528.txt')
-rw-r--r--doc/rfc/rfc528.txt507
1 files changed, 507 insertions, 0 deletions
diff --git a/doc/rfc/rfc528.txt b/doc/rfc/rfc528.txt
new file mode 100644
index 0000000..2aca792
--- /dev/null
+++ b/doc/rfc/rfc528.txt
@@ -0,0 +1,507 @@
+
+
+
+
+
+
+Network Working Group J. McQuillan
+Request for Comments: 528 BBN-NET
+NIC: 17164 20 June 1973
+
+
+ SOFTWARE CHECKSUMMING IN THE IMP AND NETWORK RELIABILITY
+
+ As the ARPA Network has developed over the last few years, and our
+ experience with operating the IMP subnetwork has grown, the issue of
+ reliability has assumed greater importance and greater complexity.
+ This note describes some modifications that have recently been made
+ to the IMP and TIP programs in this regard. These changes are
+ mechanically minor and do not affect Host operation at all, but they
+ are logically noteworthy, and for this reason we have explained the
+ workings of the new IMP and TIP programs in some detail. Host
+ personnel are advised to note particularly the modifications
+ described in sections 4 and 5, as they may wish to change their own
+ programs or operating procedures.
+
+1. A Changing View of Network Reliability
+
+ Our idea of the Network has evolved as the Network itself has grown.
+ Initially, it was thought that the only components in the network
+ design that were prone to errors were the communications circuits,
+ and the modem interfaces in the IMPs are equipped with a CRC checksum
+ to detect "almost all" such errors. The rest of the system,
+ including Host interfaces, IMP processors, memories, and interfaces,
+ were all considered to be error-free. We have had to re-evaluate
+ this position in the light of our experience. In operating the
+ network we are faced with the problem of having to perform remote
+ diagnosis on failures which cannot easily be classified or
+ understood. Some examples of such problems include reports from Host
+ personnel of lost RFNMs and lost Host-Host protocol allocate
+ messages, inexplicable behavior in the IMP of a transient nature,
+ and, finally, the problem of crashes -- the total failure of an IMP,
+ perhaps affecting adjacent IMPs. These circumstances are infrequent
+ and are therefore difficult to correlate with other failures or with
+ particular attempted remedies. Indeed, it is often impossible to
+ distinguish a software failure from a hardware failure.
+
+ In attempting to post-mortem crashes, we have sometimes found the IMP
+ program has had instructions incorrect--sometimes just one or two
+ bits picked or dropped. Clearly, memory errors can account for
+ almost any failure, not only program crashes but also data errors
+ which can lead to many other syndromes. For instance, if the address
+ of a message is changed in transit, then one Host thinks the message
+ was lost, and another Host may receive an extra message. Errors of
+ this kind fall into two general classes: errors in Host messages,
+
+
+
+McQuillan [Page 1]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ whether in the control information or the data, and errors in inter-
+ IMP messages, primarily routing update messages. In the course of
+ the last few years, it has become increasingly clear that such errors
+ were occurring, though it was difficult to speculate as to where,
+ why, and how often.
+
+ One of the earliest problems of this kind was discovered in 1971.
+ The Harvard IMP was sometimes crashing in an unknown manner so that
+ all the other IMPs were affected. It was finally determined that its
+ memory was faulty and sometimes the routing messages read out from
+ memory by the modem output interfaces were all zeroes. The adjacent
+ IMPs interpreted such an erroneous message as stating that the
+ Harvard IMP had zero delay to all destinations -- that it was the
+ best route to everywhere! Once this information propagated to the
+ other IMPs, the whole network was in a shambles. The solution to
+ this problem was to generate a software checksum for each routing
+ message before it was sent from one IMP, and to check it after it was
+ received at the other IMP. This software checksum, in addition to
+ the hardware checksum of the circuit, checks the modem interfaces and
+ memories at each IMP, and protects the IMPs from erroneous routing
+ information. The overhead in computing these checksums is not great
+ since the messages are only exchanged every 2/3 of a second.
+
+ In the first few months of 1973, we began to have a great deal of
+ trouble with the reliability of some IMPs, especially these in the
+ Washington area. The normal procedures of calling in and working
+ with Honeywell field engineers had not cleared up several of these
+ persistent failures, and it was felt that an escalation of BBN
+ involvement was needed to identify the exact causes of the problems.
+ Therefore, during much of February and March there were one or more
+ members of the staff at various sites in the network where hardware
+ problems were suspected. The first thing we found out was that the
+ operational IMP program did not give enough diagnostic information
+ about failures when they occurred, and that the available test
+ programs did not detect errors frequently enough to justify their
+ use. That is, the errors were appearing at rather low frequency,
+ from once every few hours to once every few days, compared to message
+ rates of once a second or faster. Therefore, we decided to try to
+ make the operational IMP program run when it could, and report more
+ information about detected hardware errors, rather than keep the
+ failing IMPs off the network for days at a time.
+
+ Modifications to the IMP program had two independent goals: we wanted
+ to make the software less vulnerable to hardware failures, and we
+ wanted the software to isolate the failures and report them to the
+ NCC. The technique we chose to use was generating a software
+ checksum on all packets as they are sent out over a line. We
+ suspected that the hardware failures in the Washington area were
+
+
+
+McQuillan [Page 2]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ happening between IMPs, that is, the packets were correct before they
+ were sent. Thus, a memory-to-memory software checksum, similar to
+ the technique installed two years before for routing messages only,
+ should be able to detect these errors. On March 13, a new version of
+ the IMP program was released with software checksum code. In this
+ program, when a packet is found to have an incorrect checksum it is
+ discarded, and a copy of the data is sent to the NCC. The previous
+ IMP retransmits the packet, since an acknowledgment is not returned.
+
+ A partial list of the hardware problems that were uncovered by
+ software checksums, and subsequently fixed, includes:
+
+ * One modem interface at the Aberdeen IMP dropped several bits
+ from several successive words in transferring data into memory.
+
+ * One modem interface at the Belvoir IMP picked one or two bits
+ in a single word in transferring data into memory.
+
+ * One modem interface at the ETAC TIP dropped the first word in
+ transferring data out of memory.
+
+ * A region in memory at the Utah IMP changed the low order two
+ bits in some words on an irregular basis.
+
+ Each of these problems resulted in two or three detected errors per
+ day. There were other problems that were not detected by the
+ software checksum, such as dropped interrupts. This set of problems
+ may be explained by the electronics of the high-speed DMC on 316
+ IMPs. The first three machines cited above are 316 IMPs with 3 modem
+ interfaces, and they are the only such machines in the network. The
+ third interface is in a separate drawer and the total bus length
+ seems to be too long for the driving electronics in the original
+ design. We are presently investigating various ways to fix these
+ problems, and have had some success already.
+
+2. An End-to-End Software Checksum on Packets
+
+ This last experience, and the earlier checksum on routing messages,
+ proved the value of a software checksum on all inter-IMP
+ transmissions. We have decided to extend the checksum to detect
+ intra-IMP failures as well, and make software checksums on all
+ network transmissions a permanent feature of the IMP system. We can
+ obtain an end-to-end software checksum on packets, without any time
+ gaps, as follows:
+
+
+
+
+
+
+
+McQuillan [Page 3]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ +--------+ +--------+ +---------+
+ | IMP 2|--------|3 IMP 4|--------|5 IMP |
+ | 1 | | | | 6 |
+ +---|----+ +--------+ +----|----+
+ | |
+ +---|----+ +----|----+
+ | | | |
+ | Host | | Host |
+ +--------+ +---------+
+
+ * A checksum is computed at the source IMP for each packet as it
+ is received from the source Host. (interface 1)
+
+ * The checksum is verified at each intermediate IMP as it is
+ received over the circuit from the previous IMP. (interfaces 3
+ and 5)
+
+ * If the checksum is in error, the packet is discarded, and the
+ previous IMP retransmits the packet when it does not receive an
+ acknowledgment. (interface 2 and 4)
+
+ * The previous IMP does not verify the checksum before the
+ original transmission, to cut the number of checks in half.
+ But when it must retransmit a packet it does verify the
+ checksum. If it finds an error, it has detected an intra-IMP
+ failure, and the packet is lost. If not, then the first
+ transmission was lost due to an inter-IMP failure, a circuit
+ error, or was simply refused by the adjacent IMP. The previous
+ IMP holds a good copy of the packet, which it then retransmits.
+ (interface 2 and 4)
+
+ * After the packet has successfully traversed several
+ intermediate IMPs, it arrives at the destination IMP. The
+ checksum is verified just before the packet is sent to the
+ Host. (interface 6)
+
+ This technique provides a checksum from the source IMP to the
+ destination IMP on each packet, with no gaps in time when the packet
+ is unchecked. Any errors are reported to the NCC in full, with a
+ copy of the packet in question. This method answers both
+ requirements stated above: it makes the IMPs more reliable and
+ fault-tolerant, and it provides a maximum of diagnostic information
+ for use in fault isolation. This expanded checksum logic was
+ installed in the network on June 19.
+
+ On of the major questions about such approaches is their efficiency.
+ We have been able to include the software checksum on all packets
+ without greatly increasing the processing overhead in the IMP. The
+
+
+
+McQuillan [Page 4]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ method described above involves one checksum calculation at each IMP
+ through which a packet travels. We developed a very fast checksum
+ technique, which takes only 2 msec per word. The program computes
+ the number of words in a packet and then jumps to the appropriate
+ entry in a chain of add instructions. This produces a simple sum of
+ the words in the packet, to which the number of words in the packet
+ is added to detect missing or extra words of zero. With the
+ inclusion of this code, the effective processor bandwidth of a 516
+ IMP is reduced by one-eighth for full-length store-and-forward
+ packets, from a megabit per second to 875 kilobits per second. That
+ is, the IMP now has the processing capability to connect to 17 full
+ duplex 50 kilobit per second lines, as compared to 20 such lines
+ without the checksum program. We are aware that this add checksum is
+ not a very good one in terms of its error-detecting capabilities, but
+ it is as much as the IMP can afford to do in software. Furthermore,
+ we emphasize that the primary goal of this modification is to assist
+ in the remote diagnosis of intermittent hardware failures.
+
+3. Checksumming to Improve the Reliability of Routing
+
+ We mentioned earlier the catastrophic effects that follow for the
+ Network as a whole when a single IMP begins to propagate incorrect
+ routing information. The experience described above involved a
+ specific memory failure which has not recurred in the last two years,
+ but the problem is easily understood to be of a general nature. In
+ fact, we recently had another network-wide failure that was traced to
+ a hardware error that resulted in erroneous routing messages, after
+ we had installed a software checksum on all inter-IMP transmissions.
+ The problem we had were due to a single broken instruction in the
+ part of the IMP program that builds the routing message. As a
+ result, the routing messages from that IMP were random data, and the
+ neighboring IMPs interpreted these messages as routing update
+ information. When this happened, traffic flow through the Network
+ was completely disrupted and no useful work could be done until the
+ failed IMP was halted.
+
+ This kind of problem, the introduction of incorrect routing
+ information into the Network, can happen in three ways:
+
+ * The routing message is changed in transmission. The inter-IMP
+ checksum should catch this. The bad routing messages we saw in
+ the Network had good checksums.
+
+ * The routing message is changed as it is constructed, say by a
+ memory or processor failure, or before it is transmitted. This
+ is what we termed above an intra-IMP failure.
+
+
+
+
+
+McQuillan [Page 5]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ * The routing program is incorrect for hardware or software
+ reasons.
+
+ We have attempted to solve the last two kinds of problems by
+ extending the concept of software checksums. The routing program has
+ been modified to build a software checksum for the routing message as
+ it builds the message, just as if it came from a Host. It is
+ important that this checksum refer to the intended contents of the
+ routing message, not the actual contents. That is, the program which
+ generates the routing message builds its own software checksum as it
+ proceeds, not by reading what has been stored in the routing message
+ area, but by adding up the intended contents for each entry as it
+ computes them. The process which sends out routing messages then
+ always verifies the checksum before transmitting them. This scheme
+ should detect all intra-IMP failures.
+
+ Finally, the routing program itself can be checksummed to detect any
+ changes in the code. The programs which copy in received routing
+ messages, compute new routing tables, and send out routing messages
+ each calculate the checksum of the code before executing it. If the
+ program finds a discrepancy in the checksum of the program it is
+ about to run, it immediately requests a program reload from an
+ adjacent IMP. These checksums include the checksum computation
+ itself, the routing program and any constants referenced. This
+ modification should prevent a hardware failure at one IMP from
+ affecting the Network at large by stopping the IMP before it does any
+ damage in terms of spreading bad routing. A version of the IMP
+ program with this added protection for routing was released on May
+ 22.
+
+ In the first few months of 1973, there have been several other
+ efforts aimed at improving the reliability of the Network, in
+ addition to software checksumming in the IMPs. At the same time that
+ we were discovering inter-IMP failures with the software checksum
+ packets, we began to notice a different kind of problem with intra-
+ IMP failures. In these cases we were primarily faced with memory
+ problems, and they often affected the IMP program itself, rather than
+ the packets flowing through the IMP. Our first attack on this
+ problem was to build a PDP-1 program to verify the running IMP and
+ TIP programs at a site against the correct core images held at the
+ PDP-1. The program interrogates the IMP with DDT messages, and
+ prints out a list of discrepancies. Using this program, we have
+ already found memory failures at one site.
+
+
+
+
+
+
+
+
+McQuillan [Page 6]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+4. TIP Modifications
+
+ The hardware difficulties which we began to experience during the
+ first few months of 1973 had two effects on Host-to-Host
+ communication. First, the intermittent modem interface failures, of
+ the type seen at Belvoir, Aberdeen, and ETAC, meant that messages
+ were occasionally lost by the network. This loss is reported to the
+ transmitting Host by the "Incomplete Transmission" message generated
+ by the source IMP; the Host must then decide whether to retransmit or
+ to take some other action. Second, the higher than normal incidence
+ of machine failures meant that the network sometimes "partitioned" so
+ that there was no path between the two communicating Hosts. (It
+ should be noted that, contrary to the original design, two sites are
+ currently connected to the network by only a single path; other
+ similar connections are planned. For any such sites, any failure
+ along the single path will be seen as a partition.) Since a TIP acts
+ as a Host for its users, its resilience when these types of failures
+ occur has a major effect on user satisfaction.
+
+ Prior to this time the TIP program "aborted" the user's connection if
+ it received an Incomplete Transmission indication from the IMP
+ program. In March the TIP program (and the programs of several other
+ Hosts) was changed to retransmit messages for which the Incomplete
+ Transmission indication was returned; some Hosts (e.g. MULTICs) have
+ done this from the start. This modification has turned out to be
+ relatively simple, and we urge other Hosts to consider implementing
+ some sort of error recovery software. On the other hand, it has not
+ seemed reasonable to continue attempting to transmit when the program
+ receives a "Destination Unreachable" indication, since this could
+ arise either from a network partition or from a failure at the
+ destination site. The interactive user is, of course, free to try
+ again manually.
+
+ A different situation pertains to tape transfers involving TIPs with
+ the magnetic tape option. In these cases, the user would like to
+ start the process and then ignore it until the transfer is finished.
+ Network partitions, even if infrequent, may occur when tape transfers
+ many hours in length are in progress. Therefore, we made a
+ significant modification to the TIP magnetic tape option to include a
+ sequencing mechanism in the tape transfer protocol which permits
+ automatic recovery and transmission continuation after most kinds of
+ network transients. With this mechanism in effect, and assuming a
+ tape is mounted at the "other end", the complete transfer of a tape
+ is possible with a single command given at either end. If the
+ connection goes dead in mid-transfer, the TIP magnetic tape software
+ will attempt to reopen the connection until successful and then
+ continue the transfer from where it was left off. In addition to
+ modifying the TIP magnetic tape option as specified above, we also
+
+
+
+McQuillan [Page 7]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ modified the TENEX program which is able to communicate with the TIP
+ magnetic tape option so that it remained compatible. These changes
+ were installed in April.
+
+5. Future Plans
+
+ We have been considering some of the issues of network reliability
+ discussed above in connection with the development of the new High
+ Speed Modular IMP. This design effort and the experiences with the
+ current IMP system are, of course, linked together, and we have
+ already decided on several approaches to be taken in the new line of
+ IMPs:
+
+ * The IMP will have a hardware CRC checksum generator which
+ returns the checksum on a specified range of memory.
+
+ * The IMP will use this facility to generate and check an end-
+ to-end checksum on messages. This checksum will therefore be
+ more comprehensive and better for error detection than the
+ current software checksum. It will insure a high degree of
+ reliability for Host transmissions.
+
+ * In addition, the IMP will perform a verification of a packet
+ checksum at each hop to provide diagnostic information. This
+ check will be on an optional basis, whenever the system has
+ available resources for the check.
+
+ * The code for the new IMP system will be read-only (this is
+ impractical for the present 516 and 316 IMPs), and the program
+ will periodically checksum itself using the hardware CRC
+ generator. We hope to design the program so that it can be
+ reloaded in segments in the event of a detected error in the
+ code, with no service interruption.
+
+ * Finally, we are looking into the structure of an optional IMP-
+ Host/Host-IMP checksum to complete Host/Host end-to-end
+ checksum. Under such an arrangement, the IMP and Host could
+ agree to verify the checksums on the messages transferred over
+ the interface between them, and the appropriate signalling
+ mechanisms would be provided to handled errors. With this
+ technique in effect, two Hosts could be certain that their
+ messages were delivered error-free or else they would be
+ notified of an error, and could then retransmit their message
+ if desired.
+
+
+
+
+
+
+
+McQuillan [Page 8]
+
+RFC 528 SOFTWARE CHECKSUMMING IN THE IMP 20 June 1973
+
+
+ More details on any such modifications to the IMP and to the
+ IMP-Host interface will be published when appropriate.
+
+
+ [This RFC was put into machine readable form for entry]
+ [into the online RFC archives by Via Genie 12/1999]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+McQuillan [Page 9]
+