1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
|
RFC 789
Vulnerabilities of Network Control Protocols: An Example
Eric C. Rosen
Bolt Beranek and Newman Inc.
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
This paper has appeared in the January 1981 edition of the
SIGSOFT Software Engineering Notes, and will soon appear in the
SIGCOMM Computer Communications Review. It is being circulated
as an RFC because it is thought that it may be of interest to a
wider audience, particularly to the internet community. It is a
case study of a particular kind of problem that can arise in
large distributed systems, and of the approach used in the
ARPANET to deal with one such problem.
On October 27, 1980, there was an unusual occurrence on the
ARPANET. For a period of several hours, the network appeared to
be unusable, due to what was later diagnosed as a high priority
software process running out of control. Network-wide
disturbances are extremely unusual in the ARPANET (none has
occurred in several years), and as a result, many people have
expressed interest in learning more about the etiology of this
particular incident. The purpose of this note is to explain what
the symptoms of the problem were, what the underlying causes
were, and what lessons can be drawn. As we shall see, the
immediate cause of the problem was a rather freakish hardware
malfunction (which is not likely to recur) which caused a faulty
sequence of network control packets to be generated. This faulty
sequence of control packets in turn affected the apportionment of
software resources in the IMPs, causing one of the IMP processes
to use an excessive amount of resources, to the detriment of
other IMP processes. Restoring the network to operational
- 1 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
condition was a relatively straightforward task. There was no
damage other than the outage itself, and no residual problems
once the network was restored. Nevertheless, it is quite
interesting to see the way in which unusual (indeed, unique)
circumstances can bring out vulnerabilities in network control
protocols, and that shall be the focus of this paper.
The problem began suddenly when we discovered that, with
very few exceptions, no IMP was able to communicate reliably with
any other IMP. Attempts to go from a TIP to a host on some other
IMP only brought forth the "net trouble" error message,
indicating that no physical path existed between the pair of
IMPs. Connections which already existed were summarily broken.
A flood of phone calls to the Network Control Center (NCC) from
all around the country indicated that the problem was not
localized, but rather seemed to be affecting virtually every IMP.
As a first step towards trying to find out what the state of
the network actually was, we dialed up a number of TIPs around
the country. What we generally found was that the TIPs were up,
but that their lines were down. That is, the TIPs were
communicating properly with the user over the dial-up line, but
no connections to other IMPs were possible.
We tried manually restarting a number of IMPs which are in
our own building (after taking dumps, of course). This procedure
initializes all of the IMPs' dynamic data structures, and will
- 2 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
often clear up problems which arise when, as sometimes happens in
most complex software systems, the IMPs' software gets into a
"funny" state. The IMPs which were restarted worked well until
they were connected to the rest of the net, after which they
exhibited the same complex of symptoms as the IMPs which had not
been restarted.
From the facts so far presented, we were able to draw a
number of conclusions. Any problem which affects all IMPs
throughout the network is usually a routing problem. Restarting
an IMP re-initializes the routing data structures, so the fact
that restarting an IMP did not alleviate the problem in that IMP
suggested that the problem was due to one or more "bad" routing
updates circulating in the network. IMPs which were restarted
would just receive the bad updates from those of their neighbors
which were not restarted. The fact that IMPs seemed unable to
keep their lines up was also a significant clue as to the nature
of the problem. Each pair of neighboring IMPs runs a line
up/down protocol to determine whether the line connecting them is
of sufficient quality to be put into operation. This protocol
involves the sending of HELLO and I-HEARD-YOU messages. We have
noted in the past that under conditions of extremely heavy CPU
utilization, so many buffers can pile up waiting to be served by
the bottleneck CPU process, that the IMPs are unable to acquire
the buffers needed for receiving the HELLO or I-HEARD-YOU
messages. If a condition like this lasts for any length of time,
- 3 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
the IMPs may not be able to run the line up/down protocol, and
lines will be declared down by the IMPs' software. On the basis
of all these facts, our tentative conclusion was that some
malformed update was causing the routing process in the IMPs to
use an excessive amount of CPU time, possibly even to be running
in an infinite loop. (This would be quite a surprise though,
since we tried very hard to protect ourselves against malformed
updates when we designed the routing process.) As we shall see,
this tentative conclusion, although on the right track, was not
quite correct, and the actual situation turned out to be much
more complex.
When we examined core dumps from several IMPs, we noted that
most, in some cases all, of the IMPs' buffers contained routing
updates waiting to be processed. Before describing this
situation further, it is necessary to explain some of the details
of the routing algorithm's updating scheme. (The following
explanation will of course be very brief and incomplete. Readers
with a greater level of interest are urged to consult the
references.) Every so often, each IMP generates a routing update
indicating which other IMPs are its immediate neighbors over
operational lines, and the average per-packet delay (in
milliseconds) over that line. Every IMP is required to generate
such an update at least once per minute, and no IMP is permitted
to generate more than a dozen such updates over the course of a
minute. Each update has a 6-bit sequence number which is
- 4 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
advanced by 1 (modulo 64) for each successive update generated by
a particular IMP. If two updates generated by the same IMP have
sequence numbers n and m, update n is considered to be LATER
(i.e., more recently generated) than update m if and only if one
of the following two conditions hold:
(a) n > m, and n - m <= 32
(b) n < m, and m - n > 32
(where the comparisons and subtractions treat n and m as unsigned
6-bit numbers, with no modulus). When an IMP generates an
update, it sends a copy of the update to each neighbor. When an
IMP A receives an update u1 which was generated by a different
IMP B, it first compares the sequence number of u1 with the
sequence number of the last update, u2, that it accepted from B.
If this comparison indicates that u2 is LATER than u1, u1 is
simply discarded. If, on the other hand, u1 appears to be the
LATER update, IMP A will send u1 to all its neighbors (including
the one from which it was received). The sequence number of u1
will be retained in A's tables as the LATEST received update from
B. Of course, u1 is always accepted if A has seen no previous
update from B. Note that this procedure is designed to ensure
that an update generated by a particular IMP is received,
unchanged, by all other IMPs in the network, IN THE PROPER
SEQUENCE. Each routing update is broadcast (or flooded) to all
IMPs, not just to immediate neighbors of the IMP which generated
- 5 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
the update (as in some other routing algorithms). The purpose of
the sequence numbers is to ensure that all IMPs will agree as to
which update from a given IMP is the most recently generated
update from that IMP.
For reliability, there is a protocol for retransmitting
updates over individual links. Let X and Y be neighboring IMPs,
and let A be a third IMP. Suppose X receives an update which was
generated by A, and transmits it to Y. Now if in the next 100 ms
or so, X does not receive from Y an update which originated at A
and whose sequence number is at least as recent as that of the
update X sent to Y, X concludes that its transmission of the
update did not get through to Y, and that a retransmission is
required. (This conclusion is warranted, since an update which
is received and adjudged to be the most recent from its
originating IMP is sent to all neighbors, including the one from
which it was received.) The IMPs do not keep the original update
packets buffered pending retransmission. Rather, all the
information in the update packet is kept in tables, and the
packet is re-created from the tables if necessary for a
retransmission.
This transmission protocol ("flooding") distributes the
routing updates in a very rapid and reliable manner. Once
generated by an IMP, an update will almost always reach all other
IMPs in a time period on the order of 100 ms. Since an IMP can
generate no more than a dozen updates per minute, and there are
- 6 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
64 possible sequence numbers, sequence number wrap-around is not
a problem. There is only one exception to this. Suppose two
IMPs A and B are out of communication for a period of time
because there is no physical path between them. (This may be due
either to a network partition, or to a more mundane occurrence,
such as one of the IMPs being down.) When communication is
re-established, A and B have no way of knowing how long they have
been out of communication, or how many times the other's sequence
numbers may have wrapped around. Comparing the sequence number
of a newly received update with the sequence number of an update
received before the outage may give an incorrect result. To deal
with this problem, the following scheme is adopted. Let t0 be
the time at which IMP A receives update number n generated by IMP
B. Let t1 be t0 plus 1 minute. If by t1, A receives no update
generated by B with a LATER sequence number than n, A will accept
any update from B as being more recent than n. So if two IMPs
are out of communication for a period of time which is long
enough for the sequence numbers to have wrapped around, this
procedure ensures that proper resynchronization of sequence
numbers is effected when communication is re-established.
There is just one more facet of the updating process which
needs to be discussed. Because of the way the line up/down
protocol works, a line cannot be brought up until 60 seconds
after its performance becomes good enough to warrant operational
use. (Roughly speaking, this is the time it takes to determine
- 7 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
that the line's performance is good enough.) During this
60-second period, no data is sent over the line, but routing
updates are transmitted. Remember that every node is required to
generate a routing update at least once per minute. Therefore,
this procedure ensures that if two IMPs are out of communication
because of the failure of some line, each has the most recent
update from the other by the time communication is
re-established.
This very short introduction to the routing algorithm's
updating protocol should provide enough background to enable the
reader to understand the particular problem under discussion;
further justification and detail can be found in the references.
Let us return now to the discussion of the network outage.
I have already mentioned that the core dumps showed almost all
buffers holding routing updates which were waiting to be
processed. Close inspection showed that all the updates were
from a single IMP, IMP 50. By a strange "coincidence," IMP 50
had been malfunctioning just before the network-wide outage
occurred, and was off the net during the period of the outage.
Hence it was not generating any updates during the period of the
outage. In addition, IMP 29, an immediate neighbor of IMP 50,
was also suffering hardware malfunctions (in particular, dropping
bits), but was up (though somewhat flakey) while the network was
in bad shape. Furthermore, the malfunction in IMP 50 had to do
with its ability to communicate properly with the neighboring IMP
- 8 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
29. Although we did not yet understand how it was possible for
so many updates from one IMP to be extant simultaneously, we did
understand enough to be able to get the network to recover. All
that was necessary was to patch the IMPs to disregard any updates
from IMP 50, which after all was down anyway. When the network
is operating normally, broadcasting a patch to all IMPs can be
done in a matter of minutes. With the network operating as it
was during the period of the outage, this can take as much as 3
or 4 hours. (Remember that the IMPs are generally unmanned, and
that the only way of controlling them from the NCC is via the
network itself. This is perfectly satisfactory when an outage
affects only a small group of IMPs, but is an obvious problem
when the outage has network-wide effects.) This procedure was
fully successful in bringing the network back up.
When we looked closely at the dumps, we saw that not only
were all the updates on the queue from IMP 50, but they all had
one of three sequence numbers (either 8, 40, or 44), and were
ordered in the queue as follows:
8, 40, 44, 8, 40, 44, 8, 40, 44, ... Note that by the definition
of LATER, 44 is LATER than 40 (44 > 40 and 44 - 40 <= 32), 40 is
LATER than 8 (40 > 8 and 40 - 8 <= 32), and 8 is LATER than 44
(8 < 44 and 44 - 8 > 32). Given the presence of three updates
from the same IMP with these three sequence numbers, this is what
would be expected. Since each update is LATER than one of the
others, a cycle is formed which keeps the three updates floating
- 9 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
around the network indefinitely. Thus the IMPs spend most of
their CPU time and buffer space in processing these updates. The
problem was to figure out how these three updates could possibly
have existed at the same time. After all, getting from update 8
to update 40 should require 2 or 3 full minutes, plus 31
intervening sequence numbers. So how could 8 still be around
when 40 was generated, especially since no updates with
intervening sequence numbers were present?
Our first thought was that maybe the real-time clock in IMP
50 was running one or two orders of magnitude faster than normal,
invalidating our assumptions about the maximum number of updates
which could be generated in a given time. An alternative
hypothesis suggested itself however when we looked at the binary
representations of the three sequence numbers:
8 - 001000
40 - 101000
44 - 101100
Note that 44 has only one more bit than 40, which has only one
more bit than 8. Furthermore, the three different updates were
completely identical, except for their sequence numbers. This
suggests that there was really only one update, 44, whose
sequence number was twice corrupted by dropped bits. (Of course,
it's also possible that the "real" update was 8, and was
corrupted by added bits. However, bit-dropping has proven itself
- 10 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
to be a much more common sort of hardware malfunction than
bit-adding, although spontaneously dropped bits may sometimes
come back on spontaneously.)
Surely, the reader will object, there must be protection
against dropped bits. Yes there is protection, but apparently
not enough. The update packets themselves are checksummed, so a
dropped bit in an update packet is readily detected. Remember
though that if an update needs to be retransmitted, it is
recreated from tabled information. For maximal reliability, the
tables must be checksummed also, and the checksum must be
recomputed every time the table is accessed. However, this would
require either a large number of CPU cycles (for frequent
checksumming of a large area of memory) or a large amount of
memory (to store the checksums for a lot of small areas). Since
CPU cycles and memory are both potentially scarce resources, this
did not seem to us to be a cost-effective way to deal with
problems that arise, say, once per year (this is the first such
problem encountered in a year and a half of running this routing
algorithm). Time and space can be saved by recomputing the
checksum at a somewhat slower frequency, but this is less
reliable, in that it allows a certain number of dropped bits to
"fall between the cracks." It seems likely then that one of the
malfunctioning IMPs had to retransmit update 44 at least twice,
(recreating it each time from tabled information), retransmitting
it at least once with the corrupted sequence number 40, and at
- 11 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
least once with the corrupted sequence number 8. This would
cause those three sequence numbers to be extant in the network
simultaneously, even though protocol is supposed to ensure that
this is impossible.
Actually, the detection of dropped bits is most properly a
hardware function. The next generation of IMP hardware (the "C30
IMP") will be able to detect and correct all single-bit errors,
and will detect all other bit errors. Uncorrectable bit errors
will cause the IMP to go into its "loader/dumper." (An IMP in
its loader/dumper is not usable for transferring data, and is
officially in the "down" state. However, an IMP in its
loader/dumper is easily controllable from the NCC, and can be
restarted or reloaded without on-site intervention.) Current
hardware does have parity checking (which should detect single
dropped bits), but this feature has had to be turned off since
(a) there are too many spurious parity "errors," i.e., most of
the time when the machines complain of parity errors there don't
really seem to be any, and (b) parity errors cause the machines
to simply halt, rather than go into their loader/dumpers, which
means that on-site intervention is required to restart them.
Pending the introduction of improved hardware, what can be
done to prevent problems like this from recurring in the future?
It is easy to think of many ways of avoiding this particular
problem, especially if one does not consider the problems that
may arise from the "fixes." For example, we might be able to
- 12 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
avoid this sort of problem by spending a lot more CPU cycles on
checksumming, but this may be too expensive because of the side
effects it would introduce. (Also, it is not clear that any
memory checksumming strategy can be totally free of "cracks.") A
very simple and conservative fix to prevent this particular
problem from recurring is to modify clause (a) of the definition
of LATER so that the "<=" is replaced by "<" (strictly less
than). We will implement this fix, but it cannot be guaranteed
that no related problems will ever arise.
What is really needed is not some particular fix to the
routing algorithm, but a more general fix. In some sense, the
problem we saw was not really a routing problem. The routing
code was working correctly, and the routes that were generated
were correct and consistent. The real problem is that a freakish
hardware malfunction caused a high priority process to run wild,
devouring resources needed by other processes, thereby making the
network unusable. The fact that the wild process was the routing
process is incidental. In designing the routing process, we
carefully considered the amount of resource utilization it would
require. By strictly controlling and limiting the rate at which
updates can be generated, we tried to prevent any situation in
which the routing process would make excessive demands on the
system. As we have seen though, even our carefully designed
mechanisms were unable to protect against every possible sort of
hardware failure. We need a better means of detecting that some
- 13 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
high priority process in the IMP, despite all the safeguards we
have built in, is still consuming too many resources. Once this
is detected, the IMP can be automatically placed in its
loader/dumper. In the case under discussion, we would have liked
to have all the IMPs go into their loader/dumpers when the
problem arose. This would have enabled us to re-initialize and
restart all the IMPs much more quickly. (Although restarting
individual IMPs did little good, restarting all the IMPs
simultaneously would have cleared up the problem instantly, since
all routing tables in all IMPs would have been initialized
simultaneously.) It took us no more than an hour to figure out
how to restore the network; several additional hours were
required because it took so long for us to gain control of the
misbehaving IMPs and get them back to normal. A built-in
software alarm system (assuming, of course, that it was not
subject to false alarms) might have enabled us to restore the
network more quickly, significantly reducing the duration of the
outage. This is not to say that a better alarm and control
system could ever be a replacement for careful study and design
which attempts to properly distribute the utilization of
important resources, but only that it is a necessary adjunct, to
handle the cases that will inevitably fall between the cracks of
even the most careful design.
- 14 -
^L
RFC 789 Bolt Beranek and Newman Inc.
Eric C. Rosen
REFERENCES
"The New Routing Algorithm for the ARPANET," IEEE TRANSACTIONS ON
COMMUNICATIONS, May 1980, J.M. McQuillan, I. Richer, E.C. Rosen.
"The Updating Protocol of ARPANET's New Routing Algorithm,"
COMPUTER NETWORKS, February 1980, E.C. Rosen.
- 15 -
^L
|