1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
|
^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
RFC 636 J. Burchfiel - BBN-TENEX
B. Cosell - BBN-NET
NIC 30490 R. Tomlinson - BBN-TENEX
D. Walden - BBN-NET
10 June 1974
TIP/TENEX Reliability Improvements
During the past months we have felt strong pressure to improve the
reliability of TIP/TENEX network connection as improvement in the
reliability of users' connections between TENEXs and TIPs would have
major impact on the appearance of overall network reliability due to the
large number and high visibility of TENEXs and TIPs. Despite the
emphasis on TIP/TENEX interaction, all work done applies equally well to
interactions between Hosts of any type.
The remainder of this RFC gives a sketch of our plan for improving the
reliability of connections bettween TIPs and TENEXs. Major portions of
this plan have already been implemented (TIP version 322; TENEX version
1.32) and are now undergoing final test prior to release throughout the
network. Completion of the implementation of the plan is expected in
the next quarter.
Our plan for improving the reliability of TIP/TENEX connections is
concerned with obtaining and maintaining TIP/TENEX connections,
gracefully recovering from lost connections, and providing clear
messages to the user whenever the state of his connection changes.
When a TIP user attempts to open a connection to any Host, the Host may
be down. In this case it would be helpful to provide the user with
information about the extent of the Host's unavailability. To facilitate
this, we modified the IMP program to accept and utilize information from
a Host about when the Host will be back up and for what reason it is
down. TENEX is to be modified to supply such information before it goes
down, or through manual means, after it has gone down. When the TIP
user then attempts to connect to the down TENEX, the IMP local to the
TENEX returns the information about why and for how long TENEX will be
down. The TIP is to be modified to report this sort of information to
the user; e.g., "Host unavailable because of hardware maintenance --
expected available Tuesday at 16:30 GMT".
The TIP's logger is presently not reentrant. Thus, no single TIP user
can be allowed to tie up the logger for too long at a time; and the TIP^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
therefore enforces a timeout of arbitrary length (about 60 seconds) on
logger use. However, a heavily loaded Host cannot be guaranteed always
to respond within 60 seconds to a TIP login request, and at present TIP
users sometimes cannot get connected to a heavily loaded TENEX. To
correct this problem, the TIP logger will be made reentrant and the
timeout on logger use will be eliminated.
One notorious soft spot in the Host/Host protocol which degrades the
reliability of connections is the Host/Host protocol incremental
allocate mechanism. Low frequency software bugs, intermittant hardware
bugs, etc., can lead to the incremental allocates associated with a
connection getting out of synchronization. When this happens it usually
appears to the user as if the connection just "hung up". A slight
addiition to the Host/Host protocol to allow connection allocates to be
resynchronized has been designed and implemented for both the TIP and
TENEX.
TENEX has a number of internal consistency checks (called "bughalts")
which occasionally cause TENEX to halt. Frequently, after diagnosis by
system personnel, TENEX can be made to proceed without loss from the
viewpoint of local users. A mechanism is being provided which allows
TENEX to proceed in this case from the point of view of TIP users of
TENEX.
The appropriate mechanism entails the following: TENEX will not drop
its ready line during a bughalt (from which TENEX can usually proceed
successfully), nor will it clear its NCP tables and abort all
connections. Instead, after a bughalt TENEX will: discard the message
it is currently receiving, as the IMP has returned an Incomplete
Transmission to the source for this message; reinitialize the interface
to the IMP; and resynchronize, on all connections possible, Host/Host
protocol allocate inconsistencies due to lost messages, RFNMs etc. The
latter is done with the same mechanism described above. This procedure
is not guaranteed to save all data -- a tiny bit may be lost -- but this
is of secondary importance to maintaining the connection over the TENEX
bughalt.
The TIP user must be kept fully informed as TENEX halts and then
continues. Therefore, the TIP has been modified to report "Host not
responding -- connection suspended" when it senses that TENEX has halted
(it does this by properly interpreting messages returned by the
destination IMP). When TENEX resumes service after proceeding from a
bughalt, the above procedure notifies the TIP that service is restored,
and the TIP has been modified to report "Service resumed" to all users
of that Host.
On the other hand, the service interruption may not be proceedable and
1^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
TENEX may have to do a total system reload and restart. In this case
TENEX will clear its NCP connection tables and send a Host/Host protocol
reset command to all other Hosts. On receiving this reset command, the
TIP will report "Host reset -- connection closed" to all users of that
Host with suspended connections. The TIP user can then re-login to the
TENEX or to some other Host.
Of couse, the user may not have the patience to wait for service to
resume after a TENEX bughalt. Instead, he may unilaterally choose to
connect to some other Host, ignoring the previously suspended
connection. If TENEX is then able to proceed, its NCP will still think
its connection to the TIP is good and suitable for use. Thus, we have a
connection which the TIP thinks is closed and TENEX thinks is open, a
phenomenon known as the "half-closed connection". An automatic
procedure for cleanly completing the closing of such a connection has
been specified and implemented for the TIP and TENEX.
Since TENEX will maintain connections across service interruptions, the
TIP user will be required to take the security procedure telling the TIP
to "forget" his suspended connection before abandoning his terminal.
The command @H 0 (for example) will guarantee that his connection will
not be reestablished on resumpption of service. Otherwise, his job
would be left at the mercy of anyone who acquires that terminal.
An appendix follows which describes the Host/Host protocol changes made.
These changes are backward compatible (with the exception that Hosts
which have not implemented these changes will sometimes receive
unrecognizable Host/Host protocol commands which they presumably discard
without suffering harm). These protocol changes are ad hoc in nature
but in light of their backward compatibility and potential utility, ARPA
okayed their addition to the TIP and TENEX NCPs without (we believe) any
implication that other Hosts have to implement them (although we would
encourage their widespread implementation).
2^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
Appendix - Ad Hoc Change to Host-Host Protocol
A.1 Introduction
The current Host-Host protocol (NIC #8246) contains no provisions
for resynchronizing the status information kept at the two ends of
each connection. In particular, if either host suffers a service
interruption, or if a control message is lost or corrupted in an
interface or in the subnet, the status information at the two ends
of the connection will be inconsistent.
Since the current protocol provides no way to correct this
condition, the NCPs at the two ends stay "confused" forever. An
occasional frustrating symptom of this effect is the "lost
allocate" phenomenon, where the receiving NCP believes that it has
bit and message allocations outstanding, while the sending NCP
believes that it does not have any allocation. As a result,
information flow over that connection can never be restarted.
Use of the Host-Host RST (reset) command is inappropriate here, as
it destroys all connections between the two hosts. What is needed
is a way to resynchronize only the affected connection without
disturbing any others.
A second troublesome symptom of inconsistency in status
information is the "half-closed" connection: after a service
interruption or network partitioning, one NCP may believe that a
connection is still open, while the other believes that the
connection is closed (does not exist). When such an inconsistency
is discovered, the "open" end of the connection should be closed.
A.2 The RAR, RAS and RAP commands
To achieve resynchronization of allocation, we add the following
three commands to the host-host protocol.
8 bits 8 bits
-------------------
! ! !
16 ! RAR ! link !
! ! !
-------------------
Reset Allocation by Receiver
8 bits 8 bits
-------------------
! ! !
3^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
17 ! RAS ! link !
! ! !
-------------------
Reset Allocation by Sender
8 bits 8 bits
-------------------
! ! !
20 ! RAP ! link !
! ! !
-------------------
Reset Allocation Please
The RAS command is sent from the Host sending on "link" to the
Host receiving on "link". This command may be sent whenever the
sending Host desires to resynch the status information associated
with the connection (and doesn't have a message in transit through
the network). Some circumstances in which the sending Host may
choose to do this are:
1) After a timeout when there is traffic to move but no
allocation (assumes that an allocation has been lost);
2) When an inconsistent event occurs associated with that
connection (e.g. an outstanding allocation in excess of 2^32
bits or 2^16 messages);
3) After the sending host has suffered an interruption of
network service;
4) In response to a RAP (see below).
The RAR command is sent from the Host receiving on "link" to the
Host sending on "link" in response to an RAS. It marks the
completion of the connection resynchronization. When the RAR is
returned the connection is in the known state of having no
messages in transit in either direction and the allocations are
zero. The receiving Host may then start afresh with a new
allocation and normal message transmission can proceed. Since the
RAR may be sent ONLY in response to an RAS, there are no races in
the resynchronization. All of the initiative lies with the
sending Host.
If the receiving Host detects an anomalous situation, however,
there is no way to inform the sending Host that a
resynchronization is desirable. For this purpose, the RAP command
is provided. It constitutes a "suggestion" on the part of the
4^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
receiving Host that the sending Host resynchronize; the sending
Host is free to honor it or not as it sees fit. Since there is no
obligatory response to a RAP, the receiving Host may send them as
frequently as it chooses and no harm can occur. For example, if a
message in excess of the allocate arrives, the receiving Host
might send RAPs every few seconds until the sending Host replies
with no fears of races if one or more RAPs pass a RAS in the
network.
A.3 Resynchronization Procedure
The resynchronization sequence below may be initiated only by the
sender either for internally generated reasons or upon the receipt
of a RAP.
a) Sender - decision to resynch
1) Set state to "Wait-for-RAR" (Defer transmission of
message.)
2) Wait until no RFNM outstanding
3) Send RAS
4) Zero allocation
5) Ignore allocates until RAR received
6) Set state to "Open" (Resume normal message transmission
subject to flow control.)
b) Receiver - receipt of RAS
1) Send RAR
2) Zero allocation
3) Send a new allocation
When the sender is in the "Wait-for-RAR" state it is not permitted
to send new regular messages. (Note that steps 4 and 5 will
insure this in the normal course of events.) With the return of
the RAR the pipeline contains no messages and no allocates, the
outstanding allocation variables at both ends are forced into
agreement by setting them both to zero. The receiver will then
reconsider bit and message allocation, and send an ALL command for
any allocation it cares to do.
A.4 The Problem of Half-closed Connections
The above procedures provide a way to resynchronize a connection
after a brief lapse by a communications component, which results
in lost messages or allocates for an open connection.
5^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
A longer and more severe interruption of communication may result
from a partitioning of the subnet or from a service interruption
on one of the communicating hosts. It is undesirable to tie up
resources indefinitely under such circumstances, so the user is
provided with the option of freeing up these resources (including
himself) by unilaterally dissolving the connection. Here
"unilaterally" means sending the CLS command and closing the
connection without receiving the CLS acknowledgement. Note that
this is legal only if the subnet indicates that the destination is
dead.
When service is restored ater such an interruption, the status
information at the two ends of the connection is out of
synchronization. One end believes that the connection is open,
and may proceed to use the connection. The disconnecting end
believes that the connection is closed (does not exist), and may
proceed to re-initialize communication by opening a new connection
(RTS or STR command) using the same socket pair or same link.
The resynchronization needed here is to properly close the open
end of the connection when the inconsistency is detected. We will
accomplish this by specifying consistency checks and adding a new
pair of commands.
A.5 The NXS and NXR Commands
The "missing CLS" situation described above can manifest itself in
two ways. The first way involves action taken by the NCP at the
"open" end of the connection. It may continue to send regular
messages on the link of the half-closed connection, or control
messages referencing its link. The closed end should respond with
an NXS if the message referred to a non-existent transmit link
(e.g. was an ALL) or NXR if the message referred to a non-existent
receive link (e.g. a data message). On receipt of such an NXS or
NXR message, the NCP at the "open" end should close the connection
by modifying its tables (without sending any CLS command) thereby
bringing both ends into agreement.
8 bits 8 bits
-------------------
! ! !
21 ! NXR ! link !
! ! !
-------------------
Non-existent Receive Link
8 bits 8 bits
6^L
NWG/RFC# 636 JDB BPC RST DCW3 MLK 23-OCT-75 22:27 30490
TIP/TENEX Reliability Improvements
-------------------
! ! !
22 ! NXS ! link !
! ! !
-------------------
Non-existent Send Link
A.6 Consistency Checks
A second way this inconsistency can show up involves actions
initiated by the NCP at the "closed" end. It may (thinking the
connection is closed) send an STR or RTS to reopen the connection.
The NCP at the "open" end should detect the inconsistency when it
receives such an RTS or STR command, because it specifies the same
socket pair as an existing open connection, or, in the case of an
RTS, the same link. In this case, the NCP at the "open" end
should close the connection (without sending any CLS command) to
bring the two ends into agreement before responding to the
RTS/STR.
A.7 Conclusion
The scheme presented in Section A.2 to resynchronize allocation
has one very important property: the data stream is preserved
through the exchange. Since no data is lost, it is safe to
initiate resynchronization from either end at any time. When in
doubt, resynchronize.
The consistency checks for RTS and STR, and the NXR and NXS
commands provide the synchronization needed to complete the
closing of "half-closed" connections.
The protocol changes above
|