doc/rfc/rfc817.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388

RFC:  817


          MODULARITY AND EFFICIENCY IN PROTOCOL IMPLEMENTATION

                             David D. Clark
                  MIT Laboratory for Computer Science
               Computer Systems and Communications Group
                               July, 1982


     1.  Introduction


     Many  protocol implementers have made the unpleasant discovery that

their packages do not run quite as fast as they had hoped.    The  blame

for  this  widely  observed  problem has been attributed to a variety of

causes, ranging from details in  the  design  of  the  protocol  to  the

underlying  structure  of  the  host  operating  system.   This RFC will

discuss  some  of  the  commonly  encountered   reasons   why   protocol

implementations seem to run slowly.


     Experience  suggests  that  one  of  the  most important factors in

determining the performance of an implementation is the manner in  which

that   implementation  is  modularized  and  integrated  into  the  host

operating system.  For this reason, it is useful to discuss the question

of how an implementation is structured at the same time that we consider

how it will perform.  In fact, this RFC will argue  that  modularity  is

one  of  the chief villains in attempting to obtain good performance, so

that the designer is faced  with  a  delicate  and  inevitable  tradeoff

between good structure and good performance.  Further, the single factor

which most strongly determines how well this conflict can be resolved is

not the protocol but the operating system.
^L
                                   2


     2.  Efficiency Considerations


     There  are  many aspects to efficiency.  One aspect is sending data

at minimum transmission cost, which  is  a  critical  aspect  of  common

carrier  communications,  if  not  in local area network communications.

Another aspect is sending data at a high rate, which may not be possible

at all if the net is very slow, but which may be the one central  design

constraint when taking advantage of a local net with high raw bandwidth.

The  final  consideration is doing the above with minimum expenditure of

computer resources.  This last may be necessary to achieve  high  speed,

but  in  the  case  of  the  slow  net may be important only in that the

resources used up, for example  cpu  cycles,  are  costly  or  otherwise

needed.    It  is  worth  pointing  out that these different goals often

conflict; for example it is often possible to trade off efficient use of

the computer against efficient use of the network.  Thus, there  may  be

no such thing as a successful general purpose protocol implementation.


     The simplest measure of performance is throughput, measured in bits

per second.  It is worth doing a few simple computations in order to get

a  feeling for the magnitude of the problems involved.  Assume that data

is being sent from one machine to another in packets of 576  bytes,  the

maximum  generally acceptable internet packet size.  Allowing for header

overhead, this packet size permits 4288 bits  in  each  packet.    If  a

useful  throughput  of  10,000  bits  per second is desired, then a data

bearing packet must leave the sending host about every 430 milliseconds,

a little over two per second.  This is clearly not difficult to achieve.

However, if one wishes to achieve 100 kilobits  per  second  throughput,
^L
                                   3


the packet must leave the host every 43 milliseconds, and to achieve one

megabit  per  second,  which  is not at all unreasonable on a high-speed

local net, the packets must be spaced no more than 4.3 milliseconds.


     These latter numbers are a slightly more alarming goal for which to

set one's sights.  Many operating systems take a substantial fraction of

a millisecond just to service an interrupt.  If the  protocol  has  been

structured  as  a  process,  it  is  necessary  to  go through a process

scheduling before the protocol code can even begin to run.  If any piece

of a protocol package or its data must be fetched from disk,  real  time

delays  of  between  30  to  100  milliseconds  can be expected.  If the

protocol must compete for cpu resources  with  other  processes  of  the

system,  it  may  be  necessary  to wait a scheduling quantum before the

protocol can run.   Many  systems  have  a  scheduling  quantum  of  100

milliseconds  or  more.   Considering these sorts of numbers, it becomes

immediately clear that the protocol must be fitted  into  the  operating

system  in  a  thorough  and  effective  manner  if  any like reasonable

throughput is to be achieved.


     There is one obvious conclusion immediately suggested by even  this

simple  analysis.    Except  in  very  special  circumstances, when many

packets are being processed at once, the cost of processing a packet  is

dominated  by  factors, such as cpu scheduling, which are independent of

the  packet  size.    This  suggests  two  general   rules   which   any

implementation  ought  to  obey.    First,  send  data in large packets.

Obviously, if processing time per packet is a constant, then  throughput

will be directly proportional to the packet size.  Second, never send an
^L
                                   4


unneeded  packet.    Unneeded packets use up just as many resources as a

packet full of data, but perform no useful function.  RFC  813,  "Window

and  Acknowledgement  Strategy in TCP", discusses one aspect of reducing

the number of packets sent per useful data byte.    This  document  will

mention other attacks on the same problem.


     The  above  analysis  suggests that there are two main parts to the

problem of achieving good protocol performance.  The  first  has  to  do

with  how  the  protocol  implementation  is  integrated  into  the host

operating system.  The second has to do with how  the  protocol  package

itself  is  organized  internally.   This document will consider each of

these topics in turn.


     3.  The Protocol vs. the Operating System


     There are normally three reasonable ways in which to add a protocol

to an operating system.  The protocol  can  be  in  a  process  that  is

provided by the operating system, or it can be part of the kernel of the

operating  system  itself, or it can be put in a separate communications

processor or front end machine.  This decision is strongly influenced by

details of hardware architecture and operating system  design;  each  of

these three approaches has its own advantages and disadvantages.


     The  "process"  is the abstraction which most operating systems use

to provide the execution environment for user programs.  A  very  simple

path  for  implementing  a  protocol  is  to  obtain  a process from the

operating  system  and  implement   the   protocol   to   run   in   it.

Superficially,  this  approach  has  a  number  of  advantages.    Since
^L
                                   5


modifications  to  the  kernel  are not required, the job can be done by

someone who is not an expert in the kernel structure.  Since it is often

impossible to find somebody who is experienced both in the structure  of

the  operating system and the structure of the protocol, this path, from

a management point of view, is often extremely appealing. Unfortunately,

putting a protocol in a process has a number of  disadvantages,  related

to  both  structure  and  performance.    First, as was discussed above,

process scheduling can be  a  significant  source  of  real-time  delay.

There  is  not  only the actual cost of going through the scheduler, but

the problem that the operating system may not have  the  right  sort  of

priority  tools  to  bring  the  process into execution quickly whenever

there is work to be done.


     Structurally, the difficulty with putting a protocol in  a  process

is  that  the protocol may be providing services, for example support of

data streams, which are normally obtained by  going  to  special  kernel

entry  points.   Depending on the generality of the operating system, it

may be impossible to take a  program  which  is  accustomed  to  reading

through  a kernel entry point, and redirect it so it is reading the data

from a process.  The most extreme example of this  problem  occurs  when

implementing  server  telnet.  In almost all systems, the device handler

for the locally attached teletypes is located  inside  the  kernel,  and

programs  read and write from their teletype by making kernel calls.  If

server telnet is implemented in a process, it is then necessary to  take

the  data  streams  provided  by server telnet and somehow get them back

down inside the kernel so that they  mimic  the  interface  provided  by

local   teletypes.     It  is  usually  the  case  that  special  kernel
^L
                                   6


modification  is  necessary  to  achieve  this structure, which somewhat

defeats the benefit of having removed the protocol from  the  kernel  in

the first place.


     Clearly, then, there are advantages to putting the protocol package

in  the kernel.  Structurally, it is reasonable to view the network as a

device, and device drivers are traditionally contained  in  the  kernel.

Presumably,  the  problems  associated  with  process  scheduling can be

sidesteped, at least to a certain extent, by placing the code inside the

kernel.  And it is obviously easier to make the server  telnet  channels

mimic  the local teletype channels if they are both realized in the same

level in the kernel.


     However, implementation of protocols in the kernel has its own  set

of  pitfalls.    First, network protocols have a characteristic which is

shared by almost no other device:  they require rather  complex  actions

to  be  performed  as  a  result  of  a  timeout.  The problem with this

requirement is that the kernel often has no facility by which a  program

can  be  brought into execution as a result of the timer event.  What is

really needed, of course, is  a  special  sort  of  process  inside  the

kernel.    Most  systems  lack  this  mechanism.  Failing that, the only

execution mechanism available is to run at interrupt time.


     There are substantial drawbacks to implementing a protocol  to  run

at interrupt time.  First, the actions performed may be somewhat complex

and  time  consuming,  compared  to  the maximum amount of time that the

operating system is prepared to spend servicing an interrupt.   Problems

can  arise  if interrupts are masked for too long.  This is particularly
^L
                                   7


bad  when running as a result of a clock interrupt, which can imply that

the clock interrupt is masked.  Second, the environment provided  by  an

interrupt  handler  is  usually  extremely  primitive  compared  to  the

environment of a process.    There  are  usually  a  variety  of  system

facilities  which are unavailable while running in an interrupt handler.

The most important of these is the ability to suspend execution  pending

the  arrival  of some event or message.  It is a cardinal rule of almost

every known operating system that one  must  not  invoke  the  scheduler

while  running  in  an  interrupt  handler.  Thus, the programmer who is

forced to implement all or part of his protocol package as an  interrupt

handler  must  be  the  best  sort  of  expert  in  the operating system

involved, and must be prepared  for  development  sessions  filled  with

obscure  bugs  which  crash not just the protocol package but the entire

operating system.


     A final problem with processing  at  interrupt  time  is  that  the

system  scheduler has no control over the percentage of system time used

by the protocol handler.  If a large number of packets  arrive,  from  a

foreign  host that is either malfunctioning or fast, all of the time may

be spent in the interrupt handler, effectively killing the system.


     There are other problems associated with putting protocols into  an

operating system kernel.  The simplest problem often encountered is that

the  kernel  address space is simply too small to hold the piece of code

in question.  This is a rather artificial sort of problem, but it  is  a

severe  problem  none  the  less in many machines.  It is an appallingly

unpleasant experience to do an implementation with  the  knowledge  that
^L
                                   8


for  every  byte  of new feature put in one must find some other byte of

old feature to throw out.  It is hopeless to  expect  an  effective  and

general  implementation  under this kind of constraint.  Another problem

is that the protocol package, once it  is  thoroughly  entwined  in  the

operating  system, may need to be redone every time the operating system

changes.  If the protocol and the operating system are not maintained by

the same group,  this  makes  maintenance  of  the  protocol  package  a

perpetual headache.


     The  third  option  for  protocol  implementation  is  to  take the

protocol package and move it outside  the  machine  entirely,  on  to  a

separate  processor  dedicated  to this kind of task.  Such a machine is

often described as a communications processor or a front-end  processor.

There  are  several  advantages  to this approach.  First, the operating

system on the communications processor can  be  tailored  for  precisely

this  kind  of  task.  This makes the job of implementation much easier.

Second, one does not need to redo the task for every  machine  to  which

the  protocol  is  to  be  added.   It may be possible to reuse the same

front-end machine on different host computers.  Since the task need  not

be  done as many times, one might hope that more attention could be paid

to doing it right.  Given a careful  implementation  in  an  environment

which  is  optimized for this kind of task, the resulting package should

turn out to be very efficient.  Unfortunately, there are  also  problems

with this approach.  There is, of course, a financial problem associated

with  buying  an  additional  computer.    In  many cases, this is not a

problem at all since  the  cost  is  negligible  compared  to  what  the

programmer  would  cost  to  do  the  job in the mainframe itself.  More
^L
                                   9


fundamentally, the communications processor approach does not completely

sidestep  any  of  the  problems  raised  above.  The reason is that the

communications processor, since  it  is  a  separate  machine,  must  be

attached  to  the mainframe by some mechanism.  Whatever that mechanism,

code is required in the mainframe to deal with it.   It  can  be  argued

that  the  program  to deal with the communications processor is simpler

than the program to implement the entire protocol package.  Even if that

is so,  the  communications  processor  interface  package  is  still  a

protocol in nature, with all of the same structural problems.  Thus, all

of  the  issues  raised above must still be faced.  In addition to those

problems, there are some other, more subtle problems associated with  an

outboard implementation of a protocol.  We will return to these problems

later.


     There  is  a  way  of  attaching  a  communications  processor to a

mainframe host which  sidesteps  all  of  the  mainframe  implementation

problems, which is to use some preexisting interface on the host machine

as  the  port  by  which  a  communications processor is attached.  This

strategy is often used as a last stage of desperation when the  software

on  the host computer is so intractable that it cannot be changed in any

way.  Unfortunately, it is almost inevitably the case that  all  of  the

available  interfaces  are  totally  unsuitable for this purpose, so the

result is unsatisfactory at best.  The most common  way  in  which  this

form  of attachment occurs is when a network connection is being used to

mimic local teletypes.  In this case, the  front-end  processor  can  be

attached  to  the mainframe by simply providing a number of wires out of

the front-end processor, each corresponding to a connection,  which  are
^L
                                   10


plugged  into teletype ports on the mainframe computer.  (Because of the

appearance  of  the  physical  configuration  which  results  from  this

arrangement,  Michael  Padlipsky  has  described  this  as  the "milking

machine" approach to computer networking.)   This  strategy  solves  the

immediate  problem  of  providing  remote  access  to  a host, but it is

extremely inflexible.  The channels  being  provided  to  the  host  are

restricted  by  the host software to one purpose only, remote login.  It

is impossible to use them for any other purpose, such as  file  transfer

or  sending mail, so the host is integrated into the network environment

in an extremely limited and inflexible manner.  If this is the best that

can be done, then it  should  be  tolerated.    Otherwise,  implementors

should be strongly encouraged to take a more flexible approach.


     4.  Protocol Layering


     The  previous  discussion suggested that there was a decision to be

made as to where a protocol ought to  be  implemented.    In  fact,  the

decision  is  much  more  complicated  than that, for the goal is not to

implement a single protocol, but to implement a whole family of protocol

layers, starting with a device driver or local  network  driver  at  the

bottom,  then  IP  and  TCP,  and  eventually  reaching  the application

specific protocol, such as Telnet, FTP and SMTP on the  top.    Clearly,

the bottommost of these layers is somewhere within the kernel, since the

physical  device  driver for the net is almost inevitably located there.

Equally clearly, the top layers of this package, which provide the  user

his  ability  to  perform the remote login function or to send mail, are

not entirely contained within the kernel.  Thus,  the  question  is  not
^L
                                   11


whether  the  protocol family shall be inside or outside the kernel, but

how it shall be sliced in two between that part  inside  and  that  part

outside.


     Since  protocols  come  nicely layered, an obvious proposal is that

one of the layer interfaces should be the point at which the inside  and

outside components are sliced apart.  Most systems have been implemented

in  this  way,  and  many have been made to work quite effectively.  One

obvious place to slice is at the upper interface  of  TCP.    Since  TCP

provides  a  bidirectional byte stream, which is somewhat similar to the

I/O facility provided by most operating systems, it is possible to  make

the  interface  to  TCP  almost  mimic  the  interface to other existing

devices.  Except in the matter of opening a connection, and dealing with

peculiar failures, the software using TCP need not know  that  it  is  a

network connection, rather than a local I/O stream that is providing the

communications  function.  This approach does put TCP inside the kernel,

which raises all the problems addressed  above.    It  also  raises  the

problem that the interface to the IP layer can, if the programmer is not

careful,  become  excessively  buried  inside  the  kernel.   It must be

remembered that things other than TCP are expected to run on top of  IP.

The  IP interface must be made accessible, even if TCP sits on top of it

inside the kernel.


     Another obvious place to slice is above Telnet.  The  advantage  of

slicing  above  Telnet  is  that  it solves the problem of having remote

login channels emulate local teletype channels.    The  disadvantage  of

putting  Telnet into the kernel is that the amount of code which has now
^L
                                   12


been  included  there  is  getting  remarkably  large.    In  some early

implementations, the size of the  network  package,  when  one  includes

protocols  at  the  level  of Telnet, rivals the size of the rest of the

supervisor.  This leads to vague feelings that all is not right.


     Any attempt to slice through a lower layer  boundary,  for  example

between  internet  and  TCP,  reveals  one fundamental problem.  The TCP

layer, as well as the IP layer, performs a  demultiplexing  function  on

incoming  datagrams.   Until the TCP header has been examined, it is not

possible to know for which  user  the  packet  is  ultimately  destined.

Therefore,  if  TCP,  as  a  whole,  is  moved outside the kernel, it is

necessary to create one separate process called the TCP  process,  which

performs  the TCP multiplexing function, and probably all of the rest of

TCP processing as well.  This means that incoming data  destined  for  a

user  process  involves  not  just a scheduling of the user process, but

scheduling the TCP process first.


     This suggests an  alternative  structuring  strategy  which  slices

through  the  protocols,  not  along  an established layer boundary, but

along a functional boundary having to do with demultiplexing.   In  this

approach, certain parts of IP and certain parts of TCP are placed in the

kernel.    The amount of code placed there is sufficient so that when an

incoming datagram arrives, it is possible to know for which process that

datagram is ultimately destined.  The datagram is then  routed  directly

to  the  final  process,  where  additional  IP  and  TCP  processing is

performed on it.  This removes from the kernel any requirement for timer

based actions, since they can be done by the  process  provided  by  the
^L
                                   13


user.    This  structure  has  the  additional advantage of reducing the

amount of code required in the  kernel,  so  that  it  is  suitable  for

systems where kernel space is at a premium.  The RFC 814, titled "Names,

Addresses,  Ports, and Routes," discusses this rather orthogonal slicing

strategy in more detail.


     A related discussion of protocol layering and multiplexing  can  be

found in Cohen and Postel [1].


     5.  Breaking Down the Barriers


     In  fact, the implementor should be sensitive to the possibility of

even more  peculiar  slicing  strategies  in  dividing  up  the  various

protocol  layers  between the kernel and the one or more user processes.

The result of the strategy proposed above was that part  of  TCP  should

execute  in  the process of the user.  In other words, instead of having

one TCP process for the system, there is one TCP process per connection.

Given this architecture, it is not longer necessary to imagine that  all

of  the  TCPs  are  identical.    One  TCP  could  be optimized for high

throughput applications, such as file transfer.  Another  TCP  could  be

optimized  for small low delay applications such as Telnet.  In fact, it

would be possible to produce a TCP which was  somewhat  integrated  with

the  Telnet  or  FTP  on  top  of  it.  Such an integration is extremely

important,  for  it  can  lead  to  a  kind  of  efficiency  which  more

traditional  structures are incapable of producing.  Earlier, this paper

pointed out that one of the important rules to achieving efficiency  was

to  send  the minimum number of packets for a given amount of data.  The

idea of protocol layering interacts very strongly (and poorly) with this
^L
                                   14


goal,  because  independent  layers  have  independent  ideas about when

packets should be sent, and unless these layers can somehow  be  brought

into  cooperation,  additional  packets  will flow.  The best example of

this is the operation of server telnet in a character at a  time  remote

echo  mode  on top of TCP.  When a packet containing a character arrives

at a server host, each layer has a different response  to  that  packet.

TCP  has  an obligation to acknowledge the packet.  Either server telnet

or the application layer above has an obligation to echo  the  character

received  in the packet.  If the character is a Telnet control sequence,

then Telnet has additional actions which it must perform in response  to

the  packet.    The  result  of  this,  in most implementations, is that

several packets are sent back in response to the  one  arriving  packet.

Combining  all of these return messages into one packet is important for

several reasons.  First, of course, it reduces  the  number  of  packets

being sent over the net, which directly reduces the charges incurred for

many common carrier tariff structures.  Second, it reduces the number of

scheduling  actions  which  will  occur inside both hosts, which, as was

discussed above, is extremely important in improving throughput.


     The way to achieve this goal of packet sharing is to break down the

barrier between the layers of the protocols, in a  very  restrained  and

careful  manner, so that a limited amount of information can leak across

the barrier to enable one layer to optimize its behavior with respect to

the desires of the layers above and below it.   For  example,  it  would

represent  an  improvement  if TCP, when it received a packet, could ask

the layer above whether or not it would  be  worth  pausing  for  a  few

milliseconds  before  sending  an acknowledgement in order to see if the
^L
                                   15


upper  layer  would  have  any  outgoing  data to send.  Dallying before

sending  the  acknowledgement  produces  precisely  the  right  sort  of

optimization  if  the client of TCP is server Telnet.  However, dallying

before sending an acknowledgement is absolutely unacceptable if  TCP  is

being used for file transfer, for in file transfer there is almost never

data  flowing  in  the  reverse  direction, and the delay in sending the

acknowledgement probably translates directly into a delay  in  obtaining

the  next  packets.  Thus, TCP must know a little about the layers above

it to adjust its performance as needed.


     It would be possible to imagine a general  purpose  TCP  which  was

equipped  with  all  sorts of special mechanisms by which it would query

the layer above and modify its behavior accordingly.  In the  structures

suggested above, in which there is not one but several TCPs, the TCP can

simply  be modified so that it produces the correct behavior as a matter

of course.  This structure has  the  disadvantage  that  there  will  be

several  implementations  of TCP existing on a single machine, which can

mean more maintenance headaches if a problem is found where TCP needs to

be changed.  However, it is probably the case that each of the TCPs will

be substantially simpler  than  the  general  purpose  TCP  which  would

otherwise  have  been  built.    There  are  some  experimental projects

currently under way which suggest that this approach may make  designing

of  a  TCP, or almost any other layer, substantially easier, so that the

total effort involved in bringing up a complete package is actually less

if this approach is followed.  This approach is by  no  means  generally

accepted, but deserves some consideration.
^L
                                   16


     The  general conclusion to be drawn from this sort of consideration

is that a layer boundary has both a benefit and a penalty.    A  visible

layer  boundary,  with  a  well  specified interface, provides a form of

isolation between two layers which allows one to  be  changed  with  the

confidence  that  the  other  one  will  not  stop  working as a result.

However, a firm layer boundary almost inevitably  leads  to  inefficient

operation.    This  can  easily be seen by analogy with other aspects of

operating systems.  Consider, for example,  file  systems.    A  typical

operating  system  provides  a file system, which is a highly abstracted

representation of a disk.   The  interface  is  highly  formalized,  and

presumed  to  be highly stable.  This makes it very easy for naive users

to have access to  disks  without  having  to  write  a  great  deal  of

software.  The existence of a file system is clearly beneficial.  On the

other  hand,  it is clear that the restricted interface to a file system

almost inevitably leads to inefficiency.  If the interface is  organized

as  a  sequential read and write of bytes, then there will be people who

wish to do high throughput transfers who cannot achieve their goal.   If

the  interface  is  a  virtual  memory  interface, then other users will

regret the necessity of building a byte stream interface on top  of  the

memory  mapped file.  The most objectionable inefficiency results when a

highly sophisticated package, such as a data  base  management  package,

must  be  built  on  top  of  an  existing  operating  system.    Almost

inevitably, the implementors of the database system  attempt  to  reject

the  file  system  and  obtain  direct  access  to the disks.  They have

sacrificed modularity for efficiency.


     The same conflict appears in networking, in a rather extreme  form.
^L
                                   17


The concept of a protocol is still unknown and frightening to most naive

programmers.   The idea that they might have to implement a protocol, or

even part of a protocol, as part  of  some  application  package,  is  a

dreadful thought.  And thus there is great pressure to hide the function

of  the  net behind a very hard barrier.  On the other hand, the kind of

inefficiency which results from this is a particularly undesirable  sort

of  inefficiency, for it shows up, among other things, in increasing the

cost of the communications resource used up to achieve  the  application

goal.   In cases where one must pay for one's communications costs, they

usually turn out to be the dominant cost within the system.  Thus, doing

an excessively good job of packaging up the protocols in  an  inflexible

manner  has  a  direct  impact  on  increasing  the cost of the critical

resource within the system.  This is a dilemma which will probably  only

be solved when programmers become somewhat less alarmed about protocols,

so that they are willing to weave a certain amount of protocol structure

into their application program, much as application programs today weave

parts  of  database  management  systems  into  the  structure  of their

application program.


     An extreme example of putting the protocol package  behind  a  firm

layer boundary occurs when the protocol package is relegated to a front-

end processor.  In this case the interface to the protocol is some other

protocol.    It  is  difficult to imagine how to build close cooperation

between layers when they are that far separated.  Realistically, one  of

the prices which must be associated with an implementation so physically

modularized is that the performance will suffer as a result.  Of course,

a separate processor for protocols could be very closely integrated into
^L
                                   18


the  mainframe  architecture, with interprocessor co-ordination signals,

shared memory, and similar features.  Such a physical  modularity  might

work  very  well,  but  there  is little documented experience with this

closely coupled architecture for protocol support.


     6.  Efficiency of Protocol Processing


     To this point, this document has considered how a protocol  package

should  be  broken  into  modules,  and  how  those  modules  should  be

distributed between free standing machines, the operating system kernel,

and one or more user processes.  It is now time to  consider  the  other

half  of the efficiency question, which is what can be done to speed the

execution of those programs that actually implement the protocols.    We

will make some specific observations about TCP and IP, and then conclude

with a few generalities.


     IP  is a simple protocol, especially with respect to the processing

of  normal  packets,  so  it  should  be  easy  to  get  it  to  perform

efficiently.    The only area of any complexity related to actual packet

processing has to do with fragmentation and reassembly.  The  reader  is

referred  to  RFC  815,  titled "IP Datagram Reassembly Algorithms", for

specific consideration of this point.


     Most costs in the IP layer come from table look  up  functions,  as

opposed to packet processing functions.  An outgoing packet requires two

translation  functions  to  be  performed.  The internet address must be

translated to a target gateway, and a gateway address must be translated

to a local network number (if the host is  attached  to  more  than  one
^L
                                   19


network).    It  is easy to build a simple implementation of these table

look up functions that in fact performs very  poorly.    The  programmer

should  keep  in  mind  that  there may be as many as a thousand network

numbers in a typical configuration.   Linear  searching  of  a  thousand

entry table on every packet is extremely unsuitable.  In fact, it may be

worth  asking  TCP  to  cache  a  hint for each connection, which can be

handed down to IP each time a packet  is  sent,  to  try  to  avoid  the

overhead of a table look up.


     TCP   is   a   more   complex  protocol,  and  presents  many  more

opportunities for getting things wrong.  There  is  one  area  which  is

generally  accepted  as  causing  noticeable and substantial overhead as

part of TCP processing.  This is computation of the checksum.  It  would

be  nice  if this cost could be avoided somehow, but the idea of an end-

to-end checksum is absolutely central to the functioning  of  TCP.    No

host  implementor  should think of omitting the validation of a checksum

on incoming data.


     Various clever tricks have been used to try to minimize the cost of

computing the checksum.  If it is possible to add additional  microcoded

instructions  to the machine, a checksum instruction is the most obvious

candidate.  Since computing the checksum involves picking up every  byte

of the segment and examining it, it is possible to combine the operation

of computing the checksum with the operation of copying the segment from

one  location  to  another.   Since a number of data copies are probably

already required as part of  the  processing  structure,  this  kind  of

sharing might conceivably pay off if it didn't cause too much trouble to
^L
                                   20


the  modularity  of  the  program.  Finally, computation of the checksum

seems to be one place where careful attention  to  the  details  of  the

algorithm  used  can  make a drastic difference in the throughput of the

program.  The Multics system provides one of the best  case  studies  of

this,  since  Multics  is  about  as  poorly  organized  to perform this

function as any machine implementing TCP.   Multics  is  a  36-bit  word

machine,  with  four 9-bit bytes per word.  The eight-bit bytes of a TCP

segment are laid down packed in memory, ignoring word boundaries.   This

means  that  when it is necessary to pick up the data as a set of 16-bit

units for the purpose of adding  them  to  compute  checksums,  horrible

masking  and  shifting  is  required  for  each  16-bit value.  An early

version of a program using this  strategy  required  6  milliseconds  to

checksum  a  576-byte  segment.    Obviously,  at  this  point, checksum

computation was becoming the central bottleneck to throughput.   A  more

careful  recoding of this algorithm reduced the checksum processing time

to less than one millisecond.  The strategy used  was  extremely  dirty.

It  involved adding up carefully selected words of the area in which the

data lay, knowing that for those particular  words,  the  16-bit  values

were  properly  aligned  inside  the words.  Only after the addition had

been done were the various sums shifted, and finally  added  to  produce

the  eventual  checksum.  This kind of highly specialized programming is

probably not acceptable if used everywhere within an  operating  system.

It is clearly appropriate for one highly localized function which can be

clearly identified as an extreme performance bottleneck.


     Another area of TCP processing which may cause performance problems

is the overhead of examining all of the possible flags and options which
^L
                                   21


occur in each incoming packet.  One paper, by Bunch and Day [2], asserts

that  the  overhead of packet header processing is actually an important

limiting  factor  in  throughput  computation.    Not  all   measurement

experiments  have  tended to support this result.  To whatever extent it

is true, however, there is an obvious  strategy  which  the  implementor

ought  to  use in designing his program.  He should build his program to

optimize the expected case.  It is easy, especially when first designing

a program, to pay equal attention to all of  the  possible  outcomes  of

every test.  In practice, however, few of these will ever happen.  A TCP

should  be  built  on the assumption that the next packet to arrive will

have absolutely nothing special about it,  and  will  be  the  next  one

expected  in  the  sequence  space.   One or two tests are sufficient to

determine that the expected set of control flags are on.  (The ACK  flag

should be on; the Push flag may or may not be on.  No other flags should

be on.)  One test is sufficient to determine that the sequence number of

the  incoming  packet  is  one  greater  than  the  last sequence number

received.  In almost every case, that will be the actual result.  Again,

using the Multics system as an example, failure to optimize the case  of

receiving  the  expected  sequence number had a detectable effect on the

performance of the system.  The particular problem arose when  a  number

of  packets  arrived  at  once.    TCP attempted to process all of these

packets before awaking the user.  As a result,  by  the  time  the  last

packet  arrived,  there was a threaded list of packets which had several

items on it.  When a new packet arrived, the list was searched  to  find

the  location  into which the packet should be inserted.  Obviously, the

list should be searched from highest sequence number to lowest  sequence
^L
                                   22


number,  because  one is expecting to receive a packet which comes after

those already received.  By mistake, the list was searched from front to

back, starting with the packets with the lowest sequence  number.    The

amount of time spent searching this list backwards was easily detectable

in the metering measurements.


     Other data structures can be organized to optimize the action which

is  normally  taken  on  them.  For example, the retransmission queue is

very seldom actually used  for  retransmission,  so  it  should  not  be

organized  to  optimize that action.  In fact, it should be organized to

optimized the discarding of things  from  it  when  the  acknowledgement

arrives.    In many cases, the easiest way to do this is not to save the

packet  at  all,  but  to  reconstruct  it  only  if  it  needs  to   be

retransmitted,  starting  from the data as it was originally buffered by

the user.


     There is another generality, at least as  important  as  optimizing

the  common  case,  which  is  to avoid copying data any more times than

necessary.  One more result from the Multics TCP may prove  enlightening

here.    Multics takes between two and three milliseconds within the TCP

layer to process an incoming packet, depending on its size.  For a  576-

byte packet, the three milliseconds is used up approximately as follows.

One   millisecond   is   used  computing  the  checksum.    Six  hundred

microseconds is spent copying the data.  (The data is copied  twice,  at

.3  milliseconds  a copy.)  One of those copy operations could correctly

be included as part of the checksum cost, since it is done  to  get  the

data  on  a  known  word  boundary  to  optimize the checksum algorithm.
^L
                                   23


However,  the  copy also performs another necessary transfer at the same

time.  Header processing and packet resequencing takes .7  milliseconds.

The  rest  of  the  time  is  used  in miscellaneous processing, such as

removing packets from the retransmission queue which are acknowledged by

this packet.  Data copying is the second most expensive single operation

after data checksuming.   Some  implementations,  often  because  of  an

excessively  layered  modularity, end up copying the data around a great

deal.  Other implementations end up copying the data because there is no

shared memory between processes, and the data must be moved from process

to process via a kernel operation.  Unless the amount of  this  activity

is  kept  strictly  under  control,  it  will  quickly  become the major

performance bottleneck.


     7.  Conclusions


     This document has addressed two aspects  of  obtaining  performance

from a protocol implementation, the way in which the protocol is layered

and  integrated  into  the  operating  system,  and the way in which the

detailed handling of the packet is optimized.  It would be nice  if  one

or  the  other  of these costs would completely dominate, so that all of

one's attention could be concentrated there.  Regrettably, this  is  not

so.    Depending  on  the particular sort of traffic one is getting, for

example, whether Telnet one-byte packets or file transfer  maximum  size

packets  at  maximum  speed, one can expect to see one or the other cost

being the major bottleneck to throughput.  Most  implementors  who  have

studied  their  programs  in  an  attempt to find out where the time was

going have reached  the  unsatisfactory  conclusion  that  it  is  going
^L
                                   24


equally  to  all parts of their program.  With the possible exception of

checksum  processing,  very  few  people  have  ever  found  that  their

performance  problems  were  due  to a single, horrible bottleneck which

they could fix by a single stroke of inventive programming.  Rather, the

performance was something which was improved by  painstaking  tuning  of

the entire program.


     Most  discussions  of protocols begin by introducing the concept of

layering, which tends  to  suggest  that  layering  is  a  fundamentally

wonderful  idea  which  should  be  a  part  of  every  consideration of

protocols.  In fact, layering is a mixed blessing.    Clearly,  a  layer

interface  is  necessary  whenever  more than one client of a particular

layer is to be allowed to use  that  same  layer.    But  an  interface,

precisely  because  it  is fixed, inevitably leads to a lack of complete

understanding as to what one layer wishes to obtain from another.   This

has to lead to inefficiency.  Furthermore, layering is a potential snare

in  that  one  is  tempted  to think that a layer boundary, which was an

artifact of the specification procedure, is in fact the proper  boundary

to  use in modularizing the implementation.  Again, in certain cases, an

architected layer must correspond to an implemented layer, precisely  so

that  several  clients  can  have  access  to that layer in a reasonably

straightforward manner.  In other cases, cunning  rearrangement  of  the

implemented  module  boundaries to match with various functions, such as

the demultiplexing of incoming packets, or the sending  of  asynchronous

outgoing  packets,  can  lead  to  unexpected  performance  improvements

compared to more traditional implementation strategies.   Finally,  good

performance is something which is difficult to retrofit onto an existing
^L
                                   25


program.   Since performance is influenced, not just by the fine detail,

but by the gross structure, it is sometimes the case that  in  order  to

obtain  a  substantial  performance  improvement,  it  is  necessary  to

completely redo the program from  the  bottom  up.    This  is  a  great

disappointment   to  programmers,  especially  those  doing  a  protocol

implementation for  the  first  time.    Programmers  who  are  somewhat

inexperienced  and  unfamiliar with protocols are sufficiently concerned

with getting their program logically correct that they do not  have  the

capacity  to  think  at  the  same  time  about  the  performance of the

structure they are building.  Only after they have achieved a  logically

correct  program  do they discover that they have done so in a way which

has precluded real performance.  Clearly, it is more difficult to design

a program thinking from the start about  both  logical  correctness  and

performance.  With time, as implementors as a group learn more about the

appropriate  structures  to  use  for  building  protocols,  it  will be

possible  to  proceed  with  an  implementation  project   having   more

confidence  that  the structure is rational, that the program will work,

and that the program will work well.    Those  of  us  now  implementing

protocols  have the privilege of being on the forefront of this learning

process.  It should be no surprise that our  programs  sometimes  suffer

from the uncertainty we bring to bear on them.
^L
                                   26


Citations


     [1]  Cohen  and  Postel,  "On  Protocol  Multiplexing",  Sixth Data

Communications Symposium, ACM/IEEE, November 1979.


     [2] Bunch and Day, "Control Structure Overhead in TCP", Trends  and

Applications:  Computer Networking, NBS Symposium, May 1980.