1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
|
Network Working Group J. Klensin
Request for Comments: 5198 M. Padlipsky
Obsoletes: 698 March 2008
Updates: 854
Category: Standards Track
Unicode Format for Network Interchange
Status of This Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Abstract
The Internet today is in need of a standardized form for the
transmission of internationalized "text" information, paralleling the
specifications for the use of ASCII that date from the early days of
the ARPANET. This document specifies that format, using UTF-8 with
normalization and specific line-ending sequences.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1. Requirement for a Standardized Text Stream Format . . . . 2
1.2. Terminology . . . . . . . . . . . . . . . . . . . . . . . 3
2. Net-Unicode Definition . . . . . . . . . . . . . . . . . . . . 3
3. Normalization . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Versions of Unicode . . . . . . . . . . . . . . . . . . . . . 5
5. Applicability and Stability of this Specification . . . . . . 7
5.1. Use in IETF Applications Specifications . . . . . . . . . 7
5.2. Unicode Versions and Applicability . . . . . . . . . . . . 7
6. Security Considerations . . . . . . . . . . . . . . . . . . . 9
7. Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . 10
Appendix A. History and Context . . . . . . . . . . . . . . . . . 11
Appendix B. The ASCII NVT Definition . . . . . . . . . . . . . . 12
Appendix C. The Line-Ending Problem . . . . . . . . . . . . . . . 14
Appendix D. A Note about Related Future Work . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Normative References . . . . . . . . . . . . . . . . . . . . . . 15
Informative References . . . . . . . . . . . . . . . . . . . . . 16
Klensin & Padlipsky Standards Track [Page 1]
^L
RFC 5198 Network Unicode March 2008
1. Introduction
1.1. Requirement for a Standardized Text Stream Format
Historically, Internet protocols have been largely ASCII-based and
references to "text" in protocols have assumed ASCII text and
specifically text in Network Virtual Terminal ("NVT") or "Network
ASCII" form (see Appendix A and Appendix B). Protocols and formats
that have moved beyond ASCII have included arrangements to
specifically identify the character set and often the language being
used.
In our more internationalized world, "text" clearly no longer equates
unambiguously to "network ASCII". Fortunately, however, we are
converging on Unicode [Unicode] [ISO10646] as a single international
interchange character coding and no longer need to deal with per-
script standards for character sets (e.g., one standard for each of
Arabic, Cyrillic, Devanagari, etc., or even standards keyed to
languages that are usually considered to share a script, such as
French, German, or Swedish). Unfortunately, though, while it is
certainly time to define a Unicode-based text type for use as a
common text interchange format, "use Unicode" involves even more
ambiguity than "use ASCII" did decades ago.
Unicode identifies each character by an integer, called its "code
point", in the range 0-0x10ffff. These integers can be encoded into
byte sequences for transmission in at least three standard and
generally-recognized encoding forms, all of which are completely
defined in The Unicode Standard and the documents cited below:
o UTF-8 [RFC3629] defines a variable-length encoding that may be
applied uniformly to all code points.
o UTF-16 [RFC2781] encodes the range of Unicode characters whose
code points are less than 65536 straightforwardly as 16-bit
integers, and provides a "surrogate" mechanism for encoding larger
code points in 32 bits.
o UTF-32 (also known as UCS-4) simply encodes each code point as a
32-bit integer.
Older forms and nomenclature, such as the 16-bit UCS-2, are now
strongly discouraged.
As with ASCII, any of these forms may be used with different line-
ending conventions. That flexibility can be an additional source of
confusion with, e.g., index (offset) references into documents based
on character counts.
Klensin & Padlipsky Standards Track [Page 2]
^L
RFC 5198 Network Unicode March 2008
This document proposes to establish "Net-Unicode" as a new
standardized text transmission form for the Internet, to serve as an
internationalized alternative for NVT ASCII when specified in new --
and, where appropriate, updated -- protocols. UTF-8 [RFC3629] is
chosen for the coding because it has good compatibility properties
with ASCII and for other reasons discussed in the existing IETF
character set policy [RFC2277]. "Net-Unicode" is specified in
Section 2; the subsequent sections of the document provide background
and explanation.
Whenever there is a choice, Unicode SHOULD be used with the text
encoding specified here. This combination is preferred to the
double-byte encoding of "extended ASCII" [RFC0698] or the assorted
per-language or per-country character coding systems.
1.2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
2. Net-Unicode Definition
The Network Unicode format (Net-Unicode) is defined as follows.
Parts of this definition are deliberately informal, providing
guidance for specific profiles or rules in the protocols that
reference this one rather than firm rules that apply globally.
1. Characters MUST be encoded in UTF-8 as defined in [RFC3629].
2. If the protocol has the concept of "lines", line-endings MUST be
indicated by the sequence Carriage-Return (CR, U+000D) followed
by Line-Feed (LF, U+000A), often known just as CRLF. CR SHOULD
NOT appear except when followed by LF. The only other allowed
context in which CR is permitted is in the combination CR NUL,
which is not recommended (see the note at the end of this
section).
3. The control characters in the ASCII range (U+0000 to U+001F and
U+007F to U+009F) SHOULD generally be avoided. Space (SP,
U+0020), CR, LF, and Form Feed (FF, U+000C) are exceptions to
this principle, but use of all but the first requires care as
discussed elsewhere in this document. The so-called "C1
Controls" (U+0080 through U+009F), which did not appear in ASCII,
MUST NOT appear.
FF should be used only with caution: it does not have a standard
and universal interpretation and, in particular, if its use
Klensin & Padlipsky Standards Track [Page 3]
^L
RFC 5198 Network Unicode March 2008
assumes a page length, such assumptions may not be appropriate in
international contexts (e.g., considering 8.5x11 inch paper
versus A4). Other control characters are used to affect display
format, control devices, or to structure files. None of those
uses is appropriate for streams of plain text.
4. Before transmission, all character sequences SHOULD be normalized
according to Unicode normalization form "NFC" (see Section 3).
5. As suggested in Section 6 of RFC 3629, the Byte Order Mark
("BOM") signature MUST NOT appear at the beginning of these text
strings.
6. Systems conforming to this specification MUST NOT transmit any
string containing any code point that is unassigned in the
version of Unicode on which they are dependent. The version of
NFC and the version of Unicode used by that system MUST be
consistent.
The use of LF without CR is questionable; see Appendix B for more
discussion. The newer control characters IND (U+0084) and NEL ("Next
Line", U+0085) might have been used to disambiguate the various line-
ending situations, but, because their use has not been established on
the Internet, because many protocols require CRLF, and because IND
and NEL fall within the "C1 Controls" group (see below), they MUST
NOT be used. Similar observations apply to the yet newer line and
paragraph separators at U+2028 and U+2029 and any future characters
that might be defined to serve these functions. For this
specification and protocols that depend on it, lines end in CRLF and
only in CRLF. Anything that does not end in CRLF is either not a
line or is severely malformed.
The NVT specification contained a number of additional provisions,
e.g., for the optional use of backspacing and "bare CR" (sent as CR
NUL) to generate overstruck character sequences. The much greater
number of precomposed characters in Unicode, the availability of
combining characters, and the growing use of markup conventions of
various types to show, e.g., emphasis (rather than attempting to do
that via the use of special characters), should make such sequences
largely unnecessary. These sequences SHOULD be avoided if at all
possible. However, because they were optional in NVT applications
and this specification is an NVT superset, they cannot be prohibited
entirely. The most important of these rules is that CR MUST NOT
appear unless it is immediately followed by LF (indicating end of
line) or NUL. Because NUL (an octet whose value is all zeros, i.e.,
%x00 in the notation of [RFC5234]) is hostile to programming
languages that use that character as a string delimiter, the CR NUL
sequence SHOULD be avoided for that reason as well.
Klensin & Padlipsky Standards Track [Page 4]
^L
RFC 5198 Network Unicode March 2008
3. Normalization
There are cases where strings of Unicode are fundamentally
equivalent, essentially representing the same text. These are called
"canonical equivalents" in the Unicode Standard. For example, the
following pairs of strings are canonically equivalent:
U+2126 OHM SIGN
U+03A9 GREEK CAPITAL LETTER OMEGA
U+0061 LATIN SMALL LETTER A, U+0300 COMBINING GRAVE ACCENT
U+00E0 LATIN SMALL LETTER A WITH GRAVE
Comparison of strings becomes much easier if any such cases are
always represented by a single unique form. The Unicode Consortium
specifies a normalization form, known as NFC [NFC], which provides
the necessary mappings and mechanisms to convert all canonically
equivalent sequences to a single unique form. Typically, this form
produces precomposed characters for any sequences that can be
represented in that fashion. It also reorders other combining marks
so that they have a unique and unambiguous order.
Of the various normalization forms defined as part of Unicode, NFC is
closest to actual use in practice, minimizes side-effects due to
considering characters equivalent that may not be equivalent in all
situations, and typically requires the least work when converting
from non-Unicode encodings.
The section above requires that, except in very unusual
circumstances, all Net-Unicode strings be transmitted in normalized
form. Recognition of the fact that some implementations of
applications may rely on operating system libraries over which they
have little control and adherence to the robustness principle
suggests that receivers of such strings should be prepared to receive
unnormalized ones and to not react to that in excessive ways.
4. Versions of Unicode
Unicode changes and expands over time. Large blocks of space are
reserved for future expansion. New versions, which appear at regular
intervals, add new scripts and characters. Occasionally they also
change some property definitions. In retrospect, one of the
advantages of ASCII [ASCII] when it was chosen was that the code
space was full when the Standard was first published. There was no
practical way to add characters or change code point assignments
without being obviously incompatible.
Klensin & Padlipsky Standards Track [Page 5]
^L
RFC 5198 Network Unicode March 2008
While there are some security issues if people deliberately try to
trick the system (see Section 6), Unicode version changes should not
have a significant impact on the text stream specification of this
document for the following reasons:
o The transformation between Unicode code table positions and the
corresponding UTF-8 code is algorithmic; it does not depend on
whether a code point has been assigned or not.
o The normalization recommended here, NFC (see Section 3), performs
a very limited set of mappings, much more limited than those of
the more extensive NFKC used in, e.g., Nameprep [RFC3491].
The NFC tables may be updated over time as new characters are added,
but the Unicode Consortium has guaranteed the stability of all NFC
strings. That is, if a string does not contain any unassigned
characters, and it is normalized according to NFC, it will always be
normalized according to all future versions of the Unicode Standard.
The stability of the Net-Unicode format is thus guaranteed when any
implementation that converts text into Net-Unicode format does not
permit unassigned characters.
Because Unicode code points that are reserved for private use do not
have standard definitions or normalization interpretations, they
SHOULD be avoided in strings intended for Internet interchange.
Were Unicode to be changed in a way that violated these assumptions,
i.e., that either invalidated the byte string order specified in RFC
3629 or that changed the stability of NFC as stated above, this
specification would not apply. Put differently, this specification
applies only to versions of Unicode starting with version 5.0 and
extending to, but not including, any version for which changes are
made in either the UTF-8 definition or to NFC stability. Such
changes would violate established Unicode policies and are hence
unlikely, but, should they occur, it would be necessary to evaluate
them for compatibility with this specification and other Internet
uses of NFC.
If the specification of a protocol references this one, strings that
are received by that protocol and that appear to be UTF-8 and are not
otherwise identified (e.g., by charset labeling) SHOULD be treated as
using UTF-8 in conformance with this specification.
Klensin & Padlipsky Standards Track [Page 6]
^L
RFC 5198 Network Unicode March 2008
5. Applicability and Stability of this Specification
5.1. Use in IETF Applications Specifications
During the development of this specification, there was some
confusion about where it would be useful given that, e.g., the
individual MIME media types used in email and with HTTP have their
own rules about UTF-8 character types and normalization, and the
application transport protocols impose their own conventions about
line endings. There are three answers. The first is that, in
retrospect, it would have been better to have those protocols and
content types standardized in the way specified here, even though it
is certainly too late to change them at this time. The second is
that we have several protocols that are dependent on either the
original Telnet design or other arrangements requiring a standard,
interoperable, string definition without specific content-labels of
one sort or another. Whois [RFC3912] is an example member of this
group. As consideration is given to upgrading them for non-ASCII
use, this specification provides a normative reference that provides
the same stability that NVT has provided the ASCII forms. This
specification is intended for use by other specifications that have
not yet defined how to use Unicode. Having a preferred standard
Internet definition for Unicode text streams -- rather than just one
for transmission codings -- may help improve the specification and
interoperability of protocols to be developed in the future. This
specification is not intended for use with specifications that
already allow the use of UTF-8 and precisely define that use.
5.2. Unicode Versions and Applicability
The IETF faces a practical dilemma with regard to versions of
Unicode. Each new version brings with it new characters and
sometimes new combining characters. Version 5.0 introduces the new
concept of sequences of characters named as if they were individual
characters (see [NamedSequences]). The normalization represented by
NFC is stable if all strings are transmitted and stored in normalized
form if corrections are never made to character definitions or
normalization tables and if unassigned code points are never used.
The latter is important because an unassigned code point always
normalizes to itself. However, if the same code point is assigned to
a character in a future version, it may participate in some other
normalization mapping (some specific difficulties in this regard are
discussed in [RFC4690]). It is worth noting that transmission in
normalized form is not required by either the IETF's UTF-8 Standard
[RFC3629] or by standards dependent on the current version of
Stringprep [RFC3454].
Klensin & Padlipsky Standards Track [Page 7]
^L
RFC 5198 Network Unicode March 2008
All would be well with this as described in Section 4 except for one
problem: Applications typically do not perform their own conversions
to Unicode and may not perform their own normalizations but instead
rely on operating system or language library functions -- functions
that may be upgraded or otherwise changed without changes to the
application code itself. Consequently, there may be no plausible way
for an application to know which version of Unicode, or which version
of the normalization procedures, it is utilizing, nor is there any
way by which it can guarantee that the two will be consistent.
Because of per-version changes in definitions and tables, Stringprep
and documents depending on it are now tied to Unicode Version 3.2
[Unicode32] and full interoperability of Internet Standard UTF-8
[RFC3629], when used with normalization as specified here, is
dependent on normalization definitions and the definition of UTF-8
itself not changing after Unicode Version 5.0. These assumptions
seem fairly safe, but they are still assumptions. Rather than being
linked to the latest available version of Unicode, version 5.0
[Unicode] or broader concepts of version independence based on
specific assumptions and conditions, this specification could
reasonably have been tied, like Stringprep and Nameprep to Unicode
3.2 [Unicode32] or some more recent intermediate version, but, in
addition to the obvious disadvantages of having different IETF
standards tied to different versions of Unicode, the library-based
application implementation behavior described above makes these
version linkages nearly meaningless in practice.
In theory, one can get around this problem in four ways:
1. Freeze on a particular version of Unicode and try to insist that
applications enforce that version by, e.g., containing lists of
unassigned characters and prohibiting their use. Of course, this
would prohibit evolution to include newly-added scripts and the
tables of unassigned code points would be cumbersome.
2. Require that every Unicode "text" string or file start with a
version indication, somewhat akin to the "byte order mark"
indicator. It is unlikely that this provision would be
practical. More important, it would require that each
application implementation be prepared to either support multiple
normalization tables and versions or that it reject text from
Unicode versions with which it was not prepared to deal.
3. Devise a different set of normalization rules that would, e.g.,
guarantee that no character assigned to a previously-unassigned
code point in Unicode was ever normalized to anything but itself
and use those rules instead of NFC. It is not clear whether or
not such a set of rules is possible or whether some other
Klensin & Padlipsky Standards Track [Page 8]
^L
RFC 5198 Network Unicode March 2008
completely stable set of rules could be devised, perhaps in
combination with restrictions on the ways in which characters
were added in future versions of Unicode.
4. Devise a normalization process that is otherwise equivalent to
NFC but that rejects code points that are unassigned in the
current version of Unicode, rather than mapping those code points
to themselves. This would still leave some risk of incompatible
corrections in Unicode and possibly a few edge cases, but it is
probably stable enough for Internet use in the overwhelming
number of cases. This process has been discussed in the Unicode
Consortium under the name "Stable NFC".
None of these approaches seems ideal: the ideal procedure would be as
stable and predictable as ASCII has been. But that level is simply
not feasible as long as Unicode continues to evolve by the addition
of new code points and scripts. The fourth option listed above
appears to be a reasonable compromise.
6. Security Considerations
This specification provides a standard form for the use of Unicode as
"network text". Most of the same security issues that apply to
UTF-8, as discussed in [RFC3629], apply to it, although it should be
slightly less subject to some risks by virtue of requiring NFC
normalization and generally being somewhat more restrictive.
However, shifts in Unicode versions, as discussed in Section 5.2, may
introduce other security issues.
Programs that receive these streams should use extreme caution about
assuming that incoming data are normalized, since it might be
possible to use unnormalized forms, as well as invalid UTF-8, as part
of an attack. In particular, firewalls and other systems that
interpret UTF-8 streams should be developed with the clear knowledge
that an attacker may deliberately send unnormalized text, for
instance, to avoid detection by naive text-matching systems.
NVT contains a requirement, of necessity repeated here (see
Section 2), that the CR character be immediately followed by either
LF or ASCII NUL (an octet with all bits zero). NUL may be
problematic for some programming languages that use it as a string
terminator, and hence a trap for the unwary, unless caution is used.
This may be an additional reason to avoid the use of CR entirely,
except in sequence with LF, as suggested above.
The discussion about Unicode versions above (see Section 4 and
Section 5.2) makes several assumptions about future versions of
Unicode, about NFC normalization being applied properly, and about
Klensin & Padlipsky Standards Track [Page 9]
^L
RFC 5198 Network Unicode March 2008
UTF-8 being processed and transmitted exactly as specified in RFC
3629. If any of those assumptions are not correct, then there are
cases in which strings that would be considered equivalent do not
compare equal. Robust code should be prepared for those
possibilities.
7. Acknowledgments
Many thanks to Mark Davis, Martin Duerst, and Michel Suignard for
suggestions about Unicode normalization that led to the format
described here, and especially to Mark for providing the paragraphs
that describe the role of NFC. Thanks also to Mark, Doug Ewell,
Asmus Freytag for corrected text describing Unicode transmission
forms, and to Tim Bray, Carsten Bormann, Stephane Bortzmeyer, Martin
Duerst, Frank Ellermann, Clive D.W. Feather, Ted Hardie, Bjoern
Hoehrmann, Alfred Hoenes, Kent Karlsson, Bill McQuillan, George
Michaelson, Chris Newman, and Marcos Sanz for a number of helpful
comments and clarification requests.
Klensin & Padlipsky Standards Track [Page 10]
^L
RFC 5198 Network Unicode March 2008
Appendix A. History and Context
This subsection contains a review of prior work in the ARPANET and
Internet to establish a standard text type, work that establishes the
context and motivation for the approach taken in this document. The
text is explanatory rather than normative: nothing in this section is
intended to change or update any current specification. Those who
are uninterested in this review and analysis can safely skip this
section.
One of the earlier application design decisions made in the
development of ARPANET, a decision that was carried forward into the
Internet, was the decision to standardize on a single and very
specific coding for "text" to be passed across the network [RFC0020].
Hosts on the network were then responsible for translating or mapping
from whatever character coding conventions were used locally to that
common intermediate representation, with sending hosts mapping to it
and receiving ones mapping from it to their local forms as needed.
It is interesting to note that at the time the ARPANET was being
developed, participating host operating systems used at least three
different character coding standards: the antiquated BCD (Binary
Coded Decimal), the then-dominant major manufacturer-backed EBCDIC
(Extended BCD Interchange Code), and the then-still emerging ASCII
(American Standard Code for Information Interchange). Since the
ARPANET was an "open" project and EBCDIC was intimately linked to a
particular hardware vendor, the original Network Working Group agreed
that its standard should be ASCII. That ASCII form was precisely
"7-bit ASCII in an 8-bit field", which was in effect a compromise
between hosts that were natively 7-bit oriented (e.g., with five
seven-bit characters in a 36-bit word), those that were 8-bit
oriented (using eight-bit characters) and those that placed the
seven-bit ASCII characters in 9-bit fields with two leading zero bits
(four characters in a 36-bit word).
More standardization was suggested in the first preliminary
description of the Telnet protocol [RFC0097]. With the iterations of
that protocol [RFC0137] [RFC0139] and the drawing together of an
essentially formal definition somewhat later [RFC0318], a standard
abstraction, the Network Virtual Terminal (NVT) was established. NVT
character-coding conventions (initially called "Telnet ASCII" and
later called "NVT ASCII", or, more casually, "network ASCII")
included the requirement that Carriage Return followed by Line Feed
(CRLF) be the common representation for ending lines of text (given
that some participating "Host" operating systems used the one
natively, some the other, at least one used both, and a few used
neither (preferring variable-length lines with counts or special
delimiters or markers instead) and specified conventions for some
other characters. Also, since NVT ASCII was restricted to seven-bit
Klensin & Padlipsky Standards Track [Page 11]
^L
RFC 5198 Network Unicode March 2008
characters, use of the high-order bit in octets was reserved for the
transmission of control signaling information.
At a very high level, the concept was that a system could use
whatever character coding and line representations were appropriate
locally, but text transmitted over the network as text must conform
to the single "network virtual terminal" convention. Virtually all
early Internet protocols that presume transfer of "text" assume this
virtual terminal model, although different ones assume or limit it in
different ways. Telnet, the command stream and ASCII Type in FTP
[RFC0542], the message stream in SMTP transfer [RFC2821], and the
strings passed to finger [RFC0742] and whois [RFC0954] are the
classic examples. More recently, HTTP [RFC1945] [RFC2616] follows
the same general model but permits 8-bit data and leaves the line end
sequence unspecified (the latter has been the source of a significant
number of problems).
Appendix B. The ASCII NVT Definition
The main body of this specification is intended as an update to, and
internationalized version of, the Net-ASCII definition. The
specification is self-contained in that parts of the Net-ASCII
definition that are no longer recommended are not included above.
Because Net-ASCII evolved somewhat over time and there has been
debate about which specification is the "official" Net-ASCII, it is
appropriate to review the key elements of that definition here. This
review is informal with regard to the contents of Net-ASCII and
should not be considered as a normative update or summary of the
earlier specifications (Section 2 does specify some normative updates
to those specifications and some comments below are consistent with
it).
The first part of the section titled "THE NVT PRINTER AND KEYBOARD"
in RFC 854 [RFC0854] is generally, although not universally,
considered to be the normative definition of the (ASCII) Network
Virtual Terminal and hence of Net-ASCII. It includes not only the
graphic ASCII characters but a number of control characters. The
latter are given Internet-specific meanings that are often more
specific than the definitions in the ASCII specification. In today's
usage, and for the present specification, the following
clarifications and updates to that list should be noted. Each one is
accompanied by a brief explanation of the reason why the original
specification is no longer appropriate.
1. The "defined but not required" codes -- BEL (U+0007), BS
(U+0008), HT (U+0009), VT (U+000B), and FF (U+000C) -- and the
undefined control codes ("C0") SHOULD NOT be used unless required
by exceptional circumstances. Either their original "network
Klensin & Padlipsky Standards Track [Page 12]
^L
RFC 5198 Network Unicode March 2008
printer" definitions are no longer in general use, common
practice has evolved away from the formats specified there, or
their use to simulate characters that are better handled by
Unicode is no longer appropriate. While the appearance of some
of these characters on the list may seem surprising, BS now has
an ambiguous interpretation in practice (erasing in some systems
but not in others), the width associated with HT varies with the
environment, and VT and FF do not have a uniform effect with
regard to either vertical positioning or the associated
horizontal position result. Of course, telnet escapes are not
considered part of the data stream and hence are unaffected by
this provision.
2. In Net-ASCII, CR MUST NOT appear except when immediately followed
by either NUL or LF, with the latter (CR LF) designating the "new
line" function. Today and as specified above, CR should
generally appear only when followed by LF. Because page layout
is better done in other ways, because NUL has a special
interpretation in some programming languages, and to avoid other
types of confusion, CR NUL should preferably be avoided as
specified above.
3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF
sequences (e.g., CR LF CR LF).
4. The historical NVT documents do not call out either "bare LF" (LF
without CR) or HT for special treatment. Both have generally
been understood to be problematic. In the case of LF, there is a
difference in interpretation as to whether its semantics imply
"go to same position on the next line" or "go to the first
position on the next line" and interoperability considerations
suggest not depending on which interpretation the receiver
applies. At the same time, misinterpretation of LF is less
harmful than misinterpretation of "bare" CR: in the CR case, text
may be erased or made completely unreadable; in the LF one, the
worst consequence is a very funny-looking display. Obviously, HT
is problematic because there is no standard way to transmit
intended tab position or width information in running text.
Again, the harm is unlikely to be great if HT is simply
interpreted as one or more spaces, but, in general, it cannot be
relied upon to format information.
It is worth noting that the telnet IAC character (an octet consisting
of all ones, i.e., %xFF) itself is not a problem for UTF-8 since that
particular octet cannot appear in a valid UTF-8 string. However,
while few of them have been used, telnet permits other command-
introducer characters whose bit sequences in an octet may be part of
valid UTF-8 characters. While it causes no ambiguity in UTF-8,
Klensin & Padlipsky Standards Track [Page 13]
^L
RFC 5198 Network Unicode March 2008
Unicode assigns a graphic character ("Latin Small Letter Y with
Diaeresis") to U+00FF (octets C3 B0 in UTF-8). Some caution is
clearly in order in this area.
Appendix C. The Line-Ending Problem
The definition of how a line ending should be denoted in plain text
strings on the wire for the Internet has been controversial from even
before the introduction of NVT. Some have argued that recipients
should be required to interpret almost anything that a sender might
intend as a line ending as actually a line ending. Others have
pointed out that this would lead to some ambiguities of
interpretation and presentation and would violate the principle that
we should minimize the number of forms that are permitted on the wire
in order to promote interoperability and eliminate the "every
recipient needs to understand every sender format" problem. The
design of this specification, like that of NVT, takes the latter
approach. Its designers believe that there is little point in a
standard if it is to specify "anyone can do whatever they like and
the receiver just needs to cope".
A further discussion of the nature and evolution of the line-ending
problem appears in Section 5.8 of the Unicode Standard [Unicode] and
is suggested for additional reading. If we were starting with the
Internet today, it would probably be sensible to follow the
recommendation there and use LS (U+2028) exclusively, in preference
to CRLF. However, the installed base of use of CRLF and the
importance of forward compatibility with NVT and protocols that
assume it makes that impossible, so it is necessary to continue using
CRLF as the "New Line Function" ("NLF", see the terminology section
in that reference).
Appendix D. A Note about Related Future Work
Consideration should be given to a Telnet (or SSH [RFC4251]) option
to specify this type of stream and an FTP extension [RFC0959] to
permit a new "Unicode text" data TYPE.
Klensin & Padlipsky Standards Track [Page 14]
^L
RFC 5198 Network Unicode March 2008
References
Normative References
[ISO10646] International Organization for Standardization,
"Information Technology - Universal Multiple-Octet
Coded Character Set (UCS) - Part 1: Architecture
and Basic Multilingual Plane", ISO/
IEC 10646-1:2000, October 2000.
[NFC] Davis, M. and M. Duerst, "Unicode Standard Annex
#15: Unicode Normalization Forms", October 2006,
<http://www.unicode.org/reports/tr15/>.
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, November 2003.
[RFC5234] Crocker, D. and P. Overell, "Augmented BNF for
Syntax Specifications: ABNF", STD 68, RFC 5234,
January 2008.
[Unicode] The Unicode Consortium, "The Unicode Standard,
Version 5.0", 2007.
Boston, MA, USA: Addison-Wesley. ISBN
0-321-48091-0
[Unicode32] The Unicode Consortium, "The Unicode Standard,
Version 3.0", 2000.
(Reading, MA, Addison-Wesley, 2000. ISBN 0-201-
61633-5). Version 3.2 consists of the definition
in that book as amended by the Unicode Standard
Annex #27: Unicode 3.1
(http://www.unicode.org/reports/tr27/) and by the
Unicode Standard Annex #28: Unicode 3.2
(http://www.unicode.org/reports/tr28/).
Klensin & Padlipsky Standards Track [Page 15]
^L
RFC 5198 Network Unicode March 2008
Informative References
[ASCII] American National Standards Institute (formerly
United States of America Standards Institute), "USA
Code for Information Interchange", ANSI X3.4-1968,
1968.
ANSI X3.4-1968 has been replaced by newer versions
with slight modifications, but the 1968 version
remains definitive for the Internet. ISO 646
International Reverence Version (IRV)
[ISO.646.1991] is usually considered equivalent to
ASCII.
[ISO.646.1991] International Organization for Standardization,
"Information technology - ISO 7-bit coded character
set for information interchange", ISO Standard 646,
1991.
[NamedSequences] The Unicode Consortium, "NamedSequences-4.1.0.txt",
2005, <http://www.unicode.org/Public/UNIDATA/
NamedSequences.txt>.
[RFC0020] Cerf, V., "ASCII format for network interchange",
RFC 20, October 1969.
[RFC0097] Melvin, J. and R. Watson, "First Cut at a Proposed
Telnet Protocol", RFC 97, February 1971.
[RFC0137] O'Sullivan, T., "Telnet Protocol - a proposed
document", RFC 137, April 1971.
[RFC0139] O'Sullivan, T., "Discussion of Telnet Protocol",
RFC 139, May 1971.
[RFC0318] Postel, J., "Telnet Protocols", RFC 318,
April 1972.
[RFC0542] Neigus, N., "File Transfer Protocol", RFC 542,
August 1973.
[RFC0698] Mock, T., "Telnet extended ASCII option", RFC 698,
July 1975.
[RFC0742] Harrenstien, K., "NAME/FINGER Protocol", RFC 742,
December 1977.
Klensin & Padlipsky Standards Track [Page 16]
^L
RFC 5198 Network Unicode March 2008
[RFC0854] Postel, J. and J. Reynolds, "Telnet Protocol
Specification", STD 8, RFC 854, May 1983.
[RFC0954] Harrenstien, K., Stahl, M., and E. Feinler,
"NICNAME/WHOIS", RFC 954, October 1985.
[RFC0959] Postel, J. and J. Reynolds, "File Transfer
Protocol", STD 9, RFC 959, October 1985.
[RFC1945] Berners-Lee, T., Fielding, R., and H. Nielsen,
"Hypertext Transfer Protocol -- HTTP/1.0",
RFC 1945, May 1996.
[RFC2277] Alvestrand, H., "IETF Policy on Character Sets and
Languages", BCP 18, RFC 2277, January 1998.
[RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
Masinter, L., Leach, P., and T. Berners-Lee,
"Hypertext Transfer Protocol -- HTTP/1.1",
RFC 2616, June 1999.
[RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
ISO 10646", RFC 2781, February 2000.
[RFC2821] Klensin, J., "Simple Mail Transfer Protocol",
RFC 2821, April 2001.
[RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Strings ("stringprep")",
RFC 3454, December 2002.
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A
Stringprep Profile for Internationalized Domain
Names (IDN)", RFC 3491, March 2003.
[RFC3912] Daigle, L., "WHOIS Protocol Specification",
RFC 3912, September 2004.
[RFC4251] Ylonen, T. and C. Lonvick, "The Secure Shell (SSH)
Protocol Architecture", RFC 4251, January 2006.
[RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB,
"Review and Recommendations for Internationalized
Domain Names (IDNs)", RFC 4690, September 2006.
Klensin & Padlipsky Standards Track [Page 17]
^L
RFC 5198 Network Unicode March 2008
Authors' Addresses
John C Klensin
1770 Massachusetts Ave, #322
Cambridge, MA 02140
USA
Phone: +1 617 491 5735
EMail: john-ietf@jck.com
Michael A. Padlipsky
8011 Stewart Ave.
Los Angeles, CA 90045
USA
Phone: +1 310-670-4288
EMail: the.map@alum.mit.edu
Klensin & Padlipsky Standards Track [Page 18]
^L
RFC 5198 Network Unicode March 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2008).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Klensin & Padlipsky Standards Track [Page 19]
^L
|