summaryrefslogtreecommitdiff
path: root/doc/rfc/rfc1843.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/rfc/rfc1843.txt')
-rw-r--r--doc/rfc/rfc1843.txt283
1 files changed, 283 insertions, 0 deletions
diff --git a/doc/rfc/rfc1843.txt b/doc/rfc/rfc1843.txt
new file mode 100644
index 0000000..088af59
--- /dev/null
+++ b/doc/rfc/rfc1843.txt
@@ -0,0 +1,283 @@
+
+
+
+
+
+
+Network Working Group F. Lee
+Request for Comments: 1843 Stanford University
+Category: Informational August 1995
+
+
+ HZ - A Data Format for Exchanging Files of
+ Arbitrarily Mixed Chinese and ASCII characters
+
+Status of this Memo
+
+ This memo provides information for the Internet community. This memo
+ does not specify an Internet standard of any kind. Distribution of
+ this memo is unlimited.
+
+Abstract
+
+ The content of this memo is identical to an article of the same title
+ written by the author on September 4, 1989. In this memo, GB stands
+ for GB2312-80. Note that the title is kept only for historical
+ reasons. HZ has been widely used for purposes other than "file
+ exchange".
+
+1. Introduction
+
+ Most existing computer systems which can handle a text file of
+ arbitrarily mixed Chinese and ASCII characters use 8-bit codes. To
+ exchange such text files through electronic mail on ASCII computer
+ systems, it is necessary to encode them in a 7-bit format. A generic
+ binary to ASCII encoder is not sufficient, because there is currently
+ no universal standard for such 8-bit codes. For example, CCDOS and
+ Macintosh's Chinese OS use different internal codes. Fortunately,
+ there is a PRC national standard, GuoBiao (GB), for the encoding of
+ Chinese characters, and Chinese characters encoded in the above
+ systems can be easily converted to GB by a simple formula. (* The ROC
+ standard BIG-5 is outside the scope of this article.)
+
+ HZ is a 7-bit data format proposed for arbitrarily mixed GB and ASCII
+ text file exchange. HZ is also intended for the design of terminal
+ emulators that display and edit mixed Chinese and ASCII text files in
+ real time.
+
+
+
+
+
+
+
+
+
+
+
+Lee Informational [Page 1]
+
+RFC 1843 HZ - A Data Format for Exchanging Files August 1995
+
+
+2. Specification
+
+ The format of HZ is described in the following.
+
+ Without loss of generality, we assume that all Chinese characters
+ (HanZi) have already been encoded in GB. A GB (GB1 and GB2) code is
+ a two byte code, where the first byte is in the range $21-$77
+ (hexadecimal), and the second byte is in the range $21-$7E.
+
+ A graphical ASCII character is a byte in the range $21-$7E. A non-
+ graphical ASCII character is a byte in the range $0-$20 or of the
+ value $7F.
+
+ Since the range of a graphical ASCII character overlaps that of a GB
+ byte, a byte in the range $21-$7E is interpreted according to the
+ mode it is in. There are two modes, namely ASCII mode and GB mode.
+
+ By convention, a non-graphical ASCII character should only appear in
+ ASCII mode.
+
+ The default mode is ASCII mode.
+
+ In ASCII mode, a byte is interpreted as an ASCII character, unless a
+ '~' is encountered. The character '~' is an escape character. By
+ convention, it must be immediately followed ONLY by '~', '{' or '\n'
+ (<LF>), with the following special meaning.
+
+ o The escape sequence '~~' is interpreted as a '~'.
+ o The escape-to-GB sequence '~{' switches the mode from ASCII to
+ GB.
+ o The escape sequence '~\n' is a line-continuation marker to be
+ consumed with no output produced.
+
+ In GB mode, characters are interpreted two bytes at a time as (pure)
+ GB codes until the escape-from-GB code '~}' is read. This code
+ switches the mode from GB back to ASCII. (Note that the escape-
+ from-GB code '~}' ($7E7D) is outside the defined GB range.)
+
+ The decoding process is clear from the above description.
+
+ The encoding process is straightforward. Note that an (ASCII) '~' is
+ always encoded as '~~'. A sequence of GB codes is enclosed in '~{'
+ and '~}'.
+
+
+
+
+
+
+
+
+Lee Informational [Page 2]
+
+RFC 1843 HZ - A Data Format for Exchanging Files August 1995
+
+
+3. Remarks & Recommendations
+
+ We choose to encode any ASCII character except '~' as it is, rather
+ than as a two byte code, and we choose ASCII as the default mode for
+ the following reasons. The computer systems we use is ASCII based. A
+ HZ file containing pure ASCII characters (i.e. no Chinese characters)
+ except '~' is precisely a pure ASCII file. In general, the English
+ (ASCII) portion of a HZ file is directly readable.
+
+ The escape character '~' is chosen not only because it is commonly
+ used in the ASCII world, but also because '~' ($7E) is outside the
+ defined range ($21-$77) of the first byte of a GB code.
+
+ In ASCII mode, other potential escape sequences, i.e., two byte
+ sequences beginning with '~' (other than '~~', '~{', '~\n') are
+ currently invalid HZ sequences. Hence, they can be used for future
+ extension of HZ with total upward compatibility.
+
+ The line-continuation marker '~\n' is useful if one wants to encode
+ long lines in the original text into short lines in this data format
+ without introducing extra newline characters in the decoding process.
+
+ There is no limit on the length of a line. In fact, the whole file
+ could be one long line or even contain no newline characters. Any
+ DECODER of this HZ data format should not and has no need to operate
+ on the concept of a line.
+
+ It is easy to write encoders and decoders for HZ. An encoder or
+ decoder needs to lookahead at most one character in the input data
+ stream.
+
+ Given the current mode, it is also possible and easy to decode a HZ
+ data stream by scanning backward. One of the implication is that
+ "backspaces" can be handled correctly by a terminal emulator.
+
+ To facilitate the effective use of programs supporting line/page
+ skips such as "more" on UNIX with a terminal emulator understanding
+ the HZ format, it is RECOMMENDED that the ENCODER (which outputs in
+ HZ) sets a maximum line size of less than 80 characters. Since '\n'
+ is an ASCII character, the syntax of HZ then automatically implies
+ that GB codes appearing at the end of a line must be terminated with
+ the escape-from-GB code '~}', and the line-continuation marker '~\n'
+ should be inserted appropriately. The price to paid is that the
+ encoded file size is slightly larger.
+
+ It is important to understand the following distinction. Note that
+ the above recommendation does NOT change the HZ format. It is simply
+ an encoding "style" which follows the syntax of HZ. Note that this
+
+
+
+Lee Informational [Page 3]
+
+RFC 1843 HZ - A Data Format for Exchanging Files August 1995
+
+
+ "style" is not built into HZ. It is an additional convention built
+ "on top of" HZ. Other applications may require different "styles",
+ but the same basic HZ DECODER will always work. The essence of HZ is
+ to provide such a flexible basic data format for files of arbitrarily
+ mixed Chinese and ASCII characters.
+
+4. Examples
+
+ To illustrate the "stylistic" issue of HZ encoding, we give the
+ following four examples of encoded text, which should produce the
+ same decoded output. (The recommendation in the last section refers
+ to Example 2.)
+
+ Example 1: (Suppose there is no line size limit.)
+ This sentence is in ASCII.
+ The next sentence is in GB.~{<:Ky2;S{#,NpJ)l6HK!#~}Bye.
+
+ Example 2: (Suppose the maximum line size is 42.)
+ This sentence is in ASCII.
+ The next sentence is in GB.~{<:Ky2;S{#,~}~
+ ~{NpJ)l6HK!#~}Bye.
+
+ Example 3: (Suppose a new line is started whenever there is a mode
+ switch.)
+ This sentence is in ASCII.
+ The next sentence is in GB.~
+ ~{<:Ky2;S{#,NpJ)l6HK!#~}~
+ Bye.
+
+Acknowledgement
+
+ Edmund Lai was the first one who brought my attention to this topic.
+ Discussions with Ed, Tin-Fook Ngai, Yagui Wei and Ricky Yeung were
+ very helpful in shaping the ideas in this article. Thanks to Tin-Fook
+ for his careful review of the draft and numerous interesting
+ suggestions.
+
+References
+
+ [1] Fung Fung Lee, "HZ - A Data Format for Exchanging Files of
+ Arbitrarily Mixed Chinese and ASCII Characters," September 4,
+ 1989.
+ As part of //ftp.ifcss.org/software/unix/convert/HZ-2.0.tar.gz
+
+Security Considerations
+
+ Security issues are not addressed in this memo.
+
+
+
+
+Lee Informational [Page 4]
+
+RFC 1843 HZ - A Data Format for Exchanging Files August 1995
+
+
+Author's Address
+
+ Fung Fung Lee
+ Computer Systems Laboratory
+ Stanford University
+ Stanford, CA 94309
+
+ Phone: +1 415 723 1450
+ EMail: lee@csl.stanford.edu
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+Lee Informational [Page 5]
+