UTF-8 extended to 4 byte encoding

author: Charles Forsyth <charles.forsyth@gmail.com> 2013-06-08 10:20:28 +0000
committer: Charles Forsyth <charles.forsyth@gmail.com> 2013-06-08 10:20:28 +0000
commit: 1aedbd062644a7fc13c582534c097ca08af4414d (patch)
tree: c98e368b1c36957890b7bb04f37e5742c5712339 /man/6
parent: b82549ec3500645cb6c8ff9d911d7c22ea451b98 (diff)
1 files changed, 42 insertions, 18 deletions
diff --git a/man/6/utf b/man/6/utf
index 0e7f04d3..12832a16 100644
--- a/man/6/utf
+++ b/man/6/utf
@@ -1,13 +1,13 @@
 .TH UTF 6
 .SH NAME
-UTF, Unicode, ASCII \- character set and format
+UTF, Unicode, ASCII, rune \- character set and format
 .SH DESCRIPTION
 The Inferno character set and representation are
 based on the Unicode Standard and on the ISO multibyte
 .SM UTF-8
 encoding (Universal Character
 Set Transformation Format, 8 bits wide).
-The Unicode Standard represents its characters in 16
+The Unicode Standard represents its characters in 21
 bits;
 .SM UTF-8
 represents such
@@ -17,7 +17,9 @@ Throughout this manual,
 is shortened to
 .SM UTF.
 .PP
-Internally, programs store individual Unicode characters as 16-bit values.
+Internally, programs store individual Unicode characters as 32-bit integers,
+of which only 21 bits are currently used.
+Documentation often refers to them as `runes', following Plan 9.
 However, any external manifestation of textual information,
 in files or at the interface between programs, uses the
 machine-independent, byte-stream encoding called
@@ -36,18 +38,27 @@ The
 .SM UTF
 encoding of the Unicode Standard is backward compatible with
 .SM ASCII\c
-: Inferno programs handle
+: programs presented only with
 .SM ASCII
-text, as well as uninterpreted byte streams, without special arrangement.
+work on Inferno
+even if not written to deal with
+.SM UTF\c
+,
+as do
+programs that deal with uninterpreted byte streams.
 However, programs that perform semantic processing on
 characters must convert from
 .SM UTF
-to Unicode
+to runes
 in order to work properly with non-\c
 .SM ASCII
 input.
 Normally, all necessary conversions are done by the Limbo compiler
-and execution environment, but sometimes more is necessary, such
+and execution envirnoment, when converting between
+.B "array of byte"
+and
+.B "string" ,
+but sometimes more is needed, such
 as when a program receives
 .SM UTF
 input one byte at a time;
@@ -56,29 +67,42 @@ see
 for routines to handle such processing.
 .PP
 Letting numbers be binary,
-a Unicode character x
-is converted to a multibyte
+a rune x is converted to a multibyte
 .SM UTF
 sequence
 as follows:
-.EX
-01.   x in [00000000.0bbbbbbb] \(-> 0bbbbbbb
-10.   x in [00000bbb.bbbbbbbb] \(-> 110bbbbb, 10bbbbbb
-11.   x in [bbbbbbbb.bbbbbbbb] \(-> 1110bbbb, 10bbbbbb, 10bbbbbb
-.EE
+.PP
+01.   x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
+.br
+10.   x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+.br
+11.   x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
+.br
+.PP
 .PP
 Conversion 01 provides a one-byte sequence that spans the
 .SM ASCII
 character set in a compatible way.
-Conversions 10 and 11 represent higher-valued characters
-as sequences of two or three bytes with the high bit set.
-Inferno does not support the 4-, 5-, and 6-byte sequences proposed by X-Open.
+Conversions 10, 11 and 100 represent higher-valued characters
+as sequences of two, three or four bytes with the high bit set.
+Inferno does not support the 5 and 6 byte sequences proposed by X-Open.
 When there are multiple ways to encode a value, for example rune 0,
 the shortest encoding is used.
 .PP
 In the inverse mapping,
 any sequence except those described above
-is incorrect and is converted to the Unicode value of hexadecimal 0080.
+is incorrect and is converted to the rune hexadecimal FFFD.
+.SH FILES
+.TF "/lib/unicode "
+.TP
+.B /lib/unicode
+table of characters and descriptions, suitable for
+.IR look (1).
 .SH "SEE ALSO"
+.IR ascii (1),
+.IR tcs (1),
 .IR sys-byte2char (2),
+.IR keyboard (6), 
 .IR "The Unicode Standard" .
author	Charles Forsyth <charles.forsyth@gmail.com>	2013-06-08 10:20:28 +0000
committer	Charles Forsyth <charles.forsyth@gmail.com>	2013-06-08 10:20:28 +0000
commit	1aedbd062644a7fc13c582534c097ca08af4414d (patch)
tree	c98e368b1c36957890b7bb04f37e5742c5712339 /man/6
parent	b82549ec3500645cb6c8ff9d911d7c22ea451b98 (diff)