summaryrefslogtreecommitdiff
path: root/man/6
diff options
context:
space:
mode:
authorCharles Forsyth <charles.forsyth@gmail.com>2013-06-08 10:20:28 +0000
committerCharles Forsyth <charles.forsyth@gmail.com>2013-06-08 10:20:28 +0000
commit1aedbd062644a7fc13c582534c097ca08af4414d (patch)
treec98e368b1c36957890b7bb04f37e5742c5712339 /man/6
parentb82549ec3500645cb6c8ff9d911d7c22ea451b98 (diff)
UTF-8 extended to 4 byte encoding
Diffstat (limited to 'man/6')
-rw-r--r--man/6/utf60
1 files changed, 42 insertions, 18 deletions
diff --git a/man/6/utf b/man/6/utf
index 0e7f04d3..12832a16 100644
--- a/man/6/utf
+++ b/man/6/utf
@@ -1,13 +1,13 @@
.TH UTF 6
.SH NAME
-UTF, Unicode, ASCII \- character set and format
+UTF, Unicode, ASCII, rune \- character set and format
.SH DESCRIPTION
The Inferno character set and representation are
based on the Unicode Standard and on the ISO multibyte
.SM UTF-8
encoding (Universal Character
Set Transformation Format, 8 bits wide).
-The Unicode Standard represents its characters in 16
+The Unicode Standard represents its characters in 21
bits;
.SM UTF-8
represents such
@@ -17,7 +17,9 @@ Throughout this manual,
is shortened to
.SM UTF.
.PP
-Internally, programs store individual Unicode characters as 16-bit values.
+Internally, programs store individual Unicode characters as 32-bit integers,
+of which only 21 bits are currently used.
+Documentation often refers to them as `runes', following Plan 9.
However, any external manifestation of textual information,
in files or at the interface between programs, uses the
machine-independent, byte-stream encoding called
@@ -36,18 +38,27 @@ The
.SM UTF
encoding of the Unicode Standard is backward compatible with
.SM ASCII\c
-: Inferno programs handle
+: programs presented only with
.SM ASCII
-text, as well as uninterpreted byte streams, without special arrangement.
+work on Inferno
+even if not written to deal with
+.SM UTF\c
+,
+as do
+programs that deal with uninterpreted byte streams.
However, programs that perform semantic processing on
characters must convert from
.SM UTF
-to Unicode
+to runes
in order to work properly with non-\c
.SM ASCII
input.
Normally, all necessary conversions are done by the Limbo compiler
-and execution environment, but sometimes more is necessary, such
+and execution envirnoment, when converting between
+.B "array of byte"
+and
+.B "string" ,
+but sometimes more is needed, such
as when a program receives
.SM UTF
input one byte at a time;
@@ -56,29 +67,42 @@ see
for routines to handle such processing.
.PP
Letting numbers be binary,
-a Unicode character x
-is converted to a multibyte
+a rune x is converted to a multibyte
.SM UTF
sequence
as follows:
-.EX
-01. x in [00000000.0bbbbbbb] \(-> 0bbbbbbb
-10. x in [00000bbb.bbbbbbbb] \(-> 110bbbbb, 10bbbbbb
-11. x in [bbbbbbbb.bbbbbbbb] \(-> 1110bbbb, 10bbbbbb, 10bbbbbb
-.EE
+.PP
+01. x in [000000.00000000.0bbbbbbb] → 0bbbbbbb
+.br
+10. x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb
+.br
+11. x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb
+.br
+100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
+.br
+.PP
.PP
Conversion 01 provides a one-byte sequence that spans the
.SM ASCII
character set in a compatible way.
-Conversions 10 and 11 represent higher-valued characters
-as sequences of two or three bytes with the high bit set.
-Inferno does not support the 4-, 5-, and 6-byte sequences proposed by X-Open.
+Conversions 10, 11 and 100 represent higher-valued characters
+as sequences of two, three or four bytes with the high bit set.
+Inferno does not support the 5 and 6 byte sequences proposed by X-Open.
When there are multiple ways to encode a value, for example rune 0,
the shortest encoding is used.
.PP
In the inverse mapping,
any sequence except those described above
-is incorrect and is converted to the Unicode value of hexadecimal 0080.
+is incorrect and is converted to the rune hexadecimal FFFD.
+.SH FILES
+.TF "/lib/unicode "
+.TP
+.B /lib/unicode
+table of characters and descriptions, suitable for
+.IR look (1).
.SH "SEE ALSO"
+.IR ascii (1),
+.IR tcs (1),
.IR sys-byte2char (2),
+.IR keyboard (6),
.IR "The Unicode Standard" .