diff options
| author | Charles Forsyth <charles.forsyth@gmail.com> | 2013-06-08 10:20:28 +0000 |
|---|---|---|
| committer | Charles Forsyth <charles.forsyth@gmail.com> | 2013-06-08 10:20:28 +0000 |
| commit | 1aedbd062644a7fc13c582534c097ca08af4414d (patch) | |
| tree | c98e368b1c36957890b7bb04f37e5742c5712339 /man/6 | |
| parent | b82549ec3500645cb6c8ff9d911d7c22ea451b98 (diff) | |
UTF-8 extended to 4 byte encoding
Diffstat (limited to 'man/6')
| -rw-r--r-- | man/6/utf | 60 |
1 files changed, 42 insertions, 18 deletions
@@ -1,13 +1,13 @@ .TH UTF 6 .SH NAME -UTF, Unicode, ASCII \- character set and format +UTF, Unicode, ASCII, rune \- character set and format .SH DESCRIPTION The Inferno character set and representation are based on the Unicode Standard and on the ISO multibyte .SM UTF-8 encoding (Universal Character Set Transformation Format, 8 bits wide). -The Unicode Standard represents its characters in 16 +The Unicode Standard represents its characters in 21 bits; .SM UTF-8 represents such @@ -17,7 +17,9 @@ Throughout this manual, is shortened to .SM UTF. .PP -Internally, programs store individual Unicode characters as 16-bit values. +Internally, programs store individual Unicode characters as 32-bit integers, +of which only 21 bits are currently used. +Documentation often refers to them as `runes', following Plan 9. However, any external manifestation of textual information, in files or at the interface between programs, uses the machine-independent, byte-stream encoding called @@ -36,18 +38,27 @@ The .SM UTF encoding of the Unicode Standard is backward compatible with .SM ASCII\c -: Inferno programs handle +: programs presented only with .SM ASCII -text, as well as uninterpreted byte streams, without special arrangement. +work on Inferno +even if not written to deal with +.SM UTF\c +, +as do +programs that deal with uninterpreted byte streams. However, programs that perform semantic processing on characters must convert from .SM UTF -to Unicode +to runes in order to work properly with non-\c .SM ASCII input. Normally, all necessary conversions are done by the Limbo compiler -and execution environment, but sometimes more is necessary, such +and execution envirnoment, when converting between +.B "array of byte" +and +.B "string" , +but sometimes more is needed, such as when a program receives .SM UTF input one byte at a time; @@ -56,29 +67,42 @@ see for routines to handle such processing. .PP Letting numbers be binary, -a Unicode character x -is converted to a multibyte +a rune x is converted to a multibyte .SM UTF sequence as follows: -.EX -01. x in [00000000.0bbbbbbb] \(-> 0bbbbbbb -10. x in [00000bbb.bbbbbbbb] \(-> 110bbbbb, 10bbbbbb -11. x in [bbbbbbbb.bbbbbbbb] \(-> 1110bbbb, 10bbbbbb, 10bbbbbb -.EE +.PP +01. x in [000000.00000000.0bbbbbbb] → 0bbbbbbb +.br +10. x in [000000.00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb +.br +11. x in [000000.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb +.br +100. x in [bbbbbb.bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb +.br +.PP .PP Conversion 01 provides a one-byte sequence that spans the .SM ASCII character set in a compatible way. -Conversions 10 and 11 represent higher-valued characters -as sequences of two or three bytes with the high bit set. -Inferno does not support the 4-, 5-, and 6-byte sequences proposed by X-Open. +Conversions 10, 11 and 100 represent higher-valued characters +as sequences of two, three or four bytes with the high bit set. +Inferno does not support the 5 and 6 byte sequences proposed by X-Open. When there are multiple ways to encode a value, for example rune 0, the shortest encoding is used. .PP In the inverse mapping, any sequence except those described above -is incorrect and is converted to the Unicode value of hexadecimal 0080. +is incorrect and is converted to the rune hexadecimal FFFD. +.SH FILES +.TF "/lib/unicode " +.TP +.B /lib/unicode +table of characters and descriptions, suitable for +.IR look (1). .SH "SEE ALSO" +.IR ascii (1), +.IR tcs (1), .IR sys-byte2char (2), +.IR keyboard (6), .IR "The Unicode Standard" . |
