diff options
| author | Charles.Forsyth <devnull@localhost> | 2006-12-22 20:52:35 +0000 |
|---|---|---|
| committer | Charles.Forsyth <devnull@localhost> | 2006-12-22 20:52:35 +0000 |
| commit | 46439007cf417cbd9ac8049bb4122c890097a0fa (patch) | |
| tree | 6fdb25e5f3a2b6d5657eb23b35774b631d4d97e4 /man/6/utf | |
| parent | 37da2899f40661e3e9631e497da8dc59b971cbd0 (diff) | |
20060303-partial
Diffstat (limited to 'man/6/utf')
| -rw-r--r-- | man/6/utf | 84 |
1 files changed, 84 insertions, 0 deletions
diff --git a/man/6/utf b/man/6/utf new file mode 100644 index 00000000..0e7f04d3 --- /dev/null +++ b/man/6/utf @@ -0,0 +1,84 @@ +.TH UTF 6 +.SH NAME +UTF, Unicode, ASCII \- character set and format +.SH DESCRIPTION +The Inferno character set and representation are +based on the Unicode Standard and on the ISO multibyte +.SM UTF-8 +encoding (Universal Character +Set Transformation Format, 8 bits wide). +The Unicode Standard represents its characters in 16 +bits; +.SM UTF-8 +represents such +values in an 8-bit byte stream. +Throughout this manual, +.SM UTF-8 +is shortened to +.SM UTF. +.PP +Internally, programs store individual Unicode characters as 16-bit values. +However, any external manifestation of textual information, +in files or at the interface between programs, uses the +machine-independent, byte-stream encoding called +.SM UTF. +.PP +.SM UTF +is designed so the 7-bit +.SM ASCII +set (values hexadecimal 00 to 7F), +appear only as themselves +in the encoding. +Characters with values above 7F appear as sequences of two or more +bytes with values only from 80 to FF. +.PP +The +.SM UTF +encoding of the Unicode Standard is backward compatible with +.SM ASCII\c +: Inferno programs handle +.SM ASCII +text, as well as uninterpreted byte streams, without special arrangement. +However, programs that perform semantic processing on +characters must convert from +.SM UTF +to Unicode +in order to work properly with non-\c +.SM ASCII +input. +Normally, all necessary conversions are done by the Limbo compiler +and execution environment, but sometimes more is necessary, such +as when a program receives +.SM UTF +input one byte at a time; +see +.IR sys-byte2char (2) +for routines to handle such processing. +.PP +Letting numbers be binary, +a Unicode character x +is converted to a multibyte +.SM UTF +sequence +as follows: +.EX +01. x in [00000000.0bbbbbbb] \(-> 0bbbbbbb +10. x in [00000bbb.bbbbbbbb] \(-> 110bbbbb, 10bbbbbb +11. x in [bbbbbbbb.bbbbbbbb] \(-> 1110bbbb, 10bbbbbb, 10bbbbbb +.EE +.PP +Conversion 01 provides a one-byte sequence that spans the +.SM ASCII +character set in a compatible way. +Conversions 10 and 11 represent higher-valued characters +as sequences of two or three bytes with the high bit set. +Inferno does not support the 4-, 5-, and 6-byte sequences proposed by X-Open. +When there are multiple ways to encode a value, for example rune 0, +the shortest encoding is used. +.PP +In the inverse mapping, +any sequence except those described above +is incorrect and is converted to the Unicode value of hexadecimal 0080. +.SH "SEE ALSO" +.IR sys-byte2char (2), +.IR "The Unicode Standard" . |
