diff options
Diffstat (limited to 'lib/sexp')
| -rw-r--r-- | lib/sexp | 699 |
1 files changed, 699 insertions, 0 deletions
diff --git a/lib/sexp b/lib/sexp new file mode 100644 index 00000000..0fae9606 --- /dev/null +++ b/lib/sexp @@ -0,0 +1,699 @@ +Network Working Group R. Rivest +Internet Draft May 4, 1997 +Expires November 4, 1997 + + + S-Expressions + draft-rivest-sexp-00.txt + + +Status of this Memo + + Distribution of this memo is unlimited. + + This document is an Internet-Draft. Internet Drafts are working + documents of the Internet Engineering Task Force (IETF), its Areas, + and its Working Groups. Note that other groups may also distribute + working documents as Internet Drafts. + + Internet Drafts are draft documents valid for a maximum of six + months, and may be updated, replaced, or obsoleted by other documents + at any time. It is not appropriate to use Internet Drafts as + reference material, or to cite them other than as a ``working draft'' + or ``work in progress.'' + + To learn the current status of any Internet-Draft, please check the + ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow + Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe), + ds.internic.net (US East Coast), ftp.isi.edu (US West Coast), + or munnari.oz.au (Pacific Rim) + + +Abstract + +This memo describes a data structure called "S-expressions" that are +suitable for representing arbitrary complex data structures. We make +precise the encodings of S-expressions: we give a "canonical form" for +S-expressions, described two "transport" representations, and also +describe an "advanced" format for display to people. + + + +1. Introduction + +S-expressions are data structures for representing complex data. They +are either byte-strings ("octet-strings") or lists of simpler +S-expressions. Here is a sample S-expression: + + (snicker "abc" (#03# |YWJj|)) + +It is a list of length three: + + -- the octet-string "snicker" + + -- the octet-string "abc" + + -- a sub-list containing two elements: + - the hexadecimal constant #03# + - the base-64 constant |YWJj| (which is the same as "abc") + +This note gives a specific proposal for constructing and utilizing +S-expressions. The proposal is independent of any particular application. + +Here are the design goals for S-expressions: + + -- generality: S-expressions should be good at representing arbitrary + data. + + -- readability: it should be easy for someone to examine and + understand the structure of an S-expression. + + -- economy: S-expressions should represent data compactly. + + -- tranportability: S-expressions should be easy to transport + over communication media (such as email) that are known to be + less than perfect. + + -- flexibility: S-expressions should make it relatively simple to + modify and extend data structures. + + -- canonicalization: it should be easy to produce a unique + "canonical" form of an S-expression, for digital signature purposes. + + -- efficiency: S-expressions should admit in-memory representations + that allow efficient processing. + + +Section 2 gives an introduction to S-expressions. +Section 3 discusses the character sets used. +Section 4 presents the various representations of octet-strings. +Section 5 describes how to represent lists. +Section 6 discusses how S-expressions are represented for various uses. +Section 7 gives a BNF syntax for S-expressions. +Section 8 talks about how S-expressions might be represented in memory. +Section 9 briefly describes implementations for handling S-expressions. +Section 10 discusses how applications might utilize S-expressions. +Section 11 gives historical notes on S-expressions. +Section 12 gives references. + +2. S-expressions -- informal introduction + +Informally, an S-expression is either: + -- an octet-string, or + -- a finite list of simpler S-expressions. + +An octet-string is a finite sequence of eight-bit octets. There may be +many different but equivalent ways of representing an octet-string + + abc -- as a token + + "abc" -- as a quoted string + + #616263# -- as a hexadecimal string + + 3:abc -- as a length-prefixed "verbatim" encoding + + {MzphYmM=} -- as a base-64 encoding of the verbatim encoding + (that is, an encoding of "3:abc") + + |YWJj| -- as a base-64 encoding of the octet-string "abc" + +These encodings are all equivalent; they all denote the same octet string. + +We will give details of these encodings later on, and also describe how to +give a "display type" to a byte string. + +A list is a finite sequence of zero or more simpler S-expressions. A list +may be represented by using parentheses to surround the sequence of encodings +of its elements, as in: + + (abc (de #6667#) "ghi jkl") + +As we see, there is variability possible in the encoding of an +S-expression. In some cases, it is desirable to standardize or +restrict the encodings; in other cases it is desirable to have no +restrictions. The following are the target cases we aim to handle: + + -- a "transport" encoding for transporting the S-expression between + computers. + + -- a "canonical" encoding, used when signing the S-expression. + + -- an "advanced" encoding used for input/output to people. + + -- an "in-memory" encoding used for processing the S-expression in + the computer. + +These need not be different; in this proposal the canonical encoding +is the same as the transport encoding, for example. In this note we +propose (related) encoding techniques for each of these uses. + +3. Character set + +We will be describing encodings of S-expressions. Except when giving +"verbatim" encodings, the character set used is limited to the following +characters in US-ASCII: + Alphabetic: A B ... Z a b ... z + numeric: 0 1 ... 9 + whitespace: space, horizontal tab, vertical tab, form-feed + carriage-return, line-feed + The following graphics characters, which we call "pseudo-alphabetic": + - hyphen or minus + . period + / slash + _ underscore + : colon + * asterisk + + plus + = equal + The following graphics characters, which are "reserved punctuation": + ( left parenthesis + ) right parenthesis + [ left bracket + ] right bracket + { left brace + } right brace + | vertical bar + # number sign + " double quote + & ampersand + \ backslash + The following characters are unused and unavailable, except in + "verbatim" encodings: + ! exclamation point + % percent + ^ circumflex + ~ tilde + ; semicolon + ' apostrophe + , comma + < less than + > greater than + ? question mark + + +4. Octet string representations + +This section describes in detail the ways in which an octet-string may +be represented. + +We recall that an octet-string is any finite sequence of octets, and +that the octet-string may have length zero. + + +4.1 Verbatim representation + +A verbatim encoding of an octet string consists of four parts: + + -- the length (number of octets) of the octet-string, + given in decimal most significant digit first, with + no leading zeros. + + -- a colon ":" + + -- the octet string itself, verbatim. + +There are no blanks or whitespace separating the parts. No "escape +sequences" are interpreted in the octet string. This encoding is also +called a "binary" or "raw" encoding. + +Here are some sample verbatim encodings: + + 3:abc + 7:subject + 4::::: + 12:hello world! + 10:abcdefghij + 0: + +4.2 Quoted-string representation + +The quoted-string representation of an octet-string consists of: + + -- an optional decimal length field + + -- an initial double-quote (") + + -- the octet string with "C" escape conventions (\n,etc) + + -- a final double-quote (") + +The specified length is the length of the resulting string after any +escape sequences have been handled. The string does not have any +"terminating NULL" that C includes, and the length does not count such +a character. + +The length is optional. + +The escape conventions within the quoted string are as follows (these follow +the "C" programming language conventions, with an extension for +ignoring line terminators of just LF or CRLF): + \b -- backspace + \t -- horizontal tab + \v -- vertical tab + \n -- new-line + \f -- form-feed + \r -- carriage-return + \" -- double-quote + \' -- single-quote + \\ -- back-slash + \ooo -- character with octal value ooo (all three digits + must be present) + \xhh -- character with hexadecimal value hh (both digits + must be present) + \<carriage-return> -- causes carriage-return to be ignored. + \<line-feed> -- causes linefeed to be ignored + \<carriage-return><line-feed> -- causes CRLF to be ignored. + \<line-feed><carriage-return> -- causes LFCR to be ignored. + +Here are some examples of quoted-string encodings: + + "subject" + "hi there" + 7"subject" + 3"\n\n\n" + "This has\n two lines." + "This has\ + one." + "" + +4.3 Token representation + +An octet string that meets the following conditions may be given +directly as a "token". + + -- it does not begin with a digit + + -- it contains only characters that are + -- alphabetic (upper or lower case), + -- numeric, or + -- one of the eight "pseudo-alphabetic" punctuation marks: + - . / _ : * + = + (Note: upper and lower case are not equivalent.) + (Note: A token may begin with punctuation, including ":"). + +Here are some examples of token representations: + + subject + not-before + class-of-1997 + //microsoft.com/names/smith + * + + +4.4 Hexadecimal representation + +An octet-string may be represented with a hexadecimal encoding consisting of: + + -- an (optional) decimal length of the octet string + + -- a sharp-sign "#" + + -- a hexadecimal encoding of the octet string, with each octet + represented with two hexadecimal digits, most significant + digit first. + + -- a sharp-sign "#" + +There may be whitespace inserted in the midst of the hexadecimal +encoding arbitrarily; it is ignored. It is an error to have +characters other than whitespace and hexadecimal digits. + +Here are some examples of hexadecimal encodings: + + #616263# -- represents "abc" + 3#616263# -- also represents "abc" + # 616 + 263 # -- also represents "abc" + + +4.5 Base-64 representation + +An octet-string may be represented in a base-64 coding consisting of: + + -- an (optional) decimal length of the octet string + + -- a vertical bar "|" + + -- the rfc 1521 base-64 encoding of the octet string. + + -- a final vertical bar "|" + +The base-64 encoding uses only the characters + A-Z a-z 0-9 + / = +It produces four characters of output for each three octets of input. +If the input has one or two left-over octets of input, it produces an +output block of length four ending in two or one equals signs, respectively. +Output routines compliant with this standard MUST output the equals signs +as specified. Input routines MAY accept inputs where the equals signs are +dropped. + +There may be whitespace inserted in the midst of the base-64 encoding +arbitrarily; it is ignored. It is an error to have characters other +than whitespace and base-64 characters. + +Here are some examples of base-64 encodings: + + |YWJj| -- represents "abc" + | Y W + J j | -- also represents "abc" + 3|YWJj| -- also represents "abc" + |YWJjZA==| -- represents "abcd" + |YWJjZA| -- also represents "abcd" + + +4.6 Display hint + +Any octet string may be preceded by a single "display hint". + +The purposes of the display hint is to provide information on how +to display the octet string to a user. It has no other function. +Many of the MIME types work here. + +A display-hint is an octet string surrounded by square brackets. +There may be whitespace separating the octet string from the +surrounding brackets. Any of the legal formats may be used for the +octet string. + +Here are some examples of display-hints: + + [image/gif] + [URI] + [charset=unicode-1-1] + [text/richtext] + [application/postscript] + [audio/basic] + ["http://abc.com/display-types/funky.html"] + +In applications an octet-string that is untyped may be considered to have +a pre-specified "default" mime type. The mime type + "text/plain; charset=iso-8859-1" +is the standard default. + + +4.7 Equality of octet-strings + +Two octet strings are considered to be "equal" if and only if they +have the same display hint and the same data octet strings. + +Note that octet-strings are "case-sensitive"; the octet-string "abc" +is not equal to the octet-string "ABC". + +An untyped octet-string can be compared to another octet-string (typed +or not) by considering it as a typed octet-string with the default +mime-type. + + +5. Lists + +Just as with octet-strings, there are several ways to represent an +S-expression. Whitespace may be used to separate list elements, but +they are only required to separate two octet strings when otherwise +the two octet strings might be interpreted as one, as when one token +follows another. Also, whitespace may follow the initial left +parenthesis, or precede the final right parenthesis. + +Here are some examples of encodings of lists: + + (a b c) + + ( a ( b c ) ( ( d e ) ( e f ) ) ) + + (11:certificate(6:issuer3:bob)(7:subject5:alice)) + + ({3Rt=} "1997" murphy 3:{XC++}) + + +6. Representation types + +There are three "types" of representations: + + -- canonical + + -- basic transport + + -- advanced transport + +The first two MUST be supported by any implementation; the last is +optional. + + +6.1 Canonical representation + +This canonical representation is used for digital signature purposes, +transmission, etc. It is uniquely defined for each S-expression. It +is not particularly readable, but that is not the point. It is +intended to be very easy to parse, to be reasonably economical, and to +be unique for any S-expression. + +The "canonical" form of an S-expression represents each octet-string +in verbatim mode, and represents each list with no blanks separating +elements from each other or from the surrounding parentheses. + +Here are some examples of canonical representations of S-expressions: + + (6:issuer3:bob) + + (4:icon[12:image/bitmap]9:xxxxxxxxx) + + (7:subject(3:ref5:alice6:mother)) + + +6.2 Basic transport representation + +There are two forms of the "basic transport" representation: + + -- the canonical representation + + -- an rfc-2045 base-64 representation of the canonical representation, + surrounded by braces. + +The transport mechanism is intended to provide a universal means of +representing S-expressions for transport from one machine to another. + +Here are some examples of an S-expression represented in basic +transport mode: + + (1:a1:b1:c) + + {KDE6YTE6YjE6YykA} + + (this is the same S-expression encoded in base-64) + +There is a difference between the brace notation for base-64 used here +and the || notation for base-64'd octet-strings described above. Here +the base-64 contents are converted to octets, and then re-scanned as +if they were given originally as octets. With the || notation, the +contents are just turned into an octet-string. + + +6.3 Advanced transport representation + +The "advanced transport" representation is intended to provide more +flexible and readable notations for documentation, design, debugging, +and (in some cases) user interface. + +The advanced transport representation allows all of the representation +forms described above, include quoted strings, base-64 and hexadecimal +representation of strings, tokens, representations of strings with +omitted lengths, and so on. + + +7. BNF for syntax + +We give separate BNF's for canonical and advanced forms of S-expressions. +We use the following notation: + <x>* means 0 or more occurrences of <x> + <x>+ means 1 or more occurrences of <x> + <x>? means 0 or 1 occurrences of <x> + parentheses are used for grouping, as in (<x> | <y>)* + +For canonical and basic transport: + +<sexpr> :: <string> | <list> +<string> :: <display>? <simple-string> ; +<simple-string> :: <raw> ; +<display> :: "[" <simple-string> "]" ; +<raw> :: <decimal> ":" <bytes> ; +<decimal> :: <decimal-digit>+ ; + -- decimal numbers should have no unnecessary leading zeros +<bytes> -- any string of bytes, of the indicated length +<list> :: "(" <sexp>* ")" ; +<decimal-digit> :: "0" | ... | "9" ; + +For advanced transport: + +<sexpr> :: <string> | <list> +<string> :: <display>? <simple-string> ; +<simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> | + <quoted-string> ; +<display> :: "[" <simple-string> "]" ; +<raw> :: <decimal> ":" <bytes> ; +<decimal> :: <decimal-digit>+ ; + -- decimal numbers should have no unnecessary leading zeros +<bytes> -- any string of bytes, of the indicated length +<token> :: <tokenchar>+ ; +<base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ; +<hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ; +<quoted-string> :: <decimal>? <quoted-string-body> +<quoted-string-body> :: "\"" <bytes> "\"" +<list> :: "(" ( <sexp> | <whitespace> )* ")" ; +<whitespace> :: <whitespace-char>* ; +<token-char> :: <alpha> | <decimal-digit> | <simple-punc> ; +<alpha> :: <upper-case> | <lower-case> | <digit> ; +<lower-case> :: "a" | ... | "z" ; +<upper-case> :: "A" | ... | "Z" ; +<decimal-digit> :: "0" | ... | "9" ; +<hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ; +<simple-punc> :: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ; +<whitespace-char> :: " " | "\t" | "\r" | "\n" ; +<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ; +<null> :: "" ; + +8. In-memory representations + +For processing, the S-expression would typically be parsed and represented +in memory in a more more amenable to efficient processing. We suggest +two alternatives: + + -- "list-structure" + + -- "array-layout" + +We only sketch these here, as they are only suggestive. The code referenced +below illustrates these styles in more detail. + + +8.1. List-structure memory representation + +Here there are separate records for simple-strings, strings, and +lists. An S-expression of the form ("abc" "de") would require two +records for the simple strings, two for the strings, and two for the +list elements. This is a fairly conventional representation, and +details are omitted here. + +8.2 Array-layout memory representation + +Here each S-expression is represented as a contiguous array of bytes. +The first byte codes the "type" of the S-expression: + + 01 octet-string + + 02 octet-string with display-hint + + 03 beginning of list (and 00 is used for "end of list") + +Each of the three types is immediately followed by a k-byte integer +indicating the size (in bytes) of the following representation. Here +k is an integer that depends on the implementation, it might be +anywhere from 2 to 8, but would be fixed for a given implementation; +it determines the size of the objects that can be handled. The transport +and canonical representations are independent of the choice of k made by +the implementation. + +Although the length of lists are not given in the usual S-expression +notations, it is easy to fill them in when parsing; when you reach a +right-parenthesis you know how long the list representation was, and +where to go back to fill in the missing length. + + +8.2.1 Octet string + +This is represented as follows: + + 01 <length> <octet-string> + +For example (here k = 2) + + 01 0003 a b c + +8.2.2 Octet-string with display-hint + +This is represented as follows: + + 02 <length> + 01 <length> <octet-string> /* for display-type */ + 01 <length> <octet-string> /* for octet-string */ + +For example, the S-expression + + [gif] #61626364# + +would be represented as (with k = 2) + + 02 000d + 01 0003 g i f + 01 0004 61 62 63 64 + +8.2.3 List + +This is represented as + + 03 <length> <item1> <item2> <item3> ... <itemn> 00 + +For example, the list (abc [d]ef (g)) is represented in memory as (with k=2) + + 03 001b + 01 0003 a b c + 02 0009 + 01 0001 d + 01 0002 e f + 03 0005 + 01 0001 g + 00 + 00 + +9. Code + +There is code available for reading and parsing the various +S-expression formats proposed here. + +See http://theory.lcs.mit.edu/~rivest/sexp.html + + +10. Utilization of S-expressions + +This note has described S-expressions in general form. Application writers +may wish to restrict their use of S-expressions in various ways. Here are +some possible restrictions that might be considered: + + -- no display-hints + -- no lengths on hexadecimal, quoted-strings, or base-64 encodings + -- no empty lists + -- no empty octet-strings + -- no lists having another list as its first element + -- no base-64 or hexadecimal encodings + -- fixed limits on the size of octet-strings + +11. Historical note + +The S-expression technology described here was originally developed +for ``SDSI'' (the Simple Distributed Security Infrastructure by +Lampson and Rivest [SDSI]) in 1996, although the origins clearly date +back to McCarthy's LISP programming language. It was further refined +and improved during the merger of SDSI and SPKI [SPKI] during the +first half of 1997. S-expressions are similar to, but more readable +and flexible than, Bernstein's "net-strings" [BERN]. + +12. References + +[SDSI] "A Simple Distributed Security Architecture", by + Butler Lampson, and Ronald L. Rivest + http://theory.lcs.mit.edu/~cis/sdsi.html + +[SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A + Simple Public Key Infrastructure</a> + +[BERN] Dan Bernstein's "net-strings"; Internet Draft + draft-bernstein-netstrings-02.txt + +Author's Address + + Ronald L. Rivest + Room 324, 545 Technology Square + MIT Laboratory for Computer Science + Cambridge, MA 02139 + + rivest@theory.lcs.mit.edu + + |
