summaryrefslogtreecommitdiff
path: root/lib/sexp
diff options
context:
space:
mode:
Diffstat (limited to 'lib/sexp')
-rw-r--r--lib/sexp699
1 files changed, 699 insertions, 0 deletions
diff --git a/lib/sexp b/lib/sexp
new file mode 100644
index 00000000..0fae9606
--- /dev/null
+++ b/lib/sexp
@@ -0,0 +1,699 @@
+Network Working Group R. Rivest
+Internet Draft May 4, 1997
+Expires November 4, 1997
+
+
+ S-Expressions
+ draft-rivest-sexp-00.txt
+
+
+Status of this Memo
+
+ Distribution of this memo is unlimited.
+
+ This document is an Internet-Draft. Internet Drafts are working
+ documents of the Internet Engineering Task Force (IETF), its Areas,
+ and its Working Groups. Note that other groups may also distribute
+ working documents as Internet Drafts.
+
+ Internet Drafts are draft documents valid for a maximum of six
+ months, and may be updated, replaced, or obsoleted by other documents
+ at any time. It is not appropriate to use Internet Drafts as
+ reference material, or to cite them other than as a ``working draft''
+ or ``work in progress.''
+
+ To learn the current status of any Internet-Draft, please check the
+ ``1id-abstracts.txt'' listing contained in the internet-drafts Shadow
+ Directories on: ftp.is.co.za (Africa), nic.nordu.net (Europe),
+ ds.internic.net (US East Coast), ftp.isi.edu (US West Coast),
+ or munnari.oz.au (Pacific Rim)
+
+
+Abstract
+
+This memo describes a data structure called "S-expressions" that are
+suitable for representing arbitrary complex data structures. We make
+precise the encodings of S-expressions: we give a "canonical form" for
+S-expressions, described two "transport" representations, and also
+describe an "advanced" format for display to people.
+
+
+
+1. Introduction
+
+S-expressions are data structures for representing complex data. They
+are either byte-strings ("octet-strings") or lists of simpler
+S-expressions. Here is a sample S-expression:
+
+ (snicker "abc" (#03# |YWJj|))
+
+It is a list of length three:
+
+ -- the octet-string "snicker"
+
+ -- the octet-string "abc"
+
+ -- a sub-list containing two elements:
+ - the hexadecimal constant #03#
+ - the base-64 constant |YWJj| (which is the same as "abc")
+
+This note gives a specific proposal for constructing and utilizing
+S-expressions. The proposal is independent of any particular application.
+
+Here are the design goals for S-expressions:
+
+ -- generality: S-expressions should be good at representing arbitrary
+ data.
+
+ -- readability: it should be easy for someone to examine and
+ understand the structure of an S-expression.
+
+ -- economy: S-expressions should represent data compactly.
+
+ -- tranportability: S-expressions should be easy to transport
+ over communication media (such as email) that are known to be
+ less than perfect.
+
+ -- flexibility: S-expressions should make it relatively simple to
+ modify and extend data structures.
+
+ -- canonicalization: it should be easy to produce a unique
+ "canonical" form of an S-expression, for digital signature purposes.
+
+ -- efficiency: S-expressions should admit in-memory representations
+ that allow efficient processing.
+
+
+Section 2 gives an introduction to S-expressions.
+Section 3 discusses the character sets used.
+Section 4 presents the various representations of octet-strings.
+Section 5 describes how to represent lists.
+Section 6 discusses how S-expressions are represented for various uses.
+Section 7 gives a BNF syntax for S-expressions.
+Section 8 talks about how S-expressions might be represented in memory.
+Section 9 briefly describes implementations for handling S-expressions.
+Section 10 discusses how applications might utilize S-expressions.
+Section 11 gives historical notes on S-expressions.
+Section 12 gives references.
+
+2. S-expressions -- informal introduction
+
+Informally, an S-expression is either:
+ -- an octet-string, or
+ -- a finite list of simpler S-expressions.
+
+An octet-string is a finite sequence of eight-bit octets. There may be
+many different but equivalent ways of representing an octet-string
+
+ abc -- as a token
+
+ "abc" -- as a quoted string
+
+ #616263# -- as a hexadecimal string
+
+ 3:abc -- as a length-prefixed "verbatim" encoding
+
+ {MzphYmM=} -- as a base-64 encoding of the verbatim encoding
+ (that is, an encoding of "3:abc")
+
+ |YWJj| -- as a base-64 encoding of the octet-string "abc"
+
+These encodings are all equivalent; they all denote the same octet string.
+
+We will give details of these encodings later on, and also describe how to
+give a "display type" to a byte string.
+
+A list is a finite sequence of zero or more simpler S-expressions. A list
+may be represented by using parentheses to surround the sequence of encodings
+of its elements, as in:
+
+ (abc (de #6667#) "ghi jkl")
+
+As we see, there is variability possible in the encoding of an
+S-expression. In some cases, it is desirable to standardize or
+restrict the encodings; in other cases it is desirable to have no
+restrictions. The following are the target cases we aim to handle:
+
+ -- a "transport" encoding for transporting the S-expression between
+ computers.
+
+ -- a "canonical" encoding, used when signing the S-expression.
+
+ -- an "advanced" encoding used for input/output to people.
+
+ -- an "in-memory" encoding used for processing the S-expression in
+ the computer.
+
+These need not be different; in this proposal the canonical encoding
+is the same as the transport encoding, for example. In this note we
+propose (related) encoding techniques for each of these uses.
+
+3. Character set
+
+We will be describing encodings of S-expressions. Except when giving
+"verbatim" encodings, the character set used is limited to the following
+characters in US-ASCII:
+ Alphabetic: A B ... Z a b ... z
+ numeric: 0 1 ... 9
+ whitespace: space, horizontal tab, vertical tab, form-feed
+ carriage-return, line-feed
+ The following graphics characters, which we call "pseudo-alphabetic":
+ - hyphen or minus
+ . period
+ / slash
+ _ underscore
+ : colon
+ * asterisk
+ + plus
+ = equal
+ The following graphics characters, which are "reserved punctuation":
+ ( left parenthesis
+ ) right parenthesis
+ [ left bracket
+ ] right bracket
+ { left brace
+ } right brace
+ | vertical bar
+ # number sign
+ " double quote
+ & ampersand
+ \ backslash
+ The following characters are unused and unavailable, except in
+ "verbatim" encodings:
+ ! exclamation point
+ % percent
+ ^ circumflex
+ ~ tilde
+ ; semicolon
+ ' apostrophe
+ , comma
+ < less than
+ > greater than
+ ? question mark
+
+
+4. Octet string representations
+
+This section describes in detail the ways in which an octet-string may
+be represented.
+
+We recall that an octet-string is any finite sequence of octets, and
+that the octet-string may have length zero.
+
+
+4.1 Verbatim representation
+
+A verbatim encoding of an octet string consists of four parts:
+
+ -- the length (number of octets) of the octet-string,
+ given in decimal most significant digit first, with
+ no leading zeros.
+
+ -- a colon ":"
+
+ -- the octet string itself, verbatim.
+
+There are no blanks or whitespace separating the parts. No "escape
+sequences" are interpreted in the octet string. This encoding is also
+called a "binary" or "raw" encoding.
+
+Here are some sample verbatim encodings:
+
+ 3:abc
+ 7:subject
+ 4:::::
+ 12:hello world!
+ 10:abcdefghij
+ 0:
+
+4.2 Quoted-string representation
+
+The quoted-string representation of an octet-string consists of:
+
+ -- an optional decimal length field
+
+ -- an initial double-quote (")
+
+ -- the octet string with "C" escape conventions (\n,etc)
+
+ -- a final double-quote (")
+
+The specified length is the length of the resulting string after any
+escape sequences have been handled. The string does not have any
+"terminating NULL" that C includes, and the length does not count such
+a character.
+
+The length is optional.
+
+The escape conventions within the quoted string are as follows (these follow
+the "C" programming language conventions, with an extension for
+ignoring line terminators of just LF or CRLF):
+ \b -- backspace
+ \t -- horizontal tab
+ \v -- vertical tab
+ \n -- new-line
+ \f -- form-feed
+ \r -- carriage-return
+ \" -- double-quote
+ \' -- single-quote
+ \\ -- back-slash
+ \ooo -- character with octal value ooo (all three digits
+ must be present)
+ \xhh -- character with hexadecimal value hh (both digits
+ must be present)
+ \<carriage-return> -- causes carriage-return to be ignored.
+ \<line-feed> -- causes linefeed to be ignored
+ \<carriage-return><line-feed> -- causes CRLF to be ignored.
+ \<line-feed><carriage-return> -- causes LFCR to be ignored.
+
+Here are some examples of quoted-string encodings:
+
+ "subject"
+ "hi there"
+ 7"subject"
+ 3"\n\n\n"
+ "This has\n two lines."
+ "This has\
+ one."
+ ""
+
+4.3 Token representation
+
+An octet string that meets the following conditions may be given
+directly as a "token".
+
+ -- it does not begin with a digit
+
+ -- it contains only characters that are
+ -- alphabetic (upper or lower case),
+ -- numeric, or
+ -- one of the eight "pseudo-alphabetic" punctuation marks:
+ - . / _ : * + =
+ (Note: upper and lower case are not equivalent.)
+ (Note: A token may begin with punctuation, including ":").
+
+Here are some examples of token representations:
+
+ subject
+ not-before
+ class-of-1997
+ //microsoft.com/names/smith
+ *
+
+
+4.4 Hexadecimal representation
+
+An octet-string may be represented with a hexadecimal encoding consisting of:
+
+ -- an (optional) decimal length of the octet string
+
+ -- a sharp-sign "#"
+
+ -- a hexadecimal encoding of the octet string, with each octet
+ represented with two hexadecimal digits, most significant
+ digit first.
+
+ -- a sharp-sign "#"
+
+There may be whitespace inserted in the midst of the hexadecimal
+encoding arbitrarily; it is ignored. It is an error to have
+characters other than whitespace and hexadecimal digits.
+
+Here are some examples of hexadecimal encodings:
+
+ #616263# -- represents "abc"
+ 3#616263# -- also represents "abc"
+ # 616
+ 263 # -- also represents "abc"
+
+
+4.5 Base-64 representation
+
+An octet-string may be represented in a base-64 coding consisting of:
+
+ -- an (optional) decimal length of the octet string
+
+ -- a vertical bar "|"
+
+ -- the rfc 1521 base-64 encoding of the octet string.
+
+ -- a final vertical bar "|"
+
+The base-64 encoding uses only the characters
+ A-Z a-z 0-9 + / =
+It produces four characters of output for each three octets of input.
+If the input has one or two left-over octets of input, it produces an
+output block of length four ending in two or one equals signs, respectively.
+Output routines compliant with this standard MUST output the equals signs
+as specified. Input routines MAY accept inputs where the equals signs are
+dropped.
+
+There may be whitespace inserted in the midst of the base-64 encoding
+arbitrarily; it is ignored. It is an error to have characters other
+than whitespace and base-64 characters.
+
+Here are some examples of base-64 encodings:
+
+ |YWJj| -- represents "abc"
+ | Y W
+ J j | -- also represents "abc"
+ 3|YWJj| -- also represents "abc"
+ |YWJjZA==| -- represents "abcd"
+ |YWJjZA| -- also represents "abcd"
+
+
+4.6 Display hint
+
+Any octet string may be preceded by a single "display hint".
+
+The purposes of the display hint is to provide information on how
+to display the octet string to a user. It has no other function.
+Many of the MIME types work here.
+
+A display-hint is an octet string surrounded by square brackets.
+There may be whitespace separating the octet string from the
+surrounding brackets. Any of the legal formats may be used for the
+octet string.
+
+Here are some examples of display-hints:
+
+ [image/gif]
+ [URI]
+ [charset=unicode-1-1]
+ [text/richtext]
+ [application/postscript]
+ [audio/basic]
+ ["http://abc.com/display-types/funky.html"]
+
+In applications an octet-string that is untyped may be considered to have
+a pre-specified "default" mime type. The mime type
+ "text/plain; charset=iso-8859-1"
+is the standard default.
+
+
+4.7 Equality of octet-strings
+
+Two octet strings are considered to be "equal" if and only if they
+have the same display hint and the same data octet strings.
+
+Note that octet-strings are "case-sensitive"; the octet-string "abc"
+is not equal to the octet-string "ABC".
+
+An untyped octet-string can be compared to another octet-string (typed
+or not) by considering it as a typed octet-string with the default
+mime-type.
+
+
+5. Lists
+
+Just as with octet-strings, there are several ways to represent an
+S-expression. Whitespace may be used to separate list elements, but
+they are only required to separate two octet strings when otherwise
+the two octet strings might be interpreted as one, as when one token
+follows another. Also, whitespace may follow the initial left
+parenthesis, or precede the final right parenthesis.
+
+Here are some examples of encodings of lists:
+
+ (a b c)
+
+ ( a ( b c ) ( ( d e ) ( e f ) ) )
+
+ (11:certificate(6:issuer3:bob)(7:subject5:alice))
+
+ ({3Rt=} "1997" murphy 3:{XC++})
+
+
+6. Representation types
+
+There are three "types" of representations:
+
+ -- canonical
+
+ -- basic transport
+
+ -- advanced transport
+
+The first two MUST be supported by any implementation; the last is
+optional.
+
+
+6.1 Canonical representation
+
+This canonical representation is used for digital signature purposes,
+transmission, etc. It is uniquely defined for each S-expression. It
+is not particularly readable, but that is not the point. It is
+intended to be very easy to parse, to be reasonably economical, and to
+be unique for any S-expression.
+
+The "canonical" form of an S-expression represents each octet-string
+in verbatim mode, and represents each list with no blanks separating
+elements from each other or from the surrounding parentheses.
+
+Here are some examples of canonical representations of S-expressions:
+
+ (6:issuer3:bob)
+
+ (4:icon[12:image/bitmap]9:xxxxxxxxx)
+
+ (7:subject(3:ref5:alice6:mother))
+
+
+6.2 Basic transport representation
+
+There are two forms of the "basic transport" representation:
+
+ -- the canonical representation
+
+ -- an rfc-2045 base-64 representation of the canonical representation,
+ surrounded by braces.
+
+The transport mechanism is intended to provide a universal means of
+representing S-expressions for transport from one machine to another.
+
+Here are some examples of an S-expression represented in basic
+transport mode:
+
+ (1:a1:b1:c)
+
+ {KDE6YTE6YjE6YykA}
+
+ (this is the same S-expression encoded in base-64)
+
+There is a difference between the brace notation for base-64 used here
+and the || notation for base-64'd octet-strings described above. Here
+the base-64 contents are converted to octets, and then re-scanned as
+if they were given originally as octets. With the || notation, the
+contents are just turned into an octet-string.
+
+
+6.3 Advanced transport representation
+
+The "advanced transport" representation is intended to provide more
+flexible and readable notations for documentation, design, debugging,
+and (in some cases) user interface.
+
+The advanced transport representation allows all of the representation
+forms described above, include quoted strings, base-64 and hexadecimal
+representation of strings, tokens, representations of strings with
+omitted lengths, and so on.
+
+
+7. BNF for syntax
+
+We give separate BNF's for canonical and advanced forms of S-expressions.
+We use the following notation:
+ <x>* means 0 or more occurrences of <x>
+ <x>+ means 1 or more occurrences of <x>
+ <x>? means 0 or 1 occurrences of <x>
+ parentheses are used for grouping, as in (<x> | <y>)*
+
+For canonical and basic transport:
+
+<sexpr> :: <string> | <list>
+<string> :: <display>? <simple-string> ;
+<simple-string> :: <raw> ;
+<display> :: "[" <simple-string> "]" ;
+<raw> :: <decimal> ":" <bytes> ;
+<decimal> :: <decimal-digit>+ ;
+ -- decimal numbers should have no unnecessary leading zeros
+<bytes> -- any string of bytes, of the indicated length
+<list> :: "(" <sexp>* ")" ;
+<decimal-digit> :: "0" | ... | "9" ;
+
+For advanced transport:
+
+<sexpr> :: <string> | <list>
+<string> :: <display>? <simple-string> ;
+<simple-string> :: <raw> | <token> | <base-64> | <hexadecimal> |
+ <quoted-string> ;
+<display> :: "[" <simple-string> "]" ;
+<raw> :: <decimal> ":" <bytes> ;
+<decimal> :: <decimal-digit>+ ;
+ -- decimal numbers should have no unnecessary leading zeros
+<bytes> -- any string of bytes, of the indicated length
+<token> :: <tokenchar>+ ;
+<base-64> :: <decimal>? "|" ( <base-64-char> | <whitespace> )* "|" ;
+<hexadecimal> :: "#" ( <hex-digit> | <white-space> )* "#" ;
+<quoted-string> :: <decimal>? <quoted-string-body>
+<quoted-string-body> :: "\"" <bytes> "\""
+<list> :: "(" ( <sexp> | <whitespace> )* ")" ;
+<whitespace> :: <whitespace-char>* ;
+<token-char> :: <alpha> | <decimal-digit> | <simple-punc> ;
+<alpha> :: <upper-case> | <lower-case> | <digit> ;
+<lower-case> :: "a" | ... | "z" ;
+<upper-case> :: "A" | ... | "Z" ;
+<decimal-digit> :: "0" | ... | "9" ;
+<hex-digit> :: <decimal-digit> | "A" | ... | "F" | "a" | ... | "f" ;
+<simple-punc> :: "-" | "." | "/" | "_" | ":" | "*" | "+" | "=" ;
+<whitespace-char> :: " " | "\t" | "\r" | "\n" ;
+<base-64-char> :: <alpha> | <decimal-digit> | "+" | "/" | "=" ;
+<null> :: "" ;
+
+8. In-memory representations
+
+For processing, the S-expression would typically be parsed and represented
+in memory in a more more amenable to efficient processing. We suggest
+two alternatives:
+
+ -- "list-structure"
+
+ -- "array-layout"
+
+We only sketch these here, as they are only suggestive. The code referenced
+below illustrates these styles in more detail.
+
+
+8.1. List-structure memory representation
+
+Here there are separate records for simple-strings, strings, and
+lists. An S-expression of the form ("abc" "de") would require two
+records for the simple strings, two for the strings, and two for the
+list elements. This is a fairly conventional representation, and
+details are omitted here.
+
+8.2 Array-layout memory representation
+
+Here each S-expression is represented as a contiguous array of bytes.
+The first byte codes the "type" of the S-expression:
+
+ 01 octet-string
+
+ 02 octet-string with display-hint
+
+ 03 beginning of list (and 00 is used for "end of list")
+
+Each of the three types is immediately followed by a k-byte integer
+indicating the size (in bytes) of the following representation. Here
+k is an integer that depends on the implementation, it might be
+anywhere from 2 to 8, but would be fixed for a given implementation;
+it determines the size of the objects that can be handled. The transport
+and canonical representations are independent of the choice of k made by
+the implementation.
+
+Although the length of lists are not given in the usual S-expression
+notations, it is easy to fill them in when parsing; when you reach a
+right-parenthesis you know how long the list representation was, and
+where to go back to fill in the missing length.
+
+
+8.2.1 Octet string
+
+This is represented as follows:
+
+ 01 <length> <octet-string>
+
+For example (here k = 2)
+
+ 01 0003 a b c
+
+8.2.2 Octet-string with display-hint
+
+This is represented as follows:
+
+ 02 <length>
+ 01 <length> <octet-string> /* for display-type */
+ 01 <length> <octet-string> /* for octet-string */
+
+For example, the S-expression
+
+ [gif] #61626364#
+
+would be represented as (with k = 2)
+
+ 02 000d
+ 01 0003 g i f
+ 01 0004 61 62 63 64
+
+8.2.3 List
+
+This is represented as
+
+ 03 <length> <item1> <item2> <item3> ... <itemn> 00
+
+For example, the list (abc [d]ef (g)) is represented in memory as (with k=2)
+
+ 03 001b
+ 01 0003 a b c
+ 02 0009
+ 01 0001 d
+ 01 0002 e f
+ 03 0005
+ 01 0001 g
+ 00
+ 00
+
+9. Code
+
+There is code available for reading and parsing the various
+S-expression formats proposed here.
+
+See http://theory.lcs.mit.edu/~rivest/sexp.html
+
+
+10. Utilization of S-expressions
+
+This note has described S-expressions in general form. Application writers
+may wish to restrict their use of S-expressions in various ways. Here are
+some possible restrictions that might be considered:
+
+ -- no display-hints
+ -- no lengths on hexadecimal, quoted-strings, or base-64 encodings
+ -- no empty lists
+ -- no empty octet-strings
+ -- no lists having another list as its first element
+ -- no base-64 or hexadecimal encodings
+ -- fixed limits on the size of octet-strings
+
+11. Historical note
+
+The S-expression technology described here was originally developed
+for ``SDSI'' (the Simple Distributed Security Infrastructure by
+Lampson and Rivest [SDSI]) in 1996, although the origins clearly date
+back to McCarthy's LISP programming language. It was further refined
+and improved during the merger of SDSI and SPKI [SPKI] during the
+first half of 1997. S-expressions are similar to, but more readable
+and flexible than, Bernstein's "net-strings" [BERN].
+
+12. References
+
+[SDSI] "A Simple Distributed Security Architecture", by
+ Butler Lampson, and Ronald L. Rivest
+ http://theory.lcs.mit.edu/~cis/sdsi.html
+
+[SPKI] <a href="http://www.clark.net/pub/cme/html/spki.html">SPKI--A
+ Simple Public Key Infrastructure</a>
+
+[BERN] Dan Bernstein's "net-strings"; Internet Draft
+ draft-bernstein-netstrings-02.txt
+
+Author's Address
+
+ Ronald L. Rivest
+ Room 324, 545 Technology Square
+ MIT Laboratory for Computer Science
+ Cambridge, MA 02139
+
+ rivest@theory.lcs.mit.edu
+
+