man/6/sexprs


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238

.TH SEXPRS 6
.SH NAME
sexprs \- symbolic expressions
.SH DESCRIPTION
S-expressions (`symbolic expressions') provide a way for programs to store and
exchange tree-structured text and binary data.
The Limbo module
.IR sexprs (2)
provides the variant defined by
Rivest in Internet Draft
.L draft-rivest-sexp-00.txt
(4 May 1997),
as used for instance by the Simple Public Key Infrastructure (SPKI).
It provides a `canonical' form of S-expression,
and an `advanced' form for display.
They can convey binary data directly and efficiently, unlike some
other schemes such as XML.
The two forms are closely related and all can be read or written by
.IR sexprs (2),
including a variant sometimes used for transport on links that are not 8-bit safe.
.PP
An S-expression is either a sequence of bytes (a byte
.IR string ),
or a parenthesised list of smaller S-expressions.
All forms start with the fundamental rules below, in extended BNF:
.IP
.EX
.ft R
.ta \w'\f2simple-stringxxxxx\f1'u +\w'\ ::=\ 'u
\f2sexpr\fP	::=	\f2string\fP | \f2list\fP
\f2list\fP	::=	'(' \f2sexpr\fP* ')'
.EE
.DT
.PD
.PP
They give the recursive structure.
The various representations ultimately differ only in how the byte string is represented
and whether white space such as blanks or newlines can appear.
.PP
Furthermore, the definition of
.I string
is also common to all forms:
.IP
.EX
.ft R
.ta \w'\f2simple-stringxxxxx\f1'u +\w'\ ::=\ 'u
\f2string\fP	::=	\f2display\fP? \f2simple-string\fP
\f2display\fP	::=	'[' \f2simple-string\fP ']'
.EE
.DT
.PD
.PP
The optional bracketed
.I display
string provides information on how to present the associated byte string to a user.
(``It has no other function.  Many of the MIME types work here.'')
Although supported by
.IR sexprs (2),
it is largely unused by Inferno applications and is usually left out.
The canonical and advanced forms differ in their definitions of
.IR simple-string .
They always denote sequences of 8-bit bytes, but with different syntax (encodings).
Two
.I strings
are equal iff their
.I simple-strings
encode the same byte strings (for both data and
.IR display ).
.PP
.I Canonical
form must be used when exchanging S-expressions between computers,
and when digitally signing an expression.
It is defined by the complete set of rules below:
.IP
.EX
.ft R
.ta \w'\f2simple-stringxxxxx\f1'u +\w'\ ::=\ 'u
\f2sexpr\fP	::=	\f2string\fP | \f2list\fP
\f2list\fP	::=	'(' \f2sexpr\fP* ')'
\f2string\fP	::=	\f2display\fP? \f2simple-string\fP
\f2display\fP	::=	'[' \f2simple-string\fP ']'
\f2simple-string\fP	::=	\f2raw\fP
\f2raw\fP	::=	\f2nbytes\fP ':' \f2byte*\fP
\f2nbytes\fP	::=	\f5[1-9][0-9]\fP+ |  \f50\fP
.EE
.DT
.PD
.PP
Its
.I simple-string
is a raw byte string.
The primitive
.I byte
represents an 8-bit byte.
The length of every byte string is given explicitly by a preceding decimal value
.I nbytes
(with no leading zeroes).
There is no white space.
It is `canonical' because it is uniquely defined for each S-expression.
It is efficient to parse even on small computers.
.PP
.I Advanced
form is more elaborate, and has two main differences:
not all byte strings need an explicit length, and binary
data can be represented in printable form, either using hexadecimal or base 64 encodings,
or using quoted strings (with escape sequences similar to those of Limbo or C).
Unquoted text is called a
.IR token ,
and is restricted by the standard to a specific alphabet:
it must contain only letters, digits, or characters from the set
.LR "-./_:*+=" ,
and must not start with a digit.
The latter restriction is imposed to allow byte counts to be distinguished from tokens without
lookahead, but has the consequence that decimal numbers must be quoted,
as must non-ASCII characters in
.IR utf (6)
encoding.
Upper- and lower-case letters are distinct.
The advanced transport syntax is defined by the complete set of rules below:
.IP
.EX
.ft R
.ta \w'\f2simple-stringxxxxx\f1'u +\w'\ ::=\ 'u
\f2sexpr\fP	::=	\f2string\fP | \f2list\fP
\f2list\fP	::=	'(' ( \f2sexpr\fP | \f2whitespace\fP )* ')'
\f2string\fP	::=	\f2display\fP? \f2simple-string\fP
\f2display\fP	::=	'[' \f2simple-string\fP ']'
\f2simple-string\fP	::=	\f2raw\fP | \f2token\fP | \f2base-64\fP | \f2hexadecimal\fP |  \f2quoted-string\fP
\f2raw\fP	::=	\f2nbytes\fP ':' \f2byte*\fP
\f2nbytes\fP	::=	\f5[1-9][0-9]\fP+ |  \f50\fP
\f2token\fP	::=	\f2token-start\fP \f2token-char*\fP
\f2base-64\fP	::=	\f2decimal\fP? '|' ( \f2base-64-char\fP | \f2whitespace\fP )* '|'
\f2hexadecimal\fP	::=	'#' ( \f2hex-digit\fP | \f2whitespace\fP )* '#'
\f2quoted-string\fP	::=	\f2nbytes\fP? \f2quoted-string-body\fP  
\f2quoted-string-body\fP	::=	'"' \f2byte*\fP '"'
\f2token-start\fP	::=	\f5[-./_:*+=a-zA-Z]\fP
\f2token-char\fP	::=	\f2token-start\fP | \f5[0-9]\fP
\f2hex-digit\fP	::=	\f5[0-9a-fA-F]\fP
\f2base-64-char\fP	::=	\f5[a-zA-Z0-9+/=]\fP
.EE
.PD
.DT
.PP
.I Whitespace
is any sequence of blank, tab, newline or carriage-return characters;
note that it can appear only at the places shown.
The
.I bytes
in a
.I quoted-string-body
are interpreted according to the quoting rules for Limbo (or C).
That is, the bytes are enclosed in quotes, and may contain the
escape sequences for the following characters:
backspace
.RB ( \eb ),
form-feed
.RB ( \ef ),
newline
.RB ( \en ),
carriage-return
.RB ( \er ),
tab
.RB ( \et ),
and vertical tab
.RB ( \ev ),
octal escape
.BI \e ooo
(all three digits must be given),
hexadecimal escape
.BI \ex hh
(both digits must be given),
.B \e\e
for backslash,
.B \e'
for single quote, and
and \f5\e"\fP to include a quote in a string.
Note that a quoted string can have an optional
.IR nbytes ,
but it gives the length of the byte string resulting
.I after
interpreting character escapes.
.PP
Both canonical and advanced forms can contain binary data verbatim.
Sometimes that is troublesome for storage or transport.
At the lexical level any
.I sexpr
can therefore be replaced by the following:
.IP
.EX
.ft R
\&'{' ( \f2base-64-char\fP | \f2whitespace\fP )* '}'
.EE
.PP
where the text between the braces is the base-64 encoding of the
.I sexpr
expressed in canonical or advanced form.
The S-expression parser will replace the sequence by its decoded, and resume
parsing at the start of that byte string.
Note the difference in syntax and interpretation from rule
.IR base-64
above, which encodes a
.IR simple-string ,
not an
.IR sexpr .
.SH EXAMPLES
The following S-expression is in canonical form:
.IP
.EX
(12:hello world!(5:inner0:))
.EE
.PP
It is a list of two elements: the string
.BR "hello world!" ,
and another list also with two elements,
the string
.BR inner
and an empty string.
All the bytes in the example are printable characters, but they could have been arbitrary binary values.
.PP
The following is an S-expression in advanced form:
.IP
.EX
(hello-world
    (* "3" "5.6")
    (best-of-3 (5:inner0:)))
.EE
.PP
Note that advanced form contains canonical form as a subset;
here it is used for the innermost list.
.SH SEE ALSO
.IR sexprs (2),
.IR json (6),
.IR ubfa (6)
.PP
R. Rivest, ``S-expressions'', Network Working Group Internet Draft
(4 May 1997),
reproduced in
.BR /lib/sexp .