diff options
Diffstat (limited to 'doc/ebookimp.ms')
| -rw-r--r-- | doc/ebookimp.ms | 389 |
1 files changed, 389 insertions, 0 deletions
diff --git a/doc/ebookimp.ms b/doc/ebookimp.ms new file mode 100644 index 00000000..79e3fc2a --- /dev/null +++ b/doc/ebookimp.ms @@ -0,0 +1,389 @@ +.TL +Navigating Large XML Documents on Small Devices +.AU +Roger Peppe +.AI +Vita Nuova +.br +April 2002 +.AB +Browsing eBooks on platforms with limited memory presents an +interesting problem: how can memory usage be bounded despite +the need to view documents that may be much larger than the +available memory. A simple interface to an XML parser enables +this whilst retaining much of the ease of access afforded +by XML parsers that read all of a document into memory at once. +.AE +.SH +Introduction +.LP +The Open Ebook Publication Structure was devised by the Open Ebook Forum +in order to ``provide a specification for representing the content of electronic +books''. It is based on many existing standards, notably XML and HTML. +An Open eBook publication consists of a set of documents bound together +with an Open eBook package file which enumerates all the documents, +pictures and other items that make up the book +.LP +The underlying document format is essentially HTML compatible, +which is where the first problem arises: HTML was not designed to +make it easy to view partial sections of a document. Conventionally +an entire HTML document is read in at once and rendered onto +the device. When viewing an eBook on a limited-memory device, +however, this may not be possible; books tend to be fairly large. +For such a device, the ideal format would keep the book itself +in non-volatile storage (e.g. flash or disk) and make it possible +for reader to seek to an arbitrary position in the book and render +what it finds there. +.LP +This is not possible in an HTML or XML document, as the +arbitrarily nested nature of the format means that every +position in the document has some unknown surrounding context, +which cannot be discovered without reading sequentially through +the document from the beginning. +.SH +SAX and DOM +.LP +There are two conventional programming interfaces to an XML +parser. A SAX parser provides a stream of XML entities, leaving +it up to the application to maintain the context. It is not possible +to rewind the stream, except, perhaps, to the beginning. +Using a SAX parser is +fairly straightforward, but awkward: the stream-like nature +of the interface does not map well to the tree-like structure +that is XML. A DOM parser reads a whole document into an internal +data structure representation, so a program can treat it exactly +as a tree. This also enables a program to access parts of the +document in an arbitrary order. +The DOM approach is all very well for small documents, but for large +documents the memory usage can rapidly grow to exceed +the available memory capacity. For eBook documents, this is unacceptable. +.SH +A different approach +.LP +The XML parser used in the eBook browser is akin to a SAX parser, +in that only a little of the XML structure is held in memory at one time. +The first significant difference is that the XML entities returned are +taken from one level of the tree - if the program does not wish to +see the contents of a particular XML tag, it is trivial to skip over. +The second significant difference is that random access is possible. +This possibility comes from the observation that if we have visited +a part of the document we can record the context that we found there +and restore it later if necessary. In this scheme, if we wish to return later to +a part of a document that we are currently at, we can create a ``mark'', +a token that holds the current context; at some later time we can use +that mark to return to this position. +.LP +The eBook browser uses this technique to enable random access +to the document on a page-by-page basis. Moreover a mark +can be written to external storage, thus allowing an external +``index'' into the document so it is not always necessary to +read the entire document from the start in order to jump to a particular +page in that document. +.SH +The programming interface +.LP +The interface is implemented by a module named +.CW Xml , +which provides a +.CW Parser +adt which gives access to the contents of an XML document. +Xml items are represented by an +.CW Item +pick adt with one branch of the pick corresponding to each +type of item that might be encountered. +.LP +The interface to the parser looks like this: +.P1 +open: fn(f: string, warning: chan of (Locator, string)): (ref Parser, string); +Parser: adt { + next: fn(p: self ref Parser): ref Item; + down: fn(p: self ref Parser); + up: fn(p: self ref Parser); + mark: fn(p: self ref Parser): ref Mark; + atmark: fn(p: self ref Parser, m: ref Mark): int; + goto: fn(p: self ref Parser, m: ref Mark); + str2mark: fn(p: self ref Parser, s: string): ref Mark; +}; +.P2 +To start parsing an XML document, it must first be +.CW open ed; +.CW warning +is a channel on which non-fatal error messages will be sent +if they are encountered during the parsing of the document. +It can be nil, in which case warnings are ignored. +If the document is opened successfully, a new +.CW Parser +adt, say +.I p , +is returned. +Calling +.CW \fIp\fP.next +returns the next XML item at the current level of the tree. If there +are no more items in the current branch at the current level, it +returns +.CW nil . +When a +.CW Tag +item is returned, +.CW \fIp\fP.down +can be used to descend ``into'' that tag; subsequent calls of +.CW \fIp\fP.next +will return XML items contained within the tag, +and +.CW \fIp\fP.up +returns to the previous level. +.LP +An +.CW Item +is a pick adt: +.P1 +Item: adt { + fileoffset: int; + pick { + Tag => + name: string; + attrs: Attributes; + Text => + ch: string; + ws1, ws2: int; + Process => + target: string; + data: string; + Doctype => + name: string; + public: int; + params: list of string; + Stylesheet => + attrs: Attributes; + Error => + loc: Locator; + msg: string; + } +}; +.P2 +.CW Item.Tag +represents a XML tag, empty or not. The XML +fragments +.CW "<tag></tag>" '' `` +and +.CW "<tag />" '' `` +look identical from the point of view of this interface. +A +.CW Text +item holds text found in between tags, with adjacent whitespaces merged +and whitespace at the beginning and end of the text elided. +.CW Ws1 +and +.CW ws2 +are non-zero if there was originally whitespace at the beginning +or end of the text respectively. +.CW Process +represents an XML processing request, as found between +.CW "<?....?>" '' `` +delimiters. +.CW Doctype +and +.CW Stylesheet +are items found in an XML document's prolog, the +former representing a +.CW "<!DOCTYPE...>" '' `` +document type declaration, and the latter an XML +stylesheet processing request. +.LP +When most applications are processing documents, they +will wish to ignore all items other than +.CW Tag +and +.CW Text . +To this end, it is conventional to define a ``front-end'' function +to return desired items, discard others, and take an appropriate +action when an error is encountered. Here's an example: +.P1 +nextitem(p: ref Parser): ref Item +{ + while ((gi := p.next()) != nil) { + pick i := gi { + Error => + sys->print("error at %s:%d: %s\n", + i.loc.systemid, i.loc.line, i.msg); + exit; + Process => + ; # ignore + Stylesheet => + ; # ignore + Doctype => + ; # ignore + * => + return gi; + } + } + return nil; +} +.P2 +When +.CW nextitem +encounters an error, it exits; it might instead handle the +error another way, say by raising an exception to be caught at the +outermost level of the parsing code. +.SH +A small example +.LP +Suppose we have an XML document that contains some data that we would +like to extract, ignoring the rest of the document. For this example we will +assume that the data is held within +.CW <data> +tags, which contain zero or more +.CW <item> +tags, holding the actual data as text within them. +Tags that we do not recognize we choose to ignore. +So for example, given the following XML document: +.P1 +<metadata> + <a>hello</a> + <b>goodbye</b> +</metadata> +<data> + <item>one</item> + <item>two</item> + <item>three</item> +</data> +<data> + <item>four</item> +</data> +.P2 +we wish to extract all the data items, but ignore everything inside +the +.CW <metadata> +tag. First, let us define another little convenience function to get +the next XML tag, ignoring extraneous items: +.P1 +nexttag(p: ref Parser): ref Item.Tag +{ + while ((gi := nextitem(p)) != nil) { + pick i := gi { + Tag => + return i; + } + } + return nil; +} +.P2 +Assuming that the document has already been opened, +the following function scans through the document, looking +for top level +.CW <data> +tags, and ignoring others: +.P1 +document(p: ref Parser) +{ + while ((i := nexttag(p)) != nil) { + if (i.name == "data") { + p.down(); + data(p); + p.up(); + } + } +} +.P2 +The function to parse a +.CW <data> +tag is almost as straightforward; it scans for +.CW <item> +tags and extracts any textual data contained therein: +.P1 +data(p: ref Parser) +{ + while ((i := nexttag(p)) != nil) { + if (i.name == "item") { + p.down(); + if ((gni := p.next()) != nil) { + pick ni := gni { + Text => + sys->print("item data: %s\n", ni.ch); + } + } + p.up(); + } + } +} +.P2 +The above program is all very well and works fine, but +suppose that the document that we are parsing is very +large, with data items scattered through its length, and that +we wish to access those items in an order that is not necessarily +that in which they appear in the document. +This is quite straightforward; every time we see a +data item, we record the current position with a mark. +Assuming the global declaration: +.P1 +marks: list of ref Mark; +.P2 +the +.CW document +function might become: +.P1 +document(p: ref Parser) +{ + while ((i := nexttag(p)) != nil) { + if (i.name == "data") { + p.down(); + marks = p.mark() :: marks; + p.up(); + } + } +} +.P2 +At some later time, we can access the data items arbitrarily, +for instance: +.P1 + for (m := marks; m != nil; m = tl m) { + p.goto(hd m); + data(p); + } +.P2 +If we wish to store the data item marks in some external index +(in a file, perhaps), the +.CW Mark +adt provides a +.CW str +function which returns a string representation of the mark. +.CW Parser 's +.CW str2mark +function can later be used to recover the mark. Care must +be taken that the document it refers to has not been changed, +otherwise it is likely that the mark will be invalid. +.SH +The eBook implementation +.LP +The Open eBook reader software uses the primitives described above +to maintain display-page-based access to arbitrarily large documents +while trying to bound memory usage. +Unfortunately it is difficult to unconditionally bound memory usage, +given that any element in an XML document may be arbitrarily +large. For instance a perfectly legal document might have 100MB +of continuous text containing no tags whatsoever. The described +interface would attempt to put all this text in one single item, rapidly +running out of memory! Similar types of problems can occur when +gathering the items necessary to format a particular tag. +For instance, to format the first row of a table, it is necessary to lay out +the entire table to determine the column widths. +.LP +I chose to make the simplifying assumption that top-level items within +the document would be small enough to fit into memory. +From the point of view of the display module, the document +looks like a simple sequence of items, one after another. +One item might cover more than one page, in which case a different +part of it will be displayed on each of those pages. +.LP +One difficulty is that the displayed size of an item depends on many +factors, such as stylesheet parameters, size of installed fonts, etc. +When a document is read, the page index must have been created +from the same document with the same parameters. It is difficult in +general to enumerate all the relevant parameters; they would need +to be stored inside, or alongside the index; any change would invalidate +the index. Instead of doing this, as the document is being displayed, +the eBook display program constantly checks to see if the results +it is getting from the index match with the results it is getting +when actually laying out the document. If the results differ, the +index is remade; the discrepancy will hopefully not be noticed by +the user! |
