1 files changed, 611 insertions, 0 deletions
diff --git a/doc/realinferno/real.ms b/doc/realinferno/real.ms
new file mode 100644
index 00000000..e2174dba
--- /dev/null
+++ b/doc/realinferno/real.ms
@@ -0,0 +1,611 @@
+.FP palatino
+.de ]C
+\&[\\$1]\\$2
+..
+.if \nZ=0 .so real.ref
+.EQ
+delim $$
+.EN
+.TL
+Real Inferno
+.AU
+.I "Eric Grosse"
+.AI
+.I "Lucent Technologies, Bell Labs"
+.I "Murray Hill NJ 07974 USA"
+.I "ehg@bell-labs.com"
+.\"date{19 Aug 1996, minor revisions 7 Jan 1998}
+.FS
+Previously appeared in R.F. Boisvert (editor),
+.ft I
+The Quality of Numerical Software: Assessment and
+Enhancement: Proceedings of the IFIP TC2/WG2.5
+Working Conference on the Quality of Numerical
+Software, Oxford,
+United Kingdom, 8-12 July 1996,
+.ft R
+Chapman Hall,
+London,
+1997 (pp. 270-279).
+.FE
+.AB
+Inferno is an operating system well suited to applications that need to be
+portable, graphical, and networked. This paper describes the fundamental
+floating point facilities of the system, including: tight rules on
+expression evaluation, binary/decimal conversion, exceptions and rounding,
+and the elementary function library.
+.AE
+.PP
+Although the focus of Inferno is interactive media, its portability across
+hardware and operating platforms, its relative simplicity, and its strength
+in distributed computing make it attractive for advanced scientific
+computing as well. Since the appearance of a new operating system is a
+relatively uncommon event, this is a special opportunity for numerical
+analysts to voice their opinion about what fundamental facilities they need.
+The purpose of this short paper is to describe numerical aspects of the
+initial release of Inferno, and to invite comment before the tyranny of
+backward compatibility makes changes impossible.
+.PP
+An overview of Inferno is given by Dorward et. al.
+.]C "Inferno" ,
+but for our immediate purposes it may suffice to say that Inferno plays the
+role of a traditional operating system (with compilers, process control,
+networking, graphics, and so on) but can run either on bare hardware or on
+top of another operating system like Windows95 or Unix. Programs for
+.I "Inferno"
+are written in the language
+.I "Limbo"
+and compiled to
+machine-independent object files for the
+.I "Dis"
+virtual
+machine, which is then implemented with runtime compilation for best
+performance. Files are accessible over networks using the
+.I "Styx"
+protocol; together with the presentation of most system resources as files
+and the manipulation of file namespaces, this permits integration of a
+collection of machines into a team. Limbo looks somewhat like a mixture of C
+and Pascal, augmented by modules (to cope with the namespace and dynamic
+loading needs of large programs) and by a channel facility for convenient
+(coarse-grain) parallel programing. Array references are bounds-checked and
+memory is garbage collected.
+.PP
+The rest of this paper covers the fundamental floating point environment
+provided by the Limbo compiler and
+.I "math"
+module, the ``elementary
+functions,'' and finally some comments on why particular definitions were
+chosen or why certain facilities were included or excluded. This discussion
+assumes the reader is familiar with scientific computing in general and the
+IEEE floating point standard in particular.
+.NH 1
+Floating point
+.PP
+In Limbo, arithmetic on literal and named constants is evaluated at compile
+time with all exceptions ignored. Arithmetic on variables is left by the
+compiler to runtime, even if data path analysis shows the value to be a
+compile time constant. This implies that tools generating Limbo source must
+do their own simplification, and not expect the compiler to change $x/x$
+into $1$, or $-(y-x)$ into $x-y$, or even $x-0$ into $x$. Negation $-x$
+changes the sign of $x$; note that this not the same as $0-x$ if $x=0$.
+.PP
+The compiler may perform subexpression elimination and other forms of code
+motion, but not across calls to the mode and status functions. It respects
+parentheses. The evaluation order of $a+b+c$ follows the parse tree and is
+therefore the same as for $(a+b)+c$. These are the same rules as for Fortran
+and C.
+.PP
+Contracted multiply-add instructions (with a single rounding) are not
+generated by the compiler, though they may be used in the native
+.SM BLAS
+libraries. All arithmetic follows the IEEE floating point standard
+.]C "IEEEfp" ,
+except that denormalized numbers may not be supported; see the
+discussion in section 3.
+.PP
+The most important numerical development at the language level recently has
+been accurate binary/decimal conversion
+.]C "Clinger"
+.]C "Gay"
+.]C "SteeleWhite" .
+Thus printing a real using
+.CW "\%g"
+and reading
+it on a different machine guarantees recovering identical bits. (Limbo uses
+the familiar
+.I "printf"
+syntax of C, but checks argument types against
+the format string at compile time, in keeping with its attempt to help the
+programmer by stringent type checking.) A good
+.I "strtod/dtoa"
+is,
+unfortunately, 1700 lines of source (15kbytes compiled), though with modest
+average runtime penalty. This code must be used in the compiler so that
+coefficients are accurately transferred to bytecodes. Smaller, faster, but
+sloppier, runtimes will also be provided when mandated by limited memory and
+specialized use. However, programmers may assume the features described in
+this paper are present in all Inferno systems intended for general computing.
+.PP
+Each thread has a floating point control word (governing rounding mode and
+whether a particular floating point exception causes a trap) and a floating
+point status word (storing accumulated exception flags). Functions
+.I "FPcontrol"
+and
+.I "FPstatus"
+copy bits to the control or status word, in
+positions specified by a mask, returning previous values of the bits. 
+.I "getFPcontrol"
+and
+.I "getFPstatus"
+return the words unchanged.
+.PP
+The constants
+.I "INVAL, ZDIV, OVFL, UNFL, INEX"
+are non-overlapping
+single-bit masks used to compose arguments or return values. They stand for
+the five IEEE exceptions:
+.IP •
+``invalid operation'' ($0/0$,$0 * infinity $,$ infinity - infinity $,$sqrt{-1}$)
+.IP •
+``division by zero'' ($1/0$),
+.IP •
+``overflow'' ($1.8e308$)
+.IP •
+``underflow'' ($1.1e-308$)
+.IP •
+``inexact'' ($.3*.3$).
+.PP
+The constants
+.I "RND_NR, RND_NINF, RND_PINF, RND_Z"
+are distinct
+bit patterns for ``round to nearest even'', ``round toward $-{infinity} $'',
+``round toward $+{infinity} $'', ``round toward $0$'', any of which can be set
+or extracted from the floating point control word using
+.I "RND_MASK" .
+For example,
+.IP •
+to arrange for the program to tolerate underflow,
+.I "FPcontrol(0,UNFL)."
+.IP •
+to check and clear the inexact flag,
+.I "FPstatus(0,INEX)."
+.IP •
+to set directed rounding,
+.I "FPcontrol(RND_PINF,RND_MASK)."
+.PP
+By default,
+.I "INEX"
+is quiet and
+.I "OVFL, UNFL, ZDIV,"
+and
+.I "INVAL"
+are fatal. By default, rounding is to nearest even, and library
+functions are entitled to assume this. Functions that wish to use quiet
+overflow, underflow, or zero-divide should either set and restore the
+control register themselves or clearly document that the caller must do so.
+The ``default'' mentioned here is what a Limbo program gets if started in a
+fresh environment. Threads inherit floating point control and status from
+their parent at the time of spawning and therefore one can spawn a ``round
+toward 0'' shell and re-run a program to effortlessly look for rounding
+instabilities in a program.
+.NH 1
+Elementary functions
+.PP
+The constants
+.I "Infinity, NaN, MachEps, Pi, Degree"
+are defined. Since
+Inferno has thorough support of Unicode, it was tempting to name these $infinity $, $ε $, $π $, and °, but people (or rather, their
+favorite text editing tools) may not be ready yet for non-\s-2ASCII\s0
+source text.
+.I "Infinity"
+and
+.I "NaN"
+are the positive infinity
+and quiet not-a-number of the IEEE standard, double precision.
+.I MachEps
+is $2 sup {-52}$, the unit in the last place of the mantissa $1.0$.
+The value of
+.I "Pi"
+is the nearest machine number to the
+mathematical value $π $.
+.I "Degree"
+is
+$"Pi" / 180$.
+.PP
+Three useful functions
+.I "fdim, fmax, fmin"
+are adopted from the
+Numerical C extensions
+.]C "NumerC" .
+The unusual one of these, often
+denoted $(x-y) sub {+}$, is defined by $roman "fdim" ( x , y )=x-y$ if $x > y$, else $0$. The compiler may turn these into efficient machine instruction sequences,
+possibly even branch-free, rather than function calls. There are two almost
+redundant mod functions:
+.I "remainder(x,y)"
+is as defined by the IEEE
+standard (with result $roman "|" r roman "|" <= y/2$);
+.I "fmod(x,y)"
+is $x roman "mod" y$,
+computed in exact arithmetic with $0<= r<y$. Limbo has a ``tuple'' type,
+which is the natural return value in the call $(i,f)= roman "modf" ( x )$ to
+break $x$ into integer and fractional parts. The function
+.I "rint"
+rounds to an integer, following the rounding mode specified in the floating
+point control word.
+.PP
+For a good-quality, freely-available elementary function library,
+.I "math"
+uses the IEEE subset of
+.I "fdlibm"
+.]C "fdlibm" .
+Of course, a
+conforming implementation may use entirely different source, but must take
+care with accuracy and with special arguments. There are the customary
+power, trigonometric, Bessel, and erf functions, and specialized versions $roman "expm1"( x )=e sup x - 1$, $roman "log1p" ( x )=log ( 1 + x )$. An additional function 
+$roman "pow10"( n ) = 10 sup n$ is defined; in the default implementation this is
+just fdlibm's $roman "pow" ( 10. , n )$ but it is provided so that separate
+trade-offs of accuracy and simplicity can be made
+.]C "MacIlroy" .
+.I "fdlibm"
+uses extra precise argument reduction, so the computed $sin (n*Pi)$
+is small but nonzero. If demands warrant, degree versions of the
+trigonometric functions will be added, but for now the style $sin (45*Degree)$ is used.
+The library also provides IEEE functions
+.I "ilogb, scalbn, copysign, finite, isnan,"
+and
+.I "nextafter" .
+.PP
+The functions
+.I "dot, norm1, norm2, iamax, gemm"
+are adopted from the
+.SM BLAS
+.]C "blas"
+to get tuned linear algebra kernels for
+each architecture, possibly using extra-precise accumulators. These are
+defined by $sum {{x sub i}{y sub i}}$, $sum roman | {x sub i} roman | $, $ sqrt{sum { x sub {i sup 2}}} $, $i$ such
+that $roman | {x sub i} roman | = roman max $, and $C= alpha AB + beta C$ with optional transposes on $A$
+and $B$. Since Limbo has only one floating-point type, there is no need here
+for a precision prefix. Limbo array slices permit the calling sequences to
+be more readable than in Fortran77 or C, though restricted to unit stride.
+This encourages better cache performance anyway. The matrix multiply
+function
+.I "gemm"
+remains general stride (and is the foundation for
+other operations
+.]C "Kagstrom" ).
+.PP
+Limbo is like C in providing singly-subscripted arrays with indexing
+starting at 0. Although Limbo offers arrays of arrays, as in C, for
+scientific work a better choice is to adopt the style of linearizing
+subscripts using Fortran storage order. This promotes easier exchange of
+data with other applications and reuses effort in organizing loops to
+achieve good locality. In previous language work
+.]C "pine" ,
+we implemented
+a C preprocessor that allowed the programmer to choose a convenient origin
+(such as 1) and have it compiled into 0 for the base language; because we
+passed arrays as dope vectors, we were even able to allow different origins
+for the same array in calling and called functions. The main lesson we
+learned from that experience, however, was that permutations become a
+nightmare when there is anything but dogmatic adherence to a single origin.
+So for an $m$ by $n$ matrix $A$, the programmer should use loops with $0<=
+i<m$ and $0<= j<n$ and access $A[i+m*j]$.
+.PP
+For interoperability with foreign file formats and for saving main memory in
+selected applications, functions are provided for copying bits between and
+reals and 32-bit or 64-bit IEEE-format values.
+.PP
+Finally,
+.I "math"
+provides a tuned quicksort function
+.I "sort(x,p)"
+where
+.I "x"
+is a real array and
+.I "p"
+is an int array representing
+a 0-origin permutation. This function leaves the contents of
+.I "x"
+untouched and rearranges
+.I "p"
+so that $x[{p sub i}]<= x[p sub {i+1}]$. This is
+usually what one wants to do: sort an array of abstract data types based on
+some key, but without the need to actually swap large chunks of memory.
+.NH 1
+Rationale
+.PP
+This section discusses why certain numerical features were included or not.
+.NH 2
+Rounding modes and accumulated exceptions
+.PP
+Directed rounding is only needed in a very few places in scientific
+computing, but in those places it is indispensable. Accumulated floating
+point exceptions are even more useful. User trap handling is a harder
+problem, and may be worth leaving for later, possibly with a default
+``retrospective diagnostics'' log
+.]C "Kahan" .
+.PP
+Note that the exception masks must be architecture independent, since they
+reside in the Limbo bytecodes, and therefore the implementation involves a
+small amount of bit fiddling. Still, it is efficient enough to encourage
+use. It would be difficult to port to a processor that only had static
+rounding modes in instruction opcodes rather than the dynamic model
+specified in section 2 of the IEEE standard.  Fortunately, the Alpha
+does provide both models.
+.NH 2
+Sudden underflow
+.PP
+Some processor vendors make supporting gradual underflow just too hard. (One
+must struggle upon the system trap to reconstruct exactly which instruction
+was executing and what the state of the registers was. On the MIPS, it is
+said to be 30 pages of assembler.) So Inferno supports denormalized numbers
+only if the hardware makes this easy. Providing underflow that is correct
+but very slow, as some systems do, is not necessarily doing the user a favor.
+.PP
+To determine portably if a particular system offers gradual underflow, mask
+off UNFL and do trial arithmetic.
+.NH 2
+Speed
+.PP
+Computers with slow (software) gradual underflow usually provide a fast
+flush-to-0 alternative. This often suffices, though there are important
+examples where it forces an uglier and slower coding style. A worse
+situation is if the hardware uses system traps for Infinity and NaN
+arithmetic. The resulting slowdown will make otherwise excellent and natural
+algorithms run slowly
+.]C "Demmel" .
+Sadly, even some x86 implementations
+that do non-finite arithmetic in hardware, do it relatively slowly.
+.PP
+We considered providing syntax to declare a certain program scope within
+which precise IEEE behavior was required, and relaxing the rules outside
+such scopes.
+(The numerical C extensions
+.]C "NumerC" 
+use pragma
+for this purpose.)
+These scope declarations would need to be in the
+bytecodes, since significant optimization may be attempted by the runtime
+compiler. After some discussion, and with some trepidation, it was agreed
+that instead all compilers would be required to preserve the same result and
+status as for an unoptimized version.
+.NH 2
+Comparison
+.PP
+The standard C operators
+.CW ==
+.CW !=
+.CW "<"
+.CW "<="
+.CW ">"
+.CW ">="
+are the only comparisons provided, and they behave exactly
+like the ``math'' part of Table 4 of the IEEE standard. Programs interested
+in handling NaN data should test explicitly. This seems to be the way most
+people program and leads to code more understandable to nonexperts. It is
+true that with more operators one can correctly write code that propagates
+NaNs to a successful conclusion\-but that support has been left for later.
+NaN(''tag'') should be added at that same time.
+.NH 2
+Precision
+.PP
+All implementations run exclusively in IEEE double precision. If the
+hardware has extra-precise accumulators, the round-to-double mode is set
+automatically and not changeable, in keeping with Limbo's design to have
+only one floating point type. Extended precision hardware, if available, may
+be used by the built-in elementary function and
+.SM BLAS
+libraries.
+Also, we contemplate adding a dotsharp function that would use a very long
+accumulator for very precise inner products, independent of the order of
+vector elements
+.]C "kulisch" .
+But reference implementations that use only
+double precision, avoid contracted multiply-add, and evaluate in the order 1
+up to n will always be available for strict portability.
+.PP
+At the time the decision was made to restrict the system to 64-bit floating
+point, Limbo integers were almost exclusively 32-bit and the consistency
+argument to have a single real type was compelling. Now that Limbo has more
+integer types the decision might be reconsidered. But so many engineers
+needlessly struggle with programs run in short precision, that offering it
+may do as much harm as good. On most modern computers used for general
+purpose scientific computing, 64-bit floating point arithmetic is as fast as
+32-bit, except for the memory traffic. In cases where the shorter precision
+would suffice and memory is a crucial concern, the programmer should
+consider carefully scaled fixed point or specialized compression. To
+efficiently interoperate with data files that use the short format,
+programmers may use the provided realbits32 function. While there are surely
+appropriate uses for a first-class 32-bit real type, for now we follow
+Kahan's sarcastic motto ``why use lead when gold will do?''
+.NH 2
+BLAS
+.PP
+The few
+.SM BLAS
+in the core library were chosen for readability and,
+in case of gemm, for optimization beyond what a reasonable compiler would
+attempt. We expect that compilers will (soon) be good enough that the
+difference between compiling $y+=a*x$ and calling daxpy is small. Also, as
+mentioned above, dot and gemm might reasonably use combined multiply-add or
+a long accumulator in some optional implementations.
+.NH 2
+$GAMMA ( x )$
+.PP
+To avoid confusion with the C math library, which defined
+.I "gamma"
+as $ln GAMMA $, we offer only
+.I "lgamma"
+for now. This function and 
+.I "modf"
+return an (int,real) tuple rather than assigning through an
+integer pointer, in keeping with Limbo's design. The opportunity has been
+taken to drop some obsolete functions like
+.I "frexp" .
+Other functions
+are unchanged from the C math library.
+.NH 2
+Future
+.PP
+A prototype preprocessor has been written to allow the scientific programmer
+to write $A[i,j]$ for an $A$ that was created as a $Matrix(m,n)$ and to have
+the subscript linearization done automatically. Here $Matrix$ is an Limbo
+abstract data type containing a real array and integers $m$, $n$, and column
+stride $lda$ used as in typical Fortran calling sequences.
+.PP
+The Limbo compiler is soon expected to implement the type
+.I "complex" .
+.PP
+Higher level numerical libraries will also be provided, and although that
+topic is beyond the scope of this paper, opinions about what should come
+first would be welcome.
+.PP
+Distributed computing has not been mentioned here because it involves
+relatively few considerations specific to floating point computation.
+However, it may be worth noting that in the default environment (with
+underflow trapped, so that presence or absence of denormalized numbers is
+not significant) programs run independently on heterogeneous machines
+nevertheless get precisely identical results, even with respect to thread
+scheduling. This implies that certain communication steps can be avoided,
+and that regression testing is considerably simplified.
+.PP
+Please direct comments on these numerical aspects of Inferno to Eric Grosse.
+More general technical comments can be directed to Vita Nuova
+.CW comments@vitanuova.com ). (
+I am grateful to Joe Darcy, Berkeley,
+to David Gay, Bell Labs, to David Hook, University of Melbourne,
+and to participants of the IFIP WG2.5 Working
+Conference on Quality of Numerical Software for insightful comments on a
+first draft of this paper.
+.\"the principal developers of Inferno: Sean Dorward, Rob Pike, Dave Presotto, Howard Trickey, and Phil Winterbottom.
+.SH
+Trademarks
+.LP
+Inferno, Limbo, and Dis are trademarks of Vita Nuova Holdings Limited.
+Unix is a trademark of Unix Systems Laboratories.
+Windows95 is a trademark of Microsoft.
+.EQ
+delim off
+.EN
+.SH
+References
+.nr PS -1
+.nr VS -1
+.LP
+.nr [N 0 1
+.de ]N
+.IP \\n+([N.
+.if \nZ>0 .tm \\$1 \\n([N
+..
+....
+....
+.]N "Inferno"
+S. Dorward,
+R. Pike, 
+D.\ L. Presotto,
+D.\ M. Ritchie,
+H. Trickey,
+P. Winterbottom,
+``The Inferno Operating System'',
+.I "Bell Labs Technical Journal" ,
+Vol. 2,
+No. 1,
+Winter 1997, pp. 5-18.
+Reprinted in this volume.
+.]N "Clinger"
+W.\ D. Clinger.
+``How to read floating point numbers accurately.
+In \fIProceedings of the ACM SIGPLAN'90 Conference on Programming
+Language Design and Implementation\fP, pages 92-101, 1990.
+....
+....
+.]N "Demmel"
+James\ W. Demmel and Xiaoye Li.
+Faster numerical algorithms via exception handling.
+In Jr. Earl\ Swartzlander, Mary\ Jane Irwin, and Graham Jullien,
+editors, \fIProceedings: 11th Symposium on Computer Arithmetic\fP. IEEE
+Computer Society Press, 1993.
+....
+....
+.]N "blas"
+Jack\ J. Dongarra, Jeremy\ Du Croz, Sven Hammarling, and Richard\ J. Hanson.
+Algorithm 656: An extended set of Basic Linear Algebra Subprograms.
+\fIACM Trans. on Mathematical Software\fP, 14(1):18-32, March 1988.
+....
+....
+.]N "Gay"
+D.\ M. Gay.
+Correctly rounded binary-decimal and decimal-binary conversions.
+Numerical Analysis Manuscript No. 90-10, AT&T Bell Laboratories,
+Murray Hill, NJ, 1990.
+freely redistributable, available at
+.CW http://netlib.bell-labs.com/netlib/fp/ .
+....
+....
+.]N "pine"
+E.\ H. Grosse and W.\ M. Coughran, Jr.
+The pine programming language.
+Numerical Analysis Manuscript 83-4, AT&T Bell Laboratories, 1983.
+.br
+.CW ftp://cm.bell-labs.com/cm/cs/doc/92/pine.ps.Z .
+....
+....
+.]N "IEEEfp"
+IEEE.
+Standard for binary floating-point arithmetic.
+Technical Report Std 754-1985, ANSI, 1985.
+....
+....
+.]N "Kagstrom"
+Bo\ Kagstrom, Per Ling, and Charles Van\ Loan.
+Portable high performance GEMM-based Level 3 BLAS.
+In R.\ F.\ Sincovec et\ al., editor, \fIParallel Processing for
+Scientific Computing\fP, pages 339-346. SIAM Publications, 1993.
+.CW /netlib/blas/ .
+....
+....
+.]N "Kahan"
+W.\ Kahan.
+Lecture notes on the status of IEEE Standard 754 for binary
+floating-point arithmetic.
+Technical report, Univ. Calif. Berkeley, May 23 1995.
+Work in Progress.
+....
+....
+.]N "kulisch"
+U.\ Kulisch and W.L. Miranker.
+\fIComputer arithmetic in theory and practice.\fP
+Academic Press, 1980.
+....
+....
+.]N "MacIlroy"
+M.\ D. McIlroy.
+Mass produced software components.
+In Peter Naur and Brian Randell, editors, \fISoftware Engineering\fP,
+pages 138-155, 1969.
+Garmisch, Germany, October 1968.
+....
+....
+.]N "fdlibm"
+Kwok\ C. Ng.
+.CW fdlibm :
+C math library for machines that support ieee 754
+floating-point.
+freely redistributable; available at
+.CW http://netlib.bell-labs.com/netlib/fdlibm/ ,
+March 1995.
+....
+....
+.]N "SteeleWhite"
+G.\ L. Steele and J.\ L. White.
+How to print floating point numbers accurately.
+In \fIProceedings of the ACM SIGPLAN'90 Conference on Programming
+Language Design and Implementation\fP, pages 112-126, 1990.
+....
+....
+.]N "NumerC"
+X3J11.1.
+Chapter 5, floating-point C extensions.
+Technical report, ANSI, March 29 1995.
+.nr PS +1
+.nr VS +1