XNumbers
Documentation for XNumbers: Extended-exponent floating-point numbers.
This package implements "X-numbers" as described by Fukushima (2012) — which refined the ideas of Smith et al. (1981). As Fukushima explained,
...we represent a non-zero arbitrary real number, $X$, by a pair of an IEEE754 floating point number, $x$, and a signed integer, $i_X$. More specifically speaking, we choose a certain large power of 2 as the radix, $B$, and regard $x$ and $i_X$ as the significand and the exponent with respect to it. Namely, we express $X$ as $X = x B^{i_X}$. The major difference from Smith et al. (1981) is the choice of the radix...
Note that this does not increase the precision of floating-point operations (the number of digits in the significand), but vastly increases the range of numbers that can be represented. Therefore, X-numbers are not frequently useful in numerical analysis. Even the standard Float64
can represent numbers of magnitude roughly $2^{-1022}$ to $2^{1024}$, which is usually sufficient — whereas the roughly 16 digits of precision can frequently be inadequate. In such cases, it is better to use extended-precision arithmetic, provided by packages like DoubleFloats.jl
, Quadmath.jl
, ArbNumerics.jl
, or the built-in BigFloat
type.
Instead, X-numbers are useful in very specific cases, like the computation of Associated Legendre Functions (ALFs) and hence (scalar) spherical harmonics of very high degree. While the same goals could be partially achieved using certain other types like BigFloat
, X-numbers can be implemented far more efficiently — requiring anywhere from 10% to a few times longer than similar computations with the underlying float type. However, see Xing et al. (2020) for techniques to compute ALFs using standard float types that can be advantageous in some ways. X-numbers might also find use in Machine Learning, where the precision of a Float64
is unnecessary, but the range of a Float16
may be too restrictive.
Citing
See CITATION.bib
for the relevant reference(s).
API
Base.float
— Methodfloat(x)
Float32(x)
Float64(x)
Float128(x)
Double64(x)
Convert an X-number x
to its underlying float form.
Follows the routine given in Table 6 of Fukushima (2012).
XNumbers.linear_combination
— Methodlinear_combination(f, X, g, Y)
Compute $fX+gY$, where $f$ and $g$ are floating-point numbers, and $X$ and $Y$ are X-numbers.
Follows the routine given in Table 8 of Fukushima (2012).
This routine — translated from Fukushima's Fortran code — does not account for the possibility that X or Y may be zero. It may be better to just use the more natural expression f*X+g*Y
.
XNumbers.normalize
— Methodnormalize(x)
Normalize a weakly normalized X-number.
Follows the routine given in Table 7 of Fukushima (2012).
XNumbers.xnumber
— Methodxnumber(x)
xnumber(x, i)
Construct an X-number with the same base type as x
, and normalize. If i
is not included, it is assumed to be 0.
It is also possible to construct an XNumber
explicitly as XNumber{T}(x, i)
, which bypasses the normalization step. Use caution if doing so, as most methods assume that XNumber
s are normalized.