In Search of a Fast Cube Root
Following are thoughts and observations regarding the machine extraction of
cube roots with a particular emphasis on speed of calculations.
Cube roots are a topic of interest, because they're used in many common
formulae and a fast cube root function isn't included with Microsoft Visual
Studio.
In the absence of a special cube root function, a typical strategy is
calculation via a power function (e.g., pow(x, 1.0/3.0)).
This can be problematic in terms of speed and in terms of accuracy
when negative numbers aren't handled properly.
All of the implementations of cube root functions I discovered rely on Newton's
method. Notable implementations include William. M.
Kahan's cube root function as found in the BSD sources (kahan's
cbrt) and Ken Turkowski's as presented in an Apple technical report (CubeRoot).
Kahan's implementation is particularly robust, handling negative numbers, NaN,
INF and subnormals. Turkowski's paper includes interesting facts and insights.
Cube Root by Newton's method
Given an
approximation ‘a’ of the cube root of ‘R’,
the following formula will produce ‘b’, a closer
approximation to the cube root of ‘R’.

or
Repeated application yields results progressively closer to the true cube root
of R, with the number of bits of precision doubling with each iteration.
Consequently, the accuracy of the initial guess fed into the iteration is
especially important. Desired degrees of precision can be achieved with fewer
iterations when initial estimates are more accurate. Computationally, each iteration
is relatively inexpensive: one divide, two multiplies and two adds.
Turkowski's approach factors the cube root of R into the product of a power of
two and a cube root of a small value with a limited range. He then uses a
quadratic approximation to estimate the cube root of the small value. This
approach yields an initial value with approximately 6 bits of precision to seed
the Newton-Raphson iteration.
Kahan uses a bit hack to generate an initial estimate with 5 bits of precision.
The approach exploits the properties of IEEE 754 floating point numbers by
leveraging the fact that their binary representation is close to a log2
reppresentation. The following code demonstrates the technique applied for the
double precision case. (Note: Kahan's code also includes support for
handling subnormal numbers. The following code assumes 32-bit integers).
inline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
General Nth Root Approximations
I've generalized this approach to provide initial approximations for nth roots
generally for IEEE 754 floats and doubles in the following:
template
<int n>
inline float nth_rootf(float
x)
{
const int ebits = 8;
const int fbits = 23;
int& i = (int&) x;
const int bias = (1 << (ebits-1))-1;
i = (i - (bias << fbits)) / n + (bias << fbits);
return x;
}
template <int n>
inline double nth_rootd(double
x)
{
const int ebits = 11;
const int fbits = 52;
_int64& i = (_int64&) x;
const _int64 bias = (1 << (ebits-1))-1;
i = (i - (bias << fbits)) / n + (bias << fbits);
return x;
}
To use the code to generate an nth root approximation, n is specified as a C++
template parameter. For example, a float approximation for the cube root of 5.0
can be can be generated with the call nth_rootf<3>(5.0f). In my experience, these
functions provide estimates in the range of 5 to 6 bits of precision.
Halley's Method applied to Cube
Roots
Recent investigations into faster converging cube root approximations led to a
method offered by Otis E. Lancaster in Machine Method for the Extraction of
a Cube Root1. Lancaster’s method is calculated relatively
efficiently, and it exhibits cubic convergence in contrast to the quadratic
convergence of Newton's method. It takes the following form (where ‘b’ is a closer approximation to
the cube root of R than ‘a’).

The method is actually an application of a method devised by comet discoverer
Edmond Halley (see Halley's
Method). Over the years it has
been repeatedly rediscovered and attributed to others including Lambert and
Bailey. With slight massaging, the equation can be reduced to the following:

In this form, the calculation cost amounts to one divide, three multiplies and
three additions. Although this exceeds Newton-Raphson by a multiply and an add,
speed improvements can be reaped thanks to the faster rate at which
Halley’s method converges and the relatively high cost of divide, which
is a factor in each iteration of either method.
Comparison
of Various Methods
Following
are times and accuracy estimates of various methods including the bit hack
approximation (cbrt_5f, cbrt_5d), up to three iterations of Halley’s
method and up to four iterations of Newton’s methods. These results were
obtained on an Intel Core Duo system using C++ compiled with Microsoft Visual
Studio 2005.
In
the tests, one million cube roots are calculated uniformly across the range
from 0.0 to 1.0. Times are reported in milliseconds. The second column
indicates an estimate of the minimum number of bits of precision in the worst
case found. The last column indicates an estimate of the average number of bits
of precision per evaluation.
The
cbrt_5f and cbrt_5d functions were used to seed the iterative methods.
It’s important to note that a better approximation will converge even
more quickly. Given the strategy used, a single iteration of Halley’s
method appears sufficient for 16 bits of precision. Also, note the costs of
various methods in comparison to the standard pow( ) function.
32-bit float tests
----------------------------------------
cbrt_5f 8.8 ms 5 mbp 6.223 abp
pow
144.5 ms 23 mbp 23.000 abp
halley x 1 31.8
ms 15 mbp 18.961 abp
halley x 2 59.0
ms 23 mbp 23.000 abp
newton x 1 23.4
ms 10 mbp 12.525 abp
newton x 2 48.9
ms 20 mbp 22.764 abp
newton x 3 72.0
ms 23 mbp 23.000 abp
newton x 4 89.6
ms 23 mbp 23.000 abp
64-bit double tests
----------------------------------------
cbrt_5d 23.5 ms 5 mbp 6.299 abp
pow
151.3 ms 52 mbp 52.000 abp
halley x 1 56.5
ms 16 mbp 19.503 abp
halley x 2 95.3
ms 47 mbp 51.575 abp
halley x 3 122.1 ms 51 mbp 51.965 abp
newton x 1 48.8
ms 10 mbp 12.596 abp
newton x 2 78.6
ms 20 mbp 25.205 abp
newton x 3 104.5 ms 40 mbp 47.681 abp
newton x 4 134.7 ms 51 mbp 51.992 abp
The
source code is available here.
1. Otis E. Lancaster, Machine Method for the Extraction of Cube Root Journal of
the American Statistical Association, Vol. 37, No. 217. (Mar., 1942), pp.
112-115.