This paper presents fast implementations of the inverse square root and arcsine, both in double precision. In single precision it is often possible to use a small table and one ordinary Newton-Raphson iteration to compute elementary functions such as the square root. In double precision a substantially larger table is necessary to obtain the desired precision, or, if a smaller table is used, the additional Newton-Raphson iterations required to obtain the precision often requires the evaluation of other expensive elementary functions. Furthermore, large tables use a lot of the cash memory that should have been used for the application code.
Obtaining the desired precision using a small table can instead be realised by using a higher order method than the second order Newton-Raphson method. A generalization of Newton's method to higher order is Householder's method, which unfortunately often results in very complicated expressions requiring many multiplications, additions, and even divisions.
We show how a high-order method can be used, which only requires a few extra additions and multiplications for each degree of higher order. The method starts from the Taylor expansion of the difference of the value of the elementary function and a starting guess value for each iteration. If the Taylor series is truncated after the second term, ordinary Newton iterations are obtained. In several cases it is possible to algebraically simplify the difference between the true value and the starting guess value. In those cases we show that it is advantageous to use the Taylor series to higher order to obtain the fast convergent method. Moreover, we will show how the coefficients of a Chebyshev polynomial can be fitted to give as little error as possible for the functions close to zero and in the same time reduce the terms in the Taylor expansion.
In the paper we benchmark two example implementations of the method on the x86_64 architecture. The first is the inverse square root, where the actual table (to 12 bit precision) is provided by the processor hardware. The inverse square root is important in many application programs, including computer graphics, and explicit particle simulation codes, for instance the Monte Carlo and Molecular Dynamics methods of statistical mechanics. The other example is the arcsine function, which has a slow converging Taylor expansion and where no tables are provided by the hardware. The vectorized versions of the implementations of the inverse square root are 3.5 times faster than compiled code on the Athlon64 and about 5 times faster on the Core 2. The scalar version of the arcsine function is, depending on order and table size, between 2 and 3 times faster than the compiled code, and the vectorized version is between 3 and 4 times faster on the Athlon64, while it is between 4 and 5 times faster than the compiled version on the Core 2.