Approximating cosine (update)

EDIT the below is wrong, the 3 slow machines were compiled with the wrong options so were unnecessarily slow. With "-march=native" (or for the Pi "-mfloat-abi=hard -mfpu=neon") the 9th order polynomial is again conclusively the best algorithm for all machines.

Last year I implemented a cosine approximation using a 9th order polynomial, and was pleased that in my tests it was both faster and more accurate than other methods using table lookup. However, more recently I re-ran the benchmarks on a number of less powerful computers, and it turned out that the table lookup implementations are faster there.

You can see the full results at /approximations/cosf.html, in short the 9th order polynomial is best for Pentium 4, Pentium M, Core 2, Athlon II; but interpolated table lookup is faster for Celeron T3100, Atom N455 (in 32bit mode), and the ARM chip in the Raspberry Pi Model 3 B. Accuracy seems to be the same across all machines, thanks to IEEE standards (despite being compiled with -ffast-math).