Path: Eric's Site
/ Work
/ Samples |
Related:
Résumé,
Work Samples,
What I Do,
Apple Service
(Site Map) |

File | Function | My Time | Intel Time |

sinfcosf.s | sine | 30, 75 | 38, 51, 64 |

cosine | 33, 78 | 36, 51, 64 | |

tanf.s | tangent | 34, 44 | 57, 61, 97 |

asinf.s | arcsine | 31, 22, 28 | 74 |

acosf.s | arccosine | 31, 22, 28 | 73 |

atanf.s | arctangent | 37, 24 | 43 |

atan2f.s | arctangent, two-argument | 28, 44 | 57, 62 |

These are single-precision trigonometric functions I designed and implemented in hand-coded assembly. They are fast, deliver faithfully rounded results, and conform to standards. I used Maple to design the approximating polynomials and other mathematics.

Argument reduction in the sine, cosine, and tangent routines is interesting. Because π is irrational, some very large floating-point numbers are very close to multiples of π/2. These numbers have very small sines or cosines. Determining the correct result requires finding a remainder that is many orders of magnitude smaller than the least significant bit in the input.

In order to produce a faithfully rounded result, the error in the remainder operation, the error in the polynomial, and the rounding errors in the polynomial evaluation must be, in total, less than the value of the least significant bit of the result expressed in single-precision.

A faithfully rounded result is one of the two floating-point numbers bracketing the exact mathematical answer. This criterion is used because determining the nearest floating-point number requires excessive time when the exact answer is very near the point midway between two floating-point numbers.

I tested each single-argument routine with each of the 4,294,967,296 input values.

The times shown are CPU cycles on an Intel Core 2, measured as the duration per call when called repeatedly. (This represents throughput. The latency, the duration from start of a call to when the result is available, is longer.) For comparison, times measured for Intel's Integrated Performance Primitives (IPP) are shown. A routine has several times because its behavior varies over its domain. These times are for normal numbers, not infinities, denormals, NaNs, or out-of-domain values.

The sine and cosine routines spike to 75 and 78 cycles for a small part of their input domains (four exponent values out of 254) because the table entries share address bits with a stack location, causing the processor to delay execution until it determines there is no actual physical address collision.

The sine, cosine, and tangent routines use generated data in sinfcosfTable.s and tanfTable.s, but the generating code is not publicly available. (It is straightforward arithmetic but was not published by Apple.)

This paper describes thoroughly and in detail the construction of a high-performance FFT. It is quite long, over 90 pages. Starting from the mathematical definition of an FFT, it transforms the math into a form suitable for an O(Construction of a High-Performance FFT(PDF). (Also available as Microsoft Word in a ZIP file.)

The design is discussed almost entirely in C but divides the work into subroutines that can clearly be implemented as high-performance code on an ordinary processor or on a single-instruction multiple-data processor. (That is, the routines perform parallel calculations on separate data, so it is clear that multiple instructions can be issued separately.)

You are welcome to copy the paper for the purpose of reviewing my work. If you would like to use the paper in your work, please read the license in the ZIP file.

A radix-4 FFT butterfly subroutine (text file).In my work for EADS, I wrote a high-performance FFT. This code sample (published with permission) is the subroutine called

First, if you have seen an FFT implementation before, search for the first
occurrence of "`1:`". This is a label which starts the main loop. In
some FFT implementations, the butterfly calculations are a mess. I sought to
express the calculations cleanly and bring out the symmetry in the mathematics.

Second, if you have some experience with high-performance code or
instruction scheduling, search for the second occurrence of "`1:`".
Here begins a loop that performs the same calculations, but the instructions
are reordered for high performance. (I apologize that the documentation for
this reordering is in another file that I do not have permission to publish.)
This loop runs in 31 CPU cycles per iteration.

(There are two versions of the loop because the first performs the calculations in a simple serial, comprehensible manner. It works and is used for development and illustration and is not assembled in the production code.)

These are slides for a talk about zero-knowledge proofs aimed at an audience of intelligent people but using mostly high-school mathematics. Zero-knowledge proofs are quite interesting: One person can prove to another they know a number (a password) without giving an eavesdropper any information about what the number is or how to forge a proof.Zero-Knowledge Proofs(PDF).

This was not prepared as a work project, but it shows some of my exposition style even though it does not include the narration that adds to the slides.

Path: Eric's Site
/ Work
/ Samples |
Related:
Résumé,
Work Samples,
What I Do,
Apple Service
(Site Map) |

© Copyright 2003 by Eric Postpischil.