http://aggregate.org/
@techreport{magicalgorithms,
author={Henry Gordon Dietz},
title={{The Aggregate Magic Algorithms}},
institution={University of Kentucky},
howpublished={Aggregate.Org online technical report},
URL={http://aggregate.org/MAGIC/}
}
Absolute Value of a Float
IEEE floating point uses an explicit sign bit, so the absolute value can be taken by a bitwise AND with the complement of the sign bit. For IA32 32-bit, the sign bit is an int value of 0x80000000, for IA32 64-bit, the sign bit is the long long int value 0x8000000000000000. Of course, if you prefer to use just int values, the IA32 64-bit sign bit is 0x80000000 at an int address offset of +1. For example:
double x; /* make x = abs(x) */ *(((int *) &x) + 1) &= 0x7fffffff;
Alignment of Pointers
Alignment of pointers is a pretty common problem, and there are several relevant tricks, so, at the suggestion of Jean-Charles Meyrignac (who also provided an example of the upward alignment described below), I've added a little description here.
It is fairly obvious that the downward alignment of an address a to a multiple-of-b boundary, where b is a power of 2, is simply (a & ~(b-1)). Of course, ~(b-1) is also -b, so (a & -b) also works (the difference is usually nothing; if b is a constant, most compilers will fold the first into the second form).
For upward alignment, we simply add b-1: ((a + (b-1)) & -b).
Of course, there are a few complications. First, languages like C, which allow pointer arithmetic, generally scale pointer offsets by the size of the target object -- which would keep our math from working. It used to be that casting a pointer as a (char *)
would turn-off this scaling, but with long char
and such out there, a cast as (void *)
is probably a safer bet. Unfortunately, C doesn't define bitwise operations on pointers of any flavor, so you'll need to cast to an appropriately-large integer type before doing a bitwise AND.
Secondly, aligning an address doesn't help unless you allocated a large enough chunk of memory so that you can treat your data structure as starting at the aligned address. In general, if you wish to create a b-aligned data structure with c bytes, you would do something like: a=((typeof(a))(((int)(((void *)malloc(c+(b-1)))+(b-1)))&-b))
. Please excuse my use of the GCC typeof()
. Anyway, this is particularly useful for cache-line alignment of data structures. One little annoyance: you can't call free(a); you'll need to keep a copy of the original block address for that.
Average of Integers
This is actually an extension of the "well known" fact that for binary integer values x and y, (x+y) equals ((x&y)+(x|y)) equals ((x^y)+2*(x&y)).
Given two integer values x and y, the (floor of the) average normally would be computed by (x+y)/2; unfortunately, this can yield incorrect results due to overflow. A very sneaky alternative is to use (x&y)+((x^y)/2). If we are aware of the potential non-portability due to the fact that C does not specify if shifts are signed, this can be simplified to (x&y)+((x^y)>>1). In either case, the benefit is that this code sequence cannot overflow.
Bit Reversal
Reversing the bits in an integer x is somewhat painful, but here's a SWAR algorithm for a 32-bit value:
unsigned int reverse(register unsigned int x) { x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1)); x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2)); x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4)); x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8)); return((x >> 16) | (x << 16)); }
x = (abcd efgh) ------------------- x = badc fehg x = dcba hgfe x = hgfe dcba
It also is possible to re-write this algorithm to use 4 instead of 8 constants, thus saving some instruction bandwidth. On my 1.2GHz Athlon (a Thunderbird), the difference is too small to measure reliably. Here's the other version:
unsigned int reverse(register unsigned int x) { register unsigned int y = 0x55555555; x = (((x >> 1) & y) | ((x & y) << 1)); y = 0x33333333; x = (((x >> 2) & y) | ((x & y) << 2)); y = 0x0f0f0f0f; x = (((x >> 4) & y) | ((x & y) << 4)); y = 0x00ff00ff; x = (((x >> 8) & y) | ((x & y) << 8)); return((x >> 16) | (x << 16)); }
Comparison of Float Values
IEEE floating point has a number of nice properties, including the ability to use 2's complement integer comparisons to compare floating point values, provided the native byte order is consistent between float and integer values. The only complication is the use of sign+magnitude representation in floats. The AMD Athlon Processor x86 Code Optimization Guide gives a nice summary on Page 43. Here's a set of C routines that embody the same logic:
#define FasI(f) (*((int *) &(f))) #define FasUI(f) (*((unsigned int *) &(f))) #define lt0(f) (FasUI(f) > 0x80000000U) #define le0(f) (FasI(f) <= 0) #define gt0(f) (FasI(f) > 0) #define ge0(f) (FasUI(f) <= 0x80000000U)
Comparison to Mask Conversion
In many cases, it is useful to convert the result of a comparison, which is either 0 or some non-zero bit pattern, into either a "clean" 0 or -1 bit mask.
For many systems, this can be efficienty done by C code that simply uses the logic operators and negation: -(x!=0) or -!!x. This is a very well known and old method, really a direct consequence of (and partial motivation for) the C concept of conditional results being integers. However, for some compilers and instruction sets (especially SWAR targets), the code generated for logicals is terrible, sometimes even including conditional branches! Where this obvious coding doesn't work well, here are some alternatives.
If the messy non-negative integer value is x, the sanitized version is:
(((int) (-x)) >> (WORDBITS - 1))
To remove the constraint that the messy value be non-negative, use:
(((int) (x | -x)) >> (WORDBITS - 1))
Logically, this works because the shift by (WORDBITS-1) replicates the sign bit to create a mask -- be aware, however, that the C language does not require that shifts are signed even if their operands are signed, so there is a potential portability problem. Additionally, one might think that a shift by any number greater than or equal to WORDBITS would have the same effect, but many instruction sets have shifts that behave strangely when such shift distances are specified.
Of course, the opposite condition can be tested using:
(((int) ~(x | -x)) >> (WORDBITS - 1))
If you prefer the C-standard 0 or 1 comparison result, simply use the unsigned shift:
(((unsigned int) (x | -x)) >> (WORDBITS - 1))
The opposite condition can be obtained using:
(((unsigned int) ~(x | -x)) >> (WORDBITS - 1))
Dual-Linked List with One Pointer Field
Normally, a dual-linked circular list would contain both previous and next pointer fields and the current position in the list would be identified by a single pointer. By using two current pointers, one to the node in question and the other to the one just before/after it, it becomes possible to store only a single pointer value in each node. The value stored in each node is the XOR of the next and previous pointers that normally would have been stored in each node. Decoding is obvious.
Unfortunately, using this trick in C is awkward because the XOR operation is not defined for pointers.
Divide Rounding
Joe Ibershoff, one of my students, pointed-out that integer divide normally yields the floor, but both ceiling and round-to-nearest are easy and useful. I thought these were fairly well-known tricks closely related to the Alignment of Pointers magic, but perhaps they aren't so obvious...? He points out that Ceiling(a/b) is (a+b-1)/b and RoundToNearest(a/b) is (a+(b/2))/b. Of course, these tricks also work if divide is implemented in less obvious ways, such as shifts or shift-and-subtract sequences.
GPU Any
A basic SIMD operation, "any" is a logical-OR reduction that returns true if any of its inputs are true. On SIMD hardware, this is usually very easy... but not so on GPUs (Graphics Processing Units). NVIDIA's CUDA has recently added hardware support, but there is a more portable way that is just as fast. The p_any(flag) operation within a block can be accomplished by:
if (flag) sharedtemp = serial; /* maskable store */ __syncthreads(); /* optional with right block size */ p_any = (sharedtemp == (serial++));
We first publically announced this at SC08, and we're pretty sure we invented it. The trick is that NVIDIA's hardware seems to take constant time to resolve N threads storing into the same object, i.e., it picks a winner. This behaviour is not documented, but has been experimentally observed; this p_any(flag) will run on any of the CUDA hardware, and takes essentially the same time as the atomic any that was added to later CUDA hardware. There are actually quite a few useful operations that can be built using variations on this trick....
GPU SyncBlocks
The most fundamental aggregate function (or you might call it a collective communication) is a barrier synchronization. On SIMD hardware, this is usually implicit... but not so on GPUs (Graphics Processing Units). Within a Block, NVIDIA's CUDA provides a barrier called __syncthreads(). Across Blocks -- if you are running a number of Blocks that the GPU can timeshare rather than batch process -- you can synchronize using this:
/* First, sync within each Block */ __syncthreads(); /* Pick a representative from each (here, 1D) block */ if (threadIdx.x == 0) { /* Get my barrier number */ int barno = barnos[blockIdx.x] + 1; int hisbarno; int who = (blockIdx.x + 1) % gridDim.x; /* Check in at barrier */ barnos[blockIdx.x] = barno; /* Scan for all here or somebody passed */ do { /* Wait for who */ do { hisbarno = barnos[who]; } while (hisbarno < barno); /* Bump to next who */ if (++who >= gridDim.x) who = 0; } while ((hisbarno == barno) && (who != blockIdx.x)); /* Tell others we are all here */ barnos[blockIdx.x] = barno + 1; } /* Rejoin with rest of my Block */ __syncthreads();
The above code assumes that barnos[] is a volatile (forced memory access) array in GPU global memory that is initialized to 0. The type can be either int or float; it is not critical because either way wrap-around will take longer than GPUs will let one kernel run by default. Cost is O(number of Blocks) if all arrive at the same time, but O(1) for the last to arrive if there is any temporal skew. The O(1) behavior is due to counting by 2 per barrier; if all Blocks typically arrive roughly simultaneously, the algorithm can be simplified to count by 1. The OpenCL version of this algorithm has been tested on both NVIDIA and ATI GPUs with good performance. This algorithm also is the obvious basis for efficient within-a-kernel reductions and scans....
We first publically showed various GPU variants of this algorithm at SC08, and it was published within the MS thesis of two of Dietz's students in July 2009 and September 2009. Actually, it is a trivial variation on the lockless shared memory barrier that we developed for SHMAPERS and published over a decade ago. (In fact, it took less time to synchronize four processors in a Sun server than for one of those processors to execute a single atomic memory instruction!) I note the dates because late in 2009 somebody else published and claimed to have invented what is an inferior variant of this algorithm and did not cite us....
Gray Code Conversion
A Gray code is any binary coding sequence in which only a single bit position changes as we move from one value to the next. There are many such codes, but the traditional one is computed such that the Kth Gray code is K^(K>>1).
The well-known algorithm for conversion from Gray to binary is a linear sequence of XORs that makes it seem each bit must be dealt with separately. Fortunately, that is equivalent to a parallel prefix XOR that can be computed using SWAR techniques in log time. For 32-bit Gray code values produced as described above, the conversion from Gray code back to unsigned binary is:
unsigned int g2b(unsigned int gray) { gray ^= (gray >> 16); gray ^= (gray >> 8); gray ^= (gray >> 4); gray ^= (gray >> 2); gray ^= (gray >> 1); return(gray); }
Integer Constant Multiply
Given an integer value x and an integer or floating point value y, the value of x*y can be computed efficiently using a sequence derived from the binary value of x. For example, if x is 5 (4 + 1):
y2 = y + y; y4 = y2 + y2; result = y + y4;
In the special case that y is an integer, this can be done with shifts:
y4 = (y << 2); result = y + y4;
Integer Minimum or Maximum
Given 2's complement integer values x and y, the minimum can be computed without any branches as x+(((y-x)>>(WORDBITS-1))&(y-x)). Logically, this works because the shift by (WORDBITS-1) replicates the sign bit to create a mask -- be aware, however, that the C language does not require that shifts are signed even if their operands are signed, so there is a potential portability problem. Additionally, one might think that a shift by any number greater than or equal to WORDBITS would have the same effect, but many instruction sets have shifts that behave strangely when such shift distances are specified.
Of course, maximum can be computed using the same trick: x-(((x-y)>>(WORDBITS-1))&(x-y)).
Actually, the Integer Selection coding trick is just as efficient in encoding minimum and maximum....
Integer Power
Given an integer value x and an integer or floating point value y, the value of y to the x power can be computed efficiently using a sequence derived from the binary value of x. For example, if x is 5 (4 + 1):
y2 = y * y; y4 = y2 * y2; result = y * y4;
Integer Selection
A branchless, lookup-free, alternative to code like if (a<b) x=c; else x=d; is ((((a-b) >> (WORDBITS-1)) & (c^d)) ^ d). This code assumes that the shift is signed, which, of course, C does not promise.
Is Power of 2
A non-negative binary integer value x is a power of 2 iff (x&(x-1)) is 0 using 2's complement arithmetic.
Leading Zero Count
Some machines have had single instructions that count the number of leading zero bits in an integer; such an operation can be an artifact of having floating point normalization hardware around. Clearly, floor of base 2 log of x is (WORDBITS-lzc(x)). In any case, this operation has found its way into quite a few algorithms, so it is useful to have an efficient implementation:
unsigned int lzc(register unsigned int x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); return(WORDBITS - ones(x)); }
Least Significant 1 Bit
This can be useful for extracting the lowest numbered element of a bit set. Given a 2's complement binary integer value x, (x&-x) is the least significant 1 bit. (This was pointed-out by Tom May.) The reason this works is that it is equivalent to (x & ((~x) + 1)); any trailing zero bits in x become ones in ~x, adding 1 to that carries into the following bit, and AND with x yields only the flipped bit... the original position of the least significant 1 bit.
Alternatively, since (x&(x-1)) is actually x stripped of its least significant 1 bit, the least significant 1 bit is also (x^(x&(x-1))).
Log2 of an Integer
Given a binary integer value x, the floor of the base 2 log of that number efficiently can be computed by the application of two variable-precision SWAR algorithms. The first "folds" the upper bits into the lower bits to construct a bit vector with the same most significant 1 as x, but all 1's below it. The second SWAR algorithm is population count, defined elsewhere in this document. However, we must consider the issue of what the log2(0) should be; the log of 0 is undefined, so how that value should be handled is unclear. The following code for handling a 32-bit value gives two options: if LOG0UNDEFINED, this code returns -1 for log2(0); otherwise, it returns 0 for log2(0). For a 32-bit value:
unsigned int floor_log2(register unsigned int x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); #ifdef LOG0UNDEFINED return(ones32(x) - 1); #else return(ones32(x >> 1)); #endif }
Suppose instead that you want the ceiling of the base 2 log. The floor and ceiling are identical if x is a power of two; otherwise, the result is 1 too small. This can be corrected using the power of 2 test followed with the comparison-to-mask shift used in integer minimum/maximum. The result is:
unsigned int log2(register unsigned int x) { register int y = (x & (x - 1)); y |= -y; y >>= (WORDBITS - 1); x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); #ifdef LOG0UNDEFINED return(ones(x) - 1 - y); #else return(ones(x >> 1) - y); #endif }
Next Largest Power of 2
Given a binary integer value x, the next largest power of 2 can be computed by a SWAR algorithm that recursively "folds" the upper bits into the lower bits. This process yields a bit vector with the same most significant 1 as x, but all 1's below it. Adding 1 to that value yields the next largest power of 2. For a 32-bit value:
unsigned int nlpo2(register unsigned int x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); return(x+1); }
Most Significant 1 Bit
Given a binary integer value x, the most significant 1 bit (highest numbered element of a bit set) can be computed using a SWAR algorithm that recursively "folds" the upper bits into the lower bits. This process yields a bit vector with the same most significant 1 as x, but all 1's below it. Bitwise AND of the original value with the complement of the "folded" value shifted down by one yields the most significant bit. For a 32-bit value:
unsigned int msb32(register unsigned int x) { x |= (x >> 1); x |= (x >> 2); x |= (x >> 4); x |= (x >> 8); x |= (x >> 16); return(x & ~(x >> 1)); }
Natural Data Type Precision Conversions
For integers used to represent natural data types, simply shifting right works well for conversion to a lower precision, but shifting left is not very effective for converting to a higher precision. The problem is simply that if the "new" bits are taken to be 0s, the maximum value will never be attained. Likewise, if taken to be any fixed non-0 value, the value zero will never be obtained. A good answer to this problem is to replicate the existing bit pattern in the "new" bits, truncating or repeating the pattern if more bits are needed.
For example, a 10-bit raw pixel value (e.g., from my Canon G1) called x can be extended to a 16-bit value by the expression ((x<<6)|(x>>4)). This way, both the maximum and minimum values are reachable, with good linearity throughout the entire range.
Polynomials
It is fairly obvious, but x0+x1*x+x2*x*x+x3*x*x*x+... always can be rewritten as the usually faster equivalent x0+x*(x1+x*(x2+x*(x3+x*(...)))). There are various accuracy and other issues, but this sort of obvious transformation should not be overlooked.
Population Count (Ones Count)
The population count of a binary integer value x is the number of one bits in the value. Although many machines have single instructions for this, the single instructions are usually microcoded loops that test a bit per cycle; a log-time algorithm coded in C is often faster. The following code uses a variable-precision SWAR algorithm to perform a tree reduction adding the bits in a 32-bit value:
unsigned int ones32(register unsigned int x) { /* 32-bit recursive reduction using SWAR... but first step is mapping 2-bit values into sum of 2 1-bit values in sneaky way */ x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f); }
It is worthwhile noting that the SWAR population count algorithm given above can be improved upon for the case of counting the population of multi-word bit sets. How? The last few steps in the reduction are using only a portion of the SWAR width to produce their results; thus, it would be possible to combine these steps across multiple words being reduced.
One additional note: the AMD Athlon optimization guidelines suggest a very similar algorithm that replaces the last three lines with return((x * 0x01010101) >> 24);. For the Athlon (which has a very fast integer multiply), I would have expected AMD's code to be faster... but it is actually 6% slower according to my benchmarks using a 1.2GHz Athlon (a Thunderbird). Why? Well, it so happens that GCC doesn't use a multiply instruction - it writes out the equivalent shift and add sequence!
Same Within Tolerance
Sometimes it is necessary to test if two integers, a and b, have the same value within a given tolerance, c. The obvious test would be something like ((a>b)?(a-b):(b-a))<c, which isn't horrifically inefficient, but does involve a conditional branch. Alternatively, abs(a-b)<c would do... but again, it takes some cleverness to implement abs() without a conditional jump. Here's a branchless alternative.
If (a-b)>0, then (b-a)<0; similarly, if (a-b)<0, then (b-a)>0. Both can't be greater than 0 simultaneously. Suppose that (a-b)>0. Subtracting ((a-b)-c) will produce a negative result iff a and b are within c of each other. Of course, our assumption requires (b-a)<0, so ((b-a)-c) simply becomes more negative (assuming the value doesn't wrap around). Generalizing, if either ((a-b)-c)>0 or ((b-a)-c)>0 then the values of a and b are not the same within tolerance c. In other words, they are within tolerance if:
(((a-b-c)&(b-a-c))<0)
This test can be rewritten a variety of ways. The <0 part is really just examining the sign bit, so a mask or shift could be used to extract the bit value instead. For example, using 32-bit words, (((a-b-c)&(b-a-c))>>31) using unsigned >> will produce the value 1 for true or 0 for false. It is also possible to factor-out t=a-b, giving:
(((t-c)&(-t-c))<0)
Which is really equivalent to abs(t)<c.
Once again, an excellent alternative computation has been provided by Jean-Charles Meyrignac, who observed that checking if v is in the interval [i .. j), which one might normally compute by (v>=i && v<j), can instead be done as ((unsigned int)(v-i) < (j-i)).
Checking if a and b are the same within c is thus basically checking ((unsigned int)(a-b+c)< (c+c)). Well, almost. It actually considers one more value to be in the range than it should... and it does odd things when values wrap around. Still, often close enough....
Shift-and-Add Optimization
Rather obviously, if an integer multiply can be implemented by a shift-and-add sequence, then a shift-and-add sequence can be implemented by multiplying by the appropriate constant... with some speedup on processors like the AMD Athlon. Unfortunately, GCC seems to believe constant multiplies should always be converted into shift-and-add sequences, so there is a problem in using this optimization in C source code.
Sign Extension
Although many instruction sets provide single machine instructions that implement sign extension of 2's-complement integers, I've been sent a number of tricks for sign extension. I've included them here because sign extension instructions generally work only on the data sizes directly understood by the processor, whereas these methods work on any bit precisions.
The most obvious method assumes that you have a signed right shift: to extend an a-bit number x to b bits, shift left by b-a, then signed shift that value right by b-a bits. I believe this has been widely known and used for many years -- I know I didn't invent it, but used it decades ago.
Jean-Charles Meyrignac suggested a shiftless variation that basically does a 1-bit add to flip high bits: if n is 2 to the a, simply compute (((x | -n) + (n/2)) ^ (n/2)). This version has been posted here for some time....
However, in August 2010, Joe Zbiciak sent me a little email with a much cleaner shiftless sign extension: ((x ^ n) - n) where n is the value of the top bit position in the number to be extended. Thus, to sign-extend an 8-bit value x, compute ((x ^ 128) - 128). It really couldn't be much simpler or more obvious... at least once you've been told how to do it. ;-)
Swap Values Without a Temporary
Given two binary integer values, x and y, the values can be exchanged without use of a temporary by:
x ^= y; /* x' = (x^y) */ y ^= x; /* y' = (y^(x^y)) = x */ x ^= y; /* x' = (x^y)^x = y */
It should be obvious that this can be done with various operators and their inverses, but xor has the unusual property that it is it's own inverse. For example, here it is with modular add and subtract:
x += y; /* x' = (x+y) */ y = x - y; /* y' = (x+y)-y = x */ x -= y; /* x' = (x+y)-x = y */
This works on machines that don't have xor instructions. It even would also appear to work for floating point values, but where there is a significant difference in magnitude between x and y, there can be a serious loss of accuracy in the value with the smaller magnitude. For example, if x has a much greater magnitude than y, then (x+y)==x and we end with y=0. The interesting thing is that you can losslessly swap floating-point values by treating them as integers and using xor.
SIMD Within A Register (SWAR) Operations
Before we coined the name SWAR in Fall 1996, we already had defined a complete set of basic operations and described how they could be implemented with good efficiency. On February 4, 1997, we posted this fairly complete overview document and there also are slides from a seminar presentation I gave at Purdue. These methods were used in our SWARC compiler and were detailed in a number of our publications throughout the 1990s. We hadn't posted them on this page because they were so prominently published elsewhere.
However, much to our surprize, United States Patent 7039906, "Compiler for enabling multiple signed independent data elements per register," was filed October 20, 2000 and was issued to IBM on May 2, 2006! By our reading, this patent appears to seriously overlap my and Randy Fisher's earlier published work -- much of which was cited by the patent examiner. I am posting this note here so that people who happen to hear about the IBM patent will not be discouraged from using the prior art technologies developed by us, which, by being cited in the patent, are explicitly not covered by the patent.
Trailing Zero Count
Given the Least Significant 1 Bit and Population Count (Ones Count) algorithms, it is trivial to combine them to construct a trailing zero count (as pointed-out by Joe Bowbeer):
unsigned int tzc(register int x) { return(ones((x & -x) - 1)); }
Other People's Magic
The following are generally good sources of magic algorithms:
- HAKMEM
a collection of mostly low-level tricks and otherwise interesting ideas, some quite old - Bit Twiddling Hacks
a page full of tricks very similar in spirit to those here - AMD Athlon Processor x86 Code Optimization Guide
an exceptionally complete coverage of Athlon-oriented magic at both the C and x86 assembly levels - Intel Architecture Optimization Reference Manual
Intel Pentium 4 Processor Optimization Reference Manual
Intel's equivalents to the AMD document above. very x86-oriented (including MMX, SSE, and SSE2) - The PowerPC Compiler Writer's Guide
- Technical Site Index for azillionmonkeys
lots of stuff, including hacks in x86 and C - Reciprocal Multiplication, a tutorial
a really nice piece of work by Doug Jones... - Digital Signal Processing Tricks (from comp.dsp)
- various good algorithms
- Programming Pearls book online
- Hacker's Delight is a book that seems to talk about many of the same issues discussed on this page
- Bit Permutations by Jasper Neumann, with very nice diagrams explaining them
- Bit Twiddling Hacks by Sean Eron Anderson, a very carefully done collection