## Thursday, April 5, 2012

### Checking if divisible by 10 or 5 the bitwise way.

My Problem: check if something was divisble by 10, the rules where I couldn't use the modulus, divison or multiplication operator. The check had to be in the form of a single expression, no branches or loop constructs could be used, pointer arithmetic wasn't allowed, neither was the use of floating point data types. The code had to work on a x86 CPU; and had to be ANSI C.

After spending a few hours I've come up with the following solution:
((x<<0x1F)+(x<<0x1E)+(x<<0x1B)+(x<<0x1A)+
(x<<0x17)+(x<<0x16)+(x<<0x13)+(x<<0x12)+
(x<<0x0F)+(x<<0x0E)+(x<<0x0B)+(x<<0x0A)+
(x<<0x07)+(x<<0x06)+(x<<0x03)+(x<<0x02)+x<0x33333334)&&!(x&1)

Now, this isn't exactly a fast solution, expecially on a P4 which lacks a barrel shifter, however this solution led me finding a faster soluton. The rules where I couldn't use the multiplication operator. But after removing those shifted adds and replacing it with a constant 0xCCCCCCCD, I had the following:

(x*0xCCCCCCCD) < 0x33333334 &&!(x&1)

much to my surprise this is faster than using modulus (x%10) as well as the division operator. MUL on the x86 is actually the fastest way to do division for NPOT (non-power-of-two) constants. This can be proofed with:
(x * 5^-1) < 0x33333334

Explaining how this works is quite simple: the constant 0xCCCCCCCD is the inverse of 5 modulo 2^32. Of course I couldn't stop here. I took it a bit further and came up with the following code:

!((0x19999999 * x + (x >> 1) + (x >> 3)) >> 28) [
"\x0\x1\x2\x2\x3\x3\x4\x5\x5\x6\x7\x7\x8\x8\x9\x0"
]

This is a lot more complicated to proof, mainly because it's simply odd code. To get the most obvious odd bit sorted out C allows array[idx] and idx[array] because both cases decay to *(idx+array) or *(array+idx) which are equivlent. Now for the explination:

In the table we have here: the following: 0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0
this maps the multiples of 5 to either 0 or 8. Better expressed as: (8/5 * x) % 16 (the error term is never too big) -- The real equivalency being:
((8/5 + 0x0.00000002/5) * x - delta) % 16 for delta in [0, 0.625/2^28]
This one uses a lot more operations to do the same thing, 3 shifts, 2 adds, and one MUL, still a lot faster on a pentium D than the former one. Of course I took it two more steps further.

The rules where it had to be ANSI C, but I sort of broke that bit, by exposing some x86 assembly (which entailed the best solutions).

MUL x, 0xCCCCCCCD, x
ROR x, 0x19999999
CMP x, 0x19999999
JLE hit_table ; same table {0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0}

This it the same to the CPU as the one above, but expressed in a simple fashion. Essentially the observation is:

x * 0xCCCCCCCD is even exactly if x is even, because 0xCCCCCCCD is odd, so the right rotation to move the last bit to the first and compare against 0x19999999 will give you the same answer, that is if you have ROR. So I rewrote it another way for ROR-less systems.

This method is the same as the first: which I find is a lot more nicer, too many MOVs though, but the real power comes from the x86 LEA instruction.

MOV %edx, 0x33333334
MOV %ebx, %eax
MUL %edx
AND %edx, x
LEA %edx, [%edx+%edx*4]

The reason this works is the x86 MUL instruction puts it's double width result in two registers, the sad part is you cannot write this in C.

But in the end the winner for performance ended up being:
(x * 0xCCCCCCCD) < 0x33333334 && !(x&1)