# Is this really faster? - Journal of Omnifarious

## Aug. 30th, 2009

### 01:38 pm - *Is this really faster?*

This:

unsigned int clipdigit(unsigned int * const v) { unsigned int digit = (*v) % 10; (*v) /= 10; return digit; }is turned into this:

.globl clipdigit .type clipdigit, @function clipdigit: .LFB11: .cfi_startproc movl (%rdi), %ecx movl $-858993459, %edx movl %ecx, %eax mull %edx shrl $3, %edx leal 0(,%rdx,8), %eax movl %edx, (%rdi) leal (%rax,%rdx,2), %edx movl %ecx, %eax subl %edx, %eax ret .cfi_endproc .LFE11: .size clipdigit, .-clipdigit

As a small hint/bit of explanation, 2^{32} - 858993459 = 3435973837 = 2^{35} / 10 + 2.

Is mull really **that** much faster than divl on x86_64 machines?

I was expecting to get code more like this rather straightforward bit:

.globl clipdigit .type clipdigit, @function clipdigit: .LFB11: .cfi_startproc movl (%rdi), %eax movl $10, %ecx xorl %edx, %edx divl %ecx movl %eax, (%rdi) movl %edx, %eax ret .cfi_endproc .LFE11: .size clipdigit, .-clipdigit

It turns out in testing that the second clip of code is much, much slower than the first clip. The strange mull method is about 5 times faster than the straightforward divl method. Wow, divl seems really broken if it's that slow.

**Current Mood:**confused

rosencrantz319(Link)omnifarious(Link)I just did. It turns out that divl is really, really slow.

hattifattener(Link)omnifarious(Link)I guess that makes sense. gcc is pretty spiffy for noticing a divide by a constant and turning into such convoluted (but much faster) code.

omnifarious(Link)You know, what's really amusing is close to the beginning of my education as a programmer I taught myself how to write multiply and divide routines in assembly for instructions sets that didn't have them. The 6502 and the Z-80 specifically. At around that time, I actually knew how many cycles many of the instructions I used took because it was clearly listed next to the instruction in the manual.

Now I look online and have a really hard time finding hard information like that. I know things have gotten lots more complicated with pipelines and caches and it's not anywhere near as simple as a simple cycle count for each addressing mode. But some kind of indication that the chip's implementation of divide is much more expensive than its implementation of multiply would be really helpful.

Edited at 2009-08-31 06:25 am (UTC)