?

Log in

No account? Create an account

Is this really faster? - Journal of Omnifarious

Aug. 30th, 2009

01:38 pm - Is this really faster?

Previous Entry Share Next Entry

This:

unsigned int clipdigit(unsigned int * const v)
{
   unsigned int digit = (*v) % 10;
   (*v) /= 10;
   return digit;
}
is turned into this:
.globl clipdigit
	.type	clipdigit, @function
clipdigit:
.LFB11:
	.cfi_startproc
	movl	(%rdi), %ecx
	movl	$-858993459, %edx
	movl	%ecx, %eax
	mull	%edx
	shrl	$3, %edx
	leal	0(,%rdx,8), %eax
	movl	%edx, (%rdi)
	leal	(%rax,%rdx,2), %edx
	movl	%ecx, %eax
	subl	%edx, %eax
	ret
	.cfi_endproc
.LFE11:
	.size	clipdigit, .-clipdigit

As a small hint/bit of explanation, 232 - 858993459 = 3435973837 = 235 / 10 + 2.

Is mull really that much faster than divl on x86_64 machines?

I was expecting to get code more like this rather straightforward bit:

.globl clipdigit
	.type	clipdigit, @function
clipdigit:
.LFB11:
	.cfi_startproc
	movl	(%rdi), %eax
        movl    $10, %ecx
        xorl    %edx, %edx
        divl    %ecx
        movl    %eax, (%rdi)
        movl    %edx, %eax
	ret
	.cfi_endproc
.LFE11:
	.size	clipdigit, .-clipdigit


It turns out in testing that the second clip of code is much, much slower than the first clip. The strange mull method is about 5 times faster than the straightforward divl method. Wow, divl seems really broken if it's that slow.

Current Mood: [mood icon] confused

Comments:

From:rosencrantz319
Date:August 30th, 2009 09:00 pm (UTC)
(Link)
(Reply) (Thread)
[User Picture]
From:omnifarious
Date:August 30th, 2009 09:22 pm (UTC)
(Link)

I just did. It turns out that divl is really, really slow.

(Reply) (Parent) (Thread)
From:hattifattener
Date:August 30th, 2009 09:57 pm (UTC)
(Link)
Yeah, division by an arbitrary number is slow. Has been since the dawn of mechanical computation. Single-cycle multipliers have become pretty common (though they use a lot of silicon), but I don't know of anything with a single-cycle divide.
(Reply) (Thread)
[User Picture]
From:omnifarious
Date:August 30th, 2009 10:45 pm (UTC)
(Link)

I guess that makes sense. gcc is pretty spiffy for noticing a divide by a constant and turning into such convoluted (but much faster) code.

(Reply) (Parent) (Thread)
[User Picture]
From:omnifarious
Date:August 31st, 2009 06:25 am (UTC)
(Link)

You know, what's really amusing is close to the beginning of my education as a programmer I taught myself how to write multiply and divide routines in assembly for instructions sets that didn't have them. The 6502 and the Z-80 specifically. At around that time, I actually knew how many cycles many of the instructions I used took because it was clearly listed next to the instruction in the manual.

Now I look online and have a really hard time finding hard information like that. I know things have gotten lots more complicated with pipelines and caches and it's not anywhere near as simple as a simple cycle count for each addressing mode. But some kind of indication that the chip's implementation of divide is much more expensive than its implementation of multiply would be really helpful.



Edited at 2009-08-31 06:25 am (UTC)
(Reply) (Parent) (Thread)