[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wine-devel
Subject:    Re: PATCH: ddraw speedup
From:       Ove Kaaven <ovek () arcticnet ! no>
Date:       1999-05-30 13:23:43
[Download RAW message or body]


On Sun, 30 May 1999, Marcus Meissner wrote:

>    if (palette != NULL) {
> -    unsigned short *pal = (unsigned short *) palette->screen_palents;
> +    unsigned short * pal = (unsigned short *) palette->screen_palents;
...
> +      /* gcc generates slightly inefficient code for the the copy / lookup,
> +       * it generates one excess memory access (to pal) per pixel. Since
> +       * we know that pal is not modified by the memory write we can

Couldn't this have been solved by making it

const unsigned short * const pal = ...

or

register unsigned short * pal = ...

or something of that kind?

> +       * put it into a register and reduce the number of memory accesses 
> +       * from 4 to 3 pp. There are two xor eax,eax to avoid pipeline stalls.
> +       * (This is not guaranteed to be the fastest method.)
> +       */
> +      __asm__ __volatile__(
> +      "movl %0,%%esi\n"
> +      "movl %1,%%edi\n"
> +      "movl %2,%%ecx\n"
> +      "movl %3,%%edx\n"
> +      "xor %%eax,%%eax\n"
> +      "1:\n"
> +      "    lodsb\n"
> +      "    movw (%%edx,%%eax,2),%%ax\n"
> +      "    stosw\n"
> +      "    xor %%eax,%%eax\n"
> +      "    loop 1b\n"
> +      :
> +      : "g" (c_src), "g" (c_dst) , "g" (width), "g" (pal)
> +      : "ax","di", "si", "dx" , "cx", "cc" , "memory"
> +      );
> +      c_dst+=width; /* Note: not pitch! */
> +      c_src+=pitch;

I don't like these constraints either (and you should have used "eax" 
instead of "ax", "edi" instead of "di", and so on). I'd have coded it more
like

__asm__ __volatile__(
"xorl %%eax,%%eax\n"
"1:\n"
"lodsb\n"
"movw (%%edx,%%eax,2),%%ax\n"
"stosw\n"
"xorl %%eax,%%eax\n"
"loop 1b\n"
: "=S" (c_src), "=D" (c_dst)
: "S" (c_src), "D" (c_dst), "c" (width), "d" (pal)
: "eax", "memory"
);
c_src+=(pitch-width);

although even this is probably slightly inefficient on a Pentium and
above, I've heard lods/stos would be slower and stall the pipelines more
than using mov and inc/add separately, for example. Perhaps the best would
be to compile the original code (with "const" or "register" applied) using
pgcc and then put the resulting assembly (perhaps further tweaked) in
here.


=========================================================================

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic