Faster way???

**koliva** · May 26th, 2009, 09:32 AM

Hi all,

I have an image sequence taken from a movie. My task is to send this image sequence through the projector which we built and connected to USB cable. My task is to read all image sequence, then create 3 colour images(R, G, B) for each frames and change the bit depth, if needed, and finally send them to the projector.

Because it takes a lot of time to read all the images sequence and convert them to the R,G,B components, I have to keep it into the RAM. Another reason is that data has to be ready before sending data to the projector. If user changes the bit depth, it is getting worse because this conversion takes also a lot of time. Never real-time. I think I am following wrong way for this task. There must be much more faster way to manage it but I dont know it yet.

Summarily, do you have any idea to read data from file system, and convert it to R, G, B components and change bit depth? Thanks.

**JVene** · May 26th, 2009, 01:21 PM

There's too little to go on.

Have any code?

What's the source / destination image dimensions, frame rate...

What's the target computer like - single core, multiple core, 3Ghz?

What's the USB interface, 2.0?

**koliva** · May 26th, 2009, 11:06 PM

Originally Posted by JVene

There's too little to go on.

Have any code?

What's the source / destination image dimensions, frame rate...

What's the target computer like - single core, multiple core, 3Ghz?

What's the USB interface, 2.0?

The code is very very long because of projector panel conrols. But simply,
for(int i=0;i<the_number_of_frames;i++)
{
for(int i=0;i<1024;i++)
{
for(int j=0;j<768;j++)
{
Color pixelcolor=mybmp->GetPixel(i,j);
}
}
}

This is for obtaining R, G, B values. Than I am keeping them in an array. This process takes more than 1 minute for 150 frames. It is a lot! And 150 frames is nothing. It is 3 second movie(50 Hz).

Another slow process is to convert bit depth, here is my function to convert the bit depth(I haven't tried it yet.),

unsigned char * BitConverter(unsigned char * old, int oldbit, int newbit)
{
double ratio;
for(int i=0;i<1024*768*3*deg.m_PicNum;i++)
{
ratio= *(old+i)/((pow(2.0,oldbit))-1.0);
*(old+i)=(pow(2.0,oldbit))*ratio;
}
return old;
}

if you calculate the number of loop, it is huge number. But there must be easiest way to do it.
We are always working on XGA format(1024x768). I dont know the CPU of the computer but it is one of the newest one. Let say Core 2 Duo. We are using USB 2.0 but it is not so important now. Maybe I will mention about it in my further replies because there is another problem with USB 2.0.

**JVene** · May 26th, 2009, 11:47 PM

There's just too much about performance writing to go into without starting a book.

However, try this as one example....

..and note, this isn't refined or tested - I'm just re-arranging code without checking here....

Code:

unsigned char * BitConverter(unsigned char * old, int oldbit, int newbit)
{
double ratio;

double p = pow( 2.0, oldbit );
double p1 = p - 1.0;

unsigned char * ptr = old;
unsigned char * lim = ptr + (1024*768*3*deg.m_PicNum);


while( ptr<lim )
 {
   ratio= *ptr/(p-p1);
   *ptr=p*ratio;

   ratio= *(ptr+1)/(p-p1);
   *(ptr+1)=p*ratio;

   ptr += 2;
  }

return old;
}

Note, for example, that I've incremented the pointer by 2, it may be better by 3 ( and doing 3 at a time) because of your "*3" limit - you'll need to take care of the last pixel being processed.

This change is classic Knuth to some degree.

There's a lot more than that, and it's possible the compiler would perform some of this optimization anyway.

Also, this:

Color pixelcolor=mybmp->GetPixel(i,j);

Is terribly slow. Unless you're also plucking out some region in the source, the source pixels can be accessed in order, most likely, so a function finding a pixel on i,j for each pixel is a waste - putting that into a function may be another, depending on what GetPixel does - but also putting that as a call on the mybmp object is another. This loop should probably be inside the bitmap object so that's "assumed" during the loop, instead of provided at each iteratio of the loop.

You should also avoid creating a pixelcolor inside each iteration of the loop.

There's more - much more - including potential assembler optimizations using SIMD, but the compiler may do well enough without that.

Also, on a dual core, there's plenty of opportunity to consider parallel work. Several ways to divide the job come to mind - too many at first in my mind.

Try thinking along these lines and come back with a few specific optimization attempts along the way.

**koliva** · May 27th, 2009, 02:04 AM

I calculated the elapsed time both my previous code and the code that you gave me. Here is the amazing results for 140 frames;

My code worked: 34.79 sec
Your code (ptr += 2; ) 4.55 sec
Your code (ptr += 3; ) 4.39 sec

There not such a big difference between second and third one, so I will keep ptr += 3;
This result is very good for now.

I think I didnt understand your advises for mybmp->GetPixel(i,j); I want to explain my problem little bit in detail.

I am now asking the vital question. As you see, I have tried 140 frames for this example. Imagine one hour movie. If it is cinema movie, it should be 24 Hz(24 frames per second). I need to calculate R, G, B components for each frame, so it means 24*3=72 frames per second. Since it is one hour movie, 60*60 times 72. It means 259200 frames. Then, it is XGA format (1024x768), which costs 259200 times 1024x768, which is huge amount of data.

From above mentioned calculation, it can be seen that it is impossible to keep this data in memory. However, extracting R, G, B components takes a lot of time let alone real-time. Could you please give me an idea for real time video streaming for this purpose? Tasks which have to be done before sending process are below;

Read image sequence;
Extract R, G, B channels,
Change bit depth(if needed)
Start sending data usign USB;

Thanks a lot for your helps.

**JVene** · May 27th, 2009, 06:58 AM

Let me work backwards a little here.

Unless I missed something, 1 second of uncompressed video is about 170 Mbytes of data, and as of yet you've not mentioned a decompression phase, so I assume you have some kind of RAID setup (or other specialty device) to supply nearly 200 Mbytes per second sustained data, and room for the approximate 10.7 Gbytes per minute of video.

I suspect you're ultimate target is either experimental or a device (embedded/ASIC or similar).

Ok, you say the code is too long to post here, that's fair, but this general layout:

Code:

for(int i=0;i<the_number_of_frames;i++)
{
  for(int i=0;i<1024;i++)
    {
      for(int j=0;j<768;j++)
         {
           Color pixelcolor=mybmp->GetPixel(i,j);
         }
    }
}

While logically fine is slow for reasons similar to the previous improvement (Don Knuth comes to us from the late 60's, is still writing on his works today and much of his advice can't be escaped

).

Here the operations associated with creating a Color (pixelcolor) - whatever that is (even just pushing space on the stack) is on the interior of a critical loop. Move it outside the loop, say you make a new one each frame, not for each pixel (768 times less work).

mybmp - an object representing an image frame, obviously has a member function GetPixel. This is called on the interior of a critical loop (theme repeating here). It shouldn't be inside the most interior point of a critical loop. That's one of the things taking time in the previous example - pow was called inside a critical loop, calculations performed inside the loop that need not be, and the increment of the loop itself need not be every unit.

mybmp "owns" all the pixels. At the very least, give it an entire line to work with and make it 1/768 times as busy as it is now

.

As in, all the stuff that you're doing which is too long to put here, which takes the position of the line for GetPixel.....

Put that stuff INTO the bitmap object, as a member function. Call that function once per line, not once per pixel.

Move the interior most material of loops so they're doing the least volume of work possible. Your aiming for microscopic levels of work in there, they magnify CPU drain.

As you've seen, this very basic thought process can return 4, 8 sometimes 100 times the performance levels.

For example:

Code:

for(int i=0;i<the_number_of_frames;i++)
{
  for(int i=0;i<1024;i++)
    {
      mybmp->ProcessLine( i );
    }
}


void imageframeobject::ProcessLine( int i )
{
 // Pixel is just my pseudo code, this might unsigned char  in your code

  Pixel *ptr = GetPositionOfLine( i ); // << inline
  Pixel *lim = ptr + 768;

  Color pixelcolor;

  while ( ptr < lim )
     {
       // Process pixel at ptr

       pixelcolor = *ptr; // etc

       // Process pixel at ptr + 1

       pixelcolor = *(ptr + 1); // etc
     
       ptr +=2;
     }
}

The point is here put the processing inside the bitmap, so you're not pushing that "this" pointer on the stack for every call to getpixel, along with pushing it's two parameters on the stack, before calculating a position and getting a color value that's in a contiguous series of bytes.

Further, if your processing the color adjustment at each pixel, inline SIMD using one of the SSEn variants available on the Core2 would probably process an entire 3 color RGB at one time (SSE can perform multiple multiplications at once, not in a series like the previous example, improving performance by another factor of 3, perhaps). I've done it, and it's not something I do every day, so I have to reach for the SSE reference each time (sometimes I'm working in ARM VFP, the ARM processor's SIMD). Providing that example here would be research for me, and that's starting to turn into actual work, so I'll leave that for your googling

.

Further, as a 'gross' (should I say coarse) means of parallel enhancement, you could start two threads and process every other frame in each thread, dividing the job in two, almost doubling performance. If you're not familiar with threaded programming, that's an interesting road to travel. The key in this case is to 'schedule' the output so the display is in sequence, but it's not really THAT difficult.

Again, working backwards, your data 'pump' requires you to process 170 Mbytes per second, well within the Core2's performance. However, at a sustained algorithmic processing on each pixel (which is essentially each byte), you have to consider that for a mix of operations (some taking 1 clock tick, others taking about 10 or more) you have only about 1 billion operations per second per CPU core you can depend upon (this varies per core style and speed of CPU, and against the instruction mix actually used). It might be more or less depending on your algorithms. Let's say it is 1 billion per second, that means you have only enough speed for about 7 or 8, maybe 10 machine operations on each pixel. If you use a quad core, that's 4 times the available work per pixel. You could eat up 10 machine operations in a few function calls, so you are headed for at least considering both inline assembler and/or SIMD and likely threading to sustain this 'engine'.

Is this project due soon? Is it a product, a one time experiment?

PM me on the subject if you're getting into a time crunch.

**alanjhd08** · May 27th, 2009, 10:52 AM

Hi,

You might be able to use a look-up table to speed up some of the bit-depth conversion as well.

Alan

**JVene** · May 27th, 2009, 11:49 AM

Good point.

I also noticed this minor gem: for the bitconverter

Code:

unsigned char * BitConverter(unsigned char * old, int oldbit, int newbit)
{
double ratio;

double p = pow( 2.0, oldbit );
double p1 = p - 1.0;
double px = 1.0 / ( p - p1 );


unsigned char * ptr = old;
unsigned char * lim = ptr + (1024*768*3*deg.m_PicNum);


while( ptr<lim )
 {
   ratio= *ptr * px;
   *ptr=p*ratio;

   ratio= *(ptr+1) * px;
   *(ptr+1)=p*ratio;

   ptr += 2;
  }

return old;
}

**VladimirF** · May 28th, 2009, 01:01 PM

I am trying to jump into this discussion, but can’t understand few things:
What exactly are you trying to do in your BitConverter() function? Convert from bit depth oldbit to newbit? But why newbit is not used in that function?
Also, what is deg.m_PicNum? Number of frames? Are you doing them all at once?
Why you original movie is at 50 Hz? You later mentioned 24 frames per second…
Anyway, at less than 1 megapixel resolution, 1 byte per channel, 24 frames a second, you need ~50MB/s, which is easily provided by modern hard drives and is less than USB 2.0 can do.
Another thought: traversing such a large array (that won’t fit in any cache) is expensive. Why don’t you do your bit conversion right where you already have your color components?
If you can post a skeleton of the entire process – someone might suggest further optimization.

**VladimirF** · May 28th, 2009, 01:51 PM

Originally Posted by JVene

Code:

   ratio= *ptr * px;
   *ptr=p*ratio;

Without analyzing the logic here, you can see that this is the same as:

Code:

   *ptr=p* (*ptr) * px;

And if you precalculate p*px, you can have one multiplication and one assignment instead of two. Considering number of iterations, that could be important…

**koliva** · May 29th, 2009, 10:33 AM

Originally Posted by VladimirF

I am trying to jump into this discussion, but can’t understand few things:
What exactly are you trying to do in your BitConverter() function? Convert from bit depth oldbit to newbit? But why newbit is not used in that function?
Also, what is deg.m_PicNum? Number of frames? Are you doing them all at once?
Why you original movie is at 50 Hz? You later mentioned 24 frames per second…
Anyway, at less than 1 megapixel resolution, 1 byte per channel, 24 frames a second, you need ~50MB/s, which is easily provided by modern hard drives and is less than USB 2.0 can do.
Another thought: traversing such a large array (that won’t fit in any cache) is expensive. Why don’t you do your bit conversion right where you already have your color components?
If you can post a skeleton of the entire process – someone might suggest further optimization.

Sorry, I could not follow my topic yesterday because of my whole day meeting.

I have done very good improvements in my code and now it is barely fast enough but I am sure that it can be beter.

First, I have images sequence taken from any movie, in 24 bit RGB format with XGA resolution. My main purpose is,

1) Read pre-defined amount of images from this image sequence, Let say 25,
2) Convert them one by one to R, G, B channels because I need to send these channels to my the projector.
3) User may not want to use 8 bit per color, he may use 5 bit or 6 bit instead of 8 bit. So I need to convert the data to newbit. I can embed this code inside of the function which I use for reading. I will try it later.
4) Send this block of image data to the projector using USB 2.0.

The code which JVene gave me is a little wrong because it should be,

Code:

   double p = pow( 2.0, oldbit );
   double t = pow( 2.0, newbit );
   //double p1 = p - 1.0;

   unsigned char * ptr = old;
   unsigned char * lim = ptr + (1024*768*3*deg.m_PicNum);

   while( ptr<lim )
	  {
	  ratio= *ptr/(p-1.0);
	  *ptr=t*ratio;
	  *ptr>>3;
...

This function basically converts all the values in the block from oldbit bit to newbit bit. As I said before, I will try to embed this into Read function.

I am trying to find out the best parameters in order to show seemless movie. You are right, if it is 24 bit, everything is almost okay. It costs exacly 54 megabyte and it is smaller than 60 MB/sec which is the limit of the USB. However, please keep in your mind that this is highly optimistic number that you may never reach. One parameter is the length of the cable, as an example. I think, in our case, maximum USB transfer speed is about 50 MB/sec which almost fit in your example.

I need to convert bit depth just because user may want to increase the frame rate. In that case, automatically decrease the bit depth in order not to exceed the limit of the USB.

Another thought: traversing such a large array (that won’t fit in any cache) is expensive. Why don’t you do your bit conversion right where you already have your color components?

Obtaining color components are time consuming enough. I did them seperately because I wanted to see the elapsed time. But I will do it later.

Anyway, now my program is quite fast but not enough. Let say 24 Hz. per second and 24 bit color depth; obtaining R,G,B channels itself takes nearly 3-4 seconds on my computer. JVene already gave me an advise that I can also include multithreading into my code. I have never tried this before but I will try. If you have any other advise, please let me know.

**VladimirF** · May 29th, 2009, 02:11 PM

Originally Posted by koliva

Anyway, now my program is quite fast but not enough. Let say 24 Hz. per second and 24 bit color depth; obtaining R,G,B channels itself takes nearly 3-4 seconds on my computer. JVene already gave me an advise that I can also include multithreading into my code. I have never tried this before but I will try. If you have any other advise, please let me know.

I was thinking for awhile that you could simply shift the old value to convert it to a new bit depth.
Reading a formula in your post I got concerned about losing accuracy for some values, so I ran this test:

Code:

int oldbit(8), newbit(5);
int iNew(0), iOld(0);
double p = pow( 2.0, oldbit );
double t = pow( 2.0, newbit );
double ratio(0.0);
for(int i = 0; i < 256; ++i)
{
	ratio = i/(p-1.0);
	iOld = t*ratio;

	iNew = i >> 3;
	if(iNew != iOld)
		printf("Diff: iNew=%d, iOld=%d\n", iNew, iOld);
}

I only found *ONE* difference, for the value of 255: my shift got a correct new value of 31, while your formula produced 32, which is not even in the range of 5-bit values…
Needless to say that performance of one shift will beat two floating point operations.
In addition, the main point of JVene advise about getting pixel color was NOT to use multi-threading, but to use other way to access bitmap’s data: by getting a direct pointer to its buffer, and simply incrementing it to get to the next pixel. If your bitmap is not 32-bit per pixel, you’d need to take care of the scan line padding.
Please come back here with your new results / questions.

**JVene** · May 29th, 2009, 03:16 PM

Unless I'm just off my game today, which happens, why is this:

double px = 1.0 / ( p - 1.0 );

ratio = i * px;

Not the same as

ratio = i/(p-1.0);

?

Neither p nor p - 1.0 are changing within the loop, and multiplication by an inverse is the same as division, only faster inside a loop.

Anyway, indeed I did mention that one very crude but useful means of overcoming single core speed limitations might be to thread, and do so by handing off every other frame to a second core, or every 4th frame to a quad core, with some care taken to send those frames out from a queue in correct sequence, and probably some work synchronizing sound to the video.

It's true, too, that it's not the thrust of my suggestions thus far.

Another point to consider is that a touch of inline SIMD could perform operations on full RGB triplet in one instruction.

There are also ways to improve performance by loading adjacent pixels into a register and working that from the register.

Here's a quick look at that....

Let's say the data is organized as an RGB. A crude but effective means of pulling these into an RGB routine is to load the entire RGB into a CPU register (this is obviously not portable).

The x86 or x64 (depending on your mode) could load 4 or 8 bytes at once.

The data is loaded into the register in a "burst" instead of one byte at a time - that is, if you're using C constructs to load bytes, each byte takes about as long as loading one quadword (4 bytes) or (what is the word for a 64bit unsigned integer...)

Anyway, chuck that into a register quickly, then keep it there, splitting it into RGB components within the CPU (it's very fast).

The odd part is that you have RGB - 3 parts - against a machine that will load 4 parts, making a "phase" problem of the 3/4 size oddity. It's not a problem if you 'unroll' the loop to account for it.

Let's say your in 32bits, you load the first RGBR which has the extra R of the next pixel.

Now you roll and slice the RGB target out of the register, ending up with just the leftover R of the next pixel.

You're code continues inline, not "looping" yet (as in my previous example, processing a few pixels before looping ), so you load the next quadword, which will be GBRG, giving you the GB that belongs to the leftover R before, and ending up with the next pixels RG.

Roll and slice your target RGB , then load BRGB. Now you have the B of the leftover RG of the target pixel, and you have the entire RGB of the next pixel.

Roll and slice two pixels at once here.

Now you loop, because, if you paid attention, you're loading RGBR GBRG BRGB - a 4 pixel series using only 3 loads into the CPU, completing a cycle that 'aligns' with the odd phase of 3 to 4.

Doing this works accounts for the weakness of the CPU processing bytes.

I realize it's just a touch deeper into the notion of optimizing, and this notion plays into the idea of moving into SIMD optimization also, because there you're able to process RGB in a single step - so you need to feed it RGB content rather smoothly (read efficiently).

C and C++ can't perform this kind of optimization work alone. Using special keywords that directly access registers by their x86 or x64 name is an extension you could consider, but you can't use 3 registers at once and leave C only 1 GP register. Using inline assembler you can pull several tricks to make things really burn through blocks of data.

Note, too, that in x64 there's even more room to work in.

Using 3 registers of 8 bytes you can read

RGBR GBRG < one register
BRGB RGBR < second register
GBRG BRGB < third register

That's 8 RGB pixels in 3 reads of sequential data.

Obviously the rolling and slicing takes a little time, but the cache and read cycle hits you're taking by reading bytes is actually worse.

If you adjust your work toward the integer domain, this can achieve speeds several times your target, I'm sure.

The reverse works as well for sending data out in R, G and B monochrome output.

Packing 4 adjacent pixels into a single register before writing those 4 pixels to RAM is much faster than one byte at a time.

So, if you modified my suggestion into the reverse form, you'd load as much into a register as you can, and prepare 3 output registers.

In 32 bits, let's say you load RGBR into EAX.

Now, roll that so it's RRGB, so the B is at the tail.

Using the byte register to register move (mov bh, al, then mov ch, ah ), you have moved one pixel of B and one of G into two destinations

Roll EAX again so it's GBRR and then mov dx, ax.

What you have is now one register devoted to blue (ebx) with 1 pixel, one to green (ecx) with 1 pixel, and one to red with 2 pixels (edx)

Now you load the next quadword, GBRG.

Roll EAX so it's RGGB.

mov bl, al
mov cl, ah

and before you move on, note where you are.....

You have
ebx with 2 blue pixels, in order 00BB
ecx with 2 green pixels, in order 00GG
edx with 2 red pixels, in order 00RR

Roll those so they become BB00, GG00, RR00

Then continue (it's getting long and too much like work, but now you finish the two remaining pixels and load the next quadword)

This continues until your pixel order looks like....

blue (ebx) is in the order 1234
green (ecx) is in the order 1234
red (edx ) is in the order 1234

Now, you move these 4 pixels into your output destination as a quadword (an unsigned integer of 4 bytes).

You keep pumping 4 pixels into a source, a rolling and slicing into a destination quadword of 3 colors
until they're full, send them out to RAM, and continue.

What you're doing is trading the performance hit you get by reading and writing bytes to RAM for rolling and moving data within registers (which is very fast by comparison) - in the bargain you're moving 4 bytes of data at a time as fast as you were moving 1 byte at a time.

Since your data doesn't fit into cache anyway, this ultimately helps considerable. Even when it does fit into cache, the CPU has to do something like this under the hood just to read bytes that aren't aligned with it's hardware.

If it's too much, well - I know, this kind of thing is used when the crunch is on. It's why I asked if this is an experiment, a college assignment or a prototype for a product.

**koliva** · June 2nd, 2009, 06:57 AM

I have changed my code as you suggested. Here is a part of my code,

Code:

Drawing::Rectangle *r=new Drawing::Rectangle(0, 0, 1024, 768);
Imaging::BitmapData ^bmData = mybmp->LockBits(*r, Imaging::ImageLockMode::ReadWrite, Imaging::PixelFormat::Format24bppRgb);

   int stride = bmData->Stride;
   System::IntPtr Scan0 = bmData->Scan0;

   byte * p = (byte *)(void *)Scan0;

   int nOffset = stride - 1024*3;
   int nPixel;
   long curr=pic*1024*768*3;

   for(int y=0;y<768;++y)
     {
     for(int x=0; x < 1024; ++x )
        {
        //BRG instead of RGB
        buf[curr+(y*1024)+x]=p[2];
        buf[curr+(1024*768)+(y*1024)+x]=p[1];
        buf[curr+(2*1024*768)+(y*1024)+x]=p[0];
        p+=3;
        }
     p += nOffset;
     }
mybmp->UnlockBits(bmData);

This part of the code is to assign R,G,B values into my variable. It takes 4890 ms for 100 pictures. It still high for my application. It should take less than 2 sec.

JVene, thanks for your helps but could not implement what you said in your last message. I have some experiences with assembler but I have never embed it into my C++ code. But I am sure that it will take less time when I use assembler.

If it's too much, well - I know, this kind of thing is used when the crunch is on. It's why I asked if this is an experiment, a college assignment or a prototype for a product.

None of them, this is my research topic. Actually my purpose is to use our projector but I want to show a short movie. It would be great plus.

VladimirF, thank also you for your helps. I have used what you suggested.

**JVene** · June 2nd, 2009, 02:03 PM

Well, I'm a sucker for research projects....

Here's a 32 bit asm example using VC2008 inline

Code:


void splitrgb( unsigned char *inbuffer, unsigned char *outbuffer )
{
 unsigned char *iptr = inbuffer;
 unsigned char *optr = outbuffer;
 unsigned char *limit = iptr + 1024*768*3;

          __asm {
                 mov  esi, iptr    ; source pointer
                 mov  edi, optr    ; dest pointer

                 loop_top:

                 mov  eax, [esi]     ; load qword from source, becomes r2, b1, g1, r1
                 mov  dl, al         ; move r from FIRST pixel into red register
                 mov  cl, ah         ; move g from FIRST pixel into green register
                 shr  eax, 16        ; shift eax to move r2 b1 into position
                 mov  bl, al         ; move b from FIRST pixel into blue register
                 mov  dh, ah         ; move r from SECOND pixel into red register

                 mov eax, [esi + 4]  ; load qword from source, becomes g3, r3, b2, g2

                 mov  ch, al         ; mov g from SECOND pixel into green register
                 mov  bh, ah         ; mov b from SECOND pixel into blue register

                 shr  eax, 16        ; shift eax to move g3 r3 into position
                 shl  edx, 16        ; shift red register to access pixels 3 & 4 (is now ordered 2143)
                 shl  ecx, 16        ; shift green register to access pixels 3 & 4
                 shl  ebx, 16        ; shift blue register to access pixels 3 & 4
                 mov  dl, al         ; mov r3 into position
                 mov  cl, ah         ; mov g3 into position

                 mov eax, [esi + 8]  ; load qword from source, becomes b4, g4, r4, b3

                 mov bl, al          ; mov b3 into position
                 mov dh, ah          ; mov r4 into position
                 shr eax, 16         ; shift eax to move b4, g4 into position
                 mov ch, al          ; mov g4 into position
                 mov bh, ah          ; mov b4 into position

                 ror ebx, 16         ; all 3 output registers are in 2143 order
                 ror ecx, 16         ; but need to rotate to become 4321 order
                 ror edx, 16         ; ebx = blue, ecx = green, edx = red - 4 adjacent pixels

                 mov  [edi], ebx         ; move 4 blue pixels 
                 mov  [edi+786432], ecx  ; move 4 green pixels
                 mov  [edi+1572864], edx ; move 4 red pixels

                 add  esi, 12      ; incrementing source pointer
                 add  edi, 4       ; incrementing destination pointer

                 cmp  esi, limit
                 jl   loop_top
  
                }

          

}

The function assumes you have two buffers, one a buffer of pixels in rgb format that's 1024 x 768 x 3 bytes, just one frame.

The output is assumed to also be 1024 x 768 x 3, but where the first 1024 x 768 bytes is all blue, then the next 1024 x 768 is all green, then 1024 x 768 is all red - a 3 color separation.

The registers ESI and EDI are used to point to source and destination buffers, while EAX is used to chop up the incoming rgb pixel data.

EBX, ECX and EDX are used as output buffers of 4 adjacent pixels for each of the color planes in blue, green and red.

Each loop, the input stream is parsed for adjacent 4 pixels. They come in in 3 groups (an unrolled loop is constructed).

First group input is R1G1B1R2

Second group is G2B2R3G3

Third group is B3R4G4B4

Now, when x86 loads the first group, R1G1B1R2, it becomes R2B1G1R1 in the EAX register. Output, similarly, must be "stored" in the order 4321, so that when a quad word write sends it to the destination buffer, it rotates into 1234, the correct output order.

I don't have video data, so I checked this only with a sequenced block of RAM to see that it worked as expected.

On a single core AMD at 2.9 Ghz it seems to be able to separate at about 110 frames per second.

You may need to pay attention to my loop termination, I'm not 100% certain - I did this quickly and checked only that it "seemed" to do what I expected.

A 64bit version of this would be a little faster, but would work along the same lines.

An SIMD version would do this a little differently, and would likely be faster still, though I've not considered it with any care. It may only be beneficial in the context of mixing the 3 color separation with the brightness/contrast adjustment.

Thread: Faster way???

Thread Tools

Display

Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Re: Faster way???

Posting Permissions