Wednesday, August 17, 2005

Implementing the blur filter in Flash Player 8

Like I said I wanted to explain how the blur filter in Flash Player 8 came to be. It was the first filter we tried to implement after the cache as bitmap functionality started working and probably also the last one which needed tweaks just before we went into code lock down.

My first attempt was to faithfully create an SVG compliant gaussian blur filter. So after studying the spec I had a quick and dirty version running for some testing. The results were not very pleasing. First, it ran very slow and secondly it looked awful when animated since box blur sizes are rounded to integers. I am pretty sure Flash users would have been upset about that. The SVG spec suggests to implement special versions with different convolution matrices for small sizes, but code size and performance where a concern for me. So I scrapped the whole idea of being SVG compliant and started to work on a more flexible implementation.

The idea of using a box blur (also called a median filter) is a good one since it can be easily optimized in a way that it's performance is not really dependent on the blur size itself compared to a real gaussian blur. Although while running a box blur three times on an image gets you a decent approximation of a gaussian blur, I decided to not call it a gaussian blur and let users run it once or twice times only also. This allows for greater flexibility when performance is critical since running it just once is much faster and can yield interesting effects. Running it just one time on a image f.ex. creates a fake motion blur effect. I also decided to make the blur filter sub pixel precise, to provide better looks when animating and get rid of the requirement for special versions for small sizes. So you can not only select sizes like 2, 3 and 4, but also 1.5, 2.5, 2.05 etc.

The first step was to write the box blur filter in plain C code which was done fairly quickly. The performance was decent I thought and I quickly moved on to other projects. Until the first testers tried to run it on a 800x600 image. Doh, the Flash Player would freeze up for several seconds. I guess it was not fast enough. :-) As one engineer quickly figured out it was the division in the inner loop of the blur filter which was the problem. He replaced it with a multiply which I boldly reverted right afterwards since animating the blur now caused flickering. Not good. It was not until I remembered from one of the x86 AMD optimizations guides that most divisions indeed can be represented with a multiply and shift. After a long weekend using some scrap C code I came up with the solution. Even better, it was now possible to use MMX and AltiVec which quickly made its way into the Flash Player.

Way later in the release cycle I continued to work on further optimizations, the last one was a special version if the blur size is 2^n. So if you select a size of 2, 4, 8 and 16 it will run slightly faster. Why only slightly you might ask? Well, at this point further optimizations have essentially no effect since the blur filter is not constrained by the CPU, but your memory bus speed. That's also the reason that it will almost always run slower on a Mac than a PC since the memory bus architecture and speeds of Macs generally lag way behind that of PCs. So don't tell me we do not care about the Mac. I simply can't magically increase throughput of data on this architecture and I already use every trick in the book like using the vec_dst() instruction. The only way to get more performance will be to use the GPU (meaning graphics card).

Just for kicks here is the inner loop MMX code for our blur filter if the size is 2^n (the other version contains a few tricks I do not really want to reveal :-). This little piece here is really where 90% of the computing time is spent when you apply a blur:


loop:
movd mm0, [esi]
punpcklbw mm0, mm7

movd mm2, [esi + ebx]
punpcklbw mm2, mm7

paddusw mm4, mm2
psubusw mm4, mm0

movq mm6, mm4
psrlw mm6, mm3
packuswb mm6, mm7
movd [edi], mm6

paddusw mm4, mm2
psubusw mm4, mm0

add edi, eax
add esi, 4
sub ecx, 1
jnz loop


As you can see everything runs in registers. Most of the time is spent in the movd instructions, the rest does not even show up. In the next version of the Flash Player we'll probably continue to optimize this further, maybe keeping track of two pixels at a time so we can use movq. There are also some tricks on x86 you can use to improve prefetching data into the cache.

4 Comments:

Blogger Burak said...

Hi,

Looking at the code I was wondering if it would be more optimized if you used DEC instruction rather than SUB at the end. In the old days it would definitely mean something (reducing the code size which increases instructions that fit in the instruction cache, when you don't check the carry flag)...

Great posts BTW. Your blog rocks!

Best regards,
Burak
www.asvguy.com

Wednesday, August 17, 2005 9:21:00 PM  
Anonymous mocaloca said...

How about that dynamic mask with blurred transition .... ssshhhhh
Tinic you rock!

Sunday, August 28, 2005 8:04:00 PM  
Anonymous Gary said...

Burak,

See page 2-68 of the Intel IA32 Architecture Optimization Manual:

"Assembly/Compiler Coding Rule 42. (M impact, H generality) inc and
dec instructions should be replaced with an add or sub instruction, because
add and sub overwrite all flags, whereas inc and dec do not, therefore
creating false dependencies on earlier instructions that set the flags."

Gary

Monday, October 10, 2005 11:35:00 PM  
Anonymous Mark said...

A box blur is not the same as a median filter.

Wednesday, February 22, 2006 12:22:00 AM  

Post a Comment

<< Home