Adobe Pixel Bender in Flash Player 10 Beta
A long time ago, back in Flash Player 8 days we had the idea of adding a generic way to do bitmap filters. Hard coding bitmap filters like we did for Flash Player 8 is not only not flexible, but has the burden of adding huge amounts of native code into the player and having to optimize it for each and every platform. The issue for us has always been how you would author such generic filters. Various ideas were floating around but in the end there was one sticking point: we had no language and no compiler. After Macromedia's merger with Adobe the Flash Player and the Adobe Pixel Bender team came together and we finally had what we needed: a language and a compiler.
The Pixel Bender runtime in the Flash Player is drastically different from what you find in the Adobe Pixel Bender Toolkit. The only connection the Flash Player has is the byte code which the toolkit does generate, it generates files with the .pbj extension. A .pbj file contains a binary representation of opcodes/instructions of your Pixel Bender kernel, much the same way a .swf contains ActionScript3 byte code. The byte code itself is designed to translate well into a number of different run times, but for this Flash Player release the focus was a software run time.
You heard right, software run time. Pixel Bender kernels do not run using any GPU functionality whatsoever in Flash Player 10.
Take a breath. :-)
Running filters on a GPU has a number of critical limitation. If we would have supported the GPU to render filters in this release we would have had to fall back to software in many cases. Even if you have the right hardware. And then there is the little issue that we only would have enabled this in the 'gpu' wmode. So it is critical to have a well performing software fallback; and I mean one which does not suck like some other frameworks which we have tried first (and which I will not mention by name). A good software implementation also means you can reach more customers which simply do not have the required hardware, which is probably 80-90% of the machines connected to the web out there. Lastly this is the only way we can guarantee somewhat consistent results across platforms. Although I have to point out that that you'll see differences which are the result of compromises to get better performance.
So why did we not just integrate what the Adobe Pixel Bender Toolkit does, which does support GPUs? First, we need to run on 99% of all the machines out there, down to a plain Pentium I with MMX support running at 400Mhz. Secondly, I would hate to see the Flash Player installer grow by 2 or 3 megabytes in download size. That's not what the Flash Player is about. The software implementation in Flash Player 10 as it stands now clocks in at about 35KB of compressed code. -- I am perfectly aware that some filters would get faster by an order of two magnitudes(!) on a GPU. We know that too well and for this release you will have to deal with this limitation. The important thing to take away here is: A kernel which runs well in the toolkit might not run well at all in the Flash Player.
But... I have more news you might not like. ;-) If you ever run a Pixel Bender filter on PowerPC based Mac you will see that it runs about 10 times slower than on an Intel based Mac. For this release we only had time to implement a JIT code engine for Intel based CPUs. On a PowerPC Mac Pixel Bender kernels will run in interpreted mode. I leave it up to you to make a judgment of how this will affect you. All I can say: Be careful when deploying content using Pixel Bender filters, know your viewers.
Now for some more technical details: the JIT for Pixel Bender filters in Flash Player 10 support various instructions sets, down to plain x87 floating point math and up to SSE2 for some operations like texture sampling which take the most amount of time usually. Given the nature of these filters working like shaders, i.e. being embarrassingly parallel, running Pixel Bender kernels scales linearly with amount of CPUs/cores you have on your machine. On an 8-core machine you will usually be limited by memory bandwidth. Here is a CPU readout on my MacPro when I run a filter on a large image (3872x2592 pixels):
There are 4 different ways of using Pixel Bender kernels in the Flash Player. Let me start with most obvious one and come down to the more interesting case:
- Filters. Use a Pixel Bender kernel as a filter on any DisplayObject. Obvious.
- Fill. Use a Pixel Bender kernel to define your own fill type. Want a fancy star shaped high quality gradient? A nice UV gradient? Animated fills? No problem.
- Blend mode. Not happy with the built-in blend modes? Simply build your own.
This 4th way of using Pixel Bender kernels runs completely separate from your main ActionScript code. It runs in separate thread which allows you to keep your UI responsive even if a Pixel Bender kernel takes a very long time to complete. This works fairly similar to a URLLoader. You send a request with all the information, including your source data, output objects, parameters etc. and a while later an event is dispatched telling you that it is finished. This will be great for any application which wants to do heavy processing.
In my next post I show some concrete examples of how you would use these Pixel Bender kernels in these different scenarios. For now I'll let this information sink in.
What follows are a few random technical nuggets I noted in my specification when it comes to the implementation in the Flash player, highly technical but important to know if you are pushing the limits of this feature:
- The internal RGB color space of the Flash Player is alpha pre-multiplied and that is what the Pixel Bender kernel gets.
- Output color values are always clamped against the alpha. This is not the case when the output is a ByteArray or Vector.
- The maximum native JIT code buffer size for a kernel is 32KB, if you hit this limit which can happen with complex filters the Flash Player falls back to interpreted mode mode like it does in all cases on PowerPC based machines.
- You can freely mix linear and nearest sampling in your kernel.
- Maximum coordinate range is 24bit, that means for values outside the range of -4194304..4194303 coordinates will wrap when you sample and not clamp correctly anymore.
- The linear sampler does sample up to 8bit of sub pixel information, meaning you'll get a maximum of 256 steps. This is also the case if you sample from a ByteArray with floating point data.
- Math functions apart from simple multiplication, division, addition and subtracting work slightly differently on different platforms, depending on the C-library implementation or CPU.
- In the Flash Player 10 beta vecLib is used on OSX for math functions. Slightly different results on OSX are the result. This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)
- The JIT does not do intelligent caching. In the case of fills that means that each new fill will create a new code section and rejit.
- There are usually 4 separate JIT'd code sections which each handle different total pixel counts, from 1 pixel to 4 pixels at a time. This is required for anti-aliasing as the rasterizer works with single pixel buffers in this case.
- When an if-else-endif statement is encountered, the JIT switches to scalar mode, i.e. the if-else-ending section will be expanded up to 4 times as scalar code. Anything outside of a if-else-endif block is still processed as vectors. It's best to move sampling outside of if statements if practical. The final write to the destination is always vectorized.
- The total number of JIT'd code is limited by the virtual address space. Each code section reserves 128Kbytes (4*32KB) of virtual address space.
- The first 4 pixels rendered of every instance of a shader is run in interpreted mode, the native code generation is done during that first run. You might get artifacts if you depend on limit values as the interpreted mode uses different math functions. If you are on a multicore system, every new span rendered will create a new instance of a shader, i.e. the code is JITd 8*4 times on a 8-core system. This way the JIT is completely without any locks.