Tuesday, May 20, 2008

Adobe Pixel Bender in Flash Player 10 Beta

Lee Brimelow has posted a snippet of code showing how to use Adobe Pixel Bender kernels in the Flash Player 10. Time for me to go into details about this feature. As usual there are surprises and unexpected behavior this feature holds. I'll keep this post without any sample code, but I'll promise to show some samples soon.

A long time ago, back in Flash Player 8 days we had the idea of adding a generic way to do bitmap filters. Hard coding bitmap filters like we did for Flash Player 8 is not only not flexible, but has the burden of adding huge amounts of native code into the player and having to optimize it for each and every platform. The issue for us has always been how you would author such generic filters. Various ideas were floating around but in the end there was one sticking point: we had no language and no compiler. After Macromedia's merger with Adobe the Flash Player and the Adobe Pixel Bender team came together and we finally had what we needed: a language and a compiler.

The Pixel Bender runtime in the Flash Player is drastically different from what you find in the Adobe Pixel Bender Toolkit. The only connection the Flash Player has is the byte code which the toolkit does generate, it generates files with the .pbj extension. A .pbj file contains a binary representation of opcodes/instructions of your Pixel Bender kernel, much the same way a .swf contains ActionScript3 byte code. The byte code itself is designed to translate well into a number of different run times, but for this Flash Player release the focus was a software run time.

You heard right, software run time. Pixel Bender kernels do not run using any GPU functionality whatsoever in Flash Player 10.

Take a breath. :-)

Running filters on a GPU has a number of critical limitation. If we would have supported the GPU to render filters in this release we would have had to fall back to software in many cases. Even if you have the right hardware. And then there is the little issue that we only would have enabled this in the 'gpu' wmode. So it is critical to have a well performing software fallback; and I mean one which does not suck like some other frameworks which we have tried first (and which I will not mention by name). A good software implementation also means you can reach more customers which simply do not have the required hardware, which is probably 80-90% of the machines connected to the web out there. Lastly this is the only way we can guarantee somewhat consistent results across platforms. Although I have to point out that that you'll see differences which are the result of compromises to get better performance.

So why did we not just integrate what the Adobe Pixel Bender Toolkit does, which does support GPUs? First, we need to run on 99% of all the machines out there, down to a plain Pentium I with MMX support running at 400Mhz. Secondly, I would hate to see the Flash Player installer grow by 2 or 3 megabytes in download size. That's not what the Flash Player is about. The software implementation in Flash Player 10 as it stands now clocks in at about 35KB of compressed code. -- I am perfectly aware that some filters would get faster by an order of two magnitudes(!) on a GPU. We know that too well and for this release you will have to deal with this limitation. The important thing to take away here is: A kernel which runs well in the toolkit might not run well at all in the Flash Player.

But... I have more news you might not like. ;-) If you ever run a Pixel Bender filter on PowerPC based Mac you will see that it runs about 10 times slower than on an Intel based Mac. For this release we only had time to implement a JIT code engine for Intel based CPUs. On a PowerPC Mac Pixel Bender kernels will run in interpreted mode. I leave it up to you to make a judgment of how this will affect you. All I can say: Be careful when deploying content using Pixel Bender filters, know your viewers.

Now for some more technical details: the JIT for Pixel Bender filters in Flash Player 10 support various instructions sets, down to plain x87 floating point math and up to SSE2 for some operations like texture sampling which take the most amount of time usually. Given the nature of these filters working like shaders, i.e. being embarrassingly parallel, running Pixel Bender kernels scales linearly with amount of CPUs/cores you have on your machine. On an 8-core machine you will usually be limited by memory bandwidth. Here is a CPU readout on my MacPro when I run a filter on a large image (3872x2592 pixels):



There are 4 different ways of using Pixel Bender kernels in the Flash Player. Let me start with most obvious one and come down to the more interesting case:
  • Filters. Use a Pixel Bender kernel as a filter on any DisplayObject. Obvious.
  • Fill. Use a Pixel Bender kernel to define your own fill type. Want a fancy star shaped high quality gradient? A nice UV gradient? Animated fills? No problem.
  • Blend mode. Not happy with the built-in blend modes? Simply build your own.

What about the 4th? Well, as you can see the ones in the list are designed for graphics only. The last one is more powerful than that. Instead of targeting a specific graphics primitive in the Flash Player, you can target BitmapData objects, ByteArrays or Vectors. Not only that but if you use ByteArray or Vector the data you handle are 32-bit floating point numbers for each channel, unlike BitmapData which is limited to 8-bit unsigned integers per channel. In the end this means you can use Pixel Bender kernels to not only do graphics stuff, but generic number crunching. If you can accept the 32-bit floating point limitation.

This 4th way of using Pixel Bender kernels runs completely separate from your main ActionScript code. It runs in separate thread which allows you to keep your UI responsive even if a Pixel Bender kernel takes a very long time to complete. This works fairly similar to a URLLoader. You send a request with all the information, including your source data, output objects, parameters etc. and a while later an event is dispatched telling you that it is finished. This will be great for any application which wants to do heavy processing.

In my next post I show some concrete examples of how you would use these Pixel Bender kernels in these different scenarios. For now I'll let this information sink in.

What follows are a few random technical nuggets I noted in my specification when it comes to the implementation in the Flash player, highly technical but important to know if you are pushing the limits of this feature:

  • The internal RGB color space of the Flash Player is alpha pre-multiplied and that is what the Pixel Bender kernel gets.

  • Output color values are always clamped against the alpha. This is not the case when the output is a ByteArray or Vector.

  • The maximum native JIT code buffer size for a kernel is 32KB, if you hit this limit which can happen with complex filters the Flash Player falls back to interpreted mode mode like it does in all cases on PowerPC based machines.

  • You can freely mix linear and nearest sampling in your kernel.

  • Maximum coordinate range is 24bit, that means for values outside the range of -4194304..4194303 coordinates will wrap when you sample and not clamp correctly anymore.

  • The linear sampler does sample up to 8bit of sub pixel information, meaning you'll get a maximum of 256 steps. This is also the case if you sample from a ByteArray with floating point data.

  • Math functions apart from simple multiplication, division, addition and subtracting work slightly differently on different platforms, depending on the C-library implementation or CPU.

  • In the Flash Player 10 beta vecLib is used on OSX for math functions. Slightly different results on OSX are the result. This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)

  • The JIT does not do intelligent caching. In the case of fills that means that each new fill will create a new code section and rejit.

  • There are usually 4 separate JIT'd code sections which each handle different total pixel counts, from 1 pixel to 4 pixels at a time. This is required for anti-aliasing as the rasterizer works with single pixel buffers in this case.

  • When an if-else-endif statement is encountered, the JIT switches to scalar mode, i.e. the if-else-ending section will be expanded up to 4 times as scalar code. Anything outside of a if-else-endif block is still processed as vectors. It's best to move sampling outside of if statements if practical. The final write to the destination is always vectorized.

  • The total number of JIT'd code is limited by the virtual address space. Each code section reserves 128Kbytes (4*32KB) of virtual address space.

  • The first 4 pixels rendered of every instance of a shader is run in interpreted mode, the native code generation is done during that first run. You might get artifacts if you depend on limit values as the interpreted mode uses different math functions. If you are on a multicore system, every new span rendered will create a new instance of a shader, i.e. the code is JITd 8*4 times on a 8-core system. This way the JIT is completely without any locks.

20 Comments:

Blogger tomsamson said...

While i understand you first want to get a fine working software only solution going there goes my last hope that the "gpu support" as it is in flash player 10 would be any useful for various types of content.
What about making the beta phase longer and made up of several public revisions instead of kicking the thing out after one or two public betas as release version in a state it disappoints so much after the hype that was made for it beforehand?
With flash its always hoping for the next player version to address things and add features one wants, too bad now that already starts before the current "new" player is released.

Tuesday, May 20, 2008 8:12:00 PM  
Blogger den ivanov said...

Thanx for lots of details. Will check this Bender more closely now. )

Tuesday, May 20, 2008 11:34:00 PM  
Blogger daninsaopaulo said...

First let me say I'm really excited about the prospect of a vectorized language useful for intensive generic number-crunching becoming available for the web. The potential for this is huge. (Regardless of GPU acceleration - as you mention, it's a rather small portion of the web audience).

A couple things:
- The 32KB limit for JIT sounds limiting and arbitrary. Runtime warnings when we hit this would be good - even better, allow us to override this!

- More explicit detail on when/how the code is JIT'ed is always better. There is practically zero documentation on the AS3 JIT compiler, for example, even though there are plenty of gotchas. I have a much better idea of what is going on with Java or .NET.

- Why the focus on bitmap filters since Flash 8? RIA developers could care less, game developers want 3D and acceleration, who is the audience for this stuff?

Thanks.

Wednesday, May 21, 2008 12:46:00 AM  
Blogger Ben said...

While I am still very excited at the prospect of pixel bender support (not to mention the host of other new features in 10), the removal of GPU support is truly perplexing. With every imminent new release people ask for hardware acceleration and this time I thought it actually may happen. With each new announcement it seems that GPU support is crippled more and more.
You start out by saying it was left out of this release to avoid inconsistencies across platforms but then spend the rest of your post pointing out all of the inconsistencies that exist in this release with the software only implementation.
What I really don't understand is that the whole point of having new WMODEs is to allow the content creator to choose if they want to use the GPU at the cost of potential compatibility issues. Surely it is up to us whether we run the risk of people not being able to see are content or not? If I create a full browser 3D game do I really expect a Pentium 1 user without a GPU to be able to play it?

Wednesday, May 21, 2008 5:48:00 AM  
Blogger Darrin Massena said...

Thanks for the exposition on Pixel Bender, Tunic! Extremely helpful. I'm sure you will be taking a lot of flack over the lack of hardware acceleration but there is still enormous benefit to be derived without it.

I'm particularly interested in background processing via ShaderJob. What are the relative priorities of the ShaderJob threads vs. the mainline Actionscript execution thread? I.e. what impact will background processing have on the foreground?

Wednesday, May 21, 2008 10:01:00 AM  
Blogger Jonathan Maïm said...

Thanks for the flash10 pixel bender run-time details, looking forward for the next code samples !

Wednesday, May 21, 2008 10:10:00 AM  
Blogger Darrin Massena said...

Another question for you Tunic, is the shader bytecode specification going to be made public so we can generate shaders at runtime?

Wednesday, May 21, 2008 10:11:00 AM  
Blogger Neverbith said...

It would be nice to have an alternative Flash Player version with GPU acceleration, even if it makes Flash Player a bit harder to maintain, IMHO the effort wouldn't be meaningless.

For me it would be nice to have scope level variable declaration, so each variable just exists on loop, conditional, try...catch, etc blocks, porting the Vector improvements to the Dictionary class (maybe adding support for something similar to generics?). Enums, and structures support would be a nice thing as well.

I guess that would depend more on ECMA4 tho.

Wednesday, May 21, 2008 10:19:00 AM  
Blogger Neverbith said...

It would be nice to have an alternative Flash Player version with GPU acceleration, even if it makes Flash Player a bit harder to maintain, IMHO the effort wouldn't be meaningless.

For me it would be nice to have scope level variable declaration, so each variable just exists on loop, conditional, try...catch, etc blocks, porting the Vector improvements to the Dictionary class (maybe adding support for something similar to generics?). Enums, and structures support would be a nice thing as well.

I guess that would depend more on ECMA4 tho.

Wednesday, May 21, 2008 10:20:00 AM  
Blogger Quasimondo said...

Thank you once more for some highly interesting insight!

Could you maybe elaborate why you decided not to allow loops for the Flash implementation of Pixel Bender?

Wednesday, May 21, 2008 10:30:00 AM  
Blogger Bryan Gale said...

Despite the lack of GPU acceleration, will using Pixel Blender still be faster than the equivalent AS3 code acting on a BitmapData/ByteArray/Vector (at least when it's not interpreted)?

Wednesday, May 21, 2008 11:02:00 AM  
Blogger Robin said...

"This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)"

If I'm getting this right, you want to cripple OS X's implementation to get performance on par with other platforms? If this is the case, it comes as a suprise.

I was shocked last week when a an old (think government agency old) windows box completley outperformed my 3 month old MacBook Pro whilst generating a 19 page pdf with (the most awesome) AlivePDF. Why not let OS X catch up somewhere?

Wednesday, May 21, 2008 1:06:00 PM  
Blogger Tek said...

@neverbirth> Alternative player is not a good solution. The good solution would be to have Flash player plug-ins support, all signed by Adobe and dowloaded from a central server.

I secretly have crossed my fingers that this release will bring plug-ins support when I first heard of voip (because of the code size) but it seems this is probably something too hard to introduce in the flashplayer ecosystem for now.

Wednesday, May 21, 2008 2:04:00 PM  
Blogger Alex Bustin said...

Is GPU support scheduled for a 10.x release or shall we all be waiting until 11?

I think we should at least be able to have GPU support in the standalone players for the people who don't develop for the browser.

Wednesday, May 21, 2008 4:14:00 PM  
Blogger Neverbith said...

@tek> yep, you are right, I was at work when wrote my comment, heh, and that was the first thing I thought of regarding the size that the Flash Player could reach, although I think nowadays most countries basic internet services could afford it. Just my opinion since I'm on one of the countries with worse ISPs at Europe, and since I've travelled to some poor countries.

BTW, after playing a bit more with Flash 10 I realized that even if it included the Z axes it still takes into account the children indexes for sorting, am I missing something and z-sorting is available through some function/method? if not, any plans into looking for it? of course, it wouldn't take a lot of code and time to implement this, but it's always pleasant to see these sort of things added by default.

Saturday, May 24, 2008 11:35:00 PM  
Blogger Marc said...

Hi there.

My name is Marc and I work for a VJ software that allows flash playback. We seem to have a little bit of a Threading limitation with the flash player and it'd be great if we could talk to you a little to see if there's a way to solve that issue. Could you mail me at nostromo[nospasm]@10pm.org ? It would be awesome.

Regards
Marc

Wednesday, June 11, 2008 2:35:00 AM  
Blogger krizzle said...

"You can freely mix linear and nearest sampling in your kernel."

uhm.. and how does that work? oO

Wednesday, June 11, 2008 1:22:00 PM  
Blogger Hudson said...

I have been working on a pixel bender flash filter and have a couple of questions.
The absence of loops is really limiting. Why was that such a big addition?
Re: the 32k jit limit, I was thinking this would be for the byte code, but when I had a filter that was about 12k in byte code size, it hit the 32k limit for the jit buffer. Is that about the kind of expansion one could expect, or does it vary widely depending on the code?

Monday, October 06, 2008 8:19:00 AM  
Blogger Hudson said...

another question: As an attempt to work around the limitation of no loops, would it be possible to create a progressive pixel bender filter, e.g., on each iteration it would operate on the pixels from the last iteration. Off hand I don't see how to do this...

Monday, October 06, 2008 8:28:00 AM  
Blogger polycube said...

I'm late here, just beginning my Pixel Bender bending.

This post and a few others of yours have been *incredibly helpful* in understanding Pixel Bender. Subtleties that you wouldn't get just from playing with it, hints about the implementation & differences across products/platform.

Thanks, just thanks! => david van brink

Monday, October 06, 2008 10:31:00 AM  

Post a Comment

<< Home