Tuesday, May 20, 2008

Adobe Pixel Bender in Flash Player 10 Beta

Lee Brimelow has posted a snippet of code showing how to use Adobe Pixel Bender kernels in the Flash Player 10. Time for me to go into details about this feature. As usual there are surprises and unexpected behavior this feature holds. I'll keep this post without any sample code, but I'll promise to show some samples soon.

A long time ago, back in Flash Player 8 days we had the idea of adding a generic way to do bitmap filters. Hard coding bitmap filters like we did for Flash Player 8 is not only not flexible, but has the burden of adding huge amounts of native code into the player and having to optimize it for each and every platform. The issue for us has always been how you would author such generic filters. Various ideas were floating around but in the end there was one sticking point: we had no language and no compiler. After Macromedia's merger with Adobe the Flash Player and the Adobe Pixel Bender team came together and we finally had what we needed: a language and a compiler.

The Pixel Bender runtime in the Flash Player is drastically different from what you find in the Adobe Pixel Bender Toolkit. The only connection the Flash Player has is the byte code which the toolkit does generate, it generates files with the .pbj extension. A .pbj file contains a binary representation of opcodes/instructions of your Pixel Bender kernel, much the same way a .swf contains ActionScript3 byte code. The byte code itself is designed to translate well into a number of different run times, but for this Flash Player release the focus was a software run time.

You heard right, software run time. Pixel Bender kernels do not run using any GPU functionality whatsoever in Flash Player 10.

Take a breath. :-)

Running filters on a GPU has a number of critical limitation. If we would have supported the GPU to render filters in this release we would have had to fall back to software in many cases. Even if you have the right hardware. And then there is the little issue that we only would have enabled this in the 'gpu' wmode. So it is critical to have a well performing software fallback; and I mean one which does not suck like some other frameworks which we have tried first (and which I will not mention by name). A good software implementation also means you can reach more customers which simply do not have the required hardware, which is probably 80-90% of the machines connected to the web out there. Lastly this is the only way we can guarantee somewhat consistent results across platforms. Although I have to point out that that you'll see differences which are the result of compromises to get better performance.

So why did we not just integrate what the Adobe Pixel Bender Toolkit does, which does support GPUs? First, we need to run on 99% of all the machines out there, down to a plain Pentium I with MMX support running at 400Mhz. Secondly, I would hate to see the Flash Player installer grow by 2 or 3 megabytes in download size. That's not what the Flash Player is about. The software implementation in Flash Player 10 as it stands now clocks in at about 35KB of compressed code. -- I am perfectly aware that some filters would get faster by an order of two magnitudes(!) on a GPU. We know that too well and for this release you will have to deal with this limitation. The important thing to take away here is: A kernel which runs well in the toolkit might not run well at all in the Flash Player.

But... I have more news you might not like. ;-) If you ever run a Pixel Bender filter on PowerPC based Mac you will see that it runs about 10 times slower than on an Intel based Mac. For this release we only had time to implement a JIT code engine for Intel based CPUs. On a PowerPC Mac Pixel Bender kernels will run in interpreted mode. I leave it up to you to make a judgment of how this will affect you. All I can say: Be careful when deploying content using Pixel Bender filters, know your viewers.

Now for some more technical details: the JIT for Pixel Bender filters in Flash Player 10 support various instructions sets, down to plain x87 floating point math and up to SSE2 for some operations like texture sampling which take the most amount of time usually. Given the nature of these filters working like shaders, i.e. being embarrassingly parallel, running Pixel Bender kernels scales linearly with amount of CPUs/cores you have on your machine. On an 8-core machine you will usually be limited by memory bandwidth. Here is a CPU readout on my MacPro when I run a filter on a large image (3872x2592 pixels):



There are 4 different ways of using Pixel Bender kernels in the Flash Player. Let me start with most obvious one and come down to the more interesting case:
  • Filters. Use a Pixel Bender kernel as a filter on any DisplayObject. Obvious.
  • Fill. Use a Pixel Bender kernel to define your own fill type. Want a fancy star shaped high quality gradient? A nice UV gradient? Animated fills? No problem.
  • Blend mode. Not happy with the built-in blend modes? Simply build your own.

What about the 4th? Well, as you can see the ones in the list are designed for graphics only. The last one is more powerful than that. Instead of targeting a specific graphics primitive in the Flash Player, you can target BitmapData objects, ByteArrays or Vectors. Not only that but if you use ByteArray or Vector the data you handle are 32-bit floating point numbers for each channel, unlike BitmapData which is limited to 8-bit unsigned integers per channel. In the end this means you can use Pixel Bender kernels to not only do graphics stuff, but generic number crunching. If you can accept the 32-bit floating point limitation.

This 4th way of using Pixel Bender kernels runs completely separate from your main ActionScript code. It runs in separate thread which allows you to keep your UI responsive even if a Pixel Bender kernel takes a very long time to complete. This works fairly similar to a URLLoader. You send a request with all the information, including your source data, output objects, parameters etc. and a while later an event is dispatched telling you that it is finished. This will be great for any application which wants to do heavy processing.

In my next post I show some concrete examples of how you would use these Pixel Bender kernels in these different scenarios. For now I'll let this information sink in.

What follows are a few random technical nuggets I noted in my specification when it comes to the implementation in the Flash player, highly technical but important to know if you are pushing the limits of this feature:

  • The internal RGB color space of the Flash Player is alpha pre-multiplied and that is what the Pixel Bender kernel gets.

  • Output color values are always clamped against the alpha. This is not the case when the output is a ByteArray or Vector.

  • The maximum native JIT code buffer size for a kernel is 32KB, if you hit this limit which can happen with complex filters the Flash Player falls back to interpreted mode mode like it does in all cases on PowerPC based machines.

  • You can freely mix linear and nearest sampling in your kernel.

  • Maximum coordinate range is 24bit, that means for values outside the range of -4194304..4194303 coordinates will wrap when you sample and not clamp correctly anymore.

  • The linear sampler does sample up to 8bit of sub pixel information, meaning you'll get a maximum of 256 steps. This is also the case if you sample from a ByteArray with floating point data.

  • Math functions apart from simple multiplication, division, addition and subtracting work slightly differently on different platforms, depending on the C-library implementation or CPU.

  • In the Flash Player 10 beta vecLib is used on OSX for math functions. Slightly different results on OSX are the result. This might change in the final release as the results could be too different to be acceptable. (This is at least one instance where something will be significantly faster on Mac than on PC)

  • The JIT does not do intelligent caching. In the case of fills that means that each new fill will create a new code section and rejit.

  • There are usually 4 separate JIT'd code sections which each handle different total pixel counts, from 1 pixel to 4 pixels at a time. This is required for anti-aliasing as the rasterizer works with single pixel buffers in this case.

  • When an if-else-endif statement is encountered, the JIT switches to scalar mode, i.e. the if-else-ending section will be expanded up to 4 times as scalar code. Anything outside of a if-else-endif block is still processed as vectors. It's best to move sampling outside of if statements if practical. The final write to the destination is always vectorized.

  • The total number of JIT'd code is limited by the virtual address space. Each code section reserves 128Kbytes (4*32KB) of virtual address space.

  • The first 4 pixels rendered of every instance of a shader is run in interpreted mode, the native code generation is done during that first run. You might get artifacts if you depend on limit values as the interpreted mode uses different math functions. If you are on a multicore system, every new span rendered will create a new instance of a shader, i.e. the code is JITd 8*4 times on a 8-core system. This way the JIT is completely without any locks.

Friday, May 16, 2008

What does GPU acceleration mean?

The just released Adobe® Flash® Player 10 beta version includes two new window modes (wmode) which control how the Flash Player pushes its graphics to the screen.

Traditionally there have been 3 modes:

normal: In this mode we are using plain bitmap drawing functions to get our rasterized images to the screen. On Windows that means using BitBlt to get the image to the screen on OSX we are using CopyBits or Quartz2D if the browser supports it.

transparent: This mode tries to do alpha blending on top of the HTML page, i.e. whatever is below the SWF will show through. The alpha blending is usually fairly expensive CPU resource wise so it is advised not to use this mode in normal cases. In Internet Explorer this code path does actually not going through BitBlt, it is using a DirectDraw context provided by the browser into which we composite the SWF.

opaque: Somewhat esoteric, but it is essentially like transparent, i.e. it is using DirectDraw in Internet Explorer. But instead of compositing the Flash Player just overwrites whatever is in the background. This mode behaves like normal on OSX and Linux.

Now to the new modes:

direct: This mode tries to use the fastest path to screen, or direct path if you will. In most cases it will ignore whatever the browser would want to do to have things like overlapping HTML menus or such work. A typical use case for this mode is video playback. On Windows this mode is using DirectDraw or Direct3D on Vista, on OSX and Linux we are using OpenGL. Fidelity should not be affected when you use this mode.

gpu: This is fully fledged compositing (+some extras) using some functionality of the graphics card. Think of it being similar to what OSX and Vista do for their desktop managers, the content of windows (in flash language that means movie clips) is still rendered using software, but the result is composited using hardware. When possible we also scale video natively in the card. More and more parts of our software rasterizer might move to the GPU over the next few Flash Player versions, this is just a start. On Windows this mode uses Direct3D, on OSX and Linux we are using OpenGL.

Now to the tricky part, things which will cause endless confusion if not explained:

1. Just because the Flash Player is using the video card for rendering does not mean it will be faster. In the majority of cases your content will become slower.

...

Confused yet? Good, that means you have the same understanding what GPU support means that everyone else has.

Content has to be specifically designed to work well with GPU functionality. The software rasterizer in the Flash Player can optimize a lot of cases the GPU cannot optimize, you as the designer will have to be aware of what a GPU does and adapt your content accordingly. I realize this statement is useless unless we can provide guidance, something we can hopefully achieve in the not to distant future.

2. The hardware requirements for the GPU mode are stiff. You will need at least a DirectX 9 class card. We essentially have the exact same hardware requirements as Windows Vista with Aero Glass enabled. Aero Glass uses exact same hardware functionality we do. So if Aero Glass does not work well on your machine the Flash Player will likely not be able to run well either in GPU mode (but to clarify, you do NOT need Aero Glash for the GPU mode to work in the Flash Player, I am merely talking about hardware requirements here).

3. Pixel fidelity is not guaranteed when you use the GPU mode. You have to expect that content will look different on different machines, even colors might not match perfectly. This includes video. Future Flash Players will change the look of your content in this mode. We will try our best to limit the pain but please bear in mind that in many cases we have no control on this.

Here is an example, left shows it running using the new gpu mode, right using the normal mode. This a video which is 320x240 pixels large showing red text and as you notice the gpu mode arguably looks better as the hardware does UV blending:



The downside in this specific case is that the edges of the text are not as crisp anymore.

4. Frames rates will max out at the screen refresh rate. So whatever frame rate you set in your Flash movie is meaningless if it is higher than 60. This is the case for both the 'direct' and 'gpu' mode. In most cases you should end up at a frame rate of around 50-55 due to dropped frames which occur from time to time for various reasons.

5. Please do not blindly enable either new mode (gpu or direct) in your content. Creating a GPU based context in the browser is very expensive and will drain memory and CPU resources to the point where the browser will become unresponsive. It is usually best practice to limit yourself to one SWF per HTML page using these modes. The target should be content taking over most of the page and doing full frame changes like video. Never ever, ever enable this for banners. Plain Flex applications should not use these modes either if they are not doing full screen refreshes.

6. GPU functionality ties us together with the video card manufacturers and their drivers. Given that you can expect that a significant amount of customers will not be able to view your content if you enable this mode due to driver incompabilities, and various defects in the software stack.

Finally, this beta version of the Flash Player is not yet tuned for maximum performance in the gpu mode. We are making progress, but all the above points will still apply in the long term.

The 'direct' mode should never make your content slower, except in respect to point 4 I made. It should either not change anything or lower CPU consumption somewhat with very large content, i.e. something larger than 1024x768 pixels.

What we'd ask you to do though is to give this a test drive. This is completely new land for us and we expect to encounter lots of obstacles, meaning bugs.

Wednesday, May 14, 2008

Adobe Is Making Some Noise Part 3

[Update: I have updated the code sample to match the API changes in build 10.0.1.525]

Along with the new event to drive dynamically generated audio playback there is one gaping hole: Where do you get source audio from? Sure, you can load raw audio from external sources through ByteArray but that is less than efficient. What you really need is a way to have sound assets you can access from your ActionScript code.

In Flash Player 10 code named Astro the Sound object will have one more method which is designed to work together with the "samplesCallback" event handler. It will extract raw sound data from an existing sound asset. That means any mp3 file you have in the library or load externally can be accessed and processed. Let's look at an actual code example. I am passing through audio from an existing sound object to another sound object doing the dynamic audio playback:

var mp3sound:Sound = new Sound();
var dynamicSound:Sound = new Sound();
var samples:ByteArray = new ByteArray();

function sampleData(event:SampleDataEvent):void {
samples.position = 0;
var len:Number = mp3sound.extract(samples,1777);
if ( len < 1777 ) {
// seamless loop
len += mp3sound.extract(samples,1777-len,0);
}
samples.position = 0;
for ( var c:int=0; c < len; c++ ) {
var left:Number = samples.readFloat();
var right:Number = samples.readFloat();
event.data.writeFloat(left);
event.data.writeFloat(right);
}
}

private function loadComplete(event:Event):void {
dynamicSound.addEventListener("sampleData",sampleData);
dynamicSound.play();
}

mp3sound.addEventListener(Event.COMPLETE, loadCompleteMP3);
mp3sound.load(new URLRequest("sound.mp3"));

Notice the extract() call here. This function will extract raw sound data from any existing sound object. The format returned is always 44100Khz stereo, the number format is 32-bit floating point which you can read and write using ByteArray.readFloat and ByteArray.writeFloat. The floating point values are normalized between -1.0 and 1.0. Here is the full prototype of this function:

function extract(target:ByteArray,
length:Number,
startPosition:Number = -1 ):Number;
  • target: A ByteArray object in which the extracted sound samples should be placed.
  • length: The number of sound samples to extract. A sample contains both the left and right channels -- that is, two 32-bit floating point values.
  • startPosition: The sample at which extraction should begin. If you don't specify a value, the first call to extract() starts at the beginning of the sound; subsequent calls without a value for startPosition progress sequentially through the sound.
  • extract() returns the number of samples which could be retrieved. This might be less than the length you requested at the very end of a sound.


With what we provide in Flash Player 10 we hope that we are addressing the most pressing needs of what you want to do with sound. I will likely be just a matter of time until we'll see high level frameworks done by the community on the magnitude of something like Papervision3D. The next couple of years should be very interesting indeed when it comes to sound on the web.

What's missing? Unfortunately some features did not make it into Flash Player 10: Extracting audio data from a microphone and extracting audio from a NetStream object. We are aware that both features are highly desirable, but for various reasons it was not possible to make this happen in this release.

Adobe Is Making Some Noise Part 2

[Update: I have updated the code sample to match the API changes in build 10.0.1.525]

The public beta of Adobe® Flash® Player 10, code named Astro has been released. When you read the release notes you'll notice a small sections talking about audio:

"Dynamic Sound Generation — Dynamic sound generation extends the Sound class to play back dynamically created audio content through the use of an event listener on the Sound object."

Yes, in Flash Player 10 you will be able to dynamically create audio. It's not an all powerful API, it is designed to provide a low level abstraction of the native sound driver, hence providing the most flexible platform to build your music applications on. The API has one big compromise which I can't address without large infrastructural changes and that is latency. Latency is horrible to the point where some applications will simply not be possible. To improve latency will require profound changes in the Flash Player which I will tackle for the next major revision. But for now this simple API will likely change the way you think about sound in the Flash Player.

Programming dynamic sound it is all about how quickly and consistently you can deliver the data to the sound card. Most sound cards work using a ring buffer, meaning you as the programmer push data into that ring buffer while the sound card feeds from it at the same time. The high level APIs to deal with this revolve around two concepts: 1. the device model 2. the interrupt model.

For model 1 we run in a loop (usually in a thread) and write sound data to the device. The write will block if the ring buffer is full. The loop continues until the sound ends. This is the most common method of playing back sound on Unix like systems like Linux.

In model 2 we have a function which is called by the system (usually from an interrupt on older systems) in which the applications fills part of the ring buffer. The callback function is called whenever the sound card hits a point where it runs low on samples in the ring buffer. In an OS without real threading this is usually the only way to make sound playback work. MacOS9 or older and Windows 98 or older were using this system and OSX continues to provide a way to do this in CoreAudio. As ActionScript has no threading it is advisable to use this model. We could use frames event to implement a loop, but that would represent an odd programming model.

Flash Player 10 code named Astro supports a new event on the Sound object: "samplesCallback". It will be dispatched on regular interval requesting more audio data. In the event callback function you will have to fill a given ByteArray (Sound.samplesCallbackData) with a certain amount of sound data. The amount is variable, from 512 samples to 8192 samples per event. That is something you decide on and is a balance between performance and latency in your application. The less data you provide per event the more overhead is spent in the Flash Player. The more data you provide the longer the latency for your application will be. If you just play continious audio we suggest to use the maximum amount of data per event as the difference in overall performance can be quite large.

I should note that this API will slightly change (names changes only mostly) in the final release of the Flash Player, this beta represents an older build. I'll update this post with new code once the API is finalized.

Now some real code, some of you on the beta program have seen it. Here is how you play a continuous sine wave in Flash Player 10 with the smallest amount of code:

var sound:Sound = new Sound();
function sineWavGenerator(event:SampleDataEvent):void {
for ( var c:int=0; c<1234; c++ ) {
var sample:Number = Math.sin(
(Number(c+event.position)/Math.PI/2))*0.25;
event.data.writeFloat(sample);
event.data.writeFloat(sample);
}
}
sound.addEventListener("sampleData",sineWavGenerator);
sound.play();

That's it. That simple. You can't get any more low level or flexible than this. The sample above is simple, actual code would probably not call Math.sin() in the inner loop. You would rather prepare a ByteArray or Array outside and copy the data from there.

The sound format is fixed at a sample rate of 44100Hz, 2 channels (stereo) and using 32bit floating point normalized samples. This is currently the highest quality format possible within the Flash Player. We will be targeting a more flexible system in a future Flash Player. It was not possible to offer different samples rates and more or less channels in this version. If you need to resample you can either use pure ActionScript 3 or even Adobe Pixel Bender.

The SamplesCallbackEvent.position property which is passed in is the sample position, not the time, of the segment of audio which is being requested. You can convert this value to milliseconds by dividing it by 44.1.

Your event handler has to provide at least 512 samples each time it is dispatched, at most 8192. If you provide less the Flash Player makes the assumption that you reached the end of the sound, will play the remaining samples and dispatch a SOUND_COMPLETE event. If you provide more than 8192 an exception occurs. In the sample above I use 1234 to make it clear that it can be any value between 512 and 8192

The event will be called in real time. That means you can inject new audio data interactively. The key part to understand here that we are not dealing with long amounts of sound data at any given time.

There is an internal buffer in the Flash Player which is about 0.2 to 0.5 seconds depending on the platform which is preventing drop outs. It will automatically be increased if drop outs occur. This internal buffer is the key for the high latency I was alluding to earlier. You should never depend on a certain latency with this API in your application. To enforce this there is a slight random factor in choosing this buffer size when the Flash Player launches.

Continue to read Part 3 which talks about one more new Sound API in Flash Player 10.

Adobe Is Making Some Noise Part 1

If you have been following the community there has been quite a stir around lack of sound features in the Flash Player. Not only lack of sound features, but the last two dot releases have injected serious bugs in the sound support. This is unfortunate and some of these are squarely my fault, but some of them are design choices for which there are no workarounds.

The big controversy has been around the SOUND_COMPLETE event which has become less predictable. There is a change I had to make for Windows Vista to fix an audio-video synchronization issue. Windows Vista uses a new audio stack and does support the older but still viable WaveOut API through an emulation layer. This emulation layer does not behave exactly like previous iterations of Windows and exposed some bad assumptions the Flash Player made to drive audio-video synchronization and the famous the SOUND_COMPLETE event.

To be clear: the accuracy of SOUND_COMPLETE will get even worse in Flash Player 10. Here, I said it. Why? Because the world moves on and power consumption of devices becomes more important than ever, even for desktops. One of the stumbling blocks of improving power consumption on Windows was the Flash Player and we've gotten an earful from Microsoft and Intel because of it. The key part here was a Win32 API we used which is timeBeginPeriod()/timeEndPeriod(). We were setting the period to 1ms so that events like SOUND_COMPLETE would be more accurate, dispatched at the right time. This worked on Win98 through WinXP but is essentially useless on Windows Vista since the sound stack works in a different way as I mentioned above. So apart from being useless on Vista using this API is deadly for power consumption and a show case of bad engineering. We can not continue to rely on it.

So what does that mean? You should have never relied on SOUND_COMPLETE in the first place to do sound processing. Easy for me to say, but Macromedia did miss putting a big warning into the documentation. Or maybe the real problem was that no one really paid attention to the sound stack for such a long time. I am trying to change this, unfortunately in the process causing havoc for some clients and breaking backwards compatibility.

Continue to read Part 2 to see how we will address the requests from the community in Flash Player 10.

Friday, May 02, 2008

Adobe Open Screen Project

Unless you have been living under a rock you probably saw the announcement Adobe made this week:


This is a significant move by Adobe to bring the Adobe® Flash® Player to a much broader audience than ever before. By doing this we are responding to the requests from customers throughout the different industries which want to leverage the Flash platform.

So, yes I haven't been blogging in a while, that has been due to my busy schedule (I'll be more specific soon). But since I've worked on fixing up the FLV and SWF specs a little bit for this project I figured I should add my own personal thoughts.

As the site above mentions this project essentially consists of the following changes:
  • Removing restrictions on use of the SWF and FLV/F4V specifications
  • Publishing the device porting layer APIs for Adobe Flash Player
  • Publishing the Adobe Flash® Cast™ protocol and the AMF protocol for robust data services
  • Removing licensing fees – making next major releases of Adobe Flash Player and Adobe AIR for devices free
All of these are very significant for a lot of customers although individual parties might think that it is one particular item which matters most to them. I assure you that all of them are extremely important.

Bringing a platform like the Adobe Flash Player to devices has its challenges. One of them revolves around documentation. By making the file format and API documentation open, the entry barrier for a lot of developers becomes much lower. In the past it could have been a struggle to get the necessary documentation to the teams which need them. This problem is now gone.

In many circumstances technologies Adobe provided were not compatible with how a particular device or infrastructure was set up. Given the old licensing restrictions it could prove difficult to find technical solutions. By making the specifications open and without restrictions a particular vendor might instead choose a clean rewrite of some parts if they see fit with the result that it integrates better into their eco-system.

Now there have been questions if there are any gotchas in what Adobe has announced concerning the SWF and FLV specifications. There IS in fact a license, it is on page two of the specifications. You might be shocked however by how small it is. It's essentially BSD license style damages disclaimer plus some additional information about trademarks and such. Read it before you use it.

The specifications are available here as PDF files:

http://www.adobe.com/devnet/swf/
http://www.adobe.com/devnet/flv/

Does that mean you can implement your own clone which implements what is in these specifications without fear of Adobe coming after you? Yes. Notice that I am careful how I ask this question though. Adobe still holds all the trademarks around this technology, you can not use any of those in your clone or anything related to it. Unless we give the OK to do so obviously.

If you find inaccuracies or missing information in these specifications let us know, we can integrate the feedback right away so they can appear in the next revision of these documents.