Thursday, February 01, 2007

The limits of software rendering

I have been following the progress of Papervision3D recently, the 3D engine for Flash. Here at Adobe we are always interested in applications which challenge the Flash Player performance wise. We spent some time to profile some demonstrations to see if there is anything we can do performance wise. The short answer is: not really.

What is the problem? Well, there is a limit what software can do on current CPUs. The limit can be squarely attributed to memory bandwidth and the size of L1 and L2 caches. I thought I'd explain the point by showing some actual profiling data I collected with the demonstration mentioned above. You can reproduce the same results including seeing the actual assembly instructions with any 3rd party profiler tool out there.

For testing I used a pretty beefy system, an AMD Athlon 64 X2 4800+. CodeAnalyst was my profiler of choice since this is what AMD provides for their CPUs, Intels VTune does the job just as well though if you have an Intel CPU.

Here are the top three loops which show up in the profiler and together take about 40% of overall CPU time:

The first loop does the affine bitmap blitting, meaning it handles rotation and scaling. This loop takes about 20-25% of overall time. Its characteristic is that memory access is almost always non sequential, meaning you get lots of L1 and L2 cache misses.



The second loop does a simple copy from one buffer to another. Some will recognize this code, this was adopted from the AMD Athlon Processor x86 Code Optimization Guide. This loop takes about 10-15% of overall time.



The third does a blur for a radius which is power of two, I have presented this code in a blog post before. This loop takes another 10-15% of overall time.



So what can you determine from the above? Look at the CPU clock column. This is the number of total cycles spent on an individual instruction. Any memory access, especially random access like the affine bitmap rendering is the most expensive operation. It really does not matter at that point, without GPU hardware support there is pretty much nothing we can do anymore on modern CPUs when it comes to graphics performance.

Obviously there are other areas we can improve, the other 60% of overall time have maybe another 10% overall performance bump in store if we fundamentally change some algorithms and fix bottlenecks in Tamarin. Although in this example ActionScript execution really does not account for much, it is really all related to graphics code. In the end the conclusion remains: It is memory access time which kills us.

What is even more surprising though for some is that even multi core systems will not help dramatically. In most cases the memory bus and even the L2 cache are shared among the cores.

Labels:

36 Comments:

Blogger tomsamson said...

As always interesting insights there :)
Though yup, what does that mean for the future of the flash player?
I guess you´re growing bored always hearing that but i´d so love if we could get opengl rendering support on all desktop platforms with one of the next flash player versions,since it seems like there´s really not much left to do on software side.

Thursday, February 01, 2007 9:26:00 PM  
Anonymous Joa Ebert said...

I have two questions regarding this. And first I have to say that I make use of the "there are not dumb questions" law :o)

What is the main reason, that I can walk through a big BitmapData with languages like C in less time than in Flash (if I think about a JIT there should not be such a big difference..?). I guess also with C I have the memory access problems.

Another question is releated to this case. Why do I get such big differences in the execution time?

Without any code ~200ms, with the void ~2secs, with the doNothing() ~11secs.

public function Main()
{
var ms: int = getTimer();
for ( var i: int = 0; i < 0xffffff; i++ )
{
//
//void;
//doNothing();
}

trace( getTimer() - ms );
}

private function doNothing(): void {}

Thank your for your interesting posts. I hope you have some time to explain this as well.

Saturday, February 03, 2007 7:49:00 AM  
Anonymous Anonymous said...

Hi Tinic,
A big Thank You from the PaperVision3D community for taking the trouble - and having the expertise - to pinpoint our bottlenecks.

I fear however, that you did not select the most representative sample to analyse - Ricardo's feedback cube uses previous renderer's textures for subsequent iterations and these are likely to be larger than our average bitmaps - hence the cache thrashing.

From your code snippets its unclear how Flash's affine blitter works with regard to what in the 3D world is called mip-mapping. These are a series of precomputed copies of the bitmap each half the dimensions of the preceding one.This isnt currently functional in Papervision3D, but the idea is that the copy of the source bitmap is selected that is of similar size to it's representation in the output bitmap. Thus the oversampling that is used in the high quality blitter (hopefully) only has to touch the adjacent pixels in the source map. Thus, providing the sourcemap isnt too wide, cache hits are more likely. And it looks a whole lot better too.

If, as is quite likely in the demonstration you analysed the source bitmap is so large that destination pixels are maybe 10 pixels apart, cache misses will prevail. And the intervening 9 pixels will have no effect on the output, and it will tend to look grainy, with pixels dancing around as the 3D object moves, as the lucky 1 selected pixel in every 10 changes.

Is this what you are using the power of 2 radius blurring to overcome ? Or was that an effect of some Actionscript code that the demonstration author included?

Speaking of which, is there a higher performance branching for power of two bitmaps in the affine blitter?

I'm sure that with appropriatly sized ( and shaped) source bitmaps, the cache misses will be reduced. This then opens up the options of multicore affine blitting. I'd imagine the simplest approach for this would be to seperate the task to having a thread per cpu working on a seperate (in the destination map) raster line . This would avoid most of the synchronisation headaches, as the raster lines are totally seperated, and would be reasonably cache friendly too.

Don't give up on software rendering just yet - PaperVision3D will adapt to the strenghs and weaknesses of the Flash Player - we just need to know what they are :)

jakelewis aty blueyonder doty co doty uk

Saturday, February 03, 2007 10:03:00 AM  
Blogger andrew said...

I would have to echo tomsamson's response. Where does that leave us? What do we have to look forward to the future of the Flash player in terms of graphic performance? Is OpenGL or DirectX rendering feasible? Is this something that the Flash development team is even considering?

I hate to think that we're reaching the limits here for the next few years.

Saturday, February 03, 2007 11:48:00 AM  
Anonymous Anonymous said...

Director support directX. I dont think natvie GPU support in flash is a good choice. Why not supporting embed director into flash if shockwave plugin available?

Wednesday, February 14, 2007 2:16:00 PM  
Blogger tomsamson said...

Nay,that´s no good idea even if it was technically possible.
Director has a bloated plugin which isn´t close as widespread as flash.
Also using directX is no good regarding cross platform compatibility (it would be nice as additional option,not as single one :-) ). Next up we want performance in flash, not use director with all its drawbacks ;-)
(Though i´m curious what the director update will bring right now it doesn´t look good for/with director)
Can you tell why its no good idea to have gpu support in flash?
Sure one has to watch out to keep the pluginsize as low as possible but besides that there are few other reasons speaking against gpu support,no?
(And even regarding that point, with today´s web connectivity getting faster and faster, a 1.5 mb plugin wouldn´t be a huge problem either for example)
Its not like some years ago where having a 3d card was something special and any file bigger than 1 mb was critical for most users.
Personally i´m excited to play around with pv3d and i´m sure it´ll be great for some things but yeah,no matter how good it is it can´t replace gpu rendering speed enhancment of course.

Friday, February 16, 2007 12:27:00 PM  
Anonymous Rostislav Siryk said...

Tinic, if I understood you right -- it is the operations with copying large amounts of data in-and-out from memory is the most expensive tasks for Flash Player regarding CPU load? -- and screen bitmap handling in Flash Player is the such expensive operation?

If so, therefore, the subject of your post has generally nothing special to do with Papervision3D. As I understood, usual 2D
rendering operations (for example, changing the alpha of the large bitmap inside the Flash Player) would be exactly (or just a bit less) as expensive, as drawing the 3D face with the large bitmap -- right?

If so, this is good answer to many who says "Forget about PV3S, Flash Player 3D apps are slow" -- this slowness has nothing to do with 3D, it is the limitation of the speed of redrawing large bitmaps inside the Flash Player, and 3D apps potentially not slower than regular 2D ones with lots of graphics.

Am I right?

Thursday, March 15, 2007 10:50:00 AM  
Anonymous Helio Miranda said...

I wonder why, after several years of graphics acceleration concept being around, the Flash player did not took advantage of the hardware evolution. Of course there is a lot yet to do in the software side! The player will have to specialize in graphics optimizations a lot more over the different platforms. This shouldn't mean a harm in the cross-platform concept.
It's important to remember that Adobe isn't not running alone in the "animation for web" business alone anymore, and this feature (graphics performance and resources) still is a weak point in the Flash platform.
This is surely the terrain where the Flash concurrent(s) will take on.

Monday, April 09, 2007 12:52:00 PM  
Anonymous lerc said...

As much as I would like to see 3D accelerated flash, there are significant obstacles to be overcome.

Despite the power of 3d acceleration these days, some areas of development are still fairly new. 3d has advanced well as a primary focus app but that is a totally different thing to the niche that flash fills.

flash is frequently used in multiple places on the same screen. Flash is supposed to be lightweight. I'm not sure how many 3d implementations are up to that.

here is something that would be an interesting test for resource consumption.
Make a plugin that pretends to be a flash plugin. All this plugin would do however is open up an openGL context in the area that the plugin occupies.
measure the resource usage of that plugin. My suspicions are that it would be scarily high.

Wednesday, May 02, 2007 8:27:00 PM  
Anonymous steve said...

Part of the reason why Papervision is slow is because it has to tessellate so much to avoid affine transform distortions.

If Flash could include optimised routines for perspective distortion, that would improve a lot of things quite dramatically.

Thursday, May 03, 2007 11:05:00 AM  
Anonymous Franl said...

Have you ever had a see at libraries like AmanithVG or OpenVG ? They claim to accelerate vector graphics using hw/3D power, it sounds as a very interesting opportunity for Flash.

Thursday, May 03, 2007 4:07:00 PM  
Anonymous Brian Sexton said...

I wish we could have even 2D acceleration in Flash; 3D acceleration would be great too, but even 2D Flash content is often slow on systems that can pump out fully textured 3D at higher resolutions and higher frame rates, so it would be nice if Adobe would take advantage of the tremendous installed base of GPUs in the world in any way and every way possible to improve Flash performance.

Wednesday, May 09, 2007 1:52:00 AM  
Anonymous Dave K said...

Interesting you mention opening up an openGl context. You could get this functionality by using the processing project as it has open gl capabilities. Note, if you haven't heard of this, it is a java based project in beta at the moment.

Correct me if I'm wrong, but I recall reading something a while ago that Mac and Linux used openGl for rendering though win uses directX. Is there a reason not to use openGl across all platforms and simply (I realize this may not be simple) create an api that allows the user to actually use opengl to do 3d?

Friday, May 18, 2007 1:57:00 PM  
Anonymous Anonymous said...

Papervision spends its time emulating perspective distortion using affine distortion.

If you could add perspective distortion to Flash9, then Papervision wouldn't have to break textures up into thousands of triangles to avoid distortion errors.

Thursday, May 24, 2007 7:46:00 AM  
Blogger Andrew Simmons said...

Why can't flash use the GPU? Why not use openGL and build 3D functionality into flash. Sooner or later it has to happen. I hope to see it in Flash Player 10.

Friday, May 25, 2007 9:56:00 AM  
Blogger John said...

Y'know, a lot of the things that would challenge the performance of flash player aren't getting written, because Flash has made certain choices that make those performance issues fundamentally too expensive.

I'll give the relatively common example of integer promotion behavior during arithmetic. Many standard simple containers require exotic implementation strategies just to get efficient iterators, because of all the promotion and retruncation that goes on when you want to do simple arithmetic. Implementing a priority queue in AS3 would be a nightmare.

Now, I recognize that the choice was made to maintain backwards compatibility with the previous actionscripts and to support certain curious aspects of the numeric standard in ECMA. However, experience shows a simple solution path: create a few non-standard types, probably fixed size, and give them (and only them) the efficient behavior. That way your standards and your backwards compatability are maintained for int, and you can start taking efficient implementation strategies for (say) int32, int64, uint32 and uint64.

I do not want to pay for an up-step to double and a truncation every time I need integer division. If you're really that worried about actionscript performance, start giving your developers the fundamental tools they need to write high performance applications.

This only starts mattering once you stop thinking of flash as an animation platform and start thinking of it as an application platform. There's more to life than graphics.

Friday, May 25, 2007 10:41:00 AM  
Blogger John Connors said...

Blimey, that was sobering reading. I was hoping that some of the newer parallelizable polygon rendering algorithms based on half-space functions would enable software renderers to get more of an edge.

Once again we all get stuck waiting for the bus. Nothing really changes does it?

Wednesday, May 30, 2007 2:39:00 AM  
Anonymous Brad Borne said...

Well then, looks like you guys need to do some hardware acceleration, or at least open up that external thread. It's not Flash's performance that's horrible, it's the Flash plugin. But that openGL in OS X must be helping somewhat, the World 2 demo runs much better in OS X than in windows on the same mac pro. Not only that, but all my games ran better in the plugin back in the days of Flash 7, save for bitmap caching.

It's really sad knowing that I'm going to release World 2 and not one single computer on the internet will be able to play it full speed, unless they download it and play it in the standalone player.

I keep hearing that the Fancy Pants Adventure changed people's perceptions of what can be done in Flash, that's got to be some good publicity for Adobe. Can't you guys do SOMETHING to improve the state of Flash gaming that isn't code related? Please?

Saturday, June 02, 2007 2:57:00 PM  
Anonymous Anonymous said...

Hey Tinic, can you explain to these yahoos why OpenGL is not feasible? We're talking about a system that draws graphics based on z-ordered triangle surfaces, and blitted texture mapping straight off the video card memory.

To get Flash to actually take advantage of OpenGL acceleration, the plugin would require a "middle-ware" layer that translates "Flash stuff" to something comprehensible to OpenGL. I argue that this middle layer would make things just as slow (if not slower) than it is already.

Tuesday, June 12, 2007 9:27:00 PM  
Blogger Nick said...

It all comes down to user expectations, regardless how we view the pros/cons of software v. hardware graphics performance, the public just WANT IT - smoother, faster, better. If we cannot get a significant improvement over the current performance with bitmap rendering / tearing via just software, go hardware ... or risk losing the user base now flash has some real competition with Silverlight.

Monday, July 02, 2007 1:11:00 AM  
Blogger Pascal said...

the assembly code could be re-ordered a lot, esp. the bitmap rotation stuff. Going fixed-point instead of SSE could help too, like what was done by demomakers on good ol' 486 time. Fast rotozoomer was quite a mandatory exercise by that time, just like non perspective-correct texture mapping.

Thursday, July 05, 2007 5:13:00 AM  
Anonymous Anonymous said...

"I argue that this middle layer would make things just as slow (if not slower) than it is already."
Of course it won't, unless you don't have hardware acceleration on your computer. In which case it should fall back to using software rendering or notifying the user to upgrade/install drivers.

Flash player is basically a translator for AS3 to native code anyway, and all that is required is new some new APIs within flash and an additional bit of translation to use a different (massively faster) native library for drawing.

Of course, it won't be a magic bullet, 3D programming will still require dedication and patience, but at least the option to create breathtaking 3D browser apps will be there.

Since what you can do is already hardware-limited because software rendering requires a serious processor to get any kind of frame rate, it won't be much of a leap to expect people to have an opengl gfx card to use your 3d game/editor/virtual world.

Monday, July 16, 2007 6:19:00 AM  
Blogger debreuil said...

"Of course it won't, unless you don't have hardware acceleration..."

I wouldn't be so sure of that. The gpu is very different conceptually, all triangles (no curves) and single pixel filters. Swf is really optimized for rendering pixels in sequence. I suspect it could probably be as fast or faster, but it certainly isn't a no brainer...

Also, that was a really good point above about flash being used multiple places in a browser page, make gpu an even worse fit. I would first lobby ATI and NVidia to add hardware support for 2d and curves at least : ). Maybe the new geometry shaders can help, but why should 2D settle with being 2nD class citizens!

Wednesday, July 18, 2007 11:37:00 PM  
Anonymous Anonymous said...

Yes, OpenGL for Flash would be good. Java has been developing JOGL with some reasonable success, and they claim their native coding overhead is only a few lines of code - the rest being compiled in the VM.

Carl

Friday, July 20, 2007 9:18:00 PM  
Anonymous Anonymous said...

btw, I may have a good workaround.
You don't need opengl to do the rendering. Just do the most basic opengl setup, and test if shaders are supported. Then just do amap
(as many as possible) calculations on the gpu. It is fast and can free up your cpu, plus it has some good features which will make it faster.
This way not much has to be changed and lots of stuff can be hardware accelerated.
I'm not checking this site for replies so please mail me the reply to this. nyad55@gmail.com
(I just stumbled here)

Tuesday, August 07, 2007 10:31:00 AM  
Anonymous <a href="http://drugscenterhere.com">ShopMan</a> said...

I like articles like this. Thanks!

Saturday, August 25, 2007 10:41:00 PM  
Anonymous Antonio Vargas said...

Please take a look at some of the standard references on fast affine mapping for software rendering, specially the part about tiled textures for cache friendly texturemapping :)

--> http://www.multi.fi/~mbc/personal_sources.html

Thursday, August 30, 2007 10:05:00 AM  
Anonymous mr.doob said...

http://www.mrdoob.com/lab/PV3D/acid_cubes.html

doesn't work anymore, the proper link is:

http://www.mrdoob.com/lab/pv3d/acid_cubes.html

sorry about that, oh, and with the latests version of flash player, it renders a bit differently.

Saturday, September 01, 2007 6:56:00 AM  
Anonymous <a href="http://buy-viagra2007.blogspot.com">Buy Viagra</a> said...

very good!

Sunday, September 02, 2007 2:17:00 AM  
Anonymous <a href="http://payday-loans-online--ooz.blogspot.com">Payday loans Online</a> said...

very good site! Best!

Monday, September 03, 2007 9:26:00 PM  
Anonymous <a href="http://adobe-photo-shop-cs2.blogspot.com">Gringo Andre</a> said...

Write something else. Thanks! Best Blog...

Thursday, September 06, 2007 12:25:00 AM  
Anonymous <a href="http://courses.cvcc.vccs.edu/ENG112_GROSS/_Chat_Room/000008fd.htm">Anonimous</a> said...

Well done. Keep up the great work. Best regards!

Sunday, September 09, 2007 8:57:00 AM  
Anonymous <a href="http://buy4soma.eamped.com">Anonimous</a> said...

I like it a lot! Nice site, I will bookmark!

Monday, September 10, 2007 7:30:00 AM  
Blogger Makc said...

Hello, and sorry if this is irrelevant, but I couldnt figure your email.

I would like to ask you to spare some time and answer this question:
what is exact logic of end-of-scanline check in flash player for
programmatic fills?

The problem is that when we divide an edge by some vertex, we end up
with slightly different shape drawn (with lineTo-s), and in our
situation this turns out to have significant visual impact. We would
like to adjust intermediate vertices to match internal behavior of
flash player, and for that reason we need to know this particular bit.

Thanks in advance.

P.s.: btw, did you know that specifying 32bit colors in solid color fills
crashes IE/9.0.45 and breaks all scripts in FF/9.0.47? unless this is
known issue, please pass this information to your qa.

Monday, December 03, 2007 5:11:00 AM  
Blogger Purplemint said...

Why don't flash player support multithreading?

Wednesday, January 30, 2008 6:59:00 AM  
Blogger Sunny Tambi said...

Hi Tinic,
Nice blog. I have a silly question for you
Can you tell me what is the system requirements to run a healthy graphics ?

Thursday, June 05, 2008 7:51:00 AM  

Post a Comment

<< Home