Thursday, February 01, 2007

The limits of software rendering

I have been following the progress of Papervision3D recently, the 3D engine for Flash. Here at Adobe we are always interested in applications which challenge the Flash Player performance wise. We spent some time to profile some demonstrations to see if there is anything we can do performance wise. The short answer is: not really.

What is the problem? Well, there is a limit what software can do on current CPUs. The limit can be squarely attributed to memory bandwidth and the size of L1 and L2 caches. I thought I'd explain the point by showing some actual profiling data I collected with the demonstration mentioned above. You can reproduce the same results including seeing the actual assembly instructions with any 3rd party profiler tool out there.

For testing I used a pretty beefy system, an AMD Athlon 64 X2 4800+. CodeAnalyst was my profiler of choice since this is what AMD provides for their CPUs, Intels VTune does the job just as well though if you have an Intel CPU.

Here are the top three loops which show up in the profiler and together take about 40% of overall CPU time:

The first loop does the affine bitmap blitting, meaning it handles rotation and scaling. This loop takes about 20-25% of overall time. Its characteristic is that memory access is almost always non sequential, meaning you get lots of L1 and L2 cache misses.



The second loop does a simple copy from one buffer to another. Some will recognize this code, this was adopted from the AMD Athlon Processor x86 Code Optimization Guide. This loop takes about 10-15% of overall time.



The third does a blur for a radius which is power of two, I have presented this code in a blog post before. This loop takes another 10-15% of overall time.



So what can you determine from the above? Look at the CPU clock column. This is the number of total cycles spent on an individual instruction. Any memory access, especially random access like the affine bitmap rendering is the most expensive operation. It really does not matter at that point, without GPU hardware support there is pretty much nothing we can do anymore on modern CPUs when it comes to graphics performance.

Obviously there are other areas we can improve, the other 60% of overall time have maybe another 10% overall performance bump in store if we fundamentally change some algorithms and fix bottlenecks in Tamarin. Although in this example ActionScript execution really does not account for much, it is really all related to graphics code. In the end the conclusion remains: It is memory access time which kills us.

What is even more surprising though for some is that even multi core systems will not help dramatically. In most cases the memory bus and even the L2 cache are shared among the cores.

Labels: