Sunday, September 10, 2006

gcc challenges

Dealing with compilers can be quite a challenge sometimes. Recently Mike stumbled over a pretty bad crash on one of his older machines. It looked innocent enough. To enable SSE1/SSE2 intrinsics the only option in gcc is to use '-msse -msse2'. While it did not use SSE to do floating point it does emit cvttss2si instructions to do general floating point conversions if this option is selected. Since his machine was an using an Athlon Model 4 an illegal instruction error killed the browser. SSE1 was introduced with the Athlon XP and Pentium III generations only. This behavior makes it completely impossible to safely do runtime detection of SSE1/SSE2 when using gcc unless you split out the various architectures (x86, MMX, SSE1, SSE2 and SSE3) into separate files and customize compile options which is also the suggested hack in the gcc man pages:

"These options will enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications which perform runtime CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options."

There is also an old bug about this which was essentially rejected. Why is there no option to simply turn on intrinsics? This would make things so much easier for developers and guarantee better portability of source code.

I see two practical options right now: Either we totally disable any SSE1/SSE2 optimizations which will severely cripple performance (yes, some of the rendering code will see a 30-60% slowdown which will affect most users), or, we use the Intel compiler to compile these files. Using ICC would also allow us to compile some of the inline assembly for which I have not created intrinsics yet. The real challenge for me now is to modify the autoconf scripts to use two compilers, something which it does not seem to support by default. Google and various forums I looked on are not of much help either on this subject.

Why not using separate files as suggested by the gcc man pages? Well, it would add another few weeks of refactoring and stabilizing code to make this happen. I fear that I will be tasked with this eventually though.

We also discovered recently that '-O2' can generate bad code and had to switch to '-O1' for the time being. It triggered badness in our support math routines for the JIT and funny enough some rather simple C code (our MMX code works great here):

for ( ; n ; n-- ) {
uint32 srcP = src[0];
uint32 src0 = ((srcP&0x0000FF00)<< 8)|
((srcP&0x000000FF)<< 0)|
uint32 src1 = ((srcP&0xFF000000)>> 8)|
((srcP&0x00FF0000)>>16);

uint32 dst0 = dst[0];
uint32 dst1 = dst[1];

dst0 = ((dst0*(srcP>>24))>>8) + src0;
dst1 = ((dst1*(srcP>>24))>>8) + src1;

dst[0] = dst0;
dst[1] = dst1;

src+=1;
dst+=2;
}
This is not reproducible in a standalone test app, I've tried it. Why would gcc make it easy on us anyway? :-) We are trying figure out which option is the issue and disable it. Also, the same code compiles fine using Apple's gcc build 5431 vs gcc 4.0.3 which I am using on Ubuntu right now. Even if it were to fail in a future revision of Apple's version of gcc this code would never be triggered as SSE1 and SSE2 are always guaranteed to be available on MacIntel machines.

10 Comments:

Blogger ZephyrXero said...

Why not just release two vesions? One for i386 without mmx or sse, and one for i686 that would have all the extensions enabled for modern processors?

Wednesday, September 13, 2006 1:17:00 PM  
Anonymous Anonymous said...

That looks like a good idea. Btw, aren't systems without SSE/SSE2 too slow for flash content anyway?

Friday, September 15, 2006 3:58:00 AM  
Anonymous Samuel Agesilas said...

Anonymous, One of the issue with platform's like Flash, is that u have to gaurantee backwards compatibility so it's encouraging the Adobe is thinking about the older/legacy architectures. This however is a fascinating problem, my question what measures would be put in place to make room for further advances in processor specific architectures and at the same time to prevent this from happening? Is it even possible?

Monday, September 18, 2006 3:56:00 PM  
Anonymous Anonymous said...

That looks like a good idea. Btw, aren't systems without SSE/SSE2 too slow for flash content anyway?

And where'd you get this assumption? Flash has been overly popular since the days of 233MHz Pentium-MMXs. I have a 420MHz AMD K6-2 box, and Flash content works just fine on it. There are some speed issues on some movies that use some weird 3D crap, but that may be that I don't have a 3D card (ATi was too busy making their original AIW be a TV integration unit than a 3D card at the time.)

Sunday, September 24, 2006 2:57:00 PM  
Anonymous Phil K said...

What is the purpose of using inline ASM? I'm not a terribly experienced coder, but generally speaking the performance gain is negligible, and properly written C can achieve similar figures, with the added benefit of being more portable. Won't using inline ASM prevent the simple repackaging of Flash for different architectures?

Monday, September 25, 2006 1:42:00 PM  
Anonymous Josep said...

Why not ship both binaries in the same installation package, and let the installer detect which is more appropiate to install?
To do this the install script could access to the processor properties with /proc/cpuinfo.

I think that shipping both binaries in the same installer package is possible considering the low download size of the flash plugin.

Wednesday, September 27, 2006 3:51:00 AM  
Anonymous Anonymous said...

Well, I got this assumption from the words saying about huge performance drop. I was wrong. Anyway, the best Ideay would be to have several different packages with various instructions enebled.

Friday, September 29, 2006 10:36:00 AM  
Anonymous Anonymous said...

As a (former) gcc developer I'd just like to comment on why gcc doesn't have an "intrinsics only" option for SSE code gen... The basic problem is that, if you have the intrinsics enabled, the SSE2 registers have to be available too (otherwise there would be no way to load data for the intrinsics' use). However, there is no way to make registers available/unavailable on a finer granularity than entire files.
If the registers are available, there is no way to guarantee that the register allocator doesn't decide to use them for some other code sequence (like the cvttss2si operations you noticed - yes, the decision to use them comes from the register allocator).

Well, why don't we fix that, you're probably wondering. The trouble is that the register allocator is more than 30,000 lines of incomprehensible spaghetti code (and worse, spaghetti data structures) and everyone is afraid to mess with it. In the past ten years, there have been multiple attempts to replace it; all have failed.

The good news is that the LTO project (due in the gcc 4.4-4.5 timeframe, i.e. one or two years out -- hopefully they won't have another unnecessary major version number bump over it) should at least narrow the granularity for this sort of thing to per-function.

Thursday, October 26, 2006 5:23:00 PM  
Anonymous Міша said...

-O2 is dangerous, because it includes "strict-aliasing". Disabling the latter explicitly should improve life for you... Here are the default compiler options on FreeBSD, for example:

-O2 -fno-strict-aliasing -pipe

Even better, turn on the -Wstrict-aliasing warning and fix all places in the code, where the warning is triggered...

Wednesday, June 13, 2007 2:21:00 PM  
Anonymous <a href="http://drugscenterhere.com">ShopMan</a> said...

I like articles like this. Thanks!

Saturday, August 25, 2007 11:47:00 PM  

Post a Comment

<< Home