Update (10/31/06): I still see tons of people reading this old entry. Please read this for up to date information. A beta for Linux is now available.Now that Flash Player 8 is soon to be released for Windows and OS X, I am sure we'll get plenty of questions about support for Unix, probably mostly for Linux. No, we haven't forgotten that we need to support it. It WILL happen.
But to tell you the truth I am not even happy with the state of Flash Player 7 on Linux. So much that I use the Windows versions along with
Wine to get Flash content to display on my Linux boxen. So rather than just do a quick and dirty port for Flash Player 8 I would really like to get it done right. The kicker is this: It's damn hard. Much harder than even supporting OS X. While porting command line applications to Unix is beyond trivial, porting media applications (which the Flash Player essentially is) is a real nightmare. This starts with sound support where we have to support many different sound standards (ALSA, OSS, aRTs, ESD etc.), framework support (X11, QT, GTK for copy&paste support f.ex.), IMEs (is there such a thing?), font support which is almost beyond comprehensible and many other quirks and forked 'standards'. And how about support for PowerPC and x86-64? All of this needs to be done to get something acceptable on Linux IMO.
Another problem is that Linux's main compiler is gcc, which means that ALL of our MMX code which is written in Intel notation will not compile. MMX is what makes the Flash Player reasonably fast. Having to use plain C code automatically means that your performance will be cut by at least 50% rendering wise. I am really not kidding here. So why is it so difficult to port performance optimizations? Let me show an example which shows the effort which is required to support a plethora of platforms.
WARNING! I will go into some low level assembly code here, so you might want to skip this ;-)Disclaimer: The code here in this post was totally made up in a few minutes and is NOT in the Flash Player. I am not even sure any of this code does the right thing.
So, assume a simple generic compositing function which is used if the target platform is unknown:
struct argb {
unsigned char alpha;
unsigned char red;
unsigned char green;
unsigned char blue;
} argb;
#ifdef GENERIC
void compositeARGB(argb *src, argb *dst, int n) {
for ( int i=0; i < n; n++ ) {
int a = 256 - src->alpha;
dst->alpha= ( (dst->alpha* a ) >> 8 ) + src->alpha;
dst->red = ( (dst->red * a ) >> 8 ) + src->red ;
dst->green= ( (dst->green* a ) >> 8 ) + src->green;
dst->blue = ( (dst->blue * a ) >> 8 ) + src->blue ;
src++;
dst++;
}
}
#endif // GENERIC
Now for 32bit machines we would modify this function the following way to get a drastic performance increase (50-60% faster, on PowerPC even more). Several things to notice:
- We use array access instead of doing a dst++ and src++ since some compilers generate better code.
- We do not access components on a byte per byte basis since that is extremely slow.
#ifdef TARGET_32BIT
void compositeARGB(argb *src, argb *dst, int n) {
for ( int i=0; i < n; n++ ) {
int a = 256 - src[i].alpha;
unsigned int dr = *((unsigned int *)&dst[i]);
unsigned int dl = ( dr >> 8 ) & 0x00FF00FF;
dr &= 0x00FF00FF;
unsigned int sr = *((unsigned int *)&src[i]);
unsigned int sl = sr & 0xFF00FF00;
sr &= 0x00FF00FF;
dr = ( ( dr * a ) & 0xFF00FF00 );
dl = ( ( dr * a ) & 0xFF00FF00 ) >> 8;
dr += sr;
dl += sl;
*((unsigned int *)&dst[i]) = dr | dl;
}
}
#endif // TARGET_32BIT
Lets continue with a MMX version specifically made to compile using Microsoft Visual Studio. Please note that we can't use
intrinsics on x86 since they generate extremly slow code (also see the 3/17/2004 entry
here which explains the problems). This version here is about 4 times faster than the original generic C code:
#ifdef TARGET_MMX_MSVC
void compositeARGB(argb *src, argb *dst, int n) {
const __m64 a256 = 0x0100010001000100;
_asm {
mov edi, dst
mov esi, src
mov ecx, n
pxor mm7, mm7
movq mm6, a256
loop:
movd mm0, [esi]
punpcklbw mm0, mm7
movd mm1, [edi]
punpcklbw mm1, mm7
movq mm2, mm0
pshufw mm2, mm2, 0
movq mm3, mm6
psubusw mm3, mm2
pmullw mm1, mm3
psrlw mm1, 8
paddw mm1, mm0
packuswb mm1, mm7
movd [edi], mm1
add esi, 4
add edi, 4
sub ecx, 1
jnz loop
}
}
#endif // TARGET_MMX_MSVC
But this won't compile under Linux which means we would have revert to the second version which is much slower. Under Linux we need to rewrite this using AT&T notation. So here we go:
#ifdef TARGET_MMX_GCC
void compositeARGB(unsigned int *src, unsigned int *dst, int n) {
unsigned int a256[2];
a256[0] = 0x01000100;
a256[1] = 0x01000100;
asm volatile ("pxor %%mm7,%%mm7\n\t"
"movq (%3),%%mm6\n\t"
"loop:\n\t"
"movd (%0),%%mm0\n\t"
"punpcklbw %%mm7,%%mm0\n\t"
"movd (%1),%%mm1\n\t"
"punpcklbw %%mm7,%%mm1\n\t"
"movq %%mm0,%%mm2\n\t"
"pshufw $0,%%mm2,%%mm2\n\t"
"movq %%mm6,%%mm3\n\t"
"psubusw %%mm2,%%mm3\n\t"
"pmullw %%mm3,%%mm1\n\t"
"psrlw $8,%%mm1\n\t"
"paddw %%mm0,%%mm1\n\t"
"packuswb %%mm7,%%mm1\n\t"
"movd %%mm1,(%1)\n\t"
"addl $4,%0\n\t"
"addl $4,%1\n\t"
"subl $1,%2\n\t"
"jnz loop\n\t"
:
: "r" (src), "r" (dst), "r" (n), "r" (a256)
);
}
#endif // TARGET_MMX_GCC
What? All that work just to compile under GCC? This means we need to port thousands of lines of code to support AT&T notation (Yes, I know binutils 2.1 now has Intel notation support, although some changes are still required. In theory we could use the Intel compiler also, but if we want to ship this code as an SDK it will be a problem).
Next lets look what is needed to do to support PowerPC and AltiVec to get decent performance on that platform. Here is the AltiVec version of the compositing routine from above:
#ifdef TARGET_ALTIVEC
void compositeARGB(unsigned int *src, unsigned int *dst, int n) {
while ((long(dst) & 0xF) && (n > 0))
{
int a = 256 - src->alpha;
dst->alpha= ( (dst->alpha* a ) >> 8 ) + src->alpha;
dst->red = ( (dst->red * a ) >> 8 ) + src->red ;
dst->green= ( (dst->green* a ) >> 8 ) + src->green;
dst->blue = ( (dst->blue * a ) >> 8 ) + src->blue ;
src++;
dst++;
n--;
}
vector unsigned char s0;
vector unsigned char s1 = vec_ld (0, (unsigned char *) src);
vector unsigned char s2;
const vector unsigned char aPr01 = (vector unsigned char)
(31, 0, 31, 0, 31, 0, 31, 0, 31, 4, 31, 4, 31, 4, 31, 4);
const vector unsigned char aPr23 = (vector unsigned char)
(31, 8, 31, 8, 31, 8, 31, 8, 31, 12, 31, 12, 31, 12, 31, 12);
const vector unsigned short v256 = (vector unsigned short) (256);
const vector unsigned char dPr = (vector unsigned char)
(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30);
const vector unsigned char sPr = vec_lvsl (0, (unsigned char *) src);
while ( n >= 4 ) {
s2 = vec_ld (16, (unsigned char *) src);
s0 = vec_perm (s1, s2, vPerm);
s1 = s2;
vector unsigned short d0 = vec_ld (0, (unsigned char *) dst);
vector unsigned short a01 = (vector unsigned short)
vec_perm (s0, (vector unsigned char) (0), aPr01);
vector unsigned short a23 = (vector unsigned short)
vec_perm (s0, (vector unsigned char) (0), aPr02);
vector unsigned short s01 = (vector unsigned short)
vec_mergeh (s0, (vector unsigned char) (0));
vector unsigned short s23 = (vector unsigned short)
vec_mergel (s0, (vector unsigned char) (0));
vector unsigned short d01 = (vector unsigned short)
vec_mergeh ((vector unsigned char) (0), d0);
vector unsigned short d23 = (vector unsigned short)
vec_mergel ((vector unsigned char) (0), d0);
a01 = vec_sub (v256, a01);
a23 = vec_sub (v256, a23);
d01 = vec_mladd (d01, a01, s01);
d23 = vec_mladd (s23, a23, s23);
vec_st ((vector unsigned char)
vec_perm (d01, d23, dPr), 0, (unsigned char *) dst);
dst += 4;
src += 4;
n -= 4;
}
while (n--)
{
int a = 256 - src->alpha;
dst->alpha= ( (dst->alpha* a ) >> 8 ) + src->alpha;
dst->red = ( (dst->red * a ) >> 8 ) + src->red ;
dst->green= ( (dst->green* a ) >> 8 ) + src->green;
dst->blue = ( (dst->blue * a ) >> 8 ) + src->blue ;
src++;
dst++;
}
}
#endif // TARGET_ALTIVEC
If you can understand any of this, kudos. Writing AltiVec code is even worse than writing MMX code IMO. They had the best intentions when they designed AltiVec, but code bloat and maintenance are simply a nightmare.
But we are not done yet!!! Lets say we want to port Flash Player 8 to Microsoft Windows XP 64bit edition (which is also on the TODO list for the Flash Player). This version of Windows does not support MMX, only SSE1/SSE2/SSE3 and above. That means we need to rewrite another version specifically for this OS. Also, Microsoft Visual Studio does not support inline assembly here, which means we need to use intrinsics despite the fact it generates slow code:
#ifdef TARGET_SSE2
void compositeARGB(argb *src, argb *dst, int n) {
__m128i zero; zero = _mm_xor_si128(zero,zero);
const unsigned int A256[4] = { 0x01000100, 0x01000100, 0x01000100, 0x01000100 };
__m128i a256 = *((__m128i *)A256);
for ( ; n > 1; n -= 2) {
__m128i m0 = _mm_unpacklo_epi8(*((__m128i *)src),zero);
__m128i m1 = _mm_unpacklo_epi8(*((__m128i *)dst),zero);
__m128i m2 = _mm_shuffle_epi32(_mm_shufflelo_epi16(m0,0),0);
__m128i m3 = _mm_subs_epu16(a256,m2);
m1 = _mm_mullo_epi16(m1,m3);
m1 = _mm_srli_epi16(m1,8);
m1 = _mm_add_epi16(m1,m0);
m1 = _mm_packs_epi16(m1,zero);
*((__m128i *)dst) = m1;
src+=2;
dst+=2;
}
for ( ; n > 0 ; n-- ) {
int a = 256 - src->alpha;
dst->alpha= ( (dst->alpha* a ) >> 8 ) + src->alpha;
dst->red = ( (dst->red * a ) >> 8 ) + src->red ;
dst->green= ( (dst->green* a ) >> 8 ) + src->green;
dst->blue = ( (dst->blue * a ) >> 8 ) + src->blue ;
src++;
dst++;
}
}
#endif // TARGET_SSE2
(This here is actually not quite complete and will crash hard since the source and destination data pointer need to be aligned for this to work)
Now imagine that the Flash Player has not only this one routine which needs to be optimized, but in fact dozens to hundreds of these and you see why getting a decent port is so complex.
If you bothered to read to this point it means that Macromedia wants you really bad. Really, really bad. :-) We've been looking for Linux gurus for a while (well, it has been for more than a year now without anyone being even close to what we need, I guess they all go to Google...) and for best results they would have to be in house. Collaboration with the team here on a daily basis as a long term full time employee will be key to get the best results. So apply right now for
this job. Not only will you be able to do exciting work on Linux, you'll even be paid for it!