It's well-known that StretchBlt is slow, especially in
halftone mode. What's less well-known is that it can bring Windows to its
knees. You might find this incredible, but it's easy to prove. I made a test application
that repeatedly calls StretchBlt to resize a bitmap from 6000 x 4500 down to
200 x 150 in halftone mode. The source bitmap is constant, and both the source
and the destination are memory device contexts, i.e. the test does not repaint
any windows. Running this test severely degrades the performance of all other
applications, including the Task Manager, regardless of process priorities, and despite plenty of idle cores.
Applications repaint themselves belatedly or not at all, and video applications
drop their frame rates to nearly zero.
Why is this so? It turns out that every GDI operation
involves acquiring a system-wide lock, called the GDI lock. This strategy works
well enough provided GDI operations take very little time, which is usually the
case. However as soon as one process hogs the lock by doing long GDI
operations, all other processes are screwed, because applications typically
spend most of their time updating their windows, and windows can only be
updated via GDI calls. The GDI lock is a system-wide bottleneck that potentially
reduces Windows to cooperative multitasking, which fails when one process
doesn't cooperate.
But why is a lock required at all when the application is
only resizing bitmaps in memory? How can this possibly affect other
applications? The issue is that all GDI objects including device contexts,
pens, brushes and so on are maintained at the system level, not per-process.
The purpose of the GDI lock is to protect GDI objects and attributes from being
corrupted by simultaneous modification from multiple threads. In other words,
even though my test application's source and destination bitmaps are in memory
and invisible to other processes, the StretchBlt has to be serialized anyway
because it potentially affects GDI state and GDI state is global.
Microsoft has known about this issue all along, though they
didn't publicize it for obvious reasons. They finally got around to doing
something about it in Windows 7, however it seems they were unable to get rid
of the GDI lock altogether, so instead they substituted a large number of
finer-grained locks for the one monolithic GDI lock. In theory this might help,
but it remains to be seen whether it fixes the pathological case I'm
describing. The consensus seems to be that Window 7 generally exhibits poor 2D
performance compared to XP, and my limited testing bears this out.
This issue has serious implications for FFRend. It means
that the Monitor bar potentially limits FFRend's overall throughput, because
the monitor window's StretchBlt can block the rendering thread from blitting to
the output window, particularly for large (HD or higher) frame sizes and smooth
(halftone) monitor quality. Apparently DirectDraw also has to acquire the GDI
lock, even if only Blt is called (as opposed to GetDC/ReleaseDC), because the
issue occurs even though the output window uses DirectDraw instead of GDI, and
even in full-screen exclusive mode. This last point is especially egregious.
The whole point of full-screen exclusive mode is that the system allows one
window to not cooperate with other windows on a given monitor, because the
other windows will be covered anyway. Incredibly, my StretchBlt test starves
FFRend even when FFRend is in full-screen exclusive mode and covering the test
application's window. This seems totally wrong to me.
The only solution I can see is to avoid StretchBlt, but this
means FFRend has to include its own bitmap resizing code. Why not roll my own
file system too while I'm at it? Bitmap resizing is no joke when quality and
performance are both goals. Bilinear interpolation is easy enough but image
quality degrades as the size difference between the source and destination
bitmaps increases. Bicubic behaves better but it's slow and complicated and
full of floating point. It would probably have to be implemented in SSE2
assembler to perform well enough. And of course it would be nice if the code
supported all the Freeframe bitmap formats, i.e. not only 32-bit but 24-bit and
5-6-5 too. Right.
Another possibility would be to still use StretchBlt, but
pre-scale the frames before feeding them to StretchBlt, maybe only if the
difference between the frame size and the monitor window size exceeds a certain
threshold. Implementing a 2x2 averaging down-sample is a relatively simple
matter. Of course this plan only works if the frame size is divisible by two in
both axes (usually true), and the resulting image quality remains to be seen.
Microsoft's admission of the problem can be found in an obscure MSDN blog post: Engineering Windows 7 Graphics Performance.