It's well-known that StretchBlt is slow, especially in halftone mode. What's less well-known is that it can bring Windows to its knees. You might find this incredible, but it's easy to prove. I made a test application that repeatedly calls StretchBlt to resize a bitmap from 6000 x 4500 down to 200 x 150 in halftone mode. The source bitmap is constant, and both the source and the destination are memory device contexts, i.e. the test does not repaint any windows. Running this test severely degrades the performance of all other applications, including the Task Manager, regardless of process priorities, and despite plenty of idle cores. Applications repaint themselves belatedly or not at all, and video applications drop their frame rates to nearly zero.
Why is this so? It turns out that every GDI operation involves acquiring a system-wide lock, called the GDI lock. This strategy works well enough provided GDI operations take very little time, which is usually the case. However as soon as one process hogs the lock by doing long GDI operations, all other processes are screwed, because applications typically spend most of their time updating their windows, and windows can only be updated via GDI calls. The GDI lock is a system-wide bottleneck that potentially reduces Windows to cooperative multitasking, which fails when one process doesn't cooperate.
But why is a lock required at all when the application is only resizing bitmaps in memory? How can this possibly affect other applications? The issue is that all GDI objects including device contexts, pens, brushes and so on are maintained at the system level, not per-process. The purpose of the GDI lock is to protect GDI objects and attributes from being corrupted by simultaneous modification from multiple threads. In other words, even though my test application's source and destination bitmaps are in memory and invisible to other processes, the StretchBlt has to be serialized anyway because it potentially affects GDI state and GDI state is global.
Microsoft has known about this issue all along, though they didn't publicize it for obvious reasons. They finally got around to doing something about it in Windows 7, however it seems they were unable to get rid of the GDI lock altogether, so instead they substituted a large number of finer-grained locks for the one monolithic GDI lock. In theory this might help, but it remains to be seen whether it fixes the pathological case I'm describing. The consensus seems to be that Window 7 generally exhibits poor 2D performance compared to XP, and my limited testing bears this out.
This issue has serious implications for FFRend. It means that the Monitor bar potentially limits FFRend's overall throughput, because the monitor window's StretchBlt can block the rendering thread from blitting to the output window, particularly for large (HD or higher) frame sizes and smooth (halftone) monitor quality. Apparently DirectDraw also has to acquire the GDI lock, even if only Blt is called (as opposed to GetDC/ReleaseDC), because the issue occurs even though the output window uses DirectDraw instead of GDI, and even in full-screen exclusive mode. This last point is especially egregious. The whole point of full-screen exclusive mode is that the system allows one window to not cooperate with other windows on a given monitor, because the other windows will be covered anyway. Incredibly, my StretchBlt test starves FFRend even when FFRend is in full-screen exclusive mode and covering the test application's window. This seems totally wrong to me.
The only solution I can see is to avoid StretchBlt, but this means FFRend has to include its own bitmap resizing code. Why not roll my own file system too while I'm at it? Bitmap resizing is no joke when quality and performance are both goals. Bilinear interpolation is easy enough but image quality degrades as the size difference between the source and destination bitmaps increases. Bicubic behaves better but it's slow and complicated and full of floating point. It would probably have to be implemented in SSE2 assembler to perform well enough. And of course it would be nice if the code supported all the Freeframe bitmap formats, i.e. not only 32-bit but 24-bit and 5-6-5 too. Right.
Another possibility would be to still use StretchBlt, but pre-scale the frames before feeding them to StretchBlt, maybe only if the difference between the frame size and the monitor window size exceeds a certain threshold. Implementing a 2x2 averaging down-sample is a relatively simple matter. Of course this plan only works if the frame size is divisible by two in both axes (usually true), and the resulting image quality remains to be seen.
Microsoft's admission of the problem can be found in an obscure MSDN blog post: Engineering Windows 7 Graphics Performance.