Tuesday, June 05, 2012

StretchBlt can hang Windows by hogging GDI lock

It's well-known that StretchBlt is slow, especially in halftone mode. What's less well-known is that it can bring Windows to its knees. You might find this incredible, but it's easy to prove. I made a test application that repeatedly calls StretchBlt to resize a bitmap from 6000 x 4500 down to 200 x 150 in halftone mode. The source bitmap is constant, and both the source and the destination are memory device contexts, i.e. the test does not repaint any windows. Running this test severely degrades the performance of all other applications, including the Task Manager, regardless of process priorities, and despite plenty of idle cores. Applications repaint themselves belatedly or not at all, and video applications drop their frame rates to nearly zero.

Why is this so? It turns out that every GDI operation involves acquiring a system-wide lock, called the GDI lock. This strategy works well enough provided GDI operations take very little time, which is usually the case. However as soon as one process hogs the lock by doing long GDI operations, all other processes are screwed, because applications typically spend most of their time updating their windows, and windows can only be updated via GDI calls. The GDI lock is a system-wide bottleneck that potentially reduces Windows to cooperative multitasking, which fails when one process doesn't cooperate.

But why is a lock required at all when the application is only resizing bitmaps in memory? How can this possibly affect other applications? The issue is that all GDI objects including device contexts, pens, brushes and so on are maintained at the system level, not per-process. The purpose of the GDI lock is to protect GDI objects and attributes from being corrupted by simultaneous modification from multiple threads. In other words, even though my test application's source and destination bitmaps are in memory and invisible to other processes, the StretchBlt has to be serialized anyway because it potentially affects GDI state and GDI state is global.

Microsoft has known about this issue all along, though they didn't publicize it for obvious reasons. They finally got around to doing something about it in Windows 7, however it seems they were unable to get rid of the GDI lock altogether, so instead they substituted a large number of finer-grained locks for the one monolithic GDI lock. In theory this might help, but it remains to be seen whether it fixes the pathological case I'm describing. The consensus seems to be that Window 7 generally exhibits poor 2D performance compared to XP, and my limited testing bears this out.

This issue has serious implications for FFRend. It means that the Monitor bar potentially limits FFRend's overall throughput, because the monitor window's StretchBlt can block the rendering thread from blitting to the output window, particularly for large (HD or higher) frame sizes and smooth (halftone) monitor quality. Apparently DirectDraw also has to acquire the GDI lock, even if only Blt is called (as opposed to GetDC/ReleaseDC), because the issue occurs even though the output window uses DirectDraw instead of GDI, and even in full-screen exclusive mode. This last point is especially egregious. The whole point of full-screen exclusive mode is that the system allows one window to not cooperate with other windows on a given monitor, because the other windows will be covered anyway. Incredibly, my StretchBlt test starves FFRend even when FFRend is in full-screen exclusive mode and covering the test application's window. This seems totally wrong to me.

The only solution I can see is to avoid StretchBlt, but this means FFRend has to include its own bitmap resizing code. Why not roll my own file system too while I'm at it? Bitmap resizing is no joke when quality and performance are both goals. Bilinear interpolation is easy enough but image quality degrades as the size difference between the source and destination bitmaps increases. Bicubic behaves better but it's slow and complicated and full of floating point. It would probably have to be implemented in SSE2 assembler to perform well enough. And of course it would be nice if the code supported all the Freeframe bitmap formats, i.e. not only 32-bit but 24-bit and 5-6-5 too. Right.

Another possibility would be to still use StretchBlt, but pre-scale the frames before feeding them to StretchBlt, maybe only if the difference between the frame size and the monitor window size exceeds a certain threshold. Implementing a 2x2 averaging down-sample is a relatively simple matter. Of course this plan only works if the frame size is divisible by two in both axes (usually true), and the resulting image quality remains to be seen.

The test source code is available here.

Microsoft's admission of the problem can be found in an obscure MSDN blog post: Engineering Windows 7 Graphics Performance.

Thursday, May 24, 2012

Should render thread run at a higher priority than plugin threads?

Hypothesis: In FFRend's parallel-processing frame pipeline, the render thread should run at a higher priority than plugin threads. The reason is that the pipeline uses a "pull" model, and the renderer is the puller. So if the frame rate timer is signaled and there's a frame in the renderer's queue, no good can come from deferring the render in favor of creating additional frame backlog further up the pipeline.

In practice this would only matter in the case where the engine is loaded heavily enough for there to be competition over CPU cores, but not so heavily that the renderer is unable to keep up with the frame rate timer.

Boosting the render thread's priority wouldn't improve throughput, because the engine is either keeping up with the frame rate or not, and by definition we're only interested in the case where it's keeping up. What it might do is reduce latency and jitter.

If in fact the renderer sometimes remains blocked even though its frame rate timer is signaled and there's a queued frame to ready to be rendered, in that instance the frame is displayed after its due time. Thus instead of the interval between frames being constant or nearly so, frames would be varying between being bunched together or spread out in time, even though on average the system is keeping up. This is the essence of jitter.

Jitter could be a more serious problem in DirectDraw Exclusive mode, because in this case the renderer is synchronizing to the vertical retrace, not just to a timer. The render thread is basically sitting in a polling loop somewhere in DirectX or the graphics driver, repeatedly asking the hardware if the retrace has started, and burning CPU the whole while. Not pretty but that's how it works. The question is, can the render thread be preempted by a plugin at that moment? If so we'll almost certainly miss the start of the vertical retrace, in which case DirectDraw will force the render thread to wait until the next one, i.e. the render will be delayed by an entire frame. It's possible that the vertical retrace wait is privileged code and therefore can't be preempted by an ordinary application thread. I sure wish I had better documentation. DirectX is shrouded in mystery.

But in Cooperative (windowed) mode case the problem is more straightforward. It's easy enough to characterize the current jitter, by sampling the CPU's performance counter in the renderer, storing the samples in a memory buffer, and calculating their deviation afterwards. If there is significant jitter, and boosting the renderer's priority lessens it appreciably, it's a win.

Queues view crashes

If the Queues views was visible, and certain actions were taken while FFRend was paused, resuming would likely cause a crash. Some specific scenarios:

1) Pause, load a project with (a lot) less plugins, and then resume.
2) If at least one plugin has helpers: Pause, delete some plugins (but not the one with helpers), and then resume.

In all cases the issue was the Queues view accessing non-existent frames due to stale frame pointers. One case was in Renderer: though m_CurFrame was cleared in Run's stop case, the stop code wasn't executing, due to Run's erroneous initial no-op test (since removed). PluginHelper wasn't invalidating ANY of its frame pointers, neither input nor output.

Invalidation was not only lacking in places but also inconsistent in terms of timing, e.g. Plugin was clearing its input frame pointers on engine START (in ResetQueues), but clearing its output frame pointer on plugin STOP. There's no need to invalidate on stop, because the pointers don't actually become invalid until RunInit deletes the excess frames (in AllocFrames). Thus all frame pointer invalidation is now consistently done on engine START instead of engine STOP.

Plugin invalidates its input and output frame pointers in ResetQueues. PluginHelper does the same in ResetState. Renderer invalidates its current frame pointer in Run's start case, just before launching the worker thread.

Wednesday, April 25, 2012

intermittent bogus frames during continuous single stepping

The issue is that CreatePauseFrame accesses a frame (m_CurFrame) which the renderer no longer owns. CreatePauseFrame accesses it indirectly via m_BackBuf, which uses the surface description m_SurfDesc, the lpSurface member of which points to the frame, having been set to m_CurFrame by the Render method. However m_CurFrame is ONLY guaranteed to be valid in Render, i.e. during the period between reading the input frame and and writing it to either the monitor or the free queue. The moment the current frame's reference count reaches zero, it no longer belongs to the renderer and can be overwritten by another thread at any time. In theory the issue could occur whenever the app pauses, but the odds are low. Continuous single stepping greatly increases the odds, by calling CreatePauseFrame in a tight loop.

It sounds scary but it's probably only a nuisance: so long as m_CurFrame still points to valid memory, the worst that can happen is an occasional bogus frame on the monitor.

If single step doesn't disable monitoring, the behavior becomes less likely, because in this case the frame is probably queued to the monitor window. The monitor window's timer hook is responsible for dequeuing and disposing of the frame, but it runs in the main thread, so it can't decrement the frame's reference count to zero while we're in SingleStep, because the main thread can't be in two places at once.

The fact that the engine stops the renderer after the plugins doesn't save us, because it doesn't guarantee that the renderer will render another frame between the time that some plugin munges the renderer's current frame just before stopping, and when the renderer worker thread stops.

A possible solution would be to change engine's Pause to do the following:

1. stop the renderer
2. wait for the render queue to contain a frame
3. pause the engine, stopping all plugin workers
4. single step the renderer, processing the queued frame
5. create the pause frame from the renderer's current frame

By the time you create the pause frame, all plugins were stopped BEFORE the final frame was rendered, so there's no one left alive to munge the current frame.

Note however that this solution breaks the normal single step, causing it to always step two frames instead of one.

Wednesday, April 11, 2012

In the works: column resizing, previews, non-AVI clips

The next version (2.2.04) is almost ready, and introduces a long-overdue feature: resizable columns in all views. It's only UI candy, but it's expected behavior and fairly low-risk. The next version also fixes a fairly serious bug which was accidentally introduced in 2.2.01: the MIDI Setup view's Plugin page is always empty, and selecting its tab can cause the view to resize incorrectly, overwriting the droplist and Learn check box.

After that, I'm considering adding a Preview bar, for previewing the output of one or more plugins. This would mostly be useful for clip players and source plugins. The bar would behave similarly to the Monitor bar, except that it would allow multiple preview windows within the bar. All the previews would be the same same size, but the number of previews would be user-selectable, along with their layout (horizontal stack, vertical stack, or tiled). The advantage of giving previews their own bar (as opposed to adding them to the existing Monitor bar) is that this allows the Monitor and Preview bars to be positioned differently in the GUI. The Monitor bar typically mirrors the program's main output, so it wants to be bigger than the previews; it may even belong on its own display (this is possible if you have enough video outputs).

I'm also working on upgrading the clip player to use DirectShow instead of VfW, so it can open non-AVI clips directly without needing AVISynth. This is a riskier proposition but it would make FFRend more user-friendly.

Thursday, April 05, 2012

FFRend 2.2.03: Monitor source is back!

FFRend 2.2.03 is available for download. It brings back monitor source selection, a nice feature that was lost in the V2 rewrite. It also fixes some bugs.

Monitor source selection allows the monitor bar to display the output of any plugin, not just the one connected to the renderer. This was difficult to implement in V2 due to multithreading complications, so it was omitted, until now. To change the monitor source, use Plugin/Monitor (F8), the plugin context menu, or the monitor bar's context menu.

Bug fixes include a stall that occurred when stopping a recording if the final plugin had multiple threads assigned to it, and UI jerkiness when a menu was displayed while an edit control had focus.

Also, due to an embarrassing oversight, the check for updates feature introduced in 2.2.02 fails to find its installer script. 2.2.03 fixes this, or if you prefer, there's a patch available here

Please download the latest version. Here are the release notes.

Tuesday, April 03, 2012

streaming uncompressed video: LAN vs. HDMI capture

One of my background research projects is streaming uncompressed video (ideally at least XGA res, 30 FPS) from one PC to another over a LAN. This would allow some interesting multi-user setups, e.g.:

FFRend -> FFRend: first PC/user is providing a submix to the second PC/user.
Whorld -> FFRend: same idea

The main advantages: distributed computing (load is distributed over multiple computers), and each PC can run full-screen and have dedicated user input (mouse, keyboard, MIDI etc). Whorld -> FFRend is particularly interesting b/c the Whorld app offers a much richer user experience than the UltraWhorld plugin.

I did a quick back of the envelope calculation: 1024 x 768 x 3 = 2.36 MB x 30FPS = 70.8 MB/s. Then I tested my Gigabit LAN. With 9k jumbo frames enabled, I was able to get around 79 MB/s, as measured by Netio. That's from an i7 920 box to an i5 2500K box, through a switch, with short cables. I tried eliminating the switch but it didn't matter. Bottom line: it's a bit too close for comfort but it might work.

The next thing I tried was bigFug's streamers. I tried for an hour or so, but no luck: I could only get them working in memory mode, not in TCP or UDP. So I resigned myself to rolling my own. No big deal, it's more fun anyway. I already had the sockets code kicking around from other projects. The one thing I didn't have was fast code for busting RGB32 down to RGB24. I run in 32-bit color, but I can't afford to send the unused alpha channel over the LAN. 1024 x 768 x 4 x 30 would be 94.5 MB/s, definitely not doable.

So did some poking around on the net, but in the end it was quicker to just roll the 32/24 conversions in x86 assembler. My benchmarks say thehttp://www.blogger.com/img/blank.gify get the job done in about 500 microseconds. I could cut that in half or better using SSE3 but that's a hassle and I can't be bothered at this stage. The code is here if you need it.

Other points of interest: setting the sockets receive and send buffer sizes (SO_RCVBUF and SO_SNDBUF) is crucial. I got the best results by making them big enough to fit an entire frame. I tried disabling Nagle (TCP_NODELAY) but it turns out Nagle is helping in this case. Be careful to avoid round-trips. The send plugin should only send, and the receive plugin should only receive. Don't second-guess TCP, just relax and let it handle the bookkeeping.

So anyway I fired up my streaming plugins and discovered what I probably should have guessed: It works fine, so long as the PCs don't have much else to do. I can stream uncompressed XGA at 30 FPS with nary a hiccup, with one or maybe two (out of four) cores fully loaded by other plugins. But if I load that third core, the whole show grinds to a halt. Even though that core appears to be idle.

It's strange, because I don't show much CPU load from the streaming, even if I display kernel usage. My theory is that it's something else, maybe memory bandwidth, or bus traffic. The truth is, I just don't know why it doesn't work better.

It might work a whole lot better with 10 GB Ethernet, but the NICs are still way too expensive. So my question (finally) is: how difficult would it be to realize this scheme using HDMI capture?? I've seen cards around that claim to be able to capture uncompressed HDMI at resolutions plenty higher than XGA. The Blackmagic Intensity Pro even has a nice DirectShow API so I could presumably roll myself a Freeframe plugin to receive the incoming frames and insert them into my mix. Does this make any sense or am I just dreaming?

UPDATE: As it turns out the Blackmagic devices only output HD video resolutions (720p or 1080i).

I'm going to explore multihoming instead. According to my reading there's a decent chance I can get 120 MB/s or better using dual-port server-style NICs. The current crop from Intel looks good. Probably the CPU utilization will go down too, because with server-style cards the CPU can offload more of the work. It's a bit pricey but still cheap compared to 10 Gigabit.

HDMI capture wouldn't have solved my problem exactly anyway. The goal was basically to insert one frame stream directly into the other, so that no frames are lost. Streaming over TCP does this, and the proof is that if you pause the destination PC, flow control kicks in and the source PC pauses too. In other words the source PC is really a slave of the destination.

Unless I'm misunderstanding, with HDMI (or any other) capture, it's not master/slave, it's asynchronous: between the source and the destination you've got a display adapter which has its own clock and isn't pausing for anything. It neither knows nor cares whether the destination is capturing. If the destination pauses, the source merrily continues to output frames, and those frames are lost from the destination's point of view. This isn't what I want.

I might even try just installing some ordinary extra NICs, the $10 kind. I don't necessarily need LACP: since I have custom code, I could do my own link aggregation, e.g. by sending half the frame over one NIC and half over the other. Depending on CPU and memory bus saturation issues etc. it might not work, but it's a cheap and moderately amusing experiment.

Saturday, March 31, 2012

max gigabit speed

between ckci and badbox2
test: NetCPS
jumbo 9K frames
50K writes/reads to/from socket
direct connection: 79.6 MBs
through switch: 78.0 MBs
CPU utilization: 10..15%

1024 x 768 x 3 x 25 = 58.9 MBs
1024 x 768 x 3 x 30 = 70.7 MBs

Looks like uncompressed XGA over Ethernet is doable, for sure at 25 FPS and maybe at 30.
Whorld to FFRend? FFRend to FFRend? Problem is they both use 4-byte pixels, not 3.

1024 x 768 x 4 x 25 = 78.6 MBs    <- not quite! USB3 instead?

Monday, March 26, 2012

if edit control has focus, menus periodically freeze message loop

This bug was introduced in V2. FFRend 1.7.3 doesn't exhibit the bug, and the earliest version of V2 (FFRell does. The cause is a porting error: the following bug fix from version 1.0.3 wasn't ported into V2:
28oct06    add ProcessMessageFilter to fix edit box/menu pauses
ToDo: if an edit control has focus, menus cause periodic pauses in message loop

BOOL CFFRendApp::ProcessMessageFilter(int code, LPMSG lpMsg) 
    // If a menu is displayed while an edit control has focus, the message loop
    // pauses periodically until the menu is closed; this applies to all menus,
    // including context and system menus, and it's a problem for timer-driven
    // apps that use edit controls.  The problem is caused by the undocumented
    // WM_SYSTIMER message (0x118), which Windows uses internally for various
    // purposes including scrolling and blinking the caret in an edit control.
    // The solution is to suppress WM_SYSTIMER, but only if the filter code is
    // MSGF_MENU, otherwise the caret won't blink while scrolling.  The caret
    // doesn't blink while a menu is displayed even without this workaround.
    // if displaying a menu and message is WM_SYSTIMER
    if (code == MSGF_MENU && lpMsg->message == 0x118) {
        // use GetClassName because IsKindOf fails if the edit control doesn't
        // have a CEdit instance; see Microsoft knowledge base article Q145616
        TCHAR    szClassName[6];
        if (GetClassName(lpMsg->hwnd, szClassName, 6)
        && !_tcsicmp(szClassName, _T("Edit"))) {    // if recipient is an edit control
            return TRUE;    // suppress WM_SYSTIMER
    return CWinApp::ProcessMessageFilter(code, lpMsg);

Monday, January 23, 2012

Important FFRend bug fix; testers needed please!

The last released version of FFRend ( had a bug that caused the entire desktop to flicker whenever a row view was updated. Since just about every command updates one or more row views, the desktop flickered like crazy. You'd think I would have noticed it during testing but I normally run FFRend full-screen. I'm pretty embarrassed about it. Anyway please take the latest version, which fixes this lame bug and some others too.

FFRend download

I would greatly appreciate it if any FFRend users who hang out here would take the new version out for a spin and report back if they see anything weird. I'm especially worried about the row view code, because that's what caused the desktop flicker bug in the first place. The Freeframe Parameter and MIDI Setup row views are now supposed to remember the scroll position separately for each plugin. Try switching back and forth between a plugin that's scrolled and one that isn't. Do you see any painting artifacts, or does the view paint smoothly?

This version also fixes a bunch of problems related to synchronization of oscillators between plugins, which crept in with V2 as side effects of parallel processing. The earlier versions of V2 would lose sync between plugins at the drop of a hat. It should be a lot better now. The problem was basically that in a pipeline each plugin has its own frame of reference in terms of time, so it's necessary to compensate for that in various places. To debug it I had to make a pair of special Freeframe plugins, one that stamps its parameter directly onto the frame as a number, and another that draws its parameter onto the frame as a waveform, like an oscilloscope. They're pretty handy actually. I can make them available if anyone wants them...

Thursday, January 12, 2012

check for updates & automatic updates

The basic steps are:

1. Obtain the version number of the latest release, by downloading the project web site's download page and parsing its HTML for the download URL.

2. If the installed version is out of date, use the download URL found in the download page to download the zip file of the current binary release.

3. Launch a helper that unzips the zip file, and then runs the installer. The helper is needed because it's impossible to reinstall an app while the app is running.

The first two steps are simple enough. CInternetSession::OpenURL is the easiest way I found to download files via HTTP. OpenURL creates a CHttpFile, and from there it's straightforward. The only gotcha was that the CHttpFile instance is only valid while the session exists. The hardest part was parsing the download page for the version number and download URL. Parsing is always a pain. And of course there were degenerate cases: exceptions to be handled, temp file leaks to be avoided, etc.

The update code adds about 20K to the .NET executable. This is mostly due to the static linking of MFC, which means we pay for bloated objects such as CInternetSession and CHttpFile. There's also the cost of the minizip library but this is much more modest since we're already paying for zlib anyway.

The main problem is that the app has to exit before the installer runs, otherwise the installer fails, understandably enough. It's also expected behavior for the app to restart after the installer finishes, and for extra credit the installer file should be deleted from the temp folder afterwards. Initially I thought we'd need an actual helper app to deal with all this, but batch files came to the rescue.

To avoid a race, the script needs to give FFRend enough time to exit before starting the installer. This seemed problematic since XP batch syntax doesn't support sleep, but then I discoved the ping hack, which works fine (a sleep command was finally added in Vista).

Fortunately it's possible to make msiexec non-interactive by specifying the /passive flag. This completely suppresses the installer dialog though a progress bar is briefly shown. The next problem was restarting the app without leaving the batch file's console window up. Here the secret was the incredibly useful start command, which I somehow managed not to discover for all these years. The start command does have a catch though: it doesn't necessarily pass a fully qualified path to the app, which causes serious problems if the app is expecting to be able to deduce its home folder from the command line. Luckily you can force start to pass the full path, by using the %cd% batch variable, which translates to the current directory. Also watch out for start's mandatory first argument, the window title, which has to be enclosed in quotes regardless of whether it contains spaces. In this case the window isn't visible long enough to read the title, so an empty string is fine.

Here's the completed reinstall.bat, which gets created by FFRend using CreateProcess "cmd /c". Note that the script assumes it's running in the same working folder as the application, which can be ensured by setting CreateProcess's lpCurrentDirectory argument to the application path. The arguments are:
%1 ping count: waits approximately N - 1 seconds
%2 installer path, e.g. "C:\whatever\temp\FFRend.msi"
%3 the name of the executable to launch, e.g. FFRend

@echo off
title Reinstalling %3
ping -n %1 >nul
msiexec /passive /i %2 REINSTALLMODE=vomus REINSTALL=ALL
if errorlevel 1 goto error
start "" "%cd%\%3"
del %2 >nul