Tuesday, June 05, 2012

StretchBlt can hang Windows by hogging GDI lock

It's well-known that StretchBlt is slow, especially in halftone mode. What's less well-known is that it can bring Windows to its knees. You might find this incredible, but it's easy to prove. I made a test application that repeatedly calls StretchBlt to resize a bitmap from 6000 x 4500 down to 200 x 150 in halftone mode. The source bitmap is constant, and both the source and the destination are memory device contexts, i.e. the test does not repaint any windows. Running this test severely degrades the performance of all other applications, including the Task Manager, regardless of process priorities, and despite plenty of idle cores. Applications repaint themselves belatedly or not at all, and video applications drop their frame rates to nearly zero.

Why is this so? It turns out that every GDI operation involves acquiring a system-wide lock, called the GDI lock. This strategy works well enough provided GDI operations take very little time, which is usually the case. However as soon as one process hogs the lock by doing long GDI operations, all other processes are screwed, because applications typically spend most of their time updating their windows, and windows can only be updated via GDI calls. The GDI lock is a system-wide bottleneck that potentially reduces Windows to cooperative multitasking, which fails when one process doesn't cooperate.

But why is a lock required at all when the application is only resizing bitmaps in memory? How can this possibly affect other applications? The issue is that all GDI objects including device contexts, pens, brushes and so on are maintained at the system level, not per-process. The purpose of the GDI lock is to protect GDI objects and attributes from being corrupted by simultaneous modification from multiple threads. In other words, even though my test application's source and destination bitmaps are in memory and invisible to other processes, the StretchBlt has to be serialized anyway because it potentially affects GDI state and GDI state is global.

Microsoft has known about this issue all along, though they didn't publicize it for obvious reasons. They finally got around to doing something about it in Windows 7, however it seems they were unable to get rid of the GDI lock altogether, so instead they substituted a large number of finer-grained locks for the one monolithic GDI lock. In theory this might help, but it remains to be seen whether it fixes the pathological case I'm describing. The consensus seems to be that Window 7 generally exhibits poor 2D performance compared to XP, and my limited testing bears this out.

This issue has serious implications for FFRend. It means that the Monitor bar potentially limits FFRend's overall throughput, because the monitor window's StretchBlt can block the rendering thread from blitting to the output window, particularly for large (HD or higher) frame sizes and smooth (halftone) monitor quality. Apparently DirectDraw also has to acquire the GDI lock, even if only Blt is called (as opposed to GetDC/ReleaseDC), because the issue occurs even though the output window uses DirectDraw instead of GDI, and even in full-screen exclusive mode. This last point is especially egregious. The whole point of full-screen exclusive mode is that the system allows one window to not cooperate with other windows on a given monitor, because the other windows will be covered anyway. Incredibly, my StretchBlt test starves FFRend even when FFRend is in full-screen exclusive mode and covering the test application's window. This seems totally wrong to me.

The only solution I can see is to avoid StretchBlt, but this means FFRend has to include its own bitmap resizing code. Why not roll my own file system too while I'm at it? Bitmap resizing is no joke when quality and performance are both goals. Bilinear interpolation is easy enough but image quality degrades as the size difference between the source and destination bitmaps increases. Bicubic behaves better but it's slow and complicated and full of floating point. It would probably have to be implemented in SSE2 assembler to perform well enough. And of course it would be nice if the code supported all the Freeframe bitmap formats, i.e. not only 32-bit but 24-bit and 5-6-5 too. Right.

Another possibility would be to still use StretchBlt, but pre-scale the frames before feeding them to StretchBlt, maybe only if the difference between the frame size and the monitor window size exceeds a certain threshold. Implementing a 2x2 averaging down-sample is a relatively simple matter. Of course this plan only works if the frame size is divisible by two in both axes (usually true), and the resulting image quality remains to be seen.

The test source code is available here.

Microsoft's admission of the problem can be found in an obscure MSDN blog post: Engineering Windows 7 Graphics Performance.

Thursday, May 24, 2012

Should render thread run at a higher priority than plugin threads?

Hypothesis: In FFRend's parallel-processing frame pipeline, the render thread should run at a higher priority than plugin threads. The reason is that the pipeline uses a "pull" model, and the renderer is the puller. So if the frame rate timer is signaled and there's a frame in the renderer's queue, no good can come from deferring the render in favor of creating additional frame backlog further up the pipeline.

In practice this would only matter in the case where the engine is loaded heavily enough for there to be competition over CPU cores, but not so heavily that the renderer is unable to keep up with the frame rate timer.

Boosting the render thread's priority wouldn't improve throughput, because the engine is either keeping up with the frame rate or not, and by definition we're only interested in the case where it's keeping up. What it might do is reduce latency and jitter.

If in fact the renderer sometimes remains blocked even though its frame rate timer is signaled and there's a queued frame to ready to be rendered, in that instance the frame is displayed after its due time. Thus instead of the interval between frames being constant or nearly so, frames would be varying between being bunched together or spread out in time, even though on average the system is keeping up. This is the essence of jitter.

Jitter could be a more serious problem in DirectDraw Exclusive mode, because in this case the renderer is synchronizing to the vertical retrace, not just to a timer. The render thread is basically sitting in a polling loop somewhere in DirectX or the graphics driver, repeatedly asking the hardware if the retrace has started, and burning CPU the whole while. Not pretty but that's how it works. The question is, can the render thread be preempted by a plugin at that moment? If so we'll almost certainly miss the start of the vertical retrace, in which case DirectDraw will force the render thread to wait until the next one, i.e. the render will be delayed by an entire frame. It's possible that the vertical retrace wait is privileged code and therefore can't be preempted by an ordinary application thread. I sure wish I had better documentation. DirectX is shrouded in mystery.

But in Cooperative (windowed) mode case the problem is more straightforward. It's easy enough to characterize the current jitter, by sampling the CPU's performance counter in the renderer, storing the samples in a memory buffer, and calculating their deviation afterwards. If there is significant jitter, and boosting the renderer's priority lessens it appreciably, it's a win.

Queues view crashes

If the Queues views was visible, and certain actions were taken while FFRend was paused, resuming would likely cause a crash. Some specific scenarios:

1) Pause, load a project with (a lot) less plugins, and then resume.
2) If at least one plugin has helpers: Pause, delete some plugins (but not the one with helpers), and then resume.

In all cases the issue was the Queues view accessing non-existent frames due to stale frame pointers. One case was in Renderer: though m_CurFrame was cleared in Run's stop case, the stop code wasn't executing, due to Run's erroneous initial no-op test (since removed). PluginHelper wasn't invalidating ANY of its frame pointers, neither input nor output.

Invalidation was not only lacking in places but also inconsistent in terms of timing, e.g. Plugin was clearing its input frame pointers on engine START (in ResetQueues), but clearing its output frame pointer on plugin STOP. There's no need to invalidate on stop, because the pointers don't actually become invalid until RunInit deletes the excess frames (in AllocFrames). Thus all frame pointer invalidation is now consistently done on engine START instead of engine STOP.

Plugin invalidates its input and output frame pointers in ResetQueues. PluginHelper does the same in ResetState. Renderer invalidates its current frame pointer in Run's start case, just before launching the worker thread.

Wednesday, April 25, 2012

intermittent bogus frames during continuous single stepping

The issue is that CreatePauseFrame accesses a frame (m_CurFrame) which the renderer no longer owns. CreatePauseFrame accesses it indirectly via m_BackBuf, which uses the surface description m_SurfDesc, the lpSurface member of which points to the frame, having been set to m_CurFrame by the Render method. However m_CurFrame is ONLY guaranteed to be valid in Render, i.e. during the period between reading the input frame and and writing it to either the monitor or the free queue. The moment the current frame's reference count reaches zero, it no longer belongs to the renderer and can be overwritten by another thread at any time. In theory the issue could occur whenever the app pauses, but the odds are low. Continuous single stepping greatly increases the odds, by calling CreatePauseFrame in a tight loop.

It sounds scary but it's probably only a nuisance: so long as m_CurFrame still points to valid memory, the worst that can happen is an occasional bogus frame on the monitor.

If single step doesn't disable monitoring, the behavior becomes less likely, because in this case the frame is probably queued to the monitor window. The monitor window's timer hook is responsible for dequeuing and disposing of the frame, but it runs in the main thread, so it can't decrement the frame's reference count to zero while we're in SingleStep, because the main thread can't be in two places at once.

The fact that the engine stops the renderer after the plugins doesn't save us, because it doesn't guarantee that the renderer will render another frame between the time that some plugin munges the renderer's current frame just before stopping, and when the renderer worker thread stops.

A possible solution would be to change engine's Pause to do the following:

1. stop the renderer
2. wait for the render queue to contain a frame
3. pause the engine, stopping all plugin workers
4. single step the renderer, processing the queued frame
5. create the pause frame from the renderer's current frame

By the time you create the pause frame, all plugins were stopped BEFORE the final frame was rendered, so there's no one left alive to munge the current frame.

Note however that this solution breaks the normal single step, causing it to always step two frames instead of one.

Wednesday, April 11, 2012

In the works: column resizing, previews, non-AVI clips

The next version (2.2.04) is almost ready, and introduces a long-overdue feature: resizable columns in all views. It's only UI candy, but it's expected behavior and fairly low-risk. The next version also fixes a fairly serious bug which was accidentally introduced in 2.2.01: the MIDI Setup view's Plugin page is always empty, and selecting its tab can cause the view to resize incorrectly, overwriting the droplist and Learn check box.

After that, I'm considering adding a Preview bar, for previewing the output of one or more plugins. This would mostly be useful for clip players and source plugins. The bar would behave similarly to the Monitor bar, except that it would allow multiple preview windows within the bar. All the previews would be the same same size, but the number of previews would be user-selectable, along with their layout (horizontal stack, vertical stack, or tiled). The advantage of giving previews their own bar (as opposed to adding them to the existing Monitor bar) is that this allows the Monitor and Preview bars to be positioned differently in the GUI. The Monitor bar typically mirrors the program's main output, so it wants to be bigger than the previews; it may even belong on its own display (this is possible if you have enough video outputs).

I'm also working on upgrading the clip player to use DirectShow instead of VfW, so it can open non-AVI clips directly without needing AVISynth. This is a riskier proposition but it would make FFRend more user-friendly.

Thursday, April 05, 2012

FFRend 2.2.03: Monitor source is back!

FFRend 2.2.03 is available for download. It brings back monitor source selection, a nice feature that was lost in the V2 rewrite. It also fixes some bugs.

Monitor source selection allows the monitor bar to display the output of any plugin, not just the one connected to the renderer. This was difficult to implement in V2 due to multithreading complications, so it was omitted, until now. To change the monitor source, use Plugin/Monitor (F8), the plugin context menu, or the monitor bar's context menu.

Bug fixes include a stall that occurred when stopping a recording if the final plugin had multiple threads assigned to it, and UI jerkiness when a menu was displayed while an edit control had focus.

Also, due to an embarrassing oversight, the check for updates feature introduced in 2.2.02 fails to find its installer script. 2.2.03 fixes this, or if you prefer, there's a patch available here

Please download the latest version. Here are the release notes.

Tuesday, April 03, 2012

streaming uncompressed video: LAN vs. HDMI capture

One of my background research projects is streaming uncompressed video (ideally at least XGA res, 30 FPS) from one PC to another over a LAN. This would allow some interesting multi-user setups, e.g.:

FFRend -> FFRend: first PC/user is providing a submix to the second PC/user.
Whorld -> FFRend: same idea

The main advantages: distributed computing (load is distributed over multiple computers), and each PC can run full-screen and have dedicated user input (mouse, keyboard, MIDI etc). Whorld -> FFRend is particularly interesting b/c the Whorld app offers a much richer user experience than the UltraWhorld plugin.

I did a quick back of the envelope calculation: 1024 x 768 x 3 = 2.36 MB x 30FPS = 70.8 MB/s. Then I tested my Gigabit LAN. With 9k jumbo frames enabled, I was able to get around 79 MB/s, as measured by Netio. That's from an i7 920 box to an i5 2500K box, through a switch, with short cables. I tried eliminating the switch but it didn't matter. Bottom line: it's a bit too close for comfort but it might work.

The next thing I tried was bigFug's streamers. I tried for an hour or so, but no luck: I could only get them working in memory mode, not in TCP or UDP. So I resigned myself to rolling my own. No big deal, it's more fun anyway. I already had the sockets code kicking around from other projects. The one thing I didn't have was fast code for busting RGB32 down to RGB24. I run in 32-bit color, but I can't afford to send the unused alpha channel over the LAN. 1024 x 768 x 4 x 30 would be 94.5 MB/s, definitely not doable.

So did some poking around on the net, but in the end it was quicker to just roll the 32/24 conversions in x86 assembler. My benchmarks say thehttp://www.blogger.com/img/blank.gify get the job done in about 500 microseconds. I could cut that in half or better using SSE3 but that's a hassle and I can't be bothered at this stage. The code is here if you need it.

Other points of interest: setting the sockets receive and send buffer sizes (SO_RCVBUF and SO_SNDBUF) is crucial. I got the best results by making them big enough to fit an entire frame. I tried disabling Nagle (TCP_NODELAY) but it turns out Nagle is helping in this case. Be careful to avoid round-trips. The send plugin should only send, and the receive plugin should only receive. Don't second-guess TCP, just relax and let it handle the bookkeeping.

So anyway I fired up my streaming plugins and discovered what I probably should have guessed: It works fine, so long as the PCs don't have much else to do. I can stream uncompressed XGA at 30 FPS with nary a hiccup, with one or maybe two (out of four) cores fully loaded by other plugins. But if I load that third core, the whole show grinds to a halt. Even though that core appears to be idle.

It's strange, because I don't show much CPU load from the streaming, even if I display kernel usage. My theory is that it's something else, maybe memory bandwidth, or bus traffic. The truth is, I just don't know why it doesn't work better.

It might work a whole lot better with 10 GB Ethernet, but the NICs are still way too expensive. So my question (finally) is: how difficult would it be to realize this scheme using HDMI capture?? I've seen cards around that claim to be able to capture uncompressed HDMI at resolutions plenty higher than XGA. The Blackmagic Intensity Pro even has a nice DirectShow API so I could presumably roll myself a Freeframe plugin to receive the incoming frames and insert them into my mix. Does this make any sense or am I just dreaming?

UPDATE: As it turns out the Blackmagic devices only output HD video resolutions (720p or 1080i).

I'm going to explore multihoming instead. According to my reading there's a decent chance I can get 120 MB/s or better using dual-port server-style NICs. The current crop from Intel looks good. Probably the CPU utilization will go down too, because with server-style cards the CPU can offload more of the work. It's a bit pricey but still cheap compared to 10 Gigabit.

HDMI capture wouldn't have solved my problem exactly anyway. The goal was basically to insert one frame stream directly into the other, so that no frames are lost. Streaming over TCP does this, and the proof is that if you pause the destination PC, flow control kicks in and the source PC pauses too. In other words the source PC is really a slave of the destination.

Unless I'm misunderstanding, with HDMI (or any other) capture, it's not master/slave, it's asynchronous: between the source and the destination you've got a display adapter which has its own clock and isn't pausing for anything. It neither knows nor cares whether the destination is capturing. If the destination pauses, the source merrily continues to output frames, and those frames are lost from the destination's point of view. This isn't what I want.

I might even try just installing some ordinary extra NICs, the $10 kind. I don't necessarily need LACP: since I have custom code, I could do my own link aggregation, e.g. by sending half the frame over one NIC and half over the other. Depending on CPU and memory bus saturation issues etc. it might not work, but it's a cheap and moderately amusing experiment.

Saturday, March 31, 2012

max gigabit speed

between ckci and badbox2
test: NetCPS
jumbo 9K frames
50K writes/reads to/from socket
direct connection: 79.6 MBs
through switch: 78.0 MBs
CPU utilization: 10..15%

1024 x 768 x 3 x 25 = 58.9 MBs
1024 x 768 x 3 x 30 = 70.7 MBs

Looks like uncompressed XGA over Ethernet is doable, for sure at 25 FPS and maybe at 30.
Whorld to FFRend? FFRend to FFRend? Problem is they both use 4-byte pixels, not 3.

1024 x 768 x 4 x 25 = 78.6 MBs    <- not quite! USB3 instead?