Tuesday, April 03, 2012

streaming uncompressed video: LAN vs. HDMI capture

One of my background research projects is streaming uncompressed video (ideally at least XGA res, 30 FPS) from one PC to another over a LAN. This would allow some interesting multi-user setups, e.g.:

FFRend -> FFRend: first PC/user is providing a submix to the second PC/user.
Whorld -> FFRend: same idea

The main advantages: distributed computing (load is distributed over multiple computers), and each PC can run full-screen and have dedicated user input (mouse, keyboard, MIDI etc). Whorld -> FFRend is particularly interesting b/c the Whorld app offers a much richer user experience than the UltraWhorld plugin.

I did a quick back of the envelope calculation: 1024 x 768 x 3 = 2.36 MB x 30FPS = 70.8 MB/s. Then I tested my Gigabit LAN. With 9k jumbo frames enabled, I was able to get around 79 MB/s, as measured by Netio. That's from an i7 920 box to an i5 2500K box, through a switch, with short cables. I tried eliminating the switch but it didn't matter. Bottom line: it's a bit too close for comfort but it might work.

The next thing I tried was bigFug's streamers. I tried for an hour or so, but no luck: I could only get them working in memory mode, not in TCP or UDP. So I resigned myself to rolling my own. No big deal, it's more fun anyway. I already had the sockets code kicking around from other projects. The one thing I didn't have was fast code for busting RGB32 down to RGB24. I run in 32-bit color, but I can't afford to send the unused alpha channel over the LAN. 1024 x 768 x 4 x 30 would be 94.5 MB/s, definitely not doable.

So did some poking around on the net, but in the end it was quicker to just roll the 32/24 conversions in x86 assembler. My benchmarks say thehttp://www.blogger.com/img/blank.gify get the job done in about 500 microseconds. I could cut that in half or better using SSE3 but that's a hassle and I can't be bothered at this stage. The code is here if you need it.

Other points of interest: setting the sockets receive and send buffer sizes (SO_RCVBUF and SO_SNDBUF) is crucial. I got the best results by making them big enough to fit an entire frame. I tried disabling Nagle (TCP_NODELAY) but it turns out Nagle is helping in this case. Be careful to avoid round-trips. The send plugin should only send, and the receive plugin should only receive. Don't second-guess TCP, just relax and let it handle the bookkeeping.

So anyway I fired up my streaming plugins and discovered what I probably should have guessed: It works fine, so long as the PCs don't have much else to do. I can stream uncompressed XGA at 30 FPS with nary a hiccup, with one or maybe two (out of four) cores fully loaded by other plugins. But if I load that third core, the whole show grinds to a halt. Even though that core appears to be idle.

It's strange, because I don't show much CPU load from the streaming, even if I display kernel usage. My theory is that it's something else, maybe memory bandwidth, or bus traffic. The truth is, I just don't know why it doesn't work better.

It might work a whole lot better with 10 GB Ethernet, but the NICs are still way too expensive. So my question (finally) is: how difficult would it be to realize this scheme using HDMI capture?? I've seen cards around that claim to be able to capture uncompressed HDMI at resolutions plenty higher than XGA. The Blackmagic Intensity Pro even has a nice DirectShow API so I could presumably roll myself a Freeframe plugin to receive the incoming frames and insert them into my mix. Does this make any sense or am I just dreaming?

UPDATE: As it turns out the Blackmagic devices only output HD video resolutions (720p or 1080i).

I'm going to explore multihoming instead. According to my reading there's a decent chance I can get 120 MB/s or better using dual-port server-style NICs. The current crop from Intel looks good. Probably the CPU utilization will go down too, because with server-style cards the CPU can offload more of the work. It's a bit pricey but still cheap compared to 10 Gigabit.

HDMI capture wouldn't have solved my problem exactly anyway. The goal was basically to insert one frame stream directly into the other, so that no frames are lost. Streaming over TCP does this, and the proof is that if you pause the destination PC, flow control kicks in and the source PC pauses too. In other words the source PC is really a slave of the destination.

Unless I'm misunderstanding, with HDMI (or any other) capture, it's not master/slave, it's asynchronous: between the source and the destination you've got a display adapter which has its own clock and isn't pausing for anything. It neither knows nor cares whether the destination is capturing. If the destination pauses, the source merrily continues to output frames, and those frames are lost from the destination's point of view. This isn't what I want.

I might even try just installing some ordinary extra NICs, the $10 kind. I don't necessarily need LACP: since I have custom code, I could do my own link aggregation, e.g. by sending half the frame over one NIC and half over the other. Depending on CPU and memory bus saturation issues etc. it might not work, but it's a cheap and moderately amusing experiment.

No comments: