I have an application that streams data over the network at high bandwidths that is having a lot of performance issues. I narrowed it down to a very minimal case, and am leaving out all the rest of the details here:
The application is a multimedia application that must stream frames of data using UDP at consistent frame rates. Each frame is about 160kB split into 927 UDP packets (of 174 bytes each), with each packet sent to one of 59 destination devices (max 16 packets per destination). It is a dedicated machine connected to a dedicated gigabit LAN and I've verified that a gigabit connection was established between all network devices. The performance issue I'm experiencing is highly inconsistent frame rates caused by a periodic delay in the call to sendto() at extremely regular intervals. For the purposes of testing I've turned off the frame rate limiter and am sending data as fast as possible (with the limiter set to 60FPS average bandwidth is roughly 75mbps).
I have two configurations that I've tested:
1) 927 sockets. Each frame sends 174 bytes over each of these sockets.
2) 59 sockets. Each frame sends 2436 to 2784 bytes over each of these sockets.
At gigabit speed each frame *should* take about 1.2ms to send (160kB / 1000mb ~= 0.0012, mind your bits and bytes). In practice I can't even get close to 1.2ms per frame (see below).
In configuration 1, when sending frames as fast as possible, each frame generally takes about 10ms to send. The real problem is every 0.7 seconds (~65 frames, ~10MB), a frame takes a whopping 1.8 seconds to send. I can't explain this massive delay, but it is at extremely regular intervals.
In configuration 2, the same problem exists except each frame generally takes about 20ms to send, and every 0.1 seconds (~6 frames, ~1MB) , a frame takes 0.13 seconds to send. Again, it is at extremely regular intervals.
I had thought that something else on the machine was interfering with network communications (checking usual culprits, making sure wireless networking disabled, etc.) but the thing is configuration 1 and 2 experience distinctly different patterns of delay. One is 1.8 seconds every 0.7 seconds, the other is 0.13 seconds every 0.1 seconds.
Another piece of information that may be important is the delay does not seem to happen when the network cable is unplugged. This, of course, strongly suggests some hardware issues at some point in the network. However, I have not been able to verify this yet as the software is for a client running things remotely, and communication is sometimes difficult (unfortunately, I can't be at the site to witness the actual problem, which does not occur on the development machine :-( ).
My test application does not do anything exotic. It simply creates the sockets like so:
socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
And sends data packets to their appropriate destinations:
sendto(p.sock, data, datalen, 0, paddr, addrlen);
All calls to sendto() are succeeding, all data is sent and I have verified that it is correctly received by the remote devices.
1. What is causing the delays at regular intervals?
2. Why does the delay time and interval length depend on how many sockets I have open and/or how much data I'm sending per socket (note that in both configurations each frame is the same amount of data total)?
3. How can I troubleshoot / solve this? Are there some kind of socket options I can set to improve performance and consistency?
It's frustrating because the software goes live soon, and these problems didn't occur until the software was run on site. Any advice, hints, info would be greatly appreciated.
Thanks a lot,
JC
Edit:
I should note that I sent data as fast as possible in the test application, but in the real application even when I limit as low as 30FPS, I still see the same problem, except the intervals are longer. For example, at 30FPS (approx average bandwidth 30-40mbps) it takes 10-20ms to send a frame and I still see the 1.8 second delays, but it happens every 3 or 4 seconds rather than every 0.7 seconds.
Edders
October 29th, 2008, 05:13 AM
What platform are you using? Why are you using so many different sockets for what looks like a single stream of data?
Have you tried changing the size of the UDP packets to see if that makes any change to what you see? You now send 174 bytes per packet, but you could try sending 1392 bytes (8 times as much) to see if this has any influence on what you see?
On a single stream with a high data rate I'd recommend making the task of the networking stack of your OS as simple as possible. Put as much data as you can in each packet and limit the number of sockets and ports.
TheCPUWizard
October 29th, 2008, 08:51 AM
Is your program the ONLY thing transmitting on the network???
[If you "think" the answer is yes, have you used a packet sniffer to verify???]
When I was distributing [true realtime] audio over TCP/IP a few years back, I had similar types of issues. The issue became one of two things...
1) Something else was transmitting on the network.
2) Some other protocol was running on my NIC that was doing periodic messaging.
IMHO, the ONLY way to tell is with a good packet sniffer.....
peenie
October 29th, 2008, 12:39 PM
Thanks for your replies.
What platform are you using?
Windows XP SP3 on a MacBook Pro, 2.4GHz T7700 Core 2 Duo. I do not know what NIC is in this machine.
The hardware setup was only available on-site, not during development, it's during our on-site testing that the problems were discovered.
The development machine, where there doesn't seem to be any problems (but the network setup is too different for it to be relevant) is Windows XP SP3 on a Thinkpad T60, 2.16GHz T2600 Core Duo, Intel 3945ABG wireless NIC.
Why are you using so many different sockets for what looks like a single stream of data?
In a test application I had significant performance issues sending all of the data through a single socket. On the development machine I could maintain 60FPS with both of the above configurations, with a single socket I could not get past 6 or 7 FPS.
In the single socket test, each frame took roughly 40ms to send and every 3.1 seconds a frame took 3.8 seconds to send (compare to above results for configuration 1 and 2).
It seems that as I use less sockets:
- Time to send frame increases.
- Time between delayed frames increases.
- Delay time increases.
Have you tried changing the size of the UDP packets to see if that makes any change to what you see? You now send 174 bytes per packet, but you could try sending 1392 bytes (8 times as much) to see if this has any influence on what you see?
I have tested with 512 byte packets (sending 316 packets per "frame" for the same amount of data), but only using a single socket. In that test frames took 150ms to send, but were completely consistent., there were no delays. I can try this test with the above socket configurations, however, the protocol used to talk to the network devices sticks to UDP packet boundaries, so increasing packet sizes is not an option. (That's also why I sort of ignored these test results, though... but it's interesting that it got rid of the delays).
On a single stream with a high data rate I'd recommend making the task of the networking stack of your OS as simple as possible. Put as much data as you can in each packet and limit the number of sockets and ports.
Unfortunately, the packet size must be fixed at 174. I have no control over the device protocol. And for some reason, less sockets seems to perform worse here, it's kind of counter-intuitive.
Is your program the ONLY thing transmitting on the network???
[If you "think" the answer is yes, have you used a packet sniffer to verify???]
I think the answer is yes. I will verify this with Wireshark at some point in the next few days.
Thanks for your advice,
JC
TheCPUWizard
October 29th, 2008, 12:49 PM
If you are using the standard Windows Stack, then there is alot going on in most situations. WireShark will show SOME of it, but there are also internal loopbacks which never make it unto the wire. (for example is WINS running is DNS running is the Computer Browser service running?)
I would approach this by creating a trivial (relatively speakling) Kernel mode driver so that Windows was not even aware of the device.
If you implement the buffering and packetization inside the Kernel code, then you should be able to easily achieve your goals.....
Edders
October 30th, 2008, 05:17 AM
I think you have had good advice from TheCPUWizard (no surprise there!). I have only dealt with low latency audio streaming, and none of our Windows applications get close to the kind of data rate you are transmitting. We do have an embedded solution with lots of high data rate audio streams, but it is on a completely different platform.
TheCPUWizard
October 30th, 2008, 05:57 AM
I think you have had good advice from TheCPUWizard (no surprise there!). I have only dealt with low latency audio streaming, and none of our Windows applications get close to the kind of data rate you are transmitting. We do have an embedded solution with lots of high data rate audio streams, but it is on a completely different platform.
Edders may have a viable approach (deppending on the exact scenario. Instead of using a custom driver stack within the PC, the use of an external microcontroller would potentially simplify thee software, and also eliminate issues with OS level scheduling.
peenie
October 30th, 2008, 07:00 AM
Edders may have a viable approach (deppending on the exact scenario. Instead of using a custom driver stack within the PC, the use of an external microcontroller would potentially simplify thee software, and also eliminate issues with OS level scheduling.
That may not be a bad idea, except there are tight time constraints. I would prefer the external hardware approach because then I can offload that work to somebody else. An inefficiency in the network protocol triples the required bandwidth, if I use some external hardware I can stream data to that hardware at 25mbps instead of 75mbps... well under the limits of, say, hi-speed usb 2. Still, I feel like there must be a simpler solution ... I mean, it's a 2.6GHz Core 2 Duo, it should have the processing power... 75mbps doesn't even come close to saturating old EISA busses, it's well under memory bandwidth limits and the theoretical gigabit limit of the networking hardware. I'm convinced it's some bottleneck in the driver, assuming it's not something else taking up network bandwidth as you suggested earlier. We'll be trying different hardware soon so we'll see.
Unfortunately I probably don't have time to research and write a custom driver. At some point soon we're going to try installing the software on a different machine. If that doesn't improve anything then I'll have to hop on the train and head to the site next week and see what else I can discover.
Thanks for your replies,
JC
TheCPUWizard
October 30th, 2008, 07:04 AM
The "problem" is basically inherent in the choice of a non-real time OS for what is basically a real-time task. This does not mean that there is something "wrong" with the OS, simply that it is not appropriate.
Would you enter a Formula I car in a demolition derby?
Would you enter a Hummer in an Indy race?
Both are vehicles with some very strong points....
MikeAThon
October 30th, 2008, 10:21 AM
Bandwidth problems in Winsock over UDP on gigabit ethernet networks are sometimes resolved by focusing on:
1. Jumbo frames setting. This must be consistent for all machines connected to the network.
2. The value of the FastSendDatagramThreshold setting in the registry. See http://technet.microsoft.com/en-us/library/bb726981.aspx for a description of many many registry settings, including this one. The link is for Win2k, but this registry description is the same in later versions of Windows.
As a suggestion, get yourself some other UDP performance benchmarking program, and try that instead of your own frame streaming program. That way, you can determine whether the network is up to standards, quite apart from your program's use of it, and thus isolate the problem to the network or to your program's use of it.
peenie
November 7th, 2008, 03:23 AM
I went to the site yesterday and discovered the problem. Before going on site I did a few more experiments with the theory that it was a driver problem on the MacBooks we were using, as I never saw the issue on the PC development machine. There was only a 10/100 switch available to test with at the time. Testing on that, with a 100mbit connection connected to one of the hardware devices, there were no issues at all. We dropped the connection speed down to 10mbits and still, no issues. There was no opportunity to test with a gigabit connection there, so I went to the site with the idea that it was a strange bug with the MacBook drivers in gigabit mode only.
Well, after getting there I immediately tested on the MacBook in gigabit mode and the problems happened. Skipping 100mbits, I tested at 10mbits and the delays disappeared and performance was significantly better. Ok, that supports the broken driver theory. So I tested on a PC running at 1gbit and... it didn't work. Same delays. Confused, I tested on my development machine at 1gbit and... same problems. There goes the MacBook theory.
Well, I was not responsible for the network set up at the site and had assumed it was sane. As it turns out, the 59 hardware devices can only run at 10mbit speeds (they are not configurable). However, the network consisted of 5 Netgear unmanaged switches, daisy-chained with 1gbit connections with each one breaking out into 11-13 of the 10mbit devices, and the machine connected at 1gbit to the end of the of the chain (the person on-site who had verified that all devices were connected at gigabit speeds had misinterpreted the lights on the front of the switches, which actually showed 10mbits to each device).
All of the dropped packets I was seeing were caused entirely by connection speed mismatches and the inability of the dumb switches to convert. The inconsistent frame rates with delays at regular intervals that I was seeing were entirely caused by ethernet flow control. I suspect that Windows somehow handles flow control logic per-socket, which may explain the difference in delays with different socket configurations, and also may explain the shorter delays and better performance with the one-socket-per-destination configuration. All of these problems were made worse by per-switch inconsistencies caused by the network layout.
This also explains why the development machine did not show any of the issues, it was never connected to a full set of 59 devices, and when it was connected to test hardware, it was connected through switches that could handle the data rate conversion properly. It also explains why we didn't see the problem in the tests earlier that day with the 10/100 switch, which could handle rate changes correctly. Had we had a smart gigabit switch available at that time, it would have also worked just fine.
Verifying all of that, disabling flow control eliminated the delays, dropping the computer down to a 10mbit connection eliminated the dropped packets, reducing the number of devices on the network also eliminated the delays.
The solution is to fix the network layout. We ordered a new set of nicer, smarter switches and reorganized the layout to something more sane. One switch will serve as the top level switch with a gigabit connection to the machine, and will connect to each of the 5 other switches which will be connected to the devices -- the new switches can handle the data rate conversions, and the new layout also equalizes the bandwidth through all of the switches. We'll have to experiment with flow control settings.
Unfortunately I had to return home before the new devices arrived, hopefully the new setup will resolve all of the issues.
Thanks for all the advice here, thought I'd post back with the (likely) solution, since it was completely different that all of the other theories.
J
peenie
January 2nd, 2009, 02:20 PM
Just an update in case anybody runs in to a similar problem.
In fact, everything I just posted about in that last post was completely wrong. There were a lot of red herrings.
We noticed that pressing the "repair" button in Windows to repair the network connection fixed the problem. Going through the list of what that actually does, we were able to identify the ARP cache as the source of the problem.
Windows clears the ARP cache by default every 10 minutes, and this corresponded exactly with what we were seeing (when we watched for a while, it would slow down then every 10 minutes reset and run normally again).
Lo and behold, clearing the ARP cache (command "arp -d *") fixed the problem every time. I can't explain it.
The solution in the end was to edit a few registry keys to set the ARP refresh time to 1 minute instead of 10. This solved the problem entirely. The keys to control refresh times are ArpCacheLife and ArpCacheMinReferencedLife. They are valid for Windows Server 2003 and XP, but not for Vista. They are documented here:
Now, I'm fairly positive that there were other unrelated issues originally (e.g. the flow control problems were real), but after getting rid of all of those it was this problem that proved to be the most troublesome. Very strange, and a relatively simple solution after 2 weeks of on-site debugging and hardware swapping.
So if anybody is writing any high bandwidth streaming network applications on Windows, especially when communicating to a large group of devices with fixed IPs, and you notice strange slowdown and packet dropping that regularly resets and goes away, try this first. Really, what I learned here is to not ignore things that work -- I had known the "Repair" button worked for quite some time but never actually bothered to research what it was doing for some reason.
J
MikeAThon
January 2nd, 2009, 05:16 PM
This is good information. Thank you for coming back and updating the thread with your solution.
ARP cache. Interesting. I wonder if you could share some of your theories on the possible reasons why clearing it would solve the problem.
Mike
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.