I am a bad, bad man for not having blogged about this sooner. Last year I blogged about How I Chose SlimDX as my graphics/sound/etc library of choice. And I've been meaning to blog about how I chose my network library, and what I think of it, yet have failed to do so. This was particularly important to me because (spoiler alert) I really love the networking library I'm currently using and think the guy making it is also a real class act.
For the first versions of AI War, I was using an earlier version of the Lidgren Network Library. I found this easy to implement, and got the game up and running pretty quick. The Lidgren library is a UDP-based solution, which means that it is a layer over an unreliable, connectionless protocol. That means that for applications that require reliability (like AI War), and for all messages to arrive in order, you need something like the Lidgren library. This library does a really good job of managing this whole process in a transparent way, fortunately, so you don't have to think about most of that sort of thing as a programmer working a layer above it.
There were a few bugs in the Lidgren library at the time, though, that eventually led me to create my own sockets-based code based on what I'd learned from working with Lidgren's code. Mainly I was concerned because the Lidgren library was flagged as beta software by its own creator, and I didn't want to have my game running on critical code that was beta. So I coded my own replacement.
Later Alpha Until A Few Months Post-Release
My sockets-based implementation was surprisingly easy to implement. I used TCP instead of UDP, you see, which meant that all that reliability and orderliness that the Lidgren library was having to create when running on top was built right into the protocol. I mean, after all, isn't that the entire point of TCP as a protocol? There were dire warnings on many game programming sites about not using TCP for these sorts of purposes, so I was wary, but in one afternoon programming I had a working prototype of my event-driven TCP-based code, and it worked perfectly.
In further testing, for months and months with my alpha testers, it continued to work perfectly. I stopped worrying about the TCP issue, clearly it was a non-issue. The game released in May, quietly at first, with most players playing solo at that time. Gradually it picked up an increasing playerbase, and more of them started wanting to play online together. Largely it was fine. I think it was around late June that I started getting scattered reports of really terrible performance in networking, however. Mostly it was people playing from the US to Europe, or between Australia and anywhere else.
My first thought was that these were simply connections that could not handle AI War -- AI War is a big game, and uses a fair bit of bandwidth when a lot of ships are being moved around at once. However, since this issue was a constant thing, even when no significant amount of command data was being passed, I was rather stumped for a month or so. I implemented a "Network skip" feature that adjusted network latency, and this largely let people work around the problem, but it then made the command lag pretty horrific when they used that feature.
I don't remember what or who eventually helped me realize what the problem was (it's been a while), but the key realization was this: TCP networking absolutely stinks in high packet-loss situations, which is what was interfering with these specific connections that were otherwise really fast and capable. In other words: all those articles warning against using TCP were (of course) absolutely right, and the issue had come back to bite me in a major way. All those months of testing were for naught, despite all the various network load conditions we had subjected it to, because we'd never subjected it to a high packet-loss environment.
Returning To The Lidgren Network Library
So the very first thing I did was look up the Lidgren library again, since I had liked it so well in the alpha stages of AI War. I figured that it might have had a number of updates in the intervening 6ish months (it had -- substantial ones), and I also figured that if there were errors in it, it would still be less of an issue than packet-loss issue with TCP. Basically, I was running monthly public betas, and I could let everyone beat on the Lidgren library in the game for a solid month, and see what shook out.
Another afternoon spent implementing the very-differently-organized Lidgren library (all changes very much for the better), and it was up and running. I pushed it out to the current beta release, and the affected parties noted that their issues with latency had all disappeared. And there was much rejoicing! All seemed to be well. It was working well for somewhere between a few dozen and a few hundred people (always hard to know, since not everyone who downloads beta versions posts or registers on the forums), and it remained doing so for the remainder of that beta month. That beta then turned into the official version of the game, and the Lidgren library has been an integral part of the AI War library ever since. But my story isn't quite done yet.
Sending Large Messages
Sometime after that first release that made the Lidgren library an official part of the AI War game, I started getting reports from players that the initial connection speed was super slow for them (when they would be transferring between 500kb and 2MB of data for the initial "full sync" of the game), but then that after they were connected, it was fine. And it wasn't just a little bit slow -- it was taking upwards of 30 minutes for a few of them, sometimes.
The strangest thing was that it wasn't all the time, even though it was mostly centered in that same crowd that had been having the issues with my TCP implementation. And weirdest of all, these same folks had had very speedy and nice sending of the initial game state under the TCP solution, and it was just the in-game performance that had then stunk. Groan. There were suggestions that I should use the TCP for file send, and UDP for the game itself, but that would have been a nightmare to code and maintain on a number of levels (dual connections that are really the same connection, but used for different purposes -- yuck).
So after my initial investigation of the issue, I had basically determined that any data that was above a certain size was causing a performance impact in the library, but that breaking that same message up into smaller bits was no problem. I talked to Michael Lidgren, the author of the eponymous library, via email. Basically what it boiled down to, in talking to him, was that based on the way the library was having to manage resending of lost packets, and the way that it was breaking up the data that I was unceremoniously dumping into its buffers, was causing a lot of extra re-sends that were problematic.
This was something he moved to look into (I could have done the same, since his library is open source, and I did poke around in it some -- but despite the fact that his program is extremely well organized and commented and formatted, it is nonetheless quite a complex library simply by nature). It's actually quite possible this was largely a bug in the network drivers on the specific cards, which is what it wound up looking like was most likely in the end. In the meantime, with some suggestions from Michael, I implemented some logic on my end that put less load on the network card by spacing out that content slightly, and the result was perfect.
Until a few months later...
The concept of MTU was not something I was familiar with, not being formally trained in any way for network programming. So when a few players later started having some disconnections, I was not sure what the issue was, but it seemed network-card related. The Lidgren library had been working fine for months for an ever-wider group of players, after all, so I doubted it was a problem there.
As fortune would have it, the player with this problem happened to be extraordinarily knowledgeable about networking in general, and with the help of the debug data that is built into the Lidgren library (which I make available through a key combo in AI War), he was able to figure out that this was an MTU problem. This was not a bug at all, per se, but the default value in the Lidgren library at that time was high enough that it would cause problems with a super tiny minority of players and network card/router combos.
Fortunately, I was able to set this via a simple property in AI War, to override the Lidgren library defaults, and the user's problem vanished. Voila! I had contacted Michal Lidgren again when this specific problem came up, since I'd been through a ton of debugging steps with the player and had videos of what the network was doing, and Michael had had some more good suggestions that helped lead to the discovery of the MTU issue. So I contacted him again to let him know what the player had figured out and what I had implemented as an override in my program, and suggested that he consider lowering the default MTU in his library. He agreed, and that was another problem solved.
It's been around 5 months now, during which AI War has had it's playerbase grow 400%, with no more issues since those two. Additionally, those two were really unusual and outside the bounds of even what most games would require in terms of load spikes (AI War is quite hefty on the network when it comes to moving lots of units or doing a full sync), and really they were less issues with the Lidgren library, and just unfortunate confluences with hardware that was nonstandard in some way, we basically determined in the end.
In other words, the Lidgren network library has been rock solid since I implemented it into a post-release beta version of the game, with the only two "issues" relating to it not really being issues with it directly, but with network hardware under it. The reason that I related those issues at all was because I think it illustrates the character and devotion of Michael Lidgren, as he helped to solve both of them despite suspecting that they were not really problems with his library from the get-go.
The last thing I want is for this post to result in a slew of simple help requests to Michael, so don't take this to mean he's going to help you set up networking or learn the basics of what is going on with how to use his library. I made sure not to bother him until I had collected a ton of data that I thought he was the only one who could reasonably interpret, about really specific complex issues that I'd already spent significant time trying to debug on my own. I'm pointing out his devotion because for any team that is worried they might implement the Lidgren library and then just be stuck if they discover some sort of rare but fatal bug (of which I have seen none, anyway), they can rest easy that the creator of the library is very much on the ball and is interested in making sure his library is rock solid. He's not a tutor or a guide, and I wouldn't expect him to give out a ton of free support (he's giving this out for free, come on), but it's also not one of those headless, abandoned projects that no one cares about or updates anymore.
For any team working in C# and looking for a UDP networking library, I can't recommend the Lidgren one enough. It's been rock solid for a demanding game like AI War, the creator stands behind the library, and it's small and efficient -- and open source, if you need to make changes of your own. Given all of the various sample projects he includes with it, you can probably get something up and running in an afternoon, or at least a day or two, as well. This is precisely why I've felt so bad for not making this blog post before now, because there are few (free, nonetheless) products which I can recommend so wholeheartedly and unequivocally.