Wednesday, February 24, 2010

Choosing A Network Library in C#

I am a bad, bad man for not having blogged about this sooner. Last year I blogged about How I Chose SlimDX as my graphics/sound/etc library of choice. And I've been meaning to blog about how I chose my network library, and what I think of it, yet have failed to do so. This was particularly important to me because (spoiler alert) I really love the networking library I'm currently using and think the guy making it is also a real class act.

Early Alpha
For the first versions of AI War, I was using an earlier version of the Lidgren Network Library. I found this easy to implement, and got the game up and running pretty quick. The Lidgren library is a UDP-based solution, which means that it is a layer over an unreliable, connectionless protocol. That means that for applications that require reliability (like AI War), and for all messages to arrive in order, you need something like the Lidgren library. This library does a really good job of managing this whole process in a transparent way, fortunately, so you don't have to think about most of that sort of thing as a programmer working a layer above it.

There were a few bugs in the Lidgren library at the time, though, that eventually led me to create my own sockets-based code based on what I'd learned from working with Lidgren's code. Mainly I was concerned because the Lidgren library was flagged as beta software by its own creator, and I didn't want to have my game running on critical code that was beta. So I coded my own replacement.

Later Alpha Until A Few Months Post-Release
My sockets-based implementation was surprisingly easy to implement. I used TCP instead of UDP, you see, which meant that all that reliability and orderliness that the Lidgren library was having to create when running on top was built right into the protocol. I mean, after all, isn't that the entire point of TCP as a protocol? There were dire warnings on many game programming sites about not using TCP for these sorts of purposes, so I was wary, but in one afternoon programming I had a working prototype of my event-driven TCP-based code, and it worked perfectly.

In further testing, for months and months with my alpha testers, it continued to work perfectly. I stopped worrying about the TCP issue, clearly it was a non-issue. The game released in May, quietly at first, with most players playing solo at that time. Gradually it picked up an increasing playerbase, and more of them started wanting to play online together. Largely it was fine. I think it was around late June that I started getting scattered reports of really terrible performance in networking, however. Mostly it was people playing from the US to Europe, or between Australia and anywhere else.

My first thought was that these were simply connections that could not handle AI War -- AI War is a big game, and uses a fair bit of bandwidth when a lot of ships are being moved around at once. However, since this issue was a constant thing, even when no significant amount of command data was being passed, I was rather stumped for a month or so. I implemented a "Network skip" feature that adjusted network latency, and this largely let people work around the problem, but it then made the command lag pretty horrific when they used that feature.

I don't remember what or who eventually helped me realize what the problem was (it's been a while), but the key realization was this: TCP networking absolutely stinks in high packet-loss situations, which is what was interfering with these specific connections that were otherwise really fast and capable. In other words: all those articles warning against using TCP were (of course) absolutely right, and the issue had come back to bite me in a major way. All those months of testing were for naught, despite all the various network load conditions we had subjected it to, because we'd never subjected it to a high packet-loss environment.

Drat.

Returning To The Lidgren Network Library
So the very first thing I did was look up the Lidgren library again, since I had liked it so well in the alpha stages of AI War. I figured that it might have had a number of updates in the intervening 6ish months (it had -- substantial ones), and I also figured that if there were errors in it, it would still be less of an issue than packet-loss issue with TCP. Basically, I was running monthly public betas, and I could let everyone beat on the Lidgren library in the game for a solid month, and see what shook out.

Another afternoon spent implementing the very-differently-organized Lidgren library (all changes very much for the better), and it was up and running. I pushed it out to the current beta release, and the affected parties noted that their issues with latency had all disappeared. And there was much rejoicing! All seemed to be well. It was working well for somewhere between a few dozen and a few hundred people (always hard to know, since not everyone who downloads beta versions posts or registers on the forums), and it remained doing so for the remainder of that beta month. That beta then turned into the official version of the game, and the Lidgren library has been an integral part of the AI War library ever since. But my story isn't quite done yet.

Sending Large Messages
Sometime after that first release that made the Lidgren library an official part of the AI War game, I started getting reports from players that the initial connection speed was super slow for them (when they would be transferring between 500kb and 2MB of data for the initial "full sync" of the game), but then that after they were connected, it was fine. And it wasn't just a little bit slow -- it was taking upwards of 30 minutes for a few of them, sometimes.

The strangest thing was that it wasn't all the time, even though it was mostly centered in that same crowd that had been having the issues with my TCP implementation. And weirdest of all, these same folks had had very speedy and nice sending of the initial game state under the TCP solution, and it was just the in-game performance that had then stunk. Groan. There were suggestions that I should use the TCP for file send, and UDP for the game itself, but that would have been a nightmare to code and maintain on a number of levels (dual connections that are really the same connection, but used for different purposes -- yuck).

So after my initial investigation of the issue, I had basically determined that any data that was above a certain size was causing a performance impact in the library, but that breaking that same message up into smaller bits was no problem. I talked to Michael Lidgren, the author of the eponymous library, via email. Basically what it boiled down to, in talking to him, was that based on the way the library was having to manage resending of lost packets, and the way that it was breaking up the data that I was unceremoniously dumping into its buffers, was causing a lot of extra re-sends that were problematic.

This was something he moved to look into (I could have done the same, since his library is open source, and I did poke around in it some -- but despite the fact that his program is extremely well organized and commented and formatted, it is nonetheless quite a complex library simply by nature). It's actually quite possible this was largely a bug in the network drivers on the specific cards, which is what it wound up looking like was most likely in the end. In the meantime, with some suggestions from Michael, I implemented some logic on my end that put less load on the network card by spacing out that content slightly, and the result was perfect.

Until a few months later...

MTU
The concept of MTU was not something I was familiar with, not being formally trained in any way for network programming. So when a few players later started having some disconnections, I was not sure what the issue was, but it seemed network-card related. The Lidgren library had been working fine for months for an ever-wider group of players, after all, so I doubted it was a problem there.

As fortune would have it, the player with this problem happened to be extraordinarily knowledgeable about networking in general, and with the help of the debug data that is built into the Lidgren library (which I make available through a key combo in AI War), he was able to figure out that this was an MTU problem. This was not a bug at all, per se, but the default value in the Lidgren library at that time was high enough that it would cause problems with a super tiny minority of players and network card/router combos.

Fortunately, I was able to set this via a simple property in AI War, to override the Lidgren library defaults, and the user's problem vanished. Voila! I had contacted Michal Lidgren again when this specific problem came up, since I'd been through a ton of debugging steps with the player and had videos of what the network was doing, and Michael had had some more good suggestions that helped lead to the discovery of the MTU issue. So I contacted him again to let him know what the player had figured out and what I had implemented as an override in my program, and suggested that he consider lowering the default MTU in his library. He agreed, and that was another problem solved.

Conclusion
It's been around 5 months now, during which AI War has had it's playerbase grow 400%, with no more issues since those two. Additionally, those two were really unusual and outside the bounds of even what most games would require in terms of load spikes (AI War is quite hefty on the network when it comes to moving lots of units or doing a full sync), and really they were less issues with the Lidgren library, and just unfortunate confluences with hardware that was nonstandard in some way, we basically determined in the end.

In other words, the Lidgren network library has been rock solid since I implemented it into a post-release beta version of the game, with the only two "issues" relating to it not really being issues with it directly, but with network hardware under it. The reason that I related those issues at all was because I think it illustrates the character and devotion of Michael Lidgren, as he helped to solve both of them despite suspecting that they were not really problems with his library from the get-go.

The last thing I want is for this post to result in a slew of simple help requests to Michael, so don't take this to mean he's going to help you set up networking or learn the basics of what is going on with how to use his library. I made sure not to bother him until I had collected a ton of data that I thought he was the only one who could reasonably interpret, about really specific complex issues that I'd already spent significant time trying to debug on my own. I'm pointing out his devotion because for any team that is worried they might implement the Lidgren library and then just be stuck if they discover some sort of rare but fatal bug (of which I have seen none, anyway), they can rest easy that the creator of the library is very much on the ball and is interested in making sure his library is rock solid. He's not a tutor or a guide, and I wouldn't expect him to give out a ton of free support (he's giving this out for free, come on), but it's also not one of those headless, abandoned projects that no one cares about or updates anymore.

For any team working in C# and looking for a UDP networking library, I can't recommend the Lidgren one enough. It's been rock solid for a demanding game like AI War, the creator stands behind the library, and it's small and efficient -- and open source, if you need to make changes of your own. Given all of the various sample projects he includes with it, you can probably get something up and running in an afternoon, or at least a day or two, as well. This is precisely why I've felt so bad for not making this blog post before now, because there are few (free, nonetheless) products which I can recommend so wholeheartedly and unequivocally.

19 comments:

  1. I second that! I've been using the Lidgren Library for a few years now. It's been very easy to use and functional nonetheless.

    ReplyDelete
  2. What was the MTU size that was causing problems and what did you go with? John Carmack mentioned that 1200 works well today.

    You might also want to look at http://www.linuxfoundation.org/collaborate/workgroups/networking/netem to simulate packet loss. I have found it surprisingly helpful in reproducing the impossible to reproduce network bug.

    ReplyDelete
  3. Hmm, seems to have cut off the link.

    http://www.linuxfoundation.org/
    collaborate/workgroups/
    networking/netem

    ReplyDelete
  4. The default MTU in Lidgren was 1459, but with a reduction to 1400 that has solved the problems. That's interesting that Carmack was suggesting 1200, that's significantly lower than most (all?) networking hardware I know of. Though, again, I'm far from an expert when it comes to that sort of thing.

    ReplyDelete
  5. Interestingly the default MTU in gen3 of the library is 1406 bytes (with an explanation in the comments of NetPeerConfiguration)

    ReplyDelete
  6. Oh, great to know, I'll have to check that out. Thanks for stopping by!

    ReplyDelete
  7. Well, you convinced me.

    One question though: I am vaguely familiar with the concept of MTU, but I don't understand how it could cause problems. Especially, since it is only a maximum, how making it even *lower* could resolve those problems?

    Also, I imagine the problem had something to do with fragmenting packets, but... why is that a problem? If a packet is fragmented and one-half of it is dropped, do *both* halves of it need to be resent?

    Thanks for the lesson, Chris; hope to read more from you in the future :)

    ReplyDelete
  8. "I am vaguely familiar with the concept of MTU, but I don't understand how it could cause problems. Especially, since it is only a maximum, how making it even *lower* could resolve those problems?"

    Well -- and hopefully this isn't the blind leading the blind here -- this is my understanding of it.

    Suppose that your desired packets of data range in size, but are generally something like 2000-4000 bytes or larger. Now suppose you have an MTU setting of 1412 in your application. This means that a 2000 byte packet will get split into one that's 1412, and another that's 588. Inclusive of headers, I presume, so actually smaller when considering the data itself, but I'll ignore headers for the sake of simplicity in the example.

    So far so good, right? Assuming that the MTU on your network card, and every router between you and the destination network card, are all 1412 or smaller, the packets just get passed along and that's that.

    But what if one of those routers has an MTU of 1400? That means that suddenly your packet of size 1412 is too large to fit through that router, and so that router is going to split it in some manner: maybe making a packet of 1400 and another of 12, I don't know.

    Now the destination network card, which you'd expect would be getting two packets, actually is getting three. Given the nature of networking, it should just reassemble these and be done with it, right? The worst thing that's happening here is that you're introducing some slight extra lag because you COULD have had two larger packets split more evenly rather than three packets that are smaller (so you're wasting transmission bandwidth with excess headers, and whatever else goes into splitting these packets in terms of messaging about the split).

    So that's a minor inefficiency, but with broadband speeds probably not noticeable. And that's generally been the case for most folks, on whatever network they are on.

    The problem comes when one of those routers does... something else. I'm not sure what the router does, if it just rejects packets larger than its MTU or if it splits them incorrectly, or what. I imagine it varies by bad router or driver. But the effect on the networking library is that it sees the packet not get through, and does a resend... and a resend... and a resend... which makes it look to the game like the connection has died even though it's still there. The smaller bits of data still get through, but anything larger than the MTU is disappearing into the ether somewhere along the way.

    More often than not, when players have this rare problem, simply upgrading their network drivers and/or their personal router firmware solves this issue. It's some sort of bug in one or the other. But sometimes they can't do that, or the problem is at their ISP instead, and so that's why it is handy for them to be able to lower the MTU to get around whatever the bug is in the other hardware's software.

    Hope that helps, and glad you enjoyed the post!

    ReplyDelete
  9. Thank you! I am using Lidgren now for the past week and it’s really not that hard. I still had my doubts about it and was busy doing some research into other network library options. Then I stumbled onto your blog. I can now rest assured that in 6 or more months later I won’t be in a total disaster with regards to network coding. Thank you again for your advice and tips.

    Another website that helped me alot with getting started with Lidgren: http://xnacoding.blogspot.com/

    ReplyDelete
  10. A library not already mentioned here is networkComms.net. It has a plethora of features, such as TCP, UDP, IPv6, serialisation, compression, encryption etc. More than lidgren anyways. There is a simple article on how to create a quick client server application here.

    ReplyDelete
  11. Christopher, thank you for the great article!

    Could you please describe your what delivery methods (ReliableSequenced, ReliableOrdered, etc) did you use in AI War and why?

    And could you also write about your main loop: does your server receive messages in a separate thread? do clients register callback for accepting messages from server?
    Did you use Lidgren's SequenceChannels?

    I'm new in game dev and your answers will help me to understand the basics.

    Thank you!

    ReplyDelete
  12. Christopher, thank you for the great article!

    My pleasure!

    Could you please describe your what delivery methods (ReliableSequenced, ReliableOrdered, etc) did you use in AI War and why?

    It's been a long time since I touched any of this, as it "just works," but I use NetChannel.ReliableInOrder1. That way I don't have to worry about state or resends myself: I know that whatever message is received is the very next message, and it is in order.

    That has the negative of being slower to get messages in some cases, but for my games it is rare that synchronous state would not work.

    In our later games A Valley Without Wind 1 and 2, it would have been more traditional to go with less-sequenced net channels, but we already had something working and thus ran with that with other workarounds higher up in the code.

    And could you also write about your main loop: does your server receive messages in a separate thread? do clients register callback for accepting messages from server?

    I don't do callbacks whenever possible. I actually modified the code some IIRC so that it could just continue using the heartbeat messages. Doing anything multithreaded has to be done with care, because you run into state corruption issues sometimes if multiple things try to mess with the same variable at once. So the locking and unlocking there is something that has to be done carefully, and also requires CPU time.

    For most of this I believe lidgren runs in its own thread and then reports its data to variables on the main thread. I then read that data on the main thread and do whatever with it, just in a regular polling interval. I do handle locking and so forth in other parts of my programming (mainly AI), but I felt like for networking that was not appropriate since networking has to poll so very frequently.

    But that was me, and I am very very far from an expert at that level of network communications.

    Did you use Lidgren's SequenceChannels?

    I don't _think_ so, but bear in mind this is code I have not had to touch in four years, heh.

    I'm new in game dev and your answers will help me to understand the basics.

    Hope it helps, but bear in mind this is all from the very foggy past for me, heh. All my games are still networked, but we don't actually mess with the transport layer at all anymore because it's been that rock-solid. Go Lidgren!

    ReplyDelete
  13. Thank you for such a detailed answer!

    Last question: what is the interval in AI thread at server? I mean - how often server updates the game scene?

    Thank you again!

    ReplyDelete
  14. Thank you for such a detailed answer!

    My pleasure. :)

    Last question: what is the interval in AI thread at server? I mean - how often server updates the game scene?

    For the AI thread, it's fairly "slow" because this is a higher-order thinking form of AI. In other words, low-level AI like pathfinding and basic target selection are all handled on the main thread in realtime -- that is critical. Think of this like "instinct."

    But secondary information is passed to the AI thread, and it does heavy number crunching to make larger strategic decisions that don't require instant response. Think of this like "conscious mind."

    Of course neither are anything of the sort, but that's how it's modeled.

    The secondary thread thus gets data from the main thread in a rolling queue, because there are often tens of thousands of entries in there, and the transfer is usually in batches of a thousand or two maybe ten times a second. That varies based on the load at the time, as there is a lot of scaling to that based on load; so the numbers I just said aren't quite exact, but it's a good midrange number.

    The AI thread then transfers orders back around five times a second. Those don't have to be terribly fast, since again these are more strategic concerns rather than moment to moment. Still it's way faster than a human. :)

    I should also note that the "instinct" level of AI applies equally to human and enemy ships, so that's the stuff that makes it so that you don't have to tell them every little thing. Whereas the AI thread is only for your opponent.

    ReplyDelete
  15. Hello, Christopher, and thanks for such a great overview!

    I'm starting to use Lidgren in my own game now, and I have a really daunting task to code a very large data sending on first connect.

    In my game everything is player-generated, down to textures and terrain noises, so I need to send this bundle to client when he connects to the server.

    The bundle can weight anywhere from 1MB to 100MB and beyond, and I really don't know what would be the best way to send it over from server to the client.

    Since this is the only article I found that mentions large messages, could you please spare some advice?

    What in your opinion is the best way to send large data with Lidgren? I'm thinking of two ways now, send over a zipped file (I don't know if this is even possible with lidgren) or send a superbig serialized object. I don't know how to do that either. In other words, I don't know anything. Could you please suggest the first step for me? :)

    ReplyDelete
  16. Hey there,

    Glad you found the article helpful. There are a few things to think about when you're thinking about large data:

    1. First step is to create the data. This could mean anything (more on this in a bit).

    2. Second step is to SEND the data. This is only part where Lidgren is involved. You could do this in large or small chunks.

    3. Third step is to take the data from step 2, and possibly reassemble it or else do whatever else.

    That's the broad overview of your process. Now to the specifics:

    Zipping: During step 1 above, you can zip anything you like. There are loads of libraries around for doing zipping and unzipping in a variety of formats at a variety of speeds.

    The important thing about zipping is that you do this in step 1, and then you simply unzip in step 3. Lidgren is not involved, nor does it know what you are doing. It's just passing bytes across that you hand it.

    Analogy: when you are wanting to send a physical package, you put some things in a box. You can squish stuff in or leave it to rattle around, it doesn't matter. All the delivery man knows is that he is carrying a box, end of story. Lidgren is that delivery man.

    Step 2: You really want to break this up into chunks. Lidgren can do this for you to some extent, but you'll want to help it out. I suggest breaking it into something like 100kb or 500kb chunks, and sending each of those at a time through the lidgren network. Once one has sent, send the next. On the other end, reassemble them.

    Analogy: Handing the delivery man 30 smaller packages, or handing him one truck-sized package. Sure he might be able to manage the huge thing, but that puts a burden on him that he may have trouble with, and if you CAN put things in smaller packages (you easily can in your case), then you may as well keep things simple for him.



    In all, you basically are doing the serialization so that you get a byte array at the end of it. Then you compress that or don't. Then you send 100kb to 500kb of that byte array into lidgren at a time. Lidgren will make sure it gets to the other end in one piece, and in the right order.

    The other end takes the data that is coming in, and puts it into something like a MemoryStream on its end, recombining it just like it was on the sending machine. If it's supposed to unzip this after everything is all done, then go for it. Heck, if you want to encrypt/decrypt as well, you can.

    And that's it! Hope that helps.

    ReplyDelete
  17. Wow! Thanks for such comprehensive analogies, I understand most of the process now!

    ReplyDelete

Note: Only a member of this blog may post a comment.