Rock, Paper, Azure is back…

Rock, Paper, Azure (RPA) is back!   For those of you who have played before, be sure to get back in the game!  If you haven’t heard of RPA, check out my past posts on the subject.   In short, RPA is a game that we built in Windows Azure.  You build a bot that plays a modified version of rock, paper, scissors on your behalf, and you try to outsmart the competition.  Over the summer, we ran a Grand Tournament where the first place prize was $5,000!   This time, we’ve decided to change things a bit and do both a competition and a sweepstakes.   The game, of course, is a competition because you’re trying to win.  But we heard from many who didn’t want to get in the game because the competition was a bit fierce.   Competition:  from Nov 25 through Dec 16, each Friday, we’ll give the top 5 bots a $50 Best Buy gift card.  If you’re the top bot each Friday, you’ll get a $50 gift card each Friday.  Sweepstakes:  for all bots in the game on Dec 16th, we’ll run the final round and then select a winner at random to win a trip to Cancun.   We’re also giving away an Acer Aspire S3 laptop, a Windows Phone, and an Xbox Kinect bundle.  Perfect timing for the holidays! Check it out at!

Geo-Load Balancing with the Azure Traffic Manager

One of the great new features of the Windows Azure platform is the Azure Traffic Manager, a geo load balancer and durability solution for your cloud solutions.  For any large website, managing traffic globally is critical to the architecture for both disaster recovery and load balancing. When you deploy a typical web role in Azure, each instance is automatically load balanced at the datacenter level.   The Azure Fabric Controller manages upgrades and maintenance of those instances to ensure uptime.  But what about if you want to have a web solution closer to where your users are?  Or automatically direct traffic to a location in the event of an outage?    This is where the Azure Traffic Manager comes in, and I have to say, it is so easy to set up – it boggles my mind that in today’s day and age, individuals can prop up large, redundant, durable, distributed applications in seconds that would rival the infrastructure of the largest websites.  From within the Azure portal, the first step is to click the Virtual Network menu item. On the Virtual Network page, we can set up a number of things, including the Traffic Manager.   Essentially the goal of the first step is to define what Azure deployments we’d like add to our policy, what type of load balancing we’ll use, and finally a DNS entry that we’ll use as a CNAME: We can route traffic for performance (best response time based on where user is located), failover (traffic sent to primary and only to secondary/tertiary if primary is offline), and round robin (traffic is equally distributed).   In all cases, the traffic manager monitors endpoints and will not send traffic to endpoints that are offline. I had someone ask me why you’d use round robin over routing based on performance – there’s one big case where that may be desirable:  if your users are very geography centric (or inclined to hit your site at a specific time) you’d likely see patterns here one deployment gets maxed out, while another does not.   To ease the traffic spikes to one deployment, round robin would be the way to go.  Of course, an even better solution is to combine traffic shaping based on performance with Azure scaling to meet demand. In the above image, let’s say I want to create a failover for the Rock Paper Azure botlab (a fairly silly example, but it works).   I first added my main botlab (deployed to South Central) to the DNS names, and then added my instance deployed to North Central:   From the bottom of the larger image above, you can see I’m picking a DNS name of as the public URL.  What I’d typically do at this point is go in to my DNS records, and add a CNAME, such as “” –> “”. In my case, I want this to be a failover policy, so users only get sent to my North Central datacenter in the event the south central instance is offline.  To simulate that, I took my south central instance offline, and from the Traffic Manager policy report, you’d see something like this: To test, we’ll fetch the main page in IE: … and we’re served from North Central.  Of course, the user doesn’t know (short of a traceroute) where they are going, and that’s the general idea.  There’s nothing stopping you from deploying completely different instances except of course for the potential end-user confusion! But what about database synchronization?   That’s a topic for another post …

Use the Windows Azure CDN for Assets

The most common response to running stuff in the cloud (Azure, Amazon in particular) is the that it’s too expensive for the little guy.   And generally, hosting VMs when a small shared site of something similar will suffice is a tough argument. There are aspects to Azure, though, that are very cost effective as they do “micro-scale” very well.  A good example of this is the Azure CDN, or more simply, Azure Blob Storage.   It’s effective to exchange files, it’s effective at fast delivery, and even lightweight security using shared access signatures (links that essentially only work for a period of time).    It’s durable:  not just redundant internally, but externally as well, automatically creating a backup in another datacenter. For MSDN subscribers, you already have Azure benefits, but even going out of pocket on Blob storage isn’t likely to set you back much:  $0.15/GB of storage per month, $0.01/10,000 transactions, and $0.15/GB outbound bandwidth ($0.20 in Asia; all inbound free).  A transaction is essentially a “hit” on a resource, so each time someone downloads, say, an image file, it’s bandwidth + 1 transaction.  Because these are micro transactions, for small apps, personal use, etc., it’s quite economical … often adding up to pennies per month.   A few typical examples are using storage to host files for a website, serve content to mobile devices, and to simply offload resources (images/JS files) from website code. Depending on usage, the Azure Content Delivery Network (CDN) can be a great way to enhance the user experience.  It may not always be the case (and I’ll explain why) but essentially, the CDN has dozens of edge servers around the world.  While your storage account is served from a single datacenter, having the data on the edge servers greatly enhances speed.   Suppose an app on a phone is retrieving documents/text to a worldwide audience … enabling CDN puts the content much closer.  I created a test storage account in North Europe (one of the Azure datacenters) to test this, using a small graphic from RPA: Here’s the same element via the CDN (we could be using custom DNS names, but for demo purposes we’re not): Here’s a trace to the storage account in the datacenter – from North Carolina, really not that bad all things considered: You can see we’re routed to NY, then on across the pond, and total latency of about 116ms.   And now the CDN: MUCH faster, chosen not only by physical distance but also network congestion.   Of course, I won’t see a 100ms difference between the two, but if you’re serving up large documents/images, multiple images, or streaming content, the difference will be noticeable.  If you’re new to Azure and have an account, creating a storage account from the dashboard is easy.   You’d just click on your storage accounts, and enter a name/location: You’d typically pick someplace close to you or where most of your users are.   To enable CDN, you’d just click the CDN link on the left nav, and enable it: Once complete, you’ll see if on the main account screen with the HTTP endpoint: So why wouldn’t you do this? Well, it’s all about cacheability.   If an asset is frequently changing or infrequently used, it’s not a good candidate for caching.   If there is a cache miss at a CDN endpoint, the CDN will retrieve the asset from the base storage account.  This will incur an additional transaction, but more importantly it’s slower than if the user just went straight to the storage account.  So depending on usage, it may or may not be beneficial. 

Azure and Phone … Better Together

We had an excellent time presenting today’s Windows Phone Camp in Charlotte. Thank you to everyone who attended. Here are some resources and major points of today’s “To the cloud…” session. First, here is the slide deck for the presentation.  To The Cloud... Next, download the Windows Azure Toolkit for Windows Phone. This contains both the sending notifications sample, and the Babelcam application. Note that there are quite a few setup steps – using the Web Platform Installer is a great way to make all of this easier. The key takeaway that I really wanted to convey: while the cloud is most often demonstrating massive scale scenarios, it’s also incredibly efficient at micro scale. The first scenario we looked at was using Azure Blob Storage as a simple (yet robust) way to host assets. Think of Blob Storage as a scalable file system with optional built in CDN support. Regardless of where your applications of hosted (shared host, dedicated hosting, or your own datacenter), and regardless of the type of application (client, phone, web, etc.) the CDN offers a tremendously valuable way to distribute those resources. For MSDN subscribers, you already have access so there’s no excuse to not use this benefit. But even if you had to go out of pocket, hosting assets in Azure is $0.15/GB per month, + $0.01/10,000 transactions, + $0.15/GB outbound bandwidth (inbound is free). For small applications, it’s almost free. Obviously you need to do the math for your app, but consider hosting 200MB in assets (images, JS files, XAPs, etc.), a million transactions a month with several GB of data transfers: it’s very economical at the cost of a few dollars / month. In the second demo, we looked at using Azure Queues to enhance the push notification service on the phone. The idea being that we’ll queue failed notifications, and retry them for some specified period of time. For the demo, I only modified the raw notifications. In PushNotificationsController.cs (in toolkit demo above), I modified SendMicrosoftRaw slightly: [HttpPost]public ActionResult SendMicrosoftRaw(string userId, string message){ if (string.IsNullOrWhiteSpace(message)) { return this.Json("The notification message cannot be null, empty nor white space.", JsonRequestBehavior.AllowGet); } var resultList = new List<MessageSendResultLight>(); var uris = this.pushUserEndpointsRepository.GetPushUsersByName(userId).Select(u => u.ChannelUri); var pushUserEndpoint = this.pushUserEndpointsRepository.GetPushUsersByName(userId).FirstOrDefault(); var raw = new RawPushNotificationMessage { SendPriority = MessageSendPriority.High, RawData = Encoding.UTF8.GetBytes(message) }; foreach (var uri in uris) { var messageResult = raw.SendAndHandleErrors(new Uri(uri)); resultList.Add(messageResult); if (messageResult.Status.Equals(MessageSendResultLight.Error)) { this.QueueError(pushUserEndpoint, message); } } return this.Json(resultList, JsonRequestBehavior.AllowGet);} Really the only major change is that if the messageResult comes back with an error, we’ll log the error. QueueError looks like this: private void QueueError(PushUserEndpoint pushUser, string message){ var queue = this.cloudQueueClient.GetQueueReference("notificationerror"); queue.CreateIfNotExist(); queue.AddMessage(new CloudQueueMessage( string.Format("{0}|{1}", pushUser.ChannelUri.ToString(), message) ));} We’re simply placing the message on the queue with the data we want: you need to get used to string parsing with queues. In this case, we’ll delimit the data (which is the channel URI and the message of the notification) with a pipe character. While the channel URI is not likely to change, it’s a better approach to store the username and not the URI in the message, and instead do a lookup of the current URI before sending (much like the top of SendMicrosoftRaw does), but for the purposes of the demo is fine. When we try sending a raw notification when the application isn’t running, we’ll get the following error: Typically, without a queue, you’re stuck. Using a tool like Cloud Storage Studio, we can see the notification is written to the failure queue, including the channel URI and the message: So, now we need a simple mechanism to poll for messages in the queue, and try to send them again. Because this is an Azure webrole, there’s a way to get a “free” thread to do some processing. I say free because it’s invoked by the Azure runtime automatically, so it’s a perfect place to do some background processing outside of the main site. In Webrole.cs, you’ll see there is no Run() method. The base WebRole Run() method does nothing (it does an indefinite sleep), but we can override that. The caveat is, we never want to exit this method. If an exception bubbles out of this method, or we forget to loop, the role will recycle when the method exits: public override void Run(){ this.cloudQueueClient = cloudQueueClient ?? GetStorageAccountFromConfigurationSetting().CreateCloudQueueClient(); var queue = this.cloudQueueClient.GetQueueReference("notificationerror"); queue.CreateIfNotExist(); while (true) { Thread.Sleep(200); CloudQueueMessage message = queue.GetMessage(TimeSpan.FromSeconds(60)); if (message == null) continue; if (message.DequeueCount > 60) { queue.DeleteMessage(message); continue; } string[] messageParameters = message.AsString.Split('|'); var raw = new RawPushNotificationMessage { SendPriority = MessageSendPriority.High, RawData = Encoding.UTF8.GetBytes(messageParameters[1]) }; var messageResult = raw.SendAndHandleErrors(new Uri(messageParameters[0])); if (messageResult.Status.Equals(MessageSendResultLight.Success)) { queue.DeleteMessage(message); } }} What this code is doing, every 200 milliseconds, is looking for a message on the failure queue. Messages are marked with a 60 second timeout – this will act as our “retry” window. Also, if we’ve tried to send the message more than 60 times, we’ll quit trying. Got to stop somewhere, right?   We’ll then grab the message from the queue, and parse it based on the pipe character we put in there. We’ll then send another raw notification to that channel URI. If the message was sent successfully, we’ll delete the message. Otherwise, do nothing and it will reappear in 60 seconds.   While this code is running in an Azure Web Role, it’s just as easy to run in a client app, service app, or anything else. Pretty straightforward, right? No database calls, stored procedures, locking or flags to update. Download the completed project (which is the base solution in the toolkit plus these changes) here (note: you’ll still need the toolkit):  VS2010 Solution The final demo was putting it all together using the Babelcam demo – Azure queues, tables, notifications, and ACS. Questions or comments? Let me know.

RPA Grand Tournament Complete!

We had a GREAT time running the RockPaperAzure Grand Tournament Wednesday.  We had a total of 302 players from around the world, and the top players in the finals were: We saw some amazing strategies, and learned a lot along the way.  We intend to keep the site running with more contests along the way – if you have any interest in running a competition for a user group, conference, or other activity, we’d be happy to create a round for you on the site.  Contact us for info. Server Load Nothing exposes bugs faster than a system under load!  And load it had.  Here’s an example of the load on the game engine in terms of the number of bots submitted during the competition: Now, that’s just a linear look at submissions.  With over 3000 bots flooding in (mostly during the afternoon), we kept our game engine farm quite busy.   As you may have seen in our webcasts, as the number of players increases, the number of player matches goes up significantly: The above graph illustrates the number of matches to players – and this only goes up to 50 players.  Expand this to 300 players, and the load is tremendous.  (That’s one of the excellent points about Azure – we’re able to hit a button and handle the load.) We’re going to continue some experimental open rounds so stay tuned!

Top Reasons for Bot Rejection

Just a quick tip for those in the Grand Tournament.  There are 2 very common reasons we’re seeing bots get rejected – we’ll call it the red text of death that appears when you submit a bot, but it fails code analysis.  If all you see is “Bot Successfully Uploaded,” then move along, nothing to see here. The most common are coders adding debugging statements into their code.  We do block the Diagnostics namespace (among a few others), so it’s quite likely any debugging code will trigger a failure.   If you’re deep into a bot with a lot of debug code, I recommend an #ifdef around these sections to remove them from release builds. The second most common reason are locks.  Locks, as you likely know, are syntactical sugar around a Monitor, which resides in the System.Threading namespace.  We don’t allow that, either.   But many may ask, “Why no locks?  Is it not a good practice to do this around a critical section of code?”   Not in this case.  Your bot shouldn’t be using statics since we can’t guarantee when or how many instances of your bots will be loaded at any given time, and I can guarantee this will burn you.   Your bot also can’t create threads, so locking is irrelevant in this case and would only slow things down. In many cases, these errors are triggered in bots where folks are rolling their own random number generator.  While you don’t have to use it, we highly recommend you use the Moves.GetRandomNumber() method.  It works the same way, but we guarantee unique generation for bots and we do the locking in the engine to ensure this.  There’s a fun story behind that one. Good luck!

RockPaperAzure Grand Tournament

We’re back – this time with an International Grand Tournament in Rock, Paper, Azure.    So what’s new? First, we heard many folks loud and clear that they weren’t happy it was U.S. residents only.   So, now’s your chance – we’ve opened up the tournament to Canada, the UK, Sweden, New Zealand, Germany, China, and of course the USA.   We’ve also included country flags in the leaderboard: Next, we’ve changed some of the rules.  Specifically, players are now “blind” when they play in the GT.  What does that mean?   It means that your bot will not know the team name of the opponent.  While playing, the name of the opponent is a “?” and this is also reflected in the game history and log file: Why this change?  Primarily, we felt it made the game a little more interesting as it focuses on algorithms as opposed to brute force.   We’ve created a GT Practice Round that is not blind, so if you wish, you can tinker in this round to get some exposure and fine tune your logic.  Of course, playing in the practice round is optional.  Next, players will break down into heats during the GT.   After the round closes, we’ll segment players into a number of heats (as I write this, I can’t quite recall if we agreed on a random 25% in each heat, or 25 players per heat).  The idea is that this creates a ladder approach to get to the top and adds a bit of excitement to see how far up the ladder your bot can go.  It also scales nicer, since we’re assuming a higher involvement in the competition. Finally, we decided to give away something a little sweeter than an Xbox.  This time, we’re got $5,000 riding on first place!  Additionally, what we’ve decided to do is spread out the winnings a bit more so second place receives $1,000, and the next ten players (3rd-12th place) all receive $250.  So, why the prize structure?  Well, during an in-person event during our original 6 week competition, I heard someone remark that it would be too difficult to place in the top 3 to get a prize, much less win the Xbox.   I can understand that because, indeed, some of the bots we saw were really phenomenal.   What we wanted to do was make it so there were enough prizes to reward “pretty good play” for those (like myself) are interested in playing a little, but not spending a hundred hours coding a bot.  With the new prize structure plus blind playing, it’s really anyone’s game with a little clever code.  We hope you think so, too… and have fun playing!  Questions or comments, feel free to ping us either here on my blog or through the website.

Top Failed Bots

During a presentation the other day to the Charlotte ALT.NET group,  I made a joke that Rock, Paper, Azure is doing something completely ridiculous: we invite people to write code and we’ll run it arbitrarily.  (Well, not really arbitrarily, but it does present a unique security challenge.) We’ve naturally had a few interesting submissions, so I’m posting some of them here for interest sake. First up: Thread.Sleep(...);   We see this one fairly often.  In a game where you have very limited time, I’m puzzled why some people would intentionally sleep their bots.   Next – this one is fairly innocent: 1: if (dataset.Tables.Contains(opponent.TeamName)) 2: { 3: DataTable table = dataset.Tables[opponent.TeamName]; 4: table.Rows.Clear(); 5: foreach (Round round in rounds) 6: { 7: ... … but in a short run game like this, I’d steer away from datasets.  (Plus, we have LINQ!)  This one gets caught in the filter not because it poses a specific threat, but we don’t allow anything from System.Data. Third: 1: try 2: { 3: StreamWriter writer = new StreamWriter(@"C:\RPSLog.txt", append); 4: writer.Write(builder); 5: writer.Close(); 6: } 7: catch (Exception) 8: { 9: } WOW a lot of people are trying to write text files.  Not allowed. And now for my all time favorite … an obvious hack attempt: 1: StringBuilder builder = new StringBuilder(); 2: using (SqlConnection connection = new SqlConnection(ConfigurationManager. 3: ConnectionStrings[0].ConnectionString)) 4: { 5: using (SqlCommand command = new SqlCommand("SELECT * FROM sys.Tables", connection)) 6: { 7: SqlDataReader reader = command.ExecuteReader(); 8: while (reader.Read()) 9: { 10: builder.Append(reader["[name]"]); 11: builder.Append(","); 12: } 13: } 14: } 15: you.Log.AppendLine("Here: " + builder); This last one actually raises a legitimate issue and security threat – so much so that we can ban players (or worse) for this kind of thing.  Still, though, not much of a threat:  the code can’t get through, but even if it could, connection strings aren’t available to the app domain running the round, and even so, the engine only has execute permissions on the procedures necessary to insert the results. I’m curious to see what else comes through!

Rock, Paper, Azure Deep Dive: Part 2

In part 1, I detailed some of the specifics in getting the Rock, Paper, Azure (RPA) up and running in Windows Azure.   In this post, I’ll start detailing some of the other considerations in the project – in many ways, this was a very real migration scenario of a reasonably complex application. (This post doesn’t contain any helpful info in playing the game, but those interested in scalability or migration, read on!) The first issue we had with the application was scalability.  Every time players are added to the game, the scalability requirements of course increases.  The original purpose of the engine wasn’t to be some big open-ended game played on the internet;  I imagine the idea was to host small (10 or less players).    While the game worked fine for < 10 players, we started to hit some brick walls as we climbed to 15, and then some dead ends around 20 or so.  This is not a failing of the original app design because it was doing what it was intended to do.  In my past presentations on scalability and performance, the golden rule I always discuss is:  you have to be able to benchmark and measure your performance.  Whether it is 10 concurrent users or a million, there should always be some baseline metric for the application (requests/sec., load, etc.).   In this case, we wanted to be able to quickly run (within a few minutes) a 100 player round, with capacity to handle 500 players.  The problem with reaching these numbers is that as the number of players goes up, the number of games played goes up drastically (N * N-1 / 2).   Even for just 50 players, the curve looks like this: Now imagine 100 or 500 players!  The first step in increasing the scale was to pinpoint the two main problem areas we identified in the app.  The primary was the threading model around making a move.  In an even match against another player, roughly 2,000 games will be played.   The original code would spin up a thread for each _move_for each game in the match.   That means that for a single match, a total of 4,000 threads are created, and in a 100-player round, 4,950 matches = 19,800,000 threads!  For 500 players, that number swells to 499,000,000. The advantage of the model, though, is that should a player go off into the weeds, the system can abort the thread and spin up a new thread in the next game. What we decided to do is create a single thread per player (instead of a thread per move).  By implementing 2 wait handles in the class (specifically a ManualResetEvent and AutoResetEvent) we can accomplish the same thing as the previous method.  (You can see this implementation in the Player.cs file in the DecisionClock class.)  The obvious advantage here is that we go from 20 million threads in a 100 player match to around 9,900 – still a lot, but significantly faster.   In the first tests, 5 to 10 player matches would take around 5+ minutes to complete.   Factored out (we didn’t want to wait) a 100 player match would take well over a day.   In this model, it’s significantly faster – a 100 player match is typically complete within a few minutes. The next issue was multithreading the game thread itself.  In the original implementation, games would be played in a loop that would match all players against each other, blocking on each iteration.  Our first thought was to use Parallel Extensions (of PFx) libraries built into .NET 4, and kicking off each game as a Task.  This did indeed work, but the problem was that games are so CPU intensive, creating more than 1 thread per processor is a bad idea.  If the system decided to context switch when it was your move, it could create a problem with the timing and we had an issue with a few timeouts from time to time.   Since modifying the underlying thread pool thread count is generally a bad idea, we decided to implement a smart thread pool like the one here on The Code Project.   With this, we have the ability to auto scale the threads dynamically based on a number of conditions. The final issue was memory management.  This was solved by design:  the issue was that original engine (and Bot Lab) don’t store any results until the round is over.  This means that all the log files really start to eat up RAM…again, not a problem for 10 or 20 players – we’re talking 100-200+ players and the RAM just bogs everything down.  The number of players in the Bot Lab is small enough where this wasn’t a concern, and the game server handles this by design by using SQL Azure, recording results as the games are played. Next time in the deep dive series, we’ll look at a few other segments of the game.  Until next time!

RPA: Burned by Static Cling

In a previous post about locking in Rock, Paper, Azure, I said this somewhat offhand: In this case, there’s no reason to use such code in a bot. The only time you’d need to is if your bot has a static method/object reference, but that’s a bad idea in a bot because it will lead to unpredictable results. Your bot should have only instance members. I should’ve called that out more, and in this case, we have a player who lost because of it. It’s especially tough because things seemingly worked fine, until the final tournament round. Here’s why, and here’s some information on static variables (shared in VB) for those who haven’t used them before. In short, a static modifier on a method or variable makes the member part of the type instead of the class. This is really useful on helper methods in particular, because a static member can be used without instantiating the type to which the member belongs. Static objects (variables) – in any code – should be a red flag. They have very specific advantages and lower overhead (only 1 is created no matter how many objects of that type are created) – BUT, they can burn you easily if you’re not certain how the object is loaded and used. (Static methods, as a general rule, tend to have less risk than static objects/variables.) Unfortunately, that’s what happened in last Friday’s competition to one of our players. So how can they burn you? For one, they’re not thread safe. Of course, they _could_ be thread safe, but you’d have to be cognizant of what they are doing to make them thread safe. (Non-statics might not be thread safe either, however, using statics implies global reuse so it heightens the exposure of thread safety issues and can be hard to track down.) One example: we modified the original engine to do multithreading and the original log had a static method that used a StringBuilder to build a log file. This caused problems because the StringBuilder is not thread safe – so we had to add locking. The problem was always there (even if it wasn’t static), but the problem never manifested because the core was single threaded. Another way they can burn you is that two or more objects may be accessing the objects in an nondeterministic way, leading to unpredictable results. The unfortunate part about this in particular was that the issue didn’t manifest until the main tournament round – so everything appeared fine until the final round. The game engine runs two types of rounds: player rounds, and full rounds. In either case, the engine will spin up many threads to execute the games – but during a player round, the engine loops the new bot against everyone else. As other players submit their bots, the engine will load your bot only once. Because your bot is loaded only once, there’s really no chance of static variables causing a problem, much like the StringBuilder example above. But during the tournament round, all of a sudden many more games are played. Consider that with 50 players, only 49 games are played when a new bot is uploaded. But in a full round, 1,225 games are played! There is a much stronger chance your bot will have multiple instances loaded concurrently, and modifying static variables will cause the bots to go haywire. So, the lesson of the week is: don’t use statics in a bot! Question or comment about a bot? Let us know…

My Apps

Dark Skies Astrophotography Journal Vol 1 Explore The Moon
Mars Explorer Moons of Jupiter Messier Object Explorer
Brew Finder Earthquake Explorer Venus Explorer  

My Worldmap

Month List