We’re doing some work on a new product on Azure and it’s been a learning process. For any of you that are just getting started, here’s what I’ve found to be the most useful information getting started.
Connections will drop. Regularly. Your application MUST be able to handle dropped connections. Dropped connections are inevitable and intrinsic to the cloud architecture (e.g. ops like replacing a dead node, splitting a Federation member in SQL Database, etc.). The Transient Fault Handling Application Block provides well thought out and tested ways to handle this – use it.
A SQL select that normally takes 10 milliseconds to complete can take a minute. Azure is a service-based platform of shared resources, and this means that two types of latencies or interruptions regularly occur. The first is the time taken to make a request and receive a response over the internet. Since those requests and responses can travel through any number of routers before they return to the client, timeouts and disconnections are more frequent than in local, fixed networks. The second is the time it takes for a shared-resource system like Azure to create backup versions of data for durability and to replace and reroute requests to any removed instances. In some cases that means the user waits a minute for the web page to complete. But for other uses, it means perform your communication with other systems in Azure in an async manner.
Even if the only user of your app at first will be your proud mom, set it up on two datacenters. There’s a giant difference between a single instance and two instances. Going from 2 to N has some issues, but they’re a lot fewer. Learn from the start how to properly set up a clustered system. We set one in the US and one in Europe so we hit more latency and communication problems between the two systems.
Default Capacity & Billing Settings
When you create anything in Azure, it has default limits set for capacity, billing, etc. Check them. You don’t want a well-designed system to stop responding, not because it can’t handle the load, but the increased number of customers caused you to hit the maximum charge for one of the services. We hit this problem with the Phone Home license server which caused it to be down for ½ hour.
Web App Logging
Using Trace sucks. There’s no way to set the level of logging by component and no log.IsDebugEnabled which is a tremendous performance win. We’re still working through how to best configure log4net and will update this here when we make a final decision. (A big question is do we log to each web app instance local storage, and lose that logging if the web app VM crashes, or do we write out to shared storage.)
- Best Practices for Performance in Azure Applications
- Enterprise Library 5.0 Integration Pack for Windows Azure
- Scaling Applications Using Azure Cloud Services
- (Some) Best Practices for Building Windows Azure Cloud Applications
- Cloud Power: How to scale Azure Websites globally with Traffic Manager
- Best Practices for the Design of Large-Scale Services on Azure Cloud Services
Azure Sql Database
- Azure SQL Database
- Azure SQL Database Guidelines and Limitations
- Performance Considerations with Azure SQL Database
- Guidelines for Connecting to Azure SQL Database
- Azure SQL Database Concepts
At the Build Conference I attended a session by two of the top Azure system engineers. They had a slide come up labeled “Load Testing” and they said “take it live.” Their point was you can, and should load test before you go live. But it’s only as a live system that you truly test it and find the problems from that testing.
Running on Azure is new, different, and a learning experience. Be ready to resolve issues quickly because you will hit quite a few when you first go live. And by definition, most of them will be in unexpected places.