Ken Church, Albert Greenberg and James Hamilton of Microsoft recently put out a paper on “Delivering Embarrassingly Distributed Cloud Services.”[PDF] Like most papers of this type, it’s a dry read, but informative. It looks at the tradeoff between mega-data center size and micro-data center diversity from the both the viewpoints of total cost of ownership and of performance.
The most important line in the entire report, of course, is “The trade-offs vary by application.” However, they make the argument that applications with little need for server-to-server communications will show benefits in cost, scale, reliability and performance through geo-diversification – in other words, lots of little datacenters as opposed to one big datacenter.
This seems to fly in the face of the trend in data consolidation, but there is a point to it: For any data center, there needs to be redundancy, but in a centralized data center, there needs to be more redundancy than having multiple small data centers. As Church, Greenberg, and Hamilton put it, “the more geo-diversity, the better. N+1 redundancy becomes more attractive for large N.”
The part that really interested me, though, was the networking section. (Section 3, in case you want to skip right to it.) Church, Greenberg, and Hamilton point out that in a large, centralized datacenter, you can have end-to-end control and assure a particular level of performance through supported service level agreements. On the other hand, they argue:
“[with distributed data centers] the cloud service provider has ceded control of quality to its Internet access providers, and so cannot support (or even fully monitor) SLAs on flows that cross out multiple provider networks, as the bulk of the traffic will do. However, by artfully exploiting the diversity in choice of network providers and using performance sensitive global load balancing techniques, performance may not appreciably suffer. Moreover, by exploiting geo-diversity in design, there may be attendant gains in reducing latency…”
“Many large analysis applications are best run centrally in mega data centers… Interactive applications are best run near users… [they] can be delivered with better QoS (e.g., smaller TCP round trip times…) via micro data centers.”
The argument’s sound, especially when you consider that interactive applications are probably the most latency sensitive because they need to make multiple trips to and from the client and server with every interaction.
But reducing the propagation delay (or distance delay) is merely one part of the performance equation. By ceding control over router performance and transmission, you have no way of diagnosing network round trip time problems if they occur, and wouldn’t be able to fix them – short of the messy step of changing service providers – even if you did. If something goes wrong, it could negate the speed increases by diversifying servers, so moving to this model more of a gamble than a guarantee of improvement. Granted, it’s a gamble that might make sense for some apps and some organizations – some apps, apparently, can get away with less than 100% uptime.

by Patrick Ancipink
by Brian Boyko