Tuesday, April 16, 2013

Preventing epidemics in cloud architectures

I'm developing a theory that you're breaking some sort of union rule if you try to hold a cloud computing event without putting Netflix' Adrian Cockcroft on the agenda. (Though when I mentioned this theory to Adrian, he assured me it was OK so long as he was at least mentioned in a presentation or two.) In any case, he was on stage at the Linux Collaboration Summit in San Francisco this week to talk about "Dystopia as a Service." 

Most of Adrian's talks examine various aspects of Netflix' computing architecture. It's an architecture that's both massive and almost entirely based on Amazon Web Services. It also offers a great example of what a cloud architecture should look like. Some specifics are doubtless unique to Amazon. And others unique to Netflix. But it's also true that many of the basic patterns and approaches that Netflix follows are useful study points for any "cloud native" application architecture. (Hence Adrian's ubiquity at cloud events.)

5425951169 cf99ae7af5

These patterns include things like making master copies of data cloud-resident, dynamically provisioning everything, and making sure that all services are ephemeral. This contrasts with the traditional IT pattern of having mostly heavyweight, monolithic services that you individually protected with all manner of reliability and availability mechanisms from N+1 power supplies to failover clusters. Bill Baker (then at Microsoft) wryly put the contrast between the traditional scale-up IT pattern and the scale-out cloud pattern thusly: “In scale-up, servers are like pets. You name them and when they get sick, you nurse them back to health. In scale-out, servers are like cattle. You number them and when they get sick, you shoot them."

However, for this post, I'm going to focus on one particular point that Adrian raised that hasn't been so widely discussed. That's the tension between efficiency and robustness (or anti-fragility as Adrian called it). 

The basic idea is this. Maximizing efficiency typically involves doing things like replicating "the best" as patterns and minimizing variability. You standardize ruthlessly—one operating system variant, unified monitoring, "copy exact" (to use Intel terminology) from one region to another, common configurations, and so forth. The problem is that an environment that has been ruthlessly standardized is also a monoculture. And monocultures can be catastrophically affected by singular events such as security exploits, software bugs triggered by data or a date, and DNS or certificate issues of various kinds.

Although the specifics vary, we see the tradeoffs associated with maximizing efficiency in other domains as well. For example, it's generally recognized that today's highly tuned and lean supply chains are also highly vulnerable to disruption. Writing after the Japanese tsunami, the Chicago Sun Times wrote:

“When you’re running incredibly lean and you’re going global, you become very vulnerable to supply disruptions,” says Stanley Fawcett, a professor of global supply chain management at Brigham Young University.

The risks are higher because so many companies keep inventories low to save money. They can’t sustain production for long without new supplies.

Subaru of America has suspended overtime at its only North American plant, in Lafayette, Ind. Toyota Motor Corp. has canceled overtime and Saturday production at its 13 North American plants. The two companies are trying to conserve their existing supplies.

There are techniques in every field to intelligently reduce the impact of various types of events. However, there remains something of a tradeoff between efficiency on the one hand and robustness on the other, given the need to get away from monocultures as much as possible. Adrian described Netflix as using "automated diversity management" and "diversifying the automation as well" (by using two independent monitoring systems).

Of course, every organization will have to decide for themselves just where and how to introduce diversity. (Famously, Netflix is all-in on a single cloud provider—Amazon Web Services—however much they introduce diversity elsewhere and this has contributed to outages at times.)

Some diversity will arise naturally as organization introduce new technologies, such as new virtualization platforms, that they will continue to run alongside existing ones. Similarly, most IT departments today, for better or worse, don't ruthlessly standardize to the degree that cloud providers do. Thus, a certain degree of "organic diversity" comes naturally.

However, it's worth remembering—as organizations increasingly adopt some of the practices in use by public cloud providers—that the ultimate goal isn't necessarily complete standardization even when such is practical. Today, IT is hybrid just because that's the way it evolved. But even as organizations transform into much more of an architected-for-cloud world, it's worth remembering that hybrid IT can also be a good architectural practice for keeping bugs and other shocks from becoming epidemics.

[Graphic by Dominic Alves Flickr/CC http://www.flickr.com/photos/dominicspics/5425951169/]


Unknown said...


Great insights here. Figuring out a way to services that fit unique needs of business - but that are standardized enough to deliver fast and cheap (an not fall to epidemic) - seems to be the great puzzle of the cloud age.

Kurt Milne

Cloud Development said...

I am so grateful to find your particular post. I have bookmarked this website and I will keep visiting you for further such interesting posts.

Cloud Development