Friday, February 24, 2012

Cloud Drives New Thinking About Data Architecture

It doesn't take much experience in the cloud space to realize it is subject to the same limitations as Grid computing and for the same reasons. One of the biggest limitations is data; the size of data sets, the bandwidth and cost required to move it, and the introduced latency to process it. I have repeatedly heard Alistair Croll quote another cloud visionary that "next to the cost of moving data, everything else is free". It's simple, and it's true, yet when adopting cloud so few people seem to be aware of the importance to architect for this reality.

As Geva Perry points out the reality of cloud is that like most technologies it enters a company through the everyday knowledge worker and then wends its way upward as it delivers value until finally the CxO's become aware of its success; typically right before they engage in top down initiatives to bring the technology into the company. Most of these early adopters are software developers and systems administrators, both of whom are eternally on the lookout for better solutions that make their lives easier. Neither take a data centeric focus which results in sub-optimal solutions. And in cloud where the return can be so high, a sub-optimal solution still looks great masking it's inherent shortcomings.

As I've explained to many who confuse cloud with mainframes using the centralization argument it's important to realize there is a huge difference. Mainframes were about physical centralization. And if we proved nothing else, we proved physical centralization is bad thing from a disaster recovery point of view, a cost of operations point of view, a response time point of view, and several others that I won't detail. Cloud gives us the best of both worlds: logical centralization within a physically dispersed reality.

New models require new architectures. Since the fundamental value element of any computing system is the data being processed, it's natural to use optimize the architecture for data. In the cloud the optimal data architecture takes advantage of the geographic diversity leveraging virtualization concepts to logically manage the data in a traditional centralized model. The reality is no matter how much storage you can put in one location, you'll never have enough. And even if you did, you'd never have enough buildings to house it. And if you did, you'd run out of bandwidth to move it around and make use of it. We are in a data driven age where we're better at collecting than using data, and combined with mobile technologies in which every electronic device suddenly becomes a sensor of some type we're starting to sink under the weight. Telco's had it right with CLEC's, Carrier Local Exchanges, the distribution points that linked the home to the global network.

In the Smart Grid arena I helped one of the largest utilities to re-think their smart grid strategy. The original design called for bringing all the consumption data back from the smart meters to the data center. The primary challenge on the table was how to move the data - wireless technology? Broadband over the wire? I turned the tables by refocusing the discussion on the fundamental assumptions. Why did the data need to be transferred to the data center? The answer? The business requirement was to be able to tell every user the accumulated cost of their power consumed to date at any given point in time. Digging deeper we found the current mainframe could not meet this requirement being able to only process 200,000 bills each day; hardly the real time answer the business wanted. So I asked if we could flip the architecture taking a distributed control model from my powertrain engineering days with GM. I argued it would make more sense to accumulate the data at the smart meter or neighborhood, calculate the billing at that level, and then only access the detailed data from the data center when needed. Through research the "when needed" use cases were limited to billing disputes, service issues, and the occasional audit. Since only 1% of customers called each month, even if 100% required the customer rep to reach down into the smart meter to retrieve the detailed data it ws certainly eaiser and less of a load than bringing 100% of the data back to the data center. Frankly it was hard to argue against a distributed model which became the standard and has slowly replaced the original centralized model of smart grid touted for years as the answer.

I have advocated the same distributed architecture approach for use by mobile providers (accumulate usage on the mobile phone, execute the billing algorithm, and if the provider needs the detail download what you need when you need it). I have advocated a more generic version for healthcare payers, retailers, and within the supply chain touting the advantage of storing the data where it's collected.

The data management tools are in their infancy but there is significant work going on around the world on the subject. Consider that five years ago your options within the database world were limited to a cluster. Now we have sharding, high performance filesystems, highly scalable SQL databases plus a whole new class of data management from the map-reduce/hadoop world.

Embrace it or fight it the reality doesn't change. Data growth is exploding. Storage densities are plateauing. Is it better to learn how to hold your breath, tread water, or swim?

No comments:

Post a Comment