Thursday, February 7, 2013

The Data War and Mobilization of IT

In the end everything is about data.  We use data to make decisions which, combined with thumbs, helps to separate we humans from primates.  Today the supply of data far outstrips the demand for its use.  We create data all day, every day, and everything we do creates data.  From shipments to tweets, from logins to purchases, there is a "paper" trail behind every step we take.  It's not too surprising that Cisco estimates a 13 fold increase in global internet data traffic on mobile devices from 2012 to 2017.  That's an astounding number!  Naturally such incredible growth will require wireless providers to expand their bandwidth, companies to expand their storage and compute to utilize the data, and new tools and techniques to make use of all that data.  Last thing we want to do is just leave all that data laying around, collecting dust, right?  True, but unfortunately that's the most likely scenario.

Although we are good at using data, we are only good at using familiar data. We are not so good at using data folded between complex relationships, the data equivalent of the fog of war.  Why?  In part I believe because we are creatures of habit who often use intuition rather than deduction to drive our decisions.  However companies rightly view data as treasure, the new gold, capable of providing new insights which can in turn generate new business opportunities. Data is seen as objective, the "truth".  This is the challenge of Big Data; to provide humans with the tools to overcome our innate inability to see trends and gain new insights through the fog of war.

Of course data can only provide value in decision making if the analysis is ready and relevant before the decision needs to be made.  I call this Real Time Analytics which is the business solution enabled by the concept of Big Data.  We have plenty of examples in life and business where companies have waited too long for the right answer when a good enough answer would have carried the day.  Leaders at all levels need guidance from data analysis to make decisions.  Whether in the form of an answer (this is what you should do) or simply insights (here is what is happening), we now realize a good decision today is better than a great decision tomorrow.

So our future is one of making better informed decisions faster using the ever expanding scope of data we generate combined with increasing capabilities to analyze it.  Sounds great, but is IT structured to handle the work? Our IT departments have been designed with a centralized model; bring everything back to a subset of locations which house massive amounts of network bandwidth, compute, and storage.  A 13 fold increase in mobile internet data traffic means more data, faster and a growing fog of war. The implications for companies moving Big Data programs mainstream is enormous.  There is a point at which a centralized architecture simply will not work; it will not be possible to bring all the data back to one place to perform the calculations and then distribute the results again in the time frame required by Real Time Analytics.

In short, data does not scale vertically.  We've already maximized its compression and transmit it at the speed of light.  We are up against the physical laws of nature. There is no solution on the horizon enabling us to transmit data in ever increasing increments faster from one location to another.  Instead today we have a linear relationship between the quantity of data being analyzed and the time it takes to analyze it.  There is a model, popular in discussion but often misunderstood and misapplied in practice, which addresses the scalablity issue of data: parallelization.  If we perform the calculations on a set of data in parallel rather than one at a time in serial, we can reduce the time required for analysis (note this approach is not always viable). Parallelization is built into ETL (Extract, Transform, Load) tools and is at the heart of Big Data tools such as Map Reduce (the foundation of Hadoop).  Once data can be processed in parallel, processing can occur in a distributed/federated environment with reliable, repeatable results. Taken to its logical extreme if data were analyzed at its point of origin, some percentage of the overall analysis workload would be as distributed as possible.  The net result is greater throughput of the overall system and thus reduced cycle times for analysis. It's like adding fog lights to the jeep traversing the fog of war focusing on the delivery of answers, not raw data.

We need prepare IT for a new world where the "work" is done as close to the user as possible, what I call mobilizing IT.  Now that we understand parallelization and have been applying it in various forms for over a decade, it's time to unleash it's power by moving out of the traditional data center (several will note much of this trail has been blazed already by the Content Delivery Network).  One of the first challenges we must be girding up to solve is how to push the analysis of data out toward the endpoints where the data is collected. It turns out the point of origin for much of the data being generated is very near storage and compute resources whether in the device (mobile phone) or one hop away on a network (cloud).  Combine this with in-network data management and routing capabilities and the solution is very compelling.  Of course there are ancillary benefits as well, such as the opportunity to architect a solution for continuous availability rather than the more expensive but less stable approach of disaster recovery.  Real time compliance routines could be applied.  The opportunities are endless, however the solutions are few.

Decomposition is a great approach to making big problems solvable. We have grown IT into a monstrous centralized monolith which boggles our minds.  As our data sets explode we need to think about new approaches, and one approach worth considering is federating our data. We'll need an index mechanism with associated service to locate data so it can be used properly. We can leverage the index metadata to make real time decisions on how to execute the analytics. How much will be done at the endpoint, where will the data be collected and sub-analyzed, at what point can analysis stop because we have the "good enough" answer. We need to stay ahead of the data war fog.

No comments:

Post a Comment