Of all the new Azure analytics goodies that Microsoft unveiled at its Build 2015 developer conference, the Azure Data Lake service is among those that I’m especially excited about. In a nutshell, Azure Data Lake will let organizations store massive amounts of data of all types (structured, unstructured, semi-structured) in one repository, with schema or without schema, depending on what is the most optimal solution based on quality and cost. This opens the door to some really powerful analytic possibilities, especially for high-throughput, low-latency scenarios such as Internet of Things (IoT) implementations.
Data lakes as a concept have been around for a few years, but Microsoft’s introduction of the data lake as an Azure cloud service breaks new ground. This is in large part thanks to the wider availability of Microsoft’s PolyBase technology, which allows an end user to query a data lake and a SQL database, seamlessly joining data across both and using the data with standard reporting tools.
While at first glance, Azure Data Lake seems to have a lot of similarities to the Azure HDInsight Service (Hadoop as a service), Data Lake is designed for massively parallel query processing and ingestion. I believe it will become the staging hub for all sorts of analytical workloads. It won’t replace the traditional data warehouse, but instead will complement and modernize it.
For analytics practitioners, Azure Data Lake represents a fundamental paradigm shift. In the past, we had to build generic data models that could support multiple different use cases. And we had to conform to rigid data warehousing methodologies to ensure we were maximizing the use of infrastructure. The mantra was “make it generic and always available” because of infrastructure costs. If it was necessary to buy 35 more servers to handle an analytic workload, then we had better maximize the value we’re getting from that infrastructure. I believe that this rigidity is one of the reasons many data warehouse projects have failed: warehouse users gave up on solutions that took too long to bring to market in the quest to maximize infrastructure usage.
When using cloud resources, the scenario is radically different. We don’t have to buy those 35 servers, install them and keep them running all the time. Instead we can choose a combination of servers and services in Azure that scale on demand to support a spikey workload. Obtaining computing resources really is as simple as using a slider bar—move the slider up when you need peak resources, down to accommodate a lighter workload. With this model Azure creates endless possibilities for storing and analyzing data, and all sorts of pie in the sky thinking is possible.
Golden Nuggets and IoT
I recently watched a presentation delivered by Raghu Ramakrishnan, a Technical Fellow at Microsoft, where he used the analogy of a “digital shoebox” of data to help people think about big data, which is a great way to envision the possibilities of Azure Data Lake. You can put lots of data into your “shoebox,” store it in your closet and take it out later when a business need or question arises. When you open your shoebox later and sift through all the data, you’ll be able to find all sorts of “golden nuggets” that help you improve your operations, understand your customers and optimize your business.
Just like the “digital shoebox,” Azure Data Lake will allow organizations to collect and store massive amounts of data at a much lower cost than was possible before, and then decide what to do with it later, whether that’s using Hadoop to find patterns or moving it into a data warehouse for structured analysis. It will also enable IT organizations to take on a much more consultative role in meeting business needs and having those “golden nuggets” of information ready when the business asks a question.
What does a “golden nugget” of insight look like? The most exciting possibility created by Azure Data Lake and other Azure analytic services is the ability to combine data from standard LOB applications—like ERP, CRM and e-commerce systems—with real-time data from sensors, mobile devices, social media, video streams, etc. Think of all you could you achieve by pairing up different analytic workloads to get new insights.
As I mentioned above, Internet of Things (IoT) is one application that requires the massive throughput and low latency that will be enabled by Azure Data Lake. It also involves pairing data from sensors and devices with data from standard LOB applications.
Industries that use heavy equipment, like construction and mining, are a great illustration of how IoT can pair different analytic workloads to deliver powerful benefits. For example, think of a construction company that owns a lot of heavy equipment like excavators and bulldozers. This equipment represents a significant cost to the company, so we can assume the company collects data about equipment costs and usage and stores them in a plain-vanilla data warehouse. This lets them perform some analysis to optimize decision-making about equipment—when to retire a piece of equipment, when to buy vs. lease, what maintenance to invest in throughout the equipment lifecycle, etc.
While valuable, these insights are latent (looking at what happened in the past) and difficult to apply in real time. That’s where IoT comes in and where the possibilities offered by Azure Data Lake and other Azure analytic services really shine. For example, let’s say the construction company wants to prevent equipment operators from abusing equipment or using it in sub-optimal ways that could lead to more frequent breakdowns. The company could combine insights from its data warehouse with real-time data coming in from equipment sensors to alert the foreman or operator when abuse is detected.
Azure Data Lake combined with other Azure analytic services will make powerful applications like this possible across a wide variety of industries. A preview of Data Lake will be available later this year.
In the coming weeks I’ll check out some other recently announced Azure services that support this vision and what they mean for data practitioners.