The Data Lake will get Cloudy in 2017

Written by Chief Enterprise Architect, Ravi Dharnikota. Originally posted in SandHIll Magazine.

In 2016, we saw a lot of interesting discussions around the data lake. In the spirit of reflecting on the past and looking at the future of this technology, here are the significant themes I foresee in 2017 for the enterprise data lake.

The data lake is no longer just Hadoop

Until recently, the enterprise data lake has largely been discussed, designed and implemented around Hadoop. Doug Cutting, creator of Hadoop, commented that Hadoop is at its core – HDFS, YARN and MapReduce. He went on to say that Spark, for many, is now favored over MapReduce. So that leaves a distributed storage system (HDFS) and a resource negotiator (YARN). HDFS has other alternatives in S3, Azure Blob/Azure Data Lake Store and Google Cloud Storage. YARN has alternatives like Mesos. Also, if you wanted to run Spark, it has its own resource management and job scheduling in-built, in stand-alone mode. This brings in other alternatives like S3/EMR/Redshift/Spark/Azure, which are outside of traditional Hadoop, into the enterprise data lake discussion.

Separation of storage and compute

Elasticity is among the most attractive promises of the cloud. Pay for what you consume when you need it, thus keeping your costs under control. It also leads to simple operational models where you can scale and manage your storage and compute independently.

An example of this is the need for a marketing team that buys terabytes of data to merge that with other internal data but quickly realizes that the quality of this data leaves a lot to be desired. Their internal on-premises enterprise data lake is built for regular, predictable batch loads and investing in more nodes in their Hadoop cluster is not an option. Such a burst load can be well handled by utilizing elastic compute separate from storage.

Data lake “cloudification”

Hadoop has proven a challenge for enterprises, especially from an admin and skill set perspective. This has resulted in data lakes with the right intent, but poor execution, stuck in labs. At this point, the cloud seems like a great alternative with the promise of low to no maintenance and removing the skill set required around management, maintenance and admin of the data lake platform infrastructure. The cloud also allows for freedom to research new technologies (which are rapidly changing in the big data world) and experiment with solutions without a large up-front investment.

Avoiding vendor lock-in

Various cloud vendors today provide solutions for the data lake. Each vendor has its own advantage with respect to technology, ecosystem, price, ease of use, etc. Multiple options for cloud data lake platforms, on the one hand, provide an advantage for the enterprise to get exposure to the latest innovations, while on the other hand, the same prove to be a challenge to insulate changes from the consumers of these platforms. Like most technology on-ramps, vendors and customers will predictably try many solutions until they settle on one. Vendor changes will come frequently and significantly as their functionality and market focus improves. Enterprises will likely have expert infrastructure teams trying to implement and learning to tweak the interfaces of several vendor offerings concurrently. All this is a good and healthy practice. Application teams will be subjected to all this vendor and solution shakeout. Hundreds of users that have better things to do will naturally become inhibitors to the infrastructure team’s need to make continuous improvements in the cloud infrastructure selection. There needs to be a strategy to protect the application teams in this multi-cloud data lake environment against frequent changes to their data pipelines.

Operationalizing the data lake

The enterprise is a retrofit job. Nothing can be introduced into the enterprise in isolation. Any new system needs to work well with the technology, skill set and processes that exist. As data lakes graduate from the labs and become ready for consumption by the enterprise, one of the biggest challenges to adoption and success is around retrofitting and operationalizing it.

A well-oiled process in an enterprise is mostly automated. How Continuous Integration and Continuous Deployment(CICD), versioning and collaboration are handled will define the enterprise data lake success in an enterprise. Self-service usage led by ease of use, API access and a security framework that supports authentication and authorization in the enterprise with well-defined access controls streamlines consumption of data and services and prevents chaos.

As the demand for non-traditional data integration increases, the themes that have been developing over the last year should improve the success of enterprise data lake implementation and adoption.

So ok, I have a growing collection of data files in a “data lake”. I’ve already created metadata describing where they are, what they contain, who owns them etc. It would be very compelling if I could register these files in snaplogic in the same was as I upload actual “Files” to project space. Then I could build patterns and pipelines that unlock the underlying data in my “data lake” without the end user needing to understand where the lake is, what the naming conventions/security keys/formats/encryption/compressions are etc. I’m very new here, perhaps this concept already exists… (something along the lines of a Metadata Registry / Metadata Repository)

Snaplogic supports parameterizing the input to the pipelines which can be leveraged in Patterns and Pipelines. In regards to support for Metadata repository/registry, the best way to expose the data to end users without publishing the underlying details is through Hive (which is supported by Snaplogic).