By Alex Bordei, VP of Product and Engineering Lentiq
As you may already know, after dominating the industry for 10 years, Hadoop is slowly losing momentum. The questions now are why and what are the alternatives? Let’s explore this a bit.
source: google trends
Kubernetes already surpassed Hadoop
According to a recent article, Doug Cutting said he “doesn’t see anything else coming along that will replace or supplant Hadoop, particularly when it comes to large-scale, on-premise deployments.”
Actually, it’s pretty clear where we need to look next: Kubernetes. Kubernetes currently has higher adoption rate than Hadoop had at its peak. It’s already being used for ML at scale by a lot of people, for instance at booking.com, and by my own team as well.
It’s true that it’s a general-purpose technology rather than a big data-specialized solution which is probably why Doug Cutting did not consider it an alternative. However, being general purpose means that more people will know how to use it, that it will have a bigger, more comprehensive ecosystem, it will be developed faster and maintained better, and so on. It’s a product that substitutes the need for a specialized solution.
Big data is now normal data.
Kubernetes alone is obviously not enough. In fact, not even Hadoop is enough for today’s requirements, which changed compared to 10 years ago. Users now need native model lifetime management, better support for deep learning frameworks, TensorFlow and friends, as well as better accelerators support, transactions support, better data catalogue, and so on.
All of this is now offered by open-source applications outside of the Hadoop ecosystem, with good integration with Kubernetes: Spark and Ray/Modin for distributed processing, Presto and SparkSQL as query engines, Argo as a workflow engine, Kafka as a queue, etc.
Most of these technologies provide “operators” (what they are called in Kubernetes slang), which are a kind of Kubernetes-native controllers. The Confluent Kafka operator does what Cloudera Manager support for Kafka does, only better as what you get is a self-healing, auto-scalable cloud-native Kafka cluster. That’s not the operator healing anything, it’s Kubernetes.
Kubernetes and the Cloud
First of all, there are not that many large-scale deployments. You may think there are, judging by the first slide of all the big data presentations we’ve seen in the last 10 years or so but, in my experience, most advanced analytics and ML projects are relatively “low” scale, at the order of magnitude of terabytes rather than PB. There are a few Googles in the world dealing with truly tremendous amounts of data, however, the majority of the world is not. They are in a class of their own.
Medium-sized and even small companies now dip their toes in the ML lake these days (pun intended). They have different usage patterns compared to the traditional on-premises models. They don’t have the engineers to deploy and maintain a cluster of specialized technology akin to a software supercomputer. Even if they could get them, it makes no sense, financially, to keep them on the payroll especially if they have *aaS alternatives.
Is it really cheaper operating in the cloud than on-premise? If you run the numbers, object storage is one order of magnitude cheaper than Hadoop’s disk-based storage. Yes, you pay for the transactions, but if you use RAM properly – and Spark knows how to do that – you don’t perform that many operations to the object storage, and the benefits start accumulating. The relative performance of the object storage in both GCS and S3 is comparable to a medium-scale (10s of nodes) Hadoop cluster in ML-type workloads.
Besides the cost benefit, having storage and compute apart has the tremendous benefit that it allows you to (auto-)scale the compute part, which is where 90% of the costs go. Think m4.16xlarge at $2336/mo vs $230/mo for 10TB in S3. Since most of the time the ML machines are idle, releasing them back to the cloud provider yields huge cost savings.
Hadoop cannot achieve this elasticity due to its data and code colocation principle. It takes too much time to “evict” terabytes of data from one node just to add them back in the morning. This is why we need the cloud and some mechanism to allow us to add and remove nodes from our processing clusters without it being a “node unavailable” event. In Kubernetes, it’s normal to remove and add a node, and with operators, applications can be made so as to adapt to a cluster shrink.
My feeling is that both the engineering resources shortage and the cost savings will be a trigger for a lot of enterprises to move to the cloud. Daimler recently moved their data lake to Azure so we can safely say the transition is already starting.
Serverless and Lock-in
There are many powerful serverless services out there, such as SageMaker, and the overall utilization of serverless services will clearly increase. However, as Doug notices, “there’s a lot of folks who don’t want to get locked into a single cloud vendor.” Kubernetes represents a solution even for that. The managed Kubernetes services out there are more or less compatible to each other and, in fact, CNCF has a Kubernetes service provider certification, as portability is a core design principle of Kubernetes. Thus, while cloud-specific serverless plans will clearly gain usage, my bet is on Kubernetes as the de-facto platform for ML and advanced analytics for the foreseeable future. I do think, however, that Kubernetes-based services will become more and more “serverless” and remove the nodes from the picture focusing on higher-level services.
Conclusions
What does this mean for the people that have Hadoop clusters in production? If I were them, I would personally invest in some Kubernetes training. There’s no urgency to migrate now but at least one should know what’s out there. They should talk to their application developers and see if there is any plan to migrate towards micro-services and the cloud. If there is, I would consider getting into that ecosystem of CI/CD pipelines and trying to eliminate the distinction between normal application development and “big data” stuff because there is simply no difference anymore.
About the Author:
Alex Bordei serves as VP of Product and Engineering at Lentiq (https://www.lentiq.com/). He has over ten years of experience in building cloud products. He has an MSc in Computer Science and has always been keen on research in advanced software technologies.