Operational challenges in Hadoop

Hadoop Challenges – Ecosystem of Different Products

Hadoop is an ecosystem of many different products designed to work in unison. But given the complexity of the product, it poses a number of operational challenges that one should take into account before planning for you next cluster. In this blog, we will highlight some of these challenges.

Deployment Challenges – Can I quickly deploy my data pipeline?

When we talk to our potential Hadoop customers the first thing they complain about is the time it takes to setup and start working with Hadoop – it just cannot be done quickly. Usually, there is always a long lead time just to get the cluster configured, up and running and be made accessible to the users. The fact that multiple components (services) need to be configured for different analysis needs, setting up Hadoop for one use-case can be very different than for another use-case.

Hadoop challenges - How to set up data pipeline quickly, meet seasonal spikes & growth, avoid underutilized hardware, make data available to developers

Underutilized Hardware – Can I run multiple clusters on the same shared hardware?

Most Hadoop clusters are configured on bare metal servers. The fear of running into performance predictability issues has hindered the uptake of VM based virtualization in Hadoop. However, designing to meet peak workloads and avoid noisy neighbor problems has led to cluster sprawl and underutilized bare metal servers.

Data Availability to developers – How do I provide developers access to data?

This is another area where Hadoop faces major challenges. Customers often complain that developers and QA engineers do not get access to production quality data to test their changes or run test cases. Often the new clusters are setup with synthetic data that barely looks like production data.

Meeting seasonal demands – How do I handle spikes and growth?

While capacity planning so far has been about meeting peak workloads, the real challenge comes when the ops team needs to size based on unpredictable seasonal peaks. It’s a conundrum between “how to avoid over-provisioning initially?” and “when data sets and workload grows how to avoid under provisioning?”

Container-based Robin Cloud Platform (RCP) for Big Data, NoSQL, and Relational Databases helps customers to better prepare and address these operational challenges with ease. Starting with single click cluster setup to scaling out and scaling up Robin’s cloud platform is built from the ground up to address all the operational issues that are common with most Hadoop deployments. Robin’s built-in IOPS control also ensures performance predictability by setting both MIN and MAX IOPS for all applications.

ChallengesWith Robin Cloud Platform (RCP) you can...
How do I quickly deploy my entire pipeline?Setup your data pipeline on your existing commodity hardware
How do I avoid overprovisioning and under provisioning of hardware?Size your cluster appropriately and not waste resources
How can share data with my developers?Share data with Dev and GA teams with built-in snapshot and clone capabilities
How do I handle seasonal spikes?Scale-out and scale-up your cluster
How can I run my data pipelines on shared infrastructure?Set CPU, Memory and IOPS reservations to prevent any noisy neighbor problems.

Live Joint webinar by Hortonworks and Robin Systems on how to tackle Hadoop challenges such as meeting seasonal data peaks

How I Learned to Stop Worrying and Love the Data


Related Posts


Author Deba Chatterjee, Director Products

More posts by Deba Chatterjee, Director Products