Smoke Signals Coming From Your Hadoop Cluster

As Hadoop gains traction among companies of all sizes, many are discovering that getting a cluster to run optimally is a daunting task. In fact, it’s impossible for any human to respond in real time to all the changing conditions across multiple nodes to fix problems causing bottlenecks or performance dips. Yet it’s just this performance degradation that’s critical to confront, especially for large-scale deployments. After all, if your cluster doesn’t run smoothly and efficiently, you can’t count on Hadoop to deliver business-critical results on time, which leads to wasted resources in both time and funds. So what should you be paying attention to in ensuring your clusters are operating at the highest capacity? Here are three warning signs to keep in mind.

Warning #1: You think you’re out of capacity

Most companies put considerable effort into capacity planning when designing a Hadoop deployment. You probably made painstaking calculations to ensure that enough resources — specifically CPU, memory, network throughput, and disk I/O and storage — were provisioned for your cluster’s anticipated workload. But once a cluster is brought online, the true litmus test is whether all your jobs run efficiently and complete on time. Yet sometimes it may appear you’re out of capacity when you know you’re not, because when you try to run more applications, you can’t.

Naturally, you start by using Ganglia or some other monitoring tool to root through various cluster metrics, looking for anomalies. You might check CPU usage but find your processors aren’t even close to being 100% utilized. Your 10Gb network is peaking at only 50Mb — so that’s not the problem. What else can you do?

While most monitoring tools can show that your network is busy, they can’t always show you why it’s busy. This is because either the tool doesn’t give sufficiently granular detail into the inner workings of your cluster’s activities, or because certain areas to troubleshoot (e.g. Hadoop configuration settings) aren’t normally flagged in these types of tools. The first challenge in these instances is to identify the root cause of your problem.

The issue can often be traced to the YARN architecture. In fact, Hadoop clusters are frequently YARN-bound, meaning that the way the YARN software operates binds their performance.

YARN considers units of work (containers) to allocate to a node, and is aware of whether the container completes or not. When jobs are scheduled, YARN statically assigns node resources, so that once jobs are running, no further resource adjustments are made on these containers. As a result, YARN can’t react quickly to changing conditions, and it must be configured to accommodate worst-case scenarios. In fact, it’s very likely that you have unused resources tied up in these statically assigned containers — resources that the jobs will never actually use.

Warning #2: Your high-priority jobs are not finishing on time

Not all jobs run on a cluster are of equal importance. This is especially true in multi-tenant, multi-workload situations where several disparate cluster users are trying to run different applications simultaneously. If there are critical jobs that have a finite window of time to complete, you’ll want to ensure they meet their deadlines. But what if one of these high-priority jobs is suddenly taking too long to finish and missing its SLA?

You could start by checking whether a parameter or configuration setting has recently changed. Barring this, you can email other cluster users to see if they’ve recently changed their applications or settings in a way that’s impacting overall cluster performance. This is a time-consuming approach, though, and prone to inadequate disclosure by end users. In any case, it’s quite likely that resource contention between low- and high-priority jobs prevented the critical applications from finishing on time, but up-front planning and tuning often cannot prevent this kind of resource contention.

Warning #3: Your cluster grinds to a halt periodically

For this warning sign, let’s look at a common, real-life example: on a multi-tenant cluster used by hundreds of developers, you notice the cluster nearly grinds to a halt on a regular basis. You see heavy disk usage but can’t identify the root cause without a visualization tool that operates on the right input data.

You can use a node-monitoring tool (such as Ganglia or Cloudera Manager) that will show the disks getting busy. But these tools cannot explain why the disks are busy. The main drawback is that node monitoring tools cannot give you visibility down to the task-, user-, or job-level as your applications run — they merely provide node-level summaries.

To isolate the cause of the problem using these traditional node-monitoring tools, you could log into nodes and use a tool such as iostat to monitor every process that has significant disk usage. But you would have to know exactly when to anticipate the problem to detect the spike in disk usage with this method. This is impossible to do if you rely on human interaction alone; technology must play a part.

Seeing the writing on the wall before it’s written

In each of these cases, the warning signs of faulty performance were difficult — or impossible — to troubleshoot with human intervention alone, especially without the proper view into cluster activity. Since symptoms can be misleading, the result is many wasted man-hours spent experimenting with various remedies. A more efficient solution is to invest in tools that can make corrections automatically at the first sign of a contention problem, even when jobs are running.

We’ll continue to see continuing proliferation and adoption of Hadoop across companies of all sizes and industries. Unfortunately, however, human ability alone is not sufficient to guarantee optimally running clusters. To maximize the value of your Hadoop deployment you need the ability to anticipate, react quickly and make decisions in real time. Pay close attention to these three warning signs to help pinpoint areas for improvement.

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Smoke Signals Coming From Your Hadoop Cluster

Related Posts

Beyond Breaking News: Giving Broadcast Journalists AI Real-Time Fact-Checking

Monero (XMR) adoption in 2025: Growing demand for cryptocurrency privacy

Essential skills for blockchain development in 2025

AI Is advancing but can chatbots understand human feelings?

Hackers with high IQ: The dangerous link between intelligence and cybercrime

The guide to finding the best high-yield business savings account

LATEST NEWS

Gemini gets big, but not ChatGPT big

OpenAI’s upgraded image model is now available as API

Nvidia’s NeMo tools aim to fix weak AI returns for businesses

Google’s terms kept Perplexity off Motorola’s home screen

iPhone users can now talk to Perplexity

Fireflies launches AI assistants for every meeting role

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Smoke Signals Coming From Your Hadoop Cluster

Warning #1: You think you’re out of capacity

Warning #2: Your high-priority jobs are not finishing on time

Warning #3: Your cluster grinds to a halt periodically

Seeing the writing on the wall before it’s written

Stay Ahead of the Curve!

Related Posts

LATEST NEWS

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Follow Us