As Hadoop gains traction among companies of all sizes, many are discovering that getting a cluster to run optimally is a daunting task. In fact, it’s impossible for any human to respond in real time to all the changing conditions across multiple nodes to fix problems causing bottlenecks or performance dips. Yet it’s just this performance degradation that’s critical to confront, especially for large-scale deployments. After all, if your cluster doesn’t run smoothly and efficiently, you can’t count on Hadoop to deliver business-critical results on time, which leads to wasted resources in both time and funds. So what should you be paying attention to in ensuring your clusters are operating at the highest capacity? Here are three warning signs to keep in mind.
Warning #1: You think you’re out of capacity
Most companies put considerable effort into capacity planning when designing a Hadoop deployment. You probably made painstaking calculations to ensure that enough resources — specifically CPU, memory, network throughput, and disk I/O and storage — were provisioned for your cluster’s anticipated workload. But once a cluster is brought online, the true litmus test is whether all your jobs run efficiently and complete on time. Yet sometimes it may appear you’re out of capacity when you know you’re not, because when you try to run more applications, you can’t.
Naturally, you start by using Ganglia or some other monitoring tool to root through various cluster metrics, looking for anomalies. You might check CPU usage but find your processors aren’t even close to being 100% utilized. Your 10Gb network is peaking at only 50Mb — so that’s not the problem. What else can you do?
While most monitoring tools can show that your network is busy, they can’t always show you why it’s busy. This is because either the tool doesn’t give sufficiently granular detail into the inner workings of your cluster’s activities, or because certain areas to troubleshoot (e.g. Hadoop configuration settings) aren’t normally flagged in these types of tools. The first challenge in these instances is to identify the root cause of your problem.
The issue can often be traced to the YARN architecture. In fact, Hadoop clusters are frequently YARN-bound, meaning that the way the YARN software operates binds their performance.
YARN considers units of work (containers) to allocate to a node, and is aware of whether the container completes or not. When jobs are scheduled, YARN statically assigns node resources, so that once jobs are running, no further resource adjustments are made on these containers. As a result, YARN can’t react quickly to changing conditions, and it must be configured to accommodate worst-case scenarios. In fact, it’s very likely that you have unused resources tied up in these statically assigned containers — resources that the jobs will never actually use.
Warning #2: Your high-priority jobs are not finishing on time
Not all jobs run on a cluster are of equal importance. This is especially true in multi-tenant, multi-workload situations where several disparate cluster users are trying to run different applications simultaneously. If there are critical jobs that have a finite window of time to complete, you’ll want to ensure they meet their deadlines. But what if one of these high-priority jobs is suddenly taking too long to finish and missing its SLA?
You could start by checking whether a parameter or configuration setting has recently changed. Barring this, you can email other cluster users to see if they’ve recently changed their applications or settings in a way that’s impacting overall cluster performance. This is a time-consuming approach, though, and prone to inadequate disclosure by end users. In any case, it’s quite likely that resource contention between low- and high-priority jobs prevented the critical applications from finishing on time, but up-front planning and tuning often cannot prevent this kind of resource contention.
Warning #3: Your cluster grinds to a halt periodically
For this warning sign, let’s look at a common, real-life example: on a multi-tenant cluster used by hundreds of developers, you notice the cluster nearly grinds to a halt on a regular basis. You see heavy disk usage but can’t identify the root cause without a visualization tool that operates on the right input data.
You can use a node-monitoring tool (such as Ganglia or Cloudera Manager) that will show the disks getting busy. But these tools cannot explain why the disks are busy. The main drawback is that node monitoring tools cannot give you visibility down to the task-, user-, or job-level as your applications run — they merely provide node-level summaries.
To isolate the cause of the problem using these traditional node-monitoring tools, you could log into nodes and use a tool such as iostat to monitor every process that has significant disk usage. But you would have to know exactly when to anticipate the problem to detect the spike in disk usage with this method. This is impossible to do if you rely on human interaction alone; technology must play a part.
Seeing the writing on the wall before it’s written
In each of these cases, the warning signs of faulty performance were difficult — or impossible — to troubleshoot with human intervention alone, especially without the proper view into cluster activity. Since symptoms can be misleading, the result is many wasted man-hours spent experimenting with various remedies. A more efficient solution is to invest in tools that can make corrections automatically at the first sign of a contention problem, even when jobs are running.
We’ll continue to see continuing proliferation and adoption of Hadoop across companies of all sizes and industries. Unfortunately, however, human ability alone is not sufficient to guarantee optimally running clusters. To maximize the value of your Hadoop deployment you need the ability to anticipate, react quickly and make decisions in real time. Pay close attention to these three warning signs to help pinpoint areas for improvement.
Like this article? Subscribe to our weekly newsletter to never miss out!
Leave a Reply