The USE Method applied to Hadoop

There’s an excellent methodology for performance analysis known as the "USE Method": For every resource, check Utilization, Saturation and Errors" ^[1]:

utilization — How close to capacity is the resource? (ex: CPU usage, % disk used)
saturation — How often is work waiting? (ex: network drops or timeouts)
errors — Are there error events?

For example, USE metrics for a supermarket cashier would include:

checkout utilization: flow of items across the belt; constantly stopping to look up the price of tomatoes or check coupons will harm utilization.
checkout saturation: number of customers waiting in line
checkout errors: calling for the manager

The cashier is an I/O resource; there are also capacity resources, such as available stock of turkeys:

utilization: amount of remaining stock (zero remaining would be 100% utilization)
saturation: in the US, you may need to sign up for turkeys near the Thanksgiving holiday.
errors: spoiled or damaged product

As you can see, it’s possible to have high utilization and low saturation (steady stream of customers but no line, or no turkeys in stock but people happily buying ham), low utilization and high saturation (a cashier in training with a long line), or any other combination.

Why the USE method is useful

It may not be obvious why the USE method is novel — "Hey, it’s good to record metrics" isn’t new advice.

What’s novel is its negative space. "Hey, it’s good enough to record this limited set of metrics" is quite surprising. Here’s "USE method, the extended remix": For each resource, check Utilization, Saturation and Errors. This is the necessary and sufficient foundation of a system performance analysis, and advanced effort is only justifiable once you have done so.

There are further benefits:

An elaborate solution space becomes a finite, parallelizable, delegatable list of activities.
Blind spots become obvious. We once had a client issue where datanodes would go dark in difficult-to-reproduce circumstances. After much work, we found that persistent connections from Flume and chatter around numerous small files created by Hive were consuming all available datanode handler threads. We learned by pain what we could have learned by making a list — there was no visibility for the number of available handler threads.
Saturation metrics are often non-obvious, and high saturation can have non-obvious consequences. (For example, the hard lower bound of TCP throughput)
It’s known that Teddy Bears make superb level-1 support techs, because being forced to deliver a clear, uninhibited problem statement often suggests its solution. To some extent any framework this clear and simple will carry benefits, simply by forcing an organized problem description.

<remark>http://en.wikipedia.org/wiki/Rubber_duck_debugging</remark>;

Look for the Bounding Resource

The USE metrics described below help you to identify the limiting resource of a job; to diagnose a faulty or misconfigured system; or to guide tuning and provisioning of the base system.

Improve / Understand Job Performance

Hadoop is designed to drive max utilization for its bounding resource by coordinating manageable saturation_ of the resources in front of it.

The "bounding resource" is the fundamental limit on performance — you can’t process a terabyte of data from disk faster than you can read a terabyte of data from disk. k

disk read throughput
disk write throughput
job process CPU
child process RAM, with efficient utilization of internal buffers
If you don’t have the ability to specify hardware, you may need to accept "network read/write throughput" as a bounding resource during replication.

At each step of a job, what you’d like to see is very high utilization of exactly one bounding resource from that list, with reasonable headroom and managed saturation for everything else. What’s "reasonable"? As a rule of thumb, utilization above 70% in a non-bounding resource deserves a closer look.

Diagnose Flaws

Balanced Configuration/Provisioning of base system

Resource List

Please see the [glossary] for definitions of terms. I’ve borrowed many of the system-level metrics from Brendan Gregg’s Linux Checklist; visit there for a more-detailed list.

Table 1. USE Method Checklist

resource	type	metric	instrument

CPU-like concerns

CPU	utilization	system CPU	`top`/`htop`: CPU %, overall
	utilization	job process CPU	`top`/`htop`: CPU %, each child process
	saturation	max user processes	`ulimit -u` ("noproc" in `/etc/security/limits.d/…`)
mapper slots	utilization	mapper slots used	jobtracker console; impacted by `mapred.tasktracker.map.tasks.maximum`
mapper slots	saturation	mapper tasks waiting	jobtracker console; impacted by scheduler and by speculative execution settings
	saturation	task startup overhead	???
	saturation	combiner activity	jobtracker console (TODO table cell name)
reducer slots	utilization	reducer slots used	jobtracker console; impacted by `mapred.tasktracker.reduce.tasks.maximum`
reducer slots	saturation	reducer tasks waiting	jobtracker console; impacted by scheduler and by speculative execution and slowstart settings

Memory concerns	_	__	_

memory capacity	utilization	total non-OS RAM	`free`, `htop`; you want the total excluding caches+buffers.
	utilization	child process RAM	`free`, `htop`: "RSS"; impacted by `mapred.map.child.java.opts` and `mapred.reduce.child.java.opts`
	utilization	JVM old-gen used	JMX
	utilization	JVM new-gen used	JMX
memory capacity	saturation	swap activity	`vmstat 1` - look for "r" > CPU count.
	saturation	old-gen gc count	JMX, gc logs (must be specially enabled)
	saturation	old-gen gc pause time	JMX, gc logs (must be specially enabled)
	saturation	new-gen gc pause time	JMX, gc logs (must be specially enabled)
mapper sort buffer	utilization	record size limit	announced in job process logs; controlled indirectly by `io.sort.record.percent`, spill percent tunables
	utilization	record count limit	announced in job process logs; controlled indirectly by `io.sort.record.percent`, spill percent tunables
mapper sort buffer	saturation	spill count	spill counters (jobtracker console)
	saturation	sort streams	io sort factor tunable (`io.sort.factor`)
shuffle buffers	utilization	buffer size	child process logs
	utilization	buffer %used	child process logs
shuffle buffers	saturation	spill count	spill counters (jobtracker console)
	saturation	sort streams	merge parallel copies tunable `mapred.reduce.parallel.copies` (TODO: also `io.sort.factor`?)
OS caches/buffers	utilization	total c+b	`free`, `htop`

disk concerns	_	__	_

system disk I/O	utilization	req/s, read	`iostat -xz 1` (system-wide); `iotop` (per process); `/proc/{PID}/sched` "se.statistics.iowait_sum"
	utilization	req/s, write	`iostat -xz 1` (system-wide); `iotop` (per process); `/proc/{PID}/sched` "se.statistics.iowait_sum"
	utilization	MB/s, read	`iostat -xz 1` (system-wide); `iotop` (per process); `/proc/{PID}/sched` "se.statistics.iowait_sum"
	utilization	MB/s, write	`iostat -xz 1` (system-wide); `iotop` (per process); `/proc/{PID}/sched` "se.statistics.iowait_sum"
system disk I/O	saturation	queued requests	`iostat -xnz 1`; look for "avgqu-sz" > 1, or high "await".
system disk I/O	errors		`/sys/devices/…/ioerr_cnt`; `smartctl`, `/var/log/messages`

network concerns	_	__	_

network I/O	utilization		`netstat`; `ip -s {link}`; `/proc/net/{dev}` — RX/TX throughput as fraction of max bandwidth
network I/O	saturation		`ifconfig` ("overruns", "dropped"); `netstat -s` ("segments retransmited"); `/proc/net/dev` (RX/TX "drop")
network I/O	errors	interface-level	`ifconfig` ("errors", "dropped"); `netstat -i` ("RX-ERR"/"TX-ERR"); `/proc/net/dev` ("errs", "drop")
		request timeouts	daemon and child process logs
handler threads	utilization	nn handlers	(TODO: how to measure) vs `dfs.namenode.handler.count`
	utilization	jt handlers	(TODO: how to measure) vs
	utilization	dn handlers	(TODO: how to measure) vs `dfs.datanode.handler.count`
	utilization	dn xceivers	(TODO: how to measure) vs `dfs.datanode.max.xcievers

framework concerns	_	__	_

disk capacity	utilization	system disk used	`df -bM`
	utilization	HDFS directories	`du -smc /path/to/mapred_scratch_dirs` (for all directories in `dfs.data.dir`, `dfs.name.dir`, `fs.checkpoint.dir`)
	utilization	mapred scratch space	`du -smc /path/to/mapred_scratch_dirs` (TODO scratch dir tunable)
	utilization	total HDFS free	namenode console
	utilization	open file handles	`ulimit -n` ("nofile" in `/etc/security/limits.d/…`)
job process	errors		stderr log
	errors		stdout log
	errors		counters
datanode	errors
namenode	errors
secondarynn	errors
tasktracker	errors
jobtracker	errors

Metrics in bold are critical resources — you would like to have exactly one of these at its full sustainable level

Ignore past here please

See What’s Happening

JMX (Java Monitoring Extensions)

Whenever folks are having "my programming language is better than yours", the Java afficianado can wait until somebody’s scored a lot of points and smugly play the "Yeah, but Java has JMX" card. It’s simply amazing.

Deep Explanation of JMX

VisualVM is a client for examining JMX metrics.

If you’re running remotely (as you would be on a real cluster), here are instructions for ^[2]

There is also an open-source version, MX4J.

You need a file called jmxremote_optional.jar, from Oracle Java’s "Remote API Reference Implementation"; download from Oracle. They like to gaslight web links, so who knows if that will remain stable.
Add it to the classpath of

(on a mac, in /usr/bin/jvisualvm)

Run visualvm visualvm -cp:a /path/to/jmxremote_optional.jar (on a mac, add a 'j' in front: /usr/bin/jvisualvm …).

You will need to install several plugins; I use VisualVM-Extensions, VisualVM-MBeans, Visual GC, Threads Inspector, Tracer-{Jvmstat,JVM,Monitpr} Probes.

Poor-man’s profiling

To find the most CPU-intensive java threads:

ps -e HO ppid,lwp,%cpu --sort %cpu | grep java | tail; sudo killall -QUIT java

The -QUIT sends the SIGQUIT signal to Elasticsearch, but QUIT doesn’t actually make the JVM quit. It starts a Javadump, which will write information about all of the currently running threads to standard out.

Other tools:

Ganglia
BTrace

Rough notes

Metrics:

number of spills
disk {read,write} {req/s,MB/s}
CPU % {by process}
GC
- heap used (total, %)
- new gen pause
- old gen pause
- old gen rate
- STW count
system memory
- resident ram {by process}
- paging
network interface
- throughput {read, write}
- queue
handler threads
- handlers
- xceivers *
mapper task CPU
mapper taks Network interface — throughput Storage devices — throughput, capacity Controllers — storage, network cards Interconnects — CPUs, memory, throughput
disk throughput
handler threads
garbage collection events

Exchanges:

* * shuffle buffers — memory for disk * gc options — CPU for memory

If at all possible, use a remote monitoring framework like Ganglia, Zabbix or Nagios. However clusterssh or its OSX port along with the following commands will help

Exercises

Exercise 1: start an intensive job (eg <remark>TODO: name one of the example jobs</remark>) that will saturate but not overload the cluster. Record all of the above metrics during each of the following lifecycle steps:

map step, before reducer job processes start (data read, mapper processing, combiners, spill)
near the end of the map step, so that mapper processing and reducer merge are proceeding simultaneously
reducer process step (post-merge; reducer processing, writing and replication proceeding)

Exercise 2: For each of the utilization and saturation metrics listed above, describe job or tunable adjustments that would drive it to an extreme. For example, the obvious way to drive shuffle saturation (number of merge passes after mapper completion) is to bring a ton of data down on one reducer — but excessive map tasks or a reduce_slowstart_pct of 100% will do so as well.

1. developed by Brendan Gregg for system performance tuning, modified here for Hadoop

2. Thanks to Mark Feeney for the writeup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XX54-tuning-use_method_checklist.asciidoc

XX54-tuning-use_method_checklist.asciidoc

The USE Method applied to Hadoop

Look for the Bounding Resource

Improve / Understand Job Performance

Diagnose Flaws

Balanced Configuration/Provisioning of base system

Resource List

See What’s Happening

JMX (Java Monitoring Extensions)

Poor-man’s profiling

Rough notes

Exercises

Files

XX54-tuning-use_method_checklist.asciidoc

Latest commit

History

XX54-tuning-use_method_checklist.asciidoc

File metadata and controls

The USE Method applied to Hadoop

Look for the Bounding Resource

Improve / Understand Job Performance

Diagnose Flaws

Balanced Configuration/Provisioning of base system

Resource List

See What’s Happening

JMX (Java Monitoring Extensions)

Poor-man’s profiling

Rough notes

Exercises