Skip to content

Tutorials

Jacek Masiulaniec edited this page Jun 23, 2015 · 12 revisions
  1. Write to OpenTSDB
  2. Collect operating system metrics
  3. Set the cluster tag
  4. Access the API
  5. Poll remote hosts
  6. Shed load

3.1. Write to OpenTSDB

OpenTSDB is seen as a feed subscriber with special requirements:

  • It scales best when receiving feeds directly from forwarders (without indirecting via aggregator).
  • It doesn't need to receive data points that don't contribute to line plots.

Therefore, register it by adding the following entry to /etc/tsp-controller/network:

<subscriber id="tsd" host="tsd.example.com" direct="true" dedup="true"/>

In case your OpenTSDB servers aren't reachable via single domain name (e.g. a VIP), provide domain names of OpenTSDB servers as a comma-separated list:

<subscriber
 id="tsd"
 host="tsd01.example.com,tsd02.example.com"
 direct="true"
 dedup="true"
/>

Apply by restarting controller:

service tsp-controller restart

Once the config propagates (about 1 minute), OpenTSDB will start receiving feeds on TCP/4242 from all forwarders and the poller.

tsp-controller(8) has more details about registering subscribers.

3.2. Collect operating system metrics

TSP does not ship a plugin that reads operating systems metrics. Instead, it makes it easy to reuse plugins developed by the tcollector project.

For example, collecting cpu metrics is a matter of running:

cd /etc/tsp/collect.d
wget https://raw.githubusercontent.com/OpenTSDB/tcollector/bd7c16e47e7617e93a092da50b2f9be671d4ef47/collectors/0/procstats.py
chmod +x procstats.py

Use your favorite deployment system to install these scripts.

Once tsp-forwarader detects a new script (about 1 second), it is executed automatically. You can observe the newly added metrics by running the following command on the aggregator host:

tcpdump -Alnp -i any "dst net localnet and dst port 4242" | grep "put proc\."

3.3. Set the cluster tag

By default, data points obtained from TSP lack explicit server group information. The host tag provides identification of individual hosts, which in some cases can be sufficient to infer group membership. For example, host=web01.us.example.com can be reasonably expected to be a member of the web.us server group.

However, this convention can be difficult to enforce. For this reason, TSP provides a method for providing this group information explicitly by storing it in the cluster tag of every data point.

Provide the desired cluster information by creating the file /etc/tsp-controller/config:

<config>
    <hostgroup id="web">
        <cluster id="web.us">
            <host id="web01.us.example.com"/>
            <host id="web02.us.example.com"/>
            <host id="web03.us.example.com"/>
        </cluster>
    </hostgroup>
</config>

Apply by restarting controller:

service tsp-controller restart

Once the config propagates (about 1 minute), the real-time feed will start including the new tag for all metrics originating at these three web hosts.

tsp-controller(8) has more details about the format of the config file.

3.4. Access the API

In order to access the real-time API, you must write a job that accepts single TCP connection on TCP/4242, and reads the put commands from it. In addition, every 5 seconds the job has to respond to a heartbeat request (the version command).

Once the job is up and listening on TCP/4242, register it by adding the following entry to /etc/tsp-controller/network:

<subscriber id="myjob" host="myjob.example.com"/>

Apply by restarting controller:

service tsp-controller restart

Once the config propagates (about 1 minute), your job will receive the stream over a connection established by aggregator.

Developing stream-processing jobs is easy. Example: proof-of-concept threshold checker implemented in bash:

# heartbeat responds to aggregator's heartbeat requests.
heartbeat() {
	while true
	do
		echo "Built on <unknown> (myjob)" || exit
		sleep 5
	done
}

# readfeed reads the feed looking for problems with blocked i/o
readfeed() {
	while read _ metric t n tags
	do
		if [ "$metric" = "proc.stat.procs_blocked" -a $n -gt 100 ]
		then
			series="$metric $tags"
			echo "$t,$series,too many blocked processes ($n>100)"
		fi
	done
}

heartbeat | nc -k -l 4242 | grep "^put " | readfeed

Its output is a 3-column csv, where the first column is time in Unix epoch format, the second is the series used to detect the problem, and the third is the diagnostic message.

tsp-aggregator(8) has more details on the data contract of the feed.

3.5. Poll remote hosts

The tsp-poller(8) service exists to allow polling of remote hosts. It accepts plugins just like tsp-forwarder(8) except these plugins are installed under /etc/tsp-poller/collect.d.

TODO: include collect-f5 and collect-netscaler examples.

3.6. Shed load

Betfair experienced complete failure of OpenTSDB caused by partial failure of the underlying HBase layer. The right emergency response in such scenario is to reduce the data rate of the real-time feed so that it stops exceeding the available capacity.

TSP gives the operator a mechanism for precise blocking of data points inserted into the feed. For example, in order to block all metrics matching foo.*, create the file /etc/tsp-controller/filter with the following content:

#!/bin/bash

program=$1

noop() {
	echo "[]"
	exit 0
}

[ "$program" = "tsp-forwarder" ] || noop

cat <<'EOF'
[
	{
		"Match": ["^foo\."],
		"Block": true
	}
]
EOF

Make it executable:

chmod +x /etc/tsp-controller/filter

Once the propagation delay elapses (about 1 minute), the forwarders will start dropping all matching data points.

tsp-controller(8) has more details about the filter mechanism.

Return to Documentation.