Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin Redis: Intial implementation #2

Merged
merged 3 commits into from
Jun 18, 2018

Conversation

LukeHandle
Copy link

@LukeHandle LukeHandle commented Jun 7, 2018

If redis-cli is in the PATH, it will:

  • Run --stat showing key metric changes over 5 seconds
  • Run --latency for 3 seconds, 3 times
  • Run --intrinsic-latency, testing the local systems latency
  • Run info for general statistics

If the cli tool is not present locally, netcat is used to hit the specified host/port with an info.

Both methods support the options:

PLUGIN_OPTS_REDIS_HOST defaults to 127.0.0.1
PLUGIN_OPTS_REDIS_PORT defaults to 6379

I can re PR this against the upstream/master - but it looks like a much larger change until your change rackerlabs#158 is merged.

If redis-cli is in the PATH, it will:
  * Run the "--stat" showing key metric changes over 5 seconds
  * Run "--latency" for 3 seconds, 3 times
  * Run "--intrinsic-latency", testing the local systems latency
  * Run "info" for general statistics

If the cli tool is not present locally, netcat is used to hit the specified host/port.

Both methods support the options:

`PLUGIN_OPTS_REDIS_HOST` defaults to 127.0.0.1
`PLUGIN_OPTS_REDIS_PORT` defaults to 6379
Copy link
Owner

@tonyskapunk tonyskapunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really good.

Few suggested changes, some additional ideas to handle things a little bit different.

I think doing the PR against the branch that is used for the PR on the upstream repo is the right thing to do.


Two ideas that triggered this.

When running this plugin, I immediately noticed an increase in execution time(see below).

Sounds like it would be nice to run plugins/core processes in parallel.

Maybe plugins could live in a different repo.

At the moment is too soon but wanted to write down them.

Thanks for the help!!

--

EDIT: Adding the timing examples:

Without plugins:

2018-06-08 10:11:15-05:00 [INFO] No plugins found.
2018-06-08 10:11:15-05:00 [INFO] Loading plugins from: /usr/lib/recap/plugins-enabled
2018-06-08 10:11:15-05:00 [INFO] Loading plugin: /usr/lib/recap/plugins-enabled/*
2018-06-08 10:11:15-05:00 [WARNING] No plugins found to execute in: /usr/lib/recap/plugins-enabled
2018-06-08 10:11:15-05:00 [INFO] recap[7252]: Caught signal - deleting /var/lock/recap.lock
2018-06-08 10:11:15-05:00 [INFO] Execution time: 11s
2018-06-08 10:11:15-05:00 [INFO] --- Ending recap[7252] ---

Enabling redis plugin:

2018-06-08 10:13:52-05:00 [INFO] 1 plugins found: redis
2018-06-08 10:13:52-05:00 [INFO] Loading plugins from: /usr/lib/recap/plugins-enabled
2018-06-08 10:13:52-05:00 [INFO] Loading plugin: /usr/lib/recap/plugins-enabled/redis
2018-06-08 10:13:52-05:00 [INFO] Starting 'redis' report - redis_20180608-101341.log
2018-06-08 10:14:05-05:00 [INFO] Ended 'redis' report
2018-06-08 10:14:05-05:00 [INFO] recap[7658]: Caught signal - deleting /var/lock/recap.lock
2018-06-08 10:14:05-05:00 [INFO] Execution time: 24s
2018-06-08 10:14:05-05:00 [INFO] --- Ending recap[7658] ---

if [[ -n "${redis_cli}" ]]; then
print_blankline "${LOGFILE}"

echo "Redis stat (+delta / 0.5s)" >> "${LOGFILE}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth defining another option variable for the stat? PLUGIN_OPTS_REDIS_STAT ? or is a value that will hardly change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the stat interval? As we are running it for 5 seconds, the value would only ever make sense below 1 second. It's possible that on a really busy server, you might see benefit lowering further, but you'd likely be better placed increasing the run time (or running proper Redis monitoring!). I don't think it will need changing that often.

Speaking of run time - I considered adding options for configuring the "timeout" values, but that felt like overload of variables.

log INFO "Starting '${plugin_name}' report - ${LOGFILE##*/}"
print_redis_timings "${LOGFILE}"
print_blankline "${LOGFILE}"
print_redis_info "${LOGFILE}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to pass ${LOGFILE} in here for the print_redis_*, since you are not receiving the value on those functions and also because you're calling those functions in the same scope where ${LOGFILE} is defined, that allows you to use it in the same way you use ${redis_cli}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I initially had them redefined locally in each of the other functions, but then realised they were of course still in scope (but forgot to remove this remnant).

print_redis() {
local LOGFILE="$1"
local plugin_name=${FUNCNAME[0]/print_/}
local redis_cli=$( type -p redis-cli )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you test if redis-cli is available in two out of the three print_redis_* functions, maybe is worth to test that in here and bail out if not installed ?

Maybe you only call print_redis_info since you have another way to pull info through nc ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It crossed my mind, but thought it'd get messy when then running print_redis_info and checking there anyway - I'll add a new function as suggested further down.

}

print_redis_latency() {
local latency=$( timeout 3 "${redis_cli}" -h "${host}" -p "${port}" --latency 2>&1 )
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that ANSI code you are removing(*2K) took me a while to understand it, it was until I ran a strace that I could see it, just for documenting purposes:

$ strace -o redis_cli.log -f -s255  timeout 0.1 "${redis_cli}" -h "${host}" -p "${port}" --latency 2>&1 
min: 0, max: 1, avg: 0.62 (8 samples)

$ grep min redis_cli.log |tail -1
9087  write(1, "\33[0G\33[2Kmin: 0, max: 1, avg: 0.62 (8 samples)", 45) = 45

The above means ( from this nice document https://www.real-world-systems.com/docs/ANSIcode.html)

  • \33 - Escape sequence
  • [0g - Clear horizontal tab stop at current position
  • [2K - erase entire current line

A quick example on how that affects the logs:

$ printf "$( date --rfc-3339=s ) \33[0G\33[2K[foo]\n"
[foo]

$ printf "$( date --rfc-3339=s ) [foo]\n"
2018-06-08 11:35:49-05:00 [foo]

I found another difference, I'm using Arch for testing and redis version 4.0.9-1 from community repo.

$ echo $( timeout 3 "${redis_cli}" -h "${host}" -p "${port}" --latency 2>&1 )
0 17 0.53 94

While the version provided in CentOS 7, 3.2.10-2.el7 from EPEL. Print outs differently:

$ echo $(timeout 3 "${redis_cli}" -h "${host}" -p "${port}" --latency 2>&1 )
min: 0, max: 1, avg: 0.07 (294 samples)

I not sure what's going on with the redis client version 4, I guess is "OK" just printing the tree values knowing that they mean "min, max and avg".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered a similar problem in another project where I was only splitting a string on \n, missing the \r (CR) character which would then reset the cursor position causing an overwrite...

I noticed the problem here purely by outputting to a file and viewing with less. cat interprets the escape codes by default unless ran with -e.

Did you want me to update the comment/add another line saying along lines of:

test this with "redis-cli --latency | cat -e" to see raw output and escape codes

For reference:

$ redis-cli --latency | cat -e
^[[0G^[[2Kmin: 11, max: 11, avg: 11.00 (1 samples)^[[0G^[[2Kmin: 0, max: 11, avg: 5.50 (2 samples)

Interestingly, it appears it also only runs for a second on 4.0:

$ time redis-cli --latency | cat -e
0 13 0.19 89$

real	0m1.025s
user	0m0.001s
sys	0m0.010s

With the help page showing the answer:

  --latency          Enter a special mode continuously sampling latency.
                     If you use this mode in an interactive session it runs
                     forever displaying real-time stats. Otherwise if --raw or
                     --csv is specified, or if you redirect the output to a non
                     TTY, it samples the latency for 1 second (you can use
                     -i to change the interval), then produces a single output
                     and exits.

I've tested setting the -i to 3, but it takes marginally longer and terminates early with no output (it doesn't output an update for each measurement like the 3.2 client). Setting this lower to 2.5 would still cause issues on an extremely loaded system:

 $ time timeout 3 redis-cli --latency -i 2.5 | cat -e
Terminated

real	0m3.397s
user	0m0.002s
sys	0m0.013s

I'm inclined to still set it to 2.5 as it's fairly safe, but do you think I should instead add some version checking and rely purely on the builtin timer for 4.0?

if [[ -n "${redis_cli}" ]]; then
"${redis_cli}" -h "${host}" -p "${port}" info all &>> "${LOGFILE}"
else
( printf "INFO ALL\r\n"; sleep 1 ) | nc "${host}" "${port}" &>> "${LOGFILE}"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments in here.

It would be a good idea to test if nc is available to use it. As plugins are optional and enabled by users package maintainers won't need to add dependencies to them.

I added a similar comment on the print_redis function, where it might be worth checking for redis_cli and maybe nc as well(?) and just call those print_redis_* functions that will produce results with the found commands. Maybe this function can be split into print_redis_info and print_redis_info_nc or something similar. Just throwing some ideas.

--

Netcat(nc) is hanging when not using nmap ncat, with Arch I attempted using gnu-netcat and openbsd-netcat and after receiving the all the info from redis it remains connected. CentOS works fine as I have tested with nmap-ncat. Debian has to use ncat as this is the nmap "netcat" tool, otherwise the other netcat tools(netcat-traditional, netcat-openbsd) hang as well.

In Debian the nmap ncat tool is not offered as an alternative of nc then one needs to be explicit about its use. In CentOS there is no issue as nc is symlinked to ncat. Then I suggest the use of ncat that should satisfy those distros to use the nmap netcat tool.

( printf "INFO ALL\r\n"; sleep 1 ) | ncat -C "${host}" "${port}"

I only tested with centos 7, debian 9 (and Arch, but that does not count)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sure. Another function seems neatish, maybe I'll name it print_redis_info_fallback instead.


Ahh, so the issues with hanging are due to the line end differences between implementations?

I'll also add a -w 5 to try and handle timeouts (almost all the redis-cli calls are wrapped in timeout which would handle this).

If we still see further issues here, we could instead switch to a raw sockets implementation à la http://milhouseonsoftware.com/2015/07/27/writing-a-redis-client-in-pure-bash-part-1/

Would bypass issues with ncat being present or differing behaviours, but plenty of other problems would present themselves I also suspect.

$ exec {redis_socket}<>/dev/tcp/${host}/${port}
$ printf "INFO ALL\r\nQUIT\r\n" >&${redis_socket}
$ cat <&${redis_socket}
[...]
# Keyspace
db0:keys=284,expires=5,avg_ttl=3127545

+OK

@@ -0,0 +1,69 @@
#!/bin/bash
#
# Copyright (C) 2017 Rackspace, Inc.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your name should be here, I'll ask around as that would be my impression.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this in the Contributors section, but as discussed, this is fine.

Copy link
Author

@LukeHandle LukeHandle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

I think I'll wrap the options which take time under a flag which is default off to speed up execution.

I'm considering a further flag to run a CLIENT LIST (default off), but I'm hesitant on the useful-ness of this.

if [[ -n "${redis_cli}" ]]; then
print_blankline "${LOGFILE}"

echo "Redis stat (+delta / 0.5s)" >> "${LOGFILE}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the stat interval? As we are running it for 5 seconds, the value would only ever make sense below 1 second. It's possible that on a really busy server, you might see benefit lowering further, but you'd likely be better placed increasing the run time (or running proper Redis monitoring!). I don't think it will need changing that often.

Speaking of run time - I considered adding options for configuring the "timeout" values, but that felt like overload of variables.


echo "Redis stat (+delta / 0.5s)" >> "${LOGFILE}"
# "--stat" expects a TTY (does an ioctl with TCGETS), stdbuf resolves this neatly
timeout 5 stdbuf -oL "${redis_cli}" -h "${host}" -p "${port}" --stat -i 0.5 &>> "${LOGFILE}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, I believe my initial understanding was false here. When setting the timeout larger without the stdbuf, it does output to the terminal, but (penny-drop), buffered. My solution was correct for the actual problem, but my assumption on what it must have been doing after seeing what happened in the strace with the ioctl was wrong.

I'll update the comment here to reflect this.

}

print_redis_latency() {
local latency=$( timeout 3 "${redis_cli}" -h "${host}" -p "${port}" --latency 2>&1 )
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered a similar problem in another project where I was only splitting a string on \n, missing the \r (CR) character which would then reset the cursor position causing an overwrite...

I noticed the problem here purely by outputting to a file and viewing with less. cat interprets the escape codes by default unless ran with -e.

Did you want me to update the comment/add another line saying along lines of:

test this with "redis-cli --latency | cat -e" to see raw output and escape codes

For reference:

$ redis-cli --latency | cat -e
^[[0G^[[2Kmin: 11, max: 11, avg: 11.00 (1 samples)^[[0G^[[2Kmin: 0, max: 11, avg: 5.50 (2 samples)

Interestingly, it appears it also only runs for a second on 4.0:

$ time redis-cli --latency | cat -e
0 13 0.19 89$

real	0m1.025s
user	0m0.001s
sys	0m0.010s

With the help page showing the answer:

  --latency          Enter a special mode continuously sampling latency.
                     If you use this mode in an interactive session it runs
                     forever displaying real-time stats. Otherwise if --raw or
                     --csv is specified, or if you redirect the output to a non
                     TTY, it samples the latency for 1 second (you can use
                     -i to change the interval), then produces a single output
                     and exits.

I've tested setting the -i to 3, but it takes marginally longer and terminates early with no output (it doesn't output an update for each measurement like the 3.2 client). Setting this lower to 2.5 would still cause issues on an extremely loaded system:

 $ time timeout 3 redis-cli --latency -i 2.5 | cat -e
Terminated

real	0m3.397s
user	0m0.002s
sys	0m0.013s

I'm inclined to still set it to 2.5 as it's fairly safe, but do you think I should instead add some version checking and rely purely on the builtin timer for 4.0?

if [[ -n "${redis_cli}" ]]; then
"${redis_cli}" -h "${host}" -p "${port}" info all &>> "${LOGFILE}"
else
( printf "INFO ALL\r\n"; sleep 1 ) | nc "${host}" "${port}" &>> "${LOGFILE}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure sure. Another function seems neatish, maybe I'll name it print_redis_info_fallback instead.


Ahh, so the issues with hanging are due to the line end differences between implementations?

I'll also add a -w 5 to try and handle timeouts (almost all the redis-cli calls are wrapped in timeout which would handle this).

If we still see further issues here, we could instead switch to a raw sockets implementation à la http://milhouseonsoftware.com/2015/07/27/writing-a-redis-client-in-pure-bash-part-1/

Would bypass issues with ncat being present or differing behaviours, but plenty of other problems would present themselves I also suspect.

$ exec {redis_socket}<>/dev/tcp/${host}/${port}
$ printf "INFO ALL\r\nQUIT\r\n" >&${redis_socket}
$ cat <&${redis_socket}
[...]
# Keyspace
db0:keys=284,expires=5,avg_ttl=3127545

+OK

print_redis_info() {
echo "Redis INFO ALL" >> "${LOGFILE}"
if [[ -n "${redis_cli}" ]]; then
"${redis_cli}" -h "${host}" -p "${port}" info all &>> "${LOGFILE}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wrap this in a timeout 3 as well. Should not be needed, but the client doesn't seem to have better connection timeout handling and would otherwise hang (indefinitely?) when the other end is DROP-ing all packets.

(Note: --intrinsic-latency does not initiate any connection)

@@ -0,0 +1,69 @@
#!/bin/bash
#
# Copyright (C) 2017 Rackspace, Inc.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can update this in the Contributors section, but as discussed, this is fine.

print_redis() {
local LOGFILE="$1"
local plugin_name=${FUNCNAME[0]/print_/}
local redis_cli=$( type -p redis-cli )
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It crossed my mind, but thought it'd get messy when then running print_redis_info and checking there anyway - I'll add a new function as suggested further down.

* Fixes from PR
* Add flag for running timings (default no)
* Add flag for printing client list (default no)
* More comments in code
@LukeHandle
Copy link
Author

Hey, pushed a new commit with following:

  • Adjusted ncat per recommendations for compatibility and added wait-timeout
  • Add flag for running timings (default no) - reduce the execution time
  • Add flag for printing client list (default no) - can be verbose, but might be handy
  • More comments in code - clarified some bits
  • Add timeout on redis-cli singular command
  • Separated handling for lack of CLI tool fallback
  • Improved Redis 4 compatibility

Hopefully not too many regressions in quality

@tonyskapunk
Copy link
Owner

This is looking good, I tested the new code and is working as expected, just last minor request, could you try to break the code and comments inside 80c when possible? e.g.

# from
local latency=$( timeout 3 "${redis_cli}" -h "${host}" -p "${port}" --latency -i 2.5 2>&1 )
# to
local latency=$( timeout 3 \
                   "${redis_cli}" \
                     -h "${host}" \
                     -p "${port}" \
                     --latency \
                     -i 2.5 \
                     2>&1
               )

I know in some cases is not possible to do so, but when possible let's do it. recaplog has some good examples about it.

@LukeHandle
Copy link
Author

Hey, I made the change to be within 80c. I was rather conservative with my multi-line command breaks, but the recaplog seemed to only break when strictly needed vs. on every arg.

@tonyskapunk
Copy link
Owner

tested this and looking good :) 👍

Thanks for the contribution.

@tonyskapunk tonyskapunk merged commit 8c64521 into tonyskapunk:issue155 Jun 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants