Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: report pod communication latency percentiles in pdf #212

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

askervin
Copy link
Contributor

  • Add R routine to load and plot percentiles from JSON.
  • Support any number of percentiles 1..n.
  • Store percentile configuration in JSON.
  • Fix handling percentile value 100.

Signed-off-by: Antti Kervinen antti.kervinen@intel.com

@askervin
Copy link
Contributor Author

Request for comments:

  • The default latency percentiles are now 5-25-50-75-95.
  • This means we visualize
    • median (50)
    • the range including 50 % of the measurements (25, 75)
    • borderlines of the fastest and slowest 5 % (5, 95)
  • If we'd like to drop only 1 % as outliers in both ends (instead of 5 %), we could use 1-25-50-75-99.
  • If we'd like to add to previous also the fastest and the slowest values, we could use 0-1-25-50-75-99-100.
  • With the default value I'm throwing out 10 % of results as outliers. That is less noise, more stable results, but with the risk of throwing out something that should and could have been spotted with (1, 99) in the percentiles.

@obedmr obedmr added the scaling label Oct 11, 2019
Copy link
Contributor

@dklyle dklyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. There are a couple of issue to address. One is without a legend, the chart is impossible to interpret without reading the R code. The second is rendering more than one test run (data directory) for comparison. The code loops over more than one directory if present, but only the last result will be rendered.

metrics/report/report_dockerfile/tidy_scaling_nc.R Outdated Show resolved Hide resolved
metrics/report/report_dockerfile/tidy_scaling_nc.R Outdated Show resolved Hide resolved
@grahamwhaley
Copy link

I've done one R metric latency graph before, over in the kata fio metrics. I was only plotting two percentiles, but you can see the example graphs over on the issue at kata-containers/tests#885

Note I used colours/opacity to try and group and separate the tests and the values. Yeah, I do think we need a legend on the graph to show which colour/lines mean what :-)

@@ -22,6 +22,7 @@ nc_req_msg_len=${nc_req_msg_len:-1000}
nc_port=33101
# request message
nc_req_msg=$(head -c $nc_req_msg_len /dev/zero | tr '\0' 'x')
nc_percentiles=(5 25 50 75 95)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this PR (struggling a little to get good data to check the views - that is more a 'local' issue for me though - working on it).... a high level observation though. I think it would be a good idea if we stored the raw data in the JSON results, as well as the cooked percentiles. That way, we can:

  • potentially draw some other useful graphs in the R, to show boxplots for instance covering all the data/results for better fine grained insights
  • later potentially modify the R and/or any ELK stack views or post processing we have, and still be able to apply that to historic data

I suspect we do also want to hold some 'cooked' data in the JSON as well, like at present, as ELK is not necessarily going to be easy to do that cooking inside, and generating and presenting pre-cooked data into ELK is for now the easiest UI visualisation route.

Probably don't just go adding this until we've reviewed all the code - may as well do a PR single pass update in one go ;-)

@grahamwhaley
Copy link

OOI @askervin - I see you are taking the time measures for the nc in I think nanoseconds, and then scale them to ms - I know we do that for the launch times etc. (which take place in the range of 'seconds' iirc) - is this ms granularity working nicely on your test cluster? on my local cluster I pretty much see results of only 4 and 5 ms reported. I'm wondering if we should measure microseconds....
I tweaked the script to dump out the sorted raw data, and I see data like this (the tail end of the data from a single node 20 pod test run)...

4375 4377 4382 4401 4405 4413 4414 4418 4421 4421 4423 4424 4424 4427 4428 4428 4429 4429 4430
 4430 4430 4432 4433 4434 4435 4436 4436 4438 4438 4443 4443 4444 4446 4447 4447 4448 4448 4449 
4449 4449 4449 4450 4452 4452 4454 4454 4455 4456 4456 4458 4458 4458 4458 4458 4460 4461 4461 
4461 4461 4462 4462 4462 4463 4463 4463 4464 4464 4464 4464 4465 4465 4465 4465 4465 4466 4466 
4467 4467 4468 4469 4469 4469 4470 4470 4471 4471 4471 4472 4473 4473 4473 4473 4473 4474 4474 
4474 4475 4475 4477 4477 4477 4478 4478 4478 4478 4479 4479 4480 4480 4480 4480 4481 4481 4482 
4482 4483 4483 4483 4484 4484 4484 4484 4485 4485 4485 4485 4485 4486 4486 4486 4486 4486 4486 
4487 4487 4487 4487 4487 4487 4487 4488 4488 4489 4489 4490 4490 4490 4490 4490 4490 4491 4491 
4491 4491 4491 4492 4492 4492 4492 4493 4493 4493 4493 4493 4493 4494 4494 4494 4494 4494 4494 
4494 4494 4495 4495 4495 4495 4496 4496 4497 4497 4497 4497 4497 4497 4497 4498 4498 4499 4499 
4499 4499 4499 4500 4500 4500 4500 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 4501 
4502 4502 4502 4502 4502 4503 4503 4503 4503 4503 4503 4503 4504 4504 4504 4504 4504 4504 4504 
4505 4505 4505 4505 4505 4505 4506 4506 4507 4507 4507 4507 4507 4508 4508 4508 4509 4509 4509 
4509 4509 4509 4509 4509 4510 4510 4510 

.... a bunch skipped for brevity.....
 4901 4902 4902 4903 4903 4905 4906 4909 4910 4910 4910 4910 4912 4912 4913 4913 4914 4914 4915 
4915 4916 4916 4916 4917 4917 4917 4918 4918 4918 4918 4919 4919 4919 4920 4921 4922 4922 4923 
4924 4924 4924 4925 4925 4925 4926 4928 4928 4930 4930 4931 4931 4931 4932 4933 4933 4933 4935 
4935 4937 4937 4938 4938 4940 4941 4941 4942 4942 4942 4943 4943 4943 4944 4944 4944 4946 4946 
4947 4948 4948 4949 4949 4950 4950 4951 4951 4951 4951 4951 4953 4953 4954 4954 4955 4956 4956 
4956 4956 4957 4959 4961 4962 4962 4963 4963 4967 4971 4972 4973 4974 4974 4975 4976 4977 4978 
4979 4980 4981 4982 4983 4984 4985 4985 4986 4986 4990 4991 4993 4994 4995 4999 5001 5002 5004 
5004 5007 5008 5011 5011 5013 5017 5017 5019 5020 5020 5022 5024 5024 5024 5027 5027 5043 5045 
5046 5051 5053 5056 5057 5058 5058 5061 5066 5074 5077 5088 5092 5098 5099 5101 5106 5115 5117 
5121 5124 5125 5125 5125 5137 5142 5150 5160 5166 5171 5174 5175 5176 5177 5182 5182 5184 5185 
5196 5198 5206 5214 5220 5232 5240 5254 5287 5305 5317 5320 5325 5431 5436 5439 5447 5457 5470 
5556 5561 5570 5588 5634 5671 5677 5688 5732 5773 5888 5901 5911 5922 6000 6023 6034 6088 6094 
6625 6769 7660 8599 8816 11069 11124

What is interesting for me right now is we don't really get to see that spread across the 4-5ms range, and we don't get to see those tail end outliers that ramp from 6ms up to 11ms. This is the sort of reason we should probably capture the raw data in the JSON and find some way to show that data in the graphs - I think it can give us more insight than the current rather step-function result we are seeing.

@grahamwhaley
Copy link

for ref, as I have one to hand, here is a report generated diagram of a 100 pod run at ms resolution, and then a 20 pod run at us resolution - both from the same node/cluster. shows promise I think:

scaling_nc-1

median_index = match("p50", names(ltpt))
if (!is.na(median_index)) {
# Draw median line
latp = latp + geom_line(data=latencydata[latencydata$testname==plot_test,],aes_string(x="n_pods",y=names(ltpt)[median_index],color="testname"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@askervin - I was wondering if there might be some built in R functionality for generating a similar chart with 'fading percentile bands' from the raw data - looks like it is very similar to a fanchart/fanplot - not sure if we can utilise that with our data under R though (did not dig too far): https://journal.r-project.org/archive/2015-1/abel.pdf

@askervin
Copy link
Contributor Author

@grahamwhaley , I agree that millisecond granularity is rather coarse for reporting network latency. I get 12 ms on my VM cluster, but 3-4 ms on real hardware. However, I think we should make sure that we'd really benefit from it instead of just adding noise.

For the current measurement loop, I compared date, $EPOCHREALTIME and Python time.time() granularity as fast as possible.

#!/bin/bash

measure_accuracy() {
    n_limit=$1 # number of accuracy measurements
    n=0
    while [[ $n -lt $n_limit ]] ;
    do
        n=$(( $n + 1 ))
        local start_time=$(date +%s%N)
        local end_time=$(date +%s%N)
        local date_timediff_ns=$(( end_time - start_time ))

        local er_start_time=$EPOCHREALTIME
        local er_end_time=$EPOCHREALTIME
        local er_timediff_us=$(bc <<< "(($er_end_time-$er_start_time) * 1000000)/1")

        local py_timediff_us=$(python3 -c 'import time; start_t=time.time(); end_t=time.time(); print("%d" % ((end_t-start_t)*1000000,))')

        echo "date_ms $(( $date_timediff_ns / 1000000 ))"
        echo "date_us $(( $date_timediff_ns / 1000 ))"
        echo "date_ns $date_timediff_ns"
        echo "EPOCHREALTIME_ms $(( $er_timediff_us / 1000 ))"
        echo "EPOCHREALTIME_us $er_timediff_us"
        echo "python3_ms $(( $py_timediff_us / 1000 ))"
        echo "python3_us $py_timediff_us"
    echo "---"
    done
}

measure_accuracy $1

On my high-performance desktop measuring the time of "do nothing":

  • date_us varies between 900 - 2000 us
  • EPOCHREALTIME_us varies between 12 - 37 us
  • python3_us varies between 4-5 us

Another thing is the latency server implementation. Trying out different server+client combos on localhost:

  • nc ... -e /bin/cat server + nc client:
    • date_us varies between 2700 - 8700 us
    • EPOCHREALTIME_us varies between 1400 - 6400 us
  • nc ... -e /bin/cat server + python client doing time.time() and socket.sendall()...socket.recv()
    • python timediff varies between 540 - 1200 us
  • python socket server + python socket client:
    • python timediff varies between 418 - 732 us

I would conclude that if we want to get to microsecond level, we should not do the time measurement with date and nc client anymore. We should use more efficient time measurement and socket send/recv. Currently solution would give +-3000 us error, Python client roughly +-300 us error, Python server+client +-150 us error.

@grahamwhaley
Copy link

Good data and valid point on the noise/overheads of the tools contributing significantly to the end result ;-) That fits/correlates with what I was inferring - that we are trimming off too many significant digits right now I think.

In which case, I agree, we probably want to be using better tools to measure the latency. First, we should probably define exactly what we are trying to measure (TCP or just network response time from a pod for instance) - and then we can consider other specific network tools like ping (for an ICMP bounce, if the pod network supports that), or iperf3 or similar, where we can get back good raw data. iirc, iperf3 can return results in JSON format https://github.com/kata-containers/tests/blob/master/metrics/network/network-metrics-iperf3.sh#L63 ;-).

one of the problems with network tools is - there are just soooo many to choose from...

@askervin
Copy link
Contributor Author

@grahamwhaley , good point. I changed the time in the result json to microsecond accuracy. We can handle the measurement method as another topic. In any case, current round trip time is accuracy is enough for spotting a reproducible anomaly after scaling to 400+ pods.

I also changed the number of percentiles from 5 to 9, so that we can see where the previously hidden 5 % of latency times land. (Added min, max, 1 % and 99 % percentiles.)

Saving the original measurements with all necessary details (like pod IPs associated with each latency time and the timestamp) helps debugging anomalies. It also is a remedy to the concern on having only "cooked" data available after test run.

@grahamwhaley
Copy link

Whilst I remember, something I spotted that is not actually modified in this PR....
in the JSON stats capture, I think we are storing a bunch of global data that does not change inside each iterations results. Probably better to save that out once during the init() phase at a global test level. In particular, I was thinking about:

				"Pod_command": "${pod_command//\"/\\\"}",
				"Request_length": "${nc_req_msg_len}",
				"Requests_per_pod": "${nc_reqs_per_pod}",
				"Sender": "serial",
				"Percentiles": [$(IFS=, ; echo "${latency_percentiles[*]}")],

Copy link

@grahamwhaley grahamwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@askervin - some feedback, mostly just little things.
I think if/before we land this in the code base though we would want to clean up a couple of things like the graph legend, and if we still use them, the box plot a bit.
But, overall, looking useful and hopefully giving you the data you need.

@@ -338,6 +339,8 @@ run() {
sleep ${settle_time}

if [[ ${nc_reqs_per_pod} -ge 1 ]]; then
mkdir -p "$RESULT_DIR" 2>/dev/null || true
local latency_raw_output="$RESULT_DIR/${TEST_NAME// /-}.tmaster_tworker_pods_r_ipord_ipaddr_lattot_latconn_latio_latdisconn_rx.raw"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

took me a moment to figure out what that lengthy name was... ah, you have encoded the column names into the filename? Is that a common idiom?
Given you are storing the raw data in a separate file, rather than into the JSON (which is where I thought it would go), maybe you can make that file a CSV, and then you can encode the column names in the first row, and just use .csv on the filename?

if [[ $(echo ${nc_req_msg} | nc ${pod_ip} ${nc_port}) != "${nc_req_msg}" ]]; then
pod_ip_ord=$(( pod_ip_ord + 1 ))
local latency_failed=0
local latency_pod_start_time=${EPOCHREALTIME/./}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find with EPOCHREALTIME :-) - nitpick - the json still says Units of ms I think - let's update that to be correct (even if we don't presently use that Units field).

library(tibble) # tibbles for tidy data

testnames=c(
"k8s-scaling-nc.*"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found in another PR - that the .* is a bogus overhang from another R file, and messes up the regexp that is then constructed below. Just delete the .* from the string I think.

#### Now extract latency time percentiles (ltp) ########
########################################################
ltp=br$latency_time$Percentiles
# Percentile thresholds, for example [5, 25, 50, 75, 95]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment now out of date, since you added 0/1/99/100 as well iirc

}
}

if (length(latencydata[[1]]) <= 20 && perc_count == 5) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why perc_count == 5? - ah, I guess as the boxplot takes 5 input values....
Hmm, so, two things then:

  • given we now generate >5 items in the test run, I suspect this code will no longer run?
  • There is no strict standard for what all the whiskers on a boxplot mean - I suspect the outer ones are expected maybe to be some stddev extraction - but I think we are using them as percentiles. We should probably document that in the graph (unless it is obvious from the axis - I don't have a graph to hand to check). There is an example at https://owi.usgs.gov/blog/boxplots/, but that might be a little OOT.

}
}

cat("\n\nLatency percentiles illustrated in the Figure below: ", paste0(ltp_perc, "\\%"), "\n\n")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, look at that - so, maybe disregared some of my above comments if we think the graph is clear enough already ;-)

suppressMessages(suppressWarnings(library(ggplot2))) # ability to plot nicely.
suppressMessages(library(jsonlite)) # to load the data.
suppressMessages(library(scales)) # For de-science notation of axis
library(tibble) # tibbles for tidy data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need gridextra here, like we have in the tidy_scaling.R. right now it works as you inherit that from the tidy_scaling.R that runs before this script, but when run on its own in debug mode I see:

Error in grid.arrange(latp, ncol = 1) :
  could not find function "grid.arrange"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nudge.

upper_prev_name = names(ltpt)[perc_count-n+1]
perc_dist = abs(n-perc_mid)
alpha = 0.7 * ((n+1) / (perc_mid+1))**2
latp = latp + geom_ribbon(data=latencydata[latencydata$testname==plot_test,],aes_string(x="n_pods",ymin=lower_name,ymax=lower_next_name,fill="testname"),alpha=alpha)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we land this, we should try and generate a proper legend for the plot. I suspect that will either involve some heavy aes() usage, or generating a custom legend.

@askervin
Copy link
Contributor Author

Rebased the whole series. Now this is based on the rapid test. Quite a few changes, see last commit message.
I'll run few test rounds...

@grahamwhaley
Copy link

hi @askervin - thanks for the update. I tried to give it a quick spin - initial feedback...

  • Probably consider a rebase/merge of the patches
    • as they are getting quite long now, and I think that last one is a replace/rewrite of a bunch of the previous ones?
    • the branch looks like it needs a rebase before we can merge anyway (conflicted)
  • I think you need to set SMF_USE_COLLECTD=true somewhere. Right now I think the collectd daemonset gets run due to a bug in my framework code, where if [ -n "$SMF_USE_COLLECTD" ] does not do the 'not set' check like I thought! If you'd like to fix that whilst you are there.... (change it to an == string test for true?)
  • I failed to make the busybox deploy without setting use_api=no - my suspicion is that the ncat command line substition maybe does not work in the json. I dunno.
  • nitpick: I think the hotlink in the Rmd points to the old nc test still (not the _rapid version)
  • I ran with the default 20 pods. The lat table shows results for n_pods 1,5,10 - I was thinking it might show 1,10,20 - so I peeked at the code - it's not obvious to me how it is working out those rows - I'll leave that for you to contemplate, and fix/document/clarify as necessary.

(oh, I got a run by hacking out my collectd check bug and setting no_api btw)

- Add R routine to load and plot percentiles from JSON.
- Support any number of percentiles 1..n.
- Store percentile configuration in JSON.

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
@askervin
Copy link
Contributor Author

@grahamwhaley , Thanks for your feedback!

  • Patches squashed into one.
  • SMF_USE_COLLECTD=true added and the string test changed to true.
  • Rmd script link fixed.

Issue:

  • When I try to use_api, I'm getting HTTP/HTTPS complaint... like this:
$ strace -e trace=network -s 1000 curl -s http://127.0.0.1:6443/apis/apps/v1/namespaces/default/deployments -XPOST -H 'Content-Type: application/json' -d@/root/metrics/scaling/rapidnc.json
...
sendto(3<TCP:[847285]>, "POST /apis/apps/v1/namespaces/default/deployments HTTP/1.1\r\nHost: 127.0.0.1:6443\r\nUser-Agent: curl/7.66.0...")
...
recvfrom(3<TCP:[847285]>, "HTTP/1.0 400 Bad Request\r\n\r\nClient sent an HTTP request to an HTTPS server.\n", 102400, 0, NULL, NULL) = 76

But curl gets stuck if I just s/http:/https:. Need to figure how to make apiserver accept http requests. Do you have any tips for this?

Todo:

  • check n_pods in the table with single testname. should show the first, mid and last.

@grahamwhaley
Copy link

thanks @askervin - I'll try to get to review..... @dklyle is the curl API expert ;-)

@grahamwhaley
Copy link

I may know what the use_api issue I ran into is, and it's probably not this patchset - it happens with the existing rapid test as well... I ran into this yesterday.
When using the API, the tests start up the kubeproxy, but, that only works if you happen to be running the tests on the master node. This works for me when I'm running a single node local clr-k8s-examples stack, but when I'm running a 3-node kind cluster, the nodes run under docker, and I run the tests on the bare metal host.... and then I get the busybox launch failure...

I'm slightly confused though, as I was sure I'd been running API based tests on the kind cluster. So, I'm going to have to dig a little more I think, but there is an implicit requirement I think in our tests that if you want to use the (default) API method, you need to run on the master node....
/cc @dklyle (and thanks for confirming my suspicions on this yesterday)

@grahamwhaley
Copy link

(oops, accidental mis-click closed.... re-opened now....)

@grahamwhaley
Copy link

OK, take that kubeproxy item with a little pinch of salt - turned out on my test machine I still had a kube proxy running (probably to the clr-k8s-examples stackup) when I tried to run and bind to the kind cluster. I still got some 'nc TIMEOUT' errors, but wrt the use of use_api, this PR is probalby fine, but...

  • I know we have some cleanup/exit issues, as often when a test fails or I quit it I end up with the daemonsets still present (collectd), and probably the proxy as well
  • we could/should detect if a proxy is already running and probably error out, maybe.

suppressMessages(suppressWarnings(library(ggplot2))) # ability to plot nicely.
suppressMessages(library(jsonlite)) # to load the data.
suppressMessages(library(scales)) # For de-science notation of axis
library(tibble) # tibbles for tidy data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nudge.


### For developers: uncomment following variables to run this as is in R
# resultdirs=c("PATH/TO/RES1/", ...) # keep the ending slash on result paths
# inputdir=""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment? We don't have it in any of the other R files, which do the same thing. Generally I think this is covered by the use of the report debug method, and the source'ing of the Env.R file..

}
}

# Table presentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't comprehend how the code relates to the final table - can you add some commentary please?
For instance, in my 20 pod run, I see a table in the pdf with rows for 1,5,10 pods - how is that math done? I thought maybe it would be 1,10,20 for a 20 pod run for instance?

local latency_json="$(cat << EOF
"latency_time": {
"Percentiles": [$(IFS=, ; echo "${latency_percentiles[*]}")],
"Result": ${latency_percentiles[$(( ${#latency_percentiles[@]} / 2 ))]},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - we might want to rename Result to something like Midpoint to reflect the actual value it holds.

# Measure network latency
if [[ ${nc_reqs} -ge 1 ]]; then
mkdir -p "$RESULT_DIR" 2>/dev/null || true
local latency_raw_output="$RESULT_DIR/${TEST_NAME// /-}.tmaster_tworker_pods_req_ipaddr_lattot_latconn_latio_latdisconn_rx.raw"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like the mega huge raw name.... or that it's not a json or actual csv :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants