-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: FR #84 Include containerd-specific labels to data coming from powercaprapl sensor #109
Conversation
Yop! J'ai du faire une bétise :) Dans le container:
Dans le compose:
Les logs de quand il démarre:
Mais quand je tente de curl, il est pas content:
Pas de messages d'erreurs, mais il attends.. Je pense que j'ai la bonne image car avec le param dans l'image normale, il me mets une erreur, et là non. Dis moi si je peux faire d'autres choses :) |
Hi ! I've tried to reproduce but the container does anwer me :/ |
I think I identified the problem @pierreozoux runs into. The rs-docker crate we use for this feature uses tokio as an async runtime. The prometheus exporter itself uses actix. I guess some conflicts happen as I get tokio log messages when I reproduce the issue on Pierre's machine (using prometheus exporter). I imagine I can't reproduce on mine because this is not deterministic and may rely on lower level configurations on the system (not sure but strongly suspected). I think we should either get rid of tokio in rs-docker (rs-docker kind of needs a refresh anyway) or actix in the prometheus exporter. I've also heard about bollard (https://github.com/fussybeaver/bollard) which seems to have more contributors. But it uses tokio too. So maybe the solution is to get rid of actix (I was thinking about moving prometheus exporter in full sync + thread anyway). Do you have any thoughts about that @rossf7 @PierreRust @uggla ? |
@bpetit, my 2 cents, I think that's not the sens of history. I mean all web frameworks struggled(actix, rocket seems to use async now, warp, etc... --> https://github.com/flosse/rust-web-framework-comparison) to move to async because it gives better performances. Despite we clearly don't need performances for Scaphandre web server, I'm afraid it will be difficult to find a web framework that will not use async and we might end up with something not well supported in the future. So I will bet more trying to bump up tokio to the latest versions and trying to have all tokio consumers to use that dependence (not mixing tokio versions). Then, try to see if it fixes the bug. But of course it is not so easy if it is flaky and can't be reproduced 100% of the time. |
Hi @bpetit I agree with @uggla on this. I think replacing actix with the same version of tokio is a good approach. I had the same issue when looking at the kubernetes integration. https://github.com/clux/kube-rs is the most popular library and uses tokio. There is also https://github.com/ynqa/kubernetes-rust which isn't async but the last commit was 2 years ago. I tried to get that working but I wasn't able to. Although that is probably because I'm new to rust. |
Hi ! Thanks for your views on this. I'll give a shot to tokio/hyper for the prometheus exporter. Let's see. |
Seems to work, I'll run some more test and then jump back on the docker integration if it's satisfying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a review with comments. Hoping it will help a little.
error!("server error: {}", e); | ||
} | ||
} else { | ||
panic!("{} is not a valid TCP port number", port); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe change error msg to : "is not a valid TCP port number or already bind"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't think it's the right message as it is triggered only if we can't parse the port parameter as a u16. If the port number is valid but can't be reserved we will get an error from Server::bind most likely.
{ | ||
info!( | ||
"{}: Refresh topology", | ||
Utc::now().format("%Y-%m-%dT%H:%M:%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I will make a PR to change the logging stuff. The way it is done today is not really good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the part that worries you ?
Cargo.toml
Outdated
time = "0.2.25" | ||
colored = "2.0.0" | ||
chrono = "0.4.19" | ||
rs-docker = { version = "0.0.58", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, thanks ! Yes it does. Actually I think I'll use my fork instead: https://github.com/bpetit/rs-docker/
as rs-docker seems to be unmaintained. I'll then update the dependencies so we are even. WDYT ?
}; | ||
let context = Arc::new(power_metrics); | ||
let make_svc = make_service_fn(move |_| { | ||
let ctx = context.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I would use shadow to keep the same name let context = context.clone()
. I think it is easier to follow.
Idem for sfx below.
And do you really need to clone twice ? here and in the async block below ? I'm not sure but maybe it is only required in the async block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had errors I couldn't resolve if I didn't clone twice. Maybe I did it wrong. I'll try to give you some details (next week most probably) so we can look into it and see if it should be done differently.
Thanks a lot for the review ! I'm pretty busy in the next few days, but I should be able to integrate your suggestions and build a working version of prom on tokio with container labels (+docker extra labels) for wednesday 9. 🤞 |
Prometheus exporter with tokio seems to work fine. However, I'm not confortable having a library with async as a requirement for gathering data from the docker socket locally. It's fine in a pull mode like prometheus, especially if we have a tokio runtime for the server itself, in the same version. But i don't think it's fine to require exporters like JSON, CLI, or any simple exporter to have an async runtime to be able to get extra informations for containers. rs-docker and bollard do require async. I'm forking rs-docker in a minimalistic, read-only and synchronous version. I guess it's enough for what we need here. We could then upgrade to something more fancy if needed afterwards. cc @uggla @rossf7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bpetit
I ran into a couple of problems in my testing.
The first was an error in the logs for listing the pods.
isahc::handler: request completed with error: the server certificate could not be validated
It seems to be because in the k8s client its connecting to http://localhost:6443. If I changed this to the host in my kubeconfig it was fine.
The second problem was with the k8s regex. My cluster was using a different format.
src/sensors/utils.rs
Outdated
@@ -22,9 +29,15 @@ impl ProcessTracker { | |||
/// let tracker = ProcessTracker::new(5); | |||
/// ``` | |||
pub fn new(max_records_per_process: u16) -> ProcessTracker { | |||
let regex_cgroup_docker = Regex::new(r"^/docker/.*$").unwrap(); | |||
let regex_cgroup_kubernetes = Regex::new(r"^/kubepods.slice/.*$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the cluster I was testing with the cgroups file has a different format. Could the regex be more generic to support both formats?
#/proc/193876/cgroup
1:name=systemd:/kubepods/burstable/pod7a8cbc91-66e9-4303-88df-513f77240233/acd77757d49868ead1f706f901271e737594d0e11cec86d4bfa4de45a0512938
The cluster was kubernetes v1.20.2 installed using kubeadm on ubuntu 20.10 with docker 20.10.2
I guess we need a flag or an env var to set the kubernetes api uri ? I'll extend the regexp, thanks for the feedback ! |
The env vars What about checking for those and if not have a flag that can be set manually? |
Hi @rossf7 ! I've added the gathering of kubernetes env vars. If those vars are present, they are used first to determine the server uri. I also made the regexp more flexible. I'd like hear your thoughts and tests results :) |
@bpetit Many thanks for the changes :) I'll retest and report back. Most likely will be tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bpetit My testing went well I just needed to make some small changes for the helm chart and to adjust for my cluster.
I think this is really close now. 💚 🚀
src/sensors/utils.rs
Outdated
.unwrap() | ||
.strip_prefix("docker-") | ||
.unwrap() | ||
.strip_suffix(".scope") | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.unwrap() | |
.strip_prefix("docker-") | |
.unwrap() | |
.strip_suffix(".scope") | |
.unwrap(); | |
.unwrap(); |
I had to remove this. Otherwise there was a crash. Here is an example from my cluster.
/kubepods/burstable/podb55b6901e3073a2abf41783540cb7b36/f60b363dd1d5fa5939a879804b3b96836e130f57ca1ae4442da5c368accf751b
I have a bad feeling this varies by container runtime and we might need to support multiple formats.
Dumb question but could this be a function so we can handle multiple format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it seems highly different from one setup to another. I think having a function is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be because I'm using the cgroupfs driver. The recommended driver is systemd but my cluster is a temp one on equinix metal and I didn't configure it 🤦♂️
Next time I'll use systemd and see if that changes things.
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#cgroup-drivers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to make this part not mandatory. Could you try in your environment the latest version of the code ?
thanks 🙏🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. I've retested with the systemd and cgroupfs drivers and both were fine.
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
…ub.com:hubblo-org/scaphandre into feature/#84-include-containerd-specific-labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.