Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: FR #84 Include containerd-specific labels to data coming from powercaprapl sensor #109

Merged
merged 62 commits into from
Sep 25, 2021

Conversation

bpetit
Copy link
Contributor

@bpetit bpetit commented May 9, 2021

No description provided.

@bpetit bpetit linked an issue May 9, 2021 that may be closed by this pull request
@pierreozoux
Copy link
Contributor

pierreozoux commented May 25, 2021

Yop!

J'ai du faire une bétise :)

Dans le container:

scaphandre -v
scaphandre 0.3.0

Dans le compose:

  scaphandre:
    image: hubblo/scaphandre:build-PR_84-docker-labels
    volumes:
      - type: bind
        source: /proc
        target: /proc
      - type: bind
        source: /var/run/docker.sock
        target: /var/run/docker.sock
        read_only: true
      - type: bind
        source: /sys/class/powercap
        target: /sys/class/powercap
    command: ["prometheus", "--containers"]

Les logs de quand il démarre:

   logs scaphandre
Attaching to metrics-collection_scaphandre_1
scaphandre_1        | Scaphandre prometheus exporter
scaphandre_1        | Sending ⚡ metrics
scaphandre_1        | Press CTRL-C to stop scaphandre

Mais quand je tente de curl, il est pas content:

curl localhost:8080

Pas de messages d'erreurs, mais il attends..

Je pense que j'ai la bonne image car avec le param dans l'image normale, il me mets une erreur, et là non.
J'ai regardé viteuf le code, pas l'air d'avoir trop de log, du coup, je ne pense pas que je peux increase le verbose :P

Dis moi si je peux faire d'autres choses :)

@bpetit
Copy link
Contributor Author

bpetit commented May 26, 2021

Hi !

I've tried to reproduce but the container does anwer me :/
you need to query localhost:8080/metrics by the way to get the metrics. The root endpoint will give you a warning message.
Regarding the hanging query, I'm not sure what it could be about. Could you try directly from the container, with docker exec ?

@bpetit
Copy link
Contributor Author

bpetit commented May 27, 2021

I think I identified the problem @pierreozoux runs into. The rs-docker crate we use for this feature uses tokio as an async runtime. The prometheus exporter itself uses actix. I guess some conflicts happen as I get tokio log messages when I reproduce the issue on Pierre's machine (using prometheus exporter). I imagine I can't reproduce on mine because this is not deterministic and may rely on lower level configurations on the system (not sure but strongly suspected).

I think we should either get rid of tokio in rs-docker (rs-docker kind of needs a refresh anyway) or actix in the prometheus exporter. I've also heard about bollard (https://github.com/fussybeaver/bollard) which seems to have more contributors. But it uses tokio too. So maybe the solution is to get rid of actix (I was thinking about moving prometheus exporter in full sync + thread anyway).

Do you have any thoughts about that @rossf7 @PierreRust @uggla ?

@bpetit bpetit self-assigned this May 27, 2021
@uggla
Copy link
Collaborator

uggla commented May 28, 2021

Do you have any thoughts about that @rossf7 @PierreRust @uggla ?

@bpetit, my 2 cents,

I think that's not the sens of history. I mean all web frameworks struggled(actix, rocket seems to use async now, warp, etc... --> https://github.com/flosse/rust-web-framework-comparison) to move to async because it gives better performances. Despite we clearly don't need performances for Scaphandre web server, I'm afraid it will be difficult to find a web framework that will not use async and we might end up with something not well supported in the future.
Using our own sync http(s) web server is possible too, but that's probably not a good idea regarding security aspect.
Also we will see more and more tools/libraries that will use async for io. Sometimes it is needed, sometimes it is hype so not a good reason. But that's the underlying trend.

So I will bet more trying to bump up tokio to the latest versions and trying to have all tokio consumers to use that dependence (not mixing tokio versions). Then, try to see if it fixes the bug. But of course it is not so easy if it is flaky and can't be reproduced 100% of the time.

@rossf7
Copy link
Contributor

rossf7 commented May 28, 2021

Hi @bpetit I agree with @uggla on this. I think replacing actix with the same version of tokio is a good approach.

I had the same issue when looking at the kubernetes integration. https://github.com/clux/kube-rs is the most popular library and uses tokio.

There is also https://github.com/ynqa/kubernetes-rust which isn't async but the last commit was 2 years ago. I tried to get that working but I wasn't able to. Although that is probably because I'm new to rust.

@bpetit
Copy link
Contributor Author

bpetit commented May 29, 2021

Hi ! Thanks for your views on this. I'll give a shot to tokio/hyper for the prometheus exporter. Let's see.

@bpetit
Copy link
Contributor Author

bpetit commented Jun 1, 2021

Seems to work, I'll run some more test and then jump back on the docker integration if it's satisfying.

Copy link
Collaborator

@uggla uggla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a review with comments. Hoping it will help a little.

Cargo.toml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
error!("server error: {}", e);
}
} else {
panic!("{} is not a valid TCP port number", port);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe change error msg to : "is not a valid TCP port number or already bind"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep !

Copy link
Contributor Author

@bpetit bpetit Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't think it's the right message as it is triggered only if we can't parse the port parameter as a u16. If the port number is valid but can't be reserved we will get an error from Server::bind most likely.

{
info!(
"{}: Refresh topology",
Utc::now().format("%Y-%m-%dT%H:%M:%S")
Copy link
Collaborator

@uggla uggla Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I will make a PR to change the logging stuff. The way it is done today is not really good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the part that worries you ?

Cargo.toml Outdated
time = "0.2.25"
colored = "2.0.0"
chrono = "0.4.19"
rs-docker = { version = "0.0.58", optional = true }
Copy link
Collaborator

@uggla uggla Jun 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may use cargo tree to find the dependencies.
Maybe it brings older deps ?
From crates.io:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, thanks ! Yes it does. Actually I think I'll use my fork instead: https://github.com/bpetit/rs-docker/
as rs-docker seems to be unmaintained. I'll then update the dependencies so we are even. WDYT ?

Cargo.toml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/exporters/prometheus.rs Outdated Show resolved Hide resolved
src/exporters/prometheus.rs Outdated Show resolved Hide resolved
};
let context = Arc::new(power_metrics);
let make_svc = make_service_fn(move |_| {
let ctx = context.clone();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I would use shadow to keep the same name let context = context.clone(). I think it is easier to follow.
Idem for sfx below.
And do you really need to clone twice ? here and in the async block below ? I'm not sure but maybe it is only required in the async block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had errors I couldn't resolve if I didn't clone twice. Maybe I did it wrong. I'll try to give you some details (next week most probably) so we can look into it and see if it should be done differently.

@bpetit
Copy link
Contributor Author

bpetit commented Jun 3, 2021

Thanks a lot for the review ! I'm pretty busy in the next few days, but I should be able to integrate your suggestions and build a working version of prom on tokio with container labels (+docker extra labels) for wednesday 9. 🤞

@bpetit
Copy link
Contributor Author

bpetit commented Jun 14, 2021

Prometheus exporter with tokio seems to work fine. However, I'm not confortable having a library with async as a requirement for gathering data from the docker socket locally. It's fine in a pull mode like prometheus, especially if we have a tokio runtime for the server itself, in the same version. But i don't think it's fine to require exporters like JSON, CLI, or any simple exporter to have an async runtime to be able to get extra informations for containers. rs-docker and bollard do require async. I'm forking rs-docker in a minimalistic, read-only and synchronous version. I guess it's enough for what we need here. We could then upgrade to something more fancy if needed afterwards. cc @uggla @rossf7

Copy link
Contributor

@rossf7 rossf7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bpetit
I ran into a couple of problems in my testing.

The first was an error in the logs for listing the pods.

isahc::handler: request completed with error: the server certificate could not be validated

It seems to be because in the k8s client its connecting to http://localhost:6443. If I changed this to the host in my kubeconfig it was fine.

The second problem was with the k8s regex. My cluster was using a different format.

@@ -22,9 +29,15 @@ impl ProcessTracker {
/// let tracker = ProcessTracker::new(5);
/// ```
pub fn new(max_records_per_process: u16) -> ProcessTracker {
let regex_cgroup_docker = Regex::new(r"^/docker/.*$").unwrap();
let regex_cgroup_kubernetes = Regex::new(r"^/kubepods.slice/.*$").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the cluster I was testing with the cgroups file has a different format. Could the regex be more generic to support both formats?

#/proc/193876/cgroup
1:name=systemd:/kubepods/burstable/pod7a8cbc91-66e9-4303-88df-513f77240233/acd77757d49868ead1f706f901271e737594d0e11cec86d4bfa4de45a0512938

The cluster was kubernetes v1.20.2 installed using kubeadm on ubuntu 20.10 with docker 20.10.2

@bpetit
Copy link
Contributor Author

bpetit commented Sep 13, 2021

I guess we need a flag or an env var to set the kubernetes api uri ?

I'll extend the regexp, thanks for the feedback !

@rossf7
Copy link
Contributor

rossf7 commented Sep 14, 2021

I guess we need a flag or an env var to set the kubernetes api uri ?

The env vars KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT will be set for the container when it's running as a pod.

What about checking for those and if not have a flag that can be set manually?

@bpetit
Copy link
Contributor Author

bpetit commented Sep 16, 2021

Hi @rossf7 !

I've added the gathering of kubernetes env vars. If those vars are present, they are used first to determine the server uri.

I also made the regexp more flexible.

I'd like hear your thoughts and tests results :)

@rossf7
Copy link
Contributor

rossf7 commented Sep 17, 2021

@bpetit Many thanks for the changes :) I'll retest and report back. Most likely will be tomorrow.

Copy link
Contributor

@rossf7 rossf7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bpetit My testing went well I just needed to make some small changes for the helm chart and to adjust for my cluster.

Screen Shot 2021-09-18 at 5 16 09 PM

I think this is really close now. 💚 🚀

docs_src/references/exporter-prometheus.md Show resolved Hide resolved
docs_src/references/exporter-prometheus.md Outdated Show resolved Hide resolved
src/sensors/utils.rs Outdated Show resolved Hide resolved
src/sensors/utils.rs Show resolved Hide resolved
Comment on lines 227 to 231
.unwrap()
.strip_prefix("docker-")
.unwrap()
.strip_suffix(".scope")
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.unwrap()
.strip_prefix("docker-")
.unwrap()
.strip_suffix(".scope")
.unwrap();
.unwrap();

I had to remove this. Otherwise there was a crash. Here is an example from my cluster.

/kubepods/burstable/podb55b6901e3073a2abf41783540cb7b36/f60b363dd1d5fa5939a879804b3b96836e130f57ca1ae4442da5c368accf751b

I have a bad feeling this varies by container runtime and we might need to support multiple formats.

Dumb question but could this be a function so we can handle multiple format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it seems highly different from one setup to another. I think having a function is a good idea.

Copy link
Contributor

@rossf7 rossf7 Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be because I'm using the cgroupfs driver. The recommended driver is systemd but my cluster is a temp one on equinix metal and I didn't configure it 🤦‍♂️

Next time I'll use systemd and see if that changes things.

https://kubernetes.io/docs/setup/production-environment/container-runtimes/#cgroup-drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to make this part not mandatory. Could you try in your environment the latest version of the code ?

thanks 🙏🏽

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I've retested with the systemd and cgroupfs drivers and both were fine.

bpetit and others added 12 commits September 18, 2021 15:26
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
Co-authored-by: Ross Fairbanks <rossf7@users.noreply.github.com>
…ub.com:hubblo-org/scaphandre into feature/#84-include-containerd-specific-labels
Copy link
Contributor

@rossf7 rossf7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bpetit bpetit merged commit c901389 into main Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Previous releases
Development

Successfully merging this pull request may close these issues.

Include containerd-specific labels to data coming from powercaprapl sensor
4 participants