Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASCII-1886] List agent open files in flares on Linux #28584

Merged
merged 24 commits into from
Aug 23, 2024

Conversation

pgimalac
Copy link
Contributor

@pgimalac pgimalac commented Aug 20, 2024

What does this PR do?

Add a package pkg/util/lsof which provides open file information for a given process, and use it to add a file in flares containing the agent's open files.

Motivation

Help with debugging, know what the agent is doing, what dynamic libraries are loaded, etc.

Additional Notes

Only implemented on Linux for now, and in non-local mode.

In local mode there is currently no way to know the PID of the running agent (pid file is an optional arg of the run command, and it doesn't exist in the flare command).

Example output for an agent:

FD  Type Size      OpenPerm FilePerm          Name
0   DEV  0         rw       -rw--w----        /dev/pts/0
1   DEV  0         rw       -rw--w----        /dev/pts/0
2   DEV  0         rw       -rw--w----        /dev/pts/0
3   DEV  0         r-       -rw-rw-rw-        /dev/tty
4   REG  0         rw       -rw-------        anon_inode:[eventpoll]
5   PIPE 0         r-       -rw-------        pipe:[3713635]
6   PIPE 0         -w       -rw-------        pipe:[3713635]
7   DEV  0         r-       -rw-rw-rw-        /dev/tty
8   REG  156313    -w       -rw-r--r--        /var/log/datadog/agent.log
9   unix 0         rw       connected:default stream:
10  tcp  0         rw       ESTABLISHED       172.17.0.2:56134->3.233.157.160:443
11  udp  0         rw       CLOSE             127.0.0.1:8125->0.0.0.0:0
12  unix 0         rw       connected:default stream:
13  tcp  0         rw       ESTABLISHED       127.0.0.1:43500->127.0.0.1:5001
14  tcp  0         rw       ESTABLISHED       172.17.0.2:35346->3.233.148.192:443
15  tcp  0         rw       LISTEN            127.0.0.1:5001->0.0.0.0:0
16  tcp  0         rw       ESTABLISHED       127.0.0.1:43484->127.0.0.1:5001
17  tcp  0         rw       ESTABLISHED       127.0.0.1:43490->127.0.0.1:5001
18  tcp  0         rw       ESTABLISHED       127.0.0.1:5001->127.0.0.1:43484
19  tcp  0         rw       ESTABLISHED       127.0.0.1:5001->127.0.0.1:43490
20  tcp  0         rw       LISTEN            127.0.0.1:5000->0.0.0.0:0
21  tcp  0         rw       ESTABLISHED       172.17.0.2:56132->3.233.157.160:443
22  tcp  0         rw       ESTABLISHED       127.0.0.1:5001->127.0.0.1:43500
23  tcp  0         rw       ESTABLISHED       172.17.0.2:56158->3.233.157.160:443
24  tcp  0         rw       ESTABLISHED       172.17.0.2:56170->3.233.157.160:443
25  tcp  0         rw       ESTABLISHED       127.0.0.1:5001->127.0.0.1:35242
26  REG  42        -w       -rw-r--r--        /tmp/4278883081/COMP-VWFVT21J6N-devenv/flare_creation.log
mem REG  228474696 r-xp     -rwxr-xr-x        /workspaces/datadog-agent/bin/agent/agent
mem REG  309792    r-xp     -rw-r--r--        /usr/lib/aarch64-linux-gnu/libnss_systemd.so.2
mem REG  2190752   r-xp     -rw-r--r--        /usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.30
mem REG  551064    r-xp     -rw-r--r--        /usr/lib/aarch64-linux-gnu/libm.so.6
mem REG  84296     r-xp     -rw-r--r--        /usr/lib/aarch64-linux-gnu/libgcc_s.so.1
mem REG  1637400   r-xp     -rwxr-xr-x        /usr/lib/aarch64-linux-gnu/libc.so.6
mem REG  41544     r-xp     -rw-r--r--        /workspaces/datadog-agent/dev/lib/libdatadog-agent-rtloader.so.0.1.0
mem REG  187776    r-xp     -rwxr-xr-x        /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1
mem REG  187776    r--p     -rwxr-xr-x        /usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Tested manually when running the agent in a Linux container.

Also explicitly tried using LD_PRELOAD to replace the rtloader shared library with a custom one, and it was properly displayed.

@pr-commenter
Copy link

pr-commenter bot commented Aug 20, 2024

Regression Detector

Regression Detector Results

Run ID: a8c32720-c6f2-498e-9049-b8c7cf70b5cd Metrics dashboard Target profiles

Baseline: bcbcd06
Comparison: 999bab2

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI links
file_tree memory utilization +1.88 [+1.78, +1.98] Logs
basic_py_check % cpu utilization +0.96 [-1.65, +3.57] Logs
uds_dogstatsd_to_api ingress throughput +0.00 [-0.03, +0.04] Logs
tcp_dd_logs_filter_exclude ingress throughput +0.00 [-0.01, +0.01] Logs
pycheck_lots_of_tags % cpu utilization -0.12 [-2.58, +2.35] Logs
otel_to_otel_logs ingress throughput -0.57 [-1.38, +0.24] Logs
idle memory utilization -0.99 [-1.03, -0.95] Logs
tcp_syslog_to_blackhole ingress throughput -1.30 [-13.98, +11.38] Logs
uds_dogstatsd_to_api_cpu % cpu utilization -1.78 [-2.57, -0.98] Logs

Bounds Checks

perf experiment bounds_check_name replicates_passed
idle memory_usage 10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

Copy link

cit-pr-commenter bot commented Aug 20, 2024

Go Package Import Differences

Baseline: bcbcd06
Comparison: 999bab2

binaryosarchchange
agentlinuxamd64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
agentlinuxarm64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
agentwindowsamd64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
agentdarwinamd64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
agentdarwinarm64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
iot-agentlinuxamd64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
iot-agentlinuxarm64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof
heroku-agentlinuxamd64
+4, -0
+github.com/DataDog/datadog-agent/comp/core/lsof/def
+github.com/DataDog/datadog-agent/comp/core/lsof/fx
+github.com/DataDog/datadog-agent/comp/core/lsof/impl
+github.com/DataDog/datadog-agent/pkg/util/lsof

@pr-commenter
Copy link

pr-commenter bot commented Aug 20, 2024

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=42571680 --os-family=ubuntu

Note: This applies to commit 999bab2

pkg/util/lsof/lsof_linux.go Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Outdated Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Outdated Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Outdated Show resolved Hide resolved
pkg/util/lsof/lsof_linux.go Show resolved Hide resolved
Comment on lines +51 to +60
var out bytes.Buffer
writer := tabwriter.NewWriter(&out, 1, 1, 1, ' ', 0)

fmt.Fprint(writer, "FD\tType\tSize\tOpenPerm\tFilePerm\tName\t\n")
for _, file := range files {
fmt.Fprintf(writer, "%s\t%s\t%d\t%s\t%s\t%s\t\n", file.Fd, file.Type, file.Size, file.OpenPerm, file.FilePerm, file.Name)
}

_ = writer.Flush()
return out.String()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏 👏 👏 👏 👏 👏 👏 👏

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was pretty happy when I discovered text/tabwriter 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test for that function by the way, it's somewhat testing text/tabwriter but I think it can avoid breaking things in the output format accidentally

pkg/util/lsof/lsof_linux_test.go Outdated Show resolved Hide resolved
@@ -336,6 +337,7 @@ func getSharedFxOption() fx.Option {
path.DefaultStreamlogsLogFile,
)),
flare.Module(),
lsof.Module(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could probably move this directly in the flare component. I think is a bit overkill to have a component that only add a flare provider.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking long term, this component could provide an API endpoint to access the open files information, that way the flare can include the info for every running agent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good use case 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case we might need to make part of the core bundle, what do you think?

@GustavoCaso
Copy link
Member

Overall things look good to me. I think we could improve the error handling and if something fails we could write that into the result file in the flare, that way we have more visibility into things when extracting lsof information fails.

"strings"
"syscall"

"github.com/prometheus/procfs"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thing that we already had this dependency in the Agent 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair it has almost no dependency itself, and I don't think it needs CGO, so I think it's a pretty lightweight dependency anyway

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's mostly a text parser for /proc

@GustavoCaso
Copy link
Member

I think we could improve the error handling and if something fails we could write that into the result file in the flare, that way we have more visibility into things when extracting lsof information fails.

This was already resolve by this comment #28584 (comment)

@pgimalac pgimalac requested a review from a team as a code owner August 22, 2024 12:44
@pgimalac pgimalac removed the request for review from a team August 22, 2024 12:50
Copy link
Member

@GustavoCaso GustavoCaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work 🎉

@pgimalac
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Aug 23, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

@pgimalac
Copy link
Contributor Author

/merge -c

@dd-devflow
Copy link

dd-devflow bot commented Aug 23, 2024

⚠️ MergeQueue: This merge request build was cancelled

This merge request build was cancelled

If you need support, contact us on Slack #devflow!

@pgimalac
Copy link
Contributor Author

/merge

@dd-devflow
Copy link

dd-devflow bot commented Aug 23, 2024

🚂 MergeQueue: pull request added to the queue

The median merge time in main is 22m.

Use /merge -c to cancel this operation!

@pgimalac pgimalac added the qa/done Skip QA week as QA was done before merge and regressions are covered by tests label Aug 23, 2024
@dd-mergequeue dd-mergequeue bot merged commit f186f2f into main Aug 23, 2024
229 checks passed
@dd-mergequeue dd-mergequeue bot deleted the pgimalac/lsof-flare branch August 23, 2024 09:07
@github-actions github-actions bot added this to the 7.58.0 milestone Aug 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog qa/done Skip QA week as QA was done before merge and regressions are covered by tests team/agent-shared-components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants