scripts/drtprod: send logs to datadog #123227

sudomateo · 2024-04-29T17:20:35Z

Previously, clusters created by roachprod logged exclusively to disk, requiring operators to either SSH into the instance or use roachprod logs to view logs for a CockroachDB node.

This patch adds a new roachprod fluent-bit-start command that, when run, installs and starts Fluent Bit on the CockroachDB cluster listening on 127.0.0.1:5170. The CockroachDB logging configuration has also been updated to log to this Fluent Bit endpoint, choosing not to error if the endpoint is unavailble. Clusters still log to disk as to not break existing workflows. The drtprod script was also updated to install and configure Fluent Bit on the DRT clusters. A complementary roachprod fluent-bit-stop command was also added to stop Fluent Bit.

Epic: none

Release note: none

cockroach-teamcity · 2024-04-29T17:20:45Z

This change is

scripts/drtprod

dt

Nice!

pkg/roachprod/fluentbit/fluentbit.go

pkg/roachprod/roachprod.go

herkolategan

Nice work! Thanks for adding.
I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

pkg/roachprod/fluentbit/files/fluent-bit.yaml.tmpl

pkg/roachprod/fluentbit/fluentbit.go

pkg/cmd/roachprod/main.go

herkolategan · 2024-04-30T10:38:15Z

pkg/roachprod/fluentbit/fluentbit.go

+		combinedErr = errors.CombineErrors(combinedErr, err)
+	}
+
+	if err := c.Run(ctx, l, l.Stdout, l.Stderr, install.WithNodes(c.Nodes), "fluent-bit", `


The execution speed of this Install command could benefit from c.ParallelE since there is no reason to run these sequentially on each node.

Run documents that commands are run in parallel when running on multiple nodes, which is what's happening here. Happy to change it to Parallel or ParallelE if that's not the case.

When running on just one node, the command output is streamed to stdout. When running on multiple nodes, the commands run in parallel, their output is cached and then emitted all together once all commands are completed.

Ah sorry, you are correct; I confused myself with the loop just above that where I think I thought it could all be optimised to run in parallel.

I think what I meant is the hostname + config gen could all be grouped with Parallel if I'm not mistaken.

Moved into Parallel. Let me know what you think.

Nice Thanks!

pkg/roachprod/install/cockroach.go

Previously, clusters created by `roachprod` logged exclusively to disk, requiring operators to either SSH into the instance or use `roachprod logs` to view logs for a CockroachDB node. This patch adds a new `roachprod fluent-bit-start` command that, when run, installs and starts Fluent Bit on the CockroachDB cluster listening on `127.0.0.1:5170`. The CockroachDB logging configuration has also been updated to log to this Fluent Bit endpoint, choosing not to error if the endpoint is unavailble. Clusters still log to disk as to not break existing workflows. The `drtprod` script was also updated to install and configure Fluent Bit on the DRT clusters. A complementary `roachprod fluent-bit-stop` command was also added to stop Fluent Bit. Epic: none Release note: none

sudomateo · 2024-04-30T19:32:40Z

Nice work! Thanks for adding. I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

Will work on the test case Wednesday or Thursday. Meanwhile the rest of the requested changes have been made.

herkolategan · 2024-05-01T09:27:58Z

Nice work! Thanks for adding. I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

Will work on the test case Wednesday or Thursday. Meanwhile the rest of the requested changes have been made.

Thanks for making the changes!

I have started two TC runs of roachtest [1] [2] so long as I want to double check multi-tenant works & then just a random set of 30% of the roachtests.
[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/15060095
[2] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/15060141

sudomateo · 2024-05-03T04:37:14Z

bors r+

sudomateo · 2024-05-03T04:47:14Z

@herkolategan I think we're good here since one roachtest passed and the other had at least a 30% pass rate. I can write tests on the fluentbit package to assert the configuration is generated correctly or add another roachtest if you think that'll be beneficial to these changes.

craig · 2024-05-03T05:10:41Z

Build succeeded:

renatolabs · 2024-05-03T20:24:41Z

It seems that we're unconditionally generating fluent-servers logging configuration -- can this be made opt-in? I believe most roachprod clusters won't have this enabled (including roachtest clusters), and as a result we get lots of "errors" like this in the logs:

E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229  logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused

sudomateo · 2024-05-03T20:34:10Z

It seems that we're unconditionally generating fluent-servers logging configuration -- can this be made opt-in? I believe most roachprod clusters won't have this enabled (including roachtest clusters), and as a result we get lots of "errors" like this in the logs:
E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229  logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused

I originally had it opt-in but removed it for simplicity. I can submit another pull request to make it opt-in again.

sudomateo · 2024-05-03T21:54:39Z

@renatolabs submitted #123603 to address your latest comment.

123564: drtprod: move drt-chaos to local ssds r=dt a=itsbilal Now that the local ssd bug is identified, we can move drt-chaos to local ssds as well and save some $. Epic: none Release note: None 123593: roachprod: make clusters secure by default r=srosenberg a=renatolabs **roachprod: make clusters secure by default** Roachtest has been creating secure clusters in tests for a while now. It makes sense for roachprod to use the same, better default when setting up clusters. If necessary, users can still create insecure clusters by passing the `--insecure` flag to `roachprod start` and other commands. Fixes: #38539 Release note: None **roachprod: use non-root user by default in roachprod sql** This brings roachprod's and roachtest's defaults closer. Now that we create secure roachprod clusters by default (which includes creating an admin user), we are able to change the default authentication mode used when starting a SQL shell. Like with `pgurl`, the authentication mode can be changed with the `--auth-mode` command line flag. 123603: roachprod: opt-in fluent-servers logging configuration r=sudomateo a=sudomateo In #123227 we added the ability for clusters created by `roachprod` to send their logs to an external system using the `fluent-servers` attribute in the CockroachDB logging configuration. However, the `fluent-servers` attribute was enabled unconditionally which caused clusters to to log an error if the requisite Fluent Bit server was not there to receive the logs. ``` E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229 logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused ``` This patch adds a `--enable-fluent-sink` option to the `roachprod start` command to conditionally allow clusters to use the `fluent-servers` logging configuration. It also updates `scripts/drtprod` to use this option so they can continue to log to external systems. Epic: none Release note: none Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Matthew Sanabria <24284972+sudomateo@users.noreply.github.com>

srosenberg · 2024-05-06T16:17:41Z

I originally had it opt-in but removed it for simplicity. I can submit another pull request to make it opt-in again.

Looks like the opt-in was also needed for another reason, namely it broke some scenarios in mixed-version tests [1], wherein an older binary was incompatible with some of the specified log channels; e.g., KV_DISTRIBUTION was added after 22.1.

[1] #123656

sudomateo requested a review from a team as a code owner April 29, 2024 17:20

sudomateo requested review from herkolategan and DarrylWong and removed request for a team April 29, 2024 17:20

dt reviewed Apr 29, 2024

View reviewed changes

scripts/drtprod Outdated Show resolved Hide resolved

sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch from e4bf8da to 0ab16d5 Compare April 29, 2024 19:24

dt approved these changes Apr 29, 2024

View reviewed changes

sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch 2 times, most recently from 591c0b5 to 0fe7f13 Compare April 30, 2024 01:36

dt approved these changes Apr 30, 2024

View reviewed changes

pkg/roachprod/fluentbit/fluentbit.go Show resolved Hide resolved

pkg/roachprod/fluentbit/fluentbit.go Outdated Show resolved Hide resolved

pkg/roachprod/roachprod.go Show resolved Hide resolved

herkolategan requested changes Apr 30, 2024

View reviewed changes

sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch 2 times, most recently from 68ed174 to c0b283c Compare April 30, 2024 14:20

sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch from c0b283c to 22068bf Compare April 30, 2024 14:53

herkolategan approved these changes May 2, 2024

View reviewed changes

craig bot merged commit 906605a into cockroachdb:master May 3, 2024
22 checks passed

sudomateo deleted the CLOUDOPS-9089-drt-logs branch May 3, 2024 20:51

sudomateo mentioned this pull request May 3, 2024

roachprod: opt-in fluent-servers logging configuration #123603

Merged

srosenberg mentioned this pull request May 6, 2024

roachtest: rebalance/by-load/leases/mixed-version failed #123656

Closed

srosenberg mentioned this pull request May 8, 2024

mixedversion: redirect failures to test-eng if user hooks never ran #123680

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts/drtprod: send logs to datadog #123227

scripts/drtprod: send logs to datadog #123227

sudomateo commented Apr 29, 2024 •

edited

Loading

cockroach-teamcity commented Apr 29, 2024

dt left a comment

herkolategan left a comment

herkolategan Apr 30, 2024

sudomateo Apr 30, 2024

herkolategan Apr 30, 2024 •

edited

Loading

herkolategan Apr 30, 2024

sudomateo Apr 30, 2024

herkolategan May 2, 2024

sudomateo commented Apr 30, 2024

herkolategan commented May 1, 2024

sudomateo commented May 3, 2024

sudomateo commented May 3, 2024

craig bot commented May 3, 2024

renatolabs commented May 3, 2024

sudomateo commented May 3, 2024

sudomateo commented May 3, 2024

srosenberg commented May 6, 2024

scripts/drtprod: send logs to datadog #123227

scripts/drtprod: send logs to datadog #123227

Conversation

sudomateo commented Apr 29, 2024 • edited Loading

cockroach-teamcity commented Apr 29, 2024

dt left a comment

Choose a reason for hiding this comment

herkolategan left a comment

Choose a reason for hiding this comment

herkolategan Apr 30, 2024

Choose a reason for hiding this comment

sudomateo Apr 30, 2024

Choose a reason for hiding this comment

herkolategan Apr 30, 2024 • edited Loading

Choose a reason for hiding this comment

herkolategan Apr 30, 2024

Choose a reason for hiding this comment

sudomateo Apr 30, 2024

Choose a reason for hiding this comment

herkolategan May 2, 2024

Choose a reason for hiding this comment

sudomateo commented Apr 30, 2024

herkolategan commented May 1, 2024

sudomateo commented May 3, 2024

sudomateo commented May 3, 2024

craig bot commented May 3, 2024

renatolabs commented May 3, 2024

sudomateo commented May 3, 2024

sudomateo commented May 3, 2024

srosenberg commented May 6, 2024

sudomateo commented Apr 29, 2024 •

edited

Loading

herkolategan Apr 30, 2024 •

edited

Loading