-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scripts/drtprod: send logs to datadog #123227
Conversation
e4bf8da
to
0ab16d5
Compare
591c0b5
to
0fe7f13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Thanks for adding.
I would however strongly recommend we test these changes with a TC run of roachtest
that includes multitenant tests as well, since some changes are made to the log configuration.
pkg/roachprod/fluentbit/fluentbit.go
Outdated
combinedErr = errors.CombineErrors(combinedErr, err) | ||
} | ||
|
||
if err := c.Run(ctx, l, l.Stdout, l.Stderr, install.WithNodes(c.Nodes), "fluent-bit", ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution speed of this Install
command could benefit from c.ParallelE
since there is no reason to run these sequentially on each node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run
documents that commands are run in parallel when running on multiple nodes, which is what's happening here. Happy to change it to Parallel
or ParallelE
if that's not the case.
When running on just one node, the command output is streamed to stdout. When running on multiple nodes, the commands run in parallel, their output is cached and then emitted all together once all commands are completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, you are correct; I confused myself with the loop just above that where I think I thought it could all be optimised to run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I meant is the hostname + config gen could all be grouped with Parallel
if I'm not mistaken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved into Parallel
. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Thanks!
68ed174
to
c0b283c
Compare
Previously, clusters created by `roachprod` logged exclusively to disk, requiring operators to either SSH into the instance or use `roachprod logs` to view logs for a CockroachDB node. This patch adds a new `roachprod fluent-bit-start` command that, when run, installs and starts Fluent Bit on the CockroachDB cluster listening on `127.0.0.1:5170`. The CockroachDB logging configuration has also been updated to log to this Fluent Bit endpoint, choosing not to error if the endpoint is unavailble. Clusters still log to disk as to not break existing workflows. The `drtprod` script was also updated to install and configure Fluent Bit on the DRT clusters. A complementary `roachprod fluent-bit-stop` command was also added to stop Fluent Bit. Epic: none Release note: none
c0b283c
to
22068bf
Compare
Will work on the test case Wednesday or Thursday. Meanwhile the rest of the requested changes have been made. |
Thanks for making the changes! I have started two TC runs of |
bors r+ |
@herkolategan I think we're good here since one |
It seems that we're unconditionally generating
|
I originally had it opt-in but removed it for simplicity. I can submit another pull request to make it opt-in again. |
@renatolabs submitted #123603 to address your latest comment. |
123564: drtprod: move drt-chaos to local ssds r=dt a=itsbilal Now that the local ssd bug is identified, we can move drt-chaos to local ssds as well and save some $. Epic: none Release note: None 123593: roachprod: make clusters secure by default r=srosenberg a=renatolabs **roachprod: make clusters secure by default** Roachtest has been creating secure clusters in tests for a while now. It makes sense for roachprod to use the same, better default when setting up clusters. If necessary, users can still create insecure clusters by passing the `--insecure` flag to `roachprod start` and other commands. Fixes: #38539 Release note: None **roachprod: use non-root user by default in roachprod sql** This brings roachprod's and roachtest's defaults closer. Now that we create secure roachprod clusters by default (which includes creating an admin user), we are able to change the default authentication mode used when starting a SQL shell. Like with `pgurl`, the authentication mode can be changed with the `--auth-mode` command line flag. 123603: roachprod: opt-in fluent-servers logging configuration r=sudomateo a=sudomateo In #123227 we added the ability for clusters created by `roachprod` to send their logs to an external system using the `fluent-servers` attribute in the CockroachDB logging configuration. However, the `fluent-servers` attribute was enabled unconditionally which caused clusters to to log an error if the requisite Fluent Bit server was not there to receive the logs. ``` E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229 logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused ``` This patch adds a `--enable-fluent-sink` option to the `roachprod start` command to conditionally allow clusters to use the `fluent-servers` logging configuration. It also updates `scripts/drtprod` to use this option so they can continue to log to external systems. Epic: none Release note: none Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Matthew Sanabria <24284972+sudomateo@users.noreply.github.com>
Looks like the opt-in was also needed for another reason, namely it broke some scenarios in mixed-version tests [1], wherein an older binary was incompatible with some of the specified log channels; e.g., [1] #123656 |
Previously, clusters created by
roachprod
logged exclusively to disk, requiring operators to either SSH into the instance or useroachprod logs
to view logs for a CockroachDB node.This patch adds a new
roachprod fluent-bit-start
command that, when run, installs and starts Fluent Bit on the CockroachDB cluster listening on127.0.0.1:5170
. The CockroachDB logging configuration has also been updated to log to this Fluent Bit endpoint, choosing not to error if the endpoint is unavailble. Clusters still log to disk as to not break existing workflows. Thedrtprod
script was also updated to install and configure Fluent Bit on the DRT clusters. A complementaryroachprod fluent-bit-stop
command was also added to stop Fluent Bit.Epic: none
Release note: none