Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts/drtprod: send logs to datadog #123227

Merged
merged 1 commit into from
May 3, 2024

Conversation

sudomateo
Copy link
Contributor

@sudomateo sudomateo commented Apr 29, 2024

Previously, clusters created by roachprod logged exclusively to disk, requiring operators to either SSH into the instance or use roachprod logs to view logs for a CockroachDB node.

This patch adds a new roachprod fluent-bit-start command that, when run, installs and starts Fluent Bit on the CockroachDB cluster listening on 127.0.0.1:5170. The CockroachDB logging configuration has also been updated to log to this Fluent Bit endpoint, choosing not to error if the endpoint is unavailble. Clusters still log to disk as to not break existing workflows. The drtprod script was also updated to install and configure Fluent Bit on the DRT clusters. A complementary roachprod fluent-bit-stop command was also added to stop Fluent Bit.

Epic: none

Release note: none

@sudomateo sudomateo requested a review from a team as a code owner April 29, 2024 17:20
@sudomateo sudomateo requested review from herkolategan and DarrylWong and removed request for a team April 29, 2024 17:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

scripts/drtprod Outdated Show resolved Hide resolved
@sudomateo sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch 2 times, most recently from 591c0b5 to 0fe7f13 Compare April 30, 2024 01:36
Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

pkg/roachprod/fluentbit/fluentbit.go Show resolved Hide resolved
pkg/roachprod/fluentbit/fluentbit.go Outdated Show resolved Hide resolved
pkg/roachprod/roachprod.go Show resolved Hide resolved
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Thanks for adding.
I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

pkg/roachprod/fluentbit/files/fluent-bit.yaml.tmpl Outdated Show resolved Hide resolved
pkg/roachprod/fluentbit/fluentbit.go Outdated Show resolved Hide resolved
pkg/cmd/roachprod/main.go Outdated Show resolved Hide resolved
combinedErr = errors.CombineErrors(combinedErr, err)
}

if err := c.Run(ctx, l, l.Stdout, l.Stderr, install.WithNodes(c.Nodes), "fluent-bit", `
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution speed of this Install command could benefit from c.ParallelE since there is no reason to run these sequentially on each node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run documents that commands are run in parallel when running on multiple nodes, which is what's happening here. Happy to change it to Parallel or ParallelE if that's not the case.

When running on just one node, the command output is streamed to stdout. When running on multiple nodes, the commands run in parallel, their output is cached and then emitted all together once all commands are completed.

Copy link
Collaborator

@herkolategan herkolategan Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, you are correct; I confused myself with the loop just above that where I think I thought it could all be optimised to run in parallel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I meant is the hostname + config gen could all be grouped with Parallel if I'm not mistaken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into Parallel. Let me know what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Thanks!

pkg/roachprod/install/cockroach.go Outdated Show resolved Hide resolved
@sudomateo sudomateo force-pushed the CLOUDOPS-9089-drt-logs branch 2 times, most recently from 68ed174 to c0b283c Compare April 30, 2024 14:20
Previously, clusters created by `roachprod` logged exclusively to disk,
requiring operators to either SSH into the instance or use `roachprod
logs` to view logs for a CockroachDB node.

This patch adds a new `roachprod fluent-bit-start` command that, when
run, installs and starts Fluent Bit on the CockroachDB cluster listening
on `127.0.0.1:5170`. The CockroachDB logging configuration has also been
updated to log to this Fluent Bit endpoint, choosing not to error if the
endpoint is unavailble. Clusters still log to disk as to not break
existing workflows. The `drtprod` script was also updated to install and
configure Fluent Bit on the DRT clusters. A complementary `roachprod
fluent-bit-stop` command was also added to stop Fluent Bit.

Epic: none

Release note: none
@sudomateo
Copy link
Contributor Author

Nice work! Thanks for adding. I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

Will work on the test case Wednesday or Thursday. Meanwhile the rest of the requested changes have been made.

@herkolategan
Copy link
Collaborator

Nice work! Thanks for adding. I would however strongly recommend we test these changes with a TC run of roachtest that includes multitenant tests as well, since some changes are made to the log configuration.

Will work on the test case Wednesday or Thursday. Meanwhile the rest of the requested changes have been made.

Thanks for making the changes!

I have started two TC runs of roachtest [1] [2] so long as I want to double check multi-tenant works & then just a random set of 30% of the roachtests.
[1] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/15060095
[2] https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/15060141

@sudomateo
Copy link
Contributor Author

bors r+

@sudomateo
Copy link
Contributor Author

@herkolategan I think we're good here since one roachtest passed and the other had at least a 30% pass rate. I can write tests on the fluentbit package to assert the configuration is generated correctly or add another roachtest if you think that'll be beneficial to these changes.

@craig craig bot merged commit 906605a into cockroachdb:master May 3, 2024
22 checks passed
@renatolabs
Copy link
Contributor

It seems that we're unconditionally generating fluent-servers logging configuration -- can this be made opt-in? I believe most roachprod clusters won't have this enabled (including roachtest clusters), and as a result we get lots of "errors" like this in the logs:

E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229  logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused

@sudomateo
Copy link
Contributor Author

It seems that we're unconditionally generating fluent-servers logging configuration -- can this be made opt-in? I believe most roachprod clusters won't have this enabled (including roachtest clusters), and as a result we get lots of "errors" like this in the logs:

E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229  logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused

I originally had it opt-in but removed it for simplicity. I can submit another pull request to make it opt-in again.

@sudomateo sudomateo deleted the CLOUDOPS-9089-drt-logs branch May 3, 2024 20:51
@sudomateo
Copy link
Contributor Author

@renatolabs submitted #123603 to address your latest comment.

craig bot pushed a commit that referenced this pull request May 6, 2024
123564: drtprod: move drt-chaos to local ssds r=dt a=itsbilal

Now that the local ssd bug is identified, we can move drt-chaos to local ssds as well and save some $.

Epic: none

Release note: None

123593: roachprod: make clusters secure by default r=srosenberg a=renatolabs

**roachprod: make clusters secure by default**
Roachtest has been creating secure clusters in tests for a while
now. It makes sense for roachprod to use the same, better default when
setting up clusters.

If necessary, users can still create insecure clusters by passing the
`--insecure` flag to `roachprod start` and other commands.

Fixes: #38539

Release note: None

**roachprod: use non-root user by default in roachprod sql**
This brings roachprod's and roachtest's defaults closer. Now that we
create secure roachprod clusters by default (which includes creating
an admin user), we are able to change the default authentication mode
used when starting a SQL shell.

Like with `pgurl`, the authentication mode can be changed with the
`--auth-mode` command line flag.


123603: roachprod: opt-in fluent-servers logging configuration r=sudomateo a=sudomateo

In #123227 we added the
ability for clusters created by `roachprod` to send their logs to an
external system using the `fluent-servers` attribute in the CockroachDB
logging configuration. However, the `fluent-servers` attribute was
enabled unconditionally which caused clusters to to log an error if the
requisite Fluent Bit server was not there to receive the logs.

```
E240503 20:21:50.574780 55 1@util/log/buffered_sink.go:353 ⋮ [-] 229  logging error from *log.fluentSink: dial tcp ‹127.0.0.1:5170›: connect: connection refused
```

This patch adds a `--enable-fluent-sink` option to the `roachprod start`
command to conditionally allow clusters to use the `fluent-servers`
logging configuration. It also updates `scripts/drtprod` to use this
option so they can continue to log to external systems.

Epic: none

Release note: none


Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Matthew Sanabria <24284972+sudomateo@users.noreply.github.com>
@srosenberg
Copy link
Member

I originally had it opt-in but removed it for simplicity. I can submit another pull request to make it opt-in again.

Looks like the opt-in was also needed for another reason, namely it broke some scenarios in mixed-version tests [1], wherein an older binary was incompatible with some of the specified log channels; e.g., KV_DISTRIBUTION was added after 22.1.

[1] #123656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants