Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-arch-builders: use splunk to monitor and send logs #894

Merged
merged 3 commits into from
Jul 30, 2023

Conversation

marmijo
Copy link
Member

@marmijo marmijo commented Jul 21, 2023

Add a splunk forwarder sidecar container to the RHCOS multiarch builders that will send journald logs to a central splunk server.

commit ec961fe76eeaf518535d36a221287cdb133ea5a7
Author: Michael Armijo <marmijo@redhat.com>
Date:   Mon Jul 24 02:20:39 2023 -0400

    multi-arch-builders: add additional RHCOS specific configs
    
    Add RHCOS specific configs for `aarch64` and `ppc64le` that merges
    the newly created splunk cofig with the arch specific configs.

commit 309e991f89af0b97875a34a9ad1a3462641f1af1
Author: Michael Armijo <marmijo@redhat.com>
Date:   Mon Jul 24 02:19:11 2023 -0400

    multi-arch-builders: include splunk config in `s390x` RHCOS builder
    
    Include the newly created splunk config when creating the `s390x`
    remote builder.

commit bb2e704cfa61607441226fe863d6cd0aa818b55f
Author: Michael Armijo <marmijo@redhat.com>
Date:   Fri Jul 21 17:51:48 2023 -0400

    multi-arch-builders: add splunk butane config
    
    Add a butane config to the multi-arch builders to include when creating the
    RHCOS builders. This butane config will build and start a splunk container
    to send logs to a central RH splunk server.
    
    See: https://issues.redhat.com/browse/COS-2131

@marmijo marmijo changed the title WIP remote-builders: use splunk to monitor and send logs WIP: remote-builders: use splunk to monitor and send logs Jul 21, 2023
@marmijo marmijo marked this pull request as draft July 21, 2023 22:07
@marmijo marmijo changed the title WIP: remote-builders: use splunk to monitor and send logs multi-arch-builders: use splunk to monitor and send logs Jul 24, 2023
@marmijo marmijo force-pushed the add_splunk_config branch 2 times, most recently from d678c41 to da49276 Compare July 25, 2023 22:24
ExecStartPre=nm-online --timeout=30
# use `--arch=x86_64` because the RPM doesnt exist for aarch64. ppc64le and s390x RPMs dont work for journald input.
ExecStart=-podman build --pull-always --cache-ttl=480h --arch=x86_64 -t localhost/splunkforwarder:latest https://gitlab.corp.redhat.com/paas/cicd.git#main:itpaas-util-images/splunkforwarder
ExecStart=-podman build --build-arg project=${SPLUNK_PROJECT} --cache-ttl=480h --from localhost/splunkforwarder:latest -t localhost/splunkforwarder-rhcos https://gitlab.corp.redhat.com/paas/docker-paas.git#master:splunk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to build the container and just pull it from the registry here?
Building the container each time on reboot sounds failure prone and unnecessary.

Also, if using pre-built container images, it would be possible to poll for updates regulary, instead of relying on reboots for updates. Not sure how often that would happen, but we might want to have those run unattended, as that is most of the reason we are using fcos for the pipeline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dustymabe and I talked about this. If we push it to a registry, we most likely need it locked down somewhere. That would mean configuring extra credentials on the builders when we install. Building the container each time also reduces the burden of maintaining a separate container build workflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you already talked through the implications of building it and decided it is the better option i am fine with that.

What about the automated updates? I guess right now it would happen each time fcos is updated (rebooted) right? Is that frequent enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschintag if the container fails to build on any particular boot then we'll just continue to use the previous container image that was built on a previous boot.

If we do more than one reboot quickly the --cache-ttl=480h will essentially make the next reboot builds a no-op.

For updates, yeah. The container would get rebuilt every two weeks when FCOS goes down for automated updates.

Environment=GIT_SSL_NO_VERIFY=1
ExecStartPre=nm-online --timeout=30
# use `--arch=x86_64` because the RPM doesnt exist for aarch64. ppc64le and s390x RPMs dont work for journald input.
ExecStart=-podman build --pull-always --cache-ttl=480h --arch=x86_64 -t localhost/splunkforwarder:latest https://gitlab.corp.redhat.com/paas/cicd.git#main:itpaas-util-images/splunkforwarder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also possible to hide the internal gitlab address?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe have a generic internal HTTP Link that contains the address, but it would of course introduce new dependencies

Copy link
Member Author

@marmijo marmijo Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option could be to replace the address with ${ITPAAS_SPLUNK_URL}. There are two other env variables in this file that will have to be resolved by creating the ignition config like:

envsubst < coreos-builder-splunk.bu | butane --pretty --strict > coreos-builder-splunk.ign

I'm creating an internal readme explaining the splunk container and how to install it, including which variables to set during the installation of the RHCOS remote builders. I could have the URL documented there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if we want to hide it then envsubst is probably the way to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

Volume=/var/log/journal:/var/log/journal:ro
Volume=/run/log/journal:/run/log/journal:ro
Volume=/etc/machine-id:/etc/machine-id:ro
Volume=/var/home/core:/var/log/journald:ro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a bit odd. This is a container running under the splunk user but we mount the core user's home directory into the container under /var/log/journald?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, carried over from debugging before we officially got journald logs into splunk. I'll remove this

Volume=/run/log/journal:/run/log/journal:ro
Volume=/etc/machine-id:/etc/machine-id:ro
Volume=/var/home/core:/var/log/journald:ro
Network=host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why Network=host is needed? Maybe add a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

ContainerName=rhcos-multiarch-splunk
Image=localhost/splunkforwarder-rhcos:latest
RunInit=true
User=root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment maybe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though now I have a question.. this means that it runs as "root" in your user namespace (i.e. as UID=0 inside the container which is equivalent to the splunk user on the host), right. It's not running as actual "root" on the host?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we're running the container as the splunk user on the host, not the root user

[Container]
ContainerName=rhcos-multiarch-splunk
Image=localhost/splunkforwarder-rhcos:latest
RunInit=true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment here too if we have the context.

Volume=/etc/machine-id:/etc/machine-id:ro
Volume=/var/home/core:/var/log/journald:ro
Network=host
PodmanArgs=--ipc=host --group-add keep-groups --hostname=${SPLUNK_HOSTNAME}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--ipc=host is needed for journalctl to talk to the journal on the host?

--group-add keep-groups is needed to pick up the wheel and systemd-journal groups from the splunk user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment explaining why 3 additional options are needed

Environment=GIT_SSL_NO_VERIFY=1
ExecStartPre=nm-online --timeout=30
# use `--arch=x86_64` because the RPM doesnt exist for aarch64. ppc64le and s390x RPMs dont work for journald input.
ExecStart=-podman build --pull-always --cache-ttl=480h --arch=x86_64 -t localhost/splunkforwarder:latest https://gitlab.corp.redhat.com/paas/cicd.git#main:itpaas-util-images/splunkforwarder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, if we want to hide it then envsubst is probably the way to go.

ExecStartPre=nm-online --timeout=30
# use `--arch=x86_64` because the RPM doesnt exist for aarch64. ppc64le and s390x RPMs dont work for journald input.
ExecStart=-podman build --pull-always --cache-ttl=480h --arch=x86_64 -t localhost/splunkforwarder:latest https://gitlab.corp.redhat.com/paas/cicd.git#main:itpaas-util-images/splunkforwarder
ExecStart=-podman build --build-arg project=${SPLUNK_PROJECT} --cache-ttl=480h --from localhost/splunkforwarder:latest -t localhost/splunkforwarder-rhcos https://gitlab.corp.redhat.com/paas/docker-paas.git#master:splunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jschintag if the container fails to build on any particular boot then we'll just continue to use the previous container image that was built on a previous boot.

If we do more than one reboot quickly the --cache-ttl=480h will essentially make the next reboot builds a no-op.

For updates, yeah. The container would get rebuilt every two weeks when FCOS goes down for automated updates.

Comment on lines 88 to 89
# Figure out soon if we can get the Red Hat Certs onto the hosts
Environment=GIT_SSL_NO_VERIFY=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to look into this.

multi-arch-builders/coreos-builder-splunk.bu Outdated Show resolved Hide resolved
multi-arch-builders/coreos-ppc64le-builder-512e.bu Outdated Show resolved Hide resolved
@marmijo marmijo marked this pull request as ready for review July 28, 2023 20:24
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.. We're getting really close, but I still need to investigate GIT_SSL_NO_VERIFY too.

Do you mind also rebasing on top of latest main branch?

Comment on lines +76 to +78
# use the container host's network to properly establish
# a connection with the splunk server
Network=host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the comment I still don't quite understand why this is needed maybe we can get toghether and I can understand more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure we can chat more about it. I got this information from the ITPAAS internal documentation and code. The splunk team also advised me to use this option. I tried to run the container without it and it wouldn't connect to to the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. would definitely be nice to know why - maybe some issues with NAT?

Comment on lines 79 to 81
# use `--ipc=host` to use the container host's IPC namespace
# to properly establish a connection with the splunk server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually ipc is for processes to communicate with each other on the same host IIUC. Is there message passing being done back and forth between journalctl running in the container and some daemon on the host? or something else going on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine it has something to do with the splunk process itself needing to communicate with the server. I was advised by the splunk team to use it this way. We can chat more about this 1:1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I tried googling around a bit but couldn't find anything. Can you confirm that if you don't have this setting it doesn't work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried it earlier today, and again just now to make sure. It doesn't work without --ipc=host. I can see a connection with splunk, but no logs are being sent to the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok maybe just add a "we don't know why this is needed but it doesn't work without it" to the comment.

multi-arch-builders/builder-splunk.bu Show resolved Hide resolved
# The --arch=x86_64 works here because we have `qemu-user-static-x86`
# installed on non-x86_64 FCOS streams.
ExecStart=-podman build --pull-always --cache-ttl=480h --arch=x86_64 -t localhost/splunkforwarder:latest ${ITPAAS_SPLUNK_REPO}
ExecStart=-podman build --build-arg project=${DEPLOYMENT_CLIENT} --cache-ttl=480h --from localhost/splunkforwarder:latest -t localhost/splunkforwarder-rhcos ${SPLUNK_SIDECAR_REPO}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like we don't really need to obfuscate the value of DEPLOYMENT_CLIENT do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this. It is a piece of the internal configuration that's used to establish a connection with the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but it's just a string. I don't think there's really any need to hide it, but I'm fine with leaving it as it is.

@marmijo
Copy link
Member Author

marmijo commented Jul 28, 2023

Do you mind also rebasing on top of latest main branch?

Done!

Add a butane config to the multi-arch builders to include when creating the
RHCOS builders. This butane config will build and start a splunk container
to send logs to a central RH splunk server.

See: https://issues.redhat.com/browse/COS-2131
Include the newly created splunk config when creating the `s390x`
remote builder.
Add RHCOS specific configs for `aarch64` and `ppc64le` that merges
the newly created splunk cofig with the arch specific configs.
Copy link
Member

@dustymabe dustymabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@marmijo marmijo merged commit 4b8d009 into coreos:main Jul 30, 2023
2 checks passed
@marmijo marmijo deleted the add_splunk_config branch September 1, 2023 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants