Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add logging and separate muxer for metrics #809

Merged
merged 4 commits into from
Apr 13, 2019

Conversation

jpeeler
Copy link

@jpeeler jpeeler commented Apr 10, 2019

This is to hopefully help debug a strange issue I've seen where https metrics are not being served upon initial pod launch. (However, if the pod is restarted, metrics are served as expected.) These changes might be good enough to keep even if they don't fix the problem, especially the mux change.

(BZ #1689836)

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2019
@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 10, 2019
@jpeeler
Copy link
Author

jpeeler commented Apr 11, 2019

Ok, definitely leaving in these changes!

time="2019-04-11T00:09:02Z" level=error msg="Metrics (https) serving failed: open /var/run/secrets/serving-cert/tls.crt: no such file or directory"

Will try making the secret non-optional tomorrow, and maybe that will cause the pod to fail until the secret is ready. Best as I can tell, the secrets end up in the pod over a minute after OLM is started. Extracted from inside the OLM pod:

sh-4.2$ ps aux
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1001          1  0.8  0.4  77864 67128 ?        Ssl  00:09   0:10 /bin/olm -writeStatusName operator-lifecycle-manager -tls-cert /var/run/secrets/serving-cert/tls.crt -tls-key /var/run/secrets/serving-cert/tls.key
1001         23  0.0  0.0  11828  2900 pts/0    Ss   00:27   0:00 /bin/sh
1001         34  0.0  0.0  51752  3528 pts/0    R+   00:28   0:00 ps aux
sh-4.2$ ls -l /var/run/secrets/serving-cert --full-time
total 0
lrwxrwxrwx. 1 root root 14 2019-04-11 00:10:29.460383912 +0000 tls.crt -> ..data/tls.crt
lrwxrwxrwx. 1 root root 14 2019-04-11 00:10:29.460383912 +0000 tls.key -> ..data/tls.key
sh-4.2$ ps -p 1 -o lstart
                 STARTED
Thu Apr 11 00:09:01 2019

Confirms metrics aren't being served, (but health is as expected):

sh-4.2$ lsof -i -P -n
lsof: no pwd entry for UID 1001
lsof: no pwd entry for UID 1001
COMMAND PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
lsof: no pwd entry for UID 1001
olm       1     1001    7u  IPv6  58483      0t0  TCP *:8080 (LISTEN)
lsof: no pwd entry for UID 1001
olm       1     1001    9u  IPv4 225600      0t0  TCP 10.128.0.9:33000->172.30.0.1:443 (ESTABLISHED)

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 11, 2019
@jpeeler
Copy link
Author

jpeeler commented Apr 11, 2019

/retest
e2e-aws passed... looks like e2e-aws-olm had a test issue

@ecordell
Copy link
Member

ecordell commented Apr 11, 2019

Can we use fsnotify to know when the files have actually been mounted and start the serving then?

@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2019
Also, don't use default muxer for health either. Although it's pretty
low risk, technically vendored packages can modify the global default
mux.
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2019
@jpeeler
Copy link
Author

jpeeler commented Apr 11, 2019

/test unit
Just saw the metrics test pass, so I think this is ready for LGTM again.

@jpeeler jpeeler changed the title WIP: add logging and separate muxer for metrics add logging and separate muxer for metrics Apr 11, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2019
@jpeeler
Copy link
Author

jpeeler commented Apr 11, 2019

/retest

@ecordell
Copy link
Member

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2019
@ecordell
Copy link
Member

/retest

1 similar comment
@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest

@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/hold
I just realized I need to slightly modify the metrics test to not fail when metrics are being served over HTTP.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2019
Ensure that when certificate arguments are passed, metrics are retrieved
over HTTPS. Otherwise, retrieve metrics over HTTP.
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2019
@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest

@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest
CI is not behaving today.

@ecordell
Copy link
Member

/retest

@ecordell
Copy link
Member

/lgtm
/hold cancel

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 12, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, jpeeler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest

4 similar comments
@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest

@jpeeler
Copy link
Author

jpeeler commented Apr 12, 2019

/retest

@ecordell
Copy link
Member

/retest

@jpeeler
Copy link
Author

jpeeler commented Apr 13, 2019

/retest

@ecordell
Copy link
Member

/retest

@openshift-merge-robot openshift-merge-robot merged commit 74d3ec7 into operator-framework:master Apr 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants