Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add call to installer-gather.sh on failure #3475

Merged
merged 1 commit into from
Apr 22, 2019

Conversation

sdodson
Copy link
Member

@sdodson sdodson commented Apr 15, 2019

Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 15, 2019
@deads2k
Copy link
Contributor

deads2k commented Apr 16, 2019

This will be awesome!

@openshift/sig-master

@sdodson
Copy link
Member Author

sdodson commented Apr 16, 2019

/test pj-rehearse

@wking
Copy link
Member

wking commented Apr 16, 2019

Why is this WIP?

MCO-e2e-aws:

level=fatal msg="failed to initialize the cluster: Cluster operator marketplace is still updating: timed out waiting for the condition"

ovn-e2e-aws:

level=fatal msg="failed to wait for bootstrap-complete event: timed out waiting for the condition"

It looks like there's an issue with the SSH-agent:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/3475/rehearse-3475-pull-ci-openshift-ovn-kubernetes-master-e2e-aws/1/artifacts/e2e-aws/container-logs/teardown.log.gz | gunzip | grep -3 'Could not open a connection to your authentication agent'
curl --insecure --silent --connect-timeout 5 --retry 3 --cert /tmp/artifacts/installer/tls/journal-gatewayd.crt --key /tmp/artifacts/installer/tls/journal-gatewayd.key --url https://3.84.112.162:19531/entries?_SYSTEMD_UNIT=openshift.service
curl --insecure --silent --connect-timeout 5 --retry 3 --cert /tmp/artifacts/installer/tls/journal-gatewayd.crt --key /tmp/artifacts/installer/tls/journal-gatewayd.key --url https://3.84.112.162:19531/entries?_SYSTEMD_UNIT=kubelet.service
curl --insecure --silent --connect-timeout 5 --retry 3 --cert /tmp/artifacts/installer/tls/journal-gatewayd.crt --key /tmp/artifacts/installer/tls/journal-gatewayd.key --url https://3.84.112.162:19531/entries?_SYSTEMD_UNIT=crio.service
Could not open a connection to your authentication agent.
No user exists for uid 1263960000
unknown user 1263960000
oc --insecure-skip-tls-verify --request-timeout=5s get apiserver.config.openshift.io authentication.config.openshift.io build.config.openshift.io console.config.openshift.io dns.config.openshift.io featuregate.config.openshift.io image.config.openshift.io infrastructure.config.openshift.io ingress.config.openshift.io network.config.openshift.io oauth.config.openshift.io project.config.openshift.io scheduler.config.openshift.io -o json

@sdodson
Copy link
Member Author

sdodson commented Apr 16, 2019

Why is this WIP?

I was trying to figure out if it worked or not. I'll have to look into how to fix the agent.

@sdodson
Copy link
Member Author

sdodson commented Apr 16, 2019

/test pj-rehearse

@sdodson
Copy link
Member Author

sdodson commented Apr 16, 2019

Last rehearsal didn't seem to include the teardown from this job. I thought it would only select jobs that would be affected by the change?

@abhinavdahiya
Copy link
Contributor

/test pj-rehearse

@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 17, 2019
@sdodson
Copy link
Member Author

sdodson commented Apr 17, 2019

I'm relatively certain that the items required to find the bootstrap ip are also there in the ansible job but I at least want to get proof that this works in general before I add it there as well.

roll 1d100 and hope for the best...

@sdodson
Copy link
Member Author

sdodson commented Apr 17, 2019

/test pj-rehearse

echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${HOME}:/sbin/nologin" >> /etc/passwd
fi
fi
ssh-add /tmp/cluster/ssh-privatekey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key exists at /etc/openshift-installer/ssh-privatekey

wking added a commit to wking/origin that referenced this pull request Apr 17, 2019
This lets us SSH from the teardown container into the cluster without
hitting:

  $ ssh -A core@$bootstrap_ip
  No user exists for uid 1051910000

OpenSSH has a very early getpwuid call [1] with no provision for
bypassing via HOME or USER environment variables like we did for Bazel
[2].  OpenShift runs with the random UIDs by default [3]:

  By default, all containers that we try and launch within OpenShift,
  are set blocked from “RunAsAny” which basically means that they are
  not allowed to use a root user within the container.  This prevents
  root actions such as chown or chmod from being run and is a sensible
  security precaution as, should a user be able to perform a local
  exploit to break out of the container, then they would not be
  running as root on the underlying container host.  NB what about
  user-namespaces some of you are no doubt asking, these are
  definitely coming but the testing/hardening process is taking a
  while and whilst companies such as Red Hat are working hard in this
  space, there is still a way to go until they are ready for the
  mainstream.

while Kubernetes sorts out user namespacing [4].  Despite the high
UIDs, all users on the cluster are GID 0, so the g+w is sufficient
(vs. a+w), and maybe this mitigates concerns about increased
writability for such an important file.  The main mitigation is that
these are throw-away CI containers, and not long-running production
containers where we are concerned about malicious entry.

A more polished fix has landed in CRI-O [5], but the CI cluster is
stuck on OpenShift 3.11 and Docker at the moment.

Our SSH usecase is for gathering logs in the teardown container [6],
but we've been using the tests image for both tests and teardown since
b16dcfc (images/tests/Dockerfile*: Install gzip for compressing
logs, 2019-02-19, openshift#22094).

[1]: https://github.com/openssh/openssh-portable/blob/V_7_4_P1/ssh.c#L577
[2]: openshift/release#1185
[3]: https://blog.openshift.com/getting-any-docker-image-running-in-your-own-openshift-cluster/
[4]: kubernetes/enhancements#127
[5]: cri-o/cri-o#2022
[6]: openshift/release#3475
wking added a commit to wking/origin that referenced this pull request Apr 17, 2019
This lets us SSH from the teardown container into the cluster without
hitting:

  $ ssh -A core@$bootstrap_ip
  No user exists for uid 1051910000

OpenSSH has a very early getpwuid call [1] with no provision for
bypassing via HOME or USER environment variables like we did for Bazel
[2].  OpenShift runs with the random UIDs by default [3]:

  By default, all containers that we try and launch within OpenShift,
  are set blocked from “RunAsAny” which basically means that they are
  not allowed to use a root user within the container.  This prevents
  root actions such as chown or chmod from being run and is a sensible
  security precaution as, should a user be able to perform a local
  exploit to break out of the container, then they would not be
  running as root on the underlying container host.  NB what about
  user-namespaces some of you are no doubt asking, these are
  definitely coming but the testing/hardening process is taking a
  while and whilst companies such as Red Hat are working hard in this
  space, there is still a way to go until they are ready for the
  mainstream.

while Kubernetes sorts out user namespacing [4].  Despite the high
UIDs, all users on the cluster are GID 0, so the g+w is sufficient
(vs. a+w), and maybe this mitigates concerns about increased
writability for such an important file.  The main mitigation is that
these are throw-away CI containers, and not long-running production
containers where we are concerned about malicious entry.

A more polished fix has landed in CRI-O [5], but the CI cluster is
stuck on OpenShift 3.11 and Docker at the moment.

Our SSH usecase is for gathering logs in the teardown container [6],
but we've been using the tests image for both tests and teardown since
b16dcfc (images/tests/Dockerfile*: Install gzip for compressing
logs, 2019-02-19, openshift#22094).

[1]: https://github.com/openssh/openssh-portable/blob/V_7_4_P1/ssh.c#L577
[2]: openshift/release#1185
[3]: https://blog.openshift.com/getting-any-docker-image-running-in-your-own-openshift-cluster/
[4]: kubernetes/enhancements#127
[5]: cri-o/cri-o#2022
[6]: openshift/release#3475
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 17, 2019
@sdodson
Copy link
Member Author

sdodson commented Apr 17, 2019

depends on openshift/origin#22592

@sdodson
Copy link
Member Author

sdodson commented Apr 18, 2019

/test pj-rehearse

eval $(ssh-agent)
ssh-add /etc/openshift-installer/ssh-privatekey
ssh -A -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip} /bin/bash -x /usr/local/bin/installer-gather.sh
scp -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip}:log-bundle.tar.gz /tmp/artifacts/bootstrap-logs.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would /tmp/artifacts/installer/bootstrap-logs.tar.gz be a better choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i'll change that real quick.

@abhinavdahiya
Copy link
Contributor

sh-4.2$ bootstrap_ip=$(python -c \
>                   'import sys, json; d=reduce(lambda x,y: dict(x.items() + y.items()), map(lambda x: x["resources"], json.load(sys.stdin)["modules"])); k="aws_instance.bootstrap"; print d[k]["primary"]["attributes"]["public_ip"] if k in d else ""' \
>                   < /tmp/artifacts/installer/terraform.tfstate
>               )
sh-4.2$ whoami
whoami: cannot find name for user ID 1031310000
sh-4.2$ ls -lah /etc/passwd
-rw-rw-r--. 1 root root 630 Mar  6 02:32 /etc/passwd
sh-4.2$ echo "${USER_NAME:-default}:x:$(id -u):0:${USER_NAME:-default} user:${HOME}:/sbin/nologin" >> /etc/passwd
sh-4.2$ whoami
default
sh-4.2$ eval $(ssh-agent)
Agent pid 36
sh-4.2$ ssh-add /etc/openshift-installer/ssh-privatekey
Identity added: /etc/openshift-installer/ssh-privatekey (/etc/openshift-installer/ssh-privatekey)
sh-4.2$ ssh -A -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip} /bin/bash -x /usr/local/bin/installer-gather.sh
Could not create directory '/.ssh'.
Warning: Permanently added '54.209.181.157' (ECDSA) to the list of known hosts.
Gathering bootstrap journals ...
+ ARTIFACTS=/tmp/artifacts
+ echo 'Gathering bootstrap journals ...'
+ mkdir -p /tmp/artifacts/bootstrap/journals
+ for service in bootkube openshift kubelet crio
+ journalctl --boot --no-pager --output=short --unit=bootkube
+ for service in bootkube openshift kubelet crio
+ journalctl --boot --no-pager --output=short --unit=openshift
+ for service in bootkube openshift kubelet crio
+ journalctl --boot --no-pager --output=short --unit=kubelet
+ for service in bootkube openshift kubelet crio
+ journalctl --boot --no-pager --output=short --unit=crio
Gathering bootstrap containers ...
+ echo 'Gathering bootstrap containers ...'
+ mkdir -p /tmp/artifacts/bootstrap/containers
+ sudo crictl ps --all --quiet
+ read -r container
++ grep -oP 'Name: \K(.*)'
++ sudo crictl ps -a --id 76d398bdbefdbe43f513e8adb7c4e84b22000c35f02d662fdf03b0204b7e83ea -v
+ container_name=machine-config-server
+ sudo crictl logs 76d398bdbefdbe43f513e8adb7c4e84b22000c35f02d662fdf03b0204b7e83ea
+ sudo crictl inspect 76d398bdbefdbe43f513e8adb7c4e84b22000c35f02d662fdf03b0204b7e83ea
+ read -r container
++ sudo crictl ps -a --id ef2290a9d7b8899dbb35b8894134f6f1f91f318c66c8a326a172857b5314b6bc -v
++ grep -oP 'Name: \K(.*)'
+ container_name=machine-config-controller
+ sudo crictl logs ef2290a9d7b8899dbb35b8894134f6f1f91f318c66c8a326a172857b5314b6bc
+ sudo crictl inspect ef2290a9d7b8899dbb35b8894134f6f1f91f318c66c8a326a172857b5314b6bc
+ read -r container
+ mkdir -p /tmp/artifacts/bootstrap/pods
+ read -r container
+ sudo podman ps --all --quiet
+ sudo podman logs 192cada536d7
+ sudo podman inspect 192cada536d7
+ read -r container
+ sudo podman logs 8b11a7008838
+ sudo podman inspect 8b11a7008838
+ read -r container
+ sudo podman logs 770e4f9df136
+ sudo podman inspect 770e4f9df136
+ read -r container
+ sudo podman logs 9abfd8340668
+ sudo podman inspect 9abfd8340668
+ read -r container
+ sudo podman logs e98b0ea3dd36
+ sudo podman inspect e98b0ea3dd36
+ read -r container
+ sudo podman logs a4047d8c4229
+ sudo podman inspect a4047d8c4229
+ read -r container
+ sudo podman logs a390c27012c8
+ sudo podman inspect a390c27012c8
+ read -r container
+ sudo podman logs d0b1eae518df
+ sudo podman inspect d0b1eae518df
+ read -r container
+ mkdir -p /tmp/artifacts/control-plane /tmp/artifacts/resources
Gathering cluster resources ...
+ echo 'Gathering cluster resources ...'
+ queue resources/nodes.list oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o jsonpath --template '{range .items[*]}{.metadata.name}{"\n"}{end}'
+ local TARGET=/tmp/artifacts/resources/nodes.list
+ shift
++ jobs
++ wc -l
+ local LIVE=0
+ [[ 0 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/masters.list oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o jsonpath -l node-role.kubernetes.io/master --template '{range .items[*]}{.metadata.name}{"\n"}{end}'
+ local TARGET=/tmp/artifacts/resources/masters.list
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o jsonpath --template '{range .items[*]}{.metadata.name}{"\n"}{end}'
++ wc -l
++ jobs
+ local LIVE=1
+ [[ 1 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/containers oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods --all-namespaces --template '{{ range .items }}{{ $name := .metadata.name }}{{ $ns := .metadata.namespace }}{{ range .spec.containers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ range .spec.initContainers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ end }}'
+ local TARGET=/tmp/artifacts/resources/containers
+ shift
++ wc -l
++ jobs
+ local LIVE=2
+ [[ 2 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/api-pods oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods -l apiserver=true --all-namespaces --template '{{ range .items }}-n {{ .metadata.namespace }} {{ .metadata.name }}{{ "\n" }}{{ end }}'
+ local TARGET=/tmp/artifacts/resources/api-pods
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods --all-namespaces --template '{{ range .items }}{{ $name := .metadata.name }}{{ $ns := .metadata.namespace}}{{ range .spec.containers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ range .spec.initContainers }}-n {{ $ns }} {{ $name }} -c {{ .name }}{{ "\n" }}{{ end }}{{ end }}'
++ jobs
++ wc -l
+ local LIVE=3
+ [[ 3 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/apiservices.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get apiservices -o json
+ local TARGET=/tmp/artifacts/resources/apiservices.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o jsonpath -l node-role.kubernetes.io/master --template '{range .items[*]}{.metadata.name}{"\n"}{end}'
++ wc -l
++ jobs
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods -l apiserver=true --all-namespaces --template '{{ range .items }}-n {{ .metadata.namespace }} {{ .metadata.name }}{{ "\n" }}{{ end }}'
+ local LIVE=4
+ [[ 4 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/clusteroperators.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get clusteroperators -o json
+ local TARGET=/tmp/artifacts/resources/clusteroperators.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get apiservices -o json
++ wc -l
++ jobs
+ local LIVE=5
+ [[ 5 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/clusterversion.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get clusterversion -o json
+ local TARGET=/tmp/artifacts/resources/clusterversion.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get clusteroperators -o json
++ wc -l
++ jobs
+ local LIVE=6
+ [[ 6 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/configmaps.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get configmaps --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/configmaps.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get clusterversion -o json
++ wc -l
++ jobs
+ local LIVE=7
+ [[ 7 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/csr.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get csr -o json
+ local TARGET=/tmp/artifacts/resources/csr.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get configmaps --all-namespaces -o json
++ wc -l
++ jobs
+ local LIVE=8
+ [[ 8 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/endpoints.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get endpoints --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/endpoints.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get csr -o json
++ jobs
++ wc -l
+ local LIVE=9
+ [[ 9 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/events.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get events --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/events.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get endpoints --all-namespaces -o json
++ jobs
++ wc -l
+ local LIVE=10
+ [[ 10 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/kubeapiserver.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get kubeapiserver -o json
+ local TARGET=/tmp/artifacts/resources/kubeapiserver.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get events --all-namespaces -o json
++ jobs
++ wc -l
+ local LIVE=11
+ [[ 11 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/kubecontrollermanager.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get kubecontrollermanager -o json
+ local TARGET=/tmp/artifacts/resources/kubecontrollermanager.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get kubeapiserver -o json
++ jobs
++ wc -l
+ local LIVE=12
+ [[ 12 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/machineconfigpools.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get machineconfigpools -o json
+ local TARGET=/tmp/artifacts/resources/machineconfigpools.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get kubecontrollermanager -o json
++ wc -l
++ jobs
+ local LIVE=13
+ [[ 13 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/machineconfigs.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get machineconfigs -o json
+ local TARGET=/tmp/artifacts/resources/machineconfigs.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get machineconfigpools -o json
++ jobs
++ wc -l
+ local LIVE=14
+ [[ 14 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/namespaces.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get namespaces -o json
+ local TARGET=/tmp/artifacts/resources/namespaces.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get machineconfigs -o json
++ jobs
++ wc -l
+ local LIVE=15
+ [[ 15 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/nodes.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o json
+ local TARGET=/tmp/artifacts/resources/nodes.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get namespaces -o json
++ jobs
++ wc -l
+ local LIVE=16
+ [[ 16 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/openshiftapiserver.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get openshiftapiserver -o json
+ local TARGET=/tmp/artifacts/resources/openshiftapiserver.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o json
++ wc -l
++ jobs
+ local LIVE=17
+ [[ 17 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/pods.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/pods.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get openshiftapiserver -o json
++ wc -l
++ jobs
+ local LIVE=18
+ [[ 18 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/rolebindings.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get rolebindings --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/rolebindings.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get pods --all-namespaces -o json
++ jobs
++ wc -l
+ local LIVE=19
+ [[ 19 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/roles.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get roles --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/roles.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get rolebindings --all-namespaces -o json
++ jobs
++ wc -l
+ local LIVE=20
+ [[ 20 -ge 45 ]]
+ [[ -n '' ]]
+ queue resources/services.json oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get services --all-namespaces -o json
+ local TARGET=/tmp/artifacts/resources/services.json
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get roles --all-namespaces -o json
++ jobs
++ wc -l
+ local LIVE=21
+ [[ 21 -ge 45 ]]
+ [[ -n '' ]]
+ FILTER=gzip
+ queue resources/openapi.json.gz oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get --raw /openapi/v2
+ local TARGET=/tmp/artifacts/resources/openapi.json.gz
+ shift
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get services --all-namespaces -o json
++ jobs
++ wc -l
Waiting for logs ...
+ local LIVE=22
+ [[ 22 -ge 45 ]]
+ [[ -n gzip ]]
+ echo 'Waiting for logs ...'
+ wait
+ gzip
+ sudo oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get --raw /openapi/v2
error: the server doesn't have a resource type "pods"
error: the server doesn't have a resource type "nodes"
error: the server doesn't have a resource type "pods"
error: the server doesn't have a resource type "nodes"
error: the server doesn't have a resource type "clusteroperators"
error: the server doesn't have a resource type "apiservices"
error: the server doesn't have a resource type "configmaps"
error: the server doesn't have a resource type "clusterversion"
error: the server doesn't have a resource type "csr"
error: the server doesn't have a resource type "endpoints"
error: the server doesn't have a resource type "kubecontrollermanager"
error: the server doesn't have a resource type "kubeapiserver"
error: the server doesn't have a resource type "events"
error: the server doesn't have a resource type "machineconfigpools"
error: the server doesn't have a resource type "nodes"
error: the server doesn't have a resource type "machineconfigs"
error: the server doesn't have a resource type "namespaces"
error: the server doesn't have a resource type "openshiftapiserver"
error: the server doesn't have a resource type "pods"
error: the server doesn't have a resource type "roles"
Error from server (NotFound): the server could not find the requested resource
error: the server doesn't have a resource type "services"
error: the server doesn't have a resource type "rolebindings"
Gather remote logs
+ echo 'Gather remote logs'
+ MASTERS=()
+ export MASTERS
+ '[' 0 -ne 0 ']'
++ stat --printf=%s /tmp/artifacts/resources/masters.list
+ '[' 0 -ne 0 ']'
++ sudo oc --config=/opt/openshift/auth/kubeconfig whoami --show-server
++ grep -oP 'api.\K([a-z\.]*)'
+ DOMAIN=ci
+ mapfile -t MASTERS
++ dig -t SRV _etcd-server-ssl._tcp.ci +short
++ cut -f 4 -d ' '
++ sed 's/.$//'
/usr/local/bin/installer-gather.sh: line 92: $(dig -t SRV "_etcd-server-ssl._tcp.${DOMAIN}" +short | cut -f 4 -d ' ' | sed 's/.$//'): No such file or directory
+ tar cz -C /tmp/artifacts .
Log bundle written to ~/log-bundle.tar.gz
+ echo 'Log bundle written to ~/log-bundle.tar.gz'
sh-4.2$ scp -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip}:log-bundle.tar.gz /tmp/artifacts/bootstrap-logs.tar.gz
Could not create directory '/.ssh'.
Warning: Permanently added '54.209.181.157' (ECDSA) to the list of known hosts.
log-bundle.tar.gz                                                                                                                                       100%   26KB 938.4KB/s   00:00
sh-4.2$ ls -lah /tmp/artifacts/
total 28K
drwxrwsrwx. 3 root    1031310000  52 Apr 18 21:09 .
drwxrwxrwt. 1 root    root        61 Apr 18 21:08 ..
-rw-r--r--. 1 default 1031310000 27K Apr 18 21:09 bootstrap-logs.tar.gz
drwxr-sr-x. 4 default 1031310000 199 Apr 18 21:06 installer

this looks like this is working. :yay:

@sdodson sdodson changed the title [WIP] Add call to installer-gather.sh on failure Add call to installer-gather.sh on failure Apr 18, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 18, 2019
@@ -385,7 +385,15 @@ objects:
--key /tmp/artifacts/installer/tls/journal-gatewayd.key \
--url "https://${bootstrap_ip}:19531/entries?_SYSTEMD_UNIT=${service}.service"
done
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to keep a closing fi on if [ -n "${bootstrap_ip}" ], although it should live after the block you add below, but before the else for if [ -f /tmp/artifacts/installer/terraform.tfstate ].

eval $(ssh-agent)
ssh-add /etc/openshift-installer/ssh-privatekey
ssh -A -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip} /bin/bash -x /usr/local/bin/installer-gather.sh
scp -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip}:log-bundle.tar.gz /tmp/artifacts/installer/bootstrap-logs.tar.gz fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, also we want to drop the fi here.

eval $(ssh-agent)
ssh-add /etc/openshift-installer/ssh-privatekey
ssh -A -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip} /bin/bash -x /usr/local/bin/installer-gather.sh
scp -o PreferredAuthentications=publickey -o StrictHostKeyChecking=false -o UserKnownHostsFile=/dev/null core@${bootstrap_ip}:log-bundle.tar.gz /tmp/artifacts/installer/bootstrap-logs.tar.gz fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have a trailing fi here.

Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561
@openshift-ci-robot
Copy link
Contributor

@sdodson: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/ovn-kubernetes/master/e2e-aws 6ccb707e876cd90933c0d5309d55c651c5348db9 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/master/e2e-aws 6ccb707e876cd90933c0d5309d55c651c5348db9 link /test pj-rehearse
ci/rehearse/openshift/cluster-kube-scheduler-operator/master/e2e-aws b1a9e39c50ceb146474282757adc4a4f800505b6 link /test pj-rehearse
ci/rehearse/openshift/cluster-kube-apiserver-operator/master/e2e-aws-operator e328d85ed41ca4623b8060f8b74a4227ebf3a6af link /test pj-rehearse
ci/rehearse/openshift/jenkins-client-plugin/master/e2e-aws-jenkins e328d85ed41ca4623b8060f8b74a4227ebf3a6af link /test pj-rehearse
ci/rehearse/openshift/cluster-api-actuator-pkg/master/e2e-aws-operator e328d85ed41ca4623b8060f8b74a4227ebf3a6af link /test pj-rehearse
ci/rehearse/openshift/kubernetes-autoscaler/master/e2e-aws-operator 980e92f85fa94f7f8499ed8b0e0203bdc76970fa link /test pj-rehearse
ci/rehearse/openshift/ansible-service-broker/master/operator-molecule-e2e 3661841 link /test pj-rehearse
ci/rehearse/openshift/installer/master/e2e-openstack 3661841 link /test pj-rehearse
ci/prow/pj-rehearse 3661841 link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member

wking commented Apr 22, 2019

/lgtm

Dunno what's up with CI, though.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 22, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 22, 2019
@openshift-merge-robot openshift-merge-robot merged commit b9682c8 into openshift:master Apr 22, 2019
@openshift-ci-robot
Copy link
Contributor

@sdodson: Updated the following 8 configmaps:

  • prow-job-cluster-launch-installer-openstack-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-openstack-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml
  • prow-job-cluster-launch-installer-openstack-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-openstack-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-openstack-e2e.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-installer-src configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-src.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-src.yaml
  • prow-job-cluster-launch-e2e-40 configmap in namespace ci using the following files:
    • key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml
  • prow-job-cluster-launch-e2e-40 configmap in namespace ci-stg using the following files:
    • key cluster-launch-e2e-40.yaml using file ci-operator/templates/openshift/openshift-ansible/cluster-launch-e2e-40.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml
  • prow-job-cluster-launch-installer-e2e configmap in namespace ci-stg using the following files:
    • key cluster-launch-installer-e2e.yaml using file ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml

In response to this:

Don't remove existing artifact gathering just yet.

Depends on openshift/installer#1561

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bertinatto pushed a commit to bertinatto/origin that referenced this pull request Apr 24, 2019
This lets us SSH from the teardown container into the cluster without
hitting:

  $ ssh -A core@$bootstrap_ip
  No user exists for uid 1051910000

OpenSSH has a very early getpwuid call [1] with no provision for
bypassing via HOME or USER environment variables like we did for Bazel
[2].  OpenShift runs with the random UIDs by default [3]:

  By default, all containers that we try and launch within OpenShift,
  are set blocked from “RunAsAny” which basically means that they are
  not allowed to use a root user within the container.  This prevents
  root actions such as chown or chmod from being run and is a sensible
  security precaution as, should a user be able to perform a local
  exploit to break out of the container, then they would not be
  running as root on the underlying container host.  NB what about
  user-namespaces some of you are no doubt asking, these are
  definitely coming but the testing/hardening process is taking a
  while and whilst companies such as Red Hat are working hard in this
  space, there is still a way to go until they are ready for the
  mainstream.

while Kubernetes sorts out user namespacing [4].  Despite the high
UIDs, all users on the cluster are GID 0, so the g+w is sufficient
(vs. a+w), and maybe this mitigates concerns about increased
writability for such an important file.  The main mitigation is that
these are throw-away CI containers, and not long-running production
containers where we are concerned about malicious entry.

A more polished fix has landed in CRI-O [5], but the CI cluster is
stuck on OpenShift 3.11 and Docker at the moment.

Our SSH usecase is for gathering logs in the teardown container [6],
but we've been using the tests image for both tests and teardown since
b16dcfc (images/tests/Dockerfile*: Install gzip for compressing
logs, 2019-02-19, openshift#22094).

[1]: https://github.com/openssh/openssh-portable/blob/V_7_4_P1/ssh.c#L577
[2]: openshift/release#1185
[3]: https://blog.openshift.com/getting-any-docker-image-running-in-your-own-openshift-cluster/
[4]: kubernetes/enhancements#127
[5]: cri-o/cri-o#2022
[6]: openshift/release#3475
abhinavdahiya added a commit to abhinavdahiya/release that referenced this pull request Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. sig/master size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants