Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppCreate diagnostic #16658

Merged

Conversation

sosiouxme
Copy link
Member

@sosiouxme sosiouxme commented Oct 3, 2017

Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic.

https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work
https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic
https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality

Status:

  • Create and cleanup project
  • Deploy and cleanup app
  • Wait for app to start
  • Test ability to connect to app via service
  • Test that app responds correctly
  • Test ability to connect via route
  • Write stats/results to file as json

Not yet addressed in this PR (depending on how reviews progress vs development):

  • Run a build to completion
  • Test ability to attach storage
  • Gather and write useful information (logs, status) on failure

Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 3, 2017
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2017
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2017
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 22, 2017
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 20, 2017
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2017
@sosiouxme sosiouxme force-pushed the 20170928-app-loop-diagnostic branch 4 times, most recently from 14ade1a to c3c7591 Compare January 8, 2018 03:07
@sosiouxme sosiouxme changed the title [WIP] app-create loop diagnostic AppCreate diagnostic Jan 8, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2018
@sosiouxme
Copy link
Member Author

@openshift/sig-master I would like to have online ops start trying this out and getting feedback on actual usage with 3.9; for that to happen, I will need some reviews this week.

switch index {
case 0:
errmsg = fmt.Sprintf("--%s specified that client config should be at %s\n", confFlagName, path)
case len(paths) - 1: // config in ~/.kube
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really fragile...

errmsg := ""
switch index {
case 0:
errmsg = fmt.Sprintf("--%s specified that client config should be at %s\n", confFlagName, path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no explicit config file is passed in, does this check even make sense? won't confFlagValue be ""?

case len(paths) - 1: // config in ~/.kube
// no error message indicated if it is not there... user didn't say it would be
default: // can be multiple paths from the env var in theory; all cases should go here
if len(os.Getenv(config.OpenShiftConfigPathEnvVar)) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying to craft specific messages for specific indices in the loading order seems weird

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is very old code... I couldn't think of anything cleaner at the time; perhaps I can do better now.

signal.Notify(sig, os.Interrupt, syscall.SIGTERM)
go func() {
<-sig
d.out.Warn("DCluAC001", nil, "Received interrupt; aborting diagnostic")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this actually abort the other gofunc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't. There is no way to abort a goroutine - it's collaborative concurrency, the goroutine has to want to stop. The main one keeps running after an interrupt, we're just not paying attention to it any longer. The only thing I could think of is to set up another channel and check it at various points to see if an interrupt occurred, but that seemed even messier. Do you have a different suggestion?

This is not new at all BTW, just code moved around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed it was a move

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the second commit is stuff moving around


<-done // wait until either finishes
signal.Stop(sig)
d.logResult()
Copy link
Contributor

@liggitt liggitt Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can run in parallel with the assignment on line 282 or 268-269 if interrupt is received, and crash with a data race error

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you hit interrupt, how concerned are you about race conditions crashing diagnostics? I'm not sure it's even possible here - yes, data could conceivably be written into the object while the result is being logged, but would you get anything worse than bad output? - however it doesn't seem like an important edge case. But I'd be happy to use a better pattern for handling interrupts if one is known. I think the rest of the product... probably just exits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if an individual diagnostic could get interrupted and the overall process was expected to keep going

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's the idea... it moves on to run the next diagnostic if any. but at that point nothing is going to use the result from the previous diagnostic. Part of the reason you want it to keep going is to give it a chance to clean up the resources it created, so there's an actual benefit...

@sosiouxme sosiouxme force-pushed the 20170928-app-loop-diagnostic branch 2 times, most recently from 6d82a45 to a2d934d Compare January 10, 2018 03:28
@@ -105,6 +108,9 @@ func (o *NewProjectOptions) complete(f *clientcmd.Factory, args []string) error
}

func (o *NewProjectOptions) Run(useNodeSelector bool) error {
if o.Output == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either move this to complete(), or compute a local var defaulting to os.Stdout... don't generally want to mutate options in the Run() method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 that makes sense. Hopefully I can rely on complete() being called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, it turns out there are a bunch of tests that simply construct the options directly and don't run complete() on them (indeed, they can't since it's private). So I can either change every test, or use a local var with default like you said.

@sosiouxme
Copy link
Member Author

/retest

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 29, 2018
@soltysh
Copy link
Member

soltysh commented Jan 29, 2018

/retest

@sosiouxme
Copy link
Member Author

looks like about 50 things went wrong...
/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@soltysh
Copy link
Member

soltysh commented Jan 30, 2018

More strange errors...
/retest

@deads2k deads2k removed the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2018
@deads2k
Copy link
Contributor

deads2k commented Jan 30, 2018

un lgtm-ing to calm down the retest bot. Those test integration failures are real and caused by this pull.

@sosiouxme
Copy link
Member Author

sosiouxme commented Feb 5, 2018

/test origin-verify
(results missing?)
Integration test failures seem related to project creation which I touched here... will work to fix.

@sosiouxme
Copy link
Member Author

looks like we're back to normal flakes.
/refresh
/retest

@sosiouxme
Copy link
Member Author

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/16658/test_pull_request_origin_end_to_end/9231/ was #18522 (already resolved)
verify has already passed, but for some reason the bot is still reporting the old failure.
🤷‍♂️
/retest

@sosiouxme
Copy link
Member Author

sosiouxme commented Feb 12, 2018

updated and rebased last week, @soltysh can i get a re-lgtm now that the merge window is reopened?

the bit that I needed to change was in new-project... using a local variable to default the output writer because tests didn't set it or complete it

@sosiouxme
Copy link
Member Author

@sosiouxme
Copy link
Member Author

ready for re-review

Copy link
Member

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 16, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: soltysh, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 17, 2018

@sosiouxme: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/openshift-jenkins/origin/verify ae25bd4 link /test origin-verify
ci/openshift-jenkins/gcp 6c78e37 link /test gcp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 16658, 18643).

@openshift-merge-robot openshift-merge-robot merged commit b26e530 into openshift:master Feb 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants