Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestDockerCLI/test_demo_partitioning because of server unreachable #40734

Closed
cockroach-teamcity opened this issue Sep 12, 2019 · 16 comments · Fixed by #41029
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 12, 2019

The following tests appear to have failed on master (acceptance): TestDockerCLI/test_demo_partitioning.tcl, TestDockerCLI

You may want to check for open issues.

#1484411:

root@127.134.203.135:42943/movr> error when contacting licensing server: 
Get https://register.cockroachdb.com/api/license?clusterid=6b487ccb-f22e-4290-bc18-d2232869aa0d&kind=demo&version=v19.2: 
net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)



Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 12, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Sep 12, 2019
@jordanlewis
Copy link
Member

Duplicate of #39720.

@jordanlewis jordanlewis reopened this Sep 12, 2019
@jordanlewis jordanlewis changed the title teamcity: failed test: TestDockerCLI teamcity: failed test: TestDockerCLI/test_demo_partitioning Sep 12, 2019
@jordanlewis
Copy link
Member

@rohany PTAL

@rohany
Copy link
Contributor

rohany commented Sep 13, 2019

I can't really replicate the error, and I'm unsure of how to debug this without any of the logs. However, I had some concerns with testing this while writing the tests due to the partitioning being asynchronous + expect's inability to write a test like "while the output is not like something, do something else". Maybe the solution is to just disable this test for now?

Or instead, I could test the Partitioning step of the MovR workload independently of the demo output. The main issue with the demo tests is that the go unittest versions of cockroach demo tests don't use the CCL binary, which makes us have to resort to these tcl/expect tests which I'm not as familiar with.

@rohany
Copy link
Contributor

rohany commented Sep 13, 2019

Though we do want some way to verify that demo really does partition its data, so idk.

@rohany
Copy link
Contributor

rohany commented Sep 13, 2019

Hallelujah! Though the logs show a worse problem.

root@127.134.203.135:42943/movr> error when contacting licensing server: Get https://register.cockroachdb.com/api/license?clusterid=6b487ccb-f22e-4290-bc18-d2232869aa0d&kind=demo&version=v19.2: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

It doesn't seem like we can reliably contact the licensing server from within this test...

@knz
Copy link
Contributor

knz commented Sep 17, 2019

Generally TC tests can occasionally fail while attempting to connect to external resources. We used to encounter this often back when we were letting the build update yarn dependencies.

IMHO the correct way forward is to tolerate this failure somehow. I can see several ways with various trade-offs:

  • skip the test and mark it as successful. If we do this and the server does become permanently unavailable, we will be blind to this.
  • skip the test and mark it as skipped (this is a different status than successful which TC can recognize to annotate the result). If we do this we will be more able to detect server failures, but the work needed to mark a CLI test as skipped is a bit more complex.
  • change the way we handle failures in TC to only file issues after a test has failed consistently for a day or more.

@jordanlewis it would be nice to step back and see "out of the box" here. The last time we were encountering this problem was when the URL checker was verifying doc links, and TC was recording failures due to occasional outages of the docs site. Back then we decided to side-step by simply not running that test on CI any more (it's now nightly).

@knz knz changed the title teamcity: failed test: TestDockerCLI/test_demo_partitioning teamcity: failed test: TestDockerCLI/test_demo_partitioning because of server unreachable Sep 17, 2019
@rohany
Copy link
Contributor

rohany commented Sep 18, 2019

Are we able to assign failure or success not based on the output within the test? If so, i think skipping the test (if it's a license acquisition error) and marking it successful, and adding a nightly test to check the availability of the licensing server makes sense.

@knz
Copy link
Contributor

knz commented Sep 18, 2019

I have an embryo of an idea please help me think through this:

  • I think there are two separate features here, one is license acquisition from a server and the other is the availability of some data layout and other things inside the demo session
  • today the demo code combines the two and it's hard to test them separately. In particular as you've found the test for the 2nd part fails if the test for the 1st part fails too
  • what if we could decouple the two things?
  • an idea: make it possible for cockroach demo to take the license string on the command line, with a new flag
  • you'd have all the feature tests in CI runs for the demo session use a pre-defined license string provided by the test runner (command line in Example_cli, from outside the session in TestDockerCLI) -- the specific string is not relevant to the test. This would run on every CI run, but would not be dependent on the license server
  • once per day (or per week) we'd have a nightly run that uses the license server and verifies that a demo session can start with it

WDYT?

@rohany
Copy link
Contributor

rohany commented Sep 18, 2019

I agree -- this functionality should be decoupled for better testing. I imagine in the future there could be more things dependent on having a license, so this will save us some time in the future. We will have to take the license string and the cluster organization via the command line probably.

@knz
Copy link
Contributor

knz commented Sep 18, 2019

or environment variables.

@rohany
Copy link
Contributor

rohany commented Sep 18, 2019

I am leaning towards a CLA, rather than environment variables -- if we expose some environment variables that automatically set the license etc, users might expect these to apply to not just demo. Unless this is something we'd want to do

@knz
Copy link
Contributor

knz commented Sep 18, 2019 via email

@bdarnell
Copy link
Contributor

I don't really like environment variables, but I do think it makes sense to be able to pass a license on the command line to demo/init/start-single-node (but not start - this way we don't have to worry about what happens when its set differently on different nodes). This is a special case of being able to set arbitrary cluster settings via those same commands.

@tbg
Copy link
Member

tbg commented Sep 23, 2019

@knz
Copy link
Contributor

knz commented Sep 24, 2019

I have investigated this and I found a few bugs in the code, beyond the flaky test.
I will send a PR shortly.

@craig craig bot closed this as completed in #41029 Sep 24, 2019
@craig craig bot closed this as completed in 4242276 Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants