Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency TCK failures with latest glassfish nightly build #24059

Closed
gurunrao opened this issue Aug 1, 2022 · 34 comments · Fixed by jakartaee/platform-tck#1111
Closed

Concurrency TCK failures with latest glassfish nightly build #24059

gurunrao opened this issue Aug 1, 2022 · 34 comments · Fixed by jakartaee/platform-tck#1111
Labels
ee10-tck EE 10 TCK failures tck-web-profile

Comments

@gurunrao
Copy link
Contributor

gurunrao commented Aug 1, 2022

Following Concurrency TCK failures with latest Glassfish 7 nightly Web Profile build:

  • ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests.testCopyCompletableFutureEJB

CI run - https://ci.eclipse.org/jakartaee-tck/job/10/job/jakarta-concurrency-tck-glassfish-run/95/

@gurunrao gurunrao added the ee10-tck EE 10 TCK failures label Aug 1, 2022
@dmatej
Copy link
Contributor

dmatej commented Aug 1, 2022

With commit 71a6150 it passed locally. Can you rerun the test?

I am looking at the AfterTypeDiscoveryMassOperationsTest now ...
The test changed recently ... https://github.com/jakartaee/cdi-tck/pull/401/files ... on the line 128 is not an assertEquals now, but previous version has an assertTrue there.

https://ci.eclipse.org/jakartaee-tck/job/10/job/jakarta-cdi-tck-glassfish-run-cert/org.glassfish$glassfish.cdi-tck/19/testReport/junit/org.jboss.cdi.tck.tests.full.extensions.lifecycle.atd/AfterTypeDiscoveryTest/testFinalAlternatives_null__1_/

===============================================
CDI TCK
Total tests run: 1831, Passes: 1831, Failures: 0, Skips: 0
===============================================

@erdlet
Copy link
Member

erdlet commented Aug 1, 2022

Seems the build fetches an old CDI TCK:
[INFO] Downloading from jakarta-snapshots: https://jakarta.oss.sonatype.org/content/repositories/staging/jakarta/enterprise/cdi-tck-api/4.0.5/cdi-tck-api-4.0.5.pom see here

Those errors were fixed in CDI TCK 4.0.7, which was released after the PR stated by @dmatej.

@gurunrao
Copy link
Contributor Author

gurunrao commented Aug 1, 2022

Thanks @erdlet, @dmatej.
I have updated TCK runner to use 4.0.7 - https://github.com/eclipse-ee4j/jakartaee-tck/pull/1104/files

@gurunrao
Copy link
Contributor Author

gurunrao commented Aug 1, 2022

@erdlet @scottmarlow - should the above CDI TCK test update be sent to Specification Committee for approval, since this is an change in behavior of the tests?
As per current process, test behavior is not allowed to change post review, tests can be excluded based on challenge.
CC @dmatej @arjantijms

@erdlet
Copy link
Member

erdlet commented Aug 1, 2022

I'm no expert in CDI TCK, so as far as I understood was this change only to improve the test assumption, not the expected result.

@dmatej
Copy link
Contributor

dmatej commented Aug 1, 2022

It seems to me that the test was just stabilized to allow to use also other extensions while previous expected just three which was wrong assumption. It is ok now.

@scottmarlow
Copy link
Member

Following Concurrency TCK failures with latest Glassfish 7 nightly Web Profile build:

* ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests.testCopyCompletableFutureEJB

CI run - https://ci.eclipse.org/jakartaee-tck/job/10/job/jakarta-concurrency-tck-glassfish-run/95/

Some Concurrency TCK problems with ManagedExecutorDefinitionWebTests were mentioned started in comment jakartaee/concurrency#260 (review) that seem to continue up to comment jakartaee/concurrency#260 (comment)

@scottmarlow
Copy link
Member

Does anyone know why GlassFish is failing the Concurrency test on Web Profile yet?

@arjantijms
Copy link
Contributor

Does anyone know why GlassFish is failing the Concurrency test on Web Profile yet?

A new test was added that for some reason fails. It fails on both the full profile and the web profile.

@scottmarlow
Copy link
Member

Does anyone know why GlassFish is failing the Concurrency test on Web Profile yet?

A new test was added that for some reason fails. It fails on both the full profile and the web profile.

According to https://github.com/eclipse-ee4j/jakartaee-tck/wiki/Jakarta-EE-10.0-TCK-results, the Concurrency TCK test only fails on Web Profile but not Full Platform mode.

So, the only thing that we know is the ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests.testCopyCompletableFutureEJB is failing for unknown reasons.

We are also testing against the Staged Concurrency TCK https://download.eclipse.org/ee4j/cu/jakartaee10/staged/eftl/concurrency-tck-3.0.2.zip last built on 2022-07-30.

Once thing that we haven't done is create a Concurrency TCK issue (or challenge if we learn of something to challenge). Any volunteers to create a https://github.com/jakartaee/concurrency/issues for this failure? Beyond the test name, I think we also need to include the test output, some of which I pasted into https://gist.github.com/scottmarlow/045911483dfca81173ce82afcf04a362 (including some of tests before and after in case that is relevant).

I think that there has been discussions with the Concurrency team but I would like a tracking Concurrency issue so we can get more direct feedback on what the test failure means happened exactly. It would be helpful to have an explanation of what the test code at https://github.com/jakartaee/concurrency/blob/master/tck/src/main/java/ee/jakarta/tck/concurrent/spec/ManagedExecutorService/resourcedef/ManagedExecutorDefinitionOnEJBServlet.java#L144 is validating exactly which might be a clue as to what needs to change (in the test or GlassFish).

@hs536
Copy link
Contributor

hs536 commented Aug 9, 2022

A problem of the same test class is discussed below and suggested to be fixed. Isn't this a related problem?

jakartaee/concurrency#263

@scottmarlow
Copy link
Member

@hs536 good question, do you know if the jakartaee/concurrency#264 fix would help GlassFish pass the Concurrency TCK on Web Profile?

@starksm64
Copy link
Member

@arjantijms Can you validate the latest Concurrency TCK with jakartaee/concurrency#264 ?

@starksm64
Copy link
Member

The current CCRs are already out of date, so this can be merged:

current:
shasum -a 256 /tmp/jakarta-jakartaeetck-10.0.0.zip 
cfe4db9cf5f2bf5cfff568377f7a799c3dc4624e882585a0c659d781764183dd  /tmp/jakarta-jakartaeetck-10.0.0.zip

current CCR sha256: 62a9cbbf46315ef4b5487fdbdf6e628d272fa3ace9148c5716ba57ea4415df19

@scottmarlow
Copy link
Member

I started the Platform TCK build via https://ci.eclipse.org/jakartaee-tck/job/10/job/eftl-jakartaeetck-build-100/82/ for jakartaee-tck/pull/1109

@scottmarlow
Copy link
Member

GlassFish 7 Web Profile still fails with the updated staged https://download.eclipse.org/ee4j/cu/jakartaee10/staged/eftl/concurrency-tck-3.0.2.zip
TCK via https://ci.eclipse.org/jakartaee-tck/job/10/job/jakarta-concurrency-tck-glassfish-run/102 which failed the ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests.testCopyCompletableFutureEJB test.

FYI, the ^ test run was done with https://download.eclipse.org/ee4j/cu/jakartaee10/staged/eftl/concurrency-tck-3.0.2.zip that has sha256sum 22728d729f620d6a85ae903e7d1184e0a7508a4328491b785f1b4f3d7215ca93.

Contents of concurrency-tck-3.0.2.info:

22728d729f620d6a85ae903e7d1184e0a7508a4328491b785f1b4f3d7215ca93  concurrency-tck-3.0.2-dist.zip
-rw-r--r--. 1 jenkins 1001550000 423031 Aug  9 15:56 concurrency-tck-3.0.2-dist.zip

@scottmarlow
Copy link
Member

https://ci.eclipse.org/jakartaee-tck/job/10/job/jakarta-concurrency-tck-glassfish-run/102/artifact/glassfish-runner/concurrency-tck/target/glassfish7/glassfish/domains/domain1/logs/server.log contains an interesting warning on a related test:

2022-08-09T16:28:44.055266Z] [GlassFish 7.0] [WARNING] [] [ee.jakarta.tck.concurrent.framework.TestServlet] [tid: _ThreadID=28 _ThreadName=http-listener-1(1)] [levelValue: 900] [[
  Caught exception attempting to call test method testCopyCompletableFutureEJB on servlet ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionOnEJBServlet
java.lang.AssertionError: StringContext must be propagated and Application context and IntContext must be left unchanged per ManagedExecutorDefinition and ContextServiceDefinition config. expected [StringContext propagated;IntContext unchanged] but found [StringContext propagated;IntContext incorrect:271]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertEqualsImpl(Assert.java:140)
	at org.testng.Assert.assertEquals(Assert.java:122)
	at org.testng.Assert.assertEquals(Assert.java:629)
	at ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionOnEJBServlet.testCopyCompletableFutureEJB(ManagedExecutorDefinitionOnEJBServlet.java:144)

Could the ^ warning leave GlassFish in a bad state?

Also @jamezp noticed the following warning which would be good to understand better:

[2022-08-09T16:28:44.053280Z] [GlassFish 7.0] [SEVERE] [] [jakarta.enterprise.concurrent] [tid: _ThreadID=389 _ThreadName=java:module/concurrent/ExecutorB-managedThreadFactory-Thread-1] [levelValue: 1000] [[
  Thread context provider 'IntContext' is not registered in WEB-APP/services/jakarta.enterprise.concurrent.spi.ThreadContextProvider and will be ignored!]]

@jamezp
Copy link

jamezp commented Aug 9, 2022

My gut tells me the failure has something to do with that error log. The IntContext is what the test is using for checking the assertion. Having it not registered seems like a possible issue.

@scottmarlow
Copy link
Member

Hmm, I'll try to grep for the error.

@scottmarlow
Copy link
Member

GlassFish class appserver/extras/embedded/all/target/classes/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.class seems to contain the "Thread context provider.*is not registered in" message, but I am not sure where GlassFish is getting the org.glassfish.concurrent.runtime.ContextSetupProviderImpl class from, that likely needs an update.

@gurunrao gurunrao changed the title CDI and Concurrency TCK failures with latest glassfish nightly build Concurrency TCK failures with latest glassfish nightly build Aug 10, 2022
@gurunrao
Copy link
Contributor Author

gurunrao commented Aug 10, 2022

@arjan - A challenge for test based on glassfish logs is needed? or fix at glassfish is the final solution?
CC @scottmarlow
Please note: fix jakartaee/concurrency#264 is not helping GF to pass CU TCK.

@dmatej
Copy link
Contributor

dmatej commented Aug 10, 2022

I am not sure where GlassFish is getting the org.glassfish.concurrent.runtime.ContextSetupProviderImpl class from, that likely needs an update.

https://github.com/eclipse-ee4j/glassfish/blob/master/appserver/concurrent/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java

The question is what update ... there is also some commented out code ...
https://github.com/eclipse-ee4j/glassfish/blob/master/appserver/concurrent/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java#L199

@OndroMih
Copy link
Contributor

OndroMih commented Aug 11, 2022

I ran the TCK locally and here are my findings:

  • When I only run the test method testCopyCompletableFutureEJB of ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests, the test passes
  • When I run the full Concurrency Web TCK, or when I run all the test methods in the ManagedExecutorDefinitionWebTests class, the test fails

The test also fails when I only execute the test method testCopyCompletableFuture before it.

Therefore GlassFish is "polluted" by the test testCopyCompletableFuture, which causes the test testCopyCompletableFutureEJB to fail.

@scottmarlow
Copy link
Member

I ran the TCK locally and here are my findings:

* When I only run the test method `testCopyCompletableFutureEJB` of `ee.jakarta.tck.concurrent.spec.ManagedExecutorService.resourcedef.ManagedExecutorDefinitionWebTests`, the test passes

* When I run the full Concurrency Web TCK, or when I run all the test methods in the `ManagedExecutorDefinitionWebTests` class, the test fails

The test also fails when I only execute the test method testCopyCompletableFuture before it.

Therefore GlassFish is "polluted" by the test testCopyCompletableFuture, which causes the test testCopyCompletableFutureEJB to fail.

So, GlassFish could pass all of the tests if testCopyCompletableFutureEJB is run separately. I do not know of any requirements for tests to be run all at the same time. With the Platform TCK, if all tests were run at once, it would take multiple days to run all of the tests, instead we run smaller sets of tests. Also, I think the GlassFish Platform TCK Porting kit retries tests on the first failure, which I think would be the same.

@arjantijms
Copy link
Contributor

According to https://github.com/eclipse-ee4j/jakartaee-tck/wiki/Jakarta-EE-10.0-TCK-results, the Concurrency TCK test only fails on Web Profile but not Full Platform mode.

I think that the Full Platform mode simply excludes that test. If you alter the test file to not exclude anything for Full Platform mode, when running against GlassFish full platform, all the new web profile test also run against the full profile glassfish. This includes the failing web profile test, and it fails on glassfish full profile too.

If you normally run the full profile TCK against full profile glassfish it excludes the web profile tests, and then it passes.

So that indicates that very specific test fails on both GlassFish web profile, and GlassFish full profile, even though the web profile and full profile TCK runs don't show that.

@scottmarlow
Copy link
Member

scottmarlow commented Aug 13, 2022

As per jakartaee/platform-tck#1111 (comment), this is now solved via the jakartaee/platform-tck#1111 change.

In summary, the workaround is done following what was observed in comment #24059 (comment) above. We exclude the failing web test and run that test separately.

@scottmarlow
Copy link
Member

According to https://github.com/eclipse-ee4j/jakartaee-tck/wiki/Jakarta-EE-10.0-TCK-results, the Concurrency TCK test only fails on Web Profile but not Full Platform mode.

I think that the Full Platform mode simply excludes that test. If you alter the test file to not exclude anything for Full Platform mode, when running against GlassFish full platform, all the new web profile test also run against the full profile glassfish. This includes the failing web profile test, and it fails on glassfish full profile too.

If you normally run the full profile TCK against full profile glassfish it excludes the web profile tests, and then it passes.

So that indicates that very specific test fails on both GlassFish web profile, and GlassFish full profile, even though the web profile and full profile TCK runs don't show that.

Is this mentioned in the Concurrency TCK doc requirements? I'm not against running both profiles for Full Platform.

@OndroMih
Copy link
Contributor

Although it's possible to pass the TCK if the failing test is executed in a separate test run, it seems there's some profound error in GlassFish.

The test fails only if executed after the testCopyCompletableFuture test, which essentially tests the same behavior but in servlet context. When I created a copy of the testCopyCompletableFutureEJB and executed after the testCopyCompletableFutureEJB, it again failed. This basically means that when the test testCopyCompletableFutureEJB is executed twice, it always fails the second time. The same happens if I switch the order of tests and execute testCopyCompletableFutureEJB and then execute testCopyCompletableFuture test - the EJB one passes and the servlet one (the latter) fails this time.

I still need to investigate why the second run always fails (whether it's in servlet context or EJB context seems to not matter) but we should really fix it in GlassFish. However, this issue probably shouldn't block a CCR because GF can pass the Web TCK without breaking any rules.

@OndroMih
Copy link
Contributor

After some investigation and debugging, I see several issues in GF:

  • ContextSetupProviderImpl.java in omnifaces/omniconcurrent copies a reference to one of the contexts to the allRemaining variable, and later modifies the allRemianing variable when a context is saved for asynchronous execution. At this point, also one of the contexts is modified, which means that the configuration is mangled and the sets of contexts don't correspond to the original setup. In this particular case, the IntContext is removed from contextUnchanged and therefore it's propagated because "remaining" contexts are set to propagate. I think a fix is to change the asssignments to allRemaining from allRemaining = contextPropagate and other contexts to allRemaining = new HashSet(contextPropagate) copy the sets instead.
  • Even after the fix I suggest in the first point, the list of ThreadContextProvider services is empty when the context is saved from an asynchronous thread. This is because in a new thread used to execute a completableFuture handler, Utility.getClassLoader() is null, pointing to the JDK platform classlaoder, and this classloader doesn't see the definitions in the deployed application. Nothing sets the thread context classloader for the completableFuture threads to the application's WebAppClassloader. This means that the custom contexts are invalid in subsequent handlers and are ignored.
    • A fix to set up the correct classloader isn't simple. We could store the current context classloader in the thread object and set it in the thread's run method before the task is executed. But the catch is that, currently, the thread is being created with the JDK platform classloader as the context classloader in ConcurrentRuntime.java#L473. This is because of a classloader leak issue reported in Payara here and fixed here. Maybe we could store the app classloader in the thread object in a weak reference to avoid any leaks but this needs a thorough investigation.
    • Alternatively, ContextSetupProviderImpl shouldn't load ThreadContextProvider services each time the saveContext is called. It could be executed once if null or even eagerly in constructor, there's no point in loading them each time, potentially with a different classloader, or am I wrong?

@gurunrao
Copy link
Contributor Author

gurunrao commented Aug 17, 2022

After some investigation and debugging, I see several issues in GF:

* [ContextSetupProviderImpl.java](https://github.com/omnifaces/omniconcurrent/blob/a06b9d76ed0652d4f80e7031aa2a8ae07f29931a/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java#L141) in `omnifaces/omniconcurrent` copies a reference to one of the contexts to the `allRemaining` variable, and later modifies the [allRemianing variable](https://github.com/omnifaces/omniconcurrent/blob/a06b9d76ed0652d4f80e7031aa2a8ae07f29931a/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java#L202) when a context is saved for asynchronous execution. At this point, also one of the contexts is modified, which means that the configuration is mangled and the sets of contexts don't correspond to the original setup. In this particular case, the `IntContext` is removed from `contextUnchanged` and therefore it's propagated because "remaining" contexts are set to propagate. I think a fix is to change the asssignments to `allRemaining` from `allRemaining = contextPropagate` and other contexts to `allRemaining = new HashSet(contextPropagate)` copy the sets instead.

* Even after the fix I suggest in the first point, the list of [ThreadContextProvider services](https://github.com/omnifaces/omniconcurrent/blob/a06b9d76ed0652d4f80e7031aa2a8ae07f29931a/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java#L197) is empty when the context is saved from an asynchronous thread. This is because in a new thread used to execute a completableFuture handler, `Utility.getClassLoader()` is `null`, pointing to the JDK platform classlaoder, and this classloader doesn't see the definitions in the deployed application. Nothing sets the thread context classloader for the completableFuture threads to the application's `WebAppClassloader`. This means that the custom contexts are invalid in subsequent handlers and are ignored.
  
  * A fix to set up the correct classloader isn't simple. We could store the current context classloader in the thread object and set it in the thread's run method before the task is executed. But the catch is that, currently, the thread is being created with the JDK platform classloader as the context classloader in [ConcurrentRuntime.java#L473](https://github.com/omnifaces/omniconcurrent/blob/a06b9d76ed0652d4f80e7031aa2a8ae07f29931a/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ConcurrentRuntime.java#L473). This is because of a classloader leak issue reported in Payara [here](https://github.com/payara/Payara/issues/4098) and fixed [here](https://github.com/payara/Payara/pull/5081). Maybe we could store the app classloader in the thread object in a weak reference to avoid any leaks but this needs a thorough investigation.
  * Alternatively,  [ContextSetupProviderImpl](https://github.com/omnifaces/omniconcurrent/blob/a06b9d76ed0652d4f80e7031aa2a8ae07f29931a/concurrent-impl/src/main/java/org/glassfish/concurrent/runtime/ContextSetupProviderImpl.java#L197) shouldn't load `ThreadContextProvider` services each time the `saveContext` is called. It could be executed once if `null` or even eagerly in constructor, there's no point in loading them each time, potentially with a different classloader, or am I wrong?

@OndroMih - Are you planning PR for the issues, mentioned?

@OndroMih
Copy link
Contributor

I'm sorry, I don't have a complete fix yet and I'm going on holiday next week. So I won't fix it before 28th August.

@OndroMih
Copy link
Contributor

This shouldn't block raising a CCR for Web Profile. At the platform call I believe there was an agreement that it's OK if TCK tests are passed individually and not in a single run, and GF 7-M7 passes the TCK when the failing test is executed separately. I hope that the GF/TCK team raises a CCR for Web Profile soon.

@OndroMih
Copy link
Contributor

OndroMih commented Sep 2, 2022

Because I believe it's important to fix this to make the feature of context propagation usable in production, I've raised a follow-up issue #24101.

OndroMih added a commit to OndroMih/glassfish that referenced this issue Sep 4, 2022
Tests are failing now as expected, need to fix the issue.
Signed-off-by:Ondro Mihalyi <mihalyi@omnifish.ee>
dmatej added a commit that referenced this issue Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ee10-tck EE 10 TCK failures tck-web-profile
Projects
None yet
9 participants