Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradle PR build failing #6130

Closed
planetf1 opened this issue Jan 18, 2022 · 17 comments
Closed

Gradle PR build failing #6130

planetf1 opened this issue Jan 18, 2022 · 17 comments
Assignees
Labels
build-failure High Priority - a build is failing

Comments

@planetf1
Copy link
Member

The gradle PR build is failing with errors such as:

AssetConsumerOMASCreateTagIT > testCreateTag(String) > org.odpi.openmetadata.accessservices.assetconsumer.fvt.execution.AssetConsumerOMASCreateTagIT.testCreateTag(String)[1] STANDARD_OUT
7785
    AssetGUID: 4358845d-1017-495c-b944-d9b8a33604da
7786
    SchemaTypeGUID: acd45be1-5e35-4415-92bc-82a00567b89b
7787
    Asset2GUID: 8d9c1bde-c967-4743-b6d0-063aacd2b2a4
7788

7789
    11:11:01.864 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - FFDC Response: TagResponse{tag=null, exceptionClassName='org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException', exceptionCausedBy='null', actionDescription='getTag', relatedHTTPCode=400, exceptionErrorMessage='OMAG-COMMON-400-021 The identifier 87b16829-7bad-4181-8ab5-36df81c54703 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server', exceptionErrorMessageId='OMAG-COMMON-400-021', exceptionErrorMessageParameters=[87b16829-7bad-4181-8ab5-36df81c54703, InformalTag, newUserId, getTag, Asset Consumer OMAS, serverinmem], exceptionSystemAction='The system is unable to process the request because the identifier is invalid.', exceptionUserAction='Update the caller to provide a correct asset identifier.', exceptionProperties={parameterName=guid}}
7790
    11:11:01.883 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - Detected Invalid Parameter Exception in REST Response
7791
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-COMMON-400-021 The identifier 87b16829-7bad-4181-8ab5-36df81c54703 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server

or

AnalyticsModelingOMASSynchronizationIT > testSynchronization(String) > org.odpi.openmetadata.accessservices.analyticsmodeling.fvt.execution.AnalyticsModelingOMASSynchronizationIT.testSynchronization(String)[2] STANDARD_ERROR
8155
    FVTUnexpectedCondition{testCaseName='SynchronizationTest', errorDescription='testUpdateArtifact', caughtException=PropertyServerException{reportedHTTPCode=400, reportingClassName='org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler', reportingActionDescription='updateArtifact', reportedErrorMessage='OMAG-COMMON-400-016 An unexpected org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException exception was caught by updateArtifact; error message was OMAS-ANALYTICS-MODELING-031 User testUser failed to update artifact iBASEMODULE. The artifact is unknown.', reportedErrorMessageId='OMAG-COMMON-400-016', reportedErrorMessageParameters=[org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException, updateArtifact, OMAS-ANALYTICS-MODELING-031 User testUser failed to update artifact iBASEMODULE. The artifact is unknown.], reportedSystemAction='The system is unable to process the request and has returned an exception to the caller.', reportedUserAction='Review the error message.  Also look up its full message definition which includes the system action and user action.  This is most likely to describe the correct action to take to resolve the error.  If that does not help, look for other diagnostics created at the same time.  Also validate that the caller is a valid client of this server and is operating correctly.', reportedCaughtException=null, reportedCaughtExceptionClassName='org.odpi.openmetadata.accessservices.analyticsmodeling.ffdc.exceptions.AnalyticsModelingCheckedException', relatedProperties=null}}
8156
    	at org.odpi.openmetadata.accessservices.analyticsmodeling.fvt.synchronization.SynchronizationTest.testUpdateArtifact(SynchronizationTest.java:149)

This was first seen with PR #6116 -see https://github.com/odpi/egeria/actions/runs/1674299644

At the time of this initial failure other PRs were continuing to pass. This continued for up to 3 days, including the last good build
at https://github.com/odpi/egeria/actions/runs/1687765378

Subsequently all PR checks are now failing

@planetf1 planetf1 added the build-failure High Priority - a build is failing label Jan 18, 2022
@planetf1 planetf1 self-assigned this Jan 18, 2022
@planetf1
Copy link
Member Author

Tried so far

  • Disable caching in gradle properties
  • Disable git action configured caching
  • alternative jvms

Observed

  • existing environments don't seem to hit this issue (even if ~/.m2 & git tree cleaned out_) - mac, linux - though all have 16GB ram
  • A freshly created Ubuntu 20.04+latest updates 8GB did reproduce this failure - and IS STILL reproducing the failure (though it has built successfully a few times - intermittent)

@planetf1
Copy link
Member Author

Also now occuring on data manager:

> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-engine-fvt:test

FindFVT > find(String, DataEngineClient, RepositoryService) > org.odpi.openmetadata.accessservices.dataengine.fvt.FindFVT.find(String, DataEngineClient, RepositoryService)[2] FAILED
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException at FindFVT.java:42

> Task :open-metadata-test:open-metadata-fvt:access-services-fvt:data-manager-fvt:test FAILED

DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_OUT
    ==========================================
    CreateEventsTest FVT (servergraph) failed
    Number of Failed tests: 1
    Captured Exceptions: 1
     ==> org.odpi.openmetadata.fvt.utilities.exceptions.FVTUnexpectedCondition with message getEventBroker(no GUID for Retrieve)
    FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}

DataManagerOMASIT > testCreateEvents(String) > org.odpi.openmetadata.accessservices.datamanager.fvt.execution.DataManagerOMASIT.testCreateEvents(String)[2] STANDARD_ERROR
    FVTUnexpectedCondition{testCaseName='CreateEventsTest', errorDescription='getEventBroker(no GUID for Retrieve)', caughtException=null}

@planetf1
Copy link
Member Author

Running more tests in parallel it seems as if every FVT may now be failing with guid search issues. EVERY test most likely. May just be seen as I upgraded the failing image to more cpu/mem resource - so more parallel tests without timeouts.

@planetf1
Copy link
Member Author

Noting configuration on failing system:

jonesn@ubuntu-buildtest:~/src/egeria$ java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment Temurin-11.0.13+8 (build 11.0.13+8)
OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (build 11.0.13+8, mixed mode)
jonesn@ubuntu-buildtest:~/src/egeria$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
jonesn@ubuntu-buildtest:~/src/egeria$ uname -a
Linux ubuntu-buildtest 5.4.0-1010-ibm #11-Ubuntu SMP Wed Dec 8 06:56:03 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@planetf1
Copy link
Member Author

Tips on debugging the graph.

Change properties

jonesn@ubuntu-buildtest:~/j/janusgraph-0.6.0$ cat conf/janusgraph-berkeleyje-lucene.properties | grep analytics
storage.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley
index.search.directory=/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//searchindex

follow the tips to connect, check vertices - it's then possible to connect remotely

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> graph
==>standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley]
gremlin> g
==>graphtraversalsource[standardjanusgraph[berkeleyje:/home/jonesn/src/egeria/open-metadata-test/open-metadata-fvt/access-services-fvt/analytics-modeling-fvt/build/data/servers/servergraph/repository/graph//berkeley], standard]
gremlin> g.V().count()
==>42

@planetf1
Copy link
Member Author

A query on the graph (for the analytics modeling FVT) seems to show iBASEMODULE is present within the underlying graph -> https://gist.github.com/443b5f9c1f14b40ce007814b6399c9f0

@wbittles
Copy link
Contributor

@planetf1 Can you provide some info on what steps you followed to be able to debug the PR please.

@planetf1
Copy link
Member Author

Possible fix merged. Will summarise after

@planetf1
Copy link
Member Author

planetf1 commented Jan 19, 2022

To summarise what happened here in terms of FVT failures

  • The gradle build started failing for a PR Add Archive Manager OMES #6116
  • Extensive logs were able to provide the failure message
  • We were missing the chassis error log which was an obvious place to check -- (see Save additional build artifacts to improve ability to debug build #5290 - would have been useful)
  • This same PR built locally without error for @mandy-chessell and myself, in multiple environments (linux-RHEL, macOS)
  • other PR builds worked
  • Initial investigation was specific to this PR - with many file changes, but none that appeared immediately obvious
  • The built was left as blocking in case this was a latent code issue -- to protect master, even though maven builds were working
  • The failure appeared to suggest entities were being created, but then could not be queried
  • Whilst investigating, other PR builds started failing
  • Even a test PR with a negligible change (ie README) would fail
  • At this point it looks more an environmental issue
  • I tried various changes therefore relating to caching (the most obvious cause), performance, threading & tried building in more environments
  • I managed to reproduce a failure in an Ubuntu 20.04 virtual environment (cloud)
  • I was able to capture the chassis logs & state of the JanusGraph as created by the analytics modeling FVT, which looked ok. I was able to access the graph using gremlin and some visualization tools & search for the assets (though I did not move this to another egeria server for further testing ... since)
  • The same machine built other PRs ok with no obvious changes
  • The machine became very unstable with multiple total freezes - appearing to be some kind of resource issue (kernel settings more than ulimits?)
  • I switched off inode watches as we were hitting kernel limits, though this in itself did not fix the issue on the build machine - but it did seem to impact my local failing environment in a positive way, it was then stable.
  • I moved the build environment to ubuntu 18.04 & immediately PRs started working reliably, and continue to do so.
  • At all times everything other than ubuntu 20.04 worked perfectly

Note: the investigation was delayed by a few days due to writing & running the Egeria dojo -

In summary:

  • There is a particular issue building in Ubuntu 20.04 (never seen elsewhere)
  • There is probably a relationship to watching inodes

I think this came about because

  • github started rolling out an updated base image
  • Existing workers continued to use an older image until the rollout completed
  • Therefore some builds failed, some didnt
  • Over time every worker was updated, and every one failed

Using a github hosted runner we DO NOT have fine grained control over the workers - see https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners

One remaining concern is that the //exact// reason for the failure is not known - we should revisit this, possibly leave until after a further image update in case it's a more transient issue

Closing & will open a new issue to debug & revert back to Ubuntu 20.04

@planetf1
Copy link
Member Author

Reopening.

This has continued to occur. Most recently in #6450 I addressed some issues with overlapping port numbers being used in tests. Whilst that was a potential problem, it appears not to be the only cause, since intermittently PR builds have continued to fail.

The most recent failure was at

AssetConsumerOMASCreateTagIT > testCreateTag(String) > org.odpi.openmetadata.accessservices.assetconsumer.fvt.execution.AssetConsumerOMASCreateTagIT.testCreateTag(String)[1] STANDARD_OUT
[6837](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6837)
    AssetGUID: accc79a7-1832-4c28-a25e-e5afd490aad3
[6838](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6838)
    SchemaTypeGUID: ced2170e-12b4-454f-bf94-d5bbd98f502c
[6839](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6839)
    Asset2GUID: c01a71db-158f-405c-8843-f339a592854e
[6840](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6840)

[6841](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6841)
    PublicTag1GUID: 33b0b042-4f2a-4671-8fd9-cdd64ce2cff3
[6842](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6842)

[6843](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6843)
    PublicTag2GUID: 68122c0f-66ec-4068-8d63-1e733a8408b3
[6844](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6844)

[6845](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6845)
    10:38:51.721 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - FFDC Response: TagResponse{tag=null, exceptionClassName='org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException', exceptionCausedBy='null', actionDescription='getTag', relatedHTTPCode=400, exceptionErrorMessage='OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server', exceptionErrorMessageId='OMAG-COMMON-400-021', exceptionErrorMessageParameters=[950eff2e-98f2-46bb-a7f9-b1bb61d84e75, InformalTag, newUserId, getTag, Asset Consumer OMAS, serverinmem], exceptionSystemAction='The system is unable to process the request because the identifier is invalid.', exceptionUserAction='Update the caller to provide a correct asset identifier.', exceptionProperties={parameterName=guid}}
[6846](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6846)
    10:38:51.753 [Test worker] ERROR o.o.o.c.ffdc.RESTExceptionHandler - Detected Invalid Parameter Exception in REST Response
[6847](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6847)
    org.odpi.openmetadata.frameworks.connectors.ffdc.InvalidParameterException: OMAG-COMMON-400-021 The identifier 950eff2e-98f2-46bb-a7f9-b1bb61d84e75 of the InformalTag passed by newUserId to method getTag of service Asset Consumer OMAS is not recognized by the serverinmem server
[6848](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6848)
    	at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.throwInvalidParameterException(RESTExceptionHandler.java:269)
[6849](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6849)
    	at org.odpi.openmetadata.commonservices.ffdc.RESTExceptionHandler.detectAndThrowStandardExceptions(RESTExceptionHandler.java:147)
[6850](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6850)
    	at org.odpi.openmetadata.accessservices.assetconsumer.client.rest.AssetConsumerRESTClient.callInformalTagGetRESTCall(AssetConsumerRESTClient.java:170)
[6851](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6851)
    	at org.odpi.openmetadata.accessservices.assetconsumer.client.AssetConsumer.getTag(AssetConsumer.java:1370)
[6852](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6852)
    	at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.getTagTest(CreateTagTest.java:240)
[6853](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6853)
    	at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.runIt(CreateTagTest.java:123)
[6854](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6854)
    	at org.odpi.openmetadata.accessservices.assetconsumer.fvt.tagging.CreateTagTest.performFVT(CreateTagTest.java:70)
[6855](https://github.com/odpi/egeria/runs/6209227696?check_suite_focus=true#step:5:6855)
    	at

@planetf1 planetf1 reopened this Apr 28, 2022
planetf1 added a commit to planetf1/egeria that referenced this issue May 11, 2022
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
@planetf1
Copy link
Member Author

Hvaing checked the Asset Consumer FVT code, the exception above is normal. It occurs when a tag is not found -- and in this case, this is exactly what the FVT is verifying (perhaps not ideal that we get an exception in a not found case)

Therefore the error is NOT related to the exceptions above.

Instead the failure point seems to be around:

CategoryIT > testCategory(String) > org.odpi.openmetadata.accessservices.subjectarea.fvt.junit.CategoryIT.testCategory(String)[1] STANDARD_OUT
[7849](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7849)
    CategoryFVT runIt started
[7850](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7850)
The message received from the daemon indicates that the daemon has disappeared.
[7851](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7851)

[7852](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7852)
Build request sent: Build{id=94798d2d-5003-4b65-8a8e-1911572cdb64, currentDir=/home/runner/work/egeria/egeria}
[7853](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7853)
Attempting to read last messages from the daemon log...
[7854](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7854)
Daemon pid: 1650
[7855](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7855)
  log file: /home/runner/.gradle/daemon/7.4.2/daemon-1650.out.log
[7856](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7856)
----- Last  20 lines from daemon log file - daemon-1650.out.log -----
[7857](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7857)
    14:42:04,181 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@23afc725 - URL [jar:file:/home/runner/work/egeria/egeria/open-metadata-resources/open-metadata-samples/access-services-samples/subject-area-client-samples/subject-area-definition-sample/build/libs/subject-area-definition-sample-3.9-SNAPSHOT.jar!/logback.xml] is not of type file
[7858](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7858)
    14:42:04,484 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set
[7859](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7859)
    14:42:04,493 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
[7860](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7860)
    14:42:04,550 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
[7861](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7861)
    14:42:04,576 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
[7862](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7862)
    14:42:04,772 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to OFF
[7863](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7863)
    14:42:04,772 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
[7864](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7864)
    14:42:04,773 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
[7865](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7865)
    14:42:04,777 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@5d3f99d7 - Registering current configuration as safe fallback point
[7866](https://github.com/odpi/egeria/runs/6371867613?check_suite_focus=true#step:5:7866)

which is the daemon appearing.

The most likely cause of this is resource exhaustion - max processes, files, memory etc

Will disable to gradle build daemon to see if this helps (or provides a clearer error)

planetf1 added a commit that referenced this issue May 11, 2022
Disable gradle daemon for CI build #6130
@planetf1
Copy link
Member Author

change has resulted in clean builds so far. Closing for now and will re-open if the issue returns

@planetf1
Copy link
Member Author

planetf1 commented Jun 8, 2022

Reopening as this is still ongoing (ie #6611 )

@planetf1 planetf1 reopened this Jun 9, 2022
@planetf1 planetf1 removed the blocker label Jun 10, 2022
planetf1 added a commit to planetf1/egeria that referenced this issue Jun 10, 2022
Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
planetf1 added a commit that referenced this issue Jun 10, 2022
#6130 disable parallel gradle build on ci/cd (github runner)
@planetf1
Copy link
Member Author

Latest change = stable builds for now
Closing & will continue to monitor

@planetf1
Copy link
Member Author

Occured another twice today, again the gradle daemon disappeared.

Looked through recent gradle changes and issues, and gradle/gradle#19750 seems a possible explanation.

We are setting jvm opts -- and perhaps it may be better not to. This will override gradle's curated defaults. Combined with leaving parallel builds etc off, we should try this.

Cannot be reproduced in an external environment, only on the gradle runners, so this will have to be another exploratory change.

@planetf1 planetf1 reopened this Jun 28, 2022
planetf1 added a commit to planetf1/egeria that referenced this issue Jun 28, 2022
…6130

Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
planetf1 added a commit to planetf1/egeria that referenced this issue Jul 5, 2022
odpi#6130

Signed-off-by: Nigel Jones <nigel.l.jones+git@gmail.com>
@planetf1
Copy link
Member Author

planetf1 commented Jul 5, 2022

Build logs continue to show a similar failure, as if the JVM - or the gradle process has gone away.

Tried

  • Update gradle to the latest version (7.4 RC3) - still fails (and required a groovy update)
  • Switch JVM from adoptium 11 to zulu 11 -> the zulu build hangs at the same place....?

both of these have been backed off. Harmless but not addressing issue.

Then

  • ensured the 'stopAfter' setting for the exec task was set to 'test'. This will ensure the server chassis used for a test will stop after the test is complete. When run in parallel, since the tests are at the end, this has little impact. But it seems as if the test environment is running tight on resources, processes, file descriptors, memory . The tests are now series, so this setting becomes important to limit the amount of processes active. Hopefully improving reliability.

@planetf1
Copy link
Member Author

tests have been stable, so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build-failure High Priority - a build is failing
Projects
None yet
Development

No branches or pull requests

2 participants