Hot update testing on working master #9415

skabashnyuk · 2018-04-13T11:30:50Z

Description

We need to test hot update procedure. Proposed scenario: run hot update several times during selenium testing.

During Hot Update testing discovered the following issues:

Adapt ServiceTermination functionality to workspaces recovering Adapt ServiceTermination functionality to workspaces recovering #9317
Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes #9206
Servers checkers doesn't work correctly after workspaces recovering
Many of tests are skipped because of failing their preparations.

The reason of it is that after rolling update of Che server, the liveness and readiness probes are failed after some period of time(did not measure exactly), and with this the /api/system/state was available(tried with a browser and wrote a bash script for continuously check) with status 200 ok. The event that we received from OpenShift is:
Killing container with id docker://che:Need to kill Pod.
It is reproducible only on local OCP 3.7, but not local OCP 3.9 or https://dev.rdu2c.fabric8.io:8443/ dev cluster. So it can be marked as solved while it is not reproducible on current test OCP
Che Servers logs contains the following error messages

Che Server logs

2018-04-20 09:31:11,651[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 13478 ms

Received SIGTERM
Stopping Che server running on localhost:8080
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.core.StandardServer 524]       - A valid shutdown command was received via the shutdown port. Stopping the Server instance.
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 658]  - Pausing ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,242[main]             [INFO ] [o.a.c.core.StandardService 477]      - Stopping service [Catalina]
2018-04-20 09:33:37,355[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 132]  - Preparing system to shutdown
2018-04-20 09:33:37,370[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'workspace' service
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 115]  - Shutdown workspace threads pool, wait 30s to stop normally
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 127]  - Workspace threads pool is terminated
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'workspace' is suspended
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'KubernetesClient' service
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'KubernetesClient' is suspended
2018-04-20 09:33:37,382[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 139]  - System is ready to shutdown
2018-04-20 09:33:37,385[ost-startStop-2]  [INFO ] [e.c.w.i.k.n.p.PVCSubPathHelper 197]  - PVCSubPathHelper thread pool is terminated
2018-04-20 09:33:37,394[ost-startStop-2]  [INFO ] [o.e.c.a.core.util.FileCleaner 80]    - File cleaner is stopped
2018-04-20 09:33:37,448[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 687]  - Stopping ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,450[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 707]  - Destroying ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:50,098[-24-94xb7-58653]  [ERROR] [unknown.jul.logger 49]               - JGRP000019: failed passing message to receiver
java.lang.NoClassDefFoundError: java/io/ObjectStreamClass
	at org.eclipse.persistence.internal.helper.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:42)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.eclipse.persistence.sessions.serializers.JavaSerializer.deserialize(JavaSerializer.java:56)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection.onMessage(JGroupsRemoteConnection.java:130)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection$1.receive(JGroupsRemoteConnection.java:56)
	at org.jgroups.JChannel.up(JChannel.java:768)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1034)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:438)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:989)
	at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:919)
	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:851)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:611)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:152)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
	at org.jgroups.protocols.FD.up(FD.java:260)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:325)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
	at org.jgroups.protocols.Discovery.up(Discovery.java:296)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
	at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1301)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1158)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1119)
	... 33 common frames omitted
Caused by: java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1311)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1299)
	... 35 common frames omitted

Multiuser Che Server on Kubernetes/OpenShift is not stopped normally. Destroying of context hungs up and the pod is killed when timeout is reached.
PersistenceService must be stopped not on the termination but on PreDestroy and it must be the last thing to do.
The reason why we cannot do the stop of PersistenceService on termination it's the system terminate might be called by Che admin via REST and if after this call user will try to get some information from the database he will get an error, due to all the connections and sessions are closed.
Issue: [Hot_update] Persistence service should not be stopped when Che is shut down via rest #9609
Che does not stop persistence service on a shutdown and since we added the jgroups for eclipse-link cache coordination, we need to perform the stop of persistence service to be able gracefully to stop Che server.
jgroups based eclipse-link cache coordinator does not perform the closing of EclipseLinkCommandChannel that leads to problems on Che stopping (it is necessary to investigate in detail).
Perform termination of components that are working with db components before the stop of persistence service(the list may be supplemented):
- Temporary workspace cleaner
- Workspace activity scheduler
Bootstrapper uses Che Server Route URL for notifying Che Server when bootstrapping is finished. And here is an issue because there is a possible (and reproducible) case when The Stopping Che Server is starting a workspace and waiting bootstrapper events BUT bootstrapper connects to Che Server Route URL and actually it is connected to the NEW Che Server instance. In this way, old Che Server doesn't receive that event that workspace is bootstrapped and bootstrapping wait timeout occurs.
The possible solution here is implementing distributed messaging service.
Also proving Che Server IP may solve an issue on a local OS but it won't work on a multicluster case, like OSIO, where Che Server is started on one OS cluster and workspaces on another one. Registered the issue Workspace start hangs up on bootstrapping phase when Rolling Update is in progress #9740.
Workspace agent needs somehow communicate with Che Server. And it uses Che Server Route URL for requests. But during Rolling Update there may be some hanging requests (for about few minutes) when OpenShift Router picks up the new Che Server. As result Workspace Agent fails to start when it requests Che Server during switching Che Servers. Logs are attached.

WsAgentLogs

018-04-26 15:48:09,020[main]             [INFO ] [o.a.c.core.StandardEngine 259]       - Starting Servlet Engine: Apache Tomcat/8.5.23
2018-04-26 15:48:09,111[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 959]        - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
2018-04-26 15:48:43,050[ost-startStop-1]  [ERROR] [o.a.c.c.C.[.[localhost].[/] 4752]    - Exception sending context initialized event to listener instance of class [org.eclipse.che.inject.CheBootstrap]
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Invocation error of method public void org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize() throws org.eclipse.che.api.core.ConflictException,org.eclipse.che.api.core.NotFoundException,org.eclipse.che.api.core.ServerException,org.eclipse.che.api.core.ForbiddenException on org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer@1e57ace0
  at org.eclipse.che.api.project.server.ProjectApiModule.configure(ProjectApiModule.java:60) (via modules: org.eclipse.che.wsagent.server.WsAgentModule -> org.eclipse.che.api.project.server.ProjectApiModule)
  while locating org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer
1 error
        at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:470)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:184)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:110)
        at com.google.inject.Guice.createInjector(Guice.java:99)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.getInjector(EverrestGuiceContextListener.java:140)
        at com.google.inject.servlet.GuiceServletContextListener.contextInitialized(GuiceServletContextListener.java:47)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.contextInitialized(EverrestGuiceContextListener.java:85)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:988)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1860)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.che.api.core.ServerException: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 504 Gateway Time-out
The server didn't respond in time.
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:224)
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.getAll(WorkspaceProjectSynchronizer.java:128)
        at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initializeRegisteredProjects(OnWorkspaceStartProjectInitializer.java:63)
        at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize(OnWorkspaceStartProjectInitializer.java:57)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.eclipse.che.inject.lifecycle.InitModule$1$1.afterInjection(InitModule.java:46)
        at com.google.inject.internal.MembersInjectorImpl.notifyListeners(MembersInjectorImpl.java:119)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:121)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
        at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092)
        at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
        at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:194)
        at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41)
        at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:205)
        at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:199)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1085)
        at com.google.inject.internal.InternalInjectorCreator.loadEagerSingletons(InternalInjectorCreator.java:199)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:180)
        ... 18 common frames omitted
Caused by: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 504 Gateway Time-out
The server didn't respond in time.
        at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.doRequest(DefaultHttpJsonRequest.java:284)
        at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.request(DefaultHttpJsonRequest.java:163)
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:221)
        ... 40 common frames omitted
2018-04-26 15:48:43,051[ost-startStop-1]  [ERROR] [o.a.c.core.StandardContext 5208]     - One or more listeners failed to start. Full details will be found in the appropriate container log file
2018-04-26 15:48:43,053[ost-startStop-1]  [ERROR] [o.a.c.core.StandardContext 5259]     - Context [] startup failed due to previous errors
2018-04-26 15:48:43,067[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 1023]       - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [33,955] ms
2018-04-26 15:48:43,069[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 632]  - Starting ProtocolHandler ["http-nio-4401"]
2018-04-26 15:48:43,077[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 34108 ms

sleshchenko · 2018-04-20T11:48:01Z

@akorneta and I are working on testing Hot Update functionality. Starting Rolling Update when selenium tests are running helps discover the following issues:

Servers checkers doesn't work correctly after workspaces recovering
Many of tests are skipped because of failing their preparations.
Che Servers logs contains the following error messages

Che Server logs

2018-04-20 09:31:11,651[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 13478 ms

Received SIGTERM
Stopping Che server running on localhost:8080
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.core.StandardServer 524]       - A valid shutdown command was received via the shutdown port. Stopping the Server instance.
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 658]  - Pausing ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,242[main]             [INFO ] [o.a.c.core.StandardService 477]      - Stopping service [Catalina]
2018-04-20 09:33:37,355[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 132]  - Preparing system to shutdown
2018-04-20 09:33:37,370[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'workspace' service
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 115]  - Shutdown workspace threads pool, wait 30s to stop normally
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 127]  - Workspace threads pool is terminated
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'workspace' is suspended
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'KubernetesClient' service
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'KubernetesClient' is suspended
2018-04-20 09:33:37,382[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 139]  - System is ready to shutdown
2018-04-20 09:33:37,385[ost-startStop-2]  [INFO ] [e.c.w.i.k.n.p.PVCSubPathHelper 197]  - PVCSubPathHelper thread pool is terminated
2018-04-20 09:33:37,394[ost-startStop-2]  [INFO ] [o.e.c.a.core.util.FileCleaner 80]    - File cleaner is stopped
2018-04-20 09:33:37,448[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 687]  - Stopping ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,450[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 707]  - Destroying ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:50,098[-24-94xb7-58653]  [ERROR] [unknown.jul.logger 49]               - JGRP000019: failed passing message to receiver
java.lang.NoClassDefFoundError: java/io/ObjectStreamClass
	at org.eclipse.persistence.internal.helper.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:42)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.eclipse.persistence.sessions.serializers.JavaSerializer.deserialize(JavaSerializer.java:56)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection.onMessage(JGroupsRemoteConnection.java:130)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection$1.receive(JGroupsRemoteConnection.java:56)
	at org.jgroups.JChannel.up(JChannel.java:768)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1034)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:438)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:989)
	at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:919)
	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:851)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:611)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:152)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
	at org.jgroups.protocols.FD.up(FD.java:260)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:325)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
	at org.jgroups.protocols.Discovery.up(Discovery.java:296)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
	at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1301)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1158)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1119)
	... 33 common frames omitted
Caused by: java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1311)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1299)
	... 35 common frames omitted

The plan is to fix the found issues and rerun Selenium tests with Rolling Updating.

One more issue that is discovered but not related to Rolling Update:

Multiuser Che Server on Kubernetes/OpenShift is not stopped normally. Destroying of context hungs up and the pod is killed when timeout is reached.

l0rd · 2018-04-20T15:28:54Z

Thank you for this status report @sleshchenko. I have a few questions:

The last issue you mention in your comment (Che Server is not stopped normally...) looks related to the second one you mentioned (Che Servers logs contains the following error messages...). Is it the case?
I am of course interested in having the hot update on openshift.io as soon as possible :-) And I am wondering if the tests that are still failing are some some scenario that won't happen on openshift.io. What I mean is that if we have a chance to do it we should release a first hot update version that works for osio needs (first iteration) and then release a second iteration of the hot update that works even for some corner cases that are not relevant to osio. In particular I am thinking about the following osio restrictions that may limit the testing scenario we should consider in first iteration:

we support fewer stacks and ssh is disabled (hence the kind of agents running in workspaces is really limited)
we can suppose that when we start a Che update there will be no workspace starting

sleshchenko · 2018-04-23T07:31:13Z

@l0rd

The last issue you mention in your comment (Che Server is not stopped normally...) looks related to the second one you mentioned (Che Servers logs contains the following error messages...). Is it the case?

We investigated it a bit more and looks like a reason of it is closing JGroups channels. I hope we'll fix it soon in our hot_update branch.

Also, we discovered that 503 Service is not available issue happens not during the rolling update. Sometimes, after Che Server updating it crashes. I hope that this issue may be fixed with correct closing JGroups channels.

skabashnyuk · 2018-04-23T07:56:32Z

I am of course interested in having the hot update on openshift.io as soon as possible :-) And I am wondering if the tests that are still failing are some some scenario that won't happen on openshift.io. What I mean is that if we have a chance to do it we should release a first hot update version that works for osio needs (first iteration) and then release a second iteration of the hot update that works even for some corner cases that are not relevant to osio. In particular I am thinking about the following osio restrictions that may limit the testing scenario we should consider in first iteration:

I see your point @l0rd . We will try to do some quick investigation what we can propose as a fast alternative. Maybe we can propose some "Manual hot update"

akorneta · 2018-04-24T14:49:15Z

While testing the branch hot update me and @sleshchenko find out few things:

Che does not stop persistence service on a shutdown and since we added the jgroups for eclipse-link cache coordination, we need to perform the stop of persistence service to be able gracefully to stop Che server.
jgroups based eclipse-link cache coordinator does not perform the closing of EclipseLinkCommandChannel that leads to problems on Che stopping (it is necessary to investigate in detail).
Perform termination of components that are working with db components before the stop of persistence service(the list may be supplemented):
- Temporary workspace cleaner
- Workspace activity scheduler
After rolling update of Che server, the liveness and readiness probes are failed after some period of time(did not measure exactly), and with this the /api/system/state was available(tried with a browser and wrote a bash script for continuously check) with status 200 ok. The event that we received from OpenShift is:
Killing container with id docker://che:Need to kill Pod.

For the current state, we have some trick over jgroups cache coordinator that allows us to stop Che server gracefully that needed for the rolling update.

sleshchenko · 2018-04-27T12:37:03Z

We're still testing Hot Update functionality and trying to fix all faced issue. So we discovered the following issues:

Workspace agent needs somehow communicate with Che Server. And it uses Che Server Route URL for requests. But during Rolling Update there may be some hanging requests (for about few minutes) when OpenShift Router picks up the new Che Server. As result Workspace Agent fails to start when it requests Che Server during switching Che Servers. Logs are attached.

WsAgentLogs

018-04-26 15:48:09,020[main]             [INFO ] [o.a.c.core.StandardEngine 259]       - Starting Servlet Engine: Apache Tomcat/8.5.23
2018-04-26 15:48:09,111[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 959]        - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
2018-04-26 15:48:43,050[ost-startStop-1]  [ERROR] [o.a.c.c.C.[.[localhost].[/] 4752]    - Exception sending context initialized event to listener instance of class [org.eclipse.che.inject.CheBootstrap]
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Invocation error of method public void org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize() throws org.eclipse.che.api.core.ConflictException,org.eclipse.che.api.core.NotFoundException,org.eclipse.che.api.core.ServerException,org.eclipse.che.api.core.ForbiddenException on org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer@1e57ace0
  at org.eclipse.che.api.project.server.ProjectApiModule.configure(ProjectApiModule.java:60) (via modules: org.eclipse.che.wsagent.server.WsAgentModule -> org.eclipse.che.api.project.server.ProjectApiModule)
  while locating org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer
1 error
        at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:470)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:184)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:110)
        at com.google.inject.Guice.createInjector(Guice.java:99)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.getInjector(EverrestGuiceContextListener.java:140)
        at com.google.inject.servlet.GuiceServletContextListener.contextInitialized(GuiceServletContextListener.java:47)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.contextInitialized(EverrestGuiceContextListener.java:85)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:988)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1860)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.che.api.core.ServerException: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 504 Gateway Time-out
The server didn't respond in time.
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:224)
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.getAll(WorkspaceProjectSynchronizer.java:128)
        at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initializeRegisteredProjects(OnWorkspaceStartProjectInitializer.java:63)
        at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize(OnWorkspaceStartProjectInitializer.java:57)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.eclipse.che.inject.lifecycle.InitModule$1$1.afterInjection(InitModule.java:46)
        at com.google.inject.internal.MembersInjectorImpl.notifyListeners(MembersInjectorImpl.java:119)
        at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:121)
        at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90)
        at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268)
        at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092)
        at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40)
        at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:194)
        at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41)
        at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:205)
        at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:199)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1085)
        at com.google.inject.internal.InternalInjectorCreator.loadEagerSingletons(InternalInjectorCreator.java:199)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:180)
        ... 18 common frames omitted
Caused by: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 504 Gateway Time-out
The server didn't respond in time.
        at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.doRequest(DefaultHttpJsonRequest.java:284)
        at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.request(DefaultHttpJsonRequest.java:163)
        at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:221)
        ... 40 common frames omitted
2018-04-26 15:48:43,051[ost-startStop-1]  [ERROR] [o.a.c.core.StandardContext 5208]     - One or more listeners failed to start. Full details will be found in the appropriate container log file
2018-04-26 15:48:43,053[ost-startStop-1]  [ERROR] [o.a.c.core.StandardContext 5259]     - Context [] startup failed due to previous errors
2018-04-26 15:48:43,067[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 1023]       - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [33,955] ms
2018-04-26 15:48:43,069[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 632]  - Starting ProtocolHandler ["http-nio-4401"]
2018-04-26 15:48:43,077[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 34108 ms

Bootstrapper uses Che Server Route URL for notifying Che Server when bootstrapping is finished. And here is an issue because there is a possible (and reproducible) case when Stopping Che Server is starting workspace and waiting bootstrapper events BUT bootstrapper connects to Che Server Route URL and actually it is connected to new Che Server instance. In this way, old Che Server doesn't receive that event that workspace is bootstrapped and bootstrapping wait timeout occurs. The possible solution here is implementing distributed messaging service.
Also proving Che Server IP may solve issue on a local OS but it won't work on multicluster case, like OSIO, where Che Server is started on one OS cluster and workspaces on another one.

akorneta · 2018-04-27T13:00:00Z

We also created an issue for eclipse-link extension and added a workaround for JChannel closing to be able normally stop Che master.

One more thing that is missed it's :

PersistenceService must be stopped not on the termination but on PreDestroy and it must be the last thing to do.
The reason why we cannot do the stop of PersistenceService on termination it's the system terminate might be called by Che admin via REST and if after this call user will try to get some information from the database he will get an error, due to all the connections and sessions are closed.
Issue: [Hot_update] Persistence service should not be stopped when Che is shut down via rest #9609

skabashnyuk · 2018-05-10T07:50:26Z

work will be continued here #9657

skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform labels Apr 13, 2018

gorkem mentioned this issue Apr 13, 2018

Hot update for wsmaster #8547

Closed

19 tasks

vparfonov mentioned this issue Apr 13, 2018

IDE-2018-05-08 (Sprint: 148) #9411

Closed

20 tasks

skabashnyuk mentioned this issue Apr 13, 2018

Platform-2018-05-08 (Sprint: 148) #9412

Closed

13 tasks

akorneta mentioned this issue Apr 18, 2018

Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes #9206

Closed

akorneta self-assigned this Apr 18, 2018

akorneta added the status/in-progress This issue has been taken by an engineer and is under active development. label Apr 18, 2018

skabashnyuk added the sprint/current label Apr 18, 2018

mshaposhnik unassigned akorneta Apr 19, 2018

skabashnyuk assigned mshaposhnik, sleshchenko and akorneta Apr 20, 2018

l0rd mentioned this issue Apr 20, 2018

Remove data volume used for storing Che master logs and agents binaries redhat-developer/rh-che#617

Closed

5 tasks

sleshchenko unassigned sleshchenko Apr 23, 2018

skabashnyuk mentioned this issue Apr 24, 2018

Recreate Hot Update #9528

Closed

akorneta assigned sleshchenko Apr 24, 2018

sleshchenko removed their assignment Apr 27, 2018

skabashnyuk mentioned this issue May 5, 2018

Platform-2018-05-29 (Sprint: 149) #9606

Closed

14 tasks

musienko-maxim mentioned this issue May 8, 2018

Create E2E test for checking Hot update feature for wsmaster #9646

Closed

skabashnyuk removed the sprint/current label May 10, 2018

skabashnyuk closed this as completed May 10, 2018

sleshchenko mentioned this issue May 10, 2018

Add missing service terminations #9658

Merged

sleshchenko self-assigned this May 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hot update testing on working master #9415

Hot update testing on working master #9415

skabashnyuk commented Apr 13, 2018 •

edited by sleshchenko

Loading

504 Gateway Time-out

504 Gateway Time-out

sleshchenko commented Apr 20, 2018 •

edited

Loading

l0rd commented Apr 20, 2018

sleshchenko commented Apr 23, 2018

skabashnyuk commented Apr 23, 2018

akorneta commented Apr 24, 2018 •

edited

Loading

sleshchenko commented Apr 27, 2018

504 Gateway Time-out

504 Gateway Time-out

akorneta commented Apr 27, 2018 •

edited by mshaposhnik

Loading

skabashnyuk commented May 10, 2018

Hot update testing on working master #9415

Hot update testing on working master #9415

Comments

skabashnyuk commented Apr 13, 2018 • edited by sleshchenko Loading

Description

504 Gateway Time-out

504 Gateway Time-out

sleshchenko commented Apr 20, 2018 • edited Loading

l0rd commented Apr 20, 2018

sleshchenko commented Apr 23, 2018

skabashnyuk commented Apr 23, 2018

akorneta commented Apr 24, 2018 • edited Loading

sleshchenko commented Apr 27, 2018

504 Gateway Time-out

504 Gateway Time-out

akorneta commented Apr 27, 2018 • edited by mshaposhnik Loading

skabashnyuk commented May 10, 2018

skabashnyuk commented Apr 13, 2018 •

edited by sleshchenko

Loading

sleshchenko commented Apr 20, 2018 •

edited

Loading

akorneta commented Apr 24, 2018 •

edited

Loading

akorneta commented Apr 27, 2018 •

edited by mshaposhnik

Loading