Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hot update testing on working master #9415

Closed
13 of 14 tasks
skabashnyuk opened this issue Apr 13, 2018 · 8 comments
Closed
13 of 14 tasks

Hot update testing on working master #9415

skabashnyuk opened this issue Apr 13, 2018 · 8 comments
Assignees
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. status/in-progress This issue has been taken by an engineer and is under active development.

Comments

@skabashnyuk
Copy link
Contributor

skabashnyuk commented Apr 13, 2018

Description

We need to test hot update procedure. Proposed scenario: run hot update several times during selenium testing.

During Hot Update testing discovered the following issues:

  • Adapt ServiceTermination functionality to workspaces recovering Adapt ServiceTermination functionality to workspaces recovering #9317

  • Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes #9206

  • Servers checkers doesn't work correctly after workspaces recovering

  • Many of tests are skipped because of failing their preparations.
    screenshot_20180420_141055
    The reason of it is that after rolling update of Che server, the liveness and readiness probes are failed after some period of time(did not measure exactly), and with this the /api/system/state was available(tried with a browser and wrote a bash script for continuously check) with status 200 ok. The event that we received from OpenShift is:
    Killing container with id docker://che:Need to kill Pod.
    It is reproducible only on local OCP 3.7, but not local OCP 3.9 or https://dev.rdu2c.fabric8.io:8443/ dev cluster. So it can be marked as solved while it is not reproducible on current test OCP

  • Che Servers logs contains the following error messages

Che Server logs
2018-04-20 09:31:11,651[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 13478 ms

Received SIGTERM
Stopping Che server running on localhost:8080
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.core.StandardServer 524]       - A valid shutdown command was received via the shutdown port. Stopping the Server instance.
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 658]  - Pausing ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,242[main]             [INFO ] [o.a.c.core.StandardService 477]      - Stopping service [Catalina]
2018-04-20 09:33:37,355[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 132]  - Preparing system to shutdown
2018-04-20 09:33:37,370[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'workspace' service
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 115]  - Shutdown workspace threads pool, wait 30s to stop normally
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 127]  - Workspace threads pool is terminated
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'workspace' is suspended
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'KubernetesClient' service
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'KubernetesClient' is suspended
2018-04-20 09:33:37,382[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 139]  - System is ready to shutdown
2018-04-20 09:33:37,385[ost-startStop-2]  [INFO ] [e.c.w.i.k.n.p.PVCSubPathHelper 197]  - PVCSubPathHelper thread pool is terminated
2018-04-20 09:33:37,394[ost-startStop-2]  [INFO ] [o.e.c.a.core.util.FileCleaner 80]    - File cleaner is stopped
2018-04-20 09:33:37,448[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 687]  - Stopping ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,450[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 707]  - Destroying ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:50,098[-24-94xb7-58653]  [ERROR] [unknown.jul.logger 49]               - JGRP000019: failed passing message to receiver
java.lang.NoClassDefFoundError: java/io/ObjectStreamClass
	at org.eclipse.persistence.internal.helper.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:42)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.eclipse.persistence.sessions.serializers.JavaSerializer.deserialize(JavaSerializer.java:56)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection.onMessage(JGroupsRemoteConnection.java:130)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection$1.receive(JGroupsRemoteConnection.java:56)
	at org.jgroups.JChannel.up(JChannel.java:768)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1034)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:438)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:989)
	at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:919)
	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:851)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:611)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:152)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
	at org.jgroups.protocols.FD.up(FD.java:260)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:325)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
	at org.jgroups.protocols.Discovery.up(Discovery.java:296)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
	at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1301)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1158)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1119)
	... 33 common frames omitted
Caused by: java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1311)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1299)
	... 35 common frames omitted
  • Multiuser Che Server on Kubernetes/OpenShift is not stopped normally. Destroying of context hungs up and the pod is killed when timeout is reached.

  • PersistenceService must be stopped not on the termination but on PreDestroy and it must be the last thing to do.
    The reason why we cannot do the stop of PersistenceService on termination it's the system terminate might be called by Che admin via REST and if after this call user will try to get some information from the database he will get an error, due to all the connections and sessions are closed.
    Issue: [Hot_update] Persistence service should not be stopped when Che is shut down via rest #9609

  • Che does not stop persistence service on a shutdown and since we added the jgroups for eclipse-link cache coordination, we need to perform the stop of persistence service to be able gracefully to stop Che server.

  • jgroups based eclipse-link cache coordinator does not perform the closing of EclipseLinkCommandChannel that leads to problems on Che stopping (it is necessary to investigate in detail).

  • Perform termination of components that are working with db components before the stop of persistence service(the list may be supplemented):

    • Temporary workspace cleaner
    • Workspace activity scheduler
  • Bootstrapper uses Che Server Route URL for notifying Che Server when bootstrapping is finished. And here is an issue because there is a possible (and reproducible) case when The Stopping Che Server is starting a workspace and waiting bootstrapper events BUT bootstrapper connects to Che Server Route URL and actually it is connected to the NEW Che Server instance. In this way, old Che Server doesn't receive that event that workspace is bootstrapped and bootstrapping wait timeout occurs.
    The possible solution here is implementing distributed messaging service.
    Also proving Che Server IP may solve an issue on a local OS but it won't work on a multicluster case, like OSIO, where Che Server is started on one OS cluster and workspaces on another one. Registered the issue Workspace start hangs up on bootstrapping phase when Rolling Update is in progress #9740.

  • Workspace agent needs somehow communicate with Che Server. And it uses Che Server Route URL for requests. But during Rolling Update there may be some hanging requests (for about few minutes) when OpenShift Router picks up the new Che Server. As result Workspace Agent fails to start when it requests Che Server during switching Che Servers. Logs are attached.

WsAgentLogs
018-04-26 15:48:09,020[main]             [INFO ] [o.a.c.core.StandardEngine 259]       - Starting Servlet Engine: Apache Tomcat/8.5.23
2018-04-26 15:48:09,111[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 959]        - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
2018-04-26 15:48:43,050[ost-startStop-1]  [ERROR] [o.a.c.c.C.[.[localhost].[/] 4752]    - Exception sending context initialized event to listener instance of class [org.eclipse.che.inject.CheBootstrap]
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Invocation error of method public void org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize() throws org.eclipse.che.api.core.ConflictException,org.eclipse.che.api.core.NotFoundException,org.eclipse.che.api.core.ServerException,org.eclipse.che.api.core.ForbiddenException on org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer@1e57ace0
  at org.eclipse.che.api.project.server.ProjectApiModule.configure(ProjectApiModule.java:60) (via modules: org.eclipse.che.wsagent.server.WsAgentModule -> org.eclipse.che.api.project.server.ProjectApiModule)
  while locating org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer
1 error
        at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:470)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:184)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:110)
        at com.google.inject.Guice.createInjector(Guice.java:99)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.getInjector(EverrestGuiceContextListener.java:140)
        at com.google.inject.servlet.GuiceServletContextListener.contextInitialized(GuiceServletContextListener.java:47)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.contextInitialized(EverrestGuiceContextListener.java:85)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:988)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1860)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.che.api.core.ServerException: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 

504 Gateway Time-out

The server didn't respond in time. at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:224) at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.getAll(WorkspaceProjectSynchronizer.java:128) at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initializeRegisteredProjects(OnWorkspaceStartProjectInitializer.java:63) at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize(OnWorkspaceStartProjectInitializer.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.eclipse.che.inject.lifecycle.InitModule$1$1.afterInjection(InitModule.java:46) at com.google.inject.internal.MembersInjectorImpl.notifyListeners(MembersInjectorImpl.java:119) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:121) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268) at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092) at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:194) at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41) at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:205) at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:199) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1085) at com.google.inject.internal.InternalInjectorCreator.loadEagerSingletons(InternalInjectorCreator.java:199) at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:180) ... 18 common frames omitted Caused by: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message:

504 Gateway Time-out

The server didn't respond in time. at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.doRequest(DefaultHttpJsonRequest.java:284) at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.request(DefaultHttpJsonRequest.java:163) at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:221) ... 40 common frames omitted 2018-04-26 15:48:43,051[ost-startStop-1] [ERROR] [o.a.c.core.StandardContext 5208] - One or more listeners failed to start. Full details will be found in the appropriate container log file 2018-04-26 15:48:43,053[ost-startStop-1] [ERROR] [o.a.c.core.StandardContext 5259] - Context [] startup failed due to previous errors 2018-04-26 15:48:43,067[ost-startStop-1] [INFO ] [o.a.c.startup.HostConfig 1023] - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [33,955] ms 2018-04-26 15:48:43,069[main] [INFO ] [o.a.c.http11.Http11NioProtocol 632] - Starting ProtocolHandler ["http-nio-4401"] 2018-04-26 15:48:43,077[main] [INFO ] [o.a.catalina.startup.Catalina 683] - Server startup in 34108 ms
@skabashnyuk skabashnyuk added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform labels Apr 13, 2018
@gorkem gorkem mentioned this issue Apr 13, 2018
19 tasks
@akorneta akorneta self-assigned this Apr 18, 2018
@akorneta akorneta added the status/in-progress This issue has been taken by an engineer and is under active development. label Apr 18, 2018
@sleshchenko
Copy link
Member

sleshchenko commented Apr 20, 2018

@akorneta and I are working on testing Hot Update functionality. Starting Rolling Update when selenium tests are running helps discover the following issues:

  • Servers checkers doesn't work correctly after workspaces recovering

  • Many of tests are skipped because of failing their preparations.
    screenshot_20180420_141055

  • Che Servers logs contains the following error messages

Che Server logs
2018-04-20 09:31:11,651[main]             [INFO ] [o.a.catalina.startup.Catalina 683]   - Server startup in 13478 ms

Received SIGTERM
Stopping Che server running on localhost:8080
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.core.StandardServer 524]       - A valid shutdown command was received via the shutdown port. Stopping the Server instance.
2018-04-20 09:33:37,191[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 658]  - Pausing ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,242[main]             [INFO ] [o.a.c.core.StandardService 477]      - Stopping service [Catalina]
2018-04-20 09:33:37,355[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 132]  - Preparing system to shutdown
2018-04-20 09:33:37,370[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'workspace' service
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 115]  - Shutdown workspace threads pool, wait 30s to stop normally
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 127]  - Workspace threads pool is terminated
2018-04-20 09:33:37,379[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'workspace' is suspended
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 67]    - Suspending down 'KubernetesClient' service
2018-04-20 09:33:37,381[ost-startStop-2]  [INFO ] [o.e.c.a.s.s.ServiceTerminator 81]    - Service 'KubernetesClient' is suspended
2018-04-20 09:33:37,382[ost-startStop-2]  [INFO ] [o.e.c.a.s.server.SystemManager 139]  - System is ready to shutdown
2018-04-20 09:33:37,385[ost-startStop-2]  [INFO ] [e.c.w.i.k.n.p.PVCSubPathHelper 197]  - PVCSubPathHelper thread pool is terminated
2018-04-20 09:33:37,394[ost-startStop-2]  [INFO ] [o.e.c.a.core.util.FileCleaner 80]    - File cleaner is stopped
2018-04-20 09:33:37,448[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 687]  - Stopping ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:37,450[main]             [INFO ] [o.a.c.http11.Http11NioProtocol 707]  - Destroying ProtocolHandler ["http-nio-8080"]
2018-04-20 09:33:50,098[-24-94xb7-58653]  [ERROR] [unknown.jul.logger 49]               - JGRP000019: failed passing message to receiver
java.lang.NoClassDefFoundError: java/io/ObjectStreamClass
	at org.eclipse.persistence.internal.helper.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:42)
	at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826)
	at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
	at org.eclipse.persistence.sessions.serializers.JavaSerializer.deserialize(JavaSerializer.java:56)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection.onMessage(JGroupsRemoteConnection.java:130)
	at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection$1.receive(JGroupsRemoteConnection.java:56)
	at org.jgroups.JChannel.up(JChannel.java:768)
	at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1034)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
	at org.jgroups.protocols.FlowControl.up(FlowControl.java:438)
	at org.jgroups.stack.Protocol.up(Protocol.java:426)
	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
	at org.jgroups.protocols.UNICAST.up(UNICAST.java:425)
	at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:989)
	at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:919)
	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:851)
	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:611)
	at org.jgroups.protocols.BARRIER.up(BARRIER.java:152)
	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
	at org.jgroups.protocols.FD.up(FD.java:260)
	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:325)
	at org.jgroups.protocols.MERGE2.up(MERGE2.java:237)
	at org.jgroups.protocols.Discovery.up(Discovery.java:296)
	at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
	at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1301)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1158)
	at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1119)
	... 33 common frames omitted
Caused by: java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1311)
	at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1299)
	... 35 common frames omitted

The plan is to fix the found issues and rerun Selenium tests with Rolling Updating.

One more issue that is discovered but not related to Rolling Update:

  • Multiuser Che Server on Kubernetes/OpenShift is not stopped normally. Destroying of context hungs up and the pod is killed when timeout is reached.

@l0rd
Copy link
Contributor

l0rd commented Apr 20, 2018

Thank you for this status report @sleshchenko. I have a few questions:

  1. The last issue you mention in your comment (Che Server is not stopped normally...) looks related to the second one you mentioned (Che Servers logs contains the following error messages...). Is it the case?

  2. I am of course interested in having the hot update on openshift.io as soon as possible :-) And I am wondering if the tests that are still failing are some some scenario that won't happen on openshift.io. What I mean is that if we have a chance to do it we should release a first hot update version that works for osio needs (first iteration) and then release a second iteration of the hot update that works even for some corner cases that are not relevant to osio. In particular I am thinking about the following osio restrictions that may limit the testing scenario we should consider in first iteration:

  • we support fewer stacks and ssh is disabled (hence the kind of agents running in workspaces is really limited)
  • we can suppose that when we start a Che update there will be no workspace starting

@sleshchenko
Copy link
Member

@l0rd

  1. The last issue you mention in your comment (Che Server is not stopped normally...) looks related to the second one you mentioned (Che Servers logs contains the following error messages...). Is it the case?

We investigated it a bit more and looks like a reason of it is closing JGroups channels. I hope we'll fix it soon in our hot_update branch.

Also, we discovered that 503 Service is not available issue happens not during the rolling update. Sometimes, after Che Server updating it crashes. I hope that this issue may be fixed with correct closing JGroups channels.

@skabashnyuk
Copy link
Contributor Author

I am of course interested in having the hot update on openshift.io as soon as possible :-) And I am wondering if the tests that are still failing are some some scenario that won't happen on openshift.io. What I mean is that if we have a chance to do it we should release a first hot update version that works for osio needs (first iteration) and then release a second iteration of the hot update that works even for some corner cases that are not relevant to osio. In particular I am thinking about the following osio restrictions that may limit the testing scenario we should consider in first iteration:

I see your point @l0rd . We will try to do some quick investigation what we can propose as a fast alternative. Maybe we can propose some "Manual hot update"

@akorneta
Copy link
Contributor

akorneta commented Apr 24, 2018

While testing the branch hot update me and @sleshchenko find out few things:

  • Che does not stop persistence service on a shutdown and since we added the jgroups for eclipse-link cache coordination, we need to perform the stop of persistence service to be able gracefully to stop Che server.
  • jgroups based eclipse-link cache coordinator does not perform the closing of EclipseLinkCommandChannel that leads to problems on Che stopping (it is necessary to investigate in detail).
  • Perform termination of components that are working with db components before the stop of persistence service(the list may be supplemented):
  • After rolling update of Che server, the liveness and readiness probes are failed after some period of time(did not measure exactly), and with this the /api/system/state was available(tried with a browser and wrote a bash script for continuously check) with status 200 ok. The event that we received from OpenShift is:
    Killing container with id docker://che:Need to kill Pod.

For the current state, we have some trick over jgroups cache coordinator that allows us to stop Che server gracefully that needed for the rolling update.

@sleshchenko sleshchenko removed their assignment Apr 27, 2018
@sleshchenko
Copy link
Member

We're still testing Hot Update functionality and trying to fix all faced issue. So we discovered the following issues:

  • Workspace agent needs somehow communicate with Che Server. And it uses Che Server Route URL for requests. But during Rolling Update there may be some hanging requests (for about few minutes) when OpenShift Router picks up the new Che Server. As result Workspace Agent fails to start when it requests Che Server during switching Che Servers. Logs are attached.
WsAgentLogs
018-04-26 15:48:09,020[main]             [INFO ] [o.a.c.core.StandardEngine 259]       - Starting Servlet Engine: Apache Tomcat/8.5.23
2018-04-26 15:48:09,111[ost-startStop-1]  [INFO ] [o.a.c.startup.HostConfig 959]        - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war]
2018-04-26 15:48:43,050[ost-startStop-1]  [ERROR] [o.a.c.c.C.[.[localhost].[/] 4752]    - Exception sending context initialized event to listener instance of class [org.eclipse.che.inject.CheBootstrap]
com.google.inject.CreationException: Unable to create injector, see the following errors:
1) Invocation error of method public void org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize() throws org.eclipse.che.api.core.ConflictException,org.eclipse.che.api.core.NotFoundException,org.eclipse.che.api.core.ServerException,org.eclipse.che.api.core.ForbiddenException on org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer@1e57ace0
  at org.eclipse.che.api.project.server.ProjectApiModule.configure(ProjectApiModule.java:60) (via modules: org.eclipse.che.wsagent.server.WsAgentModule -> org.eclipse.che.api.project.server.ProjectApiModule)
  while locating org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer
1 error
        at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:470)
        at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:184)
        at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:110)
        at com.google.inject.Guice.createInjector(Guice.java:99)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.getInjector(EverrestGuiceContextListener.java:140)
        at com.google.inject.servlet.GuiceServletContextListener.contextInitialized(GuiceServletContextListener.java:47)
        at org.everrest.guice.servlet.EverrestGuiceContextListener.contextInitialized(EverrestGuiceContextListener.java:85)
        at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734)
        at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:988)
        at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1860)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.eclipse.che.api.core.ServerException: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message: 

504 Gateway Time-out

The server didn't respond in time. at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:224) at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.getAll(WorkspaceProjectSynchronizer.java:128) at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initializeRegisteredProjects(OnWorkspaceStartProjectInitializer.java:63) at org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize(OnWorkspaceStartProjectInitializer.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.eclipse.che.inject.lifecycle.InitModule$1$1.afterInjection(InitModule.java:46) at com.google.inject.internal.MembersInjectorImpl.notifyListeners(MembersInjectorImpl.java:119) at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:121) at com.google.inject.internal.ConstructorInjector.construct(ConstructorInjector.java:90) at com.google.inject.internal.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:268) at com.google.inject.internal.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:46) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1092) at com.google.inject.internal.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:40) at com.google.inject.internal.SingletonScope$1.get(SingletonScope.java:194) at com.google.inject.internal.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:41) at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:205) at com.google.inject.internal.InternalInjectorCreator$1.call(InternalInjectorCreator.java:199) at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1085) at com.google.inject.internal.InternalInjectorCreator.loadEagerSingletons(InternalInjectorCreator.java:199) at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:180) ... 18 common frames omitted Caused by: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message:

504 Gateway Time-out

The server didn't respond in time. at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.doRequest(DefaultHttpJsonRequest.java:284) at org.eclipse.che.api.core.rest.DefaultHttpJsonRequest.request(DefaultHttpJsonRequest.java:163) at org.eclipse.che.api.project.server.impl.WorkspaceProjectSynchronizer.workspaceDto(WorkspaceProjectSynchronizer.java:221) ... 40 common frames omitted 2018-04-26 15:48:43,051[ost-startStop-1] [ERROR] [o.a.c.core.StandardContext 5208] - One or more listeners failed to start. Full details will be found in the appropriate container log file 2018-04-26 15:48:43,053[ost-startStop-1] [ERROR] [o.a.c.core.StandardContext 5259] - Context [] startup failed due to previous errors 2018-04-26 15:48:43,067[ost-startStop-1] [INFO ] [o.a.c.startup.HostConfig 1023] - Deployment of web application archive [/home/user/che/ws-agent/webapps/ROOT.war] has finished in [33,955] ms 2018-04-26 15:48:43,069[main] [INFO ] [o.a.c.http11.Http11NioProtocol 632] - Starting ProtocolHandler ["http-nio-4401"] 2018-04-26 15:48:43,077[main] [INFO ] [o.a.catalina.startup.Catalina 683] - Server startup in 34108 ms
  • Bootstrapper uses Che Server Route URL for notifying Che Server when bootstrapping is finished. And here is an issue because there is a possible (and reproducible) case when Stopping Che Server is starting workspace and waiting bootstrapper events BUT bootstrapper connects to Che Server Route URL and actually it is connected to new Che Server instance. In this way, old Che Server doesn't receive that event that workspace is bootstrapped and bootstrapping wait timeout occurs. The possible solution here is implementing distributed messaging service.
    Also proving Che Server IP may solve issue on a local OS but it won't work on multicluster case, like OSIO, where Che Server is started on one OS cluster and workspaces on another one.

@akorneta
Copy link
Contributor

akorneta commented Apr 27, 2018

We also created an issue for eclipse-link extension and added a workaround for JChannel closing to be able normally stop Che master.

One more thing that is missed it's :

  • PersistenceService must be stopped not on the termination but on PreDestroy and it must be the last thing to do.
    The reason why we cannot do the stop of PersistenceService on termination it's the system terminate might be called by Che admin via REST and if after this call user will try to get some information from the database he will get an error, due to all the connections and sessions are closed.
    Issue: [Hot_update] Persistence service should not be stopped when Che is shut down via rest #9609

@skabashnyuk
Copy link
Contributor Author

work will be continued here #9657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/task Internal things, technical debt, and to-do tasks to be performed. status/in-progress This issue has been taken by an engineer and is under active development.
Projects
None yet
Development

No branches or pull requests

5 participants