-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hot update testing on working master #9415
Comments
@akorneta and I are working on testing Hot Update functionality. Starting Rolling Update when selenium tests are running helps discover the following issues:
Che Server logs2018-04-20 09:31:11,651[main] [INFO ] [o.a.catalina.startup.Catalina 683] - Server startup in 13478 ms Received SIGTERM Stopping Che server running on localhost:8080 2018-04-20 09:33:37,191[main] [INFO ] [o.a.c.core.StandardServer 524] - A valid shutdown command was received via the shutdown port. Stopping the Server instance. 2018-04-20 09:33:37,191[main] [INFO ] [o.a.c.http11.Http11NioProtocol 658] - Pausing ProtocolHandler ["http-nio-8080"] 2018-04-20 09:33:37,242[main] [INFO ] [o.a.c.core.StandardService 477] - Stopping service [Catalina] 2018-04-20 09:33:37,355[ost-startStop-2] [INFO ] [o.e.c.a.s.server.SystemManager 132] - Preparing system to shutdown 2018-04-20 09:33:37,370[ost-startStop-2] [INFO ] [o.e.c.a.s.s.ServiceTerminator 67] - Suspending down 'workspace' service 2018-04-20 09:33:37,379[ost-startStop-2] [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 115] - Shutdown workspace threads pool, wait 30s to stop normally 2018-04-20 09:33:37,379[ost-startStop-2] [INFO ] [.e.c.a.w.s.WorkspaceSharedPool 127] - Workspace threads pool is terminated 2018-04-20 09:33:37,379[ost-startStop-2] [INFO ] [o.e.c.a.s.s.ServiceTerminator 81] - Service 'workspace' is suspended 2018-04-20 09:33:37,381[ost-startStop-2] [INFO ] [o.e.c.a.s.s.ServiceTerminator 67] - Suspending down 'KubernetesClient' service 2018-04-20 09:33:37,381[ost-startStop-2] [INFO ] [o.e.c.a.s.s.ServiceTerminator 81] - Service 'KubernetesClient' is suspended 2018-04-20 09:33:37,382[ost-startStop-2] [INFO ] [o.e.c.a.s.server.SystemManager 139] - System is ready to shutdown 2018-04-20 09:33:37,385[ost-startStop-2] [INFO ] [e.c.w.i.k.n.p.PVCSubPathHelper 197] - PVCSubPathHelper thread pool is terminated 2018-04-20 09:33:37,394[ost-startStop-2] [INFO ] [o.e.c.a.core.util.FileCleaner 80] - File cleaner is stopped 2018-04-20 09:33:37,448[main] [INFO ] [o.a.c.http11.Http11NioProtocol 687] - Stopping ProtocolHandler ["http-nio-8080"] 2018-04-20 09:33:37,450[main] [INFO ] [o.a.c.http11.Http11NioProtocol 707] - Destroying ProtocolHandler ["http-nio-8080"] 2018-04-20 09:33:50,098[-24-94xb7-58653] [ERROR] [unknown.jul.logger 49] - JGRP000019: failed passing message to receiver java.lang.NoClassDefFoundError: java/io/ObjectStreamClass at org.eclipse.persistence.internal.helper.CustomObjectInputStream.resolveClass(CustomObjectInputStream.java:42) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1826) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422) at org.eclipse.persistence.sessions.serializers.JavaSerializer.deserialize(JavaSerializer.java:56) at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection.onMessage(JGroupsRemoteConnection.java:130) at org.eclipse.persistence.internal.sessions.coordination.jgroups.JGroupsRemoteConnection$1.receive(JGroupsRemoteConnection.java:56) at org.jgroups.JChannel.up(JChannel.java:768) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1034) at org.jgroups.stack.Protocol.up(Protocol.java:426) at org.jgroups.stack.Protocol.up(Protocol.java:426) at org.jgroups.protocols.FRAG2.up(FRAG2.java:182) at org.jgroups.protocols.FlowControl.up(FlowControl.java:438) at org.jgroups.stack.Protocol.up(Protocol.java:426) at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294) at org.jgroups.protocols.UNICAST.up(UNICAST.java:425) at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:989) at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:919) at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:851) at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:611) at org.jgroups.protocols.BARRIER.up(BARRIER.java:152) at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155) at org.jgroups.protocols.FD.up(FD.java:260) at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:325) at org.jgroups.protocols.MERGE2.up(MERGE2.java:237) at org.jgroups.protocols.Discovery.up(Discovery.java:296) at org.jgroups.protocols.TP.passMessageUp(TP.java:1658) at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access. at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1301) at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1158) at org.apache.catalina.loader.WebappClassLoaderBase.loadClass(WebappClassLoaderBase.java:1119) ... 33 common frames omitted Caused by: java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [java.io.ObjectStreamClass]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access. at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading(WebappClassLoaderBase.java:1311) at org.apache.catalina.loader.WebappClassLoaderBase.checkStateForClassLoading(WebappClassLoaderBase.java:1299) ... 35 common frames omitted The plan is to fix the found issues and rerun Selenium tests with Rolling Updating. One more issue that is discovered but not related to Rolling Update:
|
Thank you for this status report @sleshchenko. I have a few questions:
|
We investigated it a bit more and looks like a reason of it is closing JGroups channels. I hope we'll fix it soon in our Also, we discovered that |
I see your point @l0rd . We will try to do some quick investigation what we can propose as a fast alternative. Maybe we can propose some "Manual hot update" |
While testing the branch hot update me and @sleshchenko find out few things:
For the current state, we have some trick over jgroups cache coordinator that allows us to stop Che server gracefully that needed for the rolling update. |
We're still testing Hot Update functionality and trying to fix all faced issue. So we discovered the following issues:
WsAgentLogs018-04-26 15:48:09,020[main] [INFO ] [o.a.c.core.StandardEngine 259] - Starting Servlet Engine: Apache Tomcat/8.5.23 2018-04-26 15:48:09,111[ost-startStop-1] [INFO ] [o.a.c.startup.HostConfig 959] - Deploying web application archive [/home/user/che/ws-agent/webapps/ROOT.war] 2018-04-26 15:48:43,050[ost-startStop-1] [ERROR] [o.a.c.c.C.[.[localhost].[/] 4752] - Exception sending context initialized event to listener instance of class [org.eclipse.che.inject.CheBootstrap] com.google.inject.CreationException: Unable to create injector, see the following errors: 1) Invocation error of method public void org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer.initialize() throws org.eclipse.che.api.core.ConflictException,org.eclipse.che.api.core.NotFoundException,org.eclipse.che.api.core.ServerException,org.eclipse.che.api.core.ForbiddenException on org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer@1e57ace0 at org.eclipse.che.api.project.server.ProjectApiModule.configure(ProjectApiModule.java:60) (via modules: org.eclipse.che.wsagent.server.WsAgentModule -> org.eclipse.che.api.project.server.ProjectApiModule) while locating org.eclipse.che.api.project.server.impl.OnWorkspaceStartProjectInitializer 1 error at com.google.inject.internal.Errors.throwCreationExceptionIfErrorsExist(Errors.java:470) at com.google.inject.internal.InternalInjectorCreator.injectDynamically(InternalInjectorCreator.java:184) at com.google.inject.internal.InternalInjectorCreator.build(InternalInjectorCreator.java:110) at com.google.inject.Guice.createInjector(Guice.java:99) at org.everrest.guice.servlet.EverrestGuiceContextListener.getInjector(EverrestGuiceContextListener.java:140) at com.google.inject.servlet.GuiceServletContextListener.contextInitialized(GuiceServletContextListener.java:47) at org.everrest.guice.servlet.EverrestGuiceContextListener.contextInitialized(EverrestGuiceContextListener.java:85) at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4745) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5207) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:752) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:728) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:734) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:988) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1860) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.eclipse.che.api.core.ServerException: java.io.IOException: Failed access: http://che-eclipse-che.172.19.20.43.nip.io/api/workspace/workspace74g0awl999pk64t9?token, method: GET, response code: 504, message:
|
We also created an issue for eclipse-link extension and added a workaround for JChannel closing to be able normally stop Che master. One more thing that is missed it's :
|
work will be continued here #9657 |
Description
We need to test hot update procedure. Proposed scenario: run hot update several times during selenium testing.
During Hot Update testing discovered the following issues:
Adapt ServiceTermination functionality to workspaces recovering Adapt ServiceTermination functionality to workspaces recovering #9317
Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes #9206
Servers checkers doesn't work correctly after workspaces recovering
Many of tests are skipped because of failing their preparations.
![screenshot_20180420_141055](https://user-images.githubusercontent.com/5887312/39048009-ad506d58-44a4-11e8-96e0-50a45df5af57.png)
The reason of it is that after rolling update of Che server, the liveness and readiness probes are failed after some period of time(did not measure exactly), and with this the /api/system/state was available(tried with a browser and wrote a bash script for continuously check) with status 200 ok. The event that we received from OpenShift is:
Killing container with id docker://che:Need to kill Pod.
It is reproducible only on local OCP 3.7, but not local OCP 3.9 or https://dev.rdu2c.fabric8.io:8443/ dev cluster. So it can be marked as solved while it is not reproducible on current test OCP
Che Servers logs contains the following error messages
Che Server logs
Multiuser Che Server on Kubernetes/OpenShift is not stopped normally. Destroying of context hungs up and the pod is killed when timeout is reached.
PersistenceService must be stopped not on the termination but on PreDestroy and it must be the last thing to do.
The reason why we cannot do the stop of PersistenceService on termination it's the system terminate might be called by Che admin via REST and if after this call user will try to get some information from the database he will get an error, due to all the connections and sessions are closed.
Issue: [Hot_update] Persistence service should not be stopped when Che is shut down via rest #9609
Che does not stop persistence service on a shutdown and since we added the jgroups for eclipse-link cache coordination, we need to perform the stop of persistence service to be able gracefully to stop Che server.
jgroups based eclipse-link cache coordinator does not perform the closing of EclipseLinkCommandChannel that leads to problems on Che stopping (it is necessary to investigate in detail).
Perform termination of components that are working with db components before the stop of persistence service(the list may be supplemented):
Bootstrapper uses Che Server Route URL for notifying Che Server when bootstrapping is finished. And here is an issue because there is a possible (and reproducible) case when The Stopping Che Server is starting a workspace and waiting bootstrapper events BUT bootstrapper connects to Che Server Route URL and actually it is connected to the NEW Che Server instance. In this way, old Che Server doesn't receive that event that workspace is bootstrapped and bootstrapping wait timeout occurs.
The possible solution here is implementing distributed messaging service.
Also proving Che Server IP may solve an issue on a local OS but it won't work on a multicluster case, like OSIO, where Che Server is started on one OS cluster and workspaces on another one. Registered the issue Workspace start hangs up on bootstrapping phase when Rolling Update is in progress #9740.
Workspace agent needs somehow communicate with Che Server. And it uses Che Server Route URL for requests. But during Rolling Update there may be some hanging requests (for about few minutes) when OpenShift Router picks up the new Che Server. As result Workspace Agent fails to start when it requests Che Server during switching Che Servers. Logs are attached.
WsAgentLogs
The text was updated successfully, but these errors were encountered: