watch should handle etcd old version exception #1075

chenchun · 2018-05-14T03:01:13Z

I am running spark on kubernetes. This is the full issue description https://issues.apache.org/jira/browse/SPARK-24266

I think the exception too old resource version: 21648111 (21653211) should be better handled in kubernetes-client instead of simply throw it to the caller because resource version is cached by kubernetes-client, not by the caller.

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java

Lines 259 to 266 in 5b1a57b

    
           if (status.getCode() == HTTP_GONE) { 
        
             webSocketRef.set(null); // lose the ref: closing in close() would only generate a Broken pipe 
        
                                     // exception 
        
             // shut down executor, etc. 
        
             closeEvent(new KubernetesClientException(status)); 
        
             close(); 
        
             return; 
        
           }

The text was updated successfully, but these errors were encountered:

chenchun · 2018-05-14T03:02:30Z

cc @foxish

yujiantao · 2019-06-22T23:50:52Z

We run into the same issue, any progress on this?

chenchun · 2019-07-10T07:07:48Z

@yujiantao For a simple fix, you can try comment out these lines https://github.com/fabric8io/kubernetes-client/blob/v4.0.5/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L141-L143
We've using it for a long time, everything is fine.

SergSlipushenko · 2019-07-30T20:23:12Z

We also running in the same issue with Spark, would be great to see the fix eventually

stale · 2019-10-28T21:00:29Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

SergSlipushenko · 2019-10-29T16:14:08Z

It is still relevant for people using k8s-as-a-service on Azure. We applied the workaround mentioned by @chenchun and it works fine so far...

stale · 2020-01-27T16:48:11Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

stale · 2020-04-28T03:31:00Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

stijndehaes · 2020-04-30T09:02:15Z

@yujiantao For a simple fix, you can try comment out these lines https://github.com/fabric8io/kubernetes-client/blob/v4.0.5/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchConnectionManager.java#L141-L143
We've using it for a long time, everything is fine.

@chenchun Is this something we maybe could put into the client? That for some watches you don't care about version problems.

SergSlipushenko · 2020-04-30T12:50:15Z

It is still relevant and requires a rebuild of Spark from the sources just for using hacked kubernetes-client 😞

rohanKanojia · 2020-04-30T13:12:11Z

@SergSlipushenko : Actually we added it in past but we had to remove it, see #1800

chenchun · 2020-05-02T15:23:24Z

@stijndehaes I took a look at #1800, is it better to add a bool flag of whether or not do re-watching automatically when receive a version change? So that we won't break the contract of sending HTTP_GONE if resource version is old and also makes people easier when they don't care about the problem.

rohanKanojia · 2020-05-02T19:05:59Z

@chenchun : Adding a boolean flag to reconnect sounds nice to me 👍 . This thing gets requested quite often. I think we need to tweak WatchConnectionManager

stijndehaes · 2020-05-04T06:43:58Z

I also noticed there is a deprecated watch method method that allows you to set a resource version. Looking in the git history does not tell me much. But using that method allows you to set a null resource version, that way you don't get the HTTP GONE message I believe?

manusa · 2020-05-04T09:22:08Z

We implemented SharedInformers (#1384) a while back to mimic client-go's behavior and provide an extra level of abstraction for Watch operations (Kubernetes client-go: watch.Interface vs. cache.NewInformer vs. cache.NewSharedIndexInformer? and Writing Controllers/SharedInformers)

Our implementation of SharedInformers already takes care of HTTP_GONE scenario.

If you are looking for this reconnect behavior, I would encourage using SharedInformers instead of Watch, or else use watch with your own reconnect implementation. I think providing this behavior for watch too would be duplicating a feature that's already available in Informers.

@rohanKanojia maybe we can use this issue to provide some additional examples and documentation on different use-cases for SharedInformers. I think it's unclear that they should be the default approach to watch resources.

stijndehaes · 2020-05-05T13:54:59Z

@manusa one big difference is that with a watcher we can watch one single pod. This is watch spark-submit does when watching the driver, with sharedinformer I am watching all the pods. Unless there is way to watch a single pod?
Anyway I guess this will use more resources then needed, unless I am mistaken and this is negligible?

manusa · 2020-05-06T05:40:17Z

I'm really unsure how we implemented the SharedInformer and what are the current features, but there should be an option to filter by labels (or even fields>i.e. metadata.name).
Any filtering option applicable for watch should be available to SharedInformes (since the latter is a superset of the former).
If this option isn't there we should work on providing that instead.

stijndehaes · 2020-05-06T06:05:37Z

@manusa found it! You can do it like this I think:

val podInformer = informers.sharedIndexInformerFor(
      classOf[Pod],
      classOf[PodList],
      new OperationContext().withNamespace(NAMESPACE).withName(PODNAME),
      60000)

manusa · 2020-05-06T06:08:47Z

🎉 Good news @stijndehaes, thx for sharing!

rohanKanojia · 2020-05-11T07:09:20Z

@stijndehaes : Hi, I tried documenting SharedInformer support in a blog[0] this weekend. I tried out the code you listed in the comment but unfortunately, it didn't work out for me as expected. I kept getting this error:

/usr/java/jdk-14.0.1/bin/java -javaagent:/opt/ideaIC-2019.3.3/idea-IC-193.6494.35/lib/idea_rt.jar=39383:/opt/ideaIC-2019.3.3/idea-IC-193.6494.35/bin -Dfile.encoding=UTF-8 -classpath /home/rohaan/work/repos/kubernetes-client-demo/target/classes:/home/rohaan/.m2/repository/io/fabric8/kubernetes-client/4.10.1/kubernetes-client-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-core/4.10.1/kubernetes-model-core-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-common/4.10.1/kubernetes-model-common-4.10.1.jar:/home/rohaan/.m2/repository/io/sundr/builder-annotations/0.21.0/builder-annotations-0.21.0.jar:/home/rohaan/.m2/repository/io/sundr/sundr-core/0.21.0/sundr-core-0.21.0.jar:/home/rohaan/.m2/repository/io/sundr/sundr-codegen/0.21.0/sundr-codegen-0.21.0.jar:/home/rohaan/.m2/repository/io/sundr/resourcecify-annotations/0.21.0/resourcecify-annotations-0.21.0.jar:/home/rohaan/.m2/repository/io/sundr/transform-annotations/0.21.0/transform-annotations-0.21.0.jar:/home/rohaan/.m2/repository/org/jsonschema2pojo/jsonschema2pojo-core/0.4.23/jsonschema2pojo-core-0.4.23.jar:/home/rohaan/.m2/repository/com/google/code/javaparser/javaparser/1.0.11/javaparser-1.0.11.jar:/home/rohaan/.m2/repository/com/google/android/android/4.1.1.4/android-4.1.1.4.jar:/home/rohaan/.m2/repository/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:/home/rohaan/.m2/repository/org/apache/httpcomponents/httpclient/4.0.1/httpclient-4.0.1.jar:/home/rohaan/.m2/repository/org/apache/httpcomponents/httpcore/4.0.1/httpcore-4.0.1.jar:/home/rohaan/.m2/repository/commons-codec/commons-codec/1.3/commons-codec-1.3.jar:/home/rohaan/.m2/repository/org/khronos/opengl-api/gl1.1-android-2.1_r1/opengl-api-gl1.1-android-2.1_r1.jar:/home/rohaan/.m2/repository/xerces/xmlParserAPIs/2.6.2/xmlParserAPIs-2.6.2.jar:/home/rohaan/.m2/repository/xpp3/xpp3/1.1.4c/xpp3-1.1.4c.jar:/home/rohaan/.m2/repository/com/sun/codemodel/codemodel/2.6/codemodel-2.6.jar:/home/rohaan/.m2/repository/com/google/code/gson/gson/2.5/gson-2.5.jar:/home/rohaan/.m2/repository/com/squareup/moshi/moshi/1.1.0/moshi-1.1.0.jar:/home/rohaan/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/home/rohaan/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.jar:/home/rohaan/.m2/repository/javax/validation/validation-api/1.0.0.GA/validation-api-1.0.0.GA.jar:/home/rohaan/.m2/repository/joda-time/joda-time/2.2/joda-time-2.2.jar:/home/rohaan/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.9.11/jackson-mapper-asl-1.9.11.jar:/home/rohaan/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.11/jackson-core-asl-1.9.11.jar:/home/rohaan/.m2/repository/org/apache/commons/commons-lang3/3.2.1/commons-lang3-3.2.1.jar:/home/rohaan/.m2/repository/com/google/code/findbugs/annotations/1.3.9/annotations-1.3.9.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/module/jackson-module-jaxb-annotations/2.10.3/jackson-module-jaxb-annotations-2.10.3.jar:/home/rohaan/.m2/repository/jakarta/xml/bind/jakarta.xml.bind-api/2.3.2/jakarta.xml.bind-api-2.3.2.jar:/home/rohaan/.m2/repository/jakarta/activation/jakarta.activation-api/1.2.1/jakarta.activation-api-1.2.1.jar:/home/rohaan/.m2/repository/javax/annotation/javax.annotation-api/1.3.2/javax.annotation-api-1.3.2.jar:/home/rohaan/.m2/repository/javax/xml/bind/jaxb-api/2.3.0/jaxb-api-2.3.0.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-rbac/4.10.1/kubernetes-model-rbac-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-admissionregistration/4.10.1/kubernetes-model-admissionregistration-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-apps/4.10.1/kubernetes-model-apps-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-autoscaling/4.10.1/kubernetes-model-autoscaling-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-apiextensions/4.10.1/kubernetes-model-apiextensions-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-batch/4.10.1/kubernetes-model-batch-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-certificates/4.10.1/kubernetes-model-certificates-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-coordination/4.10.1/kubernetes-model-coordination-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-discovery/4.10.1/kubernetes-model-discovery-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-events/4.10.1/kubernetes-model-events-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-extensions/4.10.1/kubernetes-model-extensions-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-networking/4.10.1/kubernetes-model-networking-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-metrics/4.10.1/kubernetes-model-metrics-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-policy/4.10.1/kubernetes-model-policy-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-scheduling/4.10.1/kubernetes-model-scheduling-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-settings/4.10.1/kubernetes-model-settings-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/kubernetes-model-storageclass/4.10.1/kubernetes-model-storageclass-4.10.1.jar:/home/rohaan/.m2/repository/io/fabric8/openshift-model/4.10.1/openshift-model-4.10.1.jar:/home/rohaan/.m2/repository/com/squareup/okhttp3/okhttp/3.12.11/okhttp-3.12.11.jar:/home/rohaan/.m2/repository/com/squareup/okio/okio/1.15.0/okio-1.15.0.jar:/home/rohaan/.m2/repository/com/squareup/okhttp3/logging-interceptor/3.12.11/logging-interceptor-3.12.11.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/dataformat/jackson-dataformat-yaml/2.10.3/jackson-dataformat-yaml-2.10.3.jar:/home/rohaan/.m2/repository/org/yaml/snakeyaml/1.24/snakeyaml-1.24.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/datatype/jackson-datatype-jsr310/2.10.3/jackson-datatype-jsr310-2.10.3.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/core/jackson-annotations/2.10.3/jackson-annotations-2.10.3.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/core/jackson-databind/2.10.3/jackson-databind-2.10.3.jar:/home/rohaan/.m2/repository/com/fasterxml/jackson/core/jackson-core/2.10.3/jackson-core-2.10.3.jar:/home/rohaan/.m2/repository/io/fabric8/zjsonpatch/0.3.0/zjsonpatch-0.3.0.jar:/home/rohaan/.m2/repository/com/github/mifmif/generex/1.0.2/generex-1.0.2.jar:/home/rohaan/.m2/repository/dk/brics/automaton/automaton/1.11-8/automaton-1.11-8.jar:/home/rohaan/.m2/repository/io/fabric8/openshift-client/4.10.1/openshift-client-4.10.1.jar:/home/rohaan/.m2/repository/org/json/json/20190722/json-20190722.jar:/home/rohaan/.m2/repository/org/slf4j/slf4j-simple/1.7.28/slf4j-simple-1.7.28.jar:/home/rohaan/.m2/repository/org/slf4j/slf4j-api/1.7.5/slf4j-api-1.7.5.jar io.fabric8.NamespacedInformerDemo
May 11, 2020 12:37:57 PM io.fabric8.NamespacedInformerDemo main
INFO: Informer factory initialized.
May 11, 2020 12:37:58 PM io.fabric8.NamespacedInformerDemo main
INFO: Starting all registered informers
[informer-controller-Pod] INFO io.fabric8.kubernetes.client.informers.cache.Controller - informer#Controller: ready to run resync and reflector runnable
[informer-controller-Pod] INFO io.fabric8.kubernetes.client.informers.cache.Reflector - Started ReflectorRunnable watch for class io.fabric8.kubernetes.api.model.Pod
[informer-controller-Pod] WARN io.fabric8.kubernetes.client.informers.cache.Controller - Reflector list-watching job exiting because the thread-pool is shutting down
java.util.concurrent.RejectedExecutionException: Error while starting ReflectorRunnable watch
	at io.fabric8.kubernetes.client.informers.cache.Reflector.listAndWatch(Reflector.java:85)
	at io.fabric8.kubernetes.client.informers.cache.Controller.run(Controller.java:112)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.util.concurrent.RejectedExecutionException: Error while doing ReflectorRunnable list
	at io.fabric8.kubernetes.client.informers.cache.Reflector.getList(Reflector.java:73)
	at io.fabric8.kubernetes.client.informers.cache.Reflector.reListAndSync(Reflector.java:94)
	at io.fabric8.kubernetes.client.informers.cache.Reflector.listAndWatch(Reflector.java:80)
	... 2 more
Caused by: java.lang.NullPointerException
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.getRootUrl(OperationSupport.java:129)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.getNamespacedUrl(OperationSupport.java:136)
	at io.fabric8.kubernetes.client.dsl.base.OperationSupport.getNamespacedUrl(OperationSupport.java:147)
	at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:656)
	at io.fabric8.kubernetes.client.informers.SharedInformerFactory$1.list(SharedInformerFactory.java:134)
	at io.fabric8.kubernetes.client.informers.SharedInformerFactory$1.list(SharedInformerFactory.java:127)
	at io.fabric8.kubernetes.client.informers.cache.Reflector.getList(Reflector.java:67)
	... 4 more

On debugging, I realized that we were losing OperationContext details when we try to pass our own OperationContext. Let me create a PR to fix this.

[0] https://medium.com/@rohaan/introduction-to-fabric8-kubernetes-java-client-informer-api-b945082d69af

rohanKanojia · 2020-05-11T08:59:51Z

With my PR, you should be able to get Namespaced informers. With when I was trying to get informer for one specific resource. I noticed one strange issue, I was getting some unnecessary delete events :

INFO: Starting all registered informers
[informer-controller-Pod] INFO io.fabric8.kubernetes.client.informers.cache.Controller - informer#Controller: ready to run resync and reflector runnable
[informer-controller-Pod] INFO io.fabric8.kubernetes.client.informers.cache.Reflector - Started ReflectorRunnable watch for class io.fabric8.kubernetes.api.model.Pod
[OkHttp https://172.17.0.2:8443/...] INFO io.fabric8.kubernetes.client.informers.cache.ReflectorWatcher - Event received MODIFIED
May 11, 2020 1:41:38 PM io.fabric8.NamespacedInformerDemo$1 onAdd
INFO: Pod testpod got added
May 11, 2020 1:41:49 PM io.fabric8.NamespacedInformerDemo$1 onUpdate
INFO: Pod testpod got updated
May 11, 2020 1:41:50 PM io.fabric8.NamespacedInformerDemo$1 onDelete
INFO: Pod testpod got deleted
May 11, 2020 1:42:20 PM io.fabric8.NamespacedInformerDemo$1 onDelete
INFO: Pod testpod got deleted
May 11, 2020 1:42:49 PM io.fabric8.NamespacedInformerDemo$1 onUpdate
INFO: Pod testpod got updated
May 11, 2020 1:42:50 PM io.fabric8.NamespacedInformerDemo$1 onDelete
INFO: Pod testpod got deleted
May 11, 2020 1:43:20 PM io.fabric8.NamespacedInformerDemo$1 onDelete

Upon debugging I checked that when we query a single resource response is in the form of a single resource not in the form of a list. Hence, Deserialization fails during list step resulting in resource's ObjectMeta being added in the list but with zero items:

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/SharedInformerFactory.java

Lines 127 to 150 in e8255c3

    
           return new ListerWatcher<T, TList>() { 
        
             @Override 
        
             public TList list(ListOptions params, String namespace, OperationContext context) throws KubernetesClientException { 
        
               BaseOperation<T, TList, ?, ?> listBaseOperation = baseOperation.newInstance(context.withNamespace(namespace)); 
        
               listBaseOperation.setType(apiTypeClass); 
        
               listBaseOperation.setListType(apiListTypeClass); 
        
               return listBaseOperation.list(); 
        
             } 
        
             @Override 
        
             public Watch watch(ListOptions params, String namespace, OperationContext context, Watcher<T> resourceWatcher) throws KubernetesClientException { 
        
               BaseOperation<T, TList, ?, ?> watchBaseOperation = baseOperation.newInstance(context); 
        
               watchBaseOperation.setType(apiTypeClass); 
        
               watchBaseOperation.setListType(apiListTypeClass); 
        
               // Register Custom Kind in case of CustomResource 
        
               if (context.getApiGroupName() != null && context.getApiGroupVersion() != null) { 
        
                 String apiGroupNameAndVersion = context.getApiGroupName() + 
        
                   (context.getApiGroupName().endsWith("/") ? context.getApiGroupVersion() : ("/" + context.getApiGroupVersion())); 
        
                 KubernetesDeserializer.registerCustomKind(apiGroupNameAndVersion, apiTypeClass.getSimpleName(), apiTypeClass); 
        
               } 
        
               return watchBaseOperation.watch(params.getResourceVersion(), resourceWatcher); 
        
             }

I checked client-go's implementation but I'm not sure if they support listing a specific resource. Maybe informers are not meant to list specific resources? Since they are implementing ListerWatcher interface. They are supposed to list and watch only, which is only applicable to lists.

…ged from k8s ### What changes were proposed in this pull request? Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes #28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>

stale · 2020-08-09T09:24:09Z

This issue has been automatically marked as stale because it has not had any activity since 90 days. It will be closed if no further activity occurs within 7 days. Thank you for your contributions!

…ged from k8s ### What changes were proposed in this pull request? Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.

[SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Cleanup the merge [SPARK-21040][CORE] Speculate tasks which are running on decommission executors This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. Added UT. Closes apache#28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-20629][CORE][K8S] Copy shuffle data when nodes are being shutdown This pull request adds the ability to migrate shuffle files during Spark's decommissioning. The design document associated with this change is at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE . To allow this change the `MapOutputTracker` has been extended to allow the location of shuffle files to be updated with `updateMapOutput`. When a shuffle block is put, a block update message will be sent which triggers the `updateMapOutput`. Instead of rejecting remote puts of shuffle blocks `BlockManager` delegates the storage of shuffle blocks to it's shufflemanager's resolver (if supported). A new, experimental, trait is added for shuffle resolvers to indicate they handle remote putting of blocks. The existing block migration code is moved out into a separate file, and a producer/consumer model is introduced for migrating shuffle files from the host as quickly as possible while not overwhelming other executors. Recomputting shuffle blocks can be expensive, we should take advantage of our decommissioning time to migrate these blocks. This PR introduces two new configs parameters, `spark.storage.decommission.shuffleBlocks.enabled` & `spark.storage.decommission.rddBlocks.enabled` that control which blocks should be migrated during storage decommissioning. New unit test & expansion of the Spark on K8s decom test to assert that decommisioning with shuffle block migration means that the results are not recomputed even when the original executor is terminated. This PR is a cleaned-up version of the previous WIP PR I made apache#28331 (thanks to attilapiros for his very helpful reviewing on it :)). Closes apache#28708 from holdenk/SPARK-20629-copy-shuffle-data-when-nodes-are-being-shutdown-cleaned-up. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Co-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: Attila Zsolt Piros <attilazsoltpiros@apiros-mbp16.lan> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-24266][K8S] Restart the watcher when we receive a version changed from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32217] Plumb whether a worker would also be decommissioned along with executor This PR is a giant plumbing PR that plumbs an `ExecutorDecommissionInfo` along with the DecommissionExecutor message. The primary motivation is to know whether a decommissioned executor would also be loosing shuffle files -- and thus it is important to know whether the host would also be decommissioned. In the absence of this PR, the existing code assumes that decommissioning an executor does not loose the whole host with it, and thus does not clear the shuffle state if external shuffle service is enabled. While this may hold in some cases (like K8s decommissioning an executor pod, or YARN container preemption), it does not hold in others like when the cluster is managed by a Standalone Scheduler (Master). This is similar to the existing `workerLost` field in the `ExecutorProcessLost` message. In the future, this `ExecutorDecommissionInfo` can be embellished for knowing how long the executor has to live for scenarios like Cloud spot kills (or Yarn preemption) and the like. No Tweaked an existing unit test in `AppClientSuite` Closes apache#29032 from agrawaldevesh/plumb_decom_info. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning This PR reduces the prospect of a job loss during decommissioning. It fixes two holes in the current decommissioning framework: - (a) Loss of decommissioned executors is not treated as a job failure: We know that the decommissioned executor would be dying soon, so its death is clearly not caused by the application. - (b) Shuffle files on the decommissioned host are cleared when the first fetch failure is detected from a decommissioned host: This is a bit tricky in terms of when to clear the shuffle state ? Ideally you want to clear it the millisecond before the shuffle service on the node dies (or the executor dies when there is no external shuffle service) -- too soon and it could lead to some wastage and too late would lead to fetch failures. The approach here is to do this clearing when the very first fetch failure is observed on the decommissioned block manager, without waiting for other blocks to also signal a failure. Without them decommissioning a lot of executors at a time leads to job failures. The task scheduler tracks the executors that were decommissioned along with their `ExecutorDecommissionInfo`. This information is used by: (a) For handling a `ExecutorProcessLost` error, or (b) by the `DAGScheduler` when handling a fetch failure. No Added a new unit test `DecommissionWorkerSuite` to test the new behavior by exercising the Master-Worker decommissioning. I chose to add a new test since the setup logic was quite different from the existing `WorkerDecommissionSuite`. I am open to changing the name of the newly added test suite :-) - Should I add a feature flag to guard these two behaviors ? They seem safe to me that they should only get triggered by decommissioning, but you never know :-). Closes apache#29014 from agrawaldevesh/decom_harden. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-32417] Fix flakyness of BlockManagerDecommissionIntegrationSuite This test tries to fix the flakyness of BlockManagerDecommissionIntegrationSuite. Make the block manager decommissioning test be less flaky An interesting failure happens when migrateDuring = true (and persist or shuffle is true): - We schedule the job with tasks on executors 0, 1, 2. - We wait 300 ms and decommission executor 0. - If the task is not yet done on executor 0, it will now fail because the block manager won't be able to save the block. This condition is easy to trigger on a loaded machine where the github checks run. - The task with retry on a different executor (1 or 2) and its shuffle blocks will land there. - No actual block migration happens here because the decommissioned executor technically failed before it could even produce a block. To remove the above race, this change replaces the fixed wait for 300 ms to wait for an actual task to succeed. When a task has succeeded, we know its blocks would have been written for sure and thus its executor would certainly be forced to migrate those blocks when it is decommissioned. The change always decommissions an executor on which a real task finished successfully instead of picking the first executor. Because the system may choose to schedule nothing on the first executor and instead run the two tasks on one executor. I have had bad luck with BlockManagerDecommissionIntegrationSuite and it has failed several times on my PRs. So fixing it. No, unit test only change. Github checks. Ran this test 100 times, 10 at a time in parallel in a script. Closes apache#29226 from agrawaldevesh/block-manager-decom-flaky. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com> [SPARK-31197][CORE] Shutdown executor once we are done decommissioning Exit the executor when it has been asked to decommission and there is nothing left for it to do. This is a rebase of apache#28817 If we want to use decommissioning in Spark's own scale down we should terminate the executor once finished. Furthermore, in graceful shutdown it makes sense to release resources we no longer need if we've been asked to shutdown by the cluster manager instead of always holding the resources as long as possible. The decommissioned executors will exit and the end of decommissioning. This is sort of a user facing change, however decommissioning hasn't been in any releases yet. I changed the unit test to not send the executor exit message and still wait on the executor exited message. Closes apache#29211 from holdenk/SPARK-31197-exit-execs-redone. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com> Connect decommissioning to dynamic scaling Because the mock always says there is an RDD we may replicate more than once, and now that there are independent threads Make Spark's dynamic allocation use decommissioning Track the decommissioning executors in the core dynamic scheduler so we don't scale down too low, update the streaming ExecutorAllocationManager to also delegate to decommission Fix up executor add for resource profile Fix our exiting and cleanup thread for better debugging next time. Cleanup the locks we use in decommissioning and clarify some more bits. Verify executors decommissioned, then killed by external external cluster manager are re-launched Verify some additional calls are not occuring in the executor allocation manager suite. Dont' close the watcher until the end of the test Use decommissionExecutors and set adjustTargetNumExecutors to false so that we can match the pattern for killExecutor/killExecutors bump numparts up to 6 Revert "bump numparts up to 6" This reverts commit daf96dd. Small coment & visibility cleanup CR feedback/cleanup Fix up the merge CR feedback, move adjustExecutors to a common utility function Exclude some non-public APIs Remove junk More CR feedback Fix adjustExecutors backport This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test This test fails for me locally and from what I recall it's because we use a different method of resolving the bind address than upstream so disabling the test Cleanup and drop watcher changes from the backport

…ged from k8s Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 No Running spark-submit to a k8s cluster. Not sure how to make an automated test for this. If someone can help me out that would be great. Closes apache#28423 from stijndehaes/bugfix/k8s-submit-resource-version-change. Address review comment to fully qualify import scala.util.control Rebase on branch-3.0 to fix SparkR integration test.

… changed from k8s ### What changes were proposed in this pull request? This is a straight application of #28423 onto branch-3.0 Restart the watcher when it failed with a HTTP_GONE code from the kubernetes api. Which means a resource version has changed. For more relevant information see here: fabric8io/kubernetes-client#1075 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This was tested in #28423 by running spark-submit to a k8s cluster. Closes #29533 from jkleckner/backport-SPARK-24266-to-branch-3.0. Authored-by: Stijn De Haes <stijndehaes@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

SunithaR · 2020-12-15T14:49:11Z

@manusa @rohanKanojia has the Watcher been fixed for the issue? We are using watcher to watch a Kube Job and started noticing this issue. Going through this thread looks like SharedInformer is an alternate. Can a SharedInformer be used to watch a Kube Job?

rohanKanojia · 2020-12-15T15:02:33Z

@SunithaR : Are you talking about batch/v1 Job resource? If yes, here is a simple example of it's usage:

try (KubernetesClient client = new DefaultKubernetesClient()) {
    // Get Informer Factory
    SharedInformerFactory sharedInformerFactory = client.informers();

    // Create instance for Job Informer
    SharedIndexInformer<Job> jobSharedIndexInformer = sharedInformerFactory.sharedIndexInformerFor(Job.class, JobList.class,
            5 * 1000L);
    logger.info("Informer factory initialized.");

    // Add Event Handler for actions on all Job events received
    jobSharedIndexInformer.addEventHandler(
            new ResourceEventHandler<Job>() {
                @Override
                public void onAdd(Job job) {
                                     logger.info("Job " + job.getMetadata().getName() + " got added");
                                                                                                       }

                @Override
                public void onUpdate(Job oldJob, Job newJob) {
                    logger.info("Job " + oldJob.getMetadata().getName() + " got updated");
                }

                @Override
                public void onDelete(Job job, boolean deletedFinalStateUnknown) {
                    logger.info("Job " + job.getMetadata().getName() + " got deleted");
                }
            }
    );

    logger.info("Starting all registered informers");
    sharedInformerFactory.startAllRegisteredInformers();

    // Wait for 1 minute
    Thread.sleep(60 * 1000L);
    logger.info("Stopping informers now..");
    sharedInformerFactory.stopAllRegisteredInformers();
}

SunithaR · 2020-12-15T16:38:25Z

Thanks @rohanKanojia yes batch/v1 Job. Is there a way to watch a single job? Currently we have this
Watch watch = client.batch().jobs().inNamespace(NAMESPACE).withName(jobName) .watch(kubeJobWatcher))
when there is on close we then delete the job
Currently we are seeing onClose called with exception - KubernetesClientException: too old resource version: 827312380 (827319450)
However we want to continue to watch it. Is this possible with the - SharedInformer

rohanKanojia · 2020-12-15T16:53:02Z

I think you should be able to do it with something like this:

SharedIndexInformer<Job> jobSharedIndexInformer = sharedInformerFactory.sharedIndexInformerFor(Job.class, JobList.class,
                    new OperationContext().withNamespace(NAMESPACE).withFields(Collections.singletonMap("metadata.name", jobName)),
                    RESYNC_PERIOD);

SunithaR · 2020-12-15T17:28:55Z

Thanks @rohanKanojia will try it out. What is the RESYNC_PERIOD?

rohanKanojia · 2020-12-16T06:50:51Z

Informers maintain their own internal cache. A resync plays back all the events held in the informer's internal cache. So after every resync period informer re-queries API server(list+watch) and updates cache. You can have a look at this blog[0] to see how informer is used with resync period. Recently we also added support for avoiding resync when resync period is set to 0.

[0] https://rohaan.medium.com/introduction-to-fabric8-kubernetes-java-client-informer-api-b945082d69af
[1] #2651

SunithaR · 2021-01-27T19:32:43Z

@manusa @rohanKanojia when sharedInformerFactory.stopAllRegisteredInformers() is called, it produces logs with ERROR severity(see stack trace below). Since we are invoking a graceful shutdown, would be good if these messages are logged at a WARN level, instead of ERROR(signalling major issue) There is a similar issue which has been fixed - kubernetes-client/java#656, but the log level has not been reduced.
Can you please take a look and see if it is possible to reduce the log level or if there is any other workaround when using SharedInformer?

2021-01-26 00:21:02.192 ERROR 1 --- [ol-202-thread-1] i.f.k.c.i.cache.ProcessorListener : Processor thread interrupted: null
2021-01-26 00:21:02.192 ERROR 1 --- [-controller-Job] i.f.k.client.informers.cache.Controller : DefaultController#processLoop got interrupted null
java.lang.InterruptedException: null

rohanKanojia · 2021-01-27T19:54:03Z

@SunithaR : Hi, Sorry for late reply. Yes, you're right we should put these error messages as DEBUG/WARN rather than ERROR.

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/cache/ProcessorListener.java

Lines 58 to 59 in 25dda61

    
           } catch(InterruptedException ex) { 
        
             log.error("Processor thread interrupted: {}", ex.getMessage());

kubernetes-client/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/cache/Controller.java

Line 160 in 25dda61

    
           log.error("DefaultController#processLoop got interrupted {}", t.getMessage(), t);

Would appreciate if you could create a separate issue for this. We will try to fix this in upcoming sprints. I think it would be awesome if you could contribute a PR for fixing this as it doesn't seem that involved :-) .

SunithaR · 2021-01-27T20:46:36Z

@rohanKanojia you got it :), opened new issue - #2753. Will switch log to WARN severity and create PR.

stale bot added the status/stale label Oct 28, 2019

stale bot removed the status/stale label Oct 29, 2019

kolorful mentioned this issue Jan 20, 2020

Watch Connection leak when HTTP_GONE happens #1943

Closed

stale bot added the status/stale label Jan 27, 2020

manusa removed the status/stale label Jan 28, 2020

stale bot added the status/stale label Apr 28, 2020

stale bot removed the status/stale label Apr 30, 2020

stijndehaes mentioned this issue Apr 30, 2020

[SPARK-24266][k8s] Restart the watcher when we receive a version changed from k8s apache/spark#28423

Closed

rohanKanojia added enhancement component/kubernetes-client Deals with the kubernetes-client labels May 2, 2020

manusa mentioned this issue May 4, 2020

Refactor SharedInformers #2010

Closed

rohanKanojia mentioned this issue May 11, 2020

Add support for Namespaced SharedInformers #2213

Merged

dawany mentioned this issue May 22, 2020

SparkSubmitOperator could not get Exit Code after log stream interrupted by k8s old resource version exception apache/airflow#8963

Closed

stale bot added the status/stale label Aug 9, 2020

stale bot closed this as completed Aug 16, 2020

jkleckner mentioned this issue Aug 24, 2020

[SPARK-24266][K8S][3.0] Restart the watcher when we receive a version changed from k8s apache/spark#29533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

watch should handle etcd old version exception #1075

watch should handle etcd old version exception #1075

chenchun commented May 14, 2018 •

edited

Loading

chenchun commented May 14, 2018

yujiantao commented Jun 22, 2019 •

edited

Loading

chenchun commented Jul 10, 2019

SergSlipushenko commented Jul 30, 2019

stale bot commented Oct 28, 2019

SergSlipushenko commented Oct 29, 2019

stale bot commented Jan 27, 2020

stale bot commented Apr 28, 2020

stijndehaes commented Apr 30, 2020

SergSlipushenko commented Apr 30, 2020

rohanKanojia commented Apr 30, 2020

chenchun commented May 2, 2020

rohanKanojia commented May 2, 2020

stijndehaes commented May 4, 2020

manusa commented May 4, 2020 •

edited

Loading

stijndehaes commented May 5, 2020 •

edited

Loading

manusa commented May 6, 2020

stijndehaes commented May 6, 2020

manusa commented May 6, 2020

rohanKanojia commented May 11, 2020

rohanKanojia commented May 11, 2020

stale bot commented Aug 9, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 15, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 15, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 16, 2020

SunithaR commented Jan 27, 2021 •

edited

Loading

rohanKanojia commented Jan 27, 2021

SunithaR commented Jan 27, 2021

watch should handle etcd old version exception #1075

watch should handle etcd old version exception #1075

Comments

chenchun commented May 14, 2018 • edited Loading

chenchun commented May 14, 2018

yujiantao commented Jun 22, 2019 • edited Loading

chenchun commented Jul 10, 2019

SergSlipushenko commented Jul 30, 2019

stale bot commented Oct 28, 2019

SergSlipushenko commented Oct 29, 2019

stale bot commented Jan 27, 2020

stale bot commented Apr 28, 2020

stijndehaes commented Apr 30, 2020

SergSlipushenko commented Apr 30, 2020

rohanKanojia commented Apr 30, 2020

chenchun commented May 2, 2020

rohanKanojia commented May 2, 2020

stijndehaes commented May 4, 2020

manusa commented May 4, 2020 • edited Loading

stijndehaes commented May 5, 2020 • edited Loading

manusa commented May 6, 2020

stijndehaes commented May 6, 2020

manusa commented May 6, 2020

rohanKanojia commented May 11, 2020

rohanKanojia commented May 11, 2020

stale bot commented Aug 9, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 15, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 15, 2020

SunithaR commented Dec 15, 2020

rohanKanojia commented Dec 16, 2020

SunithaR commented Jan 27, 2021 • edited Loading

rohanKanojia commented Jan 27, 2021

SunithaR commented Jan 27, 2021

chenchun commented May 14, 2018 •

edited

Loading

yujiantao commented Jun 22, 2019 •

edited

Loading

manusa commented May 4, 2020 •

edited

Loading

stijndehaes commented May 5, 2020 •

edited

Loading

SunithaR commented Jan 27, 2021 •

edited

Loading