Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retries for rare failures #4704

Closed
eric-czech opened this issue Dec 17, 2020 · 2 comments
Closed

Retries for rare failures #4704

eric-czech opened this issue Dec 17, 2020 · 2 comments

Comments

@eric-czech
Copy link

I recently ran into several issues with gcsfs (fsspec/gcsfs#316, fsspec/gcsfs#315, and fsspec/gcsfs#318) where errors are occasionally thrown, but only in large worfklows where enough http calls are made for them to become probable.

@martindurant suggested forcing dask to retry tasks that may fail like this with .compute(... retries=N) in fsspec/gcsfs#316, which has worked well. However, I also see this in Xarray/Zarr code interacting with gcsfs directly:

Example Traceback
Traceback (most recent call last):
  File "scripts/convert_phesant_data.py", line 100, in <module>
    fire.Fire()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/convert_phesant_data.py", line 96, in sort_zarr
    ds.to_zarr(fsspec.get_mapper(output_path), consolidated=True, mode="w")
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/core/dataset.py", line 1652, in to_zarr
    return to_zarr(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/api.py", line 1368, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/api.py", line 1128, in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/zarr.py", line 417, in store
    self.set_variables(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/zarr.py", line 489, in set_variables
    writer.add(v.data, zarr_array, region=region)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/xarray/backends/common.py", line 145, in add
    target[...] = source
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
    self.chunk_store[ckey] = cdata
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/mapping.py", line 151, in __setitem__
    self.fs.pipe_file(key, maybe_convert(value))
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
    j = await fs._call(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

Has there already been a discussion about how to address rare errors like this? Arguably, I could file the same issue with Zarr but it seemed more productive to start here at a higher level of abstraction.

To be clear, the code for the example failure above typically succeeds and reproducing this failure is difficult. I have only seen it a couple times now like this, where the calling code does not include dask, but it did make me want to know if there were any plans to tolerate rare failures in Xarray as Dask does.

@martindurant
Copy link
Contributor

As far as I can tell, this has only been happening in gcsfs - so my suggestion, to try to collect the set of conditions that should be considered "retryable" but currently aren't, still holds. However, it is also worthwhile discussing where else in the stack retries might be applied, which would affect multiple storage backends.

@shoyer
Copy link
Member

shoyer commented Dec 17, 2020

This does happen with some other backends, specifically netCDF and pydap when access remote datasets via HTTP/opendap. We have a robust_getitem helper functions for this that you'll see is used in the netCDF4 and pydap backends:

def robust_getitem(array, key, catch=Exception, max_retries=6, initial_delay=500):

I think exponential backoff with fuzzing is the right strategy for rare network failures, but I would suggest trying to push this to as low of a level as possible, e.g., ideally inside gcsfs. Retrying the whole dask computation seems quite wasteful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants