HttpError with no message halfway through large GS write workload #316

eric-czech · 2020-12-04T16:19:47Z

Similar to #315, I saw this today after 10s of GB of data had already been written to a Zarr archive (i.e. this isn't a problem with initial writes, it appears to be something spurious in long-running jobs):

Traceback (most recent call last):
  File "scripts/convert_genetic_data.py", line 312, in <module>
    fire.Fire()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr
    ds = rechunk_dataset(
  File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset
    res = fn(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen
    rechunked.execute()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute
    self._executor.execute_plan(self._plan, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan
    return plan.compute(**kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
    self.chunk_store[ckey] = cdata
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in __setitem__
    self.fs.pipe_file(key, value)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
    j = await fs._call(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

Any ideas what this could be or if it should be caught/retried somewhere?

gcsfs version: 0.7.1

The text was updated successfully, but these errors were encountered:

martindurant · 2020-12-04T16:26:16Z

The broken-pipe certainly sounds like it should be retried, and it's odd that for this case we don't know. Did you not even get an HTTP status code? There's probably a longer list of specific errors that should be considered retriable.

Note that dask allows you to retry whole failed tasks, and that might be a good safeguard against weird intermittent problems on a small number of tasks in a large graph. Of course, we'd like to get it fixed anyway.

eric-czech · 2020-12-04T16:37:41Z

Did you not even get an HTTP status code?

I didn't and here's the full log for reference where snakemake is running a script that uses local dask:

Log

Building DAG of jobs...
Creating conda environment envs/gwas.yaml...
Downloading and installing remote packages.
Environment for envs/gwas.yaml created (location: .snakemake/conda/0a479a2e)
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 bgen_to_zarr
1
Select jobs to execute...

[Fri Dec 4 12:37:26 2020]
rule bgen_to_zarr:
input: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample
output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
jobid: 0
wildcards: bgen_contig=15
threads: 63
resources: mem_mb=222822

Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt
Finished download.
Activating conda environment: /workdir/.snakemake/conda/0a479a2e
2020-12-04 12:51:13,013 | main | INFO | Loading BGEN dataset for contig Contig(name=15, index=14) from rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen (chunks = (250, -1))
2020-12-04 12:52:04,036 | main | INFO | Rechunking dataset for contig Contig(name=15, index=14) to gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr (chunks = (5216, 5792)):
<xarray.Dataset>
Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 2767971)
Dimensions without coordinates: alleles, genotypes, samples, variants
Data variables:
variant_id (variants) |S186 dask.array<chunksize=(721600,), meta=np.ndarray>
variant_rsid (variants) |S132 dask.array<chunksize=(922657,), meta=np.ndarray>
variant_position (variants) int32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_maf (variants) float32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_minor_allele (variants) |S118 dask.array<chunksize=(922657,), meta=np.ndarray>
variant_info (variants) float32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_allele (variants, alleles) |S172 dask.array<chunksize=(390167, 2), meta=np.ndarray>
sample_id (samples) int32 4476413 3205773 ... 4315851
sample_sex (samples) uint8 1 2 2 2 2 1 ... 1 2 2 1 2 1
variant_contig (variants) int64 14 14 14 14 ... 14 14 14 14
variant_contig_name (variants) |S2 b'15' b'15' ... b'15' b'15'
call_genotype_probability (variants, samples, genotypes) float16 dask.array<chunksize=(250, 487409, 3), meta=np.ndarray>
call_genotype_probability_mask (variants, samples, genotypes) bool dask.array<chunksize=(250, 487409, 3), meta=np.ndarray>
Attributes:
contigs: ['15']
contig_name: 15
contig_index: 14
[#################### ] | 51% Completed | 2hr 8min 8.3s
Traceback (most recent call last):
File "scripts/convert_genetic_data.py", line 312, in
fire.Fire()
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(varargs, **kwargs)
File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr
ds = rechunk_dataset(
File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset
res = fn(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen
rechunked.execute()
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute
self._executor.execute_plan(self._plan, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan
return plan.compute(**kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func((_execute_task(a, cache) for a in args))
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk
return load_store_chunk(x, out, index, lock, return_stored, False)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk
out[index] = np.asanyarray(x)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in setitem
self.set_basic_selection(selection, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
self.chunk_store[ckey] = cdata
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in setitem
self.fs.pipe_file(key, value)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
return await simple_upload(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
j = await fs._call(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
raise e
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
self.validate_response(status, contents, json, path, headers)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
raise HttpError(error)
gcsfs.utils.HttpError: Required
[Fri Dec 4 15:00:54 2020]
Error in rule bgen_to_zarr:
jobid: 0
output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
conda-env: /workdir/.snakemake/conda/0a479a2e
shell:
python scripts/convert_genetic_data.py bgen_to_zarr --input-path-bgen=rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen --input-path-variants=rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt --input-path-samples=rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample --output-path=gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr --contig-name=15 --contig-index=14 --remote=True && touch rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Note that dask allows you to retry whole failed tasks

Is there a way to do that when working through Xarray? Or is there some global dask distributed property that would control that? I looked at one point and was a little confused as to how that's supposed to work and whether or not it's safe with IO operations like this.

martindurant · 2020-12-04T16:43:34Z

Is there a way to do that when working through Xarray

I'm not sure - it's an optional argument to .compute().

eric-czech · 2020-12-17T20:56:06Z

I saw this twice today outside of the context of Dask, once in Xarray as logged in pydata/xarray#4704 and again when called from Pandas:

Traceback (most recent call last):
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required
Traceback (most recent call last):
  File "scripts/gwas.py", line 438, in <module>
    fire.Fire()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/gwas.py", line 401, in run_gwas
    df.to_parquet(path)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/core/frame.py", line 2109, in to_parquet
    to_parquet(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 258, in to_parquet
    return impl.write(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 112, in write
    self.api.parquet.write_table(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 1733, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 591, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1433, in pyarrow._parquet.ParquetWriter.write_table
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1352, in write
    self.flush()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1387, in flush
    self._initiate_upload()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1402, in _initiate_upload
    self.location = sync(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1486, in initiate_upload
    headers, _ = await fs._call(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

I mention it because of the 3 errors I linked to in pydata/xarray#4704, this one appears to be the most prevalent (and difficult to work around).

martindurant · 2020-12-18T17:24:06Z

Can you please apply the following and see if you get extra information?

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1299,7 +1299,7 @@ class GCSFileSystem(AsyncFileSystem):
             elif "invalid" in str(msg):
                 raise ValueError("Bad Request: %s\n%s" % (path, msg))
             elif error:
-                raise HttpError(error)
+                raise HttpError({"code": status, "message": error})
             elif status:
                 raise HttpError({"code": status})

(I can push this to a branch, if that helps with installation)

martindurant · 2020-12-18T17:29:22Z

The only mention I can find of something similar is this , where

required: The API request is missing required information. The required information could be a parameter or resource property.

which is not very descriptive.

martindurant · 2020-12-18T17:33:20Z

Indeed, why not add the following for completeness

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1290,6 +1290,7 @@ class GCSFileSystem(AsyncFileSystem):
                 # TODO: limit to appropriate exceptions
                 msg = content

+            logger.debug("Error condition: %s" % ((status, content, json, path, headers), ))
             if status == 404:
                 raise FileNotFoundError
             elif status == 403:

eric-czech · 2020-12-18T20:58:00Z

Thanks @martindurant, I patched my client environment and will post anything that gets caught here.

slevang · 2020-12-31T03:21:56Z

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

mrocklin · 2020-12-31T04:48:55Z

For Dask retries you may want to try the dask.annotate function with the retries= keyword This will require the latest release I think.

…

On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***> wrote: I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#316 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA> .

mrocklin · 2020-12-31T04:49:40Z

Actually, it may still be in a PR.

…

On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin ***@***.***> wrote: For Dask retries you may want to try the dask.annotate function with the retries= keyword This will require the latest release I think. On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***> wrote: > I'm getting this same gcsfs.utils.HttpError: Required error with > long-running zarr writes to GCS. The errors are common enough that jobs > with ~100k chunks usually fail. I'll add these debug lines tomorrow and try > running again to see if there is any more detail on the HttpError. > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#316 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA> > . >

mrocklin · 2020-12-31T04:51:20Z

dask/distributed#4347/

…

On Wed, Dec 30, 2020, 8:49 PM Matthew Rocklin ***@***.***> wrote: Actually, it may still be in a PR. On Wed, Dec 30, 2020, 8:48 PM Matthew Rocklin ***@***.***> wrote: > For Dask retries you may want to try the dask.annotate function with the > retries= keyword > > This will require the latest release I think. > > On Wed, Dec 30, 2020, 7:22 PM Sam Levang ***@***.***> > wrote: > >> I'm getting this same gcsfs.utils.HttpError: Required error with >> long-running zarr writes to GCS. The errors are common enough that jobs >> with ~100k chunks usually fail. I'll add these debug lines tomorrow and try >> running again to see if there is any more detail on the HttpError. >> >> — >> You are receiving this because you are subscribed to this thread. >> Reply to this email directly, view it on GitHub >> <#316 (comment)>, or >> unsubscribe >> <https://github.com/notifications/unsubscribe-auth/AACKZTBIX532PP5ZVA5FU7DSXPU6BANCNFSM4UNUXDPA> >> . >> >

slevang · 2020-12-31T16:02:29Z

Here's the traceback:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1745, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1482, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 945, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3632, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3621, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1122, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1217, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1508, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1580, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1709, in _chunk_setitems
    self.chunk_store.setitems({k: v for k, v in zip(ckeys, cdatas)})
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 110, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 222, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1524, in simple_upload
    j = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1229, in validate_response
    raise HttpError({"code": status, "message": error})
gcsfs.utils.HttpError: {'code': 400, 'message': 'Required', 'errors': [{'message': 'Required', 'domain': 'global', 'reason': 'required'}]}

The docs mentioned above are pretty cryptic so it's hard to know what is going wrong, but for now could this 400 be added to is_retriable in some form?

martindurant · 2020-12-31T16:09:51Z

could this 400 be added to is_retriable

Yes, I suppose we can allow 400 in general or specifically look for this weird response. I hope it is indeed intermittent.

DPGrev · 2021-01-07T17:15:14Z

We have been running into somewhat the same error (#323 ) and were able to get some more information. We have been running our script for the last few hours and were able to extract some more information about the error that we were seeing. This code ran on a GCE VM.

The following errors happened consecutively:

2021-01-07 13:05:17,955 [validate_response] ERROR - Error condition: (504, b'', None, '<< REDACTED LINK TO FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'Expires': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '0', 'Server': 'UploadServer')>)
2021-01-07 13:05:19,237 [validate_response] ERROR - Error condition: (401, b'Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.', None, '<< REDACTED LINK TO SAME FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'WWW-Authenticate': 'Bearer realm="https://accounts.google.com/"', 'Expires': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '93', 'Server': 'UploadServer')>)

First, a gateway timeout happened (status: 504) after which the _call method retried and a 401 status code was received.
Interestingly the VM does have access to the data. Thus the 401 status seems to be invalid in our case.

A possible solution would be to make the number of retries configurable (in order to mitigate the risk of reach API call limits) and retry on 401 as well? An exponential backoff would also be preferable.

martindurant · 2021-01-08T14:59:43Z

How about: if the initial error (504) seems to be retriable, then we continue retrying whatever the subsequent errors? Could you give that a try to see if it does the business - or maybe the first error changes something fundamental and there needs to be a deeper kind of reset.

chrisroat · 2021-01-18T01:28:07Z

I am seeing similar issues, but also see a KeyError for a missing DataArray (which I verified is on disk as part of a dataset) bubbling up through xarray/zarr. I'm leaving this info here in case it helps or is related. I grabbed HEAD, which is formatting the HttpErrors a little better and shows the status code in the traceback. Some of my jobs hit the 401, while others hit the KeyError. I don't see a 504. Both can be remedied by enough retries of a workload.

HttpError

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate
    make(dict(key))
  File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make
    dask.compute(tasks)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather
    return self.sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync
    return sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter
    c = np.asarray(c)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__
    array = self.get_array()
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array
    return self.datastore.ds[self.variable_name]
  File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 341, in __getitem__
    return Array(self._store, read_only=self._read_only, path=path,
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 123, in __init__
    self._load_metadata()
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 140, in _load_metadata
    self._load_metadata_nosync()
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 149, in _load_metadata_nosync
    meta_bytes = self._store[mkey]
  File "/opt/conda/lib/python3.8/site-packages/fsspec/mapping.py", line 132, in __getitem__
    result = self.fs.cat(k)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 241, in cat
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 887, in _cat_file
    headers, out = await self._call("GET", u2, headers=head)
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 548, in _call
    raise e
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 530, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 1340, in validate_response
    raise HttpError({"code": status})
Exception: , 401

KeyError

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate
    make(dict(key))
  File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make
    dask.compute(tasks)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather
    return self.sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync
    return sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter
    c = np.asarray(c)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__
    array = self.get_array()
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array
    return self.datastore.ds[self.variable_name]
  File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 349, in __getitem__
    raise KeyError(item)
KeyError: 'image'

martindurant · 2021-01-18T14:30:51Z

@chrisroat , was there again no message with the HTTP errors? The KeyError is not a surprise, that is a valid outcome from zarr when it seems like the file in question can't be reached.

Could one of the people on this thread please aggregate what we have learned into a PR, which either recognises the errors encountered here and they become retriable; or else retries all errors by default except for a specific list that we know are not retriable?

chrisroat · 2021-01-19T00:21:12Z

@martindurant The HttpErrors only have the status code passed in.

My errors are 401 or KeyError. In general, I don't think a 401 should be retried if there is a chance it's real, and a token has expired? In the @DPGrev, it seems like the gateway error has triggered something, and it's possible the retries will all be 401s. @DPGrev, it is possible for you to check by using HEAD and adding 401 to the list in this function? Also, note that the back-off is exponential already.

For the KeyError case, it means there was no underlying error that was caught -- it seems tricky to remedy that if it is at the GCS layer.

vincentschut · 2021-02-18T08:55:42Z

I am getting the same error as was posted at Dec 31 in this thread (#316 (comment)), with a similar traceback and the 'Required: 400'. I understand this might have been solved by #335, so I'll try using gcsfs head now instead of 0.7.2. Though not getting the error is no proof that it is solved (it is quite intermittent, and random), I'll post back here with my results.

slevang · 2021-04-29T19:39:01Z

Just noticed the fix in #380 so I tried out a test run with the latest changes. The good news is that did seem to fix the 400 errors. I made it through 1.7TB of a 2+TB dataset write (much further than before), but the bad news is I ran into another failure:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1790, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1476, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 156, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 1041, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 283, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 565, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 487, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 317, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3975, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3962, in load_store_chunk
    out[index] = x
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1211, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1306, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1597, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1669, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1861, in _chunk_setitems
    self.chunk_store.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/storage.py", line 1068, in setitems
    self.map.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 111, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
    raise result[0]
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
    result[0] = await coro
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 178, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1076, in _pipe_file
    location = await initiate_upload(self, bucket, key, content_type, metadata)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1586, in initiate_upload
    headers, _ = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 531, in _call
    status, headers, info, contents = await self._request(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 505, in _request
    async with self.session.request(
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
    message, payload = await self._protocol.read()  # type: ignore
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

martindurant · 2021-04-29T19:42:55Z

Would you say that this is another error that we should retry for? There are quite a few exceptions in aiohttp.clent_errors:

aiohttp.client_exceptions.ClientConnectionError
aiohttp.client_exceptions.ClientConnectorCertificateError
aiohttp.client_exceptions.ClientConnectorError
aiohttp.client_exceptions.ClientConnectorSSLError
aiohttp.client_exceptions.ClientError
aiohttp.client_exceptions.ClientHttpProxyError
aiohttp.client_exceptions.ClientOSError
aiohttp.client_exceptions.ClientPayloadError
aiohttp.client_exceptions.ClientProxyConnectionError
aiohttp.client_exceptions.ClientResponse
aiohttp.client_exceptions.ClientResponseError
aiohttp.client_exceptions.ClientSSLError
aiohttp.client_exceptions.ContentTypeError
aiohttp.client_exceptions.ServerConnectionError
aiohttp.client_exceptions.ServerDisconnectedError
aiohttp.client_exceptions.ServerTimeoutError
aiohttp.client_exceptions.WSServerHandshakeError```

slevang · 2021-04-29T20:14:23Z

This was definitely intermittent since the job had already been running a couple hours, so I think retrying is appropriate. It seems hard to pick through these and figure out which should actually should be retried, so another approach would be to just retry all with a reasonable limit on total retries?

martindurant · 2021-04-29T20:40:36Z

Yes, I tend to agree with you, after having previously resisted the idea. There are just too many fail cases! The list is here, so this could be edited to add aiohttp.client_exceptions.ClientError, which seems to be the superclass of all of the others.

martindurant · 2021-04-29T20:59:37Z

Please feel free to put that in a PR :)

slevang · 2021-04-29T21:22:48Z

Can do. I'll test it out again with this workload and then make a PR.

slevang · 2021-04-30T13:50:40Z

Awesome, with #380 and #385 I'm now able to smoothly write multi-terabyte zarr arrays direct to GCS.

eric-czech mentioned this issue Dec 4, 2020

Document issues at scale related-sciences/ukb-gwas-pipeline-nealelab#20

Open

eric-czech mentioned this issue Dec 4, 2020

Pass kwargs through to dask .compute sgkit-dev/sgkit#418

Open

eric-czech mentioned this issue Dec 17, 2020

Retries for rare failures pydata/xarray#4704

Closed

martindurant mentioned this issue Jan 6, 2021

Empty HttpError is raised on _fetch_range randomly #323

Open

rafa-guedes mentioned this issue Jan 13, 2021

Missing zarr chunks written directly to bucket in parallel #327

Open

martindurant mentioned this issue Jan 22, 2021

Possibly solve intermittent permission issues #335

Merged

slevang mentioned this issue Apr 30, 2021

make aiohttp ClientError retriable #385

Merged

martindurant closed this as completed Apr 30, 2021

oliverwm1 mentioned this issue May 4, 2021

Integration testing requirements pangeo-data/pangeo-integration-tests#1

Open

oliverwm1 mentioned this issue Aug 18, 2021

is_retriable is catching 404 responses #405

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HttpError with no message halfway through large GS write workload #316

HttpError with no message halfway through large GS write workload #316

eric-czech commented Dec 4, 2020

martindurant commented Dec 4, 2020

eric-czech commented Dec 4, 2020 •

edited

Loading

martindurant commented Dec 4, 2020

eric-czech commented Dec 17, 2020 •

edited

Loading

martindurant commented Dec 18, 2020 •

edited

Loading

martindurant commented Dec 18, 2020

martindurant commented Dec 18, 2020

eric-czech commented Dec 18, 2020

slevang commented Dec 31, 2020

mrocklin commented Dec 31, 2020 via email

mrocklin commented Dec 31, 2020 via email

mrocklin commented Dec 31, 2020 via email

slevang commented Dec 31, 2020

martindurant commented Dec 31, 2020

DPGrev commented Jan 7, 2021

martindurant commented Jan 8, 2021

chrisroat commented Jan 18, 2021 •

edited

Loading

martindurant commented Jan 18, 2021

chrisroat commented Jan 19, 2021

vincentschut commented Feb 18, 2021 •

edited

Loading

slevang commented Apr 29, 2021

martindurant commented Apr 29, 2021 •

edited

Loading

slevang commented Apr 29, 2021

martindurant commented Apr 29, 2021

martindurant commented Apr 29, 2021

slevang commented Apr 29, 2021

slevang commented Apr 30, 2021

HttpError with no message halfway through large GS write workload #316

HttpError with no message halfway through large GS write workload #316

Comments

eric-czech commented Dec 4, 2020

martindurant commented Dec 4, 2020

eric-czech commented Dec 4, 2020 • edited Loading

martindurant commented Dec 4, 2020

eric-czech commented Dec 17, 2020 • edited Loading

martindurant commented Dec 18, 2020 • edited Loading

martindurant commented Dec 18, 2020

martindurant commented Dec 18, 2020

eric-czech commented Dec 18, 2020

slevang commented Dec 31, 2020

mrocklin commented Dec 31, 2020 via email

mrocklin commented Dec 31, 2020 via email

mrocklin commented Dec 31, 2020 via email

slevang commented Dec 31, 2020

martindurant commented Dec 31, 2020

DPGrev commented Jan 7, 2021

martindurant commented Jan 8, 2021

chrisroat commented Jan 18, 2021 • edited Loading

martindurant commented Jan 18, 2021

chrisroat commented Jan 19, 2021

vincentschut commented Feb 18, 2021 • edited Loading

slevang commented Apr 29, 2021

martindurant commented Apr 29, 2021 • edited Loading

slevang commented Apr 29, 2021

martindurant commented Apr 29, 2021

martindurant commented Apr 29, 2021

slevang commented Apr 29, 2021

slevang commented Apr 30, 2021

eric-czech commented Dec 4, 2020 •

edited

Loading

eric-czech commented Dec 17, 2020 •

edited

Loading

martindurant commented Dec 18, 2020 •

edited

Loading

chrisroat commented Jan 18, 2021 •

edited

Loading

vincentschut commented Feb 18, 2021 •

edited

Loading

martindurant commented Apr 29, 2021 •

edited

Loading