Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpError with no message halfway through large GS write workload #316

Closed
eric-czech opened this issue Dec 4, 2020 · 27 comments
Closed

Comments

@eric-czech
Copy link

Similar to #315, I saw this today after 10s of GB of data had already been written to a Zarr archive (i.e. this isn't a problem with initial writes, it appears to be something spurious in long-running jobs):

Traceback (most recent call last):
  File "scripts/convert_genetic_data.py", line 312, in <module>
    fire.Fire()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr
    ds = rechunk_dataset(
  File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset
    res = fn(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen
    rechunked.execute()
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute
    self._executor.execute_plan(self._plan, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan
    return plan.compute(**kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
    self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
    self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
    self.chunk_store[ckey] = cdata
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in __setitem__
    self.fs.pipe_file(key, value)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
    j = await fs._call(
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

Any ideas what this could be or if it should be caught/retried somewhere?

gcsfs version: 0.7.1

@martindurant
Copy link
Member

The broken-pipe certainly sounds like it should be retried, and it's odd that for this case we don't know. Did you not even get an HTTP status code? There's probably a longer list of specific errors that should be considered retriable.

Note that dask allows you to retry whole failed tasks, and that might be a good safeguard against weird intermittent problems on a small number of tasks in a large graph. Of course, we'd like to get it fixed anyway.

@eric-czech
Copy link
Author

eric-czech commented Dec 4, 2020

Did you not even get an HTTP status code?

I didn't and here's the full log for reference where snakemake is running a script that uses local dask:

Log
Building DAG of jobs...
Creating conda environment envs/gwas.yaml...
Downloading and installing remote packages.
Environment for envs/gwas.yaml created (location: .snakemake/conda/0a479a2e)
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 bgen_to_zarr
1
Select jobs to execute...

[Fri Dec 4 12:37:26 2020]
rule bgen_to_zarr:
input: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen, rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt, rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample
output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
jobid: 0
wildcards: bgen_contig=15
threads: 63
resources: mem_mb=222822

Downloading from remote: rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen
Finished download.
Downloading from remote: rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt
Finished download.
Activating conda environment: /workdir/.snakemake/conda/0a479a2e
2020-12-04 12:51:13,013 | main | INFO | Loading BGEN dataset for contig Contig(name=15, index=14) from rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen (chunks = (250, -1))
2020-12-04 12:52:04,036 | main | INFO | Rechunking dataset for contig Contig(name=15, index=14) to gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr (chunks = (5216, 5792)):
<xarray.Dataset>
Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 2767971)
Dimensions without coordinates: alleles, genotypes, samples, variants
Data variables:
variant_id (variants) |S186 dask.array<chunksize=(721600,), meta=np.ndarray>
variant_rsid (variants) |S132 dask.array<chunksize=(922657,), meta=np.ndarray>
variant_position (variants) int32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_maf (variants) float32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_minor_allele (variants) |S118 dask.array<chunksize=(922657,), meta=np.ndarray>
variant_info (variants) float32 dask.array<chunksize=(2767971,), meta=np.ndarray>
variant_allele (variants, alleles) |S172 dask.array<chunksize=(390167, 2), meta=np.ndarray>
sample_id (samples) int32 4476413 3205773 ... 4315851
sample_sex (samples) uint8 1 2 2 2 2 1 ... 1 2 2 1 2 1
variant_contig (variants) int64 14 14 14 14 ... 14 14 14 14
variant_contig_name (variants) |S2 b'15' b'15' ... b'15' b'15'
call_genotype_probability (variants, samples, genotypes) float16 dask.array<chunksize=(250, 487409, 3), meta=np.ndarray>
call_genotype_probability_mask (variants, samples, genotypes) bool dask.array<chunksize=(250, 487409, 3), meta=np.ndarray>
Attributes:
contigs: ['15']
contig_name: 15
contig_index: 14
[#################### ] | 51% Completed | 2hr 8min 8.3s
Traceback (most recent call last):
File "scripts/convert_genetic_data.py", line 312, in
fire.Fire()
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
component = fn(varargs, **kwargs)
File "scripts/convert_genetic_data.py", line 296, in bgen_to_zarr
ds = rechunk_dataset(
File "scripts/convert_genetic_data.py", line 217, in rechunk_dataset
res = fn(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/sgkit/io/bgen/bgen_reader.py", line 519, in rechunk_bgen
rechunked.execute()
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/api.py", line 76, in execute
self._executor.execute_plan(self._plan, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/rechunker/executors/dask.py", line 24, in execute_plan
return plan.compute(**kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 167, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/base.py", line 452, in compute
results = schedule(dsk, keys, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
results = get_async(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
raise_exception(exc, tb)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
raise exc
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
result = _execute_task(task, data)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(
(_execute_task(a, cache) for a in args))
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3724, in store_chunk
return load_store_chunk(x, out, index, lock, return_stored, False)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/dask/array/core.py", line 3713, in load_store_chunk
out[index] = np.asanyarray(x)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1115, in setitem
self.set_basic_selection(selection, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1210, in set_basic_selection
return self._set_basic_selection_nd(selection, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1501, in _set_basic_selection_nd
self._set_selection(indexer, value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1550, in _set_selection
self._chunk_setitem(chunk_coords, chunk_selection, chunk_value, fields=fields)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1664, in _chunk_setitem
self._chunk_setitem_nosync(chunk_coords, chunk_selection, value,
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/zarr/core.py", line 1729, in _chunk_setitem_nosync
self.chunk_store[ckey] = cdata
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/mapping.py", line 154, in setitem
self.fs.pipe_file(key, value)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
return await simple_upload(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1523, in simple_upload
j = await fs._call(
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
raise e
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
self.validate_response(status, contents, json, path, headers)
File "/workdir/.snakemake/conda/0a479a2e/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
raise HttpError(error)
gcsfs.utils.HttpError: Required
[Fri Dec 4 15:00:54 2020]
Error in rule bgen_to_zarr:
jobid: 0
output: rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
conda-env: /workdir/.snakemake/conda/0a479a2e
shell:
python scripts/convert_genetic_data.py bgen_to_zarr --input-path-bgen=rs-ukb/raw/gt-imputation/ukb_imp_chr15_v3.bgen --input-path-variants=rs-ukb/raw/gt-imputation/ukb_mfi_chr15_v3.txt --input-path-samples=rs-ukb/raw/gt-imputation/ukb59384_imp_chr15_v3_s487296.sample --output-path=gs://rs-ukb/prep/gt-imputation/ukb_chr15.zarr --contig-name=15 --contig-index=14 --remote=True && touch rs-ukb/prep/gt-imputation/ukb_chr15.ckpt
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Note that dask allows you to retry whole failed tasks

Is there a way to do that when working through Xarray? Or is there some global dask distributed property that would control that? I looked at one point and was a little confused as to how that's supposed to work and whether or not it's safe with IO operations like this.

@martindurant
Copy link
Member

Is there a way to do that when working through Xarray

I'm not sure - it's an optional argument to .compute().

@eric-czech
Copy link
Author

eric-czech commented Dec 17, 2020

I saw this twice today outside of the context of Dask, once in Xarray as logged in pydata/xarray#4704 and again when called from Pandas:

Traceback (most recent call last):
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required
Traceback (most recent call last):
  File "scripts/gwas.py", line 438, in <module>
    fire.Fire()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "scripts/gwas.py", line 401, in run_gwas
    df.to_parquet(path)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/core/frame.py", line 2109, in to_parquet
    to_parquet(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 258, in to_parquet
    return impl.write(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pandas/io/parquet.py", line 112, in write
    self.api.parquet.write_table(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 1733, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/pyarrow/parquet.py", line 591, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1433, in pyarrow._parquet.ParquetWriter.write_table
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1352, in write
    self.flush()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/spec.py", line 1387, in flush
    self._initiate_upload()
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1402, in _initiate_upload
    self.location = sync(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1486, in initiate_upload
    headers, _ = await fs._call(
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/90e5c2a1/lib/python3.8/site-packages/gcsfs/core.py", line 1228, in validate_response
    raise HttpError(error)
gcsfs.utils.HttpError: Required

I mention it because of the 3 errors I linked to in pydata/xarray#4704, this one appears to be the most prevalent (and difficult to work around).

@martindurant
Copy link
Member

martindurant commented Dec 18, 2020

Can you please apply the following and see if you get extra information?

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1299,7 +1299,7 @@ class GCSFileSystem(AsyncFileSystem):
             elif "invalid" in str(msg):
                 raise ValueError("Bad Request: %s\n%s" % (path, msg))
             elif error:
-                raise HttpError(error)
+                raise HttpError({"code": status, "message": error})
             elif status:
                 raise HttpError({"code": status})

(I can push this to a branch, if that helps with installation)

@martindurant
Copy link
Member

The only mention I can find of something similar is this , where

required: The API request is missing required information. The required information could be a parameter or resource property.

which is not very descriptive.

@martindurant
Copy link
Member

Indeed, why not add the following for completeness

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1290,6 +1290,7 @@ class GCSFileSystem(AsyncFileSystem):
                 # TODO: limit to appropriate exceptions
                 msg = content

+            logger.debug("Error condition: %s" % ((status, content, json, path, headers), ))
             if status == 404:
                 raise FileNotFoundError
             elif status == 403:

@eric-czech
Copy link
Author

Thanks @martindurant, I patched my client environment and will post anything that gets caught here.

@slevang
Copy link
Contributor

slevang commented Dec 31, 2020

I'm getting this same gcsfs.utils.HttpError: Required error with long-running zarr writes to GCS. The errors are common enough that jobs with ~100k chunks usually fail. I'll add these debug lines tomorrow and try running again to see if there is any more detail on the HttpError.

@mrocklin
Copy link
Contributor

mrocklin commented Dec 31, 2020 via email

@mrocklin
Copy link
Contributor

mrocklin commented Dec 31, 2020 via email

@mrocklin
Copy link
Contributor

mrocklin commented Dec 31, 2020 via email

@slevang
Copy link
Contributor

slevang commented Dec 31, 2020

Here's the traceback:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1745, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1482, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 945, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 452, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3632, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3621, in load_store_chunk
    out[index] = np.asanyarray(x)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1122, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1217, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1508, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1580, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1709, in _chunk_setitems
    self.chunk_store.setitems({k: v for k, v in zip(ckeys, cdatas)})
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 110, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return maybe_sync(func, self, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync
    return sync(loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
    raise exc.with_traceback(tb)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
    result[0] = await future
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 222, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1007, in _pipe_file
    return await simple_upload(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1524, in simple_upload
    j = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 525, in _call
    raise e
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 507, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1229, in validate_response
    raise HttpError({"code": status, "message": error})
gcsfs.utils.HttpError: {'code': 400, 'message': 'Required', 'errors': [{'message': 'Required', 'domain': 'global', 'reason': 'required'}]}

The docs mentioned above are pretty cryptic so it's hard to know what is going wrong, but for now could this 400 be added to is_retriable in some form?

@martindurant
Copy link
Member

could this 400 be added to is_retriable

Yes, I suppose we can allow 400 in general or specifically look for this weird response. I hope it is indeed intermittent.

@DPGrev
Copy link
Contributor

DPGrev commented Jan 7, 2021

We have been running into somewhat the same error (#323 ) and were able to get some more information. We have been running our script for the last few hours and were able to extract some more information about the error that we were seeing. This code ran on a GCE VM.

The following errors happened consecutively:

2021-01-07 13:05:17,955 [validate_response] ERROR - Error condition: (504, b'', None, '<< REDACTED LINK TO FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'Expires': 'Thu, 07 Jan 2021 13:05:17 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '0', 'Server': 'UploadServer')>)
2021-01-07 13:05:19,237 [validate_response] ERROR - Error condition: (401, b'Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.', None, '<< REDACTED LINK TO SAME FILE >>', <CIMultiDictProxy('X-GUploader-UploadID': '<< REDACTED >>', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Vary': 'Origin', 'Vary': 'X-Origin', 'WWW-Authenticate': 'Bearer realm="https://accounts.google.com/"', 'Expires': 'Thu, 07 Jan 2021 13:05:19 GMT', 'Cache-Control': 'private, max-age=0', 'Content-Length': '93', 'Server': 'UploadServer')>)

First, a gateway timeout happened (status: 504) after which the _call method retried and a 401 status code was received.
Interestingly the VM does have access to the data. Thus the 401 status seems to be invalid in our case.

A possible solution would be to make the number of retries configurable (in order to mitigate the risk of reach API call limits) and retry on 401 as well? An exponential backoff would also be preferable.

@martindurant
Copy link
Member

How about: if the initial error (504) seems to be retriable, then we continue retrying whatever the subsequent errors? Could you give that a try to see if it does the business - or maybe the first error changes something fundamental and there needs to be a deeper kind of reset.

@chrisroat
Copy link
Contributor

chrisroat commented Jan 18, 2021

I am seeing similar issues, but also see a KeyError for a missing DataArray (which I verified is on disk as part of a dataset) bubbling up through xarray/zarr. I'm leaving this info here in case it helps or is related. I grabbed HEAD, which is formatting the HttpErrors a little better and shows the status code in the traceback. Some of my jobs hit the 401, while others hit the KeyError. I don't see a 504. Both can be remedied by enough retries of a workload.

HttpError
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate
    make(dict(key))
  File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make
    dask.compute(tasks)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather
    return self.sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync
    return sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter
    c = np.asarray(c)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__
    array = self.get_array()
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array
    return self.datastore.ds[self.variable_name]
  File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 341, in __getitem__
    return Array(self._store, read_only=self._read_only, path=path,
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 123, in __init__
    self._load_metadata()
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 140, in _load_metadata
    self._load_metadata_nosync()
  File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 149, in _load_metadata_nosync
    meta_bytes = self._store[mkey]
  File "/opt/conda/lib/python3.8/site-packages/fsspec/mapping.py", line 132, in __getitem__
    result = self.fs.cat(k)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 241, in cat
    raise ex
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 887, in _cat_file
    headers, out = await self._call("GET", u2, headers=head)
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 548, in _call
    raise e
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 530, in _call
    self.validate_response(status, contents, json, path, headers)
  File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 1340, in validate_response
    raise HttpError({"code": status})
Exception: , 401
KeyError
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/datajoint/autopopulate.py", line 159, in populate
    make(dict(key))
  File "/opt/conda/lib/python3.8/site-packages/starmap/pipeline.py", line 1552, in make
    dask.compute(tasks)
  File "/opt/conda/lib/python3.8/site-packages/dask/base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1990, in gather
    return self.sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 836, in sync
    return sync(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
    result[0] = yield future
  File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 110, in getter
    c = np.asarray(c)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 503, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 663, in __array__
    return np.asarray(self.array, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 568, in __array__
    return np.asarray(array[self.key], dtype=None)
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 55, in __getitem__
    array = self.get_array()
  File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 52, in get_array
    return self.datastore.ds[self.variable_name]
  File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 349, in __getitem__
    raise KeyError(item)
KeyError: 'image'

@martindurant
Copy link
Member

@chrisroat , was there again no message with the HTTP errors? The KeyError is not a surprise, that is a valid outcome from zarr when it seems like the file in question can't be reached.

Could one of the people on this thread please aggregate what we have learned into a PR, which either recognises the errors encountered here and they become retriable; or else retries all errors by default except for a specific list that we know are not retriable?

@chrisroat
Copy link
Contributor

@martindurant The HttpErrors only have the status code passed in.

My errors are 401 or KeyError. In general, I don't think a 401 should be retried if there is a chance it's real, and a token has expired? In the @DPGrev, it seems like the gateway error has triggered something, and it's possible the retries will all be 401s. @DPGrev, it is possible for you to check by using HEAD and adding 401 to the list in this function? Also, note that the back-off is exponential already.

For the KeyError case, it means there was no underlying error that was caught -- it seems tricky to remedy that if it is at the GCS layer.

@vincentschut
Copy link
Contributor

vincentschut commented Feb 18, 2021

I am getting the same error as was posted at Dec 31 in this thread (#316 (comment)), with a similar traceback and the 'Required: 400'. I understand this might have been solved by #335, so I'll try using gcsfs head now instead of 0.7.2. Though not getting the error is no proof that it is solved (it is quite intermittent, and random), I'll post back here with my results.

@slevang
Copy link
Contributor

slevang commented Apr 29, 2021

Just noticed the fix in #380 so I tried out a test run with the latest changes. The good news is that did seem to fix the 400 errors. I made it through 1.7TB of a 2+TB dataset write (much further than before), but the bad news is I ran into another failure:

  File "/root/miniconda3/lib/python3.8/site-packages/xarray/core/dataset.py", line 1790, in to_zarr
    return to_zarr(
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/api.py", line 1476, in to_zarr
    writes = writer.sync(compute=compute)
  File "/root/miniconda3/lib/python3.8/site-packages/xarray/backends/common.py", line 156, in sync
    delayed_store = da.store(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 1041, in store
    result.compute(**kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 283, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/base.py", line 565, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 487, in get_async
    raise_exception(exc, tb)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 317, in reraise
    raise exc
  File "/root/miniconda3/lib/python3.8/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3975, in store_chunk
    return load_store_chunk(x, out, index, lock, return_stored, False)
  File "/root/miniconda3/lib/python3.8/site-packages/dask/array/core.py", line 3962, in load_store_chunk
    out[index] = x
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1211, in __setitem__
    self.set_basic_selection(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1306, in set_basic_selection
    return self._set_basic_selection_nd(selection, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1597, in _set_basic_selection_nd
    self._set_selection(indexer, value, fields=fields)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1669, in _set_selection
    self._chunk_setitems(lchunk_coords, lchunk_selection, chunk_values,
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/core.py", line 1861, in _chunk_setitems
    self.chunk_store.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/zarr/storage.py", line 1068, in setitems
    self.map.setitems(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/mapping.py", line 111, in setitems
    self.fs.pipe(values)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 72, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 53, in sync
    raise result[0]
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 20, in _runner
    result[0] = await coro
  File "/root/miniconda3/lib/python3.8/site-packages/fsspec/asyn.py", line 178, in _pipe
    await asyncio.gather(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1076, in _pipe_file
    location = await initiate_upload(self, bucket, key, content_type, metadata)
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 1586, in initiate_upload
    headers, _ = await fs._call(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 531, in _call
    status, headers, info, contents = await self._request(
  File "/root/miniconda3/lib/python3.8/site-packages/gcsfs/core.py", line 505, in _request
    async with self.session.request(
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
    await resp.start(conn)
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
    message, payload = await self._protocol.read()  # type: ignore
  File "/root/miniconda3/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

@martindurant
Copy link
Member

martindurant commented Apr 29, 2021

Would you say that this is another error that we should retry for? There are quite a few exceptions in aiohttp.clent_errors:

aiohttp.client_exceptions.ClientConnectionError
aiohttp.client_exceptions.ClientConnectorCertificateError
aiohttp.client_exceptions.ClientConnectorError
aiohttp.client_exceptions.ClientConnectorSSLError
aiohttp.client_exceptions.ClientError
aiohttp.client_exceptions.ClientHttpProxyError
aiohttp.client_exceptions.ClientOSError
aiohttp.client_exceptions.ClientPayloadError
aiohttp.client_exceptions.ClientProxyConnectionError
aiohttp.client_exceptions.ClientResponse
aiohttp.client_exceptions.ClientResponseError
aiohttp.client_exceptions.ClientSSLError
aiohttp.client_exceptions.ContentTypeError
aiohttp.client_exceptions.ServerConnectionError
aiohttp.client_exceptions.ServerDisconnectedError
aiohttp.client_exceptions.ServerTimeoutError
aiohttp.client_exceptions.WSServerHandshakeError```

@slevang
Copy link
Contributor

slevang commented Apr 29, 2021

This was definitely intermittent since the job had already been running a couple hours, so I think retrying is appropriate. It seems hard to pick through these and figure out which should actually should be retried, so another approach would be to just retry all with a reasonable limit on total retries?

@martindurant
Copy link
Member

Yes, I tend to agree with you, after having previously resisted the idea. There are just too many fail cases! The list is here, so this could be edited to add aiohttp.client_exceptions.ClientError, which seems to be the superclass of all of the others.

@martindurant
Copy link
Member

Please feel free to put that in a PR :)

@slevang
Copy link
Contributor

slevang commented Apr 29, 2021

Can do. I'll test it out again with this workload and then make a PR.

@slevang
Copy link
Contributor

slevang commented Apr 30, 2021

Awesome, with #380 and #385 I'm now able to smoothly write multi-terabyte zarr arrays direct to GCS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants