-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document issues at scale #20
Comments
Over the course of a day, here are a few problems I discovered when trying to ramp up to ~20 nodes in a cluster to run jobs. This including deleting and creating clusters serveral times throughout the day. It isn't clear to me which of these are definitely due to the larger cluster and which are just spurious. I'll log them in case they continue to occur.
Normally the cluster is immediately accessible after a resize.
Full Trace2020-10-21 14:06:27,368 | __main__ | INFO | Filter summary: Traceback (most recent call last): File "scripts/gwas_dev.py", line 300, in # logger.info("Done") File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/gwas_dev.py", line 232, in run_qc_1 logger.info('Applying QC filters') File "scripts/gwas_dev.py", line 112, in apply_filters v = v.compute() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/core/dataarray.py", line 834, in compute return new.load(**kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/core/dataarray.py", line 808, in load ds = self._to_temp_dataset().load(**kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/core/dataset.py", line 654, in load evaluated_data = da.compute(*lazy_data.values(), **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/base.py", line 452, in compute results = schedule(dsk, keys, **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 2725, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1986, in gather return self.sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 832, in sync return sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1851, in _gather raise exception.with_traceback(traceback) File "/opt/conda/lib/python3.8/site-packages/dask/array/core.py", line 102, in getter File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 495, in __array__ File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__ File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray File "/opt/conda/lib/python3.8/site-packages/xarray/coding/variables.py", line 70, in __array__ File "/opt/conda/lib/python3.8/site-packages/xarray/coding/variables.py", line 218, in _scale_offset_decoding File "/opt/conda/lib/python3.8/site-packages/xarray/coding/variables.py", line 70, in __array__ File "/opt/conda/lib/python3.8/site-packages/xarray/coding/variables.py", line 138, in _apply_mask File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray File "/opt/conda/lib/python3.8/site-packages/xarray/core/indexing.py", line 560, in __array__ File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 54, in __getitem__ File "/opt/conda/lib/python3.8/site-packages/xarray/backends/zarr.py", line 51, in get_array File "/opt/conda/lib/python3.8/site-packages/zarr/hierarchy.py", line 338, in __getitem__ File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 124, in __init__ File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 141, in _load_metadata File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 150, in _load_metadata_nosync File "/opt/conda/lib/python3.8/site-packages/fsspec/mapping.py", line 133, in __getitem__ AttributeError: 'FSMap' object has no attribute 'missing_exceptions'
Full Log(gwas-dev) eczech@ukb-experiments:~/repos/ukb-gwas-pipeline-nealelab$ python scripts/gwas_dev.py run_qc_1 --input-path=rs-ukb/prep/gt-imputation/ukb_chr21.zarr --output-path=rs-ukb/prep/gt-imputation-qc/ukb_chr21.zarr 2020-10-21 15:46:30,108 | __main__ | INFO | Initialized script with dask client: 2020-10-21 15:46:30,109 | __main__ | INFO | Running stage 1 QC 2020-10-21 15:46:43,305 | __main__ | INFO | Loaded dataset: Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 1261158) Dimensions without coordinates: alleles, genotypes, samples, variants Data variables: call_genotype_probability_mask (variants, samples, genotypes) bool dask.array sample_id (samples) int32 dask.array sample_sex (samples) uint8 dask.array variant_allele (variants, alleles) |S101 dask.array variant_contig (variants) int64 dask.array variant_contig_name (variants) |S2 dask.array variant_id (variants) |S115 dask.array variant_info (variants) float32 dask.array variant_maf (variants) float32 dask.array variant_minor_allele (variants) |S101 dask.array variant_position (variants) int32 dask.array variant_rsid (variants) |S115 dask.array call_genotype_probability (variants, samples, genotypes) float16 dask.array call_dosage (variants, samples) float16 dask.array call_dosage_mask (variants, samples) bool dask.array Attributes: contig_index: 20 contig_name: 21 contigs: ['21'] 2020-10-21 15:46:43,305 | __main__ | INFO | Applying QC filters 2020-10-21 15:46:43,384 | __main__ | INFO | Filter summary: 2020-10-21 15:46:43,911 | __main__ | INFO | high_info: {False: 896362, True: 364796} 2020-10-21 15:46:44,550 | __main__ | INFO | Filter summary: 2020-10-21 16:05:33,221 | __main__ | INFO | nonzero_stddev: {True: 364796} 2020-10-21 16:05:34,199 | __main__ | INFO | Saving dataset to rs-ukb/prep/gt-imputation-qc/ukb_chr21.zarr: Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 364796) Dimensions without coordinates: alleles, genotypes, samples, variants Data variables: call_genotype_probability_mask (variants, samples, genotypes) bool dask.array sample_id (samples) int32 dask.array sample_sex (samples) uint8 dask.array variant_allele (variants, alleles) |S101 dask.array variant_contig (variants) int64 dask.array variant_contig_name (variants) |S2 dask.array variant_id (variants) |S115 dask.array variant_info (variants) float32 dask.array variant_maf (variants) float32 dask.array variant_minor_allele (variants) |S101 dask.array variant_position (variants) int32 dask.array variant_rsid (variants) |S115 dask.array call_genotype_probability (variants, samples, genotypes) float16 dask.array call_dosage (variants, samples) float16 dask.array call_dosage_mask (variants, samples) bool dask.array variant_dosage_std (variants) float32 dask.array Attributes: contig_index: 20 contig_name: 21 contigs: ['21'] Traceback (most recent call last): File "scripts/gwas_dev.py", line 304, in fire.Fire() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/gwas_dev.py", line 227, in run_qc_1 save_dataset(ds, output_path) File "scripts/gwas_dev.py", line 65, in save_dataset ds.to_zarr(store, mode="w", consolidated=True) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/core/dataset.py", line 1652, in to_zarr return to_zarr( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/backends/api.py", line 1369, in to_zarr writes = writer.sync(compute=compute) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync delayed_store = da.store( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/array/core.py", line 981, in store result.compute(**kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/base.py", line 167, in compute (result,) = compute(self, traverse=False, **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/base.py", line 452, in compute results = schedule(dsk, keys, **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 2725, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1986, in gather return self.sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 832, in sync return sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1851, in _gather raise exception.with_traceback(traceback) File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 75, in loads pass File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 1946, in __setstate__ File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 124, in __init__ File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 141, in _load_metadata File "/opt/conda/lib/python3.8/site-packages/zarr/core.py", line 150, in _load_metadata_nosync File "/opt/conda/lib/python3.8/site-packages/fsspec/mapping.py", line 132, in __getitem__ File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 227, in cat File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 826, in _cat_file File "/opt/conda/lib/python3.8/site-packages/gcsfs/core.py", line 487, in _call File "/opt/conda/lib/python3.8/site-packages/aiohttp/client.py", line 1012, in __aenter__ File "/opt/conda/lib/python3.8/site-packages/aiohttp/client.py", line 504, in _request File "/opt/conda/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 847, in start File "/opt/conda/lib/python3.8/site-packages/aiohttp/streams.py", line 591, in read aiohttp.client_exceptions.ServerDisconnectedError
Full Log2020-10-21 21:39:06,490 | __main__ | INFO | Initialized script with dask client: 2020-10-21 21:39:06,490 | __main__ | INFO | Running stage 1 QC (input_path=rs-ukb/prep/gt-imputation-qc/ukb_chr22.zarr, output_path=rs-ukb/pipe/nealelab-gwas-uni-ancestry-v3/input/gt-imputation/ukb_chr22.zarr) 2020-10-21 21:39:06,748 | __main__ | INFO | Loaded dataset: Dimensions: (alleles: 2, genotypes: 3, samples: 487409, variants: 365260) Dimensions without coordinates: alleles, genotypes, samples, variants Data variables: call_dosage (variants, samples) float16 dask.array call_dosage_mask (variants, samples) bool dask.array call_genotype_probability (variants, samples, genotypes) float16 dask.array call_genotype_probability_mask (variants, samples, genotypes) bool dask.array sample_id (samples) int32 dask.array sample_sex (samples) uint8 dask.array variant_allele (variants, alleles) |S414 dask.array variant_contig (variants) int64 dask.array variant_contig_name (variants) |S2 dask.array variant_dosage_std (variants) float32 dask.array variant_id (variants) |S428 dask.array variant_info (variants) float32 dask.array variant_maf (variants) float32 dask.array variant_minor_allele (variants) |S414 dask.array variant_position (variants) int32 dask.array variant_rsid (variants) |S304 dask.array Attributes: contig_index: 21 contig_name: 22 contigs: ['22'] 2020-10-21 21:39:06,748 | __main__ | INFO | Applying variant QC filters 2020-10-21 21:39:08,630 | __main__ | INFO | Filter summary (True ==> kept): 2020-10-21 21:39:09,108 | __main__ | INFO | high_maf: {True: 213203, False: 152057} 2020-10-21 21:39:09,435 | __main__ | INFO | Filter summary (True ==> kept): 2020-10-21 21:45:10,070 | __main__ | INFO | in_hwe: {True: 142510, False: 70693} 2020-10-21 21:45:10,326 | __main__ | INFO | Applying sample QC filters (sample_qc_path=rs-ukb/prep/main/ukb_sample_qc.zarr) 2020-10-21 21:45:10,905 | __main__ | INFO | Filter summary (True ==> kept): 2020-10-21 21:45:11,098 | __main__ | INFO | no_aneuploidy: {True: 501854, False: 651} 2020-10-21 21:45:11,220 | __main__ | INFO | in_pca: {True: 407127, False: 95378} 2020-10-21 21:45:11,305 | __main__ | INFO | in_ethnic_groups: {True: 455793, False: 46712} 2020-10-21 21:45:11,322 | __main__ | INFO | overall: {True: 365941, False: 136564} 2020-10-21 21:45:15,710 | __main__ | INFO | Saving dataset to rs-ukb/pipe/nealelab-gwas-uni-ancestry-v3/input/gt-imputation/ukb_chr22.zarr: Dimensions: (alleles: 2, genotypes: 3, principal_components: 40, samples: 365941, variants: 142510) Dimensions without coordinates: alleles, genotypes, principal_components, samples, variants Data variables: variant_hwe_p_value (variants) float64 dask.array call_dosage (variants, samples) float16 dask.array call_dosage_mask (variants, samples) bool dask.array call_genotype_probability (variants, samples, genotypes) float16 dask.array call_genotype_probability_mask (variants, samples, genotypes) bool dask.array sample_id (samples) int32 dask.array sample_sex (samples) uint8 dask.array variant_allele (variants, alleles) |S414 dask.array variant_contig (variants) int64 dask.array variant_contig_name (variants) |S2 dask.array variant_dosage_std (variants) float32 dask.array variant_id (variants) |S428 dask.array variant_info (variants) float32 dask.array variant_maf (variants) float32 dask.array variant_minor_allele (variants) |S414 dask.array variant_position (variants) int32 dask.array variant_rsid (variants) |S304 dask.array variant_genotype_counts (variants, genotypes) int32 dask.array sample_qc_sex (samples) float64 dask.array sample_genetic_sex (samples) float64 dask.array sample_age_at_recruitment (samples) float64 dask.array sample_principal_component (samples, principal_components) float64 dask.array sample_ethnic_background (samples) float64 dask.array sample_genotype_measurement_batch (samples) float64 dask.array sample_genotype_measurement_plate (samples) |S13 dask.array sample_genotype_measurement_well (samples) |S3 dask.array Traceback (most recent call last): File "scripts/gwas_dev.py", line 304, in # logger.info(f"Saving p-values to {output_path}") File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/0522a0a1/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/0522a0a1/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/eczech/repos/ukb-gwas-pipeline-nealelab/.snakemake/conda/0522a0a1/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/gwas_dev.py", line 247, in run_qc_2
Full TraceTraceback (most recent call last): File "scripts/gwas_dev.py", line 304, in fire.Fire() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 463, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/fire/core.py", line 672, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "scripts/gwas_dev.py", line 227, in run_qc_1 save_dataset(ds, output_path) File "scripts/gwas_dev.py", line 65, in save_dataset ds.to_zarr(store, mode="w", consolidated=True) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/core/dataset.py", line 1652, in to_zarr return to_zarr( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/backends/api.py", line 1369, in to_zarr writes = writer.sync(compute=compute) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/xarray/backends/common.py", line 155, in sync delayed_store = da.store( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/array/core.py", line 981, in store result.compute(**kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/base.py", line 167, in compute (result,) = compute(self, traverse=False, **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/dask/base.py", line 452, in compute results = schedule(dsk, keys, **kwargs) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 2725, in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1986, in gather return self.sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 832, in sync return sync( File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync raise exc.with_traceback(tb) File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/utils.py", line 324, in f result[0] = yield future File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/tornado/gen.py", line 735, in run value = future.result() File "/home/eczech/miniconda3/envs/gwas-dev/lib/python3.8/site-packages/distributed/client.py", line 1851, in _gather raise exception.with_traceback(traceback) File "/opt/conda/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 75, in loads pass ModuleNotFoundError: No module named 'gcsfs' |
On several occasions when all work seems to bottleneck at a single node, I see an Info page like this: Clicking on that individual worker with all the pending tasks results in "500: Internal Server Error". This workload (running GWAS regressions) eventually failed and it appears that all the workers reset, or at least the "logs" link above shows output as if they were all restarted after the failure. I'll see if the individual VMs have logs retained somewhere. |
Nope, looks like all worker logs get cleared for some reason. |
For some reason this step failed on chr22 but not chr21 today. The error was consistent with OOM yet this step does nothing but filter and rechunk. It is probably the rechunking that is problematic -- why it is not consistently problematic is odd (the datasets for chr 21 and 22 are very similar, they have same number of samples and ~365k variants +/- 1k). |
Allowing fsspec to overwrite zarr archives is a very bad idea. On more than one occasion, I have been puzzled as to why a client script appears to be idling with little to no cpu usage for ~15 minutes only to find that it is deleting a previously written zarr archive very slowly. These should always be deleted first with "gsutil -m" instead of relying on Perhaps fsspec is deleting files synchronously instead of in parallel like gsutil does. |
Today I tried to run more rechunking operations, but had an issue I have not seen before. I tried to run a GKE cluster with 5 64 vCPU (262144 MB RAM, 200G disk) nodes and a rechunking process on each one, for different contigs, but all jobs continuously failed with an error like "Exit code 255": I believe this is somehow being thrown from the Python process run as part of the snakemake CLI command. There is no information in the logs though, e.g.: This is the first time an error has occurred (other than OOM) without any useful logging, and in all cases it happens right after the bgen file is downloaded and before the conda environment is activated for the rechunking code. So far I have failed to reproduce this by running the same jobs, on the same VMs, and in the same cluster using smaller bgen files (chr XY runs as usual). I also tried running a job for a larger contig (e.g. 18) on a local Debian buster system (in the same conda env) but that also worked as expected. Changes in use for this work: c7318b5 UPDATE I tried this again on a cluster of 32 vCPUs instead. I still see the same failure for chromosome 19 and 20 (or presumably anything larger) but not 21 or 22. The problem likely has something to do with the file size. I'm not sure what code snakemake is using to do the file download, but the next logical step is probably to get onto a k8s cluster node, run the docker image manually, and then try to run the same command that snakemake issues. This might produce a system message or something else that is easier to catch in a terminal before digging into the code. Command used to run snakemake script:
Note: I updated snakemake from 5.22 to 5.30 and google-cloud-sdk from 315.0 to 319.0 for these runs (the latest versions available). UPDATE See #25. The issue was disk space though it is still unclear why snakemake appears to be duplicating files on download. |
In the last couple days while rechunking bgen files, I encountered two jobs that failed for these reasons (after running for ~6 hrs):
|
I have seen the first checksum error at least 4 or 5 times now, and it looks like it is a snakemake bug. I logged it at snakemake/snakemake#785. The second error was logged at fsspec/gcsfs#315. |
And here is another error that came out of nowhere today:
Filed as fsspec/gcsfs#316. |
This is a broad issue that I'll use to document some of the problems that have come up when scaling out the UKB pipeline.
For convenience, here is a list of issues that have been filed as a result of these problems:
Some of these are discussed more below. Others were filed directly in upstream repos.
The text was updated successfully, but these errors were encountered: