Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clinvar UnicodeDecodeError: pybedtools issue #3078

Closed
naumenko-sa opened this issue Feb 3, 2020 · 3 comments
Closed

clinvar UnicodeDecodeError: pybedtools issue #3078

naumenko-sa opened this issue Feb 3, 2020 · 3 comments
Assignees

Comments

@naumenko-sa
Copy link
Contributor

Hi everyone!

RNA-seq variant calling, last command:

/n/app/bcbio/dev/anaconda/bin/bedtools slop \
-i /n/scratch2/hsph_bioinformatic_core/sn240/atanasova2020/atanasova/work/align/S10/S10_star/../S10SJ.out-minimized.bed \
-g /n/app/bcbio/dev/genomes/Hsapiens/hg38/seq/hg38.fa.fai -b 10 | \
bedtools merge -i - > S10SJ.out-minimized-padded.bed

error:

[2020-02-03T15:41Z] multiprocessing: concat_variant_files
[2020-02-03T15:41Z] multiprocessing: run_rnaseq_ann_filter
[2020-02-03T15:41Z] Removing variants within 10 bases of splice junctions listed in /n/scratch2/hsph_bioinformatic_core/sn240/atanasova2020/atanasova/work/joint/gatk-haplotype-joint/b1/S10SJ.out-minimized-padded.bed from /n/scratch2/hsph_bioinformatic_core/sn240/atanasova2020/atanasova/work/joint/gatk-haplotype-joint/b1/b1-joint-effects-annotated-gemini-filter.vcf.gz. 
Traceback (most recent call last):
  File "/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py", line 245, in <module>
    main(**kwargs)
  File "/n/app/bcbio/dev/anaconda/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 50, in run_main
    fc_dir, run_info_yaml)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 91, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/pipeline/main.py", line 266, in rnaseqpipeline
    samples = rnaseq.rnaseq_variant_calling(samples, run_parallel)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 106, in rnaseq_variant_calling
    samples = run_parallel("run_rnaseq_ann_filter", samples)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1, backend="multiprocessing")(joblib.delayed(fn)(*x) for x in items):
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 921, in __call__
    if self.dispatch_one_batch(iterator):
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 549, in __init__
    self.results = batch()
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in __call__
    for func, args, kwargs in self.items]
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/joblib/parallel.py", line 225, in <listcomp>
    for func, args, kwargs in self.items]
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/utils.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/distributed/multitasks.py", line 287, in run_rnaseq_ann_filter
    return rnaseq.run_rnaseq_ann_filter(*args)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/pipeline/rnaseq.py", line 157, in run_rnaseq_ann_filter
    vrn_file = variation.filter_junction_variants(vrn_file, data)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/bcbio/rnaseq/variation.py", line 205, in filter_junction_variants
    pybedtools.BedTool(vrn_file).intersect(spliceslop, wa=True, header=True, v=True).saveas(out_base)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/pybedtools/bedtool.py", line 840, in decorated
    result = method(self, *args, **kwargs)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/pybedtools/bedtool.py", line 3134, in saveas
    compressed=compressed)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/site-packages/pybedtools/bedtool.py", line 1289, in _collapse
    out_.writelines(in_)
  File "/n/app/bcbio/dev/anaconda/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2720: invalid continuation byte
Mon Feb  3 10:41:30 EST 2020

SN

@hackdna hackdna self-assigned this Feb 3, 2020
@hackdna
Copy link
Member

hackdna commented Feb 5, 2020

I have run this pipeline locally in Vagrant (up to the QC step) using version 1.2.0a with S10.fastq.gz as input and a copy of intermediate results from /n/scratch2/hsph_bioinformatic_core/sn240/atanasova2020/atanasova/work but didn't see this error.

Debug log:

[2020-02-05T16:23Z] System YAML configuration: /home/vagrant/local/share/bcbio-nextgen/galaxy/bcbio_system.yaml.
[2020-02-05T16:23Z] Locale set to C.UTF-8.
[2020-02-05T16:23Z] Resource requests: picard; memory: 15.00; cores: 4
[2020-02-05T16:23Z] Configuring 1 jobs to run, using 1 cores each with 15.00g of memory reserved for each job
[2020-02-05T16:23Z] Timing: organize samples
[2020-02-05T16:23Z] multiprocessing: organize_samples
[2020-02-05T16:23Z] Using input YAML configuration: /data/sergey/atanasova.yaml
[2020-02-05T16:23Z] Checking sample YAML configuration: /data/sergey/atanasova.yaml
[2020-02-05T16:23Z] The vcfanno configuration /data/genomes/hg38/config/vcfanno/rnaedit.conf was not found for hg38, skipping.
[2020-02-05T16:23Z] Retreiving program versions from /home/vagrant/local/share/bcbio-nextgen/manifest/python-packages.yaml.
[2020-02-05T16:23Z] Retreiving program versions from /home/vagrant/local/share/bcbio-nextgen/manifest/r-packages.yaml.
[2020-02-05T16:23Z] Testing minimum versions of installed programs
[2020-02-05T16:23Z] multiprocessing: prepare_sample
[2020-02-05T16:23Z] Preparing S10
[2020-02-05T16:23Z] Resource requests: picard, samtools, star; memory: 15.00, 15.00, 15.00; cores: 4, 4, 4
[2020-02-05T16:23Z] Configuring 1 jobs to run, using 1 cores each with 15.00g of memory reserved for each job
[2020-02-05T16:23Z] Timing: alignment
[2020-02-05T16:23Z] multiprocessing: disambiguate_split
[2020-02-05T16:23Z] multiprocessing: process_alignment
[2020-02-05T16:23Z] Aligning lane S10 with star aligner
[2020-02-05T16:23Z] Resource requests: cufflinks, samtools; memory: 15.00, 15.00; cores: 4, 4
[2020-02-05T16:23Z] Configuring 1 jobs to run, using 1 cores each with 15.00g of memory reserved for each job
[2020-02-05T16:23Z] Timing: disambiguation
[2020-02-05T16:23Z] Timing: transcript assembly
[2020-02-05T16:23Z] Timing: estimate expression (threaded)
[2020-02-05T16:23Z] multiprocessing: generate_transcript_counts
[2020-02-05T16:23Z] multiprocessing: run_salmon_index
[2020-02-05T16:23Z] Transcriptome index for /data/sergey/inputs/transcriptome/hg38.fa detected, skipping building.
[2020-02-05T16:23Z] multiprocessing: run_salmon_reads
[2020-02-05T16:23Z] Transcriptome index for /data/sergey/inputs/transcriptome/hg38.fa detected, skipping building.
[2020-02-05T16:23Z] multiprocessing: detect_fusions
[2020-02-05T16:23Z] Resource requests: dexseq, express; memory: 3.00, 3.00; cores: 1, 1
[2020-02-05T16:23Z] Configuring 1 jobs to run, using 1 cores each with 3.00g of memory reserved for each job
[2020-02-05T16:23Z] Timing: estimate expression (single threaded)
[2020-02-05T16:25Z] Resource requests: gatk, vardict; memory: 3.50, 15.00; cores: 1, 4
[2020-02-05T16:25Z] Configuring 1 jobs to run, using 1 cores each with 15.00g of memory reserved for each job
[2020-02-05T16:25Z] Timing: RNA-seq variant calling
[2020-02-05T16:25Z] multiprocessing: run_rnaseq_variant_calling
[2020-02-05T16:25Z] multiprocessing: square_batch_region
[2020-02-05T16:25Z] multiprocessing: concat_variant_files
[2020-02-05T16:25Z] multiprocessing: run_rnaseq_ann_filter
[2020-02-05T16:25Z] Pad BED file : S10
[2020-02-05T16:25Z] Timing: finished

Commands log:

[2020-02-05T16:25Z] /home/vagrant/local/share/bcbio-nextgen/galaxy/../anaconda/bin/bedtools slop -i /data/sergey/align/S10/S10_star/../S10SJ.out-minimized.bed -g /data/genomes/hg38/seq/hg38.fa.fai -b 10 | bedtools merge -i - > /data/sergey/bcbiotx/tmpjsbz3l4o/S10SJ.out-minimized-padded.bed

I am going to try to reproduce this on O2 next.

@naumenko-sa
Copy link
Contributor Author

upd:
it is fixed in pybedtools repo, we need a release of it to update the production bcbio instances.

@naumenko-sa naumenko-sa changed the title UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2720: invalid continuation byte clinvar UnicodeDecodeError: pybedtools issue May 27, 2020
@naumenko-sa naumenko-sa mentioned this issue May 27, 2020
90 tasks
@naumenko-sa
Copy link
Contributor Author

daler/pybedtools#319

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants