Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining individual circRNA read counts - error #68

Open
mihinduk opened this issue Sep 16, 2019 · 19 comments
Open

Combining individual circRNA read counts - error #68

mihinduk opened this issue Sep 16, 2019 · 19 comments
Assignees

Comments

@mihinduk
Copy link

Describe the bug
DCC quits when trying to combine individual circRNA read counts. This is only when I run it using a docker in an interactive queue:
bsub -Is -q research-hpc
-a 'docker(buddej/dcc:0.1.3)'
/bin/bash

I have run this locally without issue and am trying to run it on a larger server to run datasets that demand too much memory for my local machine (although I am testing on a small dataset of 83 samples)

Since I have to convert the wrapper I wrote from python3 to python2.17.16 to be compatible with DCC, I have isolated the actual DCC command (The previous steps of generating the infiles worked) and have been just running this inside the interactive queue:

To Reproduce
Steps to reproduce the behavior:

  1. Command line used for the command:
    DCC @/gscmnt/gc2645/wgs/km_test/dcc/gtex/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/Amygdala/samplesheet -mt1 @/gscmnt/gc2645/wgs/km_test/dcc/gtex/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/Amygdala/mate1 -mt2 @/gscmnt/gc2645/wgs/km_test/dcc/gtex/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/Amygdala/mate2 -T 20 -D -R /gscmnt/gc2645/wgs/resources/RNAseq/genome/hg19_Repeats_RepeatMasker_SimpleRepeats.gtf -an /gscmnt/gc2645/wgs/resources/RNAseq/genome/gencode.v19.annotation.spike-in.gtf -Pi -F -M -Nr 1 1 -fg -k -G -A /gscmnt/gc2645/wgs/resources/RNAseq/genome/GRCh37.p13.genome.lite.spike-in.fa -B @/gscmnt/gc2645/wgs/km_test/dcc/gtex/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/Amygdala/bam_files

  2. Complete error message
    finished circRNA detection from file _tmp_DCC/SRR818418_unified.Chimeric.out.junction.7MIX0L
    Combining individual circRNA read counts
    Traceback (most recent call last):
    File "/usr/local/bin/DCC", line 11, in
    load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
    File "/usr/local/lib/python2.7/site-packages/DCC-0.4.7-py2.7.egg/DCC/main.py", line 287, in main
    File "/usr/local/lib/python2.7/site-packages/DCC-0.4.7-py2.7.egg/DCC/circAnnotate.py", line 26, in selectGeneGtf
    File "/usr/local/lib/python2.7/site-packages/HTSeq/init.py", line 197, in iter
    for line in FileOrSequence.iter(self):
    File "/usr/local/lib/python2.7/site-packages/HTSeq/init.py", line 50, in iter
    for line in lines:
    IOError: [Errno 14] Bad address

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Debian 8.6, Ubuntu 18.04]
  • Python version [e.g. 3.6, 3.4]
  • Version [e.g. 1.1.0.1]

python = python2.17.16
Version = DCC 0.4.7
Dockerfile = buddej/dcc:0.1.3
https://github.com/buddej/mgi-hpc/blob/master/dcc/Dockerfile**

Any advice you can give would be greatly appreciated.

Thank you,
Kathie Mihindukulasuriya

@tjakobi tjakobi self-assigned this Sep 19, 2019
@mihinduk
Copy link
Author

mihinduk commented Oct 2, 2019

I have solved the problem for PE data by setting -T 2 and requesting more memory based on the stats generated by running /usr/bin/time -v . For SE data, if I run the identical command on a local server it runs, but when I try to run on a remote server using Docker, I get the following error:

Command:
/usr/bin/time -v DCC @/gscmnt/gc2645/wgs/km_test/dcc/msbb/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/BM36/samplesheet -mt1 @/gscmnt/gc2645/wgs/km_test/dcc/msbb/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/BM36/read -T 2 -D -N -R /gscmnt/gc2645/wgs/resources/RNAseq/genome/hg19_Repeats_RepeatMasker_SimpleRepeats.gtf -an /gscmnt/gc2645/wgs/resources/RNAseq/genome/gencode.v19.annotation.spike-in.gtf -F -M -Nr 1 1 -fg -k -G -A /gscmnt/gc2645/wgs/resources/RNAseq/genome/GRCh37.p13.genome.lite.spike-in.fa -B @/gscmnt/gc2645/wgs/km_test/dcc/msbb/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/BM36/bam_files -O /gscmnt/gc2645/wgs/km_test/dcc/msbb/02.-ProcessedData/06.-circRNA/Hg19/DCC/BM36 -t /gscmnt/gc2645/wgs/km_test/dcc/msbb/02.-ProcessedData/06.-circRNA/Hg19/DCC/DCC_InputFiles/BM36/_tmp_DCC

Error:
Traceback (most recent call last):
File "/usr/local/bin/DCC", line 11, in
load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
File "/usr/local/lib/python2.7/site-packages/DCC-0.4.7-py2.7.egg/DCC/main.py", line 145, in main
File "/usr/local/lib/python2.7/site-packages/DCC-0.4.7-py2.7.egg/DCC/main.py", line 529, in remove_empty_lines
TypeError: 'NoneType' object is not iterable
Command exited with non-zero status 1

@JunmingH
Copy link

I met the same issue, Do you have any ideas to solve this?

@mihinduk
Copy link
Author

I would try increasing the amount of memory you request:

Dataset | tissue | samples | Cores | time | Max mem
MSBB | BM10 | 325 | 2 | 56:43:04 | 282082312
MSBB | BM22 | 334 | 4 | 44:50:42 | 313739268
MSBB | BM36 | 315 | 4 | 37:54:52 | 296575084
MSBB | BM44 | 308 | 4 | 40:16:14 | 286656860

@JunmingH
Copy link

Thanks For your reply! Is this the solution for first issue or second issue?

@tjakobi
Copy link
Contributor

tjakobi commented Nov 19, 2019

Hi @mihinduk,
hi @JunmingH,

thank you for reporting the issues and for your patience.

increasing the memory should fix issue 1, since here the error message refers to a bad memory address: IOError: [Errno 14] Bad address. So requesting more memory on cluster scheduled environments will solve that issue.

The second issue looks familiar. Could it be, that you forgot to specify -mt2?

Cheers,
Tobias

@mihinduk
Copy link
Author

mihinduk commented Nov 19, 2019 via email

@tjakobi
Copy link
Contributor

tjakobi commented Nov 19, 2019

Hi @mihinduk,

for SE data you must not use -mt1. -mt1 and -mt2 are reserved for PE setups.

Cheers,
Tobias

@mihinduk
Copy link
Author

Hi Tobias,

Do you have any idea why this would run locally:
/usr/bin/time -v DCC @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/samplesheet -mt1 @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/read -T 4 -D -N -R /40/pipelines/RNAseq/circRNA/hg19_Repeats_RepeatMasker_SimpleRepeats.gtf -an /40/pipelines/RNAseq/circRNA/Hg19_gencodev19_spikein/gencode.v19.annotation.spike-in.gtf -F -M -Nr 1 1 -fg -k -G -A /40/pipelines/RNAseq/circRNA/Hg19_gencodev19_spikein/GRCh37.p13.genome.lite.spike-in.fa -B @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/bam_files -O /40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/ -t /40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/_tmp_DCC

@JunmingH
Copy link

@tjakobi Hi Tobias,
I am also run SE data without specify the -mt1 and -mt2. But still have the same error message.
I try to integrate all the files with the location information in one files to run it. But same error.

@JunmingH
Copy link

@tjakobi Hi Tobias,

I have a question about combine the results. Since I could not run all files at same time. Therefore, I run it one by one and store in different directory. I was wondering how could I combine them together? since each subject have different results. How can I treat the missing circRNA?

Thanks!

@tjakobi
Copy link
Contributor

tjakobi commented Nov 24, 2019

Hi Tobias,

Do you have any idea why this would run locally:
/usr/bin/time -v DCC @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/samplesheet -mt1 @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/read -T 4 -D -N -R /40/pipelines/RNAseq/circRNA/hg19_Repeats_RepeatMasker_SimpleRepeats.gtf -an /40/pipelines/RNAseq/circRNA/Hg19_gencodev19_spikein/gencode.v19.annotation.spike-in.gtf -F -M -Nr 1 1 -fg -k -G -A /40/pipelines/RNAseq/circRNA/Hg19_gencodev19_spikein/GRCh37.p13.genome.lite.spike-in.fa -B @/40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/DCC_InputFiles/bam_files -O /40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/ -t /40/Public_Data/bulkRNASeq/201812_MSBB/Gene_Expression/02.-ProcessedData/06.-circRNA/Hg19/BM44/DCC/_tmp_DCC

I am not sure how DCC handles the case where only mate1 is supplied like in your example. Generally it's either both , -mt1 AND -mt2 or none of both, the I am pretty sure your command with only -mt1 is not behavign correctly. Anyway, this needs to be addressed in the code before DCC starts running.

@tjakobi
Copy link
Contributor

tjakobi commented Nov 24, 2019

@tjakobi Hi Tobias,
I am also run SE data without specify the -mt1 and -mt2. But still have the same error message.
I try to integrate all the files with the location information in one files to run it. But same error.

Hi @JunmingH,

are you referring to the

remove_empty_lines
TypeError: 'NoneType' object is not iterable
Command exited with non-zero status 1

Error?

Cheers,
Tobias

@tjakobi
Copy link
Contributor

tjakobi commented Nov 24, 2019

@tjakobi Hi Tobias,

I have a question about combine the results. Since I could not run all files at same time. Therefore, I run it one by one and store in different directory. I was wondering how could I combine them together? since each subject have different results. How can I treat the missing circRNA?

Thanks!

Hi @JunmingH,

please see my response in your new issue: #72 (comment)

Cheers,
Tobias

@JunmingH
Copy link

@tjakobi Yes same error with this
remove_empty_lines
TypeError: 'NoneType' object is not iterable
Command exited with non-zero status 1

@tjakobi
Copy link
Contributor

tjakobi commented Nov 25, 2019

Hi @JunmingH,

could you please make sure that your BAM input list file does not contain any empty lines?

Also: I would like to see your complete DCC call. Did you use @filename for specifying the input list?

Cheers,
Tobias

@JunmingH
Copy link

@tjakobi Sure

python2 ${app_dir}/main.py @samplesheet
-D -N -R ${gtf_dir}/GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf
-an ref/GRCh38/annotation/Homo_sapiens.GRCh38.95.gtf
-F -M -Nr 1 1 -fg -G -A ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa
-T 2 -O /dcc_all_results/
-B @bam_files

The format for the sample sheet and bam_files is like this:
/align/subject1.sort.coord.combined_Chimeric.out.junction
/align/subject2.sort.coord.combined_Chimeric.out.junction
/align/subject3.sort.coord.combined_Chimeric.out.junction

/align/subject1.sort.coord.combined_Aligned.sortedByCoord.out.bam
/align/subject2.sort.coord.combined_Aligned.sortedByCoord.out.bam
/align/subject3.sort.coord.combined_Aligned.sortedByCoord.out.bam

@tjakobi
Copy link
Contributor

tjakobi commented Nov 25, 2019

Hi @JunmingH,

looks good. Could you please attach the original bam_files and samplesheet files?

Cheers,
Tobias

@JunmingH
Copy link

Could you give me an email address? I can send it to you!

@tjakobi
Copy link
Contributor

tjakobi commented Nov 25, 2019

You can directly upload files here on GitHub via the area under the text field ("Attach files by dragging...")

@tjakobi tjakobi closed this as completed Nov 27, 2019
@tjakobi tjakobi reopened this Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants