Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about circFilter.py #78

Open
JunmingH opened this issue Mar 16, 2020 · 14 comments
Open

Question about circFilter.py #78

JunmingH opened this issue Mar 16, 2020 · 14 comments
Assignees

Comments

@JunmingH
Copy link

Hi,

I am trying to run DCC but always stuck at circFilter.py
The error message is below.
Traceback (most recent call last):
File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 842, in
main()
File "/circu_RNA/DCC-0.4.8/DCC/main.py", line 375, in main
filt.filter_nonrep(rep_file, indx0, count0)
File "/circu_RNA/DCC-0.4.8/DCC/circFilter.py", line 110, in filter_nonrep
nonrep = np.column_stack((indx0, count0))
File "/share/pkg.7/python2/2.7.16/install/lib/python2.7/site-packages/numpy/lib/shape_base.py", line 640, in column_stack
return _nx.concatenate(arrays, 1)
MemoryError

Could you please give me some ideas about this?

BTW is that possible it spends two days on combining those results together? Since each time it takes a very long time to combine, but still, have a memory error. I only use 4 threads and request 252 GB.

Best

@tjakobi tjakobi self-assigned this Mar 19, 2020
@tjakobi
Copy link
Contributor

tjakobi commented Mar 19, 2020

Dear @JunmingH,

would you have the complete command line that triggered this error?

Regarding the time constraints - two days seems pretty long. But the time required depends on data and command line used.

Cheers,
Tobias

@JunmingH
Copy link
Author

Hi Tobias,

Attached is the script. I ran it with around 300 samples. But there is no chance to run it completly.

/DCC-0.4.7/DCC/main.py @DCC_InputFiles/samplesheet_BM10 -T 2 -D -N -R /ref/GRCh38_Repeats_simpleRepeats_RepeatMasker.gtf -an
/ref/gencode.v26.primary_assembly.annotation.gtf -F -M -Nr 1 1 -fg -G -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa -O /MSBB_rep/BM10 -B @DCC_InputFiles/bam_files_BM10 -t MSBB_rep/tmp_dir/BM10

@tjakobi
Copy link
Contributor

tjakobi commented Mar 20, 2020

Hi @JunmingH,

I don't think I ever ran DCC on such a large number of sample - therefore the merging might indeed take a lot of time.

However, you might tune some of the parameters to reduce the time:

  • -T does not apply to the merging step, that part is single threaded, but you might want to increase the number of threads to speed up the rest of the computation

  • -Nr 1 1 is a very low threshold that will yield a very high number of false positives, i.e. its enough if a circle is only seen once with one BSJ read in one sample. I would recommend something like --Nr 2,5 or even -Nr 5,10 - meaning 2 reads in at least 5 samples or 5 reads in at least 10. THis also should speed up the merging.

  • The memory error is probably related to the extremely huge spare matrix that contains mostly zeros holding circRNA counts

Let me know if the new parameters can address your issue.

Cheers,
Tobias

@JunmingH
Copy link
Author

Hi Tobias,

One of my scripts is successful. It used 400 Gb memory for filter function. But it still stuck at a place.

I am not sure which part it is:

The error message is :
"[E::faidx_adjust_position] The sequence "chr1" not found
[E::faidx_adjust_position] The sequence "chr1" not found
[E::faidx_adjust_position] The sequence "chr1" not found
[E::faidx_adjust_position] The sequence "chr1" not found
[E::faidx_adjust_position] The sequence "chr1" not found
........."
Have you ever see this before?

@JunmingH
Copy link
Author

BTW very appreciate for your help!

@tjakobi
Copy link
Contributor

tjakobi commented Mar 21, 2020

Dear @JunmingH,

You may want to create a new index for your genome fasta file.

Try samtools faidx genome.fasta in the folder of the genome file. Maybe the index is corrupted or incomplete.

Cheers,
Tobias

@JunmingH
Copy link
Author

Do you mean -A /ref/Homo_sapiens.GRCh38.dna.primary_assembly.fa this file?

@tjakobi
Copy link
Contributor

tjakobi commented Mar 21, 2020

Yes, exactly.

@JunmingH
Copy link
Author

Hi Tobias,

Is this warning normal?, I got this when doing linear counting

2020-03-24 09:52:35,105 WARNING: circRNA start position ('chr18', '45196357') does not have mapped read counts, treated as 0

@gnilihzeux
Copy link

Hi, @tjakobi , I'm stuck with the same part. My error information is as follows:

Filtering by read counts
Count CircSkip junctions
Traceback (most recent call last):
  File "/usr/bin/DCC", line 11, in <module>
    load_entry_point('DCC==0.4.7', 'console_scripts', 'DCC')()
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 472, in main
  File "build/bdist.linux-x86_64/egg/DCC/main.py", line 669, in findCircSkipJunction
  File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 281, in findcircAdjacent
  File "build/bdist.linux-x86_64/egg/DCC/Circ_nonCirc_Exon_Match.py", line 224, in getAdjacent
ValueError: invalid literal for int() with base 10: '1_3182'
started circRNA detection from file 

I have no idea that whether if my gtf annotation, which was downloaded from NCBI, is not right, my gtf:

##gff-version 3
##source-version rtracklayer 1.46.0
##date 2020-04-13
NC_003977.2     RefSeq  region  1       3182    .       +       .       ID=NC_003977.2:1_3182;Dbxref=taxon:10407;Is_circular=true;gbkey=Src;genome=genomic;mol_type=genomic DNA;strain=ayw;
NC_003977.2     RefSeq  gene    1376    1840    .       +       .       ID=gene-HBVgp3;Dbxref=GeneID:944566;gbkey=Gene;Name=X;gene=X;gene_biotype=protein_coding;locus_tag=HBVgp3;
NC_003977.2     RefSeq  CDS     1376    1840    .       +       0       ID=cds-YP_009173867.1;Dbxref=Genbank:YP_009173867.1,GeneID:944566;gbkey=CDS;Name=YP_009173867.1;gene=X;locus_tag=HBVgp3;Parent=gene-HBVgp3;product=X protein;protein_id=YP_009173867.1;
NC_003977.2     RefSeq  regulatory_region       1592    1602    .       +       .       ID=id-NC_003977.2:1592_1602;gbkey=regulatory;Note=direct repeat sequence 2 (DR2)%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2     RefSeq  regulatory_region       1775    1794    .       +       .       ID=id-NC_003977.2:1775_1794;gbkey=regulatory;Note=phi%3b cis-acting sequence which participates in synthesis of minus-strand DNA
NC_003977.2     RefSeq  sequence_difference     1775    1775    .       -       .       ID=id-NC_003977.2:1775_1775;gbkey=misc_difference;Note=G is A in ref[2]%3b~conflict
NC_003977.2     RefSeq  gene    1816    2454    .       +       .       ID=gene-HBVgp4;Dbxref=GeneID:944568;gbkey=Gene;Name=C;gene=C;gene_biotype=protein_coding;locus_tag=HBVgp4;

But CircCoordinates and CircCount were still generated. Two separate chunks not affect each other ?

Thanks for your time.

@JunmingH
Copy link
Author

Try ucsc annotation file. This one did not have chromosome number.
Junming

@gnilihzeux
Copy link

@JunmingH Thanks. It's a genome of Hepatitis B virus and there is no information in UCSC.

@tjakobi
Copy link
Contributor

tjakobi commented Apr 17, 2020

Hi, sorry for the delayed response.

There might be an issue with the format of the 9th column, and DCC running into problems gathering data from it. DCC normally uses GTF files and you are providing a GFF3 file, which might result in problems. You might want to try using a GTF formatted file just to rule this possibility out. See https://www.biostars.org/p/99462/ for details on the differences.

@gnilihzeux
Copy link

Thanks, it is indeed a problem of GTF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants