Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing gene_bed path in igenomes config #46

Closed
jinmingda opened this issue Sep 25, 2019 · 11 comments
Closed

Missing gene_bed path in igenomes config #46

jinmingda opened this issue Sep 25, 2019 · 11 comments

Comments

@jinmingda
Copy link
Contributor

jinmingda commented Sep 25, 2019

Hello, I was wondering if there's any particular reasons why gene_bed path is missing in conf/igenomes.config. I haven't checked all the genomes but for GRCh37, the BED file generated by makeGeneBED has the same number of lines as the one located in s3://ngi-igenomes/igenomes/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.bed despite the difference in the order of a few rows. So I tend to believe they are identical. I understand you still need to have a process to generate BED files for those uncommon genomes. But it may save users some time and computing resources when it comes to human or mouse samples.

@drpatelh
Copy link
Member

Hi @jinmingda . Well spotted! Yes, as you rightly suggested there is no reason why it can't be added back in 👍 I can't remember why I removed it now 🤔 However, the makeGeneBED process is possibly one of the quickest steps in the entire pipeline so I'm not convinced that its going to be a massive saving on time and compute.

@jinmingda
Copy link
Contributor Author

jinmingda commented Sep 25, 2019

@drpatelh Thanks for your prompt reply! I notice this because I had trouble with makeGeneBED when running the pipeline on AWS Batch. It exits with a status code of 137, which indicates insufficient resources allocated to the container (task.attempt somehow doesn't seem to work). I can get around this by giving makeGeneBED more RAM, but that's beyond the scope of this issue. Anyway, I just want to confirm that you can skip makeGeneBED by using the BED files from iGenomes. Thank you!

@drpatelh
Copy link
Member

drpatelh commented Sep 26, 2019

Hi @jinmingda Yes, its entirely possible to skip this step by using the same (or a different one for that matter) BED file as is hosted on iGenomes. The RAM issue is a bit more concerning and the resource requirements will need to be updated. I have increased the resources in the nf-core/chipseq pipeline for the same process and will need to do the same here:
https://github.com/nf-core/chipseq/blob/333bba334512625775157aa0087c023cf0bde397/main.nf#L370

Need to find the time to update this pipeline with the more recently released chipseq pipeline 😅

Just out of interest did you encounter any other issues in getting the pipeline to complete?

@jinmingda
Copy link
Contributor Author

@drpatelh Yes there're a few issues I ran into that I have to modify the original pipeline code to get it working with my samples. Many of them are due to RAM requirement. For example, it seems all the processes that involve annotatePeaks.pl will be terminated without returning a non-zero status code if there's not enough memory. It took me a few hours before I realized what was going on because Nextflow would still mark them as succeeded. Additionally, I replaced the process_long label with process_medium in merge_library_macs and merge_replicate_macs because the memory footprint of macs2 is based on the size of the input BAM file. So in many cases, 2 GB is too small for real human samples.

Another issue I've found is in makeGenomeFilter. {params.blacklist} in line 464 resolves to /home/ubuntu/.nextflow/assets/nf-core/atacseq/assets/blacklist/GRCh37-blacklist.bed on my instance, but it is not accessible by workers if worker nodes and head node don't share the same disk. I notice there's a channel created for blacklist file in line 219. So I managed to solve this issue by using the channel as input and reference the input file rather than param.blacklist directly in script. I think this may be an issue caused by specific working environments.

@jinmingda
Copy link
Contributor Author

FYI, here's the command.log of merge_library_macs_annotate when there is not enough RAM.

nxf-scratch-dir ip-10-1-128-56:/tmp/nxf.1xnCYOrpFQ

	Using Custom Genome
	Peak file = consensus_peaks.mRp.clN.bed
	Genome = genome.fa
	Organism = unknown
	Using gene_ids for GTF file
	Custom annotation GTF file: genes.gtf (using transcript_id)
	Peak/BED file conversion summary:
		BED/Header formatted lines: 134293
		peakfile formatted lines: 0
		Duplicated Peak IDs: 0

	Peak File Statistics:
		Total Peaks: 134293
		Redundant Peak IDs: 0
		Peaks lacking information: 0 (need at least 5 columns per peak)
		Peaks with misformatted coordinates: 0 (should be integer)
		Peaks with misformatted strand: 0 (should be either +/- or 0/1)

	Peak file looks good!

	Reading Positions...
	-----------------------
	Finding Closest TSS...
	Processing custom annotation file...
	Features that will be considered:
		exon

Killed
	Features that will be considered:
		exon

Killed
	Prioritizing Annotations: 
	Annotating:
		Annotation	Number of peaks	Total size (bp)	Log2 Enrichment
	Annotating:
		Annotation	Number of peaks	Total size (bp)	Log2 Enrichment
	Counting Tags in Peaks from each directory...
	Organism: unknown
	Outputing Annotation File...
	Done annotating peaks file

@drpatelh
Copy link
Member

This is priceless feedback @jinmingda ! Thank you very much. I had no idea annotatePeaks.pl just blacks out like that when it doesnt have enough RAM. Im assuming it still gives an exit code of 0 which is why Nextflow thinks its finished - not good. I have seen a couple of other issues with this script that make things very difficult to trace without going back to the sysout logs.

Memory requirements for annotatePeaks.pl and MACS2 will need to be increased as you suggested 👍

Yes, the blacklist issue has come up in the past on the nf-core Slack workspace and it is exactly for the reasons you mentioned. I did create the channel to stage the file properly but I forgot to add it into the process itself. Will need some sort of conditional channel input to the process to account for instances where it isnt provided.

It would be great if you could contribute directly to the code base. We need all the help we can get 😅 and it will benefit anyone else that tries to run the pipeline too. Let me know what you think and I can add you as a contributor to nf-core 👍

@jinmingda
Copy link
Contributor Author

@drpatelh I'd love to create a pull request and make contribution to the nf-core project. I will try to address the issues I've found so far.

@drpatelh
Copy link
Member

Amazing! Thank you. Ive sent you an invitation 😎

Please feel free to add yourself and your Institute to the contributors page on the website too. Details for this can be found at the bottom of this comment:
nf-core/website#3 (comment)

Im sure you probably know this already but just fork the repository and create a PR to the dev branch and ask for it to be reviewed when you are ready. Once the CI tests are passing and we have reviewed we can merge it into the pipeline 👍

@drpatelh
Copy link
Member

Also, we have an atacseq channel on the nf-core Slack workspace which may be worth joining too.
https://nf-co.re/join

@drpatelh
Copy link
Member

see:
#47
#48

@drpatelh
Copy link
Member

drpatelh commented Oct 3, 2019

Fixed in #49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants