Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting result files and selecting files for visualization and DAR analysis #413

Open
Rafaelsoler13 opened this issue Apr 27, 2023 · 4 comments

Comments

@Rafaelsoler13
Copy link

Rafaelsoler13 commented Apr 27, 2023

Dear developers,

Firstly, I would like to express my appreciation for the detailed ATAC-seq pipeline that you developed. As a PhD student, I find this tool extremely helpful in analyzing ATAC-seq data. However, as someone who is still not 100% familiarized with all the terms and concepts in this field, I have a few questions that I would appreciate your help with.

Firstly, I am struggling to interpret some of the result files. Is there a tutorial or documentation that can explain each result file in more detail?

In addition, I am also unsure about which files are the final ones for visualization purposes, such as bigWigs or bam files, and which ones are ideal for posterior analysis using csaw. I would greatly appreciate it if you could provide some guidance on this matter.

Lastly, I recently ran the example subset data test ENCSR356KRQ_subsampled to check if the pipeline is running without any issues. Could you provide an updated html as a positive control to validate that everything worked fine? Or is there an output message that confirms that everything went okay? This is my QC file.

qc.html.txt

Best,

Rafael

@Rafaelsoler13 Rafaelsoler13 changed the title Interpreting pipeline result files and selecting appropriate files for visualization and posterior analysis Interpreting result files and selecting files for visualization and DAR analysis Apr 27, 2023
@sufyazi
Copy link

sufyazi commented Apr 28, 2023

Not one of the developers but I can try helping out. First of all, you have to run croo on the metadata.json output. It helps a lot in sieving through the opaque, various output into digestible, even visualizable bits. The README.md has a section that you can follow along.

After running croo you will find four major directories in the output, namely align, peak, qc, and signal. They are descriptive and self-explanatory: the bam file in align is the aligned reads file you would use for any downstream analysis, the narrowPeak.gz files in peak (multiple ones depending on whether you turn idr analysis on or off - if I am not mistaken this is possible by customizing the config JSON) are the output of MACS2 and represent the statistically called regions of interest (in this case, open chromatin) and what you will use for any downstream analysis like with csaw. qc is just for QC visualization, and signal contains the BigWig files, which are the continuous format data files for use in IGV to visualize read pileups. Think of it as the processed version of bam files, I guess.

I am not sure if there is any kind of 'best practice' validation step post-pipeline run, but I personally just trust the Cromwell std err output, so if I see the line Workflow: id=27336d51-bc86-46a6-8d59-bff543d8d73d, status=Succeeded (the id value here will be random every run as it is the workflow UUID) and Cromwell finished successfully. I consider the pipeline run a success. It's always good to do some sanity check by visualizing the output bigwig files and checking the QC html (just open the file on your browser of choice) for things that are out of the ordinary (which I won't get into here - you can read papers like the one from Yan et al. 2020. You asked for a 'positive control' html example, but I don't think such thing is possible because there isn't really one 'correct' atac-seq pipeline output.

By the way, you can find the visualization of the workflow in an svg file generated by croo. Below is one I grabbed from my run to demonstrate how helpful it is.

example visualization of the workflow

As for which IDR-processed output files (narrowPeaks) to use for your downstream analysis, that's something you might need to decide yourself depending on how many biological replicates you did, the nature of the experiment/study, etc. Might be worth discussing with your supervisor. You can also glean what 'overlap', conservative and optimal set means in the pipeline context by checking out the specification here.

Hope this helps at least.

@Rafaelsoler13
Copy link
Author

Thank you for the detailed explanation. It's always a pleasure read answers like this one =)

@Rafaelsoler13
Copy link
Author

Rafaelsoler13 commented Nov 15, 2023

Sorry to reopen this issue, but I was still wondering which file I should use for DAR analysis. Should I use the narrowPeak file in the peak/ folder? Or should I use Overlap Peaks / IDR Peaks? Or are they just to check the reproducibility of the peaks? @sufyazi

Thank you!

@Rafaelsoler13 Rafaelsoler13 reopened this Nov 15, 2023
@sufyazi
Copy link

sufyazi commented Nov 24, 2023

I think the narrowPeak files in the peak folder should be fine to use, though I personally just merge the overlap and IDR peak files and use these instead in my analysis. They are just more stringent peaks. But it might depend on what you want to do/see in your analysis so your mileage may vary and you might be fine just using the less stringent, background-filtered (bfilt) peak set file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants