Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI indexes #20

Open
MorexV3CAGE opened this issue Oct 17, 2023 · 4 comments
Open

CSI indexes #20

MorexV3CAGE opened this issue Oct 17, 2023 · 4 comments

Comments

@MorexV3CAGE
Copy link

MorexV3CAGE commented Oct 17, 2023

Hi,
we wanted to analyze our ATAC-seq data from plants using csaw, but the problem is that the BAM indexes need to be in the CSI format due to the size of the chromosomes. Is there a way to use csaw with these indexes? Or do you maybe plan to make an option to use these indexes in the future?
If you have any suggestions maybe even for a different software that works well with these indexes, I would appreciate it as well.

Thank you!

@LTLA
Copy link
Owner

LTLA commented Oct 20, 2023

I think this is pretty reasonable but I'm quite busy. Can you make a MRE with some small mock data to help me out?

In particular, I would like to know how far we can go with the current code. You might be able toactually create a Rsamtools::BamFile() instance with index= set to a CSI file, pass that to csaw::windowCounts() or related functions, and that CSI file path should get passed through to the C++ code. Fingers crossed, the latest version of Rhtslib might be able to read the CSI file, in which case everything might just work as-is.

@MorexV3CAGE
Copy link
Author

Thank you for your response. I have tried it with the Rsamtools::BamFile() and CSI indices and it worked so far. But my R crashes when it comes to filterWindowsLocal. This is due to the size of the dataset, since when calculating the filterWindows the RAM usage reaches 100GB+ and even tho I have 256GB (using Jupyter notebook rstudio) it crashes. Tried it with smaller portion of the data and it worked, so I guess I will have to split up the process somehow. Or would you also have some recommendations regarding this function?
Otherwise, I think, if I don't come across any future problems, the Rsamtools::BamFile() is the best and easiest solution.

@LTLA
Copy link
Owner

LTLA commented Nov 11, 2023

Sorry for the late reply. I've never seen it use so much memory before. I guess if you have super-long chromosomes, it'll create a large matrix to accommodate all of the windows. How long are your chromosomes in total, how many samples do you have, and what window sizes/spacings are you using?

@MorexV3CAGE
Copy link
Author

MorexV3CAGE commented Nov 14, 2023

Hi, yeah I guess it would be due to the size of the dataset. The chromosomes are this:
sequence length
Chr1A 601925861
Chr1B 720616616
Chr2A 802176689
Chr2B 824672899
Chr3A 758701763
Chr3B 866600556
Chr4A 772123497
Chr4B 703802483
Chr5A 723594571
Chr5B 742439866
Chr6A 627992934
Chr6B 739292552
Chr7A 753887139
Chr7B 749354956

And there are 7 samples with 3 replicas each.
The Max fragment size for readParam was 200, as well as the window width for windowCounts was 200.
This we tried to increase to windows of 2000 for the regionCounts and that's where it went into the high numbers of RAM usage.

But in the end, we didn't need to use this function, so our analysis is finished. Thanks for the CSI index loading idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants