Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions about methylpy #28

Open
lhqing opened this issue Mar 25, 2019 · 2 comments
Open

Suggestions about methylpy #28

lhqing opened this issue Mar 25, 2019 · 2 comments

Comments

@lhqing
Copy link
Contributor

lhqing commented Mar 25, 2019

Hi, Yupeng,

Some suggestions of methylpy based on my experience. I can work on improve these (since I already did these here https://github.com/lhqing/cemba_data/blob/master/cemba_data/tools/). But I'd like to know your opinions:

  1. chromosome names: I suggest always keep the chromosome name same as genome fasta, i.e. keep "chr" by default for genomes like mm10 and hg19, or never change chromosome names. I found this will decrease conflits in a lot cases.

  2. chromosome orders: Right now the order of chrom in ALLC is based on bam file, and there is no paticular "sorted or not" check for ALLC file. What's worse is in the merge_allc_files_worker function, the order of chrom may be random, because dict.keys() is not ordered.
    https://github.com/yupenghe/methylpy/blob/methylpy/methylpy/utilities.py#L543

  3. Use bgzip/tabix:

    1. Much faster then gzip.open() + f.seek
    2. allow more even parallel in funcs like merge ALLC (instead of chromosome level using current index, tabix allow query ALLC in any bin level)
      I known this need a lot changes in all functions, but I found it worthwhile. I implemented the open_allc and merge_allc function based on bgzip/tabix which can be several times faster then gzip + f.seek strategy using old index. I tend to fully drop the old index strategy.
  4. More universal map_to_region function:
    which is very important for single cell, my implementation here: map_to_region
    Given an ALLC file, this function count mC and cov for a list of regions, e.g. chrom 100kb bins, gene bodies, I use bedtools map internally, which is very fast.

@yupenghe
Copy link
Owner

yupenghe commented Oct 5, 2019

Thanks for the suggestions. Here are my thoughts.

  1. I totally agree with you and removing "chr" prefix did cause a lot of confusion.
  2. Not sure if this is important especially when point 3 is implemented as long as the position is sorted within each chromosome.
  3. Agree. I haven't test tabix yet but it could totally be a better strategy.
  4. Agree. Coverage and count are also important stats and should be reported.

I also have some thoughts that would like your comments.

  1. allc file should be separated into two files: one for CG sites and one for non-CG sites, where the generation of the latter could be optional. Most projects only care about CG sites which are a very small subset of all sites.
  2. Since pandas is extremely popular and tall data format is very powerful, I am thinking of changing some file formats (e.g. methylation level of a list of regions in a list of samples) into the tall data format.
  3. Reimplement the RMS test with python. Installing GSL library seems a bottleneck for methylpy installation.

@lhqing
Copy link
Contributor Author

lhqing commented Nov 30, 2019

Hi Yupeng, sorry I've missed this for a while...

  1. I agree, and this is not hard to implement, but this is quite a dramatic change, I guess shouldn't make this as default option.

  2. Totally agree, and not only tall data, but saving data in the plain text file also impact performance, especially when having >100 samples (for single-cell data, we can easily have that), the DMS table is huge, we could use some custom tall data format store in HDF5.

  3. This will be great, cython can probably do the same job.

In addition, actually I've implemented quite some ALLC related functions in another repo here https://github.com/lhqing/ALLCools, which basically implement everything related to ALLC handling using tabix/bgzip, add added some code for single-cell related stuff. I am willing to incorporate that into methylpy.

I think I will have time on this after next Feb 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants