- RDR: --minINCLUDE supports both INT and FLOAT. If INT, the minimum length of included part within specific feature. If FLOAT, the minimum fraction of included part within specific feature.
- BAF: add index to the output folder of each step, e.g., "1_pileup".
- docs: update TODO list.
- RDR: add --minINCLUDE option for read filtering, which is the minimum length of included part within specific feature. For example, if the genomic range of a feature is chr1:1000-3000, and one fetched read (100bp) aligned to two locus, chr1:601-660 (60bp) and chr1:3801-3840 (40bp), then no any part of the read is actually included within the feature, hence it will be filtered by --minINCLUDE=30, whereas older versions of xcltk may keep the read. Note, as the feature counting in RDR is performmed independently for each feature, so one read filtered by --minINCLUDE in one feature may still be fetched and counted by other features.
- update docstring, using the numpydoc style.
- add TODO list in docs/TODO.md.
- fix typo.
- BAF: in ref_phasing, use multiprocessing to phase SNPs of one chromosome per subprocess.
- specify dtype of column 0 as str in pd.read_csv() when loading region file.
The v0.2.x was skipped since this new version has several substantial updates:
- BAF: do reference phasing on local machines instead of using online service.
- BAF & RDR: better support well-based (e.g., SMART-seq) data without the need to merge the input BAM files first;
- coding improvement using a more unified framework, mainly using the
fc
(feature counting) andutils
sub-modules.
Feature enhancement
BAF part:
- add
xcltk baf
command line tool to support reference phasing on local machines instead of using online service. xcltk allelefc
: better support well-based (e.g., SMART-seq) data without the need to merge the input BAM files first;xcltk allelefc
: both REF and ALT allele counting will exclude the UMIs/reads mapped to both alleles when no_dup_hap is True.
RDR part:
- better support well-based (e.g., SMART-seq) data without the need to merge the input BAM files first;
- re-implement the
xcltk basefc
using thefc
(feature counting) framework.
Preprocess:
re-implement the preprocess pipeline by (1) replace the bash scripts with python functions, e.g.,
wrapping SNP calling (previously
baf_pre_phase.sh
) intoxcltk.baf.genotype::pileup()
; reference phasing locally withxcltk.baf.genotype::ref_phasing()
; wrapping allele-specific feature counting (previouslybaf_post_phase.sh
) withxcltk.baf.count::afc_wrapper()
.- further wrap the three functions into a pipeline implemented as
a sub-module
xcltk.baf.pipeline
and also as a command line toolxcltk baf
.
- further wrap the three functions into a pipeline implemented as
a sub-module
Others:
- rename the cmdline command
xcltk pileup
toxcltk allelefc
. - make the cmdline options more unified, e.g., "--samList" and "--ncores" in "xcltk allelefc", "xcltk pipeline", and "xcltk basefc".
- usage() functions by default output to stdout instead of stderr.
- cmdline "--help" option exit code changes from 1 to 0.
- add/update a few util sub-modules such as
vcf.py
,xlog.py
etc. - add post_hoc scripts for post-processing xcltk output.
- initialize "data" dir and add feature annotation files.
- baf: add reference phasing correction (xcltk rpc).
- preprocess: restructure, update scripts and data.
- rdr: output 4-column features.
- baf_pre_impute: keep het SNPs only after calling germline SNPs
- baf_post_impute: output all regions when running xcltk pileup
- rdr: fix a bug that pysam was not imported.
update baf haploblock pileup:
- re-implement the module
- fix the double counting issue of UMIs or reads when aggregating phased SNPs (some UMIs or reads could cover more than one SNPs)
- fix the issue that some UMIs are aligned to both haplotype alleles (--countDupHap)
- add an option to output all regions (--outputAllReg)
- rdr: fix program suspension caused by unmatched chrom
- baf_pre: add --umi and --duplicates options
re-implement fixref with pysam:
- support genome fasta as ref (-r)
- support gzip/bgzip input and output vcf
- support multiple alt alleles
- support multiple samples
- indels would be filtered
- support only ploidy = 2 for now
- baf_post: support multiple BAMs
- baf_pileup: set cellTAG None when given bam list
- copy barcode file for baf_pileup and copy barcode & region files for phase_snp
- basefc: replace region.stop with region.end
- small fixes
- baf_pileup: add --uniqCOUNT
- specify sample ID through cmdline option
- phase_snp: fix load_phase
- baf_post: update pileup cmdline
- add pileup module and fix double counting
- phase_snp: support bed,gff,tsv for input region
- phase_snp: support vcf as input for phase file
- add gzip support for region sub-module
- baf_pre_impute: add -C/--call option and use cellsnp-lite by default to call germline SNPs instead of freebayes
- small fix
- baf_pre_impute and baf_pileup pass tests
- add baf_pileup pipeline
- add baf_pre_imputation pipeline
- add utils
- add fixref
- add feature-count
- add xcltk cmdline
- init modules: baf, rdr and reg
- add cmdline apps: xcltk-baf, xcltk-rdr and xcltk-reg