Skip to content

Commit

Permalink
Merge pull request #31 from zjnolen/anc_state
Browse files Browse the repository at this point in the history
Allow polarizing SFS to ancestral states
  • Loading branch information
zjnolen authored Feb 13, 2024
2 parents be08d41 + 74e0554 commit 7180ede
Show file tree
Hide file tree
Showing 6 changed files with 58 additions and 12 deletions.
2 changes: 2 additions & 0 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ reference:
exclude: []
min_size: 3000

ancestral: "data/ref/testref.fa"

#===================== Sample Set Configuration =======================#

exclude_ind: []
Expand Down
12 changes: 9 additions & 3 deletions config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,13 @@ Required configuration of the reference.
- `min_size:` A size in bp (integer). All contigs below this size will be
excluded from analysis.

- `ancestral:` A path to a fasta file containing the ancestral states in your
reference genome. This is optional, and is used to polarize allele
frequencies in SAF files to ancestral/derived. If you leave this empty,
the reference genome itself will be used as ancestral, and you should be
sure the [`params`] [`realSFS`] [`fold`] is set to `1`. If you put a reference
here, you can set that to `0`.

Reference genomes should be uncompressed, and contig names should be clear and
concise. Currently, there are some issues parsing contig names with
underscores, so please change these in your reference before running the
Expand Down Expand Up @@ -453,9 +460,8 @@ or a pull request and I'll gladly put it in.
- `pruning_min-weight:` The minimum r^2 to assume two positions are in
linkage disequilibrium when pruning (float)
- `realSFS:` Settings for realSFS
- `fold:` Whether or not to fold the produced SFS (0 or 1, [docs](http://www.popgen.dk/angsd/index.php/SFS_Estimation))
**NOTE:** I have not implemented the use of an ancestral reference into
this workflow, so this should always be set to 1 until I implement this.
- `fold:` Whether or not to fold the produced SFS. Set to 1 if you have not
provided an ancestral-state reference (0 or 1, [docs](http://www.popgen.dk/angsd/index.php/SFS_Estimation))
- `sfsboot:` Doesn't work now, but when it does it will produce this many
bootstrapped SFS per population and population pair (integer)
- `fst:` Settings for $F_{ST}$ calculation in ANGSD
Expand Down
4 changes: 3 additions & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ reference:
exclude: []
min_size:

ancestral:

#===================== Sample Set Configuration =======================#

exclude_ind: []
Expand Down Expand Up @@ -137,7 +139,7 @@ params:
fit_LDdecay_n_correction: true
pruning_min-weight: 0.1
realsfs:
fold: 1 # Should only be set to 1 for now
fold: 1 # Should only be set to 1, unless ancestral reference is given above
sfsboot: 100
fst:
whichFst: 1
Expand Down
26 changes: 22 additions & 4 deletions workflow/rules/0.1_ref_prep.smk
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ rule link_ref:
"""


if config["ancestral"]:

rule link_anc_ref:
"""Link ancestral reference genome to results directory if it exists"""
input:
config["ancestral"],
output:
"results/ref/{ref}/{ref}.anc.fa",
log:
"logs/ref/link_ref/{ref}.anc.log",
conda:
"../envs/shell.yaml"
shell:
"""
ln -sr {input} {output} 2> {log}
"""


rule bwa_index:
"""Index reference genome for bwa (mapping)"""
input:
Expand All @@ -50,13 +68,13 @@ rule bwa_index:
rule samtools_faidx:
"""Index reference genome using samtools (fai index used by several tools)"""
input:
"results/ref/{ref}/{ref}.fa",
"results/ref/{ref}/{prefix}.fa",
output:
"results/ref/{ref}/{ref}.fa.fai",
"results/ref/{ref}/{prefix}.fa.fai",
log:
"logs/ref/samtools_faidx/{ref}.log",
"logs/ref/samtools_faidx/{ref}/{prefix}.log",
benchmark:
"benchmarks/ref/samtools_faidx/{ref}.log"
"benchmarks/ref/samtools_faidx/{ref}/{prefix}.log"
wrapper:
"v2.4.0/bio/samtools/faidx"

Expand Down
8 changes: 6 additions & 2 deletions workflow/rules/3.1_safs.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ rule angsd_doSaf_pop:
"""
input:
unpack(filt_depth),
unpack(get_anc_ref),
bam="results/datasets/{dataset}/bamlists/{dataset}.{ref}_{population}{dp}.bamlist",
bams=get_bamlist_bams,
bais=get_bamlist_bais,
ref="results/ref/{ref}/{ref}.fa",
reffai="results/ref/{ref}/{ref}.fa.fai",
regions="results/datasets/{dataset}/filters/chunks/{ref}_chunk{chunk}.rf",
output:
saf=temp(
Expand Down Expand Up @@ -50,7 +52,7 @@ rule angsd_doSaf_pop:
"""
angsd -doSaf 1 -bam {input.bam} -GL {params.gl_model} -ref {input.ref} \
-nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.ref} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.anc} \
{params.extra_saf} -rf {input.regions} -out {params.out} &> {log}
"""

Expand Down Expand Up @@ -109,10 +111,12 @@ rule angsd_doSaf_sample:
"""
input:
unpack(filt_depth),
unpack(get_anc_ref),
bam="results/datasets/{dataset}/bamlists/{dataset}.{ref}_{population}{dp}.bamlist",
bams=get_bamlist_bams,
bais=get_bamlist_bais,
ref="results/ref/{ref}/{ref}.fa",
reffai="results/ref/{ref}/{ref}.fa.fai",
output:
saf="results/datasets/{dataset}/safs/{dataset}.{ref}_{population}{dp}_{sites}-filts.saf.gz",
safidx="results/datasets/{dataset}/safs/{dataset}.{ref}_{population}{dp}_{sites}-filts.saf.idx",
Expand Down Expand Up @@ -140,7 +144,7 @@ rule angsd_doSaf_sample:
"""
(angsd -doSaf 1 -bam {input.bam} -GL {params.gl_model} -ref {input.ref} \
-nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.ref} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.anc} \
-setMinDepthInd {params.mindepthind} {params.extra_saf} \
-out {params.out}) &> {log}
"""
18 changes: 16 additions & 2 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,8 @@ def get_raw_fastq(wildcards):
& (units["lib"] == wildcards.lib),
["sample", "fq1", "fq2"],
].set_index("sample")
if not pd.isna(unit.fq1[0]):
return {"sample": [unit.fq1[0], unit.fq2[0]]}
if not pd.isna(unit.fq1.item()):
return {"sample": [unit.fq1.item(), unit.fq2.item()]}
if "sra" in units:
sra = units[
(units["sample"] == wildcards.sample)
Expand Down Expand Up @@ -454,6 +454,7 @@ def get_endo_cont_stat(wildcards):
# return "results/datasets/{dataset}/glfs/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_allsites-filts.glf.gz"


# Select filter file based off of full depth samples or subsampled depth
def filt_depth(wildcards):
if config["subsample_redo_filts"]:
return {
Expand All @@ -466,6 +467,19 @@ def filt_depth(wildcards):
}


# Choose to use an ancestral reference if present, otherwise use main reference
def get_anc_ref(wildcards):
if config["ancestral"]:
return {
"anc": "results/ref/{ref}/{ref}.anc.fa",
"ancfai": "results/ref/{ref}/{ref}.anc.fa.fai",
}
return {
"anc": "results/ref/{ref}/{ref}.fa",
"ancfai": "results/ref/{ref}/{ref}.fa.fai",
}


## Get random sampling proportion depending on if LD decay is being calculated
## or if LD pruning is being done
def get_ngsld_sampling(wildcards):
Expand Down

0 comments on commit 7180ede

Please sign in to comment.