Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow polarizing SFS to ancestral states #31

Merged
merged 1 commit into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .test/config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ reference:
exclude: []
min_size: 3000

ancestral: "data/ref/testref.fa"

#===================== Sample Set Configuration =======================#

exclude_ind: []
Expand Down
12 changes: 9 additions & 3 deletions config/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,13 @@ Required configuration of the reference.
- `min_size:` A size in bp (integer). All contigs below this size will be
excluded from analysis.

- `ancestral:` A path to a fasta file containing the ancestral states in your
reference genome. This is optional, and is used to polarize allele
frequencies in SAF files to ancestral/derived. If you leave this empty,
the reference genome itself will be used as ancestral, and you should be
sure the [`params`] [`realSFS`] [`fold`] is set to `1`. If you put a reference
here, you can set that to `0`.

Reference genomes should be uncompressed, and contig names should be clear and
concise. Currently, there are some issues parsing contig names with
underscores, so please change these in your reference before running the
Expand Down Expand Up @@ -453,9 +460,8 @@ or a pull request and I'll gladly put it in.
- `pruning_min-weight:` The minimum r^2 to assume two positions are in
linkage disequilibrium when pruning (float)
- `realSFS:` Settings for realSFS
- `fold:` Whether or not to fold the produced SFS (0 or 1, [docs](http://www.popgen.dk/angsd/index.php/SFS_Estimation))
**NOTE:** I have not implemented the use of an ancestral reference into
this workflow, so this should always be set to 1 until I implement this.
- `fold:` Whether or not to fold the produced SFS. Set to 1 if you have not
provided an ancestral-state reference (0 or 1, [docs](http://www.popgen.dk/angsd/index.php/SFS_Estimation))
- `sfsboot:` Doesn't work now, but when it does it will produce this many
bootstrapped SFS per population and population pair (integer)
- `fst:` Settings for $F_{ST}$ calculation in ANGSD
Expand Down
4 changes: 3 additions & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ reference:
exclude: []
min_size:

ancestral:

#===================== Sample Set Configuration =======================#

exclude_ind: []
Expand Down Expand Up @@ -137,7 +139,7 @@ params:
fit_LDdecay_n_correction: true
pruning_min-weight: 0.1
realsfs:
fold: 1 # Should only be set to 1 for now
fold: 1 # Should only be set to 1, unless ancestral reference is given above
sfsboot: 100
fst:
whichFst: 1
Expand Down
26 changes: 22 additions & 4 deletions workflow/rules/0.1_ref_prep.smk
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,24 @@ rule link_ref:
"""


if config["ancestral"]:

rule link_anc_ref:
"""Link ancestral reference genome to results directory if it exists"""
input:
config["ancestral"],
output:
"results/ref/{ref}/{ref}.anc.fa",
log:
"logs/ref/link_ref/{ref}.anc.log",
conda:
"../envs/shell.yaml"
shell:
"""
ln -sr {input} {output} 2> {log}
"""


rule bwa_index:
"""Index reference genome for bwa (mapping)"""
input:
Expand All @@ -50,13 +68,13 @@ rule bwa_index:
rule samtools_faidx:
"""Index reference genome using samtools (fai index used by several tools)"""
input:
"results/ref/{ref}/{ref}.fa",
"results/ref/{ref}/{prefix}.fa",
output:
"results/ref/{ref}/{ref}.fa.fai",
"results/ref/{ref}/{prefix}.fa.fai",
log:
"logs/ref/samtools_faidx/{ref}.log",
"logs/ref/samtools_faidx/{ref}/{prefix}.log",
benchmark:
"benchmarks/ref/samtools_faidx/{ref}.log"
"benchmarks/ref/samtools_faidx/{ref}/{prefix}.log"
wrapper:
"v2.4.0/bio/samtools/faidx"

Expand Down
8 changes: 6 additions & 2 deletions workflow/rules/3.1_safs.smk
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ rule angsd_doSaf_pop:
"""
input:
unpack(filt_depth),
unpack(get_anc_ref),
bam="results/datasets/{dataset}/bamlists/{dataset}.{ref}_{population}{dp}.bamlist",
bams=get_bamlist_bams,
bais=get_bamlist_bais,
ref="results/ref/{ref}/{ref}.fa",
reffai="results/ref/{ref}/{ref}.fa.fai",
regions="results/datasets/{dataset}/filters/chunks/{ref}_chunk{chunk}.rf",
output:
saf=temp(
Expand Down Expand Up @@ -50,7 +52,7 @@ rule angsd_doSaf_pop:
"""
angsd -doSaf 1 -bam {input.bam} -GL {params.gl_model} -ref {input.ref} \
-nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.ref} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.anc} \
{params.extra_saf} -rf {input.regions} -out {params.out} &> {log}
"""

Expand Down Expand Up @@ -109,10 +111,12 @@ rule angsd_doSaf_sample:
"""
input:
unpack(filt_depth),
unpack(get_anc_ref),
bam="results/datasets/{dataset}/bamlists/{dataset}.{ref}_{population}{dp}.bamlist",
bams=get_bamlist_bams,
bais=get_bamlist_bais,
ref="results/ref/{ref}/{ref}.fa",
reffai="results/ref/{ref}/{ref}.fa.fai",
output:
saf="results/datasets/{dataset}/safs/{dataset}.{ref}_{population}{dp}_{sites}-filts.saf.gz",
safidx="results/datasets/{dataset}/safs/{dataset}.{ref}_{population}{dp}_{sites}-filts.saf.idx",
Expand Down Expand Up @@ -140,7 +144,7 @@ rule angsd_doSaf_sample:
"""
(angsd -doSaf 1 -bam {input.bam} -GL {params.gl_model} -ref {input.ref} \
-nThreads {threads} {params.extra} -minMapQ {params.mapQ} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.ref} \
-minQ {params.baseQ} -sites {input.sites} -anc {input.anc} \
-setMinDepthInd {params.mindepthind} {params.extra_saf} \
-out {params.out}) &> {log}
"""
18 changes: 16 additions & 2 deletions workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,8 @@ def get_raw_fastq(wildcards):
& (units["lib"] == wildcards.lib),
["sample", "fq1", "fq2"],
].set_index("sample")
if not pd.isna(unit.fq1[0]):
return {"sample": [unit.fq1[0], unit.fq2[0]]}
if not pd.isna(unit.fq1.item()):
return {"sample": [unit.fq1.item(), unit.fq2.item()]}
if "sra" in units:
sra = units[
(units["sample"] == wildcards.sample)
Expand Down Expand Up @@ -454,6 +454,7 @@ def get_endo_cont_stat(wildcards):
# return "results/datasets/{dataset}/glfs/chunks/{dataset}.{ref}_{population}{dp}_chunk{chunk}_allsites-filts.glf.gz"


# Select filter file based off of full depth samples or subsampled depth
def filt_depth(wildcards):
if config["subsample_redo_filts"]:
return {
Expand All @@ -466,6 +467,19 @@ def filt_depth(wildcards):
}


# Choose to use an ancestral reference if present, otherwise use main reference
def get_anc_ref(wildcards):
if config["ancestral"]:
return {
"anc": "results/ref/{ref}/{ref}.anc.fa",
"ancfai": "results/ref/{ref}/{ref}.anc.fa.fai",
}
return {
"anc": "results/ref/{ref}/{ref}.fa",
"ancfai": "results/ref/{ref}/{ref}.fa.fai",
}


## Get random sampling proportion depending on if LD decay is being calculated
## or if LD pruning is being done
def get_ngsld_sampling(wildcards):
Expand Down
Loading