Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added optional --fitted_hist option to generate probabilities and lookup table (useful for Merfin) #53

Open
wants to merge 175 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
9a188f9
Update run_sweep.sh
tbenavi1 Apr 25, 2017
a380342
Update simhisto.sh
tbenavi1 Apr 25, 2017
f6c844a
Update eval_err.pl
tbenavi1 Apr 25, 2017
9f02f1d
Update simgenome.pl
tbenavi1 Apr 25, 2017
637f4de
Update simhisto.sh
tbenavi1 Apr 25, 2017
dd3926e
Add files via upload
tbenavi1 Apr 25, 2017
fbd52f8
Add files via upload
tbenavi1 Apr 25, 2017
a37cc61
Update genomescope.R
tbenavi1 Apr 25, 2017
fa32668
Update genomescope.R
tbenavi1 Apr 27, 2017
52f2576
Update chromosomeMutator.py
tbenavi1 May 17, 2017
8a2c0e2
Update parameteranalysis.py
tbenavi1 May 17, 2017
17c6196
Rename README.md to README2.md
tbenavi1 May 17, 2017
e4cdc69
Create README.md
tbenavi1 May 17, 2017
86fb967
Update README.md
tbenavi1 May 17, 2017
0b23e8b
Update README.md
tbenavi1 May 17, 2017
2c61b30
Update README.md
tbenavi1 May 17, 2017
3f0fd3d
Update simgenome.pl
tbenavi1 Jun 29, 2017
729255f
Update simgenome.pl
tbenavi1 Jun 29, 2017
3590b63
Update simgenome.pl
tbenavi1 Jun 29, 2017
922b6dd
Update genomescope.R
tbenavi1 Oct 17, 2017
8dc2813
Create genomescope_newmodel.R
tbenavi1 Nov 1, 2018
333d0fc
refactored into R package
calcnerd76 Jan 24, 2019
075d224
refactored into R package
calcnerd76 Jan 24, 2019
e476eec
merged changes
calcnerd76 Jan 24, 2019
923db95
fixed ending brace
calcnerd76 Jan 24, 2019
577ab9d
fixed p=4 model
calcnerd76 Jan 28, 2019
8d5d3ab
add DESCRIPTION
calcnerd76 Jan 28, 2019
a9be8cd
add NAMESPACE
calcnerd76 Jan 28, 2019
f874ad5
fixed constraints for p=4
calcnerd76 Jan 28, 2019
0db5a37
fixed boundaries
calcnerd76 Jan 28, 2019
cdb0234
correct constraints
calcnerd76 Jan 28, 2019
8eed723
switch constraints
calcnerd76 Jan 30, 2019
a623b47
tidy nls_peak
calcnerd76 Jan 31, 2019
7cb7046
changes
calcnerd76 Jan 31, 2019
3dc8115
fix het calculation
calcnerd76 Feb 1, 2019
4b1846e
add --testing parameter
calcnerd76 Feb 1, 2019
ce78aa2
enforce heterozygosity constraint
calcnerd76 Feb 1, 2019
cf75799
fixed --testing filename
calcnerd76 Feb 1, 2019
a41c7ca
fix testing for multiple ploidy and if not converge
calcnerd76 Feb 1, 2019
4a19495
fix --testing length
calcnerd76 Feb 2, 2019
6f5ea5e
topologies
calcnerd76 Feb 5, 2019
a924783
fix typos
calcnerd76 Feb 5, 2019
38d5862
fix typos
calcnerd76 Feb 5, 2019
f25cf6e
update NAMESPACE
calcnerd76 Feb 5, 2019
6566832
default for atotal_len and top for testing mode
calcnerd76 Feb 5, 2019
8a70f78
add topology as optional parameter
calcnerd76 Feb 7, 2019
953b031
fixed increment typo
calcnerd76 Feb 7, 2019
df7359f
return low het and topology as parameter
calcnerd76 Feb 14, 2019
7eb6efa
fixed naming of functions for topologies
calcnerd76 Feb 14, 2019
4cf0b8b
fix legend format
calcnerd76 Feb 14, 2019
a3b4e48
fixed legend margins
calcnerd76 Feb 14, 2019
d9927cf
add transform parameter
calcnerd76 Feb 18, 2019
9ddfc98
fix switch
calcnerd76 Feb 18, 2019
471a065
fix report results for transform
calcnerd76 Feb 18, 2019
87102ec
fix typo
calcnerd76 Feb 18, 2019
65e303e
fix
calcnerd76 Feb 18, 2019
c4454a2
fix
calcnerd76 Feb 18, 2019
a4a340b
fix transform
calcnerd76 Feb 19, 2019
d8c42f3
fix transform
calcnerd76 Feb 19, 2019
5fee718
p=6 model
calcnerd76 Feb 19, 2019
211ffc6
fix p=6 report results
calcnerd76 Feb 19, 2019
825b86b
trying typical_error = 35
calcnerd76 Feb 19, 2019
eef878c
add p=6, t=6
calcnerd76 Feb 19, 2019
dde4685
add p=6 topology = 11
calcnerd76 Feb 19, 2019
071e00d
fix report results for topology=11
calcnerd76 Feb 19, 2019
1ff27b3
p=6 full model
calcnerd76 Feb 19, 2019
f48df28
add files for p=6 full model
calcnerd76 Feb 19, 2019
acc0f5b
fix report results
calcnerd76 Feb 19, 2019
bf022b7
change r initials
calcnerd76 Feb 19, 2019
2755953
fix report results
calcnerd76 Feb 19, 2019
f3cb1a1
fix p=6 t=11 model
calcnerd76 Feb 19, 2019
dc0005d
trying new r initials
calcnerd76 Feb 19, 2019
4aa1120
fix typical error
calcnerd76 Feb 19, 2019
6deeba4
change r initials
calcnerd76 Feb 19, 2019
da6f98a
new r initials
calcnerd76 Feb 19, 2019
107636d
try fitting only first 100 values
calcnerd76 Feb 19, 2019
99596f2
smaller dmax
calcnerd76 Feb 19, 2019
375c0c7
d_max = 1
calcnerd76 Feb 19, 2019
2de006a
change r initials for p=11
calcnerd76 Feb 19, 2019
cea19e0
update
calcnerd76 Feb 21, 2019
a3a3c14
get rid of up to 100
calcnerd76 Feb 21, 2019
b48d8e2
fix r initials
calcnerd76 Feb 21, 2019
a49e670
change
calcnerd76 Feb 25, 2019
c6b5cbf
kmer_rates parameter and start calculation for transformed
calcnerd76 Feb 25, 2019
08254ed
alpha rates parameter in progress
calcnerd76 Feb 26, 2019
d613f90
add p=4 full model
calcnerd76 Feb 26, 2019
5b4111c
more files
calcnerd76 Feb 26, 2019
19397c3
fix p=4 unique model
calcnerd76 Feb 26, 2019
82504c3
add parameter for initial rates
calcnerd76 Feb 26, 2019
9ff7914
change string to character
calcnerd76 Feb 26, 2019
ff1fd7d
fixing report results for optional parameters
calcnerd76 Feb 26, 2019
2fc9a88
fix
calcnerd76 Feb 26, 2019
37a12f3
fix report results
calcnerd76 Feb 26, 2019
a2002ea
fix report results
calcnerd76 Feb 26, 2019
3a53ba3
fix r_initials for kmer rates
calcnerd76 Feb 28, 2019
01e0c11
as.numeric to avoid integer overflow
calcnerd76 Feb 28, 2019
ad5d4a5
standardize formulas
calcnerd76 Mar 3, 2019
a712a99
add d_init parameter and set default to 0.10
calcnerd76 Mar 4, 2019
d6d24af
add p=5 models
calcnerd76 Mar 4, 2019
c17edf5
add p=6 models
calcnerd76 Mar 4, 2019
2a47a68
add more topologies
calcnerd76 Mar 5, 2019
98e4b99
finish p=6 topologies
calcnerd76 Mar 5, 2019
1ecd4f3
add optional parameter for no unique sequence on plots
Mar 9, 2019
6de44d1
make testing mode parallel
Mar 18, 2019
fc489c2
fix report results
Mar 18, 2019
a28cd1a
not append for true params
Mar 18, 2019
3d3a840
try new r initials
Mar 19, 2019
c98c12a
apples to apples comparison for het values in testing mode
Mar 20, 2019
2cf4e6e
fix typo
Mar 20, 2019
bc930c2
testing report results for p=6
Mar 20, 2019
ec0da1d
added true parameters for p=3
Apr 1, 2019
8c02313
added true parameters for p=3
Apr 1, 2019
81293bb
fix report results bug
Apr 1, 2019
e61f49f
transform x^2
Apr 4, 2019
f9065e7
fix transform exponent
Apr 8, 2019
991e75c
fix transform exp and score_model function
Apr 8, 2019
4fee06d
fix estimate genome peak to do peak indices then topologies
Apr 10, 2019
00bc679
fixed verbose mode, changed up estimate genome peak to do peak indice…
Apr 10, 2019
9eaa228
make nls peak show model each time
Apr 11, 2019
8c35f78
fuller p=5 and p=6 model (except testing mode)
Apr 22, 2019
6169b65
fix files
Apr 22, 2019
9eaab4b
fix typo
Apr 22, 2019
5c5d649
add optional parameter for trace flag for nlsLM function
May 28, 2019
e82b7a3
optional trace flag
May 28, 2019
bdf2f77
make predict3_0 functions
May 29, 2019
92fc3ef
full p=5 model
May 29, 2019
adb09e8
fix p=5 p=6 t=0 report results for testing mode
Jun 1, 2019
fc3a7af
fix p=5 p=6 t=0 report results for testing mode
Jun 1, 2019
0681193
fix p=5 p=6 t=0 report results for testing mode
Jun 1, 2019
0bdcdc7
fix p=5 p=6 t=0 report results for testing mode
Jun 1, 2019
1db998e
simpler full model for p=5 and report results
Jun 4, 2019
ac8e515
make r minimum 0.00001 to avoid singular matrix
Jun 6, 2019
4eac402
try exponentiating variables to enforce positivity and removing lower…
Jun 13, 2019
cc0e31d
remove log transform
Jun 17, 2019
e613071
multiple starting heterozygosity rates
Jun 17, 2019
53aed92
simpler p=6 full model
Jun 18, 2019
cc43928
fix typo
Jun 18, 2019
f8e324b
change factor for smaller intial step size
Jun 18, 2019
dc25a6b
code cleaning in preparation for publication
Jun 19, 2019
21550d1
p6 extra newline for graphs
Jun 19, 2019
5d343c7
added predict1_0 and predict2_0 functions
Jun 19, 2019
528c213
larger margin for plots (for p=6)
Jun 27, 2019
665fe06
default fit to transformed, plot 4 plots
Jul 22, 2019
47a9351
updates
Aug 5, 2019
7eba733
updates for final
Aug 19, 2019
a953105
change readme
Aug 19, 2019
97e43b3
update
Aug 19, 2019
b8b1825
deleted analysis folder
Aug 19, 2019
7804516
add kmer coverage as optional input for website
Aug 19, 2019
f75557e
polyploid examples for website
Aug 20, 2019
1d0aa22
website updates
Aug 21, 2019
20bddaa
website updates
Aug 21, 2019
965b559
add polyploid descriptions to website
Aug 21, 2019
3026e51
add analysis files to github
Aug 22, 2019
c664fa8
write installation script
Aug 22, 2019
d6e21e1
add installation instructions to readme
Aug 22, 2019
3a61c13
DOI for preprint
Aug 26, 2019
aade38e
Added p=1 and p=2 to simulated data
Oct 9, 2019
248c760
remove p=1 from heterozygosities
tbenavi1 Oct 9, 2019
2a170b2
fix typo
Oct 9, 2019
f19aebd
fix naming for length commands
tbenavi1 Oct 11, 2019
e3fcdd5
add num_rounds advanced parameter
Oct 30, 2019
71a6416
Merge branch 'master' of https://github.com/tbenavi1/genomescope
Oct 30, 2019
2916949
add smudgeplot simulated analysis
tbenavi1 Dec 10, 2019
e14dbc3
add smudgeplot simulated analysis
tbenavi1 Dec 10, 2019
ec5086a
add smudgeplot simulated analysis
tbenavi1 Dec 10, 2019
7dc6806
Update repetitiveness_smudgeplot_commands.py
tbenavi1 Dec 11, 2019
8a1acba
update citation
Mar 18, 2020
68cc1ad
Update README.md
tbenavi1 Mar 18, 2020
f27d326
stop counting line numbers after 100k to avoid gateway timeout
mschatz Mar 24, 2020
d209051
updates for genomescope 2.0 url
mschatz Mar 24, 2020
5034ed4
Merge branch 'master' of https://github.com/tbenavi1/genomescope2.0
mschatz Mar 24, 2020
d72090a
added fitted hist and probabilities, fixed issue with install.R repo
gf777 Feb 23, 2021
e82f41d
Update report_results.R
gf777 Oct 8, 2021
5b2d2eb
Update report_results.R
gf777 Oct 8, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ user_data/*
user_uploads/*
!user_uploads/index.html

.Rproj.user
12 changes: 12 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Package: genomescope
Title: Reference-free profiling of genomes
Version: 2.0
Authors@R: person("Timothy", "Ranallo-Benavidez", email = "tbenavi1@jhu.edu",
role = c("aut", "cre"))
Description: GenomeScope analyzes the k-mer histogram to output estimates for genome size, heterozygosity, and repetitiveness, without requiring a reference genome. GenomeScope employs a polyploid-aware mixture model that, within seconds, accurately infers genome properties from unassembled sequencing data. GenomeScope produces a report and several informative plots describing the genome properties.
Depends: R (>= 3.1.0)
Imports: argparse, minpack.lm
License: file LICENSE
LazyData: true
RoxygenNote: 6.1.1
Encoding: UTF-8
71 changes: 71 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Generated by roxygen2: do not edit by hand

export(estimate_Genome_peakp)
export(eval_model)
export(nls_peak)
export(predict1_0)
export(predict1_0_unique)
export(predict1_1)
export(predict1_1_unique)
export(predict2_0)
export(predict2_0_unique)
export(predict2_1)
export(predict2_1_unique)
export(predict3_0)
export(predict3_0_unique)
export(predict3_1)
export(predict3_1_unique)
export(predict4_0)
export(predict4_0_unique)
export(predict4_1)
export(predict4_1_unique)
export(predict4_2)
export(predict4_2_unique)
export(predict5_0)
export(predict5_0_unique)
export(predict5_1)
export(predict5_1_unique)
export(predict5_2)
export(predict5_2_unique)
export(predict5_3)
export(predict5_3_unique)
export(predict5_4)
export(predict5_4_unique)
export(predict5_5)
export(predict5_5_unique)
export(predict6_0)
export(predict6_0_unique)
export(predict6_1)
export(predict6_10)
export(predict6_10_unique)
export(predict6_11)
export(predict6_11_unique)
export(predict6_12)
export(predict6_12_unique)
export(predict6_13)
export(predict6_13_unique)
export(predict6_14)
export(predict6_14_unique)
export(predict6_15)
export(predict6_15_unique)
export(predict6_16)
export(predict6_16_unique)
export(predict6_1_unique)
export(predict6_2)
export(predict6_2_unique)
export(predict6_3)
export(predict6_3_unique)
export(predict6_4)
export(predict6_4_unique)
export(predict6_5)
export(predict6_5_unique)
export(predict6_6)
export(predict6_6_unique)
export(predict6_7)
export(predict6_7_unique)
export(predict6_8)
export(predict6_8_unique)
export(predict6_9)
export(predict6_9_unique)
export(report_results)
export(score_model)
68 changes: 68 additions & 0 deletions R/estimate_genome_peak.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#' Function to fit 2p peak model, with p forms
#'
#' @param kmer_hist_orig A data frame of the original histogram data (starting at 1 and with last position removed).
#' @param x An integer vector of the x-coordinates of the histogram (after filtering out low coverage errors and high coverage kmers).
#' @param y A numeric vector of the y-coordinates of the histogram (after filtering out low coverage errors and high coverage kmers).
#' @param k An integer corresponding to the kmer length.
#' @param p An integer corresponding to the ploidy.
#' @param topology An integer corresponding to the topology to use.
#' @param estKmercov An integer corresponding to the estimated kmer coverage of the polyploid genome.
#' Set to -1 if not specified by user.
#' @param round An integer corresponding to the iteration number (0, 1, 2, 3) for the fitting process.
#' @param foldername A character vector corresponding to the name of the output directory.
#' @param arguments A data frame of the user-specified inputs.
#' @return A list (nls, nlsscore) where nls is the nlsLM model object (with some additional components)
#' and nlsscore is the score (model RSSE) corresponding to the best fit (of the p forms).
#' @export
estimate_Genome_peakp<-function(kmer_hist_orig, x, y, k, p, topology, estKmercov, round, foldername, arguments) {
if (topology==-1) {
p_to_num_topologies = c(1, 1, 1, 2, 5, 16)
num_topologies = p_to_num_topologies[p]
topologies = 1:num_topologies
}
else {
num_topologies = 1
topologies = c(topology)
}
numofKmers = sum(as.numeric(x)*as.numeric(y))
if (estKmercov==-1) {
#In situations with low heterozygosity, the peak with highest amplitude typically corresponds to the homozygous peak (i.e. the p-th peak).
#However, with increasing heterozygosity, the highest amplitude peak may be an earlier peak.
#Thus, when setting the estimated kmer coverage, we will need to iterate through these possibilities.
#num_peak_indices indicates how many possibilities we need to iterate through.
num_peak_indices = p
y_transform = as.numeric(x)**transform_exp*as.numeric(y)
estKmercov1 = x[which(y_transform==max(y_transform))][1]
}
else {
# When the user sets the estimated kmer coverage, we only need to iterate through one possibility
num_peak_indices = 1
## We set the estimated kmer coverage to be the user specified value
estKmercov1 = estKmercov
}
estLength1 = numofKmers/estKmercov1

nls00 = NULL
peak_indices = 1:num_peak_indices
for (i in peak_indices) {
nls0 = NULL
top_count = 0
## We see what happens when we set the estimated kmer coverage to be 1/i times the x-coordinate where the max peak occurs (1 <= i <= p if the user doesn't set the estimated kmer coverage, and i=1 if they do)
estKmercov2 = estKmercov1 / i
estLength2 = numofKmers/estKmercov2

if (VERBOSE) {cat(paste("trying with kmercov: ", estKmercov2, "\n"))}

for (top in topologies) {
if (VERBOSE) {cat(paste("trying with topology: ", top, "\n"))}
top_count = top_count + 1
nls1 = nls_peak(x, y, k, p, top, estKmercov2, estLength2, MAX_ITERATIONS)
nls0 = eval_model(kmer_hist_orig, nls0, nls1, p, round, foldername, arguments)[[1]]
}
if (i < num_peak_indices) { #if this is not the last evaluation
nls00 = eval_model(kmer_hist_orig, nls00, nls0, p, round, foldername, arguments)[[1]]
}
}

return(eval_model(kmer_hist_orig, nls00, nls0, p, round, foldername, arguments))
}
86 changes: 86 additions & 0 deletions R/eval_model.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#' Evaluate distinct model forms, in order to resolve ambiguity of which peak is the homozygous peak
#'
#' @param kmer_hist_orig A data frame of the original histogram data (starting at 1 and with last position removed).
#' @param nls0,nls1 The nlsLM model objects to evaluate and compare.
#' @param p An integer corresponding to the ploidy.
#' @param round An integer corresponding to the iteration number (0, 1, 2, 3) for the fitting process.
#' @param foldername A character vector corresponding to the name of the output directory.
#' @param arguments A data frame of the user-specified inputs.
#' @return A list (nls, nlsscore) where nls is the nlsLM model object (with some additional components)
#' and nlsscore is the score (model RSSE) corresponding to the best fit (of the p forms).
#' @export
eval_model<-function(kmer_hist_orig, nls0, nls1, p, round, foldername, arguments) {
nls0score = -1
nls1score = -1

## Evaluate the score the nls0
if (!is.null(nls0)) {
nls0score = score_model(kmer_hist_orig, nls0, round+0.1, foldername)

#if(VERBOSE) {cat(paste("nls0score$all:\t", nls0score$all[[1]], "\n"))}

if (VERBOSE) {
mdir = paste(foldername, "/round", round, ".1", sep="")
dir.create(mdir, showWarnings=FALSE)
report_results(kmer_prof_orig,kmer_prof_orig, k, p, (list(nls0, nls0score)) , mdir, arguments, TRUE)
}
}
else {
if (VERBOSE) {cat("nls0score failed to converge\n")}
}

## Evaluate the score of nls1
if (!is.null(nls1)) {
nls1score = score_model(kmer_hist_orig, nls1, round+0.2, foldername)

if(VERBOSE) {cat(paste("nls1score$all:\t", nls1score$all[[1]], "\n"))}

if (VERBOSE) {
mdir = paste(foldername, "/round", round, ".2", sep="")
dir.create(mdir, showWarnings=FALSE)
report_results(kmer_prof_orig, kmer_prof_orig, k, p, (list(nls1, nls1score)) , mdir, arguments, FALSE)
}
}
else {
if (VERBOSE) {cat("nls1score failed to converge\n")}
}

## Return the better of the scores
if (!is.null(nls0)) {
if (!is.null(nls1)) {
pdiff = abs(nls0score$all[[1]] - nls1score$all[[1]]) / max(nls0score$all[[1]], nls1score$all[[1]])

if (pdiff < SCORE_CLOSE) {
het0 = nls0$ahet
het1 = nls1$ahet

#if (het1 * SCORE_HET_FOLD_DIFFERENCE < het0) {
if (het1 + 0.01 < het0) {
if (VERBOSE) {cat("returning nls0, similar score, higher het\n")}
return (list(nls1, nls1score))
}
#else if (het0 * SCORE_HET_FOLD_DIFFERENCE < het1) {
else if (het0 + 0.01 < het1) {
if (VERBOSE) {cat("returning nls1, similar score, higher het\n")}
return (list(nls0, nls0score))
}
}

if (nls0score$all[[1]] < nls1score$all[[1]]) {
if (VERBOSE) {cat("returning nls0, better score\n")}
return (list(nls0, nls0score))
}
else {
if (VERBOSE) {cat("returning nls1, better score\n")}
return (list(nls1, nls1score))
}
}
else {
if (VERBOSE) {cat("returning nls0, nls1 fail\n")}
return (list(nls0, nls0score))
}
}

if (VERBOSE) {cat("returning nls1 by default\n")}
return (list(nls1, nls1score))
}
Loading