GitHub - petrelharp/euroibd: Code to do data processing and create figures for "The geography of recent genetic ancestry across Europe", http://arxiv.org/abs/1207.3815

petrelharp / euroibd Public

Notifications You must be signed in to change notification settings
Fork 6
Star 15

Code to do data processing and create figures for "The geography of recent genetic ancestry across Europe", http://arxiv.org/abs/1207.3815

CC0-1.0 license

15 stars 6 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
IBS_vs_dist.R		IBS_vs_dist.R
LICENSE		LICENSE
PCA_vs_IBD.R		PCA_vs_IBD.R
README		README
anonymize.R		anonymize.R
beagle-false-positives.R		beagle-false-positives.R
beagle-true-positives.R		beagle-true-positives.R
blocks-eda.R		blocks-eda.R
clean-sample-info.R		clean-sample-info.R
dbgap_In_Europe.out		dbgap_In_Europe.out
dbgap_samples.R		dbgap_samples.R
endpoints-and-overlap.R		endpoints-and-overlap.R
example-Europeans.out		example-Europeans.out
example-POPRES_chr22.beagle.gz		example-POPRES_chr22.beagle.gz
example-europeans.imiss		example-europeans.imiss
example-kin.genome.gz		example-kin.genome.gz
find_close_rels_IBD.R		find_close_rels_IBD.R
fit-error-model.R		fit-error-model.R
genetic.maps.R		genetic.maps.R
hapmap-phased-to-beagle.py		hapmap-phased-to-beagle.py
ibd-blocklens.csv		ibd-blocklens.csv
ibd-blocks-fns.R		ibd-blocks-fns.R
ibd-pca.R		ibd-pca.R
ibd-pop-info.csv		ibd-pop-info.csv
insert-ibd-segments.py		insert-ibd-segments.py
inversion-writeup-plots.R		inversion-writeup-plots.R
laplace-inversion-fns.R		laplace-inversion-fns.R
make-hapmap-plus-popres.sh		make-hapmap-plus-popres.sh
make-lentab.R		make-lentab.R
makefile		makefile
mixup-by-region.sh		mixup-by-region.sh
mixup-genomes.py		mixup-genomes.py
parse-sims-fns.R		parse-sims-fns.R
plus-process-fibd.py		plus-process-fibd.py
recode-hapmap-legend-for-popres.R		recode-hapmap-legend-for-popres.R
run-beagle.sh		run-beagle.sh
sensitivity.R		sensitivity.R
sfig-8-sharing-rates.svg		sfig-8-sharing-rates.svg
writeup-plots.R		writeup-plots.R

Repository files navigation

This is the collection of codes used in analysis of european POPRES data
(see Nelson et al, http://www.ncbi.nlm.nih.gov/pubmed/18760391; 
obtain access to data at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000145.v1.p1 .)
for the paper "The geography of recent genetic ancestry across Europe", 
by Peter Ralph and Graham Coop, 
published May 7, 2013, in PLoS Biology, at
http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001555

To see what makes what, see the makefile.

Most of the files are pretty clean, but a few have some clutter.

None will actually do anything without obtaining the POPRES data and running BEAGLE http://faculty.washington.edu/browning/beagle/beagle.html on them.

Please use these if they are useful, and credit us if appropriate.


--------------

If you are looking to use the inference method, look for the functions sinv() and squoosh().

Also, linv() gives the least-squares approximation, useful for finding a good starting point.

For instance, see the block in inversion-writeup-plots.R that sets things up to infer histories for all the pairs of populations:

# Set-up: the operator L and the false positive rate estimates:
gens <- cumsum( rep(1:36, each=10) )
L <- error.disc.trans( lenbins=lenbins, gens=gens, chrlens=.chrlens[setdiff(seq_along(.chrlens),omitchroms)] )) # this takes a while!! do it once and save.
fp <- disc.fp(lenbins, chrlens=.chrlens)
fp <- runmed(fp,k=15)
# Set-up for intial least-squares estimate:
S <- as.vector( smooth( hist( blocks$maplen, breaks=c(lenbins,100), plot=FALSE )$counts ) )
S <- S/mean(S)
S[lenbins>20] <- exp( predict( lm( log(S) ~ lenbins + I(lenbins^2), subset=(lenbins>20) & (S>0) ), newdata=list(lenbins=lenbins[lenbins>20]) ) )
est.Sigma <- function ( np, X, alpha=100 ) {
    # estimate Sigma by combining the population average with the smoothed, observed data
    # where Sigma is the covariance matrix of X/np
    smX <- as.vector( smooth(X) )
    ppp <- (alpha*np/(2543640+alpha*np))  # 2543640 = total # pairs
    Sigma <- ppp*smX + (1-ppp)*sum(smX)*S/sum(S)
    return( Sigma/np^2 )
}
Ksvd <- svd( 1/sqrt(as.vector(S)) * L )  # pre-compute SVD of L
total.npairs <- choose( length(unique(blocks$id1)), 2 )
linverse <- linv( X=lendist/total.npairs, S=as.vector(S), Sigma=est.Sigma(np=total.npairs,X=lendist), Ksvd=Ksvd, maxk=30, fp=fp )
sminverse <- smoothed.nonneg( lS=linverse, Ksvd=Ksvd, bestk=15, lam=1 )

# Here is where we get the actual estimate:
ans <- sinv( lendist, sminverse, L, fp, npairs=total.npairs, lam=0, gamma=1e6 )

# squoosh iterates sinv, looking for the smoothest solution that drops no more than 2 units of log-likelihood
smooth = squoosh( lendist, xinit=ans, L, fp, npairs=ans$npairs, minlen=minlen, gamma=100, lam=0, fitfn="loglik", relthresh=2/ans$npairs, twid=.05 )


--------------

Notes:

* The .gmap files provide a mapping from the endpoint indices that BEAGLE uses to genetic distances (e.g. in centiMorgans).  Important: BEAGLE uses 0-based indices.

* BEAGLE 4.0 is out, and improved over the version we used (3.3).