Skip to content

Latest commit

 

History

History
647 lines (479 loc) · 17.5 KB

prepare.rst

File metadata and controls

647 lines (479 loc) · 17.5 KB

Course materials preparation

This section contains the steps that we did to produce the materials that course participants got ready-made. That is the linux machine image, online documentation and the slide deck.

Online documentation

Login to https://github.com. Create a new project called ngs-course, with a default readme file.

Clone the project to local machine and initialize sphinx docs. Choose SSH clone link in GitHub.

git clone git@github.com:libor-m/ngs-course.git

cd ngs-course

# use default answers to all the questions
# enter project name and version 1.0
sphinx-quickstart

Now track all files created by sphinx-quickstart in current directory with git and publish to GitHub.

git add .
git commit -m 'empty sphinx project'

# ignore _build directory in git
echo _build >> .gitignore
git add .gitignore
git commit -m 'ignore _build directory'

# publish the first docs
# setting up argument less git pull with '-u'
git push -u origin master

To get live view of the documents, login to https://readthedocs.org. Your GitHub account can be paired with Read the Docs account in Edit Profile/Social Accounts, then you can simply 'import' new projects from your GitHub with one click. Import the new project and wait for it to build. After the build the docs can be found at http://ngs-course.readthedocs.org (or click the View button).

Now write the docs, commit and push. Rinse and repeat. Try to keep the commits small, just one change a time.

git add _whatever_new_files_
git commit -m '_your meaningful description of what you did here_'
git push

References that may come handy:

Adding another instance

Check out the version which will serve as starting material, create and publish new branch.

git pull
git checkout praha-january-2019
git checkout -b praha-january-2023
git push -u origin praha-january-2023

Log in to Read the Docs, set the new branch as the default version in Admin > Advanced.

If the version (branch) is not visible yet, do a force build of some previous version to get a fresh checkout.

Check if webhooks are set up both in ReadTheDocs > Project > Admin > Integratinos and in GitHub > Settings > Webhooks.

Slack

http://ngs-course.slack.com/

Add a new channel every year. Add the channel to defaults in slack Admin.

Update invite link in index.rst (30 day validity).

Cloud image

Create new machine

We expect ~16 participants. To make things simple we'll host them all on a single instance.

Follow the Meta Cloud quick start. Briefly:

  • add ssh keys

  • add SSH and ICMP security rules (more rules later)

  • Compute > Instance > Launch instance, fill this in the wizard dialog

    • Debian (64 bit)
    • flavor hpc.16core-32ram
    • 32 GB RAM - little less than 2 GB per user
    • 16 vCPUs - keep 2 of the allowed 18 for the testing instance
    • 160 GB HDD as system drive (need space for basic system, gcc, rstudio and produced data * N participants)
  • more rules in security group

    • HTTP to set up let's encrypt cert
    • 443 for secured RStudio
    • 60k-61k for mosh
    • 5690 rstudio + shiny

Debian conifg

SSH to the machine - read the IP in the OpenStack interface and log in with debian user name.

ssh debian@${INSTANCE_IP}

# start as super user
sudo su

# Prague time zone
dpkg-reconfigure tzdata

# find fastest mirror
apt install netselect-apt

# patch it in sources.list
vi /etc/sources.list

# upgrade all
apt update
apt upgrade

# keep the sources list over reboot
# +apt_preserve_sources_list: true
vi /etc/cloud/cloud.cfg

# install the basic tools for more configuration work
apt install vim screen mosh git

# log in as debian
su debian

# create an ssh key
ssh-keygen -t ed25519

# checkout dotfiles
git clone git@github.com:libor-m/dotfiles.git

# link vim config
ln -s dotfiles/vim/.vimrc .

# back to root shell
exit

# link vim config for root
cd
ln -s ~debian/dotfiles/vim/.vimrc .

Now it should be easy to work as debian user, with vim configured even for sudo.

Tiny fixes to make work as debian pleasurable.

# colrize prompt - uncomment force_color_prompt=yes
# add ll alias - uncomment alias ll='ls -l'
# export MANWIDTH=120
vi ~/.bashrc
. ~/.bashrc

Set up the user skeleton, so the newly created users will be set up as needed. Fancy login message will sure help;)

sudo su

# colrize prompt - uncomment force_color_prompt=yes
# add ll alias - uncomment alias ll='ls -l'
# fast sort and uniq
# export LC_ALL=C
# maximal width of man
# export MANWIDTH=120
# # wget impersonating normal browser
# # good for being tracked with goo.gl for example
# alias wgets='H="--header"; wget $H="Accept-Language: en-us,en;q=0.5" $H="Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" $H="Connection: keep-alive" -U "Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2" --referer=/ '
vi /etc/skel/.bashrc

# some screen settings
cat > /etc/skel/.screenrc << 'EOF'
hardstatus alwayslastline
hardstatus string '%{= kG}[%{G}%H%? %1`%?%{g}][%= %{= kw}%-w%{+b yk} %n*%t%?(%u)%? %{-}%+w %=%{g}][%{B}%d.%m. %{W}%c%{g}]'

defscrollback 20000

startup_message off
EOF

# basic RStudio ide config
# obtained by configuring one instance for liborm and then copying the
# resulting file
mkdir -p /etc/skel/.config/rstudio
cat > /etc/skel/.config/rstudio/rstudio-prefs.json <<'EOF'
{
    "save_workspace": "never",
    "font_size_points": 11,
    "editor_theme": "Solarized Dark",
    "panes": {
        "quadrants": [
            "TabSet1",
            "TabSet2",
            "Source",
            "Console"
        ],
        "tabSet1": [
            "Environment",
            "History",
            "Files",
            "Connections",
            "Build",
            "VCS",
            "Tutorial",
            "Presentation"
        ],
        "tabSet2": [
            "Plots",
            "Packages",
            "Help",
            "Viewer"
        ],
        "console_left_on_top": false,
        "console_right_on_top": false
    },
    "posix_terminal_shell": "bash"
}
EOF

# MOTD
cat > /etc/motd <<"EOF"

  _ __   __ _ ___        ___ ___  _   _ _ __ ___  ___
 | '_ \ / _` / __|_____ / __/ _ \| | | | '__/ __|/ _ \
 | | | | (_| \__ \_____| (_| (_) | |_| | |  \__ \  __/
 |_| |_|\__, |___/      \___\___/ \__,_|_|  |___/\___|
        |___/

EOF
exit

Install some basic software

sudo apt install pv curl wget jq locate

# build tools
sudo apt install build-essential pkg-config autoconf

# add important stuff to python
sudo apt install python-dev python-pip python-virtualenv

# java because of fastqc
# sudo apt install openjdk-8-jre-headless

# let's try default jre
sudo apt install default-jre-headless

Set up a dynamic DNS to get some nice login name.

cd
ln -s dotfiles/duckdns

cat duckdns/duck.cron
# add the printed line to crontab
crontab -e

This is what it takes to create a basic usable system in VirtualBox. We can shut it down now with sudo shutdown -h now and take a snapshot of the machine. If any installation goes haywire from now on, it's easy to revert to this basic system.

Install R and RStudio

R is best used in RStudio - server version can be used in web browser.

mkdir ~/sw
cd ~/sw

# install latest R
# https://cran.r-project.org/bin/linux/debian/
sudo bash -c "echo 'deb http://cloud.r-project.org/bin/linux/debian bookworm-cran40/' > /etc/apt/sources.list.d/cran.list"
sudo apt install dirmngr
sudo apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
sudo apt update
sudo apt install r-base

sudo apt install libxml2-dev libcurl4-openssl-dev libssl-dev
sudo R
> update.packages(.libPaths(), checkBuilt=TRUE, ask=F)
> install.packages(c("tidyverse", "shiny", "reshape2", "vegan"))
> quit(save="no")

# RStudio with prerequisities
sudo apt install gdebi-core
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2024.09.0-375-amd64.deb
sudo gdebi rstudio-server-*.deb

# and fix upstart config
# https://support.rstudio.com/hc/en-us/community/posts/200780986-Errors-during-startup-asio-netdb-error-1-Host-not-found-authoritative-
# remove 2 from [2345]
sudo nano /usr/lib/rstudio-server/extras/upstart/rstudio-server.conf

# install nginx as a front end
# snapd is needed for certbot ;(
sudo apt install nginx snapd

# test if http is accessible from local browser

# simple nginx proxy config for rstudio
sudo su
cat > /etc/nginx/sites-enabled/ngs-course.duckdns.org <<'EOF'
  map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
  }

  server {
  location / {
      proxy_pass http://localhost:8787;
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection $connection_upgrade;
      proxy_read_timeout 20d;
  }

  server_name ngs-course.duckdns.org;

  listen 80;

  }
EOF

# remove the default site
rm /etc/nginx/sites-enabled/default

# test and reload
nginx -t
nginx -s reload

# test if RStudio login page is visible at http
# .. we'll use the non-sudo account to access rstudio later

# secure with certbot
# (snap paths are somehow broken..and restarting the whole system is soo windows98)
/snap/bin/certbot --nginx

Install additional software

There are packages that are not in the standard repos, or the versions in the repos is very obsolete. It's worth it to install such packages by hand, when there is not much dependencies.

mkdir -p ~/sw

# install a tar with the most common method
inst-tar() {
  TAR_EXTRACT="${2:-xj}"
  cd ~/sw
  wget -O - "$1" | tar $TAR_EXTRACT || return 1
  # extract possible dir name from the tar path
  cd $( echo "$1" | egrep -o '/[^-/]+-' |  sed 's/^.//;s/$/*/' )
  ./configure
  make && sudo make install
}

# pipe viewer
inst-tar http://www.ivarch.com/programs/sources/pv-1.8.14.tar.gz xz

# parallel
inst-tar http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2

# tabtk
cd ~/sw
git clone https://github.com/lh3/tabtk.git
cd tabtk/
# no configure in the directory
make
# no installation procedure defined in makefile
# just copy the executable to a suitable location
sudo cp tabtk /usr/local/bin

# fastqc
cd ~/sw
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip fastqc_*.zip
rm fastqc_*.zip
chmod +x FastQC/fastqc

# vcftools
cd ~/sw
wget -O - https://github.com/vcftools/vcftools/tarball/master | tar xz
cd vcftools*
./autogen.sh
./configure
make && sudo make install

# samtools
inst-tar https://github.com/samtools/samtools/releases/download/1.21/samtools-1.21.tar.bz2

# bcftools
inst-tar https://github.com/samtools/bcftools/releases/download/1.21/bcftools-1.21.tar.bz2

# htslib (tabix)
inst-tar https://github.com/samtools/htslib/releases/download/1.21/htslib-1.21.tar.bz2

# bwa
cd ~/sw
wget -O - https://github.com/lh3/bwa/archive/refs/tags/v0.7.18.tar.gz | tar xz
cd bwa*
make
sudo cp bwa /usr/local/bin
# copy the man
sudo bash -c "<bwa.1 gzip > /usr/share/man/man1/bwa.1.gz"

# velvet
cd ~/sw
wget -O - https://www.ebi.ac.uk/~zerbino/velvet/velvet_1.2.10.tgz | tar xz
cd velvet*
make
sudo cp velveth velvetg /usr/local/bin

# bedtools
cd ~/sw
wget -O - https://github.com/arq5x/bedtools2/releases/download/v2.31.1/bedtools-2.31.1.tar.gz | tar xz
cd bedtools2/
make && sudo make install

# clean up
rm -rf bcftools-*/ bedtools2/ bwa-*/ htslib-*/ parallel-*/ pv-*/ samtools-*/ tabtk/ vcftools-vcftools-*/

TODO - future proofing of the installs with getting the latest - but release - quality code with something like this (does not work with tags yet):

gh-get-release() { echo $1 | cut -d/ -f4,5 | xargs -I{} curl -s https://api.github.com/repos/{}/releases/latest | jq -r .tarball_url | xargs -I{} curl -Ls {} | tar xz ;}

Check what are the largest packages:

dpkg-query -Wf '${Installed-Size}\t${Package}\n' | sort -n

Create the user accounts

For a multi-user machine, we need the low-privileged accounts and at least a quota to prevent DoS by overfilling the disk.

Name the accounts user01 to user22:

sudo su
cd

# aptitude search '?provides(wordlist)'
apt install wamerican

# generate some funny passwords
</usr/share/dict/words egrep "^[a-z]{5,8}$" |
  sort -R |
  paste -d' ' - - - |
  head -22 |
  nl -w2 -n'rz' |
  sed 's/^/user/' \
> users.tsv

# use `adduser` as debian alternative
# --gecos '' --disabled-password to get unattended run
adduser --gecos '' --disabled-password liborm
adduser --gecos '' --disabled-password janouse1
usermod -a -G sudo liborm
usermod -a -G sudo janouse1

# normal users
# for user update
# prepared in sheet, copy-paste via cat > users.tsv
<users.tsv cut -f1 | xargs -n1 adduser --gecos '' --disabled-password

# use chpasswd to update the passwords
<users.tsv tr "\t" ":" | chpasswd

# add quotas
# https://www.digitalocean.com/community/tutorials/how-to-set-filesystem-quotas-on-debian-10
apt install quota
# add ,usrquota to / mount
vi /etc/fstab
mount -o remount /
quotacheck -ugm /
quotaon -v /
<users.tsv cut -f1 | xargs -I{} setquota -u {} 8G 10G 0 0 /

# copy-paste users.tsv to shared google sheet
# delete on disk
rm users.tsv

Refresh user home dirs

For the next course, when the machine is not freshly created, remove the old mess and copy the new skeleton.

sudo su

# which user accounts will be handled
seq 2 31 | xargs printf "/home/user%02d\n" > user-homes

# move old home dirs to _bak
mkdir -p /home/_bak3

# move selected user homes
<user-homes xargs mv -t /home/_bak2

# create skeleton dirs
<user-homes cut -d/ -f3 | xargs -I{} mkhomedir_helper {}

Sample datasets

Use data from my nightingale project, subset the data for two selected chromosomes.

# see read counts for chromosomes
samtools view 41-map-smalt/alldup.bam | mawk '{cnt[$3]++;} END{for(c in cnt) print c, cnt[c];}' | sort --key=2rn,2
# extract readnames that mapped to chromosome 1 or chromosome Z
mkdir -p kurz/00-reads
samtools view 41-map-smalt/alldup.bam | mawk '($3 == "chr1" || $3 == "chrZ"){print $1;}' | sort > kurz/readnames
parallel "fgrep -A 3 -f kurz/readnames {} | grep -v '^--$' > kurz/00-reads/{/}" ::: 10-mid-split/*.fastq

# reduce the genome as well
# http://edwards.sdsu.edu/labsite/index.php/robert/381-perl-one-liner-to-extract-sequences-by-their-identifer-from-a-fasta-file
perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(chr1 chrZ)}print if $c' 51-liftover-all/lp2.fasta > kurz/20-genome/luscinia_small.fasta

# subset the vcf file with grep
# [the command got lost;]

Transfer the data to user directory (root cannot log in remotely):

# on host machine
cd somewhere.../data-pack

VM=ngs-course.duckdns.org
scp -r data-shared "debian@${VM}:~"
scp -r home/user/projects "debian@${VM}:~"

On the remote machine:

# make the shared data 'shared'
sudo mv ~/data-shared /

# change permissons back to 'read only' for user
sudo chown -R root:root /data-shared

Cleanup

# update the file database
sudo updatedb

# remove history not to confuse users
sudo su
history -cw

# ctrl-d
history -cw

Update the machine

When Debian + RStudio are reasonably updatable, we can keep the previous image. Hostname is derived from instance name via cloud-init, so renaming the instance in OpenStack should do the trick. Still /etc/hosts need to be edited to make sudo happy.

# as root
sudo su

# general update
# (add new CRAN key)
KEYID='95C0FAF38DB3CCAD0C080A7BDC78B2DDEABC47B7'
gpg --keyserver keyserver.ubuntu.com --recv-key $KEYID
gpg --armor --export $KEYID > /etc/apt/trusted.gpg.d/cran_debian_key.asc

apt update
apt upgrade

# update certificates
snap refresh
certbot certonly --nginx
systemctl restart nginx

# create a reusable update script
cat > update-packages.R <<EOF
update.packages(lib.loc=.libPaths()[1], ask=F, checkBuilt=T, Ncpus=parallel::detectCores())
EOF

# update R packages
R
> source('update-packages.R')

# update rstudio as normal user
exit
cd ~/sw
wget https://download2.rstudio.org/server/jammy/amd64/rstudio-server-2024.09.0-375-amd64.deb
sudo rstudio-server active-sessions
sudo rstudio-server offline
sudo gdebi rstudio-server-*-amd64.deb
sudo rstudio-server online

Slide deck

Libor's slide deck was created using Adobe InDesign (you can get the CS2 version almost legally for free). Vasek's slide deck was created with Microsoft Powerpoint. Images are shamelessly taken from the internet, with the 'fair use for teaching' policy ;)