Computational correction of index switching in multiplexed sequencing libraries

Supplementary information to the article "Computational correction of index switching in multiplexed sequencing libraries" by Anton JM Larsson, Geoff Stanley, Rahul Sinha, Irving L. Weissman, Rickard Sandberg available now in Nature Methods (http://dx.doi.org/10.1038/nmeth.4666)

The Jupyter notebook (Correcting_Spreading_of_Signal_Notebook.ipynb) contains Python code for the analysis and correction of index-swapping, including the generation of Figures 1, 2A-C, S1 and S2 in the manuscript (written by Anton JM Larsson). The notebook (sandbergCorrection_analyzeClustering.ipynb) contains the R code to reproduce Figures 2D-E and S3 of the manuscript (written by Geoff Stanley).

unspread.py

unspread.py estimates the percentage of contaminating reads in the experiment, estimates the 'rate of spreading', and corrects the read counts if the experiment is affected to a sufficient degree. The unspread.py script requires a table of read counts supplied as a .csv file with added information regarding each cell's index barcodes.

System Requirements

unspready.py is a python3 script with dependencies:

pandas: 0.19.2
numpy: 1.9.0
matplotlib: 2.0
statsmodels: 0.6.1
scipy: 1.0.0
patsy: 0.4.1

No further installation is needed.

Usage

usage: unspread.py [-h] [--i5 STRING] [--i7 STRING] [--rows INTEGER] [--cols INTEGER] [--idx_col INTEGER] [--sep CHAR] [--h INTEGER] [--c INTEGER] [--t FLOAT] [--idx_in_id BOOLEAN] [--delim_idx CHAR] [--column BOOLEAN] filename

Unspread: Computational correction of barcode index spreading

positional arguments:

filename .csv file with counts

optional arguments:

-h, --help show this help message and exit

--i5 STRING Index name of i5 barcodes (default: 'i5.index.name')

--i7 STRING Index name of i7 barcodes (default: 'i7.index.name')

--rows INTEGER Number of rows in plate (default: 16)

--cols INTEGER Number of columns in plate (default: 24)

--idx_col INTEGER Which column serves as the index (default: 0)

--sep CHAR The separator in the .csv file (default ',')

--h INTEGER The number of reads to use to be considered highly expressed in only one cell (default: 30)

--c INTEGER Cutoff to remove addition false positives (default: 5)

--t FLOAT Threshold for acceptable fraction of spread counts (default: 0.05)

--idx_in_id BOOLEAN If the index is in the cell id (i.e. cellid_i5_i7) (Default: 0 (False), set to 1 otherwise (True))

--delim_idx CHAR If the index is in the cell id, the delimiting character (Default: '_')

--column BOOLEAN If each column is represents a cell, otherwise each row. (default: 1 (True), set to 0 otherwise (False))

Output

unspread.py outputs a set of figures with diagnostic information comparable to the figures in the article. A log file is also saved. If the plate is affected a corrected .csv file will also be made.

Example

An example from the first plate in the manuscript:

cell.name	N.index.name	S.index.name	0610005C13Rik	0610007C21Rik	...
HSC02_a_p1c7r2_P01	N701	S522	0	117	...
HSC02_a_p1c5r5_P03	N702	S522	0	5	...

In this particular example, genes are structured by column and cells by rows but the converse is also supported.

To run the correction of the first plate in the manuscript:

./unspread.py mHSC_plate1HiSeq_counts_IndexInfo_anon.csv --i5 'S.index.name' --i7 'N.index.name' --column 0 --sep ' '

This command should not take longer than a minute.

The expected command line output is:

Reading file: mHSC_plate1HiSeq_counts_IndexInfo_anon.csv

Estimating spreading from mHSC_plate1HiSeq_counts_IndexInfo_anon.csv

Found expression to be biased along a certain column and row combination 753 times out of 899

Estimated the median rate of spreading to be 0.0098

Estimated fraction of spread reads to be 0.14827 and variance explained R-squared = 0.8996

Saving figure from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_figures.pdf

Saving log file from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_unspread.log

Correcting spreading for each gene

Saving correction to mHSC_plate1HiSeq_counts_IndexInfo_anon_corrected.csv

The genes in the manuscript, Mki67 and Tacr, have ID 7963 and 12319 respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
data		data
Correcting_Spreading_of_Signal_Notebook.ipynb		Correcting_Spreading_of_Signal_Notebook.ipynb
LICENSE		LICENSE
README.md		README.md
mHSC_plate1HiSeq_counts_IndexInfo_anon.csv		mHSC_plate1HiSeq_counts_IndexInfo_anon.csv
mHSC_plate1NextSeq_counts_IndexInfo_reindex_anon.csv		mHSC_plate1NextSeq_counts_IndexInfo_reindex_anon.csv
sandbergCorrection_analyzeClustering.ipynb		sandbergCorrection_analyzeClustering.ipynb
unspread.py		unspread.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computational correction of index switching in multiplexed sequencing libraries

unspread.py

System Requirements

Usage

Output

Example

About

Releases

Packages

Contributors 2

Languages

License

sandberg-lab/Spreading-Correction

Folders and files

Latest commit

History

Repository files navigation

Computational correction of index switching in multiplexed sequencing libraries

unspread.py

System Requirements

Usage

Output

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages