Review and resource materials from around the internet for data science, with applications in bioinformatics and computational biology and other domains, that I've found useful.
Table of Contents
- Learning to Learn
- Statistics and Probability
- General mathematics
- Linear Algebra
- Network Science
- Algorithms and Data Structures
- Programming
- Structural Query Language (SQL
- Statistical Methods and Machine Learning
- Computational Biology
- Domain Knowledge
- Data Visualization and Making Figures
- Should-Read Data Science Papers
- Software Engineering
- Reproducible Science
- People Skills and Communication
- Other Lists
- License
Resources and tips on how to self-learn and learn with others
- The Thinker's Guide to The Art of Socratic Questioning (PDF) - A checklist of questions to help facilitate directed discussions on topics.
- Questions for a Socratic Dialogue (PDF) - Nine types of questions that can be used to facilitate understanding.
Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data.
- Handbook of Biological Statistics and R Supplement - Online set of notes from "Biological Data Analysis" course from University of Delaware.
- Engineering Statistics Handbook - Handbook to help scientists and engineering incorporate statistical methods.
- Stat Trek - Teach yourself statistics.
- Online Statistics Education - Developed by Rice University, University of Houston Clear Lake, and Tufts University.
- BS704 Probability - Boston University course on probability.
- StatQuest - Series of videos on miscellaneous complex topics such as p-values, principle component analysis (PCA), and R-squared.
- STAT 505 Applied Multivariate Statistical Analysis - Penn State Eberly College of Science course.
- StatSoft Electronic Statistics Textbook
- UW Summer Institutes Archive Material - Various learning material in statistics, data analysis, machine learning, genetics, and clinical research.
- Practical Data Science for Stats - Collection of curated articles on practical data science.
- Statistics for Biologists - Nature collection of articles on statistical analysis.
- Top Upvoted Questions on CrossValidated - Great questions with great answers about topics in statistics and machine learning.
- Ordination Methods for Ecologists - Resource of ordination methodology
- Probability Cheatsheet - Compiled by William Chen and Joe Blitzstein
- Statistics Done Wrong - Reviews popular statistical errors and slip-ups committed by scientists every day.
- Statistics for Hackers - By Jake VanderPlas (PyCon 2016)
- Modern Statistics for Modern Biologists
- An Introduction to Statistical Learning - With editions in R and Python.
- Quick-R - Quick reference to statistical methods using R.
- ModernDive - Statistical Inference via Data Science
- UCLA IDRE Statistics - Examples of statistical analyses using R, SAS, SPSS, and Stata.
- r-statistics.so - Educational resource for machine learning and statistical computing in R.
- W2024 Applied Linear Regression Analysis
- Python for Data Analysis, 3E by Wes McKinney
- Automate the Boring Stuff with Python by Al Sweigart
- Think Python, 2nd Edition by Allen B. Downey
- Think Stats 2E by Allen B. Downey
- P-values, False Discovery Rate (FDR) and q-values
- FAQ: How do I interpret odds ratio in logistic regression?
- Standard error of the mean of a sample binomial distribution
- Common Probability Distributions: The Data Scientist's Crib Sheet - Data scientists have hundreds of probability distributions from which to choose. Where to start?
- Choosing the correct statistical test in SAS, Stata, SPSS, and R - Table giving general guidelines on choosing statistical tests.
- Warning Signs in Experimental Design and Interpretation - Nine common warning signs in experimental design and nine common warning signs in interpretation of experiments by Peter Norvig.
- Univariate Distribution Relationships - An interactive, flow chart diagram showing the relationships between variate univariate distributions.
- First Internet Gallery of Statistics Jokes
- PLoS's Ten Simple Rules for Effective Statistical Practice
- Common Statistical Pitfalls in Basic Science Research
- Review of Probability Theory - Maleki and Do (PDF)
- Effect Size FAQs
- Common Mistakes in Using Statistics - Spotting Them and Avoiding Them
- Common Statistical Tests are Linear Models (Or: How to Teach Stats)
- The Permutation Test - A Visual Explanation of Statistical Testing
- Visualising Residuals - Using R and ggplot2.
- Forecasting: Principles and Practice, 3rd Ed by Rob J Hyndman and George Athanasopoulos
- Interpreting Cohen's d effect size
- Interpreting Correlations
- Interactive Machine Learning, Deep Learning and Statistics websites
- Tidyverse - Opinionated collection of R packages designed for data science
- Tidymodels - Framework that is a collection of packages for modeling and machine learning using tidyverse principles.
- Text Mining with R - A tidy approach to performing text analysis in R.
- Cross-industry standard process for data mining - Wikipedia - Open standard for common processes used in data mining, which can be applied to data science analyses.
- The Limits of Data By C. Thi Nguyen - Emphasizes the importance of understanding the context of your data and that it inherently has biases.
- Build a Career in Data Science by Emily Robinson and Jacquelie Nolis - A guide on landing your first data science job and being a valued senior employee, rather than on just the technical details of how regression works. The authors also have an accompanying podcast.
- ExcelDemy - Excel courses, tutorials, and templates.
- Excel Easy - Excel tutorials and tips on functions and more.
Resources generally related to learning and understand mathematical foundations
- A Gentle Introduction to the Art of Mathematics - Gentle introduction to basic mathematical notation, set theory, writing mathematical proofs, and mathematical thinking.
Linear algebra is the branch of mathematics concerning vector spaces and linear mappings between such spaces.
- Essence of Linear Algebra - Excellent, short overview of linear algebra concepts that help develop intuition on the matter.
- MIT OCW 18.06SC Linear Algebra - Taught by Gilbert Strang.
- Linear algebra explained in four pages - Excerpt from the No Bullshit Guide to Linear Algebra by Ivan Savov.
- S.O.S. Mathematics Matrix Algebra
- PCA, Eigenvectors, and Eigenvalues (Cross Validated)
- The Matrix Reference Manual - Reference information about linear algebra and the properties of real and complex matrices.
- Linear Algebra Review and Reference - Kolter and Do (PDF)
- Immersive Linear Algebra
Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes (or vertices) and the connections between the elements or actors as links (or edges).
- Network Science Book - Online book with visualizations and interactive tools about network science by Albert-László Barabási.
- Graph Theory by Sarada Herke - YouTube series on graph theory.
- Network Science - Aggregate of all things network science resarch, introductions, people, journals, conferences, datasets, etc.
- Handbook of Graphs and Networks in People Analytics - The second volume in a series of technical textbooks for professionals working in analytics
- Awesome Network Analysis - Curated awesome list of network analysis resources
In mathematics and computer science, an algorithm is a self-contained step-by-step set of operations to be performed.
In computer science, a data structure is a particular way of organizing and storing data in a computer so that it can be accessed and modified efficiently.
- Bioinformatic Algorithms - Algorithm lectures by Phillip Compeau and Pavel Pevzner.
- Algorithms for DNA Sequencing - Ben Langmead's lectures algorithms used in DNA sequencing.
- Rosalind - Learn bioinformatics and programming through problem solving.
- VisuAlgo - Visualizing data structures and algorithms through animation.
- Discrete Mathematics: An Open Introduction
Computer programming (often shortened to programming) is a process that leads from an original formulation of a computing problem to executable computer programs.
- DevDocs - API documentation browser.
- Hyperpolyglot - Commonly used features in programming languages in side-by-side format.
- Learn X in Y Minutes - Quick start to many programming languages, data structures, and common tools.
- How to Report Bugs Effectively
- Rosetta Code - Programming chrestomathy site.
- Cookbook for R - Provide solutions to common tasks and problems in analyzing data.
- OverAPI.com - Collecting All Cheat Sheets..
- The Art of Comments - Essay on how to comment well.
- devhints.io - Modest collection of cheatsheets.
- Teach Yourself Programming in Ten Years
- Code Complete Book Review - Detailed review and notes of book.
- The Pragmatic Programmer Quick Reference
- Bash Pitfalls - Common errors that Bash programmers make, along with Bash FAQs and general Bash Programming.
- Select Star SQL - Interactive book which aims to be the best place on the internet for learning SQL.
- Tech Dev Guide - By Google.
- How to C in 2016
- explainshell - See help text that matches each argument.
- Teach Yourself Computer Science
- Competitive Programming Books
- Comprehensive Python Cheatsheet
- Practical Business Python
- Full Stack Python - Build, deploy and operate Python apps.
- Select Star SQL - Interactive online book with a non-toy dataset to learn SQL.
- SQLZoo - Tutorials learning SQL step-by-step by function.
- SQL Tutorial - Quick access tutorials on SQL.
Machine learning is the subfield of computer science that "gives computers the ability to learn without being explicitly programmed".
- Naive Bayes Part 1 and Naive Bayes Part 2
- How to choose a predictive model after k-fold cross-validation?
- Parametric versus nonparametric bootstrap resampling
- Feature engineering using R
- How to Use t-SNE Effectively - Interactive visualization to explore how tSNE behaves in order to use it more effectively.
- Accurately Measuring Model Prediction Error
- Understanding the Bias-Variance Tradeoff
- Random Forests - Creator Leo Breiman's site on random forests.
- Google's Machine Learning Crash Course - Learn TensorFlow APIs.
- Learning Math for Machine Learning by Vincent Chen
- Calculus Made Easy (PDF)
Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavior, and social systems.
- RPKM measure is inconsistent among samples
- RPKM-TPM.r - R script to show RPKM vs TPM
- StatQuest: RPKM, FPKM and TPM
- Why do we use the negative binomial distribution for analysing RNAseq data?
- QCFail.com - Articles about common next-generation sequencing problems
- Differences between DESeq/edgeR and CuffDiff in RNA-seq
- HarvardX Biomedical Data Science Open Online Training
- Question: Can someone please explain in simple terms how DESeq2 works?
- RNA-seqlopedia - Overview of RNA-seq and choices for a successful experiment.
- Theory Behind DESeq2
Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data, meaning "information that has been abstracted in some schematic form, including attributes or variables for the units of information".
- A Compendium of Clean Graphs in R
- How to Create Publication-Quality Figures
- Make Better Figures Faster Using Illustrator
- A Tour Through the Visualization Zoo
- Adobe Illustrator for Scientists (YouTube playlist)
- WebGraphviz is Graphviz in the Browser
- Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing
- from Data to Viz - Leads you to most appropriate graph for your data.
- Beautiful plotting in R: A ggplot2 cheatsheet
- Effectively Using Matplotlib
- Fundamentals of Data Visualization by Claus O. Wilke
- Practical Typography by Matthew Butterick
- ditaa - Small command-line utility to convert diagrams using ASCII art
- Asciiflow - GUI to easily create ASCII plain text diagrams
- Hand drawn feel to diagrams
- 10+ Guidelines for Better Tables in R (2020) - Notes on making better tables with accompanying R code
- The Design Philosophy of Great Tables (2024) - Design philosophy behind the the great-tables Python package to generate effect tables of data
- FriendsDontLetFriends - Opinionated essay about good and bad practices in data visualization with examples.
Data science, computational biology, and bioinformatics papers to cover the breadth of their fields.
- applied-ml by Eugene Yan - Curated papers, articles, and blogs on data science and machine learning in production.
- List of important publications in data science - Wikipedia
- How to read a research paper - One question to ask when reading papers.
- Zhang Lab Recommendations
- The Leek group guide to genomics papers
- "Foundations of Computational and Systems Biology" Readings - MIT OCW course readings.
- Question: What Are The Classic Papers In Bioinformatics?
- Best Academic Papers About the Microbiome
- Staying Current in Bioinformatics & Genomics: 2017 Edition by Stephen Turner
- RNA-Seq Analysis, Differential Gene Expression, and Functional Enrichment Analysis (Recent removal of readings page, but course overall is valuable)
General knowledge mapping and exploration tools
- Inciteful - Tools to help you accelerate your research
Software engineering is the application of engineering to the development of software in a systematic method.
- "The Guide to the Software Engineering Body of Knowledge"
- Software Engineering - Ian Sommerville
- Unix as IDE Series
- Software Engineering Resources - Aggregation of over 1800 software engineering resources on various topics.
- Flowchart Symbols Explained
- Write the Docs - A global community of people who care about documentation.
- Amazon Web Services - A Practical Guide
- Amazon Web Services in Plain English
- Command-line Tools can be 235x Faster than your Hadoop Cluster - Simple but effective demonstration of using the right tool for the right amount of data.
Reproducibility is the ability to get the same research results using the raw data and computer programs provided by the researchers.
- A statistical definition for reproducibility and replicability
scifigure
: Visualize Reproducibility and Replicability in a Comparison of Scientific Studies (R package)- What should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science
- A Guide to Reproducible Code in Ecology and Evolution (PDF)
- Docker for Beginners - By Prakhar Srivastav.
- Riffomonas: Reproducible Research - By Patrick D. Schloss.
People skills are patterns of behavior and behavioral interactions. Among people, it is an umbrella term for skills under three related set of abilities: personal effectiveness, interaction skills, and intercession skills.
- How to ask good questions - By Julia Evans.
- How To Ask Questions The Smart Way - By Eric Raymond.
- Teaching Tech Together - By Greg Wilson.
- (An Opionionated Talk) On Preparing Good Talks (PDF) - By Ranjit Jhala.
- CommKit - By MIT's Department of Biological Engineering Communication Fellows on successfull scientific communication.
- General Principles of Mathematical Communication - By Mathematical Association of America.
- Community Tool Box - By University of Kansas.
- Speech-Words to Minutes - Estimate how many words are need for a given timed speech.
- Novelist Cormac McCarthy's tips on how to write a great science paper The Pulitzer prizewinner shares his advice for pleasing readers, editors and yourself.
- Science Writing: Guidelines and Guidelines - Notes from Carl Zimmer on writing about science, medicine, and the environment.
- Write the Paper First - Argues that "writing now is a favor to yourself" and the benefits of clear writing for organizing thoughts early.
Useful lists on their own that may intersect other topics above.