Welcome to the companion code repository for the O'Reilly book Python and R for the Modern Data Scientist. You can also access this repository as an RStudio Cloud project (account required).
Success in data science depends on the flexible and appropriate use of tools. That includes Python and R, two of the foundational programming languages in the field. With this book, data scientists from the Python and R communities will learn how to speak the dialects of each language. By recognizing the strengths of working with both, you'll discover new ways to accomplish data science tasks and expand your skill set.
Authors Rick J Scavetta and Boyan Angelov explain the fundamentals of these languages and highlight where each one excels over the other, whether it's their linguistic features or the power of their open source ecosystems. Not only will you learn how to use Python and R together in real-world settings, but you'll also broaden your knowledge and job opportunities by working as a bilingual data scientist.
- Learn Python and R from the perspective of your current language
- Understand the strengths and weaknesses of each language
- Identify use cases where one language is better suited than the other
- Understand the modern open source ecosystem available for both, including packages, frameworks, and workflows
- Learn how to integrate R and Python in a single workflow
- Follow a real-world case study that demonstrates ways to use these languages together
When available, companion scripts to the book are found in their respective chapter directories.
Part II. Levels of working together I: Bilingual
Part III. Modern Context
Part IV. Levels of working together II: Synergy
Appendix A. Bilingual Dictionary
- Available here.
Datasets used in the book can be found as follows.
This dataset is from the R ggplot2
package:
library(ggplot2)
data(diamonds)
These are available in base R:
data(PlantGrowth)
data(iris)
This dataset is available in using the Python scikit-learn
package:
from sklearn.datasets import load_boston
boston_data = load_boston()
The Amazon music review data can be downloaded here. We use the "digital music" subset.
This dataset on swimming pool and car detection using satelite imagery is available on Kaggle.
The daily australian temperatures dataset can be dowloaded directly from Github.
Obtain this data and the spatial raster (the bioclimactic varialbes) using the R sdmbench
package:
library(sdmbench)
data <- get_benchmarking_data("Loxodonta africana")
This object is a list
and contains the occurence data in data$df_data
and the raster layers in data$raster_data
.
These data can be downloaded from Kaggle.
The wildfires data can be downloaded from the USDA website directly or from Kaggle. To run the case study, add the file FPA_FOD_20170508.sqlite
to the ch07-case-study/data/
folder.
This dataset is from the R dplyr
package:
library(dplyr)
data(starwars)