We translate important verbs from the tiyverse
to their closest possible Python equivalents.
The added value lies not so much in reproducing the R documentation, but to find the matching R and Python functions that should do the exact same manipulation.
Purpose: describe the problem that the verb tries to solve. How does it approach the problem/ how is it solved?
Official documentation: Please link to the official documentation, if available.
Good help, tutorials: Please link explanations, examples, discussions, that you found helpful.
Working examples should be self-contained and as short as possible (asap). The R and Python examples should do the exact same thing. Ideally, we can copy-paste the code into R or Python and it will run as expected. Load data from libraries to reach this goal (best case), or provide the data in the repo and link to it (really required???). Use standard datasets if possible (mtcars, diamonds, ...).
Purpose: Use 'select()' to choose a subset of the whole / existing dataset based on the needed columns.
This is useful in order to narrow down a big dataset to the elemntary data. The narrowed dataset will contain only the needed data.
Official documentation:
https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/select
Good help, tutorials:
Working example:
##save column u and column v in a new table:
#new_dataset <- select(source_dataset, cloumn_name_u, column_name_v)
library(nycflights13)
library(tidyverse)
##explicitly choose columns:
select(flights, year, month, day)
## Leads to the output:
# A tibble: 336,776 x 3
# year month day
# <int> <int> <int>
# 1 2013 1 1
# 2 2013 1 1
# 3 2013 1 1
# 4 2013 1 1
# 5 2013 1 1
# 6 2013 1 1
# 7 2013 1 1
# 8 2013 1 1
# 9 2013 1 1
#10 2013 1 1
# . with 336,766 more rows
##from-to selection
select(flights, year:day)
##select all columns by excluding specific columns
select(flights, -(year:day))
Purpose: Use select()
to choose a subset of the whole / existing dataset based on the needed columns.
This is useful in order to narrow down a big dataset to the elemntary data. The narrowed dataset will contain only the needed data.
Documentation:
https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html#transforming
Good help, tutorials:
- https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c
- https://www.youtube.com/watch?v=iaziBEhdyRk
Working example:
import pandas as pd
import numpy as np
df = pd.read_csv('data/sample_data.csv', index_col=0)
#one column
df['food']
#multiple columns
df[['color', 'food', 'score']]
#select rows from:until
df[2:7]
#select rows :until / from:
df[:7] / df[2:]
#select every 10th row
df[::10]
#select first 5 rows / column A
df.loc[0:5, 'A']
#instead of loc you can use iloc if you want to get eg 1 row
df.eloc[0:5, 0]
#get all rows where column A is smaller 2
df.loc[df.loc[:, 'A'] < 2]
#lambda functions for selecting can be wrote in one line
df.loc[lambda d: (d.A > 7) & (d.B < 4)]
Purpose: Use filter
to find rows/cases meeting certain conditions.
Official documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter
Good help, tutorials:
Working example:
library(tidyverse)
library(ggplot2)
# filter for cars with city miles per gallon less than 20
greenCars <- filter(mpg, cty < 20)
print(greenCars)
Purpose: Use filter
to find rows/cases meeting certain conditions.
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
Good help, tutorials:
Working example:
from plotnine.data import mpg
filtered = mpg[mpg.cty < 20]
print(filtered)
Purpose: When the dataset is unordered using arrange()
will reorder rows by values of one or more columns.
Official documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/arrange
Good help, tutorials:
- https://www.datanovia.com/en/lessons/reorder-data-frame-rows-in-r/
- https://www.youtube.com/watch?v=mcowyd3sgSA
- http://www.datasciencemadesimple.com/sorting-dataframe-r-using-dplyr/
Working example :
library(dplyr)
iris <- read.csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
# Reorder rows by sepal.length in *ascending* order:
iris_sorted_asc <- arrange(iris, sepal.length)
# Reorder rows by sepal.length in *descending* order using '-' or 'desc()':
iris_sorted_desc <- arrange(iris, desc(sepal.length))
iris_sorted_desc <- arrange(iris, -sepal.length)
# Additionally it can be sorted by *more than one* value.
# Simply add a new column to the expression:
iris_sorted_by_multiple_values <- arrange(iris, desc(sepal.length), sepal.width)
Purpose: When the dataset is unordered using sort_values() will reorder rows by values of one or more columns.
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
Good help, tutorials:
- https://datatofish.com/sort-pandas-dataframe/
- https://stackoverflow.com/questions/37787698/how-to-sort-pandas-dataframe-from-one-column
- https://www.geeksforgeeks.org/python-pandas-dataframe-sort_values-set-1/
Working example:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv", sep=",")
# Reorder rows by sepal.length in *ascending* order:
df = df.sort_values(by=['sepal.length'], ascending=[True])
# Reorder rows by sepal.length in *descending* order:
df = df.sort_values(by=['sepal.length'], ascending=[False])
# Reorder rows by *more than one* value.
# Add a new column and its order orientation to the expression:
df = df.sort_values(by=['sepal.length', 'sepal.width'], ascending=[True, False])
# If NAs should come first add *na_position*:
df = df.sort_values(by=['sepal.length', 'sepal.width'], ascending=[True, False], na_position= 'first')
Purpose: Put functions on existing columns and add new columns including the result of the applied function. If you additionally want to drop the old variable/column, use transmute
.
Official documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate
Good help, tutorials:
- https://r4ds.had.co.nz/transform.html#add-new-variables-with-mutate
- https://cran.r-project.org/web/packages/dplyr/dplyr.pdf - page 45
- https://youtu.be/2dFpblO7MB8
Working example:
library(tidyverse)
data("mtcars")
mutate(mtcars, displ_l = disp / 61.0237)
transmute(mtcars, displ_l = disp / 61.0237)
Purpose: Put functions on existing columns and add new columns including the result of the applied function.
Documentation: https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html#transforming
Good help, tutorials:
- https://stackoverflow.com/questions/28417293/sample-datasets-in-pandas
- https://vincentarelbundock.github.io/Rdatasets/datasets.html
Working example:
#https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv
import pandas as pd
mtcars= pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
mtcars = mtcars.assign(displ_l = mtcars["disp"] / 61.0237)
print(mtcars)
Purpose: Create one or more scalar variables summarizing the variables of an existing table. Tables with groups created by group_by()
will result in one row in the output for each group. Tables with no groups will result in one row.
Official documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarise
Good help, tutorials:
Working example:
install.packages("nycflights13")
library(nycflights13)
library(tidyverse)
flights %>%
summarize(sum_dep_delay = sum(dep_delay, na.rm = T),
avg_dep_delay = mean(dep_delay, na.rm = T) )
Purpose:
Documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example:
import pandas as pd
nyc = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')
nyc.head()
nyc_grouped = nyc.groupby(['year', 'month', 'day'])
nyc_grouped.head(50)
Purpose: group data by one variable to calculate aggregated data for another variable
Official documentation: https://dplyr.tidyverse.org/reference/group_by.html
Good help, tutorials:
Working example:
install.packages("nycflights13")
library(nycflights13)
library(tidyverse)
flights %>%
group_by(year, month, day) %>%
summarize(sum_dep_delay = sum(dep_delay, na.rm = T),
avg_dep_delay = mean(dep_delay, na.rm = T))
Purpose:
Documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example:
import pandas as pd
nyc = pd.read_csv('https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv')
nyc.head()
nyc_grouped = nyc.groupby(['year', 'month', 'day']).sum()
nyc_grouped.head(50)
Purpose: Gather takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. You use gather()
when you notice that you have columns that are not variables.
Official documentation: R Documentation
Good help, tutorials:
Working example:
library(tidyverse)
# prepare ungathered data set
product <- c("Product A", "Product B", "Product C")
customerA <- c(6886, 5486, 1865)
customerB <- c(4896, 9865, 3154)
# ungathered data set has multiple columns with sales numbers
ungathered_data <- data.frame(product, customerA, customerB)
# gathered data set has only one column with sales numbers
gathered_data <- gather(data = ungathered_data, "customer", "sales", 2:3)
print(gathered_data)
Purpose: Unpivots a DataFrame from wide format to long format, optionally leaving identifier variables set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, variable
and value
.
Documentation: Pandas Documentation Good help, tutorials:
Working example:
import pandas as pd
# prepare ungathered data set
# ungathered data set has multiple columns with sales numbers
ungathered_data = pd.DataFrame({'product': {0: 'A', 1: 'B', 2: 'C'},
'customer A': {0: 6886, 1: 5486, 2: 1865},
'customer B': {0: 4896, 1: 9865, 2: 3154}})
# gathered data set has only one column with sales numbers
gathered_data = pd.melt(ungathered_data, var_name= "customer", value_name= "sales", id_vars = ['product'], value_vars= ['customer A', 'customer B'])
print(gathered_data)
Purpose:
Official documentation:
Good help, tutorials:
Official documentation: https://www.rdocumentation.org/packages/tidyr/versions/0.8.3/topics/spread Good help, tutorials:
Working example (asap):
library(tidyverse)
#print(table2)
table2 %>%
spread(key = type, value = count)
Purpose: pivot_table() takes two columns (key & value), and spreads into multiple columns: it makes “long” data wider.
Official documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html
Good help, tutorials:
Working example (asap):
import pandas as pd
#sample data
d = {'country': ['Afghanistan', 'Afghanistan', 'Afghanistan', 'Afghanistan', 'Brazil', 'Brazil', 'Brazil', 'Brazil'], 'year': [1999, 1999, 2000, 2000, 1999, 1999, 2000, 2000,], 'type': ['cases', 'population', 'cases', 'population', 'cases', 'population', 'cases', 'population'], 'count': [745, 19987071, 2666, 20595360, 37737, 172006362, 80488, 174504898]}
#log sample data as table
df = pd.DataFrame(data=d)
print(df)
#log spread data
df=pd.pivot_table(df,index='country',columns='type',values='count')
print(df)
Purpose:
Official documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example (asap):
Purpose:
Official documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example (asap):
Purpose:
Official documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example (asap):
Purpose:
Official documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example (asap):
Purpose: Standard join (inner, full, left, right) --> augment a data frame with information from another data frame
Official documentation: https://dplyr.tidyverse.org/reference/join.html
Good help, tutorials:
Working example:
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
employ <- data.frame(employee, salary)
employee <- c('John Doe','Peter Gynn','Max Mustermann')
size <- c(220, 130, 160)
employ2 <- data.frame(employee, size)
#join on equally named columns
employInner <- inner_join(employ, employ2)
employFull <- full_join(employ, employ2)
employLeft <- left_join(employ, employ2)
employRight <- right_join(employ, employ2)
Purpose: Standard join (inner, full, left, right) --> augment a data frame with information from another data frame
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
Good help, tutorials:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
- https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
Working example:
import pandas as pd
employ = pd.DataFrame({'employee': ['John Doe','Peter Gynn','Jolie Hope'], 'salary': [21000, 23400, 26800]})
employ2 = pd.DataFrame({'employee': ['John Doe','Peter Gynn','Max Mustermann'], 'size': [220, 130, 160]})
employInner = pd.merge(employ, employ2, how='inner')
employOuter = pd.merge(employ, employ2, how='outer')
employLeft = pd.merge(employ, employ2, how='left')
employRight = pd.merge(employ, employ2, how='right')
Purpose: Select the data based on a list of other data. For example select the most popular flight destination from a dataframe of flights. Is also able to work on multiple columns.
Official documentation: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join#1_sections
Function names: semi_join
, anti_join
Good help, tutorials:
- https://r4ds.had.co.nz/relational-data.html#filtering-joins
- http://stat545.com/bit001_dplyr-cheatsheet.html#semi_joinpublishers-superheroes
Working example:
library(nycflights13)
library(tidyverse)
# Get the top Flight destinations
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)
# Associate the planes with the flights
flights %>%
anti_join(planes, by = "tailnum") %>%
count(tailnum, sort = TRUE)
Purpose:
Documentation:
Good help, tutorials:
- ...
- ...
- ...
Working example (asap):
Purpose: Turn implicilty missing values in a long data format into explicitly missing values (using NA or special values )
Official documentation: https://tidyr.tidyverse.org/reference/complete.html
Good help, tutorials:
Working example:
library(tidyverse)
data <- tibble("sex" = c("m", "m", "f"),
"treatment" = c("A", "B", "A"),
"results" = rnorm(3))
print(data)
# A tibble: 3 x 3
# sex treatment results
# <chr> <chr> <dbl>
# 1 m A -0.135
# 2 m B -1.99
# 3 f A 0.465
# as can be seen, the case sex=f & treatment=B is missing
data <- complete(data, sex, treatment, fill = list("results" = "-99"))
print(data)
# A tibble: 4 x 3
# sex treatment results
# <chr> <chr> <chr>
# 1 f A 0.464851026534199
# 2 f B -99
# 3 m A -0.134708501608753
# 4 m B -1.9857518802666
# now it's there
Purpose: Turns implicit missing values into explicit missing values.
Documentation:
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.from_product.html
Good help, tutorials:
Working example (asap):
import pandas as pd
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1})
print(df)
# user x y
#0 1 a 1
#1 1 b 1
#2 2 a 1
#3 3 a 1
#4 3 c 1
#5 3 d 1
df = df.set_index(['user','x'])
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x'])
df = df.reindex(mux, fill_value=0).reset_index()
print(df)
# user x y
#0 1 a 1
#1 1 b 1
#2 1 c 0
#3 1 d 0
#4 2 a 1
#5 2 b 0
#6 2 c 0
#7 2 d 0
#8 3 a 1
#9 3 b 0
#10 3 c 1
#11 3 d 1