sternclean
seeks to simplify cleaning dataframes.
Multiple cleaning steps are accomplished in just one function.
For example, you can change column types, impute one set of columns' NAs with a set value, impute another set of columns' NAs with a group mean, and impute another set of columns' infinite values with another set value in a few lines of clean code
Here is the order of operations under the hood:
Change the types
Remove columns
Impute NAs
Impute infinites
This allows multiple cleaning processes to happen in this one function
We will start with simple one-step cleaning examples. Later we will take on more complex situations.
Rickle and Mortan Dataset
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
Inf
NA
class(rickle_and_mortan $ people )
# > [1] "factor"
sternclean(" rickle_and_mortan" ,
class_to_strng = " people" )
class(rickle_and_mortan $ people )
# > [1] "character"
class(rickle_and_mortan $ intelligence )
# > [1] "character"
sternclean(" rickle_and_mortan" ,
class_to_numer = " intelligence" )
class(rickle_and_mortan $ intelligence )
# > [1] "numeric"
Column/Row Removal Parameters
sternclean(" rickle_and_mortan" ,
remove_columns = " intelligence" )
people
original_person
evil_rank
Rickle
Rickle
5
Mortan
Mortan
2.75
Jerry
Jerry
2
Pickle Rickle
Rickle
NA
sternclean(" rickle_and_mortan" ,
remove_na_rows = " evil_rank" )
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
sternclean(" rickle_and_mortan" ,
removeby_regex = " pe" )
intelligence
evil_rank
Inf
5
9
2.75
0.1
2
Inf
NA
sternclean(" rickle_and_mortan" ,
remove_all_nas = TRUE )
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
sternclean(" rickle_and_mortan" ,
remove_non_num = TRUE )
intelligence
evil_rank
Inf
5
9
2.75
0.1
2
Inf
NA
sternclean(" rickle_and_mortan" ,
remove_all_exc = c(" people" , " evil_rank" ))
people
evil_rank
Rickle
5
Mortan
2.75
Jerry
2
Pickle Rickle
NA
sternclean(" rickle_and_mortan" ,
impute_na2mean = " evil_rank" )
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
Inf
3.25
sternclean(" rickle_and_mortan" ,
impute_na_cols = " evil_rank" ,
impute_na_with = 1738 )
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
Inf
1738
sternclean(" rickle_and_mortan" ,
impute_grpmean = " evil_rank" ,
impute_grpwith = " original_person" )
original_person
people
intelligence
evil_rank
Jerry
Jerry
0.1
2
Mortan
Mortan
9
2.75
Rickle
Rickle
Inf
5
Rickle
Pickle Rickle
Inf
5
sternclean(" rickle_and_mortan" ,
impute_inf_col = " intelligence" ,
impute_inf_wit = 1738 )
people
original_person
intelligence
evil_rank
Rickle
Rickle
1738
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
1738
NA
sternclean(" rickle_and_mortan" ,
impute_cust_cl = " evil_rank" ,
impute_cust_fn = quantile ,
probs = .25 ,
na.rm = TRUE
)
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
Inf
2.375
Here we:
change the people column's class to string
change the intelligence column's class to numeric
remove the original_person column
impute the NAs in the evil rank with the column's mean
impute the infite values in the intelligence column to 1738
sternclean(" rickle_and_mortan" ,
class_to_strng = " people" ,
class_to_numer = " intelligence" ,
remove_columns = " original_person" ,
impute_na2mean = " evil_rank" ,
impute_inf_col = " intelligence" ,
impute_inf_wit = 1738
)
people
intelligence
evil_rank
Rickle
1738
5
Mortan
9
2.75
Jerry
0.1
2
Pickle Rickle
1738
3.25
Compared to Original Data Frame
people
original_person
intelligence
evil_rank
Rickle
Rickle
Inf
5
Mortan
Mortan
9
2.75
Jerry
Jerry
0.1
2
Pickle Rickle
Rickle
Inf
NA