Skip to content

Latest commit

 

History

History
683 lines (633 loc) · 13.7 KB

README.md

File metadata and controls

683 lines (633 loc) · 13.7 KB

sternclean seeks to simplify cleaning dataframes.

Multiple cleaning steps are accomplished in just one function.

For example, you can change column types, impute one set of columns' NAs with a set value, impute another set of columns' NAs with a group mean, and impute another set of columns' infinite values with another set value in a few lines of clean code

Here is the order of operations under the hood:

  • Change the types
  • Remove columns
  • Impute NAs
  • Impute infinites

This allows multiple cleaning processes to happen in this one function

Simple Examples

We will start with simple one-step cleaning examples. Later we will take on more complex situations.

Rickle and Mortan Dataset

people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf NA

Class Change Parameters

class(rickle_and_mortan$people)
#> [1] "factor"

sternclean("rickle_and_mortan",
           class_to_strng = "people")

class(rickle_and_mortan$people)
#> [1] "character"
class(rickle_and_mortan$intelligence)
#> [1] "character"

sternclean("rickle_and_mortan",
           class_to_numer = "intelligence")

class(rickle_and_mortan$intelligence)
#> [1] "numeric"

Column/Row Removal Parameters

sternclean("rickle_and_mortan",
           remove_columns = "intelligence")
people original_person evil_rank
Rickle Rickle 5
Mortan Mortan 2.75
Jerry Jerry 2
Pickle Rickle Rickle NA
sternclean("rickle_and_mortan",
           remove_na_rows =  "evil_rank")
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
sternclean("rickle_and_mortan",
           removeby_regex = "pe")
intelligence evil_rank
Inf 5
9 2.75
0.1 2
Inf NA
sternclean("rickle_and_mortan",
           remove_all_nas = TRUE)
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
sternclean("rickle_and_mortan",
           remove_non_num = TRUE)
intelligence evil_rank
Inf 5
9 2.75
0.1 2
Inf NA
sternclean("rickle_and_mortan",
           remove_all_exc = c("people", "evil_rank"))
people evil_rank
Rickle 5
Mortan 2.75
Jerry 2
Pickle Rickle NA

Impute Parameters

sternclean("rickle_and_mortan",
           impute_na2mean = "evil_rank")
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 3.25
sternclean("rickle_and_mortan",
           impute_na_cols = "evil_rank",
           impute_na_with = 1738)
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 1738
sternclean("rickle_and_mortan",
           impute_grpmean = "evil_rank",
           impute_grpwith = "original_person")
original_person people intelligence evil_rank
Jerry Jerry 0.1 2
Mortan Mortan 9 2.75
Rickle Rickle Inf 5
Rickle Pickle Rickle Inf 5
sternclean("rickle_and_mortan",
           impute_inf_col = "intelligence",
           impute_inf_wit = 1738)
people original_person intelligence evil_rank
Rickle Rickle 1738 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle 1738 NA
sternclean("rickle_and_mortan",
           impute_cust_cl = "evil_rank",
           impute_cust_fn = quantile,
           probs = .25,
           na.rm = TRUE
           )
people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf 2.375

More Complex Example

Here we:

  • change the people column's class to string
  • change the intelligence column's class to numeric
  • remove the original_person column
  • impute the NAs in the evil rank with the column's mean
  • impute the infite values in the intelligence column to 1738
sternclean("rickle_and_mortan",
           class_to_strng = "people",
           class_to_numer = "intelligence",
           remove_columns = "original_person",
           impute_na2mean = "evil_rank",
           impute_inf_col = "intelligence",
           impute_inf_wit = 1738
           )
people intelligence evil_rank
Rickle 1738 5
Mortan 9 2.75
Jerry 0.1 2
Pickle Rickle 1738 3.25

Compared to Original Data Frame

people original_person intelligence evil_rank
Rickle Rickle Inf 5
Mortan Mortan 9 2.75
Jerry Jerry 0.1 2
Pickle Rickle Rickle Inf NA