R/lgb.convert_with_rules.R
lgb.convert_with_rules.Rd
Attempts to prepare a clean dataset to prepare to put in a lgb.Dataset
.
Factors and characters are converted to integer.
In addition, keeps rules created so you can convert other datasets using this converter.
This is useful if you have a specific need for integer dataset instead of numeric dataset.
NOTE: In previous releases of LightGBM, this function was called lgb.prepare_rules2
.
lgb.convert_with_rules(data, rules = NULL)
data | A data.frame or data.table to prepare. |
---|---|
rules | A set of rules from the data preparator, if already used. |
A list with the cleaned dataset (data
) and the rules (rules
).
The data must be converted to a matrix format (as.matrix
) for input in
lgb.Dataset
.
#> 'data.frame': 150 obs. of 5 variables: #> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #> $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...#> 'data.frame': 150 obs. of 5 variables: #> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #> $ Species : int 1 1 1 1 1 1 1 1 1 1 ...#> Warning: invalid factor level, NA generated# Use conversion using known rules # Unknown factors become 0, excellent for sparse datasets newer_iris <- lgb.convert_with_rules(data = iris, rules = new_iris$rules) # Unknown factor is now zero, perfect for sparse datasets newer_iris$data[1L, ] # Species became 0 as it is an unknown factor#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.2 0newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value # Is the newly created dataset equal? YES! all.equal(new_iris$data, newer_iris$data)#> [1] TRUE# Can we test our own rules? data(iris) # Erase iris dataset # We remapped values differently personal_rules <- list( Species = c( "setosa" = 3L , "versicolor" = 2L , "virginica" = 1L ) ) newest_iris <- lgb.convert_with_rules(data = iris, rules = personal_rules) str(newest_iris$data) # SUCCESS!#> 'data.frame': 150 obs. of 5 variables: #> $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #> $ Species : int 0 3 3 3 3 3 3 3 3 3 ...# }