R package fsttable
aims to provide a fully functional data.table
interface to on-disk fst
files. The focus of the package is on keeping
memory usage as low as possible woithout sacrificing features of
in-memory data.table
operations.
You can install the latest package version with:
devtools::install_github("fstpackage/fsttable")
First, we create a on-disk fst file containing a medium sized dataset:
library(fsttable)
# write some sample data to disk
nr_of_rows <- 1e6
x <- data.table::data.table(X = 1:nr_of_rows, Y = LETTERS[1 + (1:nr_of_rows) %% 26])
fst::write_fst(x, "1.fst")
Then we define our fst_table by using:
ft <- fst_table("1.fst")
This fst_table can be used as a regular data.table object. For example, we can print:
ft
#> <fst file>
#> 1e+06 rows, 2 columns
#>
#> X Y
#> <int> <chr>
#> 1 1 B
#> 2 2 C
#> 3 3 D
#> 4 4 E
#> 5 5 F
#> -- -- --
#> 999996 999996 K
#> 999997 999997 L
#> 999998 999998 M
#> 999999 999999 N
#> 1000000 1000000 O
we can select columns:
ft[, .(Y)]
#> <fst file>
#> 1e+06 rows, 1 columns
#>
#> Y
#> <chr>
#> 1 B
#> 2 C
#> 3 D
#> 4 E
#> 5 F
#> -- --
#> 999996 K
#> 999997 L
#> 999998 M
#> 999999 N
#> 1000000 O
and rows:
ft[1:4,]
#> <fst file>
#> 4 rows, 2 columns
#>
#> X Y
#> <int> <chr>
#> 1 1 B
#> 2 2 C
#> 3 3 D
#> 4 4 E
Or both at the same time:
ft[1:4, .(X)]
#> <fst file>
#> 4 rows, 1 columns
#>
#> X
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
During the operations shown above, the actual data was never fully
loaded from the file. That’s because of fsttable
’s philosophy of
keeping RAM usage as low as possible. Printing a few lines of a table
doesn’t require knowlegde of the remaining lines, so fsttable
will
never actualy load them.
Even when you create a new set:
ft2 <- ft[1:4, .(X)]
No actual data is being loaded into RAM. The copy still uses the original fst file to keep the data on-disk:
# small size because actual data is still on disk
object.size(ft2)
#> 5808 bytes