-
Notifications
You must be signed in to change notification settings - Fork 41
/
README.Rmd
187 lines (137 loc) · 9.08 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
output:
github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
<img src="man/figures/fst.png" align="right" height="196" width="196" />
[![Build Status](https://github.com/fstpackage/fst/actions/workflows/R-CMD-check.yaml/badge.svg?branch=develop)](https://github.com/fstpackage/fst/actions/workflows/R-CMD-check.yaml)
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](https://www.gnu.org/licenses/agpl-3.0)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/fst)](https://cran.r-project.org/package=fst)
[![fastverse status badge](https://fastverse.r-universe.dev/badges/fst)](https://fastverse.r-universe.dev/ui#package:fastverse)
[![codecov](https://codecov.io/gh/fstpackage/fst/branch/develop/graph/badge.svg)](https://app.codecov.io/gh/fstpackage/fst)
[![downloads](https://cranlogs.r-pkg.org/badges/fst)](http://cran.rstudio.com/web/packages/fst/index.html)
[![total_downloads](https://cranlogs.r-pkg.org/badges/grand-total/fst)](http://cran.rstudio.com/web/packages/fst/index.html)
## Overview
The [_fst_ package][fstRepo] for R provides a fast, easy and flexible way to serialize data frames. With access speeds of multiple GB/s, _fst_ is specifically designed to unlock the potential of high speed solid state disks that can be found in most modern computers. Data frames stored in the _fst_ format have full random access, both in column and rows.
The figure below compares the read and write performance of the _fst_ package to various alternatives.
```{r speedTable, results = 'asis', message = FALSE, echo = FALSE}
require(ggplot2)
require(data.table)
require(fst)
require(knitr)
speed_results <- read_fst("res_readme.fst", as.data.table = TRUE)
speeds <- speed_results[NrOfRows == 5e7, list(
`Time (ms)` = as.integer(median(Time)),
`Size (MB)` = format(median(FileSize), digits = 2),
`Speed (MB/s)` = as.integer(median(Speed)),
N = .N),
by = "Package,Mode"]
speeds[Package == "data.table" & Mode == "Write", Method := "fwrite"]
speeds[Package == "data.table" & Mode == "Read", Method := "fread"]
speeds[Package == "fst" & Mode == "Write", Method := "write_fst"]
speeds[Package == "fst" & Mode == "Read", Method := "read_fst"]
speeds[Package == "baseR" & Mode == "Write", Method := "saveRDS"]
speeds[Package == "baseR" & Mode == "Read", Method := "readRDS"]
speeds[Package == "feather" & Mode == "Write", Method := "write_feather"]
speeds[Package == "feather" & Mode == "Read", Method := "read_feather"]
speeds[, Format := "bin"]
speeds[Package == "data.table", Format := "csv"]
setkey(speeds, Package, Mode)
display_speeds <- copy(speeds)
for (col in colnames(display_speeds)[2:ncol(display_speeds)]) {
setnames(display_speeds, col, "CurCol")
display_speeds[, CurCol := as.character(CurCol)]
display_speeds[Package == "fst", CurCol := paste0("**", CurCol, "**")]
setnames(display_speeds, "CurCol", col)
}
kable(display_speeds[, c(7, 8, 3, 4, 5, 6)])
```
These benchmarks were performed on a laptop (i7 4710HQ @2.5 GHz) with a reasonably fast SSD (M.2 Samsung SM951) using the dataset defined below. Parameter *Speed* was calculated by dividing the in-memory size of the data frame by the measured time. These results are also visualized in the following graph:
```{r speed-bench, echo = FALSE, message = FALSE, results = 'hide', fig.width = 8.5, fig.height = 6}
ggplot(speed_results[NrOfRows == 5e7, .(Speed = median(Speed)), by = "Package,Mode"]) +
geom_bar(aes(Mode, Speed, fill = Mode), colour = "darkgrey", stat = "identity") +
facet_wrap(~ Package, 1) +
ylim(0, NA) +
ylab("Read or Write speed (MB/s)") +
xlab("") +
theme(legend.position="none")
```
As can be seen from the figure, the measured speeds for the _fst_ package are very high and even top the maximum drive speed of the SSD used. The package accomplishes this by an effective combination of multi-threading and compression. The on-disk file sizes of _fst_ files are also much smaller than that of the other formats tested. This is an added benefit of _fst_'s use of type-specific compressors on each stored column.
In addition to methods for data frame serialization, _fst_ also provides methods for multi-threaded in-memory compression with the popular LZ4 and ZSTD compressors and an extremely fast multi-threaded hasher.
## Multi-threading
The _fst_ package relies heavily on multi-threading to boost the read- and write speed of data frames. To maximize throughput, _fst_ compresses and decompresses data _in the background_ and tries to keep the disk busy writing and reading data at the same time.
## Installation
The easiest way to install the package is from CRAN:
```{r, eval = FALSE}
install.packages("fst")
```
You can also use the development version from GitHub:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("fstpackage/fst", ref = "develop")
```
## Basic usage
```{r, results = 'hide', echo = FALSE, message = FALSE}
require(fst)
```
Using _fst_ is simple. Data can be stored and retrieved using methods _write\_fst_ and _read\_fst_:
```{r}
# Generate some random data frame with 10 million rows and various column types
nr_of_rows <- 1e7
df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
# Store the data frame to disk
write_fst(df, "dataset.fst")
# Retrieve the data frame again
df <- read_fst("dataset.fst")
```
_Note: the dataset defined in this example code was also used to obtain the benchmark results shown in the introduction._
## Random access
The _fst_ file format provides full random access to stored datasets. You can retrieve a selection of columns and rows with:
```{r, results = 'hide'}
df_subset <- read_fst("dataset.fst", c("Logical", "Factor"), from = 2000, to = 5000)
```
This reads rows 2000 to 5000 from columns _Logical_ and _Factor_ without actually touching any other data in the stored file. That means that a subset can be read from file **without reading the complete file first**. This is different from, say, _readRDS_ or _read\_feather_ where you have to read the complete file or column before you can make a subset.
## Compression
For compression the excellent and speedy [LZ4][lz4Repo] and [ZSTD][zstdRepo] compression algorithms are used. These compressors (in combination with type-specific bit filters), enable _fst_ to achieve high compression speeds at reasonable compression factors. The compression factor can be tuned from 0 (minimum) to 100 (maximum):
```{r, results = 'hide'}
write_fst(df, "dataset.fst", 100) # use maximum compression
```
Compression reduces the size of the _fst_ file that holds your data. But because the (de-)compression is done _on background threads_, it can increase the total read- and write speed as well. The graph below shows how the use of multiple threads enhances the read and write speed of our sample dataset.
```{r multi-threading, fig.width = 10, fig.height = 8, echo = FALSE}
speed_results[Package %in% c("baseR", "feather"), Threads := 1]
speed_results[, Threads := as.factor(Threads)]
speed_results[, Speed := 0.001 * Speed]
speed_results[, NrOfRows := ifelse(NrOfRows == 1e7, "1 million rows", "10 million rows")]
fig_results <- speed_results[, .(Speed = median(Speed)), by = "Package,Mode,NrOfRows,Threads"]
ggplot(fig_results) +
geom_bar(aes(Package, Speed, fill = Threads),
stat = "identity", position = "dodge", width = 0.8) +
facet_grid(Mode ~ NrOfRows, scales = "free_y") +
scale_fill_manual(values = colorRampPalette(c("#FBE3A5", "#433B54"))(10)[10:3]) +
ylab("Read or write speed in GB/s") +
theme_minimal()
```
The _csv_ format used by the _fread_ and _fwrite_ methods of package _data.table_ is actually a human-readable text format and not a binary format. Normally, binary formats would be much faster than the _csv_ format, because _csv_ takes more space on disk, is row based, uncompressed and needs to be parsed into a computer-native format to have any meaning. So any serializer that's working on _csv_ has an enormous disadvantage as compared to binary formats. Yet, the results show that _data.table_ is on par with binary formats and when more threads are used, it can even be faster. Because of this impressive performance, it was included in the graph for comparison.
## Bindings in other languages
**Julia**: [**`FstFileFormat.jl`**][fstformatRepo] A naive Julia binding using RCall.jl
> **Note to users**: From CRAN release v0.8.0, the _fst_ format is stable and backwards compatible. That means that all _fst_ files generated with package v0.8.0 or later can be read by future versions of the package.
```{r, echo = FALSE, results = 'hide'}
# cleanup
file.remove("dataset.fst")
```
[fstRepo]: https://github.com/fstpackage/fst
[lz4Repo]: https://github.com/lz4/lz4
[zstdRepo]: https://github.com/facebook/zstd
[fstformatRepo]: https://github.com/xiaodaigh/FstFileFormat.jl