forked from influxdata/telegraf
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(outputs.parquet): Introduce Parquet output
fixes: influxdata#14786
- Loading branch information
Showing
8 changed files
with
690 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
//go:build !custom || outputs || outputs.parquet | ||
|
||
package all | ||
|
||
import _ "github.com/influxdata/telegraf/plugins/outputs/parquet" // register plugin |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# Parquet Output Plugin | ||
|
||
This plugin sends writes metrics to parquet files. By default, the parquet | ||
output will groups metrics by metric name and write those metrics all to the | ||
same file. If a metric schema does not match then metrics are dropped. | ||
|
||
To lean more about Parquet check out the [Parquet docs][] as well as a blog | ||
post on [Querying Parquet][]. | ||
|
||
[Parquet docs]: https://parquet.apache.org/docs/ | ||
[Querying Parquet]: https://www.influxdata.com/blog/querying-parquet-millisecond-latency/ | ||
|
||
## Global configuration options <!-- @/docs/includes/plugin_config.md --> | ||
|
||
In addition to the plugin-specific configuration settings, plugins support | ||
additional global and plugin configuration settings. These settings are used to | ||
modify metrics, tags, and field or create aliases and configure ordering, etc. | ||
See the [CONFIGURATION.md][CONFIGURATION.md] for more details. | ||
|
||
[CONFIGURATION.md]: ../../../docs/CONFIGURATION.md#plugins | ||
|
||
## Configuration | ||
|
||
```toml @sample.conf | ||
# A plugin that writes metrics to parquet files | ||
[[outputs.parquet]] | ||
## Directory to write parquet files in. If a file already exists the output | ||
## will attempt to continue using the existing file. | ||
# directory = "." | ||
|
||
## Files are rotated after the time interval specified. When set to 0 no time | ||
## based rotation is performed. | ||
# rotation_interval = "0h" | ||
|
||
## Timestamp field name | ||
## Field name to use to store the timestamp. If set to an empty string, then | ||
## the timestamp is omitted. | ||
# timestamp_field_name = "timestamp" | ||
``` | ||
|
||
## Building Parquet Files | ||
|
||
### Schema | ||
|
||
Parquet files require a schema when writing files. To generate a schema, | ||
Telegraf will go through all grouped metrics and generate an Apache Arrow schema | ||
based on the union of all fields and tags. If a field and tag have the same name | ||
then the field takes precedence. | ||
|
||
The consequence of schema generation is that the very first flush sequence a | ||
metric is seen takes much longer due to the additional looping through the | ||
metrics to generate the schema. Subsequent flush intervals are significantly | ||
faster. | ||
|
||
When writing to a file, the schema is used to look for each value and if it is | ||
not present a null value is added. The result is that if additional fields are | ||
present after the first metric flush those fields are omitted. | ||
|
||
### Write | ||
|
||
The plugin makes use of the buffered writer. This may buffer some metrics into | ||
memory before writing it to disk. This method is used as it can more compactly | ||
write multiple flushes of metrics into a single Parquet row group. | ||
|
||
Additionally, the Parquet format requires a proper footer, so close must be | ||
called on the file to ensure it is properly formatted. | ||
|
||
## File Rotation | ||
|
||
If a file with the same target name exists at start, the existing file is | ||
rotated to avoid over-writing it or conflicting schema. | ||
|
||
File rotation is available via a time based interval that a user can optionally | ||
set. Due to the usage of a buffered writer, a size based rotation is not | ||
possible as the file may not actually get data at each interval. | ||
|
||
## Explore Parquet Files | ||
|
||
If a user wishes to explore a schema or data in a Parquet file quickly, then | ||
look at the | ||
|
||
### CLI | ||
|
||
The Arrow repo contains a Go CLI tool to read and parse Parquet files: | ||
|
||
```s | ||
go install github.com/apache/arrow/go/v16/parquet/cmd/parquet_reader@latest | ||
parquet_reader <file> | ||
``` | ||
|
||
### Python | ||
|
||
Users can also use the [pyarrow][] library to quick open and explore Parquet | ||
files: | ||
|
||
```python | ||
import pyarrow.parquet as pq | ||
|
||
table = pq.read_table('example.parquet') | ||
``` | ||
|
||
Once created, a user can look the various [pyarrow.Table][] functions to further | ||
explore the data. | ||
|
||
[pyarrow]: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html | ||
[pyarrow.Table]: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table |
Oops, something went wrong.