Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(outputs.parquet): Introduce Parquet output #15602

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

powersj
Copy link
Contributor

@powersj powersj commented Jul 8, 2024

Summary

Introduces a new output to write metrics in Parquet format. This groups metrics by metric name and writes them to files. We need to know the schema to write beforehand, the first time we encounter a metric name we generate the schema based on the metric we found then use the buffered file writer from here on out to write files as efficiently as possible.

We must close the file correctly otherwise the file will not be a valid parquet file. As such I have avoided creating files by template name due to complications with keeping too many files open. Instead, I have a time-based rotation option that will close the existing file and create a new file.

Check out this blog post for an overview of Parquet files: https://www.influxdata.com/blog/how-good-parquet-wide-tables/

Checklist

  • No AI generated code was used in this PR

Related issues

fixes: #14786

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins labels Jul 8, 2024
@powersj powersj self-assigned this Jul 8, 2024
@powersj powersj force-pushed the feat/parquet-serializer branch 2 times, most recently from a2a428d to a0c7b38 Compare July 18, 2024 17:27
@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Jul 18, 2024
@powersj powersj assigned srebhan and DStrand1 and unassigned powersj Jul 18, 2024
@powersj powersj marked this pull request as ready for review July 18, 2024 18:33
Copy link
Member

@DStrand1 DStrand1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary review on the readme, looking over the code now!

plugins/outputs/parquet/README.md Outdated Show resolved Hide resolved
plugins/outputs/parquet/README.md Outdated Show resolved Hide resolved
@DStrand1 DStrand1 removed their assignment Jul 19, 2024
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good @powersj! Just some small comments from my side...

plugins/outputs/parquet/README.md Outdated Show resolved Hide resolved
plugins/outputs/parquet/parquet.go Outdated Show resolved Hide resolved
plugins/outputs/parquet/parquet.go Outdated Show resolved Hide resolved
powersj added 3 commits July 23, 2024 07:11
* extra word typo in readme
* initialize timestamp field in init(), update tests
* always use string arrow type for tags
@powersj powersj force-pushed the feat/parquet-serializer branch from 62bc2e1 to 98399b6 Compare July 23, 2024 13:11
@telegraf-tiger
Copy link
Contributor

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

⚠️ This pull request increases the Telegraf binary size by 4.12 % for linux amd64 (new size: 262.5 MB, nightly size 252.1 MB)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB RPM TAR GZ ZIP
amd64.deb aarch64.rpm darwin_amd64.tar.gz windows_amd64.zip
arm64.deb armel.rpm darwin_arm64.tar.gz windows_arm64.zip
armel.deb armv6hl.rpm freebsd_amd64.tar.gz windows_i386.zip
armhf.deb i386.rpm freebsd_armv7.tar.gz
i386.deb ppc64le.rpm freebsd_i386.tar.gz
mips.deb riscv64.rpm linux_amd64.tar.gz
mipsel.deb s390x.rpm linux_arm64.tar.gz
ppc64el.deb x86_64.rpm linux_armel.tar.gz
riscv64.deb linux_armhf.tar.gz
s390x.deb linux_i386.tar.gz
linux_mips.tar.gz
linux_mipsel.tar.gz
linux_ppc64le.tar.gz
linux_riscv64.tar.gz
linux_s390x.tar.gz

@powersj powersj requested a review from srebhan July 24, 2024 21:25
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @powersj! Code looks fantastic!

@srebhan srebhan merged commit a3eda34 into influxdata:master Jul 25, 2024
27 checks passed
@github-actions github-actions bot added this to the v1.32.0 milestone Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet Serializer
3 participants