-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Does the dataset API support compression & appending to existing parquet files? #36834
Comments
Just answer the second question, for compression: class ARROW_DS_EXPORT ParquetFileWriteOptions : public FileWriteOptions {
public:
/// \brief Parquet writer properties.
std::shared_ptr<parquet::WriterProperties> writer_properties;
/// \brief Parquet Arrow writer properties.
std::shared_ptr<parquet::ArrowWriterProperties> arrow_writer_properties;
protected:
explicit ParquetFileWriteOptions(std::shared_ptr<FileFormat> format)
: FileWriteOptions(std::move(format)) {}
friend class ParquetFileFormat;
}; Maybe you can config compression in |
Much appreciated! ds::FileSystemDatasetWriteOptions write_options;
auto format = std::make_shared<ds::ParquetFileFormat>();
auto pq_options = std::dynamic_pointer_cast<arrow::dataset::ParquetFileWriteOptions>(format->DefaultWriteOptions());
pq_options->writer_properties = parquet::WriterProperties::Builder()
.created_by("1.0")
->compression(arrow::Compression::SNAPPY)
->build();
write_options.file_write_options = pq_options;
write_options.filesystem = filesystem;
write_options.base_dir = base_dir;
write_options.partitioning = partitioning;
write_options.basename_template = "part{i}.parquet"; Still don't know if Q1 is possible. |
As for Q1, I guess you can:
But I don't know if there are more convinient way to solve it |
This would be the easiest way I think. It will require you to create a plan (scan -> write). Otherwise you can use Also, partitioning happens as part of the write node, and not part of the dataset writer. So you would need to implement that yourself as well. Much better to create a query plan combining the scan and the write I think. I will see if we have any examples. |
Describe the usage question you have. Please include as many useful details as possible.
I am streaming some time series data to parquet files with the following aspects:
I have been using the parquet::arrow::FileWriter::WriteRecordBatch, which can append batch to existing parquet files but doesn't support partition.
So I tried the Dataset API. It seems that
ExistingDataBehavior
can only overwrite or delete existing files, instead of appendingIf that's not possible, I have to wait until I have a full partition in memory before flushing to the file, which uses more memory.
Please advise. Thanks.
Component(s)
C++
The text was updated successfully, but these errors were encountered: