Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

Closed
wants to merge 4 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Mar 31, 2020

I also changed the metadata key to "ARROW:experimental_compression", if anyone has opinions.

Haven't run benchmarks but will do so tomorrow.

@github-actions
Copy link

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Also, I suppose you'll tackle the compression path at some point too?

cpp/src/arrow/ipc/reader.cc Show resolved Hide resolved
cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved
cpp/src/arrow/ipc/options.h Show resolved Hide resolved
@wesm
Copy link
Member Author

wesm commented Mar 31, 2020

I addressed the comments and also parallelized the compression path. Would someone take another look at these new changes?

Comment on lines +203 to +210
if (options_.use_threads) {
return ::arrow::internal::ParallelFor(static_cast<int>(out_->body_buffers.size()),
CompressOne);
} else {
for (size_t i = 0; i < out_->body_buffers.size(); ++i) {
RETURN_NOT_OK(CompressOne(i));
}
return Status::OK();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "optional parallelism" pattern occurs frequently, I'll open a JIRA about factoring it out into a helper function.

https://issues.apache.org/jira/browse/ARROW-8299

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one style question

Comment on lines 345 to 346
Status DecompressBuffers(const std::vector<std::shared_ptr<ArrayData>>& fields,
Compression::type compression, const IpcReadOptions& options) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you're mutating the contents of fields, should this be

Suggested change
Status DecompressBuffers(const std::vector<std::shared_ptr<ArrayData>>& fields,
Compression::type compression, const IpcReadOptions& options) {
Status DecompressBuffers(Compression::type compression, const IpcReadOptions& options,
std::vector<std::shared_ptr<ArrayData>>* fields) {

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of a weird case not often seen, since the vector itself is not mutated. When I see std::vector<T>* in a function signature that suggests that the vector is modified. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true; the vector itself isn't mutated and const correctness isn't broken here. I was only thinking of trying to communicate mutation to a future reader. Unfortunately all we have is the ampersand and comments, neither of which is ideal here (... unless we start tossing std::vector<const T> around everywhere we want to be clear that the elements are immutable)

@wesm
Copy link
Member Author

wesm commented Mar 31, 2020

Flaky Thrift download again

FAILED: thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download 
cd /build/cpp/thrift_ep-prefix/src && /usr/bin/cmake -P /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-DEBUG.cmake && /usr/bin/cmake -E touch /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download
CMake Error at thrift_ep-stamp/thrift_ep-download-DEBUG.cmake:49 (message):
  Command failed: 1

   '/usr/bin/cmake' '-Dmake=' '-Dconfig=' '-P' '/build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-DEBUG-impl.cmake'

  See also

    /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-*.log

@wesm
Copy link
Member Author

wesm commented Mar 31, 2020

Googletest download flaked in Appveyor:

FAILED: googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download 
cmd.exe /C "cd /D C:\projects\arrow\cpp\build\googletest_ep-prefix\src && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -P C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELEASE.cmake && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -E touch C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download"
CMake Error at googletest_ep-stamp/googletest_ep-download-RELEASE.cmake:49 (message):
  Command failed: 1
   'C:/Miniconda37-x64/envs/arrow/Library/bin/cmake.exe' '-Dmake=' '-Dconfig=' '-P' 'C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELEASE-impl.cmake'
  See also
    C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-*.log
[10/269] Performing configure step for 'mimalloc_ep'

It passed on my fork though.

+1. Merging this

@wesm wesm closed this in 087464c Mar 31, 2020
@wesm wesm deleted the ARROW-8218 branch March 31, 2020 21:59
wesm pushed a commit that referenced this pull request May 9, 2020
…eather V2

This PR always puts the compressed size in little-endian format for Feather V2 since the reader expected the little-endian format.

Based on [the discussion](#6777 (comment)) at #6777, [this commit](aa28280) reads compressed_length in Feather V2 format as little-endian. However, the writer [puts compressed_length in native-endian](https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L177).

This PR can fix failures related to reading compressed feather format in `arrow-ipc-read-write-test` and `arrow-feather-test`.

Closes #7137 from kiszk/ARROW-8747

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Wes McKinney <wesm+git@apache.org>
kou pushed a commit that referenced this pull request May 15, 2020
This PR writes and reads Plasma header (version, type, and length) in the big-endian format. It allows us to make it easy to interpret a header of Plasma data among different endian machines.

The current issue is to write Plasma header in native endian at [here](https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71). It is not possible to know version, type, and length of a given Plasma file among different platforms. Feather V2 also uses little-endian for the header based on [the discussion](#6777 (comment)). This PR uses little-endian by following this discussion.

Closes #7146 from kiszk/ARROW-8757

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants