ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

wesm · 2020-03-31T05:42:48Z

I also changed the metadata key to "ARROW:experimental_compression", if anyone has opinions.

Haven't run benchmarks but will do so tomorrow.

github-actions · 2020-03-31T05:46:27Z

https://issues.apache.org/jira/browse/ARROW-8218

pitrou

Some comments. Also, I suppose you'll tackle the compression path at some point too?

cpp/src/arrow/ipc/reader.cc

cpp/src/arrow/ipc/options.h

cpp/src/arrow/ipc/reader.cc

…ow LZ4_FRAME and ZSTD compression

…lism

wesm · 2020-03-31T18:00:28Z

I addressed the comments and also parallelized the compression path. Would someone take another look at these new changes?

wesm · 2020-03-31T18:05:11Z

cpp/src/arrow/ipc/writer.cc

+    if (options_.use_threads) {
+      return ::arrow::internal::ParallelFor(static_cast<int>(out_->body_buffers.size()),
+                                            CompressOne);
+    } else {
+      for (size_t i = 0; i < out_->body_buffers.size(); ++i) {
+        RETURN_NOT_OK(CompressOne(i));
+      }
+      return Status::OK();


This "optional parallelism" pattern occurs frequently, I'll open a JIRA about factoring it out into a helper function.

https://issues.apache.org/jira/browse/ARROW-8299

bkietz

LGTM, just one style question

bkietz · 2020-03-31T18:12:17Z

cpp/src/arrow/ipc/reader.cc

+Status DecompressBuffers(const std::vector<std::shared_ptr<ArrayData>>& fields,
+                         Compression::type compression, const IpcReadOptions& options) {


since you're mutating the contents of fields, should this be

Suggested change

Status DecompressBuffers(const std::vector<std::shared_ptr<ArrayData>>& fields,

Compression::type compression, const IpcReadOptions& options) {

Status DecompressBuffers(Compression::type compression, const IpcReadOptions& options,

std::vector<std::shared_ptr<ArrayData>>* fields) {

?

Sort of a weird case not often seen, since the vector itself is not mutated. When I see std::vector<T>* in a function signature that suggests that the vector is modified. Thoughts?

It's true; the vector itself isn't mutated and const correctness isn't broken here. I was only thinking of trying to communicate mutation to a future reader. Unfortunately all we have is the ampersand and comments, neither of which is ideal here (... unless we start tossing std::vector<const T> around everywhere we want to be clear that the elements are immutable)

wesm · 2020-03-31T21:56:34Z

Flaky Thrift download again

FAILED: thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download 
cd /build/cpp/thrift_ep-prefix/src && /usr/bin/cmake -P /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-DEBUG.cmake && /usr/bin/cmake -E touch /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download
CMake Error at thrift_ep-stamp/thrift_ep-download-DEBUG.cmake:49 (message):
  Command failed: 1

   '/usr/bin/cmake' '-Dmake=' '-Dconfig=' '-P' '/build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-DEBUG-impl.cmake'

  See also

    /build/cpp/thrift_ep-prefix/src/thrift_ep-stamp/thrift_ep-download-*.log

wesm · 2020-03-31T21:57:47Z

Googletest download flaked in Appveyor:

FAILED: googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download 
cmd.exe /C "cd /D C:\projects\arrow\cpp\build\googletest_ep-prefix\src && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -P C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELEASE.cmake && C:\Miniconda37-x64\envs\arrow\Library\bin\cmake.exe -E touch C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download"
CMake Error at googletest_ep-stamp/googletest_ep-download-RELEASE.cmake:49 (message):
  Command failed: 1
   'C:/Miniconda37-x64/envs/arrow/Library/bin/cmake.exe' '-Dmake=' '-Dconfig=' '-P' 'C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-RELEASE-impl.cmake'
  See also
    C:/projects/arrow/cpp/build/googletest_ep-prefix/src/googletest_ep-stamp/googletest_ep-download-*.log
[10/269] Performing configure step for 'mimalloc_ep'

It passed on my fork though.

+1. Merging this

…eather V2 This PR always puts the compressed size in little-endian format for Feather V2 since the reader expected the little-endian format. Based on [the discussion](#6777 (comment)) at #6777, [this commit](aa28280) reads compressed_length in Feather V2 format as little-endian. However, the writer [puts compressed_length in native-endian](https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L177). This PR can fix failures related to reading compressed feather format in `arrow-ipc-read-write-test` and `arrow-feather-test`. Closes #7137 from kiszk/ARROW-8747 Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Wes McKinney <wesm+git@apache.org>

This PR writes and reads Plasma header (version, type, and length) in the big-endian format. It allows us to make it easy to interpret a header of Plasma data among different endian machines. The current issue is to write Plasma header in native endian at [here](https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71). It is not possible to know version, type, and length of a given Plasma file among different platforms. Feather V2 also uses little-endian for the header based on [the discussion](#6777 (comment)). This PR uses little-endian by following this discussion. Closes #7146 from kiszk/ARROW-8757 Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

pitrou reviewed Mar 31, 2020

View reviewed changes

cpp/src/arrow/ipc/reader.cc Show resolved Hide resolved

cpp/src/arrow/ipc/reader.cc Outdated Show resolved Hide resolved

cpp/src/arrow/ipc/options.h Show resolved Hide resolved

pitrou reviewed Mar 31, 2020

View reviewed changes

cpp/src/arrow/ipc/reader.cc Show resolved Hide resolved

wesm and others added 3 commits March 31, 2020 12:58

Decompress record batch messages in parallel at field level. Only all…

7387d93

…ow LZ4_FRAME and ZSTD compression

default to use_threads=false in IpcFileFormat to avoid nested paralle…

aea4d68

…lism

Code review feedback. Parallelize compression also

aa28280

wesm force-pushed the ARROW-8218 branch from 992ec95 to aa28280 Compare March 31, 2020 17:58

wesm commented Mar 31, 2020

View reviewed changes

bkietz approved these changes Mar 31, 2020

View reviewed changes

Code review feedback. Fix R bindings since Compression enum changed

243834c

wesm closed this in 087464c Mar 31, 2020

wesm deleted the ARROW-8218 branch March 31, 2020 21:59

kiszk mentioned this pull request May 9, 2020

ARROW-8747: [C++] Write compressed size in little-endian format for Feather V2 #7137

Closed

kiszk mentioned this pull request May 10, 2020

ARROW-8757: [C++][Plasma] Write Plasma header in little-endian format #7146

Closed

asfimport mentioned this pull request Mar 31, 2020

[C++] Parallelize decompression at field level in experimental IPC compression code #24415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

wesm commented Mar 31, 2020

github-actions bot commented Mar 31, 2020

pitrou left a comment

wesm commented Mar 31, 2020

wesm Mar 31, 2020

bkietz left a comment

bkietz Mar 31, 2020

wesm Mar 31, 2020

bkietz Mar 31, 2020

wesm commented Mar 31, 2020

wesm commented Mar 31, 2020

		Status DecompressBuffers(const std::vector<std::shared_ptr<ArrayData>>& fields,
		Compression::type compression, const IpcReadOptions& options) {

ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

ARROW-8218: [C++] Decompress record batch messages in parallel at field level. Only allow LZ4_FRAME, ZSTD compression #6777

Conversation

wesm commented Mar 31, 2020

github-actions bot commented Mar 31, 2020

pitrou left a comment

Choose a reason for hiding this comment

wesm commented Mar 31, 2020

wesm Mar 31, 2020

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

bkietz Mar 31, 2020

Choose a reason for hiding this comment

wesm Mar 31, 2020

Choose a reason for hiding this comment

bkietz Mar 31, 2020

Choose a reason for hiding this comment

wesm commented Mar 31, 2020

wesm commented Mar 31, 2020