-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17468: [C++] Validation for RLE arrays #13916
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Felix Yan <felixonmars@archlinux.org> Lead-authored-by: Yibo Cai <yibo.cai@arm.com> Co-authored-by: Felix Yan <felixonmars@archlinux.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
* Update the CUDA runtime version as CUDA 9.1 images are not available anymore * Fix passing child command arguments to "docker run" Checked locally under a Ubuntu 20.04 host with: ``` UBUNTU=18.04 archery --debug docker run ubuntu-cuda-cpp UBUNTU=20.04 archery --debug docker run ubuntu-cuda-cpp ``` Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ead_metadata (apache#13629) Add `filesystem` support to `pq.read_metadata` and `pq.read_schema`. Lead-authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Co-authored-by: Kshiteej K <kshitijkalambarkar@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…pache#13899) Checked locally on a Ubuntu 20.04 host with: ``` archery docker run ubuntu-cuda-python ``` Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
apache#13821) Will fix [ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763) A separate Jira issue will be made to address closing files in V2 ParquetDataset, which needs to be handled in the C++ layer. Adds context manager to `pq.ParquetFile` to close input file, and ensure reads within `pq.ParquetDataset` and `pq.read_table` are closed. ```python # user opened file-like object will not be closed with open('file.parquet', 'rb') as f: with pq.ParquetFile(f) as p: table = p.read() assert not f.closed # did not inadvertently close the open file assert not p.closed assert not f.closed # parquet context exit didn't close it assert not p.closed # references the input file status assert f.closed # normal context exit close assert p.closed # ... # path-like will be closed upon exit or `ParquetFile.close` with pq.ParquetFile('file.parquet') as p: table = p.read() assert not p.closed assert p.closed ``` Authored-by: Miles Granger <miles59923@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
See https://issues.apache.org/jira/browse/ARROW-17289 Lead-authored-by: Yaron Gvili <rtpsw@hotmail.com> Co-authored-by: rtpsw <rtpsw@hotmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…ctor should limit to Integer.MAX_VALUE (apache#13815) We got a IndexOutOfBoundsException: ``` 2022-08-03 09:33:34,076 Error executing query, currentState RUNNING, java.lang.RuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3315 in stage 5.0 failed 4 times, most recent failure: Lost task 3315.3 in stage 5.0 (TID 3926) (30.97.116.209 executor 49): java.lang.IndexOutOfBoundsException: index: 2147312542, length: 777713 (expected: range(0, 2147483648)) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:699) at org.apache.iceberg.shaded.org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:826) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$VarWidthReader.nextVal(VectorizedParquetDefinitionLevelReader.java:418) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextBatch(VectorizedParquetDefinitionLevelReader.java:235) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$VarWidthTypePageReader.nextVal(VectorizedPageIterator.java:353) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:161) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$VarWidthTypeBatchReader.nextBatchOf(VectorizedColumnIterator.java:191) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:74) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:158) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:51) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:35) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:134) at org.apache.iceberg.spark.source.BaseDataReader.next(BaseDataReader.java:98) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) ``` The root cause is the following code of `BaseVariableWidthVector.handleSafe` could fail to reallocate because of int overflow and then led to `IndexOutOfBoundsException` when we put the data into the vector. ```java protected final void handleSafe(int index, int dataLength) { while (index >= getValueCapacity()) { reallocValidityAndOffsetBuffers(); } final int startOffset = lastSet < 0 ? 0 : getStartOffset(lastSet + 1); // startOffset + dataLength could overflow while (valueBuffer.capacity() < (startOffset + dataLength)) { reallocDataBuffer(); } } ``` The offset width of `BaseVariableWidthVector` is 4, while the maximum memory allocation is Long.MAX_VALUE. This makes the memory allocation check invalid. Authored-by: xianyangliu <xianyangliu@tencent.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Currently, Java JNI builds on Github Actions can take one hour due to a very long Arrow C++ build phase (example: https://github.com/apache/arrow/runs/7881918943?check_suite_focus=true#step:6:3512). Disable unused Arrow C++ components so as to make the C++ build faster. Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
…nt updates (apache#13769) Building on apache#12157 Lead-authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Add `--validate` option to `archery crossbow status`. If `--validate` is specified and there are any missing artifacts, `archery crossbow status --validate` is existed with non-zero exit code. We can use it for CI to detect missing artifacts. We can't use `@github-actions crossbow submit` for this change because this isn't merged into the master branch yet. See https://github.com/ursacomputing/crossbow/branches/all?query=build-674 that is submitted `nightly-packages` manually. Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…and *-glib-devel should have .gir (apache#13876) The current configuration is inverted. *-glib-libs have .gir and *-glib-devel have .typelib. Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#13910) I noticed looking at pkg.go.dev that there really isn't anyone who is using the existing `compute` module, which makes sense since it isn't really finished and only provides limited utility currently. This change will mark the `compute` module as a separate sub-module inside of the `arrow` module, allowing us to use `go1.18` in this new code without forcing anyone who *isn't* using the compute module to upgrade. That way I can leverage the generics when writing the new compute code where appropriate. Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
… GetSchema (apache#13898) Consistently implements and tests the GetSchema method in Flight SQL. Builds on apache#13897. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…lization for `LocalFileSystem` (apache#13796) Introduce a specialization of `GetFileInfoGenerator` in the `LocalFileSystem` class. This implementation tries to improves performance by hiding latencies at two levels: 1. Child directories can be readahead so that listing directories entries from disk can be achieved in parallel with other work; 2. Directory entries can be `stat`'ed and yielded in chunks so that the `FileInfoGenerator` consumer can start receiving entries before a large directory is fully processed. Both mechanisms can be tuned using dedicated parameters in `LocalFileSystemOptions`. Signed-off-by: Pavel Solodovnikov <pavel.al.solodovnikov@gmail.com> Co-Authored-by: Igor Seliverstov <iseliverstov@querifylabs.com> Lead-authored-by: Pavel Solodovnikov <pavel.al.solodovnikov@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
It looks like the entries in the truth tables were copy-pasted and the _results_ were updated to match the function, but not the operator. Authored-by: Gil Forsyth <gil@forsyth.dev> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
…pache#13906) Typical real life Arrow datasets contain List type vectors of primitive type. This PR introduce ListBinder mapping of primitive types lists to java.sql.Types.ARRAY Lead-authored-by: Igor Suhorukov <igor.suhorukov@gmail.com> Co-authored-by: igor.suhorukov <igor.suhorukov@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…rdered after adding duplicated fields (apache#13321) Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: David Li <li.davidm96@gmail.com>
apache#13915) …railing bits Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…pache#13913) Authored-by: Jacob Wujciak-Jens <jacob@wujciak.de> Signed-off-by: Rok <rok@mihevc.org>
This PR aims to upgrade ORC to version 1.7.6. Apache ORC 1.7.6 is the most recent maintenance release with the following bug fixes. - https://github.com/apache/orc/releases/tag/v1.7.6 - https://orc.apache.org/news/2022/08/17/ORC-1.7.6/ Authored-by: William Hyun <william@apache.org> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
… PKGBUILD (apache#13917) Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…datafusion-c (apache#13923) Binary uploader is dev/release/05-binary-upload.sh and dev/release/post-02-binary.sh. We need to customize .deb package name. This also adds missing environment variable entries. Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…me or index (apache#13652) Authored-by: anjakefala <anja@voltrondata.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Relating to the building of the functionality for Compute in Go with Arrow, this is the implementation of ArraySpan / ExecValue / ExecResult etc. It was able to be separated out from the function interface definitions, so I was able to make this PR while apache#13924 is still being reviewed Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Authored-by: Matt Topol <zotthewizard@gmail.com> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
…apache#14210) I couldn't reproduce it, so I added a suppression instead. In both cases, the error is that the server is uncontactable. That shouldn't happen, but I changed the tests to also bind to port 0 instead of using a potentially flaky free port finder. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
This is a follow-up of apache#14204. Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…tructor with new (apache#14216) Advantage: readabilty, exception safety and efficiency(only for shared_ptr). Cases that don't apply: When calling a private/protected constructor within class member function, make_shared/unique can't work. Authored-by: Jin Shang <shangjin1997@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…pache#14228) Temporarily pin LLVM version on Appveyor due to a bug in Conda's packaging of LLVM. Authored-by: Jin Shang <shangjin1997@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
) This is a follow-up of apache#14216. We can't use std::make_shared for CUDA related classes because their constructors aren't public. Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…th output buffer (apache#14230) When the output type of an expression is of variable length, e.g. string, Gandiva would realloc the output buffer to make space for new outputs for each row. When num of rows is high some memory allocators perform poorly. We can use the std::vector like approach to amortize the allcation cost. First allocate some initial space depending on the input size. Each time we run out of space, double the buffer size. In the end shrink it to fit the actual size. Arrow string builder also uses this approach. Authored-by: Jin Shang <shangjin1997@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
TweakValidityBit returns a new Array so the calling function should use it. https://github.com/apache/arrow/blob/6cc37cf2d1ba72c46b64fbc7ac499bd0d7296d20/cpp/src/arrow/testing/gtest_util.cc#L568-L579 Authored-by: kshitij12345 <kshitijkalambarkar@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
… OpenTelemetry propagation (apache#11920) Adds a client middleware that sends span/trace ID to the server, and a server middleware that gets the span/trace ID and starts a child span. The middleware are available in builds without OpenTelemetry, they simply do nothing. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
Use `File.deleteOnExit` to delete jni lib file on JVM exit. `File.deleteOnExit` actually add a shut down hook to make sure file delte. Authored-by: jackylee-ch <lijunqing@baidu.com> Signed-off-by: David Li <li.davidm96@gmail.com>
zeroshade
pushed a commit
that referenced
this pull request
Feb 17, 2023
This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes #13752 - Closes #13754 - Closes #13842 - Closes #13882 - Closes #13916 - Closes #14063 - Closes #13970 And the issues associated with those PRs can also be closed: - Fixes #20350 - Add RunEndEncodedScalarType - Fixes #32543 - Fixes #32544 - Fixes #32688 - Fixes #32731 - Fixes #32772 - Fixes #32774 * Closes: #32104 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Tobias Zagorni <tobias@zagorni.eu> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
gringasalpastor
pushed a commit
to gringasalpastor/arrow
that referenced
this pull request
Feb 17, 2023
…pache#33641) This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes apache#13752 - Closes apache#13754 - Closes apache#13842 - Closes apache#13882 - Closes apache#13916 - Closes apache#14063 - Closes apache#13970 And the issues associated with those PRs can also be closed: - Fixes apache#20350 - Add RunEndEncodedScalarType - Fixes apache#32543 - Fixes apache#32544 - Fixes apache#32688 - Fixes apache#32731 - Fixes apache#32772 - Fixes apache#32774 * Closes: apache#32104 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Tobias Zagorni <tobias@zagorni.eu> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
fatemehp
pushed a commit
to fatemehp/arrow
that referenced
this pull request
Feb 24, 2023
…pache#33641) This PR gathers work from multiple PRs that can be closed after this one is merged: - Closes apache#13752 - Closes apache#13754 - Closes apache#13842 - Closes apache#13882 - Closes apache#13916 - Closes apache#14063 - Closes apache#13970 And the issues associated with those PRs can also be closed: - Fixes apache#20350 - Add RunEndEncodedScalarType - Fixes apache#32543 - Fixes apache#32544 - Fixes apache#32688 - Fixes apache#32731 - Fixes apache#32772 - Fixes apache#32774 * Closes: apache#32104 Lead-authored-by: Felipe Oliveira Carvalho <felipekde@gmail.com> Co-authored-by: Tobias Zagorni <tobias@zagorni.eu> Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.