Require batches to be non-empty in multi-batch JSON reader #17837

shrshi · 2025-01-28T06:06:14Z

Description

Fixes #17836

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-28T06:06:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

.devcontainer/cuda11.8-conda/devcontainer.json

copy-pr-bot · 2025-01-29T23:56:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Rebase successful. Only C++ changes remain, so I will remove Python codeowners.

shrshi · 2025-01-29T23:58:59Z

/ok to test

cpp/src/io/json/read_json.cu

shrshi · 2025-01-31T02:58:56Z

/ok to test

cpp/src/io/json/read_json.cu

shrshi · 2025-02-03T19:27:42Z

/ok to test

vyasr

Some small requests.

vyasr · 2025-02-03T22:09:11Z

cpp/src/io/json/read_json.cu

+  auto const batch_limit = static_cast<std::size_t>(std::numeric_limits<int32_t>::max()) -
+                           (max_subchunks_prealloced * size_per_subchunk);
+  return std::min<std::size_t>(batch_limit,
+                               getenv_or<std::size_t>("LIBCUDF_JSON_BATCH_SIZE", batch_limit));


cuIO has a number of these environment variable knobs. @vuule IIRC we discussed centralizing documentation of them somewhere. Did we ever do that? I'm not sure who is supposed to know that this exists.

I'm not sure where we should document this - LIBCUDF_JSON_BATCH_SIZE is used only for testing and benchmarking purposes now similar to the LIBCUDF_LARGE_STRINGS_THRESHOLD env var.

These are mostly meant for internal use, so documenting them is not too urgent IMO.
We have added a bunch of env vars recently, and it would be nice to have them listed somewhere. @vyasr is there an issue for the env var docs?

I don't know. I thought that we discussed one but I don't see anything in a quick search. We can open a new one.

cpp/src/io/json/read_json.cu

vyasr · 2025-02-03T22:28:41Z

cpp/src/io/json/read_json.cu

@@ -295,6 +299,10 @@ datasource::owning_buffer<rmm::device_buffer> get_record_range_raw_input(
      }
    }

+    auto const batch_limit = static_cast<size_t>(std::numeric_limits<int32_t>::max());


Double-checking here, we do actually want the limit to be explicitly tied to int32_t, right? i.e. we don't want the implementation-defined size_t size? Are we choosing int32_t because it is size_type? If so, should this be cudf::size_type in the arg to numeric_limits?

Yes, we want to set batch_limit to be explicitly set to 2^31 - 1 since that is the maximum string size accepted by the JSON tokenizer.

cudf/cpp/src/io/json/nested_json_gpu.cu

Line 86 in acbcf45

CUDF_EXPECTS(input_size == 0 || (input_size - 1) <= std::numeric_limits<int32_t>::max(),

If we are changing the arg to numeric_limits to cudf::size_type , I think we should modify check_input_size as well.

I agree. I don't know this part of the code well enough to know whether int32_t is really semantically cudf::size_type here or if it is a semantically different (but numerically equivalent) limit. I'll defer to your judgment.

Sematically, when do we prefer cudf::size_type over int32_t?

https://github.com/rapidsai/cudf/blob/branch-25.04/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#cudfsize_type

The cudf::size_type is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.

If this is meant to represent one of those things above ^, it should be cudf::size_type.

Thanks, Bradley. I believe that the usage in the JSON batching logic and the tokenizer should be int32_t in that case since we are referring to the size of (and offsets in) a raw JSON string before the cudf table is constructed.

Yes that sounds right to me.

cpp/src/io/json/read_json.cu

cpp/tests/io/json/json_test.cpp

…y-table-bugfix

shrshi · 2025-02-04T00:40:13Z

/ok to test

vyasr · 2025-02-04T04:50:01Z

/merge

shrshi added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 28, 2025

github-actions bot assigned shrshi Jan 28, 2025

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 28, 2025

shrshi marked this pull request as ready for review January 28, 2025 07:32

shrshi requested a review from a team as a code owner January 28, 2025 07:32

shrshi requested review from vyasr and ttnghia January 28, 2025 07:32

vuule self-requested a review January 28, 2025 17:52

shrshi changed the base branch from branch-25.04 to branch-25.02 January 29, 2025 18:15

shrshi requested review from a team as code owners January 29, 2025 18:15

shrshi requested a review from Matt711 January 29, 2025 18:15

galipremsagar previously requested changes Jan 29, 2025

View reviewed changes

.devcontainer/cuda11.8-conda/devcontainer.json Outdated Show resolved Hide resolved

shrshi added 4 commits January 29, 2025 23:53

rebase

7ea1fe5

cleanup after review

3927b1f

simplifying the batch size logic

373b06d

removing unnecessary variables

af8d052

shrshi force-pushed the empty-table-bugfix branch from cc0b2e7 to af8d052 Compare January 29, 2025 23:56

bdice removed request for a team and Matt711 January 29, 2025 23:56

vuule approved these changes Jan 30, 2025

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

shrshi added 2 commits January 31, 2025 02:57

fixing bad rebase

b7a6533

Merge branch 'branch-25.02' into empty-table-bugfix

0411031

mythrocks reviewed Feb 3, 2025

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Show resolved Hide resolved

shrshi added 2 commits February 3, 2025 19:26

pre reviews

7b2c64c

Merge branch 'branch-25.02' into empty-table-bugfix

67370cd

shrshi requested a review from mythrocks February 3, 2025 19:29

vyasr requested changes Feb 3, 2025

View reviewed changes

shrshi added 2 commits February 4, 2025 00:36

pre reviews

3e83d60

Merge branch 'empty-table-bugfix' of github.com:shrshi/cudf into empt…

eaef278

…y-table-bugfix

vyasr approved these changes Feb 4, 2025

View reviewed changes

shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 4, 2025

rapids-bot bot merged commit 8b89ea0 into rapidsai:branch-25.02 Feb 4, 2025
107 of 108 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require batches to be non-empty in multi-batch JSON reader #17837

Require batches to be non-empty in multi-batch JSON reader #17837

shrshi commented Jan 28, 2025

copy-pr-bot bot commented Jan 28, 2025

copy-pr-bot bot commented Jan 29, 2025

shrshi commented Jan 29, 2025

shrshi commented Jan 31, 2025

shrshi commented Feb 3, 2025

vyasr left a comment

vyasr Feb 3, 2025

shrshi Feb 4, 2025

vuule Feb 4, 2025

vyasr Feb 4, 2025

vyasr Feb 3, 2025

shrshi Feb 4, 2025

vyasr Feb 4, 2025

shrshi Feb 4, 2025

bdice Feb 4, 2025

shrshi Feb 4, 2025

vyasr Feb 4, 2025

shrshi commented Feb 4, 2025

vyasr commented Feb 4, 2025

Require batches to be non-empty in multi-batch JSON reader #17837

Require batches to be non-empty in multi-batch JSON reader #17837

Conversation

shrshi commented Jan 28, 2025

Description

Checklist

copy-pr-bot bot commented Jan 28, 2025

copy-pr-bot bot commented Jan 29, 2025

shrshi commented Jan 29, 2025

shrshi commented Jan 31, 2025

shrshi commented Feb 3, 2025

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shrshi commented Feb 4, 2025

vyasr commented Feb 4, 2025