-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require batches to be non-empty in multi-batch JSON reader #17837
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
cc0b2e7
to
af8d052
Compare
Rebase successful. Only C++ changes remain, so I will remove Python codeowners.
/ok to test |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small requests.
cpp/src/io/json/read_json.cu
Outdated
auto const batch_limit = static_cast<std::size_t>(std::numeric_limits<int32_t>::max()) - | ||
(max_subchunks_prealloced * size_per_subchunk); | ||
return std::min<std::size_t>(batch_limit, | ||
getenv_or<std::size_t>("LIBCUDF_JSON_BATCH_SIZE", batch_limit)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuIO has a number of these environment variable knobs. @vuule IIRC we discussed centralizing documentation of them somewhere. Did we ever do that? I'm not sure who is supposed to know that this exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure where we should document this - LIBCUDF_JSON_BATCH_SIZE
is used only for testing and benchmarking purposes now similar to the LIBCUDF_LARGE_STRINGS_THRESHOLD
env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are mostly meant for internal use, so documenting them is not too urgent IMO.
We have added a bunch of env vars recently, and it would be nice to have them listed somewhere. @vyasr is there an issue for the env var docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I thought that we discussed one but I don't see anything in a quick search. We can open a new one.
@@ -295,6 +299,10 @@ datasource::owning_buffer<rmm::device_buffer> get_record_range_raw_input( | |||
} | |||
} | |||
|
|||
auto const batch_limit = static_cast<size_t>(std::numeric_limits<int32_t>::max()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-checking here, we do actually want the limit to be explicitly tied to int32_t, right? i.e. we don't want the implementation-defined size_t
size? Are we choosing int32_t
because it is size_type
? If so, should this be cudf::size_type
in the arg to numeric_limits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we want to set batch_limit
to be explicitly set to 2^31 - 1
since that is the maximum string size accepted by the JSON tokenizer.
cudf/cpp/src/io/json/nested_json_gpu.cu
Line 86 in acbcf45
CUDF_EXPECTS(input_size == 0 || (input_size - 1) <= std::numeric_limits<int32_t>::max(), |
If we are changing the arg to
numeric_limits
to cudf::size_type
, I think we should modify check_input_size
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I don't know this part of the code well enough to know whether int32_t
is really semantically cudf::size_type
here or if it is a semantically different (but numerically equivalent) limit. I'll defer to your judgment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sematically, when do we prefer cudf::size_type
over int32_t
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
cudf::size_type
is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.
If this is meant to represent one of those things above ^, it should be cudf::size_type
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Bradley. I believe that the usage in the JSON batching logic and the tokenizer should be int32_t
in that case since we are referring to the size of (and offsets in) a raw JSON string before the cudf table is constructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds right to me.
/ok to test |
/merge |
8b89ea0
into
rapidsai:branch-25.02
Description
Fixes #17836
Checklist