Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require batches to be non-empty in multi-batch JSON reader #17837

Merged
merged 10 commits into from
Feb 4, 2025

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jan 28, 2025

Description

Fixes #17836

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shrshi shrshi added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Jan 28, 2025
Copy link

copy-pr-bot bot commented Jan 28, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 28, 2025
@shrshi shrshi marked this pull request as ready for review January 28, 2025 07:32
@shrshi shrshi requested a review from a team as a code owner January 28, 2025 07:32
@shrshi shrshi requested review from vyasr and ttnghia January 28, 2025 07:32
@vuule vuule self-requested a review January 28, 2025 17:52
@shrshi shrshi changed the base branch from branch-25.04 to branch-25.02 January 29, 2025 18:15
@shrshi shrshi requested review from a team as code owners January 29, 2025 18:15
@shrshi shrshi requested a review from Matt711 January 29, 2025 18:15
.devcontainer/cuda11.8-conda/devcontainer.json Outdated Show resolved Hide resolved
@shrshi shrshi force-pushed the empty-table-bugfix branch from cc0b2e7 to af8d052 Compare January 29, 2025 23:56
Copy link

copy-pr-bot bot commented Jan 29, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bdice bdice removed request for a team and Matt711 January 29, 2025 23:56
@bdice bdice dismissed galipremsagar’s stale review January 29, 2025 23:57

Rebase successful. Only C++ changes remain, so I will remove Python codeowners.

@shrshi
Copy link
Contributor Author

shrshi commented Jan 29, 2025

/ok to test

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Jan 31, 2025

/ok to test

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Feb 3, 2025

/ok to test

@shrshi shrshi requested a review from mythrocks February 3, 2025 19:29
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small requests.

auto const batch_limit = static_cast<std::size_t>(std::numeric_limits<int32_t>::max()) -
(max_subchunks_prealloced * size_per_subchunk);
return std::min<std::size_t>(batch_limit,
getenv_or<std::size_t>("LIBCUDF_JSON_BATCH_SIZE", batch_limit));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuIO has a number of these environment variable knobs. @vuule IIRC we discussed centralizing documentation of them somewhere. Did we ever do that? I'm not sure who is supposed to know that this exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where we should document this - LIBCUDF_JSON_BATCH_SIZE is used only for testing and benchmarking purposes now similar to the LIBCUDF_LARGE_STRINGS_THRESHOLD env var.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are mostly meant for internal use, so documenting them is not too urgent IMO.
We have added a bunch of env vars recently, and it would be nice to have them listed somewhere. @vyasr is there an issue for the env var docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I thought that we discussed one but I don't see anything in a quick search. We can open a new one.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@@ -295,6 +299,10 @@ datasource::owning_buffer<rmm::device_buffer> get_record_range_raw_input(
}
}

auto const batch_limit = static_cast<size_t>(std::numeric_limits<int32_t>::max());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-checking here, we do actually want the limit to be explicitly tied to int32_t, right? i.e. we don't want the implementation-defined size_t size? Are we choosing int32_t because it is size_type? If so, should this be cudf::size_type in the arg to numeric_limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we want to set batch_limit to be explicitly set to 2^31 - 1 since that is the maximum string size accepted by the JSON tokenizer.

CUDF_EXPECTS(input_size == 0 || (input_size - 1) <= std::numeric_limits<int32_t>::max(),

If we are changing the arg to numeric_limits to cudf::size_type , I think we should modify check_input_size as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I don't know this part of the code well enough to know whether int32_t is really semantically cudf::size_type here or if it is a semantically different (but numerically equivalent) limit. I'll defer to your judgment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sematically, when do we prefer cudf::size_type over int32_t?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/rapidsai/cudf/blob/branch-25.04/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#cudfsize_type

The cudf::size_type is the type used for the number of elements in a column, offsets to elements within a column, indices to address specific elements, segments for subsets of column elements, etc.

If this is meant to represent one of those things above ^, it should be cudf::size_type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Bradley. I believe that the usage in the JSON batching logic and the tokenizer should be int32_t in that case since we are referring to the size of (and offsets in) a raw JSON string before the cudf table is constructed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds right to me.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/tests/io/json/json_test.cpp Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Feb 4, 2025

/ok to test

@shrshi shrshi added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 4, 2025
@vyasr
Copy link
Contributor

vyasr commented Feb 4, 2025

/merge

@rapids-bot rapids-bot bot merged commit 8b89ea0 into rapidsai:branch-25.02 Feb 4, 2025
107 of 108 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: Landed
Development

Successfully merging this pull request may close these issues.

[BUG] Batched multi-source JSON reader does not require each batch to contain at least one JSON line
6 participants