Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

Closed
isVoid opened this issue Jun 30, 2022 · 0 comments · Fixed by #11272
Closed

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

isVoid opened this issue Jun 30, 2022 · 0 comments · Fixed by #11272
Labels
cuIO cuIO issue feature request New feature or request

Comments

@isVoid
Copy link
Contributor

isVoid commented Jun 30, 2022

As part of #10558 , workflows can have benefits of saving memory if a default bitwidth can be configured for the columns read via cuio. Today, csv and json data format does not encode the data type in file and is defaulted to 64bit if no data type is specified. It would be nice to provide a reader option to default it to 32bit.

It would also be nice to have a separate option for integer columns and floating point columns, because the precision requirement for these two column types are often different.

An example use in cudf python is done via a global config:

cudf.set_config("default_io_int_bitwidth", 32)
df = cudf.read_csv(file) # all integer columns are 32 bit

An alternative to this approach today is to read a small portion of the file, determine the column kind and limit the precision in the second pass:

dtypes = cudf.read_csv(file, nrows=1).dtypes
dtypes = [np.dtypes(f"{dtype.str[:2]}{32}")]
df = cudf.read_csv(file, dtype=dtypes)

This is tedious setup and requires an extra kernel launch. It would be nice to avoid that.

@isVoid isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Jun 30, 2022
@isVoid isVoid added the cuIO cuIO issue label Jun 30, 2022
rapids-bot bot pushed a commit that referenced this issue Aug 1, 2022
This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options: `None`, 32bit and 64bit. When set as `None`, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies.

"Default" implies that it should only affects places that requires type inference, that includes:

- CSV/JSON readers when dtypes are not specified
- cuDF constructors
- Materializing a range index.

This PR is the first demonstration use of `cudf.option`, depending on #11193. Diff will reduce once it's merged.

closes #11182 #10318

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11272
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants