[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

isVoid · 2022-06-30T21:47:57Z

As part of #10558 , workflows can have benefits of saving memory if a default bitwidth can be configured for the columns read via cuio. Today, csv and json data format does not encode the data type in file and is defaulted to 64bit if no data type is specified. It would be nice to provide a reader option to default it to 32bit.

It would also be nice to have a separate option for integer columns and floating point columns, because the precision requirement for these two column types are often different.

An example use in cudf python is done via a global config:

cudf.set_config("default_io_int_bitwidth", 32)
df = cudf.read_csv(file) # all integer columns are 32 bit

An alternative to this approach today is to read a small portion of the file, determine the column kind and limit the precision in the second pass:

dtypes = cudf.read_csv(file, nrows=1).dtypes
dtypes = [np.dtypes(f"{dtype.str[:2]}{32}")]
df = cudf.read_csv(file, dtype=dtypes)

This is tedious setup and requires an extra kernel launch. It would be nice to avoid that.

This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options: `None`, 32bit and 64bit. When set as `None`, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies. "Default" implies that it should only affects places that requires type inference, that includes: - CSV/JSON readers when dtypes are not specified - cuDF constructors - Materializing a range index. This PR is the first demonstration use of `cudf.option`, depending on #11193. Diff will reduce once it's merged. closes #11182 #10318 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11272

isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Jun 30, 2022

isVoid added the cuIO cuIO issue label Jun 30, 2022

isVoid mentioned this issue Jul 2, 2022

Add cudf.options #11193

Merged

isVoid mentioned this issue Jul 15, 2022

Provide an Option for Default Integer and Floating Bitwidth #11272

Merged

rapids-bot bot closed this as completed in #11272 Aug 1, 2022

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

isVoid commented Jun 30, 2022 •

edited

Loading

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

[FEA] Allow Configuring Default Bit Width Used for Json and Csv Readers #11182

Comments

isVoid commented Jun 30, 2022 • edited Loading

isVoid commented Jun 30, 2022 •

edited

Loading