You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of #10558 , workflows can have benefits of saving memory if a default bitwidth can be configured for the columns read via cuio. Today, csv and json data format does not encode the data type in file and is defaulted to 64bit if no data type is specified. It would be nice to provide a reader option to default it to 32bit.
It would also be nice to have a separate option for integer columns and floating point columns, because the precision requirement for these two column types are often different.
An example use in cudf python is done via a global config:
cudf.set_config("default_io_int_bitwidth", 32)
df=cudf.read_csv(file) # all integer columns are 32 bit
An alternative to this approach today is to read a small portion of the file, determine the column kind and limit the precision in the second pass:
This PR introduces a cudf option to allow user to control the default bitwidth for integer and floating types. The first iteration only plans to provide three options: `None`, 32bit and 64bit. When set as `None`, that means the result dtype will align with what pandas constructs. Otherwise, default to what user specifies.
"Default" implies that it should only affects places that requires type inference, that includes:
- CSV/JSON readers when dtypes are not specified
- cuDF constructors
- Materializing a range index.
This PR is the first demonstration use of `cudf.option`, depending on #11193. Diff will reduce once it's merged.
closes#11182#10318
Authors:
- Michael Wang (https://github.com/isVoid)
Approvers:
- Ashwin Srinath (https://github.com/shwina)
- Vyas Ramasubramani (https://github.com/vyasr)
URL: #11272
As part of #10558 , workflows can have benefits of saving memory if a default bitwidth can be configured for the columns read via cuio. Today, csv and json data format does not encode the data type in file and is defaulted to 64bit if no data type is specified. It would be nice to provide a reader option to default it to 32bit.
It would also be nice to have a separate option for integer columns and floating point columns, because the precision requirement for these two column types are often different.
An example use in cudf python is done via a global config:
An alternative to this approach today is to read a small portion of the file, determine the column kind and limit the precision in the second pass:
This is tedious setup and requires an extra kernel launch. It would be nice to avoid that.
The text was updated successfully, but these errors were encountered: