Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-based CSV: infer datatypes #28893

Closed
2 tasks
girarda opened this issue Aug 1, 2023 · 1 comment
Closed
2 tasks

File-based CSV: infer datatypes #28893

girarda opened this issue Aug 1, 2023 · 1 comment

Comments

@girarda
Copy link
Contributor

girarda commented Aug 1, 2023

What

Add an option to infer the data types when of a CSV stream even if the user has not provided a schema.

The existing S3 source does this with pyarrow, and most existing users set this option to True so we shouldn't let this be a breaking change.

This ticket is extracted from #28133.

Proposal 1: Use pyarrow to infer the datatype if the option is set to true

Proposal 2: Write our own inference algorithm

We can infer the type of a column by reading the first N rows of a file. For each column:

  1. If all values are boolean values, assume the field is a boolean
  2. If all values are numbers, assume the field is a number
  3. If all values are lists, assume the field is a list
  4. If any, but not all values are a list, assume the field is an object
  5. If any value is an object, assume the field is an object
  6. Else, assume the field is a string

This approach will require validating the output with existing sources to ensure backward compatibility

Acceptance criteria

  • The CsvFormat has a configuration option enabling auto schema inferrence
  • The schema inference is backward compatible with the existing S3 source with CSV files
@girarda
Copy link
Contributor Author

girarda commented Aug 1, 2023

grooming notes:

  • the schema inference will need to be used in the discover to create the right schema
  • the read operation should use the schema that was inferred from the discover step. it will already be available at the read step
  • we should limit the number of rows we read when inferring the schema. We can use a limit similar to the one we use for the json format (bytes per file)
  • testing: we can use the pyarrow inferrence and compare it's output with our own algorithm for extra security
  • we do not want to use pyarrow long-term because it loads the full csv file in memory

@maxi297 maxi297 self-assigned this Aug 2, 2023
maxi297 added a commit that referenced this issue Aug 4, 2023
maxi297 added a commit that referenced this issue Aug 4, 2023
maxi297 added a commit that referenced this issue Aug 7, 2023
maxi297 added a commit that referenced this issue Aug 8, 2023
maxi297 added a commit that referenced this issue Aug 8, 2023
maxi297 added a commit that referenced this issue Aug 9, 2023
maxi297 added a commit that referenced this issue Aug 9, 2023
maxi297 added a commit that referenced this issue Aug 9, 2023
girarda added a commit that referenced this issue Aug 14, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* first pass

* [ISSUE #28893] code review

* fix mypy issues

* comment

* rename for clarity

* Add a scenario test case

* this isn't optional anymore

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* [ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES

* Code review

* Automated Commit - Formatting Changes

* fix unit tests

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
brianjlai added a commit that referenced this issue Aug 14, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* [ISSUE #28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* pr feedback with a few additional default options set

* fix things from the rebase of master

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
octavia-approvington pushed a commit that referenced this issue Aug 15, 2023
* [ISSUE #28893] infer csv schema

* [ISSUE #28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE #28893] legacy inference and infer only when needed

* [ISSUE #28893] fix scenario tests

* [ISSUE #28893] using discovered schema as part of read

* [ISSUE #28893] self-review + cleanup

* [ISSUE #28893] fix test

* [ISSUE #28893] code review part #1

* [ISSUE #28893] code review part #2

* Fix test

* formatcdk

* [ISSUE #28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE #28893] improve inferrence to consider multiple types per value

* set decimal_as_float to True

* update

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* update

* format

* pr feedback with a few additional default options set

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
Co-authored-by: brianjlai <brian.lai@airbyte.io>
@maxi297 maxi297 closed this as completed Aug 15, 2023
harrytou pushed a commit to KYVENetwork/airbyte that referenced this issue Sep 1, 2023
* [ISSUE airbytehq#28893] infer csv schema

* [ISSUE airbytehq#28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE airbytehq#28893] legacy inference and infer only when needed

* [ISSUE airbytehq#28893] fix scenario tests

* [ISSUE airbytehq#28893] using discovered schema as part of read

* [ISSUE airbytehq#28893] self-review + cleanup

* [ISSUE airbytehq#28893] fix test

* [ISSUE airbytehq#28893] code review part #1

* [ISSUE airbytehq#28893] code review part #2

* Fix test

* formatcdk

* first pass

* [ISSUE airbytehq#28893] code review

* fix mypy issues

* comment

* rename for clarity

* Add a scenario test case

* this isn't optional anymore

* FIX test log level

* Re-adding failing tests

* [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* [ISSUE airbytehq#28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES

* Code review

* Automated Commit - Formatting Changes

* fix unit tests

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
harrytou pushed a commit to KYVENetwork/airbyte that referenced this issue Sep 1, 2023
* [ISSUE airbytehq#28893] infer csv schema

* [ISSUE airbytehq#28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE airbytehq#28893] legacy inference and infer only when needed

* [ISSUE airbytehq#28893] fix scenario tests

* [ISSUE airbytehq#28893] using discovered schema as part of read

* [ISSUE airbytehq#28893] self-review + cleanup

* [ISSUE airbytehq#28893] fix test

* [ISSUE airbytehq#28893] code review part #1

* [ISSUE airbytehq#28893] code review part #2

* Fix test

* formatcdk

* [ISSUE airbytehq#28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* pr feedback with a few additional default options set

* fix things from the rebase of master

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
harrytou pushed a commit to KYVENetwork/airbyte that referenced this issue Sep 1, 2023
…ytehq#29342)

* [ISSUE airbytehq#28893] infer csv schema

* [ISSUE airbytehq#28893] align with pyarrow

* Automated Commit - Formatting Changes

* [ISSUE airbytehq#28893] legacy inference and infer only when needed

* [ISSUE airbytehq#28893] fix scenario tests

* [ISSUE airbytehq#28893] using discovered schema as part of read

* [ISSUE airbytehq#28893] self-review + cleanup

* [ISSUE airbytehq#28893] fix test

* [ISSUE airbytehq#28893] code review part #1

* [ISSUE airbytehq#28893] code review part #2

* Fix test

* formatcdk

* [ISSUE airbytehq#28893] code review

* FIX test log level

* Re-adding failing tests

* [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value

* set decimal_as_float to True

* update

* Automated Commit - Formatting Changes

* add file adapters for avro, csv, jsonl, and parquet

* fix try catch

* update

* format

* pr feedback with a few additional default options set

---------

Co-authored-by: maxi297 <maxime@airbyte.io>
Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
Co-authored-by: brianjlai <brian.lai@airbyte.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants