File-based CSV: infer datatypes #28893

girarda · 2023-08-01T00:55:06Z

What

Add an option to infer the data types when of a CSV stream even if the user has not provided a schema.

The existing S3 source does this with pyarrow, and most existing users set this option to True so we shouldn't let this be a breaking change.

This ticket is extracted from #28133.

Proposal 1: Use pyarrow to infer the datatype if the option is set to true

Proposal 2: Write our own inference algorithm

We can infer the type of a column by reading the first N rows of a file. For each column:

If all values are boolean values, assume the field is a boolean
If all values are numbers, assume the field is a number
If all values are lists, assume the field is a list
If any, but not all values are a list, assume the field is an object
If any value is an object, assume the field is an object
Else, assume the field is a string

This approach will require validating the output with existing sources to ensure backward compatibility

Acceptance criteria

The CsvFormat has a configuration option enabling auto schema inferrence
The schema inference is backward compatible with the existing S3 source with CSV files

girarda · 2023-08-01T16:59:49Z

grooming notes:

the schema inference will need to be used in the discover to create the right schema
the read operation should use the schema that was inferred from the discover step. it will already be available at the read step
we should limit the number of rows we read when inferring the schema. We can use a limit similar to the one we use for the json format (bytes per file)
testing: we can use the pyarrow inferrence and compare it's output with our own algorithm for extra security
we do not want to use pyarrow long-term because it loads the full csv file in memory

* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * first pass * [ISSUE #28893] code review * fix mypy issues * comment * rename for clarity * Add a scenario test case * this isn't optional anymore * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * [ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES * Code review * Automated Commit - Formatting Changes * fix unit tests --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>

* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * [ISSUE #28893] code review * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * pr feedback with a few additional default options set * fix things from the rebase of master --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>

* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * [ISSUE #28893] code review * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * set decimal_as_float to True * update * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * update * format * pr feedback with a few additional default options set --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com> Co-authored-by: brianjlai <brian.lai@airbyte.io>

* [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * first pass * [ISSUE airbytehq#28893] code review * fix mypy issues * comment * rename for clarity * Add a scenario test case * this isn't optional anymore * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES * Code review * Automated Commit - Formatting Changes * fix unit tests --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>

* [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * [ISSUE airbytehq#28893] code review * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * pr feedback with a few additional default options set * fix things from the rebase of master --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>

…ytehq#29342) * [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * [ISSUE airbytehq#28893] code review * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * set decimal_as_float to True * update * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * update * format * pr feedback with a few additional default options set --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com> Co-authored-by: brianjlai <brian.lai@airbyte.io>

girarda added the team/extensibility label Aug 1, 2023

maxi297 self-assigned this Aug 2, 2023

clnoll added the area/file-cdk label Aug 3, 2023

maxi297 added a commit that referenced this issue Aug 4, 2023

[ISSUE #28893] infer csv schema

eaba483

maxi297 added a commit that referenced this issue Aug 4, 2023

[ISSUE #28893] align with pyarrow

60757c1

maxi297 mentioned this issue Aug 4, 2023

Issue 28893/infer schema csv #29099

Merged

maxi297 added a commit that referenced this issue Aug 7, 2023

[ISSUE #28893] legacy inference and infer only when needed

a394666

maxi297 added a commit that referenced this issue Aug 7, 2023

[ISSUE #28893] fix scenario tests

4f9d162

maxi297 added a commit that referenced this issue Aug 7, 2023

[ISSUE #28893] using discovered schema as part of read

0617c82

maxi297 added a commit that referenced this issue Aug 8, 2023

[ISSUE #28893] self-review + cleanup

d157aa3

maxi297 added a commit that referenced this issue Aug 8, 2023

[ISSUE #28893] fix test

57b011f

maxi297 added a commit that referenced this issue Aug 9, 2023

[ISSUE #28893] code review part #1

71cdca9

maxi297 added a commit that referenced this issue Aug 9, 2023

[ISSUE #28893] code review part #2

f651d03

maxi297 added a commit that referenced this issue Aug 9, 2023

[ISSUE #28893] code review

82db6c3

maxi297 added a commit that referenced this issue Aug 10, 2023

[ISSUE #28893] improve inferrence to consider multiple types per value

f1a60ba

maxi297 added a commit that referenced this issue Aug 11, 2023

[ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES

9f3479f

maxi297 closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-based CSV: infer datatypes #28893

File-based CSV: infer datatypes #28893

girarda commented Aug 1, 2023 •

edited by maxi297

Loading

girarda commented Aug 1, 2023 •

edited

Loading

File-based CSV: infer datatypes #28893

File-based CSV: infer datatypes #28893

Comments

girarda commented Aug 1, 2023 • edited by maxi297 Loading

What

Proposal 1: Use pyarrow to infer the datatype if the option is set to true

Proposal 2: Write our own inference algorithm

Acceptance criteria

girarda commented Aug 1, 2023 • edited Loading

girarda commented Aug 1, 2023 •

edited by maxi297

Loading

girarda commented Aug 1, 2023 •

edited

Loading