-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File-based CSV: infer datatypes #28893
Labels
Comments
grooming notes:
|
maxi297
added a commit
that referenced
this issue
Aug 4, 2023
maxi297
added a commit
that referenced
this issue
Aug 4, 2023
maxi297
added a commit
that referenced
this issue
Aug 7, 2023
maxi297
added a commit
that referenced
this issue
Aug 7, 2023
maxi297
added a commit
that referenced
this issue
Aug 7, 2023
maxi297
added a commit
that referenced
this issue
Aug 8, 2023
maxi297
added a commit
that referenced
this issue
Aug 9, 2023
maxi297
added a commit
that referenced
this issue
Aug 9, 2023
maxi297
added a commit
that referenced
this issue
Aug 10, 2023
maxi297
added a commit
that referenced
this issue
Aug 11, 2023
girarda
added a commit
that referenced
this issue
Aug 14, 2023
* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * first pass * [ISSUE #28893] code review * fix mypy issues * comment * rename for clarity * Add a scenario test case * this isn't optional anymore * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * [ISSUE #28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES * Code review * Automated Commit - Formatting Changes * fix unit tests --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
brianjlai
added a commit
that referenced
this issue
Aug 14, 2023
* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * [ISSUE #28893] code review * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * pr feedback with a few additional default options set * fix things from the rebase of master --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
octavia-approvington
pushed a commit
that referenced
this issue
Aug 15, 2023
* [ISSUE #28893] infer csv schema * [ISSUE #28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE #28893] legacy inference and infer only when needed * [ISSUE #28893] fix scenario tests * [ISSUE #28893] using discovered schema as part of read * [ISSUE #28893] self-review + cleanup * [ISSUE #28893] fix test * [ISSUE #28893] code review part #1 * [ISSUE #28893] code review part #2 * Fix test * formatcdk * [ISSUE #28893] code review * FIX test log level * Re-adding failing tests * [ISSUE #28893] improve inferrence to consider multiple types per value * set decimal_as_float to True * update * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * update * format * pr feedback with a few additional default options set --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com> Co-authored-by: brianjlai <brian.lai@airbyte.io>
harrytou
pushed a commit
to KYVENetwork/airbyte
that referenced
this issue
Sep 1, 2023
* [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * first pass * [ISSUE airbytehq#28893] code review * fix mypy issues * comment * rename for clarity * Add a scenario test case * this isn't optional anymore * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] remove InferenceType.PRIMITIVE_AND_COMPLEX_TYPES * Code review * Automated Commit - Formatting Changes * fix unit tests --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
harrytou
pushed a commit
to KYVENetwork/airbyte
that referenced
this issue
Sep 1, 2023
* [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * [ISSUE airbytehq#28893] code review * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * pr feedback with a few additional default options set * fix things from the rebase of master --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com>
harrytou
pushed a commit
to KYVENetwork/airbyte
that referenced
this issue
Sep 1, 2023
…ytehq#29342) * [ISSUE airbytehq#28893] infer csv schema * [ISSUE airbytehq#28893] align with pyarrow * Automated Commit - Formatting Changes * [ISSUE airbytehq#28893] legacy inference and infer only when needed * [ISSUE airbytehq#28893] fix scenario tests * [ISSUE airbytehq#28893] using discovered schema as part of read * [ISSUE airbytehq#28893] self-review + cleanup * [ISSUE airbytehq#28893] fix test * [ISSUE airbytehq#28893] code review part #1 * [ISSUE airbytehq#28893] code review part #2 * Fix test * formatcdk * [ISSUE airbytehq#28893] code review * FIX test log level * Re-adding failing tests * [ISSUE airbytehq#28893] improve inferrence to consider multiple types per value * set decimal_as_float to True * update * Automated Commit - Formatting Changes * add file adapters for avro, csv, jsonl, and parquet * fix try catch * update * format * pr feedback with a few additional default options set --------- Co-authored-by: maxi297 <maxime@airbyte.io> Co-authored-by: maxi297 <maxi297@users.noreply.github.com> Co-authored-by: brianjlai <brian.lai@airbyte.io>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What
Add an option to infer the data types when of a CSV stream even if the user has not provided a schema.
The existing S3 source does this with pyarrow, and most existing users set this option to True so we shouldn't let this be a breaking change.
This ticket is extracted from #28133.
Proposal 1: Use pyarrow to infer the datatype if the option is set to true
Proposal 2: Write our own inference algorithm
We can infer the type of a column by reading the first N rows of a file. For each column:
This approach will require validating the output with existing sources to ensure backward compatibility
Acceptance criteria
The text was updated successfully, but these errors were encountered: