Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49519][SQL] Merge options of table and relation when constructing FileScanBuilder #47996

Closed
wants to merge 8 commits into from

Conversation

liujiayi771
Copy link
Contributor

@liujiayi771 liujiayi771 commented Sep 5, 2024

What changes were proposed in this pull request?

Merge the options of both DataSourceV2Relation and FileTable when constructing FileScanBuilder.

Why are the changes needed?

Currently, the subclass of FileTable only uses the options from relation when constructing the FileScanBuilder, which leads to the omission of the contents in FileTable.options. For the TableCatalog, the dsOptions can be set into the FileTable.options returned by the TableCatalog.loadTable method. If only the relation options are used here, the TableCatalog will not be able to pass dsOptions that contains table options to FileScan.

Merge the two options is a better option.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Sep 5, 2024
@liujiayi771
Copy link
Contributor Author

Hi @cloud-fan. Could you help to review?

CSVScanBuilder(sparkSession, fileIndex, schema, dataSchema, options)
override def newScanBuilder(options: CaseInsensitiveStringMap): CSVScanBuilder = {
val finalOptions = this.options.asCaseSensitiveMap().asScala.filterNot {
case (k, _) => options.containsKey(k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? I think combining two maps will respect the keys in the later map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the filter and combining directly.

case (k, _) => options.containsKey(k)
}.toMap ++ options.asScala.toMap
CSVScanBuilder(sparkSession, fileIndex, schema, dataSchema,
new CaseInsensitiveStringMap(finalOptions.asJava))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do other formats have the same issue?

Copy link
Contributor Author

@liujiayi771 liujiayi771 Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan orc/parquet/json/textTable do not combine the options of table and relation, while jdbcTable does merge them. I think we can add merging logic for all of them. How do you think?

@github-actions github-actions bot added the AVRO label Sep 5, 2024
@liujiayi771 liujiayi771 changed the title [SPARK-49519][SQL] Combine options of table and relation when constructing CSVScanBuilder [SPARK-49519][SQL] Merge options of table and relation when constructing FileScanBuilder Sep 5, 2024
@cloud-fan
Copy link
Contributor

Is it possible to add a test?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, it needs a test case for the following claim. Please add it to protect your contribution from any potential future regressions, @liujiayi771 .

If only the relation options are used here, the TableCatalog will not be able to pass dsOptions that contains table options to FileScan.

@liujiayi771
Copy link
Contributor Author

Okay, I will add a test case, thanks.

@liujiayi771
Copy link
Contributor Author

@cloud-fan @dongjoon-hyun I have added a test case, could you help to review?

@adrian-wang
Copy link
Contributor

+1, LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @liujiayi771 , @cloud-fan , @adrian-wang .
Merged to master for Apache Spark 4.0.0-preview2.

@dongjoon-hyun
Copy link
Member

To @liujiayi771 , FYI, Apache Spark 4.0.0-preview2 RC1 will be released next Monday.

@liujiayi771 liujiayi771 deleted the csv-options branch September 12, 2024 01:10
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…ing FileScanBuilder

### What changes were proposed in this pull request?
Merge the options of both `DataSourceV2Relation` and `FileTable` when constructing `FileScanBuilder`.

### Why are the changes needed?
Currently, the subclass of `FileTable` only uses the options from relation when constructing the `FileScanBuilder`, which leads to the omission of the contents in `FileTable.options`. For the `TableCatalog`, the `dsOptions` can be set into the `FileTable.options` returned by the `TableCatalog.loadTable` method. If only the relation options are used here, the `TableCatalog` will not be able to pass `dsOptions` that contains table options to `FileScan`.

Merge the two options is a better option.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47996 from liujiayi771/csv-options.

Lead-authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…ing FileScanBuilder

### What changes were proposed in this pull request?
Merge the options of both `DataSourceV2Relation` and `FileTable` when constructing `FileScanBuilder`.

### Why are the changes needed?
Currently, the subclass of `FileTable` only uses the options from relation when constructing the `FileScanBuilder`, which leads to the omission of the contents in `FileTable.options`. For the `TableCatalog`, the `dsOptions` can be set into the `FileTable.options` returned by the `TableCatalog.loadTable` method. If only the relation options are used here, the `TableCatalog` will not be able to pass `dsOptions` that contains table options to `FileScan`.

Merge the two options is a better option.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47996 from liujiayi771/csv-options.

Lead-authored-by: joey.ljy <joey.ljy@alibaba-inc.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants