Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Data] Add filtering and column pruning when reading from BigQuery table #48821

Open
PetrZhitnikov opened this issue Nov 20, 2024 · 0 comments
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@PetrZhitnikov
Copy link

Description

It would be great to have the ability to provide filters and columns to be read from the BQ table.

Use case

Existing implementation

As of now, I can run a code like this to get data from a table filtered and with only selected columns & filter conditions:

import ray
ds = ray.data.read_bigquery(
    project_id="my_project",
    query="""
        SELECT station_number, mean_temp
        FROM `bigquery-public-data.samples.gsod`
        where year = 1940 and month = 1 and day = 1
    """,
)

However, it will run this query and create temporary table introducing extra costs and delay before starting reading data.

Proposed option

On the other hand, BQ Read API supports providing filters and fields directly to the read request to the existing table, via TableReadOptions (parameters
selected_fields[] and row_restriction)

So what I would like to have is to have an interface like this:

import ray
ds = ray.data.read_bigquery(
    project_id="my_project",
    dataset="bigquery-public-data.samples.gsod",
    selected_fields = ["station_number", "mean_temp"],
    row_restriction = "year = 1940 and month = 1 and day = 1"
)

And these new fields to be propagated down to BQ Read API read request. In such case it will be streaming data directly from the existing table without extra costs and time spent on creating intermediate table.

@PetrZhitnikov PetrZhitnikov added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 20, 2024
@jcotant1 jcotant1 added the data Ray Data-related issues label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants