Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let BigQueryGetData operator take a list of fields for the "order by" clause #39127

Closed
1 of 2 tasks
lopezvit opened this issue Apr 19, 2024 · 8 comments
Closed
1 of 2 tasks
Assignees
Labels
area:providers good first issue kind:feature Feature Requests provider:google Google (including GCP) related issues

Comments

@lopezvit
Copy link

Description

Sometimes you just need a the latest value of a field (e.g. updatedAt) so further operators downstream could use said value in their own query.
This can be done by SELECT MAX(updatedAt) [...] but that would required a lot of re-write, when simply adding a new param ordering_fields could solve the same issue, allowing to create a query similar to:
SELECT updatedAt FROM [...] LIMIT 1 ORDER BY updatedAt DESC

Example implementation (not tested):

def generate_query(self, hook: BigQueryHook) -> str:
    """Generate a SELECT query if for the given dataset and table ID."""
    query = "select "
    if self.selected_fields:
        query += self.selected_fields
    else:
        query += "*"
    query += (
        f" from `{self.table_project_id or hook.project_id}.{self.dataset_id}"
        f".{self.table_id}` limit {self.max_results}"
    )
    if self.ordering_fields:
        query += f" ORDER BY {self.ordering_fields}"
    return query

Use case/motivation

The operator BigQueryGetData should have 1 more params ordering_fields so the generated query would also include the ORDER BY clause.

Related issues

#24460

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@lopezvit lopezvit added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Apr 19, 2024
@jscheffl jscheffl added provider:google Google (including GCP) related issues and removed needs-triage label for new issues that we didn't triage yet labels Apr 19, 2024
@shahar1
Copy link
Contributor

shahar1 commented Apr 20, 2024

@eladkal Could you please assign me?

@shahar1
Copy link
Contributor

shahar1 commented Apr 23, 2024

BigQueryGetData's generate_query() method is used for submitting deferred jobs, while normally, the operator utilizes the hook's method list_rows. The latter utilizes BigQuery's API directly, which does not support ordering clauses. Having the feature available only for queries the deferred operator makes is discouraged.

There are two options left:

  1. Running the explicit query (including the ORDER BY clause) via BigQueryInsertJobOperator.
  2. Performing the ordering in Python as part of the operator logic, but I'm not sure there is a real advantage compared to the alternative.

@eladkal What do you think?

@Lee-W
Copy link
Member

Lee-W commented Apr 24, 2024

1. Running the explicit query (including the ORDER BY clause) via `BigQueryInsertJobOperator`.

I might prefer this one. I think the performance would be better than 2

@lopezvit
Copy link
Author

But, how do you get the result when using BigQueryInsertJobOperator? As far as I understood it it returns the job_id, not the result 🤔

@shahar1
Copy link
Contributor

shahar1 commented Apr 26, 2024

But, how do you get the result when using BigQueryInsertJobOperator? As far as I understood it, it returns the job_id, not the result 🤔

You got me on that one, as indeed - it only returns job_id.
Using BigQueryInsertJobOperator, the only option is to write the results into a BigQuery table, so it might make sense to create an alternative for outputting the results into an XCOM instead.
One possible implementation would be by adding a flag to the current operator, but then it will also have to wait until the job is actually completed.
A better option, IMO, would be to create a dedicated operator that will return the results into the XCOM given a job_id, to keep operations atomic.

@Lee-W, What are your thoughts about this one?

@Lee-W
Copy link
Member

Lee-W commented Apr 29, 2024

A better option, IMO, would be to create a dedicated operator that will return the results into the XCOM given a job_id, to keep operations atomic.

This sounds to be a better solution 👍

@shahar1
Copy link
Contributor

shahar1 commented Apr 29, 2024

A better option, IMO, would be to create a dedicated operator that will return the results into the XCOM given a job_id, to keep operations atomic.

This sounds to be a better solution 👍

Thanks :)
I've been trying out different configurations, and after discussing with @eladkal we came to the conclusion that querying data by job_id could be merged with the existing BigQueryGetData operator, as logic is very similar (also in the deferrable mode).
I'll create a PR and tag you when I'm done.

@eladkal
Copy link
Contributor

eladkal commented May 17, 2024

Fixed in #39315

@eladkal eladkal closed this as completed May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:feature Feature Requests provider:google Google (including GCP) related issues
Projects
None yet
Development

No branches or pull requests

5 participants