-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index on task instance #30762
Index on task instance #30762
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
4baf36f
to
a8aae28
Compare
99f3760
to
285e4d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good, I do not see any bad side effects of it. Anyone else wants to chime in?
af583c7
to
f3c363f
Compare
airflow/migrations/versions/0125_2_6_0_add_index_to_task_instance_table.py
Outdated
Show resolved
Hide resolved
f3c363f
to
57d0aa2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the airflow_version
part is figured out.
3a6cdc2
to
b8cf42c
Compare
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
Hi @babinos87 , I happened to be looking a bit at index utilization and noticed that this one isn't getting used on our cluster. I saw you shared a query that this benefits. Is that emitted from our codebase? Or is it just a custom query that you run? Thanks |
@dstandish this is a custom query, we have some custom code run by the scheduler to find the last successful dag run date, which runs this query and this index helps greatly. |
Thanks @babinos87 In that case, I don't think this belongs in airflow so I will plan to remove it. You can ensure that the index is there yourself as part of your infra process. Thanks |
Reverts apache#30762 Index is only helpful for a user's custom query -- not for airflow in general (see comment apache#30762 (comment)). I also observed that it also takes up as much (or more) space than the table itself. As such it doesn't belong in airflow OSS.
@dstandish this would make sense if there was a way to define custom sql schema / commands as part of the installation process. As far as I am aware, there is no such thing, and any manual changes could potentially break on db migration. I think it would be more appropriate to plan some work for this, so anyone could use it to define custom schema without risking of breaking the database between upgrades? In any case, I don't foresee any trouble caused by this extra index, and it would help if it stays - my use case is quite basic so I think that someone else could benefit from this too. |
You are free to propose somethign if you could ask - for example - a PR where you would allow anyone to (for example) plug -in a script that will be used as a step after migration and a way to add it to be able to add after-migration steps, that might be a good contribution that you could do. As anyone else in the community you can propose PRs there, and it's a nice and isolated one that you are probably one of the best people to test and have need for it, so it would be nice if you could propose one (or find / sponsor/ encourage someone who could contribute it if you do not feel like you have the right skills) |
I don't think it's likely to break anything. You can simply name it Indexes are not free. When you have more indexes, that is more storage used, more work for postgres on each update, and more for the query planner to sort through. It doesn't make sense to force all airflow users in the world to have an index that doesn't help them at all, just so that you can use it. I think we have like 9 indexes or something on task instance? That's getting quite high. We can't only add indexes, we must also try to remove ones that are not useful. Here's the other thing.... The indexes we include in airflow, you know, they might not be perfect for everyone's cluster. It's possible that some clusters might have an unusual usage pattern and maybe as a result they might need to add special index, even without having custom logic or queries. And even in that case it might not make sense to add such an index to airflow, given that doing so would be a cost for everyone and a benefit perhaps only to one. Right? |
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762 (cherry picked from commit e20b400)
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762 (cherry picked from commit e20b400)
The index this comment is talking about was removed in apache#30762.
The index this comment is talking about was removed in #30762.
The index this comment is talking about was removed in apache#30762.
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762 GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762 GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)). Noticed that this query had zero scans over a period of months. I also observed that it also takes up as much space as the table itself. Since it's not generally useful, it doesn't belong in airflow OSS. Reverts #30762 GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
Added index
ti_state_incl_start_date
ontaskinstance
table. This index seems to have great positive effect when querying this table in a setup with many millions such rows. Existing indexti_state_lkp
is not heavily utilized, as thestart_date
column (not part of the existing index) needs to be selected, hence a sequential scan will be performed in a very large table. Including thestart_date
column makes a use ofINDEX ONLY SCAN
, which improves performance dramatically in such situations.This change is applicable to PostgreSQL database backends only, as other databases do not support this feature.
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.