Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index on task instance #30762

Merged
merged 1 commit into from
May 8, 2023
Merged

Conversation

babinos87
Copy link
Contributor

@babinos87 babinos87 commented Apr 20, 2023

Added index ti_state_incl_start_date on taskinstance table. This index seems to have great positive effect when querying this table in a setup with many millions such rows. Existing index ti_state_lkp is not heavily utilized, as the start_date column (not part of the existing index) needs to be selected, hence a sequential scan will be performed in a very large table. Including the start_date column makes a use of INDEX ONLY SCAN, which improves performance dramatically in such situations.

This change is applicable to PostgreSQL database backends only, as other databases do not support this feature.


Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg
Copy link

boring-cyborg bot commented Apr 20, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@babinos87 babinos87 marked this pull request as draft April 20, 2023 10:15
@babinos87 babinos87 force-pushed the index_on_task_instance branch 2 times, most recently from 4baf36f to a8aae28 Compare April 20, 2023 10:30
@babinos87 babinos87 marked this pull request as ready for review April 20, 2023 10:32
@babinos87 babinos87 force-pushed the index_on_task_instance branch 2 times, most recently from 99f3760 to 285e4d6 Compare April 21, 2023 10:32
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, I do not see any bad side effects of it. Anyone else wants to chime in?

@babinos87 babinos87 force-pushed the index_on_task_instance branch 3 times, most recently from af583c7 to f3c363f Compare April 24, 2023 14:57
@babinos87 babinos87 force-pushed the index_on_task_instance branch from f3c363f to 57d0aa2 Compare April 24, 2023 19:27
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the airflow_version part is figured out.

@babinos87 babinos87 force-pushed the index_on_task_instance branch 3 times, most recently from 3a6cdc2 to b8cf42c Compare May 2, 2023 09:41
@potiuk potiuk merged commit 2af2ddc into apache:main May 8, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented May 8, 2023

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@jedcunningham jedcunningham added this to the Airflow 2.6.2 milestone Jun 5, 2023
@jedcunningham jedcunningham added the type:bug-fix Changelog: Bug Fixes label Jun 5, 2023
@eladkal eladkal modified the milestones: Airflow 2.6.2, Airflow 2.7.0 Jun 8, 2023
@dstandish
Copy link
Contributor

Hi @babinos87 , I happened to be looking a bit at index utilization and noticed that this one isn't getting used on our cluster. I saw you shared a query that this benefits. Is that emitted from our codebase? Or is it just a custom query that you run? Thanks

@babinos87
Copy link
Contributor Author

@dstandish this is a custom query, we have some custom code run by the scheduler to find the last successful dag run date, which runs this query and this index helps greatly.

@dstandish
Copy link
Contributor

Thanks @babinos87

In that case, I don't think this belongs in airflow so I will plan to remove it. You can ensure that the index is there yourself as part of your infra process.

Thanks

dstandish added a commit to astronomer/airflow that referenced this pull request Jan 11, 2024
Reverts apache#30762

Index is only helpful for a user's custom query -- not for airflow in general (see comment apache#30762 (comment)).  I also observed that it also takes up as much (or more) space than the table itself. As such it doesn't belong in airflow OSS.
@babinos87
Copy link
Contributor Author

babinos87 commented Jan 12, 2024

@dstandish this would make sense if there was a way to define custom sql schema / commands as part of the installation process. As far as I am aware, there is no such thing, and any manual changes could potentially break on db migration. I think it would be more appropriate to plan some work for this, so anyone could use it to define custom schema without risking of breaking the database between upgrades?

In any case, I don't foresee any trouble caused by this extra index, and it would help if it stays - my use case is quite basic so I think that someone else could benefit from this too.

@potiuk
Copy link
Member

potiuk commented Jan 12, 2024

I think it would be more appropriate to plan some work for this, so anyone could use it to define custom schema without risking of breaking the database between upgrades? In any case, I don't foresee any trouble caused by this extra index, and it would help if it stays - my use case is quite basic so I think that someone else could benefit from this too.

You are free to propose somethign if you could ask - for example - a PR where you would allow anyone to (for example) plug -in a script that will be used as a step after migration and a way to add it to be able to add after-migration steps, that might be a good contribution that you could do.

As anyone else in the community you can propose PRs there, and it's a nice and isolated one that you are probably one of the best people to test and have need for it, so it would be nice if you could propose one (or find / sponsor/ encourage someone who could contribute it if you do not feel like you have the right skills)

@dstandish
Copy link
Contributor

@dstandish this would make sense if there was a way to define custom sql schema / commands as part of the installation process. As far as I am aware, there is no such thing, and any manual changes could potentially break on db migration. I think it would be more appropriate to plan some work for this, so anyone could use it to define custom schema without risking of breaking the database between upgrades?

In any case, I don't foresee any trouble caused by this extra index, and it would help if it stays - my use case is quite basic so I think that someone else could benefit from this too.

I don't think it's likely to break anything. You can simply name it my_company_index_blah_blah and we won't clash with it. Sure if we drop a column that it uses (unlikely) then you might have a problem. But I think that's I think a risk that you must accept given that you are doing something custom with the scheduler that requires a special index.

Indexes are not free. When you have more indexes, that is more storage used, more work for postgres on each update, and more for the query planner to sort through. It doesn't make sense to force all airflow users in the world to have an index that doesn't help them at all, just so that you can use it. I think we have like 9 indexes or something on task instance? That's getting quite high. We can't only add indexes, we must also try to remove ones that are not useful.

Here's the other thing.... The indexes we include in airflow, you know, they might not be perfect for everyone's cluster. It's possible that some clusters might have an unusual usage pattern and maybe as a result they might need to add special index, even without having custom logic or queries. And even in that case it might not make sense to add such an index to airflow, given that doing so would be a cost for everyone and a benefit perhaps only to one. Right?

dstandish added a commit that referenced this pull request Jan 13, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762
potiuk pushed a commit that referenced this pull request Jan 13, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762

(cherry picked from commit e20b400)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment #30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762

(cherry picked from commit e20b400)
jedcunningham added a commit to astronomer/airflow that referenced this pull request May 7, 2024
The index this comment is talking about was removed in apache#30762.
jedcunningham added a commit that referenced this pull request May 7, 2024
The index this comment is talking about was removed in #30762.
pateash pushed a commit to pateash/airflow that referenced this pull request May 13, 2024
The index this comment is talking about was removed in apache#30762.
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 18, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762

GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 20, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762

GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 8, 2024
Index is only helpful for a user's custom query -- not for airflow in general (see comment apache/airflow#30762 (comment)).  Noticed that this query had zero scans over a period of months.  I also observed that it also takes up as much space as the table itself.  Since it's not generally useful, it doesn't belong in airflow OSS.

Reverts #30762

GitOrigin-RevId: e20b400317ae4eb41181c5b0cee466eff768b521
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants