Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate and split run job method into prepare/execute/complete steps #30308

Merged
merged 6 commits into from
Apr 11, 2023

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Mar 26, 2023

As a follow-up after decoupling of the job logic from the BaseJob
ORM object (#30255), the run method of BaseJob should also be
decoupled from it (allowing BaseJobPydantic to be passed) as well
as split into three steps, in order to allow db-less mode.

The "prepare" and "complete" steps of the run method are modifying
BaseJob ORM-mapped object, so they should be called over the
internal-api from LocalTask, DafFileProcessor and Triggerer running
in db-less mode. The "execute" method however does not need the
database however and should be run locally.

This is not yet full AIP-44 conversion, this is a prerequisite to do
so - and AIP-44 conversion will be done as a follow-up after this one.

Closes: #30295


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

session.commit()
self.state = State.RUNNING
self.start_date = timezone.utcnow()
self.state = State.RUNNING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate

airflow/jobs/base_job.py Show resolved Hide resolved
airflow/jobs/base_job.py Outdated Show resolved Hide resolved
airflow/jobs/pydantic/base_job.py Outdated Show resolved Hide resolved
airflow/jobs/base_job.py Outdated Show resolved Hide resolved
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Apr 11, 2023
airflow/jobs/job.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the split-job-run-into-steps branch from 796b866 to 4d61689 Compare April 11, 2023 08:24
@potiuk
Copy link
Member Author

potiuk commented Apr 11, 2023

Ok. I updated it now and I hope it will better explain why the split and how we are going to use it when we move to AIP-44.

Note that it is not strictly needed now (as @uranusjr noticed) for the prepare/complete methods to be able to handle JobPydantic as input (and they don't get it) - this is in preparation for the final step of refactoring in #30325 and the follow-up steps to add @internal_api calls and make it works for AIP-44 case.

Adding the JobPydantic option allowed me to see (see the comment in the current version) that I need to remove job_runner from job altogether (because currently I need to use "type: ignore [union-attr] to get rid of MyPy error.

I applied keyword-only session and :meta private: for the internal methods (also prefixed them with _) and extracted out retrieving the Job when you have just JobPydantic (which we will use when the methods will be used for AIP-44 case.

Finally I added mermaid diagram showing the current Job lifecycle and how it will look like when we implement AIP-44. This might help with reasoning why we needed to separate prepare/complete and why they should also take JobPydantic as input.

@potiuk potiuk force-pushed the split-job-run-into-steps branch from 4d61689 to 1acb079 Compare April 11, 2023 15:54
airflow/jobs/job.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the split-job-run-into-steps branch from 1acb079 to f6ce7e5 Compare April 11, 2023 18:22
As a follow-up after decoupling of the job logic from the BaseJob
ORM object (apache#30255), the `run` method of BaseJob should also be
decoupled from it (allowing BaseJobPydantic to be passed) as well
as split into three steps, in order to allow db-less mode.

The "prepare" and "complete" steps of the `run` method are modifying
BaseJob ORM-mapped object, so they should be called over the
internal-api from LocalTask, DafFileProcessor and Triggerer running
in db-less mode. The "execute" method however does not need the
database however and should be run locally.

This is not yet full AIP-44 conversion, this is a prerequisite to do
so - and AIP-44 conversion will be done as a follow-up after this one.

However we added a mermaid diagram showing the job lifecycle with and
without Internal API to make it easier to reason about it

Closes: apache#30295
@potiuk potiuk force-pushed the split-job-run-into-steps branch from f6ce7e5 to 0e96ca8 Compare April 11, 2023 18:24
Copy link
Member

@jedcunningham jedcunningham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating a few times here @potiuk.

airflow/jobs/JOB_LIFECYCLE.md Outdated Show resolved Hide resolved
airflow/jobs/JOB_LIFECYCLE.md Outdated Show resolved Hide resolved
airflow/jobs/JOB_LIFECYCLE.md Outdated Show resolved Hide resolved
airflow/jobs/JOB_LIFECYCLE.md Outdated Show resolved Hide resolved
airflow/jobs/job.py Outdated Show resolved Hide resolved
Comment on lines +341 to +342
with create_session() as session:
job.heartbeat(session=session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create a session explicitly? Can't we just not pass one (and maybe comment that that is intentional)?

Copy link
Member Author

@potiuk potiuk Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent question! And I have some likely good answers after becoming intimately familiar with the Job code now :). This had been somewhat a surprise for me to learn how things work currently, so happy to share my findings when asked (I wanted to do it later, but since you asked, there you go :):

This is how it was done originally (and I also added a note about it in the diagrams that it leads to multiple sessions being open at the same time). The problem is (and this is even in the current code) that at the place where heartbeat is executed, there is no session available from the upper-stack. I wanted to avoid any "logic" change in the series of refactors.

See the code here (this is before any of my refactorings):

with create_session() as session:

That was also quite a bit of surprise to me that this is happening and actually this is one of the cool things with this refactoring that those things become plain obvious as we decouple the database code from actual logic.

So as result of this refactor I left two TODOs in the code (the heartbeat one might have been lost with the changes/rebases I've done):

  1. Maybe we should NOT keep the session opened during the whole "execute_job" ? We currently keep a session opened, and possibly Connection,. Variable, Xcom retrieval can make use of it (but likely it does not as the session is not passed down during execute:

with create_session() as session:

I am not 100% sure if the session can be passed down from the current thread (thread local?) in our use of it. but I guess it must be passed down explicitly. In which case it seems we keep the session opened while the task is running while we are NOT USING it. This is quite a waste of the session - TCP/IP connection opened, for postgres there is a process running on the other end (pgbouncer somewhat mitigates it), but it seems that we keep the session opened while even an hour-long task is running while we cannot use that session!

But I deliberately wanted to keep the original behaviour, to make sure that my refactors are just that - refactors. I am painfully aware that joining both refactors and functionality changes is a very bad idea. However, once we cut-off the 2.6 branch I am planning to explore that and possibly optimise the use of sessions there. The decoupling I've done will be rather helpful in making it "cleanly" I think. And maybe I found that we can vastly optimize the session/DB usage here and maybe we won't need pgbouncer any more if we complete it ? I certainly want to explore the consequences of changing the session allocation here. I think things might looka a bit differently in SchedulerJob, so any changes there should be done carefully.

  1. Then, heartbeat session is kinda connected. Heartbeat actually might simply not be able to get the session differently because execute() method does not have session to pass. So we are creating new session for heartbeat with with create_session():
    with create_session() as session:
    and I guess the idea for doing it without using @provide_session were that first part of the heartbeat check and "only_if_necessary" can be done without opening a session, so using "with create_session" is there to optimize the usage/creation of session to only create it when needed. In my AIP-44 variant of heartbeat (now deleted) I split heartbeat method into two, to achieve the same with @provide_session , and possibly we can do it also in the future in similar way but (YAGNI) - maybe not so I deleted it without looking back.

I will need to confirm the findings with others (@ashb? @uranusjr - maybe I missed some session sharing mechanisms here) but I tried to carefully replicate with the refactor what was originally in the code.

I hope the answer is not too long :D

potiuk and others added 5 commits April 11, 2023 22:14
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
@potiuk
Copy link
Member Author

potiuk commented Apr 11, 2023

Thanks for iterating a few times here @potiuk.

It's cool to bounce of the ideas of yours and get good advices. I don't mind reworking my own staff to get it eventually much better :D (and I learn some ways on how to rebase and resolve conflicts much better - it's a long time I stopped fearing even complex conflict resolving and reworking even big parts of a change as long as I mastered the tooling and I have our extensive test harness to verify it at every stage). Fear of change is one of the worst reasons to not make a change.

@potiuk potiuk merged commit 73e0313 into apache:main Apr 11, 2023
@potiuk potiuk deleted the split-job-run-into-steps branch April 11, 2023 21:48
potiuk added a commit to potiuk/airflow that referenced this pull request Apr 11, 2023
This is the final step of decoupling of the job runner from ORM
based BaseJob. After this change, finally we rich the state that
the BaseJob is just a state of the Job being run, but all
the logic is kept in separate "JobRunner" entity which just
keeps the reference to the job. Also it makes sure that
job in each runner is defined as appropriate for each job type:

* SchedulerJobRunner, BackfillJobRunner can only use BaseJob
* DagProcessorJobRunner, TriggererJobRunner and especially the
  LocalTaskJobRunner can keep both BaseJob and it's Pydantic
  BaseJobPydantic representation - for AIP-44 usage.

The highlights of this change:

* Job does not have job_runner reference any more
* Job is a mandatory parameter when creating each JobRunner
* run_job method takes as parameter the job (i.e. where the state
  of the job is called) and executor_callable - i.e. the method
  to run when the job gets executed
* heartbeat callback is also passed a generic callable in order
  to execute the post-heartbeat operation of each of the job
  type
* there is no more need to specify job_type when you create
  BaseJob, the job gets its type by a simply creating a runner
  with the job

This is the final stage of refactoring that was split into
reviewable stages: apache#30255 -> apache#30302 -> apache#30308 -> this PR.

Closes: apache#30325
potiuk added a commit that referenced this pull request Apr 11, 2023
This is the final step of decoupling of the job runner from ORM
based BaseJob. After this change, finally we rich the state that
the BaseJob is just a state of the Job being run, but all
the logic is kept in separate "JobRunner" entity which just
keeps the reference to the job. Also it makes sure that
job in each runner is defined as appropriate for each job type:

* SchedulerJobRunner, BackfillJobRunner can only use BaseJob
* DagProcessorJobRunner, TriggererJobRunner and especially the
  LocalTaskJobRunner can keep both BaseJob and it's Pydantic
  BaseJobPydantic representation - for AIP-44 usage.

The highlights of this change:

* Job does not have job_runner reference any more
* Job is a mandatory parameter when creating each JobRunner
* run_job method takes as parameter the job (i.e. where the state
  of the job is called) and executor_callable - i.e. the method
  to run when the job gets executed
* heartbeat callback is also passed a generic callable in order
  to execute the post-heartbeat operation of each of the job
  type
* there is no more need to specify job_type when you create
  BaseJob, the job gets its type by a simply creating a runner
  with the job

This is the final stage of refactoring that was split into
reviewable stages: #30255 -> #30302 -> #30308 -> this PR.

Closes: #30325
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIP-44 Airflow Internal API area:API Airflow's REST/HTTP API area:CLI area:Scheduler including HA (high availability) scheduler area:webserver Webserver related Issues changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

AIP-44 Split BaseJob run() method into "start", "execute", "stop"
6 participants