Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize airflow build process and switch to Hatchling build backend #36537

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jan 2, 2024

This PR changes Airflow installation and build backend to use new
standard Python ways of building Python applications.

We've been trying to do it for quite a while. Airflow tranditionally
has been using complex and convoluted build process based on
setuptools and (extremely) custom setup.py file. It survived
migration to Airflow 2.0 and splitting Airlfow monorepo into
Airflow and Providers, adding pre-installed providers and switching
providers to use flit (and follow build standards).

So far tooling in Python ecosystme had not been able to fuflill our
needs and we refrained to develop our own tooling, but finally with
appearance of Hatch (managed by Python Packaging Authority) and
few recent advancements there we are finally able to swtich to
Python standard ways of managing project dependnecy configuration
and project build setup (with a few customizations).

This PR makes airflow build process follow those standard PEPs:

  • Airflow has all build configuration stored in pyproject.toml
    following PEP 518 which allows any fronted (pip, poetry,
    hatch, flit, or whatever other frontend is used to
    install required build dependendencies to install Airflow
    locally and to build distribution pacakges (sdist/wheel)

  • Hatchling backend follows PEP 517 for standard source tree and build
    backend implementation that allows to execute the build in a
    frontend-independent way

  • We store all project metadata in pyprooject.toml - following
    PEP 621 where all necessary project metadata components were
    defined.

  • We plug-in into Hatchling "editable build" hooks following
    PEP 660. Hatchling internally builds editable wheel that
    is used as ephemeral step and communication between backend
    and frontend (and this ephemeral wheel is used to make
    editable installation of the projeect - suitable for fast
    iteration of code without reinstalling the package.

With Airflow having many provider packages in single source tree
where we want to be able to install and develop airflow and
providers together, this is not a small feat to implement the
case wher editable installation has to behave quite a bit
differently when it comes to packaging and dependencies for
editable install (when you want to edit sources directly) and
installable package (where you want to have separate Airflow
package and provider packages). Fortunately the standardisation
efforts in the Python Packaging community and tooling implementing
it had finally made it possible.

Some of the important ways bow this has been achieved:

  • We continue using provider.yaml in providers as the single source
    of trutgh for per-provider dependencies. We added a possibility
    to specify "devel-dependencies" in provider.yaml so that all
    per-provider dependencies in generated/provider_dependencies.json
    and pyproject.toml are generated from those dependencies via
    update-providers-dependencies pre-commit.

  • Pyproject.toml is generally managed manually, but the part where
    provider dependencies and bundle dependencies are used is
    automatically updated by a pre-commit whenever provider
    dependencies change. Those generated provider dependencies contain
    just dependencies of providers - not the provider packages, but
    in the final "standard" wheel file they are replaced with
    "apache-airflow-providers-PROVIDER" dependencies - so that the
    wheel package will only install the provider and use the
    dependencies of that version of provider it installs.

  • We are utilising custom hatchiling build hooks (PEP 660 standard)
    that allow to modify 'standard' wheel package on-the-fly when
    the wheel is being prepared by adding preinstalled package
    dependencies (which are not needed in editable build) and by
    removing all devel extras (that are not needed in the PyPI
    distributed wheel package). This allows to solve the conundrum
    of having different "editable" and "standard" behaviour while
    keeping the same project specification in pyproject.toml.

  • We added description of how Hatch can be employed as build
    frontend in order to manage local virtualenv and install Airflow
    in editable way easily - while keeping all properties of the
    installed application (including working airflow cli and
    package metadata discovery) as well as how to use PEP-standard
    ways of bulding wheel and sdist packages.

  • We have a custom step (following PEP-standards) to inject
    airflow-specific build steps - compiling www assets and
    generating git commit hash version to display it in the UI

  • We also show how all this makes it possible to make it easy to
    manage local virtualenvs and editable installations for Airflow
    contributors - without vendor lock-in of the build tools as
    by following standard PEPs Airflow can be locally and editably
    installed by anyone using any build front-end tools following
    the standards - whether you use pip, poetry, Hatch, flit
    or any other frontent build tools, Airflow local installation
    and package building will work the same way for all of them,
    where both "editable" and "standard" package prepration is
    managed by hatchling backend in the same way.

  • Previously our extras contained a "." which is not normalized
    name for extras - pip and other tools replaced it automatically
    with `_'. This change updates the extra names to contain
    '-' rather than '.' in the name, following PEP-685. This should be
    fully backwards compatible, users will still be able to use "." but it
    will be normalized to "-" in Airflow packages. This is also future
    proof as it is expected that all package managers and tools
    will eventually use PEP-685 applied to extras, even if currently
    some of the tools (pip + setuptools) might generate warnings.

  • Additionally, this change organizes the documentation around
    the extras and dependencies, explaining the reasoning behind
    all the different extras we have.

  • As a bonus (and this is what we used to test it all) we are
    documenting how to use Hatch frontend to:

    • manage multiple Python installations
    • manage multiple Pythob virtualenv environments
    • build Airflow packages for release management

Fixes: #30764

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:dev-tools area:production-image Production image improvements and fixes area:providers provider:fab labels Jan 2, 2024
@potiuk
Copy link
Member Author

potiuk commented Jan 2, 2024

The failures are expected. Im still solving some problems but all the big ones are solved an I successfully run it locally both on Mac and Linux in various combinations and I am really confident it works :)

.github/workflows/ci.yml Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Jan 4, 2024

BTW. After a good night sleep I got an idea and I've figured a way to simplify it even more - I think I can make it so that I do not need editable version of dependencies at all ( following the pattern tha I did with preinstalled dependencies - I think our pyproject.toml should only contain the provider dependences in [provider] extra and I will be able to dynamically replace them with apache-airflow-providers-provider in the .whl (and standard installation which also uses the same standard wheel) ,

This way there will be literally 0 changes for the contributor workflow:

This one will install amazon dependencies:

pip install -e `.[amzaon]`

But these will install amazon provider:

pip install `.[amazon]`

pip install `apaache-airflow[amazon]`

I will try to get it working today actually :) @uranusjr

@potiuk potiuk force-pushed the switch-to-hatch branch 2 times, most recently from a1b654d to ccbe5a5 Compare January 5, 2024 23:52
Copy link
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I was reading the docs w/o testing locally if all works - anyway as being DRAFT state and CI is read at the moment. But wanted to leave early feedback. Otherwise while reading I am now joining the "convinced" crowd :-D

CONTRIBUTING.rst Outdated Show resolved Hide resolved
INSTALL Outdated Show resolved Hide resolved
INSTALL Show resolved Hide resolved
LOCAL_VIRTUALENV.rst Outdated Show resolved Hide resolved
airflow/providers/fab/provider.yaml Outdated Show resolved Hide resolved
hatch_build.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
scripts/in_container/run_prepare_airflow_packages.py Outdated Show resolved Hide resolved
setup.cfg.back Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the switch-to-hatch branch 7 times, most recently from f04a2a9 to 9f504b1 Compare January 6, 2024 15:50
@potiuk
Copy link
Member Author

potiuk commented Jan 6, 2024

Thanks @jscheffl for VERY THOROUGH review. I addressed most comments (and the new iteration is way simpler and automatically addressed some of your comments regardles). I left a few unresolved conversations when things should be fixed before "undrafting" it.

Pushed new version, and next step of this PR is to make the build green.

potiuk added a commit to potiuk/airflow that referenced this pull request Jan 6, 2024
Breeze auto-detects if it should upgrade itself - based on
finding Airflow directory it is in and calculating the hash of
the pyproject.toml it uses. Finding the airflow sources to
act on was using setup.cfg from Airflow and checking the package
name inside, but since we are about to remove setup.cfg, and
move all project configuration to pyproject.toml (see apache#36537), this
mechanism will stop working.

This PR changes it by just checking if `airflow` subdir is present,
and contains `__init__.py` with "airflow" inside. That should be
"good enough" and fast, and also it should be backwards compatible
in case new Breeze is used in older airflow sources.
potiuk added a commit to potiuk/airflow that referenced this pull request Jan 6, 2024
Breeze auto-detects if it should upgrade itself - based on
finding Airflow directory it is in and calculating the hash of
the pyproject.toml it uses. Finding the airflow sources to
act on was using setup.cfg from Airflow and checking the package
name inside, but since we are about to remove setup.cfg, and
move all project configuration to pyproject.toml (see apache#36537), this
mechanism will stop working.

This PR changes it by just checking if `airflow` subdir is present,
and contains `__init__.py` with "airflow" inside. That should be
"good enough" and fast, and also it should be backwards compatible
in case new Breeze is used in older airflow sources.
potiuk added a commit to potiuk/airflow that referenced this pull request Jan 6, 2024
Breeze auto-detects if it should upgrade itself - based on
finding Airflow directory it is in and calculating the hash of
the pyproject.toml it uses. Finding the airflow sources to
act on was using setup.cfg from Airflow and checking the package
name inside, but since we are about to remove setup.cfg, and
move all project configuration to pyproject.toml (see apache#36537), this
mechanism will stop working.

This PR changes it by just checking if `airflow` subdir is present,
and contains `__init__.py` with "airflow" inside. That should be
"good enough" and fast, and also it should be backwards compatible
in case new Breeze is used in older airflow sources.
potiuk added a commit that referenced this pull request Jan 6, 2024
Breeze auto-detects if it should upgrade itself - based on
finding Airflow directory it is in and calculating the hash of
the pyproject.toml it uses. Finding the airflow sources to
act on was using setup.cfg from Airflow and checking the package
name inside, but since we are about to remove setup.cfg, and
move all project configuration to pyproject.toml (see #36537), this
mechanism will stop working.

This PR changes it by just checking if `airflow` subdir is present,
and contains `__init__.py` with "airflow" inside. That should be
"good enough" and fast, and also it should be backwards compatible
in case new Breeze is used in older airflow sources.
potiuk added a commit that referenced this pull request Jan 13, 2024
As part of extraction from #36537 I noticed that mysql-connector-python
causes also a lot of backtracking.

This PR bumps minimum version of the mysql-connection-python from
REALLY OLD (2018) to JUST OLD (2022).

(cherry picked from commit c29632a)
potiuk added a commit that referenced this pull request Jan 13, 2024
…nd (#36537)

This PR changes Airflow installation and build backend to use new
standard Python ways of building Python applications.

We've been trying to do it for quite a while. Airflow tranditionally
has been using complex and convoluted build process based on
setuptools and (extremely) custom setup.py file. It survived
migration to Airflow 2.0 and splitting Airlfow monorepo into
Airflow and Providers, adding pre-installed providers and switching
providers to use flit (and follow build standards).

So far tooling in Python ecosystme had not been able to fuflill our
needs and we refrained to develop our own tooling, but finally with
appearance of Hatch (managed by Python Packaging Authority) and
few recent advancements there we are finally able to swtich to
Python standard ways of managing project dependnecy configuration
and project build setup (with a few customizations).

This PR makes airflow build process follow those standard PEPs:

* Airflow has all build configuration stored in pyproject.toml
  following PEP 518 which allows any fronted (`pip`, `poetry`,
  `hatch`, `flit`, or whatever other frontend is used to
  install required build dependendencies to install Airflow
  locally and to build distribution pacakges (sdist/wheel)

* Hatchling backend follows PEP 517 for standard source tree and build
  backend implementation that allows to execute the build in a
  frontend-independent way

* We store all project metadata in pyprooject.toml - following
  PEP 621 where all necessary project metadata components were
  defined.

* We plug-in into Hatchling "editable build" hooks following
  PEP 660. Hatchling internally builds editable wheel that
  is used as ephemeral step and communication between backend
  and frontend (and this ephemeral wheel is used to make
  editable installation of the projeect - suitable for fast
  iteration of code without reinstalling the package.

With Airflow having many provider packages in single source tree
where we want to be able to install and develop airflow and
providers together, this is not a small feat to implement the
case wher editable installation has to behave quite a bit
differently when it comes to packaging and dependencies for
editable install (when you want to edit sources directly) and
installable package (where you want to have separate Airflow
package and provider packages). Fortunately the standardisation
efforts in the Python Packaging community and tooling implementing
it had finally made it possible.

Some of the important ways bow this has been achieved:

* We continue using provider.yaml in providers as the single source
  of trutgh for per-provider dependencies. We added a possibility
  to specify "devel-dependencies" in provider.yaml so that all
  per-provider dependencies in `generated/provider_dependencies.json`
  and `pyproject.toml` are generated from those dependencies via
  update-providers-dependencies pre-commit.

* Pyproject.toml is generally managed manually, but the part where
  provider dependencies and bundle dependencies are used is
  automatically updated by a pre-commit whenever provider
  dependencies change. Those generated provider dependencies contain
  just dependencies of providers - not the provider packages, but
  in the final "standard" wheel file they are replaced with
  "apache-airflow-providers-PROVIDER" dependencies - so that the
  wheel package will only install the provider and use the
  dependencies of that version of provider it installs.

* We are utilising custom hatchiling build hooks (PEP 660 standard)
  that allow to modify 'standard' wheel package on-the-fly when
  the wheel is being prepared by adding preinstalled package
  dependencies (which are not needed in editable build) and by
  removing all devel extras (that are not needed in the PyPI
  distributed wheel package). This allows to solve the conundrum
  of having different "editable" and "standard" behaviour while
  keeping the same project specification in pyproject.toml.

* We added description of how `Hatch` can be employed as build
  frontend in order to manage local virtualenv and install Airflow
  in editable way easily - while keeping all properties of the
  installed application (including working airflow cli and
  package metadata discovery) as well as how to use PEP-standard
  ways of bulding wheel and sdist packages.

* We have a custom step (following PEP-standards) to inject
  airflow-specific build steps - compiling www assets and
  generating git commit hash version to display it in the UI

* We also show how all this makes it possible to make it easy to
  manage local virtualenvs and editable installations for Airflow
  contributors - without vendor lock-in of the build tools as
  by following standard PEPs Airflow can be locally and editably
  installed by anyone using any build front-end tools following
  the standards - whether you use `pip`, `poetry`, `Hatch`, `flit`
  or any other frontent build tools, Airflow local installation
  and package building will work the same way for all of them,
  where both "editable" and "standard" package prepration is
  managed by `hatchling` backend in the same way.

* Previously our extras contained a "." which is not normalized
  name for extras - `pip` and other tools replaced it automatically
  with `_'. This change updates the extra names to contain
  '-' rather than '.' in the name, following PEP-685.  This should be
  fully backwards compatible, users will still be able to use "." but it
  will be normalized to "-" in Airflow packages. This is also future
  proof as it is expected that all package managers and tools
  will eventually use PEP-685 applied to extras, even if currently
  some of the tools (pip + setuptools) might generate warnings.

* Additionally, this change organizes the documentation around
  the extras and dependencies, explaining the reasoning behind
  all the different extras we have.

* As a bonus (and this is what we used to test it all) we are
  documenting how to use Hatch frontend to:

  * manage multiple Python installations
  * manage multiple Pythob virtualenv environments
  * build Airflow packages for release management

(cherry picked from commit c439ab8)
@ephraimbuddy ephraimbuddy added the type:misc/internal Changelog: Misc changes that should appear in change log label Jan 15, 2024
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
The `graphviz` dependency has been problematic as Airflow required
dependency - especially for ARM-based installations. Graphviz
packages require binary graphviz libraries - which is already a
limitation, but they also require to install graphviz Python
bindings to be build and installed. This does not work for older
Linux installation but - more importantly - when you try
to install Graphviz libraries for Python 3.8, 3.9 for ARM M1
MacBooks, the packages fail to install because Python bindings
compilation for M1 can only work for Python 3.10+.

There is not an easy solution for that except commenting out
graphviz dependency from setup.py, when you want to install Airflow
for Python 3.8, 3.9 for MacBook M1.

However Graphviz is really used in two places:

* when you want to render DAGs wia airflow CLI - either to an image
  or directly to terminal (for terminals/systems supporting imgcat)

* when you want to render ER diagram after you modified Airflow
  models

The latter is a development-only feature, the former is production
feature, however it is a very niche one.

This PR turns rendering of the images in Airflow in optional feature
(only working when graphviz python bindings are installed) and
effectively turns graphviz into an optional extra (and removes it
from requirements).

This is not a breaking change technically - the CLIs to render the
DAGs is still there and IF you already have graphviz installed, it
will continue working as it did before. The only problem when it
does not work is where you do not have graphviz installed for
fresh installation and it will raise an error and inform that you need it.

Graphviz will remain to be installed for most users:

* the Airflow Image will still contain graphviz library, because
  it is added there as extra
* when previous version of Airflow has been installed already, then
  graphviz library is already installed there and Airflow will
  continue working as it did

The only change will be a new installation of new version of Airflow
from the scratch, where graphviz will need to be specified as extra
or installed separately in order to enable DAG rendering option.

Taking into account this behaviour (which only requires to install
a graphviz package), this should not be considered as a breaking
change.

Extracted from: #36537

(cherry picked from commit 89f1737)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
Previously we limited grpcio minimum version to stop backtracking
of `pip` from happening and we could not do it in the limits of
spark provider, becaue some google dependencies used it and
conflicted with it. This problem is now gone as we have newer
versions of google dependencies and we can not only safely move
it to spark provider but also bump it slightly higher to limit
the amount of backtracking we need to do.

Extracted from #36537

(cherry picked from commit ded01a5)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
While testing #36537 I noticed cohere backracking quite a bit with
older version. Bumping Cohere to more recent minimum version
(released in November) decreased it quite a bit.

Since Cohere is mostly standalone package, and likely you want to
bump people to later version, it's safe to assume we can bump the
minimum version.

(cherry picked from commit 9797f92)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
As part of extraction from #36537 I noticed that mysql-connector-python
causes also a lot of backtracking.

This PR bumps minimum version of the mysql-connection-python from
REALLY OLD (2018) to JUST OLD (2022).

(cherry picked from commit c29632a)
ephraimbuddy pushed a commit that referenced this pull request Jan 15, 2024
…nd (#36537)

This PR changes Airflow installation and build backend to use new
standard Python ways of building Python applications.

We've been trying to do it for quite a while. Airflow tranditionally
has been using complex and convoluted build process based on
setuptools and (extremely) custom setup.py file. It survived
migration to Airflow 2.0 and splitting Airlfow monorepo into
Airflow and Providers, adding pre-installed providers and switching
providers to use flit (and follow build standards).

So far tooling in Python ecosystme had not been able to fuflill our
needs and we refrained to develop our own tooling, but finally with
appearance of Hatch (managed by Python Packaging Authority) and
few recent advancements there we are finally able to swtich to
Python standard ways of managing project dependnecy configuration
and project build setup (with a few customizations).

This PR makes airflow build process follow those standard PEPs:

* Airflow has all build configuration stored in pyproject.toml
  following PEP 518 which allows any fronted (`pip`, `poetry`,
  `hatch`, `flit`, or whatever other frontend is used to
  install required build dependendencies to install Airflow
  locally and to build distribution pacakges (sdist/wheel)

* Hatchling backend follows PEP 517 for standard source tree and build
  backend implementation that allows to execute the build in a
  frontend-independent way

* We store all project metadata in pyprooject.toml - following
  PEP 621 where all necessary project metadata components were
  defined.

* We plug-in into Hatchling "editable build" hooks following
  PEP 660. Hatchling internally builds editable wheel that
  is used as ephemeral step and communication between backend
  and frontend (and this ephemeral wheel is used to make
  editable installation of the projeect - suitable for fast
  iteration of code without reinstalling the package.

With Airflow having many provider packages in single source tree
where we want to be able to install and develop airflow and
providers together, this is not a small feat to implement the
case wher editable installation has to behave quite a bit
differently when it comes to packaging and dependencies for
editable install (when you want to edit sources directly) and
installable package (where you want to have separate Airflow
package and provider packages). Fortunately the standardisation
efforts in the Python Packaging community and tooling implementing
it had finally made it possible.

Some of the important ways bow this has been achieved:

* We continue using provider.yaml in providers as the single source
  of trutgh for per-provider dependencies. We added a possibility
  to specify "devel-dependencies" in provider.yaml so that all
  per-provider dependencies in `generated/provider_dependencies.json`
  and `pyproject.toml` are generated from those dependencies via
  update-providers-dependencies pre-commit.

* Pyproject.toml is generally managed manually, but the part where
  provider dependencies and bundle dependencies are used is
  automatically updated by a pre-commit whenever provider
  dependencies change. Those generated provider dependencies contain
  just dependencies of providers - not the provider packages, but
  in the final "standard" wheel file they are replaced with
  "apache-airflow-providers-PROVIDER" dependencies - so that the
  wheel package will only install the provider and use the
  dependencies of that version of provider it installs.

* We are utilising custom hatchiling build hooks (PEP 660 standard)
  that allow to modify 'standard' wheel package on-the-fly when
  the wheel is being prepared by adding preinstalled package
  dependencies (which are not needed in editable build) and by
  removing all devel extras (that are not needed in the PyPI
  distributed wheel package). This allows to solve the conundrum
  of having different "editable" and "standard" behaviour while
  keeping the same project specification in pyproject.toml.

* We added description of how `Hatch` can be employed as build
  frontend in order to manage local virtualenv and install Airflow
  in editable way easily - while keeping all properties of the
  installed application (including working airflow cli and
  package metadata discovery) as well as how to use PEP-standard
  ways of bulding wheel and sdist packages.

* We have a custom step (following PEP-standards) to inject
  airflow-specific build steps - compiling www assets and
  generating git commit hash version to display it in the UI

* We also show how all this makes it possible to make it easy to
  manage local virtualenvs and editable installations for Airflow
  contributors - without vendor lock-in of the build tools as
  by following standard PEPs Airflow can be locally and editably
  installed by anyone using any build front-end tools following
  the standards - whether you use `pip`, `poetry`, `Hatch`, `flit`
  or any other frontent build tools, Airflow local installation
  and package building will work the same way for all of them,
  where both "editable" and "standard" package prepration is
  managed by `hatchling` backend in the same way.

* Previously our extras contained a "." which is not normalized
  name for extras - `pip` and other tools replaced it automatically
  with `_'. This change updates the extra names to contain
  '-' rather than '.' in the name, following PEP-685.  This should be
  fully backwards compatible, users will still be able to use "." but it
  will be normalized to "-" in Airflow packages. This is also future
  proof as it is expected that all package managers and tools
  will eventually use PEP-685 applied to extras, even if currently
  some of the tools (pip + setuptools) might generate warnings.

* Additionally, this change organizes the documentation around
  the extras and dependencies, explaining the reasoning behind
  all the different extras we have.

* As a bonus (and this is what we used to test it all) we are
  documenting how to use Hatch frontend to:

  * manage multiple Python installations
  * manage multiple Pythob virtualenv environments
  * build Airflow packages for release management

(cherry picked from commit c439ab8)
potiuk added a commit to potiuk/airflow that referenced this pull request Feb 13, 2024
Airflow Sdist packages have been broken by apache#37340 and fixed by 37388,
but we have not noticed it because CI check for sdist packages has
been broken since apache#36537 where we standardized naming of the sdist
packages to follow modern syntax (and we silently skipped installation
because no providers were found),.

This PR fixes it:

* changes the naming format expected to follow the new standard
* treats "no providers found as error"

The "no providers" as success was useful at some point of time when we
run sdist as part of regular PRs and some PRs resulted in "no providers
changed" condition, however sdist verification only happens now in
canary build (so all providers are affected) as well as we have if
condition in the job itself to skip the step of installation if there
are no providers.
potiuk added a commit that referenced this pull request Feb 13, 2024
Airflow Sdist packages have been broken by #37340 and fixed by 37388,
but we have not noticed it because CI check for sdist packages has
been broken since #36537 where we standardized naming of the sdist
packages to follow modern syntax (and we silently skipped installation
because no providers were found),.

This PR fixes it:

* changes the naming format expected to follow the new standard
* treats "no providers found as error"

The "no providers" as success was useful at some point of time when we
run sdist as part of regular PRs and some PRs resulted in "no providers
changed" condition, however sdist verification only happens now in
canary build (so all providers are affected) as well as we have if
condition in the job itself to skip the step of installation if there
are no providers.
sunank200 pushed a commit to astronomer/airflow that referenced this pull request Feb 21, 2024
…37406)

Airflow Sdist packages have been broken by apache#37340 and fixed by 37388,
but we have not noticed it because CI check for sdist packages has
been broken since apache#36537 where we standardized naming of the sdist
packages to follow modern syntax (and we silently skipped installation
because no providers were found),.

This PR fixes it:

* changes the naming format expected to follow the new standard
* treats "no providers found as error"

The "no providers" as success was useful at some point of time when we
run sdist as part of regular PRs and some PRs resulted in "no providers
changed" condition, however sdist verification only happens now in
canary build (so all providers are affected) as well as we have if
condition in the job itself to skip the step of installation if there
are no providers.
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Breeze auto-detects if it should upgrade itself - based on
finding Airflow directory it is in and calculating the hash of
the pyproject.toml it uses. Finding the airflow sources to
act on was using setup.cfg from Airflow and checking the package
name inside, but since we are about to remove setup.cfg, and
move all project configuration to pyproject.toml (see apache#36537), this
mechanism will stop working.

This PR changes it by just checking if `airflow` subdir is present,
and contains `__init__.py` with "airflow" inside. That should be
"good enough" and fast, and also it should be backwards compatible
in case new Breeze is used in older airflow sources.
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
The `graphviz` dependency has been problematic as Airflow required
dependency - especially for ARM-based installations. Graphviz
packages require binary graphviz libraries - which is already a
limitation, but they also require to install graphviz Python
bindings to be build and installed. This does not work for older
Linux installation but - more importantly - when you try
to install Graphviz libraries for Python 3.8, 3.9 for ARM M1
MacBooks, the packages fail to install because Python bindings
compilation for M1 can only work for Python 3.10+.

There is not an easy solution for that except commenting out
graphviz dependency from setup.py, when you want to install Airflow
for Python 3.8, 3.9 for MacBook M1.

However Graphviz is really used in two places:

* when you want to render DAGs wia airflow CLI - either to an image
  or directly to terminal (for terminals/systems supporting imgcat)

* when you want to render ER diagram after you modified Airflow
  models

The latter is a development-only feature, the former is production
feature, however it is a very niche one.

This PR turns rendering of the images in Airflow in optional feature
(only working when graphviz python bindings are installed) and
effectively turns graphviz into an optional extra (and removes it
from requirements).

This is not a breaking change technically - the CLIs to render the
DAGs is still there and IF you already have graphviz installed, it
will continue working as it did before. The only problem when it
does not work is where you do not have graphviz installed for
fresh installation and it will raise an error and inform that you need it.

Graphviz will remain to be installed for most users:

* the Airflow Image will still contain graphviz library, because
  it is added there as extra
* when previous version of Airflow has been installed already, then
  graphviz library is already installed there and Airflow will
  continue working as it did

The only change will be a new installation of new version of Airflow
from the scratch, where graphviz will need to be specified as extra
or installed separately in order to enable DAG rendering option.

Taking into account this behaviour (which only requires to install
a graphviz package), this should not be considered as a breaking
change.

Extracted from: apache#36537
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
Previously we limited grpcio minimum version to stop backtracking
of `pip` from happening and we could not do it in the limits of
spark provider, becaue some google dependencies used it and
conflicted with it. This problem is now gone as we have newer
versions of google dependencies and we can not only safely move
it to spark provider but also bump it slightly higher to limit
the amount of backtracking we need to do.

Extracted from apache#36537
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
This is a regular bump of Amazon-provider related dependencies.
The way how botocore releases are done, they are putting a lot of
strain on `pip` to resolve the right set of dependencies, including
long backtracking, when there are too man versions available.

Therefore, from time to time, we are bumping minimum version of
Amazon-related dependencies to limit the impact frequent releases
of boto and botocore has. Also it is generally fine to update min
version of dependencies for providers because at the very least
users can still use previously released providers in case they
have problem with those dependencies, also many of the updated
dependencies contain fixes and feature we implicitly depend on and
bumping them regulary is a good way to make sure all the functionalities
of the Amazon provider are working as expected.

Another reason for the bump is that as of 1.33 version botocore and
boto version stopped being shifted by 3 (previously boto3 1.28 was
the version corresponding to botocore 1.31). As of version 1.33 this
problem has been solved. See boto/boto3#2702

Watchtower min version is bumped to version 3 (which is 12 months old
even if before we opted for much older (more than 2 years old) and again
if users want to use older version of watchtower, they can opt for
previous provider version.

This change saves 5-6 minutes of backtracking when `pip` try to
find the right version of dependencies when upgrading to newer version.

Extracted from apache#36537
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
While testing apache#36537 I noticed cohere backracking quite a bit with
older version. Bumping Cohere to more recent minimum version
(released in November) decreased it quite a bit.

Since Cohere is mostly standalone package, and likely you want to
bump people to later version, it's safe to assume we can bump the
minimum version.
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
As part of extraction from apache#36537 I noticed that mysql-connector-python
causes also a lot of backtracking.

This PR bumps minimum version of the mysql-connection-python from
REALLY OLD (2018) to JUST OLD (2022).
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…che#36698)

We had some REALLY old minimum version of Pandas set for all our
pandas dependency - Pandas 0.17.1 has been released in 2015 (!)

Looking at the dependency tree - most of our dependencies had
> 1.2.5 set - which is more than reasonable limit as Pandas 1.2.5
had been released in June 2021 - so more than 2.5 years ago.

This limit bump further helps us to limit the pip backtracking
that starts happening in certain situations.

Extracted from: apache#36537
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…nd (apache#36537)

This PR changes Airflow installation and build backend to use new
standard Python ways of building Python applications.

We've been trying to do it for quite a while. Airflow tranditionally
has been using complex and convoluted build process based on
setuptools and (extremely) custom setup.py file. It survived
migration to Airflow 2.0 and splitting Airlfow monorepo into
Airflow and Providers, adding pre-installed providers and switching
providers to use flit (and follow build standards).

So far tooling in Python ecosystme had not been able to fuflill our
needs and we refrained to develop our own tooling, but finally with
appearance of Hatch (managed by Python Packaging Authority) and
few recent advancements there we are finally able to swtich to
Python standard ways of managing project dependnecy configuration
and project build setup (with a few customizations).

This PR makes airflow build process follow those standard PEPs:

* Airflow has all build configuration stored in pyproject.toml
  following PEP 518 which allows any fronted (`pip`, `poetry`,
  `hatch`, `flit`, or whatever other frontend is used to
  install required build dependendencies to install Airflow
  locally and to build distribution pacakges (sdist/wheel)

* Hatchling backend follows PEP 517 for standard source tree and build
  backend implementation that allows to execute the build in a
  frontend-independent way

* We store all project metadata in pyprooject.toml - following
  PEP 621 where all necessary project metadata components were
  defined.

* We plug-in into Hatchling "editable build" hooks following
  PEP 660. Hatchling internally builds editable wheel that
  is used as ephemeral step and communication between backend
  and frontend (and this ephemeral wheel is used to make
  editable installation of the projeect - suitable for fast
  iteration of code without reinstalling the package.

With Airflow having many provider packages in single source tree
where we want to be able to install and develop airflow and
providers together, this is not a small feat to implement the
case wher editable installation has to behave quite a bit
differently when it comes to packaging and dependencies for
editable install (when you want to edit sources directly) and
installable package (where you want to have separate Airflow
package and provider packages). Fortunately the standardisation
efforts in the Python Packaging community and tooling implementing
it had finally made it possible.

Some of the important ways bow this has been achieved:

* We continue using provider.yaml in providers as the single source
  of trutgh for per-provider dependencies. We added a possibility
  to specify "devel-dependencies" in provider.yaml so that all
  per-provider dependencies in `generated/provider_dependencies.json`
  and `pyproject.toml` are generated from those dependencies via
  update-providers-dependencies pre-commit.

* Pyproject.toml is generally managed manually, but the part where
  provider dependencies and bundle dependencies are used is
  automatically updated by a pre-commit whenever provider
  dependencies change. Those generated provider dependencies contain
  just dependencies of providers - not the provider packages, but
  in the final "standard" wheel file they are replaced with
  "apache-airflow-providers-PROVIDER" dependencies - so that the
  wheel package will only install the provider and use the
  dependencies of that version of provider it installs.

* We are utilising custom hatchiling build hooks (PEP 660 standard)
  that allow to modify 'standard' wheel package on-the-fly when
  the wheel is being prepared by adding preinstalled package
  dependencies (which are not needed in editable build) and by
  removing all devel extras (that are not needed in the PyPI
  distributed wheel package). This allows to solve the conundrum
  of having different "editable" and "standard" behaviour while
  keeping the same project specification in pyproject.toml.

* We added description of how `Hatch` can be employed as build
  frontend in order to manage local virtualenv and install Airflow
  in editable way easily - while keeping all properties of the
  installed application (including working airflow cli and
  package metadata discovery) as well as how to use PEP-standard
  ways of bulding wheel and sdist packages.

* We have a custom step (following PEP-standards) to inject
  airflow-specific build steps - compiling www assets and
  generating git commit hash version to display it in the UI

* We also show how all this makes it possible to make it easy to
  manage local virtualenvs and editable installations for Airflow
  contributors - without vendor lock-in of the build tools as
  by following standard PEPs Airflow can be locally and editably
  installed by anyone using any build front-end tools following
  the standards - whether you use `pip`, `poetry`, `Hatch`, `flit`
  or any other frontent build tools, Airflow local installation
  and package building will work the same way for all of them,
  where both "editable" and "standard" package prepration is
  managed by `hatchling` backend in the same way.

* Previously our extras contained a "." which is not normalized
  name for extras - `pip` and other tools replaced it automatically
  with `_'. This change updates the extra names to contain
  '-' rather than '.' in the name, following PEP-685.  This should be
  fully backwards compatible, users will still be able to use "." but it
  will be normalized to "-" in Airflow packages. This is also future
  proof as it is expected that all package managers and tools
  will eventually use PEP-685 applied to extras, even if currently
  some of the tools (pip + setuptools) might generate warnings.

* Additionally, this change organizes the documentation around
  the extras and dependencies, explaining the reasoning behind
  all the different extras we have.

* As a bonus (and this is what we used to test it all) we are
  documenting how to use Hatch frontend to:

  * manage multiple Python installations
  * manage multiple Pythob virtualenv environments
  * build Airflow packages for release management
abhishekbhakat pushed a commit to abhishekbhakat/my_airflow that referenced this pull request Mar 5, 2024
…37406)

Airflow Sdist packages have been broken by apache#37340 and fixed by 37388,
but we have not noticed it because CI check for sdist packages has
been broken since apache#36537 where we standardized naming of the sdist
packages to follow modern syntax (and we silently skipped installation
because no providers were found),.

This PR fixes it:

* changes the naming format expected to follow the new standard
* treats "no providers found as error"

The "no providers" as success was useful at some point of time when we
run sdist as part of regular PRs and some PRs resulted in "no providers
changed" condition, however sdist verification only happens now in
canary build (so all providers are affected) as well as we have if
condition in the job itself to skip the step of installation if there
are no providers.
@Taragolis Taragolis deleted the switch-to-hatch branch April 3, 2024 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API area:dev-tools area:production-image Production image improvements and fixes area:providers provider:fab type:misc/internal Changelog: Misc changes that should appear in change log
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Broken installation with --editable flag
6 participants