Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-2952] Splits CI into k8s + docker-compose #3797

Closed
wants to merge 10 commits into from
Closed

[AIRFLOW-2952] Splits CI into k8s + docker-compose #3797

wants to merge 10 commits into from

Conversation

dimberman
Copy link
Contributor

@dimberman dimberman commented Aug 23, 2018

Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Since using docker-compose for everything was causing k8s integration
tests to die silently, this will determine whether a CI test is in k8s
or docker-compose mode

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • When adding new operators/hooks/sensors, the autoclass documentation generation needs to be added.

Code Quality

  • Passes git diff upstream/master -u -- "*.py" | flake8 --diff

@dimberman
Copy link
Contributor Author

@bolkedebruin @Fokko @kaxil PTAL. I'm going to try to get this all working tonight (though I assume the tests should pass based on the travis running on my branch).

Kubernetes tests are silently running non-kubernetes airflow tests. This means that it will show up as passing as long as the non-kubernetes tests pass.

@dimberman
Copy link
Contributor Author

ede6729#diff-354f30a63fb0907d4ad57269548329e3R43

Looks like this might not be so simple. Getting errors based on:
IOError: [Errno 13] Permission denied: u'/home/travis/.wheelhouse/bleach-2.1.4-py2.py3-none-any.whl'

@gerardo any idea how to get around these permission issues?

I have to run to an event but will revisit later tonight.

@gerardo
Copy link
Contributor

gerardo commented Aug 24, 2018

@dimberman I'll have a look now

@gerardo
Copy link
Contributor

gerardo commented Aug 24, 2018

@dimberman given the Kubernetes ci scripts runs outside docker, this line should be sudo chown -R travis.travis . $HOME/.wheelhouse/ $HOME/.cache/pip instead.

I think we can expose minikube as just another service container inside the docker-compose setup, but for the sake of getting the K8S tests back up, it looks good.

@dimberman
Copy link
Contributor Author

@gerardo No luck with that. Any other potential culprits?

@bolkedebruin
Copy link
Contributor

Travis sometimes has this issue. Just make "sudo rm -rf " part of the script for one time. Also don't rely on "Travis" as a user but use the right env var.

@gerardo
Copy link
Contributor

gerardo commented Aug 29, 2018

Hey @dimberman, after your change, the build started failing in a different place: sudo: kadmin: command not found. This means tox is running the 2-setup-kdc.sh script.

Not sure what's the best way to do this with Tox, but at this point, we need to skip these scripts and only run the package installation steps, 5-run-tests.sh, 6-check-license.sh and codecov.

@gerardo
Copy link
Contributor

gerardo commented Aug 29, 2018

@dimberman I was trying to run the tests as-is inside our docker image, but so far, minikube doesn't seem to like to run inside docker.

@gerardo
Copy link
Contributor

gerardo commented Aug 29, 2018

For future reference. This looks good: https://github.com/kubernetes-sigs/kubeadm-dind-cluster

If you're an application developer, you may be better off with Minikube because it's more mature and less dependent on the local environment, but if you're feeling adventurous you may give kubeadm-dind-cluster a try, too. In particular you can run kubeadm-dind-cluster in CI environment such as Travis without having issues with nested virtualization.

@Fokko
Copy link
Contributor

Fokko commented Aug 29, 2018

Nice one @gerardo

I'm also working on getting rid of tox, since we now have docker-compose and tox, which both act as a visualisation layer.

@dimberman
Copy link
Contributor Author

@gerardo Minikube definitely will not run inside docker (there's such thing as "docker in docker" but it's a really bad rabbit hole that we should avoid by all means necessary). Let me see if I can remove those earlier tasks.

Interesting! That looks really cool. I think that would be a great idea for a future PR to switch off of minikube.

@dimberman
Copy link
Contributor Author

@Fokko @gerardo Quick update. I've been still running into weird minikube issues and have been unable to get the CI to build properly. This has become blocking on me implementing/PRing fixes for the k8sExecutor and the bug reports are starting to pile up. Could we revert the dockerized CI and then re-merge it once we get it working with k8s?

I'm working with the k8s-kubeadm-dind guys as I think the best way forward might be to switch to that.

@gerardo
Copy link
Contributor

gerardo commented Aug 30, 2018

@dimberman I can take a stab at making this work in a separate branch if you want. This is definitely a blocker, but reverting sounds like even more work.

@dimberman
Copy link
Contributor Author

@gerardo I agree that it would be a pain, but it's going to REALLY hurt if we merge PRs for a couple of weeks and then can't track down what broke the k8s executor when it restarts. Definitely please try on a different branch.

.travis.yml Outdated
- TOX_ENV=py27-backend_sqlite-env_docker
- TOX_ENV=py27-backend_postgres-env_docker
- TOX_ENV=py35-backend_mysql-env_docker PYTHON_VERSION=3
- TOX_ENV=py35-backend_sqlite-env_ddocker PYTHON_VERSION=3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo here

@dimberman
Copy link
Contributor Author

@Fokko @bolkedebruin @gerardo I was able to get kubeadm to work with a local registry (that was a rough experience lol). I'm still running into some weird TOX issues (like being unable to find python 3.5) but progress!

@dimberman
Copy link
Contributor Author

cc: @feng-tao @kaxil just a warning any PR merged right now is not being tested against kubernetes.

@dimberman
Copy link
Contributor Author

@gerardo Ok it's now solidly back in the court of "getting TOX to work". Kubeadm is able to build and deploy. PTAL and let me know how we can get these to pass.

@gerardo
Copy link
Contributor

gerardo commented Aug 31, 2018

@dimberman I'm trying to figure out the simplest changes that can get this to work. So far:

@gerardo
Copy link
Contributor

gerardo commented Aug 31, 2018

@dimberman
Copy link
Contributor Author

@gerardo ok further process. The main issue left is that it keeps attempting to compile the s3 tests even though there it's claiming there is no moto (this is after I attempted to install moto both in tox and in travis).

@gauthiermartin
Copy link
Contributor

@dimberman Currently I'm having an issue while running the ./docker/build.sh locally. There still seem to be an issue with the SLUGIFY_USES_TEXT_UNIDECODE=yes while running the script locally. I know you have added that env var in travis-ci.yml file but it is also required when running the script locally. Should we export it in the build.sh file ?

@dimberman
Copy link
Contributor Author

@Fokko @barni is going to investigate how we can make the airflow API tests to work today. Should hopefully have this working soon.

@Fokko
Copy link
Contributor

Fokko commented Sep 12, 2018

Awesome work @dimberman

# specific language governing permissions and limitations
# under the License.

set -x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e too?

except Exception as e:
print(e)
from .s3_to_hive_operator import *
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like debugging code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashb this is more of something I wanted to ask how to solve. I run into issues with the s3 tests where they don't skip even though moto shows up as "None". This was preventing the k8s tests from running at all since tests were failing at imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example build where it fails because it's attempting to run the moto decorator even though it shouldn't be able to https://travis-ci.org/bloomberg/airflow/jobs/423162910#L4949

@@ -57,10 +58,12 @@ passenv = *
commands =
pip wheel --progress-bar off -w {homedir}/.wheelhouse -f {homedir}/.wheelhouse -e .[devel_ci]
pip install --progress-bar off --find-links={homedir}/.wheelhouse --no-index -e .[devel_ci]
env_kubernetes: pip install boto3
env_kubernetes: pip install moto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want boto3 and moto always? Also aren't these already installed as test_requires from setup.py? Why do we need to specify them directly here? (My tox is hazy, so there may be a reason)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the problem above. Something is making the s3 tests attempt to run a "None" function blocking all testing.

# specific language governing permissions and limitations
# under the License.

set -x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e

AIRFLOW_ROOT="$DIRNAME/../.."

# Fix file permissions
sudo chown -R travis.travis . $HOME/.wheelhouse/ $HOME/.cache/pip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I want to be able to run kube-based tests locally, what does the workflow for that look like?


$DIRNAME/minikube/start_minikube.sh
#rm /etc/docker/daemon.json
#sudo cp $DIRNAME/daemon.json /etc/docker/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code please

@@ -0,0 +1,50 @@
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set -e

.travis.yml Outdated
- pip install tox
- pip install codecov
- pip install boto3
- pip install moto
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are in setup.py - why do we need them here?

Since using docker-compose for everything was causing k8s integration
tests to die silently, this will determine whether a CI test is in k8s
or docker-compose mode
@dimberman
Copy link
Contributor Author

@Fokko @bolkedebruin @ashb I am seeing multiple errors with the kubernetes executor once running:

  1. The API calls are failing because airflow thinks that orm_dag is None
[2018-09-12 20:41:01,730] ERROR in app: Exception on /api/experimental/dags/example_kubernetes_annotation/paused/false [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python2.7/dist-packages/airflow/api/auth/backend/default.py", line 32, in decorated
    return function(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/airflow/www_rbac/api/experimental/endpoints.py", line 156, in dag_paused
    orm_dag.is_paused = False
AttributeError: 'NoneType' object has no attribute 'is_paused'
  1. When I log into the webserver (both through k-d-c and through minikube), I am unable to see any DAGs even though the logs show them as loaded and the files are in the correct directory.

  2. When I attempt to ping through k-d-c it just hangs infinitely.

For these reasons I think this is a high priority bug. Especially since every PR we add from here on just makes the k8s executor more and more broken. I realize reverting the docker-compose CI would be a pain, but if we don't either do that or fix these bugs in the short term I fear the k8sexecutor work will become even more broken.

@gerardo
Copy link
Contributor

gerardo commented Sep 12, 2018

If it's becoming too hard, I agree with reverting the docker-compose CI changes. I could go back and create a new branch with those changes and work on fixing the k8s setup.

@Fokko
Copy link
Contributor

Fokko commented Sep 17, 2018

We should be able to set up a separate branch beside the docker-compose, which installs minikube and spins the Airflow docker, right?

@odracci
Copy link
Contributor

odracci commented Sep 19, 2018

I created #3922 as an alternative solution for this issue. /cc @dimberman @gerardo @Fokko

@dimberman
Copy link
Contributor Author

@odracci I like your solution better as a short-term fix (switching to k-d-c should ideally be done later when the build is stable). Let me know when it's ready to review :).

@Fokko
Copy link
Contributor

Fokko commented Sep 21, 2018

I've merged the one of @odracci.

@Fokko Fokko closed this Sep 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants