Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support configuration key to toggle using IPs or hostnames #7

Merged
merged 1 commit into from
Jun 26, 2017

Conversation

astahlman
Copy link

Background

We're patching the socket library in order to use IP addresses instead of hostnames, because Lyft hostnames are special and don't actually resolve. (See https://github.com/lyft/etl/pull/5560 for context) service_venv sets the $PATH such that /usr/local/lib/service_venv/bin/ takes precedence over srv/service/current/bin/, so it's not enough to simply put a patched version of airflow in the ETL repository.

We're ensuring that our patched version of airflow takes precedence by overwriting /usr/local/lib/service_venv/bin/airflow with a script that just calls our patched version at /srv/service/current/bin/airflow (from the ETL repository). See this salt-state [1].

That doesn't work anymore, because with the rollout of frozen_venvs we can't count on airflow being at a hardcoded path. https://github.com/lyft/etl/pull/6073 was one attempt to solve this by finding the service_venv root dynamically. That approach doesn't work, because it depends on service_venv being available by the time the Jinja template is rendered, but service_venv won't be available until it's created in the lyft-python state (which requires that the template is rendered).

Solution

We can avoid this chicken and egg problem if we simply patch the Airflow code directly, rather than trying to overwrite it on deployment. We apply our patch iff the prefer_ip_over_hostname key under the lyft
configuration section is set.

Rollout

On the sharedairflowworker (which is only running test DAGs so far), we will stop overwriting the airflow command with our patched version and set the prefer_ip_over_hostname key.

If this works, then we can take the same approach for the other ASGs. Until then, we will continue overwriting the airflow command under service_venv/ with our patched version from ETL.

cc @lyft/data-platform

[1] https://github.com/lyft/etl/blob/master/ops/config/states/etl/init.sls#L50-L59

* Background

We're patching the socket library in order to use IP addresses instead
of hostnames, because Lyft hostnames are special and don't actually
resolve. (See https://github.com/lyft/etl/pull/5560 for context)
`service_venv` sets the \$PATH such that
`/usr/local/lib/service_venv/bin/` takes precedence over
`srv/service/current/bin/`, so it's not enough to simply put a patched
version of `airflow` in the ETL repository.

We're ensuring that our patched version of `airflow` takes precedence by
overwriting `/usr/local/lib/service_venv/bin/airflow` with a script that
just calls our patched version at `/srv/service/current/bin/airflow`
(from the ETL repository). See this salt-state [1].

That doesn't work anymore, because with the rollout of frozen_venvs we
can't count on `airflow` being at a hardcoded path.
https://github.com/lyft/etl/pull/6073 was one attempt to solve this by
finding the service_venv root dynamically. That approach doesn't work,
because it depends on `service_venv` being available by the time the
Jinja template is rendered, but `service_venv` won't be available until
it's created in the lyft-python state (which requires that the template
is rendered).

We can avoid this chicken and egg problem if we simply patch the Airflow
code directly, rather than trying to overwrite it on deployment. We
apply our patch iff the `prefer_ip_over_hostname` key under the `lyft`
configuration section is set.

* Rollout

On the sharedairflowworker (which is only running test DAGs so far), we
will stop overwriting the airflow command with our patched version and
set the `prefer_ip_over_hostname` key.

If this works, then we can take the same approach for the other ASGs.
Until then, we will continue overwriting the `airflow` command under
service_venv/ with our patched version from ETL.

[1] https://github.com/lyft/etl/blob/master/ops/config/states/etl/init.sls#L50-L59
@astahlman
Copy link
Author

👀 @perryzheng

cc @SaurabhBajaj

Copy link

@bhuwanchopra bhuwanchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense to me. Minor suggestion

from airflow import configuration
from airflow.bin.cli import CLIFactory

def get_private_ip(name=''):
r = requests.get("http://169.254.169.254/latest/meta-data/local-ipv4")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "http://169.254.169.254/latest/meta-data/local-ipv4" also be a part of configuration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but I see no reason to make it configurable unless it could change. This is the static address from which EC2 instances vend metadata - see the AWS docs for details.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish boto3 had a nice wrapper around this, but boto/boto3#313 has been open for a while. There is boto.utils.get_instance_metadata() though:

>>> import boto.utils
>>> instance_metadata = boto.utils.get_instance_metadata()
>>> print instance_metadata['local-ipv4']
10.0.24.67

But that is a dependency to another library, which may be available though.

Overall looks like the hostname the node returns:

PRODUCTION: amalakar@etl-production-iad-d532cf46:~$ hostname
etl-production-iad-d532cf46

Doesn't seem to have a dns entry or an entry in /etc/hosts

PRODUCTION: amalakar@etl-production-iad-d532cf46:~$ host etl-production-iad-d532cf46
Host etl-production-iad-d532cf46 not found: 3(NXDOMAIN)

I have seen lot of systems do depend on the gethostname to resolve correctly on a DNS lookup. For example hadoop cli would fail when this is not satisfied. I think any instance that gets provisioned should have these pieces working which is an assumption for lot of systems out there.

@astahlman should we create a ticket against #provisioning to take care of this? We may end up patching/hacking startup script of many other systems otherwise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, provisioning is aware of it, and unfortunately it seems that this is by design. Lyft uses the hostname to encode some information about the host that's used for monitoring purposes, see the previous discussions in these two Slack threads:

  1. https://lyft.slack.com/archives/C3ASS377S/p1493405313083642
  2. https://lyft.slack.com/archives/C2U75K0H5/p1484849237000002

@perryzheng
Copy link

👍 lgtm

@astahlman
Copy link
Author

💨

@astahlman astahlman merged commit 5586353 into data-platform Jun 26, 2017
@eschachar eschachar deleted the patch_socket branch September 24, 2022 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants