Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes support for Metaflow #644

Merged
merged 35 commits into from
Oct 15, 2021
Merged

Kubernetes support for Metaflow #644

merged 35 commits into from
Oct 15, 2021

Conversation

savingoyal
Copy link
Collaborator

@savingoyal savingoyal commented Aug 13, 2021

Mostly ready for review (except for TODOs marked in the PR)

This PR provides compute integration for Metaflow. Similar to Metaflow's integration with AWS Batch, users can scale up and out from their laptops to their Kubernetes clusters (Amazon EKS, for now) trivially.

from metaflow import FlowSpec, step, kubernetes

class LinearFlow(FlowSpec):

    @step
    def start(self):
        self.my_var = 'hello world'
        self.next(self.a)

    @kubernetes(cpu=1)
    @step
    def a(self):
        print('the data artifact is: %s' % self.my_var)
        self.next(self.end)

    @step
    def end(self):
        print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':
    LinearFlow()
python flow.py run

This will execute a on your Kubernetes cluster and provide all of Metaflow's goodness.

To easily test this PR - you can

  1. Use the local file system for metadata tracking --metadata=local,
  2. Deploy EKS (say through eksctl) and configure kubernetes credentials on your laptop,
  3. Create an S3 bucket (bar) - and assign an IRSA (foo) that allows Kubernetes pods to talk to S3
  4. Launch any Metaflow flow on top of Kubernetes -
python my_flow.py --datastore=s3 --datastore-root=s3://bar/metaflow run --with kubernetes:service_account=foo

"Easy configure" for this feature is in-flight.

Please route all your feedback to our slack workspace - http://slack.outerbounds.co

@resources decorator is shared by all compute related decorators -
@Batch, @lambda, @K8s, @titus. This patch moves it out of
batch_decorator.py so that other decorators can cleanly reference
it.
@corleyma
Copy link

This looks good; it seems like the next step might be to add some functionality to the step-functions plugin, so that it can optionally compile steps to synchronous EKS executions instead of Batch executions? First pass could be all or nothing, but the ability to mix and match per step could be interesting.

@savingoyal
Copy link
Collaborator Author

This looks good; it seems like the next step might be to add some functionality to the step-functions plugin, so that it can optionally compile steps to synchronous EKS executions instead of Batch executions? First pass could be all or nothing, but the ability to mix and match per step could be interesting.

@corleyma Yep, that's the logical next step in our Kubernetes story. This PR still has some open questions. Hopefully, it will be ready to merge next week and we can fast follow with an SFN story. Mixing up Batch & Kubernetes on SFN will be a bit tricky since we rely on the response structure of submit_job (AWS Batch) & runJob (EKS) API calls to manage state passing. Once we get to that bridge, let's see how non-trivial it is to cross :)

@savingoyal savingoyal linked an issue Aug 19, 2021 that may be closed by this pull request
@savingoyal savingoyal changed the title @kubernetes Kubernetes support for Metaflow Aug 19, 2021
@savingoyal savingoyal marked this pull request as ready for review August 28, 2021 05:54
"stderr",
job_id=job.id,
)
t = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can avoid duplicating line 265-267 (echo) by setting status = None on line 263

self._kwargs = kwargs

# Kubernetes namespace defaults to `default`
self._kwargs["namespace"] = self._kwargs["namespace"] or "default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if "namespace" is not in kwargs, need to use
self._kwargs.get("namespace", "default")

def create(self):
# Check that job attributes are sensible.

# CPU value should be greater than 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a for loop:

`

for arg in ["cpu", "memory", " "disk"]:
    if float(self._kwargs.get(arg, 0)) <= 0:
      raise ...
      

`

@oavdeev oavdeev mentioned this pull request Sep 24, 2021
5 tasks
@sappier
Copy link
Contributor

sappier commented Oct 4, 2021

Tested it with #719. It works very good on test flows I tried.
The only lacking feature is a @conda decorator support on both Linux and MacOS. Unfortunately, didn't manage to find a solution.

@pikulmar
Copy link

Tested it with #719. It works very good on test flows I tried. The only lacking feature is a @conda decorator support on both Linux and MacOS. Unfortunately, didn't manage to find a solution.

The @conda decorator seems to work, cf. #40 (comment) .

@sappier
Copy link
Contributor

sappier commented Oct 13, 2021

Tested it with #719. It works very good on test flows I tried. The only lacking feature is a @conda decorator support on both Linux and MacOS. Unfortunately, didn't manage to find a solution.

The @conda decorator seems to work, cf. #40 (comment) .

Hmm, may be I'm doing something wrong. Tried it again on Mac with the latest #756. A flow always results in a such error:

$ python 04-playlist-plus/playlist.py --environment=conda run --with=kubernetes:secrets=mfl-k8s-secret,image=amancevice/pandas
Metaflow 2.3.2.post33+gitf58ae58 executing PlayListFlow for user:romk
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint is happy!
Bootstrapping conda environment...(this could take a few minutes)
2021-10-13 21:13:17.688 Workflow starting (run-id 279):
2021-10-13 21:13:20.243 [279/start/1337 (pid 37150)] Task is starting.
2021-10-13 21:13:22.460 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Task is starting (Status Job:Active Pod:Pending Container:Waiting [ContainerCreating])...
2021-10-13 21:13:31.653 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Task is starting (Status Job:Active Pod:Running Container:Running)...
2021-10-13 21:13:30.846 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Setting up task environment.
2021-10-13 21:13:51.758 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Downloading code package...
2021-10-13 21:13:55.133 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Code package downloaded.
2021-10-13 21:13:55.194 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Task is starting.
2021-10-13 21:13:56.115 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Bootstrapping environment...
2021-10-13 21:14:39.347 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] Environment bootstrapped.
2021-10-13 21:14:39.353 [279/start/1337 (pid 37150)] [e9d6dd02-617f-406b-a172-00237d88c3b1] bash: line 1: metaflow_PlayListFlow_osx-64_179c56284704ca8e53622f848a3df27cdd1f4327/bin/python: cannot execute binary file: Exec format error
2021-10-13 21:14:49.655 [279/start/1337 (pid 37150)] Kubernetes error:
2021-10-13 21:14:49.655 [279/start/1337 (pid 37150)] Error (exit code 126). This could be a transient error. Use @retry to retry.
2021-10-13 21:14:49.779 [279/start/1337 (pid 37150)]
2021-10-13 21:14:49.942 [279/start/1337 (pid 37150)] Task failed.
2021-10-13 21:14:50.188 Workflow failed.
2021-10-13 21:14:50.188 Terminating 0 active tasks...
2021-10-13 21:14:50.188 Flushing logs...
    Step failure:
    Step start (task-id 1337) failed.

Ahh, finally fixed it with a patch:

--- a/metaflow/plugins/conda/conda_step_decorator.py
+++ b/metaflow/plugins/conda/conda_step_decorator.py
@@ -193,13 +193,13 @@ class CondaStepDecorator(StepDecorator):
         # a macOS. This is needed because of gotchas around inconsistently
         # case-(in)sensitive filesystems for macOS and linux.
         for deco in decos:
-            if deco.name == 'batch' and platform.system() == 'Darwin':
+            if deco.name in ('batch', 'kubernetes') and platform.system() == 'Darwin':
                 return True
         return False

     def _architecture(self, decos):
         for deco in decos:
-            if deco.name == 'batch':
+            if deco.name in ('batch', 'kubernetes'):
                 # force conda resolution for linux-64 architectures
                 return 'linux-64'
         bit = '32'
@@ -295,7 +295,9 @@ class CondaStepDecorator(StepDecorator):
                          retry_count,
                          max_user_code_retries,
                          ubf_context):
-        if self.is_enabled(ubf_context) and 'batch' not in cli_args.commands:
+        is_batch = 'batch' in cli_args.commands
+        is_kubernetes = 'kubernetes' in cli_args.commands
+        if self.is_enabled(ubf_context) and not is_batch and not is_kubernetes:
             python_path = self.metaflow_home
             if os.environ.get('PYTHONPATH') is not None:
                 python_path = os.pathsep.join([os.environ['PYTHONPATH'], python_path])

savingoyal and others added 2 commits October 13, 2021 12:58
Conda environment should pack linux python binary when run on MacOS
to avoid an error

metaflow_PlayListFlow_osx-64_179c56284704ca8e53622f848a3df27cdd1f4327/bin/python: cannot execute binary file: Exec format error
Base automatically changed from plugin-linter to master October 14, 2021 20:43
@savingoyal savingoyal merged commit f7d54d6 into master Oct 15, 2021
@savingoyal savingoyal deleted the kubernetes-pr branch October 15, 2021 22:51
@savingoyal savingoyal restored the kubernetes-pr branch October 15, 2021 23:16
@savingoyal savingoyal deleted the kubernetes-pr branch October 27, 2021 21:35
@savingoyal savingoyal restored the kubernetes-pr branch October 27, 2021 21:35
@savingoyal savingoyal deleted the kubernetes-pr branch December 29, 2021 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for Kubernetes (with Argo)
6 participants