Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment support/Kedro deployer #2058

Open
merelcht opened this issue Nov 23, 2022 · 13 comments
Open

Deployment support/Kedro deployer #2058

merelcht opened this issue Nov 23, 2022 · 13 comments

Comments

@merelcht
Copy link
Member

merelcht commented Nov 23, 2022

Description

We want to offer users better support with deploying Kedro.

  • Production deployment is important to drive value; we're at least a couple years past when people were happy just running things locally
  • A large percentage of "hard" questions we get are related to deployment
    • TODO: Consider quantifying this percentage and providing some examples
  • The development team is also suffering
    • The existing deployment guides have a lot of challenges and are hard to maintain
    • Tribal knowledge about the different platforms and integrations (a few people wrote the deployment guides and plugins years ago)
    • Have to know too much (nobody can be an expert at deploying to so many platforms)

Implementation ideas

  • Short-term
    • Support at least 1 "production" deployment archetype/stack out of the box
    • Create a single set of best practices for deployment
      • There should be a clear mapping from "best-practice Kedro pipeline" to "best-practice Kedro deployment"
  • Medium-term

Questions

@datajoely
Copy link
Contributor

So I love this and think we should really emphasise the power of modular pipelines here as a standout feature of Kedro. Choosing the granularity of what to translate is critical here.

  • In most cases a Kedro Node should not (necessarily) == a 'task' in an orchestrator.
  • I think our concept of a modular pipeline a.k.a the 'super node' in Viz is, in most cases, what should be translated to an deployable task.
  • The user should be able to define this level of granularity too, because they may want to decide on a case by case basis.

Couler looks v cool btw.

@noklam
Copy link
Contributor

noklam commented Jun 21, 2023

https://www.linen.dev/s/kedro/t/12647848/hi-kedro-community-slightly-smiling-face-i-have-a-question-r#45fa4774-5ebd-4826-9c10-47fd95d1bebd

Something that has been done on some plugins, we should come up with a flexible way that kedro core can create the mapping for different targets.

@astrojuanlu
Copy link
Member

I guess #143 is tangentially related?

@noklam
Copy link
Contributor

noklam commented Jun 22, 2023

It is! Although I am thinking more about deployment for platform/orchestrator here, but there are definitely case user deploying kedro pipeline to an endpoint.

@noklam
Copy link
Contributor

noklam commented Oct 25, 2023

Not sure where should it go, so I just put here as this is amateur thought. Inspired by deepyaman, it seems that it's fairly easy to convert a Kedro pipeline to Metaflow. It's just a variable assignment between steps (node in Kedro's term).

The benefit of using Metaflow is that it allows you to abstract infrastructure with a decorator like @kubernetes(memory=64000) (obviously the tricky part is to get this cluster setup properly, but from DS perspective it doesn't matter as they simply want more compute for specific task). This can be integrated more smoothly potentially with kedro's tags to denote the required infra.

This is not saying Kedro is gonna to integrate with Metaflow, but showing that the possibility of doing it.

from metaflow import FlowSpec, step

class LinearFlow(FlowSpec):

    @step
    def start(self):
        self.my_var = 'hello world'
        self.next(self.a)

    @step
    def a(self):
        print('the data artifact is: %s' % self.my_var)
        self.next(self.end)

    @step
    def end(self):
        print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':
    LinearFlow()

@astrojuanlu
Copy link
Member

Related: #3094

That issue contains a research synthesis, we can use this issue to collect the plan.

@astrojuanlu
Copy link
Member

Found in #770 from @idanov:

kedro deploy airflow

(found after listening to @ankatiyar's Tech Design on kedro-airflow and reading kedro-org/kedro-plugins#25 (comment))

@astrojuanlu
Copy link
Member

As we learn more about how to deploy to Airflow (see @DimedS's #3860) it becomes more apparent that the list of steps can become quite involved. This is not new, as the Databricks workflows also require some care.

We have plenty of evidence that this is an important pain point that affects lots of users.

The main questions are:

  1. Is there anything that can be done in general to make these processes easier, simpler, leaner? (So that we don't rush to automate something that should have been easier in the first place)
  2. Is there a way these processes can be partly automated for specific platforms of choice? (Assuming that there's no general solution for all platforms, given the low traction of couler)
    • And which platforms should we choose, and with which criteria?

@noklam
Copy link
Contributor

noklam commented May 21, 2024

We should really treat Kedro as a Pipeline DSL. Most of the orchestrator work as a DAG, so these are all generic feature. For example:

  • Grouped memory nodes in Airflow is not unique to Airflow

So there is definitely a theme around DAG manipulation, is the current Pipeline API flexible enough? We have to implement a separate dfs to support the grouped memory nodes feature for airflow. We can also extend the grouped memory node feature to i.e. running specific nodes on a GPU machine, spark cluster, different python envrionment, machine type.

Where will be the best place to hold these metadata? maybe tag?

There are some early vision in #770 and #3094 done by @datajoely last year.
It would also be really cool to have some kind of DryRunner where one can orchestrate multiple Kedro Session in one go, this allow one to catch error like "memory dataset" doesn't exist in a "every kedro node as a airflow node" situation.

@datajoely
Copy link
Contributor

Yeah - my push here is not to focus too much on Airflow and really address the fundamental problems which make Kedro slightly awkward in these situations

@astrojuanlu
Copy link
Member

@astrojuanlu
Copy link
Member

Possible inspiration: Apache Beam https://beam.apache.org/get-started/beam-overview/

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    "--runner=PortableRunner",
    "--job_endpoint=localhost:8099",
    "--environment_type=LOOPBACK"
])
with beam.Pipeline(options) as p:
    ...

There's a difference though between deploying and running, this is more similar to our Dask runner https://docs.kedro.org/en/stable/deployment/dask.html

But maybe it means that we need a more clear mental model too. In my head, when we deploy a Kedro project, Kedro is no longer responsible for the execution, whereas writing a runner implies that Kedro is in fact driving the execution and acting as a poor-man's orchestrator.

@datajoely
Copy link
Contributor

I guess there is a philosophical question (very much related to granualirity) on how you express which chunks of pipeline(s) get executed in different distributed systems.

The execution plan of a Kedro DAG is resolved at runtime, we do not have a standardised way of:

  1. Declaring a group of nodes which must be run together
  2. Saying what infrastructure they should run on
  3. Validating if / ensuring data is passed between groups through persistent storage.

I'm not against the context manager approach for doing this in principle - but I think it speaks to the more fundamental problem of how some implicit elements of Kedro (which increase development velocity) leave a fair amount of unmanaged complexity when it gets to this point in the process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Discovery or Research - Later 🧪
Development

No branches or pull requests

4 participants