Support for importing processes from one file to another #238

vardaofthevalier · 2016-11-04T22:15:40Z

Hi there! It would be really cool if it were possible to import process definitions from one file to another in order to support code reuse between workflows. Is this a feature that can be utilized through the Groovy language already, or would it require additional engineering work to support this in the Nextflow DSL? I couldn't find anything about this specifically in the official Nextflow documentation and I'm fairly new to Groovy, so any advice or thoughts you have about addressing this would be much appreciated. I'd be happy to read any existing documentation that already covers this if there is any. Thanks in advance!

pditommaso · 2016-11-05T09:24:12Z

Currently it is possible to reuse Groovy scripts or JAR libraries through the standard Java/Groovy import mechanism. Sub-workflows are not (yet) supported, but we are planning to add this feature likely next year.

mes5k · 2017-01-20T19:52:47Z

Introduction

The overarching goal of this proposal is to be able to isolate and group a pipeline of processes into a reusable unit that can be shared and incorporated into other pipelines.

This proposal attempts to address this need by defining a small number of extensions to the Nextflow language while maintaining the core behavior and spirit of Nextflow's existing design and execution model.

The central idea is to extend the concept of a process to something called a module_process. A module_process would be expressed in a separate file as a (mostly) normal Nextflow pipeline. The main difference is that instead of Channels being created at the beginning of the pipeline (e.g. with Channel.from(...)) the channels would be declared. Similarly, a module_process would also declare output channels. Once given a name, then this module_process could then be included into a normal Nextflow pipeline and expanded similar to how a macro might be expanded in other languages.

Syntax

Here is a possible syntax for a module_process file:

// Give the module a name.
nextflow_module:
    com.example.coolthing

// Declare input channels.  These channels are assumed to
// be "injected" from a calling pipeline.
input_channels:
    input_1,
    intput_2

// normal processes
process a {
    input:
    val(x) from input_1

    output:
    stdout into output_1
    //...
}

process b {
    input:
    val(y) from input_2

    //...
}

process c {
    //...
}


// Declare output channels to be "exported".
output_channels:
    output_1
    output_2

The keyword nextflow_module gives the module a name, while the input_channels and output_channels keywords define the input and output channels, respectively. Some assumptions about module_processes include:

module_processes are not standalone nextflow programs. They require input channels to be injected into the module.
Likewise, the only way to get data out of a module_process is through an output channel. The execption being results written with publishDir directive, that would behave normally.

A possible syntax for using a module_process is as follows:

// Import like a normal java/groovy object or have different syntax?
import com.example.coolthing

Channel.fromPath(...).into{ origin }

// normal process
process xyz {

    input:
    val(z) from origin

    // ...

    output:
    val(x) into x_channel
    val(y) into y_channel
}


// use the module process
module_process coolthing {

    // map local channels to the module's input channels
    input:
    coolthing.input_1 from x_channel
    coolthing.input_2 from y_channel

    // map the modules outpu channels to local channels
    output:
    coolthing.output_x into x_results
    coolthing.output_y into y_results
}

// another normal process
process finish_x {

    input:
    val(x) from x_results
}

Using a module_process involves simply declaring the module_process as such and defining the input and output channels. There would be no script or exec section to the module_process definition. Instead the module_process simply connects input and output channels from the main pipeline to those defined in the module.

Execution

The goal is to leave pipeline execution exactly the same as it is now. The idea is that interpretation of a module will "include" or "flatten" the module processes into the main Nextflow pipeline so that the executor sees only one pipeline script that consists of processes and channels, just like Nextflow now.

Here is possible way of imagining what the executor will see given the example above:

Channel.fromPath(...).into{ origin }

// normal process
process xyz {

    //...

    output:
    val(x) into x_channel
    val(y) into y_channel
}

// normal processes
process coolthing.a {
    input:
    val(x) from coolthing.input_1

    output:
    stdout into coolthing.output_1
    //...
}

process coolthing.b {
    input:
    val(y) from coolthing.input_2

    //...
}

process coolthing.c {
    //...
}

// another normal process
process finish_x {

    input:
    val(x) from coolthing.x_results
}

Conclusion

This proposal introduces four new keywords to the Nextflow language: nextflow_module, input_channels, output_channels, and module_process. These keywords along with the idea of "including" module code into a final script for execution provide a (hopefully) simple model for modularizing Nextflow in a (hopefully) lightweight manner that doesn't disrupt the dataflow programming model (still just processes and channels) nor change overall execution behavior.

Disclaimer

Obviously all of the names are only suggestions. Maybe nextflow_module should be subpipeline? Or perhaps module_process should be subpipeline or subpipeline_process? Or ...?

Also, I've not implemented any of this, so I have no actual idea whether it would work. :)

pditommaso · 2017-03-13T20:47:56Z

I partially agree on this proposal. I think there shouldn't be a separate module concept. It should be possibile to include any NF script into another.

The only requirement should be to proper declare the expected workflow inputs and outputs using the approach suggested by @mes5k . Eventually including it in the existing workflow object eg.

workflow { 
  input: 
  foo 
  bar 

  output: 
  gus
  baz
}

Moreover it should be possible to continue to use the existing script parameters mechanism, both for backward compatibility and parametrisation when the script is used standalone.

My idea is that the current params values should be used to initialise input channels to a default value when such inputs are not explicitly provided. This could also be useful to replace the current common idiom in a NF script:

params.foo = '/some/file'
params.bar = '/data/*.fq' 

foo_file = file(params.foo)
bar_ch = Channel.fromPath(params.bar)

with:

params.foo = '/some/file'
params.bar = '/data/*.fq' 

workflow {
  input: 
  foo_file = file(params.foo)
  bar_ch = Channel.fromPath(params.bar)
}

The main difference would be that foo_file and bar_ch could be provided when invoking the script as a sub-workflow.

On the invoking part I still have a lot of doubts. Between the open problems:

How reference sub-workflows. Ideally it should be possible to use the same name specified on the nextflow run command line (thus downloading the script when needed).
How to reference inputs and outputs? By name? Should they be typed?
What syntax should be used? A new subworkflow keyword or maybe extend the process definition with a new subworkflow component in place of script ?

mes5k · 2017-04-05T23:36:55Z

Sorry for the slow reply on this @pditommaso! I really like your ideas. I think workflows calling workflows is much more elegant than resorting to a new module keyword. Here are my thoughts on some of your questions:

How to reference sub-workflows? I'd vote for a two-pronged approach. First, I'd load other workflows defined on the filesystem via a specified path. Then I'd build on the existing git support for handling remote projects. The git approach could pull the repo like normal and the just return a path. Building paths for search for code is something most people will grok from Java, Python, etc..
Syntax? I like the idea of overloading process with a new subworkflow component. I think that makes it clear that the subworkflow really just spawns a bunch more processes. Something like:

process coolthing {

    // Input channels must line up with subworkflow's workflow.input.
    input:
    input_1 from x_channel
    input_2 from y_channel

    // Output channels must line up with subworkflow's workflow.output.
    output:
    output_x into x_results
    output_y into y_results

    subworkflow:
      path: 'sub/coolthing_module.nf'

//// or...

    subworkflow:
      git: 'http://coolthing/nextflow/repo'
      tag: 'v2.3.4'
}

Typing inputs/outputs? I wonder if we could get away with forcing workflow.input and workflow.output to be only dataflow values or dataflow queues? I know that I occasionally use groovy variables (generally constants) in my pipelines, but I believe I could easily convert all of them from to dataflow values. Maybe I've missed an important use case?

The main question I have is about backwards compatibility. I assume that you want to maintain it, but should that extend to using existing pipelines as modules? I'd argue that a nextflow script must define workflow.input and workflow.output before it can be used as module - I wouldn't want to write a bunch of complicated logic trying to infer how to hook things up if workflow.input and workflow.output don't exist.

In any case, I'm very excited to see where this goes. The more nextflow I write, the more I find that modules/subworkflows would help!

pditommaso · 2017-04-07T17:40:48Z

How to reference sub-workflows? I'd vote for a two-pronged approach. First, I'd load other workflows defined on the filesystem via a specified path. Then I'd build on the existing git support for handling remote projects.

Yes. I was thinking to mechanism similar to the one used for the run command line, it check before the specified path on the local file system, then it looks for it on GitHub (unless it's specified the full GH url). I like the idea to specify the revision (tag).

Syntax? I like the idea of overloading process with a new subworkflow component

I will as well, tho not sure how it's feasible to implement it. It must be verified.

I wonder if we could get away with forcing workflow.input and workflow.output to be only dataflow values or dataflow queues?

There isn't a real need for this, because any object that is not a dataflow value is implicitly converted to it when connected to a process in the from clause.

I assume that you want to maintain it, but should that extend to using existing pipelines as modules? I'd argue that a nextflow script must define workflow.input and workflow.output before it can be used as module

YES!

This could be a base on which temp an implementation, but I guess it will be much more challenging to code when put in practice.

msmootsgi · 2017-04-07T20:18:16Z

Sounds like a plan! Would you like me to write something up that fully describes things? I can add it to the repo and then you can modify as you see fit?

I'm also happy to help implement this, but I'll need plenty of guidance to get going.

pditommaso · 2017-04-07T20:20:02Z

You are more than welcome.

stevekm · 2018-01-25T18:07:27Z

Any updates on this?

Also for implementation, I was wondering, are processes considered 'objects' in Nextflow? I implemented a similar feature in my own workflow manager by making each pipeline task ("process") a Python object; the user determines which tasks get run through a YAML file that simply lists the names & order of tasks to complete, and the program looks for modules of the same name to run.

Not sure if such an approach would work in Groovy & Nextflow, having each process in a separate file and simply providing a list of which ones to import & run. This way, you could dynamically create the pipeline based on user inputs.

stevekm · 2018-01-30T17:27:00Z

also it appears I am a little late to the party because after further investigation, I found a similar feature has been implemented using the 'profiles' feature, as discussed here:

https://groups.google.com/forum/#!searchin/nextflow/grape-nf%7Csort:date/nextflow/jTzzE-Lb5iU/To0PxL0EAwAJ

and implemented here:

https://github.com/guigolab/grape-nf

https://github.com/guigolab/grape-nf/blob/master/nextflow.config

Not sure if this is exactly the same as loading processes from external files, though

merky · 2018-03-28T19:58:16Z

@mes5k @pditommaso I would love the modular approach spec'd out above! Anyone working on this? Enabling modularity would make Nextflow an exponentially more powerful workflow executor.

@stevekm I think profiles/templates approach you mentioned solves a different problem, but not modular re-use of subsections of a workflow, which is what the original proposal addresses.

pditommaso · 2018-04-04T10:19:23Z

This is always in the desirable things to do, unfortunately still no progress.

stevekm · 2018-05-02T17:16:53Z

After working with Nextflow for a while I've started to see the lack of this feature as an advantage. Keeping all workflow processes contained in a single file greatly reduces complexity, compared to importing external modules. Trying to understand and troubleshoot pipelines that use the latter format is a big headache. Thoughts? Maybe a better discussion for Gitter/Google Groups?

pditommaso · 2018-05-07T12:33:32Z

After working with Nextflow for a while I've started to see the lack of this feature as an advantage

This is a controversial topic. In general I agree that duplication is better than the wrong abstraction, also taking in consideration that NF was conceived with the idea to make the a pipeline easily readable and understandable from the developer point at the level of tools and command lines executed.

However there are use cases in which the ability to reuse a module or a complete pipeline can be useful.

The goal of this issue is not to implement a module system but instead to implement the ability to import into a NF script an existing NF pipeline and execute it as a sub-workflow. This would allow users to decompose big projects into small self-contained workflows that could be recomposed in bigger ones as needed.

msmootsgi · 2018-05-07T15:48:40Z

I can't emphasize enough how important it is to have some feature to abstract away layers of detail. I've got several pipelines that are over 1000 lines of code with several dozen processes and hundreds of operators manipulating the data per pipeline. These files are simply too large to reason about easily, which slows down our ability to improve and enhance them and makes it much easier to introduce bugs. I also find myself cutting and pasting sections of pipelines when I only need a subset of functionality for certain projects. This is especially frustrating because there are clear sections of the pipelines that could be extracted and naturally expressed as a sub-workflows (or modules, or whatever).

I'm just sorry that I haven't had the time to contribute this myself.

prakruthiburra · 2018-07-20T21:21:32Z

Hey all. We REALLY need this feature. If we could get some help, perhaps we could contribute? Maybe a phone conference to discuss how we could start?

oguitart · 2018-11-08T09:47:30Z

Hi,

I think this would be very useful. We have huge workflows and it would be better to split them in subworkflows and import them in a main workflow which would run them sequentially.

rspreafico · 2019-03-27T19:22:43Z

Just another upvote for modules. I have played with Nextflow and, more recently, WDL/Cromwell. My (personal) conclusion is that Nextflow is superior in almost every respect, except maybe a few. The strongest point in favor of WDL is the reusability of tasks, which can be easily imported by several workflows. That is really huge as it does save a lot of time in writing pipelines, and it allows to make necessary updates to a task only once. I cannot think of any reason to stick to WDL if Nextflow supported modules. (the minor points were that WDL is less Paolo-dependent ;-) and the language specification is separate from the implementation - and of course the Broad brand helps. Nothing nearly as critical as modules). So help me go back to Nextflow and convince my peers too ;-)

pditommaso · 2019-03-27T19:26:56Z

Nearly there. Have a look at #984, going to be merged in master the next month.

rspreafico · 2019-03-27T20:01:31Z

That is AWESOME! Thanks for pointing me to that. The new syntax will likely require extensive revision not only of docs, but also of commonly used patterns - e.g. due to downplaying the use of publishDir and when.

pditommaso · 2019-10-19T13:00:35Z

Closing this, since modules feature has been merged in the master branch.

pditommaso added the enhancement label Nov 5, 2016

pditommaso changed the title ~~Support for importing processes from one file to another [possible feature request]~~ Support for importing processes from one file to another Nov 5, 2016

msmootsgi mentioned this issue Sep 11, 2017

Added plan for subworkflows #448

Closed

denis-yuen mentioned this issue May 8, 2018

Secondary file imports for Nextflow dockstore/dockstore#1397

Closed

pditommaso mentioned this issue Oct 8, 2018

Nextflow parameter description scheme #866

Closed

JoseEspinosa mentioned this issue Nov 21, 2018

[HACK TOPIC] implement the ability to import NF scripts in an existing NF pipeline nextflow-io/nf-hack18#6

Open

blacky0x0 mentioned this issue Feb 18, 2019

Syntax enhancement aka DLS-2 #984

Closed

pditommaso closed this as completed Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for importing processes from one file to another #238

Support for importing processes from one file to another #238

vardaofthevalier commented Nov 4, 2016

pditommaso commented Nov 5, 2016

mes5k commented Jan 20, 2017

pditommaso commented Mar 13, 2017 •

edited

Loading

mes5k commented Apr 5, 2017

pditommaso commented Apr 7, 2017

msmootsgi commented Apr 7, 2017

pditommaso commented Apr 7, 2017

stevekm commented Jan 25, 2018

stevekm commented Jan 30, 2018 •

edited

Loading

merky commented Mar 28, 2018

pditommaso commented Apr 4, 2018

stevekm commented May 2, 2018

pditommaso commented May 7, 2018 •

edited

Loading

msmootsgi commented May 7, 2018

prakruthiburra commented Jul 20, 2018

oguitart commented Nov 8, 2018

rspreafico commented Mar 27, 2019 •

edited

Loading

pditommaso commented Mar 27, 2019

rspreafico commented Mar 27, 2019

pditommaso commented Oct 19, 2019

Support for importing processes from one file to another #238

Support for importing processes from one file to another #238

Comments

vardaofthevalier commented Nov 4, 2016

pditommaso commented Nov 5, 2016

mes5k commented Jan 20, 2017

Introduction

Syntax

Execution

Conclusion

Disclaimer

pditommaso commented Mar 13, 2017 • edited Loading

mes5k commented Apr 5, 2017

pditommaso commented Apr 7, 2017

msmootsgi commented Apr 7, 2017

pditommaso commented Apr 7, 2017

stevekm commented Jan 25, 2018

stevekm commented Jan 30, 2018 • edited Loading

merky commented Mar 28, 2018

pditommaso commented Apr 4, 2018

stevekm commented May 2, 2018

pditommaso commented May 7, 2018 • edited Loading

msmootsgi commented May 7, 2018

prakruthiburra commented Jul 20, 2018

oguitart commented Nov 8, 2018

rspreafico commented Mar 27, 2019 • edited Loading

pditommaso commented Mar 27, 2019

rspreafico commented Mar 27, 2019

pditommaso commented Oct 19, 2019

pditommaso commented Mar 13, 2017 •

edited

Loading

stevekm commented Jan 30, 2018 •

edited

Loading

pditommaso commented May 7, 2018 •

edited

Loading

rspreafico commented Mar 27, 2019 •

edited

Loading