Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for importing processes from one file to another #238

Closed
vardaofthevalier opened this issue Nov 4, 2016 · 20 comments
Closed

Support for importing processes from one file to another #238

vardaofthevalier opened this issue Nov 4, 2016 · 20 comments

Comments

@vardaofthevalier
Copy link

Hi there! It would be really cool if it were possible to import process definitions from one file to another in order to support code reuse between workflows. Is this a feature that can be utilized through the Groovy language already, or would it require additional engineering work to support this in the Nextflow DSL? I couldn't find anything about this specifically in the official Nextflow documentation and I'm fairly new to Groovy, so any advice or thoughts you have about addressing this would be much appreciated. I'd be happy to read any existing documentation that already covers this if there is any. Thanks in advance!

@pditommaso
Copy link
Member

Currently it is possible to reuse Groovy scripts or JAR libraries through the standard Java/Groovy import mechanism. Sub-workflows are not (yet) supported, but we are planning to add this feature likely next year.

@pditommaso pditommaso changed the title Support for importing processes from one file to another [possible feature request] Support for importing processes from one file to another Nov 5, 2016
@mes5k
Copy link
Contributor

mes5k commented Jan 20, 2017

Introduction

The overarching goal of this proposal is to be able to isolate and group a pipeline of processes into a reusable unit that can be shared and incorporated into other pipelines.

This proposal attempts to address this need by defining a small number of extensions to the Nextflow language while maintaining the core behavior and spirit of Nextflow's existing design and execution model.

The central idea is to extend the concept of a process to something called a module_process. A module_process would be expressed in a separate file as a (mostly) normal Nextflow pipeline. The main difference is that instead of Channels being created at the beginning of the pipeline (e.g. with Channel.from(...)) the channels would be declared. Similarly, a module_process would also declare output channels. Once given a name, then this module_process could then be included into a normal Nextflow pipeline and expanded similar to how a macro might be expanded in other languages.

Syntax

Here is a possible syntax for a module_process file:

// Give the module a name.
nextflow_module:
    com.example.coolthing

// Declare input channels.  These channels are assumed to
// be "injected" from a calling pipeline.
input_channels:
    input_1,
    intput_2

// normal processes
process a {
    input:
    val(x) from input_1

    output:
    stdout into output_1
    //...
}

process b {
    input:
    val(y) from input_2

    //...
}

process c {
    //...
}


// Declare output channels to be "exported".
output_channels:
    output_1
    output_2

The keyword nextflow_module gives the module a name, while the input_channels and output_channels keywords define the input and output channels, respectively. Some assumptions about module_processes include:

  • module_processes are not standalone nextflow programs. They require input channels to be injected into the module.
  • Likewise, the only way to get data out of a module_process is through an output channel. The execption being results written with publishDir directive, that would behave normally.

A possible syntax for using a module_process is as follows:

// Import like a normal java/groovy object or have different syntax?
import com.example.coolthing

Channel.fromPath(...).into{ origin }

// normal process
process xyz {

    input:
    val(z) from origin

    // ...

    output:
    val(x) into x_channel
    val(y) into y_channel
}


// use the module process
module_process coolthing {

    // map local channels to the module's input channels
    input:
    coolthing.input_1 from x_channel
    coolthing.input_2 from y_channel

    // map the modules outpu channels to local channels
    output:
    coolthing.output_x into x_results
    coolthing.output_y into y_results
}

// another normal process
process finish_x {

    input:
    val(x) from x_results
}

Using a module_process involves simply declaring the module_process as such and defining the input and output channels. There would be no script or exec section to the module_process definition. Instead the module_process simply connects input and output channels from the main pipeline to those defined in the module.

Execution

The goal is to leave pipeline execution exactly the same as it is now. The idea is that interpretation of a module will "include" or "flatten" the module processes into the main Nextflow pipeline so that the executor sees only one pipeline script that consists of processes and channels, just like Nextflow now.

Here is possible way of imagining what the executor will see given the example above:

Channel.fromPath(...).into{ origin }

// normal process
process xyz {

    //...

    output:
    val(x) into x_channel
    val(y) into y_channel
}

// normal processes
process coolthing.a {
    input:
    val(x) from coolthing.input_1

    output:
    stdout into coolthing.output_1
    //...
}

process coolthing.b {
    input:
    val(y) from coolthing.input_2

    //...
}

process coolthing.c {
    //...
}

// another normal process
process finish_x {

    input:
    val(x) from coolthing.x_results
}

Conclusion

This proposal introduces four new keywords to the Nextflow language: nextflow_module, input_channels, output_channels, and module_process. These keywords along with the idea of "including" module code into a final script for execution provide a (hopefully) simple model for modularizing Nextflow in a (hopefully) lightweight manner that doesn't disrupt the dataflow programming model (still just processes and channels) nor change overall execution behavior.

Disclaimer

Obviously all of the names are only suggestions. Maybe nextflow_module should be subpipeline? Or perhaps module_process should be subpipeline or subpipeline_process? Or ...?

Also, I've not implemented any of this, so I have no actual idea whether it would work. :)

@pditommaso
Copy link
Member

pditommaso commented Mar 13, 2017

I partially agree on this proposal. I think there shouldn't be a separate module concept. It should be possibile to include any NF script into another.

The only requirement should be to proper declare the expected workflow inputs and outputs using the approach suggested by @mes5k . Eventually including it in the existing workflow object eg.

workflow { 
  input: 
  foo 
  bar 

  output: 
  gus
  baz
} 

Moreover it should be possible to continue to use the existing script parameters mechanism, both for backward compatibility and parametrisation when the script is used standalone.

My idea is that the current params values should be used to initialise input channels to a default value when such inputs are not explicitly provided. This could also be useful to replace the current common idiom in a NF script:

params.foo = '/some/file'
params.bar = '/data/*.fq' 

foo_file = file(params.foo)
bar_ch = Channel.fromPath(params.bar)

with:

params.foo = '/some/file'
params.bar = '/data/*.fq' 

workflow {
  input: 
  foo_file = file(params.foo)
  bar_ch = Channel.fromPath(params.bar)
} 

The main difference would be that foo_file and bar_ch could be provided when invoking the script as a sub-workflow.

On the invoking part I still have a lot of doubts. Between the open problems:

  • How reference sub-workflows. Ideally it should be possible to use the same name specified on the nextflow run command line (thus downloading the script when needed).
  • How to reference inputs and outputs? By name? Should they be typed?
  • What syntax should be used? A new subworkflow keyword or maybe extend the process definition with a new subworkflow component in place of script ?

@mes5k
Copy link
Contributor

mes5k commented Apr 5, 2017

Sorry for the slow reply on this @pditommaso! I really like your ideas. I think workflows calling workflows is much more elegant than resorting to a new module keyword. Here are my thoughts on some of your questions:

  • How to reference sub-workflows? I'd vote for a two-pronged approach. First, I'd load other workflows defined on the filesystem via a specified path. Then I'd build on the existing git support for handling remote projects. The git approach could pull the repo like normal and the just return a path. Building paths for search for code is something most people will grok from Java, Python, etc..
  • Syntax? I like the idea of overloading process with a new subworkflow component. I think that makes it clear that the subworkflow really just spawns a bunch more processes. Something like:
process coolthing {

    // Input channels must line up with subworkflow's workflow.input.
    input:
    input_1 from x_channel
    input_2 from y_channel

    // Output channels must line up with subworkflow's workflow.output.
    output:
    output_x into x_results
    output_y into y_results

    subworkflow:
      path: 'sub/coolthing_module.nf'

//// or...

    subworkflow:
      git: 'http://coolthing/nextflow/repo'
      tag: 'v2.3.4'
}
  • Typing inputs/outputs? I wonder if we could get away with forcing workflow.input and workflow.output to be only dataflow values or dataflow queues? I know that I occasionally use groovy variables (generally constants) in my pipelines, but I believe I could easily convert all of them from to dataflow values. Maybe I've missed an important use case?

The main question I have is about backwards compatibility. I assume that you want to maintain it, but should that extend to using existing pipelines as modules? I'd argue that a nextflow script must define workflow.input and workflow.output before it can be used as module - I wouldn't want to write a bunch of complicated logic trying to infer how to hook things up if workflow.input and workflow.output don't exist.

In any case, I'm very excited to see where this goes. The more nextflow I write, the more I find that modules/subworkflows would help!

@pditommaso
Copy link
Member

How to reference sub-workflows? I'd vote for a two-pronged approach. First, I'd load other workflows defined on the filesystem via a specified path. Then I'd build on the existing git support for handling remote projects.

Yes. I was thinking to mechanism similar to the one used for the run command line, it check before the specified path on the local file system, then it looks for it on GitHub (unless it's specified the full GH url). I like the idea to specify the revision (tag).

Syntax? I like the idea of overloading process with a new subworkflow component

I will as well, tho not sure how it's feasible to implement it. It must be verified.

I wonder if we could get away with forcing workflow.input and workflow.output to be only dataflow values or dataflow queues?

There isn't a real need for this, because any object that is not a dataflow value is implicitly converted to it when connected to a process in the from clause.

I assume that you want to maintain it, but should that extend to using existing pipelines as modules? I'd argue that a nextflow script must define workflow.input and workflow.output before it can be used as module

YES!

This could be a base on which temp an implementation, but I guess it will be much more challenging to code when put in practice.

@msmootsgi
Copy link
Contributor

Sounds like a plan! Would you like me to write something up that fully describes things? I can add it to the repo and then you can modify as you see fit?

I'm also happy to help implement this, but I'll need plenty of guidance to get going.

@pditommaso
Copy link
Member

You are more than welcome.

@stevekm
Copy link
Contributor

stevekm commented Jan 25, 2018

Any updates on this?

Also for implementation, I was wondering, are processes considered 'objects' in Nextflow? I implemented a similar feature in my own workflow manager by making each pipeline task ("process") a Python object; the user determines which tasks get run through a YAML file that simply lists the names & order of tasks to complete, and the program looks for modules of the same name to run.

Not sure if such an approach would work in Groovy & Nextflow, having each process in a separate file and simply providing a list of which ones to import & run. This way, you could dynamically create the pipeline based on user inputs.

@stevekm
Copy link
Contributor

stevekm commented Jan 30, 2018

also it appears I am a little late to the party because after further investigation, I found a similar feature has been implemented using the 'profiles' feature, as discussed here:

https://groups.google.com/forum/#!searchin/nextflow/grape-nf%7Csort:date/nextflow/jTzzE-Lb5iU/To0PxL0EAwAJ

and implemented here:

https://github.com/guigolab/grape-nf

https://github.com/guigolab/grape-nf/blob/master/nextflow.config

Not sure if this is exactly the same as loading processes from external files, though

@merky
Copy link

merky commented Mar 28, 2018

@mes5k @pditommaso I would love the modular approach spec'd out above! Anyone working on this? Enabling modularity would make Nextflow an exponentially more powerful workflow executor.

@stevekm I think profiles/templates approach you mentioned solves a different problem, but not modular re-use of subsections of a workflow, which is what the original proposal addresses.

@pditommaso
Copy link
Member

This is always in the desirable things to do, unfortunately still no progress.

@stevekm
Copy link
Contributor

stevekm commented May 2, 2018

After working with Nextflow for a while I've started to see the lack of this feature as an advantage. Keeping all workflow processes contained in a single file greatly reduces complexity, compared to importing external modules. Trying to understand and troubleshoot pipelines that use the latter format is a big headache. Thoughts? Maybe a better discussion for Gitter/Google Groups?

@pditommaso
Copy link
Member

pditommaso commented May 7, 2018

After working with Nextflow for a while I've started to see the lack of this feature as an advantage

This is a controversial topic. In general I agree that duplication is better than the wrong abstraction, also taking in consideration that NF was conceived with the idea to make the a pipeline easily readable and understandable from the developer point at the level of tools and command lines executed.

However there are use cases in which the ability to reuse a module or a complete pipeline can be useful.

The goal of this issue is not to implement a module system but instead to implement the ability to import into a NF script an existing NF pipeline and execute it as a sub-workflow. This would allow users to decompose big projects into small self-contained workflows that could be recomposed in bigger ones as needed.

@msmootsgi
Copy link
Contributor

I can't emphasize enough how important it is to have some feature to abstract away layers of detail. I've got several pipelines that are over 1000 lines of code with several dozen processes and hundreds of operators manipulating the data per pipeline. These files are simply too large to reason about easily, which slows down our ability to improve and enhance them and makes it much easier to introduce bugs. I also find myself cutting and pasting sections of pipelines when I only need a subset of functionality for certain projects. This is especially frustrating because there are clear sections of the pipelines that could be extracted and naturally expressed as a sub-workflows (or modules, or whatever).

I'm just sorry that I haven't had the time to contribute this myself.

@prakruthiburra
Copy link

Hey all. We REALLY need this feature. If we could get some help, perhaps we could contribute? Maybe a phone conference to discuss how we could start?

@oguitart
Copy link

oguitart commented Nov 8, 2018

Hi,

I think this would be very useful. We have huge workflows and it would be better to split them in subworkflows and import them in a main workflow which would run them sequentially.

@rspreafico
Copy link

rspreafico commented Mar 27, 2019

Just another upvote for modules. I have played with Nextflow and, more recently, WDL/Cromwell. My (personal) conclusion is that Nextflow is superior in almost every respect, except maybe a few. The strongest point in favor of WDL is the reusability of tasks, which can be easily imported by several workflows. That is really huge as it does save a lot of time in writing pipelines, and it allows to make necessary updates to a task only once. I cannot think of any reason to stick to WDL if Nextflow supported modules. (the minor points were that WDL is less Paolo-dependent ;-) and the language specification is separate from the implementation - and of course the Broad brand helps. Nothing nearly as critical as modules). So help me go back to Nextflow and convince my peers too ;-)

@pditommaso
Copy link
Member

Nearly there. Have a look at #984, going to be merged in master the next month.

@rspreafico
Copy link

That is AWESOME! Thanks for pointing me to that. The new syntax will likely require extensive revision not only of docs, but also of commonly used patterns - e.g. due to downplaying the use of publishDir and when.

@pditommaso
Copy link
Member

Closing this, since modules feature has been merged in the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants