Skip to content

Proposal: Static types #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: preview-25-04
Choose a base branch
from
Draft

Proposal: Static types #309

wants to merge 1 commit into from

Conversation

bentsherman
Copy link

@bentsherman bentsherman commented May 1, 2024

This PR is a showcase of the proposed syntax for static types in Nextflow.

While I started with the goal of simply adding type annotations and type checking, I realized that many aspects of the language needed to be re-thought in order to provide a consistent developer experience. Some of these things can be done now, but I suspect they will be more difficult without static types, so I have tried to show them in their "best form" in this PR.

Changes

  • Update to Nextflow 25.04. See Update to Nextflow 25.04 #347 for details.

  • Type annotations. The following declarations can be annotated with a type:

    • workflow params/outputs
    • workflow takes/emits
    • process inputs/outputs
    • function parameters/return
    • local variables (generally not needed)

    Nextflow will use these type annotations to infer the type of every value in the workflow and make sure they are valid.

    The main built-in types are:

    • Integer, Number, Boolean, String: primitive types
    • Path: file or directory
    • List<E>, Set<E>, Bag<E>: collections with various constraints on ordering and uniqueness
    • Map<K,V>: map of key-value pairs
    • Channel<E>: channel (i.e. queue channel)
  • Records, enums, optional types. Types can be composed in several ways to facilitate domain modeling:

    • records: a combination of named values, e.g. a sample is a meta map AND some files:
      record Sample { meta: Map ; files: List<Path> }
    • enums: a union of named values, e.g. a shirt size can be small or medium or large:
      enum TshirtSize { Small, Medium, Large }
    • optionals: any type can be suffixed with ? to denote that it can be null (e.g. String?), otherwise it should never be null
  • Define pipeline params in the main script. Each param has a type. Complex types can be composed from collections, records, and enums. Rather than specifying a particular input format for input files, simply specify a type and Nextflow will use the type like a schema to transparently load from any source (CSV/JSON/etc). Config params are defined separately in the main config. nextflow_schema.json remains unchanged but will be partially generated from the main script / config.

  • Only use params in entry workflow. Params are not known outside the entry workflow. Pass params into processes and workflows as explicit inputs instead.

  • Processes are just functions. Instead of calling a process directly with channels, use operators and supply the process name in place of an operator closure:

    // execute FASTQC in parallel on each input file
    Channel.fromPath( "inputs/*.fastq" ).map(FASTQC)
    
    // execute ACCUMULATE sequentially on each input file
    // (replaces experimental recursion)
    Channel.fromPath( "inputs/*.txt" ).reduce(ACCUMULATE)
  • Value channels are just values. No need to create value channels -- just use the value itself, Nextflow will figure out the rest. All channels are queue channels.

  • Simple operators. Use a simple and composable set of operators:

    • collect: collect channel elements into a collection (i.e. bag)
    • cross: cross product of two channels
    • filter: filter a channel based on a condition
    • gather: nested gather (similar to groupTuple)
    • join: relational join of two channels (i.e. horizontal)
    • map: transform a channel
    • mix: concatenate multiple channels (i.e. vertical)
    • reduce: accumulate each channel element into a single value
    • scan: like reduce but emit each intermediate value
    • scatter: nested scatter (similar to flatMap)
    • subscribe: invoke a function for each channel element
    • view: print each channel element

Benefits

  • Well-defined workflow inputs. Workflow inputs are explicitly defined alongside the entry workflow as a set of name-type pairs (i.e. a record type). Complex params can be loaded transparently from any source (file, database, API, etc) as long as the runtime supports it. The JSON schema of a param is inferred from the param's type.

  • Well-defined workflow outputs. Workflow outputs are explicitly defined as a set of name-type pairs (i.e. a record type). Each output can create an index file, which is essentially a serialization of a channel to external storage (file, database, API, etc), and each output can define how its published files are organized in a directory tree. The JSON schema of an output is inferred from the output's type.

  • Make pipeline import-able. Separating the "core" workflow (i.e. SRA) from params and publishing makes it easy to import the pipeline into larger pipelines. See https://github.com/bentsherman/fetchngs2rnaseq for a more complete example.

  • Simpler dataflow logic. Processes are called like an operator closure, generally with a single channel of maps. where the map keys correspond to the process inputs. Additional inputs can be provided as named args. As a result, the amount of boilerplate in the workflow logic and process definition is significantly reduced.

  • Simpler operator library. With a minimal set of operators, users can easily determine which operator to use based on their needs. The operators listed above are statically typed and pertain only to stream operations.

  • Simpler process inputs/outputs. Process inputs/outputs are declared in the same way as workflow takes/emits and pipeline params/outputs, instead of the old custom type qualifiers. Inputs of type Path are automatically staged. Thanks to the simplified dataflow logic described above, tuples are generally not needed.

Extra Notes

This proposed syntax will be enabled by the following internal improvements:

  • New script/config parser, which enables us to evolve the Nextflow language into whatever we want, without being constrained by Groovy syntax (though it still must compile to Groovy AST).

  • Static analysis, which can infer the type of every value based on the declared types of pipeline/workflow/process inputs.

  • Automatic generation of JSON schemas for workflow params and outputs based on type annotations. Preserves support for external tools like Seqera Platform, and lays the groundwork to transparently support different connectors (CSV/JSON file, HTTP API, SQL database, etc).

@bentsherman bentsherman changed the title DSL2+ / DSL3 preview DSL2+ / DSL3 proof-of-concept May 1, 2024
@bentsherman bentsherman changed the title DSL2+ / DSL3 proof-of-concept Preview: DSL2+ (and beyond) May 1, 2024
mahesh-panchal

This comment was marked as resolved.

@samuell

This comment was marked as off-topic.

@bentsherman

This comment was marked as off-topic.

@samuell

This comment was marked as off-topic.

@bentsherman

This comment was marked as off-topic.

@mahesh-panchal

This comment was marked as outdated.

@bentsherman

This comment was marked as outdated.

@mahesh-panchal

This comment was marked as outdated.

@bentsherman

This comment was marked as outdated.

@mahesh-panchal

This comment was marked as outdated.

@bentsherman

This comment was marked as outdated.

@bentsherman bentsherman changed the title Preview: DSL2+ (and beyond) Proposal: Beyond DSL2 May 21, 2024
@bentsherman bentsherman changed the title Proposal: Beyond DSL2 Proposal: Static types Nov 2, 2024
Comment on lines +24 to +28
def args_fasterqdump = task.ext.args_fasterqdump ?: ''
def args_pigz = task.ext.args_pigz ?: ''
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking the approach to ext args. People seem to really like being able to set these args from the config without the hassle of custom params and passing them through the workflow logic. If nothing else, I think we can keep supporting the ext approach, even with static types, so that it doesn't become a roadblock.

One way we could evolve the ext approach is to allow a process to declare additional params/args in a separate section:

  input:
  // ...

  params:
  args_fasterqdump: String
  args_pigz: String
  prefix: String

These are essentially process-level params, i.e. inputs that are passed in through the config rather than the pipeline code.

The config would look basically the same, perhaps with a clearer syntax:

params {
    // no glob patterns, only simple names
    withName: SRATOOLS_FASTERQDUMP {
        args_fasterqdump = '--split-files --include-technical'
    }
}

So it would be basically the same approach as the ext config, but since the process declares these settings, Nextflow can validate them in the config, provide auto-completion, etc.

My only concern is that these "process params" could be abused. At the same time, these params can basically only be strings, so that limits their usability to things like args, prefix, etc, so maybe it doesn't matter. Need to think on it more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the potential for abuse, but the ability to use closures has been a very strong attraction here in the usage of ext.args. It's essentially a necessity for ext.prefix and highly desired to be able to select parameters based on pipeline input:s too.

Something I'm wondering is if a profiles approach might also solve this issue ( although it would have to be something that supports nesting to some degree ) so it could be set using a json schema or something but this might be too complex for the average user.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using only simple names is not going to help, for example nf-core rnaseq has 5 patterns for SAMTOOLS_INDEX

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I fully support having this in params though because anything that could potentially change the results should be in the -params-file and not half in params and half in process such as a nextflow.config in the launch directory which could be forgotten if someone comes back later to rerun something.

Copy link
Member

@mahesh-panchal mahesh-panchal Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also come to the opinion that it's likely only the ext.args equivalent that should be in the params. There are parts that should remain under pipeline developer control, so having these params as strings is perhaps fine, as the dynamic part should then perhaps be process inputs instead, on which these user defined extra opts get appended. The ext.prefix is also something that should likely be a process input: too rather than something a user can control.

process FOO {
    input:
    path bar
    val args, params: true // Map where keys in the map are appendable from params, otherwise throws an error?
    val prefix // Map [ bwa_prefix, samtools_prefix ]

    script:
    """
    ...
    """
}

What worries me about this idea is that a user will say they want to remove developer set args ( where the idea is the developer only sets what's necessary for pipeline stability ).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the latest refactor, I think it will be possible to declare ext args as regular inputs:

  input:
  // ...
  args_fasterqdump: String
  args_pigz: String
  prefix: String

But now, these inputs could be supplied either in the pipeline code:

ch_input.map(FOO, args_pigz: '...')

Or the config:

process {
    withName: FOO {
        ext.args_fasterqdump = '--split-files --include-technical'
    }
}

This becomes possible because extra process inputs are now specified by name in the pipeline code. The only question would be how to handle overrides -- it wouldn't make sense for the user to be able to override "main" inputs like sample id from the config, so might still need a way to distinguish "main" inputs from "extra" inputs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one could override an entire map from the config, that would mean people could also use portions of a workflow too. So I can see a use for overriding from the config.

I like the idea though, but there definitely needs to be something to handle the overrides.

Comment on lines 93 to 74
ftp_samples
.mix(sratools_samples)
.mix(aspera_samples)
Copy link
Author

@bentsherman bentsherman Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing a separate filter/map for each branch, you could handle all three branches in a single map operator:

ncbi_settings = params.download_method == DownloadMethod.SRATOOLS
    ? CUSTOM_SRATOOLSNCBISETTINGS( sra_metadata.collect() )
    : null

samples = sra_metadata.map { meta ->
    def method = getDownloadMethod(meta, params.download_method)
    match( method ) { // or switch, or if-else, etc
        DownloadMethod.FTP -> {
            def out = SRA_FASTQ_FTP ( meta )
            new Sample(meta.id, out.fastq_1, out.fastq_2, out.md5_1, out.md5_2)
        },
        DownloadMethod.SRATOOLS -> {
            def sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
            def fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )
            def fastq_1 = fastq[0]
            def fastq_2 = !meta.single_end ? fastq[1] : null
            new Sample(meta.id, fastq_1, fastq_2, null, null)
        },
        DownloadMethod.ASPERA -> {
            def out = ASPERA_CLI ( meta, 'era-fasp' )
            new Sample(meta.id, out.fastq_1, out.fastq_2, out.md5_1, out.md5_2)
        }
    }
}

(the match is like a switch-case or if-else chain, just an idea I had for how we could do it in Nextflow)

I think this is a more concise description of the workflow, because you basically just write "regular code" and less operator logic. Now I enjoy playing with operators as much as anyone, but beyond a certain point they become a distraction in my opinion. Others may disagree.

On the other hand, this approach isn't very amenable to subworkflows. You can't call a workflow inside the map closure, and you can't factor out any of this code into a separate function. I might be able to remove this limitation in the future. But for now I think this approach would lead to fewer subworkflows. Maybe that's fine if the refactor simplifies the overall workflow logic by a lot. You certainly wouldn't have to argue so much about naming workflows then 😉

At the end of the day, either way works just fine. You can split up the logic and use more operators, which allows you to factor out subworkflows like for the SRATOOLS route, or you can do it all in one giant map operator, or something in between.

Either way is easy to do for fetchngs because you just mix the various routes at the end. But when you need to join several related channels from different processes -- think like raw fastqc + trimmed fastqc + bam for each sample in rnaseq -- I think that's where the giant map operator is much easier. A good rule of thumb might be to use separate operators for mixing, but use a single map operator for joining.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if subworkflows behaved in a similar fashion to processes. At the moment, they're just zip-ties for channels, and I think people would find working with subworkflows nicer if they behaved more like functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the topic of subworkflows, do you think people should be able to do things like channels of channels? Because that's what making subworkflows more like functions would mean. You would be able to have nested levels of concurrency, and you wouldn't be able to know the full process DAG up front.

Copy link
Author

@bentsherman bentsherman Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not knowing the full process DAG isn't inherently bad, but it would be a fundamental change in Nextflow's execution model.

But my bigger concern is that it might make Nextflow much more complicated -- basically on the same level as any general-purpose language -- which might be prohibitive for many potential developers.

I mean, if we're talking about extra process inputs being too much of a hassle, imagine what allowing channels of channels would do 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I hadn't thought of channels of channels. It's not something I think should be permitted. My thought was really more like passing a list input which would get turned into a channel, explicitly in the subworkflow. Like a process, a subworkflow wouldn't take a channel as input, but you would use map to call it on groups of things most likely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the subworkflows only allowing non-channel inputs suggestion again (in this DSL3 proposal context) and realized might this make recursion and feedback loops easier to implement? If a subworkflow has to treat inputs and outputs a bit like processes (i.e. all inputs and outputs are a non-channel type, and inputs are explicitly converted into channels in the main: block with channel factories), might it make certain things easier to compose, or does it just become memory hungry with all the extra objects that get created (I'm thinking if someone spawns 10,000 subworkflows for some purpose rather than the handful of channels if it remains like a zip-tie).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the latest refactor, my original comment here (the giant match-case block) is no longer relevant. The existing paradigm of operators and workflows is largely retained, only simplified.

I continue to see subworkflows as essentially "compiler macros" (what you call "zip-ties for channels"). But I am keeping in the back of my head the idea of a function that doesn't allow channel inputs but does allow some kind of nested parallelization via processes.

Copy link
Member

@mahesh-panchal mahesh-panchal Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit, my naive thinking of "subworkflows as functions" I don't think scales very well, so the current model is still necessary.

It's more there needs to be some kind of translation from it, because people do find thinking in that way a bit easier, and translates to recursion easier too. There seems to be interest in as seen here https://nextflow.slack.com/archives/C02T98A23U7/p1745459250690079

Copy link
Member

@mahesh-panchal mahesh-panchal Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be conceivable to have another type of block different from workflow? A composite block for lack of words atm. The idea being that those processes are not linked with channels, but just pass values from one process to the next (There can still be channels under the hood). This might also make for logical grouping for those that want a group submission feature. This would be something that could be recursed on too, rather than a workflow. I think this might be a very popular feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could work. For example, we could have a workflow that uses input/output instead of take/emit:

workflow SRATOOLS {
  input:
  meta: Map
  ncbi_settings: Path
  dbgap_key: String
  
  main:
  sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
  fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )

  output:
  sample: Sample = meta + fastq
}

record Sample {
  id: String
  // ...
}

Or, alternatively, a process that uses main instead of script, and can compose other processes:

process SRATOOLS {
  input:
  // (same as above)
  
  main:
  sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
  fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )

  output:
  // (same as above)
}

@bentsherman
Copy link
Author

I found a way to achieve static types with significantly fewer changes. and still managed to simplify the dataflow logic overall.

The key breakthrough is to treat the record type as a contract over a map (like the meta-map), rather than a fixed data type that can only contain the record members at all times. This gives you a great deal of flexibility.

We still use a map operator to call a process. But instead of calling the process in the operator closure, we call it as the operator closure. For each map in the channel, the map keys are matched to the process inputs by name.

  • The map can have additional keys, the process just won't use them
  • If the map is missing some keys, they can be provided as named args (no need to use join or combine)

Record types are not used much explicitly. They are mainly used in input/output declarations, but most of the intermediate code is just channels of maps.

This turns out to be really elegant, and it will be much easier for me to implement 😅

@nf-core-bot
Copy link
Member

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.0.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

Copy link

github-actions bot commented May 15, 2025

nf-core pipelines lint overall result: Failed ❌

Posted for pipeline commit 84e16ca

+| ✅ 159 tests passed       |+
#| ❔  15 tests were ignored |#
!| ❗   5 tests had warnings |!
-| ❌   5 tests failed       |-

❌ Test failures:

  • nextflow_config - Config variable not found: params.input
  • schema_params - Default value for param input invalid: Not in pipeline parameters. Check nextflow.config.
  • schema_params - Default value for param ena_metadata_fields invalid: Not in pipeline parameters. Check nextflow.config.
  • schema_params - Default value for param skip_fastq_download invalid: Not in pipeline parameters. Check nextflow.config.
  • schema_params - Default value for param dbgap_key invalid: Not in pipeline parameters. Check nextflow.config.

❗ Test warnings:

  • schema_lint - Parameter input is not defined in the correct subschema (input_output_options)
  • schema_params - Schema param input not found from nextflow config
  • schema_params - Schema param ena_metadata_fields not found from nextflow config
  • schema_params - Schema param skip_fastq_download not found from nextflow config
  • schema_params - Schema param dbgap_key not found from nextflow config

❔ Tests ignored:

  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: assets/multiqc_config.yml
  • files_exist - File is ignored: conf/igenomes.config
  • files_exist - File is ignored: conf/igenomes_ignored.config
  • files_exist - File is ignored: conf/modules.config
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: assets/sendmail_template.txt
  • files_unchanged - File ignored due to lint config: assets/nf-core-fetchngs_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-fetchngs_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-fetchngs_logo_dark.png
  • actions_ci - actions_ci
  • actions_awstest - 'awstest.yml' workflow not found: /home/runner/work/fetchngs/fetchngs/.github/workflows/awstest.yml
  • multiqc_config - multiqc_config
  • modules_config - modules_config

✅ Tests passed:

Run details

  • nf-core/tools version 3.0.2
  • Run at 2025-05-15 12:10:23

@bentsherman bentsherman changed the base branch from dev to preview-25-04 May 15, 2025 12:21
@maxulysse maxulysse added this to the 1.14.0 milestone May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants