-
Notifications
You must be signed in to change notification settings - Fork 81
Proposal: Static types #309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: preview-25-04
Are you sure you want to change the base?
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
def args_fasterqdump = task.ext.args_fasterqdump ?: '' | ||
def args_pigz = task.ext.args_pigz ?: '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rethinking the approach to ext args. People seem to really like being able to set these args from the config without the hassle of custom params and passing them through the workflow logic. If nothing else, I think we can keep supporting the ext approach, even with static types, so that it doesn't become a roadblock.
One way we could evolve the ext approach is to allow a process to declare additional params/args in a separate section:
input:
// ...
params:
args_fasterqdump: String
args_pigz: String
prefix: String
These are essentially process-level params, i.e. inputs that are passed in through the config rather than the pipeline code.
The config would look basically the same, perhaps with a clearer syntax:
params {
// no glob patterns, only simple names
withName: SRATOOLS_FASTERQDUMP {
args_fasterqdump = '--split-files --include-technical'
}
}
So it would be basically the same approach as the ext config, but since the process declares these settings, Nextflow can validate them in the config, provide auto-completion, etc.
My only concern is that these "process params" could be abused. At the same time, these params can basically only be strings, so that limits their usability to things like args, prefix, etc, so maybe it doesn't matter. Need to think on it more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the potential for abuse, but the ability to use closures has been a very strong attraction here in the usage of ext.args. It's essentially a necessity for ext.prefix
and highly desired to be able to select parameters based on pipeline input:
s too.
Something I'm wondering is if a profiles approach might also solve this issue ( although it would have to be something that supports nesting to some degree ) so it could be set using a json schema or something but this might be too complex for the average user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also using only simple names is not going to help, for example nf-core rnaseq has 5 patterns for SAMTOOLS_INDEX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I fully support having this in params though because anything that could potentially change the results should be in the -params-file
and not half in params and half in process such as a nextflow.config
in the launch directory which could be forgotten if someone comes back later to rerun something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also come to the opinion that it's likely only the ext.args
equivalent that should be in the params. There are parts that should remain under pipeline developer control, so having these params as strings is perhaps fine, as the dynamic part should then perhaps be process inputs instead, on which these user defined extra opts get appended. The ext.prefix
is also something that should likely be a process input:
too rather than something a user can control.
process FOO {
input:
path bar
val args, params: true // Map where keys in the map are appendable from params, otherwise throws an error?
val prefix // Map [ bwa_prefix, samtools_prefix ]
script:
"""
...
"""
}
What worries me about this idea is that a user will say they want to remove developer set args ( where the idea is the developer only sets what's necessary for pipeline stability ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the latest refactor, I think it will be possible to declare ext args as regular inputs:
input:
// ...
args_fasterqdump: String
args_pigz: String
prefix: String
But now, these inputs could be supplied either in the pipeline code:
ch_input.map(FOO, args_pigz: '...')
Or the config:
process {
withName: FOO {
ext.args_fasterqdump = '--split-files --include-technical'
}
}
This becomes possible because extra process inputs are now specified by name in the pipeline code. The only question would be how to handle overrides -- it wouldn't make sense for the user to be able to override "main" inputs like sample id from the config, so might still need a way to distinguish "main" inputs from "extra" inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If one could override an entire map from the config, that would mean people could also use portions of a workflow too. So I can see a use for overriding from the config.
I like the idea though, but there definitely needs to be something to handle the overrides.
workflows/sra/main.nf
Outdated
ftp_samples | ||
.mix(sratools_samples) | ||
.mix(aspera_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing a separate filter/map for each branch, you could handle all three branches in a single map operator:
ncbi_settings = params.download_method == DownloadMethod.SRATOOLS
? CUSTOM_SRATOOLSNCBISETTINGS( sra_metadata.collect() )
: null
samples = sra_metadata.map { meta ->
def method = getDownloadMethod(meta, params.download_method)
match( method ) { // or switch, or if-else, etc
DownloadMethod.FTP -> {
def out = SRA_FASTQ_FTP ( meta )
new Sample(meta.id, out.fastq_1, out.fastq_2, out.md5_1, out.md5_2)
},
DownloadMethod.SRATOOLS -> {
def sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
def fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )
def fastq_1 = fastq[0]
def fastq_2 = !meta.single_end ? fastq[1] : null
new Sample(meta.id, fastq_1, fastq_2, null, null)
},
DownloadMethod.ASPERA -> {
def out = ASPERA_CLI ( meta, 'era-fasp' )
new Sample(meta.id, out.fastq_1, out.fastq_2, out.md5_1, out.md5_2)
}
}
}
(the match
is like a switch-case or if-else chain, just an idea I had for how we could do it in Nextflow)
I think this is a more concise description of the workflow, because you basically just write "regular code" and less operator logic. Now I enjoy playing with operators as much as anyone, but beyond a certain point they become a distraction in my opinion. Others may disagree.
On the other hand, this approach isn't very amenable to subworkflows. You can't call a workflow inside the map closure, and you can't factor out any of this code into a separate function. I might be able to remove this limitation in the future. But for now I think this approach would lead to fewer subworkflows. Maybe that's fine if the refactor simplifies the overall workflow logic by a lot. You certainly wouldn't have to argue so much about naming workflows then 😉
At the end of the day, either way works just fine. You can split up the logic and use more operators, which allows you to factor out subworkflows like for the SRATOOLS route, or you can do it all in one giant map operator, or something in between.
Either way is easy to do for fetchngs because you just mix the various routes at the end. But when you need to join several related channels from different processes -- think like raw fastqc + trimmed fastqc + bam for each sample in rnaseq -- I think that's where the giant map operator is much easier. A good rule of thumb might be to use separate operators for mixing, but use a single map operator for joining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if subworkflows behaved in a similar fashion to processes. At the moment, they're just zip-ties for channels, and I think people would find working with subworkflows nicer if they behaved more like functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the topic of subworkflows, do you think people should be able to do things like channels of channels? Because that's what making subworkflows more like functions would mean. You would be able to have nested levels of concurrency, and you wouldn't be able to know the full process DAG up front.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not knowing the full process DAG isn't inherently bad, but it would be a fundamental change in Nextflow's execution model.
But my bigger concern is that it might make Nextflow much more complicated -- basically on the same level as any general-purpose language -- which might be prohibitive for many potential developers.
I mean, if we're talking about extra process inputs being too much of a hassle, imagine what allowing channels of channels would do 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I hadn't thought of channels of channels. It's not something I think should be permitted. My thought was really more like passing a list input which would get turned into a channel, explicitly in the subworkflow. Like a process, a subworkflow wouldn't take a channel as input, but you would use map to call it on groups of things most likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the subworkflows only allowing non-channel inputs suggestion again (in this DSL3 proposal context) and realized might this make recursion and feedback loops easier to implement? If a subworkflow has to treat inputs and outputs a bit like processes (i.e. all inputs and outputs are a non-channel type, and inputs are explicitly converted into channels in the main:
block with channel factories), might it make certain things easier to compose, or does it just become memory hungry with all the extra objects that get created (I'm thinking if someone spawns 10,000 subworkflows for some purpose rather than the handful of channels if it remains like a zip-tie).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the latest refactor, my original comment here (the giant match-case block) is no longer relevant. The existing paradigm of operators and workflows is largely retained, only simplified.
I continue to see subworkflows as essentially "compiler macros" (what you call "zip-ties for channels"). But I am keeping in the back of my head the idea of a function that doesn't allow channel inputs but does allow some kind of nested parallelization via processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit, my naive thinking of "subworkflows as functions" I don't think scales very well, so the current model is still necessary.
It's more there needs to be some kind of translation from it, because people do find thinking in that way a bit easier, and translates to recursion easier too. There seems to be interest in as seen here https://nextflow.slack.com/archives/C02T98A23U7/p1745459250690079
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be conceivable to have another type of block different from workflow
? A composite
block for lack of words atm. The idea being that those processes are not linked with channels, but just pass values from one process to the next (There can still be channels under the hood). This might also make for logical grouping for those that want a group submission feature. This would be something that could be recursed on too, rather than a workflow
. I think this might be a very popular feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could work. For example, we could have a workflow
that uses input/output
instead of take/emit
:
workflow SRATOOLS {
input:
meta: Map
ncbi_settings: Path
dbgap_key: String
main:
sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )
output:
sample: Sample = meta + fastq
}
record Sample {
id: String
// ...
}
Or, alternatively, a process
that uses main
instead of script
, and can compose other processes:
process SRATOOLS {
input:
// (same as above)
main:
sra = SRATOOLS_PREFETCH ( meta, ncbi_settings, dbgap_key )
fastq = SRATOOLS_FASTERQDUMP ( meta, sra, ncbi_settings, dbgap_key )
output:
// (same as above)
}
I found a way to achieve static types with significantly fewer changes. and still managed to simplify the dataflow logic overall. The key breakthrough is to treat the record type as a contract over a map (like the meta-map), rather than a fixed data type that can only contain the record members at all times. This gives you a great deal of flexibility. We still use a
Record types are not used much explicitly. They are mainly used in input/output declarations, but most of the intermediate code is just channels of maps. This turns out to be really elegant, and it will be much easier for me to implement 😅 |
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.0.2. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
This PR is a showcase of the proposed syntax for static types in Nextflow.
While I started with the goal of simply adding type annotations and type checking, I realized that many aspects of the language needed to be re-thought in order to provide a consistent developer experience. Some of these things can be done now, but I suspect they will be more difficult without static types, so I have tried to show them in their "best form" in this PR.
Changes
Update to Nextflow 25.04. See Update to Nextflow 25.04 #347 for details.
Type annotations. The following declarations can be annotated with a type:
Nextflow will use these type annotations to infer the type of every value in the workflow and make sure they are valid.
The main built-in types are:
Integer
,Number
,Boolean
,String
: primitive typesPath
: file or directoryList<E>
,Set<E>
,Bag<E>
: collections with various constraints on ordering and uniquenessMap<K,V>
: map of key-value pairsChannel<E>
: channel (i.e. queue channel)Records, enums, optional types. Types can be composed in several ways to facilitate domain modeling:
record Sample { meta: Map ; files: List<Path> }
enum TshirtSize { Small, Medium, Large }
?
to denote that it can be null (e.g.String?
), otherwise it should never be nullDefine pipeline params in the main script. Each param has a type. Complex types can be composed from collections, records, and enums. Rather than specifying a particular input format for input files, simply specify a type and Nextflow will use the type like a schema to transparently load from any source (CSV/JSON/etc). Config params are defined separately in the main config.
nextflow_schema.json
remains unchanged but will be partially generated from the main script / config.Only use params in entry workflow. Params are not known outside the entry workflow. Pass params into processes and workflows as explicit inputs instead.
Processes are just functions. Instead of calling a process directly with channels, use operators and supply the process name in place of an operator closure:
Value channels are just values. No need to create value channels -- just use the value itself, Nextflow will figure out the rest. All channels are queue channels.
Simple operators. Use a simple and composable set of operators:
collect
: collect channel elements into a collection (i.e. bag)cross
: cross product of two channelsfilter
: filter a channel based on a conditiongather
: nested gather (similar togroupTuple
)join
: relational join of two channels (i.e. horizontal)map
: transform a channelmix
: concatenate multiple channels (i.e. vertical)reduce
: accumulate each channel element into a single valuescan
: likereduce
but emit each intermediate valuescatter
: nested scatter (similar toflatMap
)subscribe
: invoke a function for each channel elementview
: print each channel elementBenefits
Well-defined workflow inputs. Workflow inputs are explicitly defined alongside the entry workflow as a set of name-type pairs (i.e. a record type). Complex params can be loaded transparently from any source (file, database, API, etc) as long as the runtime supports it. The JSON schema of a param is inferred from the param's type.
Well-defined workflow outputs. Workflow outputs are explicitly defined as a set of name-type pairs (i.e. a record type). Each output can create an index file, which is essentially a serialization of a channel to external storage (file, database, API, etc), and each output can define how its published files are organized in a directory tree. The JSON schema of an output is inferred from the output's type.
Make pipeline import-able. Separating the "core" workflow (i.e.
SRA
) from params and publishing makes it easy to import the pipeline into larger pipelines. See https://github.com/bentsherman/fetchngs2rnaseq for a more complete example.Simpler dataflow logic. Processes are called like an operator closure, generally with a single channel of maps. where the map keys correspond to the process inputs. Additional inputs can be provided as named args. As a result, the amount of boilerplate in the workflow logic and process definition is significantly reduced.
Simpler operator library. With a minimal set of operators, users can easily determine which operator to use based on their needs. The operators listed above are statically typed and pertain only to stream operations.
Simpler process inputs/outputs. Process inputs/outputs are declared in the same way as workflow takes/emits and pipeline params/outputs, instead of the old custom type qualifiers. Inputs of type
Path
are automatically staged. Thanks to the simplified dataflow logic described above, tuples are generally not needed.Extra Notes
This proposed syntax will be enabled by the following internal improvements:
New script/config parser, which enables us to evolve the Nextflow language into whatever we want, without being constrained by Groovy syntax (though it still must compile to Groovy AST).
Static analysis, which can infer the type of every value based on the declared types of pipeline/workflow/process inputs.
Automatic generation of JSON schemas for workflow params and outputs based on type annotations. Preserves support for external tools like Seqera Platform, and lays the groundwork to transparently support different connectors (CSV/JSON file, HTTP API, SQL database, etc).