Merge of conflicting process names in logs #5940

kdesnos · 2025-04-03T00:02:43Z

Hi,

I'm not really sure this can be considered a bug, but since this "issue" puzzled me, I'm sharing it with everyone to see if it should be acted upon.

The issue

When the list of process names is printed in the .nextflow.log file, it may happen that two distinct processes, declared in two distinct nf files, which share the exact same name become only one in the list.
Apr-03 09:55:00.694 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: process_a, process_b

This was the source of my puzzlement, as when adding a new process in a file, you expect to see it appear in this list.

Step to reproduce

Declares processes in two files:

# file: main.nf
process a {
    output:
        stdout

    script: 
    """
        echo "I'm process main:a"
    """
}

# file: sub.nf
process a {
    output:
        stdout

    script: 
    """
        echo "I'm process sub:a"
    """
}

Execute the workflow, the following content will be printed in the logs:
[main] DEBUG nextflow.Session - Workflow process names [dsl2]: a

In case the processes are used in several hierarchical workflows, the "resolved" names will also be printed as such:
a for the use of main.nf:a in the "top level" workflow of main.nf, and sub:a for the use of sub.nf:a in the a sub workflow.

The cause

The reason behind this behavior is that the list of process names returned by the static member function ScriptMeta.allProcessNames() returns a Set<String>, which cannot contain duplicates. In a scenario where a is used within the main and a subworkflow, the following elements will actually be added to the set:

a : process main.nf:a
a : process sub.nf:a
a : Use of process main.nf:a within top-level workflow
sub:a : Use of process sub.nf:a within a sub workflow.

Resulting in the following set a, sub:a.
In case sub.nf:a is not used directly within a workflow, but instead imported with an aliased name in another workflow (eg. include {a as x} from ./sub.nf), then the list would look like this a, x, with no clear trace of the multiple definitions of a.

What can be done?

To answer what can be done, I should first explain why this is bothering me :) My objective* is to analyze the traces of pipelines run with Nextflow to try to build a model for predicting performances of future executions of this pipeline. To do that, I need to be able to associate each process execution from a report, identified by its "resolved" and possibly aliased name, with the corresponding process file:process_name. This objective is currently made impossible, notably because of the described issue.

Potential "solution" number 1:
A "simple" solution would be to replace the Set<String> returned by the ScriptMeta.allProcessNames() with a List<String>. That way, the printed list of process names would be "complete". In our example:
a, a, a, sub:a
I'm not a big fan of this solution though. Although the multiple definitions, and use, of the workflow appear as now expected, the three a in the list are not strongly informative, and the difference between the definition and use of a in the top workflow is not visible.

Potential "solution" number 2:
I think it would be more informative to print the list of process definitions, separately from the list of "resolved" process names. These two list could then act as a dictionary to associate each resolved process names to its corresponding file and process definition. The result could look something like this in our example:

[main] DEBUG nextflow.Session - Workflow process definitions [dsl2]: main.nf [a, other_process, ...], sub.nf [a, ...]
[main] DEBUG nextflow.Session - Workflow resolved process names: a[main.nf:a], sub:a[sub.nf:a], x[sub.nf:a]

I can probably provide code for this solution in the next few days if you think it can be interesting to put this in the future version (which would also be great for me to avoid having to maintain this in my own fork :D )

Cheers,

Karol

*: (I'm an scientist in embedded high-performance system design and optimization)

The text was updated successfully, but these errors were encountered:

kdesnos · 2025-04-03T06:16:28Z

Code implementing the aforementioned solution number 2 is available in my fork kdesnos/nextflow:printProcessDictionnaryInLogs.

The only difference with previously described behavior is that resolved process names are grouped by process names, as follows:
[main] DEBUG nextflow.Session - Workflow resolved process names: main.nf=a=[a], sub.nf=a=[sub:a, x]

kdesnos · 2025-04-04T04:52:01Z

I created a PR in case the proposed change is deemed worthy for production. I believe that beyond my own need, this changes facilitate the identification of what an aliased process name corresponds to.

Importantly, I verified that none of the suggested info was printed when elevating the log level to traces.

kdesnos mentioned this issue Apr 4, 2025

Print process dictionary in logs (#5940) #5944

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge of conflicting process names in logs #5940

Merge of conflicting process names in logs #5940

kdesnos commented Apr 3, 2025 •

edited

Loading

kdesnos commented Apr 3, 2025

kdesnos commented Apr 4, 2025

Merge of conflicting process names in logs #5940

Merge of conflicting process names in logs #5940

Comments

kdesnos commented Apr 3, 2025 • edited Loading

The issue

Step to reproduce

The cause

What can be done?

kdesnos commented Apr 3, 2025

kdesnos commented Apr 4, 2025

kdesnos commented Apr 3, 2025 •

edited

Loading