Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge of conflicting process names in logs #5940

Open
kdesnos opened this issue Apr 3, 2025 · 2 comments
Open

Merge of conflicting process names in logs #5940

kdesnos opened this issue Apr 3, 2025 · 2 comments

Comments

@kdesnos
Copy link
Contributor

kdesnos commented Apr 3, 2025

Hi,

I'm not really sure this can be considered a bug, but since this "issue" puzzled me, I'm sharing it with everyone to see if it should be acted upon.

The issue

When the list of process names is printed in the .nextflow.log file, it may happen that two distinct processes, declared in two distinct nf files, which share the exact same name become only one in the list.
Apr-03 09:55:00.694 [main] DEBUG nextflow.Session - Workflow process names [dsl2]: process_a, process_b

This was the source of my puzzlement, as when adding a new process in a file, you expect to see it appear in this list.

Step to reproduce

Declares processes in two files:

# file: main.nf
process a {
    output:
        stdout

    script: 
    """
        echo "I'm process main:a"
    """
}

# file: sub.nf
process a {
    output:
        stdout

    script: 
    """
        echo "I'm process sub:a"
    """
}

Execute the workflow, the following content will be printed in the logs:
[main] DEBUG nextflow.Session - Workflow process names [dsl2]: a

In case the processes are used in several hierarchical workflows, the "resolved" names will also be printed as such:
a for the use of main.nf:a in the "top level" workflow of main.nf, and sub:a for the use of sub.nf:a in the a sub workflow.

The cause

The reason behind this behavior is that the list of process names returned by the static member function ScriptMeta.allProcessNames() returns a Set<String>, which cannot contain duplicates. In a scenario where a is used within the main and a subworkflow, the following elements will actually be added to the set:

  • a : process main.nf:a
  • a : process sub.nf:a
  • a : Use of process main.nf:a within top-level workflow
  • sub:a : Use of process sub.nf:a within a sub workflow.

Resulting in the following set a, sub:a.
In case sub.nf:a is not used directly within a workflow, but instead imported with an aliased name in another workflow (eg. include {a as x} from ./sub.nf), then the list would look like this a, x, with no clear trace of the multiple definitions of a.

What can be done?

To answer what can be done, I should first explain why this is bothering me :) My objective* is to analyze the traces of pipelines run with Nextflow to try to build a model for predicting performances of future executions of this pipeline. To do that, I need to be able to associate each process execution from a report, identified by its "resolved" and possibly aliased name, with the corresponding process file:process_name. This objective is currently made impossible, notably because of the described issue.

Potential "solution" number 1:
A "simple" solution would be to replace the Set<String> returned by the ScriptMeta.allProcessNames() with a List<String>. That way, the printed list of process names would be "complete". In our example:
a, a, a, sub:a
I'm not a big fan of this solution though. Although the multiple definitions, and use, of the workflow appear as now expected, the three a in the list are not strongly informative, and the difference between the definition and use of a in the top workflow is not visible.

Potential "solution" number 2:
I think it would be more informative to print the list of process definitions, separately from the list of "resolved" process names. These two list could then act as a dictionary to associate each resolved process names to its corresponding file and process definition. The result could look something like this in our example:

[main] DEBUG nextflow.Session - Workflow process definitions [dsl2]: main.nf [a, other_process, ...], sub.nf [a, ...]
[main] DEBUG nextflow.Session - Workflow resolved process names: a[main.nf:a], sub:a[sub.nf:a], x[sub.nf:a]  

I can probably provide code for this solution in the next few days if you think it can be interesting to put this in the future version (which would also be great for me to avoid having to maintain this in my own fork :D )

Cheers,

Karol

*: (I'm an scientist in embedded high-performance system design and optimization)

@kdesnos
Copy link
Contributor Author

kdesnos commented Apr 3, 2025

Code implementing the aforementioned solution number 2 is available in my fork kdesnos/nextflow:printProcessDictionnaryInLogs.

The only difference with previously described behavior is that resolved process names are grouped by process names, as follows:
[main] DEBUG nextflow.Session - Workflow resolved process names: main.nf=a=[a], sub.nf=a=[sub:a, x]

@kdesnos
Copy link
Contributor Author

kdesnos commented Apr 4, 2025

I created a PR in case the proposed change is deemed worthy for production. I believe that beyond my own need, this changes facilitate the identification of what an aliased process name corresponds to.

Importantly, I verified that none of the suggested info was printed when elevating the log level to traces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant