Skip to content

Wrong file references when input and output file names are identical #5882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zlibto opened this issue Mar 12, 2025 · 3 comments
Open

Wrong file references when input and output file names are identical #5882

zlibto opened this issue Mar 12, 2025 · 3 comments

Comments

@zlibto
Copy link

zlibto commented Mar 12, 2025

Bug report

When input and output files of a process have identical names, downstream processes get confused about which file to use. But publishDir outputs are correct. See the minimum example below.

No error or warning (e.g. name collision) was raised.

Expected behavior and actual behavior

Obviously using identical file names is a bad practice, but this can happen by accident when files names are less explicit (we discovered this when trying to generate files to different folders). I think raising a name collision error is appropriate (step1 in the example below)?

Steps to reproduce the problem

Minimum example:

nextflow.enable.dsl=2
nextflow.enable.strict=true

params.output_dir = 'output'

process step0 {
    publishDir "step0/", mode: 'copy'

    output:
    path('duplicate_file_name.txt')

    script:
    """
    echo 'step0' > duplicate_file_name.txt
    """
}

process step1 {
    publishDir "step1/", mode: 'copy'
    
    input:
    path(input)

    output:
    path('duplicate_file_name.txt')

    script:
    """
    echo "step1" > duplicate_file_name.txt
    """
}

process summary {
    publishDir ".", mode: 'copy'
    
    input:
    path(file0, stageAs:'file0')
    path(file1, stageAs: 'file1')

    output: 
    path('summary.txt') 

    script:
    """
    cat file0 file1 > summary.txt
    """
}


workflow {

    main:

    step0()
    step1(step0.out)
    summary(step0.out, step1.out)
}

// join(...) gives the same output

Program output

  • step0/duplicate_file_name.txt
step0
  • step1/duplicate_file_name.txt
step1
  • summary.txt
step1
step1

(Should be step0\nstep1 given the intention of summary(step0.out, step1.out))

Environment

  • Nextflow version: 24.10.5
  • Java version: openjdk 23.0.2 2025-01-21
  • Operating system: macOS
  • Bash version: (use the command $SHELL --version) zsh 5.9 (arm64-apple-darwin24.0)

Additional context

@bentsherman
Copy link
Member

In general, it is valid for a process to forward an input as an output, and there is no way for Nextflow to know for certain that it's being done improperly (e.g. overwriting the file vs simply using it and including it in the outputs)

So it is up to you to ensure that you forward input files properly. In step1, you could prevent a (silent) name collision by staging the input file under a different name

@zlibto
Copy link
Author

zlibto commented Mar 15, 2025

Thanks for the explanation! I see your point about forwarding the file name.

I'm still confused about why the output of summary is what it is, compared to the output of publishDir. Is it because the reference of step0.out is muted because of the input - output forwarding in step1?

@bentsherman
Copy link
Member

I think step1 is overwriting the output file of step0, so by when summary runs, it's effectively printing the same file twice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants