Skip to content

Always log values that are used to compute task hashes #5857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ewels opened this issue Mar 5, 2025 · 1 comment
Open

Always log values that are used to compute task hashes #5857

ewels opened this issue Mar 5, 2025 · 1 comment
Labels

Comments

@ewels
Copy link
Member

ewels commented Mar 5, 2025

The task hash is a critical part of Nextflow infrastructure, used to determine whether a cached result can be used or not when re-running Nextflow with -resume (amongst other things).

Currently, when resuming a pipeline doesn't work as expected, some extensive detective work is required to figure out why. This usually involves doing (2+) new runs with the -dump-hashes flag, which can be time consuming and costly. This is especially difficult if the failed cache hit is not always reproduced.

The key information that is helpful in these cases is what values were used to calculate the hash. If Nextflow could always print this, debugging resume failures would be much simpler.

We don't want to print this to the Nextflow log, as it could represent a lot of data for large runs with millions of tasks. We also don't want to add another file to the task work directory, as Nextflow already creates several and this can put pressure on the file system.
The suggested approach is to add this information as a bash code-comment to the top of the .command.begin file already present in the task work directory.

Ideally this data can be written as YAML and surrounded by fixed strings so that it can easily be pulled out programmatically, for example:

# # Start of task hash info
# container: quay.io/nextflow/rnaseq-nf:v1.1
# inputs:
#   '*':
#     - sourceObj: /home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut,
#       storePath: /home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut,
#       stageName: ggal_gut
#     - sourceObj: /home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs,
#       storePath: /home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs,
#       stageName: fastqc_ggal_gut_logs
#   config:
#     - sourceObj: /home/abhinav/rnaseq-nf/multiqc
#       storePath: /home/abhinav/rnaseq-nf/multiqc
#       stageName: multiqc
# # End of task hash info

This could then easily be pulled out and parsed as YAML:

container: quay.io/nextflow/rnaseq-nf:v1.1
inputs:
  '*':
    - sourceObj: /home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut,
      storePath: /home/abhinav/rnaseq-nf/work/03/23372f156e80deb4d7183c5f509274/ggal_gut,
      stageName: ggal_gut
    - sourceObj: /home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs,
      storePath: /home/abhinav/rnaseq-nf/work/55/15b60995682daf79ecb64bcbb8e44e/fastqc_ggal_gut_logs,
      stageName: fastqc_ggal_gut_logs
  config:
    - sourceObj: /home/abhinav/rnaseq-nf/multiqc
      storePath: /home/abhinav/rnaseq-nf/multiqc
      stageName: multiqc

Note

Above is pseudo-code based on the blog post only, exact structure of YAML can relate to whatever makes most sense from Nextflow memory.
If possible, it's good to structure / label in such a way to make it as Human-readable as possible however.

Having this info would make debugging failed resumes as simple as a diff command:

diff work/aa/bbbbb/.command.begin work/cc/dddd/.command.begin
@ewels ewels added the planned label Mar 5, 2025
@adamrtalbot
Copy link
Collaborator

This would be great, because we could start to build on it:

  • Diagnostic tools: Why did nextflow not use the cache on this run
  • In flight tools: Nextflow can report it will re-run this task because this item is different
  • Reporting: a tool like Seqera Platform could highlight the changed items after the run

But step 1 is collect the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants