Data lineage tracking (aka CID store) #5715

pditommaso · 2025-01-27T13:15:42Z

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

CID store is specified by workflow.data.store.location
Workflow Hash is created based on the workflow and parameters description
workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
CID NIO Filesystem to access data based on CIS URLs
nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <[email protected]>

netlify · 2025-01-27T13:16:03Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`36d5c82`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/680086091518c30008500621

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso · 2025-02-13T09:36:58Z

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

jorgee · 2025-02-13T09:59:24Z

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <[email protected]>

Signed-off-by: jorgee <[email protected]>

…store

Signed-off-by: jorgee <[email protected]>

Signed-off-by: jorgee <[email protected]> Signed-off-by: Jorge Ejarque <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>

Signed-off-by: jorgee <[email protected]>

jorgee · 2025-04-15T18:38:51Z

Pushed minor fixes for the render. It was failing because of the change of the task inputs type (FileInParam is now path). There was also the problem of including forbidden characters in the Mermaid node id. Moreover, I have added a default value for the html file. Users only need to specify the data to render.

I have also checked what happens when you pass a task run or workflow run LID. In the case of task runs, it renders the task graph starting from the requested task and its predecessors based on file parameter dependencies. In the case of workflows, it renders the workflow and the input parameters. It was unintentional, but I will keep it, just an extra functionality for free.

Finally, I have added the closest property in the validation errors when parsing the fragments or query strings.

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso · 2025-04-16T14:44:21Z

It looks in a good shape. I've made some tests and a a few little changes. Some notes:

better not use . in error messages. It may be confusing when preceding URIs or numbers.
Timestamps have been changes to OffsetDateTime using the current time zone. This allow keep track of the current location time. ✅
Added li command shortcut. ✅
We may want to change resolvedConfig to config for simplicity
We may want to change XxxOutputs and #outputs to XxxOutput. Along the same manner inputs to input
Find describe command verbose. What about view or print instead?
Likely TaskRun should include the resolved task script
Command lineage find 'type=DataOutput' work OK, but equivalent channel.fromPath('lid:///?type=DataOutput') seems not working 🔴
Likely makes no much sense to keep nf-lineage-h2. Planning to more an external plugin

bentsherman · 2025-04-16T16:53:35Z

Find describe command verbose. What about view or print instead?

view would be consistent with the view command, especially if we merge #5966

Command lineage find 'type=DataOutput' work OK, but equivalent channel.fromPath('lid:///?type=DataOutput') seems not working

I don't think this is meant to work? The query isn't really returning a single JSON but a collection of JSONs. This is why I suggested to remove the lid:// prefix when making a query, to not give the illusion that it's a LID path

bentsherman · 2025-04-16T17:04:01Z

modules/nf-lineage/src/main/nextflow/lineage/model/DataOutput.groovy

+class DataOutput implements LinSerializable {
+    /**


Is this class used for file outputs? If so then I would call it FileOutput, I'm having a hard time understanding what it's used for

bentsherman · 2025-04-17T01:26:30Z

modules/nextflow/src/main/groovy/nextflow/extension/PublishOp.groovy

@@ -214,7 +219,7 @@ class PublishOp {
            else {
                log.warn "Invalid extension '${ext}' for index file '${indexPath}' -- should be CSV, JSON, or YAML"
            }
-            session.notifyFilePublish(indexPath)
+            session.notifyFilePublish(indexPath, null, publishOpts.tags as Map)


Should this be annotations instead of tags?

bentsherman · 2025-04-17T01:27:14Z

modules/nextflow/src/main/groovy/nextflow/processor/TaskId.groovy

@@ -38,6 +38,8 @@ class TaskId extends Number implements Comparable, Serializable, Cloneable {

    private final int value

+    int getValue() { value }


There is already intValue() for this, see below

bentsherman · 2025-04-17T01:53:49Z

modules/nf-lineage/src/main/nextflow/lineage/config/LineageConfig.groovy

+@CompileStatic
+class LineageConfig {
+
+    final LineageStoreOpts store


If the only option under this scope is location, then I think we can shorten it to workflow.lineage.store

bentsherman · 2025-04-17T02:50:49Z

modules/nf-lineage/src/main/nextflow/lineage/model/WorkflowOutputs.groovy

+ */
+@Canonical
+@CompileStatic
+class WorkflowOutputs implements LinSerializable {


Continuing the discussion here about making workflow runs Merkle-compliant:

Rename WorkflowRun to WorkflowLaunch

Rename this class to WorkflowRun

Rename the workflowRun field in this class to launch

Use the hash of this class as the "workflow run id" in the lineage log

This way, the WorkflowRun represents the "final" record that is created and the entrypoint into the lineage from the runs log. It resolves the issue where lid://<hash>#outputs requires a reverse lookup. Even if we never go full CID, it still makes sense to do it

We could do a similar thing for tasks, i.e. TaskRun -> TaskLaunch and TaskOutputs -> TaskRun

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2025-04-17T03:56:30Z

modules/nextflow/src/main/groovy/nextflow/cli/CmdLineage.groovy

+
+        @Override
+        String getName() {
+            return 'log'


If we rename describe to view, maybe we should also rename log to list to align more with the pipeline commands, and to not confuse with nextflow log or .nextflow.log

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2025-04-17T04:27:25Z

modules/nextflow/src/main/groovy/nextflow/extension/PublishOp.groovy

@@ -257,7 +262,7 @@ class PublishOp {
     */
    protected Object normalizePaths(value, targetResolver) {
        if( value instanceof Path ) {
-            return List.of(value.getBaseName(), normalizePath(value, targetResolver))
+            return normalizePath(value, targetResolver)


I tested with rnaseq-nf (workflow-outputs-3 branch) and received the following workflow output structure:

[ { "type": "Collection", "name": "samples", "value": [ /* ... */ ] }, { "type": "Collection", "name": "summary", "value": [ "multiqc_report", "lid://494b7281ea2e02e985636dfbbcf6b8b3/multiqc_report.html" ] } ]

But this is not quite right. The summary output should just be a file. Wrapping in a list might be nice for a CSV file but it obscures the output here and also causes the LinObserver to infer the wrong type (Collection).

Applying this change produces the correct output structure:

[ { "type": "Collection", "name": "samples", "value": [ /* ... */ ] }, { "type": "Path", "name": "summary", "value": "lid://2265a814fd1c205ecc5b629070d759e2/multiqc_report.html" } ]

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2025-04-17T05:27:14Z

modules/nf-lineage/src/main/nextflow/lineage/fs/LinPath.groovy

After playing around with the lineage command, I am skeptical about how much we are overloading this lid pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.

Currently there are three main uses for lid paths:

lid://<hash>[#props]: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that #outputs is a list, I can't access an output by name (e.g. #outputs.samples), which means I can't use channel.fromPath() to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.

On the command line, it would be simpler to just provide the hash and use jq:

# before nextflow li describe lid://<hash>#params # after nextflow li describe <hash> | jq .params

In a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.

lid:///?<name>=<value>&...: used by the find command to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like #outputs. It is only used by the find command, so the URI syntax is just getting in the way:

# before # oops, forgot to escape the & ... nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2 # after nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2

lid://<hash>/<path>: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.

Based on this analysis, I think we should ditch (1) and (2) entirely and use lid:// only to refer to files.

Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.

Addressable data store

472fcc7

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso marked this pull request as draft January 27, 2025 13:15

pditommaso added 3 commits January 31, 2025 10:52

Merge branch 'master' into cid-store

4f8c524

Addressable data store [wip 2] [ci skip]

b5e8c46

Signed-off-by: Paolo Di Tommaso <[email protected]>

Minor changes [ci skip]

669afd5

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46

jorgee force-pushed the cid-store branch from 3ac180f to 669afd5 Compare February 13, 2025 09:44

pditommaso added 3 commits February 17, 2025 13:15

Addressable data store

c93a713

Signed-off-by: Paolo Di Tommaso <[email protected]>

Addressable data store [wip 2] [ci skip]

c0c660f

Signed-off-by: Paolo Di Tommaso <[email protected]>

Minor changes [ci skip]

2a2d76f

Signed-off-by: Paolo Di Tommaso <[email protected]>

jorgee force-pushed the cid-store branch from 669afd5 to 2a2d76f Compare February 17, 2025 12:15

jorgee and others added 15 commits February 17, 2025 18:16

M0 implementation

a2139e3

Signed-off-by: jorgee <[email protected]>

fix tests

fddc5f7

Signed-off-by: jorgee <[email protected]>

fix tests

fe780a8

Signed-off-by: jorgee <[email protected]>

first M1 updates

f9f7ed2

Signed-off-by: jorgee <[email protected]>

fix tests

0c2492e

Signed-off-by: jorgee <[email protected]>

update descriptions

41ac817

Signed-off-by: jorgee <[email protected]>

fix test

cdc3116

Signed-off-by: jorgee <[email protected]>

Merge branch 'master' into cid-store-m0

642e7b1

Merge branch 'cid-store' of github.com:nextflow-io/nextflow into cid-…

1400e80

…store

Merge branch 'master' into cid-store

d64c71a

Merge branch 'cid-store' into cid-store-m0

975143f

First commit to M1 implementation

82b1ccd

Signed-off-by: jorgee <[email protected]>

fix NPE in tests

edfaf5b

Signed-off-by: jorgee <[email protected]>

Fix NPE in tests

c207d92

Signed-off-by: jorgee <[email protected]>

Add CidStore factory

f4b9031

Signed-off-by: jorgee <[email protected]>

jorgee and others added 4 commits April 15, 2025 15:36

change getTargetPath with flags to different methods

9f41c28

Signed-off-by: jorgee <[email protected]>

change resolved config from string to map

f43e46c

Signed-off-by: jorgee <[email protected]>

CID to lineage rename (#5977)

5208c38

Signed-off-by: jorgee <[email protected]> Signed-off-by: Jorge Ejarque <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>

fixes render, hint of closer property name

abf0c3a

Signed-off-by: jorgee <[email protected]>

pditommaso marked this pull request as ready for review April 16, 2025 08:04

pditommaso added 3 commits April 16, 2025 12:37

Just blanks [ci fast]

ac7f675

Signed-off-by: Paolo Di Tommaso <[email protected]>

Minor changes

4c9f8e0

Signed-off-by: Paolo Di Tommaso <[email protected]>

Fix failing tests [ci fast]

306fbaf

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso changed the title ~~Addressable data store (aka CID store)~~ Data lineage tracking (aka CID store) Apr 16, 2025

Add support for command aliases [ci fast]

226bf65

Signed-off-by: Paolo Di Tommaso <[email protected]>

bentsherman self-requested a review April 16, 2025 16:55

bentsherman reviewed Apr 16, 2025

View reviewed changes

bentsherman reviewed Apr 17, 2025

View reviewed changes

bentsherman added 5 commits April 16, 2025 22:07

cleanup code style

38735b0

Signed-off-by: Ben Sherman <[email protected]>

replace nested ifs with if-guards

6d71bc9

Signed-off-by: Ben Sherman <[email protected]>

change default render file name to lineage.html

52e3333

Signed-off-by: Ben Sherman <[email protected]>

fix typo

baed1ad

Signed-off-by: Ben Sherman <[email protected]>

Merge branch 'master' into cid-store

d3bc077

bentsherman reviewed Apr 17, 2025

View reviewed changes

bentsherman added 2 commits April 16, 2025 23:12

cleanup whitespace

e7f437b

Signed-off-by: Ben Sherman <[email protected]>

don't wrap singleton file output as a list

52664ce

Signed-off-by: Ben Sherman <[email protected]>

bentsherman reviewed Apr 17, 2025

View reviewed changes

fix failing test

36d5c82

Signed-off-by: Ben Sherman <[email protected]>

bentsherman reviewed Apr 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data lineage tracking (aka CID store) #5715

Data lineage tracking (aka CID store) #5715

pditommaso commented Jan 27, 2025 •

edited

Loading

netlify bot commented Jan 27, 2025 •

edited

Loading

pditommaso commented Feb 13, 2025

jorgee commented Feb 13, 2025 •

edited

Loading

jorgee commented Apr 15, 2025

pditommaso commented Apr 16, 2025 •

edited

Loading

bentsherman commented Apr 16, 2025

bentsherman Apr 16, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

bentsherman Apr 17, 2025

		@@ -38,6 +38,8 @@ class TaskId extends Number implements Comparable, Serializable, Cloneable {

		private final int value

		int getValue() { value }

Data lineage tracking (aka CID store) #5715

Are you sure you want to change the base?

Data lineage tracking (aka CID store) #5715

Conversation

pditommaso commented Jan 27, 2025 • edited Loading

Update on 1 Mar 2025 from #5787 by @jorgee

netlify bot commented Jan 27, 2025 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

pditommaso commented Feb 13, 2025

jorgee commented Feb 13, 2025 • edited Loading

jorgee commented Apr 15, 2025

pditommaso commented Apr 16, 2025 • edited Loading

bentsherman commented Apr 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pditommaso commented Jan 27, 2025 •

edited

Loading

netlify bot commented Jan 27, 2025 •

edited

Loading

jorgee commented Feb 13, 2025 •

edited

Loading

pditommaso commented Apr 16, 2025 •

edited

Loading