Skip to content

Data lineage tracking (aka CID store) #5715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 87 commits into
base: master
Choose a base branch
from
Open

Data lineage tracking (aka CID store) #5715

wants to merge 87 commits into from

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Jan 27, 2025

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

  • CID store is specified by workflow.data.store.location
  • Workflow Hash is created based on the workflow and parameters description
  • workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
  • references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
  • CID NIO Filesystem to access data based on CIS URLs
  • nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

  • Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso marked this pull request as draft January 27, 2025 13:15
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 36d5c82
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/680086091518c30008500621

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member Author

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

@jorgee
Copy link
Contributor

jorgee commented Feb 13, 2025

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
jorgee and others added 4 commits April 15, 2025 15:36
Signed-off-by: jorgee <[email protected]>
Signed-off-by: Jorge Ejarque <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
@jorgee
Copy link
Contributor

jorgee commented Apr 15, 2025

Pushed minor fixes for the render. It was failing because of the change of the task inputs type (FileInParam is now path). There was also the problem of including forbidden characters in the Mermaid node id. Moreover, I have added a default value for the html file. Users only need to specify the data to render.

I have also checked what happens when you pass a task run or workflow run LID. In the case of task runs, it renders the task graph starting from the requested task and its predecessors based on file parameter dependencies. In the case of workflows, it renders the workflow and the input parameters. It was unintentional, but I will keep it, just an extra functionality for free.

Finally, I have added the closest property in the validation errors when parsing the fragments or query strings.

@pditommaso pditommaso marked this pull request as ready for review April 16, 2025 08:04
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso changed the title Addressable data store (aka CID store) Data lineage tracking (aka CID store) Apr 16, 2025
@pditommaso
Copy link
Member Author

pditommaso commented Apr 16, 2025

It looks in a good shape. I've made some tests and a a few little changes. Some notes:

  1. better not use . in error messages. It may be confusing when preceding URIs or numbers.
  2. Timestamps have been changes to OffsetDateTime using the current time zone. This allow keep track of the current location time. ✅
  3. Added li command shortcut. ✅
  4. We may want to change resolvedConfig to config for simplicity
  5. We may want to change XxxOutputs and #outputs to XxxOutput. Along the same manner inputs to input
  6. Find describe command verbose. What about view or print instead?
  7. Likely TaskRun should include the resolved task script
  8. Command lineage find 'type=DataOutput' work OK, but equivalent channel.fromPath('lid:///?type=DataOutput') seems not working 🔴
  9. Likely makes no much sense to keep nf-lineage-h2. Planning to more an external plugin

@bentsherman
Copy link
Member

Find describe command verbose. What about view or print instead?

view would be consistent with the view command, especially if we merge #5966

Command lineage find 'type=DataOutput' work OK, but equivalent channel.fromPath('lid:///?type=DataOutput') seems not working

I don't think this is meant to work? The query isn't really returning a single JSON but a collection of JSONs. This is why I suggested to remove the lid:// prefix when making a query, to not give the illusion that it's a LID path

@bentsherman bentsherman self-requested a review April 16, 2025 16:55
Comment on lines +32 to +33
class DataOutput implements LinSerializable {
/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class used for file outputs? If so then I would call it FileOutput, I'm having a hard time understanding what it's used for

@@ -214,7 +219,7 @@ class PublishOp {
else {
log.warn "Invalid extension '${ext}' for index file '${indexPath}' -- should be CSV, JSON, or YAML"
}
session.notifyFilePublish(indexPath)
session.notifyFilePublish(indexPath, null, publishOpts.tags as Map)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be annotations instead of tags?

@@ -38,6 +38,8 @@ class TaskId extends Number implements Comparable, Serializable, Cloneable {

private final int value

int getValue() { value }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already intValue() for this, see below

@CompileStatic
class LineageConfig {

final LineageStoreOpts store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only option under this scope is location, then I think we can shorten it to workflow.lineage.store

*/
@Canonical
@CompileStatic
class WorkflowOutputs implements LinSerializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the discussion here about making workflow runs Merkle-compliant:

  1. Rename WorkflowRun to WorkflowLaunch
  2. Rename this class to WorkflowRun
  3. Rename the workflowRun field in this class to launch
  4. Use the hash of this class as the "workflow run id" in the lineage log

This way, the WorkflowRun represents the "final" record that is created and the entrypoint into the lineage from the runs log. It resolves the issue where lid://<hash>#outputs requires a reverse lookup. Even if we never go full CID, it still makes sense to do it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do a similar thing for tasks, i.e. TaskRun -> TaskLaunch and TaskOutputs -> TaskRun


@Override
String getName() {
return 'log'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we rename describe to view, maybe we should also rename log to list to align more with the pipeline commands, and to not confuse with nextflow log or .nextflow.log

@@ -257,7 +262,7 @@ class PublishOp {
*/
protected Object normalizePaths(value, targetResolver) {
if( value instanceof Path ) {
return List.of(value.getBaseName(), normalizePath(value, targetResolver))
return normalizePath(value, targetResolver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with rnaseq-nf (workflow-outputs-3 branch) and received the following workflow output structure:

[
  {
    "type": "Collection",
    "name": "samples",
    "value": [ /* ... */ ]
  },
  {
    "type": "Collection",
    "name": "summary",
    "value": [
      "multiqc_report",
      "lid://494b7281ea2e02e985636dfbbcf6b8b3/multiqc_report.html"
    ]
  }
]

But this is not quite right. The summary output should just be a file. Wrapping in a list might be nice for a CSV file but it obscures the output here and also causes the LinObserver to infer the wrong type (Collection).

Applying this change produces the correct output structure:

[
  {
    "type": "Collection",
    "name": "samples",
    "value": [ /* ... */ ]
  },
  {
    "type": "Path",
    "name": "summary",
    "value": "lid://2265a814fd1c205ecc5b629070d759e2/multiqc_report.html"
  }
]

Signed-off-by: Ben Sherman <[email protected]>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After playing around with the lineage command, I am skeptical about how much we are overloading this lid pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.

Currently there are three main uses for lid paths:

  1. lid://<hash>[#props]: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that #outputs is a list, I can't access an output by name (e.g. #outputs.samples), which means I can't use channel.fromPath() to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.

On the command line, it would be simpler to just provide the hash and use jq:

# before
nextflow li describe lid://<hash>#params

# after
nextflow li describe <hash> | jq .params

In a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.

  1. lid:///?<name>=<value>&...: used by the find command to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like #outputs. It is only used by the find command, so the URI syntax is just getting in the way:
# before
# oops, forgot to escape the & ...
nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2

# after
nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2
  1. lid://<hash>/<path>: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.

Based on this analysis, I think we should ditch (1) and (2) entirely and use lid:// only to refer to files.

Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants