-
Notifications
You must be signed in to change notification settings - Fork 680
Data lineage tracking (aka CID store) #5715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Paolo Di Tommaso <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
5a93547
to
27345a6
Compare
@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]>
Signed-off-by: jorgee <[email protected]> Signed-off-by: Jorge Ejarque <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: jorgee <[email protected]>
Pushed minor fixes for the render. It was failing because of the change of the task inputs type (FileInParam is now path). There was also the problem of including forbidden characters in the Mermaid node id. Moreover, I have added a default value for the html file. Users only need to specify the data to render. I have also checked what happens when you pass a task run or workflow run LID. In the case of task runs, it renders the task graph starting from the requested task and its predecessors based on file parameter dependencies. In the case of workflows, it renders the workflow and the input parameters. It was unintentional, but I will keep it, just an extra functionality for free. Finally, I have added the closest property in the validation errors when parsing the fragments or query strings. |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
It looks in a good shape. I've made some tests and a a few little changes. Some notes:
|
I don't think this is meant to work? The query isn't really returning a single JSON but a collection of JSONs. This is why I suggested to remove the |
class DataOutput implements LinSerializable { | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this class used for file outputs? If so then I would call it FileOutput
, I'm having a hard time understanding what it's used for
@@ -214,7 +219,7 @@ class PublishOp { | |||
else { | |||
log.warn "Invalid extension '${ext}' for index file '${indexPath}' -- should be CSV, JSON, or YAML" | |||
} | |||
session.notifyFilePublish(indexPath) | |||
session.notifyFilePublish(indexPath, null, publishOpts.tags as Map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be annotations
instead of tags
?
@@ -38,6 +38,8 @@ class TaskId extends Number implements Comparable, Serializable, Cloneable { | |||
|
|||
private final int value | |||
|
|||
int getValue() { value } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already intValue()
for this, see below
@CompileStatic | ||
class LineageConfig { | ||
|
||
final LineageStoreOpts store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only option under this scope is location
, then I think we can shorten it to workflow.lineage.store
*/ | ||
@Canonical | ||
@CompileStatic | ||
class WorkflowOutputs implements LinSerializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the discussion here about making workflow runs Merkle-compliant:
- Rename
WorkflowRun
toWorkflowLaunch
- Rename this class to
WorkflowRun
- Rename the
workflowRun
field in this class tolaunch
- Use the hash of this class as the "workflow run id" in the lineage log
This way, the WorkflowRun
represents the "final" record that is created and the entrypoint into the lineage from the runs log. It resolves the issue where lid://<hash>#outputs
requires a reverse lookup. Even if we never go full CID, it still makes sense to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do a similar thing for tasks, i.e. TaskRun
-> TaskLaunch
and TaskOutputs
-> TaskRun
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
|
||
@Override | ||
String getName() { | ||
return 'log' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we rename describe
to view
, maybe we should also rename log
to list
to align more with the pipeline commands, and to not confuse with nextflow log
or .nextflow.log
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
@@ -257,7 +262,7 @@ class PublishOp { | |||
*/ | |||
protected Object normalizePaths(value, targetResolver) { | |||
if( value instanceof Path ) { | |||
return List.of(value.getBaseName(), normalizePath(value, targetResolver)) | |||
return normalizePath(value, targetResolver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with rnaseq-nf (workflow-outputs-3
branch) and received the following workflow output structure:
[
{
"type": "Collection",
"name": "samples",
"value": [ /* ... */ ]
},
{
"type": "Collection",
"name": "summary",
"value": [
"multiqc_report",
"lid://494b7281ea2e02e985636dfbbcf6b8b3/multiqc_report.html"
]
}
]
But this is not quite right. The summary
output should just be a file. Wrapping in a list might be nice for a CSV file but it obscures the output here and also causes the LinObserver to infer the wrong type (Collection).
Applying this change produces the correct output structure:
[
{
"type": "Collection",
"name": "samples",
"value": [ /* ... */ ]
},
{
"type": "Path",
"name": "summary",
"value": "lid://2265a814fd1c205ecc5b629070d759e2/multiqc_report.html"
}
]
Signed-off-by: Ben Sherman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing around with the lineage
command, I am skeptical about how much we are overloading this lid
pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.
Currently there are three main uses for lid
paths:
lid://<hash>[#props]
: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that#outputs
is a list, I can't access an output by name (e.g.#outputs.samples
), which means I can't usechannel.fromPath()
to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.
On the command line, it would be simpler to just provide the hash and use jq
:
# before
nextflow li describe lid://<hash>#params
# after
nextflow li describe <hash> | jq .params
In a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.
lid:///?<name>=<value>&...
: used by thefind
command to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like#outputs
. It is only used by thefind
command, so the URI syntax is just getting in the way:
# before
# oops, forgot to escape the & ...
nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2
# after
nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2
lid://<hash>/<path>
: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.
Based on this analysis, I think we should ditch (1) and (2) entirely and use lid://
only to refer to files.
Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples
. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.
Tentative implementation for addressable data store (very basic POC so far).
Update on 1 Mar 2025 from #5787 by @jorgee
M1 Implementation of CID store for provenance
Changes:
Known Limitations: