Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasetio] register_datasets() does not identify remote nvidia provider #1860

Open
2 tasks
raspawar opened this issue Apr 2, 2025 · 1 comment
Open
2 tasks
Labels
bug Something isn't working

Comments

@raspawar
Copy link
Contributor

raspawar commented Apr 2, 2025

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

For implemetation of remote::nvidia datasetio, the DatasetsRoutingTable does not allow nvidia provider due to below logic: The provider_id get's set to localfs instead.

The issue is being caused by:

# infer provider from source
        if source.type == DatasetType.rows.value:
            provider_id = "localfs"
        elif source.type == DatasetType.uri.value:
            # infer provider from uri
            if source.uri.startswith("huggingface"):
                provider_id = "huggingface"
            else:
                provider_id = "localfs"
        else:
            raise ValueError(f"Unknown data source type: {source.type}")

https://github.com/meta-llama/llama-stack/blob/main/llama_stack/distribution/routers/routing_tables.py#L434

Error logs

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
import os

os.environ["NVIDIA_PROJECT_ID"] = "experment@1"
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()

client.datasets.register(
    purpose="post-training/messages",
    dataset_id="sample-basic-test",
    source={
               "type": "uri",
               "uri": "hf://datasets/default/sample-basic-test"
           },
    metadata={"format": "json", "description": "This is an example of a dataset"},
)

result ==>
DatasetRegisterResponse(identifier='sample-basic-test', metadata={'format': 'json', 'description': 'This is an example of a dataset'}, provider_id='localfs', provider_resource_id='sample-basic-test', purpose='post-training/messages', source=SourceUriDataSource(type='uri', uri='hf://datasets/default/sample-basic-test'), type='dataset', access_attributes=None)

provider_id='localfs' which does blocks redirection to the NvidiaDatasetIOAdapter, hence

Expected behavior

Allowing provider_id as an argument to decide where to register dataset.

@raspawar raspawar added the bug Something isn't working label Apr 2, 2025
@raspawar
Copy link
Contributor Author

raspawar commented Apr 2, 2025

cc: @dglogo, @mattf, @yanxi0830, @raghotham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant