Skip to content

Improved type annotation #7429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Conversation

saiden89
Copy link

@saiden89 saiden89 commented Feb 28, 2025

I've refined several type annotations throughout the codebase to align with current best practices and enhance overall clarity. Given the complexity of the code, there may still be areas that need further attention. I welcome any feedback or suggestions to make these improvements even better.

@saiden89 saiden89 marked this pull request as ready for review March 4, 2025 13:40
@saiden89
Copy link
Author

saiden89 commented May 9, 2025

@lhoestq Could someone please take a quick look or let me know if there’s anything I should change? Thanks!

@lhoestq
Copy link
Member

lhoestq commented May 12, 2025

could you fix the conflicts ? I think some type annotations have been improved since your first commit

@saiden89
Copy link
Author

It should be good now.
I'm happy to add more annotations or refine further if needed—just let me know!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a small review and spotted a few issues, can you double check the rest please ?

if self.dtype == "double": # fix inferred type
self.dtype = "float64"
if self.dtype == "float": # fix inferred type
self.dtype = "float32"
self.pa_type = string_to_arrow(self.dtype)

def __call__(self):
def __call__(self) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a string, but a pyarrow type

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __call__(self) -> str:
def __call__(self) -> Any:

"""Wrapper for dataset transforms that recreate a new Dataset to transmit the format of the original dataset to the new dataset"""

@wraps(func)
def wrapper(*args, **kwargs):
def wrapper(*args: Any, **kwargs: Any) -> Union[dict, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def wrapper(*args: Any, **kwargs: Any) -> Union[dict, Dataset, DatasetDict]:
def wrapper(*args: Any, **kwargs: Any) -> Union[Dataset, DatasetDict]:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion introduces the following error on the return type

Type "dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset" is not assignable to return type "Dataset | DatasetDict"
  Type "dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset" is not assignable to type "Dataset | DatasetDict"
    Type "dict[Unknown, Unknown]" is not assignable to type "Dataset | DatasetDict"
      "dict[Unknown, Unknown]" is not assignable to "Dataset"
      "dict[Unknown, Unknown]" is not assignable to "DatasetDict"PylancereportReturnType
(variable) out: dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset

**kwargs,
):
**kwargs: Any,
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
) -> Dataset:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The from_csv constructor is backed by a builder whose read() method can produce multiple types (e.g. Dataset, IterableDataset). In this specific usage, it will always return a Dataset, but if we annotate from_csv() as returning Dataset, the type checker rightly complains about the other branches.

Short-term fix

Until we improve the builder’s signature, we can locally assert the type:

return cast(Dataset, CsvDatasetReader(
            path_or_paths,
            split=split,
            features=features,
            cache_dir=cache_dir,
            keep_in_memory=keep_in_memory,
            num_proc=num_proc,
            **kwargs,
        ).read())

Long-term improvement

Ideally, the builder(s) would require more advanced typing, like a Protocol or Generic. That would remove the need for any casts and keep the API both safe and precise.

Thoughts?
If anyone has deeper familiarity with the builder implementation or advanced typing patterns, please feel free to chime in!

**kwargs,
):
**kwargs: Any,
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
) -> Dataset:

**kwargs,
):
**kwargs: Any,
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
) -> Dataset:

**kwargs,
):
**kwargs: Any,
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
) -> Dataset:

**kwargs,
):
**kwargs: Any,
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]:
) -> Dataset:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

from_parquet return type annotation
2 participants