-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Improved type annotation #7429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improved type annotation #7429
Conversation
… TextDatasetReader
@lhoestq Could someone please take a quick look or let me know if there’s anything I should change? Thanks! |
could you fix the conflicts ? I think some type annotations have been improved since your first commit |
It should be good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did a small review and spotted a few issues, can you double check the rest please ?
if self.dtype == "double": # fix inferred type | ||
self.dtype = "float64" | ||
if self.dtype == "float": # fix inferred type | ||
self.dtype = "float32" | ||
self.pa_type = string_to_arrow(self.dtype) | ||
|
||
def __call__(self): | ||
def __call__(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a string, but a pyarrow type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __call__(self) -> str: | |
def __call__(self) -> Any: |
"""Wrapper for dataset transforms that recreate a new Dataset to transmit the format of the original dataset to the new dataset""" | ||
|
||
@wraps(func) | ||
def wrapper(*args, **kwargs): | ||
def wrapper(*args: Any, **kwargs: Any) -> Union[dict, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def wrapper(*args: Any, **kwargs: Any) -> Union[dict, Dataset, DatasetDict]: | |
def wrapper(*args: Any, **kwargs: Any) -> Union[Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggestion introduces the following error on the return type
Type "dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset" is not assignable to return type "Dataset | DatasetDict"
Type "dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset" is not assignable to type "Dataset | DatasetDict"
Type "dict[Unknown, Unknown]" is not assignable to type "Dataset | DatasetDict"
"dict[Unknown, Unknown]" is not assignable to "Dataset"
"dict[Unknown, Unknown]" is not assignable to "DatasetDict"PylancereportReturnType
(variable) out: dict[Unknown, Unknown] | DatasetDict | Unknown | Dataset
**kwargs, | ||
): | ||
**kwargs: Any, | ||
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: | |
) -> Dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The from_csv
constructor is backed by a builder whose read()
method can produce multiple types (e.g. Dataset
, IterableDataset
). In this specific usage, it will always return a Dataset
, but if we annotate from_csv()
as returning Dataset
, the type checker rightly complains about the other branches.
Short-term fix
Until we improve the builder’s signature, we can locally assert the type:
return cast(Dataset, CsvDatasetReader(
path_or_paths,
split=split,
features=features,
cache_dir=cache_dir,
keep_in_memory=keep_in_memory,
num_proc=num_proc,
**kwargs,
).read())
Long-term improvement
Ideally, the builder(s) would require more advanced typing, like a Protocol
or Generic
. That would remove the need for any casts and keep the API both safe and precise.
Thoughts?
If anyone has deeper familiarity with the builder implementation or advanced typing patterns, please feel free to chime in!
**kwargs, | ||
): | ||
**kwargs: Any, | ||
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: | |
) -> Dataset: |
**kwargs, | ||
): | ||
**kwargs: Any, | ||
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: | |
) -> Dataset: |
**kwargs, | ||
): | ||
**kwargs: Any, | ||
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: | |
) -> Dataset: |
**kwargs, | ||
): | ||
**kwargs: Any, | ||
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) -> Union[dict[str, IterableDataset], IterableDataset, Dataset, DatasetDict]: | |
) -> Dataset: |
I've refined several type annotations throughout the codebase to align with current best practices and enhance overall clarity. Given the complexity of the code, there may still be areas that need further attention. I welcome any feedback or suggestions to make these improvements even better.
from_parquet
return type annotation #7202