Skip to content

Feature/tensor support #673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 72 commits into
base: develop
Choose a base branch
from
Draft

Feature/tensor support #673

wants to merge 72 commits into from

Conversation

mdbenito
Copy link
Collaborator

@mdbenito mdbenito commented Apr 25, 2025

Description

This PR adds support for tensor data to pydvl.valuation.dataset.Dataset through generics, an Array prototype and a collection of wrapper array functions in pydvl.utils.array.

‼️Beware of the AI slop‼️. This is an attempt at using claude code to implement a complete feature using serena. Despite careful design of tasks and subtasks, and continuously hand-holding the dummy, there were tons of bugs, inconsistencies in the use of types and generics, as well as the array utilities, craploads of duplicate and inane tests which nevertheless left important cases out, as well as several subtle bugs. I hope to have removed much of this, but there is some left, mostly in array.py and associated tests.

Changes

  • Dataset now supports instantiation with tensors or numpy arrays. The type is preserved
  • Makes all valuation methods agnostic to the array type, except for a few exceptions.
  • Fixes some issues with Dataset indexing
  • Dataset can take memmapped numpy arrays, or memmap them if mmap=True, reducing memory cost per-node.
  • Serialization correctly handles memory maps.
  • Updated the MSR notebook, which uses a torch model for the utility to load the data as tensors.
  • Introduces a new prototype TorchSupervisedModel, which is implemented e.g. by skorch.NeuralNetClassifier, and used in the MSR notebook (not a new dependency)
  • Introduces a new SkorchSupervisedScorer to handle skorch models.

TODO

  • Use generics consistently / simplify in array.py

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

mdbenito and others added 30 commits March 17, 2025 15:51
# Conflicts:
#	src/pydvl/utils/types.py
#	src/pydvl/valuation/scorers/supervised.py
# Conflicts:
#	src/pydvl/valuation/samplers/classwise.py
- Create array_ops.py with utilities for both numpy arrays and PyTorch tensors
- Implement type-preserving functions for array creation and manipulation
- Add proper type hints with Array protocol and TypeVar for type preservation
- Add utility functions for library-specific operations
- Import array_ops in utils/__init__.py

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add stratified_split_indices utility in array_ops.py to handle both numpy arrays and tensors
- Update RawData.__post_init__ with improved type checking
- Update Dataset.from_arrays to support tensors through type-agnostic operations
- Add type hints and update docstrings for tensor support

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Update GroupedDataset to handle PyTorch tensors
- Implement type-agnostic data_to_group and group_to_data mappings
- Maintain tensor type in data_indices and logical_indices methods
- Add comprehensive tests for tensor operations in GroupedDataset

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…sor support

Extended test coverage to validate tensor support in Dataset and GroupedDataset classes:
- Added tests for mixed input types and error handling
- Added tests for edge cases like empty groups
- Added tests for single vs multi-dimensional tensors
- Added test for complex sequences of operations to verify type preservation
- Verified factory methods maintain type consistency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Updated Sample class to support PyTorch tensors
- Modified IndexSampler to be tensor-agnostic
- Added tests for tensor support in samplers
- Updated hash and equality methods to work with both array types
- Replaced numpy-specific operations with array_ops equivalents

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Fixed type errors in array_ops.py by adding proper type annotations and casts.
- Added overloads for functions to maintain type precision
- Fixed return type annotations for tensor operations
- Added proper casting to ensure type safety
- Fixed tensor-specific operations like .to() and .long()
- Ensured consistent return types match input types

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…cumentation

- Add comprehensive tests for tensor indices handling in samplers
- Verify Sample.subset is always a numpy array
- Test ClasswiseSample with tensor inputs for both subset and ooc_subset
- Add error handling tests for invalid input types
- Add documentation note about converting tensor indices to numpy arrays
- Tests verify proper conversion and appropriate handling of tensor inputs

This completes step 5.3 of the tensor support implementation plan.
mdbenito added 30 commits April 20, 2025 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant