Feature/tensor support #673

mdbenito · 2025-04-25T17:15:19Z

Description

This PR adds support for tensor data to pydvl.valuation.dataset.Dataset through generics, an Array prototype and a collection of wrapper array functions in pydvl.utils.array.

‼️Beware of the AI slop‼️. This is an attempt at using claude code to implement a complete feature using serena. Despite careful design of tasks and subtasks, and continuously hand-holding the dummy, there were tons of bugs, inconsistencies in the use of types and generics, as well as the array utilities, craploads of duplicate and inane tests which nevertheless left important cases out, as well as several subtle bugs. I hope to have removed much of this, but there is some left, mostly in array.py and associated tests.

Changes

Dataset now supports instantiation with tensors or numpy arrays. The type is preserved
Makes all valuation methods agnostic to the array type, except for a few exceptions.
Fixes some issues with Dataset indexing
Dataset can take memmapped numpy arrays, or memmap them if mmap=True, reducing memory cost per-node.
Serialization correctly handles memory maps.
Updated the MSR notebook, which uses a torch model for the utility to load the data as tensors.
Introduces a new prototype TorchSupervisedModel, which is implemented e.g. by skorch.NeuralNetClassifier, and used in the MSR notebook (not a new dependency)
Introduces a new SkorchSupervisedScorer to handle skorch models.

TODO

Use generics consistently / simplify in array.py

Checklist

Wrote Unit tests (if necessary)
Updated Documentation (if necessary)
Updated Changelog
If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

… torch specializations

# Conflicts: # src/pydvl/utils/types.py # src/pydvl/valuation/scorers/supervised.py

# Conflicts: # src/pydvl/valuation/samplers/classwise.py

- Create array_ops.py with utilities for both numpy arrays and PyTorch tensors - Implement type-preserving functions for array creation and manipulation - Add proper type hints with Array protocol and TypeVar for type preservation - Add utility functions for library-specific operations - Import array_ops in utils/__init__.py 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add stratified_split_indices utility in array_ops.py to handle both numpy arrays and tensors - Update RawData.__post_init__ with improved type checking - Update Dataset.from_arrays to support tensors through type-agnostic operations - Add type hints and update docstrings for tensor support 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Update GroupedDataset to handle PyTorch tensors - Implement type-agnostic data_to_group and group_to_data mappings - Maintain tensor type in data_indices and logical_indices methods - Add comprehensive tests for tensor operations in GroupedDataset 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…sor support Extended test coverage to validate tensor support in Dataset and GroupedDataset classes: - Added tests for mixed input types and error handling - Added tests for edge cases like empty groups - Added tests for single vs multi-dimensional tensors - Added test for complex sequences of operations to verify type preservation - Verified factory methods maintain type consistency 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Updated Sample class to support PyTorch tensors - Modified IndexSampler to be tensor-agnostic - Added tests for tensor support in samplers - Updated hash and equality methods to work with both array types - Replaced numpy-specific operations with array_ops equivalents 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Fixed type errors in array_ops.py by adding proper type annotations and casts. - Added overloads for functions to maintain type precision - Fixed return type annotations for tensor operations - Added proper casting to ensure type safety - Fixed tensor-specific operations like .to() and .long() - Ensured consistent return types match input types 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…cumentation - Add comprehensive tests for tensor indices handling in samplers - Verify Sample.subset is always a numpy array - Test ClasswiseSample with tensor inputs for both subset and ooc_subset - Add error handling tests for invalid input types - Add documentation note about converting tensor indices to numpy arrays - Tests verify proper conversion and appropriate handling of tensor inputs This completes step 5.3 of the tensor support implementation plan.

…ets by class

mdbenito and others added 30 commits March 17, 2025 15:51

Tentative Array protocol, generic Dataset and SupervisedModel/Scorer,…

3b870fa

… torch specializations

Implement mmap support for ndarrays in dataset

550cd20

Another go at conditional torch imports

93201d5

convenience wrapper for torch model profiler

e6caa08

Fix import

09c302f

Make utilitymodel generic

d8e1eb9

Missing chunk from commit 93201d5

d7add76

WIP on bzf notebook

811b843

Merge branch 'develop' into feature/tensor-support

bfd5c06

# Conflicts: # src/pydvl/utils/types.py # src/pydvl/valuation/scorers/supervised.py

Merge branch 'refs/heads/develop' into feature/tensor-support

77c37c0

# Conflicts: # src/pydvl/valuation/samplers/classwise.py

WIP: get and setstate for memmapped arrays in dataset

3047333

🧹🧹

15f370c

🧹🧹

e5dde24

🧹🧹

877de45

Factor out mmapping code

fd78d2f

Cleanup, simplify, use ndarrays for indices, fix some errors

f459b2f

Add tensor support to cwscorer. Add support functions

a792cf0

Do not convert tensors to ndarrays implicitly in valuationresult

3600b25

Finish array-agnostic cwshapley. Add support functions. Tests

ec231c7

Potentially allow for array-agnostic custom bagged models

a6e8074

Revert silly typevar

323695a

Ignore js crap

1820915

Test tensor support of data oob

6b6b5fc

mdbenito added 30 commits April 20, 2025 12:57

Add repr to to dataset

1f31f45

Improve cleanup of mmapped arrays

9f19741

Use ndarrays for names in dataset

3cc018b

fix getting data with single index

7d70dd2

Improve tests for dataset

576c84a

Fix types in load_digits_dataset

e338d93

Some documentation in docstrings

b828fec

Cleanup, fix types, simplify

41d20e2

Refactor dataset utilities in notebooks.support

7fae16f

Remove unnecessary methods after tensor support for datasets

900464d

Additional shuffle in stratified_split_indices to avoid sorting datas…

8d99379

…ets by class

Remove legacy reshape unnecessary with tensor support in Dataset

21c477d

Syntactic sugar: support moving RawData to cpu directly

78fc15a

Missing allowed None type when slicing Dataset

3c760da

Simpler interface for load_digits

2583922

Typing shenanigans

fac471d

Use skorch.classifier.NeuralNetClassifier in MSR notebook

22f79be

Avoid creating copies of already memmapped arrays. Update documentation

fab771b

Use skorch.classifier.NeuralNetClassifier in MSR experiment

c7cc42f

Fix slicing with None for GroupedDataset

549f56b

🧹🧹

ad7aa70

Delete TorchModelScorer (unnecessary with tensor Dataset)

cc9a8b3

Don't fail if sacred is missing in support module for bzf notebook

e06fad5

Delete unnecessary TorchUtility after tensor support for Dataset

b873c47

Fix import

9705faa

Fix more imports

1a7a18a

Add SkorchSupervisedScorer to move tensors to cpu before scoring

d692852

gitignore serena config

61f8e4d

Rename

0628e1f

Fix array tests

ecbefdb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/tensor support #673

Feature/tensor support #673

mdbenito commented Apr 25, 2025 •

edited

Loading

Feature/tensor support #673

Are you sure you want to change the base?

Feature/tensor support #673

Conversation

mdbenito commented Apr 25, 2025 • edited Loading

Description

Changes

TODO

Checklist

mdbenito commented Apr 25, 2025 •

edited

Loading