Skip to content

Investigate possible issues with partial views #244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vbarrielle opened this issue Oct 22, 2020 · 20 comments · Fixed by #245
Closed

Investigate possible issues with partial views #244

vbarrielle opened this issue Oct 22, 2020 · 20 comments · Fixed by #245
Assignees

Comments

@vbarrielle
Copy link
Collaborator

As evidenced by #242 and #243, middle_outer_views is not tested enough, and edge cases may happen in numerous algorithms. It's necessary to write tests for many functions involving these partial views, to reveal the presence of possible bugs.

@vbarrielle vbarrielle self-assigned this Oct 22, 2020
@mulimoen
Copy link
Collaborator

What are the requirements of data and indices? Do they have to be the same length, or is it enough for them to be as long as indptr requires? Does indices outside of the range given by indptr have to be convertible to usize?

@mulimoen
Copy link
Collaborator

Minor optimisation: to_owned of a view could adjust indptr to start at 0, and contain the minimum amount of entries in indices and data

@vbarrielle
Copy link
Collaborator Author

What are the requirements of data and indices? Do they have to be the same length, or is it enough for them to be as long as indptr requires? Does indices outside of the range given by indptr have to be convertible to usize?

In most algorithms, the only requirement is that for all i < indptr.len() - 1, indices[indptr[i]..indptr[i + 1] is not out of bounds, same for data. This imposes that data and indices have the same length, but does not impose anything outside that range. However, in the current state, the structure checks of CsMat are much more restrictive, and it's very probable some implementations depend on these more precise details.

I see two possible lines of action:

  • commit to the possibility to have zero-copy partial views of a matrix, relax structure checks and increase test coverage
  • take a shortcut and allocate when trying to slice a matrix. This seems to be scipy's approach.

I'm lean towards the first approach, as I don't like the performance penalty of the second one.

Minor optimisation: to_owned of a view could adjust indptr to start at 0, and contain the minimum amount of entries in indices and data

Yes that looks like a good idea.

@vbarrielle
Copy link
Collaborator Author

A drawback of the zero-copy approach is that equality between two sparse matrices is no longer trivial, because the non-indexed elements are arbitrary. Pretty sure the current implementation exposes this bug.

@vbarrielle
Copy link
Collaborator Author

There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the indptr storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.

That would technically be a breaking change, but not one that would affect much code in practice I guess.

@mulimoen
Copy link
Collaborator

Equality as defined is given as structural equality, so don't think this is a big problem. The approx trait could be used for semantic equality

@mulimoen
Copy link
Collaborator

Scipy ensures indptr always starts at 0, what does suitesparse do?

@vbarrielle
Copy link
Collaborator Author

vbarrielle commented Oct 23, 2020

Looks like SuiteSparse enforces the indptr to start at 0, for instance in ldl_valid_matrix.

@vbarrielle
Copy link
Collaborator Author

There is a way to have a zero copy approach without having unused/invalid elements. The idea is to use another type for the indptr storage, that would store a slice and an offset to apply to the values. To make this fit into the API there would need to be a trait defining everything that can be done with an indptr, and all algorithms would be implemented using this trait. This way the type system would help to prove correct usage.

That would technically be a breaking change, but not one that would affect much code in practice I guess.

I've started prototyping a trait around that idea, any feedback is appreciated!

@mulimoen
Copy link
Collaborator

How about this approach: https://gist.github.com/mulimoen/c476ee6a2daf5ba9e341d87dfbc46e14? data and indices should always be a part of the matrix, indptr[0] is just a remapping down to zero.

@vbarrielle
Copy link
Collaborator Author

That's a nice insight, no need for a trait with your approach!

@mulimoen
Copy link
Collaborator

Another alternative: indptr could return a Cow

vbarrielle added a commit that referenced this issue Oct 25, 2020
As suggested by @mulimoen
(#244 (comment)), it's
doable to have a concrete type abstracting the classic indptr case with
the first element at 0, and the sliced matrix case, by simply performing
the subtraction everywhere.
vbarrielle added a commit that referenced this issue Oct 25, 2020
As suggested by @mulimoen
(#244 (comment)), it's
doable to have a concrete type abstracting the classic indptr case with
the first element at 0, and the sliced matrix case, by simply performing
the subtraction everywhere.
@vbarrielle
Copy link
Collaborator Author

I really like the concrete type approach with indptr[0] used as an offset, that makes the type a guarantee of valid data, and it can take its own share of the structure checks. I've started implementing it, it looks like this for now: https://github.com/vbarrielle/sprs/blob/99663dcad21e34e3260b64e37b6e5fa33d14ff5a/src/sparse/indptr.rs

@mulimoen
Copy link
Collaborator

mulimoen commented Oct 26, 2020

Looking great! I have a couple of ideas below

  • I'm missing a form of canonical indptr, which can be used for FFI, maybe expressed as a Cow for an improper indptr?
  • ndarray calls the slice operation as_slice, should we follow the same naming?
  • new consumes storage on failure, could we return this on failure?
  • Internal API idea: a method ranges which returns an iterator of std::ops::Range<usize> which can be directly consumed
  • A lot of the maps can be removed, and we could use unwrap, we should always have a first and last element

@vbarrielle
Copy link
Collaborator Author

* I'm missing a form of canonical `indptr`, which can be used for FFI, maybe expressed as a `Cow` for an improper indptr?

Yes this looks like a good idea.

* `ndarray` calls the `slice` operation `as_slice`, should we follow the same naming?

I'll change that for coherence. But it's fallible here, so maybe try_as_slice?

* `new` consumes `storage` on failure, could we return this on failure?

I'll see if it can be done.

* Internal API idea: a method `ranges` which returns an iterator of `std::ops::Range<usize>` which can be directly consumed

That's interesting indeed, maybe it should be what's returned by iter_outer? There's one downside to that, as Rangeis not Copy AFAIR, but other than that I like this idea even for a public API.

* A lot of the `map`s can be removed, and we could use `unwrap`, we should always have a `first` and `last` element

I don't think we will always have a first and last element, for instance I don't think we should disallow creating a view around 0 rows. I know in dense linear algebra this kind of edge case can be very useful.

@mulimoen
Copy link
Collaborator

ndarrays as_slice returns an Option, try_as_slice seems a bit verbose.

A size one indptr has two elements, it would be "natural" to have a zero sized indptr with one element. Although a length zero would mean we could implement Default

@vbarrielle
Copy link
Collaborator Author

@mulimoen was that the kind of Cow-returning you had in mind?

@mulimoen
Copy link
Collaborator

That was exactly what I had in mind, we can now use this in to_owned on CsMatbase

@vbarrielle
Copy link
Collaborator Author

Hello, just an update to inform that my work on this is going to be a bit slow, my second son is born this week and I'm having much less free time. I still hope to be able to keep working on this though.

@vbarrielle
Copy link
Collaborator Author

My current state is shown is #245, it's not without issues but I think most of the work is there now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants