Skip to content

[WIP] Indexing along axes #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 18, 2015
Merged

[WIP] Indexing along axes #2

merged 9 commits into from
Feb 18, 2015

Conversation

tshort
Copy link
Collaborator

@tshort tshort commented Feb 16, 2015

This is less of a "work-in-progress" and more of a "something to start discussion" on indexing along axes. Here are feature ideas and other discussion items:

  • It'd be nice if other packages can plug in axes types and define ways to index in to them.
  • It'd be nice if other packages can declare that an axes type has order, so it can be indexed automatically (sounds like a good use of a Tim-Holy-Trait-Trick.
  • It'd be nice to define default indexing for column names and for ranges on time axes.
  • For range indexes on time axes, should it be A[[from, to],:], A[(from,to),:], A[1s:9s,:], or something else?
  • What else should we provide by default for indexing?

What's implemented is an axesindexes method that tries to generalize indexing along an axes. It should return a UnitRange or other simple indexing type. It generates a bunch of warnings, and there are unchecked indexing cases. Here's what works, now:

julia> a = AxisArray(reshape([1:24], 12,2), (.1:.1:1.2, [:a,:b]))
12x2 AxisArrays.AxisArray{Int64,2,Array{Int64,2},(:row,:col),(FloatRange{Float64},Array{Symbol,1}),(Float64,Symbol)}:
  1  13
  2  14
  3  15
  4  16
  5  17
  6  18
  7  19
  8  20
  9  21
 10  22
 11  23
 12  24

julia> a[:,[:a]]
12x1 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},Array{Int64,1}),1},(:row,:col),(FloatRange{Float64},Array{Symbol,1}),(Float64,Symbol)}:
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12

julia> a[[.9, 1.1],[:a]]
3x1 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},Array{Int64,1}),1},(:row,:col),(FloatRange{Float64},Array{Symbol,1}),(Float64,Symbol)}:
  9
 10
 11

julia> a[[.3, 1.1],:]
9x2 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},UnitRange{Int64}),1},(:row,:col),(FloatRange{Float64},Array{Symbol,1}),(Float64,Symbol)}:
  3  15
  4  16
  5  17
  6  18
  7  19
  8  20
  9  21
 10  22
 11  23

julia> a[[.9, Inf],:]
4x2 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},UnitRange{Int64}),1},(:row,:col),(FloatRange{Float64},Array{Symbol,1}),(Float64,Symbol)}:
  9  21
 10  22
 11  23
 12  24

julia> a = AxisArray(reshape([1:24], 12,2), (Ordered([.1:.1:1.2]), [:a,:b]))
12x2 AxisArrays.AxisArray{Int64,2,Array{Int64,2},(:row,:col),(AxisArrays.Ordered{Float64,Array{Float64,1}},Array{Symbol,1}),(Float64,Symbol)}:
  1  13
  2  14
  3  15
  4  16
  5  17
  6  18
  7  19
  8  20
  9  21
 10  22
 11  23
 12  24

julia> a[[.9, 1.1],:]
3x2 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},UnitRange{Int64}),1},(:row,:col),(AxisArrays.Ordered{Float64,Array{Float64,1}},Array{Symbol,1}),(Float64,Symbol)}:
  9  21
 10  22
 11  23

julia> a[[.9, 1.1],[:a]]
3x1 AxisArrays.AxisArray{Int64,2,SubArray{Int64,2,Array{Int64,2},(UnitRange{Int64},Array{Int64,1}),1},(:row,:col),(AxisArrays.Ordered{Float64,Array{Float64,1}},Array{Symbol,1}),(Float64,Symbol)}:
  9
 10
 11


@mbauman
Copy link
Member

mbauman commented Feb 16, 2015

Yes, I think some sort of smart indexing behaviors are essential to making these arrays powerful. That said, I don't really want to pun too much on the indexing operation. There are a few reasons for this:

  • Base Julia allows indexing by floating point integers (… for now Should we expect floating point indexing to be implemented? JuliaLang/julia#10154)
  • We can't customize the lowering of the end keyword — it will always be the last integer index, as determined by size/length/trailingdims. But it's not uncommon to use end in computations, and it's not possible to disable it. If you have a floating point axis, and use end in a floating point computation, it will promote and not mean what you wanted it to mean.
  • Indexing floating point types and trying to get elements by exact equality or interpolating under the hood seems crazy.
  • Indexing behaviors are already pretty complicated.

I was thinking about having only two different smart indexing behaviors, with two different traits Dimensional and Categorical:

indexing(::Union(Number, AbstractDate)) = Dimensional()
indexing(::Union(Symbol, AbstractString)) = Categorical()

Dimensional axes must be sorted, unique, and their only special indexing behavior is with an explicit Interval type to select ranges of data (see mbauman/Signals.jl#10). Categorical axes are unsorted, and their only special behavior is using their element type directly to select a single point or slice.

Are there other behaviors you'd want here? Or other kinds of types that you'd like to have as axes?

@tshort
Copy link
Collaborator Author

tshort commented Feb 16, 2015

I like the two traits you've defined. That seems like it'd cover the normal cases I can think of. I think it'd still be nice to allow other packages to further customize indexing behavior. An example might be a type that wants to use indexing for interpolation along an axis.

Regarding indexing by floating point integers, I'm not sure it contradicts what I implemented. It seems like the sentiment in JuliaLang/julia#10154 is to allow for floating point indexing of the sort we might see use of here. I agree that mbauman/Signals.jl#10 is an important question, and Julia allows a lot of ways to define syntax for that. The notation A[Interval(0.1, 0.9), :] is a little verbose but quite readable. A[Interval(Date(1980,1,1):Date(2015,1,1)), :] is even more verbose, so it'd be nice for TimeSeries to be able to use a more concise alternative (possibly with ISO 8601 strings: A["1998-12-01/2004-04-02", :]). Anyway, my main point is to give other packages control over their axes indexing. As you point out, open-ended intervals are another tricky consideration.

As far as other kinds of types as axes, I think we should try to accommodate anything that's vector-like, including iterators. For example, I could see having an axis that's a DataFrame. That'd be a way to add metadata to rows of an AxisArray object. A DataFrame wouldn't normally fit because it's not a vector-like object, but eachrow(d::DataFrame) is an iterator over rows that could make it look like a vector.

@mbauman
Copy link
Member

mbauman commented Feb 16, 2015

Yes, the more I think about it, the more I like your external dispatch-driven approach. I had initially included the axis element types as parameters of AxisArray as I was imagining dispatching on them, kind of like this:

getindex{EltA,EltB,EltC}(AxisArray{, (EltA, EltB, EltC)}, ::Interval{EltA}, ::EltB, ::UnitRange{EltC})

But that gets complicated really fast. And you'd probably want to use Unions, which are buggy in dispatch with static parameters. I think the simplest thing to do is have a fallback getindex defined for Any types that punts to an axisindexes function to get the integer or integer range indices and redispatch to the core stagedfunction.

getindex(A::AxisArray, I...) = getindex(A, map(axisindexes, A.axes, I)...)

I like the simplicity. It'll have to be a little fancier to deal with different lengths (and we can do this without map or splat with stagedfunctions). The axisindexes function could have a nice error message for unspecialized axis/index pairs. And other packages could extend it to get much fancier behavior.


I think that I only want to allow ranges when the default StepRange A:B enumerates all possible values between A and B. This is the case for Integers and Dates, but not DateTime. That could be a third trait. A[Date(1980):Date(2015),:] is pretty concise, but I agree an ISO 8601 string could do even better.

@tshort
Copy link
Collaborator Author

tshort commented Feb 16, 2015

How about the following for use as the default way to index intervals on ordered axes?

Interval(0.3, 2.5)
Interval(from = 0.3)   # open-ended `to`
Interval(to = 2.5)   # open-ended `from`

It's not the most concise approach, but it's pretty straightforward.

@mbauman
Copy link
Member

mbauman commented Feb 17, 2015

I've rebased your work on top of my recent getindex fixes, and then took a stab at implementing what we've been talking about here. Take a look, I think it should be pretty functional. The Interval type is about as minimal as it gets, but it does the trick (pretty wild that you don't need == for searchsorted which "Returns the range of indices of "a" which compare as equal").

@tshort
Copy link
Collaborator Author

tshort commented Feb 17, 2015

Awesome. That code is quite concise and easy enough to follow!

i = findfirst(ax, idx)
i == 0 && error("index $idx not found")
i
end
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have another method that indexes on an array of elements. Here's a start at that (untested):

function axisindexes{T}(::Type{Categorical}, ax::AbstractVector{T}, idx::AbstractVector{T}) 
    res = findin(ax, idx)
    length(res) == 0 && error("index $idx not found")
    res
end

Edit: fix typo. I actually tried it. It works, but note that with findin, columns are selected, but they are given in the original order, not the order specified in idx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable. You can go ahead and add it to this branch. I'll work tonight on fixing the getindex ambiguities. It's a bit of a mess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. (Probably not til tonight.)

On Tue, Feb 17, 2015 at 9:12 AM, Matt Bauman [email protected]
wrote:

In src/core.jl
#2 (comment):

+Base.convert{T}(::Type{Interval{T}}, x) = Interval(x,x)
+Base.isless(a::Interval, b::Interval) = isless(a.hi, b.lo)
+Base.isless(a::Interval, b) = isless(promote(a,b)...)
+Base.isless(a, b::Interval) = isless(promote(a,b)...)
+
+# Default axes indexing throws an error
+axisindexes(ax, idx) = axisindexes(axistype(ax), ax, idx)
+axisindexes(::Type{Unsupported}, ax, idx) = error("elementwise indexing is not supported for axes of type $(typeof(ax))")
+# Dimensional axes may be indexed by intervals of their elements
+axisindexes{T}(::Type{Dimensional}, ax::AbstractVector{T}, idx::Interval{T}) = searchsorted(ax, idx)
+# Categorical axes may be indexed by their elements
+function axisindexes{T}(::Type{Categorical}, ax::AbstractVector{T}, idx::T)

  • i = findfirst(ax, idx)
  • i == 0 && error("index $idx not found")
  • i
    +end

Seems reasonable. You can go ahead and add it to this branch. I'll work
tonight on fixing the getindex ambiguities. It's a bit of a mess.


Reply to this email directly or view it on GitHub
https://github.com/mbauman/AxisArrays.jl/pull/2/files#r24817424.

tshort and others added 8 commits February 17, 2015 20:04
Symbols, FloatRanges, and an Ordered type for ordering vectors.
This is missing tests, but it's a start and works in my quick interactive tests.  This creates three AxisTypes: Categorical, Dimensional, and Unsupported, as determined by the axistype function.

I created the checkaxis function (but don't use it yet) to enforce type-specific invariants.

The axisindexes function is used to 'lower' the fancy indexing behavior to a supported basic indexing type (Int, Range, etc).

A fallback `getindex(A, ::Any...)` calls the axisindexes functions for each fancy indexing dimension.  It just works with the other behaviors (like `A[Axis{:col}(Interval(.1,.5))]`).
Eliminate splatting for N<=4 for the fallback getindex function.  It's a little verbose, but the meta-meta-programming alternative would probably be too confusing.
Since we're not depending upon dispatch for the fancier axis indexing behaviors, this type parameter is not needed.
@mbauman
Copy link
Member

mbauman commented Feb 18, 2015

This is looking great. It seems like Coveralls doesn't like reporting statuses on PRs for forks. (This is just GnuTLS.jl flaking out). I think I'll just merge and then flush out the tests using the results from master.

mbauman added a commit that referenced this pull request Feb 18, 2015
@mbauman mbauman merged commit b7fb661 into master Feb 18, 2015
@mbauman
Copy link
Member

mbauman commented Feb 18, 2015

Thanks for all the feedback and help here!

@tshort
Copy link
Collaborator Author

tshort commented Feb 18, 2015

Agreed: looking great! Handling ambiguity warnings looked painful.

@mbauman mbauman deleted the axes-indexing branch February 19, 2015 01:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants