Skip to content

Use-cases #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nevi-me opened this issue Mar 24, 2019 · 15 comments
Closed

Use-cases #3

nevi-me opened this issue Mar 24, 2019 · 15 comments

Comments

@nevi-me
Copy link

nevi-me commented Mar 24, 2019

One of the reasons why I'm interested in dataframe libraries for Rust is that Rust could make for a good ETL tool.

What other use-cases do people have?

@hwchen
Copy link
Contributor

hwchen commented Mar 25, 2019

My own use case is definitely on the etl side. So things that are important to me are:

  • re-shaping Dataframes (join, concatenate, pivot, melt)
  • ease of working with strings (i.e. splitting a string col in two)
  • ergonomic map/apply

I'm sure there's some I'm missing, but these come to mind first

@LukeMathWalker
Copy link

My main use case concerns ML workloads:

  • a strongly-typed data structure that does not require all elements to be of the same type (as in ndarray);
  • expressive manipulation (named axes);
  • the basic SQL-like manipulations.

@galuhsahid
Copy link

Hi, hope y'all don't mind me chiming in - I'm very interested in dataframe libraries for Rust, & agreed I think Rust could make a great ETL tool!

I think my main use cases have been covered by the previous comments. Some other ones:

  • Data imputation
  • Filtering
  • Window functions
  • Time series manipulation

@jblondin
Copy link

jblondin commented Apr 19, 2019

I'm looking at ML / Data science use cases, as well. Basically, I want a library that can ETL some data and feed it into ndarray or various machine learning libraries, so interoperability is a big part of what I'm looking for.

Some other features beyond what's already been mentioned:

  • Scaling / normalization for numeric features
  • Categorical feature encoding (e.g. one-hot, feature hashing)
  • Serialization
  • Partitioning (with or without stratified sampling) and cross-validation support

@LukeMathWalker
Copy link

I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type.
What do you think?

@jblondin
Copy link

I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type.
What do you think?

Good point. While I don't think we should be beholden to trying to mimic the Python way of doing things, in the interest of the 'prefer small crates' Rust philosophy, most of my points should be in a separate ML-focused preprocessing crate. Similarly, we might want to put the time-series-specific features @galuhsahid mentions in a separate crate as well.

Thank you for bringing that up! This thread may be useful for defining some crate boundaries as well as needed use cases.

@jblondin
Copy link

  • expressive manipulation (named axes);

@LukeMathWalker I'm not sure I understand exactly what this means. What would this entail?

@LukeMathWalker
Copy link

LukeMathWalker commented Apr 21, 2019

I have definitely been too concise there @jblondin, my fault.
I meant

compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.

quoting myself from #1.

@jesskfullwood
Copy link

jesskfullwood commented May 1, 2019

compile-time checks on common manipulations (e.g. as access to columns by index name), steering as far away as possible from a "stringy" API.

This is something I really struggled with. it would be lovely to do df["age"].mean() and have it not compile if "age" is not a valid column label, but there is no way to do this in Rust. The closes I got was to define a trait

pub trait ColId: Copy {
    const NAME: &'static str;
    type Output;
}

Then use a macro

define_col!(Age, u16, "age")

which expands to something like

#[derive Copy]
struct Age;

impl ColId for Age {
    type Output = u16;
    const NAME: &'static str = "age"
}

Then do df[Age].mean(). Which works but is pretty ugly and unintuitive.

frameless has a "symbol" syntax which allows a much cleaner interface like (quoting the docs):

case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)
val apartments = Seq(
  Apartment("Paris", 50,  300000.0, 2),
  Apartment("Nice",  74,  325000.0, 3)
)
val apts = TypedDataset.create(apartments)
apts.select(apts('city)).show() // select city column with symbol
apts.select(apts('surface) * 10, apts('surface) + 2).show()  // select two columns and manipulate

Time for an RFC? 😄

@jblondin
Copy link

jblondin commented May 1, 2019

Then use a macro

define_col!(Age, u16, "age")

That's basically what the tablespace macro in agnes does:

tablespace![
  table example_table {
    Age: u16 = "age",
  }
]

I'd agree that a cleaner, simpler, approach would be preferred, but I'm not exactly sure how to go about doing that 😄

@nevi-me
Copy link
Author

nevi-me commented May 1, 2019

Hi @jesskfullwood another solution that could work is if you lazily evaluate your table/dataframe; though you might not get as ergonomic as df["age"].mean().

If you had a Column which has a type, you could:

pub struct Column {
    data: ArrayRef,
    data_type: DataType, // where this is an enum of different types
}

pub trait AggregationFn {
    mean(&self) -> Result<f64>;
    sum(&self) -> Result<f64>, // of course this can be any output result
}

impl AggregationFn for Column {
    mean(&self) -> Result<f64> {
        if self.data_type.is_numeric() {
            Ok(self.data.mean()) // assuming this is implemented somewhere as a kernel
        } else {
            Err(MyError("cannot calculate mean of non-numeric column"
        }
    }
}

@jesskfullwood
Copy link

jesskfullwood commented May 2, 2019

@nevi-me This is the way I originally did it and is basically the approach that Arrow takes. But it is quite limiting and largely negates the point of using Rust IMO. The DataFrame doesn't 'know' what is contained within so cannot statically check whether a given operation (e.g. "fetch this column") is valid. This is the major problem I have with pandas et al.

You are also limited in the types a given Column can contain, because each type must be enumerated within the DataType enum ahead of time. Essentially this limits you to just primitive types. I think it would be much nicer to be able to have e.g. enums like

enum Sex { Male, Female, NotStated }

within a Column rather than falling back to

is_male: bool

Re lazy-evaluation, I think that is a separate topic. If you had a hypothetical typesafe dataframe, one could imagine buiding up operations into a type a la how Futures work, e.g.

join(df1, df2, UserId1, UserId2) // frame1, frame2, join col 1, join col 2

could either directly evaluate the join resulting in a new Frame<...>, or build up a Join<...> type which could be executed at a later point. The latter is how Frameless works.

One benefit of lazy evaluation is that in theory the query can be optimized similar to a database so that you only execute the parts strictly necessary to generate the result you ask for.

ETA: I should mention, the optimization layer has conveniently already been written for us: Weld.

@jblondin
Copy link

jblondin commented May 2, 2019

@jesskfullwood I do think it's possible to create typesafe dataframe wrapper around Arrow (I'm currently working on it).

Adding custom data types (e.g. enums) might be a bit more difficult -- I'm not currently sure how to handle types outside of Arrow's (or at least the Rust Arrow implementation's) primitive data types. I think it should be theoretically possible, though, with Arrow's union, list, and struct frameworks.

As a more general use case question, what are our needs, datatype-wise, beyond the typical primitives / strings? @jesskfullwood brings up an interesting use case with enums (or really any arbitrary type), but we'd have to figure out how that would work with our interoperability goals.

@LukeMathWalker
Copy link

LukeMathWalker commented May 2, 2019

I strongly agree with @jesskfullwood - having a list/enum of acceptable/primitive types feels like an anti-pattern to me.
We should be able to handle arbitrary Rust types. The question becomes: how can we make this play nicely with Apache Arrow?

A possible solution would be to use a trait, where a Rust struct/enum provides methods that convert it to a memory layout that uses Apache Arrow primitives. It basically tells us how to lay it down in memory using the capabilities offered by Apache Arrow.
This might be a little tiresome to do at first, but we could probably get to the point when we can automate it for most types using a #[derive(ArrowCompatible)] macro.

@LukeMathWalker
Copy link

Btw, I didn't know about frameless - super cool! Thanks @jesskfullwood 😄

@jblondin jblondin mentioned this issue May 23, 2019
@nevi-me nevi-me closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants