-
Notifications
You must be signed in to change notification settings - Fork 1
Use-cases #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My own use case is definitely on the etl side. So things that are important to me are:
I'm sure there's some I'm missing, but these come to mind first |
My main use case concerns ML workloads:
|
Hi, hope y'all don't mind me chiming in - I'm very interested in dataframe libraries for Rust, & agreed I think Rust could make a great ETL tool! I think my main use cases have been covered by the previous comments. Some other ones:
|
I'm looking at ML / Data science use cases, as well. Basically, I want a library that can ETL some data and feed it into Some other features beyond what's already been mentioned:
|
I think that things like scaling, normalization, feature encoding do not necessarily belong to the (core) dataframe library @jblondin. I see them more in a Scikit-learnish port, that uses dataframe as a first-citizen input type. |
Good point. While I don't think we should be beholden to trying to mimic the Python way of doing things, in the interest of the 'prefer small crates' Rust philosophy, most of my points should be in a separate ML-focused preprocessing crate. Similarly, we might want to put the time-series-specific features @galuhsahid mentions in a separate crate as well. Thank you for bringing that up! This thread may be useful for defining some crate boundaries as well as needed use cases. |
@LukeMathWalker I'm not sure I understand exactly what this means. What would this entail? |
This is something I really struggled with. it would be lovely to do pub trait ColId: Copy {
const NAME: &'static str;
type Output;
} Then use a macro define_col!(Age, u16, "age") which expands to something like #[derive Copy]
struct Age;
impl ColId for Age {
type Output = u16;
const NAME: &'static str = "age"
} Then do
case class Apartment(city: String, surface: Int, price: Double, bedrooms: Int)
val apartments = Seq(
Apartment("Paris", 50, 300000.0, 2),
Apartment("Nice", 74, 325000.0, 3)
)
val apts = TypedDataset.create(apartments)
apts.select(apts('city)).show() // select city column with symbol
apts.select(apts('surface) * 10, apts('surface) + 2).show() // select two columns and manipulate Time for an RFC? 😄 |
That's basically what the tablespace![
table example_table {
Age: u16 = "age",
}
] I'd agree that a cleaner, simpler, approach would be preferred, but I'm not exactly sure how to go about doing that 😄 |
Hi @jesskfullwood another solution that could work is if you lazily evaluate your table/dataframe; though you might not get as ergonomic as If you had a pub struct Column {
data: ArrayRef,
data_type: DataType, // where this is an enum of different types
}
pub trait AggregationFn {
mean(&self) -> Result<f64>;
sum(&self) -> Result<f64>, // of course this can be any output result
}
impl AggregationFn for Column {
mean(&self) -> Result<f64> {
if self.data_type.is_numeric() {
Ok(self.data.mean()) // assuming this is implemented somewhere as a kernel
} else {
Err(MyError("cannot calculate mean of non-numeric column"
}
}
} |
@nevi-me This is the way I originally did it and is basically the approach that You are also limited in the types a given
within a
Re lazy-evaluation, I think that is a separate topic. If you had a hypothetical typesafe dataframe, one could imagine buiding up operations into a type a la how
could either directly evaluate the join resulting in a new One benefit of lazy evaluation is that in theory the query can be optimized similar to a database so that you only execute the parts strictly necessary to generate the result you ask for. ETA: I should mention, the optimization layer has conveniently already been written for us: Weld. |
@jesskfullwood I do think it's possible to create typesafe dataframe wrapper around Arrow (I'm currently working on it). Adding custom data types (e.g. enums) might be a bit more difficult -- I'm not currently sure how to handle types outside of Arrow's (or at least the Rust Arrow implementation's) primitive data types. I think it should be theoretically possible, though, with Arrow's union, list, and struct frameworks. As a more general use case question, what are our needs, datatype-wise, beyond the typical primitives / strings? @jesskfullwood brings up an interesting use case with enums (or really any arbitrary type), but we'd have to figure out how that would work with our interoperability goals. |
I strongly agree with @jesskfullwood - having a list/enum of acceptable/primitive types feels like an anti-pattern to me. A possible solution would be to use a trait, where a Rust struct/enum provides methods that convert it to a memory layout that uses Apache Arrow primitives. It basically tells us how to lay it down in memory using the capabilities offered by Apache Arrow. |
Btw, I didn't know about |
One of the reasons why I'm interested in dataframe libraries for Rust is that Rust could make for a good ETL tool.
What other use-cases do people have?
The text was updated successfully, but these errors were encountered: