Skip to content

Update more parquet APIs to use u64 #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: kyle/change-parquet-u64
Choose a base branch
from

Conversation

alamb
Copy link

@alamb alamb commented Apr 7, 2025

Note: -Targets apache#7371

This gives some idea of what other APIs would need to be changed to fully and properly support > 4GB files in WASM (32 bit usize)

I may be mistaken, but I wanted to get this PR up for discussion

@@ -431,7 +431,7 @@ impl ChunkReader for ArrowColumnChunkData {
))
}

fn get_bytes(&self, _start: u64, _length: usize) -> Result<Bytes> {
fn get_bytes(&self, _start: u64, _length: u64) -> Result<Bytes> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This length would only have to change if we expected to be fetching more than 4GB in a single request, no?

@@ -83,7 +83,7 @@ pub trait MetadataSuffixFetch: MetadataFetch {
///
/// Note the returned type is a boxed future, often created by
/// [FutureExt::boxed]. See the trait documentation for an example
fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>>;
fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, we'll never be fetching more than 4GB in a suffix request for metadata

@@ -93,7 +93,7 @@ pub struct MetadataLoader<F> {
/// The in-progress metadata
metadata: ParquetMetaData,
/// The offset and bytes of remaining unparsed data
remainder: Option<(usize, Bytes)>,
remainder: Option<(u64, Bytes)>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an offset into the cached remainder, which is a buffer of the prefetch_size I believe. Assuming that we'll never be prefetching >4GB of data for the metadata, this shouldn't need to change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right -- we don't need to change this.

Comment on lines +1106 to +1108
pub(crate) fn column_index_range(&self) -> Option<Range<u64>> {
let offset = u64::try_from(self.column_index_offset?).ok()?;
let length = u64::try_from(self.column_index_length?).ok()?;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these might be valid cases that we'd need to change

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

@@ -138,7 +138,7 @@ impl AsyncFileReader for Box<dyn AsyncFileReader + '_> {
}

impl<T: AsyncFileReader + MetadataFetch + AsyncRead + AsyncSeek + Unpin> MetadataSuffixFetch for T {
fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>> {
fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would also have to change the suffix length as well as all other places that use usize as an offset in a file.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we necessarily would need to change it in the signature, but we would need to ensure that we cast the suffix to u64 internally rather than casting the length to usize.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suffix is the length of the returned Bytes, so I'd imagine this should remain a usize.

// Size of the serialized thrift metadata plus the 8 byte footer. Only set if
// `self.parse_metadata` is called.
metadata_size: Option<usize>,
prefetch_hint: Option<u64>,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these are lengths, they probably should be u64 as well (though I realize it is unlikely people will be "pre" fetching GBs of data

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think usize is appropriate here...one wouldn't be able to fetch more than 4GB on a 32-bit machine anyway. Also, there are only 4 bytes available in the footer for the metadata length, so metadata_size could probably be change to a u32 rather than u64.

@@ -397,7 +397,7 @@ impl ParquetMetaDataReader {
pub async fn load_and_finish<F: MetadataFetch>(
mut self,
fetch: F,
file_size: usize,
file_size: u64,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these various load APIs also need to be updated to use a u64 file length to support larger files in parquet

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -95,11 +94,13 @@ impl ChunkReader for File {
Ok(BufReader::new(self.try_clone()?))
}

fn get_bytes(&self, start: u64, length: usize) -> Result<Bytes> {
let mut buffer = Vec::with_capacity(length);
fn get_bytes(&self, start: u64, length: u64) -> Result<Bytes> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by @tustvold's comment in apache#7371 this one in theory doesn't need to be updated (as the length bytes needs to fit into memory / Bytes anyways)

@@ -84,13 +84,13 @@ pub struct ParquetMetaDataReader {
///
/// This is parsed from the last 8 bytes of the Parquet file
pub struct FooterTail {
metadata_length: usize,
metadata_length: u64,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here...u32 if not left as usize

@@ -46,12 +46,12 @@ pub enum ParquetError {
ArrowError(String),
/// Error when the requested column index is more than the
/// number of columns in the row group
IndexOutOfBound(usize, usize),
IndexOutOfBound(u64, u64),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an index into an array or vector, so this should remain usize IMO

Comment on lines -52 to +54
/// Returned when a function needs more data to complete properly. The `usize` field indicates
/// Returned when a function needs more data to complete properly. The `u64` field indicates
/// the total number of bytes required, not the number of additional bytes.
NeedMoreData(usize),
NeedMoreData(u64),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is thrown when a provided buffer needs to have more data in it. Here too it doesn't make sense to ask for more bytes than a usize can index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants