Update more parquet APIs to use u64 #57

alamb · 2025-04-07T16:50:57Z

This gives some idea of what other APIs would need to be changed to fully and properly support > 4GB files in WASM (32 bit usize)

I may be mistaken, but I wanted to get this PR up for discussion

kylebarron · 2025-04-07T17:01:40Z

parquet/src/arrow/arrow_writer/mod.rs

@@ -431,7 +431,7 @@ impl ChunkReader for ArrowColumnChunkData {
        ))
    }

-    fn get_bytes(&self, _start: u64, _length: usize) -> Result<Bytes> {
+    fn get_bytes(&self, _start: u64, _length: u64) -> Result<Bytes> {


This length would only have to change if we expected to be fetching more than 4GB in a single request, no?

kylebarron · 2025-04-07T17:02:13Z

parquet/src/arrow/async_reader/metadata.rs

@@ -83,7 +83,7 @@ pub trait MetadataSuffixFetch: MetadataFetch {
    ///
    /// Note the returned type is a boxed future, often created by
    /// [FutureExt::boxed]. See the trait documentation for an example
-    fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>>;
+    fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>>;


Likewise, we'll never be fetching more than 4GB in a suffix request for metadata

kylebarron · 2025-04-07T17:03:33Z

parquet/src/arrow/async_reader/metadata.rs

@@ -93,7 +93,7 @@ pub struct MetadataLoader<F> {
    /// The in-progress metadata
    metadata: ParquetMetaData,
    /// The offset and bytes of remaining unparsed data
-    remainder: Option<(usize, Bytes)>,
+    remainder: Option<(u64, Bytes)>,


This is an offset into the cached remainder, which is a buffer of the prefetch_size I believe. Assuming that we'll never be prefetching >4GB of data for the metadata, this shouldn't need to change.

Yeah, you are right -- we don't need to change this.

kylebarron · 2025-04-07T17:05:15Z

parquet/src/file/metadata/mod.rs

+    pub(crate) fn column_index_range(&self) -> Option<Range<u64>> {
+        let offset = u64::try_from(self.column_index_offset?).ok()?;
+        let length = u64::try_from(self.column_index_length?).ok()?;


I think these might be valid cases that we'd need to change

alamb · 2025-04-07T16:52:46Z

parquet/src/arrow/async_reader/mod.rs

@@ -138,7 +138,7 @@ impl AsyncFileReader for Box<dyn AsyncFileReader + '_> {
 }

 impl<T: AsyncFileReader + MetadataFetch + AsyncRead + AsyncSeek + Unpin> MetadataSuffixFetch for T {
-    fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>> {
+    fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>> {


I think we would also have to change the suffix length as well as all other places that use usize as an offset in a file.

I don't think we necessarily would need to change it in the signature, but we would need to ensure that we cast the suffix to u64 internally rather than casting the length to usize.

suffix is the length of the returned Bytes, so I'd imagine this should remain a usize.

alamb · 2025-04-07T16:53:35Z

parquet/src/file/metadata/reader.rs

-    // Size of the serialized thrift metadata plus the 8 byte footer. Only set if
-    // `self.parse_metadata` is called.
-    metadata_size: Option<usize>,
+    prefetch_hint: Option<u64>,


As these are lengths, they probably should be u64 as well (though I realize it is unlikely people will be "pre" fetching GBs of data

I think usize is appropriate here...one wouldn't be able to fetch more than 4GB on a 32-bit machine anyway. Also, there are only 4 bytes available in the footer for the metadata length, so metadata_size could probably be change to a u32 rather than u64.

alamb · 2025-04-07T16:54:02Z

parquet/src/file/metadata/reader.rs

@@ -397,7 +397,7 @@ impl ParquetMetaDataReader {
    pub async fn load_and_finish<F: MetadataFetch>(
        mut self,
        fetch: F,
-        file_size: usize,
+        file_size: u64,


I think these various load APIs also need to be updated to use a u64 file length to support larger files in parquet

alamb · 2025-04-07T18:04:46Z

parquet/src/file/reader.rs

@@ -95,11 +94,13 @@ impl ChunkReader for File {
        Ok(BufReader::new(self.try_clone()?))
    }

-    fn get_bytes(&self, start: u64, length: usize) -> Result<Bytes> {
-        let mut buffer = Vec::with_capacity(length);
+    fn get_bytes(&self, start: u64, length: u64) -> Result<Bytes> {


I think by @tustvold's comment in apache#7371 this one in theory doesn't need to be updated (as the length bytes needs to fit into memory / Bytes anyways)

etseidl · 2025-04-07T21:16:00Z

parquet/src/file/metadata/reader.rs

@@ -84,13 +84,13 @@ pub struct ParquetMetaDataReader {
 ///
 /// This is parsed from the last 8 bytes of the Parquet file
 pub struct FooterTail {
-    metadata_length: usize,
+    metadata_length: u64,


Same here...u32 if not left as usize

etseidl · 2025-04-07T23:38:05Z

parquet/src/errors.rs

@@ -46,12 +46,12 @@ pub enum ParquetError {
    ArrowError(String),
    /// Error when the requested column index is more than the
    /// number of columns in the row group
-    IndexOutOfBound(usize, usize),
+    IndexOutOfBound(u64, u64),


This is an index into an array or vector, so this should remain usize IMO

etseidl · 2025-04-07T23:39:35Z

parquet/src/errors.rs

-    /// Returned when a function needs more data to complete properly. The `usize` field indicates
+    /// Returned when a function needs more data to complete properly. The `u64` field indicates
    /// the total number of bytes required, not the number of additional bytes.
-    NeedMoreData(usize),
+    NeedMoreData(u64),


This is thrown when a provided buffer needs to have more data in it. Here too it doesn't make sense to ask for more bytes than a usize can index.

github-actions bot added the parquet label Apr 7, 2025

Update more parquet APIs to use u64

90e71af

alamb force-pushed the alamb/more_u64 branch from 62c5c72 to 90e71af Compare April 7, 2025 16:51

alamb mentioned this pull request Apr 7, 2025

Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) apache/arrow-rs#7371

Merged

alamb added 2 commits April 7, 2025 12:59

hack more

d12a798

update

094eb7a

kylebarron reviewed Apr 7, 2025

View reviewed changes

alamb commented Apr 7, 2025

View reviewed changes

etseidl reviewed Apr 7, 2025

View reviewed changes

Update more parquet APIs to use u64 #57

Are you sure you want to change the base?

Update more parquet APIs to use u64 #57

Uh oh!

Conversation

alamb commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 7, 2025 •

edited

Loading