-
Notifications
You must be signed in to change notification settings - Fork 1
Update more parquet APIs to use u64 #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: kyle/change-parquet-u64
Are you sure you want to change the base?
Conversation
@@ -431,7 +431,7 @@ impl ChunkReader for ArrowColumnChunkData { | |||
)) | |||
} | |||
|
|||
fn get_bytes(&self, _start: u64, _length: usize) -> Result<Bytes> { | |||
fn get_bytes(&self, _start: u64, _length: u64) -> Result<Bytes> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This length
would only have to change if we expected to be fetching more than 4GB in a single request, no?
@@ -83,7 +83,7 @@ pub trait MetadataSuffixFetch: MetadataFetch { | |||
/// | |||
/// Note the returned type is a boxed future, often created by | |||
/// [FutureExt::boxed]. See the trait documentation for an example | |||
fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>>; | |||
fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, we'll never be fetching more than 4GB in a suffix request for metadata
@@ -93,7 +93,7 @@ pub struct MetadataLoader<F> { | |||
/// The in-progress metadata | |||
metadata: ParquetMetaData, | |||
/// The offset and bytes of remaining unparsed data | |||
remainder: Option<(usize, Bytes)>, | |||
remainder: Option<(u64, Bytes)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an offset into the cached remainder, which is a buffer of the prefetch_size
I believe. Assuming that we'll never be prefetching >4GB of data for the metadata, this shouldn't need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right -- we don't need to change this.
pub(crate) fn column_index_range(&self) -> Option<Range<u64>> { | ||
let offset = u64::try_from(self.column_index_offset?).ok()?; | ||
let length = u64::try_from(self.column_index_length?).ok()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these might be valid cases that we'd need to change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
@@ -138,7 +138,7 @@ impl AsyncFileReader for Box<dyn AsyncFileReader + '_> { | |||
} | |||
|
|||
impl<T: AsyncFileReader + MetadataFetch + AsyncRead + AsyncSeek + Unpin> MetadataSuffixFetch for T { | |||
fn fetch_suffix(&mut self, suffix: usize) -> BoxFuture<'_, Result<Bytes>> { | |||
fn fetch_suffix(&mut self, suffix: u64) -> BoxFuture<'_, Result<Bytes>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would also have to change the suffix length as well as all other places that use usize
as an offset in a file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we necessarily would need to change it in the signature, but we would need to ensure that we cast the suffix to u64
internally rather than casting the length to usize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suffix
is the length of the returned Bytes
, so I'd imagine this should remain a usize
.
// Size of the serialized thrift metadata plus the 8 byte footer. Only set if | ||
// `self.parse_metadata` is called. | ||
metadata_size: Option<usize>, | ||
prefetch_hint: Option<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As these are lengths, they probably should be u64 as well (though I realize it is unlikely people will be "pre" fetching GBs of data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think usize is appropriate here...one wouldn't be able to fetch more than 4GB on a 32-bit machine anyway. Also, there are only 4 bytes available in the footer for the metadata length, so metadata_size could probably be change to a u32 rather than u64.
@@ -397,7 +397,7 @@ impl ParquetMetaDataReader { | |||
pub async fn load_and_finish<F: MetadataFetch>( | |||
mut self, | |||
fetch: F, | |||
file_size: usize, | |||
file_size: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these various load APIs also need to be updated to use a u64 file length to support larger files in parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
@@ -95,11 +94,13 @@ impl ChunkReader for File { | |||
Ok(BufReader::new(self.try_clone()?)) | |||
} | |||
|
|||
fn get_bytes(&self, start: u64, length: usize) -> Result<Bytes> { | |||
let mut buffer = Vec::with_capacity(length); | |||
fn get_bytes(&self, start: u64, length: u64) -> Result<Bytes> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by @tustvold's comment in apache#7371 this one in theory doesn't need to be updated (as the length
bytes needs to fit into memory / Bytes anyways)
@@ -84,13 +84,13 @@ pub struct ParquetMetaDataReader { | |||
/// | |||
/// This is parsed from the last 8 bytes of the Parquet file | |||
pub struct FooterTail { | |||
metadata_length: usize, | |||
metadata_length: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here...u32
if not left as usize
@@ -46,12 +46,12 @@ pub enum ParquetError { | |||
ArrowError(String), | |||
/// Error when the requested column index is more than the | |||
/// number of columns in the row group | |||
IndexOutOfBound(usize, usize), | |||
IndexOutOfBound(u64, u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an index into an array or vector, so this should remain usize
IMO
/// Returned when a function needs more data to complete properly. The `usize` field indicates | ||
/// Returned when a function needs more data to complete properly. The `u64` field indicates | ||
/// the total number of bytes required, not the number of additional bytes. | ||
NeedMoreData(usize), | ||
NeedMoreData(u64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is thrown when a provided buffer needs to have more data in it. Here too it doesn't make sense to ask for more bytes than a usize
can index.
Note: -Targets apache#7371
This gives some idea of what other APIs would need to be changed to fully and properly support > 4GB files in WASM (32 bit
usize
)I may be mistaken, but I wanted to get this PR up for discussion