Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for encoding/decoding ParquetMetadata with more control #6002

Closed
alamb opened this issue Jul 4, 2024 · 24 comments
Closed

API for encoding/decoding ParquetMetadata with more control #6002

alamb opened this issue Jul 4, 2024 · 24 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Jul 4, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several cases where we would like to have more control over the encoding/deocing of Parquet metadata:

  1. serialize and deserialize it outside of a parquet file so I can store it in a cache outside of parquet files and avoid slow object store requests (Page indexes in `decode_metadata` and `encode_metadata` #5988)
  2. Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855
  3. Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999
  4. Ensuring that the Page index structures are read properly requires setting some non obvious settings on the reader scuh as ArrowReaderOptions::with_page_index

At the time of writing, the current APIs exposed

  1. No API exposed for creating / writing parquet metadata
  2. The API exposed for reading parquet metadata, decode_metadata, has no way for finer grained control

Describe the solution you'd like
I would like an API that allows more fine grained control over reading/writing metadata and that permits adding additional features over time in a backwards compatible way

Describe alternatives you've considered

Here is one potential idea -- to create Encoder / Decoder structs that can encode and decode the metadata along with various configuration options.

Ideally this struct would be integrated into the rest of the crate, e.g. used in SerializedFileWriter?

let encoder = ParquetMetadataEncoder::new()
   .with_some_options(foo);
let mut buffer = vec![]
encoder.encode(metadata, &mut buffer)

Similarly for decoding

let decoder = ParquetMetadataDecoder::new()
   .with_offset_index(true)
   .with_column_index(true);

let result = decoder.decode(&buffer);
// decoder need to have some way to communicate
// if it doesn't have sufficient information (e.g the PageIndex 
// wasn't present). Maybe this should just be an error?
match result {
  FullDecode(metadata) => return metadata,
  NeedMoreData(range) => {
    // fetch additional data and pass to deocder?
    todo!()
  } 
  ..
};

Additional context
This ticket is based on the discussion with @adriangb here #5988

There are a bunch of discussions on metadata speed here #5770

Here is a PR with a proposed 'encode_metadata' function: #6000

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Jul 4, 2024
@tustvold
Copy link
Contributor

tustvold commented Jul 4, 2024

@alamb alamb changed the title API for encoding/deocding ParquetMetadata with more control API for encoding/decoding ParquetMetadata with more control Jul 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2024

This sounds quite a lot like https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html ?

That is quite similar -- thank you. Some differences might be also be the with a normal (non async) API as well as an equivalent encoder

@adriangb
Copy link
Contributor

adriangb commented Jul 6, 2024

Ye I think the asyncness would be an important difference. Also that the existing APIs kind of want to load from an entire file. I suppose you could give it a "file" with just the footer and tell it to load just that range... but it feels a bit forced? Same with the asyncness. For my use case I could do some pointless async work (as in, make an async file like thing that just points to a Vec<u8> but in general unnecessary async work is not ideal. My general experience is that it's nice to decouple IO from encoding / decoding logic.

@alamb
Copy link
Contributor Author

alamb commented Jul 8, 2024

My general experience is that it's nice to decouple IO from encoding / decoding logic.

Yes I agree this would be ideal. Having two things:

  1. Something that handles the actual encode/decode of bytes
  2. Something that reads data from a remote source + decodes them (what MetadataLoader seems to do)

@adriangb
Copy link
Contributor

adriangb commented Jul 9, 2024

I took a crack at using MetadataLoader since I happen to have all of the parquet file bytes in memory when writing (although this is not necessarily the case if your'e streaming them somewhere).

My approach was to manually grab the footer based on the footer size declared in the penultimate 4 bytes of the file and save that. But the metadata size declared in the footer seems to not include the Page Index, and I'm not sure how I'd calculate the start location of the Page Index (and other stuff like bloom filters).

My implementation looks somewhat like:

Code
#[derive(Debug, Clone)]
struct AsyncBytes {
    file_size: usize,
    inner: Bytes,
}

impl AsyncBytes {
    fn new(file_size: usize, inner: Bytes) -> Self {
        Self {
            file_size,
            inner,
        }
    }
}

impl MetadataFetch for AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        // check that the range is within the metadata section
        let available_range = self.file_size - self.inner.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!("Attempted to fetch data from outside metadata section: range={:?}, available_range={:?}", range, available_range);
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.inner.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}


/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(file_size: usize, serialized_parquet_metadata: Bytes) -> ParquetResult<Arc<ParquetMetaData>> {
    let loaded_metadata = decode_metadata(&serialized_parquet_metadata)?;
    let reader = AsyncBytes::new(file_size, serialized_parquet_metadata);
    let mut metadata = MetadataLoader::new(reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

Not sure what the right APIs would be for this sort of use case, or in general but it seems like MetadataLoader can’t really be used here without some changes.

@adriangb
Copy link
Contributor

adriangb commented Jul 10, 2024

I got my thing working, but it seems quite brittle. TLDR is that I'm just tracking what bytes DataFusion reads and then slicing to those. Which seems like it could be quite inefficient and might break if DataFusion changes internal details.

Code
#[derive(Debug, Clone)]
struct AsyncBytes {
    file_size: usize,
    data_suffix: Bytes,
    min_offset: usize,
    max_offset: usize,
}

impl AsyncBytes {
    fn new(file_size: usize, data_suffix: Bytes) -> Self {
        Self {
            file_size,
            data_suffix,
            min_offset: file_size,
            max_offset: file_size,
        }
    }

    fn fetched_range(&self) -> Range<usize> {
        self.min_offset..self.max_offset
    }
}

impl MetadataFetch for &mut AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        self.min_offset = self.min_offset.min(range.start);
        self.max_offset = self.max_offset.max(range.end);
        let available_range = self.file_size - self.data_suffix.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={available_range:?}"
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.data_suffix.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}

pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let mut reader = AsyncBytes::new(file_size, serialized_parquet_metadata.clone());
    let loader = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = loader.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

pub async fn extract_metadata_from_file(file_data: &Bytes) -> ParquetResult<Vec<u8>> {
    let loaded_metadata = parse_metadata(file_data)?;
    let mut reader = AsyncBytes::new(file_data.len(), file_data.clone());
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    metadata.finish();
    Ok(file_data[reader.fetched_range().start..].to_vec())
}

@alamb
Copy link
Contributor Author

alamb commented Jul 13, 2024

I got my thing working, but it seems quite brittle. TLDR is that I'm just tracking what bytes DataFusion reads and then slicing to those. Which seems like it could be quite inefficient and might break if DataFusion changes internal details.

Good to hear you got it working. Yes I agree getting a more flexible API worked out that is more efficient would be ideal

As I think you are hinting at, MetadataLoader was designed for whatever the exact needs of the parquet reader were, so is not easy to use outside.

Maybe a good place to start would be to write tests / examples of what you are trying to do. For example:

  1. Read and decode metadata from a parquet footer
  • with/without offset index;
  • with/without bloom filters
  • when the initial pre-fetch didn't include the bytes for the FileMetadata
  • When the intiial pre-fetch didn't include the bytes for some of the out of line structures (offset index, bloom filters)

Also are you trying to support when you have bytes in memory that you want to decode parquet metadata from?

@adriangb
Copy link
Contributor

adriangb commented Jul 13, 2024

Also are you trying to support when you have bytes in memory that you want to decode parquet metadata from?

Yes, exactly. But to get those bytes in memory I also have to write them somehow.

The big picture use case is that I have a Vec<RecordBatch> in memory that I want to write out to a Parquet file in an object store. I also want to save metadata (in the general sense) about this new file to a commit log / secondary index. This metadata (in the general sense) store has file paths, partitioning information, file sizes, creation dates, row group statistics and also the parquet metadata. The point is that I can then take a query and push down as much as I can into this metadata store, returning everything I need to start reading files from object storage while minimizing slow object storage IO. If I store the parquet metadata there as well then in a single query to the metadata store I can get everything I need to start reading chunks of actual data from object storage.

Currently I'm writing the Vec<RecordBatch> to a Bytes (maybe in the future I'll want to write directly to object storage but that's a problem for another day) then using something like described in #6002 (comment) to extract just the metadata from those bytes. Having a metadata writer as I'm trying to do in #6000 would make this a bit less hacky because I could load the ParquetMetadata from the in-memory bytes of the entire file (there are various APIs already available for this, e.g. MetadataLoader) instead of doing the trick of tracking which bytes are being read.

In thinking about it more I don't think we need a new metadata loader. There are various places where metadata references byte ranges or offsets that apply to the entire file (e.g. the column index offsets) so there's always going to be a bit of friction trying to load metadata without the rest of the file. Maybe this is an indication that I'm abusing metadata and instead should be making a completely parallel structure but practically that's unjustifiable in terms of complexity and adding more conversions to load / dump when we already have a good serialization format. In any case, I think a simplified version of #6002 (comment) for reading would be okay:

#[derive(Debug, Clone)]
struct MetadataBytes {
    file_size: usize,
    serialized_parquet_metadata: Bytes,
}

impl MetadataBytes {
    fn new(file_size: usize, serialized_parquet_metadata: Bytes) -> Self {
        Self {
            file_size,
            serialized_parquet_metadata,
        }
    }
}

impl MetadataFetch for &mut MetadataBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        let available_range = self.file_size - self.serialized_parquet_metadata.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={available_range:?}"
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.serialized_parquet_metadata.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}

pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let mut reader = MetadataBytes::new(file_size, serialized_parquet_metadata.clone());
    let loader = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = loader.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

There's still some friction here as visible in the complexity of the code, in particular the two-step loading of the page indexes and the false asyncness. The former I now understand is just because you need information from the metadata to know how to load the page indexes. The latter is not worth making a whole new API for.

I don't know if you feel this code is worth committing to the project, I'm happy to just use it myself until someone comes along with another use case for loading ParquetMetadata from just the metadata bytes.

@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2024

In various conversations I have had the last few days, both internally at InfluxData as well as with others, this has come up

Basically, I think having the ability to easily read/write ParquetMetaData into/from bytes that might/might not be in a parquet file is a valuable feature for people trying to build indexes like you describe

For example I think @XiangpengHao is thinking about it in some contexts, and I know @crepererum and @NGA-TRAN are as well.

Thus now that we have a vehicle for working on code for 53.0.0 (53.0.0-dev branch) that won't lead to potentially massive merge conflicts, I think we should proceed sorting out these APIs.

I will find time to actively help and review

@alamb
Copy link
Contributor Author

alamb commented Jul 17, 2024

I started working on an example here: #6081 (and tried to summarize what I think the usecase is).

@alamb
Copy link
Contributor Author

alamb commented Aug 6, 2024

I think we have our first chunk of the writing side done here: #6197

See also my attempt to document more clearly how all these structures relate and the various APIs available

@alamb
Copy link
Contributor Author

alamb commented Aug 6, 2024

I also updated the example in #6081 to use the API that @adriangb added in #6197 (and that I touched up in #6202)

I would say that thanks to @adriangb and @etseidl the writing of ParquetMetaData is looking quite nice

If someone (🎣 ) had time to make a similar API for reading I think we would be in great shape

@alamb alamb changed the title API for encoding/decoding ParquetMetadata with more control [EPIC] API for encoding/decoding ParquetMetadata with more control Aug 7, 2024
@alamb alamb changed the title [EPIC] API for encoding/decoding ParquetMetadata with more control API for encoding/decoding ParquetMetadata with more control Aug 7, 2024
@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2024

REminder here is what the metadata looks like

┌──────────────────────┐                                
│                      │                                
│         ...          │                                
│                      │                                
│┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │                                
│     ColumnIndex     ◀│─ ─ ─                           
││    (Optional)     │ │     │                          
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │                                
│┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │     │ FileMetadata             
│     OffsetIndex      │       contains embedded        
││    (Optional)     │◀┼ ─   │ offsets to               
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │  │    ColumnIndex and          
│╔═══════════════════╗ │     │ OffsetIndex              
│║                   ║ │  │                             
│║                   ║ ┼ ─   │                          
│║   FileMetadata    ║ │                                
│║                   ║ ┼ ─ ─ ┘                          
│║                   ║ │                                
│╚═══════════════════╝ │                                
│┌───────────────────┐ │                                
││  metadata length  │ │ length of FileMetadata  (only) 
│└───────────────────┘ │                                
│┌───────────────────┐ │                                
││      'PAR1'       │ │ Parquet Magic Bytes            
│└───────────────────┘ │                                
└──────────────────────┘                                
                                                        
     Output Buffer                                      

How to read this today

Using the code in #6081 as an example, here is the best way I have come up with for reading metadata without firing up a parquet file reader:

Note this DOES NOT read the ColumnIndex / OffsetIndex, even if if they are present

/// Reads the metadata from a file
///
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
    let mut file = std::fs::File::open(file).unwrap();
    // This API is kind of awkward compared to the writer
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();
    let len = buffer.len();

    let mut footer = [0; 8];
    footer.copy_from_slice(&buffer[len - 8..len]);

    let md_length = decode_footer(&footer).unwrap();
    // note this also doesn't contain the ColumnOffset or ColumnIndex
    let metadata_buffer = &buffer[len - 8 - md_length..md_length];
    decode_metadata(metadata_buffer).unwrap()
}

Proposed API

Here is how I would like to interact with the data (this would apply equally to metadata stored memory blobs too)

/// Reads the metadata from a file
///
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
    let mut file = std::fs::File::open(file).unwrap();
    // This API is kind of awkward compared to the writer
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();

     let decoder = ParquetMetaDataDecoder::new()
       // read OffsetIndex and PageIndex,  if present, populating 
       // ParquetMetaData::column_index and ParquetMetaData::offset_index 
       .with_page_index(true);

     decoder.decode(&but).unwrap()
}

Nuances

  1. Is this sufficient to coordinate / decode the footer from a parquet file itself
  2. Since the FileMetadata structure have pointers / offsets into the buffer if you don't have the entire file in memory you need to update the offsets relative to the slice you do have
  3. How will we work this into the parquet metadata loader (that may need to fetch multiple buffers)

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2024

My suggestion is that we try to pull the decoding code into a structure like ParquetMetadataReader as described in #6002 (comment) and try to then update the existing code in MetadataLoader and ArrowFileReader to use it.

I am pretty sure there will be adjustments required, but that would be a good place to start I think

Perhaps @adriangb you could give it a try given your interest https://github.com/apache/arrow-rs/pull/6081/files#r1706311772

@adriangb
Copy link
Contributor

adriangb commented Aug 7, 2024

So here's what I've been working with:

/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let metadata_length = serialized_parquet_metadata.len();
    let mut reader = MaskedBytes::new(
        Box::new(AsyncBytes::new(serialized_parquet_metadata)),
        file_size - metadata_length..file_size,
    );
    let metadata = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = metadata.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}
Supporting code
/// Adapt a `Bytes` to a `MetadataFetch` implementation.
struct AsyncBytes {
    data: Bytes,
}

impl AsyncBytes {
    fn new(data: Bytes) -> Self {
        Self { data }
    }
}

impl MetadataFetch for AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        async move { Ok(self.data.slice(range.start..range.end)) }.boxed()
    }
}

/// A `MetadataFetch` implementation that reads from a subset of the full data
/// while accepting ranges that address the full data.
struct MaskedBytes {
    inner: Box<dyn MetadataFetch + Send>,
    inner_range: Range<usize>,
}

impl MaskedBytes {
    fn new(inner: Box<dyn MetadataFetch + Send>, inner_range: Range<usize>) -> Self {
        Self { inner, inner_range }
    }
}

impl MetadataFetch for &mut MaskedBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        // check that the range is within the metadata section
        let inner_range = self.inner_range.clone();
        if !(inner_range.start <= range.start && inner_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={inner_range:?}",
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - self.inner_range.start..range.end - self.inner_range.start;
        self.inner.fetch(range)
    }
}

Sorry I didn't fully understand the question. I think the API looks good on the surface and pending internal details should work.

That offset adjustment would be 0 if you (1) have the whole file or (2) are loading metadata dumped by #6197.
So maybe v0 of this API assumes it's one of those cases and doesn't adjust offsets at all, but I'm open to alternatives.

As you point out this might be hard to integrate with MetadataLoader because MetadataLoader is async and expects to be able to make many async calls to load data. We'd have to do some pretty aggressive refactoring to have rework MetadataLoader to be some sort of push based parser, or make some lower level push based parser that both MetadataLoader and ParquetMetaDataDecoder can rely on. I think this is what you're suggesting when you say "My suggestion is that we try to pull the decoding code into a structure like ParquetMetadataReader as described in #6002 (comment) and try to then update the existing code in MetadataLoader and ArrowFileReader to use it." right?

@tustvold
Copy link
Contributor

tustvold commented Aug 7, 2024

FWIW when I set out to write the MetadataLoader the initial goal was for it to be push-based, however, I struggled to come up with a suitable interface for this in the time I had available. One option might be to return a special Error that allows it to "request" a range be loaded, but it ends up pretty gnarly.

IMO the trick is to share the sync decoding logic and expose it an ergonomic way, and accept that the IO piece will have to be different for async vs non-async. This is broadly the pattern that is used throughout the parquet crate, and I don't really see a way around it.

@adriangb
Copy link
Contributor

adriangb commented Aug 8, 2024

To alleviate concerns about the API design, could we keep that private? That is, we'd have:

  1. MetadataLoader: the existing public async API for loading metadata.
  2. ParquetMetaDataDecoder: a high level single-shot sync API for decoding metadata (you need to have all of the bytes in memory)
  3. An internal push based or whatever API decoder that gets used by those two. This API we can change in the future, e.g. to decode in a single shot instead instead of load footer -> load metadata with page index offsets -> load page index.

@alamb
Copy link
Contributor Author

alamb commented Aug 8, 2024

@adriangb I think #6002 (comment) is a great idea

It also would make ParquetMetaDataDecoder / the internal push based whatever mirror the structure you implemented with ParquetMetaDataEncoder / ThriftMetadataWriter which is nicely symmetric and also seems to work well

Sorry I didn't fully understand the question. I think the API looks good on the surface and pending internal details should work.

Sorry for not being clear, I was just trying to say it would be good not to have two entirely separate paths for decoding the metadata. I think you ridea of the "internal push based or whatever API decoder" sounds perfect

@tustvold
Copy link
Contributor

The somewhat unfortunate formulation of MetadataLoader has also come up on #6157

@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2024

Update here is that thanks to several PRs from @etseidl and myself I am going to claim this is now basically complete. It is possible to read/write parquet metadata and manipulate it much more easily now (using ParquetMetaDataReader)

@alamb alamb closed this as completed Oct 2, 2024
@adriangb
Copy link
Contributor

adriangb commented Oct 2, 2024

Amazing work thank you all!

@Xuanwo
Copy link
Member

Xuanwo commented Oct 8, 2024

Hello, everyone! This API is a great improvement. I have adopted it in the parquet-opendal crate, significantly reducing duplicate code. It works really well: apache/opendal#5170

Thank you, @etseidl, for implementing this. And thanks to everyone here who has joined the discussion.

@Xuanwo
Copy link
Member

Xuanwo commented Oct 8, 2024

The only question left for me is whether I still need to implement AsyncFileReader::get_metadata.

It's a bit strange to write code like:

let reader = ParquetMetaDataReader::new().with_prefetch_hint(Some(self.prefetch_footer_size));
let size = self.content_length as usize;
// Use `self` inside a `fn get_metadata(&mut self)`
let meta = reader.load_and_finish(self, size).await?;

@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2024

It does look a bit strange, but I am not sure what an alternate would look like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

4 participants