API for encoding/decoding ParquetMetadata with more control #6002

alamb · 2024-07-04T09:57:17Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are several cases where we would like to have more control over the encoding/deocing of Parquet metadata:

serialize and deserialize it outside of a parquet file so I can store it in a cache outside of parquet files and avoid slow object store requests (Page indexes in `decode_metadata` and `encode_metadata` #5988)
Selective decoding of a subset (e.g. columns or row groups) of parquet metadata #5855
Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999
Ensuring that the Page index structures are read properly requires setting some non obvious settings on the reader scuh as ArrowReaderOptions::with_page_index

At the time of writing, the current APIs exposed

No API exposed for creating / writing parquet metadata
The API exposed for reading parquet metadata, decode_metadata, has no way for finer grained control

Describe the solution you'd like
I would like an API that allows more fine grained control over reading/writing metadata and that permits adding additional features over time in a backwards compatible way

Describe alternatives you've considered

Here is one potential idea -- to create Encoder / Decoder structs that can encode and decode the metadata along with various configuration options.

Ideally this struct would be integrated into the rest of the crate, e.g. used in SerializedFileWriter?

let encoder = ParquetMetadataEncoder::new()
   .with_some_options(foo);
let mut buffer = vec![]
encoder.encode(metadata, &mut buffer)

Similarly for decoding

let decoder = ParquetMetadataDecoder::new()
   .with_offset_index(true)
   .with_column_index(true);

let result = decoder.decode(&buffer);
// decoder need to have some way to communicate
// if it doesn't have sufficient information (e.g the PageIndex 
// wasn't present). Maybe this should just be an error?
match result {
  FullDecode(metadata) => return metadata,
  NeedMoreData(range) => {
    // fetch additional data and pass to deocder?
    todo!()
  } 
  ..
};

Additional context
This ticket is based on the discussion with @adriangb here #5988

There are a bunch of discussions on metadata speed here #5770

Here is a PR with a proposed 'encode_metadata' function: #6000

The text was updated successfully, but these errors were encountered:

tustvold · 2024-07-04T11:35:09Z

This sounds quite a lot like https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html ?

alamb · 2024-07-05T10:21:30Z

This sounds quite a lot like https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html ?

That is quite similar -- thank you. Some differences might be also be the with a normal (non async) API as well as an equivalent encoder

adriangb · 2024-07-06T23:00:14Z

Ye I think the asyncness would be an important difference. Also that the existing APIs kind of want to load from an entire file. I suppose you could give it a "file" with just the footer and tell it to load just that range... but it feels a bit forced? Same with the asyncness. For my use case I could do some pointless async work (as in, make an async file like thing that just points to a Vec<u8> but in general unnecessary async work is not ideal. My general experience is that it's nice to decouple IO from encoding / decoding logic.

alamb · 2024-07-08T10:06:25Z

My general experience is that it's nice to decouple IO from encoding / decoding logic.

Yes I agree this would be ideal. Having two things:

Something that handles the actual encode/decode of bytes
Something that reads data from a remote source + decodes them (what MetadataLoader seems to do)

adriangb · 2024-07-09T23:41:10Z

I took a crack at using MetadataLoader since I happen to have all of the parquet file bytes in memory when writing (although this is not necessarily the case if your'e streaming them somewhere).

My approach was to manually grab the footer based on the footer size declared in the penultimate 4 bytes of the file and save that. But the metadata size declared in the footer seems to not include the Page Index, and I'm not sure how I'd calculate the start location of the Page Index (and other stuff like bloom filters).

My implementation looks somewhat like:

Code

#[derive(Debug, Clone)]
struct AsyncBytes {
    file_size: usize,
    inner: Bytes,
}

impl AsyncBytes {
    fn new(file_size: usize, inner: Bytes) -> Self {
        Self {
            file_size,
            inner,
        }
    }
}

impl MetadataFetch for AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        // check that the range is within the metadata section
        let available_range = self.file_size - self.inner.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!("Attempted to fetch data from outside metadata section: range={:?}, available_range={:?}", range, available_range);
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.inner.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}


/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(file_size: usize, serialized_parquet_metadata: Bytes) -> ParquetResult<Arc<ParquetMetaData>> {
    let loaded_metadata = decode_metadata(&serialized_parquet_metadata)?;
    let reader = AsyncBytes::new(file_size, serialized_parquet_metadata);
    let mut metadata = MetadataLoader::new(reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

Not sure what the right APIs would be for this sort of use case, or in general but it seems like MetadataLoader can’t really be used here without some changes.

adriangb · 2024-07-10T16:45:13Z

I got my thing working, but it seems quite brittle. TLDR is that I'm just tracking what bytes DataFusion reads and then slicing to those. Which seems like it could be quite inefficient and might break if DataFusion changes internal details.

Code

#[derive(Debug, Clone)]
struct AsyncBytes {
    file_size: usize,
    data_suffix: Bytes,
    min_offset: usize,
    max_offset: usize,
}

impl AsyncBytes {
    fn new(file_size: usize, data_suffix: Bytes) -> Self {
        Self {
            file_size,
            data_suffix,
            min_offset: file_size,
            max_offset: file_size,
        }
    }

    fn fetched_range(&self) -> Range<usize> {
        self.min_offset..self.max_offset
    }
}

impl MetadataFetch for &mut AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        self.min_offset = self.min_offset.min(range.start);
        self.max_offset = self.max_offset.max(range.end);
        let available_range = self.file_size - self.data_suffix.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={available_range:?}"
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.data_suffix.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}

pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let mut reader = AsyncBytes::new(file_size, serialized_parquet_metadata.clone());
    let loader = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = loader.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

pub async fn extract_metadata_from_file(file_data: &Bytes) -> ParquetResult<Vec<u8>> {
    let loaded_metadata = parse_metadata(file_data)?;
    let mut reader = AsyncBytes::new(file_data.len(), file_data.clone());
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    metadata.finish();
    Ok(file_data[reader.fetched_range().start..].to_vec())
}

alamb · 2024-07-13T11:01:36Z

I got my thing working, but it seems quite brittle. TLDR is that I'm just tracking what bytes DataFusion reads and then slicing to those. Which seems like it could be quite inefficient and might break if DataFusion changes internal details.

Good to hear you got it working. Yes I agree getting a more flexible API worked out that is more efficient would be ideal

As I think you are hinting at, MetadataLoader was designed for whatever the exact needs of the parquet reader were, so is not easy to use outside.

Maybe a good place to start would be to write tests / examples of what you are trying to do. For example:

Read and decode metadata from a parquet footer

with/without offset index;
with/without bloom filters
when the initial pre-fetch didn't include the bytes for the FileMetadata
When the intiial pre-fetch didn't include the bytes for some of the out of line structures (offset index, bloom filters)

Also are you trying to support when you have bytes in memory that you want to decode parquet metadata from?

adriangb · 2024-07-13T17:50:05Z

Also are you trying to support when you have bytes in memory that you want to decode parquet metadata from?

Yes, exactly. But to get those bytes in memory I also have to write them somehow.

The big picture use case is that I have a Vec<RecordBatch> in memory that I want to write out to a Parquet file in an object store. I also want to save metadata (in the general sense) about this new file to a commit log / secondary index. This metadata (in the general sense) store has file paths, partitioning information, file sizes, creation dates, row group statistics and also the parquet metadata. The point is that I can then take a query and push down as much as I can into this metadata store, returning everything I need to start reading files from object storage while minimizing slow object storage IO. If I store the parquet metadata there as well then in a single query to the metadata store I can get everything I need to start reading chunks of actual data from object storage.

Currently I'm writing the Vec<RecordBatch> to a Bytes (maybe in the future I'll want to write directly to object storage but that's a problem for another day) then using something like described in #6002 (comment) to extract just the metadata from those bytes. Having a metadata writer as I'm trying to do in #6000 would make this a bit less hacky because I could load the ParquetMetadata from the in-memory bytes of the entire file (there are various APIs already available for this, e.g. MetadataLoader) instead of doing the trick of tracking which bytes are being read.

In thinking about it more I don't think we need a new metadata loader. There are various places where metadata references byte ranges or offsets that apply to the entire file (e.g. the column index offsets) so there's always going to be a bit of friction trying to load metadata without the rest of the file. Maybe this is an indication that I'm abusing metadata and instead should be making a completely parallel structure but practically that's unjustifiable in terms of complexity and adding more conversions to load / dump when we already have a good serialization format. In any case, I think a simplified version of #6002 (comment) for reading would be okay:

#[derive(Debug, Clone)]
struct MetadataBytes {
    file_size: usize,
    serialized_parquet_metadata: Bytes,
}

impl MetadataBytes {
    fn new(file_size: usize, serialized_parquet_metadata: Bytes) -> Self {
        Self {
            file_size,
            serialized_parquet_metadata,
        }
    }
}

impl MetadataFetch for &mut MetadataBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        let available_range = self.file_size - self.serialized_parquet_metadata.len()..self.file_size;
        if !(available_range.start <= range.start && available_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={available_range:?}"
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - available_range.start..range.end - available_range.start;
        let data = self.serialized_parquet_metadata.slice(range.start..range.end);
        async move { Ok(data) }.boxed()
    }
}

pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let mut reader = MetadataBytes::new(file_size, serialized_parquet_metadata.clone());
    let loader = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = loader.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

There's still some friction here as visible in the complexity of the code, in particular the two-step loading of the page indexes and the false asyncness. The former I now understand is just because you need information from the metadata to know how to load the page indexes. The latter is not worth making a whole new API for.

I don't know if you feel this code is worth committing to the project, I'm happy to just use it myself until someone comes along with another use case for loading ParquetMetadata from just the metadata bytes.

alamb · 2024-07-17T20:20:48Z

In various conversations I have had the last few days, both internally at InfluxData as well as with others, this has come up

Basically, I think having the ability to easily read/write ParquetMetaData into/from bytes that might/might not be in a parquet file is a valuable feature for people trying to build indexes like you describe

For example I think @XiangpengHao is thinking about it in some contexts, and I know @crepererum and @NGA-TRAN are as well.

Thus now that we have a vehicle for working on code for 53.0.0 (53.0.0-dev branch) that won't lead to potentially massive merge conflicts, I think we should proceed sorting out these APIs.

I will find time to actively help and review

alamb · 2024-07-17T21:53:14Z

I started working on an example here: #6081 (and tried to summarize what I think the usecase is).

alamb · 2024-08-06T20:24:55Z

I think we have our first chunk of the writing side done here: #6197

See also my attempt to document more clearly how all these structures relate and the various APIs available

alamb · 2024-08-06T22:23:05Z

I also updated the example in #6081 to use the API that @adriangb added in #6197 (and that I touched up in #6202)

I would say that thanks to @adriangb and @etseidl the writing of ParquetMetaData is looking quite nice

If someone (🎣 ) had time to make a similar API for reading I think we would be in great shape

alamb · 2024-08-07T12:00:17Z

REminder here is what the metadata looks like

┌──────────────────────┐                                
│                      │                                
│         ...          │                                
│                      │                                
│┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │                                
│     ColumnIndex     ◀│─ ─ ─                           
││    (Optional)     │ │     │                          
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │                                
│┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │     │ FileMetadata             
│     OffsetIndex      │       contains embedded        
││    (Optional)     │◀┼ ─   │ offsets to               
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─  │  │    ColumnIndex and          
│╔═══════════════════╗ │     │ OffsetIndex              
│║                   ║ │  │                             
│║                   ║ ┼ ─   │                          
│║   FileMetadata    ║ │                                
│║                   ║ ┼ ─ ─ ┘                          
│║                   ║ │                                
│╚═══════════════════╝ │                                
│┌───────────────────┐ │                                
││  metadata length  │ │ length of FileMetadata  (only) 
│└───────────────────┘ │                                
│┌───────────────────┐ │                                
││      'PAR1'       │ │ Parquet Magic Bytes            
│└───────────────────┘ │                                
└──────────────────────┘                                
                                                        
     Output Buffer

How to read this today

Using the code in #6081 as an example, here is the best way I have come up with for reading metadata without firing up a parquet file reader:

Note this DOES NOT read the ColumnIndex / OffsetIndex, even if if they are present

/// Reads the metadata from a file
///
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
    let mut file = std::fs::File::open(file).unwrap();
    // This API is kind of awkward compared to the writer
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();
    let len = buffer.len();

    let mut footer = [0; 8];
    footer.copy_from_slice(&buffer[len - 8..len]);

    let md_length = decode_footer(&footer).unwrap();
    // note this also doesn't contain the ColumnOffset or ColumnIndex
    let metadata_buffer = &buffer[len - 8 - md_length..md_length];
    decode_metadata(metadata_buffer).unwrap()
}

Proposed API

Here is how I would like to interact with the data (this would apply equally to metadata stored memory blobs too)

/// Reads the metadata from a file
///
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
    let mut file = std::fs::File::open(file).unwrap();
    // This API is kind of awkward compared to the writer
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer).unwrap();

     let decoder = ParquetMetaDataDecoder::new()
       // read OffsetIndex and PageIndex,  if present, populating 
       // ParquetMetaData::column_index and ParquetMetaData::offset_index 
       .with_page_index(true);

     decoder.decode(&but).unwrap()
}

Nuances

Is this sufficient to coordinate / decode the footer from a parquet file itself
Since the FileMetadata structure have pointers / offsets into the buffer if you don't have the entire file in memory you need to update the offsets relative to the slice you do have
How will we work this into the parquet metadata loader (that may need to fetch multiple buffers)

alamb · 2024-08-07T12:01:27Z

My suggestion is that we try to pull the decoding code into a structure like ParquetMetadataReader as described in #6002 (comment) and try to then update the existing code in MetadataLoader and ArrowFileReader to use it.

I am pretty sure there will be adjustments required, but that would be a good place to start I think

Perhaps @adriangb you could give it a try given your interest https://github.com/apache/arrow-rs/pull/6081/files#r1706311772

adriangb · 2024-08-07T13:01:03Z

So here's what I've been working with:

/// Load parquet metadata, including the page index, from bytes.
/// This assumes the entire metadata (and no more) is in the provided bytes.
/// Although this method is async, no IO is performed.
pub async fn load_metadata(
    file_size: usize,
    serialized_parquet_metadata: Bytes,
) -> ParquetResult<Arc<ParquetMetaData>> {
    let metadata_length = serialized_parquet_metadata.len();
    let mut reader = MaskedBytes::new(
        Box::new(AsyncBytes::new(serialized_parquet_metadata)),
        file_size - metadata_length..file_size,
    );
    let metadata = MetadataLoader::load(&mut reader, file_size, None).await?;
    let loaded_metadata = metadata.finish();
    let mut metadata = MetadataLoader::new(&mut reader, loaded_metadata);
    metadata.load_page_index(true, true).await?;
    Ok(Arc::new(metadata.finish()))
}

Supporting code

/// Adapt a `Bytes` to a `MetadataFetch` implementation.
struct AsyncBytes {
    data: Bytes,
}

impl AsyncBytes {
    fn new(data: Bytes) -> Self {
        Self { data }
    }
}

impl MetadataFetch for AsyncBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        async move { Ok(self.data.slice(range.start..range.end)) }.boxed()
    }
}

/// A `MetadataFetch` implementation that reads from a subset of the full data
/// while accepting ranges that address the full data.
struct MaskedBytes {
    inner: Box<dyn MetadataFetch + Send>,
    inner_range: Range<usize>,
}

impl MaskedBytes {
    fn new(inner: Box<dyn MetadataFetch + Send>, inner_range: Range<usize>) -> Self {
        Self { inner, inner_range }
    }
}

impl MetadataFetch for &mut MaskedBytes {
    fn fetch(&mut self, range: Range<usize>) -> BoxFuture<'_, ParquetResult<Bytes>> {
        // check that the range is within the metadata section
        let inner_range = self.inner_range.clone();
        if !(inner_range.start <= range.start && inner_range.end >= range.end) {
            return async move {
                let err = format!(
                    "Attempted to fetch data from outside metadata section: range={range:?}, available_range={inner_range:?}",
                );
                Err(parquet::errors::ParquetError::General(err))
            }
            .boxed();
        }
        // adjust the range to be within the data section
        let range = range.start - self.inner_range.start..range.end - self.inner_range.start;
        self.inner.fetch(range)
    }
}

Sorry I didn't fully understand the question. I think the API looks good on the surface and pending internal details should work.

That offset adjustment would be 0 if you (1) have the whole file or (2) are loading metadata dumped by #6197.
So maybe v0 of this API assumes it's one of those cases and doesn't adjust offsets at all, but I'm open to alternatives.

As you point out this might be hard to integrate with MetadataLoader because MetadataLoader is async and expects to be able to make many async calls to load data. We'd have to do some pretty aggressive refactoring to have rework MetadataLoader to be some sort of push based parser, or make some lower level push based parser that both MetadataLoader and ParquetMetaDataDecoder can rely on. I think this is what you're suggesting when you say "My suggestion is that we try to pull the decoding code into a structure like ParquetMetadataReader as described in #6002 (comment) and try to then update the existing code in MetadataLoader and ArrowFileReader to use it." right?

tustvold · 2024-08-07T13:29:38Z

FWIW when I set out to write the MetadataLoader the initial goal was for it to be push-based, however, I struggled to come up with a suitable interface for this in the time I had available. One option might be to return a special Error that allows it to "request" a range be loaded, but it ends up pretty gnarly.

IMO the trick is to share the sync decoding logic and expose it an ergonomic way, and accept that the IO piece will have to be different for async vs non-async. This is broadly the pattern that is used throughout the parquet crate, and I don't really see a way around it.

adriangb · 2024-08-08T01:07:28Z

To alleviate concerns about the API design, could we keep that private? That is, we'd have:

MetadataLoader: the existing public async API for loading metadata.
ParquetMetaDataDecoder: a high level single-shot sync API for decoding metadata (you need to have all of the bytes in memory)
An internal push based or whatever API decoder that gets used by those two. This API we can change in the future, e.g. to decode in a single shot instead instead of load footer -> load metadata with page index offsets -> load page index.

alamb · 2024-08-08T11:41:43Z

@adriangb I think #6002 (comment) is a great idea

It also would make ParquetMetaDataDecoder / the internal push based whatever mirror the structure you implemented with ParquetMetaDataEncoder / ThriftMetadataWriter which is nicely symmetric and also seems to work well

Sorry I didn't fully understand the question. I think the API looks good on the surface and pending internal details should work.

Sorry for not being clear, I was just trying to say it would be good not to have two entirely separate paths for decoding the metadata. I think you ridea of the "internal push based or whatever API decoder" sounds perfect

tustvold · 2024-08-27T10:25:46Z

The somewhat unfortunate formulation of MetadataLoader has also come up on #6157

alamb · 2024-10-02T20:01:27Z

Update here is that thanks to several PRs from @etseidl and myself I am going to claim this is now basically complete. It is possible to read/write parquet metadata and manipulate it much more easily now (using ParquetMetaDataReader)

adriangb · 2024-10-02T20:02:18Z

Amazing work thank you all!

Xuanwo · 2024-10-08T03:45:47Z

Hello, everyone! This API is a great improvement. I have adopted it in the parquet-opendal crate, significantly reducing duplicate code. It works really well: apache/opendal#5170

Thank you, @etseidl, for implementing this. And thanks to everyone here who has joined the discussion.

Xuanwo · 2024-10-08T03:49:33Z

The only question left for me is whether I still need to implement AsyncFileReader::get_metadata.

It's a bit strange to write code like:

let reader = ParquetMetaDataReader::new().with_prefetch_hint(Some(self.prefetch_footer_size));
let size = self.content_length as usize;
// Use `self` inside a `fn get_metadata(&mut self)`
let meta = reader.load_and_finish(self, size).await?;

alamb · 2024-10-08T15:32:48Z

It does look a bit strange, but I am not sure what an alternate would look like

alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Jul 4, 2024

alamb mentioned this issue Jul 4, 2024

Add ParquetMetadataWriter allow ad-hoc encoding of ParquetMetadata #6000

Closed

alamb changed the title ~~API for encoding/deocding ParquetMetadata with more control~~ API for encoding/decoding ParquetMetadata with more control Jul 5, 2024

alamb mentioned this issue Jul 5, 2024

Read Parquet metadata via suffix requests #5979

Open

adriangb mentioned this issue Jul 13, 2024

Reintroduce: Write Bloom filters between row groups instead of the end #5933

Merged

alamb mentioned this issue Jul 17, 2024

Example of reading and writing parquet metadata outside the file #6081

Merged

alamb mentioned this issue Jul 26, 2024

[DISCUSSION] Parquet Metadata Improvements #6129

Open

etseidl mentioned this issue Aug 6, 2024

Add ThriftMetadataWriter for writing Parquet metadata #6197

Merged

alamb changed the title ~~API for encoding/decoding ParquetMetadata with more control~~ [EPIC] API for encoding/decoding ParquetMetadata with more control Aug 7, 2024

alamb changed the title ~~[EPIC] API for encoding/decoding ParquetMetadata with more control~~ API for encoding/decoding ParquetMetadata with more control Aug 7, 2024

tustvold mentioned this issue Aug 27, 2024

Parquet/async: Default to suffix requests on supporting readers/object stores #6157

Draft

etseidl mentioned this issue Sep 13, 2024

POC: Add ParquetMetaDataReader #6392

Closed

etseidl mentioned this issue Sep 20, 2024

Add ParquetMetaDataReader #6431

Merged

This was referenced Sep 26, 2024

Add round trip tests for reading/writing parquet metadata #6463

Merged

Add builder style API for manipulating ParquetMetaData #6465

Closed

alamb closed this as completed Oct 2, 2024

alamb mentioned this issue Oct 3, 2024

Add example of how to use parquet metadata reader APIs for a local cache #6504

Closed

Xuanwo mentioned this issue Oct 8, 2024

refactor(integration/parquet): Use ParquetMetaDataReader instead apache/opendal#5170

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for encoding/decoding ParquetMetadata with more control #6002

API for encoding/decoding ParquetMetadata with more control #6002

alamb commented Jul 4, 2024 •

edited

Loading

tustvold commented Jul 4, 2024

alamb commented Jul 5, 2024

adriangb commented Jul 6, 2024

alamb commented Jul 8, 2024

adriangb commented Jul 9, 2024 •

edited

Loading

adriangb commented Jul 10, 2024 •

edited

Loading

alamb commented Jul 13, 2024

adriangb commented Jul 13, 2024 •

edited

Loading

alamb commented Jul 17, 2024

alamb commented Jul 17, 2024 •

edited

Loading

alamb commented Aug 6, 2024

alamb commented Aug 6, 2024

alamb commented Aug 7, 2024

alamb commented Aug 7, 2024 •

edited

Loading

adriangb commented Aug 7, 2024 •

edited

Loading

tustvold commented Aug 7, 2024

adriangb commented Aug 8, 2024

alamb commented Aug 8, 2024

tustvold commented Aug 27, 2024

alamb commented Oct 2, 2024 •

edited

Loading

adriangb commented Oct 2, 2024

Xuanwo commented Oct 8, 2024 •

edited

Loading

Xuanwo commented Oct 8, 2024

alamb commented Oct 8, 2024

API for encoding/decoding ParquetMetadata with more control #6002

API for encoding/decoding ParquetMetadata with more control #6002

Comments

alamb commented Jul 4, 2024 • edited Loading

tustvold commented Jul 4, 2024

alamb commented Jul 5, 2024

adriangb commented Jul 6, 2024

alamb commented Jul 8, 2024

adriangb commented Jul 9, 2024 • edited Loading

adriangb commented Jul 10, 2024 • edited Loading

alamb commented Jul 13, 2024

adriangb commented Jul 13, 2024 • edited Loading

alamb commented Jul 17, 2024

alamb commented Jul 17, 2024 • edited Loading

alamb commented Aug 6, 2024

alamb commented Aug 6, 2024

alamb commented Aug 7, 2024

How to read this today

Proposed API

Nuances

alamb commented Aug 7, 2024 • edited Loading

adriangb commented Aug 7, 2024 • edited Loading

tustvold commented Aug 7, 2024

adriangb commented Aug 8, 2024

alamb commented Aug 8, 2024

tustvold commented Aug 27, 2024

alamb commented Oct 2, 2024 • edited Loading

adriangb commented Oct 2, 2024

Xuanwo commented Oct 8, 2024 • edited Loading

Xuanwo commented Oct 8, 2024

alamb commented Oct 8, 2024

alamb commented Jul 4, 2024 •

edited

Loading

adriangb commented Jul 9, 2024 •

edited

Loading

adriangb commented Jul 10, 2024 •

edited

Loading

adriangb commented Jul 13, 2024 •

edited

Loading

alamb commented Jul 17, 2024 •

edited

Loading

alamb commented Aug 7, 2024 •

edited

Loading

adriangb commented Aug 7, 2024 •

edited

Loading

alamb commented Oct 2, 2024 •

edited

Loading

Xuanwo commented Oct 8, 2024 •

edited

Loading