Skip to content

Commit

Permalink
GH-15185: [C++][Parquet] Improve documentation for Parquet Reader col…
Browse files Browse the repository at this point in the history
…umn_indices (#15184)

This aims to fix the documentation and improve it,, by better specifying to what level the column_indices given in argument refer to.
* Closes: #15185

Lead-authored-by: LouisClt <louis1110@hotmail.fr>
Co-authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
  • Loading branch information
LouisClt and wjones127 authored Jan 6, 2023
1 parent 6bd847b commit a580f27
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 3 deletions.
17 changes: 16 additions & 1 deletion cpp/src/parquet/arrow/reader.h
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,7 @@ class PARQUET_EXPORT FileReader {
// fully-materialized arrow::Array instances
//
// Returns error status if the column of interest is not flat.
// The indicated column index is relative to the schema
virtual ::arrow::Status GetColumn(int i, std::unique_ptr<ColumnReader>* out) = 0;

/// \brief Return arrow schema for all the columns.
Expand Down Expand Up @@ -225,7 +226,21 @@ class PARQUET_EXPORT FileReader {

/// \brief Read the given columns into a Table
///
/// The indicated column indices are relative to the schema
/// The indicated column indices are relative to the internal representation
/// of the parquet table. For instance :
/// 0 foo.bar
/// foo.bar.baz 0
/// foo.bar.baz2 1
/// foo.qux 2
/// 1 foo2 3
/// 2 foo3 4
///
/// i=0 will read foo.bar.baz, i=1 will read only foo.bar.baz2 and so on.
/// Only leaf fields have indices; foo itself doesn't have an index.
/// To get the index for a particular leaf field, one can use
/// manifest().schema_fields to get the top level fields, and then walk the
/// tree to identify the relevant leaf fields and access its column_index.
/// To get the total number of leaf fields, use FileMetadata.num_columns().
virtual ::arrow::Status ReadTable(const std::vector<int>& column_indices,
std::shared_ptr<::arrow::Table>* out) = 0;

Expand Down
13 changes: 11 additions & 2 deletions cpp/src/parquet/metadata.h
Original file line number Diff line number Diff line change
Expand Up @@ -282,11 +282,20 @@ class PARQUET_EXPORT FileMetaData {

bool Equals(const FileMetaData& other) const;

/// \brief The number of top-level columns in the schema.
/// \brief The number of parquet "leaf" columns.
///
/// Parquet thrift definition requires that nested schema elements are
/// flattened. This method returns the number of columns in the un-flattened
/// flattened. This method returns the number of columns in the flattened
/// version.
/// For instance, if the schema looks like this :
/// 0 foo.bar
/// foo.bar.baz 0
/// foo.bar.baz2 1
/// foo.qux 2
/// 1 foo2 3
/// 2 foo3 4
/// This method will return 5, because there are 5 "leaf" fields (so 5
/// flattened fields)
int num_columns() const;

/// \brief The number of flattened schema elements.
Expand Down

0 comments on commit a580f27

Please sign in to comment.