Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-367: converter json <=> Arrow file format for Integration tests #203

Closed
wants to merge 2 commits into from

Conversation

julienledem
Copy link
Member

No description provided.

@julienledem
Copy link
Member Author

@wesm here is a tool to convert from arrow to json and back as well as validate an arrow file against a json one.

@wesm
Copy link
Member

wesm commented Nov 11, 2016

fantastic, will review today. made good progress yesterday on the JSON file format in C++, so will write an equivalent tool

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM modulo minor comments. We'll be sure to include nulls in the integration test datasets

for (int j = 0; j < valueCount; j++) {
Object arrow = arrowVector.getAccessor().getObject(j);
Object json = jsonVector.getAccessor().getObject(j);
if (!Objects.equal(arrow, json)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works for nested types and nulls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.
Objects.equals takes care of null.
getObject(index) materializes nested types and lists.

public class ArrowFileTestFixtures {
static final int COUNT = 10;

static void writeData(int count, MapVector parent) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, many of these test cases could use some null values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I was busy this week, but I will add more tests as suggested.

@wesm
Copy link
Member

wesm commented Nov 17, 2016

+1

@asfgit asfgit closed this in 8417096 Nov 18, 2016
@julienledem julienledem deleted the integration branch April 25, 2017 18:32
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 2, 2018
The key change here is to ensure that munmap is called at most once for
a given memory-mapped file.

Previously, MemoryMapSource::CloseFile was calling munmap on every
invocation (provided that a file had ever been successfully mapped into
memory for that instance). This is problematic in a multi-threaded
environment, even if each MemoryMapSource instance is being used in only
one thread, as illustrated by the following hypothetical sequence of
operations:

thread 1 @ time 1: munmap(0xf000, 4096)
thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000
thread 1 @ time 3: munmap(0xf000, 4096)

After time 3, the mapping for the memory segment beginning at 0xf000 has
been invalidated, so the next attempt by thread 2 to access memory
within that segment will likely cause a segfault (unless yet another
thread has mmap'd that segment in the meantime, in which case the
results could be even more interesting, but certainly no better).

Also, I'm adding/modifying a couple comments in header files to mark
"sample" implementations accordingly. This is intended to give API
consumers a heads up as to the intent and level of maturity of those
sections of the codebase.

Author: William Forson <william@gluent.com>

Closes apache#203 from wdforson/master and squashes the following commits:

782713a [William Forson] Adjust 'sample' code comments
7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2018
The key change here is to ensure that munmap is called at most once for
a given memory-mapped file.

Previously, MemoryMapSource::CloseFile was calling munmap on every
invocation (provided that a file had ever been successfully mapped into
memory for that instance). This is problematic in a multi-threaded
environment, even if each MemoryMapSource instance is being used in only
one thread, as illustrated by the following hypothetical sequence of
operations:

thread 1 @ time 1: munmap(0xf000, 4096)
thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000
thread 1 @ time 3: munmap(0xf000, 4096)

After time 3, the mapping for the memory segment beginning at 0xf000 has
been invalidated, so the next attempt by thread 2 to access memory
within that segment will likely cause a segfault (unless yet another
thread has mmap'd that segment in the meantime, in which case the
results could be even more interesting, but certainly no better).

Also, I'm adding/modifying a couple comments in header files to mark
"sample" implementations accordingly. This is intended to give API
consumers a heads up as to the intent and level of maturity of those
sections of the codebase.

Author: William Forson <william@gluent.com>

Closes apache#203 from wdforson/master and squashes the following commits:

782713a [William Forson] Adjust 'sample' code comments
7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile

Change-Id: I97842390ca33fc007a42069e84f2d160b27627a0
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 6, 2018
The key change here is to ensure that munmap is called at most once for
a given memory-mapped file.

Previously, MemoryMapSource::CloseFile was calling munmap on every
invocation (provided that a file had ever been successfully mapped into
memory for that instance). This is problematic in a multi-threaded
environment, even if each MemoryMapSource instance is being used in only
one thread, as illustrated by the following hypothetical sequence of
operations:

thread 1 @ time 1: munmap(0xf000, 4096)
thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000
thread 1 @ time 3: munmap(0xf000, 4096)

After time 3, the mapping for the memory segment beginning at 0xf000 has
been invalidated, so the next attempt by thread 2 to access memory
within that segment will likely cause a segfault (unless yet another
thread has mmap'd that segment in the meantime, in which case the
results could be even more interesting, but certainly no better).

Also, I'm adding/modifying a couple comments in header files to mark
"sample" implementations accordingly. This is intended to give API
consumers a heads up as to the intent and level of maturity of those
sections of the codebase.

Author: William Forson <william@gluent.com>

Closes apache#203 from wdforson/master and squashes the following commits:

782713a [William Forson] Adjust 'sample' code comments
7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile

Change-Id: I97842390ca33fc007a42069e84f2d160b27627a0
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 7, 2018
The key change here is to ensure that munmap is called at most once for
a given memory-mapped file.

Previously, MemoryMapSource::CloseFile was calling munmap on every
invocation (provided that a file had ever been successfully mapped into
memory for that instance). This is problematic in a multi-threaded
environment, even if each MemoryMapSource instance is being used in only
one thread, as illustrated by the following hypothetical sequence of
operations:

thread 1 @ time 1: munmap(0xf000, 4096)
thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000
thread 1 @ time 3: munmap(0xf000, 4096)

After time 3, the mapping for the memory segment beginning at 0xf000 has
been invalidated, so the next attempt by thread 2 to access memory
within that segment will likely cause a segfault (unless yet another
thread has mmap'd that segment in the meantime, in which case the
results could be even more interesting, but certainly no better).

Also, I'm adding/modifying a couple comments in header files to mark
"sample" implementations accordingly. This is intended to give API
consumers a heads up as to the intent and level of maturity of those
sections of the codebase.

Author: William Forson <william@gluent.com>

Closes apache#203 from wdforson/master and squashes the following commits:

782713a [William Forson] Adjust 'sample' code comments
7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile

Change-Id: I97842390ca33fc007a42069e84f2d160b27627a0
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
The key change here is to ensure that munmap is called at most once for
a given memory-mapped file.

Previously, MemoryMapSource::CloseFile was calling munmap on every
invocation (provided that a file had ever been successfully mapped into
memory for that instance). This is problematic in a multi-threaded
environment, even if each MemoryMapSource instance is being used in only
one thread, as illustrated by the following hypothetical sequence of
operations:

thread 1 @ time 1: munmap(0xf000, 4096)
thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000
thread 1 @ time 3: munmap(0xf000, 4096)

After time 3, the mapping for the memory segment beginning at 0xf000 has
been invalidated, so the next attempt by thread 2 to access memory
within that segment will likely cause a segfault (unless yet another
thread has mmap'd that segment in the meantime, in which case the
results could be even more interesting, but certainly no better).

Also, I'm adding/modifying a couple comments in header files to mark
"sample" implementations accordingly. This is intended to give API
consumers a heads up as to the intent and level of maturity of those
sections of the codebase.

Author: William Forson <william@gluent.com>

Closes apache#203 from wdforson/master and squashes the following commits:

782713a [William Forson] Adjust 'sample' code comments
7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile

Change-Id: I97842390ca33fc007a42069e84f2d160b27627a0
nevi-me added a commit that referenced this pull request Sep 14, 2020
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes #4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes apache#4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
A rebase and significant rewrite of sunchao/parquet-rs#197

Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply.

Goal
===

Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.

How to Use
===

```
extern crate parquet;
#[macro_use] extern crate parquet_derive;

#[derive(ParquetRecordWriter)]
struct ACompleteRecord<'a> {
  pub a_bool: bool,
  pub a_str: &'a str,
}
```

RecordWriter trait
===

This is the new trait which `parquet_derive` will implement for your structs.

```
use super::RowGroupWriter;

pub trait RecordWriter<T> {
  fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
}
```

How does it work?
===

The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:

```
[lib]
proc-macro = true
```

The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:

 - the name of the struct
 - the lifetime variables of the struct
 - the fields of the struct

The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.

The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:

```
impl RecordWriter for $struct_name {
  fn write_row_group(..) {
    $({
      $column_writer_snippet
    })
  }
}
```

this template is then added under the struct definition, ending up something like:

```
struct MyStruct {
}
impl RecordWriter for MyStruct {
  fn write_row_group(..) {
    {
       write_col_1();
    };
   {
       write_col_2();
   }
  }
}
```

and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.

Viewing the Derived Code
===

To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:

```
$WORK_DIR/parquet-rs/parquet_derive_test
cargo expand --lib > ../temp.rs
```

then you can dump the contents:

```
struct DumbRecord {
    pub a_bool: bool,
    pub a2_bool: bool,
}
impl RecordWriter<DumbRecord> for &[DumbRecord] {
    fn write_to_row_group(
        &self,
        row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
    ) {
        let mut row_group_writer = row_group_writer;
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        };
        {
            let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
            let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
            if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
                column_writer
            {
                typed.write_batch(&vals[..], None, None).unwrap();
            }
            row_group_writer.close_column(column_writer).unwrap();
        }
    }
}
```

now I need to write out all the combinations of types we support and make sure it writes out data.

Procedural Macros
===

The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.

The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.

I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!

Potentials For Better Design
===

 - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
 - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
 - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
 - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203)

Status
===

I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).

I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!).

Closes apache#4140 from xrl/parquet_derive

Lead-authored-by: Xavier Lange <xrlange@gmail.com>
Co-authored-by: Neville Dipale <nevilledips@gmail.com>
Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
rafael-telles pushed a commit to rafael-telles/arrow that referenced this pull request Nov 11, 2021
Fix method docs errors on server.h and sqlite example
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants