ARROW-367: converter json <=> Arrow file format for Integration tests #203

julienledem · 2016-11-11T00:51:06Z

No description provided.

julienledem · 2016-11-11T00:51:55Z

@wesm here is a tool to convert from arrow to json and back as well as validate an arrow file against a json one.

wesm · 2016-11-11T17:00:40Z

fantastic, will review today. made good progress yesterday on the JSON file format in C++, so will write an equivalent tool

wesm

+1 LGTM modulo minor comments. We'll be sure to include nulls in the integration test datasets

wesm · 2016-11-11T20:21:20Z

java/tools/src/main/java/org/apache/arrow/tools/Integration.java

+      for (int j = 0; j < valueCount; j++) {
+        Object arrow = arrowVector.getAccessor().getObject(j);
+        Object json = jsonVector.getAccessor().getObject(j);
+        if (!Objects.equal(arrow, json)) {


This works for nested types and nulls?

yes.
Objects.equals takes care of null.
getObject(index) materializes nested types and lists.

wesm · 2016-11-11T20:22:39Z

java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java

+public class ArrowFileTestFixtures {
+  static final int COUNT = 10;
+
+  static void writeData(int count, MapVector parent) {


Come to think of it, many of these test cases could use some null values

sorry, I was busy this week, but I will add more tests as suggested.

wesm · 2016-11-17T20:25:21Z

+1

The key change here is to ensure that munmap is called at most once for a given memory-mapped file. Previously, MemoryMapSource::CloseFile was calling munmap on every invocation (provided that a file had ever been successfully mapped into memory for that instance). This is problematic in a multi-threaded environment, even if each MemoryMapSource instance is being used in only one thread, as illustrated by the following hypothetical sequence of operations: thread 1 @ time 1: munmap(0xf000, 4096) thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000 thread 1 @ time 3: munmap(0xf000, 4096) After time 3, the mapping for the memory segment beginning at 0xf000 has been invalidated, so the next attempt by thread 2 to access memory within that segment will likely cause a segfault (unless yet another thread has mmap'd that segment in the meantime, in which case the results could be even more interesting, but certainly no better). Also, I'm adding/modifying a couple comments in header files to mark "sample" implementations accordingly. This is intended to give API consumers a heads up as to the intent and level of maturity of those sections of the codebase. Author: William Forson <william@gluent.com> Closes apache#203 from wdforson/master and squashes the following commits: 782713a [William Forson] Adjust 'sample' code comments 7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile

The key change here is to ensure that munmap is called at most once for a given memory-mapped file. Previously, MemoryMapSource::CloseFile was calling munmap on every invocation (provided that a file had ever been successfully mapped into memory for that instance). This is problematic in a multi-threaded environment, even if each MemoryMapSource instance is being used in only one thread, as illustrated by the following hypothetical sequence of operations: thread 1 @ time 1: munmap(0xf000, 4096) thread 2 @ time 2: void *addr = mmap(NULL, 4096, ...) // addr <- 0xf000 thread 1 @ time 3: munmap(0xf000, 4096) After time 3, the mapping for the memory segment beginning at 0xf000 has been invalidated, so the next attempt by thread 2 to access memory within that segment will likely cause a segfault (unless yet another thread has mmap'd that segment in the meantime, in which case the results could be even more interesting, but certainly no better). Also, I'm adding/modifying a couple comments in header files to mark "sample" implementations accordingly. This is intended to give API consumers a heads up as to the intent and level of maturity of those sections of the codebase. Author: William Forson <william@gluent.com> Closes apache#203 from wdforson/master and squashes the following commits: 782713a [William Forson] Adjust 'sample' code comments 7337551 [William Forson] PARQUET-799: Fix bug in MemoryMapSource::CloseFile Change-Id: I97842390ca33fc007a42069e84f2d160b27627a0

A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes #4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

A rebase and significant rewrite of sunchao/parquet-rs#197 Big improvement: I now use a more natural nested enum style, it helps break out what patterns of data types are . The rest of the broad strokes still apply. Goal === Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group. How to Use === ``` extern crate parquet; #[macro_use] extern crate parquet_derive; #[derive(ParquetRecordWriter)] struct ACompleteRecord<'a> { pub a_bool: bool, pub a_str: &'a str, } ``` RecordWriter trait === This is the new trait which `parquet_derive` will implement for your structs. ``` use super::RowGroupWriter; pub trait RecordWriter<T> { fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>); } ``` How does it work? === The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much: ``` [lib] proc-macro = true ``` The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl: - the name of the struct - the lifetime variables of the struct - the fields of the struct The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`. The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like: ``` impl RecordWriter for $struct_name { fn write_row_group(..) { $({ $column_writer_snippet }) } } ``` this template is then added under the struct definition, ending up something like: ``` struct MyStruct { } impl RecordWriter for MyStruct { fn write_row_group(..) { { write_col_1(); }; { write_col_2(); } } } ``` and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about. Viewing the Derived Code === To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do: ``` $WORK_DIR/parquet-rs/parquet_derive_test cargo expand --lib > ../temp.rs ``` then you can dump the contents: ``` struct DumbRecord { pub a_bool: bool, pub a2_bool: bool, } impl RecordWriter<DumbRecord> for &[DumbRecord] { fn write_to_row_group( &self, row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>, ) { let mut row_group_writer = row_group_writer; { let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); }; { let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect(); let mut column_writer = row_group_writer.next_column().unwrap().unwrap(); if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) = column_writer { typed.write_batch(&vals[..], None, None).unwrap(); } row_group_writer.close_column(column_writer).unwrap(); } } } ``` now I need to write out all the combinations of types we support and make sure it writes out data. Procedural Macros === The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code. The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`. I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile! Potentials For Better Design === - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop. - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin) - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement) - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to apache#203) Status === I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file). I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs. We can convert the nested matching style to a fancier looping style. But for now, this explicit nesting is easier to debug and understand (to me at least!). Closes apache#4140 from xrl/parquet_derive Lead-authored-by: Xavier Lange <xrlange@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Bryant Biggs <bryantbiggs@gmail.com> Co-authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

Fix method docs errors on server.h and sqlite example

ARROW-367: converter json <=> Arrow file format for Integration tests

fdbe03f

add license

b3cd326

wesm reviewed Nov 11, 2016

View reviewed changes

asfgit closed this in 8417096 Nov 18, 2016

julienledem deleted the integration branch April 25, 2017 18:32

xrl mentioned this pull request Apr 11, 2019

ARROW-5123: [Rust] Parquet derive for simple structs #4140

Closed

4 tasks

rafael-telles pushed a commit to rafael-telles/arrow that referenced this pull request Nov 11, 2021

Merge pull request apache#203 from rafael-telles/flight-sql-cpp-fix-docs

157fb48

Fix method docs errors on server.h and sqlite example

asfimport mentioned this pull request Sep 14, 2020

[Rust] derive RecordWriter from struct definitions #21608

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-367: converter json <=> Arrow file format for Integration tests #203

ARROW-367: converter json <=> Arrow file format for Integration tests #203

julienledem commented Nov 11, 2016

julienledem commented Nov 11, 2016

wesm commented Nov 11, 2016

wesm left a comment

wesm Nov 11, 2016

julienledem Nov 11, 2016

wesm Nov 11, 2016

julienledem Nov 11, 2016

julienledem Nov 18, 2016

wesm commented Nov 17, 2016

ARROW-367: converter json <=> Arrow file format for Integration tests #203

ARROW-367: converter json <=> Arrow file format for Integration tests #203

Conversation

julienledem commented Nov 11, 2016

julienledem commented Nov 11, 2016

wesm commented Nov 11, 2016

wesm left a comment

Choose a reason for hiding this comment

wesm Nov 11, 2016

Choose a reason for hiding this comment

julienledem Nov 11, 2016

Choose a reason for hiding this comment

wesm Nov 11, 2016

Choose a reason for hiding this comment

julienledem Nov 11, 2016

Choose a reason for hiding this comment

julienledem Nov 18, 2016

Choose a reason for hiding this comment

wesm commented Nov 17, 2016