ARROW-92: Arrow to Parquet Schema conversion #68

xhochy · 2016-04-24T17:58:23Z

My current WIP state. To make the actual schema conversion complete, we probably need the physical structure too as Arrow schemas only care about logical types whereas Parquet schema is about logical and physical types.

wesm · 2016-04-24T23:23:47Z

We'll have to make some decisions about type mappings. For example:

arrow::StringType becomes BYTE_ARRAY with UTF8 annotation
arrow::BinaryType (needs to be implemented) becomes BYTE_ARRAY with no ConvertedType
arrow::CharType (if ever used, we can skip it for now) becomes FIXED_LEN_BYTE_ARRAY

For List types, we should use the 3-level array encoding as described here https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema/types.h#L49

What other parts do you think are underspecified?

xhochy · 2016-04-25T07:17:12Z

For Decimal we need to decide if use the smallest possible physical type is the correct strategy.

wesm · 2016-04-25T18:50:12Z

I see. For decimals, I agree we either need multiple Arrow types, or add metadata indicating the physical storage type to the DecimalType. I would say it's better to make this explicit in the Arrow data type, let me know what you think

xhochy · 2016-05-01T08:52:59Z

Probably a simple storage_type field could be enough for the DecimalType. As this probably needs to go into the spec, I made separate issues for this https://issues.apache.org/jira/browse/ARROW-183 and https://issues.apache.org/jira/browse/ARROW-184

xhochy · 2016-05-01T15:45:11Z

PR is now in state for a minimal schema conversion basis for Pandas<->Parquet.

wesm · 2016-05-01T16:47:37Z

cpp/src/arrow/parquet/schema.cc

+      break;
+    case Type::CHAR:
+      type = ParquetType::FIXED_LEN_BYTE_ARRAY;
+      logical_type = LogicalType::UTF8;


Aside: we'll need to visit the string encoding question, as logical unicode characters won't map neatly onto a char(n) type

wesm · 2016-05-01T16:51:42Z

This looks good outside the exception handling question

wesm · 2016-05-01T22:53:08Z

+1, thank you

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed. Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

* Offset buffer can be pre-grown in Parquet ByteArray reader * nit

* Initial commit * Introduce TranslateHolder * Remove unused header

* Add translate expression support (apache#68) * Initial commit * Introduce TranslateHolder * Remove unused header * Return 1 if empty string is given as substring (apache#69) * Add two math operations: floor & ceil (apache#72) * Inital commit * Add ceil function Co-authored-by: PHILO-HE <feilong.he@intel.com>

* Initial commit * Introduce TranslateHolder * Remove unused header

ARROW-92: Arrow to Parquet Schema conversion

8a0293e

xhochy added 2 commits May 1, 2016 10:44

Add more types

9a6c876

make format

38e68e5

xhochy added 2 commits May 1, 2016 11:02

Add struct conversion

42ed0ea

Include string

9c5b085

xhochy changed the title ~~[WIP] ARROW-92: Arrow to Parquet Schema conversion~~ ARROW-92: Arrow to Parquet Schema conversion May 1, 2016

wesm reviewed May 1, 2016
View reviewed changes

Add macro to convert ParquetException to Status

e3aa261

asfgit closed this in 355f7c9 May 1, 2016

xhochy deleted the arrow-92 branch March 7, 2017 16:16

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

bc387f3

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

871303a

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

77c01d4

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

37ccfe8

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 4, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

fe6a5cf

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

f3cd4ca

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018

Fix missing include directory of gtest in CMakeLists.txt (apache#68)

5eba775

* Fix missing set the include directory of gtest * Fix to use same format as other dependencies

xuechendi pushed a commit to xuechendi/arrow that referenced this pull request Aug 4, 2020

Offset buffer can be pre-grown in Parquet ByteArray reader (apache#68)

4d166c4

* Offset buffer can be pre-grown in Parquet ByteArray reader * nit

zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Jan 6, 2022

Add translate expression support (apache#68)

28fbddf

* Initial commit * Introduce TranslateHolder * Remove unused header

zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Jan 7, 2022

Add translate expression support (apache#68)

2f46e8a

* Initial commit * Introduce TranslateHolder * Remove unused header

zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022

Add translate expression support (apache#68)

7f40757

* Initial commit * Introduce TranslateHolder * Remove unused header

zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022

Add translate expression support (apache#68)

a8c8fcc

* Initial commit * Introduce TranslateHolder * Remove unused header

rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022

Add translate expression support (apache#68)

3e6b037

* Initial commit * Introduce TranslateHolder * Remove unused header

paleolimbot mentioned this pull request Jan 28, 2023

[R] Crash on MacOS (x86) when running tests with homebrew apache-arrow also installed #33903

Closed

github-actions bot mentioned this pull request Aug 30, 2024

GH-33999: [Go] Removed cast from byte[] to string and copied entire string when Value() is called #34450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-92: Arrow to Parquet Schema conversion #68

ARROW-92: Arrow to Parquet Schema conversion #68

xhochy commented Apr 24, 2016

wesm commented Apr 24, 2016

xhochy commented Apr 25, 2016

wesm commented Apr 25, 2016

xhochy commented May 1, 2016

xhochy commented May 1, 2016

wesm May 1, 2016 •

edited

Loading

wesm commented May 1, 2016

wesm commented May 1, 2016

ARROW-92: Arrow to Parquet Schema conversion #68

ARROW-92: Arrow to Parquet Schema conversion #68

Conversation

xhochy commented Apr 24, 2016

wesm commented Apr 24, 2016

xhochy commented Apr 25, 2016

wesm commented Apr 25, 2016

xhochy commented May 1, 2016

xhochy commented May 1, 2016

wesm May 1, 2016 • edited Loading

Choose a reason for hiding this comment

wesm commented May 1, 2016

wesm commented May 1, 2016

wesm May 1, 2016 •

edited

Loading