-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-203: Python: Basic filename based Parquet read/write #83
Conversation
Travis fails because we do not build parquet. This raises now the question how we should pull in |
Some thoughts on my questions from above:
|
|
||
# Must be in one expression to avoid calling std::move which is not possible | ||
# in Cython (due to missing rvalue support) | ||
reader = unique_ptr[FileReader](new FileReader(default_memory_pool(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use reader.reset, so you aren't relying on the move functionality/R-Values references at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is not due to the assignment to reader but about forwarding the return value of ParquetFileReader.OpenFile(filename)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are probably going run into some issues with Cython's lackluster support for C++11 features. What I've done in other cases is create a shim layer either as a header-only or header + companion .cpp file that provides a simplified API for the Cython caller. I think for now it's good to get this end-to-end use case working
seems reasonable to me. I'm not super familiar with this part of the code yet, so getting some feedback from @wesm also likely makes sense. (Also it looks like the build is failing with the latest push) |
std::shared_ptr<Table> out; | ||
ReadTableFromFile(std::move(file_reader), &out); | ||
ASSERT_EQ(1, out->num_columns()); | ||
ASSERT_EQ(100, out->num_rows()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe good to get in the habit of doing const int num_rows = 100
it's really great that you got this working! re: the thirdparty toolchain, at least for testing the Python side, we may be able to use the conda dev artifacts to simplify things (https://anaconda.org/apache/parquet-cpp). Having to build parquet-cpp from source (and its thirdparty, especially Thrift) kind of stinks. let me know what you think |
if (!primitive_array) { | ||
PARQUET_IGNORE_NOT_OK(writer.Close()); | ||
return Status::NotImplemented("Table must consist of PrimitiveArray instances"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield per our related discussion about strings, at least in Parquet-land, variable and fixed-length binary values (and strings) are considered primitive types. So you could have a table of all string columns and it would be still semantically "flat". I don't think it's a big deal but it will impact code like this (obviously we will want to address nested data as soon as we can here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Thats a fair point, but I'm not sure the semantic flatness makes a big difference here. The complicated part of converting the arrow binary/strings to parquet still remains and the best way to convert the data depends on the encoding. If it is plain encoding, then we would need to convert the ParquetByteArray, if its delta encoded length then a new batch API for the parquet writer is likely more appropriate.
I'm not attached to either representation so if we can get the type system to help better support some use-cases I'm for it. But it feels like for parquet the advantages of making string types primitive might not necessarily solve the harder problems. This debate will all go away when we JIT everything, right ;-) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more of a question of the type hierarchy. Conceivably we could have code involving std::is_base_of
with PrimitiveArray. Does not need to be resolved here
I left minor comments but this is a great start to get this working. Can you add a JIRA for adding support and testing for strings, boolean, and smaller integer types (int8, 16). We will also need to deal with unsigned integers (but the will come back at signed integers until Parquet 2.0 happens someday...c'est la vie) |
Addressed all comments. For the conda things, I probably need some expert advice (toolchain and |
Thank you. In the interest of getting this in I'm merging this now; we should address the build/packaging comment separately (particularly if |
+1 |
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
…etFileReader Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits: d7f9c47 [Deepak Majeti] modified ParquetFileReader API b5269f4 [Deepak Majeti] modified travis ci script to print log on failure 940de12 [Deepak Majeti] fixed memory leak 6de0b70 [Deepak Majeti] print logs e5ea341 [Deepak Majeti] print failure 9462954 [Deepak Majeti] try fixing test failure 3a0a1d7 [Deepak Majeti] modified External Stream 9d8b44c [Deepak Majeti] fixed formatting 11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
…etFileReader Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits: d7f9c47 [Deepak Majeti] modified ParquetFileReader API b5269f4 [Deepak Majeti] modified travis ci script to print log on failure 940de12 [Deepak Majeti] fixed memory leak 6de0b70 [Deepak Majeti] print logs e5ea341 [Deepak Majeti] print failure 9462954 [Deepak Majeti] try fixing test failure 3a0a1d7 [Deepak Majeti] modified External Stream 9d8b44c [Deepak Majeti] fixed formatting 11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
…etFileReader Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits: d7f9c47 [Deepak Majeti] modified ParquetFileReader API b5269f4 [Deepak Majeti] modified travis ci script to print log on failure 940de12 [Deepak Majeti] fixed memory leak 6de0b70 [Deepak Majeti] print logs e5ea341 [Deepak Majeti] print failure 9462954 [Deepak Majeti] try fixing test failure 3a0a1d7 [Deepak Majeti] modified External Stream 9d8b44c [Deepak Majeti] fixed formatting 11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
…etFileReader Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits: d7f9c47 [Deepak Majeti] modified ParquetFileReader API b5269f4 [Deepak Majeti] modified travis ci script to print log on failure 940de12 [Deepak Majeti] fixed memory leak 6de0b70 [Deepak Majeti] print logs e5ea341 [Deepak Majeti] print failure 9462954 [Deepak Majeti] try fixing test failure 3a0a1d7 [Deepak Majeti] modified External Stream 9d8b44c [Deepak Majeti] fixed formatting 11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
…etFileReader Author: Deepak Majeti <deepak.majeti@hpe.com> Closes apache#83 from majetideepak/PARQUET-559 and squashes the following commits: d7f9c47 [Deepak Majeti] modified ParquetFileReader API b5269f4 [Deepak Majeti] modified travis ci script to print log on failure 940de12 [Deepak Majeti] fixed memory leak 6de0b70 [Deepak Majeti] print logs e5ea341 [Deepak Majeti] print failure 9462954 [Deepak Majeti] try fixing test failure 3a0a1d7 [Deepak Majeti] modified External Stream 9d8b44c [Deepak Majeti] fixed formatting 11b3e6f [Deepak Majeti] Enable an external InputStream as a source to the ParquetFileReader Change-Id: I4064b32fad83c9dfe9801538fd3a1a949ae98366
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
Introducing a cache to hold the projectors and filters for re-use. The cache is a LRU that can hold 100 entries.
[oap-native-sql] Cast to String and cast to smaller scale type support
No description provided.