-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java] Document how to convert JDBC Adapter result into a Parquet file #316
base: main
Are you sure you want to change the base?
Conversation
To close #315 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding more examples!
java/source/dataset.rst
Outdated
=================== | ||
|
||
Go to :doc:`JDBC Adapter - Write ResultSet to Parquet File <jdbc>` for an example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would have a parquet example here that doesn't include things like JDBC. Do you think it would best to add it as part of this PR?
protected Schema readSchema() { | ||
return schema; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add whitespace to the code below so it's organized into sections? I think it will be easier to read.
java/source/jdbc.rst
Outdated
Write ResultSet to Parquet File | ||
=============================== | ||
|
||
In this example, we have the JDBC adapter result and trying to write them | ||
into a parquet file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, was this specific example requested? I think it would be better to include a minimal read/write parquet example in dataset.rst
and remove this one. The jdbc.rst
already has an example for converting ResultSet
to VectorSchemaRoot
. What do you think?
Ah I see the related issue now. I think it would be best if we had "read/write parquet" examples in dataset.rst and then added a very minimal example of why/how to extend the ArrowReader class for JDBC. What do you think? |
That make sense, let me also divide that. |
Hi @lidavidm, Are there some recommendation for your side to where I could try to search/review for this issue? Did you see this error when you were working with Current error messages:
|
Have you isolated the problem? Looked at a debugger? Enabled allocation tracing? |
Hi @danepitkin changes was added as requested. |
Nice work! I left a couple more comments. Let me know what you think. |
What are those comments? |
java/source/dataset.rst
Outdated
|
||
|
||
Write Parquet Files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to io.rst
? That's were "Read parquet" is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to
io.rst
? That's were "Read parquet" is.
Currently, io.rst redirect to dataset.rst for read parquet.
What about to add write parquet on io.rst to also redirect to dataset.rst for write parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it's actually better to put "Write Parquet" examples in io.rst
. The dataset.rst
examples are primarily for querying (reading) data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
java/source/io.rst
Outdated
@@ -579,3 +579,95 @@ Reading and writing dictionary-encoded data requires separately tracking the dic | |||
Dictionary-encoded data recovered: [0, 3, 4, 5, 7] | |||
Dictionary recovered: Dictionary DictionaryEncoding[id=666,ordered=false,indexType=Int(8, true)] [Andorra, Cuba, Grecia, Guinea, Islandia, Malta, Tailandia, Uganda, Yemen, Zambia] | |||
Decoded data: [Andorra, Guinea, Islandia, Malta, Uganda] | |||
|
|||
Customize Logic to Read Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to jdbc.rst
? I think it fits better there since its directly applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just maintain the steps needed to implement a data reader, and references as an example to jdbc page.
import ch.qos.logback.classic.Level; | ||
import ch.qos.logback.classic.Logger; | ||
|
||
class JDBCReader extends ArrowReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we somehow delete the duplicate code here and reuse the other one? Or combine the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only JDBC is maintaining this demo example now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
I forgot to hit "Submit Review" 😅 sorry! |
I would appreciate your help with a new code review, @danepitkin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thank you @davisusanibar
import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel; | ||
|
||
// read arrow demo data | ||
Path uriRead = Paths.get("./thirdpartydeps/arrowfiles/random_access.arrow"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a comment describing what's in this file? Looks like it's three row groups of 3 rows each based on the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
import ch.qos.logback.classic.Level; | ||
import ch.qos.logback.classic.Logger; | ||
|
||
class JDBCReader extends ArrowReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
} | ||
} | ||
|
||
((Logger) LoggerFactory.getLogger("org.apache.arrow")).setLevel(Level.TRACE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we fiddling with loggers and adding logback to the example? I don't think we need any of that?
import ch.qos.logback.classic.Level; | ||
import ch.qos.logback.classic.Logger; | ||
|
||
class JDBCReader extends ArrowReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that we need this because writing a dataset takes an ArrowReader, so we have to adapt the JDBC ArrowVectorIterator to the ArrowReader interface
JdbcToArrowUtils.getUtcCalendar()) | ||
.setTargetBatchSize(2) | ||
.setReuseVectorSchemaRoot(true) | ||
.setArraySubTypeByColumnNameMap( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of keeping examples concise, let's use sample data that doesn't require us to deal with all of this in the first place.
Hi David,
When I try to run JDBCReader I get URI has empty scheme
java.lang.RuntimeException: URI has empty scheme: '/tmp
at
org.apache.arrow.dataset.file.JniWrapper.writeFromScannerToFile(Native
Method)
at
org.apache.arrow.dataset.file.DatasetFileWriter.write(DatasetFileWriter.java:46)
at
org.apache.arrow.dataset.file.DatasetFileWriter.write(DatasetFileWriter.java:59)
Any idea what could be causing this?
Regards
GP
…On Fri, Sep 15, 2023, 5:05 PM David Li ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In java/source/jdbc.rst
<#316 (comment)>
:
> +
+ @OverRide
+ protected Schema readSchema() throws IOException {
+ return null;
+ }
+
+ @OverRide
+ public VectorSchemaRoot getVectorSchemaRoot() throws IOException {
+ if (root == null) {
+ root = iter.next();
+ }
+ return root;
+ }
+ }
+
+ ((Logger) LoggerFactory.getLogger("org.apache.arrow")).setLevel(Level.TRACE);
Why are we fiddling with loggers and adding logback to the example? I
don't think we need any of that?
------------------------------
In java/source/jdbc.rst
<#316 (comment)>
:
> + import org.apache.arrow.dataset.scanner.ScanOptions;
+ import org.apache.arrow.dataset.scanner.Scanner;
+ import org.apache.arrow.dataset.source.Dataset;
+ import org.apache.arrow.dataset.source.DatasetFactory;
+ import org.apache.arrow.memory.BufferAllocator;
+ import org.apache.arrow.memory.RootAllocator;
+ import org.apache.arrow.vector.VectorSchemaRoot;
+ import org.apache.arrow.vector.ipc.ArrowReader;
+ import org.apache.arrow.vector.types.pojo.Schema;
+ import org.apache.ibatis.jdbc.ScriptRunner;
+ import org.slf4j.LoggerFactory;
+
+ import ch.qos.logback.classic.Level;
+ import ch.qos.logback.classic.Logger;
+
+ class JDBCReader extends ArrowReader {
Explain that we need this because writing a dataset takes an ArrowReader,
so we have to adapt the JDBC ArrowVectorIterator to the ArrowReader
interface
------------------------------
In java/source/jdbc.rst
<#316 (comment)>
:
> + final BufferAllocator allocatorParquetWrite = allocator.newChildAllocator("allocatorParquetWrite", 0,
+ Long.MAX_VALUE);
+ final Connection connection = DriverManager.getConnection(
+ "jdbc:h2:mem:h2-jdbc-adapter")
+ ) {
+ ScriptRunner runnerDDLDML = new ScriptRunner(connection);
+ runnerDDLDML.setLogWriter(null);
+ runnerDDLDML.runScript(new BufferedReader(
+ new FileReader("./thirdpartydeps/jdbc/h2-ddl.sql")));
+ runnerDDLDML.runScript(new BufferedReader(
+ new FileReader("./thirdpartydeps/jdbc/h2-dml.sql")));
+ JdbcToArrowConfig config = new JdbcToArrowConfigBuilder(allocatorJDBC,
+ JdbcToArrowUtils.getUtcCalendar())
+ .setTargetBatchSize(2)
+ .setReuseVectorSchemaRoot(true)
+ .setArraySubTypeByColumnNameMap(
In the interest of keeping examples concise, let's use sample data that
doesn't require us to deal with all of this in the first place.
—
Reply to this email directly, view it on GitHub
<#316 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACO7PHBDXWQAOPIBFG2WX6LX2S7IZANCNFSM6AAAAAA2WFM25A>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hi @pronzato, this project also uses JDBC reader https://github.com/davisusanibar/java-python-by-cdata.git. Could you please try using that and confirm if it is also failing? |
In this example, we have the JDBC adapter result and trying to write them into a parquet file.
Current workaround:
Closed with outstanding buffers allocated
/RefCnt has gone negative