-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Parquet compression handling to avoid native libraries #2397
Refactor Parquet compression handling to avoid native libraries #2397
Conversation
runtimeOnly('com.hadoop.gplcompression:hadoop-lzo:0.4.20') // For LZO codec. | ||
runtimeOnly('org.lz4:lz4-java:1.8.0') // For LZ4 codec. | ||
runtimeOnly('com.github.rdblue:brotli-codec:0.1.1') // For Brotli codec. | ||
runtimeOnly('com.fasterxml.woodstox:woodstox-core:5.2.1') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1162 will make this less terrible
@@ -70,7 +70,7 @@ dependencies { | |||
'net.sf.trove4j:trove4j:3.0.3', | |||
'com.intellij:annotations:5.1', | |||
'commons-codec:commons-codec:1.11', | |||
'org.apache.commons:commons-compress:1.18' | |||
depCommonsCompress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1162 also
...s/parquet/compression/src/main/java/io/deephaven/parquet/compress/NonClosingInputStream.java
Outdated
Show resolved
Hide resolved
.../parquet/compression/src/main/java/io/deephaven/parquet/compress/NonClosingOutputStream.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
final Map<String, String> extraMetaData) throws IOException { | ||
this.pageSize = pageSize; | ||
this.allocator = allocator; | ||
this.extraMetaData = new HashMap<>(extraMetaData); | ||
writeChannel = channelsProvider.getWriteChannel(filePath, false); // TODO add support for appending | ||
this.type = type; | ||
this.channelsProvider = channelsProvider; | ||
CodecFactory codecFactory = new CodecFactory(new Configuration(), pageSize); | ||
this.compressor = codecFactory.getCompressor(codecName); | ||
DeephavenCodecFactory ccf = new DeephavenCodecFactory(Collections.emptyList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, should probably be singleton, so the configured classes are shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note however that it has no state, except the CompressionCodecFactory, and that only has state of the codecs it was built with, no other state maintained along the way - perhaps default ctor instead would pick up the list from Configuration?
buildSrc/src/main/groovy/io.deephaven.java-repository-conventions.gradle
Show resolved
Hide resolved
extensions/parquet/compression/src/main/java/io/deephaven/parquet/compress/codec/LzoCodec.java
Outdated
Show resolved
Hide resolved
extensions/parquet/compression/src/main/java/io/deephaven/parquet/compress/codec/ZstdCodec.java
Outdated
Show resolved
Hide resolved
extensions/parquet/compression/src/main/java/io/deephaven/parquet/compress/Compressor.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnWriterImpl.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnChunkReaderImpl.java
Outdated
Show resolved
Hide resolved
extensions/parquet/base/src/main/java/io/deephaven/parquet/base/ColumnPageReaderImpl.java
Show resolved
Hide resolved
132c7bf
to
3c1a110
Compare
buildSrc/src/main/groovy/io.deephaven.java-repository-conventions.gradle
Show resolved
Hide resolved
0cfbbf6
to
55cd5fe
Compare
...s/parquet/compression/src/main/java/io/deephaven/parquet/compress/DeephavenCodecFactory.java
Outdated
Show resolved
Hide resolved
55cd5fe
to
c8ff3ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, this fails when I try to run tests. For ParquetTableReadWriteTest
:
- Whichever one runs first (usually
testParquetZstdCompressionCodec
) fails with:
org.apache.hadoop.io.compress.CompressionCodec: Provider org.apache.hadoop.io.compress.BrotliCodec could not be instantiated
java.util.ServiceConfigurationError: org.apache.hadoop.io.compress.CompressionCodec: Provider org.apache.hadoop.io.compress.BrotliCodec could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1395)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:119)
at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
at io.deephaven.parquet.compress.DeephavenCodecFactory.<init>(DeephavenCodecFactory.java:123)
at io.deephaven.parquet.compress.DeephavenCodecFactory.<init>(DeephavenCodecFactory.java:119)
at io.deephaven.parquet.compress.DeephavenCodecFactory.<clinit>(DeephavenCodecFactory.java:38)
at io.deephaven.parquet.base.ParquetFileWriter.<init>(ParquetFileWriter.java:56)
at io.deephaven.parquet.table.ParquetTableWriter.getParquetFileWriter(ParquetTableWriter.java:314)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:210)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:155)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:179)
at io.deephaven.parquet.table.ParquetTools.writeParquetTableImpl(ParquetTools.java:670)
at io.deephaven.parquet.table.ParquetTools.writeTable(ParquetTools.java:224)
at io.deephaven.parquet.table.ParquetTools.writeTable(ParquetTools.java:140)
at io.deephaven.parquet.table.ParquetTableReadWriteTest.compressionCodecTestHelper(ParquetTableReadWriteTest.java:265)
at io.deephaven.parquet.table.ParquetTableReadWriteTest.testParquetZstdCompressionCodec(ParquetTableReadWriteTest.java:296)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:119)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:164)
at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:414)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native library 'brotli'. [LoaderResult: os.name="Mac OS X", os.arch="aarch64", os.version="12.4", java.vm.name="OpenJDK 64-Bit Server VM", java.vm.version="11.0.14+9-LTS", java.vm.vendor="Azul Systems, Inc.", alreadyLoaded="null", loadedFromSystemLibraryPath="false", nativeLibName="libbrotli.dylib", temporaryLibFile="/var/folders/16/1ktb8rbx1rx2c17ykj42jxhr0000gn/T/brotli14466667012403530655/libbrotli.dylib", libNameWithinClasspath="/lib/darwin-aarch64/libbrotli.dylib", usedThisClassloader="false", usedSystemClassloader="false", java.library.path="/Users/rcaudy/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:."]
at org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35)
at org.apache.hadoop.io.compress.BrotliCodec.<init>(BrotliCodec.java:40)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
... 69 more
- The remainder fail with:
Could not initialize class io.deephaven.parquet.compress.DeephavenCodecFactory
java.lang.NoClassDefFoundError: Could not initialize class io.deephaven.parquet.compress.DeephavenCodecFactory
at io.deephaven.parquet.base.ParquetFileWriter.<init>(ParquetFileWriter.java:56)
at io.deephaven.parquet.table.ParquetTableWriter.getParquetFileWriter(ParquetTableWriter.java:314)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:210)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:155)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:179)
at io.deephaven.parquet.table.ParquetTableWriter.write(ParquetTableWriter.java:151)
at io.deephaven.parquet.table.ParquetTools.writeParquetTableImpl(ParquetTools.java:666)
at io.deephaven.parquet.table.ParquetTools.writeTable(ParquetTools.java:224)
at io.deephaven.parquet.table.ParquetTools.writeTable(ParquetTools.java:153)
at io.deephaven.parquet.table.ParquetTableReadWriteTest.groupingByLongKey(ParquetTableReadWriteTest.java:221)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.runTestClass(JUnitTestClassExecutor.java:110)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:58)
at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecutor.execute(JUnitTestClassExecutor.java:38)
at org.gradle.api.internal.tasks.testing.junit.AbstractJUnitTestClassProcessor.processTestClass(AbstractJUnitTestClassProcessor.java:62)
at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:33)
at org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:94)
at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:119)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:36)
at org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:182)
at org.gradle.internal.remote.internal.hub.MessageHubBackedObjectConnection$DispatchWrapper.dispatch(MessageHubBackedObjectConnection.java:164)
at org.gradle.internal.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:414)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.onExecute(ExecutorPolicy.java:64)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run(ManagedExecutorImpl.java:48)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.run(ThreadFactoryImpl.java:56)
at java.base/java.lang.Thread.run(Thread.java:829)
...s/parquet/compression/src/main/java/io/deephaven/parquet/compress/DeephavenCodecFactory.java
Outdated
Show resolved
Hide resolved
…uet/compress/DeephavenCodecFactory.java Co-authored-by: Ryan Caudy <rcaudy@gmail.com>
Okay, since brotli blows up in its constructor, we're a bit limited in what we can do around this. Option one, reimplement the actual BrotliCodec class (only about 100 lines long, none of it interesting) to fail later (lazily init the underlying native code). Option two, factor out the test and the wiring that runs it so that gradle only attempts to invoke it on the correct platform. I think option two is better, letting the brotli codec remain self contained, and just keep it off the classpath except for this specific test. |
Co-authored-by: Ryan Caudy <rcaudy@gmail.com>
...et/table/src/brotliTest/java/io/deephaven/parquet/table/BroltiParquetTableReadWriteTest.java
Outdated
Show resolved
Hide resolved
|
||
import static org.junit.Assert.assertTrue; | ||
|
||
public class BroltiParquetTableReadWriteTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be an @Category(OutOfBandTest.class)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - the codec tests are basically instant, the read/write test should probably be split into two classes, and only run the slow tests as out of band?
Still failing on Mac OS/M1:
|
|
Super happy we are reducing our external repositories. Created this for follow-up: #2459 |
Snappy should be fixed now, confirmed on arm by @lbooker42. |
This still needs some testing on arm platforms, but should not only remove the "parquet" codecs and replace with "hadoop-only" ones, but still let these codecs be used by both readers and writers.
Downstream builds as of this patch can no longer use Brotli, but demand for brotli seems spotty anyway. Adding the same unmaintained rdblue brolti codec jar to the classpath would automatically have it be picked up by the factory (confirmed by the test-only dependency), without leaving it on the classpath for platforms that don't support it. Follow-up work should selectively include that jar in x86 docker containers.
Configuration is stubbed in, but not well used yet, allowing certain codecs to be added ahead of others. This could for example let an implementation specify that even though there is a fallback lzo codec implemented in java, a particular native implementation should be used instead.
Fixes #2118