Bare Java JNI Bindings optimized for Apache Spark #1798

eisber · 2019-03-20T17:24:55Z

I moved the existing Java bindings into classic/
The additional bare bindings are under bare/

Based on @jackgerrits comment I added java/bare/README.md to provide some justification why these are separate.

The actual Apache Spark bindings using MLlib interfaces will be found in MMLSpark.

… handling

…hyper-parameter tuning on Spark

…he classic headers

… into marcozo/spark

JohnLangford · 2019-03-20T17:40:18Z

@deaktator @jon-morra-zefr comments on this?

@eisber Is there an argument that we need two sets of bindings?

jon-morra-zefr · 2019-03-20T21:24:51Z

The methodology seems weird here. Why have two totally different interfaces? It seems like it would be confusing for people as opposed to integrating it all into one. It also duplicates a heck of a lot of surrounding infrastructure (pom, cmake, etc.).

eisber · 2019-03-21T06:58:31Z

Hi,

I outlined the difference in https://github.com/eisber/vowpal_wabbit/blob/marcozo/spark/java/bare/README.md

There are a couple of reasons this is duplicated:

the 2nd set of bindings is not intended for users of VW, but rather for further integration. e.g. the hashing must happen before calling here.
there is no need for the locking that happens in the classic layer for the bare one and will negatively impact performance
as for pom: the bare packages the boost and zlib dependencies into the jar. As this is a fundamental change I don't want to break the original approach.

@jon-morra-zefr

what's your proposal for integration?
modify the original Java object? I'm worried about breaking any existing dependencies. Also as stated above I don't want to bear the cost for the locking.
have the classes in the same src/ folder?
create a single jar file?

Markus

JohnLangford · 2019-03-22T12:12:42Z

Reading through the README:

Being able to construct examples directly in Java is a big computational win.
Driving passes is possible with the current interface so not using/allowing a cache file is a limitation of bare?
Exposing allreduce/spanning tree is obviously important for parallel learning which can become a big computational win.
I'm a little bit unclear on locking. Having locking on by default seems reasonable since the VW reduction stack is not reentrant, but if you know you are using it in a thread-safe way it does add some overhead. I expect this is modest however. Do @eisber do you have any measurements of this?

So (1) and (3) seem like obviously desirable additional features of an interface.

The biggest concern I have is around having two interfaces---a single interface combining 'classic' with 1, 3, and optionally 4 would be more powerful, less confusing, and more maintainable.

eisber · 2019-03-22T19:46:00Z

One could expose RunPasses, but the primary use-case is for Spark and there we have all the examples hashed in memory anyway, so perform disk i/o will just hurt perf.
Not yet, but it feels weird to artificially introduce additional computation when trying to strip as much as possible (I can relate to adding some computation for sake of simplifying the API).

Please also consider the different packaging as the bare interface includes all the binaries inside the jar. It's definitely desirable to do it this way for ease of use on Azure Databricks/Spark, not so clear for other existing use-cases as it requires write access to a temp-directory.

As for maintanability: the overlap between the old and the new interface is minimal (basically the vw::initialize() and finish() call. Also the pom files differ significantly (all binaries are included in the jar, unit tests cannot be used, but rather integration tests as the classpath needs to be set on the jar to be able to extract the binaries).

The bare api is built with the Spark wrapper code in-mind which can control the state properly: when looking at: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/java/src/main/java/vowpalWabbit/learner/VWIntLearner.java#L38
there is overhead for the lock, check if it's open

Maybe naming can solve the confusion and a bit more cleanup in the pom.

eisber · 2019-03-22T19:50:34Z

Potentially build the classic on-top of bare would be an option... though the packaing and library loading needs some careful thought. 2nd iteration?

JohnLangford · 2019-03-25T12:28:04Z

Is there a reason to avoid a union interface?

jon-morra-zefr · 2019-03-25T16:45:36Z

I haven't looked at this in a while, so I'm not going to comment on the details here. Suffice it to say that it feels like the goal here is to expose a different interface in addition to the one already exposed. Ideally these 2 interfaces would be right beside each other in the same code base. The build tools would hopefully be modified to handle producing multiple artifacts instead of having 2 build chains. But since I'm not in the weeds, I'm not going to put my foot down here, just food for thought.

JohnLangford · 2019-04-01T15:06:29Z

As far as names: bare/classic doesn't seem to quite have the right connotations. Would spark/simple be a better name? The original interface seems to optimized around safe use while the new one is really optimized for efficiency, particularly in spark.

W.r.t. artifacts, it seems like 2 are essential because of does/not incorporate binaries. (It's a bit ironic that we previously incorporated binaries.)

@eisber, I don't follow your comment about 'same can be done for C++ code'. Other than that, it seems like shifting the naming and consolidating the build a bit would be quite helpful for maintenance in the future. Can you do that?

eisber · 2019-04-01T15:11:32Z

I'm happy to rename classic -> simple, bare -> spark.

I can refactor the pom.xml to avoid duplication.
'same can be done for C++ code' refers to CMake files.

I'm still iterating on implementation, so it might take a bit.

JohnLangford · 2019-04-01T15:15:51Z

Ok, sounds like we have a plan.

… into marcozo/spark

…into marcozo/spark

java/CMakeLists.txt

java/README.md

java/src/main/c++/jni_spark_cluster.cc

java/src/main/c++/jni_spark_vw.cc

java/src/main/c++/jni_spark_vw_generated.h

java/src/main/c++/jni_spark_vw.cc

java/src/main/c++/util.h

java/src/main/c++/vector_io_buf.cc

vowpalwabbit/spanning_tree.cc

JohnLangford · 2019-06-03T14:52:41Z

@eisber Can you respond to @jackgerrits ? This looks very close to a merge so I'd like to get it in for the next release.

…into marcozo/spark

formatted code improved Spanning Tree error message re-added README.md moved JNI util to the right place

vagrant and others added 21 commits March 9, 2019 14:08

initial VW spark featurizer

7e71216

working version for Spark

70c5315

package natives up with Java code. This includes boost program options

78f76a0

removed masking to avoid parameter duplication, fixed duplicate index…

36222d7

… handling

API cleanup. preparations for scoring on spark

6231297

added dynamic port allocation to spanning tree to allow for parallel …

88e6a6f

…hyper-parameter tuning on Spark

extracting VW command line args from model

ee0bcc1

final refactor

91eba92

removed java header generation and checked in the headers just like t…

dc30782

…he classic headers

fix unit tests on build server

23b321b

only build on linux

36f666f

create missing directory

f47d8a7

update java build path

7a2f2ac

fix package name in cluster wrapper

e63c482

replaced bzero with memset

bf0c5a6

expose native hashing for testing

75eeb62

fixed parameter name

27135a6

added javadoc, some more API cleanup

544a4d9

fix javadoc

1ab2345

Merge branch 'master' into marcozo/spark

c8c0d7c

Merge branch 'marcozo/spark' of https://github.com/eisber/vowpal_wabbit…

e7e0143

… into marcozo/spark

jackgerrits changed the title ~~Bara Java JNI Bindings optimized for Apache Spark~~ Bare Java JNI Bindings optimized for Apache Spark Mar 20, 2019

eisber and others added 10 commits April 29, 2019 13:36

fixed Java hashing

9a2fb2a

Merge branch 'marcozo/spark' of https://github.com/eisber/vowpal_wabbit…

b0bbd82

… into marcozo/spark

removed debug print out

3b1d031

merged bare&classic into single library

86b00ea

fixed java build path

5e00bff

renamed bare to spark

af179b2

move the java library to the right path (target/bin)

cff765d

Merge branch 'master' of https://github.com/VowpalWabbit/vowpal_wabbit …

d8d9221

…into marcozo/spark

copy the java library to the right path (target/bin)

01d1ec9

Merge branch 'master' into marcozo/spark

b821517