Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bare Java JNI Bindings optimized for Apache Spark #1798

Merged
merged 34 commits into from
Jun 4, 2019

Conversation

eisber
Copy link
Collaborator

@eisber eisber commented Mar 20, 2019

I moved the existing Java bindings into classic/
The additional bare bindings are under bare/

Based on @jackgerrits comment I added java/bare/README.md to provide some justification why these are separate.

The actual Apache Spark bindings using MLlib interfaces will be found in MMLSpark.

@JohnLangford
Copy link
Member

@deaktator @jon-morra-zefr comments on this?

@eisber Is there an argument that we need two sets of bindings?

@jackgerrits jackgerrits changed the title Bara Java JNI Bindings optimized for Apache Spark Bare Java JNI Bindings optimized for Apache Spark Mar 20, 2019
@jon-morra-zefr
Copy link
Contributor

The methodology seems weird here. Why have two totally different interfaces? It seems like it would be confusing for people as opposed to integrating it all into one. It also duplicates a heck of a lot of surrounding infrastructure (pom, cmake, etc.).

@eisber
Copy link
Collaborator Author

eisber commented Mar 21, 2019

Hi,

I outlined the difference in https://github.com/eisber/vowpal_wabbit/blob/marcozo/spark/java/bare/README.md

There are a couple of reasons this is duplicated:

  • the 2nd set of bindings is not intended for users of VW, but rather for further integration. e.g. the hashing must happen before calling here.
  • there is no need for the locking that happens in the classic layer for the bare one and will negatively impact performance
  • as for pom: the bare packages the boost and zlib dependencies into the jar. As this is a fundamental change I don't want to break the original approach.

@jon-morra-zefr

  • what's your proposal for integration?
  • modify the original Java object? I'm worried about breaking any existing dependencies. Also as stated above I don't want to bear the cost for the locking.
  • have the classes in the same src/ folder?
  • create a single jar file?

Markus

@JohnLangford
Copy link
Member

Reading through the README:

  1. Being able to construct examples directly in Java is a big computational win.
  2. Driving passes is possible with the current interface so not using/allowing a cache file is a limitation of bare?
  3. Exposing allreduce/spanning tree is obviously important for parallel learning which can become a big computational win.
  4. I'm a little bit unclear on locking. Having locking on by default seems reasonable since the VW reduction stack is not reentrant, but if you know you are using it in a thread-safe way it does add some overhead. I expect this is modest however. Do @eisber do you have any measurements of this?

So (1) and (3) seem like obviously desirable additional features of an interface.

The biggest concern I have is around having two interfaces---a single interface combining 'classic' with 1, 3, and optionally 4 would be more powerful, less confusing, and more maintainable.

@eisber
Copy link
Collaborator Author

eisber commented Mar 22, 2019

  1. One could expose RunPasses, but the primary use-case is for Spark and there we have all the examples hashed in memory anyway, so perform disk i/o will just hurt perf.
  2. Not yet, but it feels weird to artificially introduce additional computation when trying to strip as much as possible (I can relate to adding some computation for sake of simplifying the API).

Please also consider the different packaging as the bare interface includes all the binaries inside the jar. It's definitely desirable to do it this way for ease of use on Azure Databricks/Spark, not so clear for other existing use-cases as it requires write access to a temp-directory.

As for maintanability: the overlap between the old and the new interface is minimal (basically the vw::initialize() and finish() call. Also the pom files differ significantly (all binaries are included in the jar, unit tests cannot be used, but rather integration tests as the classpath needs to be set on the jar to be able to extract the binaries).

The bare api is built with the Spark wrapper code in-mind which can control the state properly: when looking at: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/java/src/main/java/vowpalWabbit/learner/VWIntLearner.java#L38
there is overhead for the lock, check if it's open

Maybe naming can solve the confusion and a bit more cleanup in the pom.

@eisber
Copy link
Collaborator Author

eisber commented Mar 22, 2019

Potentially build the classic on-top of bare would be an option... though the packaing and library loading needs some careful thought. 2nd iteration?

@JohnLangford
Copy link
Member

Is there a reason to avoid a union interface?

@jon-morra-zefr
Copy link
Contributor

I haven't looked at this in a while, so I'm not going to comment on the details here. Suffice it to say that it feels like the goal here is to expose a different interface in addition to the one already exposed. Ideally these 2 interfaces would be right beside each other in the same code base. The build tools would hopefully be modified to handle producing multiple artifacts instead of having 2 build chains. But since I'm not in the weeds, I'm not going to put my foot down here, just food for thought.

@JohnLangford
Copy link
Member

As far as names: bare/classic doesn't seem to quite have the right connotations. Would spark/simple be a better name? The original interface seems to optimized around safe use while the new one is really optimized for efficiency, particularly in spark.

W.r.t. artifacts, it seems like 2 are essential because of does/not incorporate binaries. (It's a bit ironic that we previously incorporated binaries.)

@eisber, I don't follow your comment about 'same can be done for C++ code'. Other than that, it seems like shifting the naming and consolidating the build a bit would be quite helpful for maintenance in the future. Can you do that?

@eisber
Copy link
Collaborator Author

eisber commented Apr 1, 2019

I'm happy to rename classic -> simple, bare -> spark.

I can refactor the pom.xml to avoid duplication.
'same can be done for C++ code' refers to CMake files.

I'm still iterating on implementation, so it might take a bit.

@JohnLangford
Copy link
Member

Ok, sounds like we have a plan.

java/CMakeLists.txt Outdated Show resolved Hide resolved
java/CMakeLists.txt Outdated Show resolved Hide resolved
java/README.md Outdated Show resolved Hide resolved
@JohnLangford
Copy link
Member

@eisber Can you respond to @jackgerrits ? This looks very close to a merge so I'd like to get it in for the next release.

formatted code
improved Spanning Tree error message
re-added README.md
moved JNI util to the right place
@eisber eisber merged commit fa5c3df into VowpalWabbit:master Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants