-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bare Java JNI Bindings optimized for Apache Spark #1798
Conversation
…hyper-parameter tuning on Spark
…he classic headers
… into marcozo/spark
@deaktator @jon-morra-zefr comments on this? @eisber Is there an argument that we need two sets of bindings? |
The methodology seems weird here. Why have two totally different interfaces? It seems like it would be confusing for people as opposed to integrating it all into one. It also duplicates a heck of a lot of surrounding infrastructure (pom, cmake, etc.). |
Hi, I outlined the difference in https://github.com/eisber/vowpal_wabbit/blob/marcozo/spark/java/bare/README.md There are a couple of reasons this is duplicated:
Markus |
Reading through the README:
So (1) and (3) seem like obviously desirable additional features of an interface. The biggest concern I have is around having two interfaces---a single interface combining 'classic' with 1, 3, and optionally 4 would be more powerful, less confusing, and more maintainable. |
Please also consider the different packaging as the bare interface includes all the binaries inside the jar. It's definitely desirable to do it this way for ease of use on Azure Databricks/Spark, not so clear for other existing use-cases as it requires write access to a temp-directory. As for maintanability: the overlap between the old and the new interface is minimal (basically the vw::initialize() and finish() call. Also the pom files differ significantly (all binaries are included in the jar, unit tests cannot be used, but rather integration tests as the classpath needs to be set on the jar to be able to extract the binaries). The bare api is built with the Spark wrapper code in-mind which can control the state properly: when looking at: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/java/src/main/java/vowpalWabbit/learner/VWIntLearner.java#L38 Maybe naming can solve the confusion and a bit more cleanup in the pom. |
Potentially build the classic on-top of bare would be an option... though the packaing and library loading needs some careful thought. 2nd iteration? |
Is there a reason to avoid a union interface? |
I haven't looked at this in a while, so I'm not going to comment on the details here. Suffice it to say that it feels like the goal here is to expose a different interface in addition to the one already exposed. Ideally these 2 interfaces would be right beside each other in the same code base. The build tools would hopefully be modified to handle producing multiple artifacts instead of having 2 build chains. But since I'm not in the weeds, I'm not going to put my foot down here, just food for thought. |
As far as names: bare/classic doesn't seem to quite have the right connotations. Would spark/simple be a better name? The original interface seems to optimized around safe use while the new one is really optimized for efficiency, particularly in spark. W.r.t. artifacts, it seems like 2 are essential because of does/not incorporate binaries. (It's a bit ironic that we previously incorporated binaries.) @eisber, I don't follow your comment about 'same can be done for C++ code'. Other than that, it seems like shifting the naming and consolidating the build a bit would be quite helpful for maintenance in the future. Can you do that? |
I'm happy to rename classic -> simple, bare -> spark. I can refactor the pom.xml to avoid duplication. I'm still iterating on implementation, so it might take a bit. |
Ok, sounds like we have a plan. |
… into marcozo/spark
…into marcozo/spark
@eisber Can you respond to @jackgerrits ? This looks very close to a merge so I'd like to get it in for the next release. |
…into marcozo/spark
formatted code improved Spanning Tree error message re-added README.md moved JNI util to the right place
I moved the existing Java bindings into classic/
The additional bare bindings are under bare/
Based on @jackgerrits comment I added java/bare/README.md to provide some justification why these are separate.
The actual Apache Spark bindings using MLlib interfaces will be found in MMLSpark.