Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-984] Add Java NDArray and introduce Java Operator Builder class #12772

Closed
wants to merge 10 commits into from

Conversation

lanking520
Copy link
Member

@lanking520 lanking520 commented Oct 9, 2018

Description

This is a PR based on @andrewfayres PR: #12757. Please merge the previous one first and then this one.

In this PR, Java NDArray were introduced as well as a brand new Builder Operators. Deprecated operators were removed from Java API Macros. Please see the unit test for more information.
@andrewfayres @nswamy @yzhliu @gigasquid

Special thanks for @zachgk @piyushghai for guidance and support on Java new API!

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@lanking520 lanking520 requested a review from szha as a code owner October 9, 2018 21:05
@sandeep-krishnamurthy sandeep-krishnamurthy added Scala pr-awaiting-review PR is waiting for code review Java Label to identify Java API component labels Oct 9, 2018
@lanking520 lanking520 changed the title [MXNET-984][DO NOT MERGE] Add Java NDArray and introduce Java Operator Builder class [MXNET-984] Add Java NDArray and introduce Java Operator Builder class Oct 9, 2018
@marcoabreu
Copy link
Contributor

Awesome contribution! Thanks a lot :)

Do you think it's possible to introduce test coverage early on so we can track improvements early on?

@lanking520
Copy link
Member Author

Awesome contribution! Thanks a lot :)

Do you think it's possible to introduce test coverage early on so we can track improvements early on?

Hi @marcoabreu, there is unit test covered on Java side to make sure the handmade operators and generated operators are tested. Unfortunately, we cannot have full unit test coverage for generated operators since there are 200+ and already tested on Python end. I would be grateful if you could request more tests if you find some part of the code is not covered.

@marcoabreu
Copy link
Contributor

marcoabreu commented Oct 9, 2018

Sure, I fully understand and agree. I was thinking about targeted coverage measurement for the Java API. The Backend calls would be mocked and we would validate that the passthrough is properly working. The actual logic is still tested under scala.

The idea is to detect if some changes on the scala side broke the compatibility or behaviour of the Java API.

What do you think?

Copy link
Contributor

@andrewfayres andrewfayres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes but other that that LGTM

@@ -24,4 +24,6 @@ object DType extends Enumeration {
val UInt8 = org.apache.mxnet.DType.UInt8
val Int32 = org.apache.mxnet.DType.Int32
val Unknown = org.apache.mxnet.DType.Unknown

private[mxnet] def numOfBytes(dType: DType): Int = org.apache.mxnet.DType.numOfBytes(dType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unused private method. I think we can get rid of it.

@@ -64,7 +64,7 @@ You can also disable only one rule, by specifying its rule id, as specified in:

<check level="error" class="org.scalastyle.file.FileLineLengthChecker" enabled="true">
<parameters>
<parameter name="maxLineLength"><![CDATA[100]]></parameter>
<parameter name="maxLineLength"><![CDATA[132]]></parameter>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo this. I know you picked it up from my change but Naveen wanted us to have this discussion on the dev list before changing it.

@andrewfayres
Copy link
Contributor

I like the idea of getting a test coverage measurement working early on the java api. I think that this is something we can add to our plan. Our current testing strategy is that the Java unit tests will be aimed at testing the compatibility of the java/scala interaction and all behavior testing will occur in the Scala side. With that said, the current Scala testing is lacking in some areas and needs be be improved.

Copy link
Contributor

@andrewfayres andrewfayres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should aim for having unit tests for all of the methods in the non-generated NDArray code. Given that it's just a wrapper, in early stages, and there are other things waiting on this I won't block the PR for it but we should do a follow up with better test coverage. Other than that LGTM

Copy link
Member

@yzhliu yzhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't look into the details, but a high-level comment:
I feel it is better to follow Spark's approach, name it as JavaNDArray or something, and provide toNDArray(), so that we don't need to duplicate every class in Java - each time an API needs a NDArray, Java users can use JavaNDArray to operate and convert to Scala's NDArray.

And I think Scala's Context, DType, Shape, IO can be re-used, we do not benefit much from inventing new class. So far we duplicate too much.

  • Context
    only thing I see are devtype2str and devstr2type now become Java Map, but who actually will use them? Even though it was used in some unusual case, users can easily invoke .asJava themselves.

  • DType
    Nothing changes

  • Shape
    Java users can already construct Shape easily: Shape.create(1,2,3,4), if you worry about toVector returning Scala Vector, we can add one more function toList.

  • IO
    Nothing new.

@@ -606,7 +606,7 @@ scalaclean:

scalapkg:
(cd $(ROOTDIR)/scala-package; \
mvn package -P$(SCALA_PKG_PROFILE),$(SCALA_VERSION_PROFILE) -Dcxx="$(CXX)" \
mvn package -P$(SCALA_PKG_PROFILE),$(SCALA_VERSION_PROFILE),integrationtest -Dcxx="$(CXX)" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why running test here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests are actually triggered in here. This is just to avoid JUnit test running when make scala pkg. The integration test flag will disable the unit test for JUnit

@andrewfayres
Copy link
Contributor

A few comments, what this is doing is essentially spark's approach (see their JavaDoubleRDD). They have a JavaDoubleRDD class that they use to just call methods of their Scala RDD and internally handle the conversions. You can also see this in their java examples, throughout all of the examples that I've looked at the Java examples never use any part of the core library outside of the api.java package.

If we don't provide a Java friendly wrapper for all of the classes we want to expose to Java developers it'll lead to a rather confusing experience. They'll be expected to learn what is and is not meant for them to use from Scala's package while also having to know when to use the javaapi package instead.

Take the IO wrapper for example, all the Java wrapper is exposing is a Java friendly version of DataDesc. If we don't wrap that and instead expect them to use the Scala one then they'll also get exposed to all the other classes in IO that aren't intended for them to use (aren't Java friendly). Any methods that we add there to make it Java friendly will pollute the Scala API with stuff that scala developers don't need.

@yzhliu
Copy link
Member

yzhliu commented Oct 12, 2018

Thanks for providing your points of view @andrewfayres

IMO it is not always "bad" to expose Scala classes. Scala classes are not necessary a "new language" that requires users to learn. e.g., scala.Tuple, many Java 3rd parties (Guava, Apache Common, javatuples) provide exactly the same thing.

Spark only re-implement a very small set of classes, mostly RDDs, and not every RDD have a Java version.

At least these four classes are already "java-friendly". Duplicating will only increase maintenance efforts. Honestly I don't see why DataDesc is not java-friendly.

It makes sense to me to have Java version of NDArray, (and maybe also Module, etc), because they have too many functions that are not java-friendly. But not every class is in such situation.

@yzhliu
Copy link
Member

yzhliu commented Oct 12, 2018

Also I can easily find a java example using Scala object, in the link you mentioned: https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/JavaSparkPi.java#L34-L37

And in their document, they at least recommend java users to use scala.Tuple2: https://spark.apache.org/docs/0.9.1/java-programming-guide.html

And here https://spark.apache.org/examples.html , if you click tab "Java", there's a bunch of usage of Spark-Scala objects: LogisticRegression, Vector, etc.

And let's check their java test case: https://github.com/apache/spark/blob/master/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java

I don't think whether they come from core makes difference (or org.apache.spark.sql.XXX can be viewed as "core" of spark sql), the point is more about, whether it is good to expose "Scala object" to Java users. My answer is yes, as long as it turns out to be java-friendly.

@andrewfayres
Copy link
Contributor

I'm not taking a position that Scala objects should never be exposed to Java users. I have no objections to doing so when it's appropriate. My objection is about exposing half a package to them because it is already Java friendly and requiring them to know which parts they are expected to cherry pick from.

I think everything you linked to is from outside the core module which I was talking about. The reason I was making that distinction is that it seems to be where they have separate java and scala APIs. At a cursory glance, it looks like the entire mllib module is Java friendly and doesn't have a separate API (like it had been originally designed to be that way). If we'd planned on Java support from the very beginning this is probably how we'd of done things.

I have no objections to the idea that we could have them use scala.Tuple2. It's useful, straight forward, is a concept everyone knows, and has no analog in Java SE.

I agree that DataDesc is Java-friendly (or at least good enough) but there are plenty of other classes (even in the same IO file as DataDesc) that aren't. I think it's an unreasonable ask to expect the Java Developer to know that they're expected to only select certain things from that package but other things are off limits.

I'm not particularly concerned about the maintenance costs because they're just wrappers. The only time I foresee us doing maintenance is if we make a breaking change to the corresponding method of the Scala API which should be very rare (and when it happens there's already so much to update that an extra wrapper should make little difference).

@yzhliu
Copy link
Member

yzhliu commented Oct 12, 2018

Then I do not agree the opinion of, even though DataDesc is java-friendly, we still duplicate it because there're other things duplicated.

User types DataDesc and there's only one choice exists - I feel it is way better than two things under different namespaces pop-up, have same API, behave the same, but not compatible to each other.

I think you agree that, if one class is already designed to be java-friendly, then we can just expose it - this is how you took spark-mllib as an example. But why tangle about the namespace.

The main objective of providing "Java API", is not to have a java namespace - as a user, why do I care about which namespace I'm using? As a remedy of not making APIs java-friendly at the very beginning, Spark invents JavaRDD, we are just facing the same problem here.

Also be careful saying maintenance is easy. Count how many lines you wrote for only 4 tiny, simple classes. And when API changes, increases, removes, we'll suffer from maintaining two things - And remember IO/Optimizer/etc are very likely to be extended.

@andrewfayres
Copy link
Contributor

In offline discussions we've agreed that we'll collectively craft an email with both positions, send it out for feedback, and go with the consensus that emerges.

In the meantime, development of the Java api will proceed on a new branch.

@lanking520
Copy link
Member Author

Here is the alternative one, closing this now! #12816

@lanking520 lanking520 closed this Oct 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Java Label to identify Java API component pr-awaiting-review PR is waiting for code review Scala
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants