Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

Closed
wants to merge 35 commits into from
Closed

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

wants to merge 35 commits into from

Conversation

kdatta
Copy link

@kdatta kdatta commented Jan 26, 2015

First pull request for PyGraphX. The following codes are added:

  • Java API for GraphX including JavaVertexRDD, JavaEdgeRDD and JavaGraph
  • Python backend including PythonVertexRDD, PythonEdgeRDD and PythonGraph
  • graphx package is added to pyspark
    • includes vertex.py, edge.py, graph.py and tests.py

Conflicts:
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
Conflicts:
	graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala
…DImpl class hierarchy. Fixed compile issues.
@kdatta
Copy link
Author

kdatta commented Jan 26, 2015

Updated to "[GRAPHX] Spark 3789 - Python Bindings for GraphX"

On Mon, Jan 26, 2015 at 9:35 AM, Mark Hamstra notifications@github.com
wrote:

Yes, +1 to Sandy's request.

In general, the JIRA should explain why a change is necessary or
advisable, the description of the PR should explain what is being done
to fix the problem or add the feature, and the title of the PR should
provide a useful summary of the PR since that title will end up as the
commit message in the git log.

When these recommendation aren't followed, reviewers and other developers
are forced to look in multiple places or even at the code in order to get
even a basic idea of what the PR or commit is about.


Reply to this email directly or view it on GitHub
#4205 (comment).

import scala.language.implicitConversions
import scala.reflect.ClassTag

trait JavaVertexRDDLike[VD, This <: JavaVertexRDDLike[VD, This, R],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this trait and fold all of this code directly into the JavaVertexRDD class. The *RDDLike pattern in the Java API wasn't a great design and I'd like to avoid mimicking it in new code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Josh, then how will we handle type bounds? Is there a new design for Java API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "handle type bounds"? In this PR's current code, it looks like JavaVertexRDDLike is only extended by a single class and isn't used as part of any method signatures, return types, etc, so unless I'm overlooking something I don't see why it can't be removed. Inheriting implementations from generic traits has bitten us in the past via https://issues.scala-lang.org/browse/SI-8905 (see https://issues.apache.org/jira/browse/SPARK-3266), so if this trait isn't necessary then we shouldn't have it.

The JavaRDDLike traits in the Spark Core Java API are an unfortunate holdover from an earlier design and exists primarily for code re-use purposes. We can't remove it now due because that would break binary compatibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I will change this.

@JoshRosen
Copy link
Contributor

I left an initial pass of comments. I haven't really dug into the details very much yet, but a couple of high-level comments:

  • There's a lot of code duplication in the Python code that creates the Java RDDs, so it would be nice to see if there's a way to refactor the code to remove this duplication. My concern here is largely around future maintainability, since I'm worried that we'll see the copies of the code diverge when people make changes without being aware of the duplicate copies.
  • I'd like to avoid repeating the Java*Like pattern, since it doesn't look necessary here and it has caused problems in the past: see https://issues.scala-lang.org/browse/SI-8905 and https://issues.apache.org/jira/browse/SPARK-3266.

Now that we're increasingly seeing Spark libraries being written in one JVM language and used from another (e.g. a Spark library written against the Java API and called from Scala), it might be nice to try to extend GraphX's Scala API to expose Java-friendly methods instead of adding a new Java API. This is a major departure from how we've handled Java APIs up until now, but it might be a better long-term decision for new code. I think @rxin may be able to chime in here with more details. GraphX might be a nice context to explore this idea since it's a much smaller API than Spark as a whole.

@kdatta
Copy link
Author

kdatta commented Jan 26, 2015

I like the idea of extending Scala APIs for Java, instead of having
separate Java API. It's a better model and I think we can remove a lot of
code duplication and creating layers of wrapper classes. So, does this mean
that all Java friendly methods in GraphX will return objects that Java can
work with, e.g. Iterators, JList and so on?

On Mon, Jan 26, 2015 at 11:58 AM, Josh Rosen notifications@github.com
wrote:

I left an initial pass of comments. I haven't really dug into the details
very much yet, but a couple of high-level comments:

  • There's a lot of code duplication in the Python code that creates
    the Java RDDs, so it would be nice to see if there's a way to refactor the
    code to remove this duplication. My concern here is largely around future
    maintainability, since I'm worried that we'll see the copies of the code
    diverge when people make changes without being aware of the duplicate
    copies.
  • I'd like to avoid repeating the Java*Like pattern, since it doesn't
    look necessary here and it has caused problems in the past: see
    https://issues.scala-lang.org/browse/SI-8905 and
    https://issues.apache.org/jira/browse/SPARK-3266.

Now that we're increasingly seeing Spark libraries being written in one
JVM language and used from another (e.g. a Spark library written against
the Java API and called from Scala), it might be nice to try to extend
GraphX's Scala API to expose Java-friendly methods instead of adding a new
Java API. This is a major departure from how we've handled Java APIs up
until now, but it might be a better long-term decision for new code. I
think @rxin https://github.com/rxin may be able to chime in here with
more details. GraphX might be a nice context to explore this idea since
it's a much smaller API than Spark as a whole.


Reply to this email directly or view it on GitHub
#4205 (comment).

@pwendell
Copy link
Contributor

Hey all, I'd like to close this issue and defer to a design doc before there is a lot of commenting on this. This pulls in another patch that is itself not merged and major portions of the PySpark API are copy/pasted. For instance, it might be good to wait until #3234 is merged before asking for a lot of community review here.

@rxin
Copy link
Contributor

rxin commented Jan 26, 2015

@kdatta thanks for working on this.

I also commented on JIRA. For such a massive change, can you write some high level design document and attach it to the JIRA ticket?

@kdatta
Copy link
Author

kdatta commented Jan 26, 2015

There's duplication of effort between #3234 and #3789. Should i wait for
#3234 then?
On Jan 26, 2015 1:58 PM, "Reynold Xin" notifications@github.com wrote:

@kdatta https://github.com/kdatta thanks for working on this.

I also commented on JIRA. For such a massive change, can you write some
high level design document and attach it to the JIRA ticket?


Reply to this email directly or view it on GitHub
#4205 (comment).

@kdatta
Copy link
Author

kdatta commented Jan 26, 2015

Hi All,

Here's what I suggest we do:

  1. Complete a design document on PyGraphX and attach it to JIRA
  2. Wait for Java API for GraphX to be resolved i.e. pull request [SPARK-3665][GraphX] Java API for GraphX #3234.
    There's no way out of this and no reason for duplication of effort.
  3. Remove Java API of GraphX from pull request [GRAPHX] Spark 3789 - Python Bindings for GraphX #4205 and build with [SPARK-3665][GraphX] Java API for GraphX #3234
    instead.
  4. Create new pull request for PyGraphX

On Mon, Jan 26, 2015 at 1:58 PM, Patrick Wendell notifications@github.com
wrote:

Hey all, I'd like to close this issue and defer to a design doc before
there is a lot of commenting on this. This pulls in another patch that is
itself not merged and major portions of the PySpark API are copy/pasted.
For instance, it might be good to wait until #3234
#3234 is merged before asking for a
lot of community review here.


Reply to this email directly or view it on GitHub
#4205 (comment).

@asfgit asfgit closed this in 622ff09 Jan 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants