[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

kdatta · 2015-01-26T11:21:27Z

First pull request for PyGraphX. The following codes are added:

Java API for GraphX including JavaVertexRDD, JavaEdgeRDD and JavaGraph
Python backend including PythonVertexRDD, PythonEdgeRDD and PythonGraph
graphx package is added to pyspark
- includes vertex.py, edge.py, graph.py and tests.py

… java_gateway.py

…functions

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

Conflicts: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala

…DImpl class hierarchy. Fixed compile issues.

kdatta · 2015-01-26T17:42:33Z

Updated to "[GRAPHX] Spark 3789 - Python Bindings for GraphX"

On Mon, Jan 26, 2015 at 9:35 AM, Mark Hamstra notifications@github.com
wrote:

Yes, +1 to Sandy's request.

In general, the JIRA should explain why a change is necessary or
advisable, the description of the PR should explain what is being done
to fix the problem or add the feature, and the title of the PR should
provide a useful summary of the PR since that title will end up as the
commit message in the git log.

When these recommendation aren't followed, reviewers and other developers
are forced to look in multiple places or even at the code in order to get
even a basic idea of what the PR or commit is about.

—
Reply to this email directly or view it on GitHub
#4205 (comment).

JoshRosen · 2015-01-26T18:50:00Z

graphx/src/main/scala/org/apache/spark/graphx/api/java/JavaVertexRDDLike.scala

+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+
+trait JavaVertexRDDLike[VD, This <: JavaVertexRDDLike[VD, This, R],


I'd remove this trait and fold all of this code directly into the JavaVertexRDD class. The *RDDLike pattern in the Java API wasn't a great design and I'd like to avoid mimicking it in new code.

Josh, then how will we handle type bounds? Is there a new design for Java API?

What do you mean by "handle type bounds"? In this PR's current code, it looks like JavaVertexRDDLike is only extended by a single class and isn't used as part of any method signatures, return types, etc, so unless I'm overlooking something I don't see why it can't be removed. Inheriting implementations from generic traits has bitten us in the past via https://issues.scala-lang.org/browse/SI-8905 (see https://issues.apache.org/jira/browse/SPARK-3266), so if this trait isn't necessary then we shouldn't have it.

The JavaRDDLike traits in the Spark Core Java API are an unfortunate holdover from an earlier design and exists primarily for code re-use purposes. We can't remove it now due because that would break binary compatibility.

Got it. I will change this.

JoshRosen · 2015-01-26T19:58:28Z

I left an initial pass of comments. I haven't really dug into the details very much yet, but a couple of high-level comments:

There's a lot of code duplication in the Python code that creates the Java RDDs, so it would be nice to see if there's a way to refactor the code to remove this duplication. My concern here is largely around future maintainability, since I'm worried that we'll see the copies of the code diverge when people make changes without being aware of the duplicate copies.
I'd like to avoid repeating the Java*Like pattern, since it doesn't look necessary here and it has caused problems in the past: see https://issues.scala-lang.org/browse/SI-8905 and https://issues.apache.org/jira/browse/SPARK-3266.

Now that we're increasingly seeing Spark libraries being written in one JVM language and used from another (e.g. a Spark library written against the Java API and called from Scala), it might be nice to try to extend GraphX's Scala API to expose Java-friendly methods instead of adding a new Java API. This is a major departure from how we've handled Java APIs up until now, but it might be a better long-term decision for new code. I think @rxin may be able to chime in here with more details. GraphX might be a nice context to explore this idea since it's a much smaller API than Spark as a whole.

kdatta · 2015-01-26T20:05:30Z

I like the idea of extending Scala APIs for Java, instead of having
separate Java API. It's a better model and I think we can remove a lot of
code duplication and creating layers of wrapper classes. So, does this mean
that all Java friendly methods in GraphX will return objects that Java can
work with, e.g. Iterators, JList and so on?

On Mon, Jan 26, 2015 at 11:58 AM, Josh Rosen notifications@github.com
wrote:

I left an initial pass of comments. I haven't really dug into the details
very much yet, but a couple of high-level comments:

There's a lot of code duplication in the Python code that creates
the Java RDDs, so it would be nice to see if there's a way to refactor the
code to remove this duplication. My concern here is largely around future
maintainability, since I'm worried that we'll see the copies of the code
diverge when people make changes without being aware of the duplicate
copies.

I'd like to avoid repeating the Java*Like pattern, since it doesn't
look necessary here and it has caused problems in the past: see
https://issues.scala-lang.org/browse/SI-8905 and
https://issues.apache.org/jira/browse/SPARK-3266.

Now that we're increasingly seeing Spark libraries being written in one
JVM language and used from another (e.g. a Spark library written against
the Java API and called from Scala), it might be nice to try to extend
GraphX's Scala API to expose Java-friendly methods instead of adding a new
Java API. This is a major departure from how we've handled Java APIs up
until now, but it might be a better long-term decision for new code. I
think @rxin https://github.com/rxin may be able to chime in here with
more details. GraphX might be a nice context to explore this idea since
it's a much smaller API than Spark as a whole.

—
Reply to this email directly or view it on GitHub
#4205 (comment).

pwendell · 2015-01-26T21:58:16Z

Hey all, I'd like to close this issue and defer to a design doc before there is a lot of commenting on this. This pulls in another patch that is itself not merged and major portions of the PySpark API are copy/pasted. For instance, it might be good to wait until #3234 is merged before asking for a lot of community review here.

rxin · 2015-01-26T21:58:26Z

@kdatta thanks for working on this.

I also commented on JIRA. For such a massive change, can you write some high level design document and attach it to the JIRA ticket?

kdatta · 2015-01-26T22:12:16Z

There's duplication of effort between #3234 and #3789. Should i wait for
#3234 then?
On Jan 26, 2015 1:58 PM, "Reynold Xin" notifications@github.com wrote:

@kdatta https://github.com/kdatta thanks for working on this.

I also commented on JIRA. For such a massive change, can you write some
high level design document and attach it to the JIRA ticket?

—
Reply to this email directly or view it on GitHub
#4205 (comment).

kdatta · 2015-01-26T22:54:41Z

Hi All,

Here's what I suggest we do:

Complete a design document on PyGraphX and attach it to JIRA
Wait for Java API for GraphX to be resolved i.e. pull request [SPARK-3665][GraphX] Java API for GraphX #3234.
There's no way out of this and no reason for duplication of effort.
Remove Java API of GraphX from pull request [GRAPHX] Spark 3789 - Python Bindings for GraphX #4205 and build with [SPARK-3665][GraphX] Java API for GraphX #3234
instead.
Create new pull request for PyGraphX

On Mon, Jan 26, 2015 at 1:58 PM, Patrick Wendell notifications@github.com
wrote:

Hey all, I'd like to close this issue and defer to a design doc before
there is a lot of commenting on this. This pulls in another patch that is
itself not merged and major portions of the PySpark API are copy/pasted.
For instance, it might be good to wait until #3234
#3234 is merged before asking for a
lot of community review here.

—
Reply to this email directly or view it on GitHub
#4205 (comment).

kdatta added 30 commits October 6, 2014 11:15

SPARK-3789: initial commit

fcbeee2

SPARK-3789: added graph, vertex and edge python files

0eefa44

SPARK-3789: Added PythonGraphLoader

c99f81c

SPARK-3789: Removed PythonGraphLoader. Added java_import statement to…

207d8ba

… java_gateway.py

SPARK-3789: WIP - Added JavaVertexRDD, JavaEdgeRDD and the first few …

2c2cef7

…functions

SPARK-3789: Merging master on 10/30/2014

a69a589

SPARK-3789: Removed .pyc files

cf1df50

SPARK-3789: WIP - PythonVertexRDD works

1580513

SPARK-3789: WIP - 11/6/2014

08140bf

Merge branch 'master' into SPARK-3789

13b96d9

Conflicts: core/src/main/scala/org/apache/spark/rdd/RDD.scala

SPARK-3789: WIP

a2faa64

SPARK-3789: WIP

28be23e

SPARK-3789: WIP

a23d418

SPARK-3789: Updated vertex.py, edge.py and graph.py

19b280d

SPARK-3789: JavaEdgeRDDLike compiler errors.

d07ae43

SPARK-3789: JavaEdgeRDDLike compiler errors.

e02a8ee

SPARK-3789: Merging master on 12/16

dd9c278

Conflicts: graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala

SPARK-3789: Updated JavaEdgeRDD class according to EdgeRDD and EdgeRD…

49e3845

…DImpl class hierarchy. Fixed compile issues.

SPARK-3789: Mergin master on 12/18

44c051f

SPARK-3789: temp commit before merging master on 12/22

db8cff0

SPARK-3789: Merging master on 12/22

46dcd9a

SPARK-3789: temp commit before merging master on 1/5

3754117

SPARK-3789: Merge master on 1/5

a05d458

SPARK-3789: temp commit before merging master on 1/7

5717578

SPARK-3789: temp commit before merging master on 1/16

1bbfffa

SPARK-3789: Merging master on 1/16

7297f0e

SPARK-3789: collect(), take() fixed

36d15df

SPARK-3789: temp commit before merging master on 1/18

08d4209

SPARK-3789: Merge master on 1/18

577cb4a

SPARK-3789: filter in VertexRDD fixed

6a6b7ec

JoshRosen reviewed Jan 26, 2015
View reviewed changes

asfgit closed this in 622ff09 Jan 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

kdatta commented Jan 26, 2015

kdatta commented Jan 26, 2015

JoshRosen Jan 26, 2015

kdatta Jan 26, 2015

JoshRosen Jan 26, 2015

kdatta Jan 26, 2015

JoshRosen commented Jan 26, 2015

kdatta commented Jan 26, 2015

pwendell commented Jan 26, 2015

rxin commented Jan 26, 2015

kdatta commented Jan 26, 2015

kdatta commented Jan 26, 2015

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

[GRAPHX] Spark 3789 - Python Bindings for GraphX #4205

Conversation

kdatta commented Jan 26, 2015

kdatta commented Jan 26, 2015

JoshRosen Jan 26, 2015

Choose a reason for hiding this comment

kdatta Jan 26, 2015

Choose a reason for hiding this comment

JoshRosen Jan 26, 2015

Choose a reason for hiding this comment

kdatta Jan 26, 2015

Choose a reason for hiding this comment

JoshRosen commented Jan 26, 2015

kdatta commented Jan 26, 2015

pwendell commented Jan 26, 2015

rxin commented Jan 26, 2015

kdatta commented Jan 26, 2015

kdatta commented Jan 26, 2015