Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat][Spark] Implementation of PySpark bindings to Scala API #300

Merged
merged 25 commits into from
Jan 10, 2024

Conversation

SemyonSinchenko
Copy link
Member

@SemyonSinchenko SemyonSinchenko commented Dec 23, 2023

Proposed changes

PySpark Python bindings to Spark Scala package of GraphAr with py4j. Related to #297 and follows the discussed concept.

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have signed the CLA
  • Lint and unit tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Proposed changes are very big and contains only draft/concept of the implementation. I checked some methods and bindings in Jupyter Notebook, but I do not want to write tests and documentation before getting an approve on the concept. @acezen may you please make a top-level view on overall design, naming convention, package structure, etc.? Thanks in advance! After finalizing the design I will focus on covering the code by tests.

The main idea is the following: it looks like the the spark implementation will be "Scala API first", so I tried to avoid moving any real logic to Python. I just tried to proxy all the public API (except utils and datasources that are useless for PySpark) even in places when the same logic may be reproduced in pure python. The benefit is that it should simplify maintenance of these bidnings. Downside is that some things are done not in "pythonic-way" and also the source code is almost meaningless and anyone who want to understand the logic should refer to Scala implementation.

Couple of things:

  • Every public field is provided in python via get_ method.
  • For beans set_ methods are also provided.
  • Each class has method to_scala() that returns the link to corresponding JVM object.
  • Each class has two main constructors: from_python(args) and from_scala(jvm_object) that allows to create a Python wrapper with two ways. The first one is creating the wrapper on top of existent JVM object and the second one is by initializing the new JVM object with parameters, provided from Python.
  • There is no direct way of overloading, so for some methods I spitted the original method into two. For example, in Python there are tow methods, GraphWriter.write_with_graph_info(graph_info) and GraphWriter.write(path, name, vertex_chunk_size, edge_chunk_size, file_type, version). I can change it, but there are a lot of different tricks how to mimic overloading in Python.

P.S. To try it now (it is a temporary way for now):

  1. Install poetry
  2. poetry env use python3.9
  3. poetry install --with=spark,dev,jupyter
  4. poetry run jupyter lab
  5. Entry point of the python-library is graphar_pyspark.initialize(spark: SparkSession) that should be called first

P.P.S. As I mentioned, it is just a draft for now and a lot of errors may be inside. When the concept will be finalized, I will cover it by unit tests.

====================
@Corwinpro Could you please take a look and give me your opinion about the design of bindings?

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   .gitignore
	new file:   pyspark/graphar_pysaprk/__init__.py
	new file:   pyspark/graphar_pysaprk/enums.py
	new file:   pyspark/graphar_pysaprk/graph.py
	new file:   pyspark/graphar_pysaprk/info.py
	new file:   pyspark/graphar_pysaprk/reader.py
	new file:   pyspark/graphar_pysaprk/writer.py
	new file:   pyspark/poetry.lock
	new file:   pyspark/pyproject.toml
@CLAassistant
Copy link

CLAassistant commented Dec 23, 2023

CLA assistant check
All committers have signed the CLA.

@SemyonSinchenko SemyonSinchenko changed the title Implementation draft & concept [Feat][Spark] Implementation draft & concept of PySpark bindings to Scala API Dec 24, 2023
@acezen
Copy link
Contributor

acezen commented Dec 26, 2023

nice work! thanks for the draft&concept, I will take a look and give reply ASAP.

@acezen
Copy link
Contributor

acezen commented Dec 27, 2023

hi @SemyonSinchenko , can you give an example that how to use the graphar-pyspark from user view base on the concept?

Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API of graphar-pyspark looks good to me and it seems clean and work. Thanks for the job!

pyspark/graphar_pysaprk/enums.py Outdated Show resolved Hide resolved
@SemyonSinchenko
Copy link
Member Author

The API of graphar-pyspark looks good to me and it seems clean and work. Thanks for the job!

Thank you! In this case I will focus on covering the code by tests and fix errors. I will update this PR soon. I will skip for now the topic of CI/CD because it is important to run python tests only after scala project is build and I have no clear vision how to couple this things.

@SemyonSinchenko
Copy link
Member Author

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):

[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false

I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):

assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))

If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	renamed:    pyspark/graphar_pysaprk/__init__.py -> pyspark/graphar_pyspark/__init__.py
	renamed:    pyspark/graphar_pysaprk/enums.py -> pyspark/graphar_pyspark/enums.py
	renamed:    pyspark/graphar_pysaprk/graph.py -> pyspark/graphar_pyspark/graph.py
	renamed:    pyspark/graphar_pysaprk/info.py -> pyspark/graphar_pyspark/info.py
	renamed:    pyspark/graphar_pysaprk/reader.py -> pyspark/graphar_pyspark/reader.py
	renamed:    pyspark/graphar_pysaprk/writer.py -> pyspark/graphar_pyspark/writer.py
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml
	new file:   pyspark/tests/__init__.py
	new file:   pyspark/tests/conftest.py
	new file:   pyspark/tests/test_enums.py
	new file:   pyspark/tests/test_info.py
@SemyonSinchenko SemyonSinchenko changed the title [Feat][Spark] Implementation draft & concept of PySpark bindings to Scala API [Feat][Spark] Implementation of PySpark bindings to Scala API Dec 27, 2023
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   .gitignore
	modified:   pyspark/graphar_pyspark/__init__.py
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/test_info.py
@SemyonSinchenko
Copy link
Member Author

Another problem I found: in scala part schakeyaml is used and there is a problem with combination of this library and Spark:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.alibaba.graphar.VertexInfo.loadVertexInfo.
E                   : java.lang.NoSuchMethodError: 'void org.yaml.snakeyaml.constructor.Constructor.<init>(java.lang.Class)'
E                   	at com.alibaba.graphar.VertexInfo$.loadVertexInfo(VertexInfo.scala:291)
E                   	at com.alibaba.graphar.VertexInfo.loadVertexInfo(VertexInfo.scala)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

The only way to workaround it that I found is to implement YAML-reading on the Python side. Please, may you check the last commit (pyspark/graphar_pyspark/info.py:344). If you OK with such implementation I will change also GrpahInfo and EdgeInfo loading from YAML in the same way. Otherwise we need to make changes on the scala-side (there are some options in SO threads, looks like it is possible).

@acezen
Copy link
Contributor

acezen commented Dec 28, 2023

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):

[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false

I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):

assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))

If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

Thanks for the note. scala-part == of PropertyGroup seems to be a bug and that it's not we expected. what we expect is object with same identical field would return true. I will create an issue and summit a bug fix for this.

@acezen
Copy link
Contributor

acezen commented Dec 28, 2023

Another problem I found: in scala part schakeyaml is used and there is a problem with combination of this library and Spark:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.alibaba.graphar.VertexInfo.loadVertexInfo.
E                   : java.lang.NoSuchMethodError: 'void org.yaml.snakeyaml.constructor.Constructor.<init>(java.lang.Class)'
E                   	at com.alibaba.graphar.VertexInfo$.loadVertexInfo(VertexInfo.scala:291)
E                   	at com.alibaba.graphar.VertexInfo.loadVertexInfo(VertexInfo.scala)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

The only way to workaround it that I found is to implement YAML-reading on the Python side. Please, may you check the last commit (pyspark/graphar_pyspark/info.py:344). If you OK with such implementation I will change also GrpahInfo and EdgeInfo loading from YAML in the same way. Otherwise we need to make changes on the scala-side (there are some options in SO threads, looks like it is possible).

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

@acezen
Copy link
Contributor

acezen commented Dec 28, 2023

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):

[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false

I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):

assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))

If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

The comparison bug has been fixed by PR #306

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	new file:   pyspark/README.rst
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/test_info.py
@SemyonSinchenko
Copy link
Member Author

SemyonSinchenko commented Dec 28, 2023

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

@acezen I may rewrite scala part from using of java snake-yaml to more "scala-way" with case classes and scala collections instead of BeanProperty and java collections. circe-yaml may be used as YAML parser. But it is a lot of work, so, I want to discuss it first. Should we discuss it here, or better to create another issue?

UPD: I found the root of the problem. PySpark depends of some Hadoop jars, and hadoop jars depends of the different version of snakeyaml. I found this one in dependencies on the python side:
image

Started from 2.0 in snakeyaml there is a little different syntax. I will try to update the existent scala code to the newer version of snakeyaml

SOLUTION

  1. Update snakeyaml in pom.xml to 2.x.x (2.0.0 in my case)
  2. Add LoaderOptions to Constructor: new Constructor(classOf[VertexInfo], new LoaderOptions())

I checked and all tests on scala-part passed:

Run completed in 1 minute, 9 seconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@acezen Should I open a bug (I'm not that it is a bug; who in this world knows that pyspark silently download some hadoop-dependencies?) and fix it in scala-part?
P.S. I will use for now the new scala-jar with this fix for my tests of PySpark API
P.P.S. Just made a check: with a new JAR tests on the PySpark side are passed and call to SparkSession._jvm.com.alibaba.graphar.VertexInfo.loadVertexInfo works fine! 🎉

Copy link

@Corwinpro Corwinpro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks very much for this amazing work! I focused almost exclusively on python here, so please kindly let me know if I missed something spark/scala specific.

Also please feel free to ignore any of the below comments. Those are just suggestions, but I hope you find them useful.

In general, main points are:

  • Usage of the singleton object instantiated at import time (very fragile imo and can be more user-friendly)
  • Usage of staticmethod (most of the cases appear to be class methods by what I could see inside)
  • some minor general / typing improvements.

Thank you very much!

pyspark/graphar_pyspark/__init__.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/__init__.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/__init__.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/__init__.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/enums.py Show resolved Hide resolved
pyspark/graphar_pyspark/graph.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/graph.py Show resolved Hide resolved
pyspark/graphar_pyspark/graph.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Outdated Show resolved Hide resolved
pyspark/pyproject.toml Outdated Show resolved Hide resolved
- update pyproject.toml
- fix a lot of things
- some work based on comments
- license header everywhere
- minor changes

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	new file:   pyspark/Makefile
	modified:   pyspark/graphar_pyspark/graph.py
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/graphar_pyspark/reader.py
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/__init__.py
	modified:   pyspark/tests/conftest.py
	modified:   pyspark/tests/test_enums.py
	modified:   pyspark/tests/test_info.py
	new file:   pyspark/tests/test_reader.py

 Changes not staged for commit:
	modified:   spark/pom.xml
	modified:   spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala
	modified:   spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala
	modified:   spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/__init__.py

 Changes not staged for commit:
	modified:   spark/pom.xml
	modified:   spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala
	modified:   spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala
	modified:   spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala
@acezen
Copy link
Contributor

acezen commented Dec 29, 2023

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

@acezen I may rewrite scala part from using of java snake-yaml to more "scala-way" with case classes and scala collections instead of BeanProperty and java collections. circe-yaml may be used as YAML parser. But it is a lot of work, so, I want to discuss it first. Should we discuss it here, or better to create another issue?

UPD: I found the root of the problem. PySpark depends of some Hadoop jars, and hadoop jars depends of the different version of snakeyaml. I found this one in dependencies on the python side: image

Started from 2.0 in snakeyaml there is a little different syntax. I will try to update the existent scala code to the newer version of snakeyaml

SOLUTION

  1. Update snakeyaml in pom.xml to 2.x.x (2.0.0 in my case)
  2. Add LoaderOptions to Constructor: new Constructor(classOf[VertexInfo], new LoaderOptions())

I checked and all tests on scala-part passed:

Run completed in 1 minute, 9 seconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@acezen Should I open a bug (I'm not that it is a bug; who in this world knows that pyspark silently download some hadoop-dependencies?) and fix it in scala-part? P.S. I will use for now the new scala-jar with this fix for my tests of PySpark API P.P.S. Just made a check: with a new JAR tests on the PySpark side are passed and call to SparkSession._jvm.com.alibaba.graphar.VertexInfo.loadVertexInfo works fine! 🎉

Yes, you can open a new issue and we can discuss this problem there. Maybe open a feature issue is better and you can open new pull request to update the snake-yaml version

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/__init__.py
	modified:   pyspark/graphar_pyspark/enums.py
	modified:   pyspark/graphar_pyspark/graph.py
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/graphar_pyspark/reader.py
	new file:   pyspark/graphar_pyspark/util.py
	modified:   pyspark/graphar_pyspark/writer.py
	modified:   pyspark/tests/test_info.py
	modified:   pyspark/tests/test_reader.py
	new file:   pyspark/tests/test_writer.py
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/util.py
	modified:   pyspark/graphar_pyspark/writer.py
	modified:   pyspark/tests/test_writer.py
@SemyonSinchenko
Copy link
Member Author

So... The code is mostly covered by tests:

======================== test session starts ========================
platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/sem/github/GraphAr/pyspark
plugins: cov-4.1.0, anyio-4.2.0
collected 14 items

tests/test_enums.py ...                                                                                                                                                                  [ 21%]
tests/test_info.py ......                                                                                                                                                                [ 64%]
tests/test_reader.py ...                                                                                                                                                                 [ 85%]
tests/test_writer.py ..                                                                                                                                                                  [100%]

---------- coverage: platform linux, python 3.9.18-final-0 -----------
Name                          Stmts   Miss  Cover
-------------------------------------------------
graphar_pyspark/__init__.py      27      2    93%
graphar_pyspark/enums.py         41      0   100%
graphar_pyspark/graph.py         90     28    69%
graphar_pyspark/info.py         374     34    91%
graphar_pyspark/reader.py        84     22    74%
graphar_pyspark/util.py          50     23    54%
graphar_pyspark/writer.py        57     10    82%
-------------------------------------------------
TOTAL                           723    119    84%

I think, it is ready for review. @acezen May we define some closed list of check-boxes, that need to be closed to make it ready for merge? @Corwinpro I tried to close all your mentions, thanks! May you please make a look on not-closed again, where I left comments?

Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code seems cool! I leave some comment, and there are somethings that need to implement:

  • Open the test in github action
  • Can we add some example or tutorial to show how to use graphar-pyspark, you can refer to Scala example
  • Maybe add some document about graphar-pyspark.

Thanks for the contribution!

pyspark/graphar_pyspark/graph.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/graph.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/graph.py Outdated Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
pyspark/graphar_pyspark/writer.py Outdated Show resolved Hide resolved
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   .gitignore
	modified:   docs/Makefile
	modified:   docs/index.rst
	new file:   docs/pyspark/api/graphar_pyspark.rst
	new file:   docs/pyspark/api/modules.rst
	new file:   docs/pyspark/index.rst
	new file:   docs/pyspark/pyspark-lib.rst
	modified:   pyspark/Makefile
	modified:   pyspark/graphar_pyspark/__init__.py
	modified:   pyspark/graphar_pyspark/enums.py
	new file:   pyspark/graphar_pyspark/errors.py
	modified:   pyspark/graphar_pyspark/graph.py
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/graphar_pyspark/reader.py
	modified:   pyspark/graphar_pyspark/util.py
	modified:   pyspark/graphar_pyspark/writer.py
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/__init__.py
	modified:   pyspark/tests/conftest.py
	modified:   pyspark/tests/test_enums.py
	modified:   pyspark/tests/test_info.py
	modified:   pyspark/tests/test_reader.py
	modified:   pyspark/tests/test_writer.py
+ add poetry-lock file to .licenserc ignore section

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   .licenserc.yaml
	new file:   pyspark/README.md
	deleted:    pyspark/README.rst
	modified:   pyspark/pyproject.toml
- new tests
- improved coverage
- updated Makefile for Python project
- updated pyproject.toml

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/Makefile
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/graphar_pyspark/writer.py
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/test_info.py
	modified:   pyspark/tests/test_writer.py
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/writer.py
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/tests/test_writer.py
Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @SemyonSinchenko, I leave some comments about the code and docs, and here are some general question:

  • How do we build the graphar_pyspark package? Do we need to install some necessary dependencies?if so, maybe you need add something like requirements.txt/requirements-dev.txt for install dependencies by pip
  • What's the relationship between scala spark and graphar_pyspark, if we need to package the pyspark , do we need to build the scala first?
  • Maybe add a CI workflow in spark CI to run the building step and test in github action?

docs/Makefile Show resolved Hide resolved
docs/Makefile Show resolved Hide resolved
pyspark/poetry.lock Show resolved Hide resolved
pyspark/pyproject.toml Show resolved Hide resolved
pyspark/README.md Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
pyspark/graphar_pyspark/info.py Show resolved Hide resolved
@SemyonSinchenko
Copy link
Member Author

* How do we build the graphar_pyspark package? Do we need to install some necessary dependencies?if so, maybe you need add something like `requirements.txt/requirements-dev.txt` for install dependencies by pip

Currently poetry is used as a building system (like poetry install to install the project for local development and poetry build to build the distribution). Dependencies are defined inside pyprojetc.toml file as dependency groups (spark, tests, docs, etc.)

* What's the relationship between scala spark and graphar_pyspark, if we need to package the pyspark , do we need to build the scala first?

I know two ways how to do it. The first is the pyspark itself way, when we download JAR-files silently while pip install. My opinion it is a bad option cause may easily break users environments. Another way is like in Synapse ML python project or inside PyDeequ. But the way of synapse or pydeequ requires to have a maven coordinates but as I understand we still do not have maven publishing. Do you have plans about testing GraphAr Spark scala lib against different spark and scala versions (like graphar_1.0.0_spark_spark_3.2_scala_2.12_hadoop_3.2 or graphar_1.0.0_spark_3.5_scala_2.13_hadoop_2, etc.)?

* Maybe add a CI workflow in [spark CI](https://github.com/alibaba/GraphAr/blob/main/.github/workflows/spark.yaml) to run the building step and test in github action?

I can add the following:

  • ci for code linting of pyspark project
  • ci for tests of pyspark project as a step inside spark.yml

Also, I can add something like python-release.yml action that will include:

  • running tests
  • adding PyPi API token (like in poetry docs)
  • publishing the package to PyPi (poetry publish --build)

In this case graphar_pyspark will be available for pip install. The only thing I need from you is to add a github secret with token (like PYPI_API_TOKEN or something like this)

@SemyonSinchenko
Copy link
Member Author

SemyonSinchenko commented Jan 4, 2024

  • cicd for code-style of python project
  • update cicd of spark by adding python tests
  • update README.md as described in comments
  • docstrings for all public API
  • testing coverage > 95%
  • usage example

Did I miss something?

@acezen
Copy link
Contributor

acezen commented Jan 4, 2024

  • cicd for code-style of python project
  • update cicd of spark by adding python tests
  • update README.md as described in comments
  • docstrings for all public API
  • testing coverage > 95%

Did I miss something?

overall LGTM, thanks~, maybe we can add some tutorials or examples for graphar_spark?

- ruff passed
- coverage 95%+
- docstrings for all the public API

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/__init__.py
	modified:   pyspark/graphar_pyspark/enums.py
	modified:   pyspark/graphar_pyspark/errors.py
	modified:   pyspark/graphar_pyspark/graph.py
	modified:   pyspark/graphar_pyspark/info.py
	modified:   pyspark/graphar_pyspark/reader.py
	modified:   pyspark/graphar_pyspark/util.py
	modified:   pyspark/graphar_pyspark/writer.py
	modified:   pyspark/poetry.lock
	modified:   pyspark/pyproject.toml
	modified:   pyspark/tests/test_reader.py
	new file:   pyspark/tests/test_transform.py
	modified:   pyspark/tests/test_writer.py
@acezen
Copy link
Contributor

acezen commented Jan 8, 2024

hi, @SemyonSinchenko , is there anything that update for the rest item? We plan to release GraphAr v0.11.0 in the middle of January, I think we can include this feature in this release.

@SemyonSinchenko
Copy link
Member Author

hi, @SemyonSinchenko , is there anything that update for the rest item? We plan to release GraphAr v0.11.0 in the middle of January, I think we can include this feature in this release.

Hi! I was on a short vacation. The most part of the work is done, I will finish it in a couple of days

 On branch 297-add-pyspark-bindings
 Changes to be committed:
	new file:   .github/workflows/pyspark.yml
	new file:   docs/pyspark/how-to.rst
	modified:   docs/pyspark/index.rst
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/README.md
@SemyonSinchenko
Copy link
Member Author

@acezen To be honest, I'm not a very experienced guy in DevOps tooling and also the spark-workflow file is very complicated (some magic with paths). I'm affraid to touch it just because I can break something. May we couple building of spark & pyspark in the next PR? This one is already very big. And I was going to suggest to test GraphAr against multiple versions of Hadoop & Spark (via matrix in GHA), so, the spark CI may be changed soon anyway... What do you think about it?

@acezen
Copy link
Contributor

acezen commented Jan 10, 2024

@acezen To be honest, I'm not a very experienced guy in DevOps tooling and also the spark-workflow file is very complicated (some magic with paths). I'm affraid to touch it just because I can break something. May we couple building of spark & pyspark in the next PR? This one is already very big. And I was going to suggest to test GraphAr against multiple versions of Hadoop & Spark (via matrix in GHA), so, the spark CI may be changed soon anyway... What do you think about it?

I agree with you. The spark-workflow is a little complicated and need to refine. I think we can open a new issue for that and couple building of spark & pyspark. It's ok to merge the PR first.

Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ready to merge when the typo and ci/cd setting fixed.

pyspark/README.md Outdated Show resolved Hide resolved
.github/workflows/pyspark.yml Outdated Show resolved Hide resolved
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   .github/workflows/pyspark.yml
	modified:   pyspark/README.md
 On branch 297-add-pyspark-bindings
 Changes to be committed:
	modified:   pyspark/graphar_pyspark/info.py
Copy link
Contributor

@acezen acezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great work!

@acezen acezen merged commit 2faccd8 into apache:main Jan 10, 2024
2 checks passed
@SemyonSinchenko SemyonSinchenko deleted the 297-add-pyspark-bindings branch April 22, 2024 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants