-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat][Spark] Implementation of PySpark bindings to Scala API #300
[Feat][Spark] Implementation of PySpark bindings to Scala API #300
Conversation
On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore new file: pyspark/graphar_pysaprk/__init__.py new file: pyspark/graphar_pysaprk/enums.py new file: pyspark/graphar_pysaprk/graph.py new file: pyspark/graphar_pysaprk/info.py new file: pyspark/graphar_pysaprk/reader.py new file: pyspark/graphar_pysaprk/writer.py new file: pyspark/poetry.lock new file: pyspark/pyproject.toml
nice work! thanks for the draft&concept, I will take a look and give reply ASAP. |
hi @SemyonSinchenko , can you give an example that how to use the graphar-pyspark from user view base on the concept? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API of graphar-pyspark looks good to me and it seems clean and work. Thanks for the job!
Thank you! In this case I will focus on covering the code by tests and fix errors. I will update this PR soon. I will skip for now the topic of CI/CD because it is important to run python tests only after scala project is build and I have no clear vision how to couple this things. |
@acezen In scala-part You can check it easily by yourself (this example is about [sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.
scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property
scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897
scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8
scala> p1.setName("name")
scala> p2.setName("name")
scala> p1.setIs_primary(false)
scala> p2.setIs_primary(false)
scala> p1.setData_type("INT32")
scala> p2.setData_type("INT32")
scala> p1 == p2
val res6: Boolean = false I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala())) If it is not expected behavior, I can implement |
On branch 297-add-pyspark-bindings Changes to be committed: renamed: pyspark/graphar_pysaprk/__init__.py -> pyspark/graphar_pyspark/__init__.py renamed: pyspark/graphar_pysaprk/enums.py -> pyspark/graphar_pyspark/enums.py renamed: pyspark/graphar_pysaprk/graph.py -> pyspark/graphar_pyspark/graph.py renamed: pyspark/graphar_pysaprk/info.py -> pyspark/graphar_pyspark/info.py renamed: pyspark/graphar_pysaprk/reader.py -> pyspark/graphar_pyspark/reader.py renamed: pyspark/graphar_pysaprk/writer.py -> pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml new file: pyspark/tests/__init__.py new file: pyspark/tests/conftest.py new file: pyspark/tests/test_enums.py new file: pyspark/tests/test_info.py
On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py
Another problem I found: in scala part E py4j.protocol.Py4JJavaError: An error occurred while calling z:com.alibaba.graphar.VertexInfo.loadVertexInfo.
E : java.lang.NoSuchMethodError: 'void org.yaml.snakeyaml.constructor.Constructor.<init>(java.lang.Class)'
E at com.alibaba.graphar.VertexInfo$.loadVertexInfo(VertexInfo.scala:291)
E at com.alibaba.graphar.VertexInfo.loadVertexInfo(VertexInfo.scala)
E at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
The only way to workaround it that I found is to implement YAML-reading on the Python side. Please, may you check the last commit ( |
Thanks for the note. scala-part == of PropertyGroup seems to be a bug and that it's not we expected. what we expect is object with same identical field would return true. I will create an issue and summit a bug fix for this. |
I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that. |
The comparison bug has been fixed by PR #306 |
On branch 297-add-pyspark-bindings Changes to be committed: new file: pyspark/README.rst modified: pyspark/graphar_pyspark/info.py modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py
@acezen I may rewrite scala part from using of java snake-yaml to more "scala-way" with case classes and scala collections instead of BeanProperty and java collections. circe-yaml may be used as YAML parser. But it is a lot of work, so, I want to discuss it first. Should we discuss it here, or better to create another issue? UPD: I found the root of the problem. PySpark depends of some Hadoop jars, and hadoop jars depends of the different version of snakeyaml. I found this one in dependencies on the python side: Started from 2.0 in snakeyaml there is a little different syntax. I will try to update the existent scala code to the newer version of snakeyaml SOLUTION
I checked and all tests on scala-part passed: Run completed in 1 minute, 9 seconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed. @acezen Should I open a bug (I'm not that it is a bug; who in this world knows that pyspark silently download some hadoop-dependencies?) and fix it in scala-part? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks very much for this amazing work! I focused almost exclusively on python here, so please kindly let me know if I missed something spark/scala specific.
Also please feel free to ignore any of the below comments. Those are just suggestions, but I hope you find them useful.
In general, main points are:
- Usage of the singleton object instantiated at import time (very fragile imo and can be more user-friendly)
- Usage of
staticmethod
(most of the cases appear to be class methods by what I could see inside) - some minor general / typing improvements.
Thank you very much!
- update pyproject.toml - fix a lot of things - some work based on comments - license header everywhere - minor changes On branch 297-add-pyspark-bindings Changes to be committed: new file: pyspark/Makefile modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/__init__.py modified: pyspark/tests/conftest.py modified: pyspark/tests/test_enums.py modified: pyspark/tests/test_info.py new file: pyspark/tests/test_reader.py Changes not staged for commit: modified: spark/pom.xml modified: spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py Changes not staged for commit: modified: spark/pom.xml modified: spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala
Yes, you can open a new issue and we can discuss this problem there. Maybe open a feature issue is better and you can open new pull request to update the snake-yaml version |
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py new file: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/tests/test_info.py modified: pyspark/tests/test_reader.py new file: pyspark/tests/test_writer.py
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/tests/test_writer.py
So... The code is mostly covered by tests: ======================== test session starts ========================
platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/sem/github/GraphAr/pyspark
plugins: cov-4.1.0, anyio-4.2.0
collected 14 items
tests/test_enums.py ... [ 21%]
tests/test_info.py ...... [ 64%]
tests/test_reader.py ... [ 85%]
tests/test_writer.py .. [100%]
---------- coverage: platform linux, python 3.9.18-final-0 -----------
Name Stmts Miss Cover
-------------------------------------------------
graphar_pyspark/__init__.py 27 2 93%
graphar_pyspark/enums.py 41 0 100%
graphar_pyspark/graph.py 90 28 69%
graphar_pyspark/info.py 374 34 91%
graphar_pyspark/reader.py 84 22 74%
graphar_pyspark/util.py 50 23 54%
graphar_pyspark/writer.py 57 10 82%
-------------------------------------------------
TOTAL 723 119 84%
I think, it is ready for review. @acezen May we define some closed list of check-boxes, that need to be closed to make it ready for merge? @Corwinpro I tried to close all your mentions, thanks! May you please make a look on not-closed again, where I left comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code seems cool! I leave some comment, and there are somethings that need to implement:
- Open the test in github action
- Can we add some example or tutorial to show how to use graphar-pyspark, you can refer to Scala example
- Maybe add some document about graphar-pyspark.
Thanks for the contribution!
On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore modified: docs/Makefile modified: docs/index.rst new file: docs/pyspark/api/graphar_pyspark.rst new file: docs/pyspark/api/modules.rst new file: docs/pyspark/index.rst new file: docs/pyspark/pyspark-lib.rst modified: pyspark/Makefile modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py new file: pyspark/graphar_pyspark/errors.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/__init__.py modified: pyspark/tests/conftest.py modified: pyspark/tests/test_enums.py modified: pyspark/tests/test_info.py modified: pyspark/tests/test_reader.py modified: pyspark/tests/test_writer.py
+ add poetry-lock file to .licenserc ignore section On branch 297-add-pyspark-bindings Changes to be committed: modified: .licenserc.yaml new file: pyspark/README.md deleted: pyspark/README.rst modified: pyspark/pyproject.toml
- new tests - improved coverage - updated Makefile for Python project - updated pyproject.toml On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/Makefile modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py modified: pyspark/tests/test_writer.py
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/writer.py
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/tests/test_writer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @SemyonSinchenko, I leave some comments about the code and docs, and here are some general question:
- How do we build the graphar_pyspark package? Do we need to install some necessary dependencies?if so, maybe you need add something like
requirements.txt/requirements-dev.txt
for install dependencies by pip - What's the relationship between scala spark and graphar_pyspark, if we need to package the pyspark , do we need to build the scala first?
- Maybe add a CI workflow in spark CI to run the building step and test in github action?
Currently
I know two ways how to do it. The first is the pyspark itself way, when we download JAR-files silently while
I can add the following:
Also, I can add something like
In this case |
Did I miss something? |
overall LGTM, thanks~, maybe we can add some tutorials or examples for graphar_spark? |
- ruff passed - coverage 95%+ - docstrings for all the public API On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py modified: pyspark/graphar_pyspark/errors.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/test_reader.py new file: pyspark/tests/test_transform.py modified: pyspark/tests/test_writer.py
hi, @SemyonSinchenko , is there anything that update for the rest item? We plan to release GraphAr v0.11.0 in the middle of January, I think we can include this feature in this release. |
Hi! I was on a short vacation. The most part of the work is done, I will finish it in a couple of days |
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/README.md
@acezen To be honest, I'm not a very experienced guy in DevOps tooling and also the spark-workflow file is very complicated (some magic with paths). I'm affraid to touch it just because I can break something. May we couple building of spark & pyspark in the next PR? This one is already very big. And I was going to suggest to test GraphAr against multiple versions of Hadoop & Spark (via |
I agree with you. The spark-workflow is a little complicated and need to refine. I think we can open a new issue for that and couple building of spark & pyspark. It's ok to merge the PR first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ready to merge when the typo and ci/cd setting fixed.
On branch 297-add-pyspark-bindings Changes to be committed: modified: .github/workflows/pyspark.yml modified: pyspark/README.md
On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/info.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great work!
Proposed changes
PySpark
Python bindings to Spark Scala package ofGraphAr
withpy4j
. Related to #297 and follows the discussed concept.Types of changes
What types of changes does your code introduce to GraphAr?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.Further comments
Proposed changes are very big and contains only draft/concept of the implementation. I checked some methods and bindings in Jupyter Notebook, but I do not want to write tests and documentation before getting an approve on the concept. @acezen may you please make a top-level view on overall design, naming convention, package structure, etc.? Thanks in advance! After finalizing the design I will focus on covering the code by tests.
The main idea is the following: it looks like the the spark implementation will be "Scala API first", so I tried to avoid moving any real logic to Python. I just tried to proxy all the public API (except
utils
anddatasources
that are useless for PySpark) even in places when the same logic may be reproduced in pure python. The benefit is that it should simplify maintenance of these bidnings. Downside is that some things are done not in "pythonic-way" and also the source code is almost meaningless and anyone who want to understand the logic should refer to Scala implementation.Couple of things:
get_
method.set_
methods are also provided.to_scala()
that returns the link to corresponding JVM object.from_python(args)
andfrom_scala(jvm_object)
that allows to create a Python wrapper with two ways. The first one is creating the wrapper on top of existent JVM object and the second one is by initializing the new JVM object with parameters, provided from Python.GraphWriter.write_with_graph_info(graph_info)
andGraphWriter.write(path, name, vertex_chunk_size, edge_chunk_size, file_type, version)
. I can change it, but there are a lot of different tricks how to mimic overloading in Python.P.S. To try it now (it is a temporary way for now):
poetry
poetry env use python3.9
poetry install --with=spark,dev,jupyter
poetry run jupyter lab
graphar_pyspark.initialize(spark: SparkSession)
that should be called firstP.P.S. As I mentioned, it is just a draft for now and a lot of errors may be inside. When the concept will be finalized, I will cover it by unit tests.
====================
@Corwinpro Could you please take a look and give me your opinion about the design of bindings?