[Feat][Spark] Implementation of PySpark bindings to Scala API #300

SemyonSinchenko · 2023-12-23T17:07:43Z

Proposed changes

PySpark Python bindings to Spark Scala package of GraphAr with py4j. Related to #297 and follows the discussed concept.

Types of changes

What types of changes does your code introduce to GraphAr?
Put an x in the boxes that apply

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation Update (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING doc
I have signed the CLA
Lint and unit tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

Proposed changes are very big and contains only draft/concept of the implementation. I checked some methods and bindings in Jupyter Notebook, but I do not want to write tests and documentation before getting an approve on the concept. @acezen may you please make a top-level view on overall design, naming convention, package structure, etc.? Thanks in advance! After finalizing the design I will focus on covering the code by tests.

The main idea is the following: it looks like the the spark implementation will be "Scala API first", so I tried to avoid moving any real logic to Python. I just tried to proxy all the public API (except utils and datasources that are useless for PySpark) even in places when the same logic may be reproduced in pure python. The benefit is that it should simplify maintenance of these bidnings. Downside is that some things are done not in "pythonic-way" and also the source code is almost meaningless and anyone who want to understand the logic should refer to Scala implementation.

Couple of things:

Every public field is provided in python via get_ method.
For beans set_ methods are also provided.
Each class has method to_scala() that returns the link to corresponding JVM object.
Each class has two main constructors: from_python(args) and from_scala(jvm_object) that allows to create a Python wrapper with two ways. The first one is creating the wrapper on top of existent JVM object and the second one is by initializing the new JVM object with parameters, provided from Python.
There is no direct way of overloading, so for some methods I spitted the original method into two. For example, in Python there are tow methods, GraphWriter.write_with_graph_info(graph_info) and GraphWriter.write(path, name, vertex_chunk_size, edge_chunk_size, file_type, version). I can change it, but there are a lot of different tricks how to mimic overloading in Python.

P.S. To try it now (it is a temporary way for now):

Install poetry
poetry env use python3.9
poetry install --with=spark,dev,jupyter
poetry run jupyter lab
Entry point of the python-library is graphar_pyspark.initialize(spark: SparkSession) that should be called first

P.P.S. As I mentioned, it is just a draft for now and a lot of errors may be inside. When the concept will be finalized, I will cover it by unit tests.

====================
@Corwinpro Could you please take a look and give me your opinion about the design of bindings?

On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore new file: pyspark/graphar_pysaprk/__init__.py new file: pyspark/graphar_pysaprk/enums.py new file: pyspark/graphar_pysaprk/graph.py new file: pyspark/graphar_pysaprk/info.py new file: pyspark/graphar_pysaprk/reader.py new file: pyspark/graphar_pysaprk/writer.py new file: pyspark/poetry.lock new file: pyspark/pyproject.toml

CLAassistant · 2023-12-23T17:31:33Z

All committers have signed the CLA.

acezen · 2023-12-26T01:45:40Z

nice work! thanks for the draft&concept, I will take a look and give reply ASAP.

acezen · 2023-12-27T02:41:16Z

hi @SemyonSinchenko , can you give an example that how to use the graphar-pyspark from user view base on the concept?

acezen

The API of graphar-pyspark looks good to me and it seems clean and work. Thanks for the job!

pyspark/graphar_pysaprk/enums.py

…dings

SemyonSinchenko · 2023-12-27T08:54:27Z

The API of graphar-pyspark looks good to me and it seems clean and work. Thanks for the job!

Thank you! In this case I will focus on covering the code by tests and fix errors. I will update this PR soon. I will skip for now the topic of CI/CD because it is important to run python tests only after scala project is build and I have no clear vision how to couple this things.

SemyonSinchenko · 2023-12-27T10:09:36Z

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):

[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false

I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):

assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))

If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

On branch 297-add-pyspark-bindings Changes to be committed: renamed: pyspark/graphar_pysaprk/__init__.py -> pyspark/graphar_pyspark/__init__.py renamed: pyspark/graphar_pysaprk/enums.py -> pyspark/graphar_pyspark/enums.py renamed: pyspark/graphar_pysaprk/graph.py -> pyspark/graphar_pyspark/graph.py renamed: pyspark/graphar_pysaprk/info.py -> pyspark/graphar_pyspark/info.py renamed: pyspark/graphar_pysaprk/reader.py -> pyspark/graphar_pyspark/reader.py renamed: pyspark/graphar_pysaprk/writer.py -> pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml new file: pyspark/tests/__init__.py new file: pyspark/tests/conftest.py new file: pyspark/tests/test_enums.py new file: pyspark/tests/test_info.py

On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py

SemyonSinchenko · 2023-12-27T13:01:52Z

Another problem I found: in scala part schakeyaml is used and there is a problem with combination of this library and Spark:

E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.alibaba.graphar.VertexInfo.loadVertexInfo.
E                   : java.lang.NoSuchMethodError: 'void org.yaml.snakeyaml.constructor.Constructor.<init>(java.lang.Class)'
E                   	at com.alibaba.graphar.VertexInfo$.loadVertexInfo(VertexInfo.scala:291)
E                   	at com.alibaba.graphar.VertexInfo.loadVertexInfo(VertexInfo.scala)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

The only way to workaround it that I found is to implement YAML-reading on the Python side. Please, may you check the last commit (pyspark/graphar_pyspark/info.py:344). If you OK with such implementation I will change also GrpahInfo and EdgeInfo loading from YAML in the same way. Otherwise we need to make changes on the scala-side (there are some options in SO threads, looks like it is possible).

acezen · 2023-12-28T02:05:54Z

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):
[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false
I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):
assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))
If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

Thanks for the note. scala-part == of PropertyGroup seems to be a bug and that it's not we expected. what we expect is object with same identical field would return true. I will create an issue and summit a bug fix for this.

acezen · 2023-12-28T02:45:22Z

Another problem I found: in scala part schakeyaml is used and there is a problem with combination of this library and Spark:
E                   py4j.protocol.Py4JJavaError: An error occurred while calling z:com.alibaba.graphar.VertexInfo.loadVertexInfo.
E                   : java.lang.NoSuchMethodError: 'void org.yaml.snakeyaml.constructor.Constructor.<init>(java.lang.Class)'
E                   	at com.alibaba.graphar.VertexInfo$.loadVertexInfo(VertexInfo.scala:291)
E                   	at com.alibaba.graphar.VertexInfo.loadVertexInfo(VertexInfo.scala)
E                   	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
The only way to workaround it that I found is to implement YAML-reading on the Python side. Please, may you check the last commit (pyspark/graphar_pyspark/info.py:344). If you OK with such implementation I will change also GrpahInfo and EdgeInfo loading from YAML in the same way. Otherwise we need to make changes on the scala-side (there are some options in SO threads, looks like it is possible).

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

acezen · 2023-12-28T08:53:12Z

@acezen In scala-part == comparison of property groups and properties is used. And it will be always false except the case when object are really identical. Is it a bug, or expected behavior? Because even if you pass an object with identical fields this expression will be false anyway.

You can check it easily by yourself (this example is about Property but it uses the same Beans like in PropertyGroup):
[sem@nixos:~/github/GraphAr/spark/target]$ scala -cp graphar-0.1.0-SNAPSHOT.jar
Welcome to Scala 2.13.10 (OpenJDK 64-Bit Server VM, Java 19.0.2).
Type in expressions for evaluation. Or try :help.

scala> import com.alibaba.graphar.Property;
import com.alibaba.graphar.Property

scala> val p1 = new Property();
val p1: com.alibaba.graphar.Property = com.alibaba.graphar.Property@a6e4897

scala> val p2 = new Property()
val p2: com.alibaba.graphar.Property = com.alibaba.graphar.Property@66e827a8

scala> p1.setName("name")

scala> p2.setName("name")

scala> p1.setIs_primary(false)

scala> p2.setIs_primary(false)

scala> p1.setData_type("INT32")

scala> p2.setData_type("INT32")

scala> p1 == p2
val res6: Boolean = false
I'm asking because it affects how should I write tests for Python bindings. If this behavior is expected, than I will use something like this (the only way to get actual true result):
assert vertex_info_from_py.contain_property_group(PropertyGroup.from_scala(vertex_info_from_py.get_property_groups()[0].to_scala()))
If it is not expected behavior, I can implement def canEqual(that: Any): Boolean and override def equals(that: Any): Boolean for Property and PropertyGroup in Scala API. So, the == will compare fields, not references.

The comparison bug has been fixed by PR #306

pyspark/graphar_pyspark/enums.py

On branch 297-add-pyspark-bindings Changes to be committed: new file: pyspark/README.rst modified: pyspark/graphar_pyspark/info.py modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py

…dings

SemyonSinchenko · 2023-12-28T13:50:35Z

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

@acezen I may rewrite scala part from using of java snake-yaml to more "scala-way" with case classes and scala collections instead of BeanProperty and java collections. circe-yaml may be used as YAML parser. But it is a lot of work, so, I want to discuss it first. Should we discuss it here, or better to create another issue?

UPD: I found the root of the problem. PySpark depends of some Hadoop jars, and hadoop jars depends of the different version of snakeyaml. I found this one in dependencies on the python side:

Started from 2.0 in snakeyaml there is a little different syntax. I will try to update the existent scala code to the newer version of snakeyaml

SOLUTION

Update snakeyaml in pom.xml to 2.x.x (2.0.0 in my case)
Add LoaderOptions to Constructor: new Constructor(classOf[VertexInfo], new LoaderOptions())

I checked and all tests on scala-part passed:

Run completed in 1 minute, 9 seconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@acezen Should I open a bug (I'm not that it is a bug; who in this world knows that pyspark silently download some hadoop-dependencies?) and fix it in scala-part?
P.S. I will use for now the new scala-jar with this fix for my tests of PySpark API
P.P.S. Just made a check: with a new JAR tests on the PySpark side are passed and call to SparkSession._jvm.com.alibaba.graphar.VertexInfo.loadVertexInfo works fine! 🎉

Corwinpro

Hi, thanks very much for this amazing work! I focused almost exclusively on python here, so please kindly let me know if I missed something spark/scala specific.

Also please feel free to ignore any of the below comments. Those are just suggestions, but I hope you find them useful.

In general, main points are:

Usage of the singleton object instantiated at import time (very fragile imo and can be more user-friendly)
Usage of staticmethod (most of the cases appear to be class methods by what I could see inside)
some minor general / typing improvements.

Thank you very much!

pyspark/graphar_pyspark/__init__.py

pyspark/graphar_pyspark/enums.py

pyspark/graphar_pyspark/graph.py

pyspark/graphar_pyspark/info.py

pyspark/pyproject.toml

- update pyproject.toml - fix a lot of things - some work based on comments - license header everywhere - minor changes On branch 297-add-pyspark-bindings Changes to be committed: new file: pyspark/Makefile modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/__init__.py modified: pyspark/tests/conftest.py modified: pyspark/tests/test_enums.py modified: pyspark/tests/test_info.py new file: pyspark/tests/test_reader.py Changes not staged for commit: modified: spark/pom.xml modified: spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py Changes not staged for commit: modified: spark/pom.xml modified: spark/src/main/scala/com/alibaba/graphar/EdgeInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/GraphInfo.scala modified: spark/src/main/scala/com/alibaba/graphar/VertexInfo.scala

acezen · 2023-12-29T02:13:27Z

I think the workaround of implement YAML-reading in Python would make the code not binding to scala which we don't expect to do. It's ok to make changes on the scala-side even replace snake-yaml with another yaml lib if it's the simplest way to fix that.

@acezen I may rewrite scala part from using of java snake-yaml to more "scala-way" with case classes and scala collections instead of BeanProperty and java collections. circe-yaml may be used as YAML parser. But it is a lot of work, so, I want to discuss it first. Should we discuss it here, or better to create another issue?

UPD: I found the root of the problem. PySpark depends of some Hadoop jars, and hadoop jars depends of the different version of snakeyaml. I found this one in dependencies on the python side:

Started from 2.0 in snakeyaml there is a little different syntax. I will try to update the existent scala code to the newer version of snakeyaml

SOLUTION

Update snakeyaml in pom.xml to 2.x.x (2.0.0 in my case)

Add LoaderOptions to Constructor: new Constructor(classOf[VertexInfo], new LoaderOptions())

I checked and all tests on scala-part passed:
Run completed in 1 minute, 9 seconds.
Total number of tests run: 21
Suites: completed 10, aborted 0
Tests: succeeded 21, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
@acezen Should I open a bug (I'm not that it is a bug; who in this world knows that pyspark silently download some hadoop-dependencies?) and fix it in scala-part? P.S. I will use for now the new scala-jar with this fix for my tests of PySpark API P.P.S. Just made a check: with a new JAR tests on the PySpark side are passed and call to SparkSession._jvm.com.alibaba.graphar.VertexInfo.loadVertexInfo works fine! 🎉

Yes, you can open a new issue and we can discuss this problem there. Maybe open a feature issue is better and you can open new pull request to update the snake-yaml version

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py new file: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/tests/test_info.py modified: pyspark/tests/test_reader.py new file: pyspark/tests/test_writer.py

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/tests/test_writer.py

…dings

SemyonSinchenko · 2023-12-29T17:16:34Z

So... The code is mostly covered by tests:

======================== test session starts ========================
platform linux -- Python 3.9.18, pytest-7.4.3, pluggy-1.3.0
rootdir: /home/sem/github/GraphAr/pyspark
plugins: cov-4.1.0, anyio-4.2.0
collected 14 items

tests/test_enums.py ...                                                                                                                                                                  [ 21%]
tests/test_info.py ......                                                                                                                                                                [ 64%]
tests/test_reader.py ...                                                                                                                                                                 [ 85%]
tests/test_writer.py ..                                                                                                                                                                  [100%]

---------- coverage: platform linux, python 3.9.18-final-0 -----------
Name                          Stmts   Miss  Cover
-------------------------------------------------
graphar_pyspark/__init__.py      27      2    93%
graphar_pyspark/enums.py         41      0   100%
graphar_pyspark/graph.py         90     28    69%
graphar_pyspark/info.py         374     34    91%
graphar_pyspark/reader.py        84     22    74%
graphar_pyspark/util.py          50     23    54%
graphar_pyspark/writer.py        57     10    82%
-------------------------------------------------
TOTAL                           723    119    84%

I think, it is ready for review. @acezen May we define some closed list of check-boxes, that need to be closed to make it ready for merge? @Corwinpro I tried to close all your mentions, thanks! May you please make a look on not-closed again, where I left comments?

acezen

The code seems cool! I leave some comment, and there are somethings that need to implement:

Open the test in github action
Can we add some example or tutorial to show how to use graphar-pyspark, you can refer to Scala example
Maybe add some document about graphar-pyspark.

Thanks for the contribution!

pyspark/graphar_pyspark/graph.py

pyspark/graphar_pyspark/info.py

pyspark/graphar_pyspark/writer.py

…dings

On branch 297-add-pyspark-bindings Changes to be committed: modified: .gitignore modified: docs/Makefile modified: docs/index.rst new file: docs/pyspark/api/graphar_pyspark.rst new file: docs/pyspark/api/modules.rst new file: docs/pyspark/index.rst new file: docs/pyspark/pyspark-lib.rst modified: pyspark/Makefile modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py new file: pyspark/graphar_pyspark/errors.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/__init__.py modified: pyspark/tests/conftest.py modified: pyspark/tests/test_enums.py modified: pyspark/tests/test_info.py modified: pyspark/tests/test_reader.py modified: pyspark/tests/test_writer.py

+ add poetry-lock file to .licenserc ignore section On branch 297-add-pyspark-bindings Changes to be committed: modified: .licenserc.yaml new file: pyspark/README.md deleted: pyspark/README.rst modified: pyspark/pyproject.toml

- new tests - improved coverage - updated Makefile for Python project - updated pyproject.toml On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/Makefile modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py modified: pyspark/tests/test_writer.py

…dings

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/writer.py

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/tests/test_writer.py

acezen

hi, @SemyonSinchenko, I leave some comments about the code and docs, and here are some general question:

How do we build the graphar_pyspark package? Do we need to install some necessary dependencies？if so， maybe you need add something like requirements.txt/requirements-dev.txt for install dependencies by pip
What's the relationship between scala spark and graphar_pyspark, if we need to package the pyspark , do we need to build the scala first?
Maybe add a CI workflow in spark CI to run the building step and test in github action?

docs/Makefile

pyspark/poetry.lock

pyspark/pyproject.toml

pyspark/README.md

pyspark/graphar_pyspark/info.py

SemyonSinchenko · 2024-01-04T08:34:52Z

* How do we build the graphar_pyspark package? Do we need to install some necessary dependencies？if so， maybe you need add something like `requirements.txt/requirements-dev.txt` for install dependencies by pip

Currently poetry is used as a building system (like poetry install to install the project for local development and poetry build to build the distribution). Dependencies are defined inside pyprojetc.toml file as dependency groups (spark, tests, docs, etc.)

* What's the relationship between scala spark and graphar_pyspark, if we need to package the pyspark , do we need to build the scala first?

I know two ways how to do it. The first is the pyspark itself way, when we download JAR-files silently while pip install. My opinion it is a bad option cause may easily break users environments. Another way is like in Synapse ML python project or inside PyDeequ. But the way of synapse or pydeequ requires to have a maven coordinates but as I understand we still do not have maven publishing. Do you have plans about testing GraphAr Spark scala lib against different spark and scala versions (like graphar_1.0.0_spark_spark_3.2_scala_2.12_hadoop_3.2 or graphar_1.0.0_spark_3.5_scala_2.13_hadoop_2, etc.)?

* Maybe add a CI workflow in [spark CI](https://github.com/alibaba/GraphAr/blob/main/.github/workflows/spark.yaml) to run the building step and test in github action?

I can add the following:

ci for code linting of pyspark project
ci for tests of pyspark project as a step inside spark.yml

Also, I can add something like python-release.yml action that will include:

running tests
adding PyPi API token (like in poetry docs)
publishing the package to PyPi (poetry publish --build)

In this case graphar_pyspark will be available for pip install. The only thing I need from you is to add a github secret with token (like PYPI_API_TOKEN or something like this)

SemyonSinchenko · 2024-01-04T08:37:27Z

cicd for code-style of python project
update cicd of spark by adding python tests
update README.md as described in comments
docstrings for all public API
testing coverage > 95%
usage example

Did I miss something?

acezen · 2024-01-04T11:36:27Z

cicd for code-style of python project

update cicd of spark by adding python tests

update README.md as described in comments

docstrings for all public API

testing coverage > 95%

Did I miss something?

overall LGTM, thanks~, maybe we can add some tutorials or examples for graphar_spark?

- ruff passed - coverage 95%+ - docstrings for all the public API On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/__init__.py modified: pyspark/graphar_pyspark/enums.py modified: pyspark/graphar_pyspark/errors.py modified: pyspark/graphar_pyspark/graph.py modified: pyspark/graphar_pyspark/info.py modified: pyspark/graphar_pyspark/reader.py modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/poetry.lock modified: pyspark/pyproject.toml modified: pyspark/tests/test_reader.py new file: pyspark/tests/test_transform.py modified: pyspark/tests/test_writer.py

acezen · 2024-01-08T22:52:48Z

hi, @SemyonSinchenko , is there anything that update for the rest item? We plan to release GraphAr v0.11.0 in the middle of January, I think we can include this feature in this release.

SemyonSinchenko · 2024-01-09T08:17:01Z

hi, @SemyonSinchenko , is there anything that update for the rest item? We plan to release GraphAr v0.11.0 in the middle of January, I think we can include this feature in this release.

Hi! I was on a short vacation. The most part of the work is done, I will finish it in a couple of days

On branch 297-add-pyspark-bindings Changes to be committed: new file: .github/workflows/pyspark.yml new file: docs/pyspark/how-to.rst modified: docs/pyspark/index.rst

…dings

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/README.md

SemyonSinchenko · 2024-01-09T21:02:30Z

@acezen To be honest, I'm not a very experienced guy in DevOps tooling and also the spark-workflow file is very complicated (some magic with paths). I'm affraid to touch it just because I can break something. May we couple building of spark & pyspark in the next PR? This one is already very big. And I was going to suggest to test GraphAr against multiple versions of Hadoop & Spark (via matrix in GHA), so, the spark CI may be changed soon anyway... What do you think about it?

acezen · 2024-01-10T02:05:19Z

@acezen To be honest, I'm not a very experienced guy in DevOps tooling and also the spark-workflow file is very complicated (some magic with paths). I'm affraid to touch it just because I can break something. May we couple building of spark & pyspark in the next PR? This one is already very big. And I was going to suggest to test GraphAr against multiple versions of Hadoop & Spark (via matrix in GHA), so, the spark CI may be changed soon anyway... What do you think about it?

I agree with you. The spark-workflow is a little complicated and need to refine. I think we can open a new issue for that and couple building of spark & pyspark. It's ok to merge the PR first.

acezen

I think it's ready to merge when the typo and ci/cd setting fixed.

pyspark/README.md

.github/workflows/pyspark.yml

On branch 297-add-pyspark-bindings Changes to be committed: modified: .github/workflows/pyspark.yml modified: pyspark/README.md

…dings

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/info.py

acezen

LGTM, thanks for the great work!

SemyonSinchenko changed the title ~~Implementation draft & concept~~ [Feat][Spark] Implementation draft & concept of PySpark bindings to Scala API Dec 24, 2023

acezen reviewed Dec 27, 2023

View reviewed changes

pyspark/graphar_pysaprk/enums.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

5f12da4

…dings

SemyonSinchenko changed the title ~~[Feat][Spark] Implementation draft & concept of PySpark bindings to Scala API~~ [Feat][Spark] Implementation of PySpark bindings to Scala API Dec 27, 2023

acezen mentioned this pull request Dec 28, 2023

[Bug][Spark]: the == comparison behavior of property groups and properties is not what we expect #303

Closed

1 task

acezen reviewed Dec 28, 2023

View reviewed changes

pyspark/graphar_pyspark/enums.py Show resolved Hide resolved

SemyonSinchenko added 2 commits December 28, 2023 12:01

Push changes before pulling from upstream

815043c

On branch 297-add-pyspark-bindings Changes to be committed: new file: pyspark/README.rst modified: pyspark/graphar_pyspark/info.py modified: pyspark/pyproject.toml modified: pyspark/tests/test_info.py

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

e361cb6

…dings

Corwinpro reviewed Dec 28, 2023

View reviewed changes

SemyonSinchenko added 2 commits December 29, 2023 01:55

SemyonSinchenko mentioned this pull request Dec 29, 2023

[Feat][Spark] Update snake-yaml dependency to 2+ version #311

Closed

SemyonSinchenko added 3 commits December 29, 2023 13:16

Make PR ready for review

0b4a4cd

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/util.py modified: pyspark/graphar_pyspark/writer.py modified: pyspark/tests/test_writer.py

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

5cefd2f

…dings

acezen reviewed Jan 2, 2024

View reviewed changes

SemyonSinchenko added 3 commits January 2, 2024 11:07

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

d78b0ae

…dings

Add license-header to pyspark/README

56b9a1a

+ add poetry-lock file to .licenserc ignore section On branch 297-add-pyspark-bindings Changes to be committed: modified: .licenserc.yaml new file: pyspark/README.md deleted: pyspark/README.rst modified: pyspark/pyproject.toml

SemyonSinchenko mentioned this pull request Jan 3, 2024

Change VertexWriter constructor signature #314

Merged

5 tasks

SemyonSinchenko added 4 commits January 3, 2024 18:53

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

b650c93

…dings

Drop outdated comment and TODO

3f4eb14

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/writer.py

Fix broken commit

4a03944

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/tests/test_writer.py

acezen reviewed Jan 4, 2024

View reviewed changes

SemyonSinchenko added 3 commits January 9, 2024 21:49

Ci & docs

2800e4b

On branch 297-add-pyspark-bindings Changes to be committed: new file: .github/workflows/pyspark.yml new file: docs/pyspark/how-to.rst modified: docs/pyspark/index.rst

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

b70e158

…dings

Update branch && update README

a52728a

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/README.md

SemyonSinchenko requested a review from acezen January 9, 2024 20:59

acezen approved these changes Jan 10, 2024

View reviewed changes

pyspark/README.md Outdated Show resolved Hide resolved

.github/workflows/pyspark.yml Outdated Show resolved Hide resolved

acezen reviewed Jan 10, 2024

View reviewed changes

.github/workflows/pyspark.yml Outdated Show resolved Hide resolved

SemyonSinchenko added 3 commits January 10, 2024 10:19

Fixes from comments

851c668

On branch 297-add-pyspark-bindings Changes to be committed: modified: .github/workflows/pyspark.yml modified: pyspark/README.md

Merge remote-tracking branch 'upstream/main' into 297-add-pyspark-bin…

2afc52b

…dings

Fix linter errors

5195a7c

On branch 297-add-pyspark-bindings Changes to be committed: modified: pyspark/graphar_pyspark/info.py

acezen approved these changes Jan 10, 2024

View reviewed changes

acezen merged commit 2faccd8 into apache:main Jan 10, 2024
2 checks passed

SemyonSinchenko deleted the 297-add-pyspark-bindings branch April 22, 2024 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][Spark] Implementation of PySpark bindings to Scala API #300

[Feat][Spark] Implementation of PySpark bindings to Scala API #300

SemyonSinchenko commented Dec 23, 2023 •

edited

Loading

CLAassistant commented Dec 23, 2023 •

edited

Loading

acezen commented Dec 26, 2023

acezen commented Dec 27, 2023 •

edited

Loading

acezen left a comment

SemyonSinchenko commented Dec 27, 2023

SemyonSinchenko commented Dec 27, 2023

SemyonSinchenko commented Dec 27, 2023

acezen commented Dec 28, 2023

acezen commented Dec 28, 2023

acezen commented Dec 28, 2023

SemyonSinchenko commented Dec 28, 2023 •

edited

Loading

Corwinpro left a comment •

edited

Loading

acezen commented Dec 29, 2023 •

edited

Loading

SemyonSinchenko commented Dec 29, 2023

acezen left a comment

acezen left a comment •

edited

Loading

SemyonSinchenko commented Jan 4, 2024

SemyonSinchenko commented Jan 4, 2024 •

edited

Loading

acezen commented Jan 4, 2024 •

edited

Loading

acezen commented Jan 8, 2024

SemyonSinchenko commented Jan 9, 2024

SemyonSinchenko commented Jan 9, 2024

acezen commented Jan 10, 2024

acezen left a comment

acezen left a comment

[Feat][Spark] Implementation of PySpark bindings to Scala API #300

[Feat][Spark] Implementation of PySpark bindings to Scala API #300

Conversation

SemyonSinchenko commented Dec 23, 2023 • edited Loading

Proposed changes

Types of changes

Checklist

Further comments

CLAassistant commented Dec 23, 2023 • edited Loading

acezen commented Dec 26, 2023

acezen commented Dec 27, 2023 • edited Loading

acezen left a comment

Choose a reason for hiding this comment

SemyonSinchenko commented Dec 27, 2023

SemyonSinchenko commented Dec 27, 2023

SemyonSinchenko commented Dec 27, 2023

acezen commented Dec 28, 2023

acezen commented Dec 28, 2023

acezen commented Dec 28, 2023

SemyonSinchenko commented Dec 28, 2023 • edited Loading

Corwinpro left a comment • edited Loading

Choose a reason for hiding this comment

acezen commented Dec 29, 2023 • edited Loading

SemyonSinchenko commented Dec 29, 2023

acezen left a comment

Choose a reason for hiding this comment

acezen left a comment • edited Loading

Choose a reason for hiding this comment

SemyonSinchenko commented Jan 4, 2024

SemyonSinchenko commented Jan 4, 2024 • edited Loading

acezen commented Jan 4, 2024 • edited Loading

acezen commented Jan 8, 2024

SemyonSinchenko commented Jan 9, 2024

SemyonSinchenko commented Jan 9, 2024

acezen commented Jan 10, 2024

acezen left a comment

Choose a reason for hiding this comment

acezen left a comment

Choose a reason for hiding this comment

SemyonSinchenko commented Dec 23, 2023 •

edited

Loading

CLAassistant commented Dec 23, 2023 •

edited

Loading

acezen commented Dec 27, 2023 •

edited

Loading

SemyonSinchenko commented Dec 28, 2023 •

edited

Loading

Corwinpro left a comment •

edited

Loading

acezen commented Dec 29, 2023 •

edited

Loading

acezen left a comment •

edited

Loading

SemyonSinchenko commented Jan 4, 2024 •

edited

Loading

acezen commented Jan 4, 2024 •

edited

Loading