From 7423bb925901f955386bee4c889b429d92256db5 Mon Sep 17 00:00:00 2001 From: Wey Gu Date: Wed, 1 Mar 2023 14:47:07 +0000 Subject: [PATCH 1/4] release: 0.2.0 --- README.md | 150 +++++++++----------------------------- docs/Environment_Setup.md | 110 +++++++++++++++++++++++++++- pyproject.toml | 2 +- 3 files changed, 143 insertions(+), 119 deletions(-) diff --git a/README.md b/README.md index da65313..2e0a8f3 100644 --- a/README.md +++ b/README.md @@ -61,102 +61,6 @@ pip install ngdi - [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python) - [NetworkX](https://networkx.org/) -## Run on PySpark Jupyter Notebook(Spark Engine) - -Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`. - -```bash -export PYSPARK_PYTHON=python3 -export PYSPARK_DRIVER_PYTHON=jupyter -export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --port=8888 --no-browser" - -pyspark --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ - --driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ - --jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ - --jars /opt/nebulagraph/ngdi/package/nebula-algo.jar -``` - -Then we could access Jupyter Notebook with PySpark and refer to [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) - -## Submit Algorithm job to Spark Cluster(Spark Engine) - -Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`; -We have put the `ngdi-py3-env.zip` in `/opt/nebulagraph/ngdi/package/`. -And we have the following Algorithm job in `pagerank.py`: - -```python -from ngdi import NebulaGraphConfig -from ngdi import NebulaReader - -# set NebulaGraph config -config_dict = { - "graphd_hosts": "graphd:9669", - "metad_hosts": "metad0:9669,metad1:9669,metad2:9669", - "user": "root", - "password": "nebula", - "space": "basketballplayer", -} -config = NebulaGraphConfig(**config_dict) - -# read data with spark engine, query mode -reader = NebulaReader(engine="spark") -query = """ - MATCH ()-[e:follow]->() - RETURN e LIMIT 100000 -""" -reader.query(query=query, edge="follow", props="degree") -df = reader.read() - -# run pagerank algorithm -pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) -``` - -> Note, this could be done by Airflow, or other job scheduler in production. - -Then we can submit the job to Spark cluster: - -```bash -spark-submit --master spark://master:7077 \ - --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ - --driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ - --jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ - --jars /opt/nebulagraph/ngdi/package/nebula-algo.jar \ - --py-files /opt/nebulagraph/ngdi/package/ngdi-py3-env.zip \ - pagerank.py -``` - -## Run ngdi algorithm job from python script(Spark Engine) - -We have everything ready as above, including the `pagerank.py`. - -```python -import subprocess - -subprocess.run(["spark-submit", "--master", "spark://master:7077", - "--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", - "--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", - "--jars", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", - "--jars", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", - "--py-files", "/opt/nebulagraph/ngdi/package/ngdi-py3-env.zip", - "pagerank.py"]) -``` - -## Run on single machine(NebulaGraph Engine) - -Assuming we have NebulaGraph cluster up and running, and we have the following Algorithm job in `pagerank_nebula_engine.py`: - -This file is the same as `pagerank.py` except for the following line: - -```diff -- reader = NebulaReader(engine="spark") -+ reader = NebulaReader(engine="nebula") -``` - -Then we can run the job on single machine: - -```bash -python3 pagerank.py -``` ## Documentation @@ -168,6 +72,8 @@ python3 pagerank.py See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) +Run Algorithm on top of NebulaGraph: + ```python from ngdi import NebulaReader @@ -200,31 +106,41 @@ pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some graph = reader.to_graphx() # not yet implemented ``` -### NebulaGraph Engine Examples(not yet implemented) +Write back to NebulaGraph: ```python -from ngdi import NebulaReader +from ngdi import NebulaWriter +from ngdi.config import NebulaGraphConfig -# read data with nebula engine, query mode -reader = NebulaReader(engine="nebula") -reader.query(""" - MATCH ()-[e:follow]->() - RETURN e.src, e.dst, e.degree LIMIT 100000 -""") -df = reader.read() # this will take some time -df.show(10) +config = NebulaGraphConfig() -# read data with nebula engine, scan mode -reader = NebulaReader(engine="nebula") -reader.scan(edge_types=["follow"]) -df = reader.read() # this will take some time -df.show(10) +properties = { + "louvain": "cluster_id" +} -# convert dataframe to NebulaGraphObject -graph = reader.to_graph() # this will take some time -graph.nodes.show(10) -graph.edges.show(10) +writer = NebulaWriter(data=df_result, sink="nebulagraph_vertex", config=config, engine="spark") +writer.set_options( + tag="louvain", + vid_field="_id", + properties=properties, + batch_size=256, + write_mode="insert", +) +writer.write() +``` -# run pagerank algorithm -pr_result = graph.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time +Then we could query the result in NebulaGraph: + +```cypher +MATCH (v:louvain) +RETURN id(v), v.louvain.cluster_id LIMIT 10; +``` + +### NebulaGraph Engine Examples(not yet implemented) + +Basically the same as Spark Engine, but with `engine="nebula"`. + +```diff +- reader = NebulaReader(engine="spark") ++ reader = NebulaReader(engine="nebula") ``` diff --git a/docs/Environment_Setup.md b/docs/Environment_Setup.md index 4b8df8f..61249de 100644 --- a/docs/Environment_Setup.md +++ b/docs/Environment_Setup.md @@ -1,6 +1,15 @@ # Envrionment Setup -## With Nebula-UP +**TOC** + +- [Quick Start in 5 Minutes](#with-nebula-upqiuck-start) +- [Run In Production](#in-production) + - [Run on PySpark Jupyter Notebook](#run-on-pyspark-jupyter-notebook) + - [Submit Algorithm job to Spark Cluster](#submit-algorithm-job-to-spark-cluster) + - [Run ngdi algorithm PySpark job from python script](#run-ngdi-algorithm-pyspark-job-from-python-script) + - [Run on single machine with NebulaGraph engine](#run-on-single-machine-with-nebulagraph-engine) + +## With Nebula-UP(qiuck start) ### Installation @@ -31,3 +40,102 @@ Just visit [http://localhost:7001](http://localhost:7001) in your browser, with: - host: `graphd:9669` - user: `root` - password: `nebula` + +## Rin In Production + +### Run on PySpark Jupyter Notebook + +Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`. + +```bash +export PYSPARK_PYTHON=python3 +export PYSPARK_DRIVER_PYTHON=jupyter +export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --port=8888 --no-browser" + +pyspark --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ + --driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ + --jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ + --jars /opt/nebulagraph/ngdi/package/nebula-algo.jar +``` + +Then we could access Jupyter Notebook with PySpark and refer to [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) + +### Submit Algorithm job to Spark Cluster + +Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`; +We have put the `ngdi-py3-env.zip` in `/opt/nebulagraph/ngdi/package/`. +And we have the following Algorithm job in `pagerank.py`: + +```python +from ngdi import NebulaGraphConfig +from ngdi import NebulaReader + +# set NebulaGraph config +config_dict = { + "graphd_hosts": "graphd:9669", + "metad_hosts": "metad0:9669,metad1:9669,metad2:9669", + "user": "root", + "password": "nebula", + "space": "basketballplayer", +} +config = NebulaGraphConfig(**config_dict) + +# read data with spark engine, query mode +reader = NebulaReader(engine="spark") +query = """ + MATCH ()-[e:follow]->() + RETURN e LIMIT 100000 +""" +reader.query(query=query, edge="follow", props="degree") +df = reader.read() + +# run pagerank algorithm +pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) +``` + +> Note, this could be done by Airflow, or other job scheduler in production. + +Then we can submit the job to Spark cluster: + +```bash +spark-submit --master spark://master:7077 \ + --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ + --driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ + --jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ + --jars /opt/nebulagraph/ngdi/package/nebula-algo.jar \ + --py-files /opt/nebulagraph/ngdi/package/ngdi-py3-env.zip \ + pagerank.py +``` + +### Run ngdi algorithm PySpark job from python script + +We have everything ready as above, including the `pagerank.py`. + +```python +import subprocess + +subprocess.run(["spark-submit", "--master", "spark://master:7077", + "--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", + "--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", + "--jars", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", + "--jars", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", + "--py-files", "/opt/nebulagraph/ngdi/package/ngdi-py3-env.zip", + "pagerank.py"]) +``` + +### Run on single machine with NebulaGraph engine + +Assuming we have NebulaGraph cluster up and running, and we have the following Algorithm job in `pagerank_nebula_engine.py`: + +This file is the same as `pagerank.py` except for the following line: + +```diff +- reader = NebulaReader(engine="spark") ++ reader = NebulaReader(engine="nebula") +``` + +Then we can run the job on single machine: + +```bash +python3 pagerank.py +``` \ No newline at end of file diff --git a/pyproject.toml b/pyproject.toml index ad5977c..9bdd387 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -2,7 +2,7 @@ [project] name = "ngdi" -version = "0.1.9" +version = "0.2.0" description = "NebulaGraph Data Intelligence Suite" authors = [ {name = "Wey Gu", email = "weyl.gu@gmail.com"}, From ee55a5ecfae5f3e2d9ad3a9fd404458ca83bd06c Mon Sep 17 00:00:00 2001 From: Wey Gu Date: Wed, 1 Mar 2023 14:50:49 +0000 Subject: [PATCH 2/4] chore: readme polished --- README.md | 45 ++++++++++++--------------------------------- docs/API.md | 12 ++++++++++++ 2 files changed, 24 insertions(+), 33 deletions(-) diff --git a/README.md b/README.md index 2e0a8f3..1dec9f5 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,8 @@ Scan Mode ▼ - Setup env with Nebula-Up following [this guide](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md). - Install ngdi with pip from the Jupyter Notebook with http://localhost:8888 (password: `nebula`). -- Open the demo notebook and run cells with `Shift+Enter` or `Ctrl+Enter`. +- Open the demo notebook and run cells one by one. +- Check the [API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) ## Installation @@ -64,6 +65,7 @@ pip install ngdi ## Documentation +[Environment Setup](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md) [API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) ## Usage @@ -74,36 +76,18 @@ See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di Run Algorithm on top of NebulaGraph: +> Note, there is also query mode, refer to [examples](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) or [docs](https://github.com/wey-gu/nebulagraph-di/docs/API.md) for more details. + ```python from ngdi import NebulaReader -# read data with spark engine, query mode -reader = NebulaReader(engine="spark") -query = """ - MATCH ()-[e:follow]->() - RETURN e LIMIT 100000 -""" -reader.query(query=query, edge="follow", props="degree") -df = reader.read() # this will take some time -df.show(10) - # read data with spark engine, scan mode reader = NebulaReader(engine="spark") reader.scan(edge="follow", props="degree") -df = reader.read() # this will take some time -df.show(10) - -# read data with spark engine, load mode (not yet implemented) -reader = NebulaReader(engine="spark") -reader.load(source="hdfs://path/to/edge.csv", format="csv", header=True, schema="src: string, dst: string, rank: int") -df = reader.read() # this will take some time -df.show(10) +df = reader.read() # run pagerank algorithm -pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time - -# convert dataframe to NebulaGraphObject -graph = reader.to_graphx() # not yet implemented +pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) ``` Write back to NebulaGraph: @@ -114,18 +98,13 @@ from ngdi.config import NebulaGraphConfig config = NebulaGraphConfig() -properties = { - "louvain": "cluster_id" -} +properties = {"louvain": "cluster_id"} -writer = NebulaWriter(data=df_result, sink="nebulagraph_vertex", config=config, engine="spark") +writer = NebulaWriter( + data=df_result, sink="nebulagraph_vertex", config=config, engine="spark") writer.set_options( - tag="louvain", - vid_field="_id", - properties=properties, - batch_size=256, - write_mode="insert", -) + tag="louvain", vid_field="_id", properties=properties, + batch_size=256, write_mode="insert",) writer.write() ``` diff --git a/docs/API.md b/docs/API.md index 9a2023d..1d074c8 100644 --- a/docs/API.md +++ b/docs/API.md @@ -69,6 +69,18 @@ reader.query(query=query, edge="follow", props="degree") df = reader.read() ``` +- Load mode + +> not yet implemented + +```python +# read data with spark engine, load mode (not yet implemented) +reader = NebulaReader(engine="spark") +reader.load(source="hdfs://path/to/edge.csv", format="csv", header=True, schema="src: string, dst: string, rank: int") +df = reader.read() # this will take some time +df.show(10) +``` + ## engines - `ngdi.engines.SparkEngine` is the Spark Engine for `ngdi.NebulaReader`, `ngdi.NebulaWriter` and `ngdi.NebulaAlgorithm`. From 0651b5785ec272ddc3768f43893bbaa998bd5526 Mon Sep 17 00:00:00 2001 From: Wey Gu Date: Wed, 1 Mar 2023 15:00:35 +0000 Subject: [PATCH 3/4] chore: move arch to the end --- README.md | 68 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 36 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index 1dec9f5..76161a2 100644 --- a/README.md +++ b/README.md @@ -6,38 +6,6 @@ NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python library that offers a range of APIs for data scientists to effectively read, write, analyze, and compute data in NebulaGraph. This library allows data scientists to perform these operations on a single machine using NetworkX, or in a distributed computing environment using Spark, in unified and intuitive API. With ngdi, data scientists can easily access and process data in NebulaGraph, enabling them to perform advanced analytics and gain valuable insights. -``` - ┌───────────────────────────────────────────────────┐ - │ Spark Cluster │ - │ .─────. .─────. .─────. .─────. │ - ┌─▶│ : ; : ; : ; : ; │ - │ │ `───' `───' `───' `───' │ -Algorithm │ - Spark └───────────────────────────────────────────────────┘ - Engine ┌────────────────────────────────────────────────────────────────┐ - └──┤ │ - │ NebulaGraph Data Intelligence Suite(ngdi) │ - │ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │ - │ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │ - │ └────────┘ └──────┘ └────────┘ └─────┘ │ - │ ├────────────┴───┬────────┴─────┐ └──────┐ │ - │ ▼ ▼ ▼ ▼ │ - │ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │ - ┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │ - │ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │ - │ └──────────┬─────────────────────────────────────────────────────┘ - │ │ Spark - │ └────────Reader ────────────┐ -Spark Reader Query Mode │ -Scan Mode ▼ - │ ┌───────────────────────────────────────────────────┐ - │ │ NebulaGraph Graph Engine Nebula-GraphD │ - │ ├──────────────────────────────┬────────────────────┤ - │ │ NebulaGraph Storage Engine │ │ - └─▶│ Nebula-StorageD │ Nebula-Metad │ - └──────────────────────────────┴────────────────────┘ -``` - ## Quick Start in 5 Minutes - Setup env with Nebula-Up following [this guide](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md). @@ -123,3 +91,39 @@ Basically the same as Spark Engine, but with `engine="nebula"`. - reader = NebulaReader(engine="spark") + reader = NebulaReader(engine="nebula") ``` + +## How it works + +ngdi is an unified abstraction layer for different engines, the current implementation is based on Spark, NetworkX, DGL and NebulaGraph, but it's easy to extend to other engines like Flink, GraphScope, PyG etc. + +``` + ┌───────────────────────────────────────────────────┐ + │ Spark Cluster │ + │ .─────. .─────. .─────. .─────. │ + ┌─▶│ : ; : ; : ; : ; │ + │ │ `───' `───' `───' `───' │ +Algorithm │ + Spark └───────────────────────────────────────────────────┘ + Engine ┌────────────────────────────────────────────────────────────────┐ + └──┤ │ + │ NebulaGraph Data Intelligence Suite(ngdi) │ + │ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │ + │ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │ + │ └────────┘ └──────┘ └────────┘ └─────┘ │ + │ ├────────────┴───┬────────┴─────┐ └──────┐ │ + │ ▼ ▼ ▼ ▼ │ + │ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │ + ┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │ + │ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │ + │ └──────────┬─────────────────────────────────────────────────────┘ + │ │ Spark + │ └────────Reader ────────────┐ +Spark Reader Query Mode │ +Scan Mode ▼ + │ ┌───────────────────────────────────────────────────┐ + │ │ NebulaGraph Graph Engine Nebula-GraphD │ + │ ├──────────────────────────────┬────────────────────┤ + │ │ NebulaGraph Storage Engine │ │ + └─▶│ Nebula-StorageD │ Nebula-Metad │ + └──────────────────────────────┴────────────────────┘ +``` \ No newline at end of file From 12a2959cd510cef937cd34b23d2a9b7ad1cbab3d Mon Sep 17 00:00:00 2001 From: Wey Gu Date: Wed, 1 Mar 2023 15:01:57 +0000 Subject: [PATCH 4/4] chore: move prerequiresites to the bottom --- README.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 76161a2..663f110 100644 --- a/README.md +++ b/README.md @@ -19,21 +19,10 @@ NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python libra pip install ngdi ``` -### Spark Engine Prerequisites -- Spark 2.4, 3.0(not yet tested) -- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) -- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/) -- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/) - -### NebulaGraph Engine Prerequisites -- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) -- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python) -- [NetworkX](https://networkx.org/) - - ## Documentation [Environment Setup](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md) + [API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) ## Usage @@ -126,4 +115,15 @@ Scan Mode ▼ │ │ NebulaGraph Storage Engine │ │ └─▶│ Nebula-StorageD │ Nebula-Metad │ └──────────────────────────────┴────────────────────┘ -``` \ No newline at end of file +``` + +### Spark Engine Prerequisites +- Spark 2.4, 3.0(not yet tested) +- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) +- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/) +- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/) + +### NebulaGraph Engine Prerequisites +- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) +- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python) +- [NetworkX](https://networkx.org/)