-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
release: 0.2 polished docs and readme
- Loading branch information
Showing
4 changed files
with
229 additions
and
176 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,230 +1,163 @@ | ||
# NebulaGraph Data Intelligence(ngdi) Suite | ||
|
||
![image](https://user-images.githubusercontent.com/1651790/221876073-61ef4edb-adcd-4f10-b3fc-8ddc24918ea1.png) | ||
|
||
[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm.fming.dev) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) [![PyPI version](https://badge.fury.io/py/ngdi.svg)](https://badge.fury.io/py/ngdi) [![Python](https://img.shields.io/badge/python-3.6%2B-blue.svg)](https://www.python.org/downloads/release/python-360/) | ||
<p align="center"> | ||
<em>Data Intelligence Suite with 4 line code to run Graph Algo on NebulaGraph</em> | ||
</p> | ||
|
||
NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python library that offers a range of APIs for data scientists to effectively read, write, analyze, and compute data in NebulaGraph. This library allows data scientists to perform these operations on a single machine using NetworkX, or in a distributed computing environment using Spark, in unified and intuitive API. With ngdi, data scientists can easily access and process data in NebulaGraph, enabling them to perform advanced analytics and gain valuable insights. | ||
<p align="center"> | ||
<a href="LICENSE" target="_blank"> | ||
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"> | ||
</a> | ||
|
||
``` | ||
┌───────────────────────────────────────────────────┐ | ||
│ Spark Cluster │ | ||
│ .─────. .─────. .─────. .─────. │ | ||
┌─▶│ : ; : ; : ; : ; │ | ||
│ │ `───' `───' `───' `───' │ | ||
Algorithm │ | ||
Spark └───────────────────────────────────────────────────┘ | ||
Engine ┌────────────────────────────────────────────────────────────────┐ | ||
└──┤ │ | ||
│ NebulaGraph Data Intelligence Suite(ngdi) │ | ||
│ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │ | ||
│ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │ | ||
│ └────────┘ └──────┘ └────────┘ └─────┘ │ | ||
│ ├────────────┴───┬────────┴─────┐ └──────┐ │ | ||
│ ▼ ▼ ▼ ▼ │ | ||
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │ | ||
┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │ | ||
│ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │ | ||
│ └──────────┬─────────────────────────────────────────────────────┘ | ||
│ │ Spark | ||
│ └────────Reader ────────────┐ | ||
Spark Reader Query Mode │ | ||
Scan Mode ▼ | ||
│ ┌───────────────────────────────────────────────────┐ | ||
│ │ NebulaGraph Graph Engine Nebula-GraphD │ | ||
│ ├──────────────────────────────┬────────────────────┤ | ||
│ │ NebulaGraph Storage Engine │ │ | ||
└─▶│ Nebula-StorageD │ Nebula-Metad │ | ||
└──────────────────────────────┴────────────────────┘ | ||
``` | ||
<a href="https://badge.fury.io/py/ngdi" target="_blank"> | ||
<img src="https://badge.fury.io/py/ngdi.svg" alt="PyPI version"> | ||
</a> | ||
|
||
<a href="https://www.python.org/downloads/release/python-360/" target="_blank"> | ||
<img src="https://img.shields.io/badge/python-3.6%2B-blue.svg" alt="Python"> | ||
</a> | ||
|
||
<a href="https://pdm.fming.dev" target="_blank"> | ||
<img src="https://img.shields.io/badge/pdm-managed-blueviolet" alt="pdm-managed"> | ||
</a> | ||
|
||
</p> | ||
|
||
--- | ||
|
||
**Documentation**: <a href="https://github.com/wey-gu/nebulagraph-di#documentation" target="_blank">https://github.com/wey-gu/nebulagraph-di#documentation</a> | ||
|
||
**Source Code**: <a href="https://github.com/wey-gu/nebulagraph-di" target="_blank">https://github.com/wey-gu/nebulagraph-di</a> | ||
|
||
--- | ||
|
||
|
||
NebulaGraph Data Intelligence Suite for Python (ngdi) is a powerful Python library that offers APIs for data scientists to effectively read, write, analyze, and compute data in NebulaGraph. | ||
|
||
With the support of single-machine engine(NetworkX), or distributed computing environment using Spark we could perform Graph Analysis and Algorithms on top of NebulaGraph in less than 10 lines of code, in unified and intuitive API. | ||
|
||
## Quick Start in 5 Minutes | ||
|
||
- Setup env with Nebula-Up following [this guide](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md). | ||
- Install ngdi with pip from the Jupyter Notebook with http://localhost:8888 (password: `nebula`). | ||
- Open the demo notebook and run cells with `Shift+Enter` or `Ctrl+Enter`. | ||
- Open the demo notebook and run cells one by one. | ||
- Check the [API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) | ||
|
||
## Installation | ||
|
||
```bash | ||
pip install ngdi | ||
``` | ||
|
||
### Spark Engine Prerequisites | ||
- Spark 2.4, 3.0(not yet tested) | ||
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) | ||
- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/) | ||
- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/) | ||
|
||
### NebulaGraph Engine Prerequisites | ||
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) | ||
- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python) | ||
- [NetworkX](https://networkx.org/) | ||
|
||
## Run on PySpark Jupyter Notebook(Spark Engine) | ||
|
||
Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`. | ||
## Usage | ||
|
||
```bash | ||
export PYSPARK_PYTHON=python3 | ||
export PYSPARK_DRIVER_PYTHON=jupyter | ||
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip=0.0.0.0 --port=8888 --no-browser" | ||
|
||
pyspark --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ | ||
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ | ||
--jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ | ||
--jars /opt/nebulagraph/ngdi/package/nebula-algo.jar | ||
``` | ||
### Spark Engine Examples | ||
|
||
Then we could access Jupyter Notebook with PySpark and refer to [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) | ||
See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/blob/main/examples/spark_engine.ipynb) | ||
|
||
## Submit Algorithm job to Spark Cluster(Spark Engine) | ||
Run Algorithm on top of NebulaGraph: | ||
|
||
Assuming we have put the `nebula-spark-connector.jar` and `nebula-algo.jar` in `/opt/nebulagraph/ngdi/package/`; | ||
We have put the `ngdi-py3-env.zip` in `/opt/nebulagraph/ngdi/package/`. | ||
And we have the following Algorithm job in `pagerank.py`: | ||
> Note, there is also query mode, refer to [examples](https://github.com/wey-gu/nebulagraph-di/blob/main/examples/spark_engine.ipynb) or [docs](https://github.com/wey-gu/nebulagraph-di/docs/API.md) for more details. | ||
```python | ||
from ngdi import NebulaGraphConfig | ||
from ngdi import NebulaReader | ||
|
||
# set NebulaGraph config | ||
config_dict = { | ||
"graphd_hosts": "graphd:9669", | ||
"metad_hosts": "metad0:9669,metad1:9669,metad2:9669", | ||
"user": "root", | ||
"password": "nebula", | ||
"space": "basketballplayer", | ||
} | ||
config = NebulaGraphConfig(**config_dict) | ||
|
||
# read data with spark engine, query mode | ||
# read data with spark engine, scan mode | ||
reader = NebulaReader(engine="spark") | ||
query = """ | ||
MATCH ()-[e:follow]->() | ||
RETURN e LIMIT 100000 | ||
""" | ||
reader.query(query=query, edge="follow", props="degree") | ||
reader.scan(edge="follow", props="degree") | ||
df = reader.read() | ||
|
||
# run pagerank algorithm | ||
pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) | ||
``` | ||
|
||
> Note, this could be done by Airflow, or other job scheduler in production. | ||
Then we can submit the job to Spark cluster: | ||
Write back to NebulaGraph: | ||
|
||
```bash | ||
spark-submit --master spark://master:7077 \ | ||
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ | ||
--driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \ | ||
--jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \ | ||
--jars /opt/nebulagraph/ngdi/package/nebula-algo.jar \ | ||
--py-files /opt/nebulagraph/ngdi/package/ngdi-py3-env.zip \ | ||
pagerank.py | ||
``` | ||
```python | ||
from ngdi import NebulaWriter | ||
from ngdi.config import NebulaGraphConfig | ||
|
||
## Run ngdi algorithm job from python script(Spark Engine) | ||
config = NebulaGraphConfig() | ||
|
||
We have everything ready as above, including the `pagerank.py`. | ||
properties = {"louvain": "cluster_id"} | ||
|
||
```python | ||
import subprocess | ||
|
||
subprocess.run(["spark-submit", "--master", "spark://master:7077", | ||
"--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", | ||
"--driver-class-path", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", | ||
"--jars", "/opt/nebulagraph/ngdi/package/nebula-spark-connector.jar", | ||
"--jars", "/opt/nebulagraph/ngdi/package/nebula-algo.jar", | ||
"--py-files", "/opt/nebulagraph/ngdi/package/ngdi-py3-env.zip", | ||
"pagerank.py"]) | ||
writer = NebulaWriter( | ||
data=df_result, sink="nebulagraph_vertex", config=config, engine="spark") | ||
writer.set_options( | ||
tag="louvain", vid_field="_id", properties=properties, | ||
batch_size=256, write_mode="insert",) | ||
writer.write() | ||
``` | ||
|
||
## Run on single machine(NebulaGraph Engine) | ||
Then we could query the result in NebulaGraph: | ||
|
||
Assuming we have NebulaGraph cluster up and running, and we have the following Algorithm job in `pagerank_nebula_engine.py`: | ||
```cypher | ||
MATCH (v:louvain) | ||
RETURN id(v), v.louvain.cluster_id LIMIT 10; | ||
``` | ||
|
||
This file is the same as `pagerank.py` except for the following line: | ||
### NebulaGraph Engine Examples(not yet implemented) | ||
|
||
Basically the same as Spark Engine, but with `engine="nebula"`. | ||
|
||
```diff | ||
- reader = NebulaReader(engine="spark") | ||
+ reader = NebulaReader(engine="nebula") | ||
``` | ||
|
||
Then we can run the job on single machine: | ||
|
||
```bash | ||
python3 pagerank.py | ||
``` | ||
|
||
## Documentation | ||
|
||
[API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) | ||
|
||
## Usage | ||
[Environment Setup](https://github.com/wey-gu/nebulagraph-di/blob/main/docs/Environment_Setup.md) | ||
|
||
### Spark Engine Examples | ||
|
||
See also: [examples/spark_engine.ipynb](https://github.com/wey-gu/nebulagraph-di/examples/spark_engine.ipynb) | ||
|
||
```python | ||
from ngdi import NebulaReader | ||
|
||
# read data with spark engine, query mode | ||
reader = NebulaReader(engine="spark") | ||
query = """ | ||
MATCH ()-[e:follow]->() | ||
RETURN e LIMIT 100000 | ||
""" | ||
reader.query(query=query, edge="follow", props="degree") | ||
df = reader.read() # this will take some time | ||
df.show(10) | ||
|
||
# read data with spark engine, scan mode | ||
reader = NebulaReader(engine="spark") | ||
reader.scan(edge="follow", props="degree") | ||
df = reader.read() # this will take some time | ||
df.show(10) | ||
[API Reference](https://github.com/wey-gu/nebulagraph-di/docs/API.md) | ||
|
||
# read data with spark engine, load mode (not yet implemented) | ||
reader = NebulaReader(engine="spark") | ||
reader.load(source="hdfs://path/to/edge.csv", format="csv", header=True, schema="src: string, dst: string, rank: int") | ||
df = reader.read() # this will take some time | ||
df.show(10) | ||
## How it works | ||
|
||
# run pagerank algorithm | ||
pr_result = df.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time | ||
ngdi is an unified abstraction layer for different engines, the current implementation is based on Spark, NetworkX, DGL and NebulaGraph, but it's easy to extend to other engines like Flink, GraphScope, PyG etc. | ||
|
||
# convert dataframe to NebulaGraphObject | ||
graph = reader.to_graphx() # not yet implemented | ||
``` | ||
┌───────────────────────────────────────────────────┐ | ||
│ Spark Cluster │ | ||
│ .─────. .─────. .─────. .─────. │ | ||
┌─▶│ : ; : ; : ; : ; │ | ||
│ │ `───' `───' `───' `───' │ | ||
Algorithm │ | ||
Spark └───────────────────────────────────────────────────┘ | ||
Engine ┌────────────────────────────────────────────────────────────────┐ | ||
└──┤ │ | ||
│ NebulaGraph Data Intelligence Suite(ngdi) │ | ||
│ ┌────────┐ ┌──────┐ ┌────────┐ ┌─────┐ │ | ||
│ │ Reader │ │ Algo │ │ Writer │ │ GNN │ │ | ||
│ └────────┘ └──────┘ └────────┘ └─────┘ │ | ||
│ ├────────────┴───┬────────┴─────┐ └──────┐ │ | ||
│ ▼ ▼ ▼ ▼ │ | ||
│ ┌─────────────┐ ┌──────────────┐ ┌──────────┐┌───────────┐ │ | ||
┌──┤ │ SparkEngine │ │ NebulaEngine │ │ NetworkX ││ DGLEngine │ │ | ||
│ │ └─────────────┘ └──────────────┘ └──────────┘└───────────┘ │ | ||
│ └──────────┬─────────────────────────────────────────────────────┘ | ||
│ │ Spark | ||
│ └────────Reader ────────────┐ | ||
Spark Reader Query Mode │ | ||
Scan Mode ▼ | ||
│ ┌───────────────────────────────────────────────────┐ | ||
│ │ NebulaGraph Graph Engine Nebula-GraphD │ | ||
│ ├──────────────────────────────┬────────────────────┤ | ||
│ │ NebulaGraph Storage Engine │ │ | ||
└─▶│ Nebula-StorageD │ Nebula-Metad │ | ||
└──────────────────────────────┴────────────────────┘ | ||
``` | ||
|
||
### NebulaGraph Engine Examples(not yet implemented) | ||
### Spark Engine Prerequisites | ||
- Spark 2.4, 3.0(not yet tested) | ||
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) | ||
- [NebulaGraph Spark Connector 3.4+](https://repo1.maven.org/maven2/com/vesoft/nebula-spark-connector/) | ||
- [NebulaGraph Algorithm 3.1+](https://repo1.maven.org/maven2/com/vesoft/nebula-algorithm/) | ||
|
||
```python | ||
from ngdi import NebulaReader | ||
### NebulaGraph Engine Prerequisites | ||
- [NebulaGraph 3.4+](https://github.com/vesoft-inc/nebula) | ||
- [NebulaGraph Python Client 3.4+](https://github.com/vesoft-inc/nebula-python) | ||
- [NetworkX](https://networkx.org/) | ||
|
||
# read data with nebula engine, query mode | ||
reader = NebulaReader(engine="nebula") | ||
reader.query(""" | ||
MATCH ()-[e:follow]->() | ||
RETURN e.src, e.dst, e.degree LIMIT 100000 | ||
""") | ||
df = reader.read() # this will take some time | ||
df.show(10) | ||
|
||
# read data with nebula engine, scan mode | ||
reader = NebulaReader(engine="nebula") | ||
reader.scan(edge_types=["follow"]) | ||
df = reader.read() # this will take some time | ||
df.show(10) | ||
|
||
# convert dataframe to NebulaGraphObject | ||
graph = reader.to_graph() # this will take some time | ||
graph.nodes.show(10) | ||
graph.edges.show(10) | ||
## License | ||
|
||
# run pagerank algorithm | ||
pr_result = graph.algo.pagerank(reset_prob=0.15, max_iter=10) # this will take some time | ||
``` | ||
This project is licensed under the terms of the Apache License 2.0. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.