gBolt--a fast, memory efficient, and light-weight implementation for gSpan algorithm in data mining
gBolt is up to 100x faster (see detailed experiments) than Yan's original implementation with multi-threading on a single machine. gBolt also reduces more than 200 folds memory usage, running efficiently on personal computers.
gBolt is fast because it:
- Adopts OpenMP task-based parallel programming;
- Incorporates C++11 hastable and hashset;
- Uses contiguous memory storage;
- Uses partial pruning.
gBolt is memory efficient because it:
- Incorporates C++11 emplace_back method;
- Reconstructs a graph with frequent edges and nodes before mining;
- Uses a customized Path data structure to reuse memory in recursive procedures.
gBolt is light-weight because it:
- Can be installed without any heavy third-party library;
- Delivers cross-platform performance without utilizing architectural features.
gBolt is correct because:
- We have ran experiments for
extern/data/Compound_422
andextern/data/Chemical_340
with minimal support from 0.1 to 0.9. The results generated by gBolt are exactly the same as Yan's gSpan-64.
Required:
- cmake:
sudo apt-get install cmake
or install from source. - gcc >= 4.9
Optional:
- jemalloc: install from source.
- OpenMP Environment: enable multi-threading
Steps:
git clone --recursive https://github.com/Jokeren/gBolt.git
cd gBolt
mkdir build && cd build
cmake ..
make
Build Options:
-DGBOLT_SERIAL=ON
: serial execution without OpenMP-DGBOLT_PERFORMANCE=ON
: display simple performance information and use hash map-DJEMALLOC_DIR=/path/to/dir
: use jemalloc for memory management-DCMAKE_BUILD_TYPE=<Release> or <RelWithDebInfo>
: config build type
Run an example:
./build/gbolt -i extern/data/Chemical_340 -s 0.2
Arguments help:
./build/gbolt -h
Multi-threading config:
export OMP_NUM_THREADS=<hardware core num for recommendation>
Examples:
./extern/data
Format:
t # <graph-id>
v <vertex-id> <vertex-label>
...
e <vertex-id> <vertex-id> <edge-label>
...
-
<graph-id>
must be contiguous without gaps, which means gBolt only supportst # <id>
followed byt # <id + 1>
. -
<vertex-id>
must be contiguous without gaps, which means gBolt only supportsv # <id>
followed byv # <id + 1>
. -
All the
<id>
fields must be non-negative integers starting from 0. -
All the
<label>
fields must be non-negative integers.
Yan, Xifeng, and Jiawei Han. "gspan: Graph-based substructure pattern mining." Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002.
@misc{zhou_2019, title={Jokeren/gBolt}, url={https://github.com/Jokeren/gBolt}, journal={GitHub}, author={Zhou, Keren}, year={2019}, month={March}}
gBolt is designed for efficiency, so we have not developed utilities for it. Your are welcome to implement Python
, C
, or graphical interfaces.