Skip to content

Latest commit

 

History

History
127 lines (109 loc) · 5.26 KB

README.md

File metadata and controls

127 lines (109 loc) · 5.26 KB

persona2vec

A simple implementation of persona2vec. For detail, please read persona2vec paper
https://arxiv.org/abs/2006.04941

Installation

You can use persona2vec as library. It is very simple.

python libs/setup.py install

Requirements

The codebase is implemented in Python 3.7.3, and package versions used for development are just below. This library works well on various environment. If there is a problem, please let me know with issue. I will handle it.

networkx          2.3
tqdm              4.28.1
numpy             1.15.4
pandas            0.23.4
texttable         1.5.0
scipy             1.1.0
argparse          1.1.0
gensim            3.6.0
python-louvain      - 

How to use

You can use persona2vec as library.

from persona2vec.model import Persona2Vec
from persona2vec.utils import read_graph

G = read_graph(NETWORK_FILE_NAME)
model = Persona2Vec(G, lambd=LAMBDA, dimensions=DIM, workers=NUMBER_OF_CORES)
emb = model.embedding

For detail, please check a example notebook, examples/example_karate.ipynb

Datasets - inputs

There is a utility function read_graph in persona2vec/utils.py for reading input files. You can easily make the edgelist file(*.elist) with networkx function nx.write_edgelist

Outputs

There are 3 outputs on persona2vec

  1. Persona network, Persona network is a result network of ego-splitting. File format is edgelist(*.elist), which is same to format of inputs

  2. persona to node, node to persona mapping, Mappings is a dict that connnect orginal node and splitted persona nodes or vice versa. Bascially, relation between node and persona is 1 to M relations. File format is pickle(*.json)

  3. Base embedding and Persona embedding, Base embedding and persona embedding of Persona2vec. File format is pickle(.w2v), See save_word2vec_format

For Reproducibility

We use snakemake for reproducibility in the paper. Codes for experiment is under the workflow folder.

We have two workflows, a workflow for link prediction for persona2vec (node2vec) and a workflow for SPLITTER (baseline). You can easily run the worflow using command snakemake for each folder.

Using as command line interface

persona2vec also support command line arguemnts. The following commands learn an embedding and save it with the Persona network, persona to node mapping, node to persona mapping, base embeding, and persona embedding.

persona2vec --input [INPUT_FILES_DIR] 
            --persona-network [PERSONA_NETWORK_DIR] \
            --persona-to-node [PERSONA_TO_NODE_DIR] \
            --node-to-persona [NODE_TO_PERSONA_DIR] \
            --base-emb [BASE_EMB_DIR] \
            --persona-emb [PERSONA_EMB_DIR]

If you want to train a Persona2vec with 32 dimensions.

persona2vec --dimensions 32

And, you can also change configurations for random walker easily with

persona2vec --number-of-walks 20 --walk-length 80

Input and output options

  --input [INPUT]       Input network path as edgelist format
  --persona-network [PERSONA_NETWORK]
                        Persona network path.
  --persona-to-node [PERSONA_TO_NODE]
                        Persona to node mapping file.
  --node-to-persona [NODE_TO_PERSONA]
                        Node to persona mapping file.
  --base-emb [BASE_EMB]
                        Base Embeddings path
  --persona-emb [PERSONA_EMB]
                        Persona Embeddings path

Model options

  --lambd LAMBD         Edge weight for persona edge, usually 0~1.
  --clustering-method CLUSTERING_METHOD
                        name of the clustering method that uses in splitting
                        personas, choose one of these
                        ('connected_component''modulairty','label_prop')
  --dimensions DIMENSIONS
                        Number of dimensions. Default is 128.
  --walk-length-base WALK_LENGTH_BASE
                        Length of walk per source. Default is 40.
  --num-walks-base NUM_WALKS_BASE
                        Number of walks per source. Default is 10.
  --window-size-base WINDOW_SIZE_BASE
                        Context size for optimization. Default is 5.
  --epoch-base EPOCH_BASE
                        Number of epochs in the base embedding
  --walk-length-persona WALK_LENGTH_PERSONA
                        Length of walk per source. Default is 80.
  --num-walks-persona NUM_WALKS_PERSONA
                        Number of walks per source. Default is 10.
  --window-size-persona WINDOW_SIZE_PERSONA
                        Context size for optimization. Default is 10.
  --epoch-persona EPOCH_PERSONA
                        Number of epochs in persona embedding
  --p P                 Return hyperparameter for random-walker. Default is 1.
  --q Q                 Inout hyperparameter for random-walker. Default is 1.
  --workers WORKERS     Number of parallel workers. Default is 8.
  --weighted            Boolean specifying (un)weighted. Default is
                        unweighted.
  --unweighted
  --directed            Graph is (un)directed. Default is undirected.
  --undirected