This code and data repository accompanies the following two papers:
- Choosing to grow a graph - Jan Overgoor, Austin R. Benson, Johan Ugander. (WWW, 2019)
- Scaling choice models of relational social data - Jan Overgoor, George Pakapol Supaniratisai, Johan Ugander. (KDD, 2020)
In the filenames and documentation we sometimes refer to Paper #1 and Paper #2 instead.
For questions, please email Jan at overgoor@stanford.edu.
The code for fitting logit models, as well as the code to generate the synthetic graphs, is written in Python 3. The code for the plots is written in R.
We used the following versions of external python libraries:
networkx=2.1
numpy=1.18.1
scipy=1.2.0
pandas=0.23.0
torch=0.4.0
(to accelerate the optimizing routine)plfit
- install from here, but removeplfit_v1.py
before building, for Python 3 compatibility.
Instructions to reproduce the results for both papers can be found in the corresponding README.md
files.
All of the following steps are also encoded in paper1_driver.sh
.
To reproduce the results from Section 4.1 and 4.2, follow these steps (from the /src/paper1
folder):
- Generate synthetic graphs with
python synth_generate.py
. This generates 10 graphs for each (r, p) combination, and writes them todata_path/graphs
, as defined inutil.py
. - Extract, for each edge, the relevant choice data with
python synth_process.py
. The choice set data is written todata_path/choices
. - Run the analysis code with
python make_plot_data.py
.
For the analysis in Section 4.3, follow these steps:
- Download the Flickr data with
curl -O -4 http://socialnetworks.mpi-sws.org/data/flickr-growth.txt.gz data/
. This file is about 141 Mb large. - Process the Flickr data with
python flickr_process.py
. This code takes a while to run. - Build the RMarkdown report with
R -e "rmarkdown::render('../paper1_reports/flickr_data.Rmd', output_file='../paper1_reports/flickr_data.pdf')"
.
For the analysis in Section 4.4, follow these steps:
- Download the Microsoft Academic Graph. Warning, the uncompressed size of this data set is over 165Gb. Download it with the following Bash code:
mkdir ~/mag_raw cd ~/mag_raw for i in {0..8} do curl -O -4 https://academicgraphv1wu.blob.core.windows.net/aminer/mag_papers_$i.zip unzip mag_papers_$i.zip done
- Process the data with
python mag_process.py
. Note that you can change the field of study to process. This code takes a while to run. - Build the RMarkdown report with
R -e "rmarkdown::render('../paper1_reports/mag_climatology.Rmd', output_file='../paper1_reports/mag_climatology.pdf')"
.
Finally, to produce the figures of the paper, run the R code to make the plots with Rscript make_plots.R
.
The code for the simulation experiments use a different implementation of the conditional logit fitting procedure than in Paper #1. This code base contains the following routine:
- Simulating graph edge formation under both regular conditional logit (single-mode multinomial) choice model and mixed mode multinomial choice model
- Feature extraction under different hyperparameters:
- Sampling methods
- Candidates subsampling size
- Events subsampling size
- Choice model fitting (single and de-mixed).
To reproduce the figures, go through the following steps (from the /src/paper2
folder). They are also encoded in paper2_driver.sh
.
- To generate the synthetic data for Figure 2, run
python complexity.py
- To generate the synthetic graph and subsequent analysis for Figure 3, run:
python synth_generate.py mnl; python synth_experiment.py fig3
. - To generate the synthetic graph and subsequent analysis for Figure 4, run:
python synth_generate.py mixed_mnl; python synth_experiment.py fig4
. - Finally, to produce the figures of the paper, run the R code to make the plots with
Rscript make_plots.R
.
Because discrete choice models are widely studied in other fields, there are many other software libraries available for the major statistical programming languages. For Python, there is an implementation in statsmodels
, as well as the larch
, pylogit
, choix
, and choicemodels
packages. For R, there are the mlogit
and mnlogit
libraries. Stata has the clogit
and xtmelogit
routines build-in, and there are a number of user written routes as well. We haven't tested these libraries, but they might be useful.