Code Scripts and Data for KDD'21 paper: MTC

Cite

contact Chaoqi Yang (chaoqiy2@illinois.edu, ycqsjtu@gmail.com) for any question.

@inproceedings{yang2021mtc,
    title = {MTC: Multiresolution Tensor Completion from Partial and Coarse Observations},
    author = {Yang, Chaoqi and Singh, Navjot and Xiao, Cao and Qian, Cheng and Solomonik, Edgar and Sun, Jimeng},
    booktitle = {Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) 2021},
    year = {2021}
}

We provide the code for (1) tensor data processing; (2) baselines; (3) our models.

0. Quick start

Run our model MTC on the synthetic data

cd src_synthetic_numpy; python test_X_C1_C2_unknown.py --partial 0.05 --jacobi 1 --multi 1

Run our model MTC on a subset of GCSS data

cd src_GCSS_torch; python run.py --model MTC --partial 0.05 --jacobi 1 --multi 1

1. Folder Tree

./data_synthetic
- the generated synthetic data (.pkl format): X, C1, C2, P1, P2
- the generating script: synthetic.ipynb
./src_synthetic_numpy (numpy implementation for synthetic data)
- codes with only partial observation: test_only_T.py
- codes with X, C1, C2 and unknown P1, P2: test_X_C1_C2_unknown.py
- codes with X, C1, C2 and known P1, P2: test_X_C1_C2_known.py
./data_GCSS
- data download link: https://pair-code.github.io/covid19_symptom_dataset/?country=GB (The US data)
- a subset of GCSS data (.pkl format): X, O1, O2, P1, P2
- data preprocessing jupyter notebook: Google Tensor Data.ipynb
./src_GCSS_numpy (numpy implementation for GCSS data)
- MTC codes with X, O1, O2 and unknown P1, P2: X_O1_O2_unknwon_P1_unknown_P2.py
- Baseline codes with O1, O2 and unknown P1, P2: OracleCPD.py, BGD.py, PREMA.py, CMTF.py
- MTC codes with X, O1, O2 and unknown P1, known P2: X_O1_O2_unknwon_P1_known_P2.py
- Codes for tensor initialization comparison: initialization.py
./src_GCSS_torch (torch implementation for GCSS data, GPU support)
- the interfact to call completion models: run.py
- different model functions: model.py

2. Python package dependency

pip install scipy==1.5.0
pip install numpy==1.19.1
pip install pandas==1.1.2
pip install torch==1.7.0

The dependency packages are pretty common. If any missing, please install that.

3. Instruction for synthetic data

3.1 Generate the synthetic data

The following is the key scripts to generate the synthetic data by ./src_synthetic/synthetic.ipynb.

import numpy as np
np.random.seed(400)
# generate factors
U = np.random.random((125, 10))
V = np.random.random((125, 10))
W = np.random.random((125, 10))
U = np.sort(U, axis=0) # make it smooth, so as to be continuous mode
W = np.sort(W, axis=0) # make it smooth, so as to be continuous mode

# tensor X
T = np.einsum('ir,jr,kr->ijk',U,V,W,optimize=True)

# mapping on the first mode
P1 = np.random.random((12, 125))
P1 = np.argsort(P1, axis=0) <= 0.0

# mapping on the first mode
P2 = np.random.random((12, 125))
P2 = np.argsort(P2, axis=0) <= 0.0

# C1, C2
C1 = np.einsum('ijk,ri->rjk',T,P1,optimize=True)
C2 = np.einsum('ijk,rj->irk',T,P2,optimize=True)

We also provide the data (used in the paper): ./data_synthetic.

3.2 Run and visualize the experiments

The following is the argument options:

optional arguments:
  -h, --help         show this help message and exit
  --partial PARTIAL  the amount of partial observation
  --jacobi JACOBI    Jacobi (1) or not (0)
  --multi MULTI      multiresolution (1) or not (0)

To get the results with only the partial observation, use

python test_only_X.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

To get the results with also coarse information (C1, C2), but not with the aggregation (P1, P2), use

python test_X_C1_C2_unknown.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

To get the results with also coarse information (C1, C2) and the aggregation (P1, P2), use

python test_X_C1_C2_known.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

In fact, we also provide all the running result files in ./src_synthetic_numpy/synthetic_result/.., and the results could be fed into the provided jupyter notebook for visualization.

4. Instruction for GCSS data

4.1 data preprocessing

We already provide a subset of GCSS data in ./data_GCSS. To get the whole tensor, following these steps:

First, download the GCSS data from https://pair-code.github.io/covid19_symptom_dataset/?country=GB
Second, run the data process script (Google Tensor Data.ipynb) to generate X, C1, C3, P1, P3 (Since in this dataset, the aggregation is on the first and the third mode, the C1, C3 are coarse tensors, and P1, P3 are the according aggregation matrices), or using the following code equivalently,

import pandas as pd
import numpy as np

data = pd.read_csv('./2020_country_daily_2020_US_daily_symptoms_dataset.csv') # change the data path accordingly

# all zipcode
zipcode = list(filter(lambda x: ('US' in x) and (x.count('_') == 2), data.key.unique()))
# the time
time = data.date.unique().tolist()
# disease list
disease = data.columns.tolist()[2:]

# tensor placeholder
T = np.zeros((len(zipcode), len(disease), len(time)))

# process per zipcode
for i, single_zip in enumerate(zipcode):
    print (i, single_zip)
    tmp = data[data.key == single_zip]
    re_time = [time.index(i) for i in tmp.date.tolist()]
    T[i, :, :][:, re_time] = tmp.values[:, 2:].T

# change nan to zero
T = np.nan_to_num(T)

# zipcode -> state mapping
state = np.unique([item.split('_')[1] for item in zipcode]).tolist()
P1 = np.zeros((len(state), len(zipcode)))

for i, z in enumerate(zipcode):
    for j, s in enumerate(state):
        if s in z:
            P1[j, i] = 1
            
# date -> week mapping
P3 = np.zeros(((len(time) - 1) // 7 + 1, len(time)))
for i, t in enumerate(time):
    P3[i // 7, i] = 1

O1 = np.einsum('ijk,ri->rjk',T,P1,optimize=True)
O3 = np.einsum('ijk,rk->ijr',T,P3,optimize=True)

import pickle 
# dump
pickle.dump(O1, open('O1_google.pkl', 'wb'))
pickle.dump(O3, open('O2_google.pkl', 'wb'))
pickle.dump(T, open('X_google.pkl', 'wb'))
pickle.dump(P1, open('P1_google.pkl', 'wb'))
pickle.dump(P3, open('P2_google.pkl', 'wb'))

ATTENTION! In the code, the variable names might be a little bit different, for example, we use variable name O1 to denote the tensor aggregated on the first mode. Take care, not get confused.

4.2 Use torch implementation

One command for all:

python run.py --model [model name] --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

The [model name] could be (1) OracleCPD; (2) CPC-ALS; (3) BGD; (4) CMTF; (5) PREMA; (6) MTC; (7) MTC-P2. Here, MTC is our proposed model without knowning P1,P2 while MTC-P2 knows P2 but not P1. Also, the last two arguments are only for MTC and MTC-P2.

4.3 Use numpy implementation

Oracle CPD model: run standard CP decomposition model on the origianl tensor

python OracleCPD.py

CPC-ALS model: only use the partial observation, treat the problem as a tensor completion problem

python CPC_ALS.py --partial [0.05 or other]

For BGD, CMTF, PREMA

python BGD.py --partial [0.05 or other]
python CMTF.py --partial [0.05 or other]
python PREMA.py --partial [0.05 or other]

For our model variant: MTC_{both-}, MTC_{multi-}, MTC_{jacobi-}, MTC:

python X_C1_C3_unknwon_P1_unknown_P3.py --partial [0.05 or other] --jacobi [0/1] --multi [0/1]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Scripts and Data for KDD'21 paper: MTC

Cite

0. Quick start

1. Folder Tree

2. Python package dependency

3. Instruction for synthetic data

3.1 Generate the synthetic data

3.2 Run and visualize the experiments

4. Instruction for GCSS data

4.1 data preprocessing

4.2 Use torch implementation

4.3 Use numpy implementation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data_GCSS		data_GCSS
data_synthetic		data_synthetic
src_GCSS_numpy		src_GCSS_numpy
src_GCSS_torch		src_GCSS_torch
src_synthetic_numpy		src_synthetic_numpy
README.md		README.md

ycq091044/MTC

Folders and files

Latest commit

History

Repository files navigation

Code Scripts and Data for KDD'21 paper: MTC

Cite

0. Quick start

1. Folder Tree

2. Python package dependency

3. Instruction for synthetic data

3.1 Generate the synthetic data

3.2 Run and visualize the experiments

4. Instruction for GCSS data

4.1 data preprocessing

4.2 Use torch implementation

4.3 Use numpy implementation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages