This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning
Nilesh Gupta, Sakina Bohra, Yashoteja Prabhu, Saurabh Purohit, Manik Varma
Extreme Multi-label Learning (XML
) involves assigning the subset of most relevant labels to a data point from extremely large set of label choices. An unaddressed challenge in XML is that of predicting unseen labels with no training points.
Generalized Zero-shot XML (GZXML
) is a paradigm where the task is to tag a data point with the most relevant labels from a large universe of both seen and unseen labels.
# Build
make
# Download GZ-Eurlex-4.3K dataset
mkdir GZXML-Datasets
cd GZXML-Datasets
pip install gdown
gdown "https://drive.google.com/uc?id=1j27bQZol6gOQ7AATawShcF4jXJr3Venb"
tar -xvzf GZ-Eurlex-4.3K.tar.gz
cd -
# Train and predict ZestXML on GZ-Eurlex-4.3K dataset
./run_eurlex.sh train
./run_eurlex.sh predict
# Install dependencies of metrics.py
pip install -r requirements.txt
# Install pyxclib for evaluation
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python3 setup.py install --user
cd -
# Prints evaluation metrics
python metrics.py GZ-Eurlex-4.3K
Following Datasets were used in the paper for benchmarking GZXML
algorithms (all datasets can be downloaded from here)
- GZ-EURLex-4.3K, Document Tagging of EU law pages
- GZ-Amazon-1M, Item to Item Recommendation of Amazon products
- GZ-Wikipedia-1M, Document Tagging of Wikipedia pages
Following are some statistics of these datasets:
Dataset | Num Points | Num Labels | Num Features | |||
Train | Test | Seen | Unseen | Point | Label | |
GZ-Eurlex-4.3K | 45,000 | 6,000 | 4,108 | 163 | 100,000 | 24,316 |
GZ-Amazon-1M | 914,179 | 1,465,767 | 476,381 | 483,725 | 1,000,000 | 1,476,381 |
GZ-Wikipedia-1M | 2,271,533 | 2,705,425 | 495,107 | 776,612 | 1,000,000 | 1,438,196 |
All sparse matrices are stored in text sparse matrix format, please refer to the text sparse matrix format subsection for more details. Following are the details of required files:
Xf.txt
: all features used intf-idf
representation of documents ((trn/tst/val)_X_Xf
),ith
line denotesith
feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.Yf.txt
: similar toXf.txt
it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by__label__<i>__<label-i-text>
, this feature will only be present inith
label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with__parent__
in them are only specific to theGZ-EURLex-4.3K
dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.(trn/tst/val)_X_Xf.txt
: sparse matrix (documents x document-features) representingtf-idf
feature matrix of (trn/tst/val) input documents.Y_Yf.txt
: similar to(trn/tst/val)_X_Xf.txt
but for labels, this is the sparse matrix (labels x label-features) representingtf-idf
feature matrix of labels.trn_Y_Yf.txt
: similar toY_Yf.txt
but contains features for only the seen labels (can be interpreted asY_Yf[seen-labels]
)(trn/tst/val)_X_Y.txt
: sparse matrix (documents x labels) representing (trn/tst/val) document-label relevance matrix.
This is a plain-text row-major representation of a sparse matrix. Following are the details of the format :
- The first line in this format is two space separated integers denoting the dimensions of the matrix (i.e.
num_row
num_column
) num_row
lines follow the first line and each line represents a sparse row vector- a sparse row vector is represented as space separated non zero entries of the vector, an entry in the vector is represented as
<index>:<value>
. For example if the vector is[0, 0, 0.5, 0.4, 0, 0.2]
then its sparse vector text representation is2:0.5 3:0.4 5:0.2
(NOTE : the indexing starts from 0) - You can check
GZ-Eurlex-4.3K/trn_X_Xf.txt
for sample example of a sparse matrix format
@InProceedings{Gupta21,
author = "Gupta, N. and Bohra, S. and Prabhu, Y. and Purohit, S. and Varma, M.",
title = "Generalized Zero-Shot Extreme Multi-label Learning",
booktitle = "Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
month = "August",
year = "2021"
}