Skip to content

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning

License

Notifications You must be signed in to change notification settings

nilesh2797/zestxml

Repository files navigation

Generalized Zero-Shot Extreme Multi-Label Learning

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning

Nilesh Gupta, Sakina Bohra, Yashoteja Prabhu, Saurabh Purohit, Manik Varma

Overview

Extreme Multi-label Learning (XML) involves assigning the subset of most relevant labels to a data point from extremely large set of label choices. An unaddressed challenge in XML is that of predicting unseen labels with no training points. 

Generalized Zero-shot XML (GZXML) is a paradigm where the task is to tag a data point with the most relevant labels from a large universe of both seen and unseen labels.

Running the Code

# Build
make

# Download GZ-Eurlex-4.3K dataset
mkdir GZXML-Datasets
cd GZXML-Datasets
pip install gdown
gdown "https://drive.google.com/uc?id=1j27bQZol6gOQ7AATawShcF4jXJr3Venb"
tar -xvzf GZ-Eurlex-4.3K.tar.gz
cd -

# Train and predict ZestXML on GZ-Eurlex-4.3K dataset
./run_eurlex.sh train
./run_eurlex.sh predict

# Install dependencies of metrics.py
pip install -r requirements.txt
# Install pyxclib for evaluation
git clone https://github.com/kunaldahiya/pyxclib.git
cd pyxclib
python3 setup.py install --user
cd -

# Prints evaluation metrics
python metrics.py GZ-Eurlex-4.3K

Public Datasets

Following Datasets were used in the paper for benchmarking GZXML algorithms (all datasets can be downloaded from here)

  • GZ-EURLex-4.3K, Document Tagging of EU law pages
  • GZ-Amazon-1M, Item to Item Recommendation of Amazon products
  • GZ-Wikipedia-1M, Document Tagging of Wikipedia pages

Following are some statistics of these datasets:

Dataset Num Points Num Labels Num Features
Train Test Seen Unseen Point Label
GZ-Eurlex-4.3K 45,000 6,000 4,108 163 100,000 24,316
GZ-Amazon-1M 914,179 1,465,767 476,381 483,725 1,000,000 1,476,381
GZ-Wikipedia-1M 2,271,533 2,705,425 495,107 776,612 1,000,000 1,438,196

Data Format

All sparse matrices are stored in text sparse matrix format, please refer to the text sparse matrix format subsection for more details. Following are the details of required files:

  • Xf.txt: all features used in tf-idf representation of documents ((trn/tst/val)_X_Xf), ith line denotes ith feature in the tf-idf representation. In particular, for datasets used in the paper, it's the stemmed bigram and unigram features of documents but you can choose to have any set of features depending on your application.
  • Yf.txt: similar to Xf.txt it represents features of all labels. In addition to unigrams and bigrams, we also add a unique feature specific to each label (represented by __label__<i>__<label-i-text>, this feature will only be present in ith label's features), this allows the model to have label specific parameters and helps it to do well on many-shot labels. Features with __parent__ in them are only specific to the GZ-EURLex-4.3K dataset because raw labels in this dataset have some additional information about parent concepts of each label, you can safely choose to ignore these features for any other/new dataset.
  • (trn/tst/val)_X_Xf.txt: sparse matrix (documents x document-features) representing tf-idf feature matrix of (trn/tst/val) input documents.
  • Y_Yf.txt: similar to (trn/tst/val)_X_Xf.txt but for labels, this is the sparse matrix (labels x label-features) representing tf-idf feature matrix of labels.
  • trn_Y_Yf.txt: similar to Y_Yf.txt but contains features for only the seen labels (can be interpreted as Y_Yf[seen-labels])
  • (trn/tst/val)_X_Y.txt: sparse matrix (documents x labels) representing (trn/tst/val) document-label relevance matrix.

Text sparse matrix format

This is a plain-text row-major representation of a sparse matrix. Following are the details of the format :

  • The first line in this format is two space separated integers denoting the dimensions of the matrix (i.e. num_row num_column)
  • num_row lines follow the first line and each line represents a sparse row vector
  • a sparse row vector is represented as space separated non zero entries of the vector, an entry in the vector is represented as <index>:<value>. For example if the vector is [0, 0, 0.5, 0.4, 0, 0.2] then its sparse vector text representation is 2:0.5 3:0.4 5:0.2 (NOTE : the indexing starts from 0)
  • You can check GZ-Eurlex-4.3K/trn_X_Xf.txt for sample example of a sparse matrix format

Cite

@InProceedings{Gupta21,
  author    = "Gupta, N. and Bohra, S. and Prabhu, Y. and Purohit, S. and Varma, M.",
  title     = "Generalized Zero-Shot Extreme Multi-label Learning",
  booktitle = "Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining",
  month     = "August",
  year      = "2021"
}

About

This is the official codebase for KDD 2021 paper Generalized Zero-Shot Extreme Multi-Label Learning

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published