The source code used for our KDD'2020 paper.
- GCC compiler (used to compile the source c file): See the guide for installing GCC.
Due to the constraint of size, we provide the link of our datasets in the following links, please copy the files to ${dataset}/
.
If you are using our own datasets, you can skip this step. Otherwise, please first use this preprocessing tool to extract a sentences.json
for your own corpus, and then run
bash preprocess.sh
to generate the index files.
cd c
bash run_emb_part_tax.sh
This step compiles the source file and trains embedding for concept learning. The --topic_file
in the script is used to specify the seed taxonomy.
As an example, you can set ${topic_file} to be topics_field.txt
for dataset dblp
and topics_des.txt
for dataset yelp
. These topic files are already provided in the datasets. If you want to specify your own seed taxonomy, just feel free to create a new file using the format topics_{xxx}.txt
.
Each line starts with a parent node (with the root node being ROOT), and then followed by a tab
. The children nodes of this parent is appended and separated by space
. Generated embedding file is stored under ${dataset}
.
A jupyter notebook version is available for this step.
You can change the dataset and the topic_file name in main.ipynb
.
As another option, you can use the following python programme to generate the results.
cd ..
python main.py --dataset ${dataset} --topic_file ${topic_file}.txt
This step completes the taxonomy structure and outputs keywords for each node in the taxonomy. As an example, for DBLP dataset, you can run
python main.py --dataset dblp --topic_file topics_field.txt
python generate_bash.py
This command generates a script from c/template.sh
that can recursively run embedding training for all topics and subtopics.
cd c
bash run_emb_full_tax.sh
This command will generate the final topical taxonomy under a result
directory.
Results for each topics are generated at result\${dataset}\${topic}\subtopics_for_${topic}.txt
.
E.g., each line in result\DBLP\data_mining\subtopics_for_data_mining.txt
is one subtopic of data mining (including a cluster of words).