Keyword detection for a particular domain, based on the PRDualRank framework.
Work done during Fall 2019, Spring 2020, Summer 2020, Fall 2020 at FORWARD Lab @ UIUC, by Dipro Ray and Shuhan Wang.
Dipro Ray (dipror2@illinois.edu)
- All source code files lie within
.py
files in the./final_stuff/
directory. The latest version of the scripts arefinal_framework_v9_precision.py
andfinal_framework_v9_recall.py
. All previous versions (v1 through v9) are also in the same directory. Most of the later versions differ primarily in the kind of ranking function used. - Metrics and associated scripts and ipynbs lie in the
./development_ipynbs/
directory. You can find the code to compute keyword precision and recall here. - Input data, and output data are stored in
./final_stuff/data/
and./final_stuff/outputs/
respectively. - Archives can be found in the
./old/
subdirectory of each folder, as well as the./metrics/
directory.
If you'd like to use spaCy's GPU capability, make sure you have access to a GPU. (For Nvidia users, use nvidia-smi
to check GPU info.) Then, install CUDA Toolkit 10.2 (note the version! spaCy/cupy isn't compatible with the latest 11.0 version yet, as far as I know.) Use nvcc --version
to ensure CUDA drivers (as well as the correct version) have been installed. Execute pip install -U spacy[cuda102]
. Then, try:
~$ python3
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> spacy.require_gpu()
True
If the last line is "True" and returns no errors, you're good to go with respect to spaCy GPU compatibility. Before proceeding onto the next steps, execute python -m spacy download en_core_web_sm
.
cd
into the repository directory and runpip install -r requirements.txt
cd
into thefinal_stuff
subdirectory. This will be the main working directory henceforth.final_framework_v9_precision.py
orfinal_framework_v9_recall.py is the script to be run to execute the latest framework. (FYI: It uses helper scripts:
prdualrank.py,
wikiscore.py,
extractor_helpers.py`.)- Place your input data in
./data/
directory. Make sure to sanitize the input! Lower case all text, and remove all non-alphanumeric text except periods. - An example command to run the script is:
python3 final_framework_v6.py arxiv_titles_and_abstracts_short.txt 350 750 test_run.txt 9
arxiv_titles_and_abstracts_short.txt
: This parameters indicates that the input data file is./data/arxiv_titles_and_abstracts_short.txt
- 350: The number of patterns to be extracted in each iteration
- 750: The number of keywords to be extracted in each iteration
test_run.txt
: This means that the output will be stored in./outputs/test_run.txt
- 9: This refers to the scoring method (look in
final_framework_v6.py
for details). For now, this parameter is always set to 9. - An extra parameter
iter_num
exists, but it needs to be set within the source code offinal_framework_v6.py
. It affects the number of iterations to be run.
- As mentioned earlier, your results will be stored in the designated output file.
- Push your results, with a meaningful file name that indicates date, time, framework version, scoring method, corpus (and ensure it's in the outputs directory).