Automated Domain-based Keyword Detection

Keyword detection for a particular domain, based on the PRDualRank framework.

Contributors

Work done during Fall 2019, Spring 2020, Summer 2020, Fall 2020 at FORWARD Lab @ UIUC, by Dipro Ray and Shuhan Wang.

Previous Maintainers

Dipro Ray (dipror2@illinois.edu)

Where do I find X?

All source code files lie within .py files in the ./final_stuff/ directory. The latest version of the scripts are final_framework_v9_precision.py and final_framework_v9_recall.py. All previous versions (v1 through v9) are also in the same directory. Most of the later versions differ primarily in the kind of ranking function used.
Metrics and associated scripts and ipynbs lie in the ./development_ipynbs/ directory. You can find the code to compute keyword precision and recall here.
Input data, and output data are stored in ./final_stuff/data/ and ./final_stuff/outputs/ respectively.
Archives can be found in the ./old/ subdirectory of each folder, as well as the ./metrics/ directory.

How do I run the framework?

If you'd like to use spaCy's GPU capability, make sure you have access to a GPU. (For Nvidia users, use nvidia-smi to check GPU info.) Then, install CUDA Toolkit 10.2 (note the version! spaCy/cupy isn't compatible with the latest 11.0 version yet, as far as I know.) Use nvcc --version to ensure CUDA drivers (as well as the correct version) have been installed. Execute pip install -U spacy[cuda102]. Then, try:

~$ python3
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import spacy
>>> spacy.require_gpu()
True

If the last line is "True" and returns no errors, you're good to go with respect to spaCy GPU compatibility. Before proceeding onto the next steps, execute python -m spacy download en_core_web_sm.

cd into the repository directory and run pip install -r requirements.txt
cd into the final_stuff subdirectory. This will be the main working directory henceforth.
final_framework_v9_precision.py or final_framework_v9_recall.py is the script to be run to execute the latest framework. (FYI: It uses helper scripts: prdualrank.py, wikiscore.py, extractor_helpers.py`.)
Place your input data in ./data/ directory. Make sure to sanitize the input! Lower case all text, and remove all non-alphanumeric text except periods.
An example command to run the script is: python3 final_framework_v6.py arxiv_titles_and_abstracts_short.txt 350 750 test_run.txt 9
1. arxiv_titles_and_abstracts_short.txt: This parameters indicates that the input data file is ./data/arxiv_titles_and_abstracts_short.txt
2. 350: The number of patterns to be extracted in each iteration
3. 750: The number of keywords to be extracted in each iteration
4. test_run.txt: This means that the output will be stored in ./outputs/test_run.txt
5. 9: This refers to the scoring method (look in final_framework_v6.py for details). For now, this parameter is always set to 9.
6. An extra parameter iter_num exists, but it needs to be set within the source code of final_framework_v6.py. It affects the number of iterations to be run.
As mentioned earlier, your results will be stored in the designated output file.
Push your results, with a meaningful file name that indicates date, time, framework version, scoring method, corpus (and ensure it's in the outputs directory).

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
development_ipynbs		development_ipynbs
development_scripts		development_scripts
final_stuff		final_stuff
metrics		metrics
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Domain-based Keyword Detection

Contributors

Previous Maintainers

Where do I find X?

How do I run the framework?

About

Releases

Packages

Contributors 3

Languages

harrywsh/phrase-detection

Folders and files

Latest commit

History

Repository files navigation

Automated Domain-based Keyword Detection

Contributors

Previous Maintainers

Where do I find X?

How do I run the framework?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages