- What is POKE?
- Usage Instructions
- Dataset Information
- Troubleshooting
- Routes for further exploration
- Meet the team behind POKE
- References and other Important Information
- POKE (Performance Oriented Keyphrase Extractor, name subject to change) is a Hindi Keyword Extracting Program.
- POKE is based on the YAKE! algorithm for keyword extraction.
- POKE makes improvements and adjustments for superior performance for Hindi Language Keyphrase Extraction.
- The Project contains a .zip file named
group4_NLP_project.zip
.
(If this does not apply, click this and follow given instructions.) - Extract the contents of the .zip file to a folder of choice. In this case, my folder is named
group4_NLP_project
, the same as the .zip file. - Once extracted, your folder structure should look something like this:
- Once extracted, do the following:
- Make sure your current working directory is set to the folder that you just extracted the contents of the zip file to.
- Run the
pre_processor.py
file first. This file pre processes the data and should create a folder namedprocessed_data
in the current working directory. - Now run the
main.py
file. This should create a folder namedkeywords_data
in the current working directory. Note that the newly created folder contains keyphrase data in JSON format. This allows for ease of further processing.
- By default, the input folder path and output folder path of the
main.py
file are set to this:
- Changes can be made to the input folder path and output folder path as one sees fit. Both
main.py
andpre-processor.py
have input and output folder paths namedinput_folder_path
andouter_folder_path
respectively. These variables can easily be found withCtrl+F
on Windows, orCmd+F
on Mac and edited in any of the above files. - By Changing the
input_folder_path
variable of thepre_processor.py
file to./hindi_dump
, the noisyhindi_dump
dataset can be used instead of theannotated_hindi_data
dataset. Caution: There are thousands of files in the folder. Unexpected behavior or long wait times can be expected. - Some dependencies for smooth running of the code might not be installed in your system by default. In this case, follow the instructions that your CLI gives you: install each dependency by using
pip install <package-name>
to install relevant dependencies. - For reference, these are the import statements:
- In case of any issues, refer to the Troubleshooting Section
- This dataset was obtained from this repository.
- From the above image, you may notice that the folder contains two subfolders, namely
annotated_hindi_data
andhindi_dump
. - In
annotated_hindi_data
, there are two subfolders of importance:annotations
anddata
. data
contains 71 .txt files containing various hindi texts, whileannotations
also contains the same number of .txt files with line-seperated annotated keyphrases which correspond to the .txt in thedata
folder, with the same filename. For example,4.txt
inannotations
contains the annotated keywords of4.txt
indata
.- The folder
hindi_dump
contains about 3000 .txt files which have varying levels of noise and length. This dataset contains noisy data, unlike theannotated_hindi_data
dataset.
- Library use in this project is kept minimal. However, libraries like NLTK have been used for POS (Parts of Speech) Tagging.
- It's optimal to use a virtual environment to run the code. Check it out
- This implementation of POKE is based on the YAKE! algorithm paper. For more details on implementation, click this link.
- This implementation could be further optimized by the application of the concepts given in this paper, PatternRank
- Further, more effective normalization , preprocessing techniques can be applied with this paper as inspiration. This helped shape our implementation of normalization.
- The bulk of our stopword list was obtained from this paper, which referenced this github library which contained the stopword lists. It must be noted that after multiple rounds of testing, we added many of our own stopwords to this list.
- One might have trouble with installing the
punkt
andindian
packages in nltk. - To fix this, follow the below instructions:
i. In your CLI, open the python shell by typingpython
into it.
ii. Then, import the nltk library by enteringimport nltk
.
iii. You can then install the needed package by enteringnltk.download(<package-name>)
.
iv. To quit the python terminal, usequit()
.
- PLMs (Pretrained language models) proved to be effective for keyword extraction, as seen in the case of KeyBERT. This calls for some implementation and exploration in that area.
- LLMs (Large Language Models) have been causing quite the buzz lately, with the introduction of GPT (Generative Pretrained Transformer) by OpenAI. With the usage of the Langchain framework and Possibly Hindi LLMs, there lies great potential for effective hindi keyword extraction.
- Research on the effectiveness of PLMs for Keyword extraction can be done with this paper in mind.
🔸Annabathini Swejan Chowdary
🔸Parth Bhandari
🔸Patibandla Venkata Chiranjeevi Tarun