POKE : Pankti Oriented Keyword Extractor ⚙️

What is POKE 🤔?

POKE (Performance Oriented Keyphrase Extractor, name subject to change) is a Hindi Keyword Extracting Program.
POKE is based on the YAKE! algorithm for keyword extraction.
POKE makes improvements and adjustments for superior performance for Hindi Language Keyphrase Extraction.

Usage Instructions 👩‍🔧

The Project contains a .zip file named group4_NLP_project.zip.
(If this does not apply, click this and follow given instructions.)
Extract the contents of the .zip file to a folder of choice. In this case, my folder is named group4_NLP_project , the same as the .zip file.
Once extracted, your folder structure should look something like this:
Once extracted, do the following:

Make sure your current working directory is set to the folder that you just extracted the contents of the zip file to.
Run the pre_processor.py file first. This file pre processes the data and should create a folder named processed_data in the current working directory.
Now run the main.py file. This should create a folder named keywords_data in the current working directory. Note that the newly created folder contains keyphrase data in JSON format. This allows for ease of further processing.

Further Instructions

By default, the input folder path and output folder path of the main.py file are set to this:
Changes can be made to the input folder path and output folder path as one sees fit. Both main.py and pre-processor.py have input and output folder paths named input_folder_path and outer_folder_path respectively. These variables can easily be found with Ctrl+F on Windows, or Cmd+F on Mac and edited in any of the above files.
By Changing the input_folder_path variable of the pre_processor.py file to ./hindi_dump, the noisy hindi_dump dataset can be used instead of the annotated_hindi_data dataset. Caution: There are thousands of files in the folder. Unexpected behavior or long wait times can be expected.
Some dependencies for smooth running of the code might not be installed in your system by default. In this case, follow the instructions that your CLI gives you: install each dependency by using pip install <package-name> to install relevant dependencies.
For reference, these are the import statements:
In case of any issues, refer to the Troubleshooting Section

Dataset Information 🗂️

This dataset was obtained from this repository.
From the above image, you may notice that the folder contains two subfolders, namely annotated_hindi_data and hindi_dump.
In annotated_hindi_data , there are two subfolders of importance: annotations and data.
data contains 71 .txt files containing various hindi texts, while annotations also contains the same number of .txt files with line-seperated annotated keyphrases which correspond to the .txt in the data folder, with the same filename. For example, 4.txt in annotations contains the annotated keywords of 4.txt in data.
The folder hindi_dump contains about 3000 .txt files which have varying levels of noise and length. This dataset contains noisy data, unlike the annotated_hindi_data dataset.

References and Other important Information 🔍

Library use in this project is kept minimal. However, libraries like NLTK have been used for POS (Parts of Speech) Tagging.
It's optimal to use a virtual environment to run the code. Check it out
This implementation of POKE is based on the YAKE! algorithm paper. For more details on implementation, click this link.
This implementation could be further optimized by the application of the concepts given in this paper, PatternRank
Further, more effective normalization , preprocessing techniques can be applied with this paper as inspiration. This helped shape our implementation of normalization.
The bulk of our stopword list was obtained from this paper, which referenced this github library which contained the stopword lists. It must be noted that after multiple rounds of testing, we added many of our own stopwords to this list.

Troubleshooting ⚒️

One might have trouble with installing the punkt and indian packages in nltk.
To fix this, follow the below instructions:
i. In your CLI, open the python shell by typing python into it.
ii. Then, import the nltk library by entering import nltk.
iii. You can then install the needed package by entering nltk.download(<package-name>).
iv. To quit the python terminal, use quit().

Possible Routes for further exploration 🧭

PLMs (Pretrained language models) proved to be effective for keyword extraction, as seen in the case of KeyBERT. This calls for some implementation and exploration in that area.
LLMs (Large Language Models) have been causing quite the buzz lately, with the introduction of GPT (Generative Pretrained Transformer) by OpenAI. With the usage of the Langchain framework and Possibly Hindi LLMs, there lies great potential for effective hindi keyword extraction.
Research on the effectiveness of PLMs for Keyword extraction can be done with this paper in mind.

The team behind POKE 😎

🔸Annabathini Swejan Chowdary
🔸Parth Bhandari
🔸Patibandla Venkata Chiranjeevi Tarun

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
readme_demo_images		readme_demo_images
.gitignore		.gitignore
README.md		README.md
main.py		main.py
pre_processor.py		pre_processor.py
testing.py		testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POKE : Pankti Oriented Keyword Extractor ⚙️

Contents

What is POKE 🤔?

Usage Instructions 👩‍🔧

Further Instructions

Dataset Information 🗂️

References and Other important Information 🔍

Troubleshooting ⚒️

Possible Routes for further exploration 🧭

The team behind POKE 😎

About

Releases

Packages

Languages

Tarun-pvc/POKE

Folders and files

Latest commit

History

Repository files navigation

POKE : Pankti Oriented Keyword Extractor ⚙️

Contents

What is POKE 🤔?

Usage Instructions 👩‍🔧

Further Instructions

Dataset Information 🗂️

References and Other important Information 🔍

Troubleshooting ⚒️

Possible Routes for further exploration 🧭

The team behind POKE 😎

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages