pke
is an open source python-based keyphrase extraction toolkit. It
provides an end-to-end keyphrase extraction pipeline in which each component can
be easily modified or extended to develop new models. pke
also allows for
easy benchmarking of state-of-the-art keyphrase extraction models, and
ships with supervised models trained on the
SemEval-2010 dataset.
To pip install pke
from github:
pip install git+https://github.com/boudinfl/pke.git
pke
relies on spacy
(>= 3.2.3) for text processing and requires models to be installed:
# download the english model
python -m spacy download en_core_web_sm
pke
provides a standardized API for extracting keyphrases from a document.
Start by typing the 5 lines below. For using another model, simply replace
pke.unsupervised.TopicRank
with another model (list of implemented models).
import pke
# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()
# load the content of the document, here document is expected to be a simple
# test string and preprocessing is carried out using spacy
extractor.load_document(input='text', language='en')
# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()
# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()
# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)
A detailed example is provided in the examples/
directory.
To get your hands dirty with pke
, we invite you to try our tutorials out.
Name | Link |
---|---|
Getting started with pke and keyphrase extraction |
|
Model parameterization | |
Benchmarking models |
pke
currently implements the following keyphrase extraction models:
- Unsupervised models
- Statistical models
- FirstPhrases
- TfIdf
- KPMiner (El-Beltagy and Rafea, 2010)
- YAKE (Campos et al., 2020)
- Graph-based models
- TextRank (Mihalcea and Tarau, 2004)
- SingleRank (Wan and Xiao, 2008)
- TopicRank (Bougouin et al., 2013)
- TopicalPageRank (Sterckx et al., 2015)
- PositionRank (Florescu and Caragea, 2017)
- MultipartiteRank (Boudin, 2018)
- Statistical models
- Supervised models
- Feature-based models
For comparison purposes, overall results of implemented models on commonly-used benchmark datasets are available in results. Code for reproducing these experiments are in the benchmarking notebook (also available on ).
If you use pke
, please cite the following paper:
@InProceedings{boudin:2016:COLINGDEMO,
author = {Boudin, Florian},
title = {pke: an open source python-based keyphrase extraction toolkit},
booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
month = {December},
year = {2016},
address = {Osaka, Japan},
pages = {69--73},
url = {http://aclweb.org/anthology/C16-2015}
}