Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keywords.py gives IndexError: list index out of range when words parameter is provided. #2598

Closed
ariel-frischer opened this issue Sep 8, 2019 · 6 comments · Fixed by #2738
Assignees
Labels
bug Issue described a bug Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users

Comments

@ariel-frischer
Copy link

ariel-frischer commented Sep 8, 2019

Really confused why I'm getting this error. Perhaps I'm making a silly mistake I'm not familiar with gensim and nlp in general.
Im running on Windows 10 Home 64-bit, conda version : 4.7.11, conda-build version : 2.18.8, python version : 3.7.3.final.0
My code is attempting to get keywords per sentence in a loop. To simplify matters I've isolated the following code that causes this, trying to get keywords from gensim's keywords.py.

s = "Don’t dive right into solving without a plan (and somehow hope you can muddle your way through)."
keywords(s, words=4, scores=False, split=True, lemmatize=True)

File "C:\Users\username\Anaconda3\envs\gensim\lib\site-packages\gensim\summarization\keywords.py", line 521, in keywords
    extracted_lemmas = _extract_tokens(graph.nodes(), pagerank_scores, ratio, words)
  File "C:\Users\username\Anaconda3\envs\gensim\lib\site-packages\gensim\summarization\keywords.py", line 304, in _extract_tokens
    return [(scores[lemmas[i]], lemmas[i],) for i in range(int(length))]
  File "C:\Users\username\Anaconda3\envs\gensim\lib\site-packages\gensim\summarization\keywords.py", line 304, in <listcomp>
    return [(scores[lemmas[i]], lemmas[i],) for i in range(int(length))]
IndexError: list index out of range

I've tried setting scores=True, lemmatize=False, and split=False but the same error persists. I've also tried removing the parenthesis and removing the apostrophe, the error persisted. What did work is removing the words parameter altogether, but still it shouldn't create an error if it's provided. Thanks for the help in advance!

@ariel-frischer ariel-frischer changed the title keywords.py gives IndexError: list index out of range when words parameter is provided. keywords.py gives IndexError: list index out of range when words parameter is provided. Sep 8, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Sep 9, 2019

This could well be a bug.

Are you able to step through the problem with a debugger?

@ariel-frischer
Copy link
Author

Yes going through with the debugger I can see why the following lines (in keywords._extract_tokens() ) are giving an error. words==4 while len(lemmas)==2 and so there will be in index out of range error.

length = len(lemmas) * ratio if words is None else words
return [(scores[lemmas[i]], lemmas[i],) for i in range(int(length))]

I'm unsure what would be the best way to handle this, perhaps just cap the maximum value of the final range if words > len(lemmas)

@piskvorky
Copy link
Owner

piskvorky commented Sep 9, 2019

@arielFrischer Can you go through the source and determine what are the invariants are of these variables? words, lemmas, lengths etc? What are they for? Why do you need words?

Unfortunately none of are familiar with this code.

@piskvorky
Copy link
Owner

piskvorky commented Sep 9, 2019

I looked at the code; IMO there is an implicit (unchecked) invariant that words <= len(lemmas).

If words is provided and greater than len(lemmas), such as in your example where lemmas=[u'dive', u'right'], the code will fail.

@mpenkov can we somehow track the original contributor and check with them what this is about?

Because I'm not sure if a simple words = min(int(words), len(lemmas)) is enough, or if it's a deeper problem.

@mpenkov mpenkov self-assigned this Sep 9, 2019
@mpenkov mpenkov added the conda label Sep 29, 2019
@EinKaiser
Copy link

Hi, can I have a go at this?

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 12, 2019

Sure. I think the main part of this ticket will be investigating what the original contributor was trying to achieve, as @piskvorky pointed out.

@piskvorky piskvorky added Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users bug Issue described a bug and removed conda labels Oct 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug Hacktoberfest Issues marked for hacktoberfest impact LOW Low impact on affected users reach MEDIUM Affects a significant number of users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants