Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markovify's make_sentence_with_start() doesn't seem to work properly #181

Open
nezetimesthree opened this issue Aug 10, 2023 · 11 comments
Open

Comments

@nezetimesthree
Copy link

heya @jsvine. i'm writing a quite simple code with markovify, and i keep running into couple of issues.

  1. m_s_w_s doesn't see the sentences with words when they're clearly there, strict=False
  2. for some reason, when my generated prompt is exactly two word-long, it gives me a KeyErrorL: ('wors_a', 'word_b)').
    it works in some cases as it expected to work, though, but in my tests issues happen a lot more often. i can give you the code if you need it.
@jsvine
Copy link
Owner

jsvine commented Aug 11, 2023

Hi @nezetimesthree, and thanks for your interest in markovify. When you get a chance, please provide code and text that reproduces the problem. Without that, it will unfortunately be quite hard to debug.

@nezetimesthree
Copy link
Author

nezetimesthree commented Aug 11, 2023

of course. here's the code and text file.

from transformers import pipeline
import random
import markovify

model_link = "IProject-10/bert-base-uncased-finetuned-squad2"
question_answerer = pipeline("question-answering", model=model_link)

with open('mayakovsky.txt', 'r') as file:
  f = file.readlines()
  poems = []
  poem = ''
  dataset = ''
  for line in f:
    dataset += line.strip() + '. '
    if line != '\n':
      poem += line.strip() + ' '
   
 else:
      poems.append(poem)
      poem = ''

context = random.choice(poems)
question = input()

answer = question_answerer(question=question, context=context)['answer']

print(answer, '->', ' '.join(answer.split()[-2:]))

text_model = markovify.Text(' '.join(poems))

if len(answer.split()) > 1:
  print(text_model.make_sentence_with_start(' '.join(answer.split()[-2:]), strict=False, tries=100), end='\n')
else:
  print(text_model.make_sentence_with_start(answer, strict=False, tries=100), end='\n')
for i in range(5):
  print(text_model.make_short_sentence(200, min_length=100, tries=100), end='\n')

mayakovsky.txt

@jsvine
Copy link
Owner

jsvine commented Aug 14, 2023

Thanks for sharing this, @nezetimesthree.

It seems that you're passing to make_sentence_with_state a "start" that was generated by an LLM, which is not guaranteed to be a start that actually exists in your corpus, which is a requirement for markovify and this type of Markov chain generally. Is that correct? If so, this is expected behavior of markovify and I would not consider it a bug.

If I've misunderstood, could you share a simpler code example that doesn't depend on other libraries, yet still reproduces the problem? In this example, the logic that uses IProject-10/bert-base-uncased-finetuned-squad2 is fairly intertwined here with the logic that uses markovify, and there are several different calls to markovify, making it difficult to debug.

@nezetimesthree
Copy link
Author

thanks for taking a look, @jsvine. but you're misunderstanding this: LLM gives answers only from the given context, which, in this case, is one of the poems from the file. i've checked the errors in poem dataset, and the words were there always. for some reason, NewlineText didn't see them as a start for sentences. maybe it's because some of the lines consist only of one word? could this be the issue?

@jsvine
Copy link
Owner

jsvine commented Aug 14, 2023

Thank you for the helpful clarification, @nezetimesthree. Could you share a start that the code fails on but that is definitely a start in the corpus?

@nezetimesthree
Copy link
Author

hello again, @jsvine. sorry i didn't answer yesterday, but here's the example, the error, and the proof that it's clearly there.

image
image

@jsvine
Copy link
Owner

jsvine commented Aug 15, 2023

Thanks; can you share that as copy-pasteable text?

@nezetimesthree
Copy link
Author

addititon: here's what happens when it receives only one word
image
image

can you clarify what you mean by "copy-pastable text", though? if i understand you corretcly, then the words are "ладно слажен" and "Наоборот"

@jsvine
Copy link
Owner

jsvine commented Aug 15, 2023

Great, thanks; that's what I was looking for, indeed.

@jsvine
Copy link
Owner

jsvine commented Aug 16, 2023

Thanks again for the helpful example. Taking a closer look, the issue seems not to be with make_sentence_with_start, but rather the sentence parser much earlier in the processing pipeline.

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read())


def test_presence(fragment):
    return any(
        any(fragment == token for token in sentence)
        for sentence in model.parsed_sentences
    )


print(test_presence("Послушайте!"))
print(test_presence("слажен"))

Prints:

True
False

The default Text model uses a regex-powered filter to remove sentences that could cause problems, mostly re. apostrophes and quotation marks. It also invokes unidecode, which seems to be causing the problem here. Because it's a generally useful approach, I don't want to remove that step from the library, but there are two ways you should be able to handle on your end:

  • Calling markovify.Text(..., well_formed=False), which skips the filtering step
  • Extending markovify.Text (documented here) to behave in a way better suited to your corpus.

Using well_formed=False seems to work well, although you'll have to contend with the punctuation (or strip it out in a pre-processing step), as you'll see with the comma below:

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read(), well_formed=False)

print(model.make_sentence_with_start("ладно слажен,"))

Prints: ладно слажен, — и все обвыл.

@nezetimesthree
Copy link
Author

thank you very much, @jsvine. i will test it and return with the result next week. sorry for making you wait for it, but i just won't have a chance this week. thank you again, and we'll see if this works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants