markovify's make_sentence_with_start() doesn't seem to work properly #181

nezetimesthree · 2023-08-10T21:45:05Z

heya @jsvine. i'm writing a quite simple code with markovify, and i keep running into couple of issues.

m_s_w_s doesn't see the sentences with words when they're clearly there, strict=False
for some reason, when my generated prompt is exactly two word-long, it gives me a KeyErrorL: ('wors_a', 'word_b)').
it works in some cases as it expected to work, though, but in my tests issues happen a lot more often. i can give you the code if you need it.

jsvine · 2023-08-11T16:23:17Z

Hi @nezetimesthree, and thanks for your interest in markovify. When you get a chance, please provide code and text that reproduces the problem. Without that, it will unfortunately be quite hard to debug.

nezetimesthree · 2023-08-11T17:17:17Z

of course. here's the code and text file.

from transformers import pipeline
import random
import markovify

model_link = "IProject-10/bert-base-uncased-finetuned-squad2"
question_answerer = pipeline("question-answering", model=model_link)

with open('mayakovsky.txt', 'r') as file:
  f = file.readlines()
  poems = []
  poem = ''
  dataset = ''
  for line in f:
    dataset += line.strip() + '. '
    if line != '\n':
      poem += line.strip() + ' '
   
 else:
      poems.append(poem)
      poem = ''

context = random.choice(poems)
question = input()

answer = question_answerer(question=question, context=context)['answer']

print(answer, '->', ' '.join(answer.split()[-2:]))

text_model = markovify.Text(' '.join(poems))

if len(answer.split()) > 1:
  print(text_model.make_sentence_with_start(' '.join(answer.split()[-2:]), strict=False, tries=100), end='\n')
else:
  print(text_model.make_sentence_with_start(answer, strict=False, tries=100), end='\n')
for i in range(5):
  print(text_model.make_short_sentence(200, min_length=100, tries=100), end='\n')

mayakovsky.txt

jsvine · 2023-08-14T12:46:00Z

Thanks for sharing this, @nezetimesthree.

It seems that you're passing to make_sentence_with_state a "start" that was generated by an LLM, which is not guaranteed to be a start that actually exists in your corpus, which is a requirement for markovify and this type of Markov chain generally. Is that correct? If so, this is expected behavior of markovify and I would not consider it a bug.

If I've misunderstood, could you share a simpler code example that doesn't depend on other libraries, yet still reproduces the problem? In this example, the logic that uses IProject-10/bert-base-uncased-finetuned-squad2 is fairly intertwined here with the logic that uses markovify, and there are several different calls to markovify, making it difficult to debug.

nezetimesthree · 2023-08-14T13:02:54Z

thanks for taking a look, @jsvine. but you're misunderstanding this: LLM gives answers only from the given context, which, in this case, is one of the poems from the file. i've checked the errors in poem dataset, and the words were there always. for some reason, NewlineText didn't see them as a start for sentences. maybe it's because some of the lines consist only of one word? could this be the issue?

jsvine · 2023-08-14T13:08:23Z

Thank you for the helpful clarification, @nezetimesthree. Could you share a start that the code fails on but that is definitely a start in the corpus?

nezetimesthree · 2023-08-15T15:07:04Z

hello again, @jsvine. sorry i didn't answer yesterday, but here's the example, the error, and the proof that it's clearly there.

jsvine · 2023-08-15T15:08:29Z

Thanks; can you share that as copy-pasteable text?

nezetimesthree · 2023-08-15T15:17:32Z

addititon: here's what happens when it receives only one word

can you clarify what you mean by "copy-pastable text", though? if i understand you corretcly, then the words are "ладно слажен" and "Наоборот"

jsvine · 2023-08-15T15:18:51Z

Great, thanks; that's what I was looking for, indeed.

jsvine · 2023-08-16T17:53:46Z

Thanks again for the helpful example. Taking a closer look, the issue seems not to be with make_sentence_with_start, but rather the sentence parser much earlier in the processing pipeline.

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read())


def test_presence(fragment):
    return any(
        any(fragment == token for token in sentence)
        for sentence in model.parsed_sentences
    )


print(test_presence("Послушайте!"))
print(test_presence("слажен"))

Prints:

True
False

The default Text model uses a regex-powered filter to remove sentences that could cause problems, mostly re. apostrophes and quotation marks. It also invokes unidecode, which seems to be causing the problem here. Because it's a generally useful approach, I don't want to remove that step from the library, but there are two ways you should be able to handle on your end:

Calling markovify.Text(..., well_formed=False), which skips the filtering step
Extending markovify.Text (documented here) to behave in a way better suited to your corpus.

Using well_formed=False seems to work well, although you'll have to contend with the punctuation (or strip it out in a pre-processing step), as you'll see with the comma below:

import markovify

with open("mayakovsky.txt", "r") as file:
    model = markovify.Text(file.read(), well_formed=False)

print(model.make_sentence_with_start("ладно слажен,"))

Prints: ладно слажен, — и все обвыл.

nezetimesthree · 2023-08-16T21:04:35Z

thank you very much, @jsvine. i will test it and return with the result next week. sorry for making you wait for it, but i just won't have a chance this week. thank you again, and we'll see if this works.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

markovify's make_sentence_with_start() doesn't seem to work properly #181

markovify's make_sentence_with_start() doesn't seem to work properly #181

nezetimesthree commented Aug 10, 2023

jsvine commented Aug 11, 2023

nezetimesthree commented Aug 11, 2023 •

edited by jsvine

Loading

jsvine commented Aug 14, 2023

nezetimesthree commented Aug 14, 2023

jsvine commented Aug 14, 2023

nezetimesthree commented Aug 15, 2023

jsvine commented Aug 15, 2023

nezetimesthree commented Aug 15, 2023

jsvine commented Aug 15, 2023

jsvine commented Aug 16, 2023

nezetimesthree commented Aug 16, 2023

markovify's make_sentence_with_start() doesn't seem to work properly #181

markovify's make_sentence_with_start() doesn't seem to work properly #181

Comments

nezetimesthree commented Aug 10, 2023

jsvine commented Aug 11, 2023

nezetimesthree commented Aug 11, 2023 • edited by jsvine Loading

jsvine commented Aug 14, 2023

nezetimesthree commented Aug 14, 2023

jsvine commented Aug 14, 2023

nezetimesthree commented Aug 15, 2023

jsvine commented Aug 15, 2023

nezetimesthree commented Aug 15, 2023

jsvine commented Aug 15, 2023

jsvine commented Aug 16, 2023

nezetimesthree commented Aug 16, 2023

nezetimesthree commented Aug 11, 2023 •

edited by jsvine

Loading