Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subclassing markovify.Text to allow for different types of 'sentences' #145

Open
mooseyboots opened this issue Oct 29, 2020 · 3 comments
Open

Comments

@mooseyboots
Copy link

mooseyboots commented Oct 29, 2020

hi and thx for yr great library.

i made a cli program to run it on my own texts.

i'm trying to add a subclass to it that enables me to feed it sentences that dont begin with initial capital letters and might begin with stars, bullets, etc. i made a subclass (modeled on your NewlineText) to modify the regexes in split_into_sentences(), changing the lookahead search that mandates an initial capital letter after sentence end (splitters.py, line 45) to read r"\s+(?=[-•\w‘’“”'*\|/~\",])",, and added a few more punctuation marks to the previous regexes (hypen, ellipses/triple periods).

it works if i manually generate a corpus and markov model from one of my texts, but not if i run my program using the subclass. one "sentence" will have a period in the middle of it and will continue printing text after it.

so i wanted to ask if there anything in the way that sentences are made from the markov model that would affect these modified regexes or disregard them? and is there a better way to go about modifying sentence endings than messing with split_into_sentences()?

[sorry if its obvious in the code. i'm very much a novice with programming.]

@jsvine
Copy link
Owner

jsvine commented Oct 30, 2020

Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue?

@mooseyboots
Copy link
Author

here is my subclass modifying split_into_sentences():

import re
import markovify
from markovify.splitters import is_sentence_ender


class NoInitCaps(markovify.Text):
    """
    An attempt to subclass markovify.Text to allow for sentences to not begin with an intital capital letter.
    """

    def split_into_sentences(self, text):
        potential_end_pat = re.compile(
            r"".join(
                [
                    r"([-\w\.\"'’~”&\]\)]+[…(\.){1,4}\?!])",  # A word that ends with punctuation, including ellipsis, possibly separated by white space
                    r"([‘’“”'~\"\)\]]*)",  # Followed by optional quote/parens/etc
                    r"\s+(?=[-•\w‘’“”'*\|/~\",])",  # followed by whitespace. then a lookahead to the next char, which can be alphanumeric or initial punctuation
                ]
            ),
            re.U,  # U for Unicode!
        )
        dot_iter = re.finditer(potential_end_pat, text)
        end_indices = [
            (x.start() + len(x.group(1)) + len(x.group(2)))
            for x in dot_iter
            if is_sentence_ender(x.group(1))
        ]
        spans = zip([None] + end_indices, end_indices + [None])
        sentences = [text[start:end].strip() for start, end in spans]
        return sentences

    def sentence_split(self, text):
        return self.split_into_sentences(text)

a selection of input from one of my files:

    • error, which makes things swollen, gives them that look of filling out just a little more space than is theirs, so that they bump into other swollen things, seek room.

renege.

‘empty’ words, imagine!

who among us not embalmed.
walk up to wall and kick it, once, twice, there.

, lying in wait / for the neonate. 

1 incorrect 'sentence' from the sample output using my subclass:

who among us not embalmed. walk up to the rhythm of beer and coffee, on the verge of nothing here.

so the word "embalmed." is not counting as an end.

but what confused me is that if i use the subclass manually to generate a corpus, such as something like:

  from markovify import Chain, Text
  from mkv_this.noinitcaps import NoInitCaps

  text = "/PATH/TO/INPUT/scrapbook.txt"

  with open(text, "r") as t:
      txt = t.read()
      text_obj = NoInitCaps(txt) # my subclass
      corpus = text_obj.generate_corpus(txt)
      clist = list(corpus)
  with open("/PATH/TO/OUTPUT/markov-corpus-no-init-caps.txt", "w") as c:
        c.write(str(clist))

the word "embalmed." will actually be the last item in its sentence's list:

 ['renege.'], ['‘empty’', 'words,', 'imagine!'], ['who', 'among', 'us', 'not', 'embalmed.'], ['walk', 'up', 'to', 'wall', 'and', 'kick', 'it,', 'once,', 'twice,', 'there.'], [',', 'lying', 'in', 'wait', '/', 'for', 'the', 'neonate.']

which to me suggested that the regex sentence splitter was working correctly.

my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify?

@jsvine
Copy link
Owner

jsvine commented Nov 3, 2020

Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined markovify.NewlineText class.

And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing):

markovify/markovify/text.py

Lines 287 to 293 in 16b9367

class NewlineText(Text):
"""
A (usable) example of subclassing markovify.Text. This one lets you markovify
text where the sentences are separated by newlines instead of ". "
"""
def sentence_split(self, text):
return re.split(r"\s*\n\s*", text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants