-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subclassing markovify.Text to allow for different types of 'sentences' #145
Comments
Hi @mooseyboots, and thanks for your interesting in this library. I'm having a bit of trouble, however, understanding the specifics of your inquiry. Could you provide some code, inputs, and outputs that demonstrate the issue? |
here is my subclass modifying
a selection of input from one of my files:
1 incorrect 'sentence' from the sample output using my subclass:
so the word "embalmed." is not counting as an end. but what confused me is that if i use the subclass manually to generate a corpus, such as something like:
the word "embalmed." will actually be the last item in its sentence's list:
which to me suggested that the regex sentence splitter was working correctly. my query is, if i want to change how markovify understands what constitutes the end of a sentence, is that all i need to do or are there other things to modify? |
Hi @mooseyboots, and thanks for the additional details. Judging from the sample of the corpus you shared, which seems to place each sentence on a new line, the easiest solution may just be to use the already-defined And if that doesn't quite fit your use-case, you can use that subclass's definition as a perhaps-simpler starting place (i.e., swapping out the regular expression below for the regular expression of your choosing): Lines 287 to 293 in 16b9367
|
hi and thx for yr great library.
i made a cli program to run it on my own texts.
i'm trying to add a subclass to it that enables me to feed it sentences that dont begin with initial capital letters and might begin with stars, bullets, etc. i made a subclass (modeled on your NewlineText) to modify the regexes in
split_into_sentences()
, changing the lookahead search that mandates an initial capital letter after sentence end (splitters.py, line 45) to readr"\s+(?=[-•\w‘’“”'*\|/~\",])",
, and added a few more punctuation marks to the previous regexes (hypen, ellipses/triple periods).it works if i manually generate a corpus and markov model from one of my texts, but not if i run my program using the subclass. one "sentence" will have a period in the middle of it and will continue printing text after it.
so i wanted to ask if there anything in the way that sentences are made from the markov model that would affect these modified regexes or disregard them? and is there a better way to go about modifying sentence endings than messing with
split_into_sentences()
?[sorry if its obvious in the code. i'm very much a novice with programming.]
The text was updated successfully, but these errors were encountered: