Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running example, results not same as yours #46

Closed
lukakostic opened this issue Sep 16, 2018 · 10 comments
Closed

Running example, results not same as yours #46

lukakostic opened this issue Sep 16, 2018 · 10 comments

Comments

@lukakostic
Copy link

With input of

text = """Automatic summarization is the process of reducing a text document with a
computer program in order to create a summary that retains the most important points
of the original document. As the problem of information overload has grown, and as
the quantity of data has increased, so has interest in automatic summarization.
Technologies that can make a coherent summary take into account variables such as
length, writing style and syntax. An example of the use of summarization technology
is search engines such as Google. Document summarization is another."""

print(summarize(text))

I get

Automatic summarization is the process of reducing a text document with a
Document summarization is another.

Which is much derpier and not same as the results you got (in the example):

Automatic summarization is the process of reducing a text document with a computer
program in order to create a summary that retains the most important points of the
original document.

@fbarrios
Copy link
Contributor

Hi! By default we use newlines to divide into sentences. Can you try removing the line breaks from the input text?

@lukakostic
Copy link
Author

What do you mean breaks? Like replacing dots with '\n' or replacing '\b' with '\n' ?

@fbarrios
Copy link
Contributor

Removing "\n" from the text altogether.

@lukakostic
Copy link
Author

But then wont things like this:

A
B

turn to 'AB' in some user made text?

Would it be better to turn '\n' to '.' and remove any duplicate dots (... to .)?

@lukakostic
Copy link
Author

lukakostic commented Sep 16, 2018

Doing above (replacing newlines with . and making sure only one consecutive dot is allowed), it breaks. Now summarize(text) and summarize text with ratio=0.2 are empty (print nothing) and summarize with words=10 prints "Document summarization is another." , keywords are still same.

@lukakostic
Copy link
Author

Seems it adds dots in places it shouldnt have because the input has newlines in some unnecessary places, but if i remove newlines it can result in words being one word when they shouldnt...

@lukakostic
Copy link
Author

Just removing newlines ("\n" to "") results in same output as turning "\n" to "."

@lukakostic
Copy link
Author

Except that the keywords now has this "technologyis search" which is obviously an artifact of removing newlines:

An example of the use of summarization technology
is search engines such as Google

@lukakostic
Copy link
Author

It gives a good result When i manually correct the text to remove un-needed newlines, but the summary with words=10 is empty now. Could it be it cant summarize short enough?

@fbarrios
Copy link
Contributor

fbarrios commented Sep 16, 2018

Our summaries consist of the most relevant sentences in a given text. The task of splitting a text into sentences is not solved, so we make a best effort using this regex.

That regex treat different lines (i.e.: a piece of text with \n) as different sentences. In another project we have evaluated changing this behavior, but at the end decided to keep it as it is, since is an easier task for the user to remove newlines if the text is well formatted. This has got to be better documented, so I created a ticket for that.

The other behavior does seem like a bug. It could be that the summarizer misbehaves when the words parameter is too small. Can you create a separate issue for that with an example? I will close this one.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants