Running example, results not same as yours #46

lukakostic · 2018-09-16T19:29:40Z

With input of

text = """Automatic summarization is the process of reducing a text document with a
computer program in order to create a summary that retains the most important points
of the original document. As the problem of information overload has grown, and as
the quantity of data has increased, so has interest in automatic summarization.
Technologies that can make a coherent summary take into account variables such as
length, writing style and syntax. An example of the use of summarization technology
is search engines such as Google. Document summarization is another."""

print(summarize(text))

I get

Automatic summarization is the process of reducing a text document with a
Document summarization is another.

Which is much derpier and not same as the results you got (in the example):

Automatic summarization is the process of reducing a text document with a computer
program in order to create a summary that retains the most important points of the
original document.

fbarrios · 2018-09-16T20:17:19Z

Hi! By default we use newlines to divide into sentences. Can you try removing the line breaks from the input text?

lukakostic · 2018-09-16T20:18:51Z

What do you mean breaks? Like replacing dots with '\n' or replacing '\b' with '\n' ?

fbarrios · 2018-09-16T20:19:49Z

Removing "\n" from the text altogether.

lukakostic · 2018-09-16T20:21:19Z

But then wont things like this:

A
B

turn to 'AB' in some user made text?

Would it be better to turn '\n' to '.' and remove any duplicate dots (... to .)?

lukakostic · 2018-09-16T20:30:31Z

Doing above (replacing newlines with . and making sure only one consecutive dot is allowed), it breaks. Now summarize(text) and summarize text with ratio=0.2 are empty (print nothing) and summarize with words=10 prints "Document summarization is another." , keywords are still same.

lukakostic · 2018-09-16T20:34:29Z

Seems it adds dots in places it shouldnt have because the input has newlines in some unnecessary places, but if i remove newlines it can result in words being one word when they shouldnt...

lukakostic · 2018-09-16T20:35:19Z

Just removing newlines ("\n" to "") results in same output as turning "\n" to "."

lukakostic · 2018-09-16T20:37:07Z

Except that the keywords now has this "technologyis search" which is obviously an artifact of removing newlines:

An example of the use of summarization technology
is search engines such as Google

lukakostic · 2018-09-16T21:29:38Z

It gives a good result When i manually correct the text to remove un-needed newlines, but the summary with words=10 is empty now. Could it be it cant summarize short enough?

fbarrios · 2018-09-16T23:05:08Z

Our summaries consist of the most relevant sentences in a given text. The task of splitting a text into sentences is not solved, so we make a best effort using this regex.

That regex treat different lines (i.e.: a piece of text with \n) as different sentences. In another project we have evaluated changing this behavior, but at the end decided to keep it as it is, since is an easier task for the user to remove newlines if the text is well formatted. This has got to be better documented, so I created a ticket for that.

The other behavior does seem like a bug. It could be that the summarizer misbehaves when the words parameter is too small. Can you create a separate issue for that with an example? I will close this one.

Thank you!

fbarrios mentioned this issue Sep 16, 2018

Document newline behavior #48

Closed

fbarrios closed this as completed Sep 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running example, results not same as yours #46

Running example, results not same as yours #46

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018 •

edited

Loading

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018 •

edited

Loading

Running example, results not same as yours #46

Running example, results not same as yours #46

Comments

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018 • edited Loading

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

lukakostic commented Sep 16, 2018

fbarrios commented Sep 16, 2018 • edited Loading

lukakostic commented Sep 16, 2018 •

edited

Loading

fbarrios commented Sep 16, 2018 •

edited

Loading