Summarization Task not working as expected #377

jpborgesmoura · 2023-11-19T14:53:46Z

jpborgesmoura
Nov 19, 2023

Hello everyone,
I'm just a newbie in the artificial intelligence area, doing some kind of tests and research with the spacy-llm tool.

I need to operate the task of Summarization Text and, for my experiments, I'm using the open source model Dolly, in its 3b version.

I'm having relative good understanding about its operation way, but getting into trouble when it comes to make the model summarize my text with the restrictions that I need to apply, which is: summarize texts written in brazilian portuguese keeping then in this language, using maximmum 30 words.

For this purpose, I'm setting in my config.cfg file, among other stuff, the attributes 'max_n_words' and 'template', where I put a slithly adjusted version of the default template summarization.v1.jinja that explicitly requires that the final result comes in Brazilian Portuguese (only setting the language as 'pt' in the config file doesn't seem to be enough).

Here's how my config.cfg file looks like:

[nlp]
lang = "pt"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.Summarization.v1"
max_n_words = 30

[components.llm.task.template]
@misc = "spacy.FileReader.v1"
path = "summarization-portuguese.v1.jinja"

[components.llm.model]
@llm_models = "spacy.Dolly.v1"
name = "dolly-v2-3b"

In my template, I updated the prompt as it goes: "You are an expert summarization system. Your task is to accept Text as input and summarize the Text in a concise way. The summary must always be written in Brazilian Portuguese."

However, when the model runs for my text, the results not always comes in Portuguese (sometimes it's in English), and the quantity of words exceed the maxximum that I had set. For the 'max_n_words' attribute I've read in the API docs that 'this should not expected to work exactly', but, in my case, it seems that is being totally ignored at all.

That's the code of my experiment:

pip install spacy-llm

import spacy
from spacy_llm.util import assemble

nlp = assemble("config.cfg")

doc = ''' ... some text written in portuguese ... '''

doc = nlp(doc)
doc._.summary

I don't know if there's something that I'm missing and I ask for some kind of help of yours, if it's possible. In advance, I want to thank you for your attention and apologize if it's a foolish question or if I made some mistakes with my English.

Best regards,
João Paulo

rmitsch · 2023-11-20T08:25:08Z

rmitsch
Nov 20, 2023
Maintainer

Hi @jpborgesmoura!

(only setting the language as 'pt' in the config file doesn't seem to be enough).

Yes, that setting affects the spaCy pipeline per se, but doesn't change the prompt. It might be useful to automatically append this setting to the prompt, but there are also downsides to it. We'll consider it.

I don't know if there's something that I'm missing and I ask for some kind of help of yours, if it's possible.

LLMs tend to pick up automatically on the language in the prompt, if they are trained on multilingual data. In case of Dolly v2 the model was trained on English data only (see here). I recommend you

use another, multilingually trained model
possibly translated the prompt into Portuguese, if the results with the existing prompt are still returned in English

1 reply

jpborgesmoura Nov 21, 2023
Author

Hi rmitsch, you're right! I've switched to BLOOM model, trained with Portuguese data, and its results are mostly in this language (sometimes it still outputs some English results, but I can handle them 'cause it's not very often). Thanks for your support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summarization Task not working as expected #377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Summarization Task not working as expected #377

jpborgesmoura Nov 19, 2023

Replies: 1 comment · 1 reply

rmitsch Nov 20, 2023 Maintainer

jpborgesmoura Nov 21, 2023 Author

jpborgesmoura
Nov 19, 2023

Replies: 1 comment 1 reply

rmitsch
Nov 20, 2023
Maintainer

jpborgesmoura Nov 21, 2023
Author