Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-query speaker name when multiple speaker names returned during Group Chat speaker selection #2304

Merged
merged 20 commits into from
Apr 30, 2024

Conversation

marklysze
Copy link
Collaborator

@marklysze marklysze commented Apr 6, 2024

Note: See UPDATED approach in the comment below.

Why are these changes needed?

During the speaker selection process (when in "auto" mode), the LLM returns the name of the next speaker. This is fairly reliable with OpenAI's models but with open-source/weight models they can sometimes have trouble returning, simply, the name of the next speaker. Often, they will return a sentence, paragraph, or even a large sequence of text. Furthermore, I have found that to get the correct next speaker name you often have to prompt these LLMs to provide an explanation in order to have a chain-of-thought that leads the LLM to the correct agent and the resulting response can often include the other agent names as part of the reasoning.

Currently, if there is more than one valid agent name it fails the speaker selection process by returning:
GroupChat select_speaker failed to resolve the next speaker's name. This is because the speaker selection OAI call returned: ...

This PR aims to provide a second-chance by prompting the LLM again, this time with a specific prompt text together with the returned text, asking the LLM to provide just the one agent name based on some rules. In my testing, I have found that this simpler step helps to overcome what would be a failing step and, depending on the model, this can drastically reduce the occurrence of the multiple agent names failure.

How does it work?

  • GroupChat has a new attribute, requery_on_multiple_speaker_names (bool) that enables a user to turn this feature on, default is off (False)
  • During the agent name matching step in speaker selection, GroupChat._finalize_speaker:
    • If a single agent name is detected in the response, it returns the agent (unchanged)
    • If multiple agent names are detected in the response and GroupChat.requery_on_multiple_speaker_names is True it will:
      • Compile a new chat message with a hard-coded prompt, including the response's text as context
      • Sends the message to the LLM (same LLM as the one used for the speaker selection)
      • If the new response has just one agent name, it is returned as the next agent (success!)
      • If the new response has no agent name or more than one, it fails with a similar message to the current error message
    • If no agent names are detected, it fails with the current error message (unchanged)

The prompt to select the name

I have put together a prompt that performs reasonably well in selecting a speaker name. However, I foresee that this prompt will be tweaked over time and, possibly, have the option to be overridden by the user.

I tried zero-shot prompts and few-shot prompts and, oddly, the zero-shot prompt worked better.

The prompt I have is (where {name} is the response with multiple agents names):

Your role is to identify the current or next speaker based on the provided context.
The valid speaker names are {[agent.name for agent in agents]}.
To determine the speaker use these prioritised rules:
1. If the context refers to themselves as a speaker e.g. "As the..." , choose that speaker's name
2. If it refers to the "next" speaker name, choose that name
3. Otherwise, choose the first provided speaker's name in the context

Respond with just the name of the speaker and do not provide a reason.
Context:
{name}

Testing

I tried the following multiple agent-name response texts, here is the result of my testing.

Tests are based on these 7 multiple agent-name responses (which I have seen through my testing of models):

# Response Expected Agent Name
1. Product_Manager because they speak after the Chief_Marketing_Officer. Product_Manager
2. Thanks Chief_Marketing_Officer, as the Product_Manager my plan is to produce some amazing product ideas. Product_Manager
3. Product_Manager. Here are five ideas that I think will impress the Chief_Marketing_Officer and be great for a marketing strategy for the Digital_Marketer. Product_Manager
4. The next speaker, after the Chief_Marketing_Officer, is the Product_Manager. Product_Manager
5. As the Product_Manager I've decided that an infotainment system that links up with your phone will have the biggest impact on the marketplace. Digital_Marketer, over to you for an amazing strategy. Product_Manager
6. Thank you Digital_Marketer, the next speaker will be Chief_Marketing_Officer. Chief_Marketing_Officer
7. What a great team! Let's hear from the Chief_Marketing_Officer now that the Product_Manager has some ideas. Chief_Marketing_Officer

Results

LLM Correct out of 7
Open source/weight
llama2:13b-chat ALL 7
mistral:7b-instruct-v0.2-q6_K ALL 7
mixtralq4 ALL 7
mixtralq5 ALL 7
orca2:13b-q5_K_S ALL 7
codellama:34b-instruct 6 / 7
gemma:2b-instruct 6 / 7
gemma:7b-instruct 6 / 7
neural-chat:7b-v3.3-q6_K 6 / 7
openhermes:7b-mistral-v2.5-q6_K 6 / 7
qwen:14b-chat-q6_K 6 / 7
codellama:34b-python 5 / 7
llama2:7b-chat-q6_K 5 / 7
yi:34b-chat-q4_K_M 5 / 7
deepseek-coder:6.7b-instruct-q6_K 4 / 7
solar:10.7b-instruct-v1-q5_K_M 4 / 7
phi 3 / 7
phind-codellama:34b-v2 3 / 7
codellama:7b-python 0 / 7
codellama:13b-python 0 / 7
nexusraven 0 / 7
OpenAI
gpt-3.5-turbo 6 of 7
gpt-4 6 of 7
gpt-4-turbo-preview 6 of 7
Mistral.AI
small 6 of 7
medium 6 of 7
large ALL 7

I believe this capability will noticeably improve the reliability of GroupChat speaker selection

What is missing from this PR

Tests - As this PR is based on LLM responses, I need some guidance on what tests (if any) to create. Additionally, as this is focused more towards alt-models, can that be tested in any way. Finally, we would need to make sure that the responses are consistent.

Documentation - Along with the broader need for GroupChat documents (see #2243), I think this could be added to that PR as well as tips for Non-OpenAI models.

Thanks!

Related issue number

Based on shortcomings identified in #1746.

Checks

@codecov-commenter
Copy link

codecov-commenter commented Apr 6, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 9 lines in your changes are missing coverage. Please review.

Project coverage is 50.01%. Comparing base (4a44093) to head (c464f45).

Files Patch % Lines
autogen/agentchat/groupchat.py 25.00% 9 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2304       +/-   ##
===========================================
+ Coverage   38.14%   50.01%   +11.86%     
===========================================
  Files          78       78               
  Lines        7865     7874        +9     
  Branches     1683     1824      +141     
===========================================
+ Hits         3000     3938      +938     
+ Misses       4615     3605     -1010     
- Partials      250      331       +81     
Flag Coverage Δ
unittest 14.21% <16.66%> (?)
unittests 48.98% <25.00%> (+10.85%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marklysze marklysze requested a review from ekzhu April 6, 2024 05:16
@marklysze marklysze self-assigned this Apr 6, 2024
@ekzhu
Copy link
Collaborator

ekzhu commented Apr 6, 2024

Thanks for the analysis! It looks like requerying aka give LLM a second chance makes the selection more robust.

I am wondering instead of further parametrizing the "auto" method, can we add another speaker_selection_method, such as "auto_with_retry", which retries the selection until a single speaker is returned. Do you think this approach goes a step further to address the robustness issue?

Effectively, this will be a new built in speaker selection method. You can see how we can currently use user defined selection method like this: https://microsoft.github.io/autogen/docs/topics/groupchat/customized_speaker_selection

@sonichi sonichi requested review from joshkyh and yiranwu0 April 8, 2024 08:18
Copy link
Collaborator

@joshkyh joshkyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Love the analysis.

autogen/agentchat/groupchat.py Outdated Show resolved Hide resolved
@marklysze
Copy link
Collaborator Author

Thanks for the analysis! It looks like requerying aka give LLM a second chance makes the selection more robust.

I am wondering instead of further parametrizing the "auto" method, can we add another speaker_selection_method, such as "auto_with_retry", which retries the selection until a single speaker is returned. Do you think this approach goes a step further to address the robustness issue?

Effectively, this will be a new built in speaker selection method. You can see how we can currently use user defined selection method like this: https://microsoft.github.io/autogen/docs/topics/groupchat/customized_speaker_selection

Thanks for highlighting the possible approach, @ekzhu - I didn't think about it as a new method but it's definitely worth considering.

Regarding the retries until a single speaker is returned:

During my testing, I found that if it didn't succeed with the first re-query it was because it returned either:

  • text that wouldn't be useful for re-querying (it went off on a tangent)
  • blanks

It was rare for it to return text that still had agent names or useful context to then feed back for a re-query.

It is possible that we could introduce a second re-query prompt that is different to the first one (possibly simpler like "Select the most prominent speaker name from this list {[agent.name for agent in agents]}, Context: {name}". And, if this failed to identify a single agent, we could do the next suggestion (below) or return the failed response.

Alternatively, we could just take the first mentioned name from the original response (which seems to be, more often than not, the correct one) rather than throwing an error.

New method, auto_with_retry

In terms of adding a new method, I wasn't sure how much "auto" was already used for logic within the code and was wondering if adding a new, but similar, method would result in having to replicate changes for auto also for auto_with_retry whenever they came through.

In groupchat.py it doesn't look like it would be too much work to accommodate both auto and auto_with_retry. I'm happy to add as a method which would make it a more obvious approach for users.

Let me know if we do want to make this change and I'll update code.

@marklysze
Copy link
Collaborator Author

Okay, so the manual selection process is breaking when the user does not select an agent (response is blank or "q" in manual_select_speaker).

So that needs to be handled. As the on-screen direction is "enter nothing or 'q' to use auto selection", I will have it select the next agent if they don't select a valid one during the manual selection. I've committed a fix.

@sonichi sonichi added this pull request to the merge queue Apr 30, 2024
Merged via the queue into main with commit 5b6ae32 Apr 30, 2024
75 of 85 checks passed
@sonichi sonichi deleted the requery_speaker_name_on_multiple branch April 30, 2024 04:21
jayralencar pushed a commit to jayralencar/autogen that referenced this pull request May 28, 2024
…up Chat speaker selection (microsoft#2304)

* Added requery_on_multiple_speaker_names to GroupChat and updated _finalize_speaker to requery on multiple speaker names (if enabled)

* Removed unnecessary comments

* Update to current main

* Tweak error message.

* Comment clarity

* Expanded description of Group Chat requery_on_multiple_speaker_names

* Reworked to two-way nested chat for speaker selection with default of 2 retries.

* Adding validation of new GroupChat attributes

* Updates as per @ekzhu's suggestions

* Update groupchat

- Added select_speaker_auto_multiple_template and select_speaker_auto_none_template
- Added max_attempts comment
- Re-instated support for role_for_select_speaker_messages
-

* Update conversable_agent.py

Added ability to force override role for a message to support select speaker prompt.

* Update test_groupchat.py

Updated existing select_speaker test functions as underlying approach has changed, added necessary tests for new functionality.

* Removed block for manual selection in select_speaker function.

* Catered for no-selection during manual selection mode

---------

Co-authored-by: Chi Wang <wang.chi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alt-models Pertains to using alternate, non-GPT, models (e.g., local models, llama, etc.) group chat group-chat-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants