Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
Hk669 authored May 21, 2024
2 parents 4725328 + 90ca2ca commit 2270927
Show file tree
Hide file tree
Showing 155 changed files with 11,240 additions and 8,088 deletions.
38 changes: 37 additions & 1 deletion .github/workflows/contrib-openai.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,43 @@ jobs:
with:
file: ./coverage.xml
flags: unittests

AgentEvalTest:
strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.10"]
runs-on: ${{ matrix.os }}
environment: openai1
steps:
# checkout to pr branch
- name: Checkout
uses: actions/checkout@v4
with:
ref: ${{ github.event.pull_request.head.sha }}
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install packages and dependencies
run: |
docker --version
python -m pip install --upgrade pip wheel
pip install -e .
python -c "import autogen"
pip install pytest-cov>=5 pytest-asyncio
- name: Coverage
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
AZURE_OPENAI_API_KEY: ${{ secrets.AZURE_OPENAI_API_KEY }}
AZURE_OPENAI_API_BASE: ${{ secrets.AZURE_OPENAI_API_BASE }}
OAI_CONFIG_LIST: ${{ secrets.OAI_CONFIG_LIST }}
run: |
pytest test/agentchat/contrib/agent_eval/test_agent_eval.py
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
CompressionTest:
strategy:
matrix:
Expand Down
29 changes: 29 additions & 0 deletions .github/workflows/contrib-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,35 @@ jobs:
file: ./coverage.xml
flags: unittests

AgentEvalTest:
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest]
python-version: ["3.10"]
runs-on: ${{ matrix.os }}
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install packages and dependencies for all tests
run: |
python -m pip install --upgrade pip wheel
pip install pytest-cov>=5
- name: Install packages and dependencies for AgentEval
run: |
pip install -e .
- name: Coverage
run: |
pytest test/agentchat/contrib/agent_eval/ --skip-openai
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests

CompressionTest:
runs-on: ${{ matrix.os }}
strategy:
Expand Down
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,33 @@
<img src="https://github.com/microsoft/autogen/blob/main/website/static/img/flaml.svg" width=200>
<br>
</p> -->
:fire: May 13, 2024: [The Economist](https://www.economist.com/science-and-technology/2024/05/13/todays-ai-models-are-impressive-teams-of-them-will-be-formidable) published an article about multi-agent systems (MAS) following a January 2024 interview with [Chi Wang](https://github.com/sonichi).

:fire: May 11, 2024: [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation](https://openreview.net/pdf?id=uAjxFFing2) received the best paper award in [ICLR 2024 LLM Agents Workshop](https://llmagents.github.io/).

:fire: Apr 26, 2024: [AutoGen.NET](https://microsoft.github.io/autogen-for-net/) is available for .NET developers!

:fire: Apr 17, 2024: Andrew Ng cited AutoGen in [The Batch newsletter](https://www.deeplearning.ai/the-batch/issue-245/) and [What's next for AI agentic workflows](https://youtu.be/sal78ACtGTc?si=JduUzN_1kDnMq0vF) at Sequoia Capital's AI Ascent (Mar 26).

:fire: Mar 3, 2024: What's new in AutoGen? 📰[Blog](https://microsoft.github.io/autogen/blog/2024/03/03/AutoGen-Update); 📺[Youtube](https://www.youtube.com/watch?v=j_mtwQiaLGU).

:fire: Mar 1, 2024: the first AutoGen multi-agent experiment on the challenging [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) benchmark achieved the No. 1 accuracy in all the three levels.

:tada: Jan 30, 2024: AutoGen is highlighted by Peter Lee in Microsoft Research Forum [Keynote](https://t.co/nUBSjPDjqD).
<!-- :tada: Jan 30, 2024: AutoGen is highlighted by Peter Lee in Microsoft Research Forum [Keynote](https://t.co/nUBSjPDjqD). -->

:tada: Dec 31, 2023: [AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework](https://arxiv.org/abs/2308.08155) is selected by [TheSequence: My Five Favorite AI Papers of 2023](https://thesequence.substack.com/p/my-five-favorite-ai-papers-of-2023).

<!-- :fire: Nov 24: pyautogen [v0.2](https://github.com/microsoft/autogen/releases/tag/v0.2.0) is released with many updates and new features compared to v0.1.1. It switches to using openai-python v1. Please read the [migration guide](https://microsoft.github.io/autogen/docs/Installation#python). -->

<!-- :fire: Nov 11: OpenAI's Assistants are available in AutoGen and interoperatable with other AutoGen agents! Checkout our [blogpost](https://microsoft.github.io/autogen/blog/2023/11/13/OAI-assistants) for details and examples. -->

:tada: Nov 8, 2023: AutoGen is selected into [Open100: Top 100 Open Source achievements](https://www.benchcouncil.org/evaluation/opencs/annual.html) 35 days after spinoff.
:tada: Nov 8, 2023: AutoGen is selected into [Open100: Top 100 Open Source achievements](https://www.benchcouncil.org/evaluation/opencs/annual.html) 35 days after spinoff from [FLAML](https://github.com/microsoft/FLAML).

:tada: Nov 6, 2023: AutoGen is mentioned by Satya Nadella in a [fireside chat](https://youtu.be/0pLBvgYtv6U).
<!-- :tada: Nov 6, 2023: AutoGen is mentioned by Satya Nadella in a [fireside chat](https://youtu.be/0pLBvgYtv6U). -->

:tada: Nov 1, 2023: AutoGen is the top trending repo on GitHub in October 2023.
<!-- :tada: Nov 1, 2023: AutoGen is the top trending repo on GitHub in October 2023. -->

:tada: Oct 03, 2023: AutoGen spins off from FLAML on GitHub and has a major paper update (first version on Aug 16).
<!-- :tada: Oct 03, 2023: AutoGen spins off from [FLAML](https://github.com/microsoft/FLAML) on GitHub. -->

<!-- :tada: Aug 16: Paper about AutoGen on [arxiv](https://arxiv.org/abs/2308.08155). -->

Expand Down
7 changes: 7 additions & 0 deletions autogen/agentchat/contrib/agent_eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Agents for running the AgentEval pipeline.

AgentEval is a process for evaluating a LLM-based system's performance on a given task.

When given a task to evaluate and a few example runs, the critic and subcritic agents create evaluation criteria for evaluating a system's solution. Once the criteria has been created, the quantifier agent can evaluate subsequent task solutions based on the generated criteria.

For more information see: [AgentEval Integration Roadmap](https://github.com/microsoft/autogen/issues/2162)
101 changes: 101 additions & 0 deletions autogen/agentchat/contrib/agent_eval/agent_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from typing import Dict, List, Literal, Optional, Union

import autogen
from autogen.agentchat.contrib.agent_eval.criterion import Criterion
from autogen.agentchat.contrib.agent_eval.critic_agent import CriticAgent
from autogen.agentchat.contrib.agent_eval.quantifier_agent import QuantifierAgent
from autogen.agentchat.contrib.agent_eval.subcritic_agent import SubCriticAgent
from autogen.agentchat.contrib.agent_eval.task import Task


def generate_criteria(
llm_config: Optional[Union[Dict, Literal[False]]] = None,
task: Task = None,
additional_instructions: str = "",
max_round=2,
use_subcritic: bool = False,
):
"""
Creates a list of criteria for evaluating the utility of a given task.
Args:
llm_config (dict or bool): llm inference configuration.
task (Task): The task to evaluate.
additional_instructions (str): Additional instructions for the criteria agent.
max_round (int): The maximum number of rounds to run the conversation.
use_subcritic (bool): Whether to use the subcritic agent to generate subcriteria.
Returns:
list: A list of Criterion objects for evaluating the utility of the given task.
"""
critic = CriticAgent(
system_message=CriticAgent.DEFAULT_SYSTEM_MESSAGE + "\n" + additional_instructions,
llm_config=llm_config,
)

critic_user = autogen.UserProxyAgent(
name="critic_user",
max_consecutive_auto_reply=0, # terminate without auto-reply
human_input_mode="NEVER",
code_execution_config={"use_docker": False},
)

agents = [critic_user, critic]

if use_subcritic:
subcritic = SubCriticAgent(
llm_config=llm_config,
)
agents.append(subcritic)

groupchat = autogen.GroupChat(
agents=agents, messages=[], max_round=max_round, speaker_selection_method="round_robin"
)
critic_manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

critic_user.initiate_chat(critic_manager, message=task.get_sys_message())
criteria = critic_user.last_message()
content = criteria["content"]
# need to strip out any extra code around the returned json
content = content[content.find("[") : content.rfind("]") + 1]
criteria = Criterion.parse_json_str(content)
return criteria


def quantify_criteria(
llm_config: Optional[Union[Dict, Literal[False]]] = None,
criteria: List[Criterion] = None,
task: Task = None,
test_case: str = "",
ground_truth: str = "",
):
"""
Quantifies the performance of a system using the provided criteria.
Args:
llm_config (dict or bool): llm inference configuration.
criteria ([Criterion]): A list of criteria for evaluating the utility of a given task.
task (Task): The task to evaluate.
test_case (str): The test case to evaluate.
ground_truth (str): The ground truth for the test case.
Returns:
dict: A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
"""
quantifier = QuantifierAgent(
llm_config=llm_config,
)

quantifier_user = autogen.UserProxyAgent(
name="quantifier_user",
max_consecutive_auto_reply=0, # terminate without auto-reply
human_input_mode="NEVER",
code_execution_config={"use_docker": False},
)

quantifier_user.initiate_chat( # noqa: F841
quantifier,
message=task.get_sys_message()
+ "Evaluation dictionary: "
+ Criterion.write_json(criteria)
+ "actual test case to evaluate: "
+ test_case,
)
quantified_results = quantifier_user.last_message()
return {"actual_success": ground_truth, "estimated_performance": quantified_results["content"]}
41 changes: 41 additions & 0 deletions autogen/agentchat/contrib/agent_eval/criterion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from __future__ import annotations

import json
from typing import List

import pydantic_core
from pydantic import BaseModel
from pydantic.json import pydantic_encoder


class Criterion(BaseModel):
"""
A class that represents a criterion for agent evaluation.
"""

name: str
description: str
accepted_values: List[str]
sub_criteria: List[Criterion] = list()

@staticmethod
def parse_json_str(criteria: str):
"""
Create a list of Criterion objects from a json string.
Args:
criteria (str): Json string that represents the criteria
returns:
[Criterion]: A list of Criterion objects that represents the json criteria information.
"""
return [Criterion(**crit) for crit in json.loads(criteria)]

@staticmethod
def write_json(criteria):
"""
Create a json string from a list of Criterion objects.
Args:
criteria ([Criterion]): A list of Criterion objects.
Returns:
str: A json string that represents the list of Criterion objects.
"""
return json.dumps([crit.model_dump() for crit in criteria], indent=2)
41 changes: 41 additions & 0 deletions autogen/agentchat/contrib/agent_eval/critic_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
from typing import Optional

from autogen.agentchat.conversable_agent import ConversableAgent


class CriticAgent(ConversableAgent):
"""
An agent for creating list of criteria for evaluating the utility of a given task.
"""

DEFAULT_SYSTEM_MESSAGE = """You are a helpful assistant. You suggest criteria for evaluating different tasks. They should be distinguishable, quantifiable and not redundant.
Convert the evaluation criteria into a list where each item is a criteria which consists of the following dictionary as follows
{"name": name of the criterion, "description": criteria description , "accepted_values": possible accepted inputs for this key}
Make sure "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels and "description" includes the criterion description.
Output just the criteria string you have created, no code.
"""

DEFAULT_DESCRIPTION = "An AI agent for creating list criteria for evaluating the utility of a given task."

def __init__(
self,
name="critic",
system_message: Optional[str] = DEFAULT_SYSTEM_MESSAGE,
description: Optional[str] = DEFAULT_DESCRIPTION,
**kwargs,
):
"""
Args:
name (str): agent name.
system_message (str): system message for the ChatCompletion inference.
Please override this attribute if you want to reprogram the agent.
description (str): The description of the agent.
**kwargs (dict): Please refer to other kwargs in
[ConversableAgent](../../conversable_agent#__init__).
"""
super().__init__(
name=name,
system_message=system_message,
description=description,
**kwargs,
)
36 changes: 36 additions & 0 deletions autogen/agentchat/contrib/agent_eval/quantifier_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from typing import Optional

from autogen.agentchat.conversable_agent import ConversableAgent


class QuantifierAgent(ConversableAgent):
"""
An agent for quantifying the performance of a system using the provided criteria.
"""

DEFAULT_SYSTEM_MESSAGE = """"You are a helpful assistant. You quantify the output of different tasks based on the given criteria.
The criterion is given in a json list format where each element is a distinct criteria.
The each element is a dictionary as follows {"name": name of the criterion, "description": criteria description , "accepted_values": possible accepted inputs for this key}
You are going to quantify each of the crieria for a given task based on the task description.
Return a dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.
Return only the dictionary, no code."""

DEFAULT_DESCRIPTION = "An AI agent for quantifing the performance of a system using the provided criteria."

def __init__(
self,
name="quantifier",
system_message: Optional[str] = DEFAULT_SYSTEM_MESSAGE,
description: Optional[str] = DEFAULT_DESCRIPTION,
**kwargs,
):
"""
Args:
name (str): agent name.
system_message (str): system message for the ChatCompletion inference.
Please override this attribute if you want to reprogram the agent.
description (str): The description of the agent.
**kwargs (dict): Please refer to other kwargs in
[ConversableAgent](../../conversable_agent#__init__).
"""
super().__init__(name=name, system_message=system_message, description=description, **kwargs)
42 changes: 42 additions & 0 deletions autogen/agentchat/contrib/agent_eval/subcritic_agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from typing import Optional

from autogen.agentchat.conversable_agent import ConversableAgent


class SubCriticAgent(ConversableAgent):
"""
An agent for creating subcriteria from a given list of criteria for evaluating the utility of a given task.
"""

DEFAULT_SYSTEM_MESSAGE = """You are a helpful assistant to the critic agent. You suggest sub criteria for evaluating different tasks based on the criteria provided by the critic agent (if you feel it is needed).
They should be distinguishable, quantifiable, and related to the overall theme of the critic's provided criteria.
You operate by taking in the description of the criteria. You then create a new key called sub criteria where you provide the sub criteria for the given criteria.
The value of the sub_criteria is a dictionary where the keys are the subcriteria and each value is as follows {"description": sub criteria description , "accepted_values": possible accepted inputs for this key}
Do this for each criteria provided by the critic (removing the criteria's accepted values). "accepted_values" include the acceptable inputs for each key that are fine-grained and preferably multi-graded levels. "description" includes the criterion description.
Once you have created the sub criteria for the given criteria, you return the json (make sure to include the contents of the critic's dictionary in the final dictionary as well).
Make sure to return a valid json and no code"""

DEFAULT_DESCRIPTION = "An AI agent for creating subcriteria from a given list of criteria."

def __init__(
self,
name="subcritic",
system_message: Optional[str] = DEFAULT_SYSTEM_MESSAGE,
description: Optional[str] = DEFAULT_DESCRIPTION,
**kwargs,
):
"""
Args:
name (str): agent name.
system_message (str): system message for the ChatCompletion inference.
Please override this attribute if you want to reprogram the agent.
description (str): The description of the agent.
**kwargs (dict): Please refer to other kwargs in
[ConversableAgent](../../conversable_agent#__init__).
"""
super().__init__(
name=name,
system_message=system_message,
description=description,
**kwargs,
)
Loading

0 comments on commit 2270927

Please sign in to comment.