GPRO - Feature Addition #272

Soham4001A · 2025-01-29T17:25:01Z

Description

This PR introduces Generalized Policy Reward Optimization (GRPO) as a new feature in stable-baselines3-contrib. GRPO extends Proximal Policy Optimization (PPO) by incorporating:
• Sub-step sampling per macro step, allowing multiple forward passes before environment transitions.
• Customizable reward scaling, enabling users to pass their own scaling functions or use the default tanh-based normalization.
• Better adaptability in reinforcement learning (RL) tasks, particularly for tracking and dynamic environments.

GRPO allows agents to explore action spaces more efficiently and refine their policy updates through multiple evaluations per time step.

Context

(DLR-RM/stable-baselines3#2076

[x ] I have raised an issue to propose this change (required)

Types of changes

Bug fix (non-breaking change which fixes an issue)
[x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

[x ] I've read the CONTRIBUTION guide (required)
[x ] The functionality/performance matches that of the source (required for new training algorithms or training-related features).
[x ] I have updated the tests accordingly (required for a bug fix or a new feature).
[x ] I have included an example of using the feature (required for new features).
[x ] I have included baseline results (required for new training algorithms or training-related features).
I have updated the documentation accordingly.
[ x] I have updated the changelog accordingly (required).
[x ] I have reformatted the code using make format (required)
[ x] I have checked the codestyle using make check-codestyle and make lint (required)
[ x] I have ensured make pytest and make type both pass. (required)

Soham Sane added 5 commits January 29, 2025 10:49

init - untested

c07a7b4

Reformatted but yet untested - still need to edit test files

4643314

Ready for PR (Untested Still

adb6110

Ready for PR - Tested

0c33dbb

Changelog updated

70b67f6

This was referenced Jan 29, 2025

Group Relative Proximity Optimization (GRPO) - New Feature DLR-RM/stable-baselines3#2076

Closed

[Feature Request] Group Relative Proximity Optimization (GRPO) #273

Open

araffin changed the base branch from feat/cem to master January 30, 2025 08:44

Updated GRPO to use environment reward function for sampled rewards

306ae63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPRO - Feature Addition #272

GPRO - Feature Addition #272

Soham4001A commented Jan 29, 2025 •

edited

Loading

GPRO - Feature Addition #272

Are you sure you want to change the base?

GPRO - Feature Addition #272

Conversation

Soham4001A commented Jan 29, 2025 • edited Loading

Description

Context

Types of changes

Checklist:

Soham4001A commented Jan 29, 2025 •

edited

Loading