Roadmap/list of tasks #58

pepinu · 2025-01-26T16:02:15Z

pepinu
Jan 26, 2025

Hey all, love the project idea.

I wonder if there is a plan to outline what is missing from the original repo to get this fully reproduced, maybe using projects tab or what have you? In this way we would be able to track what is the volume of work needed to fully reproduce R1.

Right now there is a Plan of Attack in the README, which is a good start, however it is hard to tell, whether step 1 is almost there in the codebase or there are 2,3,8 things that have to be completed.

I think that would help lure more enthusiastic people to the project with some ballpark how far we are, what is yet to be done, etc.

Just throwing this out there, if I missed similar documentation I'd be grateful for any links 😄

pepinu · 2025-01-26T23:16:13Z

pepinu
Jan 26, 2025
Author

Alright, so I've done a little digging into trl and distilabel codebases, and tried to categorize current state of the repository vs what is missing to get it to the state described in the paper.

Keep in mind that I'm not well versed (or versed at all 😄) in neither trl nor distilabel, so it could be missing something. Feel free to comment, so we get the full picture

Missing functionality for full reproduction:

GRPOTrainer has no group-based advantage normalization (Paper Eq. 3)
A_i = (r_i - group_mean)/group_std implemented in trl
System configuration: group_size, ε (epsilon) for clipping & β (beta) for KL regularization
Multi-Stage Training Pipeline (cold start trainer, RL trainer, rejection sampler, final RL alignment)
Data Generation lacks: rejection sampling, language mixing filters, and special token formatting (section §2.3.1)
No Special token handling
No data curation process (§2.4)
No model-size adaptations
No distillation trainer
Rewards: weighted combination w/ language consistency reward
Multi-reward system missing, currently single reward support only
Evaluators: Codeforces, LiveCodeBench, SWE-Bench, length-controlled scoring

This might be a good start. I suppose you have such a plan outlined internally and some of you are already working on those, hence the question how can we contribute?

Implementation Status Table

Component	Current Status	Missing Elements
Core Training Pipeline	Basic SFT + GRPO scripts exist	Multi-stage orchestration (cold-start → RL → rejection → final RL)
GRPO Mechanics	Basic PPO implementation with single reward	Group advantage normalization; KL penalty term (β); Epsilon (ε) clipping; Multi-reward system support
Special Token Handling	Simple / template	Cold-start formatting tokens (<\|reasoning\|>, <\|summary\|>); Structured extraction; Special token formatting (§2.3.1)
Reward System	Basic accuracy reward	Language consistency reward; Reward weighting system; Multi-reward combination; Multi-reward support
Data Generation	Basic Distilabel pipeline	Rejection sampling; Language mixing filters; Quality filtering; Data curation process (§2.4)
Distillation Pipeline	Not implemented	Complete distillation trainer
GRPO Configuration	Hardcoded parameters	Dataclass with: group_size, ε (epsilon) for clipping, β (beta) for KL regularization
Evaluation Framework	Basic metrics	Codeforces; LiveCodeBench; SWE-Bench; Length-controlled scoring

Also, I've thought it could be a good idea to get a guide how to combine free credits from some cloud provider to run this on 1xH100 for free to ease accessibility of the repo for newcomers that do not have the required compute on hand.

0 replies

agulati18 · 2025-01-27T04:33:14Z

agulati18
Jan 27, 2025

This is great! Thanks for putting this together

For GRPOTrainer has no group-based advantage normalization (Paper Eq. 3): A_i = (r_i - group_mean)/group_std

PR here: #66

0 replies

Knight7561 · 2025-01-27T18:06:13Z

Knight7561
Jan 27, 2025

Thanks of putting this up..! Would love to follow up and contribute.

0 replies

hesamsheikh · 2025-01-28T15:05:12Z

hesamsheikh
Jan 28, 2025

Rewards: weighted combination w/ language consistency reward
I submitted a PR on weighted combination of multi-reward settings. tell me what you think.
huggingface/trl#2676

0 replies

yld3 · 2025-02-10T01:24:32Z

yld3
Feb 10, 2025

I looked at the technical report and could not find explicit details regarding implementation for the accuracy reward and the format reward. This is crucial in order to get the GRPO loss function correct so I am wondering if this is either

standard : then I would love to see some pointer to references where those are more formally defined
non-standard : then I would be interested to know about the status of the implementation for those in this repo . Is it a first shot ? educated guess ? or some implementation based on unpublished but known details about GRPO ?

thanks and very impressive project, hope contributors will manage to reproduce the training of deepseek!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap/list of tasks #58

{{title}}

Replies: 5 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Roadmap/list of tasks #58

pepinu Jan 26, 2025

Replies: 5 comments

pepinu Jan 26, 2025 Author

Missing functionality for full reproduction:

Implementation Status Table

agulati18 Jan 27, 2025

Knight7561 Jan 27, 2025

hesamsheikh Jan 28, 2025

yld3 Feb 10, 2025

pepinu
Jan 26, 2025

pepinu
Jan 26, 2025
Author

agulati18
Jan 27, 2025

Knight7561
Jan 27, 2025

hesamsheikh
Jan 28, 2025

yld3
Feb 10, 2025