Replies: 5 comments
-
Alright, so I've done a little digging into trl and distilabel codebases, and tried to categorize current state of the repository vs what is missing to get it to the state described in the paper. Keep in mind that I'm not well versed (or versed at all 😄) in neither trl nor distilabel, so it could be missing something. Feel free to comment, so we get the full picture Missing functionality for full reproduction:
This might be a good start. I suppose you have such a plan outlined internally and some of you are already working on those, hence the question how can we contribute? Implementation Status Table
Also, I've thought it could be a good idea to get a guide how to combine free credits from some cloud provider to run this on 1xH100 for free to ease accessibility of the repo for newcomers that do not have the required compute on hand. |
Beta Was this translation helpful? Give feedback.
-
This is great! Thanks for putting this together For GRPOTrainer has no group-based advantage normalization (Paper Eq. 3): A_i = (r_i - group_mean)/group_std PR here: #66 |
Beta Was this translation helpful? Give feedback.
-
Thanks of putting this up..! Would love to follow up and contribute. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
I looked at the technical report and could not find explicit details regarding implementation for the accuracy reward and the format reward. This is crucial in order to get the GRPO loss function correct so I am wondering if this is either
thanks and very impressive project, hope contributors will manage to reproduce the training of deepseek!! |
Beta Was this translation helpful? Give feedback.
-
Hey all, love the project idea.
I wonder if there is a plan to outline what is missing from the original repo to get this fully reproduced, maybe using projects tab or what have you? In this way we would be able to track what is the volume of work needed to fully reproduce R1.
Right now there is a Plan of Attack in the README, which is a good start, however it is hard to tell, whether step 1 is almost there in the codebase or there are 2,3,8 things that have to be completed.
I think that would help lure more enthusiastic people to the project with some ballpark how far we are, what is yet to be done, etc.
Just throwing this out there, if I missed similar documentation I'd be grateful for any links 😄
Beta Was this translation helpful? Give feedback.
All reactions