What's changed
- Adds Direct Optimization (DPO) style rewards by @opentaco on #99
- Changes print format on exception catch by @camfairchild on #135
- Brings back netuid and wandb to logged config by @p-ferreira on #137
- Adds DPO penalty update by @Eugene-hu on #138
- Adds original reward output to wandb logs by @isabella618033 on #139
- Reweights reward models by @Eugene-hu on #140
- Update stale documentation by @steffencruz on #129