Provide Descriptions (READMEs) for `trl-lib/dataset` #2470

Kallinteris-Andreas · 2024-12-13T07:17:13Z

Feature request

TRL includes a few datasets in HF/trl-lib, and those datasets do not include any information on them

Example the trl-lib/Capybara does not have readme.md, it would be useful to include minimal information like

who made it?, The best info, I could find is NousResearch (the makers of hermes models) have a model named capybara, was this dataset what was used to train that model, or is it something else?
What is it for SFT, RewardModel, RLHF (DPO/PPO) (from my limited understanding different dataset types are used for each of those processes)
What is it intended to accomplice (Domain adaptation?, Bias reduction?), What is it intended to improve

Motivation

Extra information is always useful. It is essential for evaluating the impact of the training process.

Your contribution

Not sure what I can do to help as I am not familiar with those datasets

qgallouedec · 2024-12-13T16:33:05Z

Yes, that's a good point!

All datasets in hf.co/trl-lib are taken from an original dataset. We should at least indicate this dataset in the readme with something like:

This dataset is a processed version of [openbmb/UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) with this [script](https://github.com/huggingface/trl/blob/main/examples/datasets/ultrafeedback.py).

To do this, we should add to all script in https://github.com/huggingface/trl/blob/main/examples/datasets a model card that we push, like in

trl/scripts/generate_tiny_models.py

Lines 69 to 97 in 179ba53

    
           MODEL_CARD = """ 
        
           --- 
        
           library_name: transformers 
        
           tags: [trl] 
        
           --- 
        
           # Tiny {model_class_name} 
        
           This is a minimal model built for unit tests in the [TRL](https://github.com/huggingface/trl) library. 
        
           """ 
        
           api = HfApi() 
        
           def push_to_hub(model, tokenizer, suffix=None): 
        
               model_class_name = model.__class__.__name__ 
        
               content = MODEL_CARD.format(model_class_name=model_class_name) 
        
               model_card = ModelCard(content) 
        
               repo_id = f"{ORGANIZATION}/tiny-{model_class_name}" 
        
               if suffix is not None: 
        
                   repo_id += f"-{suffix}" 
        
               if api.repo_exists(repo_id): 
        
                   print(f"Model {repo_id} already exists, skipping") 
        
               else: 
        
                   model.push_to_hub(repo_id) 
        
                   tokenizer.push_to_hub(repo_id) 
        
                   model_card.push_to_hub(repo_id)

We could also add the type/format of dataset with a link to the relevant section in this page of the documentation: https://huggingface.co/docs/trl/en/dataset_formats

qgallouedec added 📚 documentation Improvements or additions to documentation ✨ enhancement New feature or request 🗃️ data Related to data 🙋 help from community wanted Open invitation for community members to contribute 👶 good first issue Good for newcomers labels Dec 13, 2024

August-murr linked a pull request Dec 16, 2024 that will close this issue

adding readme for datasets #2491

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Descriptions (READMEs) for `trl-lib/dataset` #2470

Provide Descriptions (READMEs) for `trl-lib/dataset` #2470

Kallinteris-Andreas commented Dec 13, 2024 •

edited

Loading

qgallouedec commented Dec 13, 2024 •

edited

Loading

Provide Descriptions (READMEs) for trl-lib/dataset #2470

Provide Descriptions (READMEs) for trl-lib/dataset #2470

Comments

Kallinteris-Andreas commented Dec 13, 2024 • edited Loading

Feature request

Motivation

Your contribution

qgallouedec commented Dec 13, 2024 • edited Loading

Provide Descriptions (READMEs) for `trl-lib/dataset` #2470

Provide Descriptions (READMEs) for `trl-lib/dataset` #2470

Kallinteris-Andreas commented Dec 13, 2024 •

edited

Loading

qgallouedec commented Dec 13, 2024 •

edited

Loading