ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

m-newhauser · 2023-04-06T11:33:57Z

Describe the bug
I'm getting a ValueError when trying to use the prepare_for_training() method to prepare a multi-label annotated Argilla dataset for training. I'm only getting the error when a given record has just a single annotation. Everything works fine if all records have more than one assigned annotation label.

To Reproduce

import datetime
import argilla as rg

# Create dataset with a multi-label record that only has one annotation
dataset_rg = rg.DatasetForTextClassification(
    [
        rg.TextClassificationRecord(text=None,
                                    inputs={'title': 'This is the title of an article'}, 
                                    prediction=None, 
                                    prediction_agent=None, 
                                    annotation=['LABEL_1'], 
                                    annotation_agent='team', 
                                    vectors=None, 
                                    multi_label=True, 
                                    explanation=None, 
                                    id='12345', 
                                    metadata={'split': 'train'}, 
                                    status='Validated', 
                                    event_timestamp=datetime.datetime(2023, 4, 4, 9, 57, 59, 910986), 
                                    search_keywords=None)
    ]
)

# Prepare the dataset for training
dataset_rg.prepare_for_training(framework="setfit")

Generates the following error:

ValueError: Class label 1 greater than configured num_classes 1

Expected behavior
Expect the method to return a Dataset that is ready for training.

Dataset({
    features: ['id', 'text', 'label', 'binarized_label'],
    num_rows: 1
})

Environment (please complete the following information):

OS [e.g. iOS]: GitHub Codespaces - Ubuntu Linux
Argilla Version [e.g. 1.0.0]: 1.6.0

Additional context
The method works properly when the given record is multi-label AND has more than one annotation label, for example:

import datetime
import argilla as rg

# Create dataset with a multi-label record that only has one annotation
dataset_rg = rg.DatasetForTextClassification(
    [
        rg.TextClassificationRecord(text=None,
                                    inputs={'title': 'This is the title of an article'}, 
                                    prediction=None, 
                                    prediction_agent=None, 
                                    annotation=['LABEL_1', 'LABEL_2'], 
                                    annotation_agent='team', 
                                    vectors=None, 
                                    multi_label=True, 
                                    explanation=None, 
                                    id='12345', 
                                    metadata={'split': 'train'}, 
                                    status='Validated', 
                                    event_timestamp=datetime.datetime(2023, 4, 4, 9, 57, 59, 910986), 
                                    search_keywords=None)
    ]
)

# Prepare the dataset for training
dataset_rg.prepare_for_training(framework="setfit")

Returns:

Dataset({
    features: ['id', 'text', 'label', 'binarized_label'],
    num_rows: 1
})

The text was updated successfully, but these errors were encountered:

…es (#2691) # Description Updated the argilla.training integration Closes #2658 Closes #2665 Closes #2659 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Refactor (change restructuring the codebase without changing functionality) - [ ] Improvement (change adding some improvement to an existing functionality) - [ ] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) argilla/tests/training/* **Checklist** - [ ] I have merged the original branch into my forked branch - [ ] I added relevant documentation - [ ] follows the style guidelines of this project - [ ] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com> Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>

davidberenstein1957 · 2023-05-04T15:25:51Z

@m-newhauser thanks again for reporting this. This was resolved in 1.7.0

# Description Updated the argilla.training integration Closes #2658 Closes #2665 Closes #2659 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] Refactor (change restructuring the codebase without changing functionality) - [ ] Improvement (change adding some improvement to an existing functionality) - [ ] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) argilla/tests/training/* **Checklist** - [ ] I have merged the original branch into my forked branch - [ ] I added relevant documentation - [ ] follows the style guidelines of this project - [ ] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: david <david.m.berenstein@gmail.com> Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com> Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>

## [1.7.0](v1.6.0...v1.7.0) ### Added - add `max_retries` and `num_threads` parameters to `rg.log` to run data logging request concurrently with backoff retry policy. See [#2458](#2458) and [#2533](#2533) - `rg.load` accepts `include_vectors` and `include_metrics` when loading data. Closes [#2398](#2398) - Added `settings` param to `prepare_for_training` ([#2689](#2689)) - Added `prepare_for_training` for `openai` ([#2658](#2658)) - Added `ArgillaOpenAITrainer` ([#2659](#2659)) - Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition ([#2693](#2693)) - Added `ArgillaTrainer` CLI support. Closes ([#2809](#2809)) ### Changed - Argilla quickstart image dependencies are externalized into `quickstart.requirements.txt`. See [#2666](#2666) - bulk endpoints will upsert data when record `id` is present. Closes [#2535](#2535) - moved from `click` to `typer` CLI support. Closes ([#2815](#2815)) - Argilla server docker image is built with PostgreSQL support. Closes [#2686](#2686) - The `rg.log` computes all batches and raise an error for all failed batches. - The default batch size for `rg.log` is now 100. ### Fixed - `argilla.training` bugfixes and unification ([#2665](#2665)) - Resolved several small bugs in the `ArgillaTrainer`. ### Deprecated - The `rg.log_async` function is deprecated and will be removed in next minor release.

davidberenstein1957 added this to the v1.7.0 milestone Apr 10, 2023

davidberenstein1957 self-assigned this Apr 10, 2023

davidberenstein1957 added the type: bug Indicates an unexpected problem or unintended behavior label Apr 10, 2023

davidberenstein1957 added a commit that referenced this issue Apr 10, 2023

chore: added more explicit warning for having too little labels #2665

0e2241a

davidberenstein1957 linked a pull request Apr 10, 2023 that will close this issue

chore: added more explicit warning for having too little labels #2665 #2669

Closed

2 tasks

This was referenced Apr 10, 2023

chore: added more explicit warning for having too little labels #2665 #2669

Closed

Feat/2658 add argilla training module for openai with several bug fixes #2691

Merged

davidberenstein1957 closed this as completed May 4, 2023

frascuchon mentioned this issue May 9, 2023

feat: update the argilla training integration #2858

Merged

14 tasks

frascuchon mentioned this issue May 9, 2023

Release v1.7.0 #2817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

m-newhauser commented Apr 6, 2023

davidberenstein1957 commented May 4, 2023

ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

Comments

m-newhauser commented Apr 6, 2023

davidberenstein1957 commented May 4, 2023