Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError when using prepare_for_training() on multi-label annotated Argilla dataset with single annotations #2665

Closed
m-newhauser opened this issue Apr 6, 2023 · 1 comment · Fixed by #2691 or #2858
Assignees
Labels
type: bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@m-newhauser
Copy link
Contributor

Describe the bug
I'm getting a ValueError when trying to use the prepare_for_training() method to prepare a multi-label annotated Argilla dataset for training. I'm only getting the error when a given record has just a single annotation. Everything works fine if all records have more than one assigned annotation label.

To Reproduce

import datetime
import argilla as rg

# Create dataset with a multi-label record that only has one annotation
dataset_rg = rg.DatasetForTextClassification(
    [
        rg.TextClassificationRecord(text=None,
                                    inputs={'title': 'This is the title of an article'}, 
                                    prediction=None, 
                                    prediction_agent=None, 
                                    annotation=['LABEL_1'], 
                                    annotation_agent='team', 
                                    vectors=None, 
                                    multi_label=True, 
                                    explanation=None, 
                                    id='12345', 
                                    metadata={'split': 'train'}, 
                                    status='Validated', 
                                    event_timestamp=datetime.datetime(2023, 4, 4, 9, 57, 59, 910986), 
                                    search_keywords=None)
    ]
)

# Prepare the dataset for training
dataset_rg.prepare_for_training(framework="setfit")

Generates the following error:

ValueError: Class label 1 greater than configured num_classes 1

Expected behavior
Expect the method to return a Dataset that is ready for training.

Dataset({
    features: ['id', 'text', 'label', 'binarized_label'],
    num_rows: 1
})

Environment (please complete the following information):

  • OS [e.g. iOS]: GitHub Codespaces - Ubuntu Linux
  • Argilla Version [e.g. 1.0.0]: 1.6.0

Additional context
The method works properly when the given record is multi-label AND has more than one annotation label, for example:

import datetime
import argilla as rg

# Create dataset with a multi-label record that only has one annotation
dataset_rg = rg.DatasetForTextClassification(
    [
        rg.TextClassificationRecord(text=None,
                                    inputs={'title': 'This is the title of an article'}, 
                                    prediction=None, 
                                    prediction_agent=None, 
                                    annotation=['LABEL_1', 'LABEL_2'], 
                                    annotation_agent='team', 
                                    vectors=None, 
                                    multi_label=True, 
                                    explanation=None, 
                                    id='12345', 
                                    metadata={'split': 'train'}, 
                                    status='Validated', 
                                    event_timestamp=datetime.datetime(2023, 4, 4, 9, 57, 59, 910986), 
                                    search_keywords=None)
    ]
)

# Prepare the dataset for training
dataset_rg.prepare_for_training(framework="setfit")

Returns:

Dataset({
    features: ['id', 'text', 'label', 'binarized_label'],
    num_rows: 1
})
@davidberenstein1957 davidberenstein1957 added this to the v1.7.0 milestone Apr 10, 2023
@davidberenstein1957 davidberenstein1957 self-assigned this Apr 10, 2023
@davidberenstein1957 davidberenstein1957 added the type: bug Indicates an unexpected problem or unintended behavior label Apr 10, 2023
davidberenstein1957 added a commit that referenced this issue May 3, 2023
…es (#2691)

# Description

Updated the argilla.training integration

Closes #2658
Closes #2665 
Closes #2659 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [X] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [ ] Improvement (change adding some improvement to an existing
functionality)
- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

argilla/tests/training/*

**Checklist**

- [ ] I have merged the original branch into my forked branch
- [ ] I added relevant documentation
- [ ] follows the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>
@davidberenstein1957
Copy link
Member

@m-newhauser thanks again for reporting this. This was resolved in 1.7.0

frascuchon added a commit that referenced this issue May 9, 2023
# Description

Updated the argilla.training integration

Closes #2658
Closes #2665 
Closes #2659 

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [X] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] Refactor (change restructuring the codebase without changing
functionality)
- [ ] Improvement (change adding some improvement to an existing
functionality)
- [ ] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

argilla/tests/training/*

**Checklist**

- [ ] I have merged the original branch into my forked branch
- [ ] I added relevant documentation
- [ ] follows the style guidelines of this project
- [ ] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [ ] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: david <david.m.berenstein@gmail.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Alvaro Bartolome <alvarobartt@yahoo.com>
frascuchon added a commit that referenced this issue May 10, 2023
##
[1.7.0](v1.6.0...v1.7.0)

### Added

- add `max_retries` and `num_threads` parameters to `rg.log` to run data
logging request concurrently with backoff retry policy. See
[#2458](#2458) and
[#2533](#2533)
- `rg.load` accepts `include_vectors` and `include_metrics` when loading
data. Closes [#2398](#2398)
- Added `settings` param to `prepare_for_training`
([#2689](#2689))
- Added `prepare_for_training` for `openai`
([#2658](#2658))
- Added `ArgillaOpenAITrainer`
([#2659](#2659))
- Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition
([#2693](#2693))
- Added `ArgillaTrainer` CLI support. Closes
([#2809](#2809))

### Changed

- Argilla quickstart image dependencies are externalized into
`quickstart.requirements.txt`. See
[#2666](#2666)
- bulk endpoints will upsert data when record `id` is present. Closes
[#2535](#2535)
- moved from `click` to `typer` CLI support. Closes
([#2815](#2815))
- Argilla server docker image is built with PostgreSQL support. Closes
[#2686](#2686)
- The `rg.log` computes all batches and raise an error for all failed
batches.
- The default batch size for `rg.log` is now 100.

### Fixed

- `argilla.training` bugfixes and unification
([#2665](#2665))
- Resolved several small bugs in the `ArgillaTrainer`.

### Deprecated

- The `rg.log_async` function is deprecated and will be removed in next
minor release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment