Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Add metadata field to the FeedbackRecord #3155

Closed
tugot17 opened this issue Jun 12, 2023 · 11 comments · Fixed by #3194
Closed

[FEATURE] Add metadata field to the FeedbackRecord #3155

tugot17 opened this issue Jun 12, 2023 · 11 comments · Fixed by #3194
Milestone

Comments

@tugot17
Copy link

tugot17 commented Jun 12, 2023

Add support for adding metadata to the FeedbackRecord. 

Currently, when pushing the records to the dataset, I can only push the fields that are later used as a TextField. In some scenarios, I would like to push not only the values but also the metadata, e.g., the hash values associated with different fields. They come in handy when analysing the results after the labelling is completed. I'm aware of the external_id but in many use cases it is not sufficient.

The solution would be to either:

  • not require all the used fields to be present in TextField.
  • add a metadatafield param that would capture such fields.

The code: 

fields = [
    rg.TextField(name="instruction", required=True),
    rg.TextField(name="input", required=True),
    rg.TextField(name="completion_a", required=True),
    rg.TextField(name="completion_b", required=True),
]

dataset = rg.FeedbackDataset(
    guidelines="Please, read the prompt carefully and...",
    questions=questions,
    fields=fields,
)


def build_records(path: Path) -> list[rg.FeedbackRecord]:
    with open(path) as f:
        data = json.load(f)

    records = []

    for point in data:
        record = rg.FeedbackRecord(
            fields={
                    "instruction_hash": point["instruction_hash"],
                    "instruction": point["instruction"],
                    "input": point["input"],
                    "completion_a": point["completion_a"],
                    "completion_b": point["completion_b"],
                    "completion_hash_a": point["completion_hash_a"],
                    "completion_hash_b": point["completion_hash_b"],
                    "pair_hash": point["pair_hash"],
                    }
        )
        records.append(record)

        return records


records = build_records(path=Path(__file__).parent / "data.json")
dataset.add_records(records)
dataset.push_to_argilla(name="my-dataset")

The exception

Exception: Failed while adding the records to the `FeedbackTask` dataset in 
Argilla with exception: Argilla server returned an error with http status: 422
Error details: [{'response': "Error: found fields values for non configured 
fields: ['instruction_hash', 'completion_hash_a', 'completion_hash_b', 
'pair_hash']", 'params': {}}]
@dvsrepo
Copy link
Member

dvsrepo commented Jun 12, 2023

Fully agree with this @tugot17. This is already planned and your use case could greatly help us design and implement this.

Some feedback from your side would be super helpful.

For other tasks, we have metadata fields that are searchable and can be used for filtering data, besides retrieving them when you read the dataset from Python. The values for these fields are always visible to all users (annotators and admins)

But for the Feedback task we're considering some more fine-grained alternatives:

  • You can define the visibility of the Fields in the sense of making them visible in the record card and letting users filter by them. This visibility could be applied only for annotators for example.
  • Introducing a new MetadataField object, with different data types (str, int, float, dates, etc.) and then letting admins specify whether they are public (available to users as described above) or private, not available in the UI but only when retrieving the records for post-processing activities.

A canonical use case for private metadata fields is when asking labelers to rank two responses from different models (for evaluation for example), you'd like to shuffle Response 1 and Response 2 to avoid always returning model 1 response as Response 1 and model 2 as Response 2, which can introduce bias. In this case, I'd like to have a private metadata field to indicate for each record which model_id was Response 1 and which one was Response 2.

Another is to keep a prompt id across different records, to split the ranking problem into several records: For prompt id = 1234 I create several records with pairs of responses, instead of just one asking the user to rank several pairs at the same time.

@dvsrepo
Copy link
Member

dvsrepo commented Jun 12, 2023

cc @nataliaElv @frascuchon

@tugot17
Copy link
Author

tugot17 commented Jun 12, 2023

The first option seems to be a much quicker fix, though it would introduce two different views for admins and for annotators, which would introduce the addional uncessery complexity.

Why not just add visible param to the rg.TextField?

@dvsrepo
Copy link
Member

dvsrepo commented Jun 12, 2023

Yes, sorry for the first option I meant adding a property visibility/ visible to each TextField. We still need to decide what visible means: i.e., it's only about presenting it to the user in the record card, not showing it at all, is it searchable?

@dvsrepo
Copy link
Member

dvsrepo commented Jun 12, 2023

As discussed, I think we could start with a new boolean attribute visible for TextField where:

  1. You can store values from the Python SDK at record creation.
  2. The field is not shown in the record card
  3. When you retrieve the records with fetch_records you get the values for it

What do you think @frascuchon , @gabrielmbmb , @alvarobartt , @keithCuniah ?

@nataliaElv
Copy link
Member

I guess the fields that aren't visible shouldn't be searchable using the search bar once we have it, right?

@dvsrepo
Copy link
Member

dvsrepo commented Jun 12, 2023

I guess the fields that aren't visible shouldn't be searchable using the search bar once we have it, right?

I think at least at the beginning we can assume that.

As we discussed the other day and confirmed with @tugot17, having a way to store additional fields is crucial for many use cases, so we should include the minimum, simplest solution as soon as possible and then maybe iterate on more complex use cases

@frascuchon
Copy link
Member

Yes, I think allowing hiding some fields or even removing from the record itself from some endpoints can be a first approach to this metadata solution.

Regarding the search, in general terms, I would say that visible/non-visible should be decoupled from the search accessibility (maybe make sense to search on non-visible fields from admins to apply some kind of data processing). Maybe we can limit search capabilities from the UI, but I think we should index the data for search.

@dvsrepo
Copy link
Member

dvsrepo commented Jun 13, 2023

I agree, let's go with a simple solution to solve the non-visible fields and then will see further use cases (search, etc.)

@frascuchon
Copy link
Member

Regarding the first solution, I feel confused with the visible attribute, since it is so coupled to the current annotation view, which can change in the future.

I would say to define a metadata property at the record level to let the users store whatever extra info for a record. It's a more general solution and is decoupled from how data will be shown.

for data in dataset:
   rg.FeedbacRecord(
      fields={"field1": data.pop("a"), "field2": data.pop("b")},  
      metadata={**data}   
   )

Appart from this, we need to think about how to provide searchable info out of the record cards, but this can be tackled as a different problem.

@dvsrepo
Copy link
Member

dvsrepo commented Jun 14, 2023

Sounds good!

@frascuchon frascuchon modified the milestones: v1.11.0, v1.12.0 Jun 14, 2023
@frascuchon frascuchon modified the milestones: v1.12.0, v1.11.0 Jun 15, 2023
frascuchon pushed a commit that referenced this issue Jun 16, 2023
# Description

This PR adds a new attribute called `metadata` to the `Record` of the
`FeedbackDataset`.

- A new `metadata` column/attribute has been added to the `Record` ORM
class (a new column `metadata` will be added by the migration script
generated by `alembic`)
- All the endpoints in the API v1 listing records has been updated to
return this new `metadata` column
- The SDK has been updated to parse the `metadata` key returned by the
API
- The Python client has been updated so new records can be created
including `metadata`

Additionally, I've refactor some `if/else` conditions in some methods
from the `FeedbackDataset` class.

Closes #3155

**Type of change**

- [x] New feature (non-breaking change which adds functionality)
- [x] Refactor

**How Has This Been Tested**

Unit tests has been updated and I've done some manual tests.

**Checklist**

- [x] I have merged the original branch into my forked branch
- [ ] I added relevant documentation
- [x] follows the style guidelines of this project
- [x] I did a self-review of my code
- [x] I made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants