[FEATURE] Add metadata field to the FeedbackRecord #3155

tugot17 · 2023-06-12T12:45:00Z

Add support for adding metadata to the FeedbackRecord.

Currently, when pushing the records to the dataset, I can only push the fields that are later used as a TextField. In some scenarios, I would like to push not only the values but also the metadata, e.g., the hash values associated with different fields. They come in handy when analysing the results after the labelling is completed. I'm aware of the external_id but in many use cases it is not sufficient.

The solution would be to either:

not require all the used fields to be present in TextField.
add a metadatafield param that would capture such fields.

The code:

fields = [
    rg.TextField(name="instruction", required=True),
    rg.TextField(name="input", required=True),
    rg.TextField(name="completion_a", required=True),
    rg.TextField(name="completion_b", required=True),
]

dataset = rg.FeedbackDataset(
    guidelines="Please, read the prompt carefully and...",
    questions=questions,
    fields=fields,
)


def build_records(path: Path) -> list[rg.FeedbackRecord]:
    with open(path) as f:
        data = json.load(f)

    records = []

    for point in data:
        record = rg.FeedbackRecord(
            fields={
                    "instruction_hash": point["instruction_hash"],
                    "instruction": point["instruction"],
                    "input": point["input"],
                    "completion_a": point["completion_a"],
                    "completion_b": point["completion_b"],
                    "completion_hash_a": point["completion_hash_a"],
                    "completion_hash_b": point["completion_hash_b"],
                    "pair_hash": point["pair_hash"],
                    }
        )
        records.append(record)

        return records


records = build_records(path=Path(__file__).parent / "data.json")
dataset.add_records(records)
dataset.push_to_argilla(name="my-dataset")

The exception

Exception: Failed while adding the records to the `FeedbackTask` dataset in 
Argilla with exception: Argilla server returned an error with http status: 422
Error details: [{'response': "Error: found fields values for non configured 
fields: ['instruction_hash', 'completion_hash_a', 'completion_hash_b', 
'pair_hash']", 'params': {}}]

The text was updated successfully, but these errors were encountered:

dvsrepo · 2023-06-12T13:03:07Z

Fully agree with this @tugot17. This is already planned and your use case could greatly help us design and implement this.

Some feedback from your side would be super helpful.

For other tasks, we have metadata fields that are searchable and can be used for filtering data, besides retrieving them when you read the dataset from Python. The values for these fields are always visible to all users (annotators and admins)

But for the Feedback task we're considering some more fine-grained alternatives:

You can define the visibility of the Fields in the sense of making them visible in the record card and letting users filter by them. This visibility could be applied only for annotators for example.
Introducing a new MetadataField object, with different data types (str, int, float, dates, etc.) and then letting admins specify whether they are public (available to users as described above) or private, not available in the UI but only when retrieving the records for post-processing activities.

A canonical use case for private metadata fields is when asking labelers to rank two responses from different models (for evaluation for example), you'd like to shuffle Response 1 and Response 2 to avoid always returning model 1 response as Response 1 and model 2 as Response 2, which can introduce bias. In this case, I'd like to have a private metadata field to indicate for each record which model_id was Response 1 and which one was Response 2.

Another is to keep a prompt id across different records, to split the ranking problem into several records: For prompt id = 1234 I create several records with pairs of responses, instead of just one asking the user to rank several pairs at the same time.

dvsrepo · 2023-06-12T13:03:17Z

cc @nataliaElv @frascuchon

tugot17 · 2023-06-12T13:54:02Z

The first option seems to be a much quicker fix, though it would introduce two different views for admins and for annotators, which would introduce the addional uncessery complexity.

Why not just add visible param to the rg.TextField?

dvsrepo · 2023-06-12T14:00:14Z

Yes, sorry for the first option I meant adding a property visibility/ visible to each TextField. We still need to decide what visible means: i.e., it's only about presenting it to the user in the record card, not showing it at all, is it searchable?

dvsrepo · 2023-06-12T14:47:24Z

As discussed, I think we could start with a new boolean attribute visible for TextField where:

You can store values from the Python SDK at record creation.
The field is not shown in the record card
When you retrieve the records with fetch_records you get the values for it

What do you think @frascuchon , @gabrielmbmb , @alvarobartt , @keithCuniah ?

nataliaElv · 2023-06-12T15:34:05Z

I guess the fields that aren't visible shouldn't be searchable using the search bar once we have it, right?

dvsrepo · 2023-06-12T15:37:43Z

I guess the fields that aren't visible shouldn't be searchable using the search bar once we have it, right?

I think at least at the beginning we can assume that.

As we discussed the other day and confirmed with @tugot17, having a way to store additional fields is crucial for many use cases, so we should include the minimum, simplest solution as soon as possible and then maybe iterate on more complex use cases

frascuchon · 2023-06-13T15:31:18Z

Yes, I think allowing hiding some fields or even removing from the record itself from some endpoints can be a first approach to this metadata solution.

Regarding the search, in general terms, I would say that visible/non-visible should be decoupled from the search accessibility (maybe make sense to search on non-visible fields from admins to apply some kind of data processing). Maybe we can limit search capabilities from the UI, but I think we should index the data for search.

dvsrepo · 2023-06-13T15:42:02Z

I agree, let's go with a simple solution to solve the non-visible fields and then will see further use cases (search, etc.)

frascuchon · 2023-06-14T09:08:40Z

Regarding the first solution, I feel confused with the visible attribute, since it is so coupled to the current annotation view, which can change in the future.

I would say to define a metadata property at the record level to let the users store whatever extra info for a record. It's a more general solution and is decoupled from how data will be shown.

for data in dataset:
   rg.FeedbacRecord(
      fields={"field1": data.pop("a"), "field2": data.pop("b")},  
      metadata={**data}   
   )

Appart from this, we need to think about how to provide searchable info out of the record cards, but this can be tackled as a different problem.

dvsrepo · 2023-06-14T09:25:24Z

Sounds good!

# Description This PR adds a new attribute called `metadata` to the `Record` of the `FeedbackDataset`. - A new `metadata` column/attribute has been added to the `Record` ORM class (a new column `metadata` will be added by the migration script generated by `alembic`) - All the endpoints in the API v1 listing records has been updated to return this new `metadata` column - The SDK has been updated to parse the `metadata` key returned by the API - The Python client has been updated so new records can be created including `metadata` Additionally, I've refactor some `if/else` conditions in some methods from the `FeedbackDataset` class. Closes #3155 **Type of change** - [x] New feature (non-breaking change which adds functionality) - [x] Refactor **How Has This Been Tested** Unit tests has been updated and I've done some manual tests. **Checklist** - [x] I have merged the original branch into my forked branch - [ ] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [x] I made corresponding changes to the documentation - [x] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

frascuchon modified the milestones: v1.11.0, v1.12.0 Jun 14, 2023

gabrielmbmb mentioned this issue Jun 14, 2023

feat: add metadata to Record #3194

Merged

10 tasks

frascuchon modified the milestones: v1.12.0, v1.11.0 Jun 15, 2023

frascuchon closed this as completed in #3194 Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add metadata field to the FeedbackRecord #3155

[FEATURE] Add metadata field to the FeedbackRecord #3155

tugot17 commented Jun 12, 2023 •

edited

Loading

dvsrepo commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

tugot17 commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

nataliaElv commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

frascuchon commented Jun 13, 2023

dvsrepo commented Jun 13, 2023

frascuchon commented Jun 14, 2023

dvsrepo commented Jun 14, 2023

[FEATURE] Add metadata field to the FeedbackRecord #3155

[FEATURE] Add metadata field to the FeedbackRecord #3155

Comments

tugot17 commented Jun 12, 2023 • edited Loading

dvsrepo commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

tugot17 commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

nataliaElv commented Jun 12, 2023

dvsrepo commented Jun 12, 2023

frascuchon commented Jun 13, 2023

dvsrepo commented Jun 13, 2023

frascuchon commented Jun 14, 2023

dvsrepo commented Jun 14, 2023

tugot17 commented Jun 12, 2023 •

edited

Loading