[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

marcelbusch · 2023-06-26T10:41:57Z

Describe the bug
In multi-label text classification, whenever I annotate records, they get copied to a validated version, but the original record stays the same. So I have a dataset with 1000 records, and after annotating 20 of them, I now have 1020 records in my dataset, the 1000 records with status "default" and the 20 annotated records. Is this expected behaviour? Because like this I can't filter for records I haven't annotated yet, when I filter for status:default I still get all 1000 records.

I run argilla and elasticsearch with the supplied docker-compose file without any modifications.

To Reproduce
Steps to reproduce the behavior:

DATASET_NAME = '...'
labels = [...]
settings = rg.TextClassificationSettings(label_schema=labels)
rg.configure_dataset_settings(name=DATASET_NAME, settings=settings)

# sample is a pandas dataframe containing 1000 records
records = sample.apply(lambda x: rg.TextClassificationRecord(
    inputs={"user": x.username, "body": x.rawContent},
    metadata={"hashtags": x.hashtags},
    id=x.id,
    multi_label=True
), axis=1).values.tolist()

rg.log(records, DATASET_NAME)

Expected behavior
I expect the original records to change status, so after annotating 20 records I should still have 1000 records in my dataset, now 980 with status "default" and 20 with status "validated"

Environment (please complete the following information):

OS: macOS 11.7.5
Browser: Firefox
Argilla Version: 1.10.0
Docker Image: docker-compose 3.9 (argilla-server:latest, elasticsearch:8.5.3)

Edit: Update on some more behaviour:

when I query for a word that results in more than 1 result, I get the unlabeled and the validated records
when I query for a word/combination of words that exist(s) only in 1 record, I also get both records (via .load and in the ui), but in the ui I can only see the validated record, but there is blank space where the unlabeled record would be (there is no element for it in the dom though, the validated record is offset via translateY)

The text was updated successfully, but these errors were encountered:

davidberenstein1957 · 2023-06-27T10:11:16Z

@marcelbusch do you know for sure that you are the only one working on this. What version of the server and Python package are you using?

I think it might be good to evaluate the query syntax w.r.t. the query bahaviour. Are you using it correctly?

marcelbusch · 2023-06-28T09:29:08Z

Yes, I am sure I'm the only one working on this. Server and python package version are both 1.10.0

I'm attaching a screenshot to clearify which behaviour I mean: I labeled and validated a record (so now there is 2 versions of it in my dataset) now I query for a combination of words that only exists in this specific record, naturally I get both versions, in python via rg.load, and in the UI it also shows that there are 2 records at the bottom, but I can only see one of them

davidberenstein1957 · 2023-06-28T09:31:44Z

@marcelbusch described in our query syntax overview you would need to query a bit differently in that case "nurnberg AND demonstration AND wegstrecke" should be what you intend to achieve, correct?

w.r.t. the duplicating records, is that still happening? Could you show me a video?

marcelbusch · 2023-06-28T09:37:58Z

No, when I connect them with AND it's still the same behaviour...

Yes that's still happening, here's a video:

argilla-duplication.mp4

davidberenstein1957 · 2023-06-28T09:39:58Z

@frascuchon

davidberenstein1957 · 2023-06-28T09:40:10Z

Thanks @marcelbusch, this really helps us.

davidberenstein1957 · 2023-06-28T10:52:38Z

Hi @marcelbusch, we cannot reproduce this behaviour. Could your provide us with more context w.r.t. your data? Also, could you run pip install argilla --update, delete your current docker images, and set their tag specific to 1.10.0 during re-deployment?

frascuchon · 2023-06-28T10:53:17Z

Thanks for your feedback @marcelbusch! Is also happening when you validate a single record (without bulk validation action)

Can you share also the extra record info (you can access it by clicking the ... button) for both records?

marcelbusch · 2023-06-28T11:18:34Z

Could your provide us with more context w.r.t. your data?

I could send you a csv of my pandas dataframe, or what do you mean?

Also, could you run pip install argilla --update, delete your current docker images, and set their tag specific to 1.10.0 during re-deployment?

So I should run server version 1.10.0 with argilla python version 1.11.0? And the docker tag would be v1.10.0 or releases-1.10.0?

Is also happening when you validate a single record (without bulk validation action)

Yes, same thing when using the validate button at the bottom and when setting records per page to 1

Can you share also the extra record info (you can access it by clicking the ... button) for both records?>

They both have the same id, could this be because i set the id manually during record creation?

marcelbusch · 2023-06-29T11:33:00Z

I took a look in my debugger at this point: updateDatasetRecords and in the record that get's passed in there when I click on a label the id is incorrect, in fact it seems to be rounded to the nearest 100: the record has an initial id of 1478720431326732291, the record that gets passed in the updateDatasetRecords function has the id 1478720431326732300

frascuchon · 2023-06-30T07:53:29Z

Thanks for all this info @marcelbusch. It's really helpful. We'll work on fixing that.

As a temporal workaround, I think you can set the record id as a string instead of a number. This should avoid this problem.

frascuchon · 2023-06-30T13:57:34Z

I've been doing some test and it looks like is a javascript limit

Using curl:

curl -X 'POST' \
  "http://localhost:6900/api/datasets/test-dataset/TextClassification:search?include_metrics=false&workspace=argilla&limit=50&from=0" \
  -H 'accept: application/json' \
  -H 'X-Argilla-Api-Key: argilla.apikey' \
  -H 'Content-Type: application/json' \
  -d '{}'
{"total":1,"records":[{"id":10805720881385292014,"status":"Default","metrics":{},"last_updated":"2023-06-30T13:23:45.462574","inputs":{"additionalProp1":"string","additionalProp3":"string","additionalProp2":"string"},"multi_label":true}],"aggregations":{"predicted_as":{},"annotated_as":{},"annotated_by":{},"predicted_by":{},"status":{"Default":1},"predicted":{},"score":{},"words":{"string":1},"metadata":{}}}

From UI:

@leiyre @damianpumar @keithCuniah Any ideas?

# Description This PR deprecates the integer support for record id. Also, warn if big integers are used as record id Refs #3265 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) **How Has This Been Tested** After the changes the, issue cannot be reproduced anymore. Tested in dev environments **Checklist** - [ ] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [X] I have added tests that prove my fix is effective or that my feature works - [ ] I filled out [the contributor form](https://tally.so/r/n9XrxK) (see text above) - [X] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: David Berenstein <david.m.berenstein@gmail.com>

davidberenstein1957 assigned frascuchon Jun 28, 2023

frascuchon added the type: bug Indicates an unexpected problem or unintended behavior label Jun 28, 2023

frascuchon assigned leiyre Jun 28, 2023

frascuchon added this to the v1.12.0 milestone Jun 28, 2023

alvarobartt modified the milestones: v1.12.0, v1.13 Jun 29, 2023

frascuchon assigned damianpumar and keithCuniah Jul 3, 2023

frascuchon mentioned this issue Jul 4, 2023

feat: warn users when using integer record ids #3328

Merged

8 tasks

frascuchon linked a pull request Jul 4, 2023 that will close this issue

feat: warn users when using integer record ids #3328

Merged

8 tasks

frascuchon closed this as completed in #3328 Jul 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

marcelbusch commented Jun 26, 2023 •

edited

Loading

davidberenstein1957 commented Jun 27, 2023

marcelbusch commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

marcelbusch commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

frascuchon commented Jun 28, 2023

marcelbusch commented Jun 28, 2023

marcelbusch commented Jun 29, 2023

frascuchon commented Jun 30, 2023

frascuchon commented Jun 30, 2023

[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

[BUG] Validating multi-label text classification records makes new copies of them instead of updating #3265

Comments

marcelbusch commented Jun 26, 2023 • edited Loading

davidberenstein1957 commented Jun 27, 2023

marcelbusch commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

marcelbusch commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

davidberenstein1957 commented Jun 28, 2023

frascuchon commented Jun 28, 2023

marcelbusch commented Jun 28, 2023

marcelbusch commented Jun 29, 2023

frascuchon commented Jun 30, 2023

frascuchon commented Jun 30, 2023

marcelbusch commented Jun 26, 2023 •

edited

Loading