improvement: Better efficiency of Weak Labels when vectors exist #3444

tomaarsen · 2023-07-24T10:48:19Z

Hello!

Description

As mentioned in #3404, weak labels are slow when the dataset contains vectors. This weak labeling support was added before we even supported vectors, so when these vectors started being included by default in every rg.load call, the weak labeling support became much slower.
In short, the vectors were added in ~January 2023, while weak labeling is from ~October 2022. It seems that we can safely set include_vectors=False as I cannot find a place where the vectors are actually used: it seems that vectors are used via extend_matrix rather than the newer record.vectors.

Type of change

Improvement

How Has This Been Tested

Via pytest .\tests\labeling\text_classification\ and by running:

import time
import argilla as rg
from argilla.labeling.text_classification import Rule, WeakLabels, load_rules, add_rules

from datasets import load_dataset

dataset_ds = load_dataset("argilla/agnews_weak_labeling", split="train")
dataset_rb = rg.read_datasets(dataset_ds, task="TextClassification")
rg.log(dataset_rb, name="agnews_weak_labeling_test")

# define queries and patterns for each category (using ES DSL)
queries = [
    (["money", "financ*", "dollar*"], "Business"),
    (["war", "gov*", "minister*", "conflict"], "World"),
    (["footbal*", "sport*", "game", "play*"], "Sports"),
    (["sci*", "techno*", "computer*", "software", "web"], "Sci/Tech"),
]

# define rules
rules = [Rule(query=term, label=label) for terms, label in queries for term in terms]

add_rules(dataset="agnews_weak_labeling_test", rules=rules)

start_t = time.time()
weak_labels = WeakLabels(dataset="agnews_weak_labeling_test")
print(f"{time.time() - start_t:.4f}s to load records and apply weak labels.")

This script loads a dataset with vectors, adds some rules, and then loads the weak labels with a timer. On develop, i.e. with vectors, this takes ~27.5 seconds. When include_vectors=False, it takes ~5.5 seconds.

See also these cProfile graphs:
Before:

After:

Checklist

I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I filled out the contributor form (see text above)
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

(Note: I'm unsure whether this best falls under "Changed" or another category in the CHANGELOG)

Tom Aarsen

codecov · 2023-07-24T19:56:38Z

Codecov Report

Patch coverage: 93.43% and project coverage change: +0.14% 🎉

Comparison is base (6630d7b) 90.13% compared to head (27ec573) 90.28%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #3444      +/-   ##
===========================================
+ Coverage    90.13%   90.28%   +0.14%     
===========================================
  Files          233      243      +10     
  Lines        12493    13226     +733     
===========================================
+ Hits         11261    11941     +680     
- Misses        1232     1285      +53

Flag	Coverage Δ
pytest	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
...ack/integrations/huggingface/card/_dataset_card.py	`100.00% <ø> (ø)`
.../feedback/integrations/huggingface/card/_parser.py	`100.00% <ø> (ø)`
src/argilla/client/feedback/types.py	`100.00% <ø> (ø)`
src/argilla/client/sdk/commons/errors.py	`72.22% <ø> (ø)`
src/argilla/feedback/__init__.py	`100.00% <ø> (ø)`
src/argilla/tasks/database/migrate.py	`39.13% <ø> (-4.87%)`	⬇️
src/argilla/training/autotrain_advanced.py	`0.00% <0.00%> (ø)`
src/argilla/utils/telemetry.py	`89.09% <ø> (ø)`
src/argilla/client/feedback/training/schemas.py	`87.50% <50.00%> (-1.01%)`	⬇️
src/argilla/server/settings.py	`77.41% <50.00%> (-3.76%)`	⬇️
... and 74 more

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

frascuchon

💯

…rgilla into feat/shortcuts-improvements * 'feat/shortcuts-improvements' of github.com:argilla-io/argilla: feat: update CLI to use async connection to DB (#3450) feat: add more value validations for rating questions (#3452) ci: selective `runs-on` value for tests execution (#3455) feat: update `package.yml` triggers (#3422) fix: uncancellable CI jobs (#3458) chore: Fix `ruff` line length (#3459) [pre-commit.ci] pre-commit autoupdate (#3449) improvement: Better efficiency of Weak Labels when vectors exist (#3444) refactor: add `ArgillaDatasetMixin` and re-structure `argilla.feedback.schemas` (#3427) chore: Set release version fix: add missing `suggestion_type_enum` values (#3445) [pre-commit.ci] pre-commit autoupdate (#3380) docs: fix username in HF Spaces docs (#3432)

Improve efficiency of Weak Labels when vectors exist

5206e28

tomaarsen requested a review from frascuchon July 24, 2023 10:48

Updated changelog accordingly

27ec573

frascuchon approved these changes Jul 25, 2023

View reviewed changes

Merge branch 'develop' into enhancement/improve_weak_labeling_efficiency

ea034ff

frascuchon merged commit 7a7cb68 into argilla-io:develop Jul 25, 2023
2 checks passed

tomaarsen deleted the enhancement/improve_weak_labeling_efficiency branch July 25, 2023 12:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improvement: Better efficiency of Weak Labels when vectors exist #3444

improvement: Better efficiency of Weak Labels when vectors exist #3444

tomaarsen commented Jul 24, 2023 •

edited

Loading

codecov bot commented Jul 24, 2023

frascuchon left a comment

improvement: Better efficiency of Weak Labels when vectors exist #3444

improvement: Better efficiency of Weak Labels when vectors exist #3444

Conversation

tomaarsen commented Jul 24, 2023 • edited Loading

Description

codecov bot commented Jul 24, 2023

Codecov Report

frascuchon left a comment

Choose a reason for hiding this comment

tomaarsen commented Jul 24, 2023 •

edited

Loading