Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improvement: Better efficiency of Weak Labels when vectors exist #3444

Conversation

tomaarsen
Copy link
Contributor

@tomaarsen tomaarsen commented Jul 24, 2023

Closes #3404

Hello!

Description

As mentioned in #3404, weak labels are slow when the dataset contains vectors. This weak labeling support was added before we even supported vectors, so when these vectors started being included by default in every rg.load call, the weak labeling support became much slower.
In short, the vectors were added in ~January 2023, while weak labeling is from ~October 2022. It seems that we can safely set include_vectors=False as I cannot find a place where the vectors are actually used: it seems that vectors are used via extend_matrix rather than the newer record.vectors.

Type of change

  • Improvement

How Has This Been Tested

Via pytest .\tests\labeling\text_classification\ and by running:

import time
import argilla as rg
from argilla.labeling.text_classification import Rule, WeakLabels, load_rules, add_rules

from datasets import load_dataset

dataset_ds = load_dataset("argilla/agnews_weak_labeling", split="train")
dataset_rb = rg.read_datasets(dataset_ds, task="TextClassification")
rg.log(dataset_rb, name="agnews_weak_labeling_test")

# define queries and patterns for each category (using ES DSL)
queries = [
    (["money", "financ*", "dollar*"], "Business"),
    (["war", "gov*", "minister*", "conflict"], "World"),
    (["footbal*", "sport*", "game", "play*"], "Sports"),
    (["sci*", "techno*", "computer*", "software", "web"], "Sci/Tech"),
]

# define rules
rules = [Rule(query=term, label=label) for terms, label in queries for term in terms]

add_rules(dataset="agnews_weak_labeling_test", rules=rules)

start_t = time.time()
weak_labels = WeakLabels(dataset="agnews_weak_labeling_test")
print(f"{time.time() - start_t:.4f}s to load records and apply weak labels.")

This script loads a dataset with vectors, adds some rules, and then loads the weak labels with a timer. On develop, i.e. with vectors, this takes ~27.5 seconds. When include_vectors=False, it takes ~5.5 seconds.

See also these cProfile graphs:
Before:
Schermafbeelding_20230724_123356

After:
Schermafbeelding_20230724_123405

Checklist

  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

(Note: I'm unsure whether this best falls under "Changed" or another category in the CHANGELOG)


  • Tom Aarsen

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Patch coverage: 93.43% and project coverage change: +0.14% 🎉

Comparison is base (6630d7b) 90.13% compared to head (27ec573) 90.28%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3444      +/-   ##
===========================================
+ Coverage    90.13%   90.28%   +0.14%     
===========================================
  Files          233      243      +10     
  Lines        12493    13226     +733     
===========================================
+ Hits         11261    11941     +680     
- Misses        1232     1285      +53     
Flag Coverage Δ
pytest ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
...ack/integrations/huggingface/card/_dataset_card.py 100.00% <ø> (ø)
.../feedback/integrations/huggingface/card/_parser.py 100.00% <ø> (ø)
src/argilla/client/feedback/types.py 100.00% <ø> (ø)
src/argilla/client/sdk/commons/errors.py 72.22% <ø> (ø)
src/argilla/feedback/__init__.py 100.00% <ø> (ø)
src/argilla/tasks/database/migrate.py 39.13% <ø> (-4.87%) ⬇️
src/argilla/training/autotrain_advanced.py 0.00% <0.00%> (ø)
src/argilla/utils/telemetry.py 89.09% <ø> (ø)
src/argilla/client/feedback/training/schemas.py 87.50% <50.00%> (-1.01%) ⬇️
src/argilla/server/settings.py 77.41% <50.00%> (-3.76%) ⬇️
... and 74 more

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@frascuchon frascuchon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@frascuchon frascuchon merged commit 7a7cb68 into argilla-io:develop Jul 25, 2023
2 checks passed
@tomaarsen tomaarsen deleted the enhancement/improve_weak_labeling_efficiency branch July 25, 2023 12:36
leiyre pushed a commit that referenced this pull request Aug 1, 2023
…rgilla into feat/shortcuts-improvements

* 'feat/shortcuts-improvements' of github.com:argilla-io/argilla:
  feat: update CLI to use async connection to DB (#3450)
  feat: add more value validations for rating questions (#3452)
  ci: selective `runs-on` value for tests execution (#3455)
  feat: update `package.yml` triggers (#3422)
  fix: uncancellable CI jobs (#3458)
  chore: Fix `ruff` line length (#3459)
  [pre-commit.ci] pre-commit autoupdate (#3449)
  improvement: Better efficiency of Weak Labels when vectors exist (#3444)
  refactor: add `ArgillaDatasetMixin` and re-structure `argilla.feedback.schemas` (#3427)
  chore: Set release version
  fix: add missing `suggestion_type_enum` values (#3445)
  [pre-commit.ci] pre-commit autoupdate (#3380)
  docs: fix username in HF Spaces docs (#3432)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] WeakLabels take so much time to compute when vectors are present
2 participants