improvement: Better efficiency of Weak Labels when vectors exist #3444
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #3404
Hello!
Description
As mentioned in #3404, weak labels are slow when the dataset contains vectors. This weak labeling support was added before we even supported vectors, so when these vectors started being included by default in every
rg.load
call, the weak labeling support became much slower.In short, the vectors were added in ~January 2023, while weak labeling is from ~October 2022. It seems that we can safely set
include_vectors=False
as I cannot find a place where the vectors are actually used: it seems that vectors are used viaextend_matrix
rather than the newerrecord.vectors
.Type of change
How Has This Been Tested
Via
pytest .\tests\labeling\text_classification\
and by running:This script loads a dataset with vectors, adds some rules, and then loads the weak labels with a timer. On
develop
, i.e. with vectors, this takes ~27.5 seconds. Wheninclude_vectors=False
, it takes ~5.5 seconds.See also these cProfile graphs:
Before:
After:
Checklist
(Note: I'm unsure whether this best falls under "Changed" or another category in the CHANGELOG)