rg.load takes too long because of the vector field #2398

cceyda · 2023-02-22T21:02:21Z

Is your feature request related to a problem? Please describe.
rg.load takes too long because of the vector field, even when I don't need it.

Describe the solution you'd like
Allow passing a param or through the query
use the elasticsearch _source query field to exclude vector when not needed.

"_source": {        
        "exclude": ["vector"]
    }

Describe alternatives you've considered
wait

Additional context
Low priority

The text was updated successfully, but these errors were encountered:

davidberenstein1957 · 2023-02-22T22:08:13Z

#2261

davidberenstein1957 · 2023-02-22T22:10:45Z

Hi @cceyda, thanks for picking this up! I think we need to review the rg.log/rg.load performance in general, given that a lot of users have had some issues with this for larger records.

Perhaps we can implement something like you suggested, or some lazy loading. At least, I expect to be able to fix some of these load issues by implementing this #2341

orasik · 2023-02-24T20:30:17Z

Excluding vectors would be great. Technically vectors will not be required in some cases when training a model.

The option vectors=None in rg.load could be used to exclude the vector of the record? @davidberenstein1957

dvsrepo · 2023-02-24T20:40:07Z

thanks @cceyda and @orasik , this is a needed and requested feat. @frascuchon we discussed this a long time ago (get a projection of only some fields when using rg.load) Is there a relatively easy solution to this?

davidberenstein1957 · 2023-02-24T21:02:00Z

I agree with the idea about the projection but IMO they are 2 separate things. Perhaps we can also think about quantization of the vectors to reduce the size a bit. Ideally, I also see this as a broader topic where we potentially can choose which attributes to load/update for a list of records. On 24 Feb 2023, at 21:40, Daniel Vila Suero ***@***.***> wrote: thanks @cceyda and @orasik , this is a needed and requested feat. @frascuchon we discussed this a long time ago (get a projection of only some fields when using rg.load) Is there a relatively easy solution to this? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

orasik · 2023-02-24T21:02:24Z

@davidberenstein1957 @frascuchon I found a bug when I tried to hard-code a list of fields to test.

This line

argilla/src/argilla/client/client.py

Line 707 in c35d63a

projection={"*"},

projection={"*"}

is leading to this code for selecting which fields to include

argilla/src/argilla/client/apis/datasets.py

Line 185 in c35d63a

"fields": list(projection) if projection else ["id"],

"fields": list(projection) if projection else ["id"]

If you make "fields": ["id"]

You will receive this error:

records = [sdk_record_class.parse_obj(r).to_client() for r in records]
  File "pydantic/main.py", line 527, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TextClassificationRecord

Not sure what is the validation behind parse_obj that is causing this error.

orasik · 2023-02-24T21:18:03Z

Found the issue.

inputs is a required field in the validation. If I change the line

argilla/src/argilla/client/apis/datasets.py

Line 185 in c35d63a

"fields": list(projection) if projection else ["id"],

to:

"fields": ["id", "inputs"]

it works fine.

I will raise a PR soon

orasik · 2023-02-24T22:11:39Z

PR #2417

davidberenstein1957 · 2023-02-27T11:00:55Z

@frascuchon I think we should add some w.r.t. which fields we update during a rg.log afterwards to avoid overwriting records without vector. Can this be done easily?

# Description Add the fields to retrieve when loading the data from argilla. `rg.load` takes too long because of the vector field, even when I don't need it. Closes #2398 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [X] Refactor (change restructuring the codebase without changing functionality) - [X] Improvement (change adding some improvement to an existing functionality) - [X] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] Using the following code: ``` import argilla as rg labelled_ds = rg.load("my_data_set", limit=2, fields=["text"]) # will load id, inputs, and text labelled_ds = rg.load("my_data_set", limit=2) # will load everything ``` **Checklist** - [X] I have merged the original branch into my forked branch - [X] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [X] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works

@embonhomme

## [1.5.0](v1.4.0...v1.5.0) - 2023-03-21 ### Added - Add the fields to retrieve when loading the data from argilla. `rg.load` takes too long because of the vector field, even when users don't need it. Closes [#2398](#2398) - Add new page and components for dataset settings. Closes [#2442](#2003) - Add ability to show image in records (for TokenClassification and TextClassification) if an URL is passed in metadata with the key \_image_url - Non-searchable fields support in metadata. [#2570](#2570) ### Changed - Labels are now centralized in a specific vuex ORM called GlobalLabel Model, see #2210. This model is the same for TokenClassification and TextClassification (so both task have labels with color_id and shortcuts parameters in the vuex ORM) - The shortcuts improvement for labels [#2339](#2339) have been moved to the vuex ORM in dataset settings feature [#2444](eb37c3b) - Update "Define a labeling schema" section in docs. - The record inputs are sorted alphabetically in UI by default. [#2581](#2581) ### Fixes - Allow URL to be clickable in Jupyter notebook again. Closes [#2527](#2527) ### Removed - Removing some data scan deprecated endpoints used by old clients. This change will break compatibility with client `<v1.3.0` - Stop using old scan deprecated endpoints in python client. This logic will break client compatibility with server version `<1.3.0` - Remove the previous way to add labels through the dataset page. Now labels can be added only through dataset settings page. ### As always, thanks to our amazing contributors! - Documentation update: tutorial for text classification models comparison (#2426) by @embonhomme - Docs: fix little typo (#2522) by @anakin87 - Docs: Tutorial on image classification (#2420) by @burtenshaw

# NOTE ~~PR #2576 must be merged first to make this PR "mergeable"~~ # Description Work implemented in #2577 and #2540 tries to introduce a mechanism for: 1. Loading data with field projections (selecting which record fields will be loaded) 2. Allow partial updates when uploading projected records Some problems of this flow are raised during the work: - There is a field mismatch between client and API records. The proposed field projection is referring to the Record API data model, and aligning it with client representation can be messy and give an unexpected behavior - The update flag mechanism is a bit cumbersome because of the client data projection and the record class validation alignments. So, even for a single field update, you need to retrieve extra fields to keep aligned the validations. The field projection in `rg.load` function can introduce confusion because both record models, client, and API, are not aligned. So, I prefer to remove this functionality. Instead of that, we can introduce some extra parameters to load for setup if users want to exclude `vectors` or `metrics` when loading data. This will help to mitigate problems founds in #2398 This option, together with changes introduced in #2576, allows users to load partial record info and still update it without losing stored data. For example, if users want to add some metadata to already stored records, they could try the following flow: ```python ds = rg.load(name=dataset, exclude_vectors=True) for record in ds: record.metadata = ... rg.log(name=dataset, records=ds) # This will update records only by adding the metadata info ``` Closes [#2398](#2398) Refs: #2577 and #2540 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [x] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) Existing tests are been modified including this new behavior **Checklist** - [x] I have merged the original branch into my forked branch - [ ] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

## [1.7.0](v1.6.0...v1.7.0) ### Added - add `max_retries` and `num_threads` parameters to `rg.log` to run data logging request concurrently with backoff retry policy. See [#2458](#2458) and [#2533](#2533) - `rg.load` accepts `include_vectors` and `include_metrics` when loading data. Closes [#2398](#2398) - Added `settings` param to `prepare_for_training` ([#2689](#2689)) - Added `prepare_for_training` for `openai` ([#2658](#2658)) - Added `ArgillaOpenAITrainer` ([#2659](#2659)) - Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition ([#2693](#2693)) - Added `ArgillaTrainer` CLI support. Closes ([#2809](#2809)) ### Changed - Argilla quickstart image dependencies are externalized into `quickstart.requirements.txt`. See [#2666](#2666) - bulk endpoints will upsert data when record `id` is present. Closes [#2535](#2535) - moved from `click` to `typer` CLI support. Closes ([#2815](#2815)) - Argilla server docker image is built with PostgreSQL support. Closes [#2686](#2686) - The `rg.log` computes all batches and raise an error for all failed batches. - The default batch size for `rg.log` is now 100. ### Fixed - `argilla.training` bugfixes and unification ([#2665](#2665)) - Resolved several small bugs in the `ArgillaTrainer`. ### Deprecated - The `rg.log_async` function is deprecated and will be removed in next minor release.

cceyda added the type: enhancement Indicates new feature requests label Feb 22, 2023

orasik mentioned this issue Feb 24, 2023

added fields to retrieve in load #2417

Merged

14 tasks

davidberenstein1957 mentioned this issue Mar 2, 2023

evaluate speed and efficiency improvements for rg.load and rg.log #2458

Closed

frascuchon added this to the v1.5.0 milestone Mar 8, 2023

frascuchon closed this as completed in #2417 Mar 9, 2023

frascuchon mentioned this issue Mar 21, 2023

Releases/1.5.0 #2585

Merged

frascuchon reopened this Mar 22, 2023

frascuchon modified the milestones: v1.5.0, v1.6.0 Mar 22, 2023

frascuchon modified the milestones: v1.6.0, v1.7.0 Apr 9, 2023

frascuchon mentioned this issue Apr 9, 2023

feat: Allow exclude vectors when loading records #2668

Merged

9 tasks

frascuchon closed this as completed in #2668 Apr 16, 2023

frascuchon mentioned this issue May 1, 2023

Release v1.7.0 #2817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rg.load takes too long because of the vector field #2398

rg.load takes too long because of the vector field #2398

cceyda commented Feb 22, 2023

davidberenstein1957 commented Feb 22, 2023

davidberenstein1957 commented Feb 22, 2023

orasik commented Feb 24, 2023

dvsrepo commented Feb 24, 2023

davidberenstein1957 commented Feb 24, 2023 via email

orasik commented Feb 24, 2023

orasik commented Feb 24, 2023

orasik commented Feb 24, 2023

davidberenstein1957 commented Feb 27, 2023

rg.load takes too long because of the vector field #2398

rg.load takes too long because of the vector field #2398

Comments

cceyda commented Feb 22, 2023

davidberenstein1957 commented Feb 22, 2023

davidberenstein1957 commented Feb 22, 2023

orasik commented Feb 24, 2023

dvsrepo commented Feb 24, 2023

davidberenstein1957 commented Feb 24, 2023 via email

orasik commented Feb 24, 2023

orasik commented Feb 24, 2023

orasik commented Feb 24, 2023

davidberenstein1957 commented Feb 27, 2023