-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rg.load takes too long because of the vector field #2398
Comments
Hi @cceyda, thanks for picking this up! I think we need to review the Perhaps we can implement something like you suggested, or some lazy loading. At least, I expect to be able to fix some of these load issues by implementing this #2341 |
Excluding vectors would be great. Technically vectors will not be required in some cases when training a model. The option |
thanks @cceyda and @orasik , this is a needed and requested feat. @frascuchon we discussed this a long time ago (get a projection of only some fields when using rg.load) Is there a relatively easy solution to this? |
I agree with the idea about the projection but IMO they are 2 separate things. Perhaps we can also think about quantization of the vectors to reduce the size a bit. Ideally, I also see this as a broader topic where we potentially can choose which attributes to load/update for a list of records. On 24 Feb 2023, at 21:40, Daniel Vila Suero ***@***.***> wrote:
thanks @cceyda and @orasik , this is a needed and requested feat. @frascuchon we discussed this a long time ago (get a projection of only some fields when using rg.load) Is there a relatively easy solution to this?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@davidberenstein1957 @frascuchon I found a bug when I tried to hard-code a list of fields to test. This line argilla/src/argilla/client/client.py Line 707 in c35d63a
is leading to this code for selecting which fields to include argilla/src/argilla/client/apis/datasets.py Line 185 in c35d63a
If you make You will receive this error:
Not sure what is the validation behind parse_obj that is causing this error. |
Found the issue.
argilla/src/argilla/client/apis/datasets.py Line 185 in c35d63a
to:
it works fine. I will raise a PR soon |
PR #2417 |
@frascuchon I think we should add some w.r.t. which fields we update during a |
# Description Add the fields to retrieve when loading the data from argilla. `rg.load` takes too long because of the vector field, even when I don't need it. Closes #2398 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [X] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [X] Refactor (change restructuring the codebase without changing functionality) - [X] Improvement (change adding some improvement to an existing functionality) - [X] Documentation update **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) - [X] Using the following code: ``` import argilla as rg labelled_ds = rg.load("my_data_set", limit=2, fields=["text"]) # will load id, inputs, and text labelled_ds = rg.load("my_data_set", limit=2) # will load everything ``` **Checklist** - [X] I have merged the original branch into my forked branch - [X] I added relevant documentation - [X] follows the style guidelines of this project - [X] I did a self-review of my code - [X] I made corresponding changes to the documentation - [X] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works
## [1.5.0](v1.4.0...v1.5.0) - 2023-03-21 ### Added - Add the fields to retrieve when loading the data from argilla. `rg.load` takes too long because of the vector field, even when users don't need it. Closes [#2398](#2398) - Add new page and components for dataset settings. Closes [#2442](#2003) - Add ability to show image in records (for TokenClassification and TextClassification) if an URL is passed in metadata with the key \_image_url - Non-searchable fields support in metadata. [#2570](#2570) ### Changed - Labels are now centralized in a specific vuex ORM called GlobalLabel Model, see #2210. This model is the same for TokenClassification and TextClassification (so both task have labels with color_id and shortcuts parameters in the vuex ORM) - The shortcuts improvement for labels [#2339](#2339) have been moved to the vuex ORM in dataset settings feature [#2444](eb37c3b) - Update "Define a labeling schema" section in docs. - The record inputs are sorted alphabetically in UI by default. [#2581](#2581) ### Fixes - Allow URL to be clickable in Jupyter notebook again. Closes [#2527](#2527) ### Removed - Removing some data scan deprecated endpoints used by old clients. This change will break compatibility with client `<v1.3.0` - Stop using old scan deprecated endpoints in python client. This logic will break client compatibility with server version `<1.3.0` - Remove the previous way to add labels through the dataset page. Now labels can be added only through dataset settings page. ### As always, thanks to our amazing contributors! - Documentation update: tutorial for text classification models comparison (#2426) by @embonhomme - Docs: fix little typo (#2522) by @anakin87 - Docs: Tutorial on image classification (#2420) by @burtenshaw
# NOTE ~~PR #2576 must be merged first to make this PR "mergeable"~~ # Description Work implemented in #2577 and #2540 tries to introduce a mechanism for: 1. Loading data with field projections (selecting which record fields will be loaded) 2. Allow partial updates when uploading projected records Some problems of this flow are raised during the work: - There is a field mismatch between client and API records. The proposed field projection is referring to the Record API data model, and aligning it with client representation can be messy and give an unexpected behavior - The update flag mechanism is a bit cumbersome because of the client data projection and the record class validation alignments. So, even for a single field update, you need to retrieve extra fields to keep aligned the validations. The field projection in `rg.load` function can introduce confusion because both record models, client, and API, are not aligned. So, I prefer to remove this functionality. Instead of that, we can introduce some extra parameters to load for setup if users want to exclude `vectors` or `metrics` when loading data. This will help to mitigate problems founds in #2398 This option, together with changes introduced in #2576, allows users to load partial record info and still update it without losing stored data. For example, if users want to add some metadata to already stored records, they could try the following flow: ```python ds = rg.load(name=dataset, exclude_vectors=True) for record in ds: record.metadata = ... rg.log(name=dataset, records=ds) # This will update records only by adding the metadata info ``` Closes [#2398](#2398) Refs: #2577 and #2540 **Type of change** (Please delete options that are not relevant. Remember to title the PR according to the type of change) - [x] Improvement (change adding some improvement to an existing functionality) **How Has This Been Tested** (Please describe the tests that you ran to verify your changes. And ideally, reference `tests`) Existing tests are been modified including this new behavior **Checklist** - [x] I have merged the original branch into my forked branch - [ ] I added relevant documentation - [x] follows the style guidelines of this project - [x] I did a self-review of my code - [ ] I made corresponding changes to the documentation - [ ] My changes generate no new warnings - [x] I have added tests that prove my fix is effective or that my feature works - [x] I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
## [1.7.0](v1.6.0...v1.7.0) ### Added - add `max_retries` and `num_threads` parameters to `rg.log` to run data logging request concurrently with backoff retry policy. See [#2458](#2458) and [#2533](#2533) - `rg.load` accepts `include_vectors` and `include_metrics` when loading data. Closes [#2398](#2398) - Added `settings` param to `prepare_for_training` ([#2689](#2689)) - Added `prepare_for_training` for `openai` ([#2658](#2658)) - Added `ArgillaOpenAITrainer` ([#2659](#2659)) - Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition ([#2693](#2693)) - Added `ArgillaTrainer` CLI support. Closes ([#2809](#2809)) ### Changed - Argilla quickstart image dependencies are externalized into `quickstart.requirements.txt`. See [#2666](#2666) - bulk endpoints will upsert data when record `id` is present. Closes [#2535](#2535) - moved from `click` to `typer` CLI support. Closes ([#2815](#2815)) - Argilla server docker image is built with PostgreSQL support. Closes [#2686](#2686) - The `rg.log` computes all batches and raise an error for all failed batches. - The default batch size for `rg.log` is now 100. ### Fixed - `argilla.training` bugfixes and unification ([#2665](#2665)) - Resolved several small bugs in the `ArgillaTrainer`. ### Deprecated - The `rg.log_async` function is deprecated and will be removed in next minor release.
Is your feature request related to a problem? Please describe.
rg.load
takes too long because of thevector
field, even when I don't need it.Describe the solution you'd like
Allow passing a param or through the query
use the elasticsearch
_source
query field to excludevector
when not needed.Describe alternatives you've considered
wait
Additional context
Low priority
The text was updated successfully, but these errors were encountered: