Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rg.load takes too long because of the vector field #2398

Closed
cceyda opened this issue Feb 22, 2023 · 9 comments · Fixed by #2417 or #2668
Closed

rg.load takes too long because of the vector field #2398

cceyda opened this issue Feb 22, 2023 · 9 comments · Fixed by #2417 or #2668
Labels
type: enhancement Indicates new feature requests
Milestone

Comments

@cceyda
Copy link
Contributor

cceyda commented Feb 22, 2023

Is your feature request related to a problem? Please describe.
rg.load takes too long because of the vector field, even when I don't need it.

Describe the solution you'd like
Allow passing a param or through the query
use the elasticsearch _source query field to exclude vector when not needed.

"_source": {        
        "exclude": ["vector"]
    }

Describe alternatives you've considered
wait

Additional context
Low priority

@cceyda cceyda added the type: enhancement Indicates new feature requests label Feb 22, 2023
@davidberenstein1957
Copy link
Member

#2261

@davidberenstein1957
Copy link
Member

Hi @cceyda, thanks for picking this up! I think we need to review the rg.log/rg.load performance in general, given that a lot of users have had some issues with this for larger records.

Perhaps we can implement something like you suggested, or some lazy loading. At least, I expect to be able to fix some of these load issues by implementing this #2341

@orasik
Copy link
Contributor

orasik commented Feb 24, 2023

Excluding vectors would be great. Technically vectors will not be required in some cases when training a model.

The option vectors=None in rg.load could be used to exclude the vector of the record? @davidberenstein1957

@dvsrepo
Copy link
Member

dvsrepo commented Feb 24, 2023

thanks @cceyda and @orasik , this is a needed and requested feat. @frascuchon we discussed this a long time ago (get a projection of only some fields when using rg.load) Is there a relatively easy solution to this?

@davidberenstein1957
Copy link
Member

davidberenstein1957 commented Feb 24, 2023 via email

@orasik
Copy link
Contributor

orasik commented Feb 24, 2023

@davidberenstein1957 @frascuchon I found a bug when I tried to hard-code a list of fields to test.

This line

projection={"*"},

projection={"*"}

is leading to this code for selecting which fields to include

"fields": list(projection) if projection else ["id"],

"fields": list(projection) if projection else ["id"]

If you make "fields": ["id"]

You will receive this error:

records = [sdk_record_class.parse_obj(r).to_client() for r in records]
  File "pydantic/main.py", line 527, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for TextClassificationRecord

Not sure what is the validation behind parse_obj that is causing this error.

@orasik
Copy link
Contributor

orasik commented Feb 24, 2023

Found the issue.

inputs is a required field in the validation. If I change the line

"fields": list(projection) if projection else ["id"],

to:

"fields": ["id", "inputs"]

it works fine.

I will raise a PR soon

@orasik
Copy link
Contributor

orasik commented Feb 24, 2023

PR #2417

@davidberenstein1957
Copy link
Member

@frascuchon I think we should add some w.r.t. which fields we update during a rg.log afterwards to avoid overwriting records without vector. Can this be done easily?

@frascuchon frascuchon added this to the v1.5.0 milestone Mar 8, 2023
frascuchon added a commit that referenced this issue Mar 9, 2023
# Description

Add the fields to retrieve when loading the data from argilla. 
`rg.load` takes too long because of the vector field, even when I don't
need it.

Closes #2398

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [X] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [X] Refactor (change restructuring the codebase without changing
functionality)
- [X] Improvement (change adding some improvement to an existing
functionality)
- [X] Documentation update

**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

- [X] Using the following code: 
```
import argilla as rg
labelled_ds = rg.load("my_data_set", limit=2, fields=["text"])
# will load id, inputs, and text
labelled_ds = rg.load("my_data_set", limit=2)
# will load everything
```

**Checklist**

- [X] I have merged the original branch into my forked branch
- [X] I added relevant documentation
- [X] follows the style guidelines of this project
- [X] I did a self-review of my code
- [X] I made corresponding changes to the documentation
- [X] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
@frascuchon frascuchon reopened this Mar 22, 2023
@frascuchon frascuchon modified the milestones: v1.5.0, v1.6.0 Mar 22, 2023
frascuchon added a commit that referenced this issue Mar 22, 2023
## [1.5.0](v1.4.0...v1.5.0) -
2023-03-21

### Added

- Add the fields to retrieve when loading the data from argilla.
`rg.load` takes too long because of the vector field, even when users
don't need it. Closes
[#2398](#2398)
- Add new page and components for dataset settings. Closes
[#2442](#2003)
- Add ability to show image in records (for TokenClassification and
TextClassification) if an URL is passed in metadata with the key
\_image_url
- Non-searchable fields support in metadata.
[#2570](#2570)

### Changed

- Labels are now centralized in a specific vuex ORM called GlobalLabel
Model, see #2210. This model
is the same for TokenClassification and TextClassification (so both task
have labels with color_id and shortcuts parameters in the vuex ORM)
- The shortcuts improvement for labels
[#2339](#2339) have been moved
to the vuex ORM in dataset settings feature
[#2444](eb37c3b)
- Update "Define a labeling schema" section in docs.
- The record inputs are sorted alphabetically in UI by default.
[#2581](#2581)

### Fixes

- Allow URL to be clickable in Jupyter notebook again. Closes
[#2527](#2527)

### Removed

- Removing some data scan deprecated endpoints used by old clients. This
change will break compatibility with client `<v1.3.0`
- Stop using old scan deprecated endpoints in python client. This logic
will break client compatibility with server version `<1.3.0`
- Remove the previous way to add labels through the dataset page. Now
labels can be added only through dataset settings page.



### As always, thanks to our amazing contributors!
- Documentation update: tutorial for text classification models
comparison (#2426) by @embonhomme
- Docs: fix little typo (#2522) by @anakin87
- Docs: Tutorial on image classification (#2420) by @burtenshaw
@frascuchon frascuchon modified the milestones: v1.6.0, v1.7.0 Apr 9, 2023
frascuchon added a commit that referenced this issue Apr 16, 2023
# NOTE

~~PR #2576 must be merged first to make this PR "mergeable"~~

# Description

Work implemented in #2577 and
#2540 tries to introduce a
mechanism for:

1. Loading data with field projections (selecting which record fields
will be loaded)
2. Allow partial updates when uploading projected records

Some problems of this flow are raised during the work:

- There is a field mismatch between client and API records. The proposed
field projection is referring to the Record API data model, and aligning
it with client representation can be messy and give an unexpected
behavior
- The update flag mechanism is a bit cumbersome because of the client
data projection and the record class validation alignments. So, even for
a single field update, you need to retrieve extra fields to keep aligned
the validations.

The field projection in `rg.load` function can introduce confusion
because both record models, client, and API, are not aligned. So, I
prefer to remove this functionality. Instead of that, we can introduce
some extra parameters to load for setup if users want to exclude
`vectors` or `metrics` when loading data. This will help to mitigate
problems founds in #2398

This option, together with changes introduced in #2576, allows users to
load partial record info and still update it without losing stored data.

For example, if users want to add some metadata to already stored
records, they could try the following flow:

```python
ds = rg.load(name=dataset, exclude_vectors=True)

for record in ds:
   record.metadata = ...

rg.log(name=dataset, records=ds) # This will update records only by adding the metadata info
```

Closes [#2398](#2398)

Refs: #2577 and
#2540

**Type of change**

(Please delete options that are not relevant. Remember to title the PR
according to the type of change)

- [x] Improvement (change adding some improvement to an existing
functionality)


**How Has This Been Tested**

(Please describe the tests that you ran to verify your changes. And
ideally, reference `tests`)

Existing tests are been modified including this new behavior

**Checklist**

- [x] I have merged the original branch into my forked branch
- [ ] I added relevant documentation
- [x] follows the style guidelines of this project
- [x] I did a self-review of my code
- [ ] I made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added relevant notes to the CHANGELOG.md file (See
https://keepachangelog.com/)

---------

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
frascuchon added a commit that referenced this issue May 10, 2023
##
[1.7.0](v1.6.0...v1.7.0)

### Added

- add `max_retries` and `num_threads` parameters to `rg.log` to run data
logging request concurrently with backoff retry policy. See
[#2458](#2458) and
[#2533](#2533)
- `rg.load` accepts `include_vectors` and `include_metrics` when loading
data. Closes [#2398](#2398)
- Added `settings` param to `prepare_for_training`
([#2689](#2689))
- Added `prepare_for_training` for `openai`
([#2658](#2658))
- Added `ArgillaOpenAITrainer`
([#2659](#2659))
- Added `ArgillaSpanMarkerTrainer` for Named Entity Recognition
([#2693](#2693))
- Added `ArgillaTrainer` CLI support. Closes
([#2809](#2809))

### Changed

- Argilla quickstart image dependencies are externalized into
`quickstart.requirements.txt`. See
[#2666](#2666)
- bulk endpoints will upsert data when record `id` is present. Closes
[#2535](#2535)
- moved from `click` to `typer` CLI support. Closes
([#2815](#2815))
- Argilla server docker image is built with PostgreSQL support. Closes
[#2686](#2686)
- The `rg.log` computes all batches and raise an error for all failed
batches.
- The default batch size for `rg.log` is now 100.

### Fixed

- `argilla.training` bugfixes and unification
([#2665](#2665))
- Resolved several small bugs in the `ArgillaTrainer`.

### Deprecated

- The `rg.log_async` function is deprecated and will be removed in next
minor release.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Indicates new feature requests
Projects
None yet
5 participants