Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add search engine reindexing cli task #4404

Merged
merged 12 commits into from
Dec 18, 2023
Merged

Conversation

jfcalvo
Copy link
Member

@jfcalvo jfcalvo commented Dec 12, 2023

Description

This PR adds a new cli task to reindex datasets and records so we can use it once ElasticSearch/OpenSearch mappings are updated.

To execute the cli task use the following command:

$ argilla server search_engine reindex

This task will iterate over all the datasets in the database, reindexing them and doing the same for all records for each dataset.

Note

We are using server-side cursors (streams) for fetch the collection of datasets and records in the cli task. For more information take a look to SQLAlchemy documentation.

Closes #4335

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested

  • Tested manually.
  • Tested manually using PostgreSQL.

Checklist

  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I filled out the contributor form (see text above)
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. area: cli Indicates that an issue or pull request is related to the Command Line Interface (CLI) area: server Indicates that an issue or pull request is related to the server area: tests Indicates that an issue or pull request is related to the tests language: python Pull requests or issues that update Python code team: backend Indicates that the issue or pull request is owned by the backend team type: enhancement Indicates new feature requests labels Dec 12, 2023
Copy link

The URL of the deployed environment for this PR is https://argilla-quickstart-pr-4404-ki24f765kq-no.a.run.app


async def _reindex_datasets(db: AsyncSession, search_engine: SearchEngine, progress: Progress) -> None:
task = progress.add_task(
f"reindexing datasets...",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can specify that only the Feedback datasets will be reindexed. Just to clarify to users

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +46 to +47
await search_engine.delete_index(dataset)
await search_engine.create_index(dataset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "soft" changes we can use the update mappings and the update settings endpoints:

See docs:

If we take care of mapping changes and we do not modify the field type, we could always update mappings without deleting the index. See here

from typer.testing import CliRunner


@pytest.mark.asyncio
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.asyncio

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

codecov bot commented Dec 13, 2023

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (4ceb23d) 65.96% compared to head (6830425) 66.07%.
Report is 2 commits behind head on develop.

Files Patch % Lines
src/argilla/cli/server/search_engine/reindex.py 72.72% 18 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4404      +/-   ##
===========================================
+ Coverage    65.96%   66.07%   +0.11%     
===========================================
  Files          330      333       +3     
  Lines        19115    19188      +73     
===========================================
+ Hits         12609    12679      +70     
- Misses        6506     6509       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@frascuchon
Copy link
Member

It would be nice to add some references in docs about this new task. I think the better place should be here

@jfcalvo
Copy link
Member Author

jfcalvo commented Dec 14, 2023

It would be nice to add some references in docs about this new task. I think the better place should be here

Ok, I will add a new section in the docs about this task.

@jfcalvo
Copy link
Member Author

jfcalvo commented Dec 14, 2023

It would be nice to add some references in docs about this new task. I think the better place should be here

Ok, I will add a new section in the docs about this task.

Done.

@frascuchon frascuchon self-requested a review December 18, 2023 16:07
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 18, 2023
@jfcalvo jfcalvo merged commit f79b803 into develop Dec 18, 2023
18 of 19 checks passed
@jfcalvo jfcalvo deleted the feat/add-reindex-cli-task branch December 18, 2023 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: cli Indicates that an issue or pull request is related to the Command Line Interface (CLI) area: server Indicates that an issue or pull request is related to the server area: tests Indicates that an issue or pull request is related to the tests language: python Pull requests or issues that update Python code lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files. team: backend Indicates that the issue or pull request is owned by the backend team type: enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add new argilla cli option to reindex all entities into search engine
3 participants