Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Setup scripts to populate the search index from the DB #2092

Closed
acouch opened this issue Sep 17, 2024 · 0 comments
Closed

[Task]: Setup scripts to populate the search index from the DB #2092

acouch opened this issue Sep 17, 2024 · 0 comments
Assignees
Labels

Comments

@acouch
Copy link
Collaborator

acouch commented Sep 17, 2024


Migrated from navapbc#10
Originally created by @chouinar on Wed, 15 May 2024 17:26:42 GMT


Summary

  • Should be usable in a non-script way (ie. functions we can call with specific opportunity records - we’ll use it for tests as well)
  • Should only load records that aren’t drafts + have an opportunity status
  • Should make a new index with configurable values (number of shards)
  • Should setup an alias
  • Should use bulk uploads for performance
  • https://opensearch.org/docs/latest/im-plugin/index-templates/

Acceptance criteria

No response

@acouch acouch closed this as completed Sep 17, 2024
acouch pushed a commit to navapbc/simpler-grants-gov that referenced this issue Sep 18, 2024
#47)

Fixes HHS#2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
acouch pushed a commit to navapbc/simpler-grants-gov that referenced this issue Sep 18, 2024
#47)

Fixes HHS#2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
acouch pushed a commit that referenced this issue Sep 18, 2024
…avapbc#47)

Fixes #2092

Setup a script to populate the search index by loading opportunities
from the DB, jsonify'ing them, loading them into a new index, and then
aliasing that index.

Several utilities were created for simplifying working with the
OpenSearch client (a wrapper for setting up configuration / patterns)

Iterating over the opportunities and doing something with them is a
common pattern in several of our scripts, so nothing is really different
there.

The meaningful implementation is how we handle creating and aliasing the
index. In OpenSearch you can give any index an alias (including putting
multiple indexes behind the same alias). The approach is pretty simple:
* Create an index
* Load opportunities into the index
* Atomically swap the index backing the `opportunity-index-alias`
* Delete the old index if they exist

This approach means that our search endpoint just needs to query the
alias, and we can keep making new indexes and swapping them out behind
the scenes. Because we could remake the index every few minutes, if we
ever need to re-configure things like the number of shards, or any other
index-creation configuration, we just update that in this script and
wait for it to run again.

I ran this locally after loading `83250` records, and it took about 61s.

You can run this locally yourself by doing:
```sh
make init
make db-seed-local
poetry run flask load-search-data load-opportunity-data
```

If you'd like to see the data, you can test it out on
http://localhost:5601/app/dev_tools#/console - here is an example query
that filters by the word `research` across a few fields and filters to
just forecasted/posted.

```json
GET opportunity-index-alias/_search
{
  "size": 25,
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "simple_query_string": {
            "query": "research",
            "default_operator": "AND",
            "fields": ["agency.keyword^16", "opportunity_title^2", "opportunity_number^12", "summary.summary_description", "opportunity_assistance_listings.assistance_listing_number^10", "opportunity_assistance_listings.program_title^4"]
          }
        }
      ],
      "filter": [
        {
          "terms": {
            "opportunity_status": [
              "forecasted",
              "posted"
            ]
          }
        }
      ]
    }
  }
}

```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

2 participants