Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a more performant way to bulk generate embeddings for terms #759

Closed
1 task done
dkotter opened this issue Apr 16, 2024 · 3 comments · Fixed by #779
Closed
1 task done

Need a more performant way to bulk generate embeddings for terms #759

dkotter opened this issue Apr 16, 2024 · 3 comments · Fixed by #779
Assignees
Labels
type:bug Something isn't working.
Milestone

Comments

@dkotter
Copy link
Collaborator

dkotter commented Apr 16, 2024

Describe the bug

In v2.2.0 of ClassifAI we added the ability to classify content within your own terms using OpenAI Embeddings. In order for this to work, we need embedding data to be generated for each term and for the post we are comparing those terms to.

The post embedding data is generated on the fly when the comparison is triggered but we don't want to do that for terms as we may have hundreds or thousands. So these are generated in bulk when the feature is first set up. This has always been a known limitation, that if you have lots of terms, this process will probably run into either timeouts, memory issues or rate limit issues with OpenAI.

In #758, we are doing some changes to how OpenAI Embeddings work but we have not yet fixed this issue, so ideally that is fixed and added to the same release (as these changes require all embeddings to be regenerated).

There are two issues I'm currently aware of:

  1. We generate these embeddings when the settings are saved. But we only generate embeddings for taxonomies that are turned on. So the first time you save, the taxonomy settings aren't saved yet so we don't run anything. You have to save again for things to work
  2. We generate embeddings for each term that doesn't currently have an embedding saved during this process (which again, fires when the settings are saved). For sites with 1000+ terms, this will almost certainly lead to timeouts or memory issues. Sites with far fewer terms will probably run into OpenAI rate limits

Ideally we would introduce some sort of queue management system to address this, ideally making this a general enough solution that it can be used by other features that may come in the future. There are tools out there we could look to use, like Action Scheduler or Cavalcade, but we may be fine just building a lightweight system on top of the scheduled event system in WordPress.

Steps to Reproduce

  1. Setup the Classification Feature with OpenAI Embeddings as the Provider
  2. Turn on at least one taxonomy and hit save
  3. Notice that no embeddings are actually generated
  4. Hit save again and notice the embeddings get generated

Can also generate 1000+ terms and try running this process again, though note this will cost money since it makes API requests. I've tested locally using an embeddings model run through Ollama and at around 1000 terms, I run into memory issues

Screenshots, screen recording, code snippet

No response

Environment information

No response

WordPress information

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@dkotter dkotter added the type:bug Something isn't working. label Apr 16, 2024
@dkotter dkotter added this to the 3.1.0 milestone Apr 16, 2024
@jeffpaul jeffpaul mentioned this issue May 28, 2024
22 tasks
@Sidsector9 Sidsector9 self-assigned this May 29, 2024
@Sidsector9
Copy link
Member

I've investigated both Action Scheduler and Cavalcade and found that the latter requires disabling WP-Cron.
For this reason I think Action Scheduler is a more reasonable candidate.

I have a branch with Action Scheduler implemented, however I'm facing some PHP memory exhaustion errors. It is intermittent, but I suspect it has to do with scheduling jobs inside the for() loop. I'll fix that and push the branch this week.

@dkotter
Copy link
Collaborator Author

dkotter commented Jun 3, 2024

@Sidsector9 Worth noting that on a different (private) project, @iamdharmesh implemented https://github.com/deliciousbrains/wp-background-processing to solve this, so that's another tool we can look into. I know he compared that to Action Scheduler and had a few reasons why he decided to use that one, so may be worth talking to him

@Sidsector9
Copy link
Member

@dkotter Dharmesh and I discussed this last week and concluded that either/or is a good choice as both has its pros and cons.

I decided to go ahead with Action Scheduler to align with Woo's decision to migrate all the background process related jobs to AS. Related: woocommerce/woocommerce#44246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants