-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classify content within existing content structure using the OpenAI Embeddings API #437
Conversation
…can be used to turn on integration. When content is saved or terms are saved, generate embedding data. Use that embedding data to assign terms to posts.
…n if all other checks pass
…existing bulk action handling to work for multiple providers
…ected taxonomies. Move some code around
Noting from our review/discussion outside GitHub, but would be good to add an admin notice if someone has both Watson NLU and OpenAI Embeddings features as there will be a likely race condition. There could be cases where someone is purposely using both services/features, so probably best to just alert that both services are active and not attempt to disable anything by default. I suspect that checking for and throwing this notice when someone saves/updates either settings should suffice (don't need to check in other places). |
…ngs are saved if both classification features are turned on. Both should be able to run at the same time but if both use the same taxonomies, one will overwrite the other. Fix a potential division by zero error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! I've left some minor review items.
add_filter( "handle_bulk_actions-edit-$post_type", [ $this, 'bulk_action_handler' ], 10, 3 ); | ||
|
||
if ( is_post_type_hierarchical( $post_type ) ) { | ||
add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be add_filter instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move both add_action( 'page_row_actions...)
and add_action( 'post_row_actions...)
inside the constructor and handle it conditionally inside register_row_action
.
This is so that we avoid hooking the same callback on the same hook multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 22eeead
if ( is_post_type_hierarchical( $post_type ) ) { | ||
add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 ); | ||
} else { | ||
add_action( 'post_row_actions', [ $this, 'register_row_action' ], 10, 2 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be add_filter instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 22eeead
…s being done in #403. Remove code that is no longer needed
… that should be a filter
Edit: Updated my version of npm locally to 8.19.4 and reinstalled dependencies and was then able to reproduce the lint errors. Still not sure why these weren't flagged previously and not sure I agree with the changes it wants (basically removes all extra spaces) but things should pass now. |
…ctions class to better support multiple providers, following what was done in #437
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the great work here @dkotter. Very impressive work on embeddings comparison. The code looks amazing and it tests well.
I have added a few minor notes to discuss and I think we are good to merge this once we conclude these notes.
Thanks for all your efforts on this. ❤️
'var classifaiEmbeddingData = %s;', | ||
wp_json_encode( | ||
[ | ||
'enabled' => true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should reflect the actual enabled value instead of the hard-coded true value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we never make it to this point if the feature is not enabled, which is why I left this hardcoded instead of pulling the value from our settings again. I think for now I'm going to leave this, but I do have plans in the future to use embedding data in other places (like generating recommended content). At that point, we'll probably change slightly how things are loaded here, since we'll have multiple features that can be enabled/disabled.
…e are doing our bulk data generation
Description of the Change
This PR introduces a new feature using OpenAI, in particular using the Embeddings API. At a high level, the embeddings API can generate vector data representing the relatedness of text strings. This data isn't super useful by itself but can be used to compare items to one other, finding how related things are.
There are a number of potential applications for this, like generating recommended/related content or providing search results (which we can look to tackle if we deem this PR worth merging) but for this initial work with embeddings, I decided to build a feature that helps classify your content.
We have an existing integration with IBM Watson that can analyze your content and provide terms that best represent that content, which we then store in various taxonomies (depending on settings). While this is great, a lot of sites want more control over their content structure and don't want a bunch of random terms being added. For instance, they may have a set structure of 12 categories that all content should be classified in to and they don't want new categories being added to that list.
The idea behind this feature is to support those setups by automatically classifying content into your existing terms. When you create or edit a term, we send that term information to the Embeddings API and then store the vector data into term meta. Then when a post is published, we also send that content to the Embeddings API to generate vector data and store that in post meta. We then run a cosine similarity comparison on the data that represents the post against all the term data to determine which terms match the closest. The highest matching terms then get auto-set (with all of this being configurable in settings).
Screenshots
We add a new settings page for OpenAI Embeddings. This is where you turn the feature on as well as choose which post types and post statuses should be auto-classified, as well as the taxonomies that should be used and how many matching terms should get set:
When the taxonomy setting is changed, we generate the embedding data for each existing term, if that data doesn't already exist.
For existing content, we've utilized the existing Classify option in the bulk edit dropdown as well as used the existing inline Classify option:
We also use the existing Gutenberg integration to turn off the classification feature on a post-by-post basis:
Reviewers notes
Language Processing
feature but still uses OpenAI. I've added this as a new Provider, which made sense to me but the existing codebase doesn't necessarily support the idea of multiple features provided by the same Provider within the same Service. All this to say that we have a new tab underLanguage Processing
for OpenAI. Was not ideal to have two tabs both labeled OpenAI so I changed the first integration to beOpenAI ChatGPT
and this feature to beOpenAI Embeddings
. We could try combining these into a single settings page but ends up being a pretty long page of settings. Open to other ideas on how to best support these types of setupsHow to test the Change
Language Processing > OpenAI Embeddings
Classify content
option is turned on and at least onePost types
,Post statuses
andTaxonomies
options are setChangelog Entry
Credits
Props @dkotter
Checklist: