Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify content within existing content structure using the OpenAI Embeddings API #437

Merged
merged 27 commits into from
May 18, 2023

Conversation

dkotter
Copy link
Collaborator

@dkotter dkotter commented Apr 10, 2023

Description of the Change

This PR introduces a new feature using OpenAI, in particular using the Embeddings API. At a high level, the embeddings API can generate vector data representing the relatedness of text strings. This data isn't super useful by itself but can be used to compare items to one other, finding how related things are.

There are a number of potential applications for this, like generating recommended/related content or providing search results (which we can look to tackle if we deem this PR worth merging) but for this initial work with embeddings, I decided to build a feature that helps classify your content.

We have an existing integration with IBM Watson that can analyze your content and provide terms that best represent that content, which we then store in various taxonomies (depending on settings). While this is great, a lot of sites want more control over their content structure and don't want a bunch of random terms being added. For instance, they may have a set structure of 12 categories that all content should be classified in to and they don't want new categories being added to that list.

The idea behind this feature is to support those setups by automatically classifying content into your existing terms. When you create or edit a term, we send that term information to the Embeddings API and then store the vector data into term meta. Then when a post is published, we also send that content to the Embeddings API to generate vector data and store that in post meta. We then run a cosine similarity comparison on the data that represents the post against all the term data to determine which terms match the closest. The highest matching terms then get auto-set (with all of this being configurable in settings).

Screenshots

We add a new settings page for OpenAI Embeddings. This is where you turn the feature on as well as choose which post types and post statuses should be auto-classified, as well as the taxonomies that should be used and how many matching terms should get set:

Embeddings settings page

When the taxonomy setting is changed, we generate the embedding data for each existing term, if that data doesn't already exist.

For existing content, we've utilized the existing Classify option in the bulk edit dropdown as well as used the existing inline Classify option:

Bulk classify

Inline classify

We also use the existing Gutenberg integration to turn off the classification feature on a post-by-post basis:

Gutenberg integration

Reviewers notes

  • This is a new Language Processing feature but still uses OpenAI. I've added this as a new Provider, which made sense to me but the existing codebase doesn't necessarily support the idea of multiple features provided by the same Provider within the same Service. All this to say that we have a new tab under Language Processing for OpenAI. Was not ideal to have two tabs both labeled OpenAI so I changed the first integration to be OpenAI ChatGPT and this feature to be OpenAI Embeddings. We could try combining these into a single settings page but ends up being a pretty long page of settings. Open to other ideas on how to best support these types of setups
  • I have not added a WP-CLI integration or Classic Editor support
  • Generating the embedding data is quick but running the comparison function on the data may be slow (the data generated by the API is an array of data around 1500 items in length). I have noted in the code a few places where we may want to add a limit to help with performance but for now, I've left it so when a post is published, we compare that post against all terms within a taxonomy. For sites that have hundreds of terms, they may start to see some performance issues that in theory could lead to timeouts on save (all these calculations run in the admin so I'm not super worried but may be worth doing some stress testing on limits)

How to test the Change

  1. Set up an integration with OpenAI in Language Processing > OpenAI Embeddings
  2. Ensure the Classify content option is turned on and at least one Post types, Post statuses and Taxonomies options are set
  3. In whatever taxonomy you've selected, ensure you have one or more terms saved
  4. Create a new piece of content and publish that content (or match whatever post status option you selected)
  5. Ensure one or more terms gets auto-assigned

Changelog Entry

Added - Automatically classify content into your existing taxonomy structure using the OpenAI Embeddings API

Credits

Props @dkotter

Checklist:

  • I agree to follow this project's Code of Conduct.
  • I have updated the documentation accordingly.
  • I have added tests to cover my change.
  • All new and existing tests pass.

@dkotter dkotter self-assigned this Apr 10, 2023
@jeffpaul jeffpaul requested review from a team and Sidsector9 and removed request for a team April 10, 2023 21:37
@jeffpaul jeffpaul added this to the 2.1.0 milestone Apr 10, 2023
@dkotter dkotter marked this pull request as ready for review April 11, 2023 03:03
@dkotter dkotter requested review from a team and jeffpaul as code owners April 11, 2023 03:03
@jeffpaul
Copy link
Member

Noting from our review/discussion outside GitHub, but would be good to add an admin notice if someone has both Watson NLU and OpenAI Embeddings features as there will be a likely race condition. There could be cases where someone is purposely using both services/features, so probably best to just alert that both services are active and not attempt to disable anything by default. I suspect that checking for and throwing this notice when someone saves/updates either settings should suffice (don't need to check in other places).

…ngs are saved if both classification features are turned on. Both should be able to run at the same time but if both use the same taxonomies, one will overwrite the other. Fix a potential division by zero error
@dkotter dkotter requested a review from Sidsector9 April 12, 2023 16:57
Copy link
Member

@Sidsector9 Sidsector9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work! I've left some minor review items.

add_filter( "handle_bulk_actions-edit-$post_type", [ $this, 'bulk_action_handler' ], 10, 3 );

if ( is_post_type_hierarchical( $post_type ) ) {
add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be add_filter instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move both add_action( 'page_row_actions...) and add_action( 'post_row_actions...) inside the constructor and handle it conditionally inside register_row_action.

This is so that we avoid hooking the same callback on the same hook multiple times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 22eeead

if ( is_post_type_hierarchical( $post_type ) ) {
add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 );
} else {
add_action( 'post_row_actions', [ $this, 'register_row_action' ], 10, 2 );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be add_filter instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 22eeead

@dkotter dkotter removed this from the 2.1.0 milestone May 1, 2023
@dkotter dkotter added this to the 2.2.0 milestone May 1, 2023
@dkotter dkotter requested a review from Sidsector9 May 2, 2023 20:41
@dkotter
Copy link
Collaborator Author

dkotter commented May 2, 2023

There's eslint errors being reported here within the gutenberg-plugin.js file. These weren't being reported previously even though the code that is being flagged hasn't changed. I'm also not getting these errors reported locally, just here in the GitHub Action. If anyone has time to take a look at that and let me know what you think, that would be great. It's flagging just some spacing issues which I can fix, just curious why those aren't getting flagged for me locally and why they weren't getting flagged here on GitHub until recently.

Edit: Updated my version of npm locally to 8.19.4 and reinstalled dependencies and was then able to reproduce the lint errors. Still not sure why these weren't flagged previously and not sure I agree with the changes it wants (basically removes all extra spaces) but things should pass now.

dkotter added a commit that referenced this pull request May 4, 2023
…ctions class to better support multiple providers, following what was done in #437
@jeffpaul jeffpaul mentioned this pull request May 17, 2023
16 tasks
Copy link
Member

@iamdharmesh iamdharmesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the great work here @dkotter. Very impressive work on embeddings comparison. The code looks amazing and it tests well.

I have added a few minor notes to discuss and I think we are good to merge this once we conclude these notes.

Thanks for all your efforts on this. ❤️

readme.txt Outdated Show resolved Hide resolved
'var classifaiEmbeddingData = %s;',
wp_json_encode(
[
'enabled' => true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should reflect the actual enabled value instead of the hard-coded true value.

Copy link
Collaborator Author

@dkotter dkotter May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we never make it to this point if the feature is not enabled, which is why I left this hardcoded instead of pulling the value from our settings again. I think for now I'm going to leave this, but I do have plans in the future to use embedding data in other places (like generating recommended content). At that point, we'll probably change slightly how things are loaded here, since we'll have multiple features that can be enabled/disabled.

includes/Classifai/Admin/BulkActions.php Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants