Classify content within existing content structure using the OpenAI Embeddings API #437

dkotter · 2023-04-10T02:47:49Z

Description of the Change

This PR introduces a new feature using OpenAI, in particular using the Embeddings API. At a high level, the embeddings API can generate vector data representing the relatedness of text strings. This data isn't super useful by itself but can be used to compare items to one other, finding how related things are.

There are a number of potential applications for this, like generating recommended/related content or providing search results (which we can look to tackle if we deem this PR worth merging) but for this initial work with embeddings, I decided to build a feature that helps classify your content.

We have an existing integration with IBM Watson that can analyze your content and provide terms that best represent that content, which we then store in various taxonomies (depending on settings). While this is great, a lot of sites want more control over their content structure and don't want a bunch of random terms being added. For instance, they may have a set structure of 12 categories that all content should be classified in to and they don't want new categories being added to that list.

The idea behind this feature is to support those setups by automatically classifying content into your existing terms. When you create or edit a term, we send that term information to the Embeddings API and then store the vector data into term meta. Then when a post is published, we also send that content to the Embeddings API to generate vector data and store that in post meta. We then run a cosine similarity comparison on the data that represents the post against all the term data to determine which terms match the closest. The highest matching terms then get auto-set (with all of this being configurable in settings).

Screenshots

We add a new settings page for OpenAI Embeddings. This is where you turn the feature on as well as choose which post types and post statuses should be auto-classified, as well as the taxonomies that should be used and how many matching terms should get set:

When the taxonomy setting is changed, we generate the embedding data for each existing term, if that data doesn't already exist.

For existing content, we've utilized the existing Classify option in the bulk edit dropdown as well as used the existing inline Classify option:

We also use the existing Gutenberg integration to turn off the classification feature on a post-by-post basis:

Reviewers notes

This is a new Language Processing feature but still uses OpenAI. I've added this as a new Provider, which made sense to me but the existing codebase doesn't necessarily support the idea of multiple features provided by the same Provider within the same Service. All this to say that we have a new tab under Language Processing for OpenAI. Was not ideal to have two tabs both labeled OpenAI so I changed the first integration to be OpenAI ChatGPT and this feature to be OpenAI Embeddings. We could try combining these into a single settings page but ends up being a pretty long page of settings. Open to other ideas on how to best support these types of setups
I have not added a WP-CLI integration or Classic Editor support
Generating the embedding data is quick but running the comparison function on the data may be slow (the data generated by the API is an array of data around 1500 items in length). I have noted in the code a few places where we may want to add a limit to help with performance but for now, I've left it so when a post is published, we compare that post against all terms within a taxonomy. For sites that have hundreds of terms, they may start to see some performance issues that in theory could lead to timeouts on save (all these calculations run in the admin so I'm not super worried but may be worth doing some stress testing on limits)

How to test the Change

Set up an integration with OpenAI in Language Processing > OpenAI Embeddings
Ensure the Classify content option is turned on and at least one Post types, Post statuses and Taxonomies options are set
In whatever taxonomy you've selected, ensure you have one or more terms saved
Create a new piece of content and publish that content (or match whatever post status option you selected)
Ensure one or more terms gets auto-assigned

Changelog Entry

Added - Automatically classify content into your existing taxonomy structure using the OpenAI Embeddings API

Credits

Props @dkotter

Checklist:

I agree to follow this project's Code of Conduct.
I have updated the documentation accordingly.
I have added tests to cover my change.
All new and existing tests pass.

…can be used to turn on integration. When content is saved or terms are saved, generate embedding data. Use that embedding data to assign terms to posts.

…unctionality

…n if all other checks pass

…rted taxonomies

…existing bulk action handling to work for multiple providers

…ected taxonomies. Move some code around

includes/Classifai/Providers/OpenAI/EmbeddingCalculations.php

jeffpaul · 2023-04-11T17:57:04Z

Noting from our review/discussion outside GitHub, but would be good to add an admin notice if someone has both Watson NLU and OpenAI Embeddings features as there will be a likely race condition. There could be cases where someone is purposely using both services/features, so probably best to just alert that both services are active and not attempt to disable anything by default. I suspect that checking for and throwing this notice when someone saves/updates either settings should suffice (don't need to check in other places).

…ngs are saved if both classification features are turned on. Both should be able to run at the same time but if both use the same taxonomies, one will overwrite the other. Fix a potential division by zero error

Sidsector9

Thanks for the great work! I've left some minor review items.

Sidsector9 · 2023-04-14T13:30:20Z

includes/Classifai/Admin/BulkActions.php

+			add_filter( "handle_bulk_actions-edit-$post_type", [ $this, 'bulk_action_handler' ], 10, 3 );
+
+			if ( is_post_type_hierarchical( $post_type ) ) {
+				add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 );


Should this be add_filter instead?

We can move both add_action( 'page_row_actions...) and add_action( 'post_row_actions...) inside the constructor and handle it conditionally inside register_row_action.

This is so that we avoid hooking the same callback on the same hook multiple times.

Done in 22eeead

Sidsector9 · 2023-04-14T13:30:31Z

includes/Classifai/Admin/BulkActions.php

+			if ( is_post_type_hierarchical( $post_type ) ) {
+				add_action( 'page_row_actions', [ $this, 'register_row_action' ], 10, 2 );
+			} else {
+				add_action( 'post_row_actions', [ $this, 'register_row_action' ], 10, 2 );


Should this be add_filter instead?

Done in 22eeead

…s being done in #403. Remove code that is no longer needed

… that should be a filter

dkotter · 2023-05-02T20:43:35Z

There's eslint errors being reported here within the gutenberg-plugin.js file. These weren't being reported previously even though the code that is being flagged hasn't changed. I'm also not getting these errors reported locally, just here in the GitHub Action. If anyone has time to take a look at that and let me know what you think, that would be great. It's flagging just some spacing issues which I can fix, just curious why those aren't getting flagged for me locally and why they weren't getting flagged here on GitHub until recently.

Edit: Updated my version of npm locally to 8.19.4 and reinstalled dependencies and was then able to reproduce the lint errors. Still not sure why these weren't flagged previously and not sure I agree with the changes it wants (basically removes all extra spaces) but things should pass now.

…ctions class to better support multiple providers, following what was done in #437

iamdharmesh

Thanks a lot for the great work here @dkotter. Very impressive work on embeddings comparison. The code looks amazing and it tests well.

I have added a few minor notes to discuss and I think we are good to merge this once we conclude these notes.

Thanks for all your efforts on this. ❤️

readme.txt

iamdharmesh · 2023-05-17T18:10:29Z

includes/Classifai/Providers/OpenAI/Embeddings.php

+				'var classifaiEmbeddingData = %s;',
+				wp_json_encode(
+					[
+						'enabled'              => true,


Maybe this should reflect the actual enabled value instead of the hard-coded true value.

So we never make it to this point if the feature is not enabled, which is why I left this hardcoded instead of pulling the value from our settings again. I think for now I'm going to leave this, but I do have plans in the future to use embedding data in other places (like generating recommended content). At that point, we'll probably change slightly how things are loaded here, since we'll have multiple features that can be enabled/disabled.

includes/Classifai/Providers/OpenAI/Embeddings.php

includes/Classifai/Admin/BulkActions.php

…e are doing our bulk data generation

dkotter added 12 commits April 6, 2023 14:13

Add WIP code for embeddings. Currently have a new settings page that …

c9f8366

…can be used to turn on integration. When content is saved or terms are saved, generate embedding data. Use that embedding data to assign terms to posts.

Add settings for post types and statuses and use those when running f…

c9d931d

…unctionality

Add a custom filter to turn off embeddings for a particular item, eve…

c2b5dba

…n if all other checks pass

Add setting to choose which taxonomies should be used for classification

cf1c66c

Utilize taxonomy setting to ensure we only run functionality on suppo…

42d9138

…rted taxonomies

Refactor code a bit. Add a term permissions check before we assign terms

dcc167b

Add a setting to choose how many terms should get assigned

9979367

Add ability to bulk classify existing content. Slightly refactor the …

4823a1e

…existing bulk action handling to work for multiple providers

Add support to existing Gutenberg plugin for the new embedding feature

af4ee0c

When settings are saved, generate embedding data for all terms in sel…

c546614

…ected taxonomies. Move some code around

Add tests

5d9687f

Rename the OpenAI providers

16f228b

dkotter self-assigned this Apr 10, 2023

dkotter added 3 commits April 9, 2023 21:03

Fix tests

1736419

Fix typo

576134c

More test fixes

cefe6ca

jeffpaul requested review from a team and Sidsector9 and removed request for a team April 10, 2023 21:37

jeffpaul added this to the 2.1.0 milestone Apr 10, 2023

Update docs

f428ac0

dkotter marked this pull request as ready for review April 11, 2023 03:03

dkotter requested review from a team and jeffpaul as code owners April 11, 2023 03:03

Sidsector9 reviewed Apr 11, 2023

View reviewed changes

includes/Classifai/Providers/OpenAI/EmbeddingCalculations.php Show resolved Hide resolved

Show an admin warning when NLU settings are saved or Embeddings setti…

eab2549

…ngs are saved if both classification features are turned on. Both should be able to run at the same time but if both use the same taxonomies, one will overwrite the other. Fix a potential division by zero error

dkotter requested a review from Sidsector9 April 12, 2023 16:57

Sidsector9 requested changes Apr 17, 2023

View reviewed changes

dkotter removed this from the 2.1.0 milestone May 1, 2023

dkotter added this to the 2.2.0 milestone May 1, 2023

dkotter added 5 commits May 2, 2023 13:25

Merge branch 'develop' into feature/openai-embeddings

0aa2c8d

Simplify how we load JS data for the gutenberg plugin, copying what i…

89e8b64

…s being done in #403. Remove code that is no longer needed

Simplify our bulk actions handler to avoid duplicate code. Fix a hook…

22eeead

… that should be a filter

Fix e2e tests

88c8702

Set onboarding options. Fix an error if no onboarding options are set

9200bf4

dkotter requested a review from Sidsector9 May 2, 2023 20:41

dkotter mentioned this pull request May 2, 2023

Ensure all required options are in new onboarding #449

Open

1 task

Fix eslint errors

79e9737

dkotter added a commit that referenced this pull request May 4, 2023

Add bulk action support to transcribe audio files. Refactor our BulkA…

81bdf1a

…ctions class to better support multiple providers, following what was done in #437

Ensure we don't get JS errors if NLU is not turned on

a89256b

jeffpaul mentioned this pull request May 17, 2023

Release version 2.2.0 #457

Closed

16 tasks

iamdharmesh reviewed May 17, 2023

View reviewed changes

dkotter added 3 commits May 18, 2023 11:49

Merge branch 'develop' into feature/openai-embeddings

3a0b353

Ensure we are getting terms that don't have embedding data yet when w…

3f45946

…e are doing our bulk data generation

Remove unneeded method

40f94d9

dkotter merged commit bb719ca into develop May 18, 2023

dkotter deleted the feature/openai-embeddings branch May 18, 2023 18:10

This was referenced Jun 20, 2023

Support OpenAI Embeddings in the Classic Editor #496

Closed

Add WP-CLI command to bulk classify items using the OpenAI Embeddings feature #497

Closed

phpbits mentioned this pull request Jun 30, 2023

Add WP-CLI command to bulk process OpenAI Embeddings #521

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify content within existing content structure using the OpenAI Embeddings API #437

Classify content within existing content structure using the OpenAI Embeddings API #437

dkotter commented Apr 10, 2023 •

edited

Loading

jeffpaul commented Apr 11, 2023

Sidsector9 left a comment

Sidsector9 Apr 14, 2023

Sidsector9 Apr 14, 2023

dkotter May 2, 2023

Sidsector9 Apr 14, 2023

dkotter May 2, 2023

dkotter commented May 2, 2023 •

edited

Loading

iamdharmesh left a comment

iamdharmesh May 17, 2023

dkotter May 18, 2023 •

edited

Loading

Classify content within existing content structure using the OpenAI Embeddings API #437

Classify content within existing content structure using the OpenAI Embeddings API #437

Conversation

dkotter commented Apr 10, 2023 • edited Loading

Description of the Change

Screenshots

Reviewers notes

How to test the Change

Changelog Entry

Credits

Checklist:

jeffpaul commented Apr 11, 2023

Sidsector9 left a comment

Choose a reason for hiding this comment

Sidsector9 Apr 14, 2023

Choose a reason for hiding this comment

Sidsector9 Apr 14, 2023

Choose a reason for hiding this comment

dkotter May 2, 2023

Choose a reason for hiding this comment

Sidsector9 Apr 14, 2023

Choose a reason for hiding this comment

dkotter May 2, 2023

Choose a reason for hiding this comment

dkotter commented May 2, 2023 • edited Loading

iamdharmesh left a comment

Choose a reason for hiding this comment

iamdharmesh May 17, 2023

Choose a reason for hiding this comment

dkotter May 18, 2023 • edited Loading

Choose a reason for hiding this comment

dkotter commented Apr 10, 2023 •

edited

Loading

dkotter commented May 2, 2023 •

edited

Loading

dkotter May 18, 2023 •

edited

Loading