-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to the OpenAI Embeddings Provider #758
Conversation
…512 vectors for easier storage
…ons. Modify our calculation slightly
…sing each through custom filters. Allows these to easily be used and changed by extenders
…ould have, passing a filter around this value
…post or term embeddings. Chunk content down before we generate embeddings. Start process of changing how we compare embeddings to work with these chunks, as we now have an array of embeddings instead of a single embedding
…round this. Increase the max number of terms to 5000. Order terms by most used to least used
…ices to admins. Fix a bug in our hide notice script so it only triggers when the close icon is clicked. Add functionality to regenerate embeddings for all terms and delete all post embeddings. Ensure we build embeddings when running our preview function, in case the post embeddings are out of date.
Note that you'll see a couple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this, @dkotter. This is such an amazing enhancement to ClassifAI.
The code looks good to me, and in most cases, it works fine. I just noticed a significant threshold % difference between the trunk and this branch. Is this expected? Could you please help to check if you experience the same, or is this something that only happens with my setup? Also, if this is expected, then we might want to add information in the notice to adjust the threshold.
trunk |
This PR branch |
---|---|
Thank you.
I do think this is expected as we're not only using a different model but we've slightly changed our comparison function as well. I am not seeing quite as drastic differences as what you've shown but am seeing differences before and after. The good things I'm seeing though are:
To test further, I used some of the examples that OpenAI provides (utilizing their python SDK) to run comparisons and I got very similar distance scores. A few examples:
These obviously don't match 100% but I believe they are close enough to show things are working as expected (especially as we round things up and convert to a percentage, so you'd end up with 66.15% vs 67.56% and 57.68% vs 58.45% respectively).
I'm not worried about this for a few reasons:
For sure I think we can call this out in our changelog but I probably wouldn't do more than that, but open to other thoughts. Also worth mentioning that I'm open to others testing things out and seeing if we need some changes to our comparison function. I took what OpenAI is doing in their SDK and converted that from python to PHP but there's a good chance I missed something / messed something up in that conversion. There's also other comparison functions we can look to use, cosine similarity seems to be the most widely used but may be worth testing out others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkotter Thanks for checking on this and providing a detailed explanation; everything looks good to me. Regarding the comparison function, I've already utilized it in finding similar terms, and it works really well there. So, there's no issue with that. I think we're good to merge this.
Description of the Change
OpenAI recently introduced new Embedding models that are cheaper, give better results and come with the ability to reduce the size of the vectors produced. The main purpose of this PR is to update our existing OpenAI Embeddings Provider to use this new model. In addition, a few other improvements have been made to how we use Embeddings.
High level overview of changes
text-embedding-ada-002
model to thetext-embedding-3-small
model. Decided to use this model instead oftext-embedding-3-large
as it's cheaper to run while still giving good resultspost_content
field, we may generate 4 or 5 embeddings. This required changes in multiple places as we now have an array of embeddings to compare, which we then merge these results together, sort by score and then remove duplicates, giving us the highest matched itemsgenerate_embedding
method that will take in some text and generate embeddings for that. We have more specific methods to generate embeddings for a post or a term which now pass content into this more generic method. This also makes it easier for 3rd parties to use this same methodcosine_similarity
so it's more clear what it's doing (and allows us to add other similarity functions if desired)How to test the Change
Changelog Entry
Credits
Props @dkotter
Checklist: