Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add example: <Generate Embeddings> and <Embedding Similarity Search> #274

Closed
wants to merge 23 commits into from

Conversation

aceld
Copy link
Contributor

@aceld aceld commented Apr 21, 2023

No description provided.

@codecov
Copy link

codecov bot commented Apr 21, 2023

Codecov Report

Merging #274 (9e2d687) into master (c3b2451) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #274      +/-   ##
==========================================
+ Coverage   97.02%   97.05%   +0.02%     
==========================================
  Files          17       18       +1     
  Lines         705      712       +7     
==========================================
+ Hits          684      691       +7     
  Misses         15       15              
  Partials        6        6              
Impacted Files Coverage Δ
embeddings_utils.go 100.00% <100.00%> (ø)

@sashabaranov
Copy link
Owner

Hey, thank you for this PR!

@aceld
Copy link
Contributor Author

aceld commented Apr 25, 2023

@sashabaranov Hello, I believe that this example is what many developers need. It is a basic case of using Embedding for semantic search. I hope you can review and approve it. Thank you.

@sashabaranov
Copy link
Owner

@aceld could you please make changes based on the comments above? Would love to merge this PR after the changes are made

@aceld
Copy link
Contributor Author

aceld commented Apr 25, 2023

@sashabaranov Which comment do you want me to edit?

@sashabaranov
Copy link
Owner

@aceld both!

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@sashabaranov
Copy link
Owner

Duh, I think comments were in draft stage and not published, sorry!

@aceld
Copy link
Contributor Author

aceld commented Apr 26, 2023

@sashabaranov done.

// Calculate dot product
dot := DotProduct(v1, v2)
// Calculate magnitude of v1
v1Magnitude := math.Sqrt(float64(DotProduct(v1, v1)))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embeddings are normalized to length 1, so we don't need to do that. CosineSimilarity is equal to DotProduct in this case.

https://platform.openai.com/docs/guides/embeddings/which-distance-function-should-i-use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sashabaranov OK I just delete the function CosineSimilarity and it’s over. Anyway, this is not used in the example.

v2 := []float32{2, 4, 6}
expected := float32(28.0)
result := DotProduct(v1, v2)
if result != expected {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't compare floats like that https://bitbashing.io/comparing-floats.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sashabaranov So how to compare? Can you provide some case parameters?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aceld something like

func isClose(a, b float32) bool {
        if a == b {
		return true
	}
	return math.Abs(float64(a-b)) < 1e-12
}

https://floating-point-gui.de/errors/comparison/

@sashabaranov
Copy link
Owner

@aceld if you have some time — maybe let's remove all README changes, fix float comparison and merge!

@aceld
Copy link
Contributor Author

aceld commented Jun 5, 2023

OK @sashabaranov

@aceld
Copy link
Contributor Author

aceld commented Jun 9, 2023

@sashabaranov I apologize for not being able to find time to make the changes recently. I am sorry for any inconvenience caused, and I will try my best to allocate time next week to fix this issue.

@aceld
Copy link
Contributor Author

aceld commented Jun 13, 2023

@sashabaranov
I have completed the fix according to your requirements, but the Codecov Report did not pass. I don't quite understand it. Please enlighten me. Thank you.

@sashabaranov
Copy link
Owner

@aceid hey, you imports are not sorted properly. Please refer to https://golangci-lint.run/ for more information

@aceld
Copy link
Contributor Author

aceld commented Jun 16, 2023

@aceid hey, you imports are not sorted properly. Please refer to https://golangci-lint.run/ for more information

@sashabaranov done!

package openai

// DotProduct Calculate dot product of two vectors.
func DotProduct(v1, v2 []float32) float32 {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not put it as an Embedding method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@sashabaranov like you say.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally, but we don't really have []float32 vectors in the library except for Embedding struct. Might make sense to add it as func (e Embedding) DotProduct(another Embedding)

Comment on lines +582 to +594
file, err := os.Create("embeddings.bin")
if err != nil {
fmt.Printf("Create file error: %v\n", err)
return
}
defer file.Close()

encoder := gob.NewEncoder(file)
err = encoder.Encode(selectionsEmbeddings)
if err != nil {
fmt.Printf("Encode error: %v\n", err)
return
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think file I/O and marshalling is largely out of scope to the purpose of this example. Could you please remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sashabaranov Sure, you're right. Do you have any suggestions on how to store vector data more efficiently? I would appreciate some advice.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aceld I think the storage of vector data is largely is out of scope for this README — the point is just to show an example, not to build vector-search DB.

Comment on lines +664 to +684
input := "I am a Golang Software Engineer, I like Go and OpenAI."

// get embedding of input
inputEmbd, err := getEmbedding(ctx, client, []string{input})
if err != nil {
fmt.Printf("GetEmedding error: %v\n", err)
return
}

// Calculate similarity through cosine matching algorithm
var questionScores []float32
for _, embed := range allEmbeddings {
// OpenAI embeddings are normalized to length 1, which means that:
// Cosine similarity can be computed slightly faster using just a dot product
score := openai.DotProduct(embed, inputEmbd)
questionScores = append(questionScores, score)
}

// Take the subscripts of the top few selections with the highest similarity
sortedIndexes := sortIndexes(questionScores)
sortedIndexes = sortedIndexes[:3] // Top 3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please add this section to the previous example and have one single example for embeddings?

@sashabaranov sashabaranov added help wanted Extra attention is needed stale labels Jun 30, 2023
@Leeaandrob
Copy link

It's fantastic how can I help this pr to be merged? @aceld @sashabaranov ?

@sashabaranov
Copy link
Owner

@Leeaandrob yes, please feel free to fork it and continue in another PR!

@aceld
Copy link
Contributor Author

aceld commented Jul 12, 2023

@sashabaranov @Leeaandrob The code conflicts have been resolved, and currently, all checks have passed.

@github-actions github-actions bot removed the stale label Jul 13, 2023
@sashabaranov
Copy link
Owner

@aceld there are still a number of changes to be made mentioned above. To re-iterate:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants