Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: the cosine similarity is evaluated for top comments and bot comments are ignored #225

Open
wants to merge 5 commits into
base: development
Choose a base branch
from

Conversation

gentlementlegen
Copy link
Member

@gentlementlegen gentlementlegen commented Dec 26, 2024

Resolves #174

What are the changes

  • bot comments are ignored when building the prompt, which greatly reduces token usage and innacuracy
  • if the token count for the built prompt is higher than the model limit, strip the comments down to the 10 most relevant ones based on their cosine similarity with the specification

Rfc @sshivaditya2019

@gentlementlegen gentlementlegen marked this pull request as ready for review December 26, 2024 07:39
Copy link
Member

@0x4007 0x4007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very skeptical of tfidf approach. We should go simpler and filter out more instead.

@@ -15,6 +15,9 @@ import {
import { BaseModule } from "../types/module";
import { ContextPlugin } from "../types/plugin-input";
import { GithubCommentScore, Result } from "../types/results";
import { TfIdf } from "../helpers/tf-idf";

const TOKEN_MODEL_LIMIT = 124000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the model and possibly should be an environment variable because we might change models.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard coding the 12400 doesn't seem like a solution there either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0x4007 It is not hard coded but configurable within the config file?
There is no API to retrieve a model max token value as far as I know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 179 is hard coded

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should I do if the configuration is undefined, should I throw an error and stop the run then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if we don't have it saved in our library or collection of known amounts then it should throw

@@ -61,7 +64,7 @@ export class ContentEvaluatorModule extends BaseModule {
const allCommentsUnClean = data.allComments || [];
const allComments: { id: number; comment: string; author: string }[] = [];
for (const commentObj of allCommentsUnClean) {
if (commentObj.user) {
if (commentObj.user && commentObj.user.type !== "Bot") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also filter out slash commands? And minimized comments?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, I didn't program this part so was not sure what I should keep / remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3cbd808

Rfc @sshivaditya2019 I am not sure why all the comments were sent (commands, bot comments etc), now only user evaluated comments (which also are curated) are sent to evaluate the prompt which seems better, let me know if that introduces some issues on comment evaluation precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't affect it at all. I would proceed with implicit approval even if shiv doesn't reply in time for merge.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gentlementlegen It shouldn't impact the comment evaluation at all. I think the prompt was updated, though the original one may have included something related to bot comments.

@@ -178,15 +181,33 @@ export class ContentEvaluatorModule extends BaseModule {
const dummyResponse = JSON.stringify(this._generateDummyResponse(comments), null, 2);
const maxTokens = this._calculateMaxTokens(dummyResponse);

const promptForComments = this._generatePromptForComments(specification, comments, allComments);
let promptForComments = this._generatePromptForComments(specification, comments, allComments);
if (this._calculateMaxTokens(promptForComments, Infinity) > TOKEN_MODEL_LIMIT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical about this whole TFIDF approach

  1. The tokenizer algorithm is different per model and I'm not sure what model we should be using. I'm starting to see a pattern of OpenAI and Anthropic release cycles alternating so to get the best models perhaps we should also alternate. I heard good things about the new OpenAI model.
  2. I don't fully understand why we need this if we can just filter out a lot more overhead like the bot comments, slash commands
  3. The weaknesses of this approach isn't fully clear to me. We did some earlier research on implementing this method in here before but I recall that it wasn't a good fit. I just don't remember why- perhaps accuracy related.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be an edge case for very long conversations which should not happen very often. Advantage about TFIDF is that we keep a bunch of relevant comments instead of arbitrarily keep the first / last n comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you articulate the weaknesses or concerns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main concern is that we base the selection solely on similarity, which might be less precise that using the LLM to determine which comments are the most relevant. On the other hand I think it is better than actually writing summaries of comments because this way we keep the content intact during evaluation (just shorten it).

Since this should not happen often (and even less often now that we will properly filter useless comments once I resolve your comment), the easiest and most straightforward solution seemed to be the best choice.

@sshivaditya2019
Copy link

I don't think TF-IDF would be the best option for selecting the comments, as it only takes in account the word frequency and does not give any importance to semantics. A better solution might be to switch to a larger context model, like Gemini, which provides a 2 million token context window when we reach the token limit, and exclude bot-generated comments from the selection process.

@gentlementlegen
Copy link
Member Author

This was added as a failsafe when we go through the limit. Gemini can be an option, but theoretically we could also go beyond its token limit (even if unlikely). Since we can configure models which all have different token max limits (and third parties could be using smaller and cheaper ones) I think it is important that the technique we use to lower the context size is not based on LLM.

@0x4007
Copy link
Member

0x4007 commented Dec 29, 2024

More careful filtering of comments like removal of bot commands and comments, and possibly text summarization, or concatenation of multiple calls are all more accurate approaches. TFIDF is not the right tool for the job.

@gentlementlegen
Copy link
Member Author

The commands and bot comments got also fixed in this PR. I added this as a last resort if this was not enough. As I said before, I don't think it should rely on LLMs itself because third party users could chose a tiny model like gpt-3 and have only 3000 tokens available which would not be capable to even summarize. I can change this to summarize each comment in one sentence, which would use one call to the LLM per comment, which I though would potentially be expensive.

@0x4007
Copy link
Member

0x4007 commented Dec 29, 2024

Doing multiple calls to score everything and then concatenate results seems the most straightforward with no data loss.

@gentlementlegen
Copy link
Member Author

@0x4007 Doing the evaluation is not the problem, the problem is the given context to the LLM that gets too big. If an issue has 300 comments, then the prompt would contain these 300 comments while evaluating which would be too many for the token limit, so it has to get smaller. I don't see a way to fix that with no data loss, except if you meant comparing the comment against every other single comment one by one?

@0x4007
Copy link
Member

0x4007 commented Dec 30, 2024

Divide into two and do 150 each call. Receive the results array and concatenate them together

@gentlementlegen
Copy link
Member Author

gentlementlegen commented Dec 30, 2024

@0x4007 This is not reliable, and if a third party decides to use a model like o4-mini it would break anyways. What should be done in that case?

Plus concatenating would not make sense. I should run the comment against 150 first and 150 last (which would make the context inaccurate) and then probably average the result. I believe you didn't see how comment are evaluated: the problem is that for the context of understanding comments, we send all the comments in the context evaluated against the comments of the user. Refer to:

@0x4007
Copy link
Member

0x4007 commented Dec 31, 2024

Surely it's a bit of a trade off without all of the comments in one shot, but this approach seems to trades off the least.

  1. The task specification is by far the most important reference point.
  2. We can inject a new line in the prompt explaining that following is part 2/2 and that due to context window limits we had to split the conversation into multiple parts for evaluation.

and if a third party decides to use a model like o4-mini it would break anyways. What should be done in that case?

Why would they complain about using a non default model? We set the default to what we know works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check overhead for comment evaluation
3 participants