-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/pf tests v2 #60
Feat/pf tests v2 #60
Conversation
…tflow tests. Also initial Promtpflow
…tflow tests. Also initial Promtpflow
…tflow tests. Also initial Promtpflow
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ne systematically as testing infra gets added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving the PR - perhaps we can chat about the comments/ open questions in our next check-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's an issue, but to confirm: Depending on how the output of this is used, having two definitions for value 1 ("The ANSWER is logically false from the information contained in the CONTEXT." and "It is not possible to determine whether the ANSWER is true or false without further information.") might not be ideal. Especially if only the integer is returned as the prompt suggests, one cannot distinguish between the two cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I took the prompt from Micsosoft, but no doubt we can improve. As we add tests for LLM-generated output (as opposed to memories/recipes which are deterministic), then we should revisit this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not specific to this work, but in this file we assume 17.839.995 as the correct total population of Mali, while in flow.dag.yaml loc 11, it's 17.907.114. Might be a data/ memory problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That must have been left in there at an earlier point when HAPI gave a slightly different number. Defining the tests will come in the next wave, but I also updated the tests for this PR as I'm extending it to add batch tests. Sorry, I should have started a new branch, more reviewing on this one soon. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: The examples' chat history only contains user input and no assistant responses. Didn't we at one point add the assistant output to the history as well for context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted! I've been grappling with this. chat_history is used for intent, and I can see where assistant responses might inform that decision, but the majority comes from the user ...
user: what's the total population of Mali?
assistant: It's XXXXXXX
user: plot a map by state
In terms of intent, this is just as effective I feel, and uses a lot less tokens (bear in mind assistant responses can be BIG) ...
user: what's the total population of Mali?
user: plot a map by state
So I chopped out the AI response. I agree though it is confusing, should we perhaps rename to be chat_history_user or something, or make this more obvious elsewhere?
This PR Implements the following:
README instructions
Addition of Promptflow for LLM output evaluation. Includes a new docker-compose-dev.yml as this wouldn't be needed in production. User needs to build with
docker compose -f docker-compose.yml -f docker-compose-dev.yml up -d --build
to get this functionalitySlight refactoring of chainlit ui for neatness
Tweak to memory judge to add few-shot examples to better handle context switching. Needed for basic tests
A Promptflow wrapper to call chainlit code using Mock chainlit objects to override
NOTE: Though the test can run the same code as chainlit, which will test recipes/memory and on-the-fly assistant as a user would, it has the following limitations:
As noted above this isn't a neat solution due to calling async code of chainlit, but it does work. Will merge into main as a guide for future end-to-end tests, but main focus will be to test recipes server AI and AI assistant in isolation