Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/pf tests v2 #60

Merged
merged 20 commits into from
Jun 29, 2024
Merged

Feat/pf tests v2 #60

merged 20 commits into from
Jun 29, 2024

Conversation

dividor
Copy link
Contributor

@dividor dividor commented Jun 28, 2024

This PR Implements the following:

  1. README instructions

  2. Addition of Promptflow for LLM output evaluation. Includes a new docker-compose-dev.yml as this wouldn't be needed in production. User needs to build with docker compose -f docker-compose.yml -f docker-compose-dev.yml up -d --build to get this functionality

  3. Slight refactoring of chainlit ui for neatness

  4. Tweak to memory judge to add few-shot examples to better handle context switching. Needed for basic tests

  5. A Promptflow wrapper to call chainlit code using Mock chainlit objects to override

NOTE: Though the test can run the same code as chainlit, which will test recipes/memory and on-the-fly assistant as a user would, it has the following limitations:

  • No support for images
  • Only one test included in flow.dat.json, ie no data.jsonl just yet
  • The test function runs, completes, but doesn't exit due to some hanging async process in Chainlit. After a lot of investigation, gave up for now and instead run as a script, then kill the script. VERY hacky workaround.

As noted above this isn't a neat solution due to calling async code of chainlit, but it does work. Will merge into main as a guide for future end-to-end tests, but main focus will be to test recipes server AI and AI assistant in isolation

dividor added 18 commits June 25, 2024 11:25
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done
…ne systematically as testing infra gets added
@dividor dividor requested a review from JanPeterDatakind June 28, 2024 21:39
Copy link
Contributor

@JanPeterDatakind JanPeterDatakind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the PR - perhaps we can chat about the comments/ open questions in our next check-in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's an issue, but to confirm: Depending on how the output of this is used, having two definitions for value 1 ("The ANSWER is logically false from the information contained in the CONTEXT." and "It is not possible to determine whether the ANSWER is true or false without further information.") might not be ideal. Especially if only the integer is returned as the prompt suggests, one cannot distinguish between the two cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I took the prompt from Micsosoft, but no doubt we can improve. As we add tests for LLM-generated output (as opposed to memories/recipes which are deterministic), then we should revisit this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specific to this work, but in this file we assume 17.839.995 as the correct total population of Mali, while in flow.dag.yaml loc 11, it's 17.907.114. Might be a data/ memory problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That must have been left in there at an earlier point when HAPI gave a slightly different number. Defining the tests will come in the next wave, but I also updated the tests for this PR as I'm extending it to add batch tests. Sorry, I should have started a new branch, more reviewing on this one soon. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: The examples' chat history only contains user input and no assistant responses. Didn't we at one point add the assistant output to the history as well for context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted! I've been grappling with this. chat_history is used for intent, and I can see where assistant responses might inform that decision, but the majority comes from the user ...

user: what's the total population of Mali?
assistant: It's XXXXXXX
user: plot a map by state

In terms of intent, this is just as effective I feel, and uses a lot less tokens (bear in mind assistant responses can be BIG) ...

user: what's the total population of Mali?
user: plot a map by state

So I chopped out the AI response. I agree though it is confusing, should we perhaps rename to be chat_history_user or something, or make this more obvious elsewhere?

@JanPeterDatakind JanPeterDatakind merged commit 1d398d4 into main Jun 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants