Feat/pf tests v2 #60

dividor · 2024-06-28T15:53:19Z

This PR Implements the following:

README instructions
Addition of Promptflow for LLM output evaluation. Includes a new docker-compose-dev.yml as this wouldn't be needed in production. User needs to build with docker compose -f docker-compose.yml -f docker-compose-dev.yml up -d --build to get this functionality
Slight refactoring of chainlit ui for neatness
Tweak to memory judge to add few-shot examples to better handle context switching. Needed for basic tests
A Promptflow wrapper to call chainlit code using Mock chainlit objects to override

NOTE: Though the test can run the same code as chainlit, which will test recipes/memory and on-the-fly assistant as a user would, it has the following limitations:

No support for images
Only one test included in flow.dat.json, ie no data.jsonl just yet
The test function runs, completes, but doesn't exit due to some hanging async process in Chainlit. After a lot of investigation, gave up for now and instead run as a script, then kill the script. VERY hacky workaround.

As noted above this isn't a neat solution due to calling async code of chainlit, but it does work. Will merge into main as a guide for future end-to-end tests, but main focus will be to test recipes server AI and AI assistant in isolation

…tflow tests. Also initial Promtpflow

…ainlit

…ill after execution. Obviously, this is a very hacky workaround to be able to use the exact chainlit code for e2e tests, and we may not use this (we're implementing unit tests for recipes server and assistant independantly), but will finish implementation as it works and nearly done

…ne systematically as testing infra gets added

JanPeterDatakind

Approving the PR - perhaps we can chat about the comments/ open questions in our next check-in

JanPeterDatakind · 2024-06-28T22:37:27Z

flows/chainlit-ui-evaluation/groundedness_score.jinja2

I don't think it's an issue, but to confirm: Depending on how the output of this is used, having two definitions for value 1 ("The ANSWER is logically false from the information contained in the CONTEXT." and "It is not possible to determine whether the ANSWER is true or false without further information.") might not be ideal. Especially if only the integer is returned as the prompt suggests, one cannot distinguish between the two cases.

Hmm, good point. I took the prompt from Micsosoft, but no doubt we can improve. As we add tests for LLM-generated output (as opposed to memories/recipes which are deterministic), then we should revisit this.

JanPeterDatakind · 2024-06-28T22:42:50Z

flows/chainlit-ui-evaluation/data.jsonl

Not specific to this work, but in this file we assume 17.839.995 as the correct total population of Mali, while in flow.dag.yaml loc 11, it's 17.907.114. Might be a data/ memory problem.

That must have been left in there at an earlier point when HAPI gave a slightly different number. Defining the tests will come in the next wave, but I also updated the tests for this PR as I'm extending it to add batch tests. Sorry, I should have started a new branch, more reviewing on this one soon. :)

JanPeterDatakind · 2024-06-29T18:18:37Z

templates/generate_intent_from_history_prompt.jinja2

Question: The examples' chat history only contains user input and no assistant responses. Didn't we at one point add the assistant output to the history as well for context?

Well spotted! I've been grappling with this. chat_history is used for intent, and I can see where assistant responses might inform that decision, but the majority comes from the user ...

user: what's the total population of Mali?
assistant: It's XXXXXXX
user: plot a map by state

In terms of intent, this is just as effective I feel, and uses a lot less tokens (bear in mind assistant responses can be BIG) ...

user: what's the total population of Mali?
user: plot a map by state

So I chopped out the AI response. I agree though it is confusing, should we perhaps rename to be chat_history_user or something, or make this more obvious elsewhere?

dividor added 18 commits June 25, 2024 11:25

Added promptflow standard build

81ef861

Fixed container name

d43373f

Interim checkin to make main loop simpler, in prep for self-tests

66fa688

Interim checkin to make main loop simpler, in prep for self-tests

f040277

Mock test harness, with Mock chainlit so we can use UI code for promp…

e47a0e8

…tflow tests. Also initial Promtpflow

Mock test harness, with Mock chainlit so we can use UI code for promp…

8302bc1

…tflow tests. Also initial Promtpflow

Mock test harness, with Mock chainlit so we can use UI code for promp…

7260074

…tflow tests. Also initial Promtpflow

Interim commit, still having thread management issues due to async ch…

9b97c48

…ainlit

Promptflow works partially

5f82705

Added Promptflow to docker build as a dev option, ie not part of prod

6fdb85f

Adjusted AI judge prompt as part of creating unit tests. We will refi…

248ec0f

…ne systematically as testing infra gets added

Fixed bug to back-populate assistant history when in test mode

009de78

Fixed bug to back-populate assistant history when in test mode

f07f853

Had to add a dockerbuild to be able to install and mock chainlit

5218ed8

dividor requested a review from JanPeterDatakind June 28, 2024 21:39

dividor added 2 commits June 28, 2024 17:48

Fixed bug to back-populate assistant history when in test mode

56b3de3

Fixed bug to back-populate assistant history when in test mode

4405eb0

JanPeterDatakind approved these changes Jun 29, 2024

View reviewed changes

JanPeterDatakind merged commit 1d398d4 into main Jun 29, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/pf tests v2 #60

Feat/pf tests v2 #60

dividor commented Jun 28, 2024 •

edited

Loading

JanPeterDatakind left a comment

JanPeterDatakind Jun 28, 2024

dividor Jun 29, 2024

JanPeterDatakind Jun 28, 2024

dividor Jun 29, 2024

JanPeterDatakind Jun 29, 2024

dividor Jun 29, 2024

Feat/pf tests v2 #60

Feat/pf tests v2 #60

Conversation

dividor commented Jun 28, 2024 • edited Loading

JanPeterDatakind left a comment

Choose a reason for hiding this comment

JanPeterDatakind Jun 28, 2024

Choose a reason for hiding this comment

dividor Jun 29, 2024

Choose a reason for hiding this comment

JanPeterDatakind Jun 28, 2024

Choose a reason for hiding this comment

dividor Jun 29, 2024

Choose a reason for hiding this comment

JanPeterDatakind Jun 29, 2024

Choose a reason for hiding this comment

dividor Jun 29, 2024

Choose a reason for hiding this comment

dividor commented Jun 28, 2024 •

edited

Loading