This document walks you through the concepts implemented in this sample so you can understand it's capabilities and how to do the same.
Humans interact with each other through conversations that have some context of what is being discussed. OpenAI's ChatGPT can also interact this way with humans. However, this capability is not native to an LLM itself. It must be implemented. Let's explore what happens when we test contextual follow up questions with our LLM where we ask follow up questions that imply an existing context like you would have in a conversation with another person.
Let's observe this in action. Follow the steps after launching the application:
- Start a new Chat Session.
- Enter a question,
What is the largest lake in North America?
, wait for response,Lake Superior
- Enter a follow up without context,
What is the second largest?
, wait for repsonse,Lake Huron
- Enter a third follow up,
What is the third largest?
, wait for resopnse,Great Bear lake
Clearly the LLM is able to keep context for the conversation and answer appropriately. While this concept is simple enough it can present some challenges. It also introduces the concept of tokens for services like OpenAI.
Large language models require chat history to generate contextually relevant results. But there is a limit how much text you can send. Large language models have limits on how much text they can process in a request and output in a response. These limits are not expressed as words, but as tokens. Tokens represent words or part of a word. On average 4 characters is one token. Tokens are essentially the compute currency for large language model. Because of this limit on tokens, it is therefore necessary to limit them. This can be a bit tricky in certain scenarios. You will need to ensure enough context for the LLM to generate a correct response, while avoiding negative results of consuming too many tokens which can include incomplete results or unexpected behavior.
This application allows you to configure how large the context window can be (length of chat history). This is done using the configuration value, MaxConversationTokens that you can adjust in the appsettings.json file.
Large language models are amazing with their ability to generate completions to a user's questions. However, these requests to generate completions from an LLM are computationally expensive (expressed in tokens) and can also be quite slow. This cost and latency increases as the amount of text increases.
In a pattern called Retrieval Augmented Generation or RAG Pattern, data from a database is used to augment or ground the LLM by providing additional information to generate a response. These payloads can get rather large. It is not uncommon to consume thousands of tokens and wait for 3-4 seconds for a response for large payloads. In a world where milliseconds counts, waiting for 3-4 seconds is often an unacceptable user experience.
Thankfully we can create a cache for this type of solution to reduce both cost and latency. In this exercise, we will introduce a specialized cache called a semantic cache.
Traditional caches are key-value pairs and use an equality match on the key to get data. Keys for a semantic cache are vectors (or embeddings) which represent words in a high dimensional space where words with similar meaning or intent are in close proximity to each other dimensionally.
A cache GET is done with a specialized vector query in which the match is done comparing the proximity of these vectors. The results are a cached completion previously generated by an LLM. Vector queries include a similarity score that represents how close the vectors are to each other. Values range from 0 (no similarity) to 1 (exact match).
To execute a vector query for a semantic cache, user text is converted into vectors and then used as the filter predicate to search for similar vectors in the cache. For our semantic cache, we will create a query that returns just one result, and we use a similarity score as a way to dial in, how close the user's intent and words are to the cache's key values. The greater the score, the more similar the words and intent. The lower the score, the less similar the words and potentially intent as well.
In practice, setting the similarity score value can be tricky. To high, and the cache will quickly fill up with multiple responses for very similar questions. To low, and the cache will return irrelevant responses that do not satisfy the user. In some scenarios, developers may opt to return multiple items from the cache, letting the user decide which is relevant.
Let's observe the semantic cache in action. Follow the steps after launching the application:
- Launch the application locally or in Codespaces.
- Press the "Clear Cache" button.
- Start a new Chat Session.
- Enter a question,
What is the largest lake in North America?
- Enter a follow up without context,
What is the second largest?
- Enter a third follow up,
What is the third largest?
To test we will repeat the above sequence with slightly modified prompts.
-
Enter a question,
What is the largest lake in North America?
. Observe the response is much faster. It also has (cached response) appended to it. -
Enter a slightly different version of this question,
What is the biggest lake in North America?
. Observe the response took slightly longer. It also consumed tokens as you can see.This was essentially the same question with the same intent. So why didn't it result in a cache hit? The reason is the similarity score. It defaults to a value of
0.99
. This means that the question must be nearly exactly the same as what was cached. -
Open the the appsettings.Development.json file in the project. Edit the CacheSimilarityScore value and adjust it from
0.99
to0.95
. Save the file. -
Relaunch the application.
-
Start a new Chat Session.
-
Clear the cache.
-
Enter a question,
What is the largest lake in North America?
-
Then enter the similar question,
What is the biggest lake in North America?
Notice this time the result is coming from the cache.
You can Spend time trying different sequences of questions (and follow up questions) and then modifying them with different similarity scores. You can click on Clear Cache if you want to start over and do the same series of questions again.
The last section dives into the LLM orchestration SDK created by Microsft Research called, Semantic Kernel. Semantic Kernel is an open-source SDK that lets you easily build agents that can call your existing code. As a highly extensible SDK, you can use Semantic Kernel with models from OpenAI, Azure OpenAI, Hugging Face, and more! You can connect it to various vector databases using built-in connectors. By combining your existing C#, Python, and Java code with these models, you can build agents that answer questions and automate processes.
There aren't any unique tests you can do in this sample with Semantic Kernel that are not already covered here. But you can look through the code at the SemanticKernelService.cs implemented in this sample as well as in the ChatService.cs where it is used. The sample is very simple, intended to just give you a quick start in exploring its features and capabilities.