Quickstart exercises

This document walks you through the concepts implemented in this sample so you can understand it's capabilities and how to do the same.

Context window (chat history)

Humans interact with each other through conversations that have some context of what is being discussed. OpenAI's ChatGPT can also interact this way with humans. However, this capability is not native to an LLM itself. It must be implemented. Let's explore what happens when we test contextual follow up questions with our LLM where we ask follow up questions that imply an existing context like you would have in a conversation with another person.

Quickstart: Conversational context

Let's observe this in action. Follow the steps after launching the application:

Start a new Chat Session.
Enter a question, What is the largest lake in North America?, wait for response, Lake Superior
Enter a follow up without context, What is the second largest?, wait for repsonse, Lake Huron
Enter a third follow up, What is the third largest?, wait for resopnse, Great Bear lake

Clearly the LLM is able to keep context for the conversation and answer appropriately. While this concept is simple enough it can present some challenges. It also introduces the concept of tokens for services like OpenAI.

Tokens

Large language models require chat history to generate contextually relevant results. But there is a limit how much text you can send. Large language models have limits on how much text they can process in a request and output in a response. These limits are not expressed as words, but as tokens. Tokens represent words or part of a word. On average 4 characters is one token. Tokens are essentially the compute currency for large language model. Because of this limit on tokens, it is therefore necessary to limit them. This can be a bit tricky in certain scenarios. You will need to ensure enough context for the LLM to generate a correct response, while avoiding negative results of consuming too many tokens which can include incomplete results or unexpected behavior.

This application allows you to configure how large the context window can be (length of chat history). This is done using the configuration value, MaxConversationTokens that you can adjust in the appsettings.json file.

Semantic Cache

Large language models are amazing with their ability to generate completions to a user's questions. However, these requests to generate completions from an LLM are computationally expensive (expressed in tokens) and can also be quite slow. This cost and latency increases as the amount of text increases.

In a pattern called Retrieval Augmented Generation or RAG Pattern, data from a database is used to augment or ground the LLM by providing additional information to generate a response. These payloads can get rather large. It is not uncommon to consume thousands of tokens and wait for 3-4 seconds for a response for large payloads. In a world where milliseconds counts, waiting for 3-4 seconds is often an unacceptable user experience.

Thankfully we can create a cache for this type of solution to reduce both cost and latency. In this exercise, we will introduce a specialized cache called a semantic cache.

Traditional caches are key-value pairs and use an equality match on the key to get data. Keys for a semantic cache are vectors (or embeddings) which represent words in a high dimensional space where words with similar meaning or intent are in close proximity to each other dimensionally.

Vector Query & Similarity Score

A cache GET is done with a specialized vector query in which the match is done comparing the proximity of these vectors. The results are a cached completion previously generated by an LLM. Vector queries include a similarity score that represents how close the vectors are to each other. Values range from 0 (no similarity) to 1 (exact match).

To execute a vector query for a semantic cache, user text is converted into vectors and then used as the filter predicate to search for similar vectors in the cache. For our semantic cache, we will create a query that returns just one result, and we use a similarity score as a way to dial in, how close the user's intent and words are to the cache's key values. The greater the score, the more similar the words and intent. The lower the score, the less similar the words and potentially intent as well.

In practice, setting the similarity score value can be tricky. To high, and the cache will quickly fill up with multiple responses for very similar questions. To low, and the cache will return irrelevant responses that do not satisfy the user. In some scenarios, developers may opt to return multiple items from the cache, letting the user decide which is relevant.

Quickstart: Semantic Cache

Let's observe the semantic cache in action. Follow the steps after launching the application:

Launch the application locally or in Codespaces.
Press the "Clear Cache" button.
Start a new Chat Session.
Enter a question, What is the largest lake in North America?
Enter a follow up without context, What is the second largest?
Enter a third follow up, What is the third largest?

To test we will repeat the above sequence with slightly modified prompts.

Enter a question, What is the largest lake in North America?. Observe the response is much faster. It also has (cached response) appended to it.
Enter a slightly different version of this question, What is the biggest lake in North America?. Observe the response took slightly longer. It also consumed tokens as you can see.

This was essentially the same question with the same intent. So why didn't it result in a cache hit? The reason is the similarity score. It defaults to a value of 0.99. This means that the question must be nearly exactly the same as what was cached.
Open the the appsettings.Development.json file in the project. Edit the CacheSimilarityScore value and adjust it from 0.99 to 0.95. Save the file.
Relaunch the application.
Start a new Chat Session.
Clear the cache.
Enter a question, What is the largest lake in North America?
Then enter the similar question, What is the biggest lake in North America? Notice this time the result is coming from the cache.

You can Spend time trying different sequences of questions (and follow up questions) and then modifying them with different similarity scores. You can click on Clear Cache if you want to start over and do the same series of questions again.

Semantic Kernel

The last section dives into the LLM orchestration SDK created by Microsft Research called, Semantic Kernel. Semantic Kernel is an open-source SDK that lets you easily build agents that can call your existing code. As a highly extensible SDK, you can use Semantic Kernel with models from OpenAI, Azure OpenAI, Hugging Face, and more! You can connect it to various vector databases using built-in connectors. By combining your existing C#, Python, and Java code with these models, you can build agents that answer questions and automate processes.

There aren't any unique tests you can do in this sample with Semantic Kernel that are not already covered here. But you can look through the code at the SemanticKernelService.cs implemented in this sample as well as in the ChatService.cs where it is used. The sample is very simple, intended to just give you a quick start in exploring its features and capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickstart.md

quickstart.md

Quickstart exercises

Context window (chat history)

Quickstart: Conversational context

Tokens

Semantic Cache

Vector Query & Similarity Score

Quickstart: Semantic Cache

Semantic Kernel

Files

quickstart.md

Latest commit

History

quickstart.md

File metadata and controls

Quickstart exercises

Context window (chat history)

Quickstart: Conversational context

Tokens

Semantic Cache

Vector Query & Similarity Score

Quickstart: Semantic Cache

Semantic Kernel