This project is a LangChain prompt engineering application. The vision for this project is to load a bunch of real analysis math html files from a website, extract their info using Beautiful Soup LangChain HTML loader, break the material into chunks, and use ChatOpenAI to interact with the math material. I will try to restrict ChatOpenAI from using its own knowledge of real analysis and instead use just the documents. Can it come up with proofs? Let's find out.
- Chat OpenAi: Underlying LLM that user asks questions to in QA format
- LangChain: Framework that facilitiates document chunking, retrieval, and construction of LLM pipeline
- BSHTMLLoader: Extracts text from raw HTML, making the documents look nice, which makes the LLM better at understanding the context
- RecursiveCharacterTextSplitter: Splits documents into smaller logical chunks, decreasing the context required to be fed into the LLM
- OpenAI Embeddings: Converts document texts to vectors for fast similarity querying
- Chroma DB Vectorstore: Stores document embeddings in memory for quick retrieval
- ConversationalRetrievalChain: LangChain-provided
- LangChain Playground: Allows easy testing of LLM
- Docker: facilitates deployment
- Containerfile curls 4 HTML real analysis pages and converts them to UTF8 to avoid character errors
- UTF8 files are loaded in via Beautiful Soup into LangChain documents
- LangChain documents are broken up into smaller documents
- Smaller documents are converted into vectors via OpenAI Embeddings
- Embeddings are stored into in-memory ChromaDB
- ChromaDB is converted to retrieval object which is then fed into LangChain ConversationalRetrievalChain
- ConversationalRetrievalChain is exposed to the user via LangChain FastAPI playground
- ConversationalRetrievalChain will only answer questions based off the context. Its power is quite weak. If we want more expresiveness, we need to use a different class or write our own. However, ConversationalRetrievalChain does have use cases: when you want the LLM to answer based only off of the context, then this is the right class. You wouldn't want to hallucinate text from documents.
- I had to choose the right chunk size and overlap size for the RecursiveCharacterTextSplitter. If it's too small, the generated documents won't have enough info for the LLM to understand.
I asked it to answer question 1.4.5, and it could not do so because the answer was not directly in the text: HTML
However, it did understand the question, which is great. This is because it found the right document chunk.
Then, I asked it a question directly from the text about equivalence relations. Here is the part I wanted.
Success!
It couldn't do it, since the information was not in the context.
- Clone the repository.
- Make .env file with OPENAI_API_KEY set to your open API key
- Run
docker-compose up --build
- Open
http://0.0.0.0:8000/test/playground/