-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning LLMs for ReAct. Unleashing the power of finetuning to… | by Pranav Jadhav | Feb, 2024 | Towards AI #657
Comments
Related issues#333: Paper Digest: NeurIPS-2023 Highlights (Full List)### DetailsSimilarity score: 0.88 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html)Paper Digest: NeurIPS 2023 Highlights 1, Toolformer: Language Models Can Teach Themselves to Use Tools 2, Self-Refine: Iterative Refinement with Self-Feedback 3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena Suggested labels{ "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }#546: [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction### DetailsSimilarity score: 0.87 - [ ] [[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/abs/2304.11015)[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-CorrectionDESCRIPTION: DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction Mohammadreza Pourreza, Davood Rafiei There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the performance of LLMs in the reasoning process, we study how decomposing the task into smaller sub-tasks can be effective. In particular, we show that breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into LLMs can be an effective approach for significantly improving their performance. Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set. URL: https://arxiv.org/abs/2304.11015 Suggested labels{'label-name': 'Text-to-SQL', 'label-description': 'Focuses on generating SQL queries from natural language text.', 'confidence': 76.74}#332: streaming-llm: Efficient Streaming Language Models with Attention Sinks### DetailsSimilarity score: 0.85 > **Note: Efficient Streaming Language Models with Attention Sinks** > > [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) > > **TL;DR** > > We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. > > **News** > > - [2023/10] StreamingLLM is integrated into Intel Extension for Transformers. > - [2023/10] Check out Attention Sinks, a third-party implementation to enable StreamingLLM on more Huggingface LLMs. > > **Abstract** > > Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a "sink" even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. > > **Usage** > > **Environment Setup** > > ``` > conda create -yn streaming python=3.8 > conda activate streaming > > pip install torch torchvision torchaudio > pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece > > python setup.py develop > ``` > > **Run Streaming Llama Chatbot** > > ``` > CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming > ``` > > **FAQ** > > **What does "working on infinite-length inputs" imply for LLMs?** > > Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. > > **Is the context window of LLMs expanded?** > > No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. > > **Can I input an extensive text, like a book, into StreamingLLM for summarization?** > > While you can input a lengthy text, the model will only recognize the latest tokens.#314: Prompt Engineering Guide | Prompt Engineering Guide### DetailsSimilarity score: 0.85 - [ ] [Prompt Engineering Guide | Prompt Engineering Guide](https://www.promptingguide.ai/)Prompt Engineering Guide #153: Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog### DetailsSimilarity score: 0.85 - [ ] [Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)
This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section. #130: Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Papers With Code### DetailsSimilarity score: 0.85 - [ ] [Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models | Papers With Code](https://paperswithcode.com/paper/language-agent-tree-search-unifies-reasoning)Quote
|
Finetuning LLMs for ReAct
Description:
Finetuning LLMs for ReAct Unleashing the power of finetuning to improve multi-hop question-answering ability in LLMs.
Author: Pranav Jadhav
Published in: Towards AI
Reading Time: 14 min read
Published: 6 days ago
Views: 71
In this article, I will share my findings in benchmarking and finetuning open-source language models for ReAct (Reasoning + Acting). I demonstrate that finetuning can dramatically improve the accuracy of LLMs in answering multi-hop questions using ReAct. I also present a new dataset that can be used to finetune models for the ReAct format presented by the original paper (Yao et al., 2022). My findings indicate that, through finetuning, open-source LLMs show promise for making agents that can effectively reason and use tools.
Language Models Reasoning?
Since ChatGPT started the language model gold rush, we’ve been consistently surprised by the abilities of these neural networks to imitate our speech and writing. However, a key component of intelligence that distanced these models from ourselves was reasoning. The reasoning barrier first faltered when chain-of-thought (CoT) prompting was introduced by Wei et al. in 2022. They found that simply prompting the language model to “think step by step” and output intermediate reasoning steps improved accuracy on question-answering tasks. However, the reasoning ability of LLMs didn’t end there. Another development in reasoning was chain-of-thought with self-consistency (CoT-SC), where multiple reasoning traces were generated and the majority answer is returned as the final answer (Wang et al., 2022). Then in late 2022, a team of researchers from Princeton University and Google Research published a paper called ReAct: Synergizing Reasoning and Acting in Language Models. In this paper, the team introduces a method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer.
What is ReAct?
Simply put, ReAct is a prompting strategy to force an LLM to “reason” about what it is doing and interact with tools using actions. I will give a basic explanation here, but for a deep dive, I recommend looking at the blog post or the paper.
Read More
Suggested labels
{'label-name': 'ReAct-Prompting', 'label-description': 'Describes the method of prompting LLMs to output a sequence of thought, action, and observation steps to reach a final answer', 'gh-repo': 'https://pub.towardsai.net/finetuning-llms-for-react-9ab291d84ddc', 'confidence': 63.39}
The text was updated successfully, but these errors were encountered: