-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
- Loading branch information
There are no files selected for viewing
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
|
||
--- | ||
title: "ARMADA: Attribute-Based Multimodal Data Augmentation" | ||
id: "2408.10086v1" | ||
description: "TL;DR: ARMADA augments image-text pairs using knowledge-guided attribute manipulation, improving multimodal language models." | ||
author: Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu, Nanyun Peng, Heng Ji | ||
date: "2024-08-19" | ||
image: "https://browse.arxiv.org/html/2408.10086v1/x1.png" | ||
categories: ['production'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.10086v1/x1.png) | ||
|
||
### Summary: | ||
|
||
- The paper introduces a novel attribute-based, multimodal data augmentation framework called ARMADA, which extracts entities and visual attributes, then modifies the visual attributes of entities in images by building an entity-attribute multimodal knowledge base (KB). | ||
- ARMADA aims to address the limitations of existing multimodal data augmentation methods by generating semantically consistent, knowledge-grounded multimodal data instances. | ||
- The proposed augmentation pipeline in this work demonstrates semantically consistent and knowledge-grounded multimodal data, addressing the limitations of previous multimodal data augmentation methods. | ||
- The empirical results demonstrate that the proposed data augmentation strategy leads to substantial gains in various image-text downstream tasks such as image-text retrieval, VQA, image captioning, and especially in fine-grained image classification tasks that rely on attribute-centric information. | ||
|
||
### Major Findings: | ||
|
||
1. ARMADA is a novel multimodal data generation framework that extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. | ||
2. ARMADA generates visually similar images of disparate categories using neighboring entities in the KB hierarchy. | ||
3. ARMADA uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. | ||
|
||
### Analysis and Critique: | ||
|
||
- The paper does not provide a detailed analysis of the limitations and potential biases of the proposed method. | ||
- The paper does not discuss any methodological issues or conflicting evidence that may impact the validity of the results. | ||
- The paper does not provide a clear comparison with other existing multimodal data augmentation methods, making it difficult to evaluate the effectiveness of ARMADA. | ||
- The paper does not provide a clear discussion of the potential applications and implications of the proposed method. | ||
- The paper does not provide a clear discussion of the potential ethical considerations and societal impact of the proposed method. | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.10086v1](https://arxiv.org/abs/2408.10086v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.10086v1](https://browse.arxiv.org/html/2408.10086v1) | | ||
| Truncated | False | | ||
| Word Count | 7291 | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
|
||
--- | ||
title: "A Comparison of Large Language Model and Human Performance on Random Number Generation Tasks" | ||
id: "2408.09656v1" | ||
description: "LLMs like ChatGPT-3.5 generate random sequences more effectively than humans, with fewer repetitive and sequential patterns." | ||
author: Rachel M. Harrison | ||
date: "2024-08-19" | ||
image: "https://browse.arxiv.org/html/2408.09656v1/extracted/5799024/figures/patterns_frequency.png" | ||
categories: ['prompt-engineering', 'social-sciences', 'robustness', 'hci'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.09656v1/extracted/5799024/figures/patterns_frequency.png) | ||
|
||
### Summary: | ||
|
||
- The study compares the performance of ChatGPT-3.5, a large language model (LLM), with human performance on Random Number Generation Tasks (RNGTs). | ||
- ChatGPT-3.5 more effectively avoids repetitive and sequential patterns compared to humans, with notably lower repeat frequencies and adjacent number frequencies. | ||
- The research aims to deepen our understanding of how LLMs can more closely mimic human random generation behaviors and broaden their applications in cognitive and behavioral science research. | ||
|
||
### Major Findings: | ||
|
||
1. **ChatGPT-3.5 exhibits human-like cognitive biases**: The study tests whether ChatGPT-3.5, trained on human-generated text, exhibits human-like cognitive biases when generating random number sequences. | ||
2. **ChatGPT-3.5 more effectively avoids repetitive and sequential patterns**: Initial findings indicate that ChatGPT-3.5 more effectively avoids repetitive and sequential patterns compared to humans, with notably lower repeat frequencies and adjacent number frequencies. | ||
3. **Potential for broader applications in cognitive and behavioral science research**: Continued research into different models, parameters, and prompting methodologies will deepen our understanding of how LLMs can more closely mimic human random generation behaviors, while also broadening their applications in cognitive and behavioral science research. | ||
|
||
### Analysis and Critique: | ||
|
||
- The study's focus on comparing LLMs with human performance on RNGTs is a novel approach to understanding the capabilities and limitations of LLMs. | ||
- The use of ChatGPT-3.5, a widely recognized and advanced model, provides a strong foundation for the study. | ||
- The study's findings suggest that LLMs can mimic certain aspects of human cognitive biases, which could have significant implications for cognitive and behavioral science research. | ||
- However, the study's reliance on a single model (ChatGPT-3.5) may limit the generalizability of its findings. Future research could benefit from comparing the performance of multiple LLMs. | ||
- Additionally, the study does not explore the potential impact of different prompting strategies or model parameters on the performance of LLMs in RNGTs. This could be a fruitful area for future research. | ||
- Finally, the study's focus on | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.09656v1](https://arxiv.org/abs/2408.09656v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.09656v1](https://browse.arxiv.org/html/2408.09656v1) | | ||
| Truncated | False | | ||
| Word Count | 4221 | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
|
||
--- | ||
title: "A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites" | ||
id: "2408.07846v1" | ||
description: "LLMs can automate unit test generation, and AgoneTest offers a scalable solution for Java projects, complete with a new dataset and evaluation methodology." | ||
author: Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, Claudio Bartolini | ||
date: "2024-08-14" | ||
image: "https://browse.arxiv.org/html/2408.07846v1/x1.png" | ||
categories: ['robustness', 'programming'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.07846v1/x1.png) | ||
|
||
### Summary: | ||
|
||
The paper introduces AgoneTest, an automated system designed to generate test suites for Java projects and evaluate their quality. The system focuses on class-level test code generation and automates the entire process from test generation to test assessment. AgoneTest leverages the Methods2Test dataset and integrates libraries such as JaCoCo, PITest, and TsDetect to compute metrics for test evaluation. The main contributions of the work include the AgoneTest system, a methodology for evaluating LLMs and prompting techniques, and a new dataset called Classes2Test. | ||
|
||
### Major Findings: | ||
|
||
1. AgoneTest is a closed-loop, highly automated software system that supports the generation and assessment of unit tests for real-life open-source Java projects. | ||
2. The system provides a comprehensive evaluation of LLMs and prompting techniques in the task of developing unit tests, along with a set of metrics and test smells to assess the quality of the generated test suites. | ||
3. Classes2Test is an annotated open-source Java project dataset that extends Methods2Test, allowing for the assessment of test performance of an LLM on the entire class rather than on a single method. | ||
|
||
### Analysis and Critique: | ||
|
||
The paper presents a promising approach to automating the generation and evaluation of unit test suites using LLMs. However, there are some potential limitations and areas for improvement: | ||
|
||
1. The scope of the evaluation is limited to Java projects, which may not generalize well to other programming languages. | ||
2. The evaluation only considers two LLMs and two prompt types, which may not fully capture the capabilities of other models and prompting techniques. | ||
3. The temperature parameter is set to 0, which may limit the creativity and diversity of the generated test cases. | ||
4. A significant number of generated test classes fail to compile or execute, highlighting the need for improved LLM performance in generating syntactically and semantically correct test code. | ||
5. The evaluation metrics used may not fully capture the quality of the test suite, and additional metrics or approaches may be needed to provide a more comprehensive assessment. | ||
|
||
Future work should focus on addressing these limitations and further refining the AgoneTest system to improve its performance and applicability to a wider range of projects and programming languages. | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.07846v1](https://arxiv.org/abs/2408.07846v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.07846v1](https://browse.arxiv.org/html/2408.07846v1) | | ||
| Truncated | False | | ||
| Word Count | 9465 | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
|
||
--- | ||
title: "A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition" | ||
id: "2408.09491v1" | ||
description: "New method reduces errors, eliminates repetition in audio-LLM speech recognition, improving performance in noisy environments." | ||
author: Yangze Li, Xiong Wang, Songjun Cao, Yike Zhang, Long Ma, Lei Xie | ||
date: "2024-08-18" | ||
image: "https://browse.arxiv.org/html/2408.09491v1/x1.png" | ||
categories: ['social-sciences', 'robustness'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.09491v1/x1.png) | ||
|
||
### Summary: | ||
|
||
The paper proposes a transcription prompt-based audio-LLM to address the issues of substitution errors and decoding repetition in speech recognition tasks. The approach introduces an ASR expert as a transcription tokenizer and a hybrid AR NAR decoding method. Experiments on the 10k-hour WenetSpeech Mandarin corpus show that the proposed method decreases CER by 12.2% and 9.6% on Test_Net and Test_Meeting evaluation sets, respectively, compared to the baseline. Notably, the decoding repetition rate is reduced to zero, indicating that the repetition problem has been fundamentally solved. | ||
|
||
### Major Findings: | ||
|
||
1. The proposed transcription prompt-based audio-LLM effectively addresses substitution errors and decoding repetition in speech recognition tasks. | ||
2. The hybrid AR NAR decoding approach fundamentally solves the decoding repetition problem and achieves a lower ASR decoding RTF. | ||
3. The proposed method significantly improves speech recognition performance, with a 12.2% and 9.6% relative decrease in CER on Test_Net and Test_Meeting evaluation sets, respectively. | ||
|
||
### Analysis and Critique: | ||
|
||
The paper presents a promising approach to improving speech recognition performance in noisy environments. The use of an ASR expert as a transcription tokenizer and a hybrid AR NAR decoding method effectively addresses the issues of substitution errors and decoding repetition. However, the paper does not discuss the potential limitations or biases of the proposed method. Additionally, the method's performance on other languages or datasets is not evaluated, which could limit its generalizability. Further research is needed to evaluate the proposed method's performance on other languages and datasets and to address any potential limitations or biases. | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.09491v1](https://arxiv.org/abs/2408.09491v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.09491v1](https://browse.arxiv.org/html/2408.09491v1) | | ||
| Truncated | False | | ||
| Word Count | 3902 | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
|
||
--- | ||
title: "Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents" | ||
id: "2408.07199v1" | ||
description: "LLMs struggle with multi-step reasoning in interactive environments. Our framework, combining MCTS search, self-critique, and iterative fine-tuning, improves LLM agents' performance in complex tasks, outperforming baselines and human performance in a simulated e-commerce platform." | ||
author: Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov | ||
date: "2024-08-13" | ||
image: "https://browse.arxiv.org/html/2408.07199v1/extracted/5790031/images/AgentTree2.png" | ||
categories: ['prompt-engineering', 'hci'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.07199v1/extracted/5790031/images/AgentTree2.png) | ||
|
||
### Summary: | ||
|
||
The paper introduces Agent Q, a novel approach that combines several key concepts in reasoning, search, self-critique, and reinforcement learning to improve the planning and reasoning capabilities of a web agent. The method utilizes Monte Carlo Tree Search (MCTS) to guide trajectory collection and iteratively improve model performance using direct preference optimization (DPO). The proposed approach is evaluated in the WebShop environment and a real-world reservations booking website, demonstrating significant improvements in the model's zero-shot performance and outperforming GPT-4's performance after a single day of autonomous data collection. | ||
|
||
### Major Findings: | ||
|
||
1. Agent Q framework improves the model zero-shot absolute success rate from 18.6% to 81.7% (a 340% relative increase) in real-world booking experiments, outperforming GPT-4's performance after a single day of autonomous data collection. | ||
2. When equipped with online search capability, Agent Q's absolute success further improves to 95.4%. | ||
3. The approach represents a significant step forward in the development of autonomous web agents through its search and self-critique capabilities, setting a new benchmark for reliable multi-step decision-making in interactive settings. | ||
|
||
### Analysis and Critique: | ||
|
||
1. The paper does not discuss the potential limitations of the proposed approach, such as the computational resources required for MCTS and DPO, or the scalability of the method to more complex environments. | ||
2. The evaluation of Agent Q is limited to the WebShop environment and a real-world reservations booking website, and it is unclear how the approach would perform in other domains or tasks. | ||
3. The paper does not provide a detailed comparison with other state-of-the-art methods for improving the planning and reasoning capabilities of web agents, making it difficult to assess the relative performance of Agent Q. | ||
4. The paper does not discuss the potential ethical implications of deploying autonomous web agents, such as the risk of bias or the impact on human employment. | ||
5. The paper does not provide a clear roadmap for future research, making it difficult to identify potential directions for improving the proposed approach. | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.07199v1](https://arxiv.org/abs/2408.07199v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.07199v1](https://browse.arxiv.org/html/2408.07199v1) | | ||
| Truncated | False | | ||
| Word Count | 9890 | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
|
||
--- | ||
title: "Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning" | ||
id: "2408.09600v1" | ||
description: "Antidote removes harmful parameters post-fine-tuning, reducing harmful content without compromising performance, regardless of training hyper-parameters." | ||
author: Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi, Josh Kimball, Ling Liu | ||
date: "2024-08-18" | ||
image: "https://browse.arxiv.org/html/2408.09600v1/x1.png" | ||
categories: ['security'] | ||
format: | ||
html: | ||
code-overflow: wrap | ||
--- | ||
|
||
![](https://browse.arxiv.org/html/2408.09600v1/x1.png) | ||
|
||
### Summary: | ||
|
||
The paper proposes a post-fine-tuning stage solution called Antidote to address the issue of harmful fine-tuning in large language models (LLMs). The authors evaluate existing solutions and find that they are highly sensitive to training hyperparameters in the fine-tuning stage, which they call the hyper-parameter sensitive issue. Antidote aims to realign the model after the fine-tuning stage has been completed, remaining agnostic to the training details in the fine-tuning stage. The method relies on the philosophy that by removing harmful parameters, the harmful model can be recovered from harmful behaviors, regardless of how those harmful parameters are formed in the fine-tuning stage. Empirical results show that Antidote reduces harmful scores while maintaining accuracy on downstream tasks. | ||
|
||
### Major Findings: | ||
|
||
1. Existing solutions for harmful fine-tuning are highly sensitive to the training hyperparameters in the fine-tuning stage, which is named the hyper-parameter sensitive issue. | ||
2. Antidote, a post-fine-tuning realignment solution, remains agnostic towards the training details in the fine-tuning stage, addressing the hyper-parameter sensitive issue. | ||
3. Comprehensive experiments on four downstream tasks and different attack settings verify the effectiveness of the proposed method. | ||
|
||
### Analysis and Critique: | ||
|
||
The paper presents a novel approach to addressing the issue of harmful fine-tuning in LLMs. The authors provide a thorough evaluation of existing solutions and identify their limitations, which is a significant contribution to the field. The proposed method, Antidote, offers a promising solution to the hyper-parameter sensitive issue, which has not been systematically studied before. | ||
|
||
However, the paper does not discuss potential limitations or unanswered questions regarding the proposed method. For instance, it is unclear how Antidote would perform in scenarios where the harmful data is not easily identifiable or when the model has already been significantly compromised. Additionally, the paper does not provide a comparison of Antidote with other post-fine-tuning stage defenses, such as those mentioned in the related work section. | ||
|
||
In conclusion, the paper presents a valuable contribution to the field of LLM safety alignment, but further research is needed to explore the limitations and potential improvements of the proposed method. | ||
|
||
## Appendix | ||
|
||
| | | | ||
|----------|----------| | ||
| Model | accounts/fireworks/models/mixtral-8x22b-instruct | | ||
| Date Generated | 2024-08-20 | | ||
| Abstract | [https://arxiv.org/abs/2408.09600v1](https://arxiv.org/abs/2408.09600v1) | | ||
| HTML | [https://browse.arxiv.org/html/2408.09600v1](https://browse.arxiv.org/html/2408.09600v1) | | ||
| Truncated | False | | ||
| Word Count | 8169 | |