-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise the output to reduce token consumption #113
base: master
Are you sure you want to change the base?
Conversation
Implemented a function in esbmc_util that uses regex patterns to remove some output lines to reduce token consmuption and to facilitate the reduction of the likelihood that LLMs like GPT 3.5 turbo reach the token limit for some codebases that need comprehensive explanations. Addded in config.json four new cases that address common bugs in C programs: buffer overflow, arithmetic overflow, array out-of-bounds and memory leaks. Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Created more regex patterns that remove some lines that can be ommitted and provide shorter outputs. Added a few more indications in the config.json file that tell the LLM to avoid mentioning time measurements Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Created new output scenarious in the config.json file for NaN check, struct fields check, deadlock check and data races check Created the reduce_output field in the config.json file to tell the LLM to produce a smaller output Created new regex patterns in the reduce_output2() function Created the remove_patterns_nltk that uses NLP methods to identify patterns in a sequence of tokenized words. The advantage of it are that it provides more granular control over text sentences, it can be more precise in some situations. Created the test_output_reducer file that tests the functionality of the esbmc_output_optimisation(), reduce_output2() and remove_patterns_nltk() functions that covers all the string patterns Instantiated the GPT_4_TURBO_PREVIEW model in the enum of the ai_models.py and used it for various code bases. Signed-off-by: mihai.state <mihaita.state@yahoo.com>
@@ -7,11 +7,11 @@ | |||
import sys | |||
|
|||
# Enables arrow key functionality for input(). Do not remove import. | |||
import readline | |||
#import readline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you comment this out? It says clearly, "Do not remove import."
from . import config | ||
import nltk | ||
from nltk.tokenize import word_tokenize,sent_tokenize | ||
nltk.download('punkt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This downloads a library to your system without explicit user approval. This is bad design, there should be an option in the config to configure where it downloads this. We don't want to be downloading files to the user's systems without them knowing ;)
esbmc_ai/esbmc_util.py
Outdated
esbmc_output = re.sub(r'.*assignments.*|.*assignment.*', '', esbmc_output, flags=re.MULTILINE) | ||
esbmc_output = re.sub(r'.*iteratively.*', '', esbmc_output, flags=re.MULTILINE) | ||
|
||
esbmc_output = re.sub(r'.*branches.*', '', esbmc_output, flags=re.MULTILINE) #The informations about branches are not necessary relevant to the understanding of bugs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following sample:
#include <string.h>
#include <assert.h>
int main() {
char* value = "branches";
assert(strcmp("branches", value) == 0);
return 0;
}
Would produce the following ESBMC output:
ESBMC version 7.5.0 64-bit x86_64 linux
Target: 64-bit little-endian x86_64-unknown-linux with esbmclibc
Parsing sample.c
Converting
Generating GOTO Program
GOTO program creation time: 0.097s
GOTO program processing time: 0.002s
Starting Bounded Model Checking
Unwinding loop 8 iteration 1 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 2 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 3 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 4 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 5 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 6 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 7 file string.c line 96 column 12 function strcmp
Unwinding loop 8 iteration 8 file string.c line 96 column 12 function strcmp
Symex completed in: 0.002s (92 assignments)
Slicing time: 0.000s (removed 90 assignments)
Generated 19 VCC(s), 1 remaining after simplification (2 assignments)
No solver specified; defaulting to Boolector
Encoding remaining VCC(s) using bit-vector/floating-point arithmetic
Encoding to solver time: 0.000s
Solving with solver Boolector 3.2.3
Runtime decision procedure: 0.000s
Building error trace
[Counterexample]
State 1 file sample.c line 6 column 2 function main thread 0
----------------------------------------------------
Violated property:
file sample.c line 6 column 2 function main
assertion strcmp("branches", value) == 0
return_value$_strcmp$1 == 0
VERIFICATION FAILED
Putting that through a regex viewer removes from the Violated property:
section the line assertion strcmp("branches", value) == 0
.
There are many more regex that are too generic to be included. You filter out ESBMC output that is useful.
myenv/bin/Activate.ps1
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this even do? Why was it included? You included it without a commit message explaining why it should be included in the project. Arguably, this could just be included as a generic script on the system.
So this is quite a messy pull request, the pull request has some big issues both on a technical level and an experimental level. Technical Issues
You shouldn't include everything in a commit that you work on. Only things that you would like included in the project. Practical IssueDo you have any figures that show any improvement in speed/accuracy of successful repairs? If the accuracy of repairs improves, we should know by how much. If it becomes worse, we need to know why, also it cannot be accepted. Also, how slow does this become due to all the regex? Also, there's already a merge that happened that can reduce the ESBMC output to a few tokens as it uses only the violated property section only, please look at: #117 Maybe this PR can be added as another mode of that? |
"1.Some regex operations could be done without regex. For example, filtering strings out would be faster without regex. |
The problem I have been trying to explain is that REGEX patterns used carelessly, like in this case where you are nuking lines with the word “branches” is not good design. I showed you an example where the useful output was altered. Also, REGEX is extremely slow. Direct string manipulation is more rigid but also miles faster. So I don't know what you mean by: "The regex pattern may prove to be more efficient", efficient in terms of what? |
Removed the 'flags=re.MULTILINE' argument for the regex patterns that work properly without it Used the re.compile method for some patterns to reduce time complexity Signed-off-by: mihai.state <mihaita.state@yahoo.com>
Created two functions in the esbmc_util file that use regex patterns to reduce the generated output of ESBMC-AI and another one that removes patterns by using the nltk library.
Added tests for the patterns that use specific ESBMC-AI phrases as input strings.
Changed the content of a system instruction in the config.json file to tell the LLM to avoid providing some informations that are not useful for debbuging and development puroposes, such as time measurements, assertions, assumptions or warnings. This approach leads to a more consistent and efficient output reduction alongside the three functions that use specific patterns.
Added new scenarious in the config.json file for some bugs.
The implementation has been tested on various SV-Benchmarks multiple times and they lead to consistent token consumption reductions.