Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for running the test cases in subprocesses #74

Open
BergLucas opened this issue Jun 20, 2024 · 17 comments
Open

Add support for running the test cases in subprocesses #74

BergLucas opened this issue Jun 20, 2024 · 17 comments

Comments

@BergLucas
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Over the last four months, I've had the opportunity to write a master's thesis about improving automated test case generation for machine learning libraries. During this project, I discovered several limitations and bugs in Pynguin that I've already reported. However, some bugs could not be easily fixed. These are segmentation faults, memory leaks, floating point exceptions, and Python's GIL deadlocks, which do not come from Pynguin but rather from the module under test. Unfortunately, with Pynguin's current architecture which executes test cases in threads rather than subprocesses, these types of bugs cause the main process to crash and, therefore, the crash of Pynguin. I have observed these kinds of crashes on very popular libraries such as numpy, pandas, polars, scipy and sklearn, and it could also happen on other modules as I've only focused on these few.

Describe the solution you'd like

To solve the problem, I propose to change some aspects of Pynguin's architecture so that test cases can be executed in subprocesses depending on a Pynguin parameter. I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

Describe alternatives you've considered

To the best of my knowledge, I think that the only way to detect segmentation faults, memory leaks, etc, is to use subprocesses, so I don't see any other alternatives for dealing with these crashes.

Additional context

With this new architecture, it would also be possible to create error-revealing test cases, as Randoop does. Indeed, by checking the exit code of the subprocesses, a crash can be detected and, therefore, a test case created to reproduce it. This is something that has already been implemented in my prototype and that has already helped me find a few bugs in some libraries:

@nickodell
Copy link

nickodell commented Jun 20, 2024

I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

I see you're using the spawn start method. I wonder if you could improve performance here by using a forkserver start method, and using set_forkserver_preload() to preload the module under test.

I know that in SciPy, it can take quite a while to import some modules. For example, on my computer it can take 0.4 seconds to run import scipy.signal and nothing else. It seems like it could be possible to import the module once, and re-use that work, since the computation happening during import seems unlikely to be interesting from a testing perspective.

More info: https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html

@stephanlukasczyk
Copy link
Member

Wow, first of all, this is impressive!

I agree to all of your points, the way the current execution is being built is probably not the best it could be. I am very much willing to integrate this into Pynguin, first, because I believe that it could overcome the limitations and makes Pynguin more flexible, and second, because the track record (five found bugs) is already quite nice—I am pretty sure there is more to follow and I need to maintain a list of found bugs at some point.

Regarding the slow-down: do you have some average numbers? 40x is a massive slowdown, I agree, but if this is only a rare worst case the picture would probably look different. Also, could you perhaps try @nickodell 's suggestion (thanks @nickodell for the suggestion of the forkserver) if that would bring an improvement?

Random additional thought: even if the execution in a subprocess is slower, do you see any potential to parallelise these executions? Currently, Pynguin uses the threads to isolate executions but only executes the test cases in sequential order. If the subprocess would allow to parallelise test-case executions easily, the overhead might not be that critical any more.

@BergLucas
Copy link
Contributor Author

Hi @stephanlukasczyk,

I do have some numbers regarding the slow-down, but it's a bit hard to interpret an average because the 40x speed decrease was calculated using the few cases where Pynguin didn't crash, so I felt that averaging out all the cases wasn't very representative of the true speed decrease. If you're interested, here's my master's thesis, there are averages on the number of iterations achieved per module in chapters 5 and 7. This thesis also tried to implement a plugin system to allow testers' knowledge to be easily incorporated into the test generation algorithm. Initially, the architectural change was just a necessary improvement to be able to run the plugin system on machine learning libraries but I thought it was the most interesting change to add to Pynguin at the moment.

Regarding the forkserver, it might be interesting to check whether this has an impact. However, I noticed that even with just a spawn start method, the thing that took the most time was not sending data to the subprocess, but transferring data from the subprocess to the main process. I haven't checked this in detail yet, but I think it's due to the fact that there are a lot of references in the ExecutionResult class and that, in the end, the whole test cluster is transferred back to the main process.

Regarding the parallelization, I did try to implement it at one point and noticed that most of the time, it was faster to start a single subprocess, run every test case in it, and fallback to running each test case in separate subprocesses only when a crash was detected. That's what I've done to improve the speed of the "TestSuiteChromosomeComputation" class. However, it's true that it could be interesting to parallelize the "TestCaseChromosomeComputation" class, but I think that would require a lot of changes, and I didn't do it because I didn't have much time.

@stephanlukasczyk
Copy link
Member

Hi @BergLucas ,

What I could do is running Pynguin from your branch with the current executor and your subprocess-based executor on some benchmark that I have. I'll set this up and run it—will report back numbers as soon as they have arrived.

Also thank you for your thesis, I'll have a look. Your comments regarding the data that has to be transferred between processes is quite helpful; avoiding to transfer large amounts of data might be achievable, but first I want to see how the executors behave on my benchmark. I'll add results to this issue as soon as I have any.

@BergLucas
Copy link
Contributor Author

BergLucas commented Jun 21, 2024

The subprocess-based executor is only used when using the --subprocess parameter if you did not see it, so don't forget to run Pynguin with this parameter @stephanlukasczyk. Also, you should probably use simple assertions instead of mutation analysis ones, or it could take hours to generate the assertions on some modules at the moment.

@stephanlukasczyk
Copy link
Member

Finally found the time to run Pynguin on the latest version of your branch, @BergLucas . Configurations use DynaMOSA in its default settings, assertion generation deactivated, with a generation timeout of 600s. DEFAULT_EXECUTION refers to the current, default, execution; SUBPROCESS_EXECUTION is your proposed approach. First, we start with some statistics (basically the describe function of a Pandas DataFrame) from the raw data:

Coverage Overview
                       count      mean       std  min       25%       50%       75%  max
ConfigurationId                                                                         
DEFAULT_EXECUTION     9514.0  0.685906  0.308685  0.0  0.469880  0.750000  1.000000  1.0
SUBPROCESS_EXECUTION  9332.0  0.638031  0.327191  0.0  0.333333  0.666667  0.972973  1.0


Iterations Overview
                       count         mean          std  min    25%     50%     75%     max
ConfigurationId                                                                           
DEFAULT_EXECUTION     9514.0  1588.290309  1397.305685  0.0  57.25  1446.0  2665.0  7071.0
SUBPROCESS_EXECUTION  9332.0    72.648093    77.326008  0.0  11.00    52.0   106.0   467.0


Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     9514.0  2.772500  2.289745  0.0  0.422727  2.617743  4.512063  11.765391
SUBPROCESS_EXECUTION  9332.0  0.124591  0.128730  0.0  0.022979  0.091211  0.181255   0.775748

Because I know that there are always some failures, I filter the raw data and remove all modules that did not produce a result in all repetitions (I did 15 repetitions per configuration):

Remove modules that did not yield 15 iterations
Before: 645
After:  574


Coverage Overview
                       count      mean       std  min       25%       50%  75%  max
ConfigurationId                                                                    
DEFAULT_EXECUTION     8610.0  0.704771  0.302880  0.0  0.500000  0.774194  1.0  1.0
SUBPROCESS_EXECUTION  8610.0  0.658459  0.321583  0.0  0.383065  0.722222  1.0  1.0


Iterations Overview
                       count         mean          std  min  25%     50%     75%     max
ConfigurationId                                                                         
DEFAULT_EXECUTION     8610.0  1640.112544  1429.615295  0.0  9.0  1561.0  2759.0  7071.0
SUBPROCESS_EXECUTION  8610.0    74.888850    79.414983  0.0  7.0    53.0   111.0   467.0


Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     8610.0  2.871112  2.335756  0.0  0.343301  2.819468  4.673877  11.765391
SUBPROCESS_EXECUTION  8610.0  0.128680  0.132070  0.0  0.019231  0.094059  0.188954   0.775748


Relative Coverage
                      BranchCoverage  RelativeCoverage
ConfigurationId                                       
DEFAULT_EXECUTION           0.704771          0.916433
SUBPROCESS_EXECUTION        0.658459          0.730026

Finally, I've also plotted coverage over the 600s generation time:

coverage over time

The plot and the results show what you've already noted, namely that there is a large slow-down, which also influences coverage significantly.

@BergLucas
Copy link
Contributor Author

Hi @stephanlukasczyk ,

That's very interesting. I would have thought that the number of modules that did not yield 15 iterations would have been higher on default execution because of the types of crashes I mentioned before, but that doesn't seem to be the case. In theory, the execution using subprocesses can't crash unless there's a bug in the implementation so I guess the branch isn't super stable yet as I suspected.

@stephanlukasczyk
Copy link
Member

If you are interested and have time for debugging, I can provide you the full raw results, including log files etc.

@nickodell
Copy link

Hello, I've fixed two of the crash bugs you identified in SciPy. I think this is very valuable in terms of identifying surprising corner-cases in the library.

By the way, I would be interested in looking into the performance issue with subprocess-based execution. Would you mind showing me how to set up your branch to test SciPy? I took a look at the docs, but I'm not sure how to apply that in the case where I'm testing a package that needs to be built before I can test it.

@stephanlukasczyk
Copy link
Member

Thank you @nickodell that you offer to investigate the performance.

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

If @BergLucas uses a different approach, he probably could elaborate, too.

@nickodell
Copy link

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

Thanks - when you do this, what do you set the --project-path parameter to?

@stephanlukasczyk
Copy link
Member

I usually also have a source-code checkout of the subject under test (here scipy) lying around to which I point the parameter to.

@BergLucas
Copy link
Contributor Author

Hi @stephanlukasczyk,
I've been busy over the holidays so I haven't had time to work on Pynguin but normally I'll be able to carry on with my research and intend to keep looking for ways to improve Pynguin. So, I would be very interested to get the raw data you mentioned in your previous comment if you still have it.

@BergLucas
Copy link
Contributor Author

Hi @stephanlukasczyk and thanks for the raw data,

I analysed the modules that didn't work at all with my branch and found 2 small bugs that appeared when the data was serialized and sent back to the main process. I've fixed them in this branch: https://github.com/BergLucas/pynguin/tree/improvement/subprocess-execution.

I also noticed a few things about the benchmark, such as modules that required git or dependencies that weren't installed, and modules that had circular imports. I don't know if it's normal.

Now, I'm also going to check the modules that have run at least once but that also crashed at least once and I'll post a comment afterwards.

Module default run id Error subprocess run id Error
isort.api 82 / 751 if exception_bad_items := dill.detect.baditems(result.exceptions):
isort.files 87 / 756 if exception_bad_items := dill.detect.baditems(result.exceptions):
isort.identify 90 / 759 if exception_bad_items := dill.detect.baditems(result.exceptions):
isort.place 96 / 765 if exception_bad_items := dill.detect.baditems(result.exceptions):
pypara.accounting.journaling 181 / 850 if assertion_trace_bad_items := dill.detect.baditems(result.assertion_trace):
pypara.accounting.ledger 182 / 851 and not dill.detect.baditems(reference_bindings)
pypara.currencies 187 Has some errors when executing the module 856 if exception_bad_items := dill.detect.baditems(result.exceptions):
semantic_release.changelog.changelog 191 Failed to initialize: Bad git executable. 860 Failed to initialize: Bad git executable.
semantic_release.changelog.compare 192 Failed to initialize: Bad git executable. 861 Failed to initialize: Bad git executable.
semantic_release.history.logs 196 Failed to initialize: Bad git executable. 865 Failed to initialize: Bad git executable.
semantic_release.history.parser_angular 197 Failed to initialize: Bad git executable. 866 Failed to initialize: Bad git executable.
semantic_release.history.parser_emoji 198 Failed to initialize: Bad git executable. 867 Failed to initialize: Bad git executable.
semantic_release.history.parser_helpers 199 Failed to initialize: Bad git executable. 868 Failed to initialize: Bad git executable.
semantic_release.history.parser_tag 200 Failed to initialize: Bad git executable. 869 Failed to initialize: Bad git executable.
semantic_release.vcs_helpers 204 Failed to initialize: Bad git executable. 873 Failed to initialize: Bad git executable.
sanic.exceptions 232 / 901 if exception_bad_items := dill.detect.baditems(result.exceptions):
typesystem.base 259 / 928 if exception_bad_items := dill.detect.baditems(result.exceptions):
untangle 270 / 939 if exception_bad_items := dill.detect.baditems(result.exceptions):
boltons.excutils 337 / 1006 if exception_bad_items := dill.detect.baditems(result.exceptions):
simplejson.encoder 371 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import) 1040 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import)
simplejson.errors 372 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import) 1041 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import)
simplejson.raw_json 373 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import) 1042 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import)
simplejson.scanner 374 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import) 1043 ImportError: cannot import name 'JSONDecoder' from partially initialized module 'simplejson.decoder' (most likely due to a circular import)
humanfriendly.deprecation 382 3 OOM 1051 if exception_bad_items := dill.detect.baditems(result.exceptions):
requests.adapters 555 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1224 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils'
requests.api 556 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1225 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils'
requests.auth 557 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1226 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.cookies 558 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1227 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.exceptions 559 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1228 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.help 560 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1229 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.hooks 561 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1230 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.models 562 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1231 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.sessions 563 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1232 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.structures 564 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1233 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
requests.utils 565 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import) 1234 ImportError: cannot import name 'parse_dict_header' from partially initialized module 'requests.utils' (most likely due to a circular import)
tqdm.contrib.telegram 576 ModuleNotFoundError: No module named 'requests' 1245 ModuleNotFoundError: No module named 'requests'
entrypoints 579 / 1248 if exception_bad_items := dill.detect.baditems(result.exceptions):
nmap.nmap 580 / 1249 if exception_bad_items := dill.detect.baditems(result.exceptions):

@BergLucas
Copy link
Contributor Author

BergLucas commented Dec 6, 2024

Here are the data concerning the modules that sometimes fail on my branch. Most of it comes from the bugs mentioned in my previous comment and are therefore fixed, but there are some that are related to Bad file descriptor or SystemError that would be interesting to study in more detail.

Module subprocess run id Error Nb fails
boltons.dictutils 1004 if exception_bad_items := dill.detect.baditems(result.exceptions): 4
boltons.gcutils 1010 SystemError: Objects/tupleobject.c:964: bad argument to internal function 4
boltons.iterutils 1011 if exception_bad_items := dill.detect.baditems(result.exceptions): 6
boltons.queueutils 1017 if exception_bad_items := dill.detect.baditems(result.exceptions): 7
boltons.socketutils 1018 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
codetiming._timers 680 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
flake8.api.legacy 691 Weird flask error 1
flake8.main.setuptools_command 701 Weird flask error 1
httpie.output.writer 743 While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7fcc44b08400> 2
httpie.uploads 749 While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7ff824b1cb80> 1
httpie.utils 750 While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7fb2226939d0> 1
humanfriendly.case 1047 if exception_bad_items := dill.detect.baditems(result.exceptions): 8
isort.core 753 if exception_bad_items := dill.detect.baditems(result.exceptions): 2
isort.deprecated.finders 754 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
isort.hooks 758 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
isort.literal 761 if exception_bad_items := dill.detect.baditems(result.exceptions): 2
isort.parse 764 if exception_bad_items := dill.detect.baditems(result.exceptions): 9
isort.settings 766 if exception_bad_items := dill.detect.baditems(result.exceptions): 6
isort.sorting 767 if exception_bad_items := dill.detect.baditems(result.exceptions): 7
isort.wrap 769 if exception_bad_items := dill.detect.baditems(result.exceptions): 4
isort.wrap_modes 770 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
mimesis.providers.generic 794 if exception_bad_items := dill.detect.baditems(result.exceptions): 8
mimesis.providers.internet 796 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
python3.httplib2.auth 1028 OSError: [Errno 9] Bad file descriptor 1
pytutils.sets 891 ? 1
pytutils.trees 894 if exception_bad_items := dill.detect.baditems(result.exceptions): 7
runtime.Python3.src.antlr4.atn.ATNDeserializer 975 if exception_bad_items := dill.detect.baditems(result.exceptions): 6
setuptools._distutils.command.install 1282 OSError: [Errno 9] Bad file descriptor 1
setuptools.archive_util 1295 OSError: [Errno 9] Bad file descriptor 1
sty.primitive 915 if exception_bad_items := dill.detect.baditems(result.exceptions): 8
thonny.jedi_utils 920 if exception_bad_items := dill.detect.baditems(result.exceptions): 1
xdg.Mime 1221 OSError: [Errno 9] Bad file descriptor 2
yaml.reader 1259 if exception_bad_items := dill.detect.baditems(result.exceptions): 2

@stephanlukasczyk
Copy link
Member

Thanks for reporting back. I assume the dill errors are specific to your implementation? The git errors could probably be fixed by providing a git inside of Pynguin's Docker container.

The OSErrors are interesting, as well as the SystemError. The ast related excepptions from httpie, without checking it in detail, seem like there is some wrong type being used here.

Anyway, I guess I need to have a look at the logs, too.

@BergLucas
Copy link
Contributor Author

Yes, the dill errors are specific to my implementation and were the cause of the bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants