Add support for running the test cases in subprocesses #74

BergLucas · 2024-06-20T20:50:53Z

Is your feature request related to a problem? Please describe.

Over the last four months, I've had the opportunity to write a master's thesis about improving automated test case generation for machine learning libraries. During this project, I discovered several limitations and bugs in Pynguin that I've already reported. However, some bugs could not be easily fixed. These are segmentation faults, memory leaks, floating point exceptions, and Python's GIL deadlocks, which do not come from Pynguin but rather from the module under test. Unfortunately, with Pynguin's current architecture which executes test cases in threads rather than subprocesses, these types of bugs cause the main process to crash and, therefore, the crash of Pynguin. I have observed these kinds of crashes on very popular libraries such as numpy, pandas, polars, scipy and sklearn, and it could also happen on other modules as I've only focused on these few.

Describe the solution you'd like

To solve the problem, I propose to change some aspects of Pynguin's architecture so that test cases can be executed in subprocesses depending on a Pynguin parameter. I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

Describe alternatives you've considered

To the best of my knowledge, I think that the only way to detect segmentation faults, memory leaks, etc, is to use subprocesses, so I don't see any other alternatives for dealing with these crashes.

Additional context

With this new architecture, it would also be possible to create error-revealing test cases, as Randoop does. Indeed, by checking the exit code of the subprocesses, a crash can be detected and, therefore, a test case created to reproduce it. This is something that has already been implemented in my prototype and that has already helped me find a few bugs in some libraries:

The text was updated successfully, but these errors were encountered:

nickodell · 2024-06-20T22:35:32Z

I've already built a working prototype here, but due to too much data transfer between the main process and the subprocesses, the execution in a subprocess is up to 40x slower than the execution in a thread, so I think it would first be necessary to rethink the changes I've made in my prototype to increase speed by limiting data transfer between the main process and the subprocesses.

I see you're using the spawn start method. I wonder if you could improve performance here by using a forkserver start method, and using set_forkserver_preload() to preload the module under test.

I know that in SciPy, it can take quite a while to import some modules. For example, on my computer it can take 0.4 seconds to run import scipy.signal and nothing else. It seems like it could be possible to import the module once, and re-use that work, since the computation happening during import seems unlikely to be interesting from a testing perspective.

More info: https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html

stephanlukasczyk · 2024-06-21T08:43:36Z

Wow, first of all, this is impressive!

I agree to all of your points, the way the current execution is being built is probably not the best it could be. I am very much willing to integrate this into Pynguin, first, because I believe that it could overcome the limitations and makes Pynguin more flexible, and second, because the track record (five found bugs) is already quite nice—I am pretty sure there is more to follow and I need to maintain a list of found bugs at some point.

Regarding the slow-down: do you have some average numbers? 40x is a massive slowdown, I agree, but if this is only a rare worst case the picture would probably look different. Also, could you perhaps try @nickodell 's suggestion (thanks @nickodell for the suggestion of the forkserver) if that would bring an improvement?

Random additional thought: even if the execution in a subprocess is slower, do you see any potential to parallelise these executions? Currently, Pynguin uses the threads to isolate executions but only executes the test cases in sequential order. If the subprocess would allow to parallelise test-case executions easily, the overhead might not be that critical any more.

BergLucas · 2024-06-21T11:44:50Z

Hi @stephanlukasczyk,

I do have some numbers regarding the slow-down, but it's a bit hard to interpret an average because the 40x speed decrease was calculated using the few cases where Pynguin didn't crash, so I felt that averaging out all the cases wasn't very representative of the true speed decrease. If you're interested, here's my master's thesis, there are averages on the number of iterations achieved per module in chapters 5 and 7. This thesis also tried to implement a plugin system to allow testers' knowledge to be easily incorporated into the test generation algorithm. Initially, the architectural change was just a necessary improvement to be able to run the plugin system on machine learning libraries but I thought it was the most interesting change to add to Pynguin at the moment.

Regarding the forkserver, it might be interesting to check whether this has an impact. However, I noticed that even with just a spawn start method, the thing that took the most time was not sending data to the subprocess, but transferring data from the subprocess to the main process. I haven't checked this in detail yet, but I think it's due to the fact that there are a lot of references in the ExecutionResult class and that, in the end, the whole test cluster is transferred back to the main process.

Regarding the parallelization, I did try to implement it at one point and noticed that most of the time, it was faster to start a single subprocess, run every test case in it, and fallback to running each test case in separate subprocesses only when a crash was detected. That's what I've done to improve the speed of the "TestSuiteChromosomeComputation" class. However, it's true that it could be interesting to parallelize the "TestCaseChromosomeComputation" class, but I think that would require a lot of changes, and I didn't do it because I didn't have much time.

stephanlukasczyk · 2024-06-21T11:54:07Z

Hi @BergLucas ,

What I could do is running Pynguin from your branch with the current executor and your subprocess-based executor on some benchmark that I have. I'll set this up and run it—will report back numbers as soon as they have arrived.

Also thank you for your thesis, I'll have a look. Your comments regarding the data that has to be transferred between processes is quite helpful; avoiding to transfer large amounts of data might be achievable, but first I want to see how the executors behave on my benchmark. I'll add results to this issue as soon as I have any.

BergLucas · 2024-06-21T12:02:21Z

The subprocess-based executor is only used when using the --subprocess parameter if you did not see it, so don't forget to run Pynguin with this parameter @stephanlukasczyk. Also, you should probably use simple assertions instead of mutation analysis ones, or it could take hours to generate the assertions on some modules at the moment.

stephanlukasczyk · 2024-07-02T06:58:07Z

Finally found the time to run Pynguin on the latest version of your branch, @BergLucas . Configurations use DynaMOSA in its default settings, assertion generation deactivated, with a generation timeout of 600s. DEFAULT_EXECUTION refers to the current, default, execution; SUBPROCESS_EXECUTION is your proposed approach. First, we start with some statistics (basically the describe function of a Pandas DataFrame) from the raw data:

Coverage Overview
                       count      mean       std  min       25%       50%       75%  max
ConfigurationId                                                                         
DEFAULT_EXECUTION     9514.0  0.685906  0.308685  0.0  0.469880  0.750000  1.000000  1.0
SUBPROCESS_EXECUTION  9332.0  0.638031  0.327191  0.0  0.333333  0.666667  0.972973  1.0


Iterations Overview
                       count         mean          std  min    25%     50%     75%     max
ConfigurationId                                                                           
DEFAULT_EXECUTION     9514.0  1588.290309  1397.305685  0.0  57.25  1446.0  2665.0  7071.0
SUBPROCESS_EXECUTION  9332.0    72.648093    77.326008  0.0  11.00    52.0   106.0   467.0


Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     9514.0  2.772500  2.289745  0.0  0.422727  2.617743  4.512063  11.765391
SUBPROCESS_EXECUTION  9332.0  0.124591  0.128730  0.0  0.022979  0.091211  0.181255   0.775748

Because I know that there are always some failures, I filter the raw data and remove all modules that did not produce a result in all repetitions (I did 15 repetitions per configuration):

Remove modules that did not yield 15 iterations
Before: 645
After:  574


Coverage Overview
                       count      mean       std  min       25%       50%  75%  max
ConfigurationId                                                                    
DEFAULT_EXECUTION     8610.0  0.704771  0.302880  0.0  0.500000  0.774194  1.0  1.0
SUBPROCESS_EXECUTION  8610.0  0.658459  0.321583  0.0  0.383065  0.722222  1.0  1.0


Iterations Overview
                       count         mean          std  min  25%     50%     75%     max
ConfigurationId                                                                         
DEFAULT_EXECUTION     8610.0  1640.112544  1429.615295  0.0  9.0  1561.0  2759.0  7071.0
SUBPROCESS_EXECUTION  8610.0    74.888850    79.414983  0.0  7.0    53.0   111.0   467.0


Speed Overview
                       count      mean       std  min       25%       50%       75%        max
ConfigurationId                                                                               
DEFAULT_EXECUTION     8610.0  2.871112  2.335756  0.0  0.343301  2.819468  4.673877  11.765391
SUBPROCESS_EXECUTION  8610.0  0.128680  0.132070  0.0  0.019231  0.094059  0.188954   0.775748


Relative Coverage
                      BranchCoverage  RelativeCoverage
ConfigurationId                                       
DEFAULT_EXECUTION           0.704771          0.916433
SUBPROCESS_EXECUTION        0.658459          0.730026

Finally, I've also plotted coverage over the 600s generation time:

The plot and the results show what you've already noted, namely that there is a large slow-down, which also influences coverage significantly.

BergLucas · 2024-07-02T13:33:46Z

Hi @stephanlukasczyk ,

That's very interesting. I would have thought that the number of modules that did not yield 15 iterations would have been higher on default execution because of the types of crashes I mentioned before, but that doesn't seem to be the case. In theory, the execution using subprocesses can't crash unless there's a bug in the implementation so I guess the branch isn't super stable yet as I suspected.

stephanlukasczyk · 2024-07-02T13:55:21Z

If you are interested and have time for debugging, I can provide you the full raw results, including log files etc.

nickodell · 2024-07-06T02:04:49Z

Hello, I've fixed two of the crash bugs you identified in SciPy. I think this is very valuable in terms of identifying surprising corner-cases in the library.

By the way, I would be interested in looking into the performance issue with subprocess-based execution. Would you mind showing me how to set up your branch to test SciPy? I took a look at the docs, but I'm not sure how to apply that in the case where I'm testing a package that needs to be built before I can test it.

stephanlukasczyk · 2024-07-06T06:06:00Z

Thank you @nickodell that you offer to investigate the performance.

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

If @BergLucas uses a different approach, he probably could elaborate, too.

nickodell · 2024-07-06T06:16:32Z

What I usually do is setting up a new virtual environment (based on Python 3.10 because Pynguin will only work with this version), install Pynguin into this virtual environment and also install the respective library into it--scipy in this case. I can then run Pynguin from this environment, having all the dependencies required. This should work also for binaries that are part of a package.

Thanks - when you do this, what do you set the --project-path parameter to?

stephanlukasczyk · 2024-07-06T06:44:00Z

I usually also have a source-code checkout of the subject under test (here scipy) lying around to which I point the parameter to.

BergLucas · 2024-09-12T13:58:12Z

Hi @stephanlukasczyk,
I've been busy over the holidays so I haven't had time to work on Pynguin but normally I'll be able to carry on with my research and intend to keep looking for ways to improve Pynguin. So, I would be very interested to get the raw data you mentioned in your previous comment if you still have it.

BergLucas · 2024-12-06T12:49:46Z

Hi @stephanlukasczyk and thanks for the raw data,

I analysed the modules that didn't work at all with my branch and found 2 small bugs that appeared when the data was serialized and sent back to the main process. I've fixed them in this branch: https://github.com/BergLucas/pynguin/tree/improvement/subprocess-execution.

I also noticed a few things about the benchmark, such as modules that required git or dependencies that weren't installed, and modules that had circular imports. I don't know if it's normal.

Now, I'm also going to check the modules that have run at least once but that also crashed at least once and I'll post a comment afterwards.

BergLucas · 2024-12-06T13:50:46Z

Here are the data concerning the modules that sometimes fail on my branch. Most of it comes from the bugs mentioned in my previous comment and are therefore fixed, but there are some that are related to Bad file descriptor or SystemError that would be interesting to study in more detail.

Module	subprocess run id	Error	Nb fails
boltons.dictutils	1004	if exception_bad_items := dill.detect.baditems(result.exceptions):	4
boltons.gcutils	1010	SystemError: Objects/tupleobject.c:964: bad argument to internal function	4
boltons.iterutils	1011	if exception_bad_items := dill.detect.baditems(result.exceptions):	6
boltons.queueutils	1017	if exception_bad_items := dill.detect.baditems(result.exceptions):	7
boltons.socketutils	1018	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
codetiming._timers	680	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
flake8.api.legacy	691	Weird flask error	1
flake8.main.setuptools_command	701	Weird flask error	1
httpie.output.writer	743	While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7fcc44b08400>	2
httpie.uploads	749	While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7ff824b1cb80>	1
httpie.utils	750	While getting the types of exceptions in the handler, expected to find an ast.Name, ast.Tuple, or ast.Attribute,but got <ast.Call object at 0x7fb2226939d0>	1
humanfriendly.case	1047	if exception_bad_items := dill.detect.baditems(result.exceptions):	8
isort.core	753	if exception_bad_items := dill.detect.baditems(result.exceptions):	2
isort.deprecated.finders	754	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
isort.hooks	758	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
isort.literal	761	if exception_bad_items := dill.detect.baditems(result.exceptions):	2
isort.parse	764	if exception_bad_items := dill.detect.baditems(result.exceptions):	9
isort.settings	766	if exception_bad_items := dill.detect.baditems(result.exceptions):	6
isort.sorting	767	if exception_bad_items := dill.detect.baditems(result.exceptions):	7
isort.wrap	769	if exception_bad_items := dill.detect.baditems(result.exceptions):	4
isort.wrap_modes	770	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
mimesis.providers.generic	794	if exception_bad_items := dill.detect.baditems(result.exceptions):	8
mimesis.providers.internet	796	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
python3.httplib2.auth	1028	OSError: [Errno 9] Bad file descriptor	1
pytutils.sets	891	?	1
pytutils.trees	894	if exception_bad_items := dill.detect.baditems(result.exceptions):	7
runtime.Python3.src.antlr4.atn.ATNDeserializer	975	if exception_bad_items := dill.detect.baditems(result.exceptions):	6
setuptools._distutils.command.install	1282	OSError: [Errno 9] Bad file descriptor	1
setuptools.archive_util	1295	OSError: [Errno 9] Bad file descriptor	1
sty.primitive	915	if exception_bad_items := dill.detect.baditems(result.exceptions):	8
thonny.jedi_utils	920	if exception_bad_items := dill.detect.baditems(result.exceptions):	1
xdg.Mime	1221	OSError: [Errno 9] Bad file descriptor	2
yaml.reader	1259	if exception_bad_items := dill.detect.baditems(result.exceptions):	2

stephanlukasczyk · 2024-12-07T18:31:13Z

Thanks for reporting back. I assume the dill errors are specific to your implementation? The git errors could probably be fixed by providing a git inside of Pynguin's Docker container.

The OSErrors are interesting, as well as the SystemError. The ast related excepptions from httpie, without checking it in detail, seem like there is some wrong type being used here.

Anyway, I guess I need to have a look at the logs, too.

BergLucas · 2024-12-09T13:50:03Z

Yes, the dill errors are specific to my implementation and were the cause of the bugs.

BergLucas mentioned this issue Dec 11, 2024

Refactor the architecture to add support for running the test cases in subprocesses #82

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for running the test cases in subprocesses #74

Add support for running the test cases in subprocesses #74

BergLucas commented Jun 20, 2024

nickodell commented Jun 20, 2024 •

edited

Loading

stephanlukasczyk commented Jun 21, 2024

BergLucas commented Jun 21, 2024

stephanlukasczyk commented Jun 21, 2024

BergLucas commented Jun 21, 2024 •

edited

Loading

stephanlukasczyk commented Jul 2, 2024

BergLucas commented Jul 2, 2024

stephanlukasczyk commented Jul 2, 2024

nickodell commented Jul 6, 2024

stephanlukasczyk commented Jul 6, 2024

nickodell commented Jul 6, 2024

stephanlukasczyk commented Jul 6, 2024

BergLucas commented Sep 12, 2024

BergLucas commented Dec 6, 2024

BergLucas commented Dec 6, 2024 •

edited

Loading

stephanlukasczyk commented Dec 7, 2024

BergLucas commented Dec 9, 2024

Add support for running the test cases in subprocesses #74

Add support for running the test cases in subprocesses #74

Comments

BergLucas commented Jun 20, 2024

nickodell commented Jun 20, 2024 • edited Loading

stephanlukasczyk commented Jun 21, 2024

BergLucas commented Jun 21, 2024

stephanlukasczyk commented Jun 21, 2024

BergLucas commented Jun 21, 2024 • edited Loading

stephanlukasczyk commented Jul 2, 2024

BergLucas commented Jul 2, 2024

stephanlukasczyk commented Jul 2, 2024

nickodell commented Jul 6, 2024

stephanlukasczyk commented Jul 6, 2024

nickodell commented Jul 6, 2024

stephanlukasczyk commented Jul 6, 2024

BergLucas commented Sep 12, 2024

BergLucas commented Dec 6, 2024

BergLucas commented Dec 6, 2024 • edited Loading

stephanlukasczyk commented Dec 7, 2024

BergLucas commented Dec 9, 2024

nickodell commented Jun 20, 2024 •

edited

Loading

BergLucas commented Jun 21, 2024 •

edited

Loading

BergLucas commented Dec 6, 2024 •

edited

Loading