-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor the architecture to add support for running the test cases in subprocesses #82
base: main
Are you sure you want to change the base?
Conversation
…ment/subprocess-execution
Hi @BergLucas , One question: if I want to run this, I just have to add the |
Hi @stephanlukasczyk , Yes, that's the only thing that you have to do to use the subprocess executor. |
Shall I do yet another run to collect some data on the behaviour, comparing the current execution, and the one from this PR? |
This PR just contains the code that had already been benchmarked against the classic execution that we discussed in issue #74, but where the few bugs I had found had been corrected. So, we could do another benchmark, but its usefulness would be limited for now. I think it would be more interesting to do a benchmark later to check that the refactoring hasn't caused any loss of performance on classic execution and to compare the performance of my prototype with the performance of the final code where data transfer has been optimized. |
That's fine for me as well, I'd have suggested to benchmark this anyway at some point, to be sure we do not introduce any issues in existing code. Do you already have plans how you want to integrate this into |
Hi @stephanlukasczyk , I've merged “main” into this branch to give us a better overview of what has changed. As I said above, merging the whole branch at once would not be the best. The smartest way to do it would be to incrementally cherry-pick some changes in other branches so that we could re-discuss and better test them. Then, it would be possible to merge these changes back from "main" to this branch in order to reduce its size. If you want, I can already make a new branch that will only contain the changes concerning the tracers so that you can better understand the way I was planning to work. Or, if you prefer, we could schedule a call to discuss it directly. |
Hi,
I'm making this pull request to coordinate the implementation of the test case execution in subprocesses. The goal is to highlight the prototype code I've already done and compare it to the current code. This will allow me to point out the problems I had to deal with and the solutions I found. I'm pretty sure there are still things that can be improved because I was fairly new to the codebase when I started this branch and there are surely things that can be optimized so don't hesitate to make comments.
An architecture based on observers
The current architecture of Pynguin is based on observers for calculating various information. This makes Pynguin very flexible and configurable but it complicates execution in subprocesses. Indeed, it's rather complicated to call a main process observer from a subprocess because a link has to be made at some point, and data has to be transferred with each call.
That's why I chose a different approach in my prototype. I noticed that all observers interacted mostly with test cases and results. This is due to the fact that there was a constraint in their specification that limited their memory to the thread they were in for certain methods. By extending this constraint slightly to say that it applied to threads or subprocesses now, this almost made it possible to send them to another process, since all the data manipulated would remain included in their memories and in the results, which was going to be sent back to the main process anyway. The problem was that some observers also used other data in some methods. However, the use of this other data was limited to the
after_test_case_execution_outside_thread
functions. This is why I came up with the idea of splitting the observers into 2 parts, the part that will be executed on the main process (ExecutionObserver), where we'll callafter_test_case_execution_outside_thread
, and the part sent and executed on the subprocess (RemoteExecutionObserver) with the rest of the methods.This has several advantages:
Firstly, this would allow Pynguin to continue working in the same way as before if we don't want to use subprocesses. In this case, we just need to run the RemoteExecutionObserver on the main process and everything will work in the same way.
Secondly, it limits the amount of information exchanged, as we only need to send the remote observers and don't need to send all the data used by the entire observer.
Finally, it retains Pynguin's flexibility while allowing observers to be executed in subprocesses.
(Note: I've renamed the observer methods to better match the new names)
Current Architecture
Prototype Architecture
A fairly complex data transfer
The current class architecture was not designed to be sent from one process to another. As a result, there are references between classes all over the place, and sending an object using pickle sometimes means sending all the data that Pynguin manipulates. Fortunately, when a new process is created, the memory is copied directly without going through pickle, which allows us to transfer data that pickle doesn't support and to be faster. However, the problem arises when transferring data from the subprocess to the main process. Indeed, this transfer has to be carried out using pickle and, with all the references, there's a big bottleneck which means that a lot of data has to be copied, and this sometimes causes crashes as some types just can't be serialized using pickle. This can be partly mitigated by using dill, which is a pickle extension that supports more types, but it doesn't fix everything.
In my prototype, I managed to make everything necessary serializable using dill, but the best thing would be to redesign the data that needs to be sent so that this doesn't happen.
In addition, there's a special case concerning the tracer. For the moment, the tracer is installed when the target module is imported. However, during the memory copy to create the subprocess, the tracer is also copied, except that its reference becomes different from the one I'm passing as an argument to the function executed in the subprocess. Therefore, I have to modify it after the execution of the subprocess to be able to retrieve the information that the tracer obtained during the execution of the test case. It's annoying, but the other solution, which consisted in reinstrumenting the module after executing the submodule, created a big loss of performance, so it just wasn't acceptable.
Some necessary improvements
At the moment, the creation of error-revealing test cases is very ugly, they're created on the fly and they're not minimized to make it easy to find the bug. I think it would be interesting to handle them better, but that would mean tweaking the algorithms, and I didn't have the time back then.
Note
The
src/pynguin/instrumentation/tracer.py
file is mainly composed of classes that have been transferred from thesrc/pynguin/testcase/execution.py
file because there were circular imports and this is why there are so many line changes.Refactoring plan
I don't think it would be a good idea to merge this entire pull request at once, so I suggest that we do certain steps in other pull requests to better test and discuss each addition. In order, I suggest that we do the following:
What do you think of this plan?
Don't hesitate to criticize or validate certain parts so that I can implement them ! 🙂