-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Force flake8 to the use the single-threaded pool. #505
base: rolling
Are you sure you want to change the base?
Conversation
The comment in the code has more information about why we want to do this. Signed-off-by: Chris Lalancette <clalancette@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will give official lgtm because this turns everything into all green!!!
So it does indeed seem to make everything green, which is a good start. In terms of job times, it looks like all jobs except one took the same, or shorter, than their nightly counterparts from last night. The one exception was aarch64, which took 1 hr 34 min on the nightly, but 1 hr 55 min here. I'm not sure why that is different. That said, the jobs run here do not have a 1-1 correspondence with the nightly jobs. The nightly jobs explicitly use either "Release" or "Debug", while the CI jobs use "None" (which is different than both of those). So what I'm going to do here is to run another set of jobs, all in Release mode. We'll see how that compares in terms of time. |
Hm, so the results on the Release jobs show that this is slower. Note that in order to compare apples-to-apples, I actually took the total time for the job, subtracted off the time until we made it to the "Run Dockerfile" step, and used that as the numbers below (because the nightly jobs are at a disadvantage, since they always start from a fresh container):
So it is clear that this change has something of an impact on CI times, particularly on aarch64. However, aarch64 is the job that can most afford a CI time regression, since the workers aren't as heavily used as the amd64 workers, and it isn't ridiculously long like Windows is. This one is a tough call to make. My personal opinion is that we should take the CI time hit in favor of making RHEL consistently pass CI. But pinging @cottsay @nuclearsandwich @Crola1702 @claraberendsen @ament/team for thoughts. |
# We've seen some problems, especially on RHEL-9, where using the multi-threaded | ||
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use | ||
# the single-threaded pool here. This has some performance implications for | ||
# large codebases, but given the distributed nature of ROS packages this shouldn't | ||
# generally be a large problem. | ||
flake8_argv.append('-j=1') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clalancette just idea, can we do something like this? we have never seen this issue with other platform.
# We've seen some problems, especially on RHEL-9, where using the multi-threaded | |
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use | |
# the single-threaded pool here. This has some performance implications for | |
# large codebases, but given the distributed nature of ROS packages this shouldn't | |
# generally be a large problem. | |
flake8_argv.append('-j=1') | |
import platform | |
if platform.system() == "Linux" and "el9" in platform.version(): | |
# We've seen some problems, especially on RHEL-9, where using the multi-threaded | |
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use | |
# the single-threaded pool here. This has some performance implications for | |
# large codebases, but given the distributed nature of ROS packages this shouldn't | |
# generally be a large problem. | |
flake8_argv.append('-j=1') |
Have there been any attempts to force a different version of flake8 on RHEL 9? Or to force a Noble build to use the same flake8 as what RHEL 9 currently has? It would be good to understand if this is a problem with the interpreter, some particular version(s) of flake8, a dependency thereof, or something else entirely. |
Trying this PR again, as the problem in rhel started happening reliably again: We closed ros2/ros2cli#932 because the problem wasn't happening, but it started happening again. A quick check to changes in the repos / pip freeze, showed that the change triggering the problem could be either ros2/rosidl_dynamic_typesupport_fastrtps#6 or a change in the pip version used from pip==21.2.3 to pip==21.3.1. cc: @Crola1702 |
@clalancette maybe we don't have to take the time hit in other distributions? using fujitatomoya's suggestions on your PR, or something else, that only modifies rhel for this problem. |
The comment in the code has more information about why we want to do this.
This is a draft because a) I'm not 100% sure this fixes the issue, and b) I'm not sure what this does to our CI times. @fujitatomoya FYI.