Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force flake8 to the use the single-threaded pool. #505

Draft
wants to merge 1 commit into
base: rolling
Choose a base branch
from

Conversation

clalancette
Copy link
Contributor

The comment in the code has more information about why we want to do this.

This is a draft because a) I'm not 100% sure this fixes the issue, and b) I'm not sure what this does to our CI times. @fujitatomoya FYI.

The comment in the code has more information about
why we want to do this.

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
@clalancette
Copy link
Contributor Author

clalancette commented Oct 17, 2024

CI (of everything):

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Copy link

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks, i will keep eyes on this!

  • Linux-rhel Debug Build Status

Copy link

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will give official lgtm because this turns everything into all green!!!

@clalancette
Copy link
Contributor Author

So it does indeed seem to make everything green, which is a good start.

In terms of job times, it looks like all jobs except one took the same, or shorter, than their nightly counterparts from last night. The one exception was aarch64, which took 1 hr 34 min on the nightly, but 1 hr 55 min here. I'm not sure why that is different.

That said, the jobs run here do not have a 1-1 correspondence with the nightly jobs. The nightly jobs explicitly use either "Release" or "Debug", while the CI jobs use "None" (which is different than both of those). So what I'm going to do here is to run another set of jobs, all in Release mode. We'll see how that compares in terms of time.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor Author

Hm, so the results on the Release jobs show that this is slower. Note that in order to compare apples-to-apples, I actually took the total time for the job, subtracted off the time until we made it to the "Run Dockerfile" step, and used that as the numbers below (because the nightly jobs are at a disadvantage, since they always start from a fresh container):

Job nightly time this PR time delta
Linux 2 hr 33 min 2 hr 34 min +1 min
Linux aarch64 1 hr 29 min 1 hr 57 min +28 min
RHEL 2 hr 14 min 2 hr 18 min +4 min
Windows 4 hr 28 min 4 hr 36 min +8 min

So it is clear that this change has something of an impact on CI times, particularly on aarch64. However, aarch64 is the job that can most afford a CI time regression, since the workers aren't as heavily used as the amd64 workers, and it isn't ridiculously long like Windows is.

This one is a tough call to make. My personal opinion is that we should take the CI time hit in favor of making RHEL consistently pass CI. But pinging @cottsay @nuclearsandwich @Crola1702 @claraberendsen @ament/team for thoughts.

Comment on lines +267 to +273
# We've seen some problems, especially on RHEL-9, where using the multi-threaded
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use
# the single-threaded pool here. This has some performance implications for
# large codebases, but given the distributed nature of ROS packages this shouldn't
# generally be a large problem.
flake8_argv.append('-j=1')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clalancette just idea, can we do something like this? we have never seen this issue with other platform.

Suggested change
# We've seen some problems, especially on RHEL-9, where using the multi-threaded
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use
# the single-threaded pool here. This has some performance implications for
# large codebases, but given the distributed nature of ROS packages this shouldn't
# generally be a large problem.
flake8_argv.append('-j=1')
import platform
if platform.system() == "Linux" and "el9" in platform.version():
# We've seen some problems, especially on RHEL-9, where using the multi-threaded
# pool in flake8 can cause the Python interpreter to crash. Force flake8 to use
# the single-threaded pool here. This has some performance implications for
# large codebases, but given the distributed nature of ROS packages this shouldn't
# generally be a large problem.
flake8_argv.append('-j=1')

@cottsay
Copy link
Contributor

cottsay commented Oct 18, 2024

Have there been any attempts to force a different version of flake8 on RHEL 9? Or to force a Noble build to use the same flake8 as what RHEL 9 currently has?

It would be good to understand if this is a problem with the interpreter, some particular version(s) of flake8, a dependency thereof, or something else entirely.

@Blast545
Copy link
Contributor

Blast545 commented Dec 4, 2024

Trying this PR again, as the problem in rhel started happening reliably again:

  • Linux-rhel Debug Build Status

We closed ros2/ros2cli#932 because the problem wasn't happening, but it started happening again. A quick check to changes in the repos / pip freeze, showed that the change triggering the problem could be either ros2/rosidl_dynamic_typesupport_fastrtps#6 or a change in the pip version used from pip==21.2.3 to pip==21.3.1. cc: @Crola1702

@Blast545
Copy link
Contributor

Blast545 commented Dec 4, 2024

My personal opinion is that we should take the CI time hit in favor of making RHEL consistently pass CI

@clalancette maybe we don't have to take the time hit in other distributions? using fujitatomoya's suggestions on your PR, or something else, that only modifies rhel for this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants