feat: Investigate and fix issue with wrong CPU count for containers #623

MaxymVlasov · 2024-02-14T21:28:30Z

Put an x into the box if that apply:

This PR introduces breaking change.
This PR fixes a bug.
This PR adds new functionality.
This PR enhances existing functionality.

Description of your changes

So, I found that nproc always shows how many CPUs available is. K8s "limits" and docker --cpus are throttling mechanisms, which do not hide the visibility of all cores.
There are a few workarounds, but IMO, it is better to implement checks for that than do them

Workaround for docker - set --cpuset-cpus
Workaraund for K8s - somehow deal with kubelet static CPU management policy, as recommend in Reddit

Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to function-caller

hooks/_common.sh

README.md

hooks/_common.sh

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

hooks/_common.sh

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

…ction in README (#620) ### Reasoning We have a GH workflow that runs lockflies updates every week (implementation and reasoning [here](https://grem1.in/post/terraform-lockfiles-maxymvlasov/)). It usually takes from 2h 30min to 3h 15min. That was fine for us, till we found that our GH runners, based on AWS EC2s, started silently failing after 30min "without recent logs", and that was fixed by crutch which sends a dummy log every 10min. However, during the debugging, I spent some time describing why hooks were not utilizing all the provided resources. And that means a waste of time and money, not only for that corner case but for every huge commit, which can cause opting out by `git commit -n` of using hooks locally for changes that affect many directories. ### Description of your changes * Add per-hook `--parallelism-limit` setting to `--hook-config`. Defaults to `number of logical CPUs - 1` * As quick tests show, ~5% of stacks face race condition problem, no matter if any locking mechanism exists or dirs try to init in parallel. I suppose the lock failed as it uses disk when hooks run in memory, so the creation of the lock can take some time as there bunch of caches between Mem and Disk. These milliseconds are enough to allow running a few `t init` in parallel. * Final implementation uses a retry mechanism for cases when race condition failed to `t init` directory. In quick tests, I can say that on big changes: * Up to 2000% speed increase for `terraform_validate`, and up to 500% - for other affected hooks. * When `--parallelism-limit=1` I observed an insignificant increase in time (about 5-10%) compared to v1.86.0 which has no parallelism at all. This may be the cost of maintaining parallelism or the result of external factors since the tests were not conducted in a vacuum. For small changes, improvements are less significant. ----- Other significant findings/solutions included to this PR: * feat: Investigate and fix issue with wrong CPU count for containers (#623) So, I found that `nproc` always shows how many CPUs available is. K8s "limits" and docker `--cpus` are throttling mechanisms, which do not hide the visibility of all cores. There are a few workarounds, but IMO, it is better to implement checks for that than do them >Workaround for docker - set `--cpuset-cpus` >Workaraund for K8s - somehow deal with [kubelet static CPU management policy](https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/#cpu-management-policies), as [recommend in Reddit](https://news.ycombinator.com/item?id=25224714) * Send all "colorify" logs through stderr, as make able to add user-facing-logs in functions that also need to return same value to the function-caller. Needed for `common::get_cpu_num` err_msg show up ------ * Count --parallelism-ci-cpu-cores only in edge-cases Details: #620 (review) --------- Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

MaxymVlasov added 19 commits February 14, 2024 16:06

Update _common.sh

8f59b2c

Set right CPU count for containers

5f84408

More logging

4fc2219

t

3d31ac9

t

7e14cbb

t

c411f63

t

6ea0e0a

t

1f9c399

t

4ab8e0e

t

4b6767e

t

cdad120

Set right CPU count for containers

1aa3f50

f

4f9693c

f

d541815

parallelism_bypass_safety_check

9df913b

f

1b19f28

Simplify logic

b26e30b

Finally good colution

42959d6

docs

f70bddd

MaxymVlasov requested a review from yermulnik as a code owner February 14, 2024 21:28

MaxymVlasov changed the title ~~Feat/parallelizm debug logs~~ feat: Investigate and fix issue with wrong CPU count for containers Feb 14, 2024

MaxymVlasov mentioned this pull request Feb 14, 2024

feat: Add parallelism to major chunk of hooks. Check Parallelism section in README #620

Merged

4 tasks

polishing

d2ec4d1

MaxymVlasov commented Feb 14, 2024

View reviewed changes

hooks/_common.sh Outdated Show resolved Hide resolved

yermulnik reviewed Feb 14, 2024

View reviewed changes

MaxymVlasov and others added 3 commits February 15, 2024 16:04

Apply suggestions from code review

882c0f1

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

Apply suggestions from code review

8b898e5

Apply suggestions from code review

a2b2db3

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

MaxymVlasov commented Feb 15, 2024

View reviewed changes

hooks/_common.sh Outdated Show resolved Hide resolved

Apply review suggestions + simplify naming

56c9ff8

MaxymVlasov and others added 2 commits February 15, 2024 16:54

Update hooks/_common.sh

1851e11

Co-authored-by: George L. Yermulnik <yz@yz.kiev.ua>

Show msgs i sterr (needed for common::get_cpu_num err_msg show up)

0d0ce57

MaxymVlasov requested a review from yermulnik February 15, 2024 15:57

yermulnik approved these changes Feb 15, 2024

View reviewed changes

MaxymVlasov merged commit ec22d70 into feat/parallelizm Feb 15, 2024
6 checks passed

MaxymVlasov deleted the feat/parallelizm-debug-logs branch February 15, 2024 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Investigate and fix issue with wrong CPU count for containers #623

feat: Investigate and fix issue with wrong CPU count for containers #623

MaxymVlasov commented Feb 14, 2024 •

edited

Loading

feat: Investigate and fix issue with wrong CPU count for containers #623

feat: Investigate and fix issue with wrong CPU count for containers #623

Conversation

MaxymVlasov commented Feb 14, 2024 • edited Loading

Description of your changes

MaxymVlasov commented Feb 14, 2024 •

edited

Loading