[feature request or bug] Guarantee unique allocation index #3698

dukeland9 · 2017-12-29T03:37:23Z

This is a feature request or bug report after this issue: #3593
After that issue, there are still duplicate indices for allocations from time to time.

For example, in the following job:
https://github.com/hashicorp/nomad/files/1574186/allocations_0.7.1.txt
Allocation 36, 19 and 25 were scheduled twice.

If we can't guarantee the uniqueness the alloc index, why bother offer such variable interpolation on https://www.nomadproject.io/docs/runtime/interpolation.html? It's very misleading and useless.

preetapan · 2018-01-04T21:02:28Z

@dukeland9 we are back after the break and looking at this again. Would like to confirm one thing - in your job specification did you ask for 100 allocs? Would be helpful if you posted the job specification here.

preetapan · 2018-01-04T22:36:00Z

@dukeland9 After some more investigation, we found that there are two different things happening here:

Reusing the same alloc index on a lost allocation that was replaced - that is expected behavior. Nomad will use indexes starting from 0 to desired_count -1. When one of those need to be replaced, like 19, 25 and 36 in your example because it lost its connection to the node they were on, the scheduler will reuse that alloc index to create a replacement.

Creating the right number of allocations - We did find a bug in how we count whether a batch job was successfully allocated. This resulted in the scheduler not creating allocations with indexes 97, 98 and 99 to make a desired total count of 100. The bug was because it incorrectly counted the replaced allocations (19, 25, 36) against the total number of desired running allocations(100). We have a fix for this, will be commenting shortly with a binary to test this.

Also noting that this bug is a rare edge case, it only happens when there is a large enough batch being requested in a CPU contentious environment, and all the initial set of placements have not yet been made before which there are lost allocations.

Thanks once again for stress testing this in your environment.

dadgar · 2018-01-04T22:46:50Z

nomad.zip
Hey here is a Linux AMD64 binary that includes the changes from #3717. If you want to give that a test that would be great!

dukeland9 · 2018-01-05T06:54:35Z

@preetapan Thank you for investigating this issue!

dukeland9 · 2018-01-05T06:57:37Z

@dadgar I tried to run your binary, but it got "error while loading shared libraries: liblxc.so.1: cannot open shared object file: No such file or directory". I want to confirm that the binary was compiled correctly. I don't think we had lxc dependency before.

preetapan · 2018-01-05T15:11:11Z

@dukeland9 can you try this binary ? Tried building one for you on my Linux box. Verified that this does not depend on liblxc.so (I added some output from ldd below that shows this)

preetha@preetha-work ~/nomad/bin (master) $ldd nomad
	linux-vdso.so.1 =>  (0x00007fff60bfd000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe9a7b1c000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe9a7752000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fe9a7d39000)

dukeland9 · 2018-01-06T14:35:41Z

@preetapan I can't access your file from Amazon S3 due to the blocking issue in China. Would you please provide one on github like dadgar did? Thanks a lot.

BTW, I only have to update the binaries on the servers right?

dukeland9 · 2018-01-07T07:58:43Z

@preetapan Never mind. I managed to build one myself. I replaced the binary on the servers and tested running two jobs with node draining situation. It seemed the system worked correctly.
I'll watch the system running a few more days to give more solid feedback then.

dukeland9 · 2018-01-10T01:45:55Z

The system has been running as expected for several days. Thank @preetapan and @dadgar for fixing this!

github-actions · 2022-12-04T02:17:14Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chelseakomlo added type/bug theme/scheduling labels Jan 4, 2018

preetapan self-assigned this Jan 5, 2018

preetapan added the fixed-waiting-confirmation label Jan 8, 2018

preetapan removed the fixed-waiting-confirmation label Jan 10, 2018

preetapan closed this as completed Jan 10, 2018

schmichael mentioned this issue Jan 10, 2018

Task shall not be marked as complete when it's killed by node draining? #3691

Closed

pznamensky mentioned this issue Dec 10, 2019

Duplicated index in allocation name #6829

Closed

github-actions bot locked as resolved and limited conversation to collaborators Dec 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request or bug] Guarantee unique allocation index #3698

[feature request or bug] Guarantee unique allocation index #3698

dukeland9 commented Dec 29, 2017

preetapan commented Jan 4, 2018

preetapan commented Jan 4, 2018 •

edited

Loading

dadgar commented Jan 4, 2018

dukeland9 commented Jan 5, 2018

dukeland9 commented Jan 5, 2018

preetapan commented Jan 5, 2018 •

edited

Loading

dukeland9 commented Jan 6, 2018 •

edited

Loading

dukeland9 commented Jan 7, 2018

dukeland9 commented Jan 10, 2018

github-actions bot commented Dec 4, 2022

[feature request or bug] Guarantee unique allocation index #3698

[feature request or bug] Guarantee unique allocation index #3698

Comments

dukeland9 commented Dec 29, 2017

preetapan commented Jan 4, 2018

preetapan commented Jan 4, 2018 • edited Loading

dadgar commented Jan 4, 2018

dukeland9 commented Jan 5, 2018

dukeland9 commented Jan 5, 2018

preetapan commented Jan 5, 2018 • edited Loading

dukeland9 commented Jan 6, 2018 • edited Loading

dukeland9 commented Jan 7, 2018

dukeland9 commented Jan 10, 2018

github-actions bot commented Dec 4, 2022

preetapan commented Jan 4, 2018 •

edited

Loading

preetapan commented Jan 5, 2018 •

edited

Loading

dukeland9 commented Jan 6, 2018 •

edited

Loading