Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread/affinity values in allocations may generate invalid JSON (and may not be working correctly) #5698

Closed
chrisboulton opened this issue May 14, 2019 · 10 comments · Fixed by #5713

Comments

@chrisboulton
Copy link

Nomad version

Nomad v0.9.1 (4b2bdbd9ab68a27b10c2ee781cceaaf62e114399)

Issue

We've just rolled spreads out in our environment, and for some of our services, whenever we run {{nomad alloc status..}} we receive the following:

Error querying allocation: invalid character '+' looking for beginning of value

I grabbed one of the evaluations via the API, and it looks like for whatever reason some of the score metadata is returning invalid JSON (snip of - see +Inf references):

"ScoreMetaData": [
{"NodeID":"0505eac9-c018-9fd2-db1f-fa73189acf72","NormScore":+Inf.0,"Scores":{"node-affinity":0.0,"allocation-spread":+Inf.0,"binpack":0.3753003015577075,"job-anti-affinity":0.0,"node-reschedule-penalty":0.0}},

{"NodeID":"53aa4ef9-3c24-0e13-b30e-d359febb10b7","NormScore":+Inf.0,"Scores":{"node-reschedule-penalty":0.0,"node-affinity":0.0,"allocation-spread":+Inf.0,"binpack":0.9337175385458898,"job-anti-affinity":0.0}},

{"NodeID":"f9b40239-f714-c4ee-ca56-0a7b793c3fbe","NormScore":+Inf.0,"Scores":{"node-affinity":0.0,"allocation-spread":+Inf.0,"binpack":0.8930095623039143,"job-anti-affinity":0.0,"node-reschedule-penalty":0.0}},

{"NodeID":"64948e5d-9084-8397-6afe-a4f60a6432db","NormScore":+Inf.0,"Scores":{"job-anti-affinity":0.0,"node-reschedule-penalty":-1.0,"node-affinity":0.0,"allocation-spread":+Inf.0,"binpack":0.8274427409048469}},

{"NodeID":"49153c3a-0ff3-f39b-d771-37a854d01033","NormScore":+Inf.0,"Scores":{"node-reschedule-penalty":0.0,"node-affinity":0.0,"allocation-spread":+Inf.0,"binpack":0.8667487781667723,"job-anti-affinity":0.0}}]

We're not sure how to reproduce this as yet - we just know it's happening consistently in our environment. All nomad-clients (and servers) are running Nomad 0.9.1.

Job file

Emailed along with full un-redacted allocation output

@chrisboulton chrisboulton changed the title Spread/affinity values in allocations may generate invalid JSON Spread/affinity values in allocations may generate invalid JSON (and may not be working correctly) May 14, 2019
@chrisboulton
Copy link
Author

We've done some more digging - this doesn't seem to be impacting every allocation for a job, only some. We haven't been able to identify what the cause is yet.

@rsigrest
Copy link

We've been seeing this too with v0.9.1, but like you not for every job. I noticed it come up when tasks are migrated while draining a client. Currently, the only thing that seems to fix it is nuking the data_dir files and deregistering Nomad services from Consul (effectively adding a brand new client node).

@jonbodner
Copy link

We ran into this problem as well, with the allocation-spread:

      "ScoreMetaData": [
        {
          "NodeID": "82249719-0569-b4e8-f6bb-da18a905a041",
          "NormScore":+Inf.0,
          "Scores": {
            "allocation-spread":+Inf.0,
            "binpack": 0.13675268338169427,
            "job-anti-affinity": 0.0,
            "node-reschedule-penalty": 0.0,
            "node-affinity": 0.0
          }
        },

I did a quick dive through the code and allocation-spread is only set in

func (iter *SpreadIterator) Next() *RankedNode in scheduler/spread.go.

There are a couple of things in that method that could cause a +Inf.

First is

spreadWeight := float64(spreadDetails.weight) / float64(iter.sumSpreadWeights)

If iter.sumSpreadWeights is 0. it looks like the field is only set in

func (iter *SpreadIterator) computeSpreadInfo(tg *structs.TaskGroup) in scheduler/spread.go

iter.sumSpreadWeights += int32(spread.Weight)

So if you never specify any weights, that would cause a +Inf.

The second thing I found was a call to:

func evenSpreadScoreBoost(pset *propertySet, option *structs.Node) float64

In that function, we have a divide by minCount, in a situation where minCount could be 0:

	if minCount == 0 {
		deltaBoost = -1.0
	} else {
		delta := int(minCount - currentAttributeCount)
		deltaBoost = float64(delta) / float64(minCount)
	}
	if currentAttributeCount != minCount {
		// Boost based on delta between current and min
		return deltaBoost
	} else if minCount == maxCount {
		// Maximum possible penalty when the distribution is even
		return -1.0
	}
	// Penalty based on delta from max value
	delta := int(maxCount - minCount)
	deltaBoost = float64(delta) / float64(minCount)
	return deltaBoost

Note that if minCount is 0 and currentAttributeCount isn't 0 and maxCount isn't 0, we fall through to the second to last line, which does a division by 0, which will result in a +Inf.

The iteration over combinedUseMap is suspect; map iteration is not in any particular order. If there is a value of 0 in the map, there's no guarantee that it will be seen before a non-zero value. If that happens, this code could assign minCount the value of 0:

	minCount := uint64(0)
	maxCount := uint64(0)
	for _, value := range combinedUseMap {
		if minCount == 0 || value < minCount {
			minCount = value
		}
		if maxCount == 0 || value > maxCount {
			maxCount = value
		}
	}

@chrisboulton
Copy link
Author

chrisboulton commented May 14, 2019

@rsigrest: interesting. We went straight from v0.8.7 to v0.9.1, skipping v0.9.0. I'm curious if you were on v0.9.0 previously and didn't see the problem? Similar question about version to you @jonbodner.

We're actually in a much worse place today, and we're trying to figure out if it's related to the spread/+Inf issue or not. All new allocations are stuck in pending, with no reason (there are placement metrics for some of them though). We basically can't reschedule anything, can't drain nodes, etc. Not sure if it's related to spreads or just a v0.9.X related issue at the moment.

@alexdulin
Copy link

(Speaking in regards to @jonbodner's comment, we work together) @chrisboulton we moved from 0.8.7 to 0.9.0, experienced the issues with the malformed JSON, and then moved from 0.9.0 to 0.9.1 and continued to witness it.

Eventually we removed the spread stanza from all jobs after this problem caused node drains to fail entirely. If you are still experiencing the problem I would advise removing spread from all jobs and seeing if other operations resume successful execution. It worked for us.

@rsigrest
Copy link

I was also using 0.8.7 and upgraded directly to 0.9.1, specifically because of the spread feature. This problem first happened after the upgrade when performing a drain on one of my clients. The node would drain successfully but an adjacent client node would suddenly become unreachable with the "Error querying allocation: invalid character '+' looking for beginning of value" error being returned from any request from the CLI (status, drain, etc.) and the UI showing zero allocations and returning an error when trying to get details about the node.While in this bad state, the node still had dozens of Nomad containers running and the Consul agent had healthy services for each one.

Based on what @jonbodner posted earlier, I'm going to try some deployments that specify a weight value in the spread stanza and see if that works before simply removing any use of spread, since it's something I'd really like to start using if possible.

@chrisboulton
Copy link
Author

chrisboulton commented May 15, 2019

For what it's worth - we were using the weight attribute. Our spread stanza (at a task group level) was:

"Spreads": [
    {
        "Attribute": "${meta.zone}", # we doubled checked to make sure this was defined on all client nodes
        "SpreadTarget": null,
        "Weight": 50
    },
    {
        "Attribute": "${node.unique.name}",
        "SpreadTarget": null,
        "Weight": 50
    }
],

@preetapan
Copy link
Contributor

@jonbodner Thanks for your analysis, wanted to clarify that the weight is set to 50 here

if s.Weight == nil {
if its not set during job parse time.

PR #5713 fixes the divide by zero bug.

@chrisboulton
Copy link
Author

Just wanted to drop a quick note - I cherry-picked #5713 and built a release on top of v0.9.1. We've had it in our test environments for 48 hours now, and have spreads enabled for the past 24 hours. Can confirm this is fixed now. 🎉

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants