Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to truncate raft.db on nomad server nodes #4477

Open
tantra35 opened this issue Jul 6, 2018 · 12 comments
Open

Ability to truncate raft.db on nomad server nodes #4477

tantra35 opened this issue Jul 6, 2018 · 12 comments

Comments

@tantra35
Copy link
Contributor

tantra35 commented Jul 6, 2018

Nomad version

0.8.4

After some launch of huge jobs we have very big raft,db files on server nodes, so restart of nomad server takes huge amount of time, and nomad server eats too many memory. When we replace one server with this huge raft.db with new one, on that node raft.db not so big, and nomad eats expected amount of memory. So it will be very usfull to manually truncate raft.db

I found some mention about the same problem in consul issue tracker:

hashicorp/consul#866

But not clearly to my is nomad have the same behavior

@jippi
Copy link
Contributor

jippi commented Jul 6, 2018

huge jobs
huge amount of
too many memory
eats expected amount of memory

can you please put some actual numbers to these definition ? :)

@tantra35
Copy link
Contributor Author

tantra35 commented Jul 6, 2018

@jippi In our case nomad servers was on t2.micro instances. And we launch spark jobs with 2000 executors(ie 2000 allocations). After cluster hung, due insufficient hardware we upgrade servers to t2.xlarge, but fo now raft.db take 30GB and nomad uses 65% of memory, but for now on that cluster at now days launched only 10 jobs with 1-3 allocations within job, and restart of nomad server doesn't help to reduce memory consuption

@jippi
Copy link
Contributor

jippi commented Jul 6, 2018

@tantra35 that seem quite underpowered, see the recommended production requirements - especially with the bursty capacity of the t2 family - personally we use m4.xlarge and m4.2xlarge in prod

@tantra35
Copy link
Contributor Author

tantra35 commented Jul 6, 2018

@jippi I understand this(and we already made new more power cluster), but it seems looks strange that after we dramatically reduced allocs count(ie stop all spark jobs), nomad doesn't return occupied memory, and doesn't make this even after restart server, its not problem to replace all server nodes(in that case as i mention above, new server takes expected amount of memory), but sometimes this not so easy (fo example in static dedicated hardware)

@jippi
Copy link
Contributor

jippi commented Jul 7, 2018

did you try to force a gc?

@tantra35
Copy link
Contributor Author

tantra35 commented Jul 8, 2018

@jippi this was first what we did

@burdandrei
Copy link
Contributor

burdandrei commented Jul 10, 2018

Adding my 2 cents:
0.8.4 couple of thousands of allocations we receive OOM on 16GB r4.large:
image

@tantra35
Copy link
Contributor Author

@burdandrei can you share how you solve this in your case?

@preetapan
Copy link
Contributor

@tantra35 Until nomad takes a snapshot, the raft DB is not compacted. Currently the threshold of when we take a snapshot is not tunable, it does it after 8192 raft writes.

What's the output of nomad agent-info (https://www.nomadproject.io/docs/commands/agent-info.html) on your server? The specific values of interest are last_snapshot_index and last_log_index. Usually after that number reaches a multiple close to 8192, any old logs are deleted including the ones from jobs not running anymore.

We could make this value tunable but that's a tradeoff between disk I/O (since compacting the raft DB via snapshots is expensive) vs disk utilization. I would recommend increasing disk capacity on your servers.

@burdandrei
Copy link
Contributor

well @tantra35 in the meantime i changed servers to be i3.xlarge to have 32 GB ram and NVME for the data dir. but to be honest raft.db file is hardly reaching 200MB.
I found the memory problem was because of our job scaler. It submitted the jobs too fast while the last deployment was still running, so we had a LOT of canceled deployments.
As soon as I added a condition scaler to avoid adjusting the group count if there is a running deployment memory leak stopped.
image

@preetapan It shows that deep inside nomad is not foolproof. =(
I managed to bring nomad servers to OOM 4 times a day before adding the fix.
P.S. 0.8.3 handled it without a problem, so probably 0.8.4 has some degradations.

@tantra35
Copy link
Contributor Author

tantra35 commented Jul 13, 2018

@preetapan

here is our values:

raft
  applied_index = 523719
  commit_index = 523719
  fsm_pending = 0
  last_contact = 13.264716ms
  last_log_index = 523719
  last_log_term = 3180
  last_snapshot_index = 516962
  last_snapshot_term = 3180

due cluster now in static state(no new allocations add or removed) we have no any chance move the case from the dead point, but if we had the opportunity to send some signal to nomad for example USR1, or nomad can making some cleanup at start(that seems reasonable) , or static tool to maintain raft.db

@tantra35
Copy link
Contributor Author

tantra35 commented Jul 30, 2018

@preetapan A few seconds ago on one of our cluster diff between last_snapshot_index and last_log_index is 8192, and snapshot action was performed, but raft.db still huge about 17GB(we expect that this file will be much smaller after snapshot, but it not truncated),

root@ip-172-25-2-11:/var/lib/nomad/server/raft# ls -lah
total 17G
drwxr-xr-x 3 root root 4.0K Jul 30 16:27 .
drwxr-xr-x 4 root root 4.0K Jul 27 17:36 ..
-rwxr-xr-x 1 root root  971 Jul  4 16:29 peers.info
-rw------- 1 root root  17G Jul 30 18:10 raft.db
drwxr-xr-x 4 root root 4.0K Jul 30 18:08 snapshots

but snapshot(state.bin) is about 130MB

root@ip-172-25-2-11:/var/lib/nomad/server/raft/snapshots/969-1786945-1532957246133# ls -lah
total 131M
drwxr-xr-x 2 root root 4.0K Jul 30 16:27 .
drwxr-xr-x 4 root root 4.0K Jul 30 18:08 ..
-rw-r--r-- 1 root root  454 Jul 30 16:27 meta.json
-rw-r--r-- 1 root root 131M Jul 30 16:27 state.bin

In logs we see follow

Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Starting snapshot up to 1794727
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] snapshot: Creating new snapshot at /var/lib/nomad/server/raft/snapshots/1278-1794727-1532962979564
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Compacting logs from 1786488 to 1784493
Jul 30 18:02:59 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:02:59 [INFO] raft: Snapshot to 1794727 complete

Its very strange and looks like a bug, that from(1786488) log is greater than to(1784493), so as i understand nothing happens because nomаd uses default raft config, and in compactLogs function present follow code:

	// Truncate up to the end of the snapshot, or `TrailingLogs`
	// back from the head, which ever is further back. This ensures
	// at least `TrailingLogs` entries, but does not allow logs
	// after the snapshot to be removed.
	maxLog := min(snapIdx, lastLogIdx-r.conf.TrailingLogs)

in nomad case r.conf.TrailingLogs is 10240 so raft.db actually not truncated, also this not happens at second round:

Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] raft: Starting snapshot up to 1803290
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] snapshot: Creating new snapshot at /var/lib/nomad/server/raft/snapshots/1278-1803290-1532965026178
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] snapshot: reaping snapshot /var/lib/nomad/server/raft/snapshots/952-1786487-1532956909606
Jul 30 18:37:06 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:06 [INFO] raft: Compacting logs from 1786488 to 1793117
Jul 30 18:37:07 ip-172-25-2-169 nomad.sh[1408]:     2018/07/30 18:37:07 [INFO] raft: Snapshot to 1803290 complete

I think in nomad r.conf.TrailingLogs must be equal to autopilot.MaxTrailingLogs, but not predefined 10240

After write little prorgam that emulates compactLogs functionality:

package main

import (
	"fmt"
	raftboltdb "github.com/hashicorp/raft-boltdb"
)

func main() {
	l_db, _ := raftboltdb.NewBoltStore("./raft.db")

	l_first,_ := l_db.FirstIndex()
	l_end,_ := l_db.LastIndex()

	fmt.Printf("[INFO] raft: Compacting logs from %d to %d\n", l_first, l_end)

	if l_err := l_db.DeleteRange(l_first, l_end); l_err != nil { 
		fmt.Printf("[ERR] log compaction failed: %v\n", l_err)
	} else {
		fmt.Printf("[INFO] log compaction ok\n")
	}

	l_db.Close()
}

We found that raft.db not truncated, and we make manual compaction throw boltdb cli compact (boltdb/bolt#460),

root@ip-172-25-2-22:/home/ruslan# ./bolt-cli compact -o /var/lib/nomad/server/raft/raft.db /var/lib/nomad/server/raft/raft.old.db
20370898944 -> 109387776 bytes (gain=186.23x)

after that nomad servers stop eat too many memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants