Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

node disconnects when starting a main container and a data container unit (as a sidekick) #1395

Open
ncodes opened this issue Dec 7, 2015 · 10 comments

Comments

@ncodes
Copy link

ncodes commented Dec 7, 2015

I have been trying to start a postgres unit (main unit) and a sidekick unit which will be acting as a data container. Starting the main unit leads forces one of the nodes in cluster to exit/disconnect. Below are my unit files and fleet journal:

postgres-db@.service

[Unit]
Description=A postgres database. Uses postgres-db-data@.service as a data container.
Requires=etcd.service
Requires=docker.service
Requires=postgres-db-data@%i.service
After=etcd.service
After=docker.service
Before=postgres-db-data@%i.service


[Service]
TimeoutStartSec=0
KillMode=none
EnvironmentFile=/etc/environment
ExecStartPre=-/usr/bin/docker kill postgres%i
ExecStartPre=-/usr/bin/docker rm postgres%i
ExecStartPre=/usr/bin/docker pull ncodes/postgres
ExecStart=/usr/bin/docker run --volumes-from postgres-data-%i --name postgres%i -e POSTGRES_PASSWORD=mysecretpassword -e DB_NAME=database1 -p ${COREOS_PUBLIC_IPV4}:%i:5432 ncodes/postgres 
ExecStop=/usr/bin/docker stop postgres%i

[X-Fleet]
X-Conflicts=postgres-db@*.service

postgres-db-data@.service

[Unit]
Description=A postgres database
BindsTo=postgres-db@%i.service

# Requirements
Requires=etcd.service
Requires=docker.service
Requires=postgres-db@%i.service

# ordering
After=etcd.service
After=docker.service
After=postgres-db@%i.service

[Service]
TimeoutStartSec=0
KillMode=none
EnvironmentFile=/etc/environment
ExecStartPre=-/usr/bin/docker kill postgres-data-%i
ExecStartPre=-/usr/bin/docker rm postgres-data-%i
ExecStartPre=/usr/bin/docker pull ncodes/postgres
ExecStart=/usr/bin/docker run --name postgres-data-%i -e POSTGRES_PASSWORD=mysecretpassword -e DB_NAME=database2 ncodes/postgres 
ExecStop=/usr/bin/docker stop postgres-data-%i

[X-Fleet]
X-ConditionMachineOf=postgres-db@%i.service

journalctl -u fleet

Dec 07 10:53:52 coreos01 systemd[1]: Started fleet daemon.
Dec 07 10:53:52 coreos01 fleetd[813]: INFO fleetd.go:64: Starting fleetd version 0.11.5
Dec 07 10:53:52 coreos01 fleetd[813]: INFO fleetd.go:168: No provided or default config file found - proceeding without
Dec 07 10:53:52 coreos01 fleetd[813]: INFO server.go:157: Establishing etcd connectivity
Dec 07 10:55:00 coreos01 fleetd[813]: INFO server.go:168: Starting server components
Dec 07 10:55:00 coreos01 fleetd[813]: INFO engine.go:79: Engine leader is 6fbfd62531ce435ca99ed7aa318b7220
Dec 07 11:09:18 coreos01 fleetd[813]: INFO manager.go:246: Writing systemd unit postgres-db-alone@5433.service (561b)
Dec 07 11:09:18 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 11:09:19 coreos01 fleetd[813]: INFO manager.go:127: Triggered systemd unit postgres-db-alone@5433.service start: job=3192
Dec 07 11:09:19 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-alone@5433.service reason="unit scheduled here but not loaded"
Dec 07 11:09:19 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 11:09:19 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db-alone@5433.service reason="unit currently loaded but desired state is
Dec 07 11:13:29 coreos01 fleetd[813]: INFO manager.go:138: Triggered systemd unit postgres-db-alone@5433.service stop: job=4131
Dec 07 11:13:29 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StopUnit job=postgres-db-alone@5433.service reason="unit currently launched but desired state i
Dec 07 11:13:49 coreos01 fleetd[813]: INFO manager.go:138: Triggered systemd unit postgres-db-alone@5433.service stop: job=4219
Dec 07 11:13:49 coreos01 fleetd[813]: INFO manager.go:259: Removing systemd unit postgres-db-alone@5433.service
Dec 07 11:13:49 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 11:13:49 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-alone@5433.service reason="unit loaded but not scheduled here"
Dec 07 11:13:49 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 12:08:21 coreos01 fleetd[813]: INFO engine.go:79: Engine leader is 6e0e59c83bd7450e8c8e22c36ee1cf3a
Dec 07 12:08:26 coreos01 fleetd[813]: INFO manager.go:246: Writing systemd unit postgres-db@5433.service (750b)
Dec 07 12:08:26 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 12:08:26 coreos01 fleetd[813]: ERROR manager.go:129: Failed to trigger systemd unit postgres-db@5433.service start: Unit postgres-db-data@5433.service failed to load: No such file or dir
Dec 07 12:08:26 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@5433.service reason="unit scheduled here but not loaded"
Dec 07 12:08:26 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 12:08:26 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5433.service reason="unit currently loaded but desired state is launc
Dec 07 12:08:27 coreos01 fleetd[813]: INFO manager.go:246: Writing systemd unit postgres-db-data@5433.service (676b)
Dec 07 12:08:27 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 12:08:27 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@5433.service reason="unit scheduled here but not loaded"
Dec 07 12:08:27 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 12:29:37 coreos01 fleetd[813]: INFO manager.go:138: Triggered systemd unit postgres-db@5433.service stop: job=12320
Dec 07 12:29:37 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StopUnit job=postgres-db@5433.service reason="unit currently launched but desired state is load
Dec 07 12:29:48 coreos01 fleetd[813]: INFO manager.go:127: Triggered systemd unit postgres-db@5433.service start: job=12322
Dec 07 12:29:48 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5433.service reason="unit currently loaded but desired state is launc
Dec 07 12:29:50 coreos01 fleetd[813]: INFO engine.go:159: Updated cluster engine version from 0 to 1
Dec 07 12:29:50 coreos01 fleetd[813]: INFO engine.go:185: Engine leadership acquired
Dec 07 12:29:50 coreos01 fleetd[813]: INFO engine.go:81: Engine leadership changed from 6e0e59c83bd7450e8c8e22c36ee1cf3a to ca376ee32b42401087d3cb6acfd4d6fc
Dec 07 12:29:53 coreos01 fleetd[813]: INFO manager.go:138: Triggered systemd unit postgres-db-data@5433.service stop: job=12499
Dec 07 12:29:53 coreos01 fleetd[813]: INFO manager.go:259: Removing systemd unit postgres-db-data@5433.service
Dec 07 12:29:53 coreos01 fleetd[813]: INFO manager.go:138: Triggered systemd unit postgres-db@5433.service stop: job=12501
Dec 07 12:29:53 coreos01 fleetd[813]: INFO manager.go:259: Removing systemd unit postgres-db@5433.service
Dec 07 12:29:53 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 12:29:53 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-data@5433.service reason="unit loaded but not scheduled here"
Dec 07 12:29:53 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db@5433.service reason="unit loaded but not scheduled here"
Dec 07 12:29:53 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 12:30:28 coreos01 fleetd[813]: INFO engine.go:271: Scheduled Unit(postgres-db@5433.service) to Machine(ca376ee32b42401087d3cb6acfd4d6fc)
Dec 07 12:30:28 coreos01 fleetd[813]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db@5433.service, MachineID: ca376ee32b42401087d3cb6a
Dec 07 12:30:33 coreos01 fleetd[813]: INFO manager.go:246: Writing systemd unit postgres-db@5433.service (750b)
Dec 07 12:30:33 coreos01 fleetd[813]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 12:30:33 coreos01 fleetd[813]: ERROR manager.go:129: Failed to trigger systemd unit postgres-db@5433.service start: Unit postgres-db-data@5433.service failed to load: No such file or dir
Dec 07 12:30:33 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@5433.service reason="unit scheduled here but not loaded"
Dec 07 12:30:33 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 12:30:33 coreos01 fleetd[813]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5433.service reason="unit currently loaded but desired state is launc
@scruplelesswizard
Copy link

Your logs are truncated. Please see http://unix.stackexchange.com/questions/229188/journalctl-how-to-prevent-text-from-truncating-in-terminal and repost your logs.

@ncodes
Copy link
Author

ncodes commented Dec 7, 2015

I made corrections in the [Unit] section of both units so that the postgres-db-data@%i.service unit is started before postgres-db@%i.service. As a result, both units were loaded in the correct order. I proceeded to starting the unit on one of my machines (core-01) with this command:

fleetctl start postgres-db@5434.service

(Please note that I created the units instance from their templates using symbolic links.)

I got the following logs after running the above command. I was running infinitely, i had to do ctrl-c

2015/12/07 17:25:59 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:00 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:00 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:01 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:01 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:02 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:02 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:03 WARN fleetctl.go:803: Unit postgres-db@5434.service not found
2015/12/07 17:26:03 WARN fleetctl.go:803: Unit postgres-db@5434.service not found

I checked if the machine core-01 (machine-id: 05f35eb2d608440585bbfaa214fa36b2) was still part of the cluster but it was no longer a member. Also ran the check on the two other machines and they both saw each other.

fleetctl list-machines

MACHINE     IP      METADATA
05f35eb2... 10x.xxx.xxx.xxx -

The I fetched the fleet and etcd logs for this machine

-- Logs begin at Mon 2015-12-07 17:17:40 UTC, end at Mon 2015-12-07 18:17:28 UTC. --
Dec 07 17:18:03 coreos01 systemd[1]: Started fleet daemon.
Dec 07 17:18:03 coreos01 fleetd[808]: INFO fleetd.go:64: Starting fleetd version 0.11.5
Dec 07 17:18:03 coreos01 fleetd[808]: INFO fleetd.go:168: No provided or default config file found - proceeding without
Dec 07 17:18:03 coreos01 fleetd[808]: INFO server.go:157: Establishing etcd connectivity
Dec 07 17:18:36 coreos01 fleetd[808]: INFO server.go:168: Starting server components
Dec 07 17:18:36 coreos01 fleetd[808]: INFO engine.go:159: Updated cluster engine version from 0 to 1
Dec 07 17:18:36 coreos01 fleetd[808]: INFO engine.go:185: Engine leadership acquired
Dec 07 17:24:28 coreos01 fleetd[808]: INFO engine.go:271: Scheduled Unit(postgres-db@5434.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 17:24:28 coreos01 fleetd[808]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db@5434.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 17:24:30 coreos01 fleetd[808]: INFO engine.go:271: Scheduled Unit(postgres-db-data@5434.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 17:24:30 coreos01 fleetd[808]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db-data@5434.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:246: Writing systemd unit postgres-db-data@5434.service (677b)
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:246: Writing systemd unit postgres-db@5434.service (749b)
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 17:25:59 coreos01 systemd[1]: Started etcd.
Dec 07 17:25:59 coreos01 fleetd[808]: INFO manager.go:127: Triggered systemd unit postgres-db@5434.service start: job=2341
Dec 07 17:25:59 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5434.service reason="unit currently loaded but desired state is launched"
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.433 INFO      | 05f35eb2d608440585bbfaa214fa36b2 is starting a new cluster
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.447 INFO      | etcd server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :4001, advertised url http://127.0.0.1:4001]
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.453 INFO      | peer server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :7001, advertised url http://127.0.0.1:7001]
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.457 INFO      | 05f35eb2d608440585bbfaa214fa36b2 starting in peer mode
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.459 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'initialized' to 'follower'.
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.460 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'follower' to 'leader'.
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.460 INFO      | 05f35eb2d608440585bbfaa214fa36b2: leader changed from '' to '05f35eb2d608440585bbfaa214fa36b2'.
Dec 07 17:26:01 coreos01 fleetd[808]: INFO engine.go:159: Updated cluster engine version from 0 to 1
Dec 07 17:26:01 coreos01 fleetd[808]: ERROR engine.go:217: Engine leadership lost, renewal failed: 100: Key not found (/_coreos.com/fleet/lease) [5]
Dec 07 17:26:03 coreos01 fleetd[808]: INFO engine.go:185: Engine leadership acquired
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:138: Triggered systemd unit postgres-db-data@5434.service stop: job=2518
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:259: Removing systemd unit postgres-db-data@5434.service
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:138: Triggered systemd unit postgres-db@5434.service stop: job=2520
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:259: Removing systemd unit postgres-db@5434.service
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-data@5434.service reason="unit loaded but not scheduled here"
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db@5434.service reason="unit loaded but not scheduled here"
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 17:31:00 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:31:00 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: unexpected EOF
-- Logs begin at Mon 2015-12-07 17:17:40 UTC, end at Mon 2015-12-07 18:17:28 UTC. --
Dec 07 17:18:03 coreos01 systemd[1]: Started fleet daemon.
Dec 07 17:18:03 coreos01 fleetd[808]: INFO fleetd.go:64: Starting fleetd version 0.11.5
Dec 07 17:18:03 coreos01 fleetd[808]: INFO fleetd.go:168: No provided or default config file found - proceeding without
Dec 07 17:18:03 coreos01 fleetd[808]: INFO server.go:157: Establishing etcd connectivity
Dec 07 17:18:36 coreos01 fleetd[808]: INFO server.go:168: Starting server components
Dec 07 17:18:36 coreos01 fleetd[808]: INFO engine.go:159: Updated cluster engine version from 0 to 1
Dec 07 17:18:36 coreos01 fleetd[808]: INFO engine.go:185: Engine leadership acquired
Dec 07 17:24:28 coreos01 fleetd[808]: INFO engine.go:271: Scheduled Unit(postgres-db@5434.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 17:24:28 coreos01 fleetd[808]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db@5434.service, MachineID: 05f35eb2d608440585bbfaa2
14fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 17:24:30 coreos01 fleetd[808]: INFO engine.go:271: Scheduled Unit(postgres-db-data@5434.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 17:24:30 coreos01 fleetd[808]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db-data@5434.service, MachineID: 05f35eb2d608440585b
bfaa214fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:246: Writing systemd unit postgres-db-data@5434.service (677b)
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:246: Writing systemd unit postgres-db@5434.service (749b)
Dec 07 17:24:31 coreos01 fleetd[808]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:24:31 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 17:25:59 coreos01 systemd[1]: Started etcd.
Dec 07 17:25:59 coreos01 fleetd[808]: INFO manager.go:127: Triggered systemd unit postgres-db@5434.service start: job=2341
Dec 07 17:25:59 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5434.service reason="unit currently loaded but desired state is launc
hed"
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.433 INFO      | 05f35eb2d608440585bbfaa214fa36b2 is starting a new cluster
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.447 INFO      | etcd server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :4001, advertised url http://127.0.0.1:4001]
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.453 INFO      | peer server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :7001, advertised url http://127.0.0.1:7001]
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.457 INFO      | 05f35eb2d608440585bbfaa214fa36b2 starting in peer mode
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.459 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'initialized' to 'follower'.
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.460 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'follower' to 'leader'.
Dec 07 17:25:59 coreos01 etcd[1003]: [etcd] Dec  7 17:25:59.460 INFO      | 05f35eb2d608440585bbfaa214fa36b2: leader changed from '' to '05f35eb2d608440585bbfaa214fa36b2'.
Dec 07 17:26:01 coreos01 fleetd[808]: INFO engine.go:159: Updated cluster engine version from 0 to 1
Dec 07 17:26:01 coreos01 fleetd[808]: ERROR engine.go:217: Engine leadership lost, renewal failed: 100: Key not found (/_coreos.com/fleet/lease) [5]
Dec 07 17:26:03 coreos01 fleetd[808]: INFO engine.go:185: Engine leadership acquired
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:138: Triggered systemd unit postgres-db-data@5434.service stop: job=2518
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:259: Removing systemd unit postgres-db-data@5434.service
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:138: Triggered systemd unit postgres-db@5434.service stop: job=2520
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:259: Removing systemd unit postgres-db@5434.service
Dec 07 17:26:04 coreos01 fleetd[808]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-data@5434.service reason="unit loaded but not scheduled here"
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db@5434.service reason="unit loaded but not scheduled here"
Dec 07 17:26:04 coreos01 fleetd[808]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 17:31:00 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:31:00 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: unexpected EOF
Dec 07 17:31:00 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: unexpected EOF
Dec 07 17:35:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:36:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:40:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:41:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:45:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:46:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:50:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:51:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:55:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 17:56:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:00:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:01:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:05:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:06:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:10:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:11:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:15:59 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 18:16:01 coreos01 fleetd[808]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

Here are the logs from one of the remaining two machines in the cluster:

-- Subject: Unit fleet.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit fleet.service has finished starting up.
--
-- The start-up result is done.
Dec 07 17:18:07 coreos01 fleetd[827]: INFO fleetd.go:64: Starting fleetd version 0.11.5
Dec 07 17:18:07 coreos01 fleetd[827]: INFO fleetd.go:168: No provided or default config file found - proceeding without
Dec 07 17:18:07 coreos01 fleetd[827]: INFO server.go:157: Establishing etcd connectivity
Dec 07 17:18:40 coreos01 fleetd[827]: INFO server.go:168: Starting server components
Dec 07 17:18:40 coreos01 fleetd[827]: INFO engine.go:79: Engine leader is 05f35eb2d608440585bbfaa214fa36b2
Dec 07 17:26:13 coreos01 fleetd[827]: INFO engine.go:79: Engine leader is e1e6eed49a8e4b698c29cfd39d3f3ce5
Dec 07 17:26:24 coreos01 fleetd[827]: INFO manager.go:246: Writing systemd unit postgres-db@5434.service (749b)
Dec 07 17:26:24 coreos01 fleetd[827]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:26:24 coreos01 fleetd[827]: ERROR manager.go:129: Failed to trigger systemd unit postgres-db@5434.service start: Unit postgres-db-data@5434.service failed to load: No such file or directory.
Dec 07 17:26:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:26:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 17:26:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@5434.service reason="unit currently loaded but desired state is launched"
Dec 07 17:26:24 coreos01 fleetd[827]: INFO manager.go:246: Writing systemd unit postgres-db-data@5434.service (677b)
Dec 07 17:26:24 coreos01 fleetd[827]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 17:26:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@5434.service reason="unit scheduled here but not loaded"
Dec 07 17:26:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"

@ncodes
Copy link
Author

ncodes commented Dec 7, 2015

I retried again with the following logs:

node-01

Dec 07 18:46:13 coreos01 etcd2[764]: bc399ca97836b0e1 restored progress of a099e32821cc3cb7 [next = 10002, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Dec 07 18:46:13 coreos01 etcd2[764]: bc399ca97836b0e1 restored progress of bc399ca97836b0e1 [next = 10002, match = 10001, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]
Dec 07 18:46:13 coreos01 etcd2[764]: bc399ca97836b0e1 [commit: 1268] restored snapshot [index: 10001, term: 3]
Dec 07 18:46:13 coreos01 etcd2[764]: raft applied incoming snapshot at index 10001
Dec 07 18:46:13 coreos01 etcd2[764]: recovered from incoming snapshot at index 10001
Dec 07 18:46:13 coreos01 etcd2[764]: the connection with 8859b9c7c25010fd became active
Dec 07 18:46:14 coreos01 etcd2[764]: the connection with a099e32821cc3cb7 became active
Dec 07 18:46:14 coreos01 etcd2[764]: published {Name:05f35eb2d608440585bbfaa214fa36b2 ClientURLs:[http://10.134.197.190:2379 http://10.134.197.190:4001]} to cluster 9837d5f591eaf261
Dec 07 18:46:14 coreos01 fleetd[770]: INFO server.go:168: Starting server components
Dec 07 18:46:14 coreos01 fleetd[770]: INFO engine.go:79: Engine leader is e1e6eed49a8e4b698c29cfd39d3f3ce5
Dec 07 19:05:15 coreos01 fleetd[770]: INFO manager.go:246: Writing systemd unit postgres-db-data@54333.service (677b)
Dec 07 19:05:15 coreos01 fleetd[770]: INFO manager.go:246: Writing systemd unit postgres-db@54333.service (749b)
Dec 07 19:05:15 coreos01 fleetd[770]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 19:05:15 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@54333.service reason="unit scheduled here but not loaded"
Dec 07 19:05:15 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@54333.service reason="unit scheduled here but not loaded"
Dec 07 19:05:15 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 19:07:05 coreos01 systemd[1]: Stopping etcd2...
Dec 07 19:07:05 coreos01 etcd2[764]: received terminated signal, shutting down...
Dec 07 19:07:05 coreos01 systemd[1]: Started etcd.
Dec 07 19:07:05 coreos01 fleetd[770]: INFO manager.go:127: Triggered systemd unit postgres-db@54333.service start: job=3217
Dec 07 19:07:05 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@54333.service reason="unit currently loaded but desired state is launched"
Dec 07 19:07:05 coreos01 fleetd[770]: ERROR engine.go:149: Unable to determine cluster engine version
Dec 07 19:07:05 coreos01 systemd[1]: Stopped etcd2.
Dec 07 19:07:05 coreos01 fleetd[770]: ERROR units.go:275: Failed fetching Unit(postgres-db@54333.service) from Registry: dial tcp 127.0.0.1:4001: connection refused
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.497 INFO      | Peer URLs in log:  / 05f35eb2d608440585bbfaa214fa36b2 (http://127.0.0.1:7001)
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.504 INFO      | etcd server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :4001, advertised url http://127.0.0.1:4001]
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.505 INFO      | peer server [name 05f35eb2d608440585bbfaa214fa36b2, listen on :7001, advertised url http://127.0.0.1:7001]
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.505 INFO      | 05f35eb2d608440585bbfaa214fa36b2 starting in peer mode
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.506 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'initialized' to 'follower'.
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.892 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'follower' to 'candidate'.
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.892 INFO      | 05f35eb2d608440585bbfaa214fa36b2: state changed from 'candidate' to 'leader'.
Dec 07 19:07:05 coreos01 etcd[1084]: [etcd] Dec  7 19:07:05.892 INFO      | 05f35eb2d608440585bbfaa214fa36b2: leader changed from '' to '05f35eb2d608440585bbfaa214fa36b2'.
Dec 07 19:07:05 coreos01 fleetd[770]: INFO engine.go:185: Engine leadership acquired
Dec 07 19:07:05 coreos01 fleetd[770]: INFO engine.go:81: Engine leadership changed from e1e6eed49a8e4b698c29cfd39d3f3ce5 to 05f35eb2d608440585bbfaa214fa36b2
Dec 07 19:07:05 coreos01 fleetd[770]: INFO manager.go:138: Triggered systemd unit postgres-db-data@54333.service stop: job=3394
Dec 07 19:07:05 coreos01 fleetd[770]: INFO manager.go:259: Removing systemd unit postgres-db-data@54333.service
Dec 07 19:07:06 coreos01 fleetd[770]: INFO manager.go:138: Triggered systemd unit postgres-db@54333.service stop: job=3396
Dec 07 19:07:06 coreos01 fleetd[770]: INFO manager.go:259: Removing systemd unit postgres-db@54333.service
Dec 07 19:07:06 coreos01 fleetd[770]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 19:07:06 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-data@54333.service reason="unit loaded but not scheduled here"
Dec 07 19:07:06 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db@54333.service reason="unit loaded but not scheduled here"
Dec 07 19:07:06 coreos01 fleetd[770]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 19:07:08 coreos01 etcd[1084]: [etcd] Dec  7 19:07:08.538 INFO      | 05f35eb2d608440585bbfaa214fa36b2: snapshot of 12331 events at index 12331 completed
Dec 07 19:12:05 coreos01 fleetd[770]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.
Dec 07 19:12:05 coreos01 fleetd[770]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: unexpected EOF
Dec 07 19:12:15 coreos01 fleetd[770]: ERROR event.go:106: etcd watcher /_coreos.com/fleet/job returned error: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint.

node-01

-- Logs begin at Mon 2015-12-07 17:17:44 UTC. --
Dec 07 18:46:14 coreos01 etcd2[836]: the connection with bc399ca97836b0e1 became active
Dec 07 18:46:31 coreos01 etcd2[836]: the clock difference against peer bc399ca97836b0e1 is too high [1.097446793s > 1s]
Dec 07 18:53:24 coreos01 fleetd[827]: INFO manager.go:138: Triggered systemd unit postgres-db-data@5434.service stop: job=9910
Dec 07 18:53:24 coreos01 fleetd[827]: INFO manager.go:259: Removing systemd unit postgres-db-data@5434.service
Dec 07 18:53:24 coreos01 fleetd[827]: INFO manager.go:138: Triggered systemd unit postgres-db@5434.service stop: job=9912
Dec 07 18:53:24 coreos01 fleetd[827]: INFO manager.go:259: Removing systemd unit postgres-db@5434.service
Dec 07 18:53:24 coreos01 fleetd[827]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 18:53:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db-data@5434.service reason="unit loaded but not scheduled here"
Dec 07 18:53:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=postgres-db@5434.service reason="unit loaded but not scheduled here"
Dec 07 18:53:24 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 19:07:05 coreos01 etcd2[836]: the connection with bc399ca97836b0e1 became inactive
Dec 07 19:07:05 coreos01 etcd2[836]: failed to dial bc399ca97836b0e1 on stream MsgApp v2 (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:05 coreos01 etcd2[836]: failed to dial bc399ca97836b0e1 on stream Message (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:10 coreos01 etcd2[836]: failed to heartbeat bc399ca97836b0e1 on stream MsgApp v2 (write tcp 10.134.197.190:59846: broken pipe)
Dec 07 19:07:10 coreos01 etcd2[836]: failed to heartbeat bc399ca97836b0e1 on stream Message (write tcp 10.134.197.190:59848: broken pipe)
Dec 07 19:07:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:07:32 coreos01 fleetd[827]: INFO manager.go:246: Writing systemd unit postgres-db@54333.service (749b)
Dec 07 19:07:32 coreos01 fleetd[827]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 19:07:33 coreos01 fleetd[827]: ERROR manager.go:129: Failed to trigger systemd unit postgres-db@54333.service start: Unit postgres-db-data@54333.service failed to load: No such file or directory.
Dec 07 19:07:33 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db@54333.service reason="unit scheduled here but not loaded"
Dec 07 19:07:33 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 19:07:33 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=postgres-db@54333.service reason="unit currently loaded but desired state is launched"
Dec 07 19:07:34 coreos01 fleetd[827]: INFO manager.go:246: Writing systemd unit postgres-db-data@54333.service (677b)
Dec 07 19:07:34 coreos01 fleetd[827]: INFO manager.go:182: Instructing systemd to reload units
Dec 07 19:07:34 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=postgres-db-data@54333.service reason="unit scheduled here but not loaded"
Dec 07 19:07:34 coreos01 fleetd[827]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Dec 07 19:08:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:08:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:09:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:09:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:10:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:10:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:11:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:11:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:11:55 coreos01 etcd2[836]: start to snapshot (applied: 20002, lastsnap: 10001)
Dec 07 19:11:55 coreos01 etcd2[836]: saved snapshot at index 20002
Dec 07 19:11:55 coreos01 etcd2[836]: compacted raft log at 15002
Dec 07 19:12:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:12:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:13:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:13:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:14:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:14:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:15:01 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:15:31 coreos01 etcd2[836]: the connection to peer bc399ca97836b0e1 is unhealthy

node-03

Dec 07 18:46:14 coreos01 etcd2[835]: failed to write bc399ca97836b0e1 on pipeline (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 18:46:14 coreos01 etcd2[835]: the connection with bc399ca97836b0e1 became active
Dec 07 18:46:15 coreos01 etcd2[835]: the connection with bc399ca97836b0e1 became inactive
Dec 07 18:46:15 coreos01 etcd2[835]: failed to write bc399ca97836b0e1 on stream MsgApp v2 (write tcp 10.134.197.190:45848: broken pipe)
Dec 07 18:46:15 coreos01 etcd2[835]: the connection with bc399ca97836b0e1 became active
Dec 07 18:46:30 coreos01 etcd2[835]: the clock difference against peer bc399ca97836b0e1 is too high [1.103146001s > 1s]
Dec 07 18:53:24 coreos01 fleetd[826]: INFO engine.go:256: Unscheduled Job(postgres-db-data@5434.service) from Machine(790889ce139d4222ad95ccf49805735e)
Dec 07 18:53:24 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: postgres-db-data@5434.service, MachineID: 790889ce139d4222ad95ccf49805735e, Reason: "target Machine(790889ce139d4222ad95ccf49805735e) unable to run unit"}
Dec 07 18:53:24 coreos01 fleetd[826]: INFO engine.go:256: Unscheduled Job(postgres-db@5434.service) from Machine(790889ce139d4222ad95ccf49805735e)
Dec 07 18:53:24 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: postgres-db@5434.service, MachineID: 790889ce139d4222ad95ccf49805735e, Reason: "target state inactive"}
Dec 07 19:05:13 coreos01 fleetd[826]: INFO engine.go:271: Scheduled Unit(postgres-db@54333.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 19:05:13 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db@54333.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 19:05:13 coreos01 fleetd[826]: INFO engine.go:271: Scheduled Unit(postgres-db-data@54333.service) to Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 19:05:13 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db-data@54333.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target state loaded and unit not scheduled"}
Dec 07 19:07:05 coreos01 etcd2[835]: the connection with bc399ca97836b0e1 became inactive
Dec 07 19:07:05 coreos01 etcd2[835]: failed to dial bc399ca97836b0e1 on stream MsgApp v2 (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:05 coreos01 etcd2[835]: failed to dial bc399ca97836b0e1 on stream Message (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:05 coreos01 etcd2[835]: failed to write bc399ca97836b0e1 on stream MsgApp v2 (write tcp 10.134.197.190:45968: broken pipe)
Dec 07 19:07:05 coreos01 etcd2[835]: failed to heartbeat bc399ca97836b0e1 on stream Message (write tcp 10.134.197.190:45970: broken pipe)
Dec 07 19:07:05 coreos01 etcd2[835]: failed to write bc399ca97836b0e1 on pipeline (dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:06 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:06 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:11 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:11 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:16 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:16 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:21 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:21 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:26 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:26 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:30 coreos01 etcd2[835]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:07:31 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:31 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:31 coreos01 fleetd[826]: INFO engine.go:256: Unscheduled Job(postgres-db@54333.service) from Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 19:07:31 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: postgres-db@54333.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target Machine(05f35eb2d608440585bbfaa214fa36b2) went away"}
Dec 07 19:07:31 coreos01 fleetd[826]: INFO engine.go:256: Unscheduled Job(postgres-db-data@54333.service) from Machine(05f35eb2d608440585bbfaa214fa36b2)
Dec 07 19:07:31 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: postgres-db-data@54333.service, MachineID: 05f35eb2d608440585bbfaa214fa36b2, Reason: "target Machine(05f35eb2d608440585bbfaa214fa36b2) went away"}
Dec 07 19:07:31 coreos01 fleetd[826]: INFO engine.go:271: Scheduled Unit(postgres-db@54333.service) to Machine(790889ce139d4222ad95ccf49805735e)
Dec 07 19:07:31 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db@54333.service, MachineID: 790889ce139d4222ad95ccf49805735e, Reason: "target state launched and unit not scheduled"}
Dec 07 19:07:33 coreos01 fleetd[826]: INFO engine.go:271: Scheduled Unit(postgres-db-data@54333.service) to Machine(790889ce139d4222ad95ccf49805735e)
Dec 07 19:07:33 coreos01 fleetd[826]: INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: postgres-db-data@54333.service, MachineID: 790889ce139d4222ad95ccf49805735e, Reason: "target state loaded and unit not scheduled"}
Dec 07 19:07:36 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:36 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:41 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:41 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:46 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:46 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:51 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:51 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:56 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:07:56 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:00 coreos01 etcd2[835]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:08:01 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:01 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:06 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:06 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:11 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:11 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:16 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:16 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:21 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:21 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:26 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:26 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:30 coreos01 etcd2[835]: the connection to peer bc399ca97836b0e1 is unhealthy
Dec 07 19:08:31 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:31 coreos01 etcd2[835]: cannot get the version of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)
Dec 07 19:08:36 coreos01 etcd2[835]: failed to reach the peerURL(http://10.134.197.190:2380) of member bc399ca97836b0e1 (Get http://10.134.197.190:2380/version: dial tcp 10.134.197.190:2380: connection refused)

The node where fleetctl start... is called ends up leaving the cluster. It appear member nodes are no longer able to reach this node.

@mischief
Copy link
Contributor

mischief commented Dec 7, 2015

@ncodes looks like you somehow start both etcd and etcd2. replace instances of etcd.service with etcd2.service and see if that helps.

@ncodes
Copy link
Author

ncodes commented Dec 7, 2015

@mischief All instances were started with etcd2. Here is my cloud config content.

#cloud-config

hostname: coreos01
coreos:
  etcd2:
    discovery: https://discovery.etcd.io/70109903408afc4c306883654ae51cff
    advertise-client-urls: http://$private_ipv4:2379,http://$private_ipv4:4001
    initial-advertise-peer-urls: http://$private_ipv4:2380
    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001
    listen-peer-urls: http://$private_ipv4:2380

  units:
    - name: etcd2.service
      command: start
    - name: fleet.service
      command: start

CoreOS version: 835.8.0 (stable)

@scruplelesswizard
Copy link

I would suggest that if you don't require etcd that you mask the service in your future by using:

-name: etcd.service
 mask: yes

I know that I have accidentally run commands against etcd instead of etcd2 more than once. Also, if you don't use any etcd v1 clients then you can remove references to port 4001.

See https://coreos.com/os/docs/latest/cloud-config.html#units for more information

Are you running on a cloud provider?

@ncodes
Copy link
Author

ncodes commented Dec 7, 2015

@thecanadianbaker I am currently trying out CoreOS on DigitalOcean. I will update my cloud config and let you know how it goes

@mischief
Copy link
Contributor

mischief commented Dec 7, 2015

@ncodes in the units you first posted, you have

Requires=etcd.service
...
After=etcd.service

this should probably be etcd2.service, otherwise etcd.service will get started and your etcd2 cluster will split.

@scruplelesswizard
Copy link

And that is exactly why I mask etcd... I completely missed that. Thank's for actually reading @mischief

@ncodes
Copy link
Author

ncodes commented Dec 8, 2015

@thecanadianbaker @mischief Masking etcd solved the problem. The unit where able to start and the cluster did not split. However, I wasn't able to connect to the main unit from a postgres client.

fleetctl journal postgres-db@5434.service returns:

-- Logs begin at Mon 2015-12-07 21:18:59 UTC, end at Tue 2015-12-08 11:37:09 UTC. --
Dec 08 11:30:19 coreos-01 docker[13955]: Status: Image is up to date for ncodes/postgres:latest
Dec 08 11:30:19 coreos-01 systemd[1]: Started A postgres database. Uses postgres-db-data@.service as a data container..
Dec 08 11:30:19 coreos-01 docker[14001]: Warning: '-volumes-from' is deprecated, it will be replaced by '--volumes-from' soon. See usage.
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  database system was interrupted; last known up at 2015-12-08 11:30:20 UTC
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  database system was not properly shut down; automatic recovery in progress
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  record with zero length at 0/115E940
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  redo is not required
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  MultiXact member wraparound protections are now enabled
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  database system is ready to accept connections
Dec 08 11:30:20 coreos-01 docker[14001]: LOG:  autovacuum launcher started

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants