- Major revamp of watchbot internals. (refs #184). The system now:
- Relies on an ECS service for scaling
- Provides users metrics on cpu and memory utilization of all containers
- Re-uses the same containers to process multiple jobs, reducing overhead
- Clearer error messages from the CLI tool for bad user input.
- Adds a log message if the watcher receives an SQS message that it has already launched a task for, and is still waiting to learn whether that task succeeded or failed.
- Upon receiving a duplicate message, the watcher checks if the in-flight task is in
PENDING
state. If so, it stops the task and returns the message to SQS for a retry.
- Fixes
DeadLetterAlarm
thresholding: changesComparisonOperator
fromGreaterThanThreshold
toGreaterThanOrEqualToThreshold
so that alarm is triggered when a single message is sent to the DeadLetterQueue.
- Makes
EvaluationPeriods
forFailedWorkerPlacementAlarm
customizable
- Adds
.ref.notificationTopic
to the output fromwatchbot.template()
- Adjusts watcher permissions on RunTask so that it can only launch its own Worker tasks.
- Adds a configuration option to specify
placementConstraints
of watchbot's task definitions
- Adds a configuration option to specify a
Family
property of watchbot's task definitions
- Adjust CloudWatch Event Rule names to allow stacks to include multiple sets of watchbot resources
- Adjusts log group names to allow stacks to include multiple sets of watchbot resources
- LogGroup names are now
${stack-name}-${region}-${prefix}
, whereprefix
defaults towatchbot
if not otherwise specified.
- LogGroup names are now
- BREAKING changes to the format with which CloudWatch LogGroups and streams are named. These should be considered breaking changes because upgrading a stack from v2.x to v3.x in-place will result in CloudFormation conflicts. Circumvent the conflicts by manually deleting the existing log group before running the CloudFormation update.
- LogGroup names are now
${stack-name}-${region}
- Streams are now prefixed with
${service-version}
(a GitSha in most cases)
- LogGroup names are now
- More permissive engines.node
- Fixes a regression in 2.5.0, allowing watcher containers to launch workers with new family names.
- Task definitions created by Watchbot's
.template(options)
function will now useoptions.service
as the task definition's family.
- Upgrade node.js runtime to 4.3 for webhook function
- Add quotes around
$@
operator in the watchbot-progress.sh script to preserve spaces in metadata arguments #142
- Add metric for the amount of time the task spent in
PENDING
state.
- find watchbot-progress's path using
require.resolve
to work with Yarn's flat dependency tree #131
- set ulimit to 10240 in the container definition
- always uses exponential backoff when returning work messages to SQS
- fixes error handling for
Cannot*ContainerError
no-op - stale messages in the TaskEventQueue will be dropped after 20 minutes
- watcher runs on ubuntu 16.04 LTS
CannotStartContainerError
,CannotPullContainerError
andDockerTimeoutError
errors do not cause notifications when AlarmOnEveryError is set
- Removes
-event-target
from the ID of the cloudwatch events filter to make it shorter. refs #119
- fixes a bug in the changelog
- consolidates CLI commands into a single
watchbot
command - adds a CLI command for interacting with the dead letter queue. Note that you cannot use the CLI unless you're working with a 2.1.0+ stack.
- fixes a bug that wouldn't have allowed you to disable exponential backoff
- returns
task.container[n].reason
asreason
when task finishes, if available - adds a second SQS queue used for the watcher's internal tracking of CloudWatch task state-change events
- adds ephemeral, or non-persistent, volume compatibility (see AWS's task data volume documentation)
- adds mount point object compatibility for cloudfriend operators, and any other operators that use semicolons and commas
- adds a
worker-capacity
script to estimate how many additional worker tasks can be placed in your service's cluster at its current capacity - adds CloudWatch metrics for worker errors (non-zero exit codes), failed worker container placement, worker duration, watcher concurrency, and message receive counts
- adds an alarm for number of worker errors in 60s, configurable through
watchbot.template(options)
.errorThreshold
. Defaults to alarms after 10 failures per minute. - drops polling of DescribeTasks API to learn when workers are completed
- BREAKING removes cluster resource polling - workers will try to be placed and fail instead of avoiding placement attempts
- BREAKING by default, watchbot no longer sends notification emails each time a worker errors. You can opt-in to this behavior by setting
watchbot.template(options)
.alarmOnEachFailure: true
. - BREAKING no longer sends notifications on error interacting with SQS. Instead watchbot silently proceeds.
- BREAKING watcher log format has changed. Now watcher logs print JSON objects
- BREAKING removes
.notifyAfterRetries
option - BREAKING removes
.backoff
option. Workers are always retried with exponential backoff - BREAKING adds a dead letter queue. Messages received more than 14 times by a watcher container will be sent to this queue. Any visible messages in this queue will trip an alarm.
- adds
options.reservation.softMemory
which allows the caller to set up a soft memory reservation on worker tasks
- bump watchbot-progress to v1.1.1, handles a bug in checking part status on a completed job
- move to @mapbox/watchbot, use MemoryReservation soft limit for the Watcher task
- update and switch to namespaced package for
@mapbox/watchbot-progress
- reimplement and fix
NotifyAfterRetries
as a watcher environment variable
- fix a bug where
NotifyAfterRetries
was still expected in watcher container environment
- adds duration (in seconds) to watcher log output when tasks complete
- fix bug with
NotifyAfterRetries
where the environment variable was set in the watcher container, not the worker.
- adds
options.privileged
parameter to watchbot's template
- Adds
.ref.queueName
to the output fromwatchbot.template()
- Clarifies watcher log messages conveying outcome when tasks finish
- Fixes a bug where task launching could fail due to a
startedBy
name longer than 36 characters
- Adds support for us-east-2 (Ohio)
- Allows
options.logAggregationFunction
to reference a potentially empty stack parameter
- Adds event emitter to signal when cluster instances have been identified
- Adds error emitter to signal when there are no cluster instances
- Adds readCapacityUnits & writeCapacityUnits configurable watchbot.template option params
- Adds error handling for log line >50kb edge case
- Exposes notifyAfterRetry concept to retry jobs before sending alarms
- Adds pagination for describeContainerInstances
- Adds watchbot-progress dependency
- Adds support for ap-* regions by adding regional mapping for worker/watcher images assuming ecs-conex is doing your image packaging.
- Fix bug where watchbot would not retry running a task if it encountered a RESOURCE:CPU contrainst error.
- Breaking requires KMS key under the CF export
cloudformation-kms-production
to grant worker tasks permission to decrypt secure environment variables. See README and https://github.com/mapbox/cloudformation-kms, https://github.com/mapbox/decrypt-kms-env.
- Fix potential race condition when creating
LogForwarding
- Adds EcsWatchbotVersion to template Metadata
- allow
workers
andbackoff
to be a ref - adds
options.debugLogs
to enable verbose logging - adds log stream prefix to organize worker/watcher logs better
- fix for worker role in reduce mode
- fixes a bug that could produce an invalid template if no memory reservation is specified. New default memory is 64MB
- fixes a bug that limited a watcher to maintaining at most 100 concurrent workers
- adds
reduce
option towatchbot.template()
for tracking map-reduce operations - adds example recipes for workers using
reduce
mode - Breaking changes the
startedBy
attribute of worker tasks to the stack's name
- fixes a bug where
options.command
would break the watcher - adds
.ref.queueUrl
and.ref.queueArn
references to object returned bywatchbot.template()
- automatically provide workers with permission to publish to watchbot's SNS topic
- adds
watchbot.logStream
, a node.js writable stream for prefixing logs - Breaking changes the name of the SQS queue, making it a bit easier to find in the console
- Breaking switch to TaskRole instead of grafting permissions onto a predefined role
- fixes a template generation bug for callers that do not use mount points
- adds
logAggregationFunction
argument to watchbot.template - allow caller to set container CMD
- template validation, cleanups, default watchbot version
- overhauls template building process, providing scripts that expose Watchbot's resources as JavaScript objects
- container logs are sent from Docker to CloudWatch Logs instead of syslog
- a watchbot stack creates its own CloudWatch LogGroup and sends all container logs to it
- on task failure, reads recent container logs from CloudWatch and includes them in notifications
- adds helper functions to run as part of the worker which help generate homogeneous, searchable log output
- silences
[status]
log messages unless logLevel is set todebug
- improved message body in notifications sent when task fail
- logs are sent to syslog instead of to a file assumed to be mounted from the host machine
- new template builder arguments to only include certain resources (e.g. webhooks) if you ask for them
- watcher pays attention to cluster resource reservation, avoids polling the queue when the cluster is fully utilized, and retries runTask requests if a request fails due to lack of memory.
- template sets up watcher permissions such that updates to the worker's task definition will not lead to permissions failures in the midst of a deploy
- watcher logs include message subject and body
- gracefully return messages to the queue if the ECS API fails to run a task
- handle situations where a single watcher receives the same message twice
- adjust alarm description in CloudFormation template
- First sketch of Watchbot on ECS