Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: change vtbackup_duration_by_phase to binary-valued vtbackup_phase #12973

Merged
24 changes: 23 additions & 1 deletion changelog/18.0/18.0.0/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,10 @@
- [Deleted `V3` planner](#deleted-v3)
- [Deleted `k8stopo`](#deleted-k8stopo)
- [Deleted `vtgr`](#deleted-vtgr)
- [Deprecated VTBackup stat `DurationByPhase`](#deprecated-vtbackup-stat-duration-by-phase)
- **[New stats](#new-stats)**
- [VTGate Vindex unknown parameters](#vtgate-vindex-unknown-parameters)
- [VTBackup stat `Phase`](#vtbackup-stat-phase)
- [VTBackup stat `PhaseStatus`](#vtbackup-stat-phase-status)
- **[VTTablet](#vttablet)**
- [VTTablet: New ResetSequences RPC](#vttablet-new-rpc-reset-sequences)
Expand Down Expand Up @@ -115,12 +117,32 @@ the `k8stopo` has been removed.

The `vtgr` has been deprecated in Vitess 17, also see https://github.com/vitessio/vitess/issues/13300. With Vitess 18 `vtgr` has been removed.

#### <a id="deprecated-vtbackup-stat-duration-by-phase"/>Deprecated VTbackup stat `DurationByPhase`

VTBackup stat `DurationByPhase` is deprecated. Use the binary-valued `Phase` stat instead.

### <a id="new-stats"/>New stats

#### <a id="vtgate-vindex-unknown-parameters"/>VTGate Vindex unknown parameters

The VTGate stat `VindexUnknownParameters` gauges unknown Vindex parameters found in the latest VSchema pulled from the topology.

#### <a id="vtbackup-stat-phase"/>VTBackup `Phase` stat

In v17, the `vtbackup` stat `DurationByPhase` stat was added measuring the time spent by `vtbackup` in each phase. This stat turned out to be awkward to use in production, and has been replaced in v18 by a binary-valued `Phase` stat.

`Phase` reports a 1 (active) or a 0 (inactive) for each of the following phases:

* `CatchUpReplication`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would have made this Catchup instead of CatchUp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good nit, done!

* `InitialBackup`
* `RestoreLastBackup`
* `TakeNewBackup`

To calculate how long `vtbackup` has spent in a given phase, sum the 1-valued data points over time and multiply by the data collection or reporting interval. For example, in Prometheus:

```
sum_over_time(vtbackup_phase{phase="TakeNewBackup"}) * <interval>
```
#### <a id="vtbackup-stat-phase-status"/>VTBackup `PhaseStatus` stat

`PhaseStatus` reports a 1 (active) or a 0 (inactive) for each of the following phases and statuses:
Expand Down Expand Up @@ -165,4 +187,4 @@ removing Vitess support.

#### <a id="new-durability-policies"/>New Durability Policies

2 new inbuilt durability policies have been added to Vitess in this release namely `semi_sync_with_rdonly_ack` and `cross_cell_with_rdonly_ack`. These policies are exactly like `semi_sync` and `cross_cell` respectively, and differ just in the part where the rdonly tablets can also send semi-sync ACKs.
2 new inbuilt durability policies have been added to Vitess in this release namely `semi_sync_with_rdonly_ack` and `cross_cell_with_rdonly_ack`. These policies are exactly like `semi_sync` and `cross_cell` respectively, and differ just in the part where the rdonly tablets can also send semi-sync ACKs.
25 changes: 25 additions & 0 deletions go/cmd/vtbackup/plugin_opentsdb.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to test for vtbackup-specific metrics in an E2E test, used opentsdb plugin and write stats to file for later verification.

Copyright 2023 The Vitess Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package main

import "vitess.io/vitess/go/stats/opentsdb"

// This plugin imports opentsdb to register the opentsdb stats backend.

func init() {
opentsdb.Init("vtbackup")
}
59 changes: 51 additions & 8 deletions go/cmd/vtbackup/vtbackup.go
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,9 @@ const (
operationTimeout = 1 * time.Minute

phaseNameCatchUpReplication = "CatchUpReplication"
phaseNameInitialBackup = "InitialBackup"
phaseNameRestoreLastBackup = "RestoreLastBackup"
phaseNameTakeNewBackup = "TakeNewBackup"
phaseStatusCatchUpReplicationStalled = "Stalled"
phaseStatusCatchUpReplicationStopped = "Stopped"
)
Expand All @@ -127,11 +130,35 @@ var (
detachedMode bool
keepAliveTimeout = 0 * time.Second
disableRedoLog = false
durationByPhase = stats.NewGaugesWithSingleLabel(

// Deprecated, use "Phase" instead.
deprecatedDurationByPhase = stats.NewGaugesWithSingleLabel(
"DurationByPhaseSeconds",
"How long it took vtbackup to perform each phase (in seconds).",
"[DEPRECATED] How long it took vtbackup to perform each phase (in seconds).",
"phase",
)

// This gauge is updated 3*N times during the course of a vtbackup run,
// where N is the number of different phases vtbackup transitions through.
// Once to initialize to 0, another time to set the phase to active (1),
// and another to deactivate the phase (back to 0).
//
// At most a single phase is active at a given time.
//
// The sync gauge immediately reports changes to push-backed backends.
// The benefit of the sync gauge is that it makes verifying stats in
// integration tests a lot more tractable.
phase = stats.NewSyncGaugesWithSingleLabel(
"Phase",
"Active phase.",
"phase",
)
phaseNames = []string{
phaseNameCatchUpReplication,
phaseNameInitialBackup,
phaseNameRestoreLastBackup,
phaseNameTakeNewBackup,
}
phaseStatus = stats.NewGaugesWithMultiLabels(
"PhaseStatus",
"Internal state of vtbackup phase.",
Expand Down Expand Up @@ -219,6 +246,9 @@ func main() {
defer topoServer.Close()

// Initialize stats.
for _, phaseName := range phaseNames {
phase.Set(phaseName, int64(0))
}
for phaseName, statuses := range phaseStatuses {
for _, status := range statuses {
phaseStatus.Set([]string{phaseName, status}, 0)
Expand Down Expand Up @@ -294,7 +324,7 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
if err := mysqld.Init(initCtx, mycnf, initDBSQLFile); err != nil {
return fmt.Errorf("failed to initialize mysql data dir and start mysqld: %v", err)
}
durationByPhase.Set("InitMySQLd", int64(time.Since(initMysqldAt).Seconds()))
deprecatedDurationByPhase.Set("InitMySQLd", int64(time.Since(initMysqldAt).Seconds()))
// Shut down mysqld when we're done.
defer func() {
// Be careful not to use the original context, because we don't want to
Expand Down Expand Up @@ -358,14 +388,19 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back

backupParams.BackupTime = time.Now()
// Now we're ready to take the backup.
phase.Set(phaseNameInitialBackup, int64(1))
defer phase.Set(phaseNameInitialBackup, int64(0))
if err := mysqlctl.Backup(ctx, backupParams); err != nil {
return fmt.Errorf("backup failed: %v", err)
}
durationByPhase.Set("InitialBackup", int64(time.Since(backupParams.BackupTime).Seconds()))
deprecatedDurationByPhase.Set("InitialBackup", int64(time.Since(backupParams.BackupTime).Seconds()))
log.Info("Initial backup successful.")
phase.Set(phaseNameInitialBackup, int64(0))
return nil
}

phase.Set(phaseNameRestoreLastBackup, int64(1))
defer phase.Set(phaseNameRestoreLastBackup, int64(0))
backupDir := mysqlctl.GetBackupDir(initKeyspace, initShard)
log.Infof("Restoring latest backup from directory %v", backupDir)
restoreAt := time.Now()
Expand Down Expand Up @@ -397,7 +432,8 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
default:
return fmt.Errorf("can't restore from backup: %v", err)
}
durationByPhase.Set("RestoreLastBackup", int64(time.Since(restoreAt).Seconds()))
deprecatedDurationByPhase.Set("RestoreLastBackup", int64(time.Since(restoreAt).Seconds()))
phase.Set(phaseNameRestoreLastBackup, int64(0))

// As of MySQL 8.0.21, you can disable redo logging using the ALTER INSTANCE
// DISABLE INNODB REDO_LOG statement. This functionality is intended for
Expand Down Expand Up @@ -455,6 +491,9 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
backupParams.BackupTime = time.Now()

// Wait for replication to catch up.
phase.Set(phaseNameCatchUpReplication, int64(1))
defer phase.Set(phaseNameCatchUpReplication, int64(0))

var (
lastStatus replication.ReplicationStatus
status replication.ReplicationStatus
Expand All @@ -479,7 +518,7 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
// We're caught up on replication to at least the point the primary
// was at when this vtbackup run started.
log.Infof("Replication caught up to %v after %v", status.Position, time.Since(waitStartTime))
durationByPhase.Set("CatchUpReplication", int64(time.Since(waitStartTime).Seconds()))
deprecatedDurationByPhase.Set("CatchUpReplication", int64(time.Since(waitStartTime).Seconds()))
break
}
if !lastStatus.Position.IsZero() {
Expand All @@ -499,6 +538,7 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
phaseStatus.Set([]string{phaseNameCatchUpReplication, phaseStatusCatchUpReplicationStopped}, 0)
}
}
phase.Set(phaseNameCatchUpReplication, int64(0))

// Stop replication and see where we are.
if err := mysqld.StopReplication(nil); err != nil {
Expand Down Expand Up @@ -539,15 +579,18 @@ func takeBackup(ctx context.Context, topoServer *topo.Server, backupStorage back
if err := mysqld.Start(ctx, mycnf); err != nil {
return fmt.Errorf("Could not start MySQL after full shutdown: %v", err)
}
durationByPhase.Set("RestartBeforeBackup", int64(time.Since(restartAt).Seconds()))
deprecatedDurationByPhase.Set("RestartBeforeBackup", int64(time.Since(restartAt).Seconds()))
}

// Now we can take a new backup.
backupAt := time.Now()
phase.Set(phaseNameTakeNewBackup, int64(1))
defer phase.Set(phaseNameTakeNewBackup, int64(0))
if err := mysqlctl.Backup(ctx, backupParams); err != nil {
return fmt.Errorf("error taking backup: %v", err)
}
durationByPhase.Set("TakeNewBackup", int64(time.Since(backupAt).Seconds()))
deprecatedDurationByPhase.Set("TakeNewBackup", int64(time.Since(backupAt).Seconds()))
phase.Set(phaseNameTakeNewBackup, int64(0))

// Return a non-zero exit code if we didn't meet the replication position
// goal, even though we took a backup that pushes the high-water mark up.
Expand Down
1 change: 1 addition & 0 deletions go/flags/endtoend/vtbackup.txt
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,7 @@ Usage of vtbackup:
--mysql_server_version string MySQL server version to advertise. (default "8.0.30-Vitess")
--mysql_socket string path to the mysql socket
--mysql_timeout duration how long to wait for mysqld startup (default 5m0s)
--opentsdb_uri string URI of opentsdb /api/put method
--port int port for the server
--pprof strings enable profiling
--purge_logs_interval duration how often try to remove old logs (default 1h0m0s)
Expand Down
23 changes: 23 additions & 0 deletions go/stats/counters.go
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,29 @@ func (g *GaugesWithSingleLabel) Set(name string, value int64) {
g.counters.set(name, value)
}

// SyncGaugesWithSingleLabel is a GaugesWithSingleLabel that proactively pushes
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new

// stats to push-based backends when Set is called.
type SyncGaugesWithSingleLabel struct {
GaugesWithSingleLabel
name string
}

// NewSyncGaugesWithSingleLabel creates a new SyncGaugesWithSingleLabel.
func NewSyncGaugesWithSingleLabel(name, help, label string, tags ...string) *SyncGaugesWithSingleLabel {
return &SyncGaugesWithSingleLabel{
GaugesWithSingleLabel: *NewGaugesWithSingleLabel(name, help, label, tags...),
name: name,
}
}

// Set sets the value of a named gauge.
func (sg *SyncGaugesWithSingleLabel) Set(name string, value int64) {
sg.GaugesWithSingleLabel.Set(name, value)
if sg.name != "" {
_ = pushOne(sg.name, &sg.GaugesWithSingleLabel)
}
}

// GaugesWithMultiLabels is a CountersWithMultiLabels implementation where
// the values can go up and down.
type GaugesWithMultiLabels struct {
Expand Down
28 changes: 20 additions & 8 deletions go/stats/export.go
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,22 @@ func Publish(name string, v expvar.Var) {
publish(name, v)
}

func pushAll() error {
backend, ok := pushBackends[statsBackend]
if !ok {
return fmt.Errorf("no PushBackend registered with name %s", statsBackend)
}
return backend.PushAll()
}

func pushOne(name string, v Variable) error {
backend, ok := pushBackends[statsBackend]
if !ok {
return fmt.Errorf("no PushBackend registered with name %s", statsBackend)
}
return backend.PushOne(name, v)
}

// StringMapFuncWithMultiLabels is a multidimensional string map publisher.
//
// Map keys are compound names made with joining multiple strings with '.',
Expand Down Expand Up @@ -183,8 +199,10 @@ func publish(name string, v expvar.Var) {
// to be pushed to it. It's used to support push-based metrics backends, as expvar
// by default only supports pull-based ones.
type PushBackend interface {
// PushAll pushes all stats from expvar to the backend
// PushAll pushes all stats from expvar to the backend.
PushAll() error
// PushOne pushes a single stat from expvar to the backend.
PushOne(name string, v Variable) error
}

var pushBackends = make(map[string]PushBackend)
Expand Down Expand Up @@ -214,13 +232,7 @@ func emitToBackend(emitPeriod *time.Duration) {
ticker := time.NewTicker(*emitPeriod)
defer ticker.Stop()
for range ticker.C {
backend, ok := pushBackends[statsBackend]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to pushAll

if !ok {
log.Errorf("No PushBackend registered with name %s", statsBackend)
return
}
err := backend.PushAll()
if err != nil {
if err := pushAll(); err != nil {
// TODO(aaijazi): This might cause log spam...
log.Warningf("Pushing stats to backend %v failed: %v", statsBackend, err)
}
Expand Down
58 changes: 58 additions & 0 deletions go/stats/opentsdb/backend.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up refactoring this package a bit, mostly moving stuff to their own files. Will call out which stuff was moved vs. new.

Copyright 2023 The Vitess Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package opentsdb

import (
"time"

"vitess.io/vitess/go/stats"
)

// backend implements stats.PushBackend
type backend struct {
// The prefix is the name of the binary (vtgate, vttablet, etc.) and will be
// prepended to all the stats reported.
prefix string
// Tags that should be included with every data point. If there's a tag name
// collision between the common tags and a single data point's tags, the data
// point tag will override the common tag.
commonTags map[string]string
// writer is used to send data points somewhere (file, http, ...).
writer writer
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new. Factor HTTP-vs.-file specific stats-writing logic into writer implementations.

}

// PushAll pushes all stats to OpenTSDB
func (b *backend) PushAll() error {
collector := b.collector()
collector.collectAll()
return b.writer.Write(collector.data)
}

// PushOne pushes a single stat to OpenTSDB
func (b *backend) PushOne(name string, v stats.Variable) error {
collector := b.collector()
collector.collectOne(name, v)
return b.writer.Write(collector.data)
}

func (b *backend) collector() *collector {
Copy link
Collaborator Author

@maxenglander maxenglander May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is new and supports multiple collection strategies (PushAll vs. PushOne).

return &collector{
commonTags: b.commonTags,
prefix: b.prefix,
timestamp: time.Now().Unix(),
}
}
Loading
Loading