Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: improve MoveTables SwitchTraffic performance (getCopyState) #14325

Closed
maxenglander opened this issue Oct 20, 2023 · 0 comments · Fixed by #14375
Closed

Enhancement: improve MoveTables SwitchTraffic performance (getCopyState) #14325

maxenglander opened this issue Oct 20, 2023 · 0 comments · Fixed by #14375
Assignees
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)

Comments

@maxenglander
Copy link
Collaborator

maxenglander commented Oct 20, 2023

Feature Description

During MoveTables SwitchTraffic, there is a phase where wrangler queries the copystate of every VStream:

  1. First SwitchTraffic calls canSwitch
    func (vrw *VReplicationWorkflow) canSwitch(keyspace, workflowName string) (reason string, err error) {
  2. Which calls getStreams, which loads every VStream in the target keyspace
    query := `select
    id,
    source,
    pos,
    stop_pos,
    max_replication_lag,
    state,
    db_name,
    time_updated,
    transaction_timestamp,
    time_heartbeat,
    time_throttled,
    component_throttled,
    message,
    tags,
    workflow_type,
    workflow_sub_type,
    defer_secondary_keys,
    rows_copied
    from _vt.vreplication`
    results, err := wr.runVexec(ctx, workflow, keyspace, query, nil, false)
  3. And then iterates over every VStream to get the replication status and copy state
    for primary, result := range results {
    var rsrStatus []*ReplicationStatus
    nqr := sqltypes.Proto3ToResult(result).Named()
    if len(nqr.Rows) == 0 {
    continue
    }
    for _, row := range nqr.Rows {
    status, sk, err := wr.getReplicationStatusFromRow(ctx, row, primary)
  4. Each of which involves a VReplicationExec call
    func (wr *Wrangler) getCopyState(ctx context.Context, tablet *topo.TabletInfo, id int32) ([]copyState, error) {
    var cs []copyState
    query := fmt.Sprintf("select table_name, lastpk from _vt.copy_state where vrepl_id = %d and id in (select max(id) from _vt.copy_state where vrepl_id = %d group by vrepl_id, table_name)",
    id, id)
    qr, err := wr.VReplicationExec(ctx, tablet.Alias, query)

When there are a lot of VStreams, this takes so long that the entire SwitchTraffic times out ends up in partially completed state.

Use Case(s)

Until recently, Vitess did not support changing the primary Vindex during MoveTables. Now it does. For Vitess clusters with lots of shards, say 128, a MoveTables operation to change a primary Vindex would involve 16384 VStreams. While each of the calls to get the copy state of those VStreams would be fast, in aggregate they can result in a SwitchTraffic timeout.

@maxenglander maxenglander added Type: Feature Needs Triage This issue needs to be correctly labelled and triaged labels Oct 20, 2023
@maxenglander maxenglander self-assigned this Oct 20, 2023
@maxenglander maxenglander changed the title Feature Request: improve MoveTables SwitchTraffic performance (getCopyState) Feature Request: improve MoveTables SwitchTraffic performance (getCopyState) Oct 20, 2023
@maxenglander maxenglander added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: VReplication labels Oct 26, 2023
@maxenglander maxenglander changed the title Feature Request: improve MoveTables SwitchTraffic performance (getCopyState) Enhancement: improve MoveTables SwitchTraffic performance (getCopyState) Oct 26, 2023
@mattlord mattlord removed the Needs Triage This issue needs to be correctly labelled and triaged label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants