Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vitess VReplication Testing Framework #12136

Closed
2 tasks
rohit-nayak-ps opened this issue Jan 24, 2023 · 8 comments
Closed
2 tasks

Vitess VReplication Testing Framework #12136

rohit-nayak-ps opened this issue Jan 24, 2023 · 8 comments

Comments

@rohit-nayak-ps
Copy link
Contributor

rohit-nayak-ps commented Jan 24, 2023

Notes for Potential LFX Mentees:

This project is part of the LFX program. More information at https://github.com/cncf/mentoring/tree/main/lfx-mentorship. For mentees, details on how to apply are at https://github.com/cncf/mentoring/tree/main/lfx-mentorship#how-to-apply.
This project will not, per se, require contributions to Vitess, but the contribution guide has information on learning both golang and Vitess concepts for someone new to Vitess.
To learn HCL: https://github.com/hashicorp/hcl. This link talks about creating a custom DSL using HCL.

⚠️ If you're thinking about opening PRs to the project before the application period begins, please read the initial sections regarding contribution guidelines and advice from a previous gsoc project!

Feature Description

VReplication is a core component in Vitess. Production Vitess clusters regularly depend on workflows like Resharding, MoveTables and Materialize as well as the use of the VStream API. This has added VReplication to the critical path. While we do have good unit test and e2e test coverage we do not measure performance. Also some failures are not as easy to reproduce in local tests, like reparenting operations; transient network and database failures; connection and memory leaks; etc.

We propose creating a framework which will allow defining test cases for different VReplication workflows which will be run at partial scale, validate the results and potentially store benchmark output.

In the rest of this document we outline specific goals, challenges that will need to be addressed and a proposed implementation architecture.

Practical Aspects

The tests will be fairly expensive in terms of cpu time and number of instances. Hence we will not run them on demand (like arewefastyet, for example). It is likely we will initially, at least, run on private infrastructure (until and unless we get free infra from CNCF or any other source).
Tests will be run periodically, say every week, to catch performance and functionality regressions. They can also be run on specific PRs that are expected to improve or impact performance.

Specific Goals

Testing

We will run long-running workflows (~hours) on different cluster configurations with intermittent reparents and simulating common failures on non-trivial data sizes and different table schemas. These are not intended to be comprehensive functionality tests but smoke-tests for curated cluster and data configurations and specific workflows. The aim is to catch and surface existing bugs and regressions.

Benchmarks

For some of the test configurations we will publish performance results (like rows per second, GiBs per second, CPU and memory usage, etc). These will act as reference benchmarks for the community to get an idea of approximate sizing required for Vitess clusters and estimating how long workflows will run.

Note that this will be just an indication: actual performance is highly dependent on the nature of the data, network configurations, underlying hardware etc.

Non-goals

This framework is NOT intended to replace unit and e2e tests in Vitess. In particular, these tests will NOT run for every PR or push.

Implementation

Workflow Configurations

  • Multiple types of workflows like MoveTables, Reshard, Materialize, VStream
  • Sharded/Unsharded. Also VDiff.
  • Different number of Source/Target keyspace shards
  • Table distributions: huge tables, lots of small tables
  • Data types and widths of columns
  • Primary Key configurations: simple, compound, data types
  • Flags/Options: vttablet and workflow parameters (e.g. vstream_packet_size) that affect VReplication

Approach

  • Small number of selected benchmark configs
  • Run all tests within <8-12?> hours sequentially, to maintain a small infra footprint
  • Small (but not too small) infra config to reduce cost
  • Mini-runs for quick turnarounds (10% data, 1 hour?)

Benchmark Measures

  • Time for workflow to “complete”
  • Replication lag for streams
  • CPU/Memory usage of vttablet
  • MySQL load
  • Network data usage (if possible)

Each benchmark run should also attach the full configuration for the test including schema, and all vreplication related metrics.

Proposed Benchmark Configs

  • Single Huge Table, compound PK (int, binary), Unsharded to Two Shards, Only Copy, 100M rows/100GB data
  • Large table + lots of small tables, different PKs, some wide tables, Two Shards to Four Shards, 1 Copy + 1K QPS for 1 hour, use VDiff here
  • Large number of streams: 1000 materialize streams, unsharded to unsharded, small initial rows, 10 QPS for 1 hour

Implementation Artifacts

  • Initial data file for huge/large table. We can base this on TPCC datasets
  • Data populator for generating streaming data
  • The DSL specification. The current thought is to do this in HCL since it offers a highly customizable option and is also well maintained.
  • DSL parser
  • Driver that runs tests based on the DSL configurations
  • Backend adapter: first, a docker adapter for local development followed by an adapter for AWS EC2.
  • Result storage backends: YML / PlanetScale
@TheRealSibasishBehera
Copy link

Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue

@rohit-nayak-ps
Copy link
Contributor Author

rohit-nayak-ps commented Jan 25, 2023

Hi @rohit-nayak-ps This feature looks interesting to me, I would like to work on this feature as an lfx mentee, can you give me brief idea about the prerequisites to get started with this issue

@TheRealSibasishBehera , good to hear that you are interested. I have added initial notes about the prerequisites at the head of this issue description as well as links to mentee application procedures. Let us know if you you need more information/clarifications.

@PaarthAgarwal
Copy link

That'll be a great addition to vitess. Going through the description looks like it matches my skills. I'll apply for it

@vishalvivekm
Copy link

Hello Sir @rohit-nayak-ps
I know basics of go and am currently referring to the resources added by you above to get familiar with the project .
Very excited to contribute to it as a Linux Foundation mentee for the upcoming spring term.

@abhinandanudupa
Copy link

@rohit-nayak-ps

  1. Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?

  2. Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?

@rohit-nayak-ps
Copy link
Contributor Author

  1. Since this seems to be a project needing cloud resources, how will a mentee run the tests during development? What are the pre-requisites for learning how to use the platform?

Good question. We will start with building a local adapter so that we can run the different Vitess components on docker on your local machines itself. We can run it with small amount of data so that we don't need a lot of local computing power. Once we have a working local setup then we will provide cloud resources.

Pre-requisites are mentioned at the top of this issue. If you have any specific questions, feel free to ask.

  1. Will we also have to develop an UI for the configuration of the benchmark - like setting no. of shards, no. of streams etc ?

There is no plan for a UI for configuration. However we do have plans for a UI to look at the results of the benchmark runs. That is not in the scope of the initial LFX project, though of course people are welcome to work on them as well if they have the time.

@frouioui frouioui added this to v18.0.0 Jun 30, 2023
@frouioui frouioui moved this to In Progress in v18.0.0 Jun 30, 2023
@frouioui frouioui removed this from v17.0.0 Jun 30, 2023
@nikzayn
Copy link

nikzayn commented May 29, 2024

Hey @rohit-nayak-ps, I am interested in this LFX mentorship mentee spring term. Currently, I have an existing skillset to implement the current goals mentioned in the description.

Excited to be a part of this. As it would be my first mentorship program.

@rohit-nayak-ps
Copy link
Contributor Author

We have decided not to pursue this at the moment since it will take significant resources to build and maintain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Status: Done
Development

No branches or pull requests

6 participants