Databases in inDrive are subject to high loads, which imposes certain requirements on the development of queries and data structure migrations. This guideline provides the main recommendations for naming entities, making changes, and performance considerations that engineers should pay attention to during the development process.
Following this guideline will help pass SQL reviews faster. Prepared migrations should always be run through SQLLint.
The general recommendation for naming all entities is to use common words that do not allow ambiguous interpretations.
For example, use the term gender
for storing gender data. Avoid naming conflicts with reserved words
from the SQL dialect being used.
Tables should be named with a noun in English indicating the main entity in the singular. For compound names, words are separated by underlining.
Good:
order
;vehiсle_details
(since each row in the table stores a set of attributes and options of the car, plural is allowed).
Bad:
orders
;rides_orders
;VehicleDetails
;vehicle_detail
.
Table columns should be named with a singular English noun. For compound names, words are separated by underscores. For
storing unstructured information, use the reserved field name - jdoc
. Don't use
reserved words.
When naming table fields, use standard prefixes and suffixes:
is_
for logical fields;_by
for event sources;_at
for timestamps.
Good:
id
;uuid
;order_id
(if it's a link to the order in other table);document_text
;created_at
;updated_by
.
Bad:
order_ids
;text
;creation_date
;updated_by_user
.
It is recommended to make up the index name from the table name, field/fields and the _idx
suffix. If the name is too
long, abbreviations are allowed that do not lead to the loss of information about the purpose of this index
It is recommended to make up the name from the field/fields name and the _key suffix. If the name is too long, abbreviations are allowed that do not lead to the loss of information about the purpose of this index
Despite certain advantages and data integrity control, we prohibit the use of foreign keys in DB schemas because they introduce a large number of restrictions on the execution of online DDL. You can read more here.
The table must always have a PRIMARY(UNIQUE) key
. This is critically important for replication. The primary key must
be compact. It is recommended to avoid zero (NULL
) values in columns with keys, since a nullable field requires more
space and additional calculations inside innodb
.
It should be noted that a standard query of the create index ...
is not performed by online migration, but directly
on the table. When creating an index on a large table, this can lead to its blocking. It is safer to use a construction
of the following type:
ALTER TABLE table_name ADD INDEX ...
The index being created will work well only if the column by which it is created has a sufficient number of different values (high cardinality). Then this index will effectively reduce the volume of data being selected. You could check cardinality with the following command:
SHOW INDEX FROM tbl_name;
There is a rare case when the cardinality value can be neglected if there is an understanding that the column by which the index is created has a strong skew in the data. For example, the distribution is 95/5, and in the samples you need to get only the values that are smaller, then in this case you can create an index on such a column in order to get high selectivity.
More details in this article.
- well suited for most search operations (=, >, <, >=, <=, between);
- works for LIKE searches if you need to start from the beginning (LIKE 'word%' but not LIKE '%word').
- work only for equality conditions (=, <>), but very fast;
- cannot be used in ORDER BY.
Prefer one composite key to several simple ones:
- the composite key is used from left to right;
- place the most selective fields of the composite key at the beginning;
- selectivity: keys with a selection of +50% of the table are useless. MySQL does not have partial keys.
Avoid duplicates: when one index contains another index entirely.
Example — the first key is useless because the second contains everything needed:
KEY `city_id` (`city_id`),
KEY `city_id_type_status` (`city_id`,`type`,`status`)
When using OR
, make sure that each group by OR
uses an indexed field in the same index.
Suppose there is a composite index on columns (index_part1
, index_part2
, index_part3
).
These queries will use the index.
-- index_part1, index_part2 will be used
... WHERE index_part1=1 AND index_part2=2 AND other_column=3
/* index_part1 = 1 on the left and index_part2 = 2 on the right will be used */
... WHERE (index_part1=1 OR other_column=10) AND index_part2=2
/* only "index_part1='hello'" but not index_part3 will be used */
... WHERE index_part1='hello' AND index_part3=5
/* Only index_part1 will be used, as it is in both parts of OR */
... WHERE (index_part1=1 AND index_part2=2) OR (index_part1=3 AND index_part3=3);
And here the index will not be used:
/* index_part1 is not used, so the index does not work */
... WHERE index_part2=1 AND index_part3=2
/* there is no index for other_column=10, so the entire expression will not use the index */
... WHERE index_part1=1 OR other_column=10
/* there is no condition for index_part1 on the right */
... WHERE index_part1=1 OR index_part2=10
Always use EXPLAIN
to check index performance. In small selections, EXPLAIN
will work incorrectly.
Key selections:
- equality is the best choice:
WHERE
column = value;WHERE
column IN (value1, value2).
Do not use functions on indexed columns — the index will not work:
Bad:
SELECT COUNT(*) FROM tbl_modercheck
WHERE
TIMESTAMPDIFF(HOUR, created, NOW()) < 12
Good:
SELECT COUNT(*) FROM tbl_modercheck
WHERE
(created > NOW() - interval 12 hour)
In general, it is recommended to use UUIDv7
or UUIDv8
in BINARY(16)
format for the Primary Key in the table. The identifiers of both versions already contain the record
creation time, so for data consistency, we create a virtual created_at column - its values will be formed from the
identifier. For example:
`created_at` DATETIME GENERATED ALWAYS AS (from_unixtime(conv(hex(`user_uuid` >> 80),16,10) / 1000)) VIRTUAL
You can use the Primary Key in NUMBER
format in case of creating any reference books in which the data changes slowly,
at one point and is common to all regions, which eliminates the need for further data merging and record identifier
conflicts.
We use partitioning in large tables to simplify data management: deleting old data, moving data to cold storage, etc. Partitions can also have a positive effect on the speed of some queries by reducing the amount of data being processed.
Since the most common way for us to partition data is by record creation date, and date and time are part of the UUIDv7 and UUIDv8 identifier, the correct way is to partition the table by ID. This method will simplify working with old records and allow you to find a record by its identifier as quickly as possible.
Other ways of partitioning tables are also possible depending on the data usage scenario.
Partitioning would be applied by DBA, developers should make this possible:
- use UUID v7 or v8 as PK;
- implement code that will work with partitions efficiently;
- be careful with JOINs.
Example of partitioning by UUID:
PARTITION BY RANGE COLUMNS(uuid) (
PARTITION p01 VALUES LESS THAN (_binary 0x018D61F7180000000000000000000000) ENGINE = InnoDB,
PARTITION p_max VALUES LESS THAN (MAXVALUE) ENGINE = InnoDB
)
To store string values with a limited number of options (about ten), it is recommended to use the ENUM
or SET
type. If
you need to use a larger number of options, you should put them in a separate reference table.
It is recommended to store data in MySQL in normalized form. However, in some cases, when it is necessary to save a dynamic structure, or it is impossible to normalize data at the current level of knowledge about the subject area, it is allowed to use JSON fields in tables. It should be remembered that such fields tend to grow, collecting more and more attributes over time. When setting up data replication between DB instances, such fields create a large data flow, which is fraught with lagging replicas and other difficulties.
When adding JSON fields to the table structure, it is necessary to limit the set of stored attributes at least at the level of the API contract of the service being developed.
Sometimes a situation may arise when a field in the DB structure is declared as NOT NULL
, but at the same time, when
inserting data, we do not yet have information about the related entity and its identifier. Due to the lack of
verification by Foreign Key for such tables, it becomes possible to pass some special “fake” identifier to the DB,
which would act as a missing value. For example, make the value 0 or generate a “zero” UUID, which will be safely saved
in the table. Using this approach complicates the processing of such data at the level of analysts and the Data
Department, and also worsens the readability and support of the code.
Using fake identifiers is prohibited. In this situation, it is recommended to ALTER
such a field and make it nullable
with a default value of NULL
, which will correctly describe such a relationship and natively allow you to perform a
JOIN
.
To make it easier for other engineers to understand the purpose of tables and fields in the database, all migrations should include comments on the entities being created, except for obvious ones like the id field. Comments should be in English, begin with a capital letter, and briefly describe the purpose of the given entity. For example:
CREATE TABLE IF NOT EXISTS `user_vehicle_vertical_status`
(
`uuid` BINARY(16) NOT NULL PRIMARY KEY,
`user_vehicle_uuid` BINARY(16) NOT NULL COMMENT 'UUID from user_vehicle table',
`status` ENUM ('denied', 'approved') NOT NULL COMMENT 'WD1 application status on vertical',
`service` VARCHAR(20) NOT NULL COMMENT 'Name of service where user is authorized to work on vehicle',
`created_at` DATETIME NOT NULL DEFAULT NOW(),
`modified_at` DATETIME NOT NULL DEFAULT NOW() ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY `user_vehicle_uuid` (`user_vehicle_uuid`, `service`)
) COMMENT = 'This table normalizes data with WD1 by containing additional info on vertical authorization';
If the column values use different numeric codes (magic numbers), the ENUM
type with a list of possible values should
be used for such a column. In such a case, the comment should provide brief descriptions of these values.
To simplify data governance tasks, comments can contain special tags. See the Sensitive Data section.
The select count(*) from ...
queries are prohibited on the production environment, as this query creates an increased
load and can lead to a significant drop in DB performance. To estimate the table size in records, we recommend using
MySQL internal statistics, which will allow you to roughly estimate the table size without risks to performance:
select table_rows
from information_schema.tables
where table_schema = 'database_name'
and table_name = 'table_name' ;
Another option for indirectly estimating the table size can be the EXPLAIN
query. In this case, you need to pay
attention to the rows value - the number of rows that the query will have to process.
- never use an expression like
select * from ...
- always explicitly specify the fields to select, even if all are needed; - the selection must always be limited by the
WHERE
condition. - queries must use limits and pagination to protect against unexpectedly large selections.
When performing data updates, you need to follow a few simple rules to avoid problems in a production environment. Large
updates of a large number of records should be performed in parts by the DBA. It is recommended to perform any updates
in a transaction with a manual COMMIT/ROLLBACK
, controlling the number of changed records.
Performing update operations without WHERE
is prohibited. Also, if you know in advance the number of records to be
changed, it is a good practice to use the LIMIT
construct specifying this number to avoid massive data corruption in
case of an error.
For example, we need to update only first_name
.
Bad:
update user set last_name = ?, first_name = ?, balance = ? where id = ?;
Good:
update user set first_name = ? where id = ?;
There may be a situation where someone else changes the user's balance in parallel, and you then overwrite the old balance value, resulting in incorrect data in the DB.
Bad:
-- calculate the balance in the code
update user set balance = ? where id = ?;
Good:
-- increment balance in query
update user set balance = balance + ? where id = ?;
As in the previous case, it is necessary to take into account that the balance can change in parallel.
A transaction should take as little time as possible and affect only as much data as it needs. No blocking operations are allowed while the method has an open transaction. This is especially true for synchronous network requests in a transaction - this behavior is strictly prohibited! It is necessary to design the code in such a way that the transaction works as quickly as possible.
To avoid lock wait time, please use NOWAIT
clause: query will return error couldn't obtain lock
in this case. Don't
forget to properly handle this error: retry it automatically or return an error to the user.
SELECT id, status FROM order WHERE id = ? FOR UPDATE NOWAIT
The general approach to changing the data structure can be described by the following points:
- migration should not break backward compatibility;
- migration should be idempotent if possible;
- migration should not lead to database degradation and should not affect the operation of the service.
You should always be aware that migration may lead to blocking or decreased service performance and take all necessary efforts to minimize possible consequences: use online migration tools, have an understanding of INSTANT/INPLACE operations and their applicability to our version of MySQL/Aurora. If in doubt, it is always worth consulting SQL Advocates.
When planning migrations, you need to consider how the planned changes are performed: do they make changes only to metadata or lead to a complete rebuild of the table; do they require locking operations or can they be performed concurrently.
Also, you need to consider the amount of data in the DB table that needs to be changed, since even if the operation does not lead to locks and the service continues to work normally, a long migration can lead to a Liquibase timeout.
More details.
Changes to the database must be made in such a way as to not break backward compatibility. If backward compatibility is not possible, consider using a multiphase migration. Examples of backward compatible changes:
- adding a table;
- adding a
NULLABLE
column; - adding a
NOT NULL
column if there is aDEFAULT
value; - creating an index;
- changing the column dimension to increase;
- adding partitioning to a table;
- adding a Set or Enum to the list of possible values.
Examples of operations that break backward compatibility:
- drop a table or a column;
- rename a table or a column;
It is recommended not to delete an object from the database directly, as this may result in the service not working and the inability to roll back to the previous version due to incompatibility of the data structure.
It is recommended to perform such operations in two phases:
- Renaming a column/table and launching a new version of the service to assess the functionality of the build. If problems were detected at this step, the original column name can be instantly returned back, and the data will not be lost.
- If no problems were detected at step 1, you can perform the DROP operation.
In general, it is better to refrain from placing DML queries in Liquibase migrations. Migrations can only be used to
fill static reference books. INSERT
operations are safe for migrations, but you should carefully check the number of
records being changed if you need to perform an UPDATE
or DELETE
- this can lead to blocking.
It is better to perform large DML queries in batches using a DBA.
It is PROHIBITED
to accumulate and process personal data in the service database without prior approval. The process of
managing, protecting and coordinating the need to work with PD in all cases without exception is described in
personal information handling
“Personal Information Handling”.
If the creation/addition of a column containing PD has been approved, these objects must be marked with a special #PPI
tag in the comment. Similarly, the creation/addition of a column containing geo data must be marked with a special #GEO
tag in the comment. Open Metadata Platform will automatically collect the markup from the comments, which will allow us
to track the path of PD through the inDrive infrastructure.
If change capture is configured for the objects being changed using CDC (Debezium), you should pay attention to whether the changes being made break the compatibility of the data structure with the DWH. Pay special attention and alert the data engineering team if:
- DROP of tables is performed
- DROP of columns with data is performed
- Change of the column data type is performed without backward compatibility (for example, varchar to int)
Data normalization works poorly in the case of a loaded database.
Bad:
- table
region
has columns (id
,name
); - table
city
has columns (id
,name
,region_id
); - table
user
stores (id
,user_name
,city_id
); - to find out region we have to use
join city
.
Good:
- table
region
has columns (id
,name
); - table
city
has columns (id
,name
,region_id
); - table
user
stores (id
,user_name
,city_id
,region_id
); - user region is already known.
So if it is known that there will be many typical requests, it is better to follow the principle if the data changes
simultaneously (in the example, when city_id
changes, region_id
changes and vice versa), it is better to store
them together.
If the data belongs to one entity and will be used only within this entity, there is no need to allocate this data to a separate table.
Example, there is an entity user:
user |
---|
id |
first_name |
last_name |
city_id |
You need to add payment methods. You can add two tables:
payment_provider |
---|
id |
name |
Or you can add the link to the user
table if you know that the payment methods will be used only with user
.
column |
type |
example |
---|---|---|
id |
int |
|
first_name |
varchar |
|
last_name |
varchar |
|
city_id |
int |
|
jdoc |
json |
{"providers": [1, 4]} |
As a result, there is no need to do a join
to get the payment methods
If the number of columns starts to exceed a certain critical threshold consider two options:
- add a json field;
- still split the table into different entities.
MySQL allows to create virtual columns and create indices over them. For example, if you have json field with a
customer_id
inside, you could create a virtual column and create an index over it.
`contractor_id` bigint GENERATED ALWAYS AS (json_unquote(json_extract(`contractor`, _utf8mb4'$.id'))) VIRTUAL,
…
KEY `idx_contractor_id` (`contractor_id`)
It's wrong when you write to master and read from replicas due to possible issues with replication lag. So you could read from replica only information that could be outdated and that doesn't affect the business logic. When you need account balance or order status, you should read from the master.
Table should be separated for Hot and Cold data. Hot data is data that is often used and Cold data is data that is rarely used. Hot data should be in a separate table, so the query will be faster. For example, order stores in Hot table since it created and moves to Cold storage when it is fulfilled or canceled.
Don't use RDBMS for analytical tasks - it's very expensive solution for it. Use ClickHouse, Kafka or similar tools.
ORM is a universal tool that is poorly compatible with high loads. It helps to quickly write code, but it slows down everything else. It is better to write simple queries and logic in the application. The only exception is the use of ORM for internal admin panel with low load.
Don't use triggers and Stored procedures, because they:
- spread the logic between app and database;
- hard to support.