title | summary |
---|---|
IMPORT INTO vs. TiDB Lightning |
Learn about the differences between `IMPORT INTO` and TiDB Lightning. |
Many users have provided feedback that the deployment, configuration, and maintenance of TiDB Lightning are complex, especially in scenarios involving parallel importing of large datasets.
Based on the feedback, TiDB has gradually integrated some functionalities of TiDB Lightning into the IMPORT INTO
SQL statement. You can directly import data by executing IMPORT INTO
, thereby improving the efficiency of data import. In addition, IMPORT INTO
supports some functionalities that TiDB Lightning does not, such as automatic distributed task scheduling and TiDB Global Sort.
IMPORT INTO
is introduced in v7.2.0 and becomes generally available (GA) in v7.5.0. It will continue to be improved and optimized in future versions. Once the IMPORT INTO
capability can fully replace TiDB Lightning, TiDB Lightning will be deprecated. At that time, relevant notification will be provided in advance in TiDB Release Notes and documentation.
The following sections describe the differences between IMPORT INTO
and TiDB Lightning in multiple dimensions.
IMPORT INTO
does not require separate deployment. You can run it directly on TiDB nodes, which eliminates additional deployment work.
By contrast, TiDB Lightning requires separate server deployment.
The IMPORT INTO
task and other business workloads can share TiDB resources or utilize them at different times to fully leverage the TiDB resources. To ensure stable operation of business workloads while maintaining the performance and stability of the IMPORT INTO
task, you can specify specific TiDB nodes dedicated to IMPORT INTO
for data import.
When you use TiDB Global Sort, there is no need to mount large local disks. TiDB Global Sort can use Amazon S3 as the storage. Once the import task is completed, the temporary data stored on Amazon S3 for global sorting will be automatically deleted to save storage costs.
You need separate servers to deploy and run TiDB Lightning. When no import tasks are executed, these resources remain idle. The total idle time is even longer in scenarios where import tasks are executed periodically, resulting in a waste of resources.
If the dataset to be imported is large, you also need to prepare large local disks to sort the data to be imported.
You can directly write SQL statements to submit import tasks, which are easy to call and integrate.
By contrast, TiDB Lightning requires you to write task configuration files. These configuration files are complex and not easily called by third parties.
IMPORT INTO
supports distributed execution. For example, when you import 40 TiB of source data files into one target table, after submitting the SQL statement, TiDB will automatically split the import task into multiple sub-tasks and schedule different TiDB nodes to execute these sub-tasks.
By contrast, the configuration for TiDB Lightning is complex, inefficient, and prone to errors.
Assume that you start 10 TiDB Lightning instances to import data in parallel, then you need to create 10 TiDB Lightning configuration files. In each file, you need to configure the range of source files to be read by the corresponding TiDB Lightning instance. For example, TiDB Lightning instance 1 reads the first 100 files, instance 2 reads the next 100 files, and so on.
In addition, you need to configure the shared metadata table and other configuration information for these 10 TiDB Lightning instances, which is complex.
With TiDB Global Sort, IMPORT INTO
can transmit tens of TiB of source data to multiple TiDB nodes, encode the data KV pairs and index KV pairs, and then transfer these pairs to Amazon S3 for global sorting before writing them into TiKV.
Because these KV pairs are globally sorted, data imported from various TiDB nodes into TiKV does not overlap, allowing it to be written directly into the RocksDB. This eliminates the need for TiKV to perform compaction operations, resulting in significant improvement in both write performance and stability of TiKV.
After the import is completed, the data used for global sorting on Amazon S3 will be automatically deleted, saving storage costs.
TiDB Lightning only supports local sort. For example, for tens of TiB of source data, if TiDB Lightning does not have large local disks configured, or if multiple TiDB Lightning instances are used for parallel import, each TiDB Lightning instance can only use local disks to sort the data to be imported. Due to the inability to perform global sort, there will be an overlap between the data imported into TiKV by multiple TiDB Lightning instances, especially in scenarios where index data is more prevalent, triggering TiKV to perform compaction operations. Compaction operations are resource-intensive, which will lead to a decrease in TiKV's write performance and stability.
If you want to continue importing data later, you will need to keep the TiDB Lightning server and the disks on the server for the next import. The cost of using preallocated disks is relatively high, compared with IMPORT INTO
using Amazon S3 on a pay-as-you-go basis.
Currently, there are no performance test comparison results under equivalent test environments between IMPORT INTO
and TiDB Lightning.
When Amazon S3 is used as the storage for global sorting, the performance test results for IMPORT INTO
are as follows:
Source dataset | Node configuration | Average import speed per TiDB node |
---|---|---|
40 TiB data (2.26 billion rows, 19 KiB per row) | 10 TiDB (16C32G) nodes and 20 TiKV (16C27G) nodes | 222 GiB/h |
10 TiB data (565 million rows, 19 KiB per row) | 5 TiDB (16C32G) nodes and 10 TiKV (16C27G) nodes | 307 GiB/h |
After a TiDB node fails, tasks on that node are automatically transferred to the remaining TiDB nodes to continue running.
After a TiDB Lightning instance node fails, you need to perform manual recovery of tasks on a new node based on previously recorded checkpoints.
Due to the use of Global Sort, data imported into TiKV does not overlap, resulting in better scalability compared with TiDB Lightning.
Due to only supporting local sort, data imported into TiKV might overlap when new TiDB Lightning instances are added, resulting in more compaction operations for TiKV and limiting scalability relative to IMPORT INTO
.
Currently, IMPORT INTO
still lacks some functionalities and cannot fully replace TiDB Lightning in some scenarios, such as:
-
Logical import
Before importing data with
IMPORT INTO
, the target table must be empty. If you need to import data into a table that already contains data, it is recommended to use methods such asLOAD DATA
or direct insertion. Starting from v8.0, TiDB supports bulk DML for executing large transactions. -
Conflict data handling
IMPORT INTO
currently does not support conflict data handling. Before the data import, you need to define the table schema properly to ensure that the data to be imported does not conflict with primary keys (PK) or unique keys (UK). Otherwise, it might cause task failures. -
Importing data into multiple target tables
Currently, only one target table is allowed for one
IMPORT INTO
SQL statement. If you want to import data into multiple target tables, you need to submit multipleIMPORT INTO
SQL statements.
In future versions, these functionalities will be supported by IMPORT INTO
, along with additional enhancements to its capabilities, such as allowing modification of concurrency during task execution and adjusting throughput for writing to TiKV. This will make it more convenient for you to manage tasks.
Compared with TiDB Lightning, IMPORT INTO
can be directly executed on TiDB nodes, supports automated distributed task scheduling and TiDB Global Sort, and offers significant improvements in deployment, resource utilization, task configuration convenience, ease of invocation and integration, high availability, and scalability. It is recommended that you consider using IMPORT INTO
instead of TiDB Lightning in appropriate scenarios.