Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory_usage_limit is deduced much less then expected #9745

Closed
CalvinNeo opened this issue Dec 26, 2024 · 4 comments · Fixed by pingcap/tidb-engine-ext#408 or #9753
Closed

memory_usage_limit is deduced much less then expected #9745

CalvinNeo opened this issue Dec 26, 2024 · 4 comments · Fixed by pingcap/tidb-engine-ext#408 or #9753
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. component/storage report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.

Comments

@CalvinNeo
Copy link
Member

CalvinNeo commented Dec 26, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

3. What did you see instead (Required)

4. What is your TiFlash version? (Required)

If not set, memory_usage_limit is inferred by block_cache_cap

            let mut limit =
                (block_cache_cap.0 as f64 / BLOCK_CACHE_RATE * MEMORY_USAGE_LIMIT_RATE) as u64;

https://github.com/pingcap/tidb-engine-ext/blob/521fd9dbc55e58646045d88f91c3c35db50b5981/src/config/mod.rs#L3597-L3598
which is actually set by(I mean it's original value is not this, but the following code will adjust and take final effect)

        if let Some(a) = self.rocksdb.defaultcf.block_cache_size
            && let Some(b) = self.rocksdb.writecf.block_cache_size
            && let Some(c) = self.rocksdb.lockcf.block_cache_size
        {
            let d = self
                .raftdb
                .defaultcf
                .block_cache_size
                .map(|s| s.0)
                .unwrap_or_default();
            let sum = a.0 + b.0 + c.0 + d;
            self.storage.block_cache.capacity = Some(ReadableSize(sum));
        }

https://github.com/pingcap/tidb-engine-ext/blob/521fd9dbc55e58646045d88f91c3c35db50b5981/src/config/mod.rs#L3795-L3805

and modified by

    config.raftdb.defaultcf.block_cache_size = proxy_config.raftdb.defaultcf.block_cache_size;
    config.rocksdb.defaultcf.block_cache_size = proxy_config.rocksdb.defaultcf.block_cache_size;
    config.rocksdb.writecf.block_cache_size = proxy_config.rocksdb.writecf.block_cache_size;
    config.rocksdb.lockcf.block_cache_size = proxy_config.rocksdb.lockcf.block_cache_size;

https://github.com/pingcap/tidb-engine-ext/blob/521fd9dbc55e58646045d88f91c3c35db50b5981/proxy_components/proxy_server/src/config.rs#L405-L408

See pingcap/tidb-engine-ext@2f2900a.


And tiflash proxy limit the memory for CF to a small number.

pub fn memory_limit_for_cf(is_raft_db: bool, cf: &str, total_mem: u64) -> ReadableSize {
    let (ratio, min, max) = match (is_raft_db, cf) {
        (true, CF_DEFAULT) => (0.05, 256 * MIB as usize, usize::MAX),
        (false, CF_DEFAULT) => (0.25, 0, 128 * MIB as usize),
        (false, CF_LOCK) => (0.02, 0, 128 * MIB as usize),
        (false, CF_WRITE) => (0.15, 0, 128 * MIB as usize),
        _ => unreachable!(),
    };
    let mut size = (total_mem as f64 * ratio) as usize;
    size = size.clamp(min, max);
    ReadableSize::mb(size as u64 / MIB)
}

https://github.com/pingcap/tidb-engine-ext/blob/521fd9dbc55e58646045d88f91c3c35db50b5981/proxy_components/proxy_server/src/config.rs#L123-L134


The logic affects all released versions including LTS 6.1/6.5/7.1/7.5/8.1/8.5

As a result, consider an enough big memory, the limit could be:

  • kv cf: 128MB * 3 = 0.375GiB
  • raft cf: 0.05 * total available machine memory

So basicly, the memory limit is 0.05 * total available machine memory / 0.45 * 0.75.

  • If the machine memory is 376 GiB, then memory limit is 31.3GiB.
  • If the machine memory is 32 GiB, then memory limit is 2.66 GiB.

Note, even if the raft-engine is used, we still take the raft_db memory size into account.

And because there is a memory high water machanism, the memory usage on proxy will time another 0.1 factor

#[allow(clippy::derivable_impls)]
impl Default for ProxyConfig {
    fn default() -> Self {
        ProxyConfig {
            raft_store: RaftstoreConfig::default(),
            server: ServerConfig::default(),
            rocksdb: RocksDbConfig::default(),
            raftdb: RaftDbConfig::default(),
            storage: StorageConfig::default(),
            enable_io_snoop: false,
            memory_usage_high_water: 0.1,
            readpool: ReadPoolConfig::default(),
            import: ImportConfig::default(),
            engine_store: EngineStoreConfig::default(),
            memory: MemoryConfig::default(),
        }
    }
}

So as a result,

  • If the machine memory is 376 GiB, then memory limit is 31.3GiB. TiFlash would complain if the memory is more than 3.1GiB.
  • If the machine memory is 32 GiB, then memory limit is 2.66 GiB. TiFlash would complain if the memory usage is more than 0.26 GiB.
@CalvinNeo CalvinNeo added type/bug The issue is confirmed as a bug. component/storage severity/major labels Dec 26, 2024
@CalvinNeo CalvinNeo added affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. component/storage severity/major type/bug The issue is confirmed as a bug. and removed type/bug The issue is confirmed as a bug. severity/major may-affects-5.4 component/storage may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 may-affects-8.5 labels Dec 26, 2024
@JaySon-Huang
Copy link
Contributor

JaySon-Huang commented Dec 26, 2024

After the "memory_usage_high_water" is hit. And the memory usage of "raft_msg_usage + cached_entries + applying_entries" reaches reject_messages_on_memory_ratio. It may cause tiflash reject append msg.

In such case, we may see logging as below in the tikv leader side "Raft raft: cannot step raft local message":

[2024/12/02 05:15:23.768 +09:00] [ERROR] [peer.rs:618] ["handle raft message err"] [err_code=KV:Raft:StepLocalMsg] [err="Raft raft: cannot step raft local message"] [peer_id=5569340825] [region_id=5569340823]
[2024/12/02 05:15:25.771 +09:00] [ERROR] [peer.rs:618] ["handle raft message err"] [err_code=KV:Raft:StepLocalMsg] [err="Raft raft: cannot step raft local message"] [peer_id=5569340825] [region_id=5569340823]

https://github.com/tikv/tikv/blob/a2c58c94f89cbb410e66d8f85c236308d6fc64f0/src/server/service/kv.rs#L2590-L2615

@CalvinNeo
Copy link
Member Author

Also in #9743, will discuss the relationship between reject msgappend and evict cache

@CalvinNeo
Copy link
Member Author

The reject append logic lies


        let res = async move {
            let mut stream = stream.map_err(Error::from);
            while let Some(mut batch_msg) = stream.try_next().await? {
                let len = batch_msg.get_msgs().len();
                RAFT_MESSAGE_RECV_COUNTER.inc_by(len as u64);
                RAFT_MESSAGE_BATCH_SIZE.observe(len as f64);
                let reject = needs_reject_raft_append(reject_messages_on_memory_ratio);
                for msg in batch_msg.take_msgs().into_iter() {
                    if let Err(err @ RaftStoreError::StoreNotMatch { .. }) =
                        Self::handle_raft_message(store_id, &ch, msg, reject)
                    {
                        // Return an error here will break the connection, only do that for
                        // `StoreNotMatch` to let tikv to resolve a correct address from PD
                        return Err(Error::from(err));
                    }
                }
                if let Some(ref counter) = message_received {
                    counter.inc_by(len as u64);
                }
            }
            Ok::<(), Error>(())
        };

Note that only append msg will be rejected. A peer could be destroyed due to tombstone msg which could happen if the region is moved out of the memory-exhausted machine.

ti-chi-bot bot pushed a commit that referenced this issue Jan 3, 2025
close #9745

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon <tshent@qq.com>
JaySon-Huang added a commit to CalvinNeo/tiflash that referenced this issue Jan 9, 2025
close pingcap#9745

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon <tshent@qq.com>
ti-chi-bot bot pushed a commit that referenced this issue Jan 9, 2025
…9776)

ref #4982, close #9745

Signed-off-by: CalvinNeo <calvinneo1995@gmail.com>
Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon <tshent@qq.com>
@seiya-annie
Copy link

/report customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jan 13, 2025
ti-chi-bot bot pushed a commit that referenced this issue Feb 6, 2025
…9841)

ref #4982, close #9745

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
CalvinNeo added a commit to CalvinNeo/tiflash that referenced this issue Feb 8, 2025
close pingcap#9745

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon <tshent@qq.com>
ti-chi-bot bot pushed a commit that referenced this issue Feb 10, 2025
…9853)

ref #4982, close #9745

Signed-off-by: Calvin Neo <calvinneo1995@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.5 This bug affects the 6.5.x(LTS) versions. affects-7.1 This bug affects the 7.1.x(LTS) versions. affects-7.5 This bug affects the 7.5.x(LTS) versions. affects-8.1 This bug affects the 8.1.x(LTS) versions. affects-8.5 This bug affects the 8.5.x(LTS) versions. component/storage report/customer Customers have encountered this bug. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
3 participants