Avoid MPP queries exhaust tiflash-compute network under disagg arch

## Enhancement

A Dumpling export job failed with `Error 1105 (HY000): rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout` when the query was executed through TiFlash/Mpp. The issue was triggered by very high TiFlash network bandwidth at that time, which delayed the keepalive ACK and caused the MPP stream to fail.

### Reproduction Steps
This issue is not easy to reproduce in a stable way, because it depends on specific runtime conditions rather than on a simple deterministic query pattern alone.
A similar failure is more likely to happen when all of the following conditions are met:
- the query is executed on the TiFlash MPP path;
- the workload involves disaggregated reads with a relatively high number of S3 requests;
- TiFlash is under very high network load at that time; and
- the keepalive ACK from TiFlash is not returned to TiDB within the timeout window.
In this incident, the Dumpling export query hit exactly this kind of runtime condition and failed with keepalive ping failed to receive ACK within timeout.

### Analysis
The Dumpling log showed that the export failed with rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout. The same symptom was also observed on the TiDB side during MPP stream receiving, which indicates that the failure happened on the TiDB–TiFlash RPC path rather than in Dumpling logic itself. 

<img width="2822" height="1210" alt="Image" src="https://github.com/user-attachments/assets/f337c4cd-814a-4d5a-817d-1bc8fec765ec" />

Further investigation showed that the affected SQL was executed on TiFlash. At the incident time, the TiFlash node handling the query was under very high network load, with observed bandwidth reaching about 2.28 GiB/s. The instance type’s documented network bandwidth is about 12 / up to 25 Gbit/s, which is approximately 1.4 / up to 2.91 GiB/s after converting to GiB/s, so the observed traffic was already close to the upper end of the available bandwidth.

<img width="2830" height="884" alt="Image" src="https://github.com/user-attachments/assets/25c000a1-233a-4a54-927e-6097d4553b5a" />

The TiFlash metrics further showed that the high network traffic was mainly driven by a large number of S3 read requests at that time. This was related to an intentionally increased concurrency strategy on the disaggregated read path, which helps improve performance in some cache-miss scenarios(https://github.com/pingcap/tiflash/pull/10522), but does not sufficiently constrain network bandwidth under workloads such as wide-table reads, cold reads, and multi-segment access. In addition, the current synchronous I/O model and the lack of effective S3 bandwidth governance can further amplify the pressure on network bandwidth.
Taken together, the evidence indicates that the incident was caused by the TiFlash execution path under heavy S3-related network pressure, which eventually led to the keepalive timeout between TiDB and TiFlash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid MPP queries exhaust tiflash-compute network under disagg arch #10752

Enhancement

Reproduction Steps

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid MPP queries exhaust tiflash-compute network under disagg arch #10752

Description

Enhancement

Reproduction Steps

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions