Feature: batch size hint of stream consumption #16363

dantengsky · 2024-09-02T02:59:03Z

Summary

When consuming streams, there are cases where the size of the change set needs to be constrained.

For example:

SELECT * FROM stream_s WITH CONSUME MAX_BATCH_SIZE_HINT 100;

While consuming stream_s, aim to return a change set of up to 100 rows if possible. Any remaining data will be left for the next round of consumption.

The text was updated successfully, but these errors were encountered:

sundy-li · 2024-09-02T11:47:35Z

@dosu how to implement this feature?

dosubot · 2024-09-02T11:50:55Z

Hi @sundy-li! I'm here to assist you with any bugs, questions, or contributions while you wait for a human maintainer. Let me know how I can help! Dosu

To implement the feature for batch size hint of stream consumption in SQL, such as SELECT * FROM stream_s WITH CONSUME MAX_BATCH_SIZE_HINT 100;, you need to modify the SelectInterpreter to handle the MAX_BATCH_SIZE_HINT parameter. This involves parsing the hint from the SQL query and then using it to configure the stream consumption.

Here is a high-level approach to implement this feature:

Parse the MAX_BATCH_SIZE_HINT from the SQL query: Modify the SQL parser to recognize and extract the MAX_BATCH_SIZE_HINT parameter.
Store the parsed hint in the SelectInterpreter: Add a field to the SelectInterpreter to store the batch size hint.
Use the batch size hint during stream consumption: Modify the stream consumption logic to use the batch size hint.

Below is a simplified example of how you might implement this:

Step 1: Modify the SQL Parser

You need to modify the SQL parser to recognize the MAX_BATCH_SIZE_HINT parameter. This part is not shown in the provided code, but you would typically add this to your SQL grammar and parsing logic.

Step 2: Store the Batch Size Hint in `SelectInterpreter`

Add a field to the SelectInterpreter to store the batch size hint.

pub struct SelectInterpreter {
    ctx: Arc<QueryContext>,
    s_expr: SExpr,
    bind_context: BindContext,
    metadata: MetadataRef,
    formatted_ast: Option<String>,
    ignore_result: bool,
    max_batch_size_hint: Option<usize>, // Add this field
}

impl SelectInterpreter {
    pub fn try_create(
        ctx: Arc<QueryContext>,
        bind_context: BindContext,
        s_expr: SExpr,
        metadata: MetadataRef,
        formatted_ast: Option<String>,
        ignore_result: bool,
        max_batch_size_hint: Option<usize>, // Add this parameter
    ) -> Result<Self> {
        Ok(SelectInterpreter {
            ctx,
            s_expr,
            bind_context,
            metadata,
            formatted_ast,
            ignore_result,
            max_batch_size_hint, // Initialize the field
        })
    }
}

Step 3: Use the Batch Size Hint During Stream Consumption

Modify the stream consumption logic to use the batch size hint.

#[async_backtrace::framed]
pub async fn build_pipeline(
    &self,
    mut physical_plan: PhysicalPlan,
) -> Result<PipelineBuildResult> {
    if let PhysicalPlan::Exchange(exchange) = &mut physical_plan {
        if exchange.kind == FragmentKind::Merge && self.ignore_result {
            exchange.ignore_exchange = self.ignore_result;
        }
    }

    let mut build_res = build_query_pipeline(
        &self.ctx,
        &self.bind_context.columns,
        &physical_plan,
        self.ignore_result,
    )
    .await?;

    // consume stream
    let update_stream_metas = query_build_update_stream_req(&self.ctx, &self.metadata).await?;

    let catalog = self.ctx.get_default_catalog()?;
    build_res
        .main_pipeline
        .set_on_finished(move |info: &ExecutionInfo| match &info.res {
            Ok(_) => GlobalIORuntime::instance().block_on(async move {
                match update_stream_metas {
                    Some(streams) => {
                        let r = UpdateMultiTableMetaReq {
                            update_table_metas: streams.update_table_metas,
                            ..Default::default()
                        };
                        info!("Updating the stream meta to consume data");
                        catalog.update_multi_table_meta(r).await.map(|_| ())
                    }
                    None => Ok(()),
                }
            }),
            Err(error_code) => Err(error_code.clone()),
        });

    // Use the batch size hint if provided
    if let Some(max_batch_size) = self.max_batch_size_hint {
        build_res.main_pipeline.set_max_batch_size(max_batch_size);
    }

    Ok(build_res)
}

In this example, set_max_batch_size is a hypothetical method that you would need to implement in the Pipeline class to configure the batch size for stream consumption.

This is a high-level overview, and the actual implementation details may vary based on your specific codebase and requirements ^[1].

To continue talking to Dosu, mention @dosu.

dantengsky added the C-feature Category: feature label Sep 2, 2024

dosubot bot added the A-query Area: databend query label Sep 2, 2024

dantengsky assigned zhyass Sep 2, 2024

zhyass mentioned this issue Sep 3, 2024

feat: batch size hint of stream consumption #16372

Merged

11 tasks

BohuTANG closed this as completed in #16372 Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: batch size hint of stream consumption #16363

Feature: batch size hint of stream consumption #16363

dantengsky commented Sep 2, 2024 •

edited

Loading

sundy-li commented Sep 2, 2024

dosubot bot commented Sep 2, 2024

Feature: batch size hint of stream consumption #16363

Feature: batch size hint of stream consumption #16363

Comments

dantengsky commented Sep 2, 2024 • edited Loading

sundy-li commented Sep 2, 2024

dosubot bot commented Sep 2, 2024

Step 1: Modify the SQL Parser

Step 2: Store the Batch Size Hint in SelectInterpreter

Step 3: Use the Batch Size Hint During Stream Consumption

dantengsky commented Sep 2, 2024 •

edited

Loading

Step 2: Store the Batch Size Hint in `SelectInterpreter`