Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix go pipeline stop hang caused by improper component stop order #1914

Merged
merged 4 commits into from
Nov 26, 2024

Conversation

henryzhx8
Copy link
Collaborator

@henryzhx8 henryzhx8 commented Nov 26, 2024

  • 背景:

    • 在2.0时代,当发生配置变更或容器信息变更时,会触发全量配置重新加载。此时,对于使用Go输入,sls输出的流水线,当发送端因为网络原因导致发送受阻时,有一定概率在stop的时候卡在aggregator flush。由于普通的流水线有超时保护,所以即便卡住也不会影响整体,但是自监控流水线不知为何没有超时保护,所以一旦卡住就会导致Logtail整体不工作。
    • 配置独立热加载改造后,自监控流水线只在进程退出时会停止,而此时因为数据都往磁盘上写了,因此发送端不会受阻,所以整条流水线也不会卡住。所以客观上这个问题已经解决了,但是有点凑巧的意味,没有修复卡住的根本原因。
  • 原因:
    当go往c++发送的时候,会先判断队列满不满,如果满了,再判断流水线是不是要stop,如果要stop那么就把待发送的数据暂时存起来。但是问题在于“流水线要stop”的标志是在停止了input、processor和aggregator之后才设置的,而aggregator已经卡住了,所以没有机会设置这个flag了,导致了卡死。

  • 方案:
    把设置“流水线要stop”flag的操作放到流水线停止的最开始

@yyuuttaaoo yyuuttaaoo merged commit aab3058 into main Nov 26, 2024
15 checks passed
@yyuuttaaoo yyuuttaaoo deleted the fix/hang branch November 26, 2024 12:01
@yyuuttaaoo yyuuttaaoo added the bug Something isn't working label Nov 26, 2024
@yyuuttaaoo yyuuttaaoo added this to the v2.1 milestone Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants