Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics SDK] Segfault when short export period is used for metrics #1676

Closed
ahadnagy opened this issue Oct 12, 2022 · 4 comments · Fixed by #1682
Closed

[Metrics SDK] Segfault when short export period is used for metrics #1676

ahadnagy opened this issue Oct 12, 2022 · 4 comments · Fixed by #1682
Assignees
Labels
bug Something isn't working

Comments

@ahadnagy
Copy link
Contributor

We're experiencing segfaults when using async instruments with a relatively short (e.g. OTEL_METRICS_EXPORT_INTERVAL_MILLIS=100 OTEL_METRICS_EXPORT_TIMEOUT_MILLIS=50) export period.

For me, this looks like a race condition issue and strongly resembles #1663 , but it's still present after bumping opentelemetry-cpp to the last commit.

I was able to catch this with GDB, here's the stack trace:
(gdb)_backtrace_full.txt

If necessary, I'll try to put together a self-contained reproducer, but I'm hoping that it's something simple. :)

@ahadnagy ahadnagy added the bug Something isn't working label Oct 12, 2022
@ahadnagy ahadnagy changed the title Segfault when short export period is used for metrics [Metrics SDK] Segfault when short export period is used for metrics Oct 12, 2022
@lalitb lalitb self-assigned this Oct 12, 2022
@lalitb
Copy link
Member

lalitb commented Oct 12, 2022

@ahadnagy Thanks for reporting. Looking at the stack trace, this seems different from #1663. Are you using the example code from here - https://github.com/open-telemetry/opentelemetry-cpp/blob/main/examples/otlp/grpc_metric_main.cc, as I am not able to repro it using given export periods. If possible, can you share your code (needn't be self-contained for now). Also for the observable instrument, ensure that the object is valid for the lifetime of metric collection.

@ahadnagy
Copy link
Contributor Author

@lalitb Thanks for looking into this! The code I'm using closely follows the mentioned example with some utility abstractions. Before discovering this issue we were using commit#9e87a6eb5997bd923c3c3742727bd6bceff483e5 with the default values for period and timeout, that worked.

There are two differences in our use-case worth mentioning: the registration of the instruments might happen a bit later than in the example (~few hundreds of ms), and we're using >10 async instruments.

Tomorrow I'll try to reproduce this in the context of the example application, fingers crossed.

@ahadnagy
Copy link
Contributor Author

ahadnagy commented Oct 13, 2022

@lalitb I did some more debugging today and it seems that the issues we're having happen when we'd like to terminate the application and the static objects are being destructed.

I was able to capture the failure with Valgrind as well, this gives a bit more context into where and when things were allocated and freed:
valgrind.txt

After figuring out the timing of the failure I was able to come up with a stand-alone reproducer:
https://github.com/ahadnagy/opentelemetry-cpp/tree/segfault-reproducer
(https://github.com/ahadnagy/opentelemetry-cpp/blob/segfault-reproducer/examples/otlp/reproducer.cc)

This fails with sigabrt, but the stack trace is almost identical.

@lalitb
Copy link
Member

lalitb commented Oct 13, 2022

Thanks @ahadnagy, I can reproduce the problem with your code, should be good to troubleshoot further :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants