Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add persistent storage to exporter #1278

Closed
rajkumar-rangaraj opened this issue Sep 16, 2020 · 4 comments · Fixed by open-telemetry/opentelemetry-dotnet-contrib#171
Closed
Labels
enhancement New feature or request pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package

Comments

@rajkumar-rangaraj
Copy link
Contributor

rajkumar-rangaraj commented Sep 16, 2020

To avoid data loss with transient errors, enable OpenTelemetry exporters to store failed telemetry and retry sending it at later time. Idea is based on persistent storage option available with Application Insights SDK and @reyang design principles discussed in OpenCensus Python. Adding persistent storage to exporters will cover following scenarios.

  1. When the SDK failed to export data to the backend system due to networking issues, to prevent eating up all the memory, we need to either discard excessive data (depending on the case, it could be either latest or oldest), or store them locally (e.g. file, log, reliable pipe, ETW).

  2. In case of application exit/restart/crash, we want to reduce the data loss. Although data loss is unavoidable given we're not a fully transactional system (e.g. your code writes traces to a queue, and the process got killed before the queue item got processed, the data will get lost), having ability to store things locally and being able to pick up later (after machine or application restart) would be useful for some cases.

  3. Console application (backend job, periodic task, command line tools) might need to store the traces during the exit grace period, since sending all the data across networking might not be possible within that grace period.

  4. There are cases where developers need more reliability, for example, auditing logs and QoS logs. We might need to provide an alternative way, so developers can sacrifice performance (e.g. without going through the queue, synchronously persist the log in a local storage or even transmit the data across the network) for reliability.

The design principles:

  1. Need to work in a multi-threading environment.
  2. Need to work in a multi-processing environment (e.g. one application has multiple process instances running at the same time).
  3. Should leverage existing stuff if possible, rather than reinventing wheels.
  4. Need to have solution for both agent and agent-less scenario.

Storage folder:

  • By default, data will go to subfolder of current user home directory. It is customizable.
  • Every application defines its unique executable path.
    Storage folder = transmission root folder / application folder
    transmission root folder = HOME directory of the CURRENT USER, or the path explicitly specified by the user
    application folder name = SHA256 hash of User identity that runs the application's process + Path of current executable
  • Transmission are stored with unique file name - datetimestamp(ISO 8601)-GUID on it. Example: 2020-09-15T210909.267417-21ae34ceb5ee46888f04f9ceb437eec6.blob
  • Store only the failed items, especially in case where we get partial success from service.
  • Only store the retriable failed transmission and honor RetryAfter sent from the service and transmit data.
  • Use exponential retry and drop the data (old data) after limit.

Store data:

  • On failure, move failed telemetries to file.
  • If application is shutting down store data to a disk and attempt to send the data.

Read data:
Thread wakes up at configurable time (for example, 30 seconds), reads data from folder and re-transmit to backend.

Delete data:
Delete the data only after transmission is success or data is expired.

@rajkumar-rangaraj rajkumar-rangaraj added the enhancement New feature or request label Sep 16, 2020
@reyang
Copy link
Member

reyang commented Sep 16, 2020

  • File will contain NDJSON content (with \n line separators). For Example,
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]
  • Store only the failed items in NDJSON format, especially in case we get partial success from service.

We can probably leave this to each exporter since they might want to use other formats?

@rajkumar-rangaraj
Copy link
Contributor Author

  • File will contain NDJSON content (with \n line separators). For Example,
{"some":"thing"}
{"foo":17,"bar":false,"quux":true}
{"may":{"include":"nested","objects":["and","arrays"]
  • Store only the failed items in NDJSON format, especially in case we get partial success from service.

We can probably leave this to each exporter since they might want to use other formats?

Yes, it makes sense for every exporter to use their own format.

@reyang
Copy link
Member

reyang commented Sep 16, 2020

application folder name = SHA256 hash of Process Identity + Path of current executable

What is a Process Identity?

@rajkumar-rangaraj
Copy link
Contributor Author

application folder name = SHA256 hash of Process Identity + Path of current executable

What is a Process Identity?

I should have clarified this earlier, Process Identity has different meaning in different environments. I meant here to say "Identity for user account which is running the application process". Will get the issue updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package
Projects
None yet
3 participants