-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Proposal: Audit Logging SIG #2409
base: main
Are you sure you want to change the base?
Changes from all commits
5094fb1
9337b7f
75f2c57
f81c2f4
65ae32e
776b821
6dd519d
2ec002d
d7e265f
405ddb5
0adb8e5
a5ef343
711dc46
087865c
3876a31
8b38626
066501b
70cbac4
a6b34f1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,197 @@ | ||||||||||||||||
# Audit Logging | ||||||||||||||||
|
||||||||||||||||
## Background and description | ||||||||||||||||
|
||||||||||||||||
Audit logging describes the capability of capturing audit-trail relevant events of a system to meet compliance requirements. Such events may originate from the infrastructure (e.g. a Kubernetes cluster) up to the application-level. It is a capability that is particularly relevant for providers of enterprise software. | ||||||||||||||||
|
||||||||||||||||
Unlike regular application logs, audit logs are usually subject to long retention periods and software providers must guarantee their completeness (i.e. guarantee of delivery). | ||||||||||||||||
|
||||||||||||||||
Examples of audit logs include: (see [Appendix B: Examples of Audit Log Events]) | ||||||||||||||||
- failed login attempts | ||||||||||||||||
- permission changes (e.g. of a service account or application user) | ||||||||||||||||
- accessing sensitive information | ||||||||||||||||
- modification of data | ||||||||||||||||
|
||||||||||||||||
### Current challenges | ||||||||||||||||
|
||||||||||||||||
OpenTelemetry does not have a good solution for audit logging | ||||||||||||||||
|
||||||||||||||||
- no semantic conventions for audit logs in OTel | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Can you provide some examples of what would be part of such semantic conventions? My knowledge on audit logs is very limited, so it would help to understand the problem much better. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @svrnm our experience has shown that in order to analyze audit logs at scale, it is important to define an (extensible) event catalog. The event catalog standardizes audit log events across workloads/produces. For example, our internal event catalog currently consists of 50+ such events. Ideally, such a catalog would be part of semantic conventions. To make this more tangible, I've added some examples to the appendix of the document: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. another examples from the security world is https://github.com/ocsf. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @mlenkeit. Makes it much clearer The For the I am just making those things up to exemplify the difference, they will probably take a different form or shape eventually, so to make a long story short, here is a suggestion to rephrase:
Suggested change
@renewelches thanks for calling out OCSF, if I remember correctly there were conversations in the past between OTel and OCSF, cc @lmolkova There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding I understand how "semantic conventions for audit logs" can be misleading. To me, the suggestion that you made has a notion of particularly describe logs that are "already there" (e.g. events emitted by a K8s cluster) and can be considered relevant for audit purposes. Especially in enterprise software, it's common that applications produce logs that are specifically mean to be audit logs (and nothing else). To me, it' s important that we find wording that covers these two types that we do have. How about the following?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As mentioned in another comment, this all depends on what attributes are changeable or must be immutable. As of my understanding an attribute could be altered by a processor in the collector. Which is something we would want to avoid or want to prevent in cases of audit logs. If we conclude that we can or should only guarantee immutability for the log itself then we must live with replication/doublication. Otherwise we might have to add the constrain that also certain attributes must be immutable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to looking into OCSF for security events and borrowing relevant semantic conventions from there. |
||||||||||||||||
- OTel APIs/SDKs do not provide feedback to the application level whether data (in particular logs) have been successfully delivered to a remote endpoint. To guarantee delivery, either the SDK has to give those guarantees, or provide feedback to the application so that it can take care of guaranteed delivery itself. | ||||||||||||||||
- OTel Collector instances may lose audit logs in transit (i.e. no guarantee of delivery) | ||||||||||||||||
|
||||||||||||||||
See [Appendix A: Guarantee of Delivery] for more details | ||||||||||||||||
|
||||||||||||||||
### Goals, objectives, and requirements | ||||||||||||||||
|
||||||||||||||||
The goal of this project is to make OTel fit for audit logging purposes that meet compliance requirements of enterprise software providers, in particular: | ||||||||||||||||
|
||||||||||||||||
- REQ-01: Semantic conventions for application-level audit logs are defined | ||||||||||||||||
- REQ-02: Semantic conventions for infrastructure-level audit logs are defined | ||||||||||||||||
- REQ-03: Guaranteed delivery of audit logs exported via OpenTelemetry SDK | ||||||||||||||||
- REQ-04: OTel Collector instances must provide guaranteed delivery of audit logs, including when its process is interrupted | ||||||||||||||||
|
||||||||||||||||
See [Appendix A: Guarantee of Delivery] for more details | ||||||||||||||||
|
||||||||||||||||
## Deliverables | ||||||||||||||||
|
||||||||||||||||
- semantic convention for audit logs | ||||||||||||||||
- extend OTel APIs/SDKs for audit logging purposes (in collaboration with the respective SIG) | ||||||||||||||||
- extend OTel Collector for audit logging purposes (in collaboration with the respective SIG) | ||||||||||||||||
|
||||||||||||||||
## Staffing / Help Wanted | ||||||||||||||||
|
||||||||||||||||
The following vendors are interested in improving this area: | ||||||||||||||||
- SAP (@mlenkeit, @FWinkler79) | ||||||||||||||||
- Microsoft (@reyang) | ||||||||||||||||
|
||||||||||||||||
Other vendors are invited to join the discussion. | ||||||||||||||||
|
||||||||||||||||
### Required staffing | ||||||||||||||||
|
||||||||||||||||
* Project lead: SAP (name tbd) | ||||||||||||||||
* Sponsors: | ||||||||||||||||
- @reyang | ||||||||||||||||
- tbd | ||||||||||||||||
* GC liaison: @svrnm | ||||||||||||||||
* Engineers for API/SDK: | ||||||||||||||||
* SAP will provide a prototype in two languages (tbd; likely two of Java, JavaScript, Go) | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need prototype in two parts:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for pointing this out! It's clear to us, but I'll work on making this clearer in the doc... |
||||||||||||||||
* Engineers for OTel Collector: tbd | ||||||||||||||||
* Maintainers/approvers: tbd | ||||||||||||||||
|
||||||||||||||||
## Timeline | ||||||||||||||||
|
||||||||||||||||
TBD based on community involvement. | ||||||||||||||||
|
||||||||||||||||
## Labels | ||||||||||||||||
|
||||||||||||||||
- audit-logging (tbc) | ||||||||||||||||
|
||||||||||||||||
## Project Board | ||||||||||||||||
|
||||||||||||||||
TODO: add link | ||||||||||||||||
|
||||||||||||||||
## SIG Meetings and Other Info | ||||||||||||||||
|
||||||||||||||||
TODO: add information | ||||||||||||||||
|
||||||||||||||||
## Appendix | ||||||||||||||||
|
||||||||||||||||
### Appendix A: Guarantee of Delivery | ||||||||||||||||
|
||||||||||||||||
In the context of this document, guarantee of delivery describes the ability of delivering audit logs from source to destination through OTel means while ensuring that all such signals arrive at the destination and/or providing the source with a means to handle failed delivery. | ||||||||||||||||
|
||||||||||||||||
Messaging protocols that support different levels of delivery guarantees may refer to this behavior as _at least once_ or _exactly once_, as opposed to _at most once_. | ||||||||||||||||
|
||||||||||||||||
We assume that every component that is involved in the delivery of audit logs from source to destination must support guarantee of delivery individually, rather than assuming that this ability can be provided by e.g. only the collector or SDK. | ||||||||||||||||
|
||||||||||||||||
The implications of guarantee of delivery can be illustrated with an example consisting of a workload, an OTel Collector instance, and a durable storage. The workload acts as the source and produces audit logs via the OTel API/SDK. It writes the data via OTLP to the collector. The collector is configured to export audit logs to a durable storage that acts as the destination such as an S3 bucket. | ||||||||||||||||
|
||||||||||||||||
The following implications would apply: | ||||||||||||||||
|
||||||||||||||||
- workload produces an audit-relevant event: | ||||||||||||||||
|
||||||||||||||||
The workload emits the event via the OTel API/SDK. It may wait for acknowledgement of receipt from the collector before proceeding. If the event is rejected or receipt is not acknowledged in time, the workload or SDK may act accordingly, e.g. retry, rollback a database transaction, inform the user, etc. | ||||||||||||||||
|
||||||||||||||||
- OTel Collector receives the event: | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Take a look at the current requirements: https://github.com/open-telemetry/opentelemetry-collector/blob/b9ff1bc54c992bc76cc9ecb0a7ee1f0f591f6d23/receiver/doc.go#L31 This open issue tracks compliance with requirements: open-telemetry/opentelemetry-collector#7460 |
||||||||||||||||
|
||||||||||||||||
To ensure that the event is not lost even if the collector process is terminated or crashes, the collector may need to persist the event before acknowledging receipt to the workload or SDK. If the event cannot be persisted, receipt must be rejected. | ||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the expectation if the collector instance disappeared (e.g., the machine running the collector exploded / was stolen)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is the most tricky part, or to put it in a question: do we need guarantee of delivery between 2 components (workload->collector,collector->S3) or end-to-end (workload->S3)? I would assume "end-to-end" except the collector can guarantee that data is persisted according the auditing requirements There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If the solution for audit logging with OTel meant that the OTel Collector had an own persistence, I would argue that theft/explosion/etc. are rather in the responsibility of Operations in terms of configuring said persistence such that it is resilient "enough". Or to make this more concrete: if for example something such as the storage extension was used, Operations would need to make sure that the database/file/redis storage runs in an HA mode. I'm stressing the if here, because I think is a detail that the SIG should work out. Or do you think that's something that should rather be clarified upfront? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I suggest that we leave this for the SIG to figure out. In the OTEP, I suggest that we avoid "guaranteed delivery" and use something like "certain degree/level of data delivery guarantee". Not a blocker for this PR though (I'm good with the current version). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 for what @reyang wrote. I think it is good to have this in the appendix and some wording around it, since there is many people (including myself) who have a superficial knowledge around audit logs, so it helps to contextualize and understand what this is all about. So no more details are needed in this doc, this would be for the SIG to figure out |
||||||||||||||||
|
||||||||||||||||
- OTel Collector exports the event: | ||||||||||||||||
|
||||||||||||||||
Once the event is exported and the target (i.e. S3) acknowledges receipt, the event can dropped from the collector's persistence. | ||||||||||||||||
|
||||||||||||||||
- the target (i.e. S3) receives the event: | ||||||||||||||||
|
||||||||||||||||
Acknowledges receipt after persisting the event. | ||||||||||||||||
|
||||||||||||||||
Note that this is outside the scope of the OTel. More general, when using OTel for audit logging purposes, it's the users (e.g. Ops) responsibility to configure a suitable export target. | ||||||||||||||||
|
||||||||||||||||
Note that this example may contain implementation details for illustration purposes. The actual implementation may differ as long as the requirements are met. | ||||||||||||||||
|
||||||||||||||||
The example is kept simple for illustration purposes. Many edge cases need to be discussed by the SIG, such as batch-sending of signals or handling of multiple export targets. | ||||||||||||||||
|
||||||||||||||||
It may turn out that all OTel receivers, processors, or exporters can be made compatible with guarantee of delivery for audit logging purposes. | ||||||||||||||||
|
||||||||||||||||
### Appendix B: Examples of Audit Log Events | ||||||||||||||||
|
||||||||||||||||
The following list contains sample audit log events in a YAML format for better readability and intentionally do not follow any OTel-related schema. | ||||||||||||||||
|
||||||||||||||||
An event consists of the event name, event-specific data, and general metadata. The individual properties of these events would ideally be reflected in common or audit log-specific semantic conventions. | ||||||||||||||||
|
||||||||||||||||
- failed login attempts | ||||||||||||||||
|
||||||||||||||||
```yaml | ||||||||||||||||
event: UserLoginFailure | ||||||||||||||||
data: | ||||||||||||||||
loginMethod: oidc | ||||||||||||||||
failureReason: userLocked | ||||||||||||||||
metadata: | ||||||||||||||||
id: 50b925b5-0ba9-42f3-b476-8a6795000046 | ||||||||||||||||
timestamp: 1732193414483 | ||||||||||||||||
ip: 10.11.12.13 | ||||||||||||||||
initiator: john-doe | ||||||||||||||||
application: payroll | ||||||||||||||||
tenant: fab54af9-f978-463e-9c02-f92db1afc2b4 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
- permission changes (e.g. of a service account or application user) | ||||||||||||||||
|
||||||||||||||||
```yaml | ||||||||||||||||
event: AuthnRoleToUserAdd | ||||||||||||||||
data: | ||||||||||||||||
user: jane-doe | ||||||||||||||||
role: editor | ||||||||||||||||
metadata: | ||||||||||||||||
id: 50b925b5-0ba9-42f3-b476-8a6795000046 | ||||||||||||||||
timestamp: 1732193414483 | ||||||||||||||||
ip: 10.11.12.13 | ||||||||||||||||
initiator: john-doe | ||||||||||||||||
application: payroll | ||||||||||||||||
tenant: fab54af9-f978-463e-9c02-f92db1afc2b4 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
- accessing sensitive information | ||||||||||||||||
|
||||||||||||||||
```yaml | ||||||||||||||||
event: DppDataAccess | ||||||||||||||||
data: | ||||||||||||||||
channelType: web | ||||||||||||||||
channelId: https://payroll.example.com/user/jane-doe/compensation | ||||||||||||||||
dataSubjectType: employeeID | ||||||||||||||||
dataSubjectId: jane-doe | ||||||||||||||||
objectType: compensation | ||||||||||||||||
objectId: 1196f42b-8f12-4df0-9b1f-01c98d2c7291 | ||||||||||||||||
attribute: salary | ||||||||||||||||
value: 50000 | ||||||||||||||||
metadata: | ||||||||||||||||
id: 50b925b5-0ba9-42f3-b476-8a6795000046 | ||||||||||||||||
timestamp: 1732193414483 | ||||||||||||||||
ip: 10.11.12.13 | ||||||||||||||||
initiator: john-doe | ||||||||||||||||
application: payroll | ||||||||||||||||
tenant: fab54af9-f978-463e-9c02-f92db1afc2b4 | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
|
||||||||||||||||
- modification of data | ||||||||||||||||
|
||||||||||||||||
```yaml | ||||||||||||||||
event: DataModification | ||||||||||||||||
data: | ||||||||||||||||
objectType: CronJob | ||||||||||||||||
objectId: my-sample-cronjob | ||||||||||||||||
attribute: schedule | ||||||||||||||||
oldValue: 0 0 1 * * # monthly | ||||||||||||||||
newValue: 0 0 1 1 * # annually | ||||||||||||||||
metadata: | ||||||||||||||||
id: 50b925b5-0ba9-42f3-b476-8a6795000046 | ||||||||||||||||
timestamp: 1732193414483 | ||||||||||||||||
ip: 10.11.12.13 | ||||||||||||||||
initiator: john-doe | ||||||||||||||||
k8sCluster: my-sample-cluster | ||||||||||||||||
``` | ||||||||||||||||
|
||||||||||||||||
<!-- links --> | ||||||||||||||||
[Appendix A: Guarantee of Delivery]: #appendix-a-guarantee-of-delivery | ||||||||||||||||
[Appendix B: Examples of Audit Log Events]: #appendix-b-examples-of-audit-log-events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points! In addition, these are something we might want to consider:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@reyang thanks for mentioning these points.
Especially the API behavior is something that we had thought about initially. However, when we first pitched audit logging on Slack, we received the following comment from Ted Young:
Based on this initial feedback, we decided to file this SIG proposal without proposing such API changes.