Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Logging Projects #60391

Closed
22 of 30 tasks
joshdover opened this issue Mar 17, 2020 · 20 comments
Closed
22 of 30 tasks

[Meta] Logging Projects #60391

joshdover opened this issue Mar 17, 2020 · 20 comments
Labels
enhancement New value added to drive a business result Meta Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@joshdover
Copy link
Contributor

joshdover commented Mar 17, 2020

⚠️ This issue is WIP and not yet complete ⚠️

This issue is intended to be the source-of-truth for all things logging currently being planned or worked on in Kibana.

Some issues belong to multiple categories or tracks of work here, so there is some duplication.

Logging Projects

Kibana Platform Logger

The Kibana Platform (aka "New Platform") has a new logger and configuration that more closely matches Elasticsearch's usage of log4j. You can read more about its design here.

Related issues:

Audit Logging

Audit logging is a security feature. While audit logging already exists in Kibana, it is currently quite limited. This is a high priority project for Security across the Stack. The current plan is to build new audit logging features on top of the Kibana Platform logger.

Kibana Audit Logging Proposal

Related issues:

Blockers for security to begin building audit logging:

Future needs:

Alert History Log (Event Log)

The Alerting team has built a event_log plugin for recording specific events into a separate Elasticsearch index. Alerts do not currently integrate with this plugin yet, but it is planned in the near-term.

Related issues:

Log Ingestion

Monitoring is moving to metricbeat for ingestion of monitoring data to Elasticsearch. Operations would like to do the same with filebeat for ingesting normal logs into Elasticsearch.

It may also make sense to use filebeat for alerting's history log / event log as well. It solves many of the problems that event log does not currently handle (buffering, exponential backoff, etc.)

If using filebeat fulfills requirements for both use cases, it may make sense to actually combine these into a single log and/or single index in Elasticsearch and filter for each use case at query-time.

Pending Decisions

There are a number of decisions that affect more than one of these projects and need to be made in order to unblock them:

  • Should filebeat be used for normal logs, audit logs, and alerting history logs?
    • If so, should these logs have separate indices in Elasticsearch?
      • Each of these features follows different privilege models and having separate indices may make that simpler to enforce. However, it adds some additional complexity to Kibana installations and upgrades.
  • How important is it that all logging features go through the same or similar mechanisms?
    • Is the Platform logger flexible enough to support all these use cases?
@joshdover
Copy link
Contributor Author

If so, should these logs have separate indices in Elasticsearch?

I've been thinking about this more, and I think there is quite a bit of value from having separate indices for different use cases. Other than the security issue, we also can have different retention policies per use case. I think it's very likely that we'll want shorter retention for Alerting history than we would want for Audit logs.

If all indices use the same prefix .kibana I believe we are covered by the existing index permissions granted to the kibana_system role.

The only real downside I can think of is the operation complexity added by needing to migrate or reindex multiple indices for major version upgrades. However, if all of these indices are using ECS, I believe schema migrations will be quite rare. This reduces the risk of failed upgrades quite a bit.

One outstanding question is whether or not these indices need to be "system indices" that are hidden from the user? If so, how much work is needed to support them in the Elasticsearch plugin?

@tylersmalley are there any other concerns that I am not thinking of?

@joshdover joshdover added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Team:Operations Team label for Operations Team Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! labels Mar 18, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-security (Team:Security)

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@joshdover joshdover added Meta enhancement New value added to drive a business result labels Mar 18, 2020
@kobelb
Copy link
Contributor

kobelb commented Mar 18, 2020

Should filebeat be used for normal logs, audit logs, and alerting history logs?

I think we should be using filebeat for normal logs and audit logs, but not for populating the "alert history" that will show up integrated within the Alerting application. From my understanding, users should be granted access to the history for an alert if they have access to the alert itself, so these should be stored in "system indices". Is there some benefit that I'm missing from using Filebeat to create these entries as opposed to having Alerting itself insert these documents?

One outstanding question is whether or not these indices need to be "system indices" that are hidden from the user? If so, how much work is needed to support them in the Elasticsearch plugin?

For the normal logs, and the audit logs I think these should be stored either in "hidden indices" or "data indices". Users should be granted access to them using normal Elasticsearch index privileges, and they should be available for use within applications like Discover, Dashboard, Visualize, Logging, etc. As far as I'm aware, data is stored in "system indices" won't ever be accessible directly to end-users using the normal Elasticsearch document APIs.

If we want to ship Filebeat with Kibana, and automatically start shipping the logs to Elasticsearch without any user intervention, we should use "hidden indices" as we can make the assumption that the end-user isn't using them for something different.

If we optionally want to start shipping the logs to Elasticsearch with some user intervention, we can use "hidden indices" as well. However, we could also potentially use "data indices" and allow the end-user to specify the indices this data should be ingested into.

@jbudz
Copy link
Member

jbudz commented Mar 18, 2020

cc @pmuellr for the alerting question above

@joshdover
Copy link
Contributor Author

joshdover commented Mar 18, 2020

Is there some benefit that I'm missing from using Filebeat to create these entries as opposed to having Alerting itself insert these documents?

The only benefits I'm aware of are the ones I listed above in the ingestion section:

[Filebeat] solves many of the problems that event log does not currently handle (buffering, exponential backoff, etc.).

We get some benefits out of the box, however there are some drawbacks. For instance, it may be tricky to present feedback to the alerting plugin on the status of history ingestion.

That said, I'm supportive of treating alerting history differently. It seems to have enough different requirements to not be using the same mechanics, at least for the time being. Maybe later down the line it makes sense to consolidate these systems, but I don't think we're there yet.

If we optionally want to start shipping the logs to Elasticsearch with some user intervention, we can use "hidden indices" as well. However, we could also potentially use "data indices" and allow the end-user to specify the indices this data should be ingested into.

I think what is key here is that we don't need to start ingesting logs right away for the Audit logging MVP. That can be an add on feature in the future.

There's also some UI features we've talked about using log data for but I think we should evaluate those from first principles. For example, SO history should probably be part of a larger versioning feature rather than just showing edit log events in the UI.

I think separating these efforts from the start, and then identifying overlap later may be the quicker path to delivering on all of these fronts. There are some obvious things that should share infrastructure and we should leverage those when we can, but I don't want to serialize everything artificially if it does not provide significant value right now in the short term.

@kobelb
Copy link
Contributor

kobelb commented Mar 18, 2020

I think separating these efforts from the start, and then identifying overlap later may be the quicker path to delivering on all of these fronts. There are some obvious things that should share infrastructure and we should leverage those when we can, but I don't want to serialize everything artificially if it does not provide significant value right now in the short term.

I think this is a good path forward. You brought up some good points regarding the buffering and exponential backoffs which Filebeat has implemented. I don't want to be entirely glossing over them. @elastic/kibana-alerting-services do you know how many documents we're talking about being created every time that an alert runs?

@legrego
Copy link
Member

legrego commented Mar 19, 2020

If all indices use the same prefix .kibana I believe we are covered by the existing index permissions granted to the kibana_system role.

Is the plan to have filebeat authenticate as the kibana_system user? I feel like that role is overly permissive for what filebeat requires. We might want to consider creating another user/role which is only able to append to these indices (via the create_doc index privilege). I think it's important to constrain the types of operations we are authorized to do. create_doc will allow us to ingest our logs, but prevent both updates and deletes.

@mshustov
Copy link
Contributor

Blockers for security to begin building audit logging:
Allow plugins to register custom logging appenders #53256

@elastic/kibana-security are you going to use a custom appender?

Elasticsearch query log #58086

I'm not sure it's a blocker for Audit Logging. Does the Audit Logger depend on built-in ES logging format? They overlap, but they have different requirements and output format.

@joshdover
Shouldn't we add to the blocker list the next issues:

@legrego
Copy link
Member

legrego commented Mar 19, 2020

@elastic/kibana-security are you going to use a custom appender?

To clarify, an appender is used to direct logs to a specific output (such as a file), right? Are we allowed to create our own instances of the built-in appenders? If so, that might be sufficient. We could direct audit logs to a specific file appender (separate from the "main" file appender the rest of Kibana is using).

I'm not sure it's a blocker for Audit Logging. Does the Audit Logger depend on built-in ES logging format? They overlap, but they have different requirements and output format.

Not that I'm aware of. @jportner what do you think?

@jportner
Copy link
Contributor

I'm not sure it's a blocker for Audit Logging. Does the Audit Logger depend on built-in ES logging format? They overlap, but they have different requirements and output format.

Not that I'm aware of. @jportner what do you think?

I don't think that's a blocker for security audit logging. We could certainly augment security audit logs if additional performance data was available, but that's not part of our MVP.

@joshdover
Copy link
Contributor Author

Allow plugins to register custom logging appenders #53256

I was assuming that an extended version of the ECS layout may be necessary for adding the extra ECS fields. It's possible we can make the regular ECS layout support everything, but the data is just only populated in the Audit log records.

Elasticsearch query log #58086

I'm not sure it's a blocker for Audit Logging. Does the Audit Logger depend on built-in ES logging format? They overlap, but they have different requirements and output format.

Good point, but audit logging will need the audit events to be emitted from the ES service. We should have separate issues for those.

👍

I don't think this is a hard requirement for the first phase Audit Logging, but will be needed in the second phase.

@mshustov
Copy link
Contributor

To clarify, an appender is used to direct logs to a specific output (such as a file), right? Are we allowed to create our own instances of the built-in appenders? If so, that might be sufficient. We could direct audit logs to a specific file appender (separate from the "main" file appender the rest of Kibana is using).

I thought we decided to configure an existing File / LogRotation appenders for this.
Wouldn't NP Logging hierarchical model allow us to achieve log piping to the desired destination without introducing a new API?

I was assuming that an extended version of the ECS layout may be necessary for adding the extra ECS fields. It's possible we can make the regular ECS layout support everything, but the data is just only populated in the Audit log records.

Yes, Layout defines the output format, not the content. So but the data is just only populated in the Audit log records. sounds like a right move. And to note again: In Elasticsearch JSON layout follows the ECS format by default. We should refactor the existing JSON layout to ensure compatibility across the stack.

@pmuellr
Copy link
Member

pmuellr commented Mar 23, 2020

Some thoughts on the alerting event log, based on conversations above.

I think there is quite a bit of value from having separate indices for different use cases.

Ya, probably. I was thinking the shape of the logging service might be like saved objects, where I can create a separate ES index to serve as the storage, but reuse high-level APIs in the logging service. Even if alerting event log doesn't end up needing a separate index, not hard to imagine some other solution wanting one in the future. Something to keep in mind anyway.

Should filebeat be used for normal logs, audit logs, and alerting history logs?

One of the design alternatives for the event log was to write to a file and ingest w/filebeat, and I think this could be made to work. I ruled it out as the first version of our log as I didn't want to be the plugin that added a 10MB binary to Kibana :-)

The main reason to use filebeat is to get our logger out of the business of buffering log events, in case ES goes down. But that's about it. Our planned story is to buffer a small (eg, 100) events in memory, throwing out the oldest events when the buffer gets full. So, kinda lossy. OTOH, if ES is actually down, then alerts and actions aren't going to be running either, so it's not even really clear we'd need this buffer for issues with ES being down. I'm expecting the primary benefit for the internal buffering write events is that we can bulk write them (every couple of seconds) rather than doing write-per-event.

I think alerting would be perfectly fine having a nice filebeat ingestion story, once we get there, and can get by with our current "write log entries with JS calls" till then.

The only real downside [of using separate ES indices] I can think of is the operation complexity added by needing to migrate or reindex multiple indices for major version upgrades.

I think the idea of migrating logging indices, or even re-indexing, is something we want to avoid. We are currently going with a story where we create per-stack-version indices, eg, .kibana-event-log-8.0.0-000001 (the -000001 suffix is an ILM thing) - copying what APM is currently doing. And there's an assumption that we will keep the schemas compatible enough that searching across old indices should work, except for searches that might contain new data added to new versions of the log (ie, only add new fields, never change them or delete them). Worse case is that we'd need to introspect a bit on existing logs, look at their verions and date ranges of the data contained in them, and use that to create elaborate searches, or do separate (clunky) searches across the different versions, and join ourselves (not great, but for time series data, probably ok).

From my understanding, users should be granted access to the history for an alert if they have access to the alert itself, so these should be stored in "system indices".

Probably true that they should be in system indices, however we're currently going to be in crunch mode for 7.8, where we REALLY need to have some amount of the event log operational, and it doesn't feel like we could ship this as system indices for 7.8.

System indices seems like the right way to go, long-term. Purpose-built API would be nice. We'd need to figure out some kind of "ILM"-ish story - could be pretty simple like a map of time durations and states - "warm storage after 1 days, cold storage after 1 week, delete after 2 weeks" kinda thing. We'd manage an actual ILM policy from a constrained API surface we exposed to users.

I think separating these efforts from the start, and then identifying overlap later may be the quicker path to delivering on all of these fronts.

Concur. We're on a tight schedule for 7.8 anyway, so seems unlikely we'd have a generic logger fully operational by then, that would also be suitable for alerting.

We should start thinking about what it would take to converge these separate efforts (or maybe just alerting and everything else) into a single story. And I think the "hardest" part will be converging on an extended ECS schema for Kibana. We probably want to start thinking about what our extension story is, we can make sure it will work for alerting, etc. We already have a few Kibana extensions in our small ECS subset:

"kibana": {
"properties": {
"server_uuid": {
"type": "keyword",
"ignore_above": 1024
},
"namespace": {
"type": "keyword",
"ignore_above": 1024
},
"saved_objects": {
"properties": {
"store": {
"type": "keyword",
"ignore_above": 1024
},
"id": {
"type": "keyword",
"ignore_above": 1024
},
"type": {
"type": "keyword",
"ignore_above": 1024
}
},
"type": "nested",
"dynamic": "strict"
}
},
"dynamic": "strict"
}

Note that the saved_objects property is intended to be a primary search key through the event log - you would typically only be able to see the history of an alert/action if you can "SEE" the alert/action (security-wise), and so have access to the saved object type/id. We're not yet sure if we really need multiple of these (hence the nested type), and if we do need multiple, could we simplify this down to a string type (url-ification of saved object references: space/type/id kinda thing).

@elastic/kibana-alerting-services do you know how many documents we're talking about being created every time that an alert runs?

How long is a piece of string? Users could have 1000's of alerts that go off once per second, each scheduling 1000's of actions to run. Or just a handful. No idea, really. There are some known customers who use a lot of ES watches, we've been keeping those in mind in terms of needing to support that kind of scale. In theory, alerting will be "simpler" for customers to use than watcher, so you'd think we'll probably have customers creating more alerts than they created watches.

@gmmorris
Copy link
Contributor

gmmorris commented Apr 8, 2020

Hey,
Just pinging here as we're making a change in our top level kibana object which @pmuellr describes above.

We're changing the saved_object object so that each SO can have their own namespace in preparation for #27004

You can see this change in the PR here:

https://github.com/elastic/kibana/blob/a4f93abb557f4b2f2700271c32ef982f6b891fc4/x-pack/plugins/event_log/generated/mappings.json#L69-L104

We wanted to make sure this is visible in here in case our top level kibana key potentially clashes with the work being done for the Kibana log ECS usage.

@pmuellr
Copy link
Member

pmuellr commented Jun 9, 2020

One of the things I want to look into for the event log used by alerting, is the new data streams support. I'm guessing the other logging uses referenced here aren't at the point of needing to think about this yet, but figured I'd mention it, see if anyone else is looking into this.

The driver for using data streams for the event log are to make it easier to describe the relationship between the indices, aliases, templates and ILM policies. It turns out to be tricky to get these to all work together completely reliably, would be nice to get that additional reliability. Presumably, it also has some performance benefits for both queries and maybe writes.

We'd target supporting this at a minor version level, as we currently have version-specific ES resources, so a new minor version can completely change the underlying implementation of these sorts of bits.

@legrego legrego removed the Team:Security Team focused on: Auth, Users, Roles, Spaces, Audit Logging, and more! label Aug 3, 2021
@tylersmalley tylersmalley added 1 and removed 1 labels Oct 11, 2021
@exalate-issue-sync exalate-issue-sync bot added the impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. label Oct 12, 2021
@exalate-issue-sync exalate-issue-sync bot added the loe:small Small Level of Effort label Oct 12, 2021
@kobelb kobelb added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 31, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@tylersmalley tylersmalley removed loe:small Small Level of Effort impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. EnableJiraSync labels Mar 16, 2022
@lukeelmers
Copy link
Member

This issue hasn't been active in 2 years, and most of the items are completed, so I'll go ahead and close it.

If anyone feels we still need it, feel free to reopen. ❤️

@tylersmalley tylersmalley removed the Team:Operations Team label for Operations Team label Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New value added to drive a business result Meta Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests