Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data_stream fields to spec #38

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Dec 7, 2020

Data streams are a new way of index management that is used by Elastic Agent. See also elastic/ecs#1145.
To enable users to influence the data stream (and consequently, the index) the data is sent to, ECS loggers should offer configuration options for data_stream.dataset and data_stream.namespace. The data stream for logs is logs-{data_stream.dataset}-{data_stream.namespace}. As the settings are optional, Filebeat will default the values to default so that the data stream would be logs-generic-default.

The event.dataset field is going to be deprecated in favor of data_stream.dataset and scheduled to be removed for 8.0.

As we're planning to GA Elastic Agent in the 7.x timeframe, we'll have to support both event.dataset and data_stream.dataset for a while. Therefore, ECS loggers should set both fields with the same value, ideally before their initial GA release. This will make 7.x ECS loggers compatible with the 8.x stack.

When 8.0 comes along, ECS loggers should drop support for the event.dataset field. This would still allow 8.x loggers to be used with a 7.x stack. The only implication would be that the ML categorization job, which relies on event.dataset, wouldn't work. To fix that, users can use an ingest node processor to copy the value of data_stream.dataset to event.dataset, or downgrade their ECS logger to 7.x.

@ebeahan
Copy link
Member

ebeahan commented Dec 7, 2020

The event.dataset field is going to be deprecated in favor of data_stream.dataset and scheduled to be removed for 8.0.

@felixbarny Deprecating event.dataset is a topic discussed previously, but I'm not aware of a finalized decision from the ECS side.

The current guidance included in the data_stream fields RFC notes that event.dataset and data_stream.dataset should use the same value: https://github.com/elastic/ecs/blob/master/rfcs/text/0009-data_stream-fields.md#L123

@ruflin
Copy link
Member

ruflin commented Dec 8, 2020

@felixbarny I would also assume that event.dataset and data_stream.dataset will coexist.

  • On the defaults: Default for data_stream.dataset is `generic
  • Not allowed characters: We have similar logic on the Kibana and Elastic Agent side, I think we should follow this. @michalpristas can you chime in here?

spec/spec.json Outdated Show resolved Hide resolved
@felixbarny
Copy link
Member Author

I would also assume that event.dataset and data_stream.dataset will coexist.

That's how I specified it. Both event.dataset and data_stream.dataset will be set by the loggers. But starting with version 8.0 of the loggers, we can stop sending event.dataset if ECS drops the field in 8.0 and if the logs UI uses data_stream.dataset for the log rate anomaly detection.

Default for data_stream.dataset is generic

I've updated the spec. But to be clear: the loggers would not set data_stream.dataset=generic by default. The default for the loggers is to not add the data_stream.dataset field. Instead, Filebeat will apply the defaults if they're missing.

@ruflin
Copy link
Member

ruflin commented Dec 10, 2020

I'm torn if it should be set by default or not. The advantage of setting it also in the log is even if an other tool picks the logs up, it will contain all the data needed. But adds to each log line :-(

@felixbarny
Copy link
Member Author

I really think it's the responsibility of the "other tool" to add the defaults. It already needs to add metadata about the host, pod, container, etc anyway.

@weltenwort
Copy link
Member

To add some context from the Logs UI side:

It uses event.dataset for the foreseeable future, because it supports log entries in plain indices and in data streams. If storage size is a concern, it should work fine if event.dataset was defined as an alias for data_stream.dataset.

@felixbarny
Copy link
Member Author

@ruflin is adding event.dataset as an alias for data_stream.dataset something you're considering in the index templates created by Elastic Agent?

Should ecs loggers add both event.dataset and `data_stream.dataset fields?

The advantage is that this would work with the widest range of Filebeat and the metrics UI.

The downside is that there's a duplication of data. Also, if event.dataset is defined as an alias for data_stream.dataset, it might cause issues if the log records also contain event.dataset. Maybe newer versions of Filebeat should just drop event.dataset if they see that data_stream.dataset is present?

@ruflin
Copy link
Member

ruflin commented Jan 20, 2021

Elastic Agent does not have any templates, I assume you are referring to the templates shipped by Elasticsearch? As you mentioned, there is a problem around event.dataset already existing. Filebeat is the only shipper that we control, there might be others.

Also there are the parts which don't use the new indexing strategy yet where still event.dataset is used. Maybe that could be the differentiator? For indices which are logs-*-* data_stream.dataset is used, for everytihng else it falls back to event.dataset?

@felixbarny
Copy link
Member Author

Elastic Agent does not have any templates, I assume you are referring to the templates shipped by Elasticsearch?

I was referring to the templates shipped with Filebeat that are installed by Elastic Agent, IIUC.

For indices which are logs-- data_stream.dataset is used, for everytihng else it falls back to event.dataset?

But where should that decision be made? The ECS loggers don't and shouldn't know who's shipping the logs.

If ECS loggers add data_stream fields, non-datastream-aware shippers (such as older Filebeat versions) would just add them to the index, resulting in unnecessary keyword fields.

Maybe ECS loggers should just add event.dataset and newer versions of Filebeat should add a data_stream.dataset field with the same value? But that means users wouldn't be able to set data_stream.namespace in their logging config. That would need to be done in the Filebeat config.

To serve the log UI's log rate anomaly ML job, even datastream-based logs-*-* indices need the event.dataset, it seems.

@ruflin
Copy link
Member

ruflin commented Jan 21, 2021

The templates are loaded by Elasticsearch directly. Here is an open PR to modify these, it should indicate where these are: elastic/elasticsearch#64978

Agree, ECS loggers should not care who ships the logs. What if we use the data_stream.dataset field only? To keep it compatible with older event.dataset it would be up to the collector or event Logs UI / ML job to add an alias / runtime fields.

Lets try to discuss first the end goal and then work backwards from there what we can do to mitigate the issues that happen short and midterm.

@apmmachine
Copy link

apmmachine commented Feb 8, 2021

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-08-30T04:31:00.402+0000

  • Duration: 3 min 53 sec

  • Commit: 8b37b7e

Trends 🧪

Image of Build Times

Steps errors 1

Expand to view the steps failures

Read yaml from files in the workspace or text.
  • Took 0 min 0 sec . View more details on here
  • Description: .ci/.jenkins-loggers.yml

Log output

Expand to view the last 100 lines of log output

[2021-08-30T04:33:27.579Z]  > git init /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38 # timeout=10
[2021-08-30T04:33:27.643Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git
[2021-08-30T04:33:27.643Z]  > git --version # timeout=10
[2021-08-30T04:33:27.648Z]  > git --version # 'git version 2.17.1'
[2021-08-30T04:33:27.648Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key
[2021-08-30T04:33:27.719Z]  > git fetch --no-tags --progress -- git@github.com:elastic/ecs-logging.git +refs/heads/*:refs/remotes/origin/* # timeout=15
[2021-08-30T04:33:28.352Z]  > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10
[2021-08-30T04:33:28.363Z]  > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
[2021-08-30T04:33:28.372Z]  > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10
[2021-08-30T04:33:28.379Z]  > git rev-parse --verify HEAD # timeout=10
[2021-08-30T04:33:28.388Z] No valid HEAD. Skipping the resetting
[2021-08-30T04:33:28.389Z]  > git clean -fdx # timeout=10
[2021-08-30T04:33:28.401Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git
[2021-08-30T04:33:28.402Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key
[2021-08-30T04:33:28.405Z]  > git fetch --no-tags --progress --prune -- git@github.com:elastic/ecs-logging.git +refs/pull/38/head:refs/remotes/origin/PR-38 +refs/heads/master:refs/remotes/origin/master # timeout=15
[2021-08-30T04:33:28.936Z] Merging remotes/origin/master commit a617a4431fefbd10775429de7af4748e482fc9f4 into PR head commit 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c
[2021-08-30T04:33:29.097Z] Merge succeeded, producing 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16
[2021-08-30T04:33:29.097Z] Checking out Revision 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16 (PR-38)
[2021-08-30T04:33:28.987Z]  > git config core.sparsecheckout # timeout=10
[2021-08-30T04:33:28.994Z]  > git checkout -f 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=15
[2021-08-30T04:33:29.059Z]  > git remote # timeout=10
[2021-08-30T04:33:29.063Z]  > git config --get remote.origin.url # timeout=10
[2021-08-30T04:33:29.073Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key
[2021-08-30T04:33:29.076Z]  > git merge a617a4431fefbd10775429de7af4748e482fc9f4 # timeout=10
[2021-08-30T04:33:29.090Z]  > git rev-parse HEAD^{commit} # timeout=10
[2021-08-30T04:33:29.100Z]  > git config core.sparsecheckout # timeout=10
[2021-08-30T04:33:29.108Z]  > git checkout -f 8c1eb7606d64bba0de7b94dee5ad74ffbb6f0e16 # timeout=15
[2021-08-30T04:33:33.730Z] Commit message: "Merge commit 'a617a4431fefbd10775429de7af4748e482fc9f4' into HEAD"
[2021-08-30T04:33:33.734Z]  > git rev-list --no-walk 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10
[2021-08-30T04:33:33.806Z] Cleaning workspace
[2021-08-30T04:33:33.809Z]  > git rev-parse --verify HEAD # timeout=10
[2021-08-30T04:33:33.822Z] Resetting working tree
[2021-08-30T04:33:33.822Z]  > git reset --hard # timeout=10
[2021-08-30T04:33:33.828Z]  > git clean -fdx # timeout=10
[2021-08-30T04:33:34.393Z] Masking supported pattern matches of $JOB_GCS_BUCKET or $NOTIFY_TO
[2021-08-30T04:33:34.454Z] Timeout set to expire in 3 hr 0 min
[2021-08-30T04:33:34.472Z] The timestamps step is unnecessary when timestamps are enabled for all Pipeline builds.
[2021-08-30T04:33:34.788Z] [INFO] 'shallow' is forced to be disabled when running on PullRequests
[2021-08-30T04:33:34.810Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging
[2021-08-30T04:33:34.832Z] [INFO] gitCheckout: Checkout SCM PR-38 with some customisation.
[2021-08-30T04:33:34.875Z] [INFO] Override default checkout
[2021-08-30T04:33:34.930Z] Sleeping for 10 sec
[2021-08-30T04:33:44.940Z] The recommended git tool is: NONE
[2021-08-30T04:33:45.186Z] using credential f6c7695a-671e-4f4f-a331-acdce44ff9ba
[2021-08-30T04:33:45.196Z] Cloning the remote Git repository
[2021-08-30T04:33:45.256Z] Cloning repository git@github.com:elastic/ecs-logging.git
[2021-08-30T04:33:45.321Z]  > git init /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging # timeout=10
[2021-08-30T04:33:45.327Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git
[2021-08-30T04:33:45.328Z]  > git --version # timeout=10
[2021-08-30T04:33:45.333Z]  > git --version # 'git version 2.17.1'
[2021-08-30T04:33:45.333Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key
[2021-08-30T04:33:45.337Z]  > git fetch --tags --progress -- git@github.com:elastic/ecs-logging.git +refs/heads/*:refs/remotes/origin/* # timeout=10
[2021-08-30T04:33:45.926Z]  > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10
[2021-08-30T04:33:45.929Z]  > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10
[2021-08-30T04:33:45.936Z]  > git config remote.origin.url git@github.com:elastic/ecs-logging.git # timeout=10
[2021-08-30T04:33:45.943Z] Fetching upstream changes from git@github.com:elastic/ecs-logging.git
[2021-08-30T04:33:45.944Z] using GIT_SSH to set credentials GitHub user @elasticmachine SSH key
[2021-08-30T04:33:46.484Z] Checking out Revision 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c (origin/PR-38)
[2021-08-30T04:33:46.504Z] Commit message: "Add more disallowed characters, require lower case"
[2021-08-30T04:33:45.947Z]  > git fetch --tags --progress -- git@github.com:elastic/ecs-logging.git +refs/pull/38/head:refs/remotes/origin/PR-38 +refs/heads/master:refs/remotes/origin/master # timeout=10
[2021-08-30T04:33:46.480Z]  > git rev-parse origin/PR-38^{commit} # timeout=10
[2021-08-30T04:33:46.487Z]  > git config core.sparsecheckout # timeout=10
[2021-08-30T04:33:46.490Z]  > git checkout -f 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10
[2021-08-30T04:33:46.507Z]  > git rev-list --no-walk 8b37b7e2b728df95ae72a47f36fb0c7aca47a65c # timeout=10
[2021-08-30T04:33:47.096Z] Masking supported pattern matches of $GIT_USERNAME or $GIT_PASSWORD
[2021-08-30T04:33:47.729Z] + git fetch https://****:****@github.com/elastic/ecs-logging.git +refs/pull/*/head:refs/remotes/origin/pr/*
[2021-08-30T04:33:48.074Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging/.git
[2021-08-30T04:33:48.163Z] Archiving artifacts
[2021-08-30T04:33:49.193Z] + git rev-parse HEAD
[2021-08-30T04:33:49.556Z] + git rev-parse HEAD
[2021-08-30T04:33:49.866Z] + git rev-parse origin/pr/38
[2021-08-30T04:33:49.900Z] [INFO] githubEnv: Found Git Build Cause: pr
[2021-08-30T04:33:50.095Z] Masking supported pattern matches of $GITHUB_TOKEN
[2021-08-30T04:33:50.870Z] [INFO] githubPrCheckApproved: Title: Add data_stream fields to spec - User: felixbarny - Author Association: MEMBER
[2021-08-30T04:33:51.215Z] Stashed 126 file(s)
[2021-08-30T04:33:51.470Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging
[2021-08-30T04:33:51.819Z] + git diff --name-only origin/master...8b37b7e2b728df95ae72a47f36fb0c7aca47a65c
[2021-08-30T04:33:51.854Z] [INFO] isGitRegionMatch: found with regex [^spec/spec.json]
[2021-08-30T04:33:52.161Z] + git log --pretty=format:* https://github.com/elastic/all/commit/%h %s origin/master...HEAD --follow -- spec
[2021-08-30T04:33:52.161Z] + sed s/#\([0-9]\+\)/https:\/\/github.com\/elastic\/all\/pull\/\1/g
[2021-08-30T04:33:52.347Z] Running in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging
[2021-08-30T04:33:52.371Z] ERROR: Pull Requests to ECS logging repos failed
[2021-08-30T04:33:52.371Z] java.io.FileNotFoundException: /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38/src/github.com/elastic/ecs-logging/.ci/.jenkins-loggers.yml does not exist.
[2021-08-30T04:33:52.371Z] 	at org.jenkinsci.plugins.pipeline.utility.steps.conf.ReadYamlStep$Execution.doRun(ReadYamlStep.java:107)
[2021-08-30T04:33:52.371Z] 	at org.jenkinsci.plugins.pipeline.utility.steps.AbstractFileOrTextStepExecution.run(AbstractFileOrTextStepExecution.java:29)
[2021-08-30T04:33:52.371Z] 	at org.jenkinsci.plugins.workflow.steps.SynchronousNonBlockingStepExecution.lambda$start$0(SynchronousNonBlockingStepExecution.java:47)
[2021-08-30T04:33:52.371Z] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[2021-08-30T04:33:52.371Z] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[2021-08-30T04:33:52.371Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2021-08-30T04:33:52.371Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2021-08-30T04:33:52.371Z] 	at java.lang.Thread.run(Thread.java:748)
[2021-08-30T04:33:52.641Z] Running on Jenkins in /var/lib/jenkins/workspace/s-logging-update-specs-mbp_PR-38
[2021-08-30T04:33:52.720Z] [INFO] getVaultSecret: Getting secrets
[2021-08-30T04:33:52.752Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-08-30T04:33:53.334Z] + chmod 755 generate-build-data.sh
[2021-08-30T04:33:53.334Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61 UNSTABLE 172671
[2021-08-30T04:33:53.334Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61/steps/?limit=10000 -o steps-info.json
[2021-08-30T04:33:53.334Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-shared/ecs-logging-update-specs-mbp/PR-38/runs/61/tests/?status=FAILED -o tests-errors.json
[2021-08-30T04:33:53.334Z] Retry 1/3 exited 22, retrying in 1 seconds...
[2021-08-30T04:33:54.677Z] Retry 2/3 exited 22, retrying in 2 seconds...

@felixbarny
Copy link
Member Author

Let's try to get this one over the finishing line.

Lets try to discuss first the end goal and then work backwards from there what we can do to mitigate the issues that happen short and midterm.

The goal of ecs loggers is to be compatible with the widest range of stack versions possible. For that reason, I think it makes sense to include both the event.dataset and data_stream.dataset, and to make sure they are always sync.
If ecs loggers would only add data_stream.dataset, older Filebeat versions would not add a event.dataset field.
Future versions of Filebeat may drop the event.dataset fiel and create an alias in the mapping instead to optimize the storage.

@ruflin
Copy link
Member

ruflin commented Mar 17, 2021

++, lets move forward with both as you suggested @felixbarny . We can always adjust / optimise later.

@felixbarny
Copy link
Member Author

I hit an issue when testing this end-to-end with elastic/ecs-logging-java#124 and Elastic Agent 7.11.2: The data_stream.dataset field in the log files were overridden by the Dataset name configured in the policy editor. I vaguely remember that an integration can currently only have a specific dataset but that this is something that's going to change as APM Server also needs to be able to write to different datastreams (per service.name).

Screen Shot 2021-03-18 at 16 43 33

@ruflin
Copy link
Member

ruflin commented Mar 22, 2021

I think we need to open an issue for this in Beats. It bascially means we need some logic around:

if data_stream.dataset { do nothing } else { add x }

It is possible that apm-server can do this already today but not the logfile input AFAIK.

@felixbarny
Copy link
Member Author

I've created an issue: elastic/beats#24683

I'll mark this PR as blocked, at least until we have a consensus that this should be implemented in the near to mid-term.

@ruflin
Copy link
Member

ruflin commented Apr 12, 2023

@felixbarny I assume the routing feature might bring a new twist on this discussion?

@felixbarny
Copy link
Member Author

It definitely does. Reading this thread is almost nostalgic 😄. This thread was the reason I got pulled in more and more into routing and it seems elastic/elasticsearch#76511 is finally landing soon.

The reroute processor will also unblock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants