Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source #132818

Closed
11 of 14 tasks
joshdover opened this issue May 24, 2022 · 39 comments
Closed
11 of 14 tasks
Assignees
Labels
needs design Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@joshdover
Copy link
Contributor

joshdover commented May 24, 2022

We have three new indexing features in Elasticsearch that can reduce the overall storage size of data significantly:

We'd like to enable integration developers to start testing the ingest and query performance of enabling these features before we start making any changes in the integrations themselves or allowing end users to enable these from the Fleet UI.

Today, each of these can already be enabled by leveraging the *@custom component templates that Fleet installs for each integration data stream, to varying degrees of ease of use (details below). We could improve the UX around this for integration developers by adding an explicit API in Fleet to enable this, however it may not be necessary.

How to do this today

See https://github.com/elastic/integrations/blob/main/docs/how_to_test_new_indexing_features.md

@joshdover joshdover added Feature:EPM Fleet team's Elastic Package Manager (aka Integrations) project Team:Fleet Team label for Observability Data Collection Fleet team labels May 24, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@ruflin
Copy link
Member

ruflin commented May 25, 2022

Should we be able to enable this on a package level or data stream level?

@joshdover
Copy link
Contributor Author

I'd prefer to keep this as simple as possible, and only do it on the package level if we don't need to be able to do it on a data stream level.

@joshdover joshdover changed the title [Fleet] Add opt-in support for time series indexing [Fleet] Add opt-in support for time series indexing, doc-value-only fields, and synthetic source Jun 16, 2022
@joshdover joshdover changed the title [Fleet] Add opt-in support for time series indexing, doc-value-only fields, and synthetic source [Fleet] Opt-in support for time series indexing, doc-value-only fields, and synthetic source Jun 16, 2022
@joshdover
Copy link
Contributor Author

joshdover commented Jul 8, 2022

@kpollich @jen-huang We could enable the ecosystem team to add each of these toggles if we first provided a basic framework for adding a package-level setting that is used at install time. I think that framework need to include:

  • Storing settings on the epm-packages object
  • Exposing an API field for updating the settings on a package (likely on the package install endpoint)
  • Exposing the settings to the various installation paths during install
  • Exposing a preconfiguration field for settings (optional)

This would probably be quite low effort to provide the basic plumbing and I think the Ecosystem folks would be able to use that to make the specific changes required for each feature.

@andresrc
Copy link

@joshdover for the specific case of synthetic source, given that the change does not depend on specific fields, etc. Do you think it would be feasible, as a first step, to introduce this toggle (package or data-stream level, whatever we think it's better) just on the Fleet side without requiring any new setting in the packages? This would remove the need to do an additional release of every existing package.

Once we add the setting, it can be used as the default value for the toggle for that package / data-stream that we should make sure is "use synthetic source" for every new package that we create.

@joshdover
Copy link
Contributor Author

@andresrc I think starting with synthetic source would make the most sense. We could use that to build out the plumbing I described in #132818 (comment) and then leverage it for other opt in features in the future.

One tricky thing about synthetic source is the limitation on keyword fields that have ignore_above configured. Synthetic source does not support this yet. This makes the feature a little unusable with the majority of our data streams. So if we were to enable support for this on the Fleet API, I think we'd want to put it on the package level but then only apply it to data streams that do not have any ignore_above keyword fields.

@andresrc
Copy link

andresrc commented Jul 27, 2022

(edited after #132818 (comment))

@joshdover @jsoriano I as we move forward with the testing we keep finding corner cases: the ignore_above parameter, the fact that some logs data streams seem to store event.original just in _source. There's also the possibility mentioned in different place that this might be a breaking change in some cases.

Given this, I would like to propose a gradual approach.

Phase I

Add a toggle at the data stream level with a default value of not using synthetic source. This would allow greater granularity at enabling the feature to test for potential breakages and leverage the benefits when possible. We have very big packages which mix logs and metrics data streams. Package-level granularity would not be very practical.

When the toggle is enabled:

  • Synthetic source will be enabled.
  • The ignore_above property would be removed from all fields that have it.

The toggle (or the screen containing the toggle) would also show some badge or similar warning that this is technical preview / beta feature.

With this, we could start gradually recommending the use of synthetic source in specific data streams without changes as we feel confident about them and without generating the risk of breaking changes.

Phase II

Depending on the results of the previous phase, we can consider different options:

  • Add a setting in the package. When it is enabled at the data stream level it will always use synthetic source with no option of disabling it. This would be used only for new data streams.
  • Enable synthetics by default for new integrations that are installed.
  • Add a package level / global level toggle that can override the individual data streams to enable synthetic source.

We will also need to consider the decisions made around the future of the ignore_above property.

Phase III

(probably in a major)

Always use synthetic source by default.

@jsoriano
Copy link
Member

@andresrc thanks, the plan sounds good to me. The only change I would do is to make the phase II conditional to the results of phase I. If things go well I think that we could go directly to the first point of phase III, enabling synthetic source by default in all new policies, and avoid adding a setting that would have expiration date. Such a setting may be confusing for package developers.

@andresrc
Copy link

Thanks @jsoriano , edited the comment

@joshdover
Copy link
Contributor Author

@andresrc

Phase I
Add the toggle at the data stream level with a default value of not using synthetic source.

Is this a UI toggle or an API toggle? If it's an API, I don't think we really need to create anything new as it seems the same as asking the users to set this on the custom component template using it's existing Elasticsearch API.

If we want UI support, we could add the toggle to component template editor UI to allow setting synthetic source there (right now you have to do it in Dev Tools). This would help us avoid needing to determine where to list these data steams and settings for them (we don't have any such UI today).

Phase II
Add a setting in the package. When it is enabled at the data stream level it will always use synthetic source with no option of disabling it. This would be used only for new data streams.

One tricky thing is that we may need to specify an ES version number requirement for the data stream since the restrictions around synthetic source are changing between releases (eg. support for ignore_above). Or would we simply bump the kibana.version constraint to be the same minimum version required for the data stream to use synthetic source?

@jsoriano
Copy link
Member

using it's existing Elasticsearch API

@joshdover I think these flags should be managed by Fleet, in case we later decide to enable these features everywhere automatically, or depending on certain conditions.
It may also happen that multiple settings are needed, as in the case of TSDB or doc-value-only fields. It could be also the case for synthetic source if we decide to remove ignore_above when it is enabled.

While Fleet is aware of the setting, I don't have a strong preference for API or UI. UI would be better to make it easier to recommend its use to users or not so experienced package developers, but it'd be nice if the flags are also exposed through the Fleet API in any case.

@andresrc
Copy link

@joshdover I have updated the comment with some additional considerations, including what to do with ignore_above

Is this a UI toggle or an API toggle? If it's an API, I don't think we really need to create anything new as it seems the same as asking the users to set this on the custom component template using it's existing Elasticsearch API.

I would prefer to start with an UI toggle as it would be easier for final users to try it for certain data streams where they might get the benefit.

If we want UI support, we could add the toggle to component template editor UI to allow setting synthetic source there (right now you have to do it in Dev Tools). This would help us avoid needing to determine where to list these data steams and settings for them (we don't have any such UI today).

If we are doing more "magic" here (i.e. removing the ignore_above) would the component template editor UI still be the right place? But yes, the issue that we don't have a place to list the data streams for a given integrations, probably something that we would need to add to the (installed) integration page.

@joshdover
Copy link
Contributor Author

joshdover commented Aug 8, 2022

If we are doing more "magic" here (i.e. removing the ignore_above) would the component template editor UI still be the right place? But yes, the issue that we don't have a place to list the data streams for a given integrations, probably something that we would need to add to the (installed) integration page.

If we need to modify existing mappings for this "magic" then this probably isn't the right place for it. Let's figure out a good place within the Integrations UI. I think having something on the "settings" tab for an installed integration could make sense.

Marking this issue as needing design, since that is the next step before we can do implementation work.

@tommyers-elastic
Copy link
Contributor

we have discovered an issue whilst testing this on integration data streams. fleet's 'final pipeline' component template contains the event.agent_id_status field, which is a keyword with ignore_above. this is automatically applied and currently causes enabling synthetic _source to fail.

after a brief discussion it was suggested that simply removing ignore_above from this mapping is safe, since the values of the field are discreet and well known (one of missing, auth_metadata_missing, mismatch, verified). the relevant code is here.

@tommyers-elastic
Copy link
Contributor

tommyers-elastic commented Aug 12, 2022

FYI - i have found another barrier to enabling this in certain integrations which declare fields with dynamic mappings. the keyword field type generated by the dynamic mapping processor applies ignore_above by default. https://github.com/elastic/beats/blob/ea207346d651448b8917b0791b2b117b9f9b9212/libbeat/template/processor.go#L293

I'm looking into ways around this now, but wanted to give you a heads up that currently this fails at index time.

edit:
just doing some reading around this and it looks like the following issue is closely related. at this point i'm not 100% sure how it is pieced together but i will update when i understand it better. #129344

@kpollich
Copy link
Member

I spoke with @mukeshelastic a bit offline about potential avenues for implementation here.

Generally, I am in favor of an implementation that consists of the following

  • A Fleet managed API for setting indexing options including tsdb, doc-value-only, and synthetic source on a per data stream basis
  • An interface on the integration settings page that allow users to opt in to each of these indexing feature on each data stream for a given package
  • (Future) Package spec changes that allow package maintainers to opt in to these settings by default for given data streams

There is one major caveat that @tommyers-elastic has begun broaching above: the incompatibility between the ignore_above and synthetic source. Currently, if an index contains a mapping with ignore_above set and it's updated to opt in to synthetic source, the operation will fail as Tom has mentioned in #132818 (comment).

We have a few options to work around this limitation, mentioned in various comments above:

  1. Implement logic in Fleet such that we can intelligently add/remove ignore_above from mappings based on the enabled/disabled state of synthetic source
  2. Petition the Elasticsearch team to alter the mappings API such that ignore_above is silently (or with some non-fatal warning) ignored when synthetic source is enabled.

I'll attempt to detail the tradeoffs for each approach below.

1. Fleet intelligently adds/removes ignore_above setting

With this approach, Fleet will need to update all mappings for a given data stream such that they do not include an ignore_above setting upon an opt-in to synthetic source. Then, if synthetic source is disabled at some point in the future, Fleet will need to re-apply the ignore_above setting to those same mappings, and any mappings that have since been added to index.

This means Fleet will also need to detect the ignore_above setting in the @custom component template for a given package when opting in or out of synthetic source.

The risks here are mainly around implementing a substantial amount of logic for setting/unsetting these mapping settings across component templates for a package. There's room for error here, potential performance costs to consider, and generally a lot of moving parts.

2. Elasticsearch accepts the ignore_above setting, but ignores it or warns in cases where it conflicts w/ synthetic source

This option would involve petitioning the Elasticsearch team to update the mapping API such that we can enable synthetic source even if an index contains mappings with ignore_above set. This would provide a much better UX for Fleet, as we wouldn't have to take so much action to opt in and out of synthetic source for users.

This has its own set of risks, though, as a cross-team effort that would be beholden to the appropriate ES team's priorities. This is also likely a special case that breaks assumptions users will make based on how other Elasticsearch APIs behave. For instance, the data streams APIs support a subset of the index APIs' features, but Elasticsearch doesn't silently ignore unsupported field, parameters, etc from the index APIs when they're provided to data streams APIs. It responds with errors. Bucking this trend and implementing a different pattern for this special case introduces unpredictability and inconsistency into Elasticsearch APIs and how they handle "settings conflicts" like this.


I think in general, the second option is probably preferable. It would likely provide a more stable experience around specifically the synthetic source workflow we're proposing here. It seems to me that the risks around having Fleet implement "magic" to update mapping settings when synthetic source is enabled/disabled are less than those associated with API ergonomics. I'm also not an Elasticsearch engineer, though, so we should try to get this proposal in front of an appropriate team on that side of things.

@joshdover I'm sure you will have some thoughts on these two approaches and their tradeoffs, and I'm sure you can correct anything I'm misunderstanding. I'd appreciate some input from you and from anyone else who might have further thoughts here.

@ruflin
Copy link
Member

ruflin commented Aug 15, 2022

Thanks for the detailed write up @kpollich and the trade offs mentioned. If I think long term, ideally having synthetic source just becomes a setting on the data stream and all the magic around it is just handled by Elasticsearch (Option 2). But this might take a bit longer.

My suggestion would be to start talking about Option 2 to the Elasticsearch team but do a basic implementation for Option one 1 Fleet on the data stream level. This would be an experimental feature or similar with limited support. For example @custom is not supported. If a users uses @custom it is up to the user to make sure it is also aligned. I'm simplifying but all that Fleet would have to do is take the fields.yml mapping + mappings in the Fleet source code, strip all ignore_above, put it into the template and trigger a rollover. My hope would be that the effort of this would be much smaller as not all the edge cases you mentioned above have to be covered. These would be known limitations.

@jsoriano
Copy link
Member

jsoriano commented Aug 15, 2022

I would also say to start with Option 1 at least for the initial implementation of the opt-in feature, this can help to validate the feature and the specifics can evolve over time with option 2 or other alternatives that may appear.

@nik9000 wdyt about the option 2 described in #132818 (comment) ?

@kpollich
Copy link
Member

My suggestion would be to start talking about Option 2 to the Elasticsearch team but do a basic implementation for Option one 1 Fleet on the data stream level. This would be an experimental feature or similar with limited support. For example @Custom is not supported. If a users uses @Custom it is up to the user to make sure it is also aligned.

I'm +1 on this implementation strategy. Flagging the settings UI for these index settings as experimental gives us some leeway on things like this. I think we should probably capture the caveat with @custom templates in a docs writeup about the experimental index settings, as well.

@nik9000
Copy link
Member

nik9000 commented Aug 15, 2022

2. Petition the Elasticsearch team to alter the mappings API such that ignore_above is silently (or with some non-fatal warning) ignored when synthetic source is enabled.

Petition accepted. Sort of. I think we'll actually be able to support ignore_above with it's original intention. I'd follow elastic/elasticsearch#87480 and, likely, a follow up PR specifically about ignore_above.

But, like, I think it wouldn't be super bad to start by removing ignore_above. I know some other folks tried that with a smaller data set and found that they had fields that were too long and caused rejected documents. That's why I had another think about supporting it properly and I think figured out how.

@nik9000
Copy link
Member

nik9000 commented Aug 15, 2022

If a users uses @Custom

I feel bad for the owner of this github username. Same for whoever owns @timestamp. I've talked with @nik because we've been mixed up a few times in the past. Github's a funny place.

@joshdover
Copy link
Contributor Author

Since we shipped the experimental support for synthetic source in 8.5, I think we should consider getting doc-value-only fields support in next, with a similar experimental UX. I believe the amount of effort involved in adding support it this way should be quite small and may have a very large impact (20% storage savings, 20% improved indexing perf), so I'd prefer we enable this to start being tested sooner than later.

@kpollich do you have a rough estimate of effort required to add support for this as an experimental toggle? My understanding is that when enabled, we'd need to modify the component templates to set index: false on any field types that support this feature. Should we open a dedicated issue for just this part, similar to #140095?

@kpollich
Copy link
Member

kpollich commented Nov 1, 2022

@kpollich do you have a rough estimate of effort required to add support for this as an experimental toggle? My understanding is that when enabled, we'd need to modify the component templates to set index: false on any field types that support this feature. Should we open a dedicated issue for just this part, similar to #140095?

A separate issue would be great. The way we've added synthetic source as a toggle is fairly extensible, so I think a lot of the groundwork is already laid here. This is probably a one week lift to implement a toggle.

@joshdover
Copy link
Contributor Author

joshdover commented Nov 1, 2022

@kpollich I opened this issue: #144357

@ruflin
Copy link
Member

ruflin commented Nov 3, 2022

On the priority side, could we get the enabling of time series in first instead of doc values only part? This would allow us to better test TSDB indices.

@ruflin
Copy link
Member

ruflin commented Nov 3, 2022

The feature for TSDB might even be split up into 2 parts: Support in Fleet if it is set in the package (elastic/package-spec#357). @kpollich Is this supported today? And second to enable it on demand by switching over. Note: Switching back for TSDB is not possible as far as I know.

@kpollich kpollich changed the title [Fleet] Opt-in support for time series indexing, doc-value-only fields, and synthetic source [Fleet] [Meta] Opt-in support for time series indexing, doc-value-only fields, and synthetic source Nov 4, 2022
@joshdover joshdover changed the title [Fleet] [Meta] Opt-in support for time series indexing, doc-value-only fields, and synthetic source [Fleet] [Meta] Support for time series indexing, doc-value-only fields, and synthetic source Dec 23, 2022
@joshdover
Copy link
Contributor Author

@kpollich @jen-huang I have updated this issue to track both the experimental toggles for testing as well as the long-term support for the real GA feature support. All tasks and bugs related to these features should be added here.

@joshdover
Copy link
Contributor Author

This meta issue also needs an owner on the Fleet team. I want to make sure someone has the time to fully understand the goals of these new indexing features and how integrations should leverage them. I think some of the discrepencies in behavior that have gotten implemented (see #147684 for examples) may have been avoided with fewer people working on these items.

@kpollich
Copy link
Member

Created a few issues based on offline discussions over on Google docs with @lucabelluccini

Something else that hasn't come up yet - are there licensing restrictions around these indexing features at all?

https://www.elastic.co/guide/en/elasticsearch//reference/master/tsds.html doesn't mention any licensing restrictions for TSDS.

https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html#synthetic-source mentions synthetic _source is still in preview, but no licensing restrictions.

https://www.elastic.co/guide/en/elasticsearch/reference/current/doc-values.html#doc-value-only-fields seems to imply doc value only fields are GA, but no licensing restrictions specified.

As far as I can tell, these features are all available with a basic license, so I don't see any issues here. I could be wrong though, though I'm not sure how to confirm.

@joshdover
Copy link
Contributor Author

@giladgal where could we find this information regarding licensing for TSDS related features?

@giladgal
Copy link

@giladgal where could we find this information regarding licensing for TSDS related features?

Licensing is not described in the documentation. The information about licensing is in the file headers and in the subscriptions web page.

@kpollich
Copy link
Member

Hi all. I'm closing this in favor of a new meta issue tracking the few outstanding long-term support/stability tasks around TSDS. See https://github.com/elastic/ingest-dev/issues/1773. Thanks for all your help here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs design Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests