Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field aliases #23714

Closed
clintongormley opened this issue Mar 23, 2017 · 32 comments
Closed

Field aliases #23714

clintongormley opened this issue Mar 23, 2017 · 32 comments
Assignees
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

It is hard to rename a field when using time-based indices - search and especially aggregations will only work on either the new or the old version, but there is a transition period where not all data will be seen.

We can introduce a new field type called alias which simply points to another field, eg:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "host": {
          "properties": {
            "source_ip": {
              "type": "ip"
            }
          }
        },
        "sourceIP": {
          "type": "alias",
          "path": "host.source_ip"
        }
      }
    }
  }
}

This field type would work as follows:

  • Attempts to index into the alias field would result in an exception - it is read only
  • Queries, aggs, suggestions, scripts (using doc[]), highlighting, fielddata_fields, docvalue_fields, stored_fields would just get the data (and mapping) from the specified path
  • Source filtering would not work with the aliased field

This also works for users who want to expose a nicer name for fields in Kibana

@clintongormley clintongormley added :Search Foundations/Mapping Index mappings, including merging and defining field types discuss >feature labels Mar 23, 2017
@skearns64
Copy link
Contributor

++.

We may also want to consider supporting these field aliases in the field_stats and the (I think) upcoming field_capability APIs.

@clintongormley clintongormley added help wanted adoptme and removed discuss labels Mar 24, 2017
@colings86
Copy link
Contributor

Discussed in FixItFriday and we said that we can see this as a useful feature for transitioning to a new field name but if the implementation is not as clean as it appears currently we should discuss this again since mappings are already complex and we should not do anything to increase that complexity too much

@rjernst
Copy link
Member

rjernst commented Mar 24, 2017

If we are going to add (however much) complexity here, can we trade it for removal of some other complexity/leniency? One of the things that has bugged me for a long time is unmapped fields stuff. Could we be more strict about fields existing across indexes, and if there is not a concrete field, they can create an alias (maybe even have an "empty" type of alias which means "match anything against this field").

@clintongormley
Copy link
Contributor Author

I really don't want to make this change contingent on anything else. It is a good solution in and of itself. Let's keep this issue on topic.

@mrec
Copy link

mrec commented Apr 7, 2017

Another (non-transitional) use case for this: we have a lot of examples where the master datasets for some indices/types define separate FooVariant1 and FooVariant2 fields while others don't make a distinction and so only have data for one of these. To support consistent searching across indices we currently use copy_to rules, but the data duplication is obviously wasteful; zero-cost aliases like this would be much nicer.

@nik9000
Copy link
Member

nik9000 commented Apr 7, 2017

If we are going to add (however much) complexity here, can we trade it for removal of some other complexity/leniency? One of the things that has bugged me for a long time is unmapped fields stuff. Could we be more strict about fields existing across indexes, and if there is not a concrete field, they can create an alias (maybe even have an "empty" type of alias which means "match anything against this field").

@rjernst, are you saying that if we had field aliases we could clean up the code around unmapped fields by mapping them to a field that explicitly doesn't index them? Or are you thinking of something at query time?

@mrec I think this feature would work for what you want. As envisioned the fields would have to be the same types for this to work.

@clintongormley I wonder if duplicating the old structure with ingest is a technically better solution. It'd work in all cases without requiring new code, but it isn't a thing you could do after the fact and it isn't free from a storage standpoint. copy_to is a similar thing, but with less space used and working in fewer contexts. Both are more flexible than an alias in that you can define different mappings for the field so you can handle the cases where you changed mappings. I wonder if we're better off documenting a few "recipes" for migrating fields over time.

@rjernst
Copy link
Member

rjernst commented Apr 10, 2017

are you saying that if we had field aliases we could clean up the code around unmapped fields by mapping them to a field that explicitly doesn't index them? Or are you thinking of something at query time?

@nik9000 Possibly. I just realized my original thought could be done now, regardless of aliases. That is, instead of having the current "unmapped fields" logic on the coordinating node, we could require adding a dummy empty field on older indexes for that name.

@clintongormley
Copy link
Contributor Author

@clintongormley I wonder if duplicating the old structure with ingest is a technically better solution. It'd work in all cases without requiring new code, but it isn't a thing you could do after the fact and it isn't free from a storage standpoint. copy_to is a similar thing, but with less space used and working in fewer contexts. Both are more flexible than an alias in that you can define different mappings for the field so you can handle the cases where you changed mappings. I wonder if we're better off documenting a few "recipes" for migrating fields over time.

@nik9000 the point is that with copy_to and ingest you need to reindex. I'm trying to solve the case where you are transitioning from field Foo in old indices to field foo in new indices, and you want to be able to run aggs or searches across old and new indices. A field alias can be added after the fact (when you realise you have a problem) at zero cost.

@ppf2
Copy link
Member

ppf2 commented May 5, 2017

+1 For example, this would have helped Kibana users a lot if we had field aliases to handle the raw -> keyword field name changes between 2.x and 5.x. For there will be older indices with .raw references while newer indices will have .keyword.

@seang-es
Copy link

Further enhancement for this: Could we construct the alias so that it can consist of an 'OR' of two other aliases? First name/last name in individual aliases with 'Name' as a separate alias that could hit on either would be helpful in some use cases.

@clintongormley
Copy link
Contributor Author

Could we construct the alias so that it can consist of an 'OR' of two other aliases? First name/last name in individual aliases with 'Name' as a separate alias that could hit on either would be helpful in some use cases.

No, this introduces a huge amount of complexity, eg we silently need to be able to upgrade single-field queries to compound queries when run against multi-field aliases.

@ppf2
Copy link
Member

ppf2 commented Dec 13, 2017

Yet another breaking change in our stack that is going to make this feature useful. In 6.0+, filebeat stops using input_type output field for prospectors and has renamed it to prospector.type. That means Kibana users who have visualizations against input_type will have to handle this field name change challenge when querying both older and newer indices as part of the upgrade.

@josefschiefer
Copy link

josefschiefer commented Dec 14, 2017

We have various types of log files stored in indexes with different names for the timestamp field. Field aliases would be a great way to map the timestamp field to a common alias which I could use for sorting and aggregations.

The alternative solution for the above is to either 1) copy the field to an unified field (requires more storage and re-indexing), or 2) use scripting for sorting or aggregations which makes the query much slower.

I think field aliases are an elegant solution to make queries across indexes more useful and faster.

@rpedela
Copy link

rpedela commented Jan 11, 2018

@clintongormley Why isn't renaming a field directly possible?

@josefschiefer
Copy link

Renaming a field would require to re-indexing the data and it the integrations (e.g. dashboards) with the old indexes.

For my use case, I am creating a "view" across indexes and try to combine two fields that have a different name. Elasticsearch is using filtered index aliases for views across indexes. Would it be maybe easier to support field aliases as part of filtered index aliases instead of adding them directly to the mapping?

@rpedela
Copy link

rpedela commented Jan 11, 2018

@josefschiefer I am assuming you are responding to my question. If so, it wasn't directed at you. I should have put @clintongormley to make it more clear. Sorry about that.

@colings86
Copy link
Contributor

colings86 commented Jan 12, 2018

@rpedela What @josefsalyer said is correct and is the reason why renaming a field directly is not possible. This is an open community and we welcome anyone to respond to any questions asked on issues, thank you @josefsalyer for taking the time to respond. Once Lucene segments are written they are never modified so the only way to change all values for an entire field is to re-index. You can do this already but this issue is trying to come up with a solution for when re-indexing is not feasible or for the period until a re-index can be done.

@rpedela
Copy link

rpedela commented Jan 12, 2018

@colings86 I know you need to reindex currently, but why is that the case? Why is the field name set in stone within the Lucene segment? Does the field name have to be set in stone?

@colings86
Copy link
Contributor

@rpedela It is set in stone because Lucene works in an append only way. It never modifies files and only ever added new files. This is why when you update a document you actually delete the document and create a new document. It is also why when you delete a document it is actually only marks as deleted in that segment and the actual delete is deferred until the segment is merged (if it is ever merged). This principle is important in Lucene as it is makes the segments work well with the OS filesystem cache which keeps searches fast.

Because of the above, in order to rename a field, we would need to rewrite every segment in the index to change the field name in that segment (effectively we would need to delete every document in the index, and re-index the document with the new field name), this is the same as re-indexing the whole index so it doesn't really buy anything.

@rpedela
Copy link

rpedela commented Jan 12, 2018

@josefsalyer It wasn't my intent to make you feel unwelcome. I apologize.

@colings86 According to the Lucene docs, the field names are stored in a FNM file and the names are mapped to numbers. If I am understanding correctly, the field number is used throughout the other files to reference a field rather than the name. Is it possible to modify that file? There is an old, open issue with a patch that does just that, however it screwed up ordering. However the quote below in the latest docs suggest ordering should no longer be a problem.

FieldNumber: the field's number. Note that unlike previous versions of Lucene, the fields are not numbered implicitly by their order in the file, instead explicitly.

If modifying the FNM file is not possible, can ES store its own field name mapping? A mapping between the source's name and some immutable, unique name generated by ES? The ES name is used in Lucene and in rest of the ES codebase, and renaming a field is just updating that mapping. I currently do this myself using a Postgres table so I can avoid reindexing.

@clintongormley
Copy link
Contributor Author

@rpedela while what you say is correct, the situation is more complex than that. The field name doesn't only exist in the mapping and in Lucene, it also exists in the _source. There would be no way to change that without reindexing every document.

On top of that, if the index is still accepting new documents or changes, it is likely that those new documents would use the old field name, so now you end up with two fields...

This is why a field alias seems the better route to me.

@rpedela
Copy link

rpedela commented Jan 12, 2018

@clintongormley That is a good point regarding modifying the FNM file.

In the case where ES stores a mapping, couldn't the _source also be modified to use the ES-generated names? Then when the _source is returned to the user, it is modified again based on the mapping. Iterating through JSON keys is pretty fast so it shouldn't be a performance issue. I also can't think of any weird edge cases where the index would be out of sync. Like I said previously, I do exactly that myself and I haven't noticed any problems.

Some people on this thread have voiced use cases for alias other than renaming which suggests it is a good idea. However from a user's perspective, aliases specifically for renaming seems more complicated and less intuitive than a _rename API. And actually the mapping idea is basically an alias, but it is hidden from the user. In other words, I think we may agree on the solution. I just disagree on the API.

@clintongormley
Copy link
Contributor Author

@rpedela what happens if you rename a field, then add a new field with the old name and perhaps a different mapping? Now you have a conflict. Aliases prevent that.

@rpedela
Copy link

rpedela commented Jan 12, 2018

@clintongormley Why would there be a conflict? Let's map this out.

  1. foo is indexed as es_field_1 inside _source and the Lucene segment.
  2. foo is renamed to bar in the mapping, but es_field_1 is still used for _source and the Lucene segment.
  3. A new foo is added and indexed as es_field_2 inside _source and the Lucene segment. And bar still points to es_field_1.

The es_field_* are immutable and increment as fields are added.

@clintongormley
Copy link
Contributor Author

That's not how the _source works. The source field is an untouched copy of the JSON document you index. Having to change the source to have a layer of redirection between "virtual" field names and the field names stored and returned from the source would have a huge overhead (plus would introduce a hundred possible bugs thanks to the added complexity).

That just ain't gonna happen :)

@rpedela
Copy link

rpedela commented Jan 12, 2018

Fair enough. Thanks for listening.

@clintongormley
Copy link
Contributor Author

No problem :) Thanks for bringing up the idea

@jpountz
Copy link
Contributor

jpountz commented Mar 14, 2018

cc @elastic/es-search-aggs

@reardencode
Copy link

This may belong in a separate issue, but I'll start here: Similar to the all_fields execution mode, another use-case for field aliases might be as an alternative to using copy_to to create custom all fields on earlier versions.

"properties": {
  "all_ips": {
    "type": "alias",
    "paths": ["source_ip", "dest_ip"]
  }
}

This would let users choose between index-time cost of copy_to and the search time costs of constructing the MultiMatchQuery as all_fields does.

@colings86
Copy link
Contributor

@reardencode thanks for the suggestion, we had talked about something similar to that but we would prefer to keep things simple, at least for the first version of this feature and restrict aliases to only point to a single concrete field. This makes the logic a simple substitution of the field name rather than requiring us to produce a boolean query with all the various aliased fields in.

@mP1
Copy link

mP1 commented Apr 16, 2018

@clintongormley

Is there any particular reason that aliases must exist as a property (mappings/my_type/properties) rather than as a new sibling of "mappings/my_type/properties", something like..

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "host": {
          "properties": {
            "source_ip": {
              "type": "ip"
            }
          }
        }
      },
      "property-aliases" {
        "sourceIP": {
          "path": "host.source_ip"
        }
      }
    }
  }
}

Im thinking this would make for a smaller change, as all the places that loop over properties would remain unchanged and wouldnt need to know about type=alias and do special things, even if it is skipping.

This would then mean aliases are only ever considered in the code that needs to build a query (etc).

Naturally the code would still enforce that aliases must point to real properties, and there must be no clash between aliases and properties even if they are in different parts of the mapping json graph.

@mP1
Copy link

mP1 commented Apr 17, 2018

answer to self, my proposal would probably require updates to the mappings loader and saver and verifier to handle the new "properties-aliases" branch. Better to stick to aliases being a new type under properties.

@jtibshirani jtibshirani removed the help wanted adoptme label Jun 28, 2018
ruflin added a commit to ruflin/beats that referenced this issue Jul 23, 2018
In elastic/elasticsearch#23714 Elasticsearch implemented the alias field type. This can be used in fields.yml as following:

```
- name: a.b
  type: alias
  path: a.c
```

`a.b` will be the alias for `a.c`.
andrewkroh pushed a commit to elastic/beats that referenced this issue Jul 24, 2018
In elastic/elasticsearch#23714 Elasticsearch implemented the alias field type. This can be used in fields.yml as following:

```
- name: a.b
  type: alias
  path: a.c
```

`a.b` will be the alias for `a.c`.
@jtibshirani jtibshirani self-assigned this Jul 26, 2018
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests