Skip to content
This repository has been archived by the owner on Jun 3, 2024. It is now read-only.

Latest commit

 

History

History
1523 lines (1430 loc) · 30 KB

elasticsearch.md

File metadata and controls

1523 lines (1430 loc) · 30 KB

Elasticsearch Process

This document describes the process of configuring Elasticsearch templates for Mimirsbrunn.

We can picture Elasticsearch as a black box, where we store JSON documents. These documents are of different kinds, and depend on our business. Since we deal with geospatial data, and Navitia in particular works with public transportations, the types of documents we store are:

  • administrative regions:
  • addresses:
  • streets
  • point of interests (POIs)
  • stops (Public Transportations)

We first submit configuration files to Elasticsearch to describe how we want each document type to be handled. These are so called component templates, and index templates, which include:

  • settings: how do we want the text to be handled? do we want to use synonyms, lowercase, use stems,…
  • mappings: how each field of each type of document listed above is handled.

When the documents are indexed according to our settings and mappings, we can then query Elasticsearch, and play with lots of parameters to push the ranking of documents up or down.

This document describes how we establish a baseline for these templates, and the process of updating them.

Configuring Elasticsearch templates is an iterative process, which, when done right, results in:

  • reduced memory consumption in Elasticsearch, by reducing the size / number of indices.
  • reduced search duration, by simplifying the query
  • better ranking

Creating Templates

Gathering Fields

We'll construct a table with all the fields, for each type of document. The source of information is the document, which is a rust structure serialized to JSON. When building this resource, be sure to exclude what would be skipped (marked as skip) by the serializer.

field type description
administrative_regions Vec<Arc> A list of parent administrative regions
approx_coord Option Coordinates of (the center??) of the region, similar to coord Given in lat lon
bbox Option<Rect> Bounding Box
boundary Option<MultiPolygon> Describes the shape of the admin region
codes BTreeMap<String, String> Some codes used in OSM, like ISO3166, ref:nuts, wikidata
context Option Used for debugging
coord Coord Coordinates of the region
country_codes Vec Country Codes
id String Unique id created by cosmogony
insee String A code used to identify regions in France. From OSM
label String ??
labels I18nProperties ??
level u32 Position of the region in the admin hierarchy
name String Name
names I18nProperties Name, but internationalized, eg name:en, name:ru, name:es
parent_id Option id of the parent admin region (or none if root)
weight f64 A number associated with the population in that region
zip_codes Vec Zip codes (can be more than one)
zone_type Option Describes the type, eg city, suburb, country,…

Addresses, compared to administrative regions, have very little unique fields, just house number and street:

field type description
approx_coord Option
context Option
coord Coord
country_codes Vec
house_number String Identifier in the street
id String Unique identifier
label String
name String
street Street Reference to the street the address belongs to.
weight f64
zip_codes Vec

No particular fields for streets:

field type description
administrative_regions Vec<Arc>
approx_coord Option
context Option
coord Coord
country_codes Vec
id String
label String
name String
weight f64
zip_codes Vec
field type description
address Option Address associated with that POI Can be an address or a street
administrative_regions Vec<Arc>
approx_coord Option
context Option
coord Coord
country_codes Vec
id String
label String
labels I18nProperties
name String
names I18nProperties
poi_type PoiType id / name references in NTFS
properties BTreeMap<String, String>
weight f64
zip_codes Vec

Stop (Public Transportations)

field type description
administrative_regions Vec<Arc>
approx_coord Option
codes BTreeMap<String, String>
comments Vec
commercial_modes Vec
context Option
coord Coord
country_codes Vec
coverages Vec
feed_publishers Vec
id String
label String
lines Vec
name String
physical_modes Vec
properties BTreeMap<String, String>
timezone String
weight f64 The weight depends on the number of lines, and other parameters.
zip_codes Vec

Partitioning Templates

When we combine together all the fields from the previous documents, we obtain the following table, which shows all the fields in use, and by what type of document.

field type description adm add poi stp str
address Option Address associated with that POI
administrative_regions Vec<Arc> A list of parent administrative regions
approx_coord Option Coordinates of the object, similar to coord
bbox Option<Rect> Bounding Box
boundary Option<MultiPolygon> Describes the shape of the admin region
codes BTreeMap<String, String> Some codes used in OSM, like ISO3166, ref:nuts, wikidata
comments Vec
commercial_modes Vec
context Option<Conte✓t> Used to return information (debugging)
coord Coord
country_codes Vec Country Codes
coverages Vec
feed_publishers Vec
house_number String Identifier in the street
id String Unique identifier
insee String A code used to identify regions in France.
label String ??
labels I18nProperties ??
level u32 Position of the region in the admin hierarchy
lines Vec
name String Name
names I18nProperties Name, but internationalized, eg name:en, name:ru, name:es
parent_id Option id of the parent admin region (or none if root)
physical_modes Vec
poi_type PoiType id / name references in NTFS
properties BTreeMap<String, String>
street Street Reference to the street the address belongs to.
timezone String
weight f64
zip_codes Vec
zone_type Option Describes the type, eg city, suburb, country,…

Talk about type, indexed_at (and pipeline)

Component Templates

We can extract from this table a list of fields that are (almost) common to all the documents. In this table of common fields, we indicate what type is used for Elasticsearch, whether we should index the field, and some comments.

field type adm add poi stp str Elasticsearch Index Comment
administrative_regions Vec<Arc<Admin>> large object
approx_coord Option<Geometry> ?? Improved geo_point in Elasticsearch may render approx_coord obsolete
context Option<Context> Output
coord Coord geo_point Index for reverse API
country_codes Vec<String> ?? Are we searching with these ?
id String keyword Index for features API. Really need to index??
label String SAYT Field created by binaries (contains name and other informations, like admin, country code, …)
name String text copy to full label
weight f64 float used for ranking
zip_codes Vec<String> text ?? copy to full label

Now we'll turn this table into an actual component template, responsible for handling all the common fields.

A few points are important to notice:

  • The text based search is happening on the label. The label is created by the indexing program, and contains the name, some information about the administrative region it belongs to, maybe a country code. So we're not indexing the name, because the search is happening on the label.

The component template also contains additional fields, that are not present in the document sent by the binaries:

field type adm add poi stp str Elasticsearch Index Comment
indexed_at
date Generated by an Elasticsearch pipeline
type
constant_keyword Set in individual index templates

The search template has to reflect the information found in the common template.

Index Templates

Admin

If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field type Elasticsearch Index Comment
bbox Option<Rect<f64>> Bounding Box
boundary Option<MultiPolygon<f64>> geo_shape
codes BTreeMap<String, String>
insee String
labels I18nProperties ?? used in dynamic templates
level u32 used for ranking
names I18nProperties used in dynamic templates
parent_id Option<String>
zone_type Option<ZoneType> keyword used for filtering

The treatment of labels and names is done in a separate template, using dynamic templates.

This leaves the remaining fields to be indexed with the mimir-admin.json index template.

Address

If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field type Elasticsearch Index Comment
house_number String text ?? Should we index it ?
street Street Reference to the street the address belongs to.

This leaves the remaining fields to be indexed with the mimir-addr.json index template.

Streets

For streets, its quite easy, because all the documents can be indexed with the base template, leaving mimir-street.json index template.

POIs

If we look back at the list of fields present in the poi document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field type Elasticsearch Index Comment
address Option object
boundary Option<MultiPolygon<f64>> geo_shape
labels I18nProperties ?? used in dynamic templates
names I18nProperties used in dynamic templates
poi_type PoiType keyword used for filtering
properties BTreeMap<String, String> object used for filtering

This leaves the remaining fields to be indexed with the mimir-poi.json index template.

Stops

If we look back at the list of fields present in the stop document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field type Elasticsearch Index Comment
comments Vec
commercial_modes Vec
coverages Vec
feed_publishers Vec
lines Vec
physical_modes Vec
properties BTreeMap<String, String> flattened
timezone String

This leaves the remaining fields to be indexed with the mimir-stop.json index template.

Using Templates

Importing Templates

For now there is a single binary that is used to insert templates in Elasticsearch. It must be used prior to the creation of any index. This binary uses the same configuration / command line configuration as the other binaries.

./target/release/ctlmimir -c ./config -m testing run

This program will look for the directories <config>/ctlmimir, and <config>/elasticsearch to read some configuration values. and then scan <config>/elasticsearch/templates/components and import all the templates in there, and same thing for <config>/elasticsearch/templates/indices.

You can check that all the templates directly in Elasticsearch: Since Mimirsbrunn's templates are prefixed with 'mimir-', you can run:

curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'

Same thing for index templates:

curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'

Overriding Templates

There are scenarios in which you may want to override certain values.

For a certain type of index

Let's say you want to make sure that all administrative region indices have a certain number of replicas, different from the default one. So, prior to importing the templates, you can change the index template in config/elasticsearch/templates/indices/mimir-admin.json and change the settings:

{
  "elasticsearch": {
    "index_patterns": ["munin_admin*"],
    "template": {
      "settings": {
        "number_of_replicas": "2"
      }
      ...
    }
  }
}

Then, when you run ctlmimir, you will have a unique value for the number of replicas for all indices starting with munin_admin*. You can then test that when you are creating a new index with cosmogony2mimir you will have the correct number of replicas.

For a certain index

Lets say that, following the previous scenario, you'd want to create a new admin index, but with a different number of replicas than that found in the index template.

In that case you can still use command line overrides:

cosmogony2mimir -s elasticsearch.settings.number_of_replicas=9 ...

Updating Templates

Updating templates is essentially an iterative process, and we try to use a TDD approach:

  • A new feature, a bug, and we create a new scenario in the features directory.
  • We run the end to end tests (cargo test --test end_to_end), it fails
  • We update the templates, and run the test again.

Playing with templates, analyzers, tokenizers, and so on, boosting some results with regards to others requires an intimate knowledge of how

Evaluating Templates

These measures should be taken into account when modifying the templates: Like most iterative process, we make a change, evaluate the results, estimate what needs to be changed to improve the measure, and loop again.

Evaluating the templates can be done with:

  • ctlmimir, which is a binary used to import the templates found in /config/elasticsearchs/templates. With this tool, we just check that we can actually import the templates.
  • import2mimir.sh can be used to evaluate the whole indexing process, using ctl2mimir, and the other indexing tools.
  • end to end tests are used to make sure that the indexing process is correct, and that searching predefined queries results are correct.
  • benchmark are used to estimate the time it takes to either index or search.