Elasticsearch Process

Creating Templates
- Gathering Fields
  - Administrative Region
  - Address
  - Street
  - POI
  - Stop
- Partitioning Templates
Using Templates
- Importing Templates
- Overriding Templates
Updating Templates
- Evaluating Templates

This document describes the process of configuring Elasticsearch templates for Mimirsbrunn.

We can picture Elasticsearch as a black box, where we store JSON documents. These documents are of different kinds, and depend on our business. Since we deal with geospatial data, and Navitia in particular works with public transportations, the types of documents we store are:

administrative regions:
addresses:
streets
point of interests (POIs)
stops (Public Transportations)

We first submit configuration files to Elasticsearch to describe how we want each document type to be handled. These are so called component templates, and index templates, which include:

settings: how do we want the text to be handled? do we want to use synonyms, lowercase, use stems,…
mappings: how each field of each type of document listed above is handled.

When the documents are indexed according to our settings and mappings, we can then query Elasticsearch, and play with lots of parameters to push the ranking of documents up or down.

This document describes how we establish a baseline for these templates, and the process of updating them.

Configuring Elasticsearch templates is an iterative process, which, when done right, results in:

reduced memory consumption in Elasticsearch, by reducing the size / number of indices.
reduced search duration, by simplifying the query
better ranking

Creating Templates

Gathering Fields

We'll construct a table with all the fields, for each type of document. The source of information is the document, which is a rust structure serialized to JSON. When building this resource, be sure to exclude what would be skipped (marked as skip) by the serializer.

Administrative Region

field	type	description
administrative_regions	Vec<Arc>	A list of parent administrative regions
approx_coord	Option	Coordinates of (the center??) of the region, similar to coord Given in lat lon
bbox	Option<Rect>	Bounding Box
boundary	Option<MultiPolygon>	Describes the shape of the admin region
codes	BTreeMap<String, String>	Some codes used in OSM, like ISO3166, ref:nuts, wikidata
context	Option	Used for debugging
coord	Coord	Coordinates of the region
country_codes	Vec	Country Codes
id	String	Unique id created by cosmogony
insee	String	A code used to identify regions in France. From OSM
label	String	??
labels	I18nProperties	??
level	u32	Position of the region in the admin hierarchy
name	String	Name
names	I18nProperties	Name, but internationalized, eg name:en, name:ru, name:es
parent_id	Option	id of the parent admin region (or none if root)
weight	f64	A number associated with the population in that region
zip_codes	Vec	Zip codes (can be more than one)
zone_type	Option	Describes the type, eg city, suburb, country,…

Address

Addresses, compared to administrative regions, have very little unique fields, just house number and street:

field	type	description
approx_coord	Option
context	Option
coord	Coord
country_codes	Vec
house_number	String	Identifier in the street
id	String	Unique identifier
label	String
name	String
street	Street	Reference to the street the address belongs to.
weight	f64
zip_codes	Vec

Street

No particular fields for streets:

field	type	description
administrative_regions	Vec<Arc>
approx_coord	Option
context	Option
coord	Coord
country_codes	Vec
id	String
label	String
name	String
weight	f64
zip_codes	Vec

Point of Interest

field	type	description
address	Option	Address associated with that POI Can be an address or a street
administrative_regions	Vec<Arc>
approx_coord	Option
context	Option
coord	Coord
country_codes	Vec
id	String
label	String
labels	I18nProperties
name	String
names	I18nProperties
poi_type	PoiType	id / name references in NTFS
properties	BTreeMap<String, String>
weight	f64
zip_codes	Vec

Stop (Public Transportations)

field	type	description
administrative_regions	Vec<Arc>
approx_coord	Option
codes	BTreeMap<String, String>
comments	Vec
commercial_modes	Vec
context	Option
coord	Coord
country_codes	Vec
coverages	Vec
feed_publishers	Vec
id	String
label	String
lines	Vec
name	String
physical_modes	Vec
properties	BTreeMap<String, String>
timezone	String
weight	f64	The weight depends on the number of lines, and other parameters.
zip_codes	Vec

Partitioning Templates

When we combine together all the fields from the previous documents, we obtain the following table, which shows all the fields in use, and by what type of document.

field	type	description	adm	add	poi	stp	str
address	Option	Address associated with that POI			✓
administrative_regions	Vec<Arc>	A list of parent administrative regions	✓		✓	✓	✓
approx_coord	Option	Coordinates of the object, similar to coord	✓	✓	✓	✓	✓
bbox	Option<Rect>	Bounding Box	✓
boundary	Option<MultiPolygon>	Describes the shape of the admin region	✓
codes	BTreeMap<String, String>	Some codes used in OSM, like ISO3166, ref:nuts, wikidata	✓			✓
comments	Vec					✓
commercial_modes	Vec					✓
context	Option<Conte✓t>	Used to return information (debugging)	✓	✓	✓	✓	✓
coord	Coord		✓	✓	✓	✓	✓
country_codes	Vec	Country Codes	✓	✓	✓	✓	✓
coverages	Vec					✓
feed_publishers	Vec					✓
house_number	String	Identifier in the street		✓
id	String	Unique identifier	✓	✓	✓	✓	✓
insee	String	A code used to identify regions in France.	✓
label	String	??	✓	✓	✓	✓	✓
labels	I18nProperties	??	✓		✓
level	u32	Position of the region in the admin hierarchy	✓
lines	Vec					✓
name	String	Name	✓	✓	✓	✓	✓
names	I18nProperties	Name, but internationalized, eg name:en, name:ru, name:es	✓		✓
parent_id	Option	id of the parent admin region (or none if root)	✓
physical_modes	Vec					✓
poi_type	PoiType	id / name references in NTFS			✓
properties	BTreeMap<String, String>				✓	✓
street	Street	Reference to the street the address belongs to.		✓
timezone	String					✓
weight	f64		✓	✓	✓	✓	✓
zip_codes	Vec		✓	✓	✓	✓	✓
zone_type	Option	Describes the type, eg city, suburb, country,…	✓

Talk about type, indexed_at (and pipeline)

Component Templates

We can extract from this table a list of fields that are (almost) common to all the documents. In this table of common fields, we indicate what type is used for Elasticsearch, whether we should index the field, and some comments.

field	type	adm	add	poi	stp	str	Elasticsearch	Index	Comment
administrative_regions	`Vec<Arc<Admin>>`	✓		✓	✓	✓		✗	large object
approx_coord	`Option<Geometry>`	✓	✓	✓	✓	✓	??	✗	Improved geo_point in Elasticsearch may render `approx_coord` obsolete
context	`Option<Context>`	✓	✓	✓	✓	✓		✗	Output
coord	`Coord`	✓	✓	✓	✓	✓	geo_point	✓	Index for reverse API
country_codes	`Vec<String>`	✓	✓	✓	✓	✓	??	✗	Are we searching with these ?
id	`String`	✓	✓	✓	✓	✓	keyword	✓	Index for features API. Really need to index??
label	`String`	✓	✓	✓	✓	✓	SAYT	✓	Field created by binaries (contains name and other informations, like admin, country code, …)
name	`String`	✓	✓	✓	✓	✓	text	✓	copy to `full label`
weight	`f64`	✓	✓	✓	✓	✓	float	✗	used for ranking
zip_codes	`Vec<String>`	✓	✓	✓	✓	✓	text	??	copy to `full label`

Now we'll turn this table into an actual component template, responsible for handling all the common fields.

A few points are important to notice:

The text based search is happening on the label. The label is created by the indexing program, and contains the name, some information about the administrative region it belongs to, maybe a country code. So we're not indexing the name, because the search is happening on the label.

The component template also contains additional fields, that are not present in the document sent by the binaries:

field	type	adm	add	poi	stp	str	Elasticsearch	Index	Comment
indexed_at		✓	✓	✓	✓	✓	date	✗	Generated by an Elasticsearch pipeline
type		✓	✓	✓	✓	✓	constant_keyword	✗	Set in individual index templates

The search template has to reflect the information found in the common template.

Index Templates

Admin

If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field	type	Elasticsearch	Index	Comment
bbox	`Option<Rect<f64>>`	Bounding Box	✗
boundary	`Option<MultiPolygon<f64>>`	geo_shape	✗
codes	`BTreeMap<String, String>`		✗
insee	`String`		✗
labels	`I18nProperties`	??	✓	used in dynamic templates
level	`u32`		✗	used for ranking
names	`I18nProperties`		✓	used in dynamic templates
parent_id	`Option<String>`		✗
zone_type	`Option<ZoneType>`	keyword	✓	used for filtering

The treatment of labels and names is done in a separate template, using dynamic templates.

This leaves the remaining fields to be indexed with the mimir-admin.json index template.

Address

If we look back at the list of fields present in the administrative region document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field	type	Elasticsearch	Index	Comment
house_number	String	text	✓	?? Should we index it ?
street	Street	Reference to the street the address belongs to.	✗

This leaves the remaining fields to be indexed with the mimir-addr.json index template.

Streets

For streets, its quite easy, because all the documents can be indexed with the base template, leaving mimir-street.json index template.

POIs

If we look back at the list of fields present in the poi document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field	type	Elasticsearch	Index	Comment
address	Option	object	✗
boundary	`Option<MultiPolygon<f64>>`	geo_shape	✗
labels	`I18nProperties`	??	✓	used in dynamic templates
names	`I18nProperties`		✓	used in dynamic templates
poi_type	`PoiType`	keyword	✓	used for filtering
properties	`BTreeMap<String, String>`	object	✓	used for filtering

This leaves the remaining fields to be indexed with the mimir-poi.json index template.

Stops

If we look back at the list of fields present in the stop document, and remove all the fields that are part of the common template, we have the following list of remaining fields:

field	type	Elasticsearch	Index
comments	Vec		✗
commercial_modes	Vec		✗
coverages	Vec		✗
feed_publishers	Vec		✗
lines	Vec		✗
physical_modes	Vec		✗
properties	BTreeMap<String, String>	flattened	✓
timezone	String		✗

This leaves the remaining fields to be indexed with the mimir-stop.json index template.

Using Templates

Importing Templates

For now there is a single binary that is used to insert templates in Elasticsearch. It must be used prior to the creation of any index. This binary uses the same configuration / command line configuration as the other binaries.

./target/release/ctlmimir -c ./config -m testing run

This program will look for the directories <config>/ctlmimir, and <config>/elasticsearch to read some configuration values. and then scan <config>/elasticsearch/templates/components and import all the templates in there, and same thing for <config>/elasticsearch/templates/indices.

You can check that all the templates directly in Elasticsearch: Since Mimirsbrunn's templates are prefixed with 'mimir-', you can run:

curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'

Same thing for index templates:

curl -X GET 'http://localhost:9200/_component_template/mimir-*' | jq '.'

Overriding Templates

There are scenarios in which you may want to override certain values.

For a certain type of index

Let's say you want to make sure that all administrative region indices have a certain number of replicas, different from the default one. So, prior to importing the templates, you can change the index template in config/elasticsearch/templates/indices/mimir-admin.json and change the settings:

{
  "elasticsearch": {
    "index_patterns": ["munin_admin*"],
    "template": {
      "settings": {
        "number_of_replicas": "2"
      }
      ...
    }
  }
}

Then, when you run ctlmimir, you will have a unique value for the number of replicas for all indices starting with munin_admin*. You can then test that when you are creating a new index with cosmogony2mimir you will have the correct number of replicas.

For a certain index

Lets say that, following the previous scenario, you'd want to create a new admin index, but with a different number of replicas than that found in the index template.

In that case you can still use command line overrides:

cosmogony2mimir -s elasticsearch.settings.number_of_replicas=9 ...

Updating Templates

Updating templates is essentially an iterative process, and we try to use a TDD approach:

A new feature, a bug, and we create a new scenario in the features directory.
We run the end to end tests (cargo test --test end_to_end), it fails
We update the templates, and run the test again.

Playing with templates, analyzers, tokenizers, and so on, boosting some results with regards to others requires an intimate knowledge of how

Evaluating Templates

These measures should be taken into account when modifying the templates: Like most iterative process, we make a change, evaluate the results, estimate what needs to be changed to improve the measure, and loop again.

Evaluating the templates can be done with:

ctlmimir, which is a binary used to import the templates found in /config/elasticsearchs/templates. With this tool, we just check that we can actually import the templates.
import2mimir.sh can be used to evaluate the whole indexing process, using ctl2mimir, and the other indexing tools.
end to end tests are used to make sure that the indexing process is correct, and that searching predefined queries results are correct.
benchmark are used to estimate the time it takes to either index or search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elasticsearch.md

elasticsearch.md

Elasticsearch Process

Creating Templates

Gathering Fields

Administrative Region

Address

Street

Point of Interest

Stop (Public Transportations)

Partitioning Templates

Component Templates

Index Templates

Admin

Address

Streets

POIs

Stops

Using Templates

Importing Templates

Overriding Templates

For a certain type of index

For a certain index

Updating Templates

Evaluating Templates

Files

elasticsearch.md

Latest commit

History

elasticsearch.md

File metadata and controls

Elasticsearch Process

Creating Templates

Gathering Fields

Administrative Region

Address

Street

Point of Interest

Stop (Public Transportations)

Partitioning Templates

Component Templates

Index Templates

Admin

Address

Streets

POIs

Stops

Using Templates

Importing Templates

Overriding Templates

For a certain type of index

For a certain index

Updating Templates

Evaluating Templates