Installing elasticsearch:
For mac users the easiest way would be to install it from homebrew:
brew install elasticsearch
For linux users:
Please try this tutorial here: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-elasticsearch-on-ubuntu-14-04
Run elasticsearch and check if it's running on http://localhost:9200
elasticsearch
on Mac
sudo service elasticsearch start
on Ubuntu
If you have your docker and docker-compose installed.
docker-compose up
Default user is elastic
and password changeme
Download sense from chrome store. It's plugin that will maker our life easier during the workshop.
https://chrome.google.com/webstore/detail/sense-beta/lhjgkmllcaadmopgmanpapmpjgmfcfig
Create index for movies, it will hold all the movies documents that we will import in a minute. Open sense and type your first request to Elasticsearch, this one will create index with name "movies".
PUT movies
Now let's import data into our index. I prepared json with all the documents that could be easily builed.
curl -s --header "Content-Type:application/json" -XPOST localhost:9200/_bulk --data-binary @movies.json
Use option -u
for typing user and password when running with docker.
Let's make the simplest possible query to our movies index. Query that returns all results, it's called match all query.
GET <name_of_index>/_search
{
"query": {
"match_all": {}
}
}
You should get this type of result in response:
"hits": {
"total": 306,
"max_score": 1,
Type this query into sense and see what results you get for movies index.
Still very simple query, we will only search for particular string.
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "this AND that OR thus"
}
}
}
Using this knowledge find movie Scarface in the elasticsearch. It should be returned as first result.
Let's build on this, we want to extend our search capabilities. Elasticsearch uses operators like in programming, by default it uses 'OR' but we can use 'AND' to get exact match.
GET /_search
{
"query": {
"query_string" : {
"default_field" : "content",
"query" : "Strawberry pie with jello",
"default_operator": "AND"
}
}
}
Now we will be sure that we will only get recipes we are interested.
Make a query to elasticsearch that will return only 1 result on query Captain America first avenger
"hits": {
"total": 1,
"max_score": 11.263437,
"hits": [
{
"_index": "movies",
"_type": "movie",
"_id": "139",
"_score": 11.263437,
"_source": {
"title": "Captain America: The First Avenger",
"plot": "Predominantly set during World War II, Steve Rogers is a sickly man from Brooklyn who's transformed into super-soldier Captain America to aid in the war effort. Rogers must stop the Red Skull – Adolf Hitler's ruthless head of weaponry, and the leader of an organization that intends to use a mysterious device of untold powers for world domination.",
"genres": null,
But what about case when users don't type correctly query. We should also handle this case. We could use match query with fuzz query here, it's a simpler cousin of string query.
"query": {
"match": {
"text": {
"query": "jomped over me!",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
"fuzziness": "AUTO"
generates an edit distance based on the length of the term. For lengths:
0..2
must match exactly
3..5
one edit allowed
>5
two edits allowed
You could also use number values, like 0, 1, 2
. Fuzziness is interpreted as Levenshtein Edit Distance. More about: fuzziness
Exercise.
Write query that will return all Captain America movies based on query, which was mistyped: "Captaon America".
Matches documents with fields that have terms within a certain range. The type of the Lucene query depends on the field type, for string fields, the TermRangeQuery, while for number/date fields, the query is a NumericRangeQuery. The following example returns all documents where age is between 10 and 20:
GET _search
{
"query": {
"range" : {
"age" : {
"gte" : 10,
"lte" : 20,
"boost" : 2.0
}
}
}
}
gte = Greater-than or equal to
gt = Greater-than
lte = Less-than or equal to
lt = Less-than
Create query that would return movies with running time between 60 and 90 minutes.
It should return 57 results.
The bool query takes a more-matches-is-better approach, so the score from each matching must or should clause will be added together to provide the final _score for each document.
must - The clause (query) must appear in matching documents and will contribute to the score.
filter - Filter clauses are executed in filter context, meaning that scoring is ignored and clauses are considered for caching.
should - The clause (query) should appear in the matching document.
Example query:
POST _search
{
"query": {
"bool" : {
"must" : {
"term" : { "user" : "kimchy" }
},
"filter": {
"term" : { "tag" : "tech" }
},
"must_not" : {
"range" : {
"age" : { "gte" : 10, "lte" : 20 }
}
},
"should" : [
{ "term" : { "tag" : "wow" } },
{ "term" : { "tag" : "elasticsearch" } }
],
"minimum_should_match" : 1,
"boost" : 1.0
}
}
}
Create query that will find superhero movies (keywords field: superhero) that are no longer than 120 minutes and not shorter than 60 minutes (field runtime) and must not have Robert Downey Jr. as starring actor (actors field).
You should get 12 results for this query
Let's get some interesting stats for analytics, we want to get overall view how some value occurs through the documents. The stats aggregation would give us general insight, gives us count, minimum value, maximum value, averages.
{
"aggs" : {
"grades_stats" : { "stats" : { "field" : "grade" } }
}
}
and returns:
{
...
"aggregations": {
"grades_stats": {
"count": 6,
"min": 60,
"max": 98,
"avg": 78.5,
"sum": 471
}
}
}
Read more about aggregations here
Get overall data for rating in movies: min, max, average. Do that using stats query.
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket.
GET products/_search?size=0
{
"aggs": {
"weight_ranges": {
"range": {
"field": "weight",
"ranges": [
{
"to": 500
},
{
"from": 500,
"to": 1000
},
{
"from": 1000,
"to": 1500
}
]
}
}
}
}
and this will return aggregated data:
...
"aggregations": {
"weight_ranges" : {
"buckets": [
{
"to": 500,
"doc_count": 20
},
{
"from": 500,
"to": 1000,
"doc_count": 4
},
{
"from": 1000,
"doc_count": 4
}
]
}
}
}
Using range queries, count how many movies were in mentioned run times: below 60 minutes, between 60 and 75 minutes, between 90 and 120 minutes.
We can also use histogram to bucket data instead of ranges. It's useful for prices in shops, so we can see how prices fall between different ranges 0$-10$, 10$-20$
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 10
}
}
}
}
Would return:
{
...
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0.0,
"doc_count": 1
},
{
"key": 50.0,
"doc_count": 1
},
{
"key": 100.0,
"doc_count": 0
},
{
"key": 150.0,
"doc_count": 2
},
{
"key": 200.0,
"doc_count": 3
}
]
}
}
}
Create histogram aggregation for rating in movies with interval equal 1.
Allows to add one or more sort on specific fields. Each sort can be reversed as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.
GET /my_index/my_type/_search
{
"sort" : [
{ "post_date" : {"order" : "asc"}},
"user",
{ "name" : "desc" },
{ "age" : "desc" },
"_score"
],
"query" : {
"term" : { "user" : "kimchy" }
}
}
Sort Captain America movies by release date in ascending order, oldest movie first. You should display only Captain America movies here. Keep results relevant
Allows to highlight search results on one or more fields. It's useful for seeing in results page, where did your query appear in searched field.
GET /_search
{
"query" : {
"match": { "content": "kimchy" }
},
"highlight" : {
"pre_tags" : ["<tag1>"],
"post_tags" : ["</tag1>"],
"fields" : {
"content" : {}
}
}
}
Create highlight for your query to search plot in movies 'terrorist attack'. It should return with highlighted fields with tags like this:
"highlight": {
"plot": [
"Jack Ryan, as a young covert CIA analyst, uncovers a Russian plot to crash the U.S. economy with a <highlight>terrorist</highlight> <highlight>attack</highlight>."
]
}
You can create pagination by passing parameters size and from to query. Size will dictate number of elements on page and from will work as offset.
For pages 1 to 3.
GET /_search?size=5
GET /_search?size=5&from=5
GET /_search?size=5&from=10
also could be passed to body
{
"query": {
"match_all": {}
},
"size": 5
}
Create pagination for movies with genre action.
Put your knowledge to good use and create movie recommendation query that will take text which could include: plot, actors, title, release date.
It should:
- Give movies with higher rating, higher score but be still relevant.
- Prefer newer movies.
- Prefer shorter movies over longer movies.
You can also play around with it further and extra powers to it.
Save your query on google drive and send me.