If you've found yourself on this page, you're most likely taking my basic introduction to Elasticsearch course and are trying to follow along. The resources in this project are nothing more than the minimal set of files and queries needed to walk through the "Chat log analysis" case study of Elasticsearch which, for ease of use, I'll provide a brief review of here.
Given a dump of slack logs, I'd like to be able to answer the following questions:
- Which user are most commonly posting GIFs?
- What have people saying about me?
- How does chat activity vary over time?
To follow along with this guide, you will need the following tools installed and available on your local machine:
- A terminal w/ basic command-line skills
- A running Elasticsearch node (or access to one)
To get started, you'll first need to clone this project:
cd /path/to/your/workspace
git clone https://github.com/rusnyder/intro-to-es
This guide assumes that you either already have an Elasticsearch node running and accessible or that you are able to create one without guidance.
NOTE: All commands from this point assume your Elasticsearch node has its HTTP service available at the location http://localhost:9200
To make sure that our indexed data gets analyzed the way we want, we need to make sure that we give Elasticsearch the appropriate settings/mappings for our data prior to indexing any data. This can be done in two ways:
- Create all our indices with the settings/mappings
- Create an index template that will apply settings/mappings as any new indices get created
For simplicity, we COULD go with option #1, but for the sake of completeness and "best practices", this example will walk through #2.
Creating our Template
This project provides a template that matches on any indices named
intro-to-es-*
(note the wildcard), which means that any index created
with a matching name will get created with all of the settings and mappings
from that template. This is nice because once the template has been created,
we can just start indexing data and new indices will automatically pick up
their appropriate settings.
Let's go ahead and upload the template (using
intro-to-es-template.json
):
curl -XPUT http://localhost:9200/_template/intro-to-es -d @intro-to-es-template.json
This project provides some sample data in the file
bulk-index.jsonl
that are already in a format the
Elasticsearch's Bulk API accepts, so indexing our data is as
simple as POSTing that file to Elasticsearch:
# Delete any data from a previous run of these steps
curl -XDELETE 'http://localhost:9200/intro-to-es*'
# Index our events using the bulk API and refreshing
# to make them available for search immediately
curl -XPOST http://localhost:9200/_bulk --data-binary @bulk-index.jsonl
# Run a quick count just to see that your documents were
# indexed and that the "intro-to-es" alias was created
curl http://localhost:9200/intro-to-es/_count
NOTE: cURL will strip newlines from any data provided using the -d/--data
option, so we must use the --data-binary
option to preserve the newlines
which are necessary for proper interpretation of the bulk request body.
At this point, we've got our data indexed and are ready to start answering some questions!
Question 1: Which users are most commonly posting GIFs?
Here we just want to find the top senders using the /gif
slack directive,
and since our analyzer only preserves slashes when they are at the beginning
of the sentence, we know that if we match the token /gif
, then it's the
directive that we're looking for and not just the string "/gif" occurring
in the middle of a sentence.
Once we subset the events, aggregating over from
gives us the top senders.
POST /intro-to-es/_search
{
"query": {
"term": {
"message": "/gif"
}
},
"aggs": {
"top_senders": {
"terms": {
"field": "from",
"size": 0
}
}
}
}
Question 2: What have people saying about me?
To answer this, we want to take a look at the chat messages that are most
relevant, with "relevance" here loosely being defined as "how likely is it
that this message pertains to or is about me?". For this, we use a couple
of clauses with boost
factors as a means of saying occurrences of certain
tokens/phrases can be treated as "more certainly me" than others.
POST /intro-to-es/_search
{
"query": {
"bool": {
"should": [
{
"term": {
"message": {
"value": "@rusnyder",
"boost": 10
}
}
},
{
"match": {
"message": {
"query": "russell,snyder",
"boost": 2
}
}
},
{
"prefix": {
"message": {
"value": "russ",
"boost": 1
}
}
}
]
}
}
}