Skip to content

Latest commit

 

History

History
68 lines (55 loc) · 2.46 KB

README.md

File metadata and controls

68 lines (55 loc) · 2.46 KB

Goodbooks10K dataset search tool

A simple React Elasticsearch tool to explore the Goodbooks10k dataset allowing filtering by book title and user tags, built using light-bootstrap-dashboard-react. It is meant to be a simple showcase for how to use ElasticSearch to search for tagged documents.

Installation

  • Run npm install

Data loading

Currently the way to load the data to an elasticsearch instance is through the nested_create_es_index.py script, after decompressing the data/data.7z file. You will need to install the elasticsearch-py package for it to work. If you want to build the new_nested_tags.csv file yourself from the raw goodbooks-10k files you'll also need to install pandas into a python environment.

After running the index creation script, it is recommended to explore it interactively in Kibana. The basic query used by this project is an aggregation of the form:

GET /goodbooks10k/books/_search
{ 
  "aggs": {
        "byTag": {
          "terms": {
            "field": "tag_nested.name.keyword",
            "size": 40000
          }
        }
      },
  "size": 0
}

Where the size set to 40000 tells the aggregation to load every tag in the dataset, and the size set to 0 in the outer query tells ElasticSearch to omit search results, as we are only interested in the aggregated data.

Gotchas

  • react-select package is fixed at version 1.2.1 in package.json due to this issue, from reading this other issue.

  • Elasticsearch is prone to backwards-incompatible changes, so keep in mind that the aggregation currently used in this project might not work as expected in future versions of ElasticSearch (higher than 6). When trying the main aggregation in Kibana with size 40000, the response area will also show the following warning message.

#! Deprecation: This aggregation creates too many buckets (10001) and will throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buckets in multiple requests.

If you are getting this message, try changing search.max_buckets as indicated. If it still doesn't work or other error comes up, please open an issue.