This task had the following questions to solve on huge data:
Task 1: Count the number of employees in each County, Region and City
Task 2: Generate Employee Summary
Task 3: Generate employee summary and ordering by Gender and Salary
Task 4: Summerize the number of employee joined and hikes granted based on month
Task 5: Generate employee summary and ordering by Salary
It has been performed on 5 million records CSV data using PySpark and Elastisearch (No Logstash!). I performed this task in 2 ways:
- Using only PySpark
- Using PySpark and Elasticsearch.
- Python3 (less than Python3.8 to avoid compatibility issues)
$ sudo apt-get install python3
$ sudo apt-get install python3-pip
- Java JDK8 (required for Spark) and JDK11 (required for Elasticsearch)
- Apache Spark (v2.4.x preferable to avoid compatibility issues) and also install the PySpark using pip
$ sudo pip3 install pyspark
- Elasticsearch and also install the Elasticsearch using pip
$ sudo pip3 install elasticsearch
As this task has been performed by two types, the first few steps are similar and then they branch out differently. So in the basics steps, the process of loading the data with PySpark into a dataframe has to be done. The dataset used for this task was already clean, so no data cleaning procedures had to be performed. But in some cases, this might be a necessicity as in most cases the data is noisy and dirty, so perform the steps as required by your dataset.
- First import the necessary libraries required for the task.
findspark.init("/usr/local/spark/") # finding locally installed spark
from pyspark.sql import SparkSession, functions as func
- Next create a SparkSession.
spark = SparkSession.builder.appName('task').getOrCreate()
- Keep in mind since the dataset is huge, reading the data with spark would sometimes cause the kernel to die as in case of Jupyter Notebooks, so we need to load the data with spark and make a dataframe.
df ="csv").option("header","true").load("/your/dataset/path").fillna(0)["the columns you need"]
- Next check the data types of each of the chosen columns. Since there might be some columns that contains numeric data but are of string data type, we need to type cast them into appropriate data type.
df.dtypes # for checking the data types
df = df.withColumn("your column", df["your column"].cast("your data type"))
- Make sure if all the necessary changes have been done by checking all the data types and the data itself.
After the basic steps have been performed, the task using PySpark is fairly easy to do since we have to apply groupby and aggregation functions using PySpark. After completion of the task make sure to stop the SparkSession.
So for doing this task using PySpark and Elasticsearch, first we'll extract the Elasticsearch tar file and need to add a few extra libraries to the basic steps we just performed as follows
from elasticsearch import Elasticsearch
import requests
from pprint import pprint
- First we will have to start the Elasticsearch from the terminal.
$ cd /path/to/elasticsearch
$ ./bin/elasticsearch # to see all the details
$ ./bin/elasticsearh -d # to start elasticsearch as deamon process
- Next extract the ES Hadoop zip folder and copy the elasticsearch-hadoop-x.jar to the spark jars folder.
$ cd /es-hadoop/dist
$ cp elasticsearch-hadoop-x.jar /path/to/spark/jars
- Next check if Elasticsearch is successfully reachable or not by creating a requests object.
res = requests.get('http://localhost:9200')
- After the basic steps have been performed successfully, we need to write the PySpark dataframe to the Elasticsearch directly in much less time without the need of Logstash since Logstash is time and memory hog.
"es.resource", '%s' % ('your indexname')
"es.nodes", 'localhost'
"es.port", '9200'
- Next create an Elasticsearch object and perform the tasks as required.
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
- During the tasks with elasticsearch, when performing aggregation operations on strings, we need to use ".keyword" after the string columns that need to be used since string columns aren't allowed to be iterated by default.
- And after the tasks have been successfully performed, close the connection to Elasticsearch and stop the SparkSession.
- A Basic Guide To Elasticsearch Aggregations
- Indexing into Elasticsearch using Spark - code snippets
- Python Elasticsearch Client API Documentation
- 23 Useful Elasticsearch Example Queries
- Elasticsearch Tutorial for beginners - TechieLifestyle
- Spark Groupby Example with DataFrame
- Can elasticsearch do GROUP BY multi fields and ORDER BY count?