Stacklite

Get insight from Stack Overflow datasets using Hadoop, Hive, and MySQL cluster that running on Docker containers so you can try on your own machine

The datasets in the dataset folder are retrieved from:

StackLite: A simple dataset of Stack Overflow questions and tags

It is available on https://github.com/dgrtwo/StackLite, and https://www.kaggle.com/stackoverflow/stacklite

Hive Cluster

Prerequisites

Docker client installed

Preparations

First of, We need to pull necessary images for this cluster

$ ./pull-image.sh

Hadoop and hive image that we used this time are based on Newnius, that you can check it out in his blog

https://blog.newnius.com/setup-apache-hive-in-docker.html

Next up we can create our cluster by starting necessary containers, and connect them with a network

$ ./start-container.sh && ./run-container.sh

We're all set, now We can use the cluster to process data with Hive

How to use

On hadoop-master node, we need to format namenode first

$ bin/hadoop namenode -format

Then, We can start yarn and dfs nodes

$ sbin/start-dfs.sh && sbin/start-yarn.sh

Initialize hive schema on hive node

$ schematool --dbType mysql --initSchema

Run metastore after init

$ hive --service metastore

Put data to hadoop-master from host

$ docker cp data/. hadoop-master:/usr/local/hadoop

Put data to hfds

$ bin/hadoop dfs -mkdir /<folder_name>

$ bin/hadoop dfs -put <your_data>.csv /<folder_name>

First, We need to create table on hive node

Wil be added...

Start running some query, and export it as csv

$ hive -e '<query>' > <result_file>.csv

On host, copy query result to host

$ docker cp hive:/usr/local/hive/<query_result>.csv result/data/

Result and visualization can be seen on result folder

PySpark

Prerequisites

Python

Preparations

On pyspark-analysis directory run

$ pip install -r requirements.txt

Run

$ jupyter notebook

How to use

On hadoop-master node, we need to format namenode first

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
hive-cluster		hive-cluster
pyspark-analysis		pyspark-analysis
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stacklite

Hive Cluster

Prerequisites

Preparations

How to use

PySpark

Prerequisites

Preparations

How to use

About

Releases

Packages

Languages

maarlf/stackexchange

Folders and files

Latest commit

History

Repository files navigation

Stacklite

Hive Cluster

Prerequisites

Preparations

How to use

PySpark

Prerequisites

Preparations

How to use

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages