Objectives:
- To learn about the Spark framework
- To become familiar with the Databricks notebook environment
- To implement text mining techniques
- To work with social media data (tweets)
PLAN
-
Overview of the relevant data objects and structures, [ Databricks ]
- Tweet object (sample)
- JSON format
- Spark DataFrame
- Databricks table
-
Sourcing, [ Databricks ]
- Define relevant search parameter or make a random search.
- Get tweets using the REST API (prior)
- Aggregate tweets to this single JSON source file with this script (prior)
- Export to S3
- Upload the JSON source file to a Databricks table
- Create dataframe from Databricks table
-
Exploration, [ Databricks]
- Show dataframe, print schema
- Basic sql queries
- User tweet frequency bar graph
- Count tweets containing a given keyword
-
Preparation, [ Databricks ]
- tokenization
- stop word removal
- Stemming
- N-Grams
-
Analysis (available soon!)
- Principal Component Analysis
- Cluster analysis
- ..
Doc & programming guides
Tutorials