Introduction to text mining with Spark

Objectives:

PLAN

Overview of the relevant data objects and structures, [ Databricks ]
- Tweet object (sample)
- JSON format
- Spark DataFrame
- Databricks table
Sourcing, [ Databricks ]
- Define relevant search parameter or make a random search.
- Get tweets using the REST API (prior)
- Aggregate tweets to this single JSON source file with this script (prior)
- Export to S3
- Upload the JSON source file to a Databricks table
- Create dataframe from Databricks table
Exploration, [ Databricks]
- Show dataframe, print schema
- Basic sql queries
- User tweet frequency bar graph
- Count tweets containing a given keyword
Preparation, [ Databricks ]
- tokenization
- stop word removal
- Stemming
- N-Grams
Analysis (available soon!)
- Principal Component Analysis
- Cluster analysis
- ..

Doc & programming guides

Tutorials

Provide feedback