Skip to content

Latest commit

 

History

History
67 lines (51 loc) · 2.77 KB

README.md

File metadata and controls

67 lines (51 loc) · 2.77 KB

Introduction to text mining with Spark

Objectives:

  1. To learn about the Spark framework
  2. To become familiar with the Databricks notebook environment
  3. To implement text mining techniques
  4. To work with social media data (tweets)

PLAN

  1. Overview of the relevant data objects and structures, [ Databricks ]

    • Tweet object (sample)
    • JSON format
    • Spark DataFrame
    • Databricks table
  2. Sourcing, [ Databricks ]

    • Define relevant search parameter or make a random search.
    • Get tweets using the REST API (prior)
    • Aggregate tweets to this single JSON source file with this script (prior)
    • Export to S3
    • Upload the JSON source file to a Databricks table
    • Create dataframe from Databricks table
  3. Exploration, [ Databricks]

    • Show dataframe, print schema
    • Basic sql queries
    • User tweet frequency bar graph
    • Count tweets containing a given keyword
  4. Preparation, [ Databricks ]

    • tokenization
    • stop word removal
    • Stemming
    • N-Grams
  5. Analysis (available soon!)

    • Principal Component Analysis
    • Cluster analysis
    • ..

Doc & programming guides

Tutorials