Skip to content

eolecvk/intro_spark_twitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to text mining with Spark

Objectives:

  1. To learn about the Spark framework
  2. To become familiar with the Databricks notebook environment
  3. To implement text mining techniques
  4. To work with social media data (tweets)

PLAN

  1. Overview of the relevant data objects and structures, [ Databricks ]

    • Tweet object (sample)
    • JSON format
    • Spark DataFrame
    • Databricks table
  2. Sourcing, [ Databricks ]

    • Define relevant search parameter or make a random search.
    • Get tweets using the REST API (prior)
    • Aggregate tweets to this single JSON source file with this script (prior)
    • Export to S3
    • Upload the JSON source file to a Databricks table
    • Create dataframe from Databricks table
  3. Exploration, [ Databricks]

    • Show dataframe, print schema
    • Basic sql queries
    • User tweet frequency bar graph
    • Count tweets containing a given keyword
  4. Preparation, [ Databricks ]

    • tokenization
    • stop word removal
    • Stemming
    • N-Grams
  5. Analysis (available soon!)

    • Principal Component Analysis
    • Cluster analysis
    • ..

Doc & programming guides

Tutorials


Releases

No releases published

Packages

 
 
 

Languages