This is the repo to collaborate on Thorn-Cloudera's project at OSD- GHC16. Please create a folder with your team name, add your work there and submit a pull request.
Here is the project description page:
Sanitized Chat Data over a few months for 2 chat rooms
‘author': 'User1',
'content': '@User2 MESSAGE LINK1 ',
'id': u’ef66e354-aa47-42fc-92f3-dcbed2802510',
'scrape_datetime': u'2016-09-06 02:10:31.152588’,
'site': ‘SITE1'
- Landing page show high level overview: chat room stats, top user table, includes search
- Views: user profile, content, network analysis
- Data Insights
- Messages over time per chat room
- User Stats: * distributions by: # messages, # unique links posted * Top Users: * post patterns, num of original links posted
- Content Stats: * Most popular content, distribution of content by popularity * How often is new content posted?
- Network * Who talks to who? graph analysis to rank users? do sub communities exist?
Javascript, React, Python (numpy, scipy, pandas, networkx), Jupyter notebooks, git
- Which bootstrapping model do we want to choose for data hosting and analysis?
- Option 1: Provide the dataset in a downloaded location And participants can download, use their favourite tools for exploration and generating stats. Upload the code and insights to github.
- Option 2: Hosted on Cloudera infrastructure
* Create an EDH cluster - Sravya
* Create Thorn dataset as a Hive table? - Sravya
* Make sure Spark is setup and we can run spark jobs on this dataset
* User account setup: Setup cluster accounts for all users.
* Analysis:
- Spark shell
- Hue console
- Sense
- Jupyter notebook * Preparation before hackathon is strongly encouraged:
- Get familiar with one of the analysis tools mentioned above.
- Option 3: Docker?
- We do not have mentor expertise in react and javascript
- Can we get more mentors to fill this expertise gap?
- If not - shall we remove this project?