ITC - Data mining project - StackExchange Analyse.
main focus - Stack Overflow
Nir Barazida and Inbar Shirizly
This project scrapes websites and analyses the data retrieved.
The websites that are analysed are under the group of StackExchange main websites:
- https://stackoverflow.com/
- https://math.stackexchange.com/
- https://askubuntu.com/
- https://superuser.com/
The analysis focuses on data retrieved from the top individual users of several websites (according to the website's all-time rank (since the website's establishment until scrapper was run)).
-
Presents insights about each country and continent:
- Reputation
- Total answers, profile views and people reached
- People reached
- Top tags - posts and score
- reputation trends between 2017-2020
-
Presents the amount of top users over time - according to the registration date of current top users
- Requirements.txt - The project dependencies can be found in the requirements.txt file in the current repository
- first and foremost - one who is using our program must have a MySQL account. the program create and commit information to a MySQL data-base. don't worry if you don't have a dedicated data-base for this information - we'll create it for you.
- To connect to the user MySQL account one has to create environment variables of user name and password. this way
we'll protect your sensitive information and make sure that it won't show on github and/or in the command line argument.
To create environment variables you can use this guid Corey Schafer - Environment Variables
please store the variable with the names:- MySQL_PASS - password.
- MySQL_USER - user name.
For now, this project is in milestone 3, hence the program crawls from the
input websites and commit the data to the user MySQL data-base.
In the future, the program will store the data on a remote data base that
is located on a server, and display the insights in a dashboard
The project implementation plan is to use OOP because of it's diversity and time optimization.
The opportunity to implement scraping features on different websites, using the same project with minor changes in the HTML page, gives the project a significant advantage.
To approach the diversity problem we decided to create 3 different class:
- Website
- User analysis
- UserScraper
- User
The first two are general and have little dependence on the website HTML. The third and fourth class are all dedicated to scraping the information and very much adjusted to the specific website and objects that we are scraping.
-
Class Website:
General class for the website crawler with the format of Stack Exchange -
Class User analysis:
Class for user analysis in a the website that get links for each individual user page -
UserScraper:
Receives the users url and scrapes all the information into class variables -
Class User:
Inherent from class UserScraper the methods to scrap information about a user into class variables.
Receives the users url, scrapes all the information into class variables and eventually will commit all the information to a data base using SQLAlchemy ORM.
In addition to the scraping abilities we added an API ability to receive information about every website that is being scraped using StackExchange API. After creating all four class that ables us to scrap the data and the API ability, we'll start working on the data-base that stores the information. To crate the data-base we will use SQLAlchemy based on ORM. This way we will be able to query and manipulate the database using object-oriented code instead of writing SQL. the implantation of the above can be shown in the ORM.py file
Tables description:
users
- contains information of the indivdual user - contain users from all the websites together, distinguished bywebsite_id
.websites
- table related to users (one website to many users) - stores general information about the website from StackExchange API.tags
- name of tag - connected to a relation tableuser_tags
- which contains the score and number of posts of the tag to the each user (users
-tags
= many to many)reputation
- reputation of the user, including data from each year between 2017-2020 related tousers
table (one user for one reputation entity)location
- country and continent of users (one location for many users)stack_exchange_location
- table that stores the users last phrase of location description, using attitude of dynamic programming - using these description to save api requests (if the description already exists during the scrap process). (one location_id for many website_location (the description))
In the command line arguments the user will be able to use the following features:
-
NUM_USERS_TO_SCRAP - amount of users that will be scrapped in the current execution. default users to scrap is 10
-
WEBSITE_NAMES - list of websites that the program will scrap. Note that they must be part of StackExchange group.
-
Multiprocessing - when this mode is on, the program will scrap from each of the input WEBSITE_NAMES concurrently using Multiprocessing. If the mode is turned off it will run over this list in a loop. default value is False.
-
DB_NAME - The database name that the user wants to scrap the information to. if one dose not exist the program will create one for him with the new name. default name is 'stack_exchange_db'
-
- init.py - when importing src file to main will initialize general process such set connection, create engine etc.
- conf.py - configuration file, generate all important values from json file and from the command line input
- general.py - general function that are being needed in multiple python file.
- geo_location.py - file contains three nested function that receives a general location string and retrieves a generic country and continent
- logger.py - file contains class logger - for general logger format.
- ORM.py - file that defines schema using ORM - create tables and all relationships between the tables in the database.
- user.py - includes the class User(UserScraper) - extracts the data for the individual user and add it to the relevant tables in the data-base.
- user_analysis.py - includes the class UserAnalysis(Website) - create a generator of links for each individual user page.
- user_scraper.py - includes the class UserScraper(Object) - Receives the users url and scrapes all the information into class variables
- website.py - includes the class Website(object) - create soup of pages, find last page and create soups for main topic pages.
- working_with_database.py - file that contains most of the function which CRUD with the database.
-
create_json_file.py - python file that generate the mining_constants.json
-
mining_constants.json - json format file contains the constants for all the program. the conf.py parse the json into class variables. The program will only use conf.py to import different variables.
-
requirements.txt - file with all the packages and dependencies of the project.