main Library used:
request
numpy
pandas
mysqlConnector
SQLAlchemy
streamlit
Main page used is IranKetab
Creating a beautiful dashboard and do some null hypothesis 😄
All purpose is just showing our skills and knowledge about Analytics, So we don't have any mistrust to "IranKetab" Website
In two separate files named scrap_link
and crawl_page
, we have extracted all the necessary and unnecessary information from the target website.
In the scrap_link
file, only all the links to the books from all categories (tags) have been extracted and saved in a file named links.csv
. It takes approximately 60 minutes to extract all the links.
In the crawl_page
file, all the essential details such as the book's ISBN, discount amount, print series, tags, etc., have been extracted. Subsidiary information includes footnotes or golden quotes from the book. The final file has an approximate size of 100 megabytes (assuming all links are present) and is saved in CSV format.
instead of using classical method you can use pandas_readHTML method, but there is no way to get link of translators or publisher
In the cleaner_book
file, all the data is cleaned and then saved in a new CSV file. Following that, in the ExportDB
file, we segregated the data according to the diagram below, storing it locally. Subsequently, we obtained CSV outputs from the same data.
cleaner_book
: Cleans and saves data in a new CSV file.ExportDB
: Segregates data according to the diagram and saves locally.
Diagram:
In st_git
file we create a dynamic dashboard with streamlit library.
In this dashboard, graphs about the frequency of books of each genre, the number of books by top authors, translators and publishers, books published in different years, the frequency of books based on the type of cover, as well as the distribution of book prices based on the year of publication and book rating are displayed in has come
There is also a search section where the user can search for the desired book by specifying the maximum price, minimum score, author, publisher and book genre.
In the figure below, a view of the dashboard is displayed:
You can check final result here Streamlit_Dashboard 🕶️
Include Some statistics tests for finding relationships between features
Translation has a significant impact on the price of the book
There is a significant difference in prices between hardcover and paperback versions
for more details you can check first_hyp
& second_hyp