GitHub - KizMan-23/queries: Providing Insights, understanding and processing Big Data using SQL and PySpark

Queries is a repository for sql and pyspark projects as frameworks used in querrying and processing big data given different conditons. sql is a widely used domain specific language used to process data stored in relational databases. PySpark is the Python API for Apache Spark, a powerful distributed computing framework designed for processing large-scale data. pyspark is widely used for data distributed across multiple storage and is prominent for its adaptation with sql, machine learning, pandas in natures of: Spark SQL, Mlib, structured streaming, Pandas API on Spark.

Employyees SQL is a project which uses sql in a company_employee data setting to analyze problems, bringing solutions to questions that are important to understand the scope of the data. SQL is primarily used in providing answers to questions that surround business settings.

Music Store Analysis just like music platforms like spotify, this projects showcases the use of sql to understand artists, albums, tracks and other related problems. The project follows a question and answerformat and also provides insight into understanding the complexites of performing complex sql querries for business solutions

sql-practice 1,2,3 are a json files of sql solutions i solved from sql_practice website.The Website offers business related problems and expects sql solutions for each problem, thus can help to understand business the more and offer growth insights for the business.

PySpark as an apache spark api is accessible through the data analytics platform, Databricks. All pyspark projects were carried out on the databricks workspace notebooks.

Employee On PySpark is replication and re-purposing of the sql version of employee_sql problems where company_employees relation problems were sorted using pypsark applications. This was a project to show the similarites and difficulties between sql and pyspark in providing business solutions.

spotify streams on pyspark is a typical analysis of track, artists and album data across different streaming platforms such as spotify, YouTube, TikTok etc. The project showcases the use of pyspark as an analytical solution to provide understaning and metrics into the numbers surrounding streams of Tracks and Arists performances.

Basic ML on Pyspark is a continued project on the capabilities of pyspark. utilizing Mlib functions of spark, classical regression and classification tasks and models can be performed on Resilient Distributed Datasets(RDD) which is a core structure for spark framework.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.ipynb_checkpoints		.ipynb_checkpoints
coffee shop		coffee shop
company_employee		company_employee
employee data		employee data
music store data		music store data
README.md		README.md
basic_ml_on_pyspark.ipynb		basic_ml_on_pyspark.ipynb
clean_dataset.ipynb		clean_dataset.ipynb
employees_in_pyspark.ipynb		employees_in_pyspark.ipynb
employees_query.sql		employees_query.sql
music_store_analysis.sql		music_store_analysis.sql
music_store_analyze.sql		music_store_analyze.sql
pyspark_on_spotify_streams.ipynb		pyspark_on_spotify_streams.ipynb
sql-practice-1.com.json		sql-practice-1.com.json
sql-practice-2.com.json		sql-practice-2.com.json
sql-practice-2b.com.json		sql-practice-2b.com.json
sql-practice-3.com.json		sql-practice-3.com.json
start_on_pyspark.ipynb		start_on_pyspark.ipynb
table_values.sql		table_values.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

KizMan-23/queries

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages