Queries is a repository for sql and pyspark projects as frameworks used in querrying and processing big data given different conditons. sql is a widely used domain specific language used to process data stored in relational databases. PySpark is the Python API for Apache Spark, a powerful distributed computing framework designed for processing large-scale data. pyspark is widely used for data distributed across multiple storage and is prominent for its adaptation with sql, machine learning, pandas in natures of: Spark SQL, Mlib, structured streaming, Pandas API on Spark.
Employyees SQL is a project which uses sql in a company_employee data setting to analyze problems, bringing solutions to questions that are important to understand the scope of the data. SQL is primarily used in providing answers to questions that surround business settings.
Music Store Analysis just like music platforms like spotify, this projects showcases the use of sql to understand artists, albums, tracks and other related problems. The project follows a question and answerformat and also provides insight into understanding the complexites of performing complex sql querries for business solutions
sql-practice 1,2,3 are a json files of sql solutions i solved from sql_practice website.The Website offers business related problems and expects sql solutions for each problem, thus can help to understand business the more and offer growth insights for the business.
PySpark as an apache spark api is accessible through the data analytics platform, Databricks. All pyspark projects were carried out on the databricks workspace notebooks.
Employee On PySpark is replication and re-purposing of the sql version of employee_sql problems where company_employees relation problems were sorted using pypsark applications. This was a project to show the similarites and difficulties between sql and pyspark in providing business solutions.
spotify streams on pyspark is a typical analysis of track, artists and album data across different streaming platforms such as spotify, YouTube, TikTok etc. The project showcases the use of pyspark as an analytical solution to provide understaning and metrics into the numbers surrounding streams of Tracks and Arists performances.
Basic ML on Pyspark is a continued project on the capabilities of pyspark. utilizing Mlib functions of spark, classical regression and classification tasks and models can be performed on Resilient Distributed Datasets(RDD) which is a core structure for spark framework.