Skip to content

In this project, the objective was to analyze the "User, Occupation, Movies, and Ratings" dataset using Apache Hive. The data was processed and analyzed using Hive's SQL-like query language and MapReduce framework, making it easier to handle large datasets. The focus of the analysis was to provide a comprehensive breakdown of the data

Notifications You must be signed in to change notification settings

subhanjandas/User-Occupation-and-Movies-Ratings-Data-Exploration-using-Apache-Hive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

User, Occupation and Movies, Ratings Data Exploration using Apache Hive

Introduction

In this project, the objective was to analyze the "User, Occupation, Movies, and Ratings" dataset using Apache Hive. The data was processed and analyzed using Hive's SQL-like query language and MapReduce framework, making it easier to handle large datasets. The focus of the analysis was to provide a comprehensive breakdown of the data and uncover key insights into user preferences and trends.

The first step in the analysis was to load the data into Hive and create a table for querying. The data was then displayed in a tabular format, allowing for a visual inspection of the data. The next step was to print the schema of the table, providing a structured overview of the variables and data types in the dataset.

To further refine the analysis, the data was filtered to show only those observations where the user's age was greater than 25 and the occupation was specified. This helped to focus on the most relevant data and identify trends in user preferences.

The data was then grouped by occupation and the count of users in each occupation was calculated. This provided a summary of the data by occupation, allowing the analysis to determine the occupations where the most users were active and where the preferences were the strongest.

Finally, the data was used to find the user with the highest number of ratings, as well as their age and user ID. This provided valuable insights into the users who were most active in rating movies and could help in understanding the preferences of these users.

In conclusion, this project demonstrates the power of Apache Hive in performing data analysis and uncovering valuable insights from the "User, Occupation, Movies, and Ratings" dataset. By using Hive's SQL-like query language, the data was queried and analyzed with ease, providing valuable information for decision-making purposes.

Tools Used

  • Ambari
  • Apache Hive

Project Screenshots

  • Question 1: Find out the Occupation of all the users

image

image

  • Question 2: Find out numbers of non-adults (users with age less than 18) who have rated movies

image

  • Question 3: Find out the count of users by occupation where user age is more than 25

image

image

image

Question 4: Find the user id & age of the user with the most number of ratings

image

About

In this project, the objective was to analyze the "User, Occupation, Movies, and Ratings" dataset using Apache Hive. The data was processed and analyzed using Hive's SQL-like query language and MapReduce framework, making it easier to handle large datasets. The focus of the analysis was to provide a comprehensive breakdown of the data

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published