PageRankAlgorithm

Final grade: 9.6/10.

Overview

This project involves the implementation of the original PageRank algorithm, which laid the groundwork for the Google Search Engine. The algorithm aims to assign a weight to each webpage on the internet, primarily based on the number of external links pointing to the page.

Objective

The primary goal is to apply the PageRank algorithm to set weights for each page within the English version of Wikipedia, utilizing the dataset available in the Databricks Databases Set at the URI: dbfs:/databricks-datasets/wikipedia-datasets/data-001/en_wikipedia/articles-only-parquet. This involves working with a database containing 5,823,210 entries stored in a parquet file format.

Dataset Analysis

The dataset comprises seven columns: title, id, revisionId, revisionTimestamp, revisionUsername, revisionUsernameId, and text. The essential data for this project is located in the title, id, and text columns. The project involves creating a forward link matrix by identifying outgoing links (noted with double "[[]]") and mapping these to the respective ids. Subsequently, a backward links matrix will be constructed to apply the PageRank algorithm effectively.

Implementation Details

Apache Spark DataFrames will be used to manage the Wikipedia database and intermediate results.
A smaller version of the dataset, approximately 0.01% of the records (around 582 records), is recommended for initial analysis.
Implementation of a User Defined Function (UDF) to parse the text field from each record and extract the outgoing links is suggested.
The final output should be a Pandas DataFrame including the columns: title, id, and pagerank, representing the PageRank value for each Wikipedia page.

Expected Results

The project's culmination is a scalable and optimized implementation of the PageRank algorithm, capable of processing extensive datasets with efficiency and accuracy.

Evaluation Criteria

The project will be evaluated based on the following criteria:

Code Execution: Your code must run without errors (20%).
Scalability: Your implementation should be scalable, tested with datasets ten times larger on a cluster with 80 execution workers (20%).
Optimization: Code optimization is crucial (20%).
Documentation: Adequate code documentation is required (10%).
Conclusions: Insightful conclusions based on the project findings (30%).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
PageRank_Spark.ipynb		PageRank_Spark.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PageRankAlgorithm

Overview

Objective

Dataset Analysis

Implementation Details

Expected Results

Evaluation Criteria

About

Releases

Packages

Languages

lucia-corsan/Spark_PageRank

Folders and files

Latest commit

History

Repository files navigation

PageRankAlgorithm

Overview

Objective

Dataset Analysis

Implementation Details

Expected Results

Evaluation Criteria

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages