Skip to content

Latest commit

 

History

History
58 lines (41 loc) · 1.7 KB

README.md

File metadata and controls

58 lines (41 loc) · 1.7 KB

Project Title

Spark and HBase based HApache Access Log Analyzer

Project Description

Apache log analyzer will analyze the Apache access log files, and generates the ordered list of urls-method-count pair based with following specifications:

  • List the duplicate visits from apache access log
  • Seperate them according to HTTP Methods
  • Count the duplicates for each URL-Method Pair
  • Sort them according to URL and then with HTTP Methods
  • Output format: (REQUEST_URL, REQUEST_METHOD, COUNT)

Output Example:

((/mailman,GET),6)
((/mailman/admin,GET),2)
((/mailman/admin/ppwc,GET),6)
((/mailman/admin/ppwc,POST),6)

Input

Place Apache Access Log into input/input

Working Procedure

Spark

  • Seperate URL and HTTP Method from each line of input
  • Create Record<URL:String, Method:String> for each pair
  • Count repetitions for each pair
  • Filter out non-duplicate pairs
  • Sort according to Record object
  • Partition on 5 reducers with HashPartiton of Record object
  • Save as a text file
  • Prepare output data into format supported by HBase
  • Configure HBase
  • Save into HBase

HBase

  • A simple CRUD for HBase data
  • TODO: Create CRUD for Apache Spark output data

Getting Started

Change current directory to project source directory and run ./run.sh.

Prerequisites

Prerequisites

Author

License

This project is licensed under the MIT License - see the LICENSE.md file for details