LogProcessing Map-Reduce is a Collection of Map Reduce program to process Log Files and
extract information. This project can be configured to work with Log Files of multiple
types without any changes to the Code base, and just by modifying the applications.config
file.
- Author: Lakshmanan Meiyappan
- Email: lmeiya2@uic.edu
- Scala 3.0.2
- SBT 1.5.2
- hadoop-core 1.2.1
- slf4j-api 2.0.0
- typesafe config 1.4.1
LogProcessing Map-Reduce comprises Four Map-Reduce Tasks.
- LogLevel Frequency: Compute Count for each Log Level across all input files.
- Most Error in TimeInterval: Find Time Intervals with most errors, results in descending order.
- Longest Substring matching Regex: Length of Longest Substring that matches a Regular Expression.
- LogLevel Frequency Distribution in TimeIntervals: LogLevel distributions in specified TimeIntervals.
Users can inject Regex pattern in the Config files, and the Map-Reduce jobs will search for the pattern in the LogFiles and produce results for the requested pattern.
See How to Run LogProcessing MapReduce section for more instructions on how to execute this program.
Please find the Documentation of this Project hosted in Github pages here: LogProcessing Documentation
Demo and Walk-through Video:
Running LogProcessing Map-Reduce on AWS EMR
Detailed Report and results after executing the Map Reduce task can be found here: LogProcessing MapReduce Report.
git clone https://github.com/laxmena/LogProcessing-MapReduce.git
cd LogProcessing-MapReduce
sbt clean compile
sbt test
sbt assembly
This command will generate a jar file in target/scala-3.0.2/LogProcessing-MapReduce-assembly-0.1.jar
For detailed step-by-step guide on how to execute LogProcessing-MapReduce jobs on AWS or Hortonworks Sandbox, refer this guide: Deploying on AWS/Hortonworks Guide
- Connect to the remote hadoop master using putty or command line.
- Transfer the Input Log Files and JAR file to remote machine. Copy the input files to the HDFS directory. (See Commands 1, 3 and 4 in Useful commands section below.)
- Run the Hadoop map reduce job by executing the following command:
hadoop jar LogProcessing-MapReduce-assembly [input-path] [output-path] [job-key] [pattern-key]
- On successful completion of Map-Reduce task, the results will be generated in the
[output-path]
. See commands 5 and 6 in Useful commands section below to read the output.
List of available [job-key]
and its associated Map-Reduce task:
job-key | Map-Reduce Task | Supports Regex Search? |
---|---|---|
log-frequency | LogLevel Frequency | ✘ |
most-error | Most Error in TimeInterval | ✔ |
longest-regex | Longest Substring matching Regex | ✔ |
log-freq-dist | LogLevel Distribution in TimeIntervals | ✔ |
List of available pattern-key
by Default:
key | pattern | Description |
---|---|---|
pattern1 | .* | (Default) Matches Entire String |
pattern2 | \([^)\\n]*\) | String enclosed within Parantheses |
pattern3 | [^\\s]+ | String without any spaces |
pattern4 | [\d]+ | Consecutive Numbers |
pattern5 | ([a-c][e-g][0-3] or [A-Z][5-9][f-w]){5,15} | Pattern1 or Pattern2 should repeat between 5 to 15 times, inclusive |
Different combinations of job-key
and pattern-key
can be used to execute Map-Reduce tasks.
Examples:
hadoop jar LogProcessing-MapReduce-assembly-0.1.jar logprocess/input logprocess/longest-regex-1 longest-regex pattern1
hadoop jar LogProcessing-MapReduce-assembly-0.1.jar logprocess/input logprocess/log-freq-dist-3 log-freq-dist pattern3
hadoop jar LogProcessing-MapReduce-assembly-0.1.jar logprocess/input logprocess/logfrequency
hadoop jar LogProcessing-MapReduce-assembly-0.1.jar logprocess/input logprocess/mosterror most-error pattern5
- Transfer file from Local Machine to a Remote machine
scp -P 2222 <path/to/local/file> <username@remote_machine_ip>:<path/to/save/files>
- Transfer directory from Local Machine to Remote machine
scp -P 2222 -r <path/to/local/directory> <username@remote_machine_ip>:<path/to/save/files>
- Create HDFS Directory
hadoop fs -mkdir <directory_name>
- Add Files to HDFS
hadoop fs -put <path/to/files> <hdfs/directory/path>
- Reading Hadoop Map-Reduce Output
hadoop fs -cat <hdfs/output/directory>/*
- Save Hadoop Map-Reduce output to Local file
hadoop fs -cat <hdfs/output/directory>/* > filename.extension
- Running JAR with multiple main classes
hadoop jar <name-of-jar> <full-class-name> <input-hdfs-directory> <output-hdfs-directory>
- List files in HDFS Directory
hdfs dfs -ls hdfs dfs -ls <directory/path>
- Remove file or directory in HDFS
hdfs dfs -rm -r <path/to/directory> hdfs dfs -rm <path/to/file>