Skip to content

A collection of useful Map Reduce programs that provide insight on HTML documents stored on an Apache Hadoop file system maintained by the University of Notre Dame.

Notifications You must be signed in to change notification settings

johnedquinn/map-reduce-html

Repository files navigation

map-reduce-html

A collection of useful Map Reduce programs that provide insight on HTML documents stored on the HDFS file system maintained by the University of Notre Dame.

Prerequisites

  • Hadoop
  • Python 3

File Descriptions

  • htmltohosts: Read HTML on the standard input, find the A HREF tags, and emit only the hostnames present in those tags, one per line.
  • htmltowords: Read HTML on the standard input, remove extraneous items such as tags and punctuation, and emit only simple lowercase words of three or more characters, one per line.
  • WordCount: Produce a listing of all words that appear in all documents, each with a count of frequency, sorted by frequency.
  • Bigrams: Produce a listing of the top ten bi-grams (pair of adjacent words) in the dataset.
  • InvertedIndex: For each word encountered, produce a list of all hosts in which the word occurs.
  • Out-Links: For each host, produce a unique list of hosts that it links to.
  • InLinks: For each host, produce a unique list of hosts that link TO it.
  • NDegrees: Produce a listing of all hosts 1 hop from www.nd.edu. Then, produce a listing for 2 hops, 3 hops, and so forth, until the result converges.
  • run.sh: Helps run the Hadoop commands associated with the above programs. See below for more details.

Programs and Outputs

Word Count

To Run

-- Using the provided shell script
$ ./run.sh WordCount

-- Running it on your own
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files WordCountMap.py,WordCountReduce.py -input /users/jquinn13/Words -output /users/jquinn13/WordCount -mapper WordCountMap.py -reducer WordCountReduce.py

Example Output

the 202466
and 195977
for 79271
var 52680
with  31424
this  30190
more  29278
your  27984
you 27554
new 26347

Bigrams

To Run

-- Using the provided shell script
$ ./run.sh Bigrams

-- Running it on your own
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files BigramsMap.py,BigramsReduce.py -input /users/jquinn13/Words -output /users/jquinn13/Bigrams -mapper BigramsMap.py -reducer BigramsReduce.py

Example Output

var:var 10548
and:the 10281
and:screen  9414
for:the 8596
learn:more  6992
and:and 6668
more:read 6130
solid:solid 6009
all:and 5931
the:university  5189

Inverted Index