map-reduce-html

A collection of useful Map Reduce programs that provide insight on HTML documents stored on the HDFS file system maintained by the University of Notre Dame.

Prerequisites

Hadoop
Python 3

File Descriptions

htmltohosts: Read HTML on the standard input, find the A HREF tags, and emit only the hostnames present in those tags, one per line.
htmltowords: Read HTML on the standard input, remove extraneous items such as tags and punctuation, and emit only simple lowercase words of three or more characters, one per line.
WordCount: Produce a listing of all words that appear in all documents, each with a count of frequency, sorted by frequency.
Bigrams: Produce a listing of the top ten bi-grams (pair of adjacent words) in the dataset.
InvertedIndex: For each word encountered, produce a list of all hosts in which the word occurs.
Out-Links: For each host, produce a unique list of hosts that it links to.
InLinks: For each host, produce a unique list of hosts that link TO it.
NDegrees: Produce a listing of all hosts 1 hop from www.nd.edu. Then, produce a listing for 2 hops, 3 hops, and so forth, until the result converges.
run.sh: Helps run the Hadoop commands associated with the above programs. See below for more details.

Programs and Outputs

Word Count

To Run

-- Using the provided shell script
$ ./run.sh WordCount

-- Running it on your own
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files WordCountMap.py,WordCountReduce.py -input /users/jquinn13/Words -output /users/jquinn13/WordCount -mapper WordCountMap.py -reducer WordCountReduce.py

Example Output

the 202466
and 195977
for 79271
var 52680
with  31424
this  30190
more  29278
your  27984
you 27554
new 26347

Bigrams

To Run

-- Using the provided shell script
$ ./run.sh Bigrams

-- Running it on your own
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files BigramsMap.py,BigramsReduce.py -input /users/jquinn13/Words -output /users/jquinn13/Bigrams -mapper BigramsMap.py -reducer BigramsReduce.py

Example Output

var:var 10548
and:the 10281
and:screen  9414
for:the 8596
learn:more  6992
and:and 6668
more:read 6130
solid:solid 6009
all:and 5931
the:university  5189

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
BigramsMap.py		BigramsMap.py
BigramsReduce.py		BigramsReduce.py
InLinksMap.py		InLinksMap.py
InLinksReduce.py		InLinksReduce.py
InvertedIndexMap.py		InvertedIndexMap.py
InvertedIndexReduce.py		InvertedIndexReduce.py
NDegreesMap.py		NDegreesMap.py
NDegreesReduce.py		NDegreesReduce.py
OutLinksMap.py		OutLinksMap.py
OutLinksReduce.py		OutLinksReduce.py
README.md		README.md
WordCountMap.py		WordCountMap.py
WordCountReduce.py		WordCountReduce.py
htmltohosts.py		htmltohosts.py
htmltowords.py		htmltowords.py
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

map-reduce-html

Prerequisites

File Descriptions

Programs and Outputs

Word Count

To Run

Example Output

Bigrams

To Run

Example Output

Inverted Index

johnedquinn/map-reduce-html

Folders and files

Latest commit

History

Repository files navigation

map-reduce-html

Prerequisites

File Descriptions

Programs and Outputs

Word Count

To Run

Example Output

Bigrams

To Run

Example Output

Inverted Index