Name	Name	Last commit message	Last commit date
parent directory ..
mapreduce	mapreduce
spark	spark
sparkwithlambda	sparkwithlambda
util	util
README.md	README.md
anagram.png	anagram.png

What is an Anagram?

Anagram is a word or phrase made by transposing the letters of another word or phrase.
An anagram is a type of word play, the result of rearranging the letters of a word or phrase to produce a new word or phrase, using all the original letters exactly once
For example, the word "anagram" can be rearranged into "nagaram". "Elvis" can be rearranged into "lives", therefore "Elvis" and "lives" are anagrams
Someone who creates anagrams may be called an "anagrammatist"
Here we work on anagrams, which are only single words.

Examples

The word "mary" can be rearranged into "army"
The word "secure" is an anagram of "rescue"
The word "elvis" can be rearranged into "lives"

Programs

The purpose of these MapReduce programs (Hadoop and Spark) is to find anagrams for a set of given documents.

We will ignore words if their length is less than N (we read N as a parameter).

Packages

Package Name	Description
`org.dataalgorithms.chapB05.anagram.spark`	Spark Programs without Lambda Expressions
`org.dataalgorithms.chapB05.anagram.sparkwithlambda`	Spark Programs with Lambda Expressions
`org.dataalgorithms.chapB05.anagram.util`	Groups common methods
`org.dataalgorithms.chapB05.anagram.mapreduce`	MapReduce/Hadoop Programs

Spark Programs without Lambda Expressions

These programs are in the org.dataalgorithms.chap05.anagram.spark package.
These programs do not use lambda expressions.

Spark Programs without Lambda Expressions	Description
Program-1: `AnagramFinder`	Basic Anagram Finder: using Spark's `groupByKey()`
Program-2: `AnagramUsingCombineByKey`	Anagram Finder & Frequency: using Spark's `combineByKey()`
Program-3: `AnagramUsingGroupByKey`	Anagram Finder & Frequency: using Spark's `groupByKey()`

Spark Programs with Lambda Expressions

These programs are in the org.dataalgorithms.chapB05.anagram.sparkwithlambda package.
With the addition of lambda expressions in Java 8, I have updated Spark’s API to transparently support these expressions.

Spark Programs with Lambda Expressions	Description
Program-1: `AnagramFinder`	Basic Anagram Finder: using Spark's `groupByKey()`
Program-2: `AnagramUsingCombineByKey`	Anagram Finder & Frequency: using Spark's `combineByKey()`
Program-3: `AnagramUsingGroupByKey`	Anagram Finder & Frequency: using Spark's `groupByKey()`

MapReduce/Hadoop Program

These programs are in the org.dataalgorithms.chapB05.anagram.mapreduce package.

MapReduce/Hadoop Program	Description
`AnagramDriver`	The driver program, which submits the job
`AnagramMapper`	Implements the `map()` function
`AnagramReducer`	Implements the `reduce()` function

Common Spark Performance Pitfalls

Not all Spark's transformations and operations are equal, however, and a few of the most common performance pitfalls for novice Spark developers arise from picking the wrong one. for details see the following:

How-to: Tune Your Apache Spark Jobs (Part 1) by Sandy Ryza
How-to: Tune Your Apache Spark Jobs (Part 2) by Sandy Ryza
Avoid using groupByKey() when performing an associative reductive operation (such as addition, multiplication, ...). In these situations, reduceByKey() is a better option than groupByKey().
Avoid reduceByKey() When the input and output value types are different. You may use combineByKey().
Avoid the flatMap-join-groupBy pattern. When two datasets are already grouped by key and you want to join them and keep them grouped, you can just use cogroup. That avoids all the overhead associated with unpacking and repacking the groups.

Input Files for Testing

Data Set: http://www.gutenberg.org/dirs/etext02/mword10.zip

Sample Input

# cat sample_anagram.txt 
Mary and Elvis lives in Detroit army Easter Listen 
a silent eaters Death Hated elvis Mary easter Silent
Mary and Elvis are in a army Listen Silent detroit

Sample Script to run AnagramCountUsingCombineByKey

Before running this script, you need to build the data_algorithms_book.jar file.

cat run_anagram_using_combineByKey.sh 
#!/bin/bash
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_72.jdk/Contents/Home/
echo "JAVA_HOME=$JAVA_HOME"
#
export BOOK_HOME=/Users/mparsian/zmp/github/data-algorithms-book
export SPARK_HOME=/Users/mparsian/spark-1.6.1-bin-hadoop2.6
export SPARK_MASTER=spark://localhost:7077
export SPARK_JAR=$BOOK_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
export APP_JAR=$BOOK_HOME/dist/data_algorithms_book.jar
#
# define input parameters
N=2
INPUT="file://$BOOK_HOME/sample_anagram.txt"
OUTPUT="file://$BOOK_HOME/output"
#
prog=org.dataalgorithms.chapB05.anagram.spark.AnagramCountUsingCombineByKey
$SPARK_HOME/bin/spark-submit  \
    --class $prog \
    --master $SPARK_MASTER \
    $APP_JAR $N $INPUT $OUTPUT

Sample Output for AnagramCountUsingCombineByKey

Each output record has the following format:

<sorted-word><anagrams-and-associated-frequencies>

$ cat output/part-00000 
(adeht,{death=1, hated=1})
(eilnst,{silent=3, listen=2})
(eilsv,{lives=1, elvis=3})
(aeerst,{eaters=1, easter=2})
(amry,{army=2, mary=3})

Sample Script to run AnagramCountUsingGroupByKey

Before running this script, you need to build the data_algorithms_book.jar file.

$ cat run_anagram_using_combineByKey.sh 
#!/bin/bash
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_72.jdk/Contents/Home/
echo "JAVA_HOME=$JAVA_HOME"
#
export BOOK_HOME=/Users/mparsian/zmp/github/data-algorithms-book
export SPARK_HOME=/Users/mparsian/spark-1.6.1-bin-hadoop2.6
export SPARK_MASTER=spark://localhost:7077
export SPARK_JAR=$BOOK_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
export APP_JAR=$BOOK_HOME/dist/data_algorithms_book.jar
# define input parameters
N=2
INPUT="file://$BOOK_HOME/sample_anagram.txt"
OUTPUT="file://$BOOK_HOME/output"
#
prog=org.dataalgorithms.chapB05.anagram.spark.AnagramCountUsingGroupByKey
$SPARK_HOME/bin/spark-submit  \
    --class $prog \
    --master $SPARK_MASTER \
    $APP_JAR $N $INPUT $OUTPUT

Sample Output for AnagramCountUsingGroupByKey

Each output record has the following format:

<sorted-word><anagrams-and-associated-frequencies>

$ cat output/part-00000 
(adeht,{death=1, hated=1})
(eilnst,{silent=3, listen=2})
(eilsv,{lives=1, elvis=3})
(aeerst,{eaters=1, easter=2})
(amry,{army=2, mary=3})

Sample Script to run AnagramFinder

Before running this script, you need to build the data_algorithms_book.jar file.

$ cat run_anagram_using_combineByKey.sh 
#!/bin/bash
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_72.jdk/Contents/Home/
echo "JAVA_HOME=$JAVA_HOME"
#
export BOOK_HOME=/Users/mparsian/zmp/github/data-algorithms-book
export SPARK_HOME=/Users/mparsian/spark-1.6.1-bin-hadoop2.6
export SPARK_MASTER=spark://localhost:7077
export SPARK_JAR=$BOOK_HOME/lib/spark-assembly-1.6.0-hadoop2.6.0.jar
export APP_JAR=$BOOK_HOME/dist/data_algorithms_book.jar
# define input parameters
N=2
INPUT="file://$BOOK_HOME/sample_anagram.txt"
OUTPUT="file://$BOOK_HOME/output"
#
prog=org.dataalgorithms.chapB05.anagram.spark.AnagramFinder
$SPARK_HOME/bin/spark-submit  \
    --class $prog \
    --master $SPARK_MASTER \
    $APP_JAR $N $INPUT $OUTPUT

Sample Output for AnagramFinder

Each output record has the following format:

<sorted-word><anagrams-and-associated-frequencies>

$ cat output/part-00000 
(adeht,[death, hated])
(eilnst,[silent, listen])
(eilsv,[lives, elvis])
(aeerst,[eaters, easter])
(amry,[army, mary])

References

best regards
Mahmoud Parsian

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

anagram

anagram

README.md

What is an Anagram?

Examples

Programs

Packages

Spark Programs without Lambda Expressions

Spark Programs with Lambda Expressions

MapReduce/Hadoop Program

Common Spark Performance Pitfalls

Input Files for Testing

Sample Input

Sample Script to run AnagramCountUsingCombineByKey

Sample Output for AnagramCountUsingCombineByKey

Sample Script to run AnagramCountUsingGroupByKey

Sample Output for AnagramCountUsingGroupByKey

Sample Script to run AnagramFinder

Sample Output for AnagramFinder

References

Files

anagram

Directory actions

More options

Directory actions

More options

Latest commit

History

anagram

Folders and files

parent directory

README.md

What is an Anagram?

Examples

Programs

Packages

Spark Programs without Lambda Expressions

Spark Programs with Lambda Expressions

MapReduce/Hadoop Program

Common Spark Performance Pitfalls

Input Files for Testing

Sample Input

Sample Script to run AnagramCountUsingCombineByKey

Sample Output for AnagramCountUsingCombineByKey

Sample Script to run AnagramCountUsingGroupByKey

Sample Output for AnagramCountUsingGroupByKey

Sample Script to run AnagramFinder

Sample Output for AnagramFinder

References