-
Notifications
You must be signed in to change notification settings - Fork 2
Experiments: Running the MapReduce code (Linux)
The following experiments were conducted on Linux Mint 20 Cinnamon having 2 processors, 5.5 GB RAM and 70 GB storage. For each of the experiments, these commands are agnostic of the platform in which Hadoop was set up, unless mentioned otherwise. In addition to that, each of these experiments shown in this page have the following assumptions:
- The username is burraabhishek
- The present working directory is
~/src
. The entire src directory of this repository was cloned to the home directory in the Linux machine. - The Hadoop version is Hadoop 3.3.0
- All the directories in the Hadoop Distributed File System differ across various development environments
These values differ across various development environments. Replace these values wherever necessary.
NOTE: To run these experiments, a Hadoop Development Environment is required. This guide can help you get started if you do not have a Hadoop Development Environment.
Run each of these commands to start HDFS:
start-dfs.sh
start-yarn.sh
For these experiments, it is recommended to open the Terminal from the present working directory and then run the above commands.
start-dfs.sh: Starts the Distributed File System. This starts the following:
- namenode (on localhost, unless otherwise specified)
- datanodes
- secondary namenodes
start-yarn.sh: Starts Hadoop YARN (Yet Another Resource Manager). YARN manages computing resources in clusters. Running this command starts the following:
- resourcemanager
- nodemanagers
To check the status of the Hadoop daemons, type the command jps
. jps is Java Virtual Machine Process Status Tool. For example:
$ jps
2560 NodeManager
2706 Jps
2453 ResourceManager
2021 DataNode
2168 SecondaryNameNode
1930 NameNode
Ensure that all five daemons and Jps are available. The numbers on the left are the process IDs and may differ across environments.
Hadoop Streaming runs executable mappers and reducers. These codes are not currently executables.
For each of the 4 Python files in the directory, add an interpreter in the beginning and leave a line after that (differs across platforms).
For example,
#!/usr/bin/python
# Rest of the Python code
The first two bytes #! indicates that the Unix/Linux program loader should interpret the rest of the line as a command to launch the interpreter with which the program is executed. For example, #!/usr/bin/python runs python code with the python executable in /usr/bin.
Then for each of the files, run the following commands (This actually converts the files into executables):
chmod +x mapper.py
chmod +x reducer.py
chmod +x nextpass.py
chmod +x reset.py
You can either use the dataset generator included here or download a dataset available online. (If you choose the latter, please abide by the licensing conditions if any).
Upload the dataset into HDFS.
For example, if the dataset is ~/src/csv_dataset.csv
and the destination in HDFS is /dataset/
(The directory is not created), where ~/src
is the present working directory where the commands are executed, then the following command copies the dataset into HDFS:
hdfs dfs -mkdir /dataset
hdfs dfs -put csv_dataset.csv /dataset/csv_dataset.csv
Note that the name of the files need not be the same while using hdfs dfs -put
.
HDFS can be accessed using a web browser. If default settings are used, then the URL
localhost:9870
should open the HDFS.
This URL may differ for different Hadoop configurations.
To browse the file system, go to Utilities, then select 'Browse the file system'.
Hadoop MapReduce jobs are generally written in Java. Therefore, a jar (Java ARchive) file is required to run these jobs. In this case, the jobs are written in Python3 and the jar file used is hadoop-streaming.x.y.z.jar where x.y.z represents the Hadoop version.
The command to execute the MapReduce code is:
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-libjars custom.jar \
-file apriori_settings.json \
-file discarded_items.txt \
-input /dataset/csv_dataset.csv \
-mapper mapper.py \
-reducer reducer.py \
-output /output1 \
-outputformat CustomMultiOutputFormat
Replace:
- mapper.py with the full path of the mapper file
- reducer.py with the full path of the reducer file
Replace if different:
- The version in the jar file from 3.3.0 to your jar file version.
- csv_dataset.csv with the dataset in HDFS
- /output1 with the location in HDFS where you want to store the output.
When the MapReduce code executes successfully, inside the directory /output1 in HDFS (in this case, it is output1), there will be 3 files:
- _SUCCESS
- frequent
- discarded
To see the frequent itemsets: go to frequent and download part-00000.
To see the discarded itemsets: go to discarded and download part-00000.
To run the next pass more efficiently, it is recommended to copy the list of discarded itemsets to your present working directory. The steps are as follows:
- Either move or delete the existing
part-00000
file. In this case, the file is deleted using:
rm part-00000
- Copy the file from HDFS using:
hdfs dfs -copyToLocal /output1/discarded/part-00000
This command copies the file in hdfs://output1/part-00000 to the present working directory.
Now configure to run the next pass using
./nextpass.py
Repeat these steps of running the MapReduce code until:
- The desired output is obtained in frequent itemsets.
- No more frequent itemsets are obtained.
These two commands stop the YARN service and HDFS respectively:
stop-yarn.sh
stop-dfs.sh