Skip to content

Latest commit

 

History

History
80 lines (54 loc) · 22.2 KB

Homework1.md

File metadata and controls

80 lines (54 loc) · 22.2 KB

Homework 1

The goal of this homework is for students to gain experience with solving a distributed computational problem using cloud computing technologies. The main textbook group (option 1) will design and implement an instance of the map/reduce computational model using AWS EMR whereas the alternative textbook group (option 2) will use the CORBA model. You can check your textbook option in the corresponding column of the gradebook on the Blackboard.

Grade: 15%

This Git repo contains the homework description that uses an open-source implementation of a network simulator in Scala. Students should invest some time to learn the implementation details of this random graph generation platform, specifically, how mutation is largely avoided in the implementation and recursion is used, however, it is not required for completing this homework.

Preliminaries

As part of this first homework assignment students are going to learn how to create and manage Git project repository, create an application in Scala, create tests using widely popular Scalatest framework, and expand on the provided SBT build and run script for their applications. Your job as a student in this course is to create randomly generated graphs that represent big data, apply specially designed perturbation operators to produce modified graphs, parse and analyze them to determine differences between the original and perturbed graphs by processing them in the cloud using distributed big data analysis frameworks.

First things first, if you haven't done so, you must create your account at Github, a Git repo management system. Please make sure that you write your name in your README.md in your repo as it is specified on the class roster. Since it is a large class, please use your UIC email address for communications and for signing your projects and you should avoid emails from other accounts like funnybunny2000@gmail.com. As always, the homeworks class' Teams channel is the preferred way to exchange information and ask questions. If you don't receive a response within 12 hours, please contact your TA or me by tagging our names. If you use emails it may be a case that your direct emails went to the spam folder.

Next, if you haven't done so, you will install IntelliJ with your academic license, the JDK, the Scala runtime and the IntelliJ Scala plugin and the Simple Build Toolkit (SBT) and make sure that you can create, compile, and run Java and Scala programs. Please make sure that you can run various Java tools from your chosen JDK between versions 8 and 19.

In this and all consecutive homeworks you will use logging and configuration management frameworks. You will comment your code extensively and supply logging statements at different logging levels (e.g., TRACE, INFO, WARN, ERROR) to record information at some salient points in the executions of your programs. All input configuration variables/parameters must be supplied through configuration files -- hardcoding these values in the source code is prohibited and will be punished by taking a large percentage of points from your total grade! You are expected to use Logback and SLFL4J for logging and Typesafe Conguration Library for managing configuration files. These and other libraries should be imported into your project using your script build.sbt. These libraries and frameworks are widely used in the industry, so learning them is the time well spent to improve your resumes. Also, please set up your account with AWS Educate. Using your UIC email address may enable you to receive free credits for running your jobs in the cloud. Preferably, you should create your developer account for $29 per month to enjoy the full range of AWS services. Some students I know created business accounts to receive better options from AWS and some of them even started companies while taking this course using their AWS account and applications they created and hosted there!

From the implementation of the network simulator you can see how to use Scala to create a fully functional (not imperative) implementation with subprojects and tests. As you see from the StackOverflow survey, knowledge of Scala is highly paid and in great demand, and it is expected that you pick it relatively fast, especially since it is tightly integrated with Java. I recommend using the book on Programming in Scala Fourth and Fifth Editions by Martin Odersky et al. You can obtain this book using the academic subscription on Safari Books Online. There are many other books and resources available on the Internet to learn Scala. Those who know more about functional programming can use the book on Functional Programming in Scala published in 2023 by Michael Pilquist, Rúnar Bjarnason, and Paul Chiusano.

When creating your Map/Reduce in Scala or CORBA program code in C++ you should avoid using vars and while/for loops that iterate over collections using induction variables. Instead, you should learn to use collection methods map, flatMap, foreach, filter and many others with lambda functions, which make your code linear and easy to understand as we studied it in class. Also, avoid mutable variables that expose the internal states of your modules at all cost. Points will be deducted for having unreasonable vars and inductive variable loops without explanation why mutation is needed in your code unless it is confined to method scopes - you can always do without it.

Overview

In this homework, you will create a distributed program for parallel processing of the large graphs that are generated using the network graph simulation plaform (NetGraphSim). Readme.md for NetGraphSim contains detailed instructions for cloning, building and running the program. Depending on the power of your computer you can generate large-scale pairs of graphs with tens of thousand nodes where the original graph and its perturbed equivalent differ in some nodes and edges depending on the chosen parameter. The differences are produced in the yaml format that you will use in this homework to determine the precision of your approach of computing differences between the original and the perturbed graph.

Consider various real-world scenarios in which large graphs represent some social media interactions or a web farm or a network topology of some company. In these scenarios changes happen asynchronously - a user or a computing node may leave and rejoin the network as a node with some parameters different from the previous one and some edges, i.e., connections removed or added. Detecting these changes and relating them to the previously recorded graph is important for solving many problems, e.g., a bot detection and removal in social media applications.

Of course, the graph isomorphism problem is NP-complete, however, an approximate solution may be very useful in many real-world scenarios. Consider the following approximation algorithm for graph comparison where all pairs of nodes are compared along with edges incident on and to these nodes and a matching score is computed using some formulae of your choice. In general, it can be a linear combination of terms where term coefficients represent the importance of certain node/edge parameters and the variables represent the values of these parameters. The result is not only a general match indicator value for the graph but also the determination for each pair of nodes and edges of how closely they match. This kind of computation is expensive and your job is to accomplish it using a distributed cloud-based framework within a short period of time and a given cloud resources budget.

This and all future homework scripts are written using a retroscripting technique, in which the homework outlines are generally and loosely drawn, and the individual students improvise to create the implementation that fits their refined objectives. In doing so, students are expected to stay within the basic requirements of the homework and they are free to experiments. Asking questions is important, so please ask away at MS Teams!

Functionality

Your homework assignment is to create a program for parallel distributed processing of large graphs. Your goal is to produce the matching statistics about the generated graphs and their perturbed counterparts. First, you will compute a Yaml or an CSV file that shows the distribution of nodes and edges by their similarities that are computed based on your formula. Second, for each node and edge you will compute the likelihood that a node or an edge was modified/added/removed in the perturbed graph or it remained the same. Please note that a modified node may be viewed as an instance of the previous node deleted and a new node added with the parameters similar to the deleted node. This determination will depend on your own uniquely fine-tuned graph matching algorithm. Then, for each detected modification you will compare it with the golden set of changes produced by NetGraphSim as a result of creating a perturbed graph. Finally, you will produce an estimate of the goodness of your algorithm similar to precision and recall ratios based on your experiments.

Your algorith issues predictions for matching graph component, i.e., nodes and edges, some of which may be incorrect. A match between a graph component in the original and perturbed graphs is called a traceability link. Thus RTL is the sum of Good Traceability Links (GTL) and Bad Traceability Links (BTL), RTL = GTL + BTL. BTL is the sum of CTL, which is the number of correct TLs that are mistakenly discarded by your algorithm, and WTL, which is the number of wrong TLs that the your algorithm accepts, BTL = CTL + WTL. GTL is the sum of the DTL, which is the number of correctly discarded TLs and ATL, which is the number of correctly accepted TLs, GTL = DTL + ATL. The quality of your algorithm is thuse measured using three ratios: ACC, VPR, and BTLR where the accuracy ratio is computed as ACC= ATL/RTL, BTLR = WTL/RTL, and the algorithm’s precision ratio is computed as VPR= (GTL - BTL)/(2×RTL) + 0.5. The ACC variable is the ratio of correctly recovered TLs, and the VPR ratio shows how mistaken your algorithm is when analyzing recovered TLs. Constants are used in the formula for the VPR in order to normalize its values, VPR ∈ [0, 1], where VPR=0 means that all recovered TLs are incorrect, and VPR=1 means that they are correct. The idea behind computing the precision VPR is to evaluate the difference between good and bad TLs, i.e., GTL and BTL. If all recovered TLs are good, i.e., GTL=RTL and BTL=0, then VPR=1. If all recovered TLs are bad, i.e., BTL=RTL and GTL=0, then VPR=0. The difference between the ACC and VPR measures is that ACC is used to evaluate the performance of your algorithm while VPR shows the combined performance of the your similarity formula and the resulting matchmaking part of your algorithm. The variable ACC is analogous to the recall parameter in information retrieval, which is the ratio of the number of relevant documents retrieved to the total number of documents.

Assignment for the main textbook group

Your job is to create the mapper and the reducer for each task, explain how they work, and then to implement them and run on the big data graphs that you will generate using your predefined configuration parameters for NetGraphSim. The output of your map/reduce is a Yaml or an CSV file with the required statistics. The explanation of the map/reduce model is given in the main textbook and covered in class lectures.

You will create and run your software application using Apache Hadoop, a framework for distributed processing of large data sets across multiple computers (or even on a single node) using the map/reduce model. Next, after creating and testing your map/reduce program locally, you will deploy it and run it on the Amazon Elastic MapReduce (EMR) - you can find plenty of documentation online. You will produce a short movie that documents all steps of the deployment and execution of your program with your narration and you will upload this movie to youtube and you will submit a link to your movie as part of your submission in the README.md file. To produce a movie, you may use an academic version of Camtasia or Zoom or some other cheap/free screen capture technology from the UIC webstore or an application for a movie capture of your choice. The captured web browser content should show your login name in the upper right corner of the AWS application and you should introduce yourself in the beginning of the movie speaking into the camera. The display of your passwords and your credit card numbers should be avoided :-).

Assignment for the alternative textbook group

Your job is to create the distributed objects using omniOrb CORBA framework for each task, explain how they work, and then to implement them and run on the generated log message dataset. The output of your distributed system is a Yaml or an CSV file with the required statistics. The explanation of the CORBA is given in the alternative textbook in Chapter 7 -Guide to Reliable Distributed Systems: Building High-Assurance Applications and Cloud-Hosted Services 2012th Edition by Kenneth P. Birman. You can complete your implementation using C++ or Python.

Next, after creating and testing your program locally, you will deploy it and run it on the AWS EC2 IaaS. You will produce a short movie that documents all steps of the deployment and execution of your program with your narration and you will upload this movie to youtube and you will submit a link to your movie as part of your submission in the README.md file. To produce a movie, you may use an academic version of Camtasia or some other cheap/free screen capture technology from the UIC webstore or an application for a movie capture of your choice. The captured web browser content should show your login name in the upper right corner of the AWS application and you should introduce yourself in the beginning of the movie speaking into the camera.

Baseline Submission

Your baseline project submission should include your implementation, a conceptual explanation in the document or in the comments in the source code of how your mapper and reducer work to solve the problem for Option 1 group or how your CORBA distributed object work for Option 2 group, and the documentation that describe the build and runtime process, to be considered for grading. Your should use markdown for your project's Readme.md. Your project submission should include all your source code as well as non-code artifacts (e.g., configuration files), your project should be buildable using the SBT, and your documentation must specify how you paritioned the data and what input/outputs are.

Collaboration

You can post questions and replies, statements, comments, discussion, etc. on Teams using the corresponding channel. For this homework, feel free to share your ideas, mistakes, code fragments, commands from scripts, and some of your technical solutions with the rest of the class, and you can ask and advise others using Teams on where resources and sample programs can be found on the Internet, how to resolve dependencies and configuration issues. When posting question and answers on Teams, please make sure that you selected the appropriate channel, to ensure that all discussion threads can be easily located. Active participants and problem solvers will receive bonuses from the big brother :-) who is watching your exchanges (i.e., your class instructor and your TA). However, you must not describe your mappers/reducers or the CORBA architecture or other specific details related to how you construct your models!

Git logistics

This is an individual homework. Please remember to grant a read access to your repository to your TA and your instructor. You can commit and push your code as many times as you want. Your code will not be visible and it should not be visible to other students - your repository should be private. Announcing a link to your public repo for this homework or inviting other students to join your fork for an individual homework before the submission deadline will result in losing your grade. For grading, only the latest commit timed before the deadline will be considered. If your first commit will be pushed after the deadline, your grade for the homework will be zero. For those of you who struggle with the Git, I recommend a book by Ryan Hodson on Ry's Git Tutorial. The other book called Pro Git is written by Scott Chacon and Ben Straub and published by Apress and it is freely available. There are multiple videos on youtube that go into details of the Git organization and use.

Please follow this naming convention to designate your authorship while submitting your work in README.md: "Firstname Lastname" without quotes, where you specify your first and last names exactly as you are registered with the University system, as well as your UIC.EDU email address, so that we can easily recognize your submission. I repeat, make sure that you will give both your TA and the course instructor the read/write access to your private forked repository so that we can leave the file feedback.txt in the root of your repo with the explanation of the grade assigned to your homework.

Discussions and submission

As it is mentioned above, you can post questions and replies, statements, comments, discussion, etc. on Teams. Remember that you cannot share your code and your solutions privately, but you can ask and advise others using Teams and StackOverflow or some other developer networks where resources and sample programs can be found on the Internet, how to resolve dependencies and configuration issues. Yet, your implementation should be your own and you cannot share it. Alternatively, you cannot copy and paste someone else's implementation and put your name on it. Your submissions will be checked for plagiarism. Copying code from your classmates or from some sites on the Internet will result in severe academic penalties up to the termination of your enrollment in the University.

Submission deadline and logistics

Sunday, October 1, 2023 at 10PM CST by submitting the link to your homework repo in the Teams Assignments channel. Your submission repo will include the code for the program, your documentation with instructions and detailed explanations on how to assemble and deploy your program along with the results of your program execution, the link to the video and a document that explains these results based on the characteristics and the configuration parameters of your log generator, and what the limitations of your implementation are. Again, do not forget, please make sure that you will give both your TAs and your instructor the read access to your private repository. Your code should compile and run from the command line using the commands sbt clean compile test and sbt clean compile run. Also, you project should be IntelliJ friendly, i.e., your graders should be able to import your code into IntelliJ and run from there. Use .gitignore to exlude files that should not be pushed into the repo.

Evaluation criteria

  • the maximum grade for this homework is 15%. Points are subtracted from this maximum grade: for example, saying that 2% is lost if some requirement is not completed means that the resulting grade will be 15%-2% => 13%; if the core homework functionality does not work or it is not implemented as specified in your documentation, your grade will be zero;
  • only some basic map/reduce or CORBA examples from some repos are given and nothing else is done: zero grade;
  • using Java instead of Scala: 5% penalty;
  • using unique generated node IDs for SimRank matches: 7% penalty;
  • homework submissions for an incorrectly chosen textbook assignment option will be desk-rejected with the grade zero;
  • having less than five unit and/or integration scalatests: up to 10% lost;
  • missing comments and explanations from your program with clarifications of your design rationale: up to 10% lost;
  • logging is not used in your programs: up to 5% lost;
  • hardcoding the input values in the source code instead of using the suggested configuration libraries: up to 5% lost;
  • for each used var for heap-based shared variables or mutable collections: 0.3% lost;
  • for each used while or for or other loops with induction variables to iterate over a collection: 0.5% lost;
  • no instructions in README.md on how to install and run your program: up to 10% lost;
  • the program crashes without completing the core functionality: up to 15% lost;
  • the documentation exists but it is insufficient to understand your program design and models and how you assembled and deployed all components of your mappers and reducers: up to 15% lost;
  • the minimum grade for this homework cannot be less than zero.

That's it, folks!