Final report

karlnyr · Apr 26, 2019 · 150798b · 150798b
1 parent ff9d950
commit 150798b
Show file tree

Hide file tree

Showing 5 changed files with 1 addition and 123 deletions.
diff --git a/A1_Karl_Nyren.md b/A1_Karl_Nyren.md
@@ -85,4 +85,4 @@ but I count that every retweet regardless of added text is a non-unique tweet.
 
 1. Comparing the knowledge of language is less when it comes to Hive in the beginning. However, when you get into more serious queries it does get more complicated. The conventionality of Hive can be attractive when you want to present a solution to repetitive queries, since hive works through indexing and thus builds up in speed. I unfortunately don't have the stats of the speed for the jobs done in this example, but from experience I can tell that the Hadoop/MapReduce approach was significantly faster than Hive doing this particular task. This is probably due to the structure of the data we are working with, and that hql is optimized for structured data. One other difficulty that can be seen with Hadoop/MapReduce is that you really need to stick to that mode of thinking, meaning you one want a tuple, with a key and a value. With Hive on the other hand one can break free from this term of thinking and focus directly on the query you want to perform which really shows the perks of having a higher tier interface for Hadoop. What you give up by using Hive is the in depth optimization that can be done in the search steps, and the number of ways a MapReduction can be done is near to endless. I do strongly believe in that using Hadoop/MapReduce to begin with is a better way to learn how the framework is set up, but for commercial applications Hive shines in less requirement of problem solving when doing queries. Hive would also be better applicable if you know that the format of your data will remain constant, thus you never need to change queries and make the user interaction seem seamless with their previous work-process. One could argue that this is true for Hadoop/MapReduce too, however if you even slightly want to change the query a whole new mapper and reducer needs to be created, whilst hive simple requires you to create a new query. In terms of setup both of the services had their own issues, but in the end hive was an odd user experience with the database that it sets up. Without being careful one could clutter up their folders with hive metastores. 
 2. Pig is a script based program, based on the language Pig Latin. This language is resembling hive in terms of the simplicity. One could say that it is even more simplistic. Pig works different from hive in the way that you can either interact with it directly, as done in hive, or you can submit batch scripts to perform tasks. It will use mapreduction but is able to include frameworks of Spark or Taz if it suits the task. It is a bit closer to Hadoop/MapReduce in the way that you can transform the data inside the script, and then print a bit of an different output. Sure, one could make very intricate nested queries in hive, but they do become messy quite quickly. One might argue that when doing Pig queries you need to load the data every time, which in batch state could be troublesome. However, Pig is able to use spark dynamic allocation, enabling Pig to do multiple processes at the same time and reallocate the resources for different scripts accordingly.
-3. I believe that NoSQL solutions could be efficient if we are limited by main memory. Since the data we want to analyze is semistructured NoSQL should not have any issues with processing the data. Since the data can be split up into many smaller chunks we would not have to worry about the loading of the data into main memory when we would search through the tweets.
+3. I believe that NoSQL solutions could be efficient if we are limited by main memory. Since the data we want to analyze is semistructured NoSQL should not have any issues with processing the data. Since the data can be split up into many smaller chunks we would not have to worry about the loading of the data into main memory when we would search through the tweets. 
diff --git a/Data/part-00000 b/Data/part-00000
diff --git a/Data/tweet_data_freq b/Data/tweet_data_freq
diff --git a/Scripts/plot_from_wordcount.py b/Scripts/plot_from_wordcount.py
diff --git a/Scripts/plotter.py b/Scripts/plotter.py