diff --git a/nlp_apps.ipynb b/nlp_apps.ipynb index 089a50c26..458c55700 100644 --- a/nlp_apps.ipynb +++ b/nlp_apps.ipynb @@ -24,7 +24,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## LANGUAGE RECOGNITION\n", + "# LANGUAGE RECOGNITION\n", "\n", "A very useful application of text models (you can read more on them on the [`text notebook`](https://github.com/aimacode/aima-python/blob/master/text.ipynb)) is categorizing text into a language. In fact, with enough data we can categorize correctly mostly any text. That is because different languages have certain characteristics that set them apart. For example, in German it is very usual for 'c' to be followed by 'h' while in English we see 't' followed by 'h' a lot.\n", "\n", @@ -37,8 +37,10 @@ }, { "cell_type": "code", - "execution_count": 1, - "metadata": {}, + "execution_count": 2, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "from utils import open_data\n", @@ -66,8 +68,10 @@ }, { "cell_type": "code", - "execution_count": 2, - "metadata": {}, + "execution_count": 3, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "from learning import NaiveBayesLearner\n", @@ -88,8 +92,10 @@ }, { "cell_type": "code", - "execution_count": 3, - "metadata": {}, + "execution_count": 4, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "def recognize(sentence, nBS, n):\n", @@ -116,7 +122,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -132,7 +138,7 @@ "'German'" ] }, - "execution_count": 4, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -143,7 +149,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 6, "metadata": {}, "outputs": [ { @@ -159,7 +165,7 @@ "'English'" ] }, - "execution_count": 5, + "execution_count": 6, "metadata": {}, "output_type": "execute_result" } @@ -170,7 +176,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": 7, "metadata": {}, "outputs": [ { @@ -186,7 +192,7 @@ "'German'" ] }, - "execution_count": 6, + "execution_count": 7, "metadata": {}, "output_type": "execute_result" } @@ -197,7 +203,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -213,7 +219,7 @@ "'English'" ] }, - "execution_count": 7, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -248,8 +254,10 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, + "execution_count": 1, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "from utils import open_data\n", @@ -277,8 +285,10 @@ }, { "cell_type": "code", - "execution_count": 9, - "metadata": {}, + "execution_count": 2, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "from learning import NaiveBayesLearner\n", @@ -297,8 +307,10 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, + "execution_count": 3, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "def recognize(sentence, nBS):\n", @@ -317,7 +329,7 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -326,7 +338,7 @@ "'Abbott'" ] }, - "execution_count": 11, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -346,7 +358,7 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -355,7 +367,7 @@ "'Austen'" ] }, - "execution_count": 12, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -391,7 +403,9 @@ { "cell_type": "code", "execution_count": 1, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "from utils import open_data\n", @@ -437,7 +451,9 @@ { "cell_type": "code", "execution_count": 3, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "wordseq = words(federalist)\n", @@ -485,7 +501,9 @@ { "cell_type": "code", "execution_count": 5, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "wordseq = [w for w in wordseq if w != 'publius']" @@ -551,7 +569,9 @@ { "cell_type": "code", "execution_count": 7, - "metadata": {}, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "hamilton = ''.join(hamilton)\n", @@ -571,19 +591,27 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well.\n", + "Now it is time to build our new Naive Bayes Learner. It is very similar to the one found in `learning.py`, but with an important difference: it doesn't classify an example, but instead returns the probability of the example belonging to each class. This will allow us to not only see to whom a paper belongs to, but also the probability of authorship as well. \n", + "We will build two versions of Learners, one will multiply probabilities as is and other will add the logarithms of them.\n", "\n", - "Finally, since we are dealing with long text and the string of probability multiplications is long, we will end up with the results being rounded to 0 due to floating point underflow. To work around this problem we will use the built-in Python library `decimal`, which allows as to set decimal precision to much larger than normal." + "Finally, since we are dealing with long text and the string of probability multiplications is long, we will end up with the results being rounded to 0 due to floating point underflow. To work around this problem we will use the built-in Python library `decimal`, which allows as to set decimal precision to much larger than normal.\n", + "\n", + "Note that the logarithmic learner will compute a negative likelihood since the logarithm of values less than 1 will be negative.\n", + "Thus, the author with the lesser magnitude of proportion is more likely to have written that paper.\n", + "\n" ] }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, + "execution_count": 16, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "import random\n", "import decimal\n", + "import math\n", "from decimal import Decimal\n", "\n", "decimal.getcontext().prec = 100\n", @@ -594,6 +622,11 @@ " result *= Decimal(x)\n", " return result\n", "\n", + "def log_product(numbers):\n", + " result = 0.0\n", + " for x in numbers:\n", + " result += math.log(x)\n", + " return result\n", "\n", "def NaiveBayesLearner(dist):\n", " \"\"\"A simple naive bayes classifier that takes as input a dictionary of\n", @@ -617,7 +650,32 @@ "\n", " return pred\n", "\n", - " return predict" + " return predict\n", + "\n", + "def NaiveBayesLearnerLog(dist):\n", + " \"\"\"A simple naive bayes classifier that takes as input a dictionary of\n", + " Counter distributions and can then be used to find the probability\n", + " of a given item belonging to each class. It will compute the likelihood by adding the logarithms of probabilities.\n", + " The input dictionary is in the following form:\n", + " ClassName: Counter\"\"\"\n", + " attr_dist = {c_name: count_prob for c_name, count_prob in dist.items()}\n", + "\n", + " def predict(example):\n", + " \"\"\"Predict the probabilities for each class.\"\"\"\n", + " def class_prob(target, e):\n", + " attr = attr_dist[target]\n", + " return log_product([attr[a] for a in e])\n", + "\n", + " pred = {t: class_prob(t, example) for t in dist.keys()}\n", + "\n", + " total = -sum(pred.values())\n", + " for k, v in pred.items():\n", + " pred[k] = v/total\n", + "\n", + " return pred\n", + "\n", + " return predict\n", + "\n" ] }, { @@ -629,12 +687,15 @@ }, { "cell_type": "code", - "execution_count": 9, - "metadata": {}, + "execution_count": 17, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "dist = {('Madison', 1): P_madison, ('Hamilton', 1): P_hamilton, ('Jay', 1): P_jay}\n", - "nBS = NaiveBayesLearner(dist)" + "nBS = NaiveBayesLearner(dist)\n", + "nBSL = NaiveBayesLearnerLog(dist)" ] }, { @@ -646,8 +707,10 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, + "execution_count": 18, + "metadata": { + "collapsed": true + }, "outputs": [], "source": [ "def recognize(sentence, nBS):\n", @@ -663,45 +726,84 @@ }, { "cell_type": "code", - "execution_count": 11, + "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Paper No. 49: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 50: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 51: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 52: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 53: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 54: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 55: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 56: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 57: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 58: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 18: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 19: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 20: Hamilton: 0.00 Madison: 1.00 Jay: 0.00\n", - "Paper No. 64: Hamilton: 1.00 Madison: 0.00 Jay: 0.00\n" + "\n", + "Straightforward Naive Bayes Learner\n", + "\n", + "Paper No. 49: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 50: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 51: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 52: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 53: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 54: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 55: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 56: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 57: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 58: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 18: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 19: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 20: Hamilton: 0.0000 Madison: 1.0000 Jay: 0.0000\n", + "Paper No. 64: Hamilton: 1.0000 Madison: 0.0000 Jay: 0.0000\n", + "\n", + "Logarithmic Naive Bayes Learner\n", + "\n", + "Paper No. 49: Hamilton: -0.330591 Madison: -0.327717 Jay: -0.341692\n", + "Paper No. 50: Hamilton: -0.333119 Madison: -0.328454 Jay: -0.338427\n", + "Paper No. 51: Hamilton: -0.330246 Madison: -0.325758 Jay: -0.343996\n", + "Paper No. 52: Hamilton: -0.331094 Madison: -0.327491 Jay: -0.341415\n", + "Paper No. 53: Hamilton: -0.330942 Madison: -0.328364 Jay: -0.340693\n", + "Paper No. 54: Hamilton: -0.329566 Madison: -0.327157 Jay: -0.343277\n", + "Paper No. 55: Hamilton: -0.330821 Madison: -0.328143 Jay: -0.341036\n", + "Paper No. 56: Hamilton: -0.330333 Madison: -0.327496 Jay: -0.342171\n", + "Paper No. 57: Hamilton: -0.330625 Madison: -0.328602 Jay: -0.340772\n", + "Paper No. 58: Hamilton: -0.330271 Madison: -0.327215 Jay: -0.342515\n", + "Paper No. 18: Hamilton: -0.337781 Madison: -0.330932 Jay: -0.331287\n", + "Paper No. 19: Hamilton: -0.335635 Madison: -0.331774 Jay: -0.332590\n", + "Paper No. 20: Hamilton: -0.334911 Madison: -0.331866 Jay: -0.333223\n", + "Paper No. 64: Hamilton: -0.331004 Madison: -0.332968 Jay: -0.336028\n" ] } ], "source": [ + "print('\\nStraightforward Naive Bayes Learner\\n')\n", "for d in disputed:\n", " probs = recognize(papers[d], nBS)\n", - " results = ['{}: {:.2f}'.format(name, probs[(name, 1)]) for name in 'Hamilton Madison Jay'.split()]\n", - " print('Paper No. {}: {}'.format(d, ' '.join(results)))" + " results = ['{}: {:.4f}'.format(name, probs[(name, 1)]) for name in 'Hamilton Madison Jay'.split()]\n", + " print('Paper No. {}: {}'.format(d, ' '.join(results)))\n", + "\n", + "print('\\nLogarithmic Naive Bayes Learner\\n')\n", + "for d in disputed:\n", + " probs = recognize(papers[d], nBSL)\n", + " results = ['{}: {:.6f}'.format(name, probs[(name, 1)]) for name in 'Hamilton Madison Jay'.split()]\n", + " print('Paper No. {}: {}'.format(d, ' '.join(results)))\n", + "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "We can see that both learners classify the papers identically. Because of underflow in the straightforward learner, only one author remains with a positive value. The log learner is more accurate with marginal differences between all the authors. \n", + "\n", "This is a simple approach to the problem and thankfully researchers are fairly certain that papers 49-58 were all written by Madison, while 18-20 were written in collaboration between Hamilton and Madison, with Madison being credited for most of the work. Our classifier is not that far off. It correctly identifies the papers written by Madison, even the ones in collaboration with Hamilton.\n", "\n", "Unfortunately, it misses paper 64. Consensus is that the paper was written by John Jay, while our classifier believes it was written by Hamilton. The classifier is wrong there because it does not have much information on Jay's writing; only 4 papers. This is one of the problems with using unbalanced datasets such as this one, where information on some classes is sparser than information on the rest. To avoid this, we can add more writings for Jay and Madison to end up with an equal amount of data for each author." ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] } ], "metadata": { @@ -720,7 +822,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.3" + "version": "3.6.1" } }, "nbformat": 4, diff --git a/notebook.py b/notebook.py index aafdf19e4..263f7a44b 100644 --- a/notebook.py +++ b/notebook.py @@ -888,6 +888,7 @@ def draw_table(self): self.text_n(self.table[self.context[0]][self.context[1]] if self.context else "Click for text", 0.025, 0.975) self.update() + ############################################################################################################ ##################### Functions to assist plotting in search.ipynb #################### diff --git a/tests/test_csp.py b/tests/test_csp.py index 0f282e3fe..2bc907b6c 100644 --- a/tests/test_csp.py +++ b/tests/test_csp.py @@ -437,6 +437,5 @@ def test_tree_csp_solver(): assert (tcs['NT'] == 'R' and tcs['WA'] == 'B' and tcs['Q'] == 'B' and tcs['NSW'] == 'R' and tcs['V'] == 'B') or \ (tcs['NT'] == 'B' and tcs['WA'] == 'R' and tcs['Q'] == 'R' and tcs['NSW'] == 'B' and tcs['V'] == 'R') - if __name__ == "__main__": pytest.main()