Skip to content

Latest commit

 

History

History
60 lines (36 loc) · 3.57 KB

04-searching-for-words.md

File metadata and controls

60 lines (36 loc) · 3.57 KB

Using the NLTK Corpus   |   Positioning Words


4. Searching for Words

Let's start by analyzing Moby Dick, which is text1 for NLTK.

The first function we will look at is concordance. "Concordance" in this context means the characters on either side of the word. Our text is behaving like one giant string, so concordance will just count the number of characters on either side. By default, this is 25 characters on either side of our target word (including spaces), but you can change that if you want.

In the Jupyter Notebook, type:

text1.concordance("whale")

The output shows us the 25 characters on either side of the word "whale" in Moby Dick. Let's try this with another word, "love." Just replace the word "whale" with "love," and we get the contexts in which Melville uses "love" in Moby Dick. concordance is used (behind the scenes) for several other functions, including similar and common_contexts.

Let's now see which words appear in similar contexts as the word "love." NLTK has a built-in function for this as well: similar.

text1.similar("love")

Behind the scenes, Python found all the contexts where the word "love" appears. It also finds similar environments, and then what words were common among the similar contexts. This gives a sense of what other words appear in similar contexts. This is somewhat interesting in itself, but more interesting if we compare it to something else. Let's take a look at another text. What about Sense and Sensibility (text2)? Let's see what words are similar to "love" in Jane Austen's writing. In the next cell, type:

text2.similar("love")

We can compare the two and see immediately that Melville and Austen use the word "love" differently.

Investigating "lol"

Let's expand from novels for a minute and take a look at the NLTK Chat Corpus. In chats, text messages, and other digital communication platforms, "lol" is exceedingly common. We know it doesn't simply mean "laughing out loud"—maybe the similar function can provide some insight into what it does mean.

text5.similar("lol")

The resulting list is a lot of greetings, indicating that "lol" probably has more of a phatic function. Phatic language is language primarily for communicating social closeness. Phatic words stand in contrast to semantic words, which contribute meaning to the utterance.

If you are interested in this type of analysis, take a look at the common_contexts function in the NLTK book or in the NLTK docs.

Evaluation

Check all sentences below that are correct:

  • The similar method brings a list of words that are similiar in writing, but not necessarily in meaning, like "whale" and "while".
  • Using the concordance method with a specific word, such as "whale", returns the words that surround "whale" in different sentences, helping us to get a glimpse of the contexts in which the word "whale" shows up.*

Keywords

Do you remember the glossary terms from this section?


Using the NLTK Corpus   |   Positioning Words