Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch 1 #52

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 2 additions & 11 deletions vignettes/exploration.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,7 @@ demo_vectors %>% closest_to("bad")
The tildes are necessary syntax where things get interesting--you can do **math** on these vectors. So if we want to find the words that are closest to the *combination* of "good" and "bad" (which is to say, words that get used in evaluation) we can write (see where the tilde is?):

```{r}

demo_vectors %>% closest_to(~"good"+"bad")

# The same thing could be written as:
# demo_vectors %>% closest_to(demo_vectors[["good"]]+demo_vectors[["bad"]])
```
Expand All @@ -84,7 +82,6 @@ demo_vectors %>% closest_to(~"good" - "bad")
> a vector that describes the difference between positive and negative.
> Similarity to this vector means, technically, the portion of a words vectors whose
> whose multidimensional path lies largely along the direction between the two words.

Again, you can easily switch the order to the opposite: here are a bunch of bad words:

```{r}
Expand Down Expand Up @@ -127,11 +124,9 @@ demo_vectors %>% closest_to(~ "guy" + ("she" - "he"))

Principal components can let you plot a subset of these vectors to see how they relate. You can imagine an arrow from "he" to "she", from "guy" to "lady", and from "man" to "woman"; all run in roughly the same direction.

```{r}

```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
demo_vectors[[c("lady","woman","man","he","she","guy","man"), average=F]] %>%
plot(method="pca")

```

These lists of ten words at a time are useful for interactive exploration, but sometimes we might want to say 'n=Inf' to return the full list. For instance, we can combine these two methods to look at positive and negative words used to evaluate teachers.
Expand All @@ -141,10 +136,8 @@ First we build up three data_frames: first, a list of the 50 top evaluative word
```{r}
top_evaluative_words = demo_vectors %>%
closest_to(~ "good"+"bad",n=75)

goodness = demo_vectors %>%
closest_to(~ "good"-"bad",n=Inf)

femininity = demo_vectors %>%
closest_to(~ "she" - "he", n=Inf)
```
Expand All @@ -153,10 +146,9 @@ Then we can use tidyverse packages to join and plot these.
An `inner_join` restricts us down to just those top 50 words, and ggplot
can array the words on axes.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
library(ggplot2)
library(dplyr)

top_evaluative_words %>%
inner_join(goodness) %>%
inner_join(femininity) %>%
Expand All @@ -165,4 +157,3 @@ top_evaluative_words %>%
y=`similarity to "good" - "bad"`,
label=word))
```

24 changes: 5 additions & 19 deletions vignettes/introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@ if (!require(wordVectors)) {
}
devtools::install_github("bmschmidt/wordVectors")
}


```

# Building test data
Expand Down Expand Up @@ -66,7 +64,6 @@ To train a word2vec model, use the function `train_word2vec`. This actually buil

```{r}
if (!file.exists("cookbook_vectors.bin")) {model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)} else model = read.vectors("cookbook_vectors.bin")

```

A few notes:
Expand Down Expand Up @@ -104,7 +101,7 @@ Now we have a pretty expansive list of potential fish-related words from old coo

Or we can just arrange them somehow. In this case, it doesn't look like much of anything.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
some_fish = closest_to(model,model[[c("fish","salmon","trout","shad","flounder","carp","roe","eels")]],150)
fishy = model[[some_fish$word,average=F]]
plot(fishy,method="pca")
Expand Down Expand Up @@ -133,22 +130,19 @@ These can be useful for figuring out, at a glance, what some of the overall comm
Clusters need not be derived at the level of the full model. We can take, for instance,
the 20 words closest to each of four different kinds of words.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
ingredients = c("madeira","beef","saucepan","carrots")
term_set = lapply(ingredients,
function(ingredient) {
nearest_words = model %>% closest_to(model[[ingredient]],20)
nearest_words$word
}) %>% unlist

subset = model[[term_set,average=F]]

subset %>%
cosineDist(subset) %>%
as.dist %>%
hclust %>%
plot

```


Expand All @@ -160,21 +154,17 @@ One of the basic strategies you can take is to try to project the high-dimension

For instance, we can take the words "sweet" and "sour," find the twenty words most similar to either of them, and plot those in a sweet-salty plane.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
tastes = model[[c("sweet","salty"),average=F]]

# model[1:3000,] here restricts to the 3000 most common words in the set.
sweet_and_saltiness = model[1:3000,] %>% cosineSimilarity(tastes)

# Filter to the top 20 sweet or salty.
sweet_and_saltiness = sweet_and_saltiness[
rank(-sweet_and_saltiness[,1])<20 |
rank(-sweet_and_saltiness[,2])<20,
]

plot(sweet_and_saltiness,type='n')
text(sweet_and_saltiness,labels=rownames(sweet_and_saltiness))

```


Expand All @@ -183,21 +173,17 @@ There's no limit to how complicated this can get. For instance, there are really
Rather than use a base matrix of the whole set, we can shrink down to just five dimensions: how similar every word in our set is to each of these five. (I'm using cosine similarity here, so the closer a number is to one, the more similar it is.)

```{r}

tastes = model[[c("sweet","salty","savory","bitter","sour"),average=F]]

# model[1:3000,] here restricts to the 3000 most common words in the set.
common_similarities_tastes = model[1:3000,] %>% cosineSimilarity(tastes)

common_similarities_tastes[20:30,]
```

Now we can filter down to the 50 words that are closest to *any* of these (that's what the apply-max function below does), and
use a PCA biplot to look at just 50 words in a flavor plane.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
high_similarities_to_tastes = common_similarities_tastes[rank(-apply(common_similarities_tastes,1,max)) < 75,]

high_similarities_to_tastes %>%
prcomp %>%
biplot(main="Fifty words in a\nprojection of flavor space")
Expand All @@ -218,7 +204,7 @@ Just calling "plot" will display the equivalent of a word cloud with individual

"Perplexity" is the optimal number of neighbors for each word. By default it's 50; smaller numbers may cause clusters to appear more dramatically at the cost of overall coherence.

```{r}
```{r, fig.height = 7, fig.width = 7, fig.align = "center"}
plot(model,perplexity=50)
```

Expand Down