Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot similarity matrix using textplot_network() #7

Open
koheiw opened this issue Oct 28, 2018 · 8 comments
Open

Plot similarity matrix using textplot_network() #7

koheiw opened this issue Oct 28, 2018 · 8 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Oct 28, 2018

If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?

require(quanteda)

mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords())
mt <- dfm_trim(mt, min_termfreq = 100)
sim <- textstat_proxy(mt, margin = "features")
textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)

rplot

@kbenoit
Copy link
Contributor

kbenoit commented Oct 28, 2018 via email

@jiongweilua
Copy link

Hi, I wanted to chime in on this. I agree with Ken that when thinking about similarity, a heatmap-like visualisation is more intuitive to me.

However, I think that a collocation_network function might be a useful complementary function to textstat_collocations. The concept of collocation seems to me to lend itself naturally to a spatial expression. But from a function design point of view, I think there are some non-trivial challenges - e.g. scalability, replicability of visualisations, interactivity etc. Page 37 of this article has a nice overview: Towards Interactive Multidimensional Visualisations for Corpus Linguistics

Is this something we would like to explore?

@randomgambit
Copy link

@kbenoit is there any hacky way to do what you are referring to?

A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.

@kbenoit
Copy link
Contributor

kbenoit commented Feb 17, 2019

I wouldn't call it hacky, but the code below works. The simil measures will yield positive values so we would ideally figure out a way to remove the values < 1.0.

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

simmat <- corpus_subset(data_corpus_inaugural, Year > 1980) %>%
  dfm(remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("en")) %>%
  textstat_simil() %>%
  as.matrix()
simmat[1:5, 1:5]
##              1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
## 1981-Reagan    1.0000000   0.6503200 0.4750618    0.5159960    0.5181002
## 1985-Reagan    0.6503200   1.0000000 0.5043065    0.5558569    0.6074780
## 1989-Bush      0.4750618   0.5043065 1.0000000    0.5037529    0.5311117
## 1993-Clinton   0.5159960   0.5558569 0.5037529    1.0000000    0.5961274
## 1997-Clinton   0.5181002   0.6074780 0.5311117    0.5961274    1.0000000

ggcorrplot::ggcorrplot(simmat, hc.order = TRUE, type = "lower")

corrplot::corrplot.mixed(simmat, order = "hclust", tl.col = "black")

@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
@wang93312
Copy link

Hope Dr. Benoit can offer some guidance on the following feedback --

install.packages("quanteda.textstat")
Installing package into ‘C:/Users/jwang/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘quanteda.textstat’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(quanteda.textstat)
Error in library(quanteda.textstat) :
there is no package called ‘quanteda.textstat’

I tried different versions of R. None of the attempts worked.
Thank you!

@wang93312
Copy link

My apology -- I miss "s" in install.packages("quanteda.textstats")

Regarding the following warning messages --
1: remove_punct, remove_numbers arguments are not used.
2: 'remove' is deprecated; use dfm_remove() instead

I am wondering if Dr. Benoit could provide an example on how to use dfm_remove() to replace remove.

Thank you!

@kbenoit
Copy link
Contributor

kbenoit commented Sep 17, 2023

example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+                "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+     dfm(tolower = FALSE)
#> 
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+                         wordsEndingInY = c("by", "my"),
#> dfm_rm+                         notintext = "blahblah"))
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My by United_States Sweden
#>   text1  1  1             0      0
#>   text2  0  0             1      1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#>        features
#> docs    by
#>   text1  1
#>   text2  0
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My Christmas was by Does United_States
#>   text1  1         1   1  1    0             0
#>   text2  0         0   0  0    1             1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    ruined your opposition tax plan . the or Sweden have
#>   text1      1    1          1   1    1 1   0  0      0    0
#>   text2      0    0          0   0    0 0   1  1      1    1
#> [ reached max_nfeat ... 4 more features ]
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My was by your Does the or have more
#>   text1  1   1  1    1    0   0  0    0    0
#>   text2  0   0  0    0    1   1  1    1    1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition tax plan . United_States Sweden progressive
#>   text1         1      1          1   1    1 1             0      0           0
#>   text2         0      0          0   0    0 0             1      1           1
#>        features
#> docs    taxation
#>   text1        0
#>   text2        1
#> [ reached max_nfeat ... 1 more feature ]
#> 
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition United_States Sweden progressive taxation
#>   text1         1      1          1             0      0           0        0
#>   text2         0      0          0             1      1           1        1
#> 
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+                       "No if, and, or but about it: lots of stopwords.")))
#> 
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#>        features
#> docs    this is a document with lots of stopwords . no
#>   text1    1  1 1        1    1    1  1         1 1  0
#>   text2    0  0 0        0    0    1  1         1 1  1
#> [ reached max_nfeat ... 8 more features ]
#> 
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#>        features
#> docs    document lots stopwords . , :
#>   text1        1    1         1 1 0 0
#>   text2        0    1         1 1 2 1
#> 
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+                  "no if, and, or but about it: lots"),
#> dfm_rm+                remove_punct = TRUE)
#> 
#> dfm_rm> fcmat <- fcm(toks)
#> 
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#>            features
#> features    this contains lots of stopwords no if and or but
#>   this         0        1    1  1         1  0  0   0  0   0
#>   contains     0        0    1  1         1  0  0   0  0   0
#>   lots         0        0    0  1         1  1  1   1  1   1
#>   of           0        0    0  0         1  0  0   0  0   0
#>   stopwords    0        0    0  0         0  0  0   0  0   0
#>   no           0        0    0  0         0  0  1   1  1   1
#>   if           0        0    0  0         0  0  0   1  1   1
#>   and          0        0    0  0         0  0  0   0  1   1
#>   or           0        0    0  0         0  0  0   0  0   1
#>   but          0        0    0  0         0  0  0   0  0   0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#> 
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#>            features
#> features    contains lots stopwords
#>   contains         0    1         1
#>   lots             0    0         1
#>   stopwords        0    0         0

Created on 2023-09-17 with reprex v2.0.2

@wang93312
Copy link

wang93312 commented Sep 17, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants