Plot similarity matrix using textplot_network() #7

koheiw · 2018-10-28T09:20:41Z

If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?

require(quanteda)

mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords())
mt <- dfm_trim(mt, min_termfreq = 100)
sim <- textstat_proxy(mt, margin = "features")
textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-10-28T20:30:34Z

Networks typically to reflect co-occurrence rather than similarity - are there examples of networks to show similarity? Either way it would be easy to write a method for textplot_network() that allows the return items from textstat_simil() as inputs. A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2. On 28 Oct 2018, at 05:20, Kohei Watanabe <notifications@github.com<mailto:notifications@github.com>> wrote: If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function? require(quanteda) mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords()) mt <- dfm_trim(mt, min_termfreq = 100) sim <- textstat_proxy(mt, margin = "features") textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95) [rplot]<https://user-images.githubusercontent.com/6572963/47614121-eafee100-dadd-11e8-89de-d9945baa3a5c.png> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<https://github.com/quanteda/quanteda/issues/1474>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACFMZvItd-JZj9RxHUnO19RE6EN2cjwuks5upXbqgaJpZM4X93XH>.

jiongweilua · 2019-01-19T21:54:18Z

Hi, I wanted to chime in on this. I agree with Ken that when thinking about similarity, a heatmap-like visualisation is more intuitive to me.

However, I think that a collocation_network function might be a useful complementary function to textstat_collocations. The concept of collocation seems to me to lend itself naturally to a spatial expression. But from a function design point of view, I think there are some non-trivial challenges - e.g. scalability, replicability of visualisations, interactivity etc. Page 37 of this article has a nice overview: Towards Interactive Multidimensional Visualisations for Corpus Linguistics

Is this something we would like to explore?

randomgambit · 2019-02-16T18:52:43Z

@kbenoit is there any hacky way to do what you are referring to?

A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.

kbenoit · 2019-02-17T22:46:14Z

I wouldn't call it hacky, but the code below works. The simil measures will yield positive values so we would ideally figure out a way to remove the values < 1.0.

library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View

simmat <- corpus_subset(data_corpus_inaugural, Year > 1980) %>%
  dfm(remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("en")) %>%
  textstat_simil() %>%
  as.matrix()
simmat[1:5, 1:5]
##              1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
## 1981-Reagan    1.0000000   0.6503200 0.4750618    0.5159960    0.5181002
## 1985-Reagan    0.6503200   1.0000000 0.5043065    0.5558569    0.6074780
## 1989-Bush      0.4750618   0.5043065 1.0000000    0.5037529    0.5311117
## 1993-Clinton   0.5159960   0.5558569 0.5037529    1.0000000    0.5961274
## 1997-Clinton   0.5181002   0.6074780 0.5311117    0.5961274    1.0000000

ggcorrplot::ggcorrplot(simmat, hc.order = TRUE, type = "lower")

corrplot::corrplot.mixed(simmat, order = "hclust", tl.col = "black")

wang93312 · 2023-09-17T03:54:18Z

Hope Dr. Benoit can offer some guidance on the following feedback --

install.packages("quanteda.textstat")
Installing package into ‘C:/Users/jwang/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘quanteda.textstat’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(quanteda.textstat)
Error in library(quanteda.textstat) :
there is no package called ‘quanteda.textstat’

I tried different versions of R. None of the attempts worked.
Thank you!

wang93312 · 2023-09-17T04:14:14Z

My apology -- I miss "s" in install.packages("quanteda.textstats")

Regarding the following warning messages --
1: remove_punct, remove_numbers arguments are not used.
2: 'remove' is deprecated; use dfm_remove() instead

I am wondering if Dr. Benoit could provide an example on how to use dfm_remove() to replace remove.

Thank you!

kbenoit · 2023-09-17T16:51:34Z

example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
#> 
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+                "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+     dfm(tolower = FALSE)
#> 
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+                         wordsEndingInY = c("by", "my"),
#> dfm_rm+                         notintext = "blahblah"))
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My by United_States Sweden
#>   text1  1  1             0      0
#>   text2  0  0             1      1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#>        features
#> docs    by
#>   text1  1
#>   text2  0
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My Christmas was by Does United_States
#>   text1  1         1   1  1    0             0
#>   text2  0         0   0  0    1             1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    ruined your opposition tax plan . the or Sweden have
#>   text1      1    1          1   1    1 1   0  0      0    0
#>   text2      0    0          0   0    0 0   1  1      1    1
#> [ reached max_nfeat ... 4 more features ]
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    My was by your Does the or have more
#>   text1  1   1  1    1    0   0  0    0    0
#>   text2  0   0  0    0    1   1  1    1    1
#> 
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition tax plan . United_States Sweden progressive
#>   text1         1      1          1   1    1 1             0      0           0
#>   text2         0      0          0   0    0 0             1      1           1
#>        features
#> docs    taxation
#>   text1        0
#>   text2        1
#> [ reached max_nfeat ... 1 more feature ]
#> 
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    Christmas ruined opposition United_States Sweden progressive taxation
#>   text1         1      1          1             0      0           0        0
#>   text2         0      0          0             1      1           1        1
#> 
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+                       "No if, and, or but about it: lots of stopwords.")))
#> 
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#>        features
#> docs    this is a document with lots of stopwords . no
#>   text1    1  1 1        1    1    1  1         1 1  0
#>   text2    0  0 0        0    0    1  1         1 1  1
#> [ reached max_nfeat ... 8 more features ]
#> 
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#>        features
#> docs    document lots stopwords . , :
#>   text1        1    1         1 1 0 0
#>   text2        0    1         1 1 2 1
#> 
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+                  "no if, and, or but about it: lots"),
#> dfm_rm+                remove_punct = TRUE)
#> 
#> dfm_rm> fcmat <- fcm(toks)
#> 
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#>            features
#> features    this contains lots of stopwords no if and or but
#>   this         0        1    1  1         1  0  0   0  0   0
#>   contains     0        0    1  1         1  0  0   0  0   0
#>   lots         0        0    0  1         1  1  1   1  1   1
#>   of           0        0    0  0         1  0  0   0  0   0
#>   stopwords    0        0    0  0         0  0  0   0  0   0
#>   no           0        0    0  0         0  0  1   1  1   1
#>   if           0        0    0  0         0  0  0   1  1   1
#>   and          0        0    0  0         0  0  0   0  1   1
#>   or           0        0    0  0         0  0  0   0  0   1
#>   but          0        0    0  0         0  0  0   0  0   0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#> 
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#>            features
#> features    contains lots stopwords
#>   contains         0    1         1
#>   lots             0    0         1
#>   stopwords        0    0         0

^{Created on 2023-09-17 with reprex v2.0.2}

wang93312 · 2023-09-17T17:23:48Z

Thank you so much for your timely guidance, Dr. Benoit! Gratefully, JJ Wang, Ph.D. Professor Advanced Educational Studies Department (661) 654-3048 California State University, Bakersfield 9001 Stockdale Hwy, Mail Stop: 22 EDUC Bakersfield, CA 93311 https://www.csub.edu/aes

…

________________________________ From: Kenneth Benoit ***@***.***> Sent: Sunday, September 17, 2023 9:51 AM To: quanteda/quanteda.textplots ***@***.***> Cc: Jianjun Wang ***@***.***>; Comment ***@***.***> Subject: Re: [quanteda/quanteda.textplots] Plot similarity matrix using textplot_network() (#7) example("dfm_remove", package = "quanteda") #> Package version: 3.3.1 #> Unicode version: 14.0 #> ICU version: 71.1 #> Parallel computing: 10 of 10 threads used. #> See https://quanteda.io [quanteda.io]<https://urldefense.com/v3/__https://quanteda.io__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waeLxwDeM$> for tutorials and examples. #> #> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.", #> dfm_rm+ "Does the United_States or Sweden have more progressive taxation?")) %>% #> dfm_rm+ dfm(tolower = FALSE) #> #> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"), #> dfm_rm+ wordsEndingInY = c("by", "my"), #> dfm_rm+ notintext = "blahblah")) #> #> dfm_rm> dfm_select(dfmat, pattern = dict) #> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars. #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1 #> #> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE) #> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars. #> features #> docs by #> text1 1 #> text2 0 #> #> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex") #> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars. #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1 #> #> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex") #> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars. #> features #> docs ruined your opposition tax plan . the or Sweden have #> text1 1 1 1 1 1 1 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 #> [ reached max_nfeat ... 4 more features ] #> #> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed") #> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars. #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1 #> #> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed") #> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars. #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation #> text1 0 #> text2 1 #> [ reached max_nfeat ... 1 more feature ] #> #> dfm_rm> # select based on character length #> dfm_rm> dfm_select(dfmat, min_nchar = 5) #> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars. #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1 #> #> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.", #> dfm_rm+ "No if, and, or but about it: lots of stopwords."))) #> #> dfm_rm> dfmat #> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars. #> features #> docs this is a document with lots of stopwords . no #> text1 1 1 1 1 1 1 1 1 1 0 #> text2 0 0 0 0 0 1 1 1 1 1 #> [ reached max_nfeat ... 8 more features ] #> #> dfm_rm> dfm_remove(dfmat, stopwords("english")) #> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars. #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1 #> #> dfm_rm> toks <- tokens(c("this contains lots of stopwords", #> dfm_rm+ "no if, and, or but about it: lots"), #> dfm_rm+ remove_punct = TRUE) #> #> dfm_rm> fcmat <- fcm(toks) #> #> dfm_rm> fcmat #> Feature co-occurrence matrix of: 12 by 12 features. #> features #> features this contains lots of stopwords no if and or but #> this 0 1 1 1 1 0 0 0 0 0 #> contains 0 0 1 1 1 0 0 0 0 0 #> lots 0 0 0 1 1 1 1 1 1 1 #> of 0 0 0 0 1 0 0 0 0 0 #> stopwords 0 0 0 0 0 0 0 0 0 0 #> no 0 0 0 0 0 0 1 1 1 1 #> if 0 0 0 0 0 0 0 1 1 1 #> and 0 0 0 0 0 0 0 0 1 1 #> or 0 0 0 0 0 0 0 0 0 1 #> but 0 0 0 0 0 0 0 0 0 0 #> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ] #> #> dfm_rm> fcm_remove(fcmat, stopwords("english")) #> Feature co-occurrence matrix of: 3 by 3 features. #> features #> features contains lots stopwords #> contains 0 1 1 #> lots 0 0 1 #> stopwords 0 0 0 Created on 2023-09-17 with reprex v2.0.2 [reprex.tidyverse.org]<https://urldefense.com/v3/__https://reprex.tidyverse.org__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wa3ZXX8dc$> — Reply to this email directly, view it on GitHub [github.com]<https://urldefense.com/v3/__https://github.com/quanteda/quanteda.textplots/issues/7*issuecomment-1722517929__;Iw!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waouyZR4Q$>, or unsubscribe [github.com]<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AMF342ZQLU6EJHCH4LIKOX3X24TCBANCNFSM4UGAOFHQ__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wafVlQfMo$>. You are receiving this because you commented.Message ID: ***@***.***>

kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plot similarity matrix using textplot_network() #7

Plot similarity matrix using textplot_network() #7

koheiw commented Oct 28, 2018

kbenoit commented Oct 28, 2018 via email

jiongweilua commented Jan 19, 2019

randomgambit commented Feb 16, 2019

kbenoit commented Feb 17, 2019

wang93312 commented Sep 17, 2023

wang93312 commented Sep 17, 2023

kbenoit commented Sep 17, 2023

wang93312 commented Sep 17, 2023 via email

Plot similarity matrix using textplot_network() #7

Plot similarity matrix using textplot_network() #7

Comments

koheiw commented Oct 28, 2018

kbenoit commented Oct 28, 2018 via email

jiongweilua commented Jan 19, 2019

randomgambit commented Feb 16, 2019

kbenoit commented Feb 17, 2019

wang93312 commented Sep 17, 2023

wang93312 commented Sep 17, 2023

kbenoit commented Sep 17, 2023

wang93312 commented Sep 17, 2023 via email