-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plot similarity matrix using textplot_network() #7
Comments
Networks typically to reflect co-occurrence rather than similarity - are there examples of networks to show similarity? Either way it would be easy to write a method for textplot_network() that allows the return items from textstat_simil() as inputs.
A more common plot would be a heatmap of similarity called textplot_simil() that took a similarity input. This is a pretty common way to plot a matrix of correlations/similarities and easily available in ggplot2.
On 28 Oct 2018, at 05:20, Kohei Watanabe <notifications@github.com<mailto:notifications@github.com>> wrote:
If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using textplot_network() in a few steps. Why don't we make this an official function?
require(quanteda)
mt <- dfm(data_corpus_inaugural, remove_punct = TRUE, remove = stopwords())
mt <- dfm_trim(mt, min_termfreq = 100)
sim <- textstat_proxy(mt, margin = "features")
textplot_network(quanteda:::as.fcm(as(sim, "dgTMatrix")), min_freq = 0.95)
[rplot]<https://user-images.githubusercontent.com/6572963/47614121-eafee100-dadd-11e8-89de-d9945baa3a5c.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<https://github.com/quanteda/quanteda/issues/1474>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACFMZvItd-JZj9RxHUnO19RE6EN2cjwuks5upXbqgaJpZM4X93XH>.
|
Hi, I wanted to chime in on this. I agree with Ken that when thinking about similarity, a heatmap-like visualisation is more intuitive to me. However, I think that a Is this something we would like to explore? |
@kbenoit is there any hacky way to do what you are referring to?
|
I wouldn't call it hacky, but the code below works. The simil measures will yield positive values so we would ideally figure out a way to remove the values < 1.0. library("quanteda")
## Package version: 1.4.1
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
simmat <- corpus_subset(data_corpus_inaugural, Year > 1980) %>%
dfm(remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("en")) %>%
textstat_simil() %>%
as.matrix()
simmat[1:5, 1:5]
## 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton
## 1981-Reagan 1.0000000 0.6503200 0.4750618 0.5159960 0.5181002
## 1985-Reagan 0.6503200 1.0000000 0.5043065 0.5558569 0.6074780
## 1989-Bush 0.4750618 0.5043065 1.0000000 0.5037529 0.5311117
## 1993-Clinton 0.5159960 0.5558569 0.5037529 1.0000000 0.5961274
## 1997-Clinton 0.5181002 0.6074780 0.5311117 0.5961274 1.0000000
ggcorrplot::ggcorrplot(simmat, hc.order = TRUE, type = "lower") corrplot::corrplot.mixed(simmat, order = "hclust", tl.col = "black") |
Hope Dr. Benoit can offer some guidance on the following feedback --
A version of this package for your version of R might be available elsewhere,
I tried different versions of R. None of the attempts worked. |
My apology -- I miss "s" in install.packages("quanteda.textstats") Regarding the following warning messages -- I am wondering if Dr. Benoit could provide an example on how to use dfm_remove() to replace remove. Thank you! |
example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
#>
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+ "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+ dfm(tolower = FALSE)
#>
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+ wordsEndingInY = c("by", "my"),
#> dfm_rm+ notintext = "blahblah"))
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#> features
#> docs My by United_States Sweden
#> text1 1 1 0 0
#> text2 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#> features
#> docs by
#> text1 1
#> text2 0
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs My Christmas was by Does United_States
#> text1 1 1 1 1 0 0
#> text2 0 0 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#> features
#> docs ruined your opposition tax plan . the or Sweden have
#> text1 1 1 1 1 1 1 0 0 0 0
#> text2 0 0 0 0 0 0 1 1 1 1
#> [ reached max_nfeat ... 4 more features ]
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#> features
#> docs My was by your Does the or have more
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition tax plan . United_States Sweden progressive
#> text1 1 1 1 1 1 1 0 0 0
#> text2 0 0 0 0 0 0 1 1 1
#> features
#> docs taxation
#> text1 0
#> text2 1
#> [ reached max_nfeat ... 1 more feature ]
#>
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition United_States Sweden progressive taxation
#> text1 1 1 1 0 0 0 0
#> text2 0 0 0 1 1 1 1
#>
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+ "No if, and, or but about it: lots of stopwords.")))
#>
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#> features
#> docs this is a document with lots of stopwords . no
#> text1 1 1 1 1 1 1 1 1 1 0
#> text2 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 8 more features ]
#>
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#> features
#> docs document lots stopwords . , :
#> text1 1 1 1 1 0 0
#> text2 0 1 1 1 2 1
#>
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+ "no if, and, or but about it: lots"),
#> dfm_rm+ remove_punct = TRUE)
#>
#> dfm_rm> fcmat <- fcm(toks)
#>
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#> features
#> features this contains lots of stopwords no if and or but
#> this 0 1 1 1 1 0 0 0 0 0
#> contains 0 0 1 1 1 0 0 0 0 0
#> lots 0 0 0 1 1 1 1 1 1 1
#> of 0 0 0 0 1 0 0 0 0 0
#> stopwords 0 0 0 0 0 0 0 0 0 0
#> no 0 0 0 0 0 0 1 1 1 1
#> if 0 0 0 0 0 0 0 1 1 1
#> and 0 0 0 0 0 0 0 0 1 1
#> or 0 0 0 0 0 0 0 0 0 1
#> but 0 0 0 0 0 0 0 0 0 0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#>
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#> features
#> features contains lots stopwords
#> contains 0 1 1
#> lots 0 0 1
#> stopwords 0 0 0 Created on 2023-09-17 with reprex v2.0.2 |
Thank you so much for your timely guidance, Dr. Benoit!
Gratefully,
JJ Wang, Ph.D.
Professor
Advanced Educational Studies Department
(661) 654-3048
California State University, Bakersfield
9001 Stockdale Hwy, Mail Stop: 22 EDUC
Bakersfield, CA 93311
https://www.csub.edu/aes
…________________________________
From: Kenneth Benoit ***@***.***>
Sent: Sunday, September 17, 2023 9:51 AM
To: quanteda/quanteda.textplots ***@***.***>
Cc: Jianjun Wang ***@***.***>; Comment ***@***.***>
Subject: Re: [quanteda/quanteda.textplots] Plot similarity matrix using textplot_network() (#7)
example("dfm_remove", package = "quanteda")
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io [quanteda.io]<https://urldefense.com/v3/__https://quanteda.io__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waeLxwDeM$> for tutorials and examples.
#>
#> dfm_rm> dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.",
#> dfm_rm+ "Does the United_States or Sweden have more progressive taxation?")) %>%
#> dfm_rm+ dfm(tolower = FALSE)
#>
#> dfm_rm> dict <- dictionary(list(countries = c("United_States", "Sweden", "France"),
#> dfm_rm+ wordsEndingInY = c("by", "my"),
#> dfm_rm+ notintext = "blahblah"))
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars.
#> features
#> docs My by United_States Sweden
#> text1 1 1 0 0
#> text2 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars.
#> features
#> docs by
#> text1 1
#> text2 0
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs My Christmas was by Does United_States
#> text1 1 1 1 1 0 0
#> text2 0 0 0 0 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars.
#> features
#> docs ruined your opposition tax plan . the or Sweden have
#> text1 1 1 1 1 1 1 0 0 0 0
#> text2 0 0 0 0 0 0 1 1 1 1
#> [ reached max_nfeat ... 4 more features ]
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars.
#> features
#> docs My was by your Does the or have more
#> text1 1 1 1 1 0 0 0 0 0
#> text2 0 0 0 0 1 1 1 1 1
#>
#> dfm_rm> dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition tax plan . United_States Sweden progressive
#> text1 1 1 1 1 1 1 0 0 0
#> text2 0 0 0 0 0 0 1 1 1
#> features
#> docs taxation
#> text1 0
#> text2 1
#> [ reached max_nfeat ... 1 more feature ]
#>
#> dfm_rm> # select based on character length
#> dfm_rm> dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars.
#> features
#> docs Christmas ruined opposition United_States Sweden progressive taxation
#> text1 1 1 1 0 0 0 0
#> text2 0 0 0 1 1 1 1
#>
#> dfm_rm> dfmat <- dfm(tokens(c("This is a document with lots of stopwords.",
#> dfm_rm+ "No if, and, or but about it: lots of stopwords.")))
#>
#> dfm_rm> dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars.
#> features
#> docs this is a document with lots of stopwords . no
#> text1 1 1 1 1 1 1 1 1 1 0
#> text2 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 8 more features ]
#>
#> dfm_rm> dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars.
#> features
#> docs document lots stopwords . , :
#> text1 1 1 1 1 0 0
#> text2 0 1 1 1 2 1
#>
#> dfm_rm> toks <- tokens(c("this contains lots of stopwords",
#> dfm_rm+ "no if, and, or but about it: lots"),
#> dfm_rm+ remove_punct = TRUE)
#>
#> dfm_rm> fcmat <- fcm(toks)
#>
#> dfm_rm> fcmat
#> Feature co-occurrence matrix of: 12 by 12 features.
#> features
#> features this contains lots of stopwords no if and or but
#> this 0 1 1 1 1 0 0 0 0 0
#> contains 0 0 1 1 1 0 0 0 0 0
#> lots 0 0 0 1 1 1 1 1 1 1
#> of 0 0 0 0 1 0 0 0 0 0
#> stopwords 0 0 0 0 0 0 0 0 0 0
#> no 0 0 0 0 0 0 1 1 1 1
#> if 0 0 0 0 0 0 0 1 1 1
#> and 0 0 0 0 0 0 0 0 1 1
#> or 0 0 0 0 0 0 0 0 0 1
#> but 0 0 0 0 0 0 0 0 0 0
#> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ]
#>
#> dfm_rm> fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features.
#> features
#> features contains lots stopwords
#> contains 0 1 1
#> lots 0 0 1
#> stopwords 0 0 0
Created on 2023-09-17 with reprex v2.0.2 [reprex.tidyverse.org]<https://urldefense.com/v3/__https://reprex.tidyverse.org__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wa3ZXX8dc$>
—
Reply to this email directly, view it on GitHub [github.com]<https://urldefense.com/v3/__https://github.com/quanteda/quanteda.textplots/issues/7*issuecomment-1722517929__;Iw!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39waouyZR4Q$>, or unsubscribe [github.com]<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AMF342ZQLU6EJHCH4LIKOX3X24TCBANCNFSM4UGAOFHQ__;!!LNEL6vXnN3x8o9c!lgTIq06q6tU6n6QFKVLeMMbrMXp33GzBjMq-EsjWZbGVWTr9J8JaJSf81fa7nA7L-K-0Ka8FgB9w39wafVlQfMo$>.
You are receiving this because you commented.Message ID: ***@***.***>
|
If we treat a similarity matrix as a type of adjacency matrix, we can plot a semantic network using
textplot_network()
in a few steps. Why don't we make this an official function?The text was updated successfully, but these errors were encountered: