Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature coocurrence as a graph -> fcg or fcm_graph ? #3

Open
aourednik opened this issue Mar 9, 2018 · 4 comments
Open

feature coocurrence as a graph -> fcg or fcm_graph ? #3

aourednik opened this issue Mar 9, 2018 · 4 comments

Comments

@aourednik
Copy link

Many thanks for developing quanteda! The fcm feature is fast and very useful. This is an enhancement proposal.
Trying to apply textplot_network() or as.network() to a large fcm triggers an error: "fcm is too large for a network plot". This makes sense, since a visualization would take too much resources. But it would be nice to be able to convert the whole sparse matrix to a graph for graph-oriented treatment. Since the fcm seems of class dgTMatrix, it should be possible to convert to a graph with T2Graph(), as documented here: https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/graph2T.html
This triggers an error, though. Igraph's function graph_from_adjacency_matrix() makes it possible.
But this solution makes us leave the context of quanteda. Maybe a fcg or fcm_graph object would be a useful new feature, where word concurrence would be a graph object instead of a sparse matrix. Features and their links could then be annotated with extra metadata. fcm_graph could also propose export functions to gml or json, for direct interoperability with Cytoscape, Gephi or D3. The following solution does this. Would a tighter integration of this functionality in quanteda be possible?

library("Matrix")
library("igraph")
#txtfcm is an fcm object produced with fcm(tokens(corpus(readtext(set_of_txt_files))))
txtfcm.graph <- T2graph(txtfcm) # Triggers an error: no slot of name "j" for this object of class "fcm"
txtfcm.graph <- graph_from_adjacency_matrix(txtfcm,weighted=TRUE) # works
V(txtfcm.graph)$freq <- rowSums(txtfcm) # gives word frequencies if fcm is not weighted
# Examples of use possibilities opened by a graph approach :
txtfcm.graph <- simplify(txtfcm.graph) # remove loops and duplicate edges
txtfcm.graph <- delete_edges(txtfcm.graph,which(E(txtfcm.graph)$weight<0.5)) # delete weak coocurrences
txtfcm.graph <- delete_vertices(txtfcm.graph,which(V(txtfcm.graph)$freq < 3)) # delete low frequecy words
# associate attributes to each vertex (word.fr.positive and word.fr.negative are lists of words)
V(txtfcm.graph)$negative <- sapply(V(txtfcm.graph)$name,function(x){
  return(x %chin% words.fr.negative)
})
V(txtfcm.graph)$positive <- sapply(V(txtfcm.graph)$name, function(x){
  return(x %chin% words.fr.positive)
})
write_graph(txtfcm.graph,"my_graph.gml",format="gml") # export the graph to gml for use in Cytoscape or Gephi
@aourednik
Copy link
Author

Here a coocurrence visualisation example produced with this approach. The fcm is converted to a graph converted to gml and post-treated in Cytoscape; Gephi also imports gml. Fruchterman-Reingold force-directed layout has been applied. Sizes show featur frequencies (The source is a set of diplomatic reports of the Swiss embassy in Stockholm around 1930)
my_graph_stockholm1928_1932 gml_3

@koheiw
Copy link
Collaborator

koheiw commented Mar 12, 2018

Hi @ aourednik thank you for the suggestion and the beautiful plot. The initial version of textplot_network() was actually based on igraph but its bug (might have been fixed by now) prevented us from using in our package (we also preferred to base our visualization functions on ggplot2). We were sure that network analysis experts like you will find out how to convert a FCM into a set of edges.

I run your coded and I noticed that T2graph() works with a FCM if it is coerced to triplets by as(txtfcm, 'dgTMatrix') but it depends on the graph package that is not on CRAN. So we could make a thin wrapper function:

as.igraph.fcm <- function(x) {
    igraph::graph_from_adjacency_matrix(x)
}

if you think this is useful. What kind of meta-data do you want to pass to a igraph object? I was thinking of adding information about overall word frequency the FCM constructor, but don't know what I can do more than that. We might have a lot of meta data for documents, but not for features.

@aourednik
Copy link
Author

aourednik commented Mar 12, 2018

Hi @Koheiv, thank you for your answer. For feature metadata, I was thinking of frequency, and topic, or sentiment, based on a dictionary like for a dfm with dfm_lookup(). This to be able to color and size the word-nodes in a visualization.
Original frequencies stored as word-level-metadata would also be useful when generating a weighted-by-distance fcm, since rowSums of the fcm matrix then yields floating point numbers and the original word-occurrence-counts in the overall tokens object is "lost".

The idea would be to be able to do something like the code below but with less code, by having things wrapped in a fcg or fcm_graph object.

feat <- names(topfeatures(txtfcm, 200)) # (in the example, I limit to 200 nodes, but it would be great if the graph object could contain as many nodes as the fcm can contain features)
topfeat <- fcm_select(txtfcm, feat, verbose = FALSE)
samplefreqs <- as.data.table(textstat_frequency(dfm(topfeat)))
setkey(samplefreqs,"feature")
vsize <- sapply(rownames(topfeat),function(x){return(sqrt(samplefreqs[x]$frequency))}) # alternative: vsize <- sqrt(rowSums(topfeat)) but when the frequencies are weighted by distance, this result is weird
vcolor <- sapply(rownames(topfeat),function(x){
  if (x %chin% words.fr.joy) {return("red")} else return("black")})
textplot_network(topfeat,min_freq = 0.5, vertex_color = vcolor, vertex_size=vsize / max(vsize) * 7)

In the last line, when I do textplot_network(), I basically rely on vsize, vcolor, and topfeat having the same number and order of rows. If I create a new subset by fcm_select(), I need to rerun all lines of code. Among the abilities of a fcm_graph object, I was imagining a function like fcm_select(fcm_graph,feat) that would be able to create a subset while retaining the associated word-level metadata.

The fcm_graph object could also have a write_to_gml(), write_to_d3json() and/or as.igraph() method.

@aourednik
Copy link
Author

Another use I would see for a graph approach would be to select the co-occurrence neighborhood of specific words. For example, to select all negative words and their immediate neighbors, one needs to do something like this after conversion to igraph:

v_of_interest <- which(V(txtfcm.graph)$negative)
txtfcm.graph <- subgraph.edges(txtfcm.graph,E(txtfcm.graph)[inc(v_of_interest)])

Actually, the result is unsatisfying, the problem being in igraph not providing a function for making a coherent subgraph based on a set of nodes. make_ego_graph() could be expected to do this but it does not, as described here. Currently, only Cytoscape does this as needed.

Great would be something like this (with imaginary functions):

myfcgraph <- fcm_graph(mytokens) # pre-assigns "name" and the marginal frequency "frequency" as node-level attributes
# assuming that "polarity" is a data.table with two columns "word" and "pol" 
setkey(polarity,word) 
V(myfcgraph)$polarity <- sapply(V(myfcgraph)$name,function(x){return(polarity[x]$pol)}) 
v_of_interest <- myfcgraph[polarity != 0]
feat <- neighbors(v_of_interest,1) # gets first-order neigbors, like the ego() function in igraph 
myfcgraph <- fcm_graph_select(fcm_graph,feat)
textplot_network(myfcgraph,min_freq = 10, vertex_color = polarity, vertex_size=sqrt(frequency))

The code would yield something like this

my_graph_stockholm_1928_1932_limited gml_1 1

koheiw referenced this issue in quanteda/quanteda Jun 7, 2018
- It works in the same way as as.network for #1260
- Both as.network() and as.igraph() return marginal feqture frequency as vertex attribute
@kbenoit kbenoit transferred this issue from quanteda/quanteda Nov 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants