-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature coocurrence as a graph -> fcg or fcm_graph ? #3
Comments
Here a coocurrence visualisation example produced with this approach. The fcm is converted to a graph converted to gml and post-treated in Cytoscape; Gephi also imports gml. Fruchterman-Reingold force-directed layout has been applied. Sizes show featur frequencies (The source is a set of diplomatic reports of the Swiss embassy in Stockholm around 1930) |
Hi @ aourednik thank you for the suggestion and the beautiful plot. The initial version of I run your coded and I noticed that as.igraph.fcm <- function(x) {
igraph::graph_from_adjacency_matrix(x)
} if you think this is useful. What kind of meta-data do you want to pass to a igraph object? I was thinking of adding information about overall word frequency the FCM constructor, but don't know what I can do more than that. We might have a lot of meta data for documents, but not for features. |
Hi @Koheiv, thank you for your answer. For feature metadata, I was thinking of frequency, and topic, or sentiment, based on a dictionary like for a dfm with dfm_lookup(). This to be able to color and size the word-nodes in a visualization. The idea would be to be able to do something like the code below but with less code, by having things wrapped in a fcg or fcm_graph object. feat <- names(topfeatures(txtfcm, 200)) # (in the example, I limit to 200 nodes, but it would be great if the graph object could contain as many nodes as the fcm can contain features)
topfeat <- fcm_select(txtfcm, feat, verbose = FALSE)
samplefreqs <- as.data.table(textstat_frequency(dfm(topfeat)))
setkey(samplefreqs,"feature")
vsize <- sapply(rownames(topfeat),function(x){return(sqrt(samplefreqs[x]$frequency))}) # alternative: vsize <- sqrt(rowSums(topfeat)) but when the frequencies are weighted by distance, this result is weird
vcolor <- sapply(rownames(topfeat),function(x){
if (x %chin% words.fr.joy) {return("red")} else return("black")})
textplot_network(topfeat,min_freq = 0.5, vertex_color = vcolor, vertex_size=vsize / max(vsize) * 7) In the last line, when I do textplot_network(), I basically rely on vsize, vcolor, and topfeat having the same number and order of rows. If I create a new subset by fcm_select(), I need to rerun all lines of code. Among the abilities of a fcm_graph object, I was imagining a function like fcm_select(fcm_graph,feat) that would be able to create a subset while retaining the associated word-level metadata. The fcm_graph object could also have a write_to_gml(), write_to_d3json() and/or as.igraph() method. |
Another use I would see for a graph approach would be to select the co-occurrence neighborhood of specific words. For example, to select all negative words and their immediate neighbors, one needs to do something like this after conversion to igraph: v_of_interest <- which(V(txtfcm.graph)$negative)
txtfcm.graph <- subgraph.edges(txtfcm.graph,E(txtfcm.graph)[inc(v_of_interest)]) Actually, the result is unsatisfying, the problem being in igraph not providing a function for making a coherent subgraph based on a set of nodes. make_ego_graph() could be expected to do this but it does not, as described here. Currently, only Cytoscape does this as needed. Great would be something like this (with imaginary functions): myfcgraph <- fcm_graph(mytokens) # pre-assigns "name" and the marginal frequency "frequency" as node-level attributes
# assuming that "polarity" is a data.table with two columns "word" and "pol"
setkey(polarity,word)
V(myfcgraph)$polarity <- sapply(V(myfcgraph)$name,function(x){return(polarity[x]$pol)})
v_of_interest <- myfcgraph[polarity != 0]
feat <- neighbors(v_of_interest,1) # gets first-order neigbors, like the ego() function in igraph
myfcgraph <- fcm_graph_select(fcm_graph,feat)
textplot_network(myfcgraph,min_freq = 10, vertex_color = polarity, vertex_size=sqrt(frequency)) The code would yield something like this |
- It works in the same way as as.network for #1260 - Both as.network() and as.igraph() return marginal feqture frequency as vertex attribute
Many thanks for developing quanteda! The fcm feature is fast and very useful. This is an enhancement proposal.
Trying to apply textplot_network() or as.network() to a large fcm triggers an error: "fcm is too large for a network plot". This makes sense, since a visualization would take too much resources. But it would be nice to be able to convert the whole sparse matrix to a graph for graph-oriented treatment. Since the fcm seems of class dgTMatrix, it should be possible to convert to a graph with T2Graph(), as documented here: https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/graph2T.html
This triggers an error, though. Igraph's function graph_from_adjacency_matrix() makes it possible.
But this solution makes us leave the context of quanteda. Maybe a fcg or fcm_graph object would be a useful new feature, where word concurrence would be a graph object instead of a sparse matrix. Features and their links could then be annotated with extra metadata. fcm_graph could also propose export functions to gml or json, for direct interoperability with Cytoscape, Gephi or D3. The following solution does this. Would a tighter integration of this functionality in quanteda be possible?
The text was updated successfully, but these errors were encountered: