Word Cloud generating different clouds from the same data #149

scottkleinman · 2015-05-22T23:48:43Z

I have been sent some data from State of the Union speeches that is not resulting in consistent word clouds in the Word Cloud tool. Each time the graph is rendered, the JSON object in the dataset variable is the same, but the generated SVG code differs in which words are displayed. To give examples:

Graph 1: states: 6911, congress: 5862, etc.
Graph 2: government: 7549, states: 6911, congress: 5862, etc.

As you can see, Graph 1 skipped "government". I have submitted the same data to Jason Davies original Word Cloud Generator (http://www.jasondavies.com/wordcloud/), and it does not produce the same effect. So something is going on in our rendering of the graph.

scottkleinman · 2015-05-23T01:47:07Z

Some extra information. I added console.log(cloud.length) after line 105 of scripts_wordcloud.js and then ran Word Cloud in Firefox. It consistently tells me that my data cloud has 27880 objects but it also flashes up a number of repeats, which is inconsistent. From what I can tell, "government" only shows up in the graph if the number of repeats is higher. I'm not sure what the repeats are (not words in the dataset--I've checked that).

scottkleinman · 2015-05-24T01:21:22Z

I have now identified the problem. High frequency words are dropped if they cannot fit within the layout. Sometimes re-generating the cloud will reveal the words since each new cloud has a different layout. But not always. Switching from the default log n to √n (sqrt) scale to n (linear) improves the results but still does not display high frequency words 100% of the time. I have found that adjusting the size of the word cloud and the scale of the contents fits more words, but even that is not a guarantee that every word will be included in every data set. And we'd have to build in user-defined size/scale functions.

I think a better approach is to experiment with ideal scaling for different sized data sets and autodetect the best fit for the data. That would produce consistently better word clouds, but would not entirely solve the problem. In the Margins should discuss the implications of this limitation and direct users to BubbleViz, which is probably better suited to visualising large amounts of data.

Discussion of the issue can be found at the following links:

jasondavies/d3-cloud#36
jasondavies/d3-cloud#19
jasondavies/d3-cloud#17

scottkleinman · 2015-05-24T01:33:10Z

Belatedly, another idea is to include a small window where the user can scroll through word count table. If the most frequent word(s) in the table do not appear in the cloud, then the user will immediately know and they can turn to In the Margins to find out why.

scottkleinman · 2015-05-26T19:31:36Z

A word counts table has been added to Word Cloud so that the user can see if a word has been omitted. Discussion of this issue and ways to customise the layout need to go in In the Margins. But, short of modifying the d3 algorithm, there's not much more we can do, so I'm closing this issue for now.

mleblanc321 · 2015-05-26T19:51:36Z

nice job

for jasondavies#1 jasondavies#8 jasondavies#14 jasondavies#35 jasondavies#45 WheatonCS/Lexos#149

scottkleinman · 2016-05-21T16:52:38Z

As far as I can tell, there has been no progress on this in d3.layout.cloud(). We might try implementing one of the workarounds in the discussion at jasondavies/d3-cloud#36. Failing that, it might be good to go one better than my previous solution by automatically displaying a table with the 3 most frequent words and the 3 longest words. That way, the user can easily see if their word cloud is faulty.

Also, the problem may be more extreme in multicloud since the layout area is much smaller. I'm not exactly sure how we deal with this.

kreddy95 · 2016-06-30T16:46:43Z

Is this still a bug?

scottkleinman · 2016-06-30T17:46:02Z

Alas, yes. I've added a new In the Margins label to remind us to document this.

scottkleinman · 2016-07-15T16:23:52Z

I have added this issue to our In the Margins notes. Since it doesn't seem like this problem will be addressed in d3.js any time soon. I think we can close this issue.

scottkleinman added the bug label May 22, 2015

scottkleinman closed this as completed May 26, 2015

cesine mentioned this issue Jun 24, 2015

Fit all terms by recursively decreasing the size until it fits cesine/d3-cloud#74

Merged

cesine added a commit to cesine/d3-cloud that referenced this issue Jun 24, 2015

test for ability to create same clouds using same random seed

aa92441

for jasondavies#1 jasondavies#8 jasondavies#14 jasondavies#35 jasondavies#45 WheatonCS/Lexos#149

scottkleinman reopened this May 21, 2016

scottkleinman added In the Margins and removed In the Margins labels Jul 14, 2016

scottkleinman closed this as completed Jul 15, 2016

scottkleinman mentioned this issue Mar 16, 2024

some inconsistencies in generating word clouds? #1073

Open

jackMurray20 assigned jackMurray20 and unassigned jackMurray20 Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Cloud generating different clouds from the same data #149

Word Cloud generating different clouds from the same data #149

scottkleinman commented May 22, 2015

scottkleinman commented May 23, 2015

scottkleinman commented May 24, 2015

scottkleinman commented May 24, 2015

scottkleinman commented May 26, 2015

mleblanc321 commented May 26, 2015

scottkleinman commented May 21, 2016

kreddy95 commented Jun 30, 2016

scottkleinman commented Jun 30, 2016 •

edited

Loading

scottkleinman commented Jul 15, 2016

Word Cloud generating different clouds from the same data #149

Word Cloud generating different clouds from the same data #149

Comments

scottkleinman commented May 22, 2015

scottkleinman commented May 23, 2015

scottkleinman commented May 24, 2015

scottkleinman commented May 24, 2015

scottkleinman commented May 26, 2015

mleblanc321 commented May 26, 2015

scottkleinman commented May 21, 2016

kreddy95 commented Jun 30, 2016

scottkleinman commented Jun 30, 2016 • edited Loading

scottkleinman commented Jul 15, 2016

scottkleinman commented Jun 30, 2016 •

edited

Loading