Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Cloud generating different clouds from the same data #149

Closed
scottkleinman opened this issue May 22, 2015 · 9 comments
Closed

Word Cloud generating different clouds from the same data #149

scottkleinman opened this issue May 22, 2015 · 9 comments
Assignees
Labels

Comments

@scottkleinman
Copy link
Contributor

I have been sent some data from State of the Union speeches that is not resulting in consistent word clouds in the Word Cloud tool. Each time the graph is rendered, the JSON object in the dataset variable is the same, but the generated SVG code differs in which words are displayed. To give examples:

Graph 1: states: 6911, congress: 5862, etc.
Graph 2: government: 7549, states: 6911, congress: 5862, etc.

As you can see, Graph 1 skipped "government". I have submitted the same data to Jason Davies original Word Cloud Generator (http://www.jasondavies.com/wordcloud/), and it does not produce the same effect. So something is going on in our rendering of the graph.

@scottkleinman
Copy link
Contributor Author

Some extra information. I added console.log(cloud.length) after line 105 of scripts_wordcloud.js and then ran Word Cloud in Firefox. It consistently tells me that my data cloud has 27880 objects but it also flashes up a number of repeats, which is inconsistent. From what I can tell, "government" only shows up in the graph if the number of repeats is higher. I'm not sure what the repeats are (not words in the dataset--I've checked that).

@scottkleinman
Copy link
Contributor Author

I have now identified the problem. High frequency words are dropped if they cannot fit within the layout. Sometimes re-generating the cloud will reveal the words since each new cloud has a different layout. But not always. Switching from the default log n to √n (sqrt) scale to n (linear) improves the results but still does not display high frequency words 100% of the time. I have found that adjusting the size of the word cloud and the scale of the contents fits more words, but even that is not a guarantee that every word will be included in every data set. And we'd have to build in user-defined size/scale functions.

I think a better approach is to experiment with ideal scaling for different sized data sets and autodetect the best fit for the data. That would produce consistently better word clouds, but would not entirely solve the problem. In the Margins should discuss the implications of this limitation and direct users to BubbleViz, which is probably better suited to visualising large amounts of data.

Discussion of the issue can be found at the following links:

jasondavies/d3-cloud#36
jasondavies/d3-cloud#19
jasondavies/d3-cloud#17

@scottkleinman
Copy link
Contributor Author

Belatedly, another idea is to include a small window where the user can scroll through word count table. If the most frequent word(s) in the table do not appear in the cloud, then the user will immediately know and they can turn to In the Margins to find out why.

@scottkleinman
Copy link
Contributor Author

A word counts table has been added to Word Cloud so that the user can see if a word has been omitted. Discussion of this issue and ways to customise the layout need to go in In the Margins. But, short of modifying the d3 algorithm, there's not much more we can do, so I'm closing this issue for now.

@mleblanc321
Copy link
Contributor

nice job

@scottkleinman
Copy link
Contributor Author

As far as I can tell, there has been no progress on this in d3.layout.cloud(). We might try implementing one of the workarounds in the discussion at jasondavies/d3-cloud#36. Failing that, it might be good to go one better than my previous solution by automatically displaying a table with the 3 most frequent words and the 3 longest words. That way, the user can easily see if their word cloud is faulty.

Also, the problem may be more extreme in multicloud since the layout area is much smaller. I'm not exactly sure how we deal with this.

@kreddy95
Copy link
Contributor

Is this still a bug?

@scottkleinman
Copy link
Contributor Author

scottkleinman commented Jun 30, 2016

Alas, yes. I've added a new In the Margins label to remind us to document this.

@scottkleinman
Copy link
Contributor Author

I have added this issue to our In the Margins notes. Since it doesn't seem like this problem will be addressed in d3.js any time soon. I think we can close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants