Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-means with Silhouette scoring crashes on insufficient RAM #1502

Closed
klonuo opened this issue Aug 7, 2016 · 18 comments
Closed

K-means with Silhouette scoring crashes on insufficient RAM #1502

klonuo opened this issue Aug 7, 2016 · 18 comments
Labels
bug A bug confirmed by the core team

Comments

@klonuo
Copy link

klonuo commented Aug 7, 2016

Read file with two integer columns, then select k-means and try silhouette scoring - result crash.
Same happens with silhouette plot.

Using sklearn to the same shows no issues.

latest Orange 3.3.8 nightly build on Windows

@ajdapretnar
Copy link
Contributor

@klonuo I've tried with several different data sets and k-means works fine for me (latest version, Win). Tested with zoo.tab and voting.tab.

@klonuo
Copy link
Author

klonuo commented Aug 10, 2016

k-means works fine but not silhouette scoring or plot, as they crash Orange.
Here is my data: sample.csv

@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

The silhouette scoring fails because you have given it 23k rows and you don't have enough RAM (or not enough RAM is available to a 32-bit process) to compute 23k**2 distance matrix (some 4.3 GB needed).

If you have enough RAM physically (e.g. >= 6 GB), try installing 64-bit Python 3.5 Anaconda and their 64-bit build of Orange 3.

Alternatively, reduce your data set.

According to silhouette, the optimal number of clusters is 5. But it's pretty tight.
screenshot - 08102016 - 11 35 07 pm

@kernc kernc closed this as completed Aug 10, 2016
@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

How does the crash look like?

@kernc kernc reopened this Aug 10, 2016
@kernc kernc changed the title Orange crashes on silhouette scoring Orange.evaluation.clustering.Silhouette should catch MemoryError Aug 10, 2016
@kernc kernc added the bug A bug confirmed by the core team label Aug 10, 2016
@klonuo
Copy link
Author

klonuo commented Aug 10, 2016

Thanks @kernc

Indeed, my laptop has 4GB RAM.
I used sklearn on 64bit Python 3.5, while Orange setup comes in 32bit.

How does the crash look like?

Regular crash - dialog pops informing that Python has crashed and Orange process ends.

@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

So if Python's catching MemoryError works (which may not always be the case), catching MemoryError in Orange.evaluation.clustering.Silhouette.compute_score() should handle this case a bit more gracefully.

Can you test if wrapping that .from_predicted() call in a try-except indeed catches? Would you like to submit a pull request?

@klonuo
Copy link
Author

klonuo commented Aug 10, 2016

np, but I could not catch the exception, neither in ClusteringScore.from_predicted(), nor in Scoring.from_predicted(). Do you have suggestion where to set try/except?

Also I added breakpoint in Orange.evaluation.clustering.Silhouette.compute_score() but it didn't get there.

@klonuo klonuo closed this as completed Aug 10, 2016
@kernc kernc reopened this Aug 10, 2016
@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

class Silhouette(ClusteringScore):
    separate_folds = True

    def compute_score(self, results):
        try:
            return self.from_predicted(results, silhouette_score)
        except MemoryError:
            return 'whatever'

If this still crashes instead of raising an error on 'whatever', then tough luck.

@klonuo
Copy link
Author

klonuo commented Aug 10, 2016

Yep, as mentioned above crash seems to happen before arriving to Silhouette.compute_score()

@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

Probably within compute_score() but before that except.

In either case, having some 20k examples is not unreasonable, whereas crashing is.


The potential fix from top of the head is to replace all np.fromiter() calls with np.empty() and then fill the values in. This will definitelymight raise MemoryError on insufficient RAM, which willwould be much more helpful.

@kernc
Copy link
Contributor

kernc commented Aug 10, 2016

Never mind. This is not it. Orange.evaluation.clustering.Silhouette is only used in tests!

🎉

@kernc kernc changed the title Orange.evaluation.clustering.Silhouette should catch MemoryError K-means with Silhouette scoring crashes on insufficient RAM Aug 10, 2016
@klonuo
Copy link
Author

klonuo commented Aug 10, 2016

Ok.

Python crashes on:

proj.silhouette = silhouette_score(X, proj.labels_)

Stepping inside, crash happens on: https://github.com/scikit-learn/scikit-learn/blob/3f37cb989af44c1f7ff8067cba176cf9b0c61eb7/sklearn/metrics/pairwise.py#L245
called from https://github.com/scikit-learn/scikit-learn/blob/3f37cb989af44c1f7ff8067cba176cf9b0c61eb7/sklearn/metrics/pairwise.py#L1078

But I couldn't analyze this loop...

@klonuo
Copy link
Author

klonuo commented Aug 11, 2016

Just to add this...

I tried again with sklearn (0.17.1 from my 64bit Python 3.5 shell) and my system freeze - Python process took exactly as you mentioned - 4.3 GB. Previously I must have used some sub-sample of the data...
I'll open issue on sklearn and link to this issue

IMHO this is worse than just crashing Python, as I had no other option then system reset.

I don't know if its feasible, but it could be nice if for some demanding algorithms we could validate user memory (with psutil perhaps) to user data.

@ajdapretnar
Copy link
Contributor

Pinged sklearn, hoping for a reply on the issue. @kernc Is this something we can fix or do we have to wait for sklearn?

@kernc
Copy link
Contributor

kernc commented Mar 8, 2017

I guess, right before this line, we could add:

_ = np.empty((X.shape[0] + 10,) * 2)  # +10 overhead margin
del _

When X is large and RAM is insufficient, this will_really should_ fail with MemoryError as the chunk tries to be allocated in whole. This should delay everything a couple of msec, but it should work. It's a hack.

@markotoplak
Copy link
Member

Was this perhaps solved in #2073 (so version 3.4 should work OK)? Can someone verify? @thocevar

@markotoplak
Copy link
Member

Furthermore, is trying to catch MemoryError still needed?

@kernc
Copy link
Contributor

kernc commented Mar 9, 2017

Ah, seems like it's mitigated, thanks. It would probably still crash if <= 200M RAM available, and the except block, as is, wouldn't catch it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug confirmed by the core team
Projects
None yet
Development

No branches or pull requests

4 participants