Skip to content
This repository has been archived by the owner on May 3, 2022. It is now read-only.

ALS app: java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.String #304

Closed
srowen opened this issue Aug 23, 2016 · 11 comments
Assignees
Milestone

Comments

@srowen
Copy link
Member

srowen commented Aug 23, 2016

Reports of a strange ClassCastException in ALS in master / 2.3:

2016-08-18 17:17:18,768 INFO  ALSServingModelManager:97 ALSServingModel[features:30, implicit:true, X:(7640666 users), Y:(1613282 items, partitions: [0:104911, 1:10022, 2:44695, 3:26323, 4:50937, 5:36393, 6:99643, 7:54777, 8:17366, 9:28681, 10:131557, 11:31438, 12:33617, 13:24153, 14:111447, 15:43643, ...]...), fractionLoaded:0.99938]
2016-08-18 17:17:18,867 INFO  ALSServingModelManager:104 Loading new model
2016-08-18 17:17:24,141 INFO  AbstractOryxResource:86 Model loaded fraction: 0.9996865
2016-08-18 17:17:24,460 INFO  ALSServingModelManager:115 Updating model
2016-08-18 17:17:49,084 ERROR ModelManagerListener:144 Error while consuming updates
java.lang.ClassCastException: java.lang.Object cannot be cast to java.lang.String
        at net.openhft.koloboke.collect.impl.hash.MutableSeparateKVObjLHashGO.removeIf(MutableSeparateKVObjLHashGO.java:275)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModel.lambda$retainRecentAndKnownItems$7(ALSServingModel.java:437)
        at net.openhft.koloboke.collect.impl.hash.MutableLHashParallelKVObjObjMapGO$ValueView.forEach(MutableLHashParallelKVObjObjMapGO.java:2228)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModel.retainRecentAndKnownItems(ALSServingModel.java:435)
        at com.cloudera.oryx.app.serving.als.model.ALSServingModelManager.consume(ALSServingModelManager.java:119)
        at com.cloudera.oryx.lambda.serving.ModelManagerListener.lambda$contextInitialized$1(ModelManagerListener.java:142)
        at com.cloudera.oryx.common.lang.LoggingCallable.lambda$log$0(LoggingCallable.java:48)
        at com.cloudera.oryx.common.lang.LoggingCallable.lambda$asRunnable$1(LoggingCallable.java:66)
        at java.lang.Thread.run(Thread.java:745)
2016-08-18 17:17:49,086 INFO  ModelManagerListener:177 ModelManagerListener closing
2016-08-18 17:17:49,086 INFO  ModelManagerListener:179 Shutting down model manager
2016-08-18 17:17:49,086 INFO  ModelManagerListener:184 Shutting down input producer
2016-08-18 17:17:49,086 INFO  Producer:68 Shutting down producer
@srowen
Copy link
Member Author

srowen commented Sep 9, 2016

@cimox and @flyingandrunning -- you say you have this same error? @cimox you commented that it only happens if data is in a wrong format, could you elaborate? Nicholas also has this error but I can't figure out how to reproduce it otherwise. We've looked at loads of theories.

@cimox
Copy link
Contributor

cimox commented Sep 12, 2016

Sure @srowen, I will try to reproduce it on our dev environment. I'll keep in touch.

@cimox
Copy link
Contributor

cimox commented Sep 13, 2016

Hi @srowen so I've talked with my colleague and he told me that we can share created model from HDFS with you, if it will help you. Basically we can share whole HDFS dir from project where this issue occurred. If you still need me to try reproduce it, I will give a try.

@srowen
Copy link
Member Author

srowen commented Sep 20, 2016

This may be related to #312 in that I believe Nicolas is no longer seeing the problem after this change. If you're able to try a build from branch, you can check it out. It'll be in the next release. Let's reopen if anyone still sees it though.

@srowen srowen closed this as completed Sep 20, 2016
@srowen srowen reopened this Sep 22, 2016
srowen added a commit to srowen/oryx that referenced this issue Sep 22, 2016
srowen added a commit to srowen/oryx that referenced this issue Sep 22, 2016
srowen added a commit to srowen/oryx that referenced this issue Sep 30, 2016
srowen added a commit that referenced this issue Sep 30, 2016
@cimox
Copy link
Contributor

cimox commented Oct 10, 2016

Hi @srowen, any news related to this issue? Can I help you somehow to fix this?

@srowen
Copy link
Member Author

srowen commented Oct 25, 2017

We've had a workaround, at least, for a long while. I think that's the resolution for the foreseeable future.

@srowen
Copy link
Member Author

srowen commented Sep 24, 2018

This is still an issue as seen in #353
@stiv-yakovenko has found a related Koloboke issue which is probably related: leventov/Koloboke#66

The two issues occur in different places, but have some clear similarities:

            if ((key = (E) keys[i]) != FREE) {
                if (filter.test(key)) {
                if (tab[i] != FREE) {
                    action.accept((V) tab[i + 1]);
                }

A value is checked against a marker object FREE, and if it's not the marker, is passed to a user function. FREE is an Object, not an E or V here but it doesn't matter after erasure. It does matter when passed to a function.

The issue is that neither of these checks for the other marker, REMOVED.

I don't see an obvious workaround. We can remove Koloboke for now or see if it can be fixed upstream.

@srowen srowen reopened this Sep 24, 2018
@srowen
Copy link
Member Author

srowen commented Sep 24, 2018

Small update: I see that the code intends to never leave REMOVED in place after methods like removeIf are called. closeDelayedRemoved cleans those out. Either there is some bug there, or else still some concurrency issue in the caller here. I can't find any accesses that modify the state and aren't protected by a write lock; it's pretty straightforward code on this end.

@stiv-yakovenko
Copy link

Yes, the intention was not to have REMOVED elements, but something went wrong :)
I dont think this is concurrency problem because another person observed this bug in concurrency-free example.
If you want to rescue koloboke, you will have to create some sort of fuzzy load test that will a) find crashing pattern b) will give some guarantee that you have fixed problem without introducing new one. I'd remove this koloboke collection at all instead. 10% performance boost is not worth classcastexception.

@srowen
Copy link
Member Author

srowen commented Sep 25, 2018

Yes, that's for sure. That's unfortunate. I was hoping to find there's another way around it.

While we can hack the lambda functions we pass to methods like forEach to cope with these unexpected Objects, I don't think that helps calls to retainAll and so on.

It might be possible to do things like copy collections at key points instead of updating them. That doesn't sound great, but might still be better than foregoing Koloboke entirely. The memory impact of using regular collections is, IIRC, quite significant. There are other primitive collection libraries but none more maintained or better than Koloboke.

We can fork some code from Koloboke if needed, temporarily, to get a fix in. Do we know what the fix even is? It's easy to check for REMOVED in the loop but I don't think it's even supposed to be there, and may cause other issues. This is the most likely way to address this, and I'll have to find time later to look into it.

@stiv-yakovenko
Copy link

Well, koloboke is dead, author ignores this critical bug since March. You can use eclipse collections, based on Goldman Sachs implementation of collections, their benefit from memory/perfomance footprint seems to be comparable: https://github.com/eclipse/eclipse-collections

@srowen srowen modified the milestones: 2.3.0, 2.7.2 Oct 6, 2018
@srowen srowen closed this as completed Oct 6, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants