Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextAnalyzer not Serializable #77

Open
robgratz29 opened this issue Mar 15, 2024 · 2 comments
Open

TextAnalyzer not Serializable #77

robgratz29 opened this issue Mar 15, 2024 · 2 comments
Assignees

Comments

@robgratz29
Copy link

Tim, question, is there a reason we can't make the TextAnalyzer Serializable? Reason I'm asking is that I'm trying to hook up the FTA stuff into Spark as a custom aggregator. Problem is, the "accumulator" is the TextAnalyzer and it has to be serializable. I've gotten around it by storing the marshalled JSON representation of the analyzer, but performance grinds to a halt having to marshall and unmarshall for every row processed.

So I guess my issue is to make the TextAnalyzer instance Serializable.

Thanks,
Rob

@tsegall
Copy link
Owner

tsegall commented Mar 19, 2024

Rob,

Not quite sure whether to call this an enhancement or a bug :-).

However, firstly there is a serialization() and deserialization() on the TextAnalyzer(). You should be able to use these rather than rolling your own. Have a look at the new test exerciseSerialization() in TestMerge.java.

However, serialization() and deserialization() in particular are slooow. The good news is that with the latest release 15.5.2 serialization() performance has improved 15x - so from 662μs -> 46μs. I also improved deserialization() from 2562μs -> 1729μs - a much more modest improvement, so you are still looking at ~2ms to deserialize().

Historically all the focus has been on making train() as fast as possible. I am not sure how fast you need it to be?

Inherently deserialize() is going to be significantly slower, OTOH it seems to me I should be able to make it somewhat faster.

Do you have sense what would be required to be 'acceptable'?

Regards, Tim.

@robgratz29
Copy link
Author

I have a workaround that gets by the problem. I wrapped the TextAnalyzer in a class that implements Externalizable then do the serialize/deserialize calls there. This way I only have to make those calls when the object is serialized/deserialized rather than holding onto the json representation which required a serialize/deserialize every time it was used. You can close out this bug/enhancement if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants