pymongo.errors.OperationFailure: distinct too big, 16mb cap #686

monokal · 2017-04-13T10:41:59Z

When trying to train from and use the Ubuntu Dialog Corpus with the MongoDB Storage Adapter I'm hitting the following exception. The code is pretty much identical to the Ubuntu Corpus example in this repo.

I believe the issue is related MongoDB not being able to handle strings over 160 characters (which there is in the Ubuntu Corpus). So this should either be somehow resolved, or support dropped as it's currently "broken".

Traceback (most recent call last):
  File "./InfraBot.py", line 198, in <module>
    main()
  File "./InfraBot.py", line 194, in main
    bot(args)
  File "./InfraBot.py", line 91, in __call__
    r = self.bot.get_response("are you there?")
  File "/usr/local/lib/python3.6/site-packages/chatterbot/chatterbot.py", line 114, in get_response
    statement, response = self.generate_response(input_statement, session_id)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/chatterbot.py", line 134, in generate_response
    response = self.logic.process(input_statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/multi_adapter.py", line 39, in process
    output = adapter.process(statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/best_match.py", line 54, in process
    closest_match = self.get(input_statement)
  File "/usr/local/lib/python3.6/site-packages/chatterbot/logic/best_match.py", line 16, in get
    statement_list = self.chatbot.storage.get_response_statements()
  File "/usr/local/lib/python3.6/site-packages/chatterbot/storage/mongodb.py", line 275, in get_response_statements
    response_query = self.statements.distinct('in_response_to.text')
  File "/usr/local/lib/python3.6/site-packages/pymongo/collection.py", line 2030, in distinct
    collation=collation)["values"]
  File "/usr/local/lib/python3.6/site-packages/pymongo/collection.py", line 232, in _command
    collation=collation)
  File "/usr/local/lib/python3.6/site-packages/pymongo/pool.py", line 419, in command
    collation=collation)
  File "/usr/local/lib/python3.6/site-packages/pymongo/network.py", line 116, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/usr/local/lib/python3.6/site-packages/pymongo/helpers.py", line 210, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: distinct too big, 16mb cap

an0mali · 2017-05-16T01:47:15Z

I got around the 16mb cap by pulling up mongo shell and entering

db.collection.allowDiskUse="true"

but... then you hit issues such as this in mongo, after entering an input and while waiting for a response-

2017-05-15T21:30:24.661-0400 I COMMAND [conn40] warning: log line attempted (5147kB) over max size (10kB)

haven't figured out a workaround for that, but as of right now... It's looking like that corpus is just wayyyy too large

telkomops · 2017-05-23T07:45:29Z

tried db.collection.allowDiskUse="true" on mongo shell. Does not work in Mongo 3.4 , is a mongo restart needed?

Even when mongodb aggregation framework is used. MongoDB restricts the BSON document to be not more than 16 MB, so fails .

db.statements.aggregate(
   [
     {
       $group:
         {
           _id: null ,
           distinctText: { $addToSet: "$in_response_to.text" }
         }
     },
     { $out : "aggResults" }
   ],
   {
       allowDiskUse:true
   }
);

vkosuri · 2017-05-23T09:38:08Z

More info about this issue https://jira.mongodb.org/browse/SERVER-431?focusedCommentId=22283&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-22283

peterel · 2017-05-23T14:49:07Z

Running in to this as well now. So, its simply not possible to train on the Ubuntu corpus? @vkosuri

vkosuri · 2017-05-23T22:51:09Z

@peterel I never tried, i'll give a try by next week

telkomops · 2017-05-24T08:39:45Z

@PeterL , tried training on Ubuntu Corpus, In 16 hours, about 70% of the corpus is indexed.
currently the bottle neck is the 16MB issue being discussed around using MongoDB Adapter.

peterel · 2017-05-24T08:56:09Z

@telkomops Then we are in the same situation. I had it running for around 15 hours and now running in to the 16MB issue. How did you proceed? Im thinking of scrapping MongoDB all together and try either some sql-version or maybe some other "chatbot framework". The jsonAdapter is simply much too slow to use in my testrunts.

monokal · 2017-05-24T13:09:14Z

IMO, ChatterBot would really benefit from a rewrite to integrate Keras/TensorFlow as the Machine/Deep Learning backend. It's not a difficult implementation now that Keras has been re-written by Google, complete with high-level Python APIs.

It provides powerful, well-maintained Machine Learning models/algorithms which have numerous NLP/chatbot examples on the web, will handle training storage, provides a web-UI to dig in to the bot's Neural Network, training progress, etc), and far more.

I'd be very happy to contribute should the proposal be accepted as I was thinking of migrating my project away from ChatterBot to Keras anyways, but I'd much rather make a good thing great here.

Some useful links:

monokal · 2017-05-24T13:10:15Z

Opened proposal under #761

"+1" if you're interested.

peterel · 2017-05-28T14:17:20Z

+1

peterel · 2017-06-03T21:34:20Z

@vkosuri Did you find the time to give this a try and reproduce the issue?

vkosuri · 2017-06-05T01:25:29Z

@peterel i trained my bot with Offline Ubuntu corpus https://arxiv.org/abs/1506.08909, insted of above issue, I have seen file not found issues. I'll give another try today.

If possible where exactly you have seen the above error? I mean on which tsv file you encountered above error?

peterel · 2017-06-05T16:37:57Z

@vkosuri Hmm, not sure I follow. This issue is regarding using the Ubuntu corpus and training it for a long time only to get a "pymongo.errors.OperationFailure: distinct too big, 16mb cap" error. In my case, I trained with the Ubuntu corpus for a couple of hours and all was well, it gave "decent" answers. Then I left if over night and when I tried a new conversation I hit this 16mb cap error.

Or is your comment referring to that it matter which version of the Ubuntu corpus one trains with (the "built in" one or this offline one you are referring to)?

telkomops · 2017-07-06T18:36:14Z

@peterel , i gave up on the ubuntu corpus, as my ideal scenarios was to train with my internal chat conversations and hit upon the bug #759 ,kind of giving up on chatterbot for now. :(

peterel · 2017-07-06T19:59:39Z

@telkomops I gave up on the Ubuntu corpus as well. Using the built in one and hope users will "fill up" som proper responses. Are you going with Keras or Tensorflow instead? Or is there any other similar "framework" as Chatterbot youd recommend?

gunthercox · 2017-07-06T22:09:16Z

@telkomops / @peterel, The Ubuntu corpus is a massive data set. I think it may have been a mistake for me to add documentation and training support for it. ChatterBot isn't ready to handle that much data, yet. I'm working to improve this but the changes required to optimize these queries on large data sets are still a few releases away.

peterel · 2017-07-07T09:06:34Z

@gunthercox Many thanks for your efforts. Chatterbot is very cool :) I do agree though about the Ubuntu corpus. Since it wont "work" with Chatterbot atm its probably better to remove it from the docs or at least make folks aware that this wont work now. The alternative is, like for me, to spend days on training and then realizing that it doesnt work. Again, thanks for your efforts!

vkosuri · 2017-07-07T11:04:20Z

Instead removing it from chatterbot, how about moving it to https://github.com/gunthercox/chatterbot-corpus. If any users are like to use UbuntuCorpusTrainer they will use it.

from chatterbot_corpus.trainers import UbuntuCorpusTrainer

peterel · 2017-07-10T14:51:48Z

@vkosuri Hmm, if the Ubuntu Corpus cant be used with Chatterbot, I think its better not to include it at all or at least show a "disclaimer". Otherwise, youll end up with folks spending days in training only to see it crash...which is not so good :)

lesleslie · 2017-07-30T04:37:03Z

The maximum BSON document size is 16 megabytes.

There is lots of stuff on this and workarounds on stack overflow.

Most recommend using GridFS. Which would mean switching from '.distinct' to '.aggregate' (as mentioned above in this thread too).

This doesn't work but it would probably look something like this (mongodb.py):

def get_response_statements(self):
    """
    Return only statements that are in response to another statement.
    A statement must exist which lists the closest matching statement in the
    in_response_to field. Otherwise, the logic adapter may find a closest
    matching statement that does not have a known response.
    """
    #response_query = self.statements.distinct('in_response_to.text')  # current
    response_query = self.statements.aggregate({'$group': {'_id': '$in_response_to.text'}})

    _statement_query = {
        'text': {
            # '$in': response_query  # current
            '$in': list(response_query)  # works with aggregate
        }
    }

    _statement_query.update(self.base_query.value())

    statement_query = self.statements.find(_statement_query)

    statement_objects = []

    for statement in list(statement_query):
        statement_objects.append(self.mongo_to_object(statement))

    return statement_objects

http://api.mongodb.com/python/current/examples/aggregation.html
http://docs.mongodb.org/manual/reference/gridfs/

anurag-arora · 2018-01-03T14:39:25Z

Hi @gunthercox ,

I have trained my bot with Ubuntu Dialog Corpus for 1 day, and after running i am receiving this error
pymongo.errors.OperationFailure: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.
Please look into this and tell me if it is possible to work around this issue.
I have terminated the training as it was taking too much time and chatterbot-database size is around 0.390GB .

Thanks

gunthercox added the bug label May 4, 2017

monokal mentioned this issue May 29, 2017

pymongo.errors.OperationFailure: distinct too big, 16mb cap #757

Closed

gunthercox mentioned this issue Jun 9, 2017

pymongo.errors.OperationFailure for Ubuntu corpus gunthercox/chatterbot-corpus#21

Closed

lesleslie mentioned this issue Aug 9, 2017

add sqlite pragma settings to sql_storage.py - change distinct to aggregate in mongodb.py #916

Merged

gunthercox closed this as completed Aug 11, 2017

vkosuri mentioned this issue Feb 13, 2018

pymongo.errors.DocumentTooLarge: BSON document too large (17943418 bytes) - the connected server supports BSON document sizes up to 16777216 bytes. #1206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pymongo.errors.OperationFailure: distinct too big, 16mb cap #686

pymongo.errors.OperationFailure: distinct too big, 16mb cap #686

monokal commented Apr 13, 2017 •

edited

Loading

an0mali commented May 16, 2017 •

edited

Loading

telkomops commented May 23, 2017 •

edited

Loading

vkosuri commented May 23, 2017

peterel commented May 23, 2017

vkosuri commented May 23, 2017

telkomops commented May 24, 2017

peterel commented May 24, 2017

monokal commented May 24, 2017

monokal commented May 24, 2017 •

edited

Loading

peterel commented May 28, 2017

peterel commented Jun 3, 2017

vkosuri commented Jun 5, 2017

peterel commented Jun 5, 2017

telkomops commented Jul 6, 2017

peterel commented Jul 6, 2017

gunthercox commented Jul 6, 2017

peterel commented Jul 7, 2017

vkosuri commented Jul 7, 2017

peterel commented Jul 10, 2017

lesleslie commented Jul 30, 2017 •

edited

Loading

anurag-arora commented Jan 3, 2018

pymongo.errors.OperationFailure: distinct too big, 16mb cap #686

pymongo.errors.OperationFailure: distinct too big, 16mb cap #686

Comments

monokal commented Apr 13, 2017 • edited Loading

an0mali commented May 16, 2017 • edited Loading

telkomops commented May 23, 2017 • edited Loading

vkosuri commented May 23, 2017

peterel commented May 23, 2017

vkosuri commented May 23, 2017

telkomops commented May 24, 2017

peterel commented May 24, 2017

monokal commented May 24, 2017

monokal commented May 24, 2017 • edited Loading

peterel commented May 28, 2017

peterel commented Jun 3, 2017

vkosuri commented Jun 5, 2017

peterel commented Jun 5, 2017

telkomops commented Jul 6, 2017

peterel commented Jul 6, 2017

gunthercox commented Jul 6, 2017

peterel commented Jul 7, 2017

vkosuri commented Jul 7, 2017

peterel commented Jul 10, 2017

lesleslie commented Jul 30, 2017 • edited Loading

anurag-arora commented Jan 3, 2018

monokal commented Apr 13, 2017 •

edited

Loading

an0mali commented May 16, 2017 •

edited

Loading

telkomops commented May 23, 2017 •

edited

Loading

monokal commented May 24, 2017 •

edited

Loading

lesleslie commented Jul 30, 2017 •

edited

Loading