-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance on big amount of training data #164
Comments
Commit e5a9869 makes one small change to start to address this by reducing the number of read and write transactions that are made to the database. I will continue to post updates on this ticket to track performance improvement changes. |
Pull request #173 allows the storage adapter to override an expensive method to provide a more efficient implementation. The |
What about using SQLite? Will that speed up the process? Is there sqlite adapter? I tried to do the same, I fed a 3.5Mb training file with converstations from social network, was curious what kind of answers i'll get from that :D And firstly it took about 40 minutes to train, and now it is just stuck on trying to answer. I gueees, because i need to download it and run the server, huh?.. |
Btw, Why not to use SQL database? |
Oh, okay, mongodb works fine now. Much faster. But that Bulk error is annoying. And I still think having a standard option of sqlite would be nice, it's much faster than json, but it is also just a single file and it does not require you to install anything but python. Just a thought. No rush though. |
@Nixellion I'm glad you are getting better results with the Mongo DB adapter. The JSON file adapter is really just meant for testing and development because it is limited by the fact that it has to write to the hard disk each time it needs to save. Sill looking into the bulk insert error, and I've opened a ticket for tracking the addition of a new SQLite storage adapter #241. |
Cool, thanks! Also found some old discussions back from 2014-2015, about making this bot smart enough to pass at least some of Turing tests\questions, building sentences from words, etc I hope you're still onto it :) |
Does It support parallel Training? |
Parallel training is only supported if the database being used supports concurrent writes. The default file database that ChatterBot uses does not support concurrent writes, but if you use mongo db it will. |
my data size about |
my data size is about 2G |
You will probably need to do a bit of work to get the import process ready to bring in 2GB of data in parallel. I would recommend breaking it up, if possible, into a few files of manageable size. You will then have to use python's multiprocessing capabilities to start training processes on each subset of the data file. This functionality isn't built into ChatterBot at the moment, if you are unsure on how to accomplish this, feel free to ask any questions. Otherwise, I have opened a ticket to get support for this functionality added to ChatterBot (#354). |
I've noticed that #597 using ujson has sped up processing a lot, though my training data is only ~300MB in size. I recommend trying it out to see how much faster it will go. |
@Martmists hi, bro, have u solved the efficiency of bot's training and testing ?can u share some thoughts about improving efficiency ? tks |
One thing to note is to NOT use the default JSON storage. It's slow due to constant I/O, it's relatively unoptimized and uses the stdlib JSON module. I recommend writing your own or trying to find one online. |
@Martmists I have used mongodb as the storage adapter. However it is still very slowly for response about 7w data taking 41 seconds. I am working on finding other ways to improving efficiency. How about u? |
I'm going to close this issue off, I don't believe there is any remaining actionable items here. Tickets have been created to implement changes that will help to improve response times. See #925 and its related tickets for further details. |
I trained the bot with ~17k conversations and now it takes a lot of time for response. Are there ways to avoid it?
Training data: https://gist.github.com/sntp/221f53c48bec929ac36d0951b496fcbd
The text was updated successfully, but these errors were encountered: