-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to load corpus #457
Comments
Hello @romanovzky thanks for reporting this. This is in fact related to #426 and something that has been on our list for a while. Essentially, to fix this we need to implement an iterating data fetcher that does not keep the entire data set in memory as is currently done. Since more and more people are training classifiers over large data sets we need to make this a priority. I will add a 'feature' tag to this issue. |
Thanks @alanakbik , I'll be eager to give this new feature a go! |
@alanakbik, do we have any work around solution currently for loading issue like this? Additionally, whether is there a functionality that does NOT need reload the data from text after formating, just batching the data in the loader? |
The only current workaround is to reduce the training data set size, which is not great. We're making this a priority feature for version 0.5 (see #563)! |
Also looking forward to this feature in 0.5, currently we can't use Flair at all to train multilabel text classifiers. We're working on a dataset of ~400k lines (resulting in more or less 600MB training file, split in 10 files), training on Tesla V100 with 16GB of GPU RAM and 64GB RAM on the machine itself, with SSD disk and we can only load 2k lines, after which we run into the same issue described (we're loading the Using the Glove WordEmbeeding only, we can fit more data, however that defeats a bit the original purpose of using Flair. Is there anything we can help with? So far I really like the way Flair is being built, reminds me of spaCy in terms of ease of use, with the best NLP techniques available. What could also help would be some progress indicator while loading the corpus. Currently we only see the following output:
Having some progress could allow to more easily see "what percentage of the corpus can fit into the available RAM" by monitoring memory usage while loading the corpus. It would also help when creating our own embeddings, as loading a very large corpus takes quite a bit of time. That would be a quick band-aid fix though (for 0.4.x), as the solution planned for 0.5 sounds much better. |
Hello @superzadeh good points - we should definitely add a progress bar to indicate how much of a corpus is loaded. I'll put in a ticket for this. Wrt large datasets since many users have this problem this feature will be a major priority for version 0.5. From our side, development on this will begin from second week of April (when everybody is back from vacation). Until then, you could check out pull request #595 that implements an interating data fetcher - some features such as randomization of the data are still missing and its not yet fully tested, but maybe this could work for you. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
PR merged for this a while back |
Describe the bug
Following Tutorial 7 I cannot load my own corpus using
The data in the
txt
follow the convention__label__X\ttext
. The total amount of classes is two, the total amount of entries is:The used memory rises continuously, ending up having all memory used by the code above, leading to non-responsive system and then python crashes (running inside a jupyter notebook, so the kernel restarts).
Reducing the number of documents (say produce only 1000-1000-500 sets) seems to workaround this. However this is not intended behaviour for a Deep Learning solution.
To Reproduce
As far as I know, data with the above specifications (I cannot divulge or share the data).
Expected behavior
To load a corpus to be used in a
TextClassifier
.Screenshots
N/A
Environment (please complete the following information):
Additional context
N/A
The text was updated successfully, but these errors were encountered: