Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to load corpus #457

Closed
romanovzky opened this issue Feb 5, 2019 · 8 comments
Closed

Unable to load corpus #457

romanovzky opened this issue Feb 5, 2019 · 8 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@romanovzky
Copy link

Describe the bug
Following Tutorial 7 I cannot load my own corpus using

corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'),
                                                       test_file='test.txt',
                                                       dev_file='val.txt',
                                                       train_file='train.txt') 

The data in the txt follow the convention __label__X\ttext. The total amount of classes is two, the total amount of entries is:

  • train: 20387
  • val: 20387
  • test: 10194

The used memory rises continuously, ending up having all memory used by the code above, leading to non-responsive system and then python crashes (running inside a jupyter notebook, so the kernel restarts).

Reducing the number of documents (say produce only 1000-1000-500 sets) seems to workaround this. However this is not intended behaviour for a Deep Learning solution.

To Reproduce
As far as I know, data with the above specifications (I cannot divulge or share the data).

Expected behavior
To load a corpus to be used in a TextClassifier.

Screenshots
N/A

Environment (please complete the following information):

  • OS [e.g. iOS, Linux]: Linux
  • Version [e.g. flair-0.3.2]: 0.4.0
  • RAM: 16GB

Additional context
N/A

@romanovzky romanovzky added the bug Something isn't working label Feb 5, 2019
@alanakbik
Copy link
Collaborator

Hello @romanovzky thanks for reporting this. This is in fact related to #426 and something that has been on our list for a while. Essentially, to fix this we need to implement an iterating data fetcher that does not keep the entire data set in memory as is currently done. Since more and more people are training classifiers over large data sets we need to make this a priority.

I will add a 'feature' tag to this issue.

@alanakbik alanakbik added the feature A new feature label Feb 5, 2019
@romanovzky
Copy link
Author

Thanks @alanakbik , I'll be eager to give this new feature a go!

@ShaohongBai
Copy link

ShaohongBai commented Feb 26, 2019

@alanakbik, do we have any work around solution currently for loading issue like this? Additionally, whether is there a functionality that does NOT need reload the data from text after formating, just batching the data in the loader?

@alanakbik
Copy link
Collaborator

The only current workaround is to reduce the training data set size, which is not great. We're making this a priority feature for version 0.5 (see #563)!

@superzadeh
Copy link

superzadeh commented Mar 20, 2019

Also looking forward to this feature in 0.5, currently we can't use Flair at all to train multilabel text classifiers.

We're working on a dataset of ~400k lines (resulting in more or less 600MB training file, split in 10 files), training on Tesla V100 with 16GB of GPU RAM and 64GB RAM on the machine itself, with SSD disk and we can only load 2k lines, after which we run into the same issue described (we're loading the news-forward and news-backward FlairEmbeddings). Either running out of RAM while loading the corpus, or running out of GPU ram while training.

Using the Glove WordEmbeeding only, we can fit more data, however that defeats a bit the original purpose of using Flair.

Is there anything we can help with? So far I really like the way Flair is being built, reminds me of spaCy in terms of ease of use, with the best NLP techniques available.

What could also help would be some progress indicator while loading the corpus. Currently we only see the following output:

2019-03-20 10:18:41,276 Reading data from /floyd/home/output/small
2019-03-20 10:18:41,277 Train: /floyd/home/output/small/train.csv
2019-03-20 10:18:41,278 Dev: /floyd/home/output/small/dev.csv
2019-03-20 10:18:41,279 Test: /floyd/home/output/small/test.csv

Having some progress could allow to more easily see "what percentage of the corpus can fit into the available RAM" by monitoring memory usage while loading the corpus. It would also help when creating our own embeddings, as loading a very large corpus takes quite a bit of time.

That would be a quick band-aid fix though (for 0.4.x), as the solution planned for 0.5 sounds much better.

image

@alanakbik
Copy link
Collaborator

Hello @superzadeh good points - we should definitely add a progress bar to indicate how much of a corpus is loaded. I'll put in a ticket for this.

Wrt large datasets since many users have this problem this feature will be a major priority for version 0.5. From our side, development on this will begin from second week of April (when everybody is back from vacation). Until then, you could check out pull request #595 that implements an interating data fetcher - some features such as randomization of the data are still missing and its not yet fully tested, but maybe this could work for you.

alanakbik pushed a commit that referenced this issue May 10, 2019
alanakbik pushed a commit that referenced this issue May 10, 2019
alanakbik pushed a commit that referenced this issue May 13, 2019
alanakbik pushed a commit that referenced this issue May 13, 2019
alanakbik pushed a commit that referenced this issue May 15, 2019
alanakbik pushed a commit that referenced this issue May 15, 2019
alanakbik pushed a commit that referenced this issue May 16, 2019
alanakbik pushed a commit that referenced this issue May 16, 2019
alanakbik pushed a commit that referenced this issue May 17, 2019
alanakbik pushed a commit that referenced this issue May 17, 2019
alanakbik pushed a commit that referenced this issue May 17, 2019
@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 30, 2020
@alanakbik
Copy link
Collaborator

PR merged for this a while back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants