Unable to load corpus #457

romanovzky · 2019-02-05T16:46:08Z

Describe the bug
Following Tutorial 7 I cannot load my own corpus using

corpus = NLPTaskDataFetcher.load_classification_corpus(Path('./'),
                                                       test_file='test.txt',
                                                       dev_file='val.txt',
                                                       train_file='train.txt')

The data in the txt follow the convention __label__X\ttext. The total amount of classes is two, the total amount of entries is:

train: 20387
val: 20387
test: 10194

The used memory rises continuously, ending up having all memory used by the code above, leading to non-responsive system and then python crashes (running inside a jupyter notebook, so the kernel restarts).

Reducing the number of documents (say produce only 1000-1000-500 sets) seems to workaround this. However this is not intended behaviour for a Deep Learning solution.

To Reproduce
As far as I know, data with the above specifications (I cannot divulge or share the data).

Expected behavior
To load a corpus to be used in a TextClassifier.

Screenshots
N/A

Environment (please complete the following information):

OS [e.g. iOS, Linux]: Linux
Version [e.g. flair-0.3.2]: 0.4.0
RAM: 16GB

Additional context
N/A

The text was updated successfully, but these errors were encountered:

alanakbik · 2019-02-05T16:52:38Z

Hello @romanovzky thanks for reporting this. This is in fact related to #426 and something that has been on our list for a while. Essentially, to fix this we need to implement an iterating data fetcher that does not keep the entire data set in memory as is currently done. Since more and more people are training classifiers over large data sets we need to make this a priority.

I will add a 'feature' tag to this issue.

romanovzky · 2019-02-05T16:56:03Z

Thanks @alanakbik , I'll be eager to give this new feature a go!

ShaohongBai · 2019-02-26T17:28:12Z

@alanakbik, do we have any work around solution currently for loading issue like this? Additionally, whether is there a functionality that does NOT need reload the data from text after formating, just batching the data in the loader?

alanakbik · 2019-02-27T13:12:07Z

The only current workaround is to reduce the training data set size, which is not great. We're making this a priority feature for version 0.5 (see #563)!

superzadeh · 2019-03-20T10:17:34Z

Also looking forward to this feature in 0.5, currently we can't use Flair at all to train multilabel text classifiers.

We're working on a dataset of ~400k lines (resulting in more or less 600MB training file, split in 10 files), training on Tesla V100 with 16GB of GPU RAM and 64GB RAM on the machine itself, with SSD disk and we can only load 2k lines, after which we run into the same issue described (we're loading the news-forward and news-backward FlairEmbeddings). Either running out of RAM while loading the corpus, or running out of GPU ram while training.

Using the Glove WordEmbeeding only, we can fit more data, however that defeats a bit the original purpose of using Flair.

Is there anything we can help with? So far I really like the way Flair is being built, reminds me of spaCy in terms of ease of use, with the best NLP techniques available.

What could also help would be some progress indicator while loading the corpus. Currently we only see the following output:

2019-03-20 10:18:41,276 Reading data from /floyd/home/output/small
2019-03-20 10:18:41,277 Train: /floyd/home/output/small/train.csv
2019-03-20 10:18:41,278 Dev: /floyd/home/output/small/dev.csv
2019-03-20 10:18:41,279 Test: /floyd/home/output/small/test.csv

Having some progress could allow to more easily see "what percentage of the corpus can fit into the available RAM" by monitoring memory usage while loading the corpus. It would also help when creating our own embeddings, as loading a very large corpus takes quite a bit of time.

That would be a quick band-aid fix though (for 0.4.x), as the solution planned for 0.5 sounds much better.

alanakbik · 2019-03-21T15:32:50Z

Hello @superzadeh good points - we should definitely add a progress bar to indicate how much of a corpus is loaded. I'll put in a ticket for this.

Wrt large datasets since many users have this problem this feature will be a major priority for version 0.5. From our side, development on this will begin from second week of April (when everybody is back from vacation). Until then, you could check out pull request #595 that implements an interating data fetcher - some features such as randomization of the data are still missing and its not yet fully tested, but maybe this could work for you.

GH-457: PyTorch DataLoader

stale · 2020-04-30T02:11:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

alanakbik · 2020-04-30T10:56:47Z

PR merged for this a while back

romanovzky added the bug Something isn't working label Feb 5, 2019

alanakbik added the feature A new feature label Feb 5, 2019

alanakbik mentioned this issue Feb 5, 2019

Iterating data fetcher for large training data sets #458

Closed

alanakbik removed the feature A new feature label Feb 5, 2019

alanakbik mentioned this issue Feb 24, 2019

Flair 0.5 features #563

Closed

5 tasks

alanakbik pushed a commit that referenced this issue May 10, 2019

GH-457: begin refactor of data loaders

2972243

alanakbik pushed a commit that referenced this issue May 10, 2019

GH-457: begin refactor of data loaders

fd557e7

alanakbik pushed a commit that referenced this issue May 10, 2019

GH-457: changed training example for new dataset loader

948cb0c

alanakbik pushed a commit that referenced this issue May 10, 2019

GH-457: English and German UD DataLoader

9f05704

alanakbik pushed a commit that referenced this issue May 13, 2019

GH-457: round sampling of dev data

558ad40

alanakbik pushed a commit that referenced this issue May 13, 2019

GH-457: classification data loader

6042d95

alanakbik pushed a commit that referenced this issue May 15, 2019

GH-457: streaming data loading for classification

7d6ae79

alanakbik pushed a commit that referenced this issue May 15, 2019

GH-457: adapt unit tests for new data loaders

cbb8502

alanakbik pushed a commit that referenced this issue May 15, 2019

GH-457: MultiCorpus for DataLoader

6745fcd

alanakbik pushed a commit that referenced this issue May 15, 2019

GH-457: MultiCorpus for DataLoader

2c5ab0b

alanakbik pushed a commit that referenced this issue May 15, 2019

GH-457: Refactor Corpus interface to base class

4402492

alanakbik pushed a commit that referenced this issue May 16, 2019

GH-457: debug regression model

0c876ac

alanakbik pushed a commit that referenced this issue May 16, 2019

GH-457: support for CoNLL-03 Dutch and Spanish

c81c20a

alanakbik pushed a commit that referenced this issue May 16, 2019

GH-457: comment out failing unit test

8a055de

alanakbik pushed a commit that referenced this issue May 16, 2019

GH-457: Streaming DataLoading for NER corpora

fc3b8c6

alanakbik pushed a commit that referenced this issue May 16, 2019

GH-457: Streaming DataLoading for NER corpora

f39287e

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: Add support for UD corpora

bca40e8

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: More UD languages

3de5818

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: Large UD corpora

028c83b

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: BIOES conversion in data loader

8a4d2a5

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: add deprecation warning to NLPTaskDataLoader

73cdaed

alanakbik pushed a commit that referenced this issue May 17, 2019

GH-457: num_workers as settable parameter in Trainer

4576777

alanakbik mentioned this issue May 17, 2019

GH-457: PyTorch DataLoader #735

Merged

alanakbik pushed a commit that referenced this issue May 17, 2019

Merge branch 'master' into GH-457-data-loader

1fa4530

alanakbik pushed a commit that referenced this issue May 20, 2019

Merge pull request #735 from zalandoresearch/GH-457-data-loader

b2230ec

GH-457: PyTorch DataLoader

stale bot added the wontfix This will not be worked on label Apr 30, 2020

alanakbik closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to load corpus #457

Unable to load corpus #457

romanovzky commented Feb 5, 2019

alanakbik commented Feb 5, 2019

romanovzky commented Feb 5, 2019

ShaohongBai commented Feb 26, 2019 •

edited

Loading

alanakbik commented Feb 27, 2019

superzadeh commented Mar 20, 2019 •

edited

Loading

alanakbik commented Mar 21, 2019

stale bot commented Apr 30, 2020

alanakbik commented Apr 30, 2020

Unable to load corpus #457

Unable to load corpus #457

Comments

romanovzky commented Feb 5, 2019

alanakbik commented Feb 5, 2019

romanovzky commented Feb 5, 2019

ShaohongBai commented Feb 26, 2019 • edited Loading

alanakbik commented Feb 27, 2019

superzadeh commented Mar 20, 2019 • edited Loading

alanakbik commented Mar 21, 2019

stale bot commented Apr 30, 2020

alanakbik commented Apr 30, 2020

ShaohongBai commented Feb 26, 2019 •

edited

Loading

superzadeh commented Mar 20, 2019 •

edited

Loading