[Question] Training data too big to fit in memory #108

chatnord · 2024-06-18T15:49:04Z

Hi,

thanks for your work guys!
I am trying to explore using your implementation for our use-case, but I am stuck a bit on how you would deal with cases where the training set is too big to fit in memory.
With normal tensorflow, we usually have a "from_generator" implementation, and we only read one batch at a time. I was reading through your documentation today, and I am not sure of how you would proceed here.

Can someone please point me to any relevant information?

Thanks a lot!

achoum · 2024-06-19T14:16:21Z

Hi Chatnord,

If the training dataset is too large to fit in memory, there are essentially two families of solutions.

The first solution is to approximate the training.

The simplest solution is to train on less data. Training on less data can lead to worst models, but this is not always the case and can be evaluated empirically. For example, if your dataset contains 1B examples but you can only fit 10M examples in memory. Try training a model with 9M examples and with 10M examples. If both models perform the same, this is a good indication that having more data won't improve significantly the model quality.

Another solution is to train multiple models on different subset of data and then to ensemble them. This is generally easy to do by hand. The subset sampling can be done on the examples (e.g., sample a subset of examples) or on the features (train each model on all the examples but a subset of features).

If you need to train on more data, you can use distributed training:
https://ydf.readthedocs.io/en/latest/tutorial/distributed_training/

We should publish example of YDF distribute training in Google Cloud soon.

YDF distributed training distributes both the data and the computation. So, if one machine can store 10M examples, you can train on 100M examples using 10 machines with the same speed, and 1B examples using 100 machines.

Currently, the most memory efficient way to feed examples for in-process training (i.e., non-distributed training) is to use numpy arrays. Using Tensorflow data generator works, but this is not as efficient.

How large if your dataset? (number of examples and number of features).

Hope this helps.
M.

chatnord · 2024-06-19T14:53:29Z

Hey M,
thanks for answering (I think both here and on the forum :)), you have been of HUGE help.
Ultimately, I understand that if you cannot use TFRecords or any way to stream the input dataset (makes sense), so I think that I will try with solution #1 in the meantime.
Thanks a lot

TonyCongqianWang · 2024-06-23T21:58:52Z

I also have a problem with memory but my data does fit into Memory. I supply both a Training and a Validation set (around 40GB in total). When I run the training, the process uses way to much memory (400GB) until it gets killed. Initially I had more than one parallel try, but I reduced it to 1 after the process got killed the first time.

achoum · 2024-07-04T12:59:13Z

You are right to note that the RAM usage during training is larger than the training dataset alone. During training, there are two other large memory allocations:

To speed-up training, the training dataset is indexed. For instance, numerical values are sorted and the order is stored in a 32bits index (by default). This means that if you have a float32 numerical column, the memory usage will be doubled. In some cases, the memory usage can be more than doubled. For instance, if you have a int8 numerical input features, the index is still stored with 32bits precision, making the index memory usage 4x larger than the original dataset. We do not currently offer a way to disable this index.

Note however, than using discretized numerical features (discretize_numerical_columns=True) will make the index smaller.

When training with multi-thread, each threads allocates a pool of working memory. If the model trains with 20 threads, 20 such pool will be allocated. This pool of memory is generally small compared to the actual dataset, but in some cases, it can be significant.

A large memory usage can also be a sign of a problem in the learner configuration. A semi-common problem arrises when a numerical column with a large amount of unique values is treated as a CATEGORICAL column. The memory usage of the learning algorithm for categorical features is linear with the number of unique values making the miss-match mentioned above possibly allocating a lot of memory (and training a poor model).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Training data too big to fit in memory #108

[Question] Training data too big to fit in memory #108

chatnord commented Jun 18, 2024

achoum commented Jun 19, 2024

chatnord commented Jun 19, 2024

TonyCongqianWang commented Jun 23, 2024

achoum commented Jul 4, 2024

[Question] Training data too big to fit in memory #108

[Question] Training data too big to fit in memory #108

Comments

chatnord commented Jun 18, 2024

achoum commented Jun 19, 2024

chatnord commented Jun 19, 2024

TonyCongqianWang commented Jun 23, 2024

achoum commented Jul 4, 2024