Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory usage (RAM) - grow too fast #395

Open
viotemp1 opened this issue Aug 23, 2020 · 10 comments
Open

memory usage (RAM) - grow too fast #395

viotemp1 opened this issue Aug 23, 2020 · 10 comments

Comments

@viotemp1
Copy link

Hello,

I see that Hyperband search is eating 1GB memory every 20 trials or so. Anything I can do?
Regards,

Trial 2
Screenshot 2020-08-23 at 22 03 23

Trial 20
Screenshot 2020-08-23 at 22 15 46

@Derdin-datascience
Copy link

adding backend.clear_session() worked for me:

from keras import backend as backend
def model_builder(hp):
    backend.clear_session()
    model = Sequential()
    hp_drop = hp.Float('drop', min_value=0, max_value=0.2, step=0.025)
    model.add(Dense(128, activation = "relu"))
    model.add(Dropout(hp_drop))
    model.add(Dense(1, activation = "relu"))

    model.compile(
        loss='mean_absolute_error',
        optimizer=tf.keras.optimizers.Adam(0.001),
        metrics=["mean_absolute_percentage_error"]
    )
    return model

@summelon
Copy link

@Derdin-datascience Not work under mirrored strategy

@JLPiper
Copy link

JLPiper commented Mar 8, 2023

I am having this same issue 3 years later.
As the tuner search progresses through each trial, more and more RAM is consumed until either an OOM error or the computer freezes entirely.
I have tried adding a clean_up function that runs at the start of every build_model function consisting of clear_session(), del model, and gc.collect() to no avail.
Has anyone found a reasonable fix for this yet?

@h4ck4l1
Copy link

h4ck4l1 commented Sep 4, 2023

Having Same problem. Kaggle offers only 13 GB ram with their gpu's so its i am hitting limit with wavenet.
Has anyone found any solution?

@Furkan-rgb
Copy link

Running into the same issue. Currently using a very non elegant solution of running the script through a main script that basically loops through it like so:

    for i in range(5, 150, 3):
        print(f"Running tuner.py with argument {i}")
        subprocess.run(["python", "model/tuning/tuner.py", str(i)])

And then taking this argument as the max_trials parameter where the tuning happens. Hoping there is a solution for the memory leakage in the tuner!

@haifeng-jin
Copy link
Collaborator

I cannot find a feasible solution right now either. Anyone who has an idea of how to fix it is welcome.

Thanks!

@OliverWeitman
Copy link

Has anyone found a working solution to this? Neither backend.clear_session() or gc.collect() worked for me...

@farhanhubble
Copy link

farhanhubble commented Jan 17, 2024

I was training 16 models in a loop and together they'd devour 500GB of memory on tensorflow = "2.5.0"! There was definitely a leak somewhere. memory profiling did not help and I was sure that the leak was likely somewhere outside python code. I switched to using a generator that produces one mini batch of data at a time and that seems to have completely plugged the leak.

I went from:

for i in range(y_train.shape[1]):
    model = _build_model()
    logger.info(f"Training model {i+1}")
    model.fit(
        x_train,
        y_train[:, i],
        validation_data=(x_test, y_tes[:, i]),
        epochs=7,
        batch_size=16,
        callbacks=callbacks,
    )

to

def _data_generator(x, y, batch_size):
    for i in range(0, len(x), batch_size):
        yield x[i:i + batch_size], y[i:i + batch_size]


for i in range(y_train.shape[1]):
    logger.info(f"Creating model {i+1}")
    model = _build_model()
    model.fit(
        _data_generator(x_train, y_train[:, i], 16),
        validation_data=(x_test, y_test[:, i]),
        epochs=7,
        callbacks=callbacks,
        steps_per_epoch=int(np.ceil(len(x_train_trf) / 16)),
    )

@jdkern11
Copy link

jdkern11 commented Feb 22, 2024

I did some memory profiling, and if you look in keras_tuner/src/engine/hypermodel.py at function fit, this is where memory starts leaking.

def fit(self, hp, model, *args, **kwargs):
        """Train the model.

        Args:
            hp: HyperParameters.
            model: `keras.Model` built in the `build()` function.
            **kwargs: All arguments passed to `Tuner.search()` are in the
                `kwargs` here. It always contains a `callbacks` argument, which
                is a list of default Keras callback functions for model
                checkpointing, tensorboard configuration, and other tuning
                utilities. If `callbacks` is passed by the user from
                `Tuner.search()`, these default callbacks will be appended to
                the user provided list.

        Returns:
            A `History` object, which is the return value of `model.fit()`, a
            dictionary, or a float.

            If return a dictionary, it should be a dictionary of the metrics to
            track. The keys are the metric names, which contains the
            `objective` name. The values should be the metric values.

            If return a float, it should be the `objective` value.
        """
        return model.fit(*args, **kwargs)

The model passed in here is the one you define using tensorflow. As such, I do not think this is an issue with keras_tuner, I think it might actually be a memory leak in tensorflow's model.fit.

I tried deleting the model and that did not solve this issue, so I suspect there is leakage in this function somewhere. May also be related to custom train step?

@JLPiper
Copy link

JLPiper commented Feb 22, 2024

Until a proper fix is found, on #873 , I described how I made a very rough workaround:

A. Use a less intensive hyper-parameter search option that can feasibly complete its search before the memory consumption becomes too much. I found switching from Bayesian to Hyperband gave me a lot more leeway at the cost of the benefits of using Bayesian.

B. Use a separate program to launch, monitor, and kill the main tuner program. I found a handful of libraries that let you track the resource usage relatively easily. Simply have that program launch the tuner, wait until the resource usage passes a certain threshold, and have it kill the tuner when the usage passes the threshold.

Keras tuners naturally save its progress, so it will pick up right where it left off. A word of warning though, even then I have run into occasions where a single tuning step consumes too much memory by itself and gets stuck as the secondary program kills it before it can finish processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants