Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarming RAM Usage Spike during Hyperband Search #1031

Open
LorenzoMonti opened this issue Dec 2, 2024 · 0 comments
Open

Alarming RAM Usage Spike during Hyperband Search #1031

LorenzoMonti opened this issue Dec 2, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@LorenzoMonti
Copy link

While training a model using a Hyperband Search tuner on an HPC system with ~500GB of RAM, the program continuously increases its RAM usage without releasing memory between trials similarly to what is described in #873. Despite the high system capacity, the RAM usage eventually exceeds available resources, leading to the program or the entire system crashing.

This is the relevant code:

    def build_model(self, hp):
        # Tunable hyperparameters
        num_transformer_blocks = hp.Int(
            'num_transformer_blocks', 
            min_value=2, 
            max_value=8, 
            step=2
        )
        
        head_size = hp.Int(
            'head_size', 
            min_value=32, 
            max_value=128, 
            step=32
        )
        
        num_heads = hp.Int(
            'num_heads', 
            min_value=2, 
            max_value=8, 
            step=2
        )
        
        ff_dim = hp.Int(
            'ff_dim', 
            min_value=64, 
            max_value=256, 
            step=64
        )
        
        learning_rate = hp.Float(
            'learning_rate', 
            min_value=1e-4, 
            max_value=1e-2, 
            sampling='LOG'
        )
        
        dropout_rate = hp.Float(
            'dropout_rate', 
            min_value=0.1, 
            max_value=0.5, 
            step=0.1
        )
        
        sparsity_rate = hp.Float(
            'sparsity_rate', 
            min_value=0.1, 
            max_value=0.5, 
            step=0.1
        )
        
        # Input layer
        inputs = tf.keras.layers.Input(shape=self.input_shape)
        
        # Positional Encoding
        positions = self._positional_encoding(
            self.input_shape[0], 
            self.input_shape[1]
        )
        positions = tf.expand_dims(positions, axis=0)
        
        x = tf.keras.layers.Add()([inputs, positions])
        
        for _ in range(num_transformer_blocks):
            x = self._informer_encoder(
                x, 
                head_size, 
                num_heads, 
                ff_dim, 
                dropout_rate,
                sparsity_rate
            )
        
        # Sequence Length Reduction
        x = tf.keras.layers.GlobalAveragePooling1D()(x)
        
        # MLP Layers
        x = tf.keras.layers.Dense(128, activation="gelu")(x)
        x = tf.keras.layers.Dropout(dropout_rate)(x)
        x = tf.keras.layers.Dense(64, activation="gelu")(x)
        x = tf.keras.layers.Dropout(dropout_rate)(x)
        
        # Output Layer
        outputs = tf.keras.layers.Dense(1)(x)
        
        # Create and compile model
        model = tf.keras.Model(inputs=inputs, outputs=outputs, name='informer_tuned')
        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
            loss='mean_squared_error',
            metrics=['mae']
        )
        
        return model

    def _positional_encoding(self, length, d_model):
        ...

    def _informer_encoder(self, inputs, head_size, num_heads, ff_dim, dropout, sparsity_rate):
        ...

    def tune_hyperparameters(self, X_train, X_val, y_train, y_val):
        tuner = kt.Hyperband(
            self.build_model,
            objective='val_mae',
            max_epochs=500,
            factor=3,
            directory=self.output_directory,
            project_name='informer_tuning',
            executions_per_trial=self.executions_per_trial
        )
        
        # Early stopping
        stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=100)
        
        # Run hyperparameter search
        tuner.search(
            X_train, y_train,
            epochs=500,
            validation_data=(X_val, y_val),
            callbacks=[stop_early]
        )

At this point, I would like to know if there are any updates regarding a solution that does not rely on workarounds like calling clear_session() before every build_model.
Thanks in advance.

@LorenzoMonti LorenzoMonti added the bug Something isn't working label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant