a possible strategy for training large models #1646

rtkclouds · 2023-05-29T23:54:30Z

rtkclouds
May 29, 2023

I'm not an expert, despite programming in the area for the last 10 years and following the development closely, I didn't graduate in the area specifically, being a food engineer
but I have had several advances in the area with technologies that made possible the use of large networks in a web system and tensorflow.js
I don't know if a similar technique already exists, if I'm talking nonsense or something that already exists, disregard it, but I believe that if it doesn't exist, it's important to consider this technique.
On a 3070 mobile video card with 8gb ram, it is not possible to train networks with more than 100 million parameters using TFJS
My idea was to create a replica of the models but with 128 layers of size 128x smaller
Between each batch randomly 5% of the parameters are defined as trainable and the rest frozen.
This allowed me to train models with up to 2 billion parameters on the same graphics card.
However, training and saving only 550 million, because the tfjs gave an error due to the size of the vector
The tests are still in their infancy, but the network of 550 million parameters converged and was of good quality, in this case I use language models for trading and for that purpose the improvement was significant.
Another implementation I did was to generate the vector embedding the part, and use the embed vector as input and output
i asked gpt about what i did and the answer was :
It's possible that the rotation of making only one layer trainable in each batch is having an unexpected beneficial effect on model learning. While not a common approach, it can introduce certain features that can speed up the learning process. Some possible reasons for this observation are:

Effective regularization: By freezing most of the layers in each batch, you are effectively applying a form of regularization to the neural network. Regularization can help reduce overfitting, improve model generalizability, and allow for faster learning.

More efficient exploration of the parameter space: By toggling the trainability of different layers in each batch, you are allowing different parts of the model to adjust their parameters at different times. This can help the model explore the parameter space more efficiently, preventing it from getting stuck in local minima or suboptimal settings.

Reduction of interference between layers: By training only one layer in each batch, you are reducing interference between layers during the parameter update process. This can allow layers to adapt more independently and effectively, making learning easier.

It is important to note that these are only possible scenarios and the exact effect may vary depending on the model, issue and data. It is critical to perform a thorough evaluation of model performance, benchmark against other strategies, and perform additional experiments to better understand the benefits and limitations of this layer trainability rotation approach.

Remember to also consider other performance metrics, such as model accuracy and robustness on test data. Additionally, performing cross-validation and evaluating results in different settings can help confirm that the observed effect is consistent and reproducible.

I decided to share, pos when I asked about similar techniques, it seems they don't do that exactly, just similar techniques :
There are similar techniques that can be considered as approaches related to partial freezing of layers in neural networks. Some examples include:

Fine-tuning: Fine-tuning is a technique where a neural network pre-trained on a large dataset such as ImageNet is initialized and then some of the upper layers are thawed and trained on a dataset specific to your problem. This allows the neural network to leverage prior knowledge learned in related tasks, while tweaking the final layers to suit the new problem.

Transfer Learning: Transfer learning involves using a neural network pre-trained in a specific domain and then tweaking or freezing some of the layers to solve a related but possibly different problem. This approach allows you to take advantage of the knowledge acquired in previous tasks, avoiding the need to train a neural network from scratch.

Residual Networks: Residual networks introduced the concept of residual connections, where activations from one layer are added directly to activations from a previous layer. This allows information to be propagated directly between layers, even when some layers are frozen or have zero gradient. This architecture makes training deep networks easier and can help avoid problems such as gradient fading in deep layers.

an example implementation for tensorflow in python training 1.2 billion parameters
import tensorflow as tf
from tensorflow.keras.layers import Dense, Concatenate, LayerNormalization
import tensorflow as tf
tf.config.set_visible_devices([], 'GPU')

def train():
input_shape = (512, 64)
hidden_units = 4096
num_parts = 16
num_transformer_blocks = 4
num_heads = 2

 # Function to create transformer layer architecture with ReZero
 class ReZeroTransformerLayer(tf.keras.layers.Layer):
     def __init__(self, hidden_units, num_heads, **kwargs):
         super(ReZeroTransformerLayer, self).__init__(**kwargs)
         self.hidden_units = hidden_units
         self.num_heads = num_heads
         self.dense = Dense(hidden_units, activation='relu')
         self.multi_head_attention = tf.keras.layers.MultiHeadAttention(num_heads=num_heads, key_dim=64)
         self.alpha = tf.Variable(0.0, trainable=True)

     def call(self, inputs):
         x = inputs
         res = x # Save initial output for later addition
         x = self.multi_head_attention(x, x)
         x = self.alpha * x + res # Apply ReZero
         x = self.dense(x)
         return x

 # Creating the model
 inputs = tf.keras.Input(shape=input_shape)
 x = Dense(hidden_units, activation='relu')(inputs)

 for _ in range(num_transformer_blocks):
     # Dividing the data into parts
     parts = tf.split(x, num_parts, axis=1)
     output = []

     for part in parts:
         # Applying the transformer layer with ReZero on each part
         part = ReZeroTransformerLayer(hidden_units, num_heads)(part)
         outputs.append(part)

     # Concatenating the results of the parts
     x = Concatenate(axis=1)(outputs)

 outputs = Dense(input_shape[1])(x)
 model = tf.keras.Model(inputs=inputs, outputs=outputs)

 # Configuring the randomness of layers training
 def set_trainable_randomly(model, p=0.99):
     total_layers = len(model.layers)
     num_trainable = int(p * total_layers)
     trainable_indices = tf.random.shuffle(tf.range(total_layers))[:num_trainable]

     for i, layer in enumerate(model.layers):
         if i in trainable_indices:
             layer.trainable = True
         else:
             layer.trainable = False

 # Function to generate sample data
 def generate_data(batch_size):
     return tf.random.normal((batch_size,) + input_shape)

 # Compiling and training the model
 model.compile(optimizer='adam', loss='mse')
 model.summary()
 for _ in range(num_transformer_blocks):
     for _ in range(1000):
         x = generate_data(4)
         set_trainable_randomly(model, p=0.99)
         res=model.train_on_batch(x, x)
         print(res)

train()

rtkclouds · 2023-05-31T18:03:44Z

rtkclouds
May 31, 2023
Author

2 days https://arxiv.org/abs/2305.19264

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a possible strategy for training large models #1646

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

a possible strategy for training large models #1646

rtkclouds May 29, 2023

Replies: 1 comment

rtkclouds May 31, 2023 Author

rtkclouds
May 29, 2023

rtkclouds
May 31, 2023
Author