Model Performance

Here are the generative performances under different training settings and model configs.

Generative performance of unconditional models.

Model Name	Type	Params	FID	IS	Training Data	BS	Iters
GET-T	Uncond	8.6M	15.23	8.40	1M	128	800k
GET-M	Uncond	19.2M	10.81	8.77	1M	128	800k
GET-S	Uncond	37.2M	7.99	9.05	1M	128	800k
GET-B	Uncond	62.2M	7.39	9.17	1M	128	800k
GET-B+	Uncond	83.5M	7.21	9.07	1M	128	800k

Generative performance of class-conditional models.

Model Name	Type	Params	FID	IS	Training Data	BS	Iters
GET-B	Cond	62.2M	6.23	9.42	1M	256	800k
GET-B	Cond	62.2M	5.66	9.63	2M	256	1.2M

There is a clear scaling for Generative Equilibrium Transformers. Mostly, FID has a log linear relation w.r.t. the input money (=Training FLOPs/Time/Data/Params) when fixing other dimensions. Scaling up the training data, better supervision from the perceptual loss/teacher model, more training FLOPs/larger batch size/longer training schedule can lead to better results, as demonstrated.

Ideally, when scaling up training data, the model size and training flops need to adjust accordingly to achieve the best training efficiency. Nonetheless, restricted by our computing resources, we cannot derive the exact scaling law for compute-optimal models as shown in Chinchilla. Despite the compute restriction, our observation still shows that GET's scaling law suggests much more compact compute-optimal models than ViTs, which is ideal for memory-bounded deployment, like today's LLM.

In particular, this work shows that memorizing sufficient "regular" data pairs can lead to a good generative model, no matter for GET or ViT. The differences can be data efficiency and training efficiency (we assume there are much better training strategies though).

Here, the term "regular" means the latent-image pairs are easy to learn, e.g., sampled from a pretrained model. Please note that the randomly paired latent-code, for example, shuffling the latent-image pairs in the training set, cannot be memorized by a model at a constant learning rate (we use a fixed learning rate to train models over massive pairs in the above experiments), as the loss curve is non-decreasing. This implies that the learnability of pairing can be a strong measurement of pairing quality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STAT.md

STAT.md

Model Performance

Files

STAT.md

Latest commit

History

STAT.md

File metadata and controls

Model Performance