Neural scaling law
In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks.[1][2]
Introduction
In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, performance after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".
Examples
Chinchilla scaling
One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have:[3]
where the variables are
- is the cost of training the model, in FLOPs.
- is the number of parameters in the model.
- is the number of tokens in the training set.
- is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.
and the statistical parameters are
- , meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by.[4] Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.
- .
The statistical laws were fitted over experimental data with .
Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed , we can uniquely solve for all 4 variables that minimizes . This provides us with the optimal for any fixed :
Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:
Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on. There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of . One can also directly fit a statistical law for without going through the detour, for which one obtains:
or as tabulated:
/ FLOP | / FLOPs of training Gopher | ||
---|---|---|---|
400 Million | 1.92e+19 | 1/29968 | 8.0 Billion |
1 Billion | 1.21e+20 | 1/5706 | 20.2 Billion |
10 Billion | 1.23e+22 | 1/2819 | 205.1 Billion |
67 Billion | 5.76e+23 | 1 | 1.5 Trillion |
175 Billion | 3.85e+24 | 6.7 | 3.7 Trillion |
280 Billion | 9.90e+24 | 17.2 | 5.9 Trillion |
520 Billion | 3.43e+25 | 59.5 | 11.0 Trillion |
1 Trillion | 1.27e+26 | 221.3 | 21.2 Trillion |
10 Trillion | 1.30e+28 | 22515.9 | 216.2 Trillion |
(Henighan, Kaplan, et al, 2020)
A 2020 analysis [5] studied statistical relations between over a wide range of values and found similar scaling laws, over the range of , , and over multiple modalities (text, video, image, text to image, etc.).[5]
In particular, the scaling laws it found are (Table 1 of [5]):
- For each modality, they fixed one of the two , and varying the other one ( is varied along using ), the achievable test loss satisfieswhere is the varied variable, and are parameters to be found by statistical fitting. The parameter is the most important one.
- When is the varied variable, ranges from to depending on the model modality. This corresponds to the from the Chinchilla scaling paper.
- When is the varied variable, ranges from to depending on the model modality. This corresponds to the from the Chinchilla scaling paper.
- Given fixed computing budget, optimal model parameter count is consistently aroundThe parameter varies by a factor of up to 10 for different modalities. The exponent parameter varies from to for different modalities. This exponent corresponds to the from the Chinchilla scaling paper.
- It's "strongly suggested" (but not statistically checked) that . This exponent corresponds to the from the Chinchilla scaling paper.
The scaling law of was confirmed during the training of GPT-3 (Figure 3.1 [6]).
Vision transformers
Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts , on image sets of sizes , for computing (in units of TPUv3-core-days).[7]
After training the model, it is finetuned on ImageNet training set. Let be the error probability of the finetuned model classifying ImageNet test set. They found .
Broken Neural Scaling Laws (BNSL)
A 2022 analysis [8] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:
in which refers to the quantity being scaled (i.e. , , , number of training steps, or model input size) and refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters are found by statistical fitting.
The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent).
The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include ResNets, Transformers, MLP-Mixers, Graph Neural Networks, U-Nets, Ensembles (and Non-Ensembles), MoE (Mixture of Experts) (and Non-MoE) Models, and Sparse Pruned (and Non-Sparse Unpruned) Models.
References
- Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh (2021-02-12). "Explaining Neural Scaling Laws". arXiv:2102.06701 [cond-mat, stat].
- Hestness, Joel; Narang, Sharan; Ardalani, Newsha; Diamos, Gregory; Jun, Heewoo; Kianinejad, Hassan; Patwary, Md Mostofa Ali; Yang, Yang; Zhou, Yanqi (2017-12-01). "Deep Learning Scaling is Predictable, Empirically". arXiv:1712.00409 [cs, stat].
- Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv:2203.15556 [cs].
- Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.
- Sam, Henighan, Tom Kaplan, Jared Katz, Mor Chen, Mark Hesse, Christopher Jackson, Jacob Jun, Heewoo Brown, Tom B. Dhariwal, Prafulla Gray, Scott Hallacy, Chris Mann, Benjamin Radford, Alec Ramesh, Aditya Ryder, Nick Ziegler, Daniel M. Schulman, John Amodei, Dario McCandlish (2020-10-27). Scaling Laws for Autoregressive Generative Modeling. OCLC 1228442047.
- Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared D; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020). "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems. 33.
- Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (2022). "Scaling Vision Transformers": 12104–12113.
{{cite journal}}
: Cite journal requires|journal=
(help) - Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). Broken Neural Scaling Laws. International Conference on Learning Representations (ICLR), 2023.