Neural scaling law

In machine learning, a neural scaling law is a scaling law relating parameters of a family of neural networks.[1][2]

Introduction

In general, a neural model can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, performance after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".

Examples

Chinchilla scaling

One particular scaling law ("Chinchilla scaling") states that, for a large language model (LLM) autoregressively trained for one epoch, with a cosine learning rate schedule, we have:[3]

{\begin{cases}C=C_{0}ND\\L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}\end{cases}}

where the variables are

$C$ is the cost of training the model, in FLOPs.
$N$ is the number of parameters in the model.
$D$ is the number of tokens in the training set.
$L$ is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.

and the statistical parameters are

$C_{0}=6$ , meaning that it costs 6 FLOPs per parameter to train on one token. This is estimated by.[4] Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.
$\alpha =0.34,\beta =0.28,A=406.4,B=410.7,L_{0}=1.69$ .

The statistical laws were fitted over experimental data with $N\in [7\times 10^{7},1.6\times 10^{10}],D\in [5\times 10^{9},5\times 10^{11}],C\in [10^{18},10^{24}]$ .

Since there are 4 variables related by 2 equations, imposing 1 additional constraint and 1 additional optimization objective allows us to solve for all four variables. In particular, for any fixed $C$ , we can uniquely solve for all 4 variables that minimizes $L$ . This provides us with the optimal $D_{opt}(C),N_{opt}(C)$ for any fixed $C$ :

N_{opt}(C)=G\left({\frac {C}{6}}\right)^{a},\quad D_{opt}(C)=G^{-1}\left({\frac {C}{6}}\right)^{b},\quad {\text{ where }}\quad G=\left({\frac {\alpha A}{\beta B}}\right)^{\frac {1}{\alpha +\beta }},\quad a={\frac {\beta }{\alpha +\beta }}{\text{, and }}b={\frac {\alpha }{\alpha +\beta }}{\text{. }}

Plugging in the numerical values, we obtain the "Chinchilla efficient" model size and training dataset size, as well as the test loss achievable:

{\begin{cases}N_{opt}(C)=0.6\;C^{0.45}\\D_{opt}(C)=0.3\;C^{0.55}\\L_{opt}(C)=1070\;C^{-0.154}+1.7\end{cases}}

Similarly, we may find the optimal training dataset size and training compute budget for any fixed model parameter size, and so on. There are other estimates for "Chinchilla efficient" model size and training dataset size. The above is based on a statistical model of $L={\frac {A}{N^{\alpha }}}+{\frac {B}{D^{\beta }}}+L_{0}$ . One can also directly fit a statistical law for $D_{opt}(C),N_{opt}(C)$ without going through the detour, for which one obtains:

{\begin{cases}N_{opt}(C)=0.1\;C^{0.5}\\D_{opt}(C)=1.7\;C^{0.5}\end{cases}}

or as tabulated:


$N_{opt}(C)$	$C$ / FLOP	$C$ / FLOPs of training Gopher	$D_{opt}(C)$
400 Million	1.92e+19	1/29968	8.0 Billion
1 Billion	1.21e+20	1/5706	20.2 Billion
10 Billion	1.23e+22	1/2819	205.1 Billion
67 Billion	5.76e+23	1	1.5 Trillion
175 Billion	3.85e+24	6.7	3.7 Trillion
280 Billion	9.90e+24	17.2	5.9 Trillion
520 Billion	3.43e+25	59.5	11.0 Trillion
1 Trillion	1.27e+26	221.3	21.2 Trillion
10 Trillion	1.30e+28	22515.9	216.2 Trillion

(Henighan, Kaplan, et al, 2020)

A 2020 analysis [5] studied statistical relations between $C,N,D,L$ over a wide range of values and found similar scaling laws, over the range of $N\in [10^{3},10^{9}]$ , $C\in [10^{12},10^{21}]$ , and over multiple modalities (text, video, image, text to image, etc.).[5]

In particular, the scaling laws it found are (Table 1 of [5]):

For each modality, they fixed one of the two $C,N$ $C,N$ , and varying the other one ( $D$ $D$ is varied along using $D=C/6N$ $D=C/6N$ ), the achievable test loss satisfies
$L=L_{0}+\left({\frac {x_{0}}{x}}\right)^{\alpha }$
$L=L_{0}+\left({\frac {x_{0}}{x}}\right)^{\alpha }$
where $x$ $x$ is the varied variable, and $L_{0},x_{0},\alpha$ $L_{0},x_{0},\alpha$ are parameters to be found by statistical fitting. The parameter $\alpha$ $\alpha$ is the most important one.
- When $N$ is the varied variable, $\alpha$ ranges from $0.037$ to $0.24$ depending on the model modality. This corresponds to the $\alpha =0.34$ from the Chinchilla scaling paper.
- When $C$ is the varied variable, $\alpha$ ranges from $0.048$ to $0.19$ depending on the model modality. This corresponds to the $\beta =0.28$ from the Chinchilla scaling paper.
Given fixed computing budget, optimal model parameter count is consistently around $N_{opt}(C)=\left({\frac {C}{5\times 10^{-12}{\text{petaFLOP-day}}}}\right)^{0.7}=9.0\times 10^{-7}C^{0.7}$ The parameter $9.0\times 10^{-7}$ varies by a factor of up to 10 for different modalities. The exponent parameter $0.7$ varies from $0.64$ to $0.75$ for different modalities. This exponent corresponds to the $\approx 0.5$ from the Chinchilla scaling paper.
It's "strongly suggested" (but not statistically checked) that $D_{opt}(C)\propto N_{opt}(C)^{0.4}\propto C^{0.28}$ . This exponent corresponds to the $\approx 0.5$ from the Chinchilla scaling paper.

The scaling law of $L=L_{0}+(C_{0}/C)^{0.048}$ was confirmed during the training of GPT-3 (Figure 3.1 [6]).

Vision transformers

Vision transformers, similar to language transformers, exhibit scaling laws. A 2022 research trained vision transformers, with parameter counts $N\in [5\times 10^{6},2\times 10^{9}]$ , on image sets of sizes $D\in [3\times 10^{7},3\times 10^{9}]$ , for computing $C\in [0.2,10^{4}]$ (in units of TPUv3-core-days).[7]

After training the model, it is finetuned on ImageNet training set. Let $L$ be the error probability of the finetuned model classifying ImageNet test set. They found $\min _{N,D}L=0.09+{\frac {0.26}{(C+0.01)^{0.35}}}$ .

Broken Neural Scaling Laws (BNSL)

A 2022 analysis [8] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:

$y=a+{\bigg (}bx^{-c_{0}}{\bigg )}\prod _{i=1}^{n}\left(1+\left({\frac {x}{d_{i}}}\right)^{1/f_{i}}\right)^{-c_{i}*f_{i}}$

in which $x$ refers to the quantity being scaled (i.e. $C$ , $N$ , $D$ , number of training steps, or model input size) and $y$ refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, or FID score) in zero-shot, prompted, or fine-tuned settings. The parameters $a,b,c_{0},c_{1}...c_{n},d_{1}...d_{n},f_{1}...f_{n}$ are found by statistical fitting.

The scenarios in which the scaling behaviors of artificial neural networks were found to follow this functional form include large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, AI capabilities, robotics, out-of-distribution (OOD) generalization, continual learning, transfer learning, uncertainty estimation / calibration, out-of-distribution detection, adversarial robustness, distillation, sparsity, retrieval, quantization, pruning, fairness, molecules, computer programming/coding, math word problems, arithmetic, emergent abilities, double descent, supervised learning, unsupervised/self-supervised learning, and reinforcement learning (single agent and multi-agent).

The architectures for which the scaling behaviors of artificial neural networks were found to follow this functional form include ResNets, Transformers, MLP-Mixers, Graph Neural Networks, U-Nets, Ensembles (and Non-Ensembles), MoE (Mixture of Experts) (and Non-MoE) Models, and Sparse Pruned (and Non-Sparse Unpruned) Models.

References

Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh (2021-02-12). "Explaining Neural Scaling Laws". arXiv:2102.06701 [cond-mat, stat].
Hestness, Joel; Narang, Sharan; Ardalani, Newsha; Diamos, Gregory; Jun, Heewoo; Kianinejad, Hassan; Patwary, Md Mostofa Ali; Yang, Yang; Zhou, Yanqi (2017-12-01). "Deep Learning Scaling is Predictable, Empirically". arXiv:1712.00409 [cs, stat].
Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv:2203.15556 [cs].
Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.
Sam, Henighan, Tom Kaplan, Jared Katz, Mor Chen, Mark Hesse, Christopher Jackson, Jacob Jun, Heewoo Brown, Tom B. Dhariwal, Prafulla Gray, Scott Hallacy, Chris Mann, Benjamin Radford, Alec Ramesh, Aditya Ryder, Nick Ziegler, Daniel M. Schulman, John Amodei, Dario McCandlish (2020-10-27). Scaling Laws for Autoregressive Generative Modeling. OCLC 1228442047.
Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared D; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020). "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems. 33.
Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (2022). "Scaling Vision Transformers": 12104–12113. {{cite journal}}: Cite journal requires |journal= (help)
Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). Broken Neural Scaling Laws. International Conference on Learning Representations (ICLR), 2023.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh (2021-02-12). "Explaining Neural Scaling Laws". arXiv:2102.06701 [cond-mat, stat].

[2] Hestness, Joel; Narang, Sharan; Ardalani, Newsha; Diamos, Gregory; Jun, Heewoo; Kianinejad, Hassan; Patwary, Md Mostofa Ali; Yang, Yang; Zhou, Yanqi (2017-12-01). "Deep Learning Scaling is Predictable, Empirically". arXiv:1712.00409 [cs, stat].

[3] Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv:2203.15556 [cs].

[kaplan-scaling-4] Kaplan, Jared; McCandlish, Sam; Henighan, Tom; Brown, Tom B.; Chess, Benjamin; Child, Rewon; Gray, Scott; Radford, Alec; Wu, Jeffrey; Amodei, Dario (2020). "Scaling Laws for Neural Language Models". CoRR. abs/2001.08361. arXiv:2001.08361.

[:0-5] Sam, Henighan, Tom Kaplan, Jared Katz, Mor Chen, Mark Hesse, Christopher Jackson, Jacob Jun, Heewoo Brown, Tom B. Dhariwal, Prafulla Gray, Scott Hallacy, Chris Mann, Benjamin Radford, Alec Ramesh, Aditya Ryder, Nick Ziegler, Daniel M. Schulman, John Amodei, Dario McCandlish (2020-10-27). Scaling Laws for Autoregressive Generative Modeling. OCLC 1228442047.

[6] Brown, Tom; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared D; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon (2020). "Language Models are Few-Shot Learners". Advances in Neural Information Processing Systems. 33.

[7] Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (2022). "Scaling Vision Transformers": 12104–12113. {{cite journal}}: Cite journal requires |journal= (help)

[:1-8] Caballero, Ethan; Gupta, Kshitij; Rish, Irina; Krueger, David (2022). Broken Neural Scaling Laws. International Conference on Learning Representations (ICLR), 2023.