Language model

A language model is a probability distribution over sequences of words.[1] Given any sequence of words of length $m$ , a language model assigns a probability $P(w_{1},\ldots ,w_{m})$ to the whole sequence. Language models generate probabilities by training on text corpora in one or many languages. Given that languages can be used to express an infinite variety of valid sentences (the property of digital infinity), language modeling faces the problem of assigning non-zero probabilities to linguistically valid sequences that may never be encountered in the training data. Several modelling approaches have been designed to surmount this problem, such as applying the Markov assumption or using neural architectures such as recurrent neural networks or transformers.

Language models are useful for a variety of problems in computational linguistics; from initial applications in speech recognition[2] to ensure nonsensical (i.e. low-probability) word sequences are not predicted, to wider use in machine translation[3] (e.g. scoring candidate translations), natural language generation (generating more human-like text), part-of-speech tagging, parsing,[3] optical character recognition, handwriting recognition,[4] grammar induction,[5] information retrieval,[6][7] and other applications.

Language models are used in information retrieval in the query likelihood model. There, a separate language model is associated with each document in a collection. Documents are ranked based on the probability of the query $Q$ in the document's language model $M_{d}$ : $P(Q\mid M_{d})$ . Commonly, the unigram language model is used for this purpose.

Since 2018, large language models (LLMs) consisting of deep neural networks with billions of trainable parameters, trained on massive datasets of unlabelled text, have demonstrated impressive results on a wide variety of natural language processing tasks. This development has led to a shift in research focus toward the use of general-purpose LLMs.[8]

Model types

n-gram

An n-gram language model is a language model that models sequences of words as a Markov process. It makes use of the simplifying assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. A bigram model considers one previous word, a trigram model considers two, and in general, an n-gram model considers n-1 words of previous context.[9]

For example, a bigram language model models the probability of the sentence I saw the red house as:

P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})

Where $\langle s\rangle$ and $\langle /s\rangle$ are special tokens denoting the start and end of a sentence.

These conditional probabilities may be estimated based on frequency counts in some text corpus. For example, $P({\text{saw}}\mid {\text{I}})$ can be naively estimated as the proportion of occurrences of the word I which are followed by saw in the corpus. The problem of sparsity (for example, if the bigram "red house" has zero occurrences in our corpus) may necessitate modifying the basic markov model by smoothing techniques, particularly when using larger context windows.[9]

n-gram models are no longer commonly used in natural language processing research and applications, as they have been supplanted by state of the art deep learning methods, most recently large language models.

Exponential

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

P(w_{m}\mid w_{1},\ldots ,w_{m-1})={\frac {1}{Z(w_{1},\ldots ,w_{m-1})}}\exp(a^{T}f(w_{1},\ldots ,w_{m}))

where $Z(w_{1},\ldots ,w_{m-1})$ is the partition function, $a$ is the parameter vector, and $f(w_{1},\ldots ,w_{m})$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on $a$ or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Neural network

Neural language models (or continuous space language models) use continuous representations or embeddings of words to make their predictions.[10] These models make use of neural networks.

Continuous space embeddings help to alleviate the curse of dimensionality in language modeling: as language models are trained on larger and larger texts, the number of unique words (the vocabulary) increases.[lower-alpha 1] The number of possible sequences of words increases exponentially with the size of the vocabulary, causing a data sparsity problem because of the exponentially many sequences. Thus, statistics are needed to properly estimate probabilities. Neural networks avoid this problem by representing words in a distributed way, as non-linear combinations of weights in a neural net.[11] An alternate description is that a neural net approximates the language function. The neural net architecture might be feed-forward or recurrent, and while the former is simpler the latter is more common.

Typically, neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution

P(w_{t}\mid \mathrm {context} )\,\forall t\in V.

That is, the network is trained to predict a probability distribution over the vocabulary, given some linguistic context. This is done using standard neural net training algorithms such as stochastic gradient descent with backpropagation.[11] The context might be a fixed-size window of previous words, so that the network predicts

P(w_{t}\mid w_{t-k},\dots ,w_{t-1})

from a feature vector representing the previous $k$ words.[11] Another option is to use "future" words as well as "past" words as features,[12] so that the estimated probability is

P(w_{t}\mid w_{t-k},\dots ,w_{t-1},w_{t+1},\dots ,w_{t+k}).

This is called a bag-of-words model. When the feature vectors for the words in the context are combined by a continuous operation, this model is referred to as the continuous bag-of-words architecture (CBOW).[13]

A third option that trains slower than the CBOW but performs slightly better is to invert the previous problem and make a neural network learn the context, given a word.[13] More formally, given a sequence of training words $w_{1},w_{2},w_{3},\dots ,w_{T}$ , one maximizes the average log-probability

{\frac {1}{T}}\sum _{t=1}^{T}\sum _{-k\leq j\leq k,j\neq 0}\log P(w_{t+j}\mid w_{t})

where $k$ , the size of the training context, can be a function of the center word $w_{t}$ . This is called a skip-gram language model.[14] Bag-of-words and skip-gram models are the basis of the word2vec program.[15]

Instead of using neural net language models to produce actual probabilities, it is common to instead use the distributed representation encoded in the networks' "hidden" layers as representations of words; each word is then mapped onto an $n$ -dimensional real vector called the word embedding, where $n$ is the size of the layer just before the output layer. The representations in skip-gram models have the distinct characteristic that they model semantic relations between words as linear combinations, capturing a form of compositionality. For example, in some such models, if $v$ is the function that maps a word $w$ to its $n$ -d vector representation, then

v(\mathrm {king} )-v(\mathrm {male} )+v(\mathrm {female} )\approx v(\mathrm {queen} )

where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[13][14]

Other

A positional language model[16] assesses the probability of given words occurring close to one another in a text, not necessarily immediately adjacent. Similarly, bag-of-concepts models[17] leverage the semantics associated with multi-word expressions such as buy_christmas_present, even when they are used in information-rich sentences like "today I bought a lot of very nice Christmas presents".

Despite the limited successes in using neural networks,[18] authors acknowledge the need for other techniques when modelling sign languages.

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data it sees, some proposed models investigate the rate of learning, e.g. through inspection of learning curves. [19]

Various data sets have been developed to use to evaluate language processing systems.[12] These include:

Corpus of Linguistic Acceptability[20]
GLUE benchmark[21]
Microsoft Research Paraphrase Corpus[22]
Multi-Genre Natural Language Inference
Question Natural Language Inference
Quora Question Pairs[23]
Recognizing Textual Entailment[24]
Semantic Textual Similarity Benchmark
SQuAD question answering Test[25]
Stanford Sentiment Treebank[26]
Winograd NLI
BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, MMLU (Measuring Massive Multitask Language Understanding), BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs.[27] (LLaMa Benchmark)

Criticism

Although contemporary language models, such as GPTs, can be shown to match human performance on some tasks, it is not clear they are plausible cognitive models. For instance, recurrent neural networks have been shown to learn patterns humans do not learn and fail to learn patterns that humans do learn.[28]

Notes

See Heaps' law.

References

Jurafsky, Dan; Martin, James H. (2021). "N-gram Language Models". Speech and Language Processing (3rd ed.). Archived from the original on 22 May 2022. Retrieved 24 May 2022.
Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" Archived 15 August 2020 at the Wayback Machine. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" Archived 11 November 2020 at the Wayback Machine. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" Archived 14 August 2022 at the Wayback Machine. arXiv:1808.10000.
Ponte, Jay M.; Croft, W. Bruce (1998). A language modeling approach to information retrieval. Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008.
Hiemstra, Djoerd (1998). A linguistically motivated probabilistically model of information retrieval. Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34.
Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus. Archived from the original on 9 March 2023. Retrieved 10 March 2023.
Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks". Archived from the original on 1 November 2020. Retrieved 27 January 2019.
Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. Vol. 3. p. 3881. Bibcode:2008SchpJ...3.3881B. doi:10.4249/scholarpedia.3881. Archived from the original on 26 October 2020. Retrieved 28 August 2015.
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (10 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space". arXiv:1301.3781 [cs.CL].
Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado irst4=Greg S.; Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality (PDF). Advances in Neural Information Processing Systems. pp. 3111–3119. Archived (PDF) from the original on 29 October 2020. Retrieved 22 June 2015.
Harris, Derrick (16 August 2013). "We're on the cusp of deep learning for the masses. You can thank Google later". Gigaom. Archived from the original on 11 November 2020. Retrieved 22 June 2015.
Lv, Yuanhua; Zhai, ChengXiang (2009). "Positional Language Models for Information Retrieval in" (PDF). Proceedings. 32nd international ACM SIGIR conference on Research and development in information retrieval (SIGIR). Archived from the original (PDF) on 24 November 2020. Retrieved 7 April 2012.
Cambria, Erik; Hussain, Amir (28 July 2012). Sentic Computing: Techniques, Tools, and Applications. Springer Netherlands. ISBN 978-94-007-5069-2. Archived from the original on 16 April 2023. Retrieved 25 February 2019.
Mocialov, Boris; Hastie, Helen; Turner, Graham (August 2018). "Transfer Learning for British Sign Language Modelling". Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018): 101–110. arXiv:2006.02144. Archived from the original on 5 December 2020. Retrieved 14 March 2020.
Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations", International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
"The Corpus of Linguistic Acceptability (CoLA)". nyu-mll.github.io. Archived from the original on 7 December 2020. Retrieved 25 February 2019.
"GLUE Benchmark". gluebenchmark.com. Archived from the original on 4 November 2020. Retrieved 25 February 2019.
"Microsoft Research Paraphrase Corpus". Microsoft Download Center. Archived from the original on 25 October 2020. Retrieved 25 February 2019.
Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment" (PDF). Archived from the original (PDF) on 9 August 2017. Retrieved 24 February 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
"The Stanford Question Answering Dataset". rajpurkar.github.io. Archived from the original on 30 October 2020. Retrieved 25 February 2019.
"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". nlp.stanford.edu. Archived from the original on 27 October 2020. Retrieved 25 February 2019.
Hendrycks, Dan (14 March 2023), Measuring Massive Multitask Language Understanding, archived from the original on 15 March 2023, retrieved 15 March 2023
Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (9 January 2018). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5. Archived from the original on 16 April 2023. Retrieved 11 December 2021.