Language model
A statistical '''language model''' assigns a [[probability]] to a sequence of ''m'' words <math>P(w_1,\ldots,w_m)</math> by means of a [[probability distribution]].

Language modeling is used in many [[natural language processing]] applications such as [[speech recognition]], [[machine translation]], [[part-of-speech tagging]], [[parsing]] and [[information retrieval]]. 

In [[speech recognition]] and in [[data compression]], such a model tries to capture the properties of a language, and to predict the next word in a speech sequence. 

When used in information retrieval, a language model is associated with a [[document]] in a collection. With query ''Q'' as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, ''P(Q|M<sub>d</sub>)''. 

Estimating the probability of sequences can become difficult in [[corpora]], in which [[phrase]]s or [[Sentence (linguistics)|sentence]]s can be arbitrarily long and hence some sequences are not observed during [[training]] of the language model ([[data sparseness problem]] of [[overfitting]]). For that reason these models are often approximated using smoothed [[N-gram]] models.

== N-gram models ==

In an n-gram model, the probability <math>P(w_1,\ldots,w_m)</math> of observing the sentence w<sub>1</sub>,...,w<sub>m</sub> is approximated as

<math>
P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i|w_1,\ldots,w_{i-1})
 \approx \prod^m_{i=1} P(w_i|w_{i-(n-1)},\ldots,w_{i-1})
</math>

Here, it is assumed that the probability of observing the ''i<sup>th</sup>'' word ''w<sub>i</sub>'' in the context history of the preceding ''i-1'' words can be approximated by the probability of observing it in the shortened context history of the preceding ''n-1'' words (''n<sup>th</sup> order [[Markov property]]).

The conditional probability can be calculated from n-gram frequency counts:
<math>
P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) = \frac{count(w_{i-(n-1)},\ldots,w_{i-1})}{count(w_{i-(n-1)},w_{i-1},\ldots,w_i)}
</math>


The words '''bigram''' and '''trigram''' language model denote n-gram language models with ''n=2'' and ''n=3'', respectively.

=== Example ===
In a bigram (n=2) language model, the probability of the sentence ''I saw the red house'' is approximated as 
<math>
P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red)
</math>

whereas in a trigram (n=3) language model, the approximation is
<math>
P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red)
</math>

== See also ==
* [[Factored language model]]

== References ==
*{{cite conference | author=J M Ponte and W B Croft | url=http://citeseer.ist.psu.edu/ponte98language.html | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275-281}}
*{{cite conference | author=F Song and W B Croft | url=http://citeseer.ist.psu.edu/song99general.html | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279-280}}

[[Category:Statistical natural language processing]]

{{compu-AI-stub}}

[[ca:Model de llenguatge]]
[[zh:語言模型]]