Language model A statistical '''language model''' assigns a [[probability]] to a sequence of ''m'' words P(w_1,\ldots,w_m) by means of a [[probability distribution]]. Language modeling is used in many [[natural language processing]] applications such as [[speech recognition]], [[machine translation]], [[part-of-speech tagging]], [[parsing]] and [[information retrieval]]. In [[speech recognition]] and in [[data compression]], such a model tries to capture the properties of a language, and to predict the next word in a speech sequence. When used in information retrieval, a language model is associated with a [[document]] in a collection. With query ''Q'' as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, ''P(Q|Md)''. Estimating the probability of sequences can become difficult in [[corpora]], in which [[phrase]]s or [[Sentence (linguistics)|sentence]]s can be arbitrarily long and hence some sequences are not observed during [[training]] of the language model ([[data sparseness problem]] of [[overfitting]]). For that reason these models are often approximated using smoothed [[N-gram]] models. == N-gram models == In an n-gram model, the probability P(w_1,\ldots,w_m) of observing the sentence w1,...,wm is approximated as P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i|w_1,\ldots,w_{i-1}) \approx \prod^m_{i=1} P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) Here, it is assumed that the probability of observing the ''ith'' word ''wi'' in the context history of the preceding ''i-1'' words can be approximated by the probability of observing it in the shortened context history of the preceding ''n-1'' words (''nth order [[Markov property]]). The conditional probability can be calculated from n-gram frequency counts: P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) = \frac{count(w_{i-(n-1)},\ldots,w_{i-1})}{count(w_{i-(n-1)},w_{i-1},\ldots,w_i)} The words '''bigram''' and '''trigram''' language model denote n-gram language models with ''n=2'' and ''n=3'', respectively. === Example === In a bigram (n=2) language model, the probability of the sentence ''I saw the red house'' is approximated as P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red) whereas in a trigram (n=3) language model, the approximation is P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red) == See also == * [[Factored language model]] == References == *{{cite conference | author=J M Ponte and W B Croft | url=http://citeseer.ist.psu.edu/ponte98language.html | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275-281}} *{{cite conference | author=F Song and W B Croft | url=http://citeseer.ist.psu.edu/song99general.html | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279-280}} [[Category:Statistical natural language processing]] {{compu-AI-stub}} [[ca:Model de llenguatge]] [[zh:語言模型]]