N-gram {{distinguish|engram}} An '''''n''-gram''' is a sub-sequence of ''n'' items from a given [[sequence]]. ''n''-grams are used in various areas of statistical [[natural language processing]] and genetic sequence analysis. The items in question can be letters, words or [[base pairs]] according to the application. An ''n''-gram of size 1 is a "[[unigram]]"; size 2 is a "[[bigram]]" (or, more etymologically sound but less commonly used, a "digram"); size 3 is a "[[trigram]]"; and size 4 or more is simply called an "''n''-gram". Some [[language model]]s built from n-grams are "(''n'' − 1)-order [[Markov_chain|Markov model]]s". ==Examples== Here are examples of '''''word''''' level 3-grams and 4-grams (and counts of the number of times they appeared) from the [[N-gram#Google_use_of_N-gram|Google n-gram corpus]]. *ceramics collectables collectibles (55) *ceramics collectables fine (130) *ceramics collected by (52) *ceramics collectible pottery (50) *ceramics collectibles cooking (45) 4-grams *serve as the incoming (92) *serve as the incubator (99) *serve as the independent (794) *serve as the index (223) *serve as the indication (72) *serve as the indicator (120) ==''n''-gram models== An '''''n''-gram model''' models sequences, notably natural languages, using the statistical properties of ''n''-grams. This idea can be traced to an experiment by [[Claude Shannon]]'s work in [[information theory]]. His question was, given a sequence of letters (for example, the sequence "for ex"), what is the [[likelihood]] of the next letter? From training data, one can derive a [[probability distribution]] for the next letter given a history of size n: ''a'' = 0.4, ''b'' = 0.00001, ''c'' = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0. More concisely, an ''n''-gram model predicts x_{i} based on x_{i-1}, x_{i-2}, \dots, x_{i-n}. In Probability terms, this is nothing but P(x_{i} | x_{i-1}, x_{i-2}, \dots, x_{i-n}). When used for [[language model|language modeling]] independence assumptions are made so that each word depends only on the last ''n'' words. This [[Markov model]] is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together. ''n''-gram models are widely used in statistical [[natural language processing]]. In [[speech recognition]], [[phonemes]] and sequences of phonemes are modeled using a ''n''-gram distribution. For parsing, words are modeled such that each ''n''-gram is composed of ''n'' words. For [[language recognition]], sequences of letters are modeled for different languages. For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "the dog smelled", "dog smelled like", "smelled like a", and "like a skunk". For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth. Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks. Punctuation is also commonly reduced or removed by preprocessing. ''n''-grams can also be used for sequences of words or, in fact, for almost any type of data. They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from. They have also been very successful as the first pass in genetic sequence search and in the identification of which species short sequences of DNA were taken from. N-gram models are often criticized because they lack any explicit representation of long range dependency. While it is true that the only explicit dependency range is (n-1) tokens for an n-gram model, it is also true that the effective range of dependency is significantly longer than this although long range correlations drop exponentially with distance for any Markov model. Alternative Markov language models that incorporate some degree of local state can exhibit very long range dependencies. This is often done using hand-crafted state variables that represent, for instance, the position in a sentence, the general topic of discourse or a grammatical state variable. Some of the best parsers of English currently in existence are roughly of this form. Another criticism that has been leveled is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction introduced by [[Noam Chomsky]]. This criticism fails to explain why parsers that are the best at parsing text seem to uniformly lack any such distinction and most even lack any clear distinction between semantics and syntax. Most proponents of n-gram and related language models opt for a fairly pragmatic approach to language modeling that emphasizes empirical results over theoretical purity. ==''n''-grams for approximate matching== ''n''-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of ''n''-grams, it can be embedded in a [[vector space]] (in other words, represented as a [[histogram]]), thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a 26^3-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams. However, we know empirically that if two strings of real text have a similar vector representation (as measured by [[dot product|cosine distance]]) then they are likely to be similar. Other metrics have also been applied to vectors of ''n''-grams with varying, sometimes better, results. For example [[z-score]]s have been used to compare documents by examining how many standard deviations each ''n''-gram differs from its mean occurrence in a large collection, or [[text corpus]], of documents (which form the "background" vector). In the event of small counts, the [[g-score]] may give better results for comparing alternative models. It is also possible to take a more principled approach to the statistics of ''n''-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in [[Bayesian inference]]. ==Other applications== ''n''-grams find use in several areas of computer science, [[computational linguistics]], and applied mathematics. They have been used to: * design [[kernel (mathematics)|kernels]] that allow [[machine learning]] algorithms such as [[support vector machine]]s to learn from string data * find likely candidates for the correct spelling of a misspelled word * improve compression in [[data compression|compression algorithms]] where a small area of data requires ''n''-grams of greater length * assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, [[speech recognition]], OCR ([[optical character recognition]]), [[Intelligent Character Recognition]] ([[ICR]]), [[machine translation]] and similar applications * improve retrieval in [[information retrieval]] systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents * improve retrieval performance in genetic sequence analysis as in the [[BLAST]] family of programs * identify the language a text is in or the species a small sequence of DNA was taken from * predict letters or words at random in order to create text, as in the [[dissociated press]] algorithm. == Bias-versus-variance trade-off == What goes into picking the ''n'' for the ''n''-gram? There are problems of balance weight between ''infrequent grams'' (for example, if a proper name appeared in the training data) and ''frequent grams''. Also, items not seen in the training data will be given a [[probability]] of 0.0 without [[smoothing]]. For unseen but plausible data from a sample, one can introduce [[pseudocount]]s. Pseudocounts are generally motivated on Bayesian grounds. === Smoothing techniques === * [[Linear interpolation]] (e.g., taking the [[weighted mean]] of the unigram, bigram, and trigram) * [[Good-Turing]] discounting * [[Witten-Bell discounting]] * [[Katz's back-off model]] (trigram) ==Google use of N-gram== [[Google]] uses n-gram models for a variety of R&D projects, such as [[statistical machine translation]], [[speech recognition]], [[Spell checker|checking spelling]], [[entity detection]], and [[information extraction|data mining]]. In September of 2006 [http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html Google announced] that they made their n-grams [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 public] at the [[Linguistic Data Consortium]] ([http://www.ldc.upenn.edu/ LDC]). == See also == * [[Hidden Markov model]] * [[Trigram]], [[Bigram]] * [[n-tuple]] * [[k-mer]] ==Bibliography== * Christopher D. Manning, Hinrich Schütze, ''Foundations of Statistical Natural Language Processing'', MIT Press: 1999. ISBN 0-262-13360-1. * Ted Dunning, ''Statistical Identification of Language''. Computing Research Laboratory Memorandum (1994) MCCS-94-273. * Owen White, Ted Dunning, Granger Sutton, Mark Adams, J.Craig Venter, and Chris Fields. A quality control algorithm for dna sequencing projects. Nucleic Acids Research, 21(16):3829--3838, 1993. * Frederick J. Damerau, ''Markov Models and Linguistic Theory''. Mouton. The Hague, 1971. == External links == * [http://n-gram-patterns.sourceforge.net/ Google n-gram Information Extracter] * [http://citeseer.ist.psu.edu/dunning94statistical.html N-gram language identification algorithm] * An online [http://jonathanwellons.com/n-grams/index.cgi N-Grams generator]. * SSN is a state of the art [http://homepages.inf.ed.ac.uk/jhender6/ statistical language parser] that is nearly Markov. * [http://ngram.sourceforge.net Ngram Statistics Package], open source package to identify statistically significant Ngrams * [http://www.w3.org/TR/ngram-spec/ Stochastic Language Models (N-Gram) Specification] [[Category:Natural language processing]] [[Category:Computational linguistics]] [[Category:Speech recognition]] [[Category:corpus linguistics]] [[ca:N-grama]] [[de:N-Gramm]] [[es:N-grama]] [[eu:N-grama]] [[fr:N-gramme]]