Text corpus In [[linguistics]], a '''corpus''' (plural ''corpora'') or '''text corpus''' is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe. A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). Multilingual corpora that have been specially formatted for side-by-side comparison are called ''aligned parallel corpora''. In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as [[annotation]]. An example of annotating a corpus is [[part-of-speech tagging]], or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of ''tags''. Another example is indicating the [[lemma (linguistics)|lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, interlinear [[gloss]]ing is used to make the annotation bilingual. Corpora are the main knowledge base in [[corpus linguistics]]. The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create [[hidden Markov model]]s for POS-tagging and other purposes. Corpora and [[frequency list]]s derived from them are useful for [[language teaching]]. ==Archaeological corpora== Text corpora. are also used in the study of [[historical document]]s, for example in attempts to [[decipherment|decipher]] ancient scripts, or in [[Biblical scholarship]]. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time, may be the 15-30 year [[Amarna letters]] texts-([[1350 BC]]). The ''corpus'' of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates. == Some notable text corpora == English language: * [[American National Corpus]] * [[Bank of English]] * [[British National Corpus]] * [[Corpus Juris Secundum]] * [[Corpus of Contemporary American English (COCA)]] 360 million words, 1990-2007. Freely available online. * [[Brown Corpus]], forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB. * [[Oxford English Corpus]] * [[Scottish Corpus of Texts & Speech]] Other languages: * [[Amarna letters]], (for [[Akkadian language|Akkadian]], Egyptian, [[Sumerogram]]'s, etc.) * [[Bijankhan Corpus]] A Contemporary Persian Corpus for NLP researches * [[Croatian National Corpus]] * [[Hamshahri Corpus]] A Contemporary Persian Corpus for IR researches * [[Neo-Assyrian Text Corpus Project]] * [[Persian Today Corpus]] * [[Thesaurus Linguae Graecae]] (Ancient Greek) ==See also== * [[Concordance (publishing)|Concordance]] * [[Corpus linguistics]] * [[Linguistic Data Consortium]] * [[Natural language processing]] * [[Natural Language Toolkit]] * [[Parallel text alignment]] * [[Search engines]]: they access the "web corpus". * [[Translation memory]] * [[Treebank]] == External links == * [http://corpus.byu.edu/ Freely-available, web-based corpora (100 million - 360 million words each): American, British (BNC), TIME, Spanish, Portuguese] * {{dmoz|/Science/Social_Sciences/Linguistics/Computational_Linguistics/|Computational Linguistics}} * [http://www.clres.com/corp.html ACL SIGLEX Resource Links: Text Corpora] * [http://www.eva.mpg.de/lingua/files/morpheme.html The Leipzig Glossing Rules]: Conventions for interlinear [[morpheme]]-by-morpheme [[gloss]]es * [http://www.ahds.ac.uk/linguistic-corpora Developing Linguistic Corpora: a Guide to Good Practice] [[Category:Discourse analysis]] [[Category:Corpus linguistics]] [[Category:Computational linguistics]] [[Category:Data mining]] [[cs:Jazykový korpus]] [[de:Textkorpus]] [[el:Σώμα κειμένων]] [[es:Corpus lingüístico]] [[eo:Korpuso]] [[eu:Testu corpus]] [[fr:Corpus]] [[gl:Corpus lingüístico]] [[ms:Korpus]] [[nl:Corpus (taalkunde)]] [[ja:コーパス]] [[pl:Korpus (językoznawstwo)]] [[pt:Corpus lingüístico]] [[sk:Korpus (jazykoveda)]] [[sl:Besedilni korpus]] [[th:คลังข้อความ]] [[zh-yue:語料庫]] [[zh:语料库]]