[10380010] |
Information
[10380020] |'''Information''' as a [[Conveyed concept|concept]] has a diversity of meanings, from everyday usage to technical settings. [10380030] |Generally speaking, the concept of information is closely related to notions of [[constraint]], [[communication]], [[control system|control]], [[data]], [[form]], [[instruction]], [[knowledge]], [[Meaning (linguistics)|meaning]], [[stimulation|mental stimulus]], [[pattern]], [[perception]], and [[knowledge representation|representation]]. [10380040] |Many people speak about the [[Information Age]] as the advent of the Knowledge Age or [[knowledge society]], the [[information society]], the [[Information revolution]], and [[Information technology|information technologies]], and even though [[informatics]], [[information science]] and [[computer science]] are often in the spotlight, the word "information" is often used without careful consideration of the various meanings it has acquired. [10380050] |== Etymology == [10380060] |According to the [[Oxford English Dictionary]], the earliest historical meaning of the word ''information'' in [[English language|English]] was the act of ''informing'', or giving form or shape to the mind, as in education, instruction, or training. [10380070] |A quote from 1387: "Five books come down from heaven for information of mankind." [10380080] |It was also used for an ''item'' of training, ''e.g.'' a particular instruction. [10380090] |"Melibee had heard the great skills and reasons of Dame Prudence, and her wise information and techniques." [10380100] |(1386) [10380110] |The English word was apparently derived by adding the common "noun of action" ending "''-ation''" (descended through Francais from Latin "''-tio''") to the earlier verb ''to inform'', in the sense of to give form to the mind, to discipline, instruct, teach: "Men so wise should go and inform their kings." [10380120] |(1330) ''Inform'' itself comes (via French) from the Latin verb ''informare'', to give form to, to form an idea of. [10380125] |Furthermore, Latin itself already even contained the word ''informatio'' meaning concept or idea, but the extent to which this may have influenced the development of the word ''information'' in English is unclear. [10380130] |As a final note, the ancient Greek word for ''form'' was [eidos], and this word was famously used in a technical philosophical sense by [Plato] (and later Aristotle) to denote the ideal identity or essence of something (see [Theory of forms]). [10380140] |"Eidos" can also be associated with [thought], [proposition] or even [concept]. [10380150] |== Information as a message == [10380160] |'''Information''' is the state of a system of interest. [10380170] |Message is the information materialized. [10380180] |Information is a quality of a [[message]] from a [[sender]] to one or more receivers. [10380190] |Information is always ''about'' something (size of a parameter, occurrence of an event, etc). [10380200] |Viewed in this manner, information does not have to be accurate. [10380210] |It may be a truth or a lie, or just the sound of a falling tree. [10380220] |Even a disruptive noise used to inhibit the flow of communication and create misunderstanding would in this view be a form of information. [10380230] |However, generally speaking, if the ''amount'' of information in the received message increases, the message is more accurate. [10380240] |This model assumes there is a definite [[sender]] and at least one receiver. [10380250] |Many refinements of the model assume the existence of a common language understood by the sender and at least one of the receivers. [10380260] |An important variation identifies information as that which would be communicated by a message if it were sent from a sender to a receiver capable of understanding the message. [10380270] |Notably, it is not required that the sender be capable of understanding the message, or even cognizant that there is a message. [10380280] |Thus, information is something that can be extracted from an environment, e.g., through observation, reading or measurement. [10380290] |Information is a term with many meanings depending on context, but is as a rule closely related to such concepts as meaning, knowledge, instruction, communication, representation, and mental stimulus. [10380300] |Simply stated, information is a message received and understood. [10380310] |In terms of data, it can be defined as a collection of facts from which conclusions may be drawn. [10380320] |There are many other aspects of information since it is the knowledge acquired through study or experience or instruction. [10380330] |But overall, information is the result of processing, manipulating and organizing data in a way that adds to the knowledge of the person receiving it. [10380340] |[[Communication theory]] provides a numerical measure of the uncertainty of an outcome. [10380350] |For example, we can say that "the signal contained thousands of bits of information". [10380360] |Communication theory tends to use the concept of [[information entropy]], generally attributed to [[C.E. Shannon]] (see below). [10380370] |Another form of information is [[Fisher information]], a concept of [[R.A. Fisher]]. [10380380] |This is used in application of statistics to [[estimation theory]] and to science in general. [10380390] |Fisher information is thought of as the amount of information that a message carries about an unobservable parameter. [10380400] |It can be computed from knowledge of the [[likelihood function]] defining the system. [10380410] |For example, with a normal likelihood function, the Fisher information is the reciprocal of the variance of the law. [10380420] |In the absence of knowledge of the likelihood law, the Fisher information may be computed from normally distributed score data as the reciprocal of their second moment. [10380430] |Even though information and data are often used interchangeably, they are actually very different. [10380440] |Data is a set of unrelated information, and as such is of no use until it is properly evaluated. [10380450] |Upon evaluation, once there is some significant relation between data, and they show some relevance, then they are converted into information. [10380460] |Now this same data can be used for different purposes. [10380470] |Thus, till the data convey some information, they are not useful. [10380480] |=== Measuring information entropy === [10380490] |The view of information as a message came into prominence with the publication in 1948 of an influential paper by [[Claude Shannon]], "[[A Mathematical Theory of Communication]]." [10380500] |This paper provides the foundations of [[information theory]] and endows the word ''information'' not only with a technical meaning but also a measure. [10380510] |If the sending device is equally likely to send any one of a set of N messages, then the preferred measure of "the information produced when one message is chosen from the set" is the base two [[logarithm]] of N (This measure is called ''[[self-information]]''). [10380520] |In this paper, Shannon continues: [10380530] |A complementary way of measuring information is provided by [[algorithmic information theory]]. [10380540] |In brief, this measures the information content of a list of symbols based on how predictable they are, or more specifically how easy it is to compute the list through a [[computer program|program]]: the information content of a sequence is the number of bits of the shortest program that computes it. [10380550] |The sequence below would have a very low algorithmic information measurement since it is a very predictable pattern, and as the pattern continues the measurement would not change. [10380560] |Shannon information would give the same information measurement for each symbol, since they are [[statistical randomness|statistically random]], and each new symbol would increase the measurement. [10380570] |:123456789101112131415161718192021 [10380580] |It is important to recognize the limitations of traditional information theory and algorithmic information theory from the perspective of human meaning. [10380590] |For example, when referring to the meaning content of a message Shannon noted “Frequently the messages have ''meaning…'' these semantic aspects of communication are irrelevant to the engineering problem. [10380600] |The significant aspect is that the actual message is one selected ''from a set of possible messages''” (emphasis in original). [10380610] |In information theory signals are part of a process, not a substance; they do something, they do not contain any specific meaning. [10380620] |Combining algorithmic information theory and information theory we can conclude that the most random signal contains the most information as it can be interpreted in any way and cannot be compressed. [10380630] |Michael Reddy noted that "'signals' of the [[mathematical theory]] are 'patterns that can be exchanged'. [10380640] |There is no message contained in the signal, the signals convey the ability to select from a set of possible messages." [10380650] |In information theory "the system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design". [10380660] |== Information as a pattern == [10380670] |Information is any represented [[pattern]]. [10380680] |This view assumes neither accuracy nor directly communicating parties, but instead assumes a separation between an object and its representation. [10380690] |Consider the following example: [[economic statistics]] represent an [[Economics|economy]], however inaccurately. [10380700] |What are commonly referred to as data in [[computing]], [[statistics]], and other fields, are forms of information in this sense. [10380710] |The [[electromagnetism|electro-magnetic]] patterns in a [[computer network]] and connected [[peripheral device|device]]s are related to something other than the pattern itself, such as [[Character (computing)|text characters]] to be displayed and [[Computer keyboard|keyboard]] input. [10380720] |[[Signal (information theory)|Signal]]s, [[Sign (linguistics)|sign]]s, and [[symbol]]s are also in this category. [10380730] |On the other hand, according to [[semiotics]], data is symbols with certain syntax and information is data with a certain semantic. [10380740] |[[Painting]] and [[drawing]] contain information to the extent that they represent something such as an assortment of objects on a table, a [[profile]], or a [[landscape]]. [10380750] |In other words, when a pattern of something is transposed to a pattern of something else, the latter is information. [10380760] |This would be the case whether or not there was anyone to perceive it. [10380770] |But if information can be defined merely as a pattern, does that mean that neither [[utility]] nor meaning are necessary components of information? [10380780] |Arguably a distinction must be made between raw unprocessed data and information which possesses utility, [[value (economics)|value]] or some quantum of meaning. [10380790] |On this view, information may indeed be characterized as a pattern; but this is a [[necessary]] condition, not a [[sufficient]] one. [10380800] |An individual entry in a telephone book, which follows a specific pattern formed by name, address and telephone number, does not become "informative" in some sense unless and until it possesses some degree of utility, value or meaning. [10380810] |For example, someone might look up a girlfriend's number, might order a take away etc. [10380820] |The vast majority of numbers will never be construed as "information" in any meaningful sense. [10380830] |The gap between data and information is only closed by a behavioral bridge whereby some value, utility or meaning is added to transform mere data or pattern into information. [10380840] |When one constructs a representation of an object, one can selectively extract from the object ([[sampling (case studies)|sampling]]) or use a [[system]] of signs to replace ([[encode|encoding]]), or both. [10380850] |The sampling and encoding result in representation. [10380860] |An example of the former is a "sample" of a product; an example of the latter is "verbal description" of a product. [10380870] |Both contain information of the product, however inaccurate. [10380880] |When one interprets representation, one can predict a broader pattern from a limited number of observations (inference) or understand the relation between patterns of two different things ([[decode|decoding]]). [10380890] |One example of the former is to sip a [[soup]] to know if it is spoiled; an example of the latter is examining footprints to determine the animal and its condition. [10380900] |In both cases, information sources are not constructed or presented by some "sender" of information. [10380910] |Regardless, information is dependent upon, but usually unrelated to and separate from, the medium or media used to express it. [10380920] |In other words, the position of a theoretical series of bits, or even the output once interpreted by a [[computer]] or similar device, is unimportant, except when someone or something is present to interpret the information. [10380930] |Therefore, a quantity of information is totally distinct from its medium. [10380940] |== Information as sensory input == [10380950] |Often information is viewed as a type of [[input]] to an [[organism]] or designed device. [10380960] |Inputs are of two kinds. [10380970] |Some inputs are important to the function of the organism (for example, food) or device ([[energy]]) by themselves. [10380980] |In his book ''Sensory Ecology,'' Dusenbery called these causal inputs. [10380990] |Other inputs (information) are important only because they are associated with causal inputs and can be used to predict the occurrence of a causal input at a later time (and perhaps another place). [10381000] |Some information is important because of association with other information but eventually there must be a connection to a causal input. [10381010] |In practice, information is usually carried by weak stimuli that must be detected by specialized sensory systems and amplified by energy inputs before they can be functional to the organism or device. [10381020] |For example, light is often a causal input to plants but provides information to animals. [10381030] |The colored light reflected from a flower is too weak to do much photosynthetic work but the visual system of the bee detects it and the bee's nervous system uses the information to guide the bee to the flower, where the bee often finds nectar or pollen, which are causal inputs, serving a nutritional function. [10381040] |Information is any type of sensory input. [10381050] |When an organism with a [[nervous system]] receives an input, it transforms the input into an electrical signal. [10381060] |This is regarded information by some. [10381070] |The idea of representation is still relevant, but in a slightly different manner. [10381080] |That is, while [[abstract painting]] does not represent anything concretely, when the viewer sees the painting, it is nevertheless transformed into electrical signals that create a representation of the painting. [10381090] |Defined this way, information does not have to be related to truth, communication, or representation of an object. [10381100] |[[Entertainment]] in general is not intended to be informative. [10381110] |[[Music]], the [[performing arts]], [[amusement park]]s, works of [[fiction]] and so on are thus forms of information in this sense, but they are not necessarily forms of information according to some definitions given above. [10381120] |Consider another example: food supplies both nutrition and taste for those who eat it. [10381130] |If information is equated to sensory input, then nutrition is not information but taste is. [10381140] |== Information as an influence which leads to a transformation == [10381150] |Information is any type of pattern that influences the formation or transformation of other patterns. [10381160] |In this sense, there is no need for a conscious mind to perceive, much less appreciate, the pattern. [10381170] |Consider, for example, [[DNA]]. [10381180] |The sequence of [[nucleotide]]s is a pattern that influences the formation and development of an organism without any need for a conscious mind. [10381190] |[[Systems theory]] at times seems to refer to information in this sense, assuming information does not necessarily involve any conscious mind, and patterns circulating (due to [[feedback]]) in the system can be called information. [10381200] |In other words, it can be said that information in this sense is something potentially perceived as representation, though not created or presented for that purpose. [10381210] |When [[Marshall McLuhan]] speaks of [[media (communication)|media]] and their effects on human cultures, he refers to the structure of [[cultural artifact|artifacts]] that in turn shape our behaviors and mindsets. [10381220] |Also, [[pheromone]]s are often said to be "information" in this sense. [10381230] |(See also [[Gregory Bateson]].) [10381240] |== Information as a property in physics == [10381250] |In 2003, J. D. Bekenstein claimed there is a growing trend in [[physics]] to define the physical world as being made of information itself (and thus information is defined in this way). [10381260] |Information has a well defined meaning in physics. [10381270] |Examples of this include the phenomenon of [[quantum entanglement]] where particles can interact without reference to their separation or the speed of light. [10381280] |Information itself cannot travel faster than light even if the information is transmitted indirectly. [10381290] |This could lead to the fact that all attempts at physically observing a particle with an "entangled" relationship to another are slowed down, even though the particles are not connected in any other way other than by the information they carry. [10381300] |Another link is demonstrated by the [[Maxwell's demon]] thought experiment. [10381310] |In this experiment, a direct relationship between information and another physical property, [[entropy]], is demonstrated. [10381320] |A consequence is that it is impossible to destroy information without increasing the entropy of a system; in practical terms this often means generating heat. [10381330] |Another, more philosophical, outcome is that information could be thought of as interchangeable with [[Energy#Transformations_of_energy|energy]]. [10381340] |Thus, in the study of [[logic gates]], the theoretical lower bound of thermal energy released by an ''AND gate'' is higher than for the ''NOT gate'' (because information is destroyed in an ''AND gate'' and simply converted in a ''NOT gate''). [10381350] |Physical information is of particular importance in the theory of [[quantum computers]]. [10381360] |== Information as records == [10381370] |Records are a specialized form of information. [10381380] |Essentially, records are information produced consciously or as by-products of business activities or transactions and retained because of their value. [10381390] |Primarily their value is as evidence of the activities of the organization but they may also be retained for their informational value. [10381400] |Sound [[records management]] ensures that the integrity of records is preserved for as long as they are required. [10381410] |The international standard on records management, ISO 15489, defines records as "information created, received, and maintained as evidence and information by an organization or person, in pursuance of legal obligations or in the transaction of business". [10381420] |The International Committee on Archives (ICA) Committee on electronic records defined a record as, "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity". [10381430] |Records may be retained because of their business value, as part of the [[corporate memory]] of the organization or to meet legal, fiscal or accountability requirements imposed on the organization. [10381440] |Willis (2005) expressed the view that sound management of business records and information delivered "…six key requirements for good [[corporate governance]]…transparency; accountability; due process; compliance; meeting statutory and common law requirements; and security of personal and corporate information." [10381450] |== Information and semiotics == [10381460] |Beynon-Davies explains the multi-faceted concept of information in terms of that of signs and sign-systems. [10381470] |Signs themselves can be considered in terms of four inter-dependent levels, layers or branches of [[semiotics]]: pragmatics, semantics, syntactics and empirics. [10381480] |These four layers serve to connect the social world on the one hand with the physical or technical world on the other. [10381490] |[[Pragmatics]] is concerned with the purpose of communication. [10381500] |Pragmatics links the issue of signs with that of intention. [10381510] |The focus of pragmatics is on the intentions of human agents underlying communicative behaviour. [10381520] |In other words, intentions link language to action. [10381530] |[[Semantics]] is concerned with the meaning of a message conveyed in a communicative act. [10381535] |Semantics considers the content of communication. [10381540] |Semantics is the study of the meaning of signs - the association between signs and behaviour. [10381550] |Semantics can be considered as the study of the link between symbols and their referents or concepts; particularly the way in which signs relate to human behaviour. [10381560] |Syntactics is concerned with the formalism used to represent a message. [10381570] |Syntactics as an area studies the form of communication in terms of the logic and grammar of sign systems. [10381580] |Syntactics is devoted to the study of the form rather than the content of signs and sign-systems. [10381590] |Empirics is the study of the signals used to carry a message; the physical characteristics of the medium of communication. [10381600] |Empirics is devoted to the study of communication channels and their characteristics, e.g., sound, light, electronic transmission etc. [10381610] |Communication normally exists within the context of some social situation. [10381620] |The social situation sets the context for the intentions conveyed (pragmatics) and the form in which communication takes place. [10381630] |In a communicative situation intentions are expressed through messages which comprise collections of inter-related signs taken from a language which is mutually understood by the agents involved in the communication. [10381640] |Mutual understanding implies that agents involved understand the chosen language in terms of its agreed syntax (syntactics) and semantics. [10381650] |The sender codes the message in the language and sends the message as signals along some communication channel (empirics). [10381660] |The chosen communication channel will have inherent properties which determine outcomes such as the speed with which communication can take place and over what distance. [10390010] |
Information extraction
[10390020] |In [[natural language processing]], '''information extraction''' (IE) is a type of [[information retrieval]] whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured [[machine-readable]] documents. [10390030] |An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company_1, company_2, date), from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp." [10390040] |A broad goal of IE is to allow computation to be done on the previously unstructured data. [10390050] |A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data. [10390060] |The significance of IE is determined by the growing amount of information available in unstructured (i.e. without [[metadata]]) form, for instance on the Internet. [10390070] |This knowledge can be made more accessible by means of transformation into [[relational database|relational form]], or by marking-up with [[XML]] tags. [10390080] |An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. [10390090] |A typical application of IE is to scan a set of documents written in a [[natural language]] and populate a database with the information extracted. [10390100] |Current approaches to IE use [[natural language processing]] techniques that focus on very restricted domains. [10390110] |For example, the ''[[Message Understanding Conference]]'' (MUC) is a competition-based conference that focused on the following domains in the past: [10390120] |*MUC-1 (1987), MUC-2 (1989): Naval operations messages. [10390130] |*MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries. [10390140] |*MUC-5 (1993): Joint ventures and microelectronics domain. [10390150] |*MUC-6 (1995): News articles on management changes. [10390160] |*MUC-7 (1998): Satellite launch reports. [10390170] |Natural Language texts may need to use some form of a [[Text simplification]] to create a more easily machine readable text to extract the sentences. [10390180] |Typical subtasks of IE are: [10390190] |* [[Named Entity Recognition]]: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions. [10390200] |* [[Coreference]]: identification chains of [[noun phrase]]s that refer to the same object. [10390210] |For example, [[Anaphora (linguistics)|anaphora]] is a type of coreference. [10390220] |* [[Terminology extraction]]: finding the relevant terms for a given [[text corpus|corpus]] [10390230] |* Relation Extraction: identification of relations between entities, such as: [10390240] |**PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.") [10390250] |**PERSON located in LOCATION (extracted from the sentence "Bill is in France.") [10400010] |
Information retrieval
[10400020] |'''Information retrieval''' ('''IR''') is the science of searching for documents, for [[information]] within documents and for [[Metadata (computing)|metadata]] about documents, as well as that of searching [[relational database]]s and the [[World Wide Web]]. [10400030] |There is overlap in the usage of the terms data retrieval, [[document retrieval]], information retrieval, and [[text retrieval]], but each also has its own body of literature, theory, [[Praxis (process)|praxis]] and technologies. [10400040] |IR is [[interdisciplinary]], based on [[computer science]], [[mathematics]], [[library science]], [[information science]], [[information architecture]], [[cognitive psychology]], [[linguistics]], [[statistics]] and [[physics]]. [10400050] |Automated information retrieval systems are used to reduce what has been called "[[information overload]]". [10400060] |Many universities and [[public library|public libraries]] use IR systems to provide access to books, journals and other documents. [10400070] |Web [[Web search engine|search engine]]s are the most visible [[Information retrieval applications|IR applications]]. [10400080] |== History == [10400090] |The idea of using computers to search for relevant pieces of information was popularized in an article ''[[As We May Think]]'' by [[Vannevar Bush]] in 1945. [10400100] |First implementations of information retrieval systems were introduced in the 1950s and 1960s. [10400110] |By 1990 several different techniques had been shown to perform well on small text corpora (several thousand documents). [10400120] |In 1992 the US Department of Defense, along with the [[National Institute of Standards and Technology]] (NIST), cosponsored the [[Text Retrieval Conference]] (TREC) as part of the TIPSTER text program. [10400130] |The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. [10400140] |This catalyzed research on methods that [[scalability|scale]] to huge corpora. [10400150] |The introduction of web [[Web search engine|search engine]]s has boosted the need for very large scale retrieval systems even further. [10400160] |The use of digital methods for storing and retrieving information has led to the phenomenon of [[digital obsolescence]], where a digital resource ceases to be readable because the physical media, the reader required to read the media, the hardware, or the software that runs on it, is no longer available. [10400170] |The information is initially easier to retrieve than if it were on paper, but is then effectively lost. [10400180] |=== Timeline === [10400190] |* 1890: Hollerith tabulating machines were used to analyze the US census. [10400200] |([[Herman Hollerith]]). [10400210] |* 1945: [[Vannevar Bush]]'s ''[[As We May Think]]'' appeared in ''[[Atlantic Monthly]]'' [10400220] |* Late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans. [10400230] |* 1947: [[Hans Peter Luhn]] (research engineer at IBM since 1941) began work on a mechanized, punch card based system for searching chemical compounds. [10400240] |* 1950: The term "information retrieval" may have been coined by [[Calvin Mooers]]. [10400250] |* 1950s: Growing concern in the US for a "science gap" with the USSR motivated, encouraged funding, and provided a backdrop for mechanized literature searching systems ([[Allen Kent]] et al) and the invention of citation indexing ([[Eugene Garfield]]). [10400260] |* 1955: Allen Kent joined [[Case Western Reserve University]], and eventually becomes associate director of the Center for Documentation and Communications Research. [10400270] |That same year, Kent and colleagues publish a paper in American Documentation describing the precision and recall measures, as well as detailing a proposed "framework" for evaluating an IR system, which includes statistical sampling methods for determining the number of relevant documents not retrieved. [10400280] |* 1958: International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified. [10400290] |See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959) [10400300] |* 1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval." [10400310] |* 1960: Melvin Earl (Bill) Maron and J. L. Kuhns published "On relevance, probabilistic indexing, and information retrieval" in Journal of the ACM 7(3):216-244, July 1960. [10400320] |* Early 1960s: [[Gerard Salton]] began work on IR at Harvard, later moved to Cornell. [10400330] |* 1962: [[Cyril W. Cleverdon]] published early findings of the Cranfield studies, developing a model for IR system evaluation. [10400340] |See: Cyril W. Cleverdon, "Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems". [10400350] |Cranfield Coll. of Aeronautics, Cranfield, England, 1962. [10400360] |* 1962: Kent published Information Analysis and Retrieval [10400370] |* 1963: Weinberg report "Science, Government and Information" gave a full articulation of the idea of a "crisis of scientific information." [10400380] |The report was named after Dr. [[Alvin Weinberg]]. [10400390] |* 1963: [[Joseph Becker]] and [[Robert M. Hayes]] published text on information retrieval. [10400400] |Becker, Joseph; Hayes, Robert Mayo. [10400410] |Information storage and retrieval: tools, elements, theories. [10400420] |New York, Wiley (1963). [10400430] |* 1964: [[Karen Spärck Jones]] finished her thesis at Cambridge, ''Synonymy and Semantic Classification'', and continued work on [[computational linguistics]] as it applies to IR [10400440] |* 1964: The [[National Bureau of Standards]] sponsored a symposium titled "Statistical Association Methods for Mechanized Documentation." [10400450] |Several highly significant papers, including G. Salton's first published reference (we believe) to the SMART system. [10400460] |* Mid-1960s: National Library of Medicine developed [[MEDLARS]] Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch retrieval system [10400470] |* Mid-1960s: Project Intrex at MIT [10400480] |* 1965: [[J. C. R. Licklider]] published ''Libraries of the Future'' [10400490] |* 1966: [[Don Swanson]] was involved in studies at University of Chicago on Requirements for Future Catalogs [10400500] |* 1968: Gerard Salton published ''Automatic Information Organization and Retrieval''. [10400510] |* 1968: [[J. W. Sammon]]'s RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model. [10400520] |* 1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers) was the first proposal for visualization interface to an IR system. [10400530] |* Late 1960s: [[F. W. Lancaster]] completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval [10400540] |* Early 1970s: first online systems--NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT [10400550] |* Early 1970s: [[Theodor Nelson]] promoting concept of [[hypertext]], published Computer Lib/Dream Machines [10400560] |* 1971: [[N. Jardine]] and [[C. J. Van Rijsbergen]] published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis." [10400570] |(Information Storage and Retrieval, 7(5), pp. 217-240, Dec 1971) [10400580] |*1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model: [10400590] |** A Theory of Indexing (Society for Industrial and Applied Mathematics) [10400600] |** "A theory of term importance in automatic text analysis", (JASIS v. 26) [10400610] |** "A vector space model for automatic indexing", (CACM 18:11) [10400620] |* 1978: The First [[Association for Computing Machinery|ACM]] [[SIGIR]] conference. [10400630] |* 1979: C. J. Van Rijsbergen published ''Information Retrieval'' (Butterworths). [10400640] |Heavy emphasis on probabilistic models. [10400650] |* 1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge [10400660] |* 1982: [[Nicholas J. Belkin|Belkin]], Oddy, and Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval. [10400670] |This was an important concept, though their automated analysis tool proved ultimately disappointing. [10400680] |* 1983: Salton (and M. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models. [10400690] |* Mid-1980s: Efforts to develop end user versions of commercial IR systems. [10400700] |* 1985-1993: Key papers on and experimental systems for visualization interfaces. [10400710] |* Work by [[D. B. Crouch]], [[Robert R. Korfhage]], [[M. Chalmers]], [[A. Spoerri]] and others. [10400720] |* 1989: First [[World Wide Web]] proposals by [[Tim Berners-Lee]] at [[CERN]]. [10400730] |* 1992: First TREC conference. [10400740] |* 1997: Publication of [[Robert R. Korfhage|Korfhage]]'s ''Information Storage and Retrieval'' with emphasis on visualization and multi-reference point systems. [10400750] |* Late 1990s: Web [[Web search engine|search engine]] implementation of many features formerly found only in experimental IR systems [10400760] |== Overview == [10400770] |An information retrieval process begins when a user enters a query into the system. [10400780] |Queries are formal statements of [[information need]]s, for example search strings in web search engines. [10400790] |In information retrieval a query does not uniquely identify a single object in the collection. [10400800] |Instead, several objects may match the query, perhaps with different degrees of [[relevance|relevancy]]. [10400810] |An object is an entity which keeps or stores information in a database. [10400820] |User queries are matched to objects stored in the database. [10400830] |Depending on the [[Information retrieval applications|application]] the data objects may be, for example, text documents, images or videos. [10400840] |Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates. [10400850] |Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value. [10400860] |The top ranking objects are then shown to the user. [10400870] |The process may then be iterated if the user wishes to refine the query. [10400880] |== Performance measures == [10400890] |Many different measures for evaluating the performance of information retrieval systems have been proposed. [10400900] |The measures require a collection of documents and a query. [10400910] |All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query. [10400920] |In practice queries may be [[ill-posed]] and there may be different shades of relevancy. [10400930] |=== Precision === [10400940] |Precision is the fraction of the documents retrieved that are [[Relevance (information retrieval)|relevant]] to the user's information need. [10400950] |: \mbox{precision}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{retrieved documents}\}|} [10400960] |In [[binary classification]], precision is analogous to [[positive predictive value]]. [10400970] |Precision takes all retrieved documents into account. [10400980] |It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. [10400990] |This measure is called ''precision at n'' or ''P@n''. [10401000] |Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of [[accuracy and precision]] within other branches of science and technology. [10401010] |=== Recall === [10401020] |Recall is the fraction of the documents that are relevant to the query that are successfully retrieved. [10401030] |:\mbox{recall}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{relevant documents}\}|} [10401040] |In binary classification, recall is called [[sensitivity (tests)|sensitivity]]. [10401050] |So it can be looked at as ''the probability that a relevant document is retrieved by the query''. [10401060] |It is trivial to achieve recall of 100% by returning all documents in response to any query. [10401070] |Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision. [10401080] |=== Fall-Out === [10401090] |The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available: [10401100] |: \mbox{fall-out}=\frac{|\{\mbox{non-relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{non-relevant documents}\}|} [10401110] |In binary classification, fall-out is closely related to [[specificity (tests)|specificity]]. [10401120] |More precisely: \mbox{fall-out}=1-\mbox{specificity}. [10401130] |It can be looked at as ''the probability that a non-relevant document is retrieved by the query''. [10401140] |It is trivial to achieve fall-out of 0% by returning zero documents in response to any query. [10401150] |=== F-measure === [10401160] |The weighted [[harmonic mean]] of precision and recall, the traditional F-measure or balanced F-score is: [10401170] |:F = 2 \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\mathrm{precision} + \mathrm{recall}).\, [10401180] |This is also known as the F_1 measure, because recall and precision are evenly weighted. [10401190] |The general formula for non-negative real ß is: [10401200] |:F_\beta = (1 + \beta^2) \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\beta^2 \cdot \mathrm{precision} + \mathrm{recall}).\, [10401210] |Two other commonly used F measures are the F_{2} measure, which weights recall twice as much as precision, and the F_{0.5} measure, which weights precision twice as much as recall. [10401220] |The F-measure was derived by van Rijsbergen (1979) so that F_\beta "measures the effectiveness of retrieval with respect to a user who attaches ß times as much importance to recall as precision". [10401230] |It is based on van Rijsbergen's effectiveness measure E = 1-(1/(\alpha/P + (1-\alpha)/R)). [10401240] |Their relationship is F_\beta = 1 - E where \alpha=1/(\beta^2+1). [10401250] |=== Average precision of precision and recall=== [10401260] |The precision and recall are based on the whole list of documents returned by the system. [10401270] |Average precision emphasizes returning more relevant documents earlier. [10401280] |It is average of precisions computed after truncating the list after each of the relevant documents in turn: [10401290] |: \operatorname{AveP} = \frac{\sum_{r=1}^N (P(r) \times \mathrm{rel}(r))}{\mbox{number of relevant documents}} \! [10401300] |where ''r'' is the rank, ''N'' the number retrieved, ''rel()'' a binary function on the relevance of a given rank, and ''P()'' precision at a given cut-off rank. [10401310] |== Model types == [10401320] |[[Image:Information-Retrieval-Models.png|thumb|500px|categorization of IR-models (translated from [http://de.wikipedia.org/wiki/Informationsrückgewinnung#Klassifikation_von_Modellen_zur_Repr.C3.A4sentation_nat.C3.BCrlichsprachlicher_Dokumente German entry], original source [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id= Dominik Kuropka])]] [10401325] |For the information retrieval to be efficient, the documents are typically transformed into a suitable representation. [10401330] |There are several representations. [10401340] |The picture on the right illustrates the relationship of some common models. [10401350] |In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model. [10401360] |=== First dimension: mathematical basis === [10401370] |* ''Set-theoretic models'' represent documents as sets of words or phrases. [10401380] |Similarities are usually derived from set-theoretic operations on those sets. [10401390] |Common models are: [10401400] |** [[Standard Boolean model]] [10401410] |** [[Extended Boolean model]] [10401420] |** [[Fuzzy retrieval]] [10401430] |* ''Algebraic models'' represent documents and queries usually as vectors, matrices or tuples. [10401440] |The similarity of the query vector and document vector is represented as a scalar value. [10401450] |** [[Vector space model]] [10401460] |** [[Generalized vector space model]] [10401470] |** Topic-based vector space model (literature: [http://www.kuropka.net/files/TVSM.pdf], [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=]) [10401480] |** [[Extended Boolean model]] [10401490] |** Enhanced topic-based vector space model (literature: [http://kuropka.net/files/HPI_Evaluation_of_eTVSM.pdf], [http://www.logos-verlag.de/cgi-bin/engbuchmid?isbn=0514&lng=eng&id=]) [10401500] |** Latent semantic indexing aka [[latent semantic analysis]] [10401510] |* ''Probabilistic models'' treat the process of document retrieval as a probabilistic inference. [10401520] |Similarities are computed as probabilities that a document is relevant for a given query. [10401530] |Probabilistic theorems like the [[Bayes' theorem]] are often used in these models. [10401540] |** [[Binary independence retrieval]] [10401550] |** [[Probabilistic relevance model (BM25)]] [10401560] |** Uncertain inference [10401570] |** [[Language model]]s [10401580] |** [[Divergence-from-randomness model]] [10401590] |** [[Latent Dirichlet allocation]] [10401600] |=== Second dimension: properties of the model === [10401610] |* ''Models without term-interdependencies'' treat different terms/words as independent. [10401620] |This fact is usually represented in vector space models by the [[orthogonality]] assumption of term vectors or in probabilistic models by an [[independency]] assumption for term variables. [10401630] |* ''Models with immanent term interdependencies'' allow a representation of interdependencies between terms. [10401640] |However the degree of the interdependency between two terms is defined by the model itself. [10401650] |It is usually directly or indirectly derived (e.g. by [[dimension reduction|dimensional reduction]]) from the [[co-occurrence]] of those terms in the whole set of documents. [10401660] |* ''Models with transcendent term interdependencies'' allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. [10401670] |They relay an external source for the degree of interdependency between two terms. [10401680] |(For example a human or sophisticated algorithms.) [10401690] |== Major figures == [10401700] |* [[Gerard Salton]] [10401710] |* [[Hans Peter Luhn]] [10401720] |* [http://ciir.cs.umass.edu/personnel/croft.html W. Bruce Croft] [10401730] |* [[Karen Spärck Jones]] [10401740] |* [[C. J. van Rijsbergen]] [10401750] |* [http://www.soi.city.ac.uk/~ser/homepage.html Stephen E. Robertson] [10401760] |== Awards in the field == [10401770] |* [[Tony Kent Strix award]] [10401780] |* [[Gerard Salton Award]] [10410010] |
Information theory
[10410020] |'''Information theory''' is a branch of [[applied mathematics]] and [[electrical engineering]] involving the quantification of [[information]]. [10410030] |Historically, information theory was developed to find fundamental limits on compressing and reliably [[communication|communicating]] data. [10410040] |Since its inception it has broadened to find applications in many other areas, including [[statistical inference]], [[natural language processing]], [[cryptography]] generally, [[networks]] other than communication networks -- as in [[neurobiology]], the evolution and function of molecular codes, model selection in ecology, thermal physics, [[quantum computing]], plagiarism detection and other forms of [[data analysis]]. [10410050] |A key measure of information in the theory is known as [[information entropy]], which is usually expressed by the average number of bits needed for storage or communication. [10410060] |Intuitively, entropy quantifies the uncertainty involved when encountering a [[random variable]]. [10410070] |For example, a fair coin flip (2 equally likely outcomes) will have less entropy than a roll of a die (6 equally likely outcomes). [10410080] |Applications of fundamental topics of information theory include [[lossless data compression]] (e.g. [[ZIP (file format)|ZIP files]]), [[lossy data compression]] (e.g. [[MP3]]s), and [[channel capacity|channel coding]] (e.g. for [[DSL]] lines). [10410110] |The field is at the intersection of [[mathematics]], [[statistics]], [[computer science]], [[physics]], [[neurobiology]], and [[electrical engineering]]. [10410120] |Its impact has been crucial to the success of the [[Voyager program|Voyager]] missions to deep space, the invention of the CD, the feasibility of mobile phones, the development of the [[Internet]], the study of [[linguistics]] and of human perception, the understanding of [[black hole]]s, and numerous other fields. [10410130] |Important sub-fields of information theory are source coding, channel coding, algorithmic complexity theory, algorithmic information theory, and measures of information. [10410140] |==Overview== [10410150] |The main concepts of information theory can be grasped by considering the most widespread means of human communication: language. [10410160] |Two important aspects of a good language are as follows: First, the most common words (e.g., "a", "the", "I") should be shorter than less common words (e.g., "benefit", "generation", "mediocre"), so that sentences will not be too long. [10410170] |Such a tradeoff in word length is analogous to [[data compression]] and is the essential aspect of [[source coding]]. [10410180] |Second, if part of a sentence is unheard or misheard due to noise -— e.g., a passing car -— the listener should still be able to glean the meaning of the underlying message. [10410190] |Such robustness is as essential for an electronic communication system as it is for a language; properly building such robustness into communications is done by [[Channel capacity|channel coding]]. [10410200] |Source coding and channel coding are the fundamental concerns of information theory. [10410210] |Note that these concerns have nothing to do with the ''importance'' of messages. [10410220] |For example, a platitude such as "Thank you; come again" takes about as long to say or write as the urgent plea, "Call an ambulance!" while clearly the latter is more important and more meaningful. [10410230] |Information theory, however, does not consider message importance or meaning, as these are matters of the quality of data rather than the quantity and readability of data, the latter of which is determined solely by probabilities. [10410240] |Information theory is generally considered to have been founded in 1948 by [[Claude Elwood Shannon|Claude Shannon]] in his seminal work, "[[A Mathematical Theory of Communication]]." [10410250] |The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel. [10410260] |The most fundamental results of this theory are Shannon's [[source coding theorem]], which establishes that, on average, the number of ''bits'' needed to represent the result of an uncertain event is given by its [[information entropy|entropy]]; and Shannon's [[noisy-channel coding theorem]], which states that ''reliable'' communication is possible over ''noisy'' channels provided that the rate of communication is below a certain threshold called the channel capacity. [10410270] |The channel capacity can be approached in practice by using appropriate encoding and decoding systems. [10410280] |Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: [[adaptive system]]s, [[anticipatory system]]s, [[artificial intelligence]], [[complex system]]s, [[complexity science]], [[cybernetics]], [[informatics]], [[machine learning]], along with [[systems science]]s of many descriptions. [10410290] |Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of [[coding theory]]. [10410300] |Coding theory is concerned with finding explicit methods, called ''codes'', of increasing the efficiency and reducing the net error rate of data communication over a noisy channel to near the limit that Shannon proved is the maximum possible for that channel. [10410310] |These codes can be roughly subdivided into [[data compression]] (source coding) and [[error-correction]] (channel coding) techniques. [10410320] |In the latter case, it took many years to find the methods Shannon's work proved were possible. [10410330] |A third class of information theory codes are cryptographic algorithms (both [[code (cryptography)|code]]s and [[cipher]]s). [10410340] |Concepts, methods and results from coding theory and information theory are widely used in [[cryptography]] and [[cryptanalysis]]. [10410350] |''See the article [[ban (information)]] for a historical application.'' [10410360] |Information theory is also used in [[information retrieval]], [[intelligence (information gathering)|intelligence gathering]], [[gambling]], [[statistics]], and even in [[musical composition]]. [10410370] |==Historical background== [10410380] |The landmark event that established the discipline of information theory, and brought it to immediate worldwide attention, was the publication of [[Claude E. Shannon]]'s classic paper "[[A Mathematical Theory of Communication]]" in the ''[[Bell System Technical Journal]]'' in July and October of 1948. [10410390] |Prior to this paper, limited information theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability. [10410400] |[[Harry Nyquist]]'s 1924 paper, ''Certain Factors Affecting Telegraph Speed,'' contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation W = K \log m, where ''W'' is the speed of transmission of intelligence, ''m'' is the number of different voltage levels to choose from at each time step, and ''K'' is a constant. [10410410] |[[Ralph Hartley]]'s 1928 paper, ''Transmission of Information,'' uses the word ''information'' as a measurable quantity, reflecting the receiver's ability to distinguish that one sequence of symbols from any other, thus quantifying information as H = \log S^n = n \log S, where ''S'' was the number of possible symbols, and ''n'' the number of symbols in a transmission. [10410420] |The natural unit of information was therefore the decimal digit, much later renamed the [[ban (information)|hartley]] in his honour as a unit or scale or measure of information. [10410430] |[[Alan Turing]] in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war [[Cryptanalysis of the Enigma|Enigma]] ciphers. [10410440] |Much of the mathematics behind information theory with events of different probabilities was developed for the field of [[thermodynamics]] by [[Ludwig Boltzmann]] and [[J. Willard Gibbs]]. [10410450] |Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by [[Rolf Landauer]] in the 1960s, are explored in ''[[Entropy in thermodynamics and information theory]]''. [10410460] |In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion that [10410470] |:"The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point." [10410480] |With it came the ideas of [10410490] |* the [[information entropy]] and [[redundancy (information theory)|redundancy]] of a source, and its relevance through the [[source coding theorem]]; [10410500] |* the [[mutual information]], and the [[channel capacity]] of a noisy channel, including the promise of perfect loss-free communication given by the [[noisy-channel coding theorem]]; [10410510] |* the practical result of the [[Shannon–Hartley law]] for the channel capacity of a Gaussian channel; and of course [10410520] |* the [[bit]]—a new way of seeing the most fundamental unit of information [10410530] |==Ways of measuring information== [10410540] |Information theory is based on [[probability theory]] and [[statistics]]. [10410550] |The most important quantities of information are [[Information entropy|entropy]], the information in a [[random variable]], and [[mutual information]], the amount of information in common between two random variables. [10410560] |The former quantity indicates how easily message data can be [[data compression|compressed]] while the latter can be used to find the communication rate across a [[Channel (communications)|channel]]. [10410570] |The choice of logarithmic base in the following formulae determines the [[units of measurement|unit]] of [[information entropy]] that is used. [10410580] |The most common unit of information is the [[bit]], based on the [[binary logarithm]]. [10410590] |Other units include the [[nat (information)|nat]], which is based on the [[natural logarithm]], and the [[deciban|hartley]], which is based on the [[common logarithm]]. [10410600] |In what follows, an expression of the form p \log p \, is considered by convention to be equal to zero whenever p=0. [10410605] |This is justified because \lim_{p \rightarrow 0+} p \log p = 0 for any logarithmic base. [10410610] |===Entropy=== [10410620] |The '''[[information entropy|entropy]]''', H, of a discrete random variable X is a measure of the amount of ''uncertainty'' associated with the value of X. [10410630] |Suppose one transmits 1000 bits (0s and 1s). [10410640] |If these bits are known ahead of transmission (to be a certain value with absolute probability), logic dictates that no information has been transmitted. [10410650] |If, however, each is equally and independently likely to be 0 or 1, 1000 bits (in the information theoretic sense) have been transmitted. [10410660] |Between these two extremes, information can be quantified as follows. [10410670] |If \mathbb{X}\, is the set of all messages x that X could be, and p(x) is the probability of X given x, then the entropy of X is defined: [10410680] |: H(X) = \mathbb{E}_{X} [I(x)] = -\sum_{x \in \mathbb{X}} p(x) \log p(x). [10410690] |(Here, I(x) is the [[self-information]], which is the entropy contribution of an individual message.) [10410700] |An important property of entropy is that it is maximized when all the messages in the message space are equiprobable—i.e., most unpredictable—in which case H(X) = \log |\mathbb{X}|. [10410710] |The special case of information entropy for a random variable with two outcomes is the '''[[binary entropy function]]''': [10410720] |:H_\mbox{b}(p) = - p \log p - (1-p)\log (1-p).\, [10410730] |===Joint entropy=== [10410740] |The '''[[joint entropy]]''' of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y). [10410750] |This implies that if X and Y are [[statistical independence|independent]], then their joint entropy is the sum of their individual entropies. [10410760] |For example, if (X,Y) represents the position of a [[chess]] piece — X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece. [10410770] |:H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \, [10410780] |Despite similar notation, joint entropy should not be confused with '''[[cross entropy]]'''. [10410790] |===Conditional entropy (equivocation)=== [10410800] |The '''[[conditional entropy]]''' or '''conditional uncertainty''' of X given random variable Y (also called the '''equivocation''' of X about Y) is the average conditional entropy over Y: [10410810] |: H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(y)}. [10410820] |Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. [10410830] |A basic property of this form of conditional entropy is that: [10410840] |: H(X|Y) = H(X,Y) - H(Y) .\, [10410850] |===Mutual information (transinformation)=== [10410860] |'''[[Mutual information]]''' measures the amount of information that can be obtained about one random variable by observing another. [10410870] |It is important in communication where it can be used to maximize the amount of information shared between sent and received signals. [10410880] |The mutual information of X relative to Y is given by: [10410890] |:I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)} [10410900] |where SI (''S''pecific mutual ''I''nformation) is the [[pointwise mutual information]]. [10410910] |A basic property of the mutual information is that [10410920] |: I(X;Y) = H(X) - H(X|Y).\, [10410930] |That is, knowing ''Y'', we can save an average of I(X; Y) bits in encoding ''X'' compared to not knowing ''Y''. [10410940] |Mutual information is [[symmetric function|symmetric]]: [10410950] |: I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\, [10410960] |Mutual information can be expressed as the average [[Kullback–Leibler divergence]] (information gain) of the [[posterior probability|posterior probability distribution]] of ''X'' given the value of ''Y'' to the [[prior probability|prior distribution]] on ''X'': [10410970] |: I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )]. [10410980] |In other words, this is a measure of how much, on the average, the probability distribution on ''X'' will change if we are given the value of ''Y''. [10410990] |This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: [10411000] |: I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)). [10411010] |Mutual information is closely related to the [[likelihood-ratio test|log-likelihood ratio test]] in the context of contingency tables and the [[multinomial distribution]] and to [[Pearson's chi-square test|Pearson's χ2 test]]: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution. [10411020] |===Kullback–Leibler divergence (information gain)=== [10411030] |The '''[[Kullback–Leibler divergence]]''' (or '''information divergence''', '''information gain''', or '''relative entropy''') is a way of comparing two distributions: a "true" [[probability distribution]] ''p(X)'', and an arbitrary probability distribution ''q(X)''. [10411040] |If we compress data in a manner that assumes ''q(X)'' is the distribution underlying some data, when, in reality, ''p(X)'' is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression. [10411050] |It is thus defined [10411060] |:D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \left( -p(x) \log {p(x)}\right) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}. [10411070] |Although it is sometimes used as a 'distance metric', it is not a true [[Metric (mathematics)|metric]] since it is not symmetric and does not satisfy the [[triangle inequality]] (making it a semi-quasimetric). [10411080] |===Other quantities=== [10411090] |Other important information theoretic quantities include [[Rényi entropy]] (a generalization of entropy) and [[differential entropy]] (a generalization of quantities of information to continuous distributions.) [10411100] |==Coding theory== [10411110] |[[Coding theory]] is one of the most important and direct applications of information theory. [10411120] |It can be subdivided into [[data compression|source coding]] theory and [[error correction|channel coding]] theory. [10411130] |Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source. [10411140] |* Data compression (source coding): There are two formulations for the compression problem: [10411150] |#[[lossless data compression]]: the data must be reconstructed exactly; [10411160] |#[[lossy data compression]]: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function. [10411170] |This subset of Information theory is called [[rate–distortion theory]]. [10411180] |* Error-correcting codes (channel coding): While data compression removes as much [[redundancy (information theory)|redundancy]] as possible, an error correcting code adds just the right kind of redundancy (i.e. [[error correction]]) needed to transmit the data efficiently and faithfully across a noisy channel. [10411190] |This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts. [10411200] |However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user. [10411210] |In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the [[broadcast channel]]) or intermediary "helpers" (the [[relay channel]]), or more general [[computer network|networks]], compression followed by transmission may no longer be optimal. [10411220] |[[Network information theory]] refers to these multi-agent communication models. [10411230] |===Source theory=== [10411240] |Any process that generates successive messages can be considered a '''[[Communication source|source]]''' of information. [10411250] |A memoryless source is one in which each message is an [[Independent identically-distributed random variables|independent identically-distributed random variable]], whereas the properties of [[ergodic theory|ergodicity]] and [[stationary process|stationarity]] impose more general constraints. [10411260] |All such sources are [[stochastic process|stochastic]]. [10411270] |These terms are well studied in their own right outside information theory. [10411280] |====Rate==== [10411290] |Information [[Entropy rate|'''rate''']] is the average entropy per symbol. [10411300] |For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is [10411310] |:r = \lim_{n \to \infty} H(X_n|X_{n-1},X_{n-2},X_{n-3}, \ldots); [10411320] |that is, the conditional entropy of a symbol given all the previous symbols generated. [10411330] |For the more general case of a process that is not necessarily stationary, the ''average rate'' is [10411340] |:r = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \dots X_n); [10411350] |that is, the limit of the joint entropy per symbol. [10411360] |For stationary sources, these two expressions give the same result. [10411370] |It is common in information theory to speak of the "rate" or "entropy" of a language. [10411380] |This is appropriate, for example, when the source of information is English prose. [10411390] |The rate of a source of information is related to its [[redundancy (information theory)|redundancy]] and how well it can be [[data compression|compressed]], the subject of '''source coding'''. [10411400] |===Channel capacity=== [10411410] |Communications over a channel—such as an [[ethernet]] wire—is the primary motivation of information theory. [10411420] |As anyone who's ever used a telephone (mobile or landline) knows, however, such channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality. [10411430] |How much information can one hope to communicate over a noisy (or otherwise imperfect) channel? [10411440] |Consider the communications process over a discrete channel. [10411450] |A simple model of the process is shown below: [10411460] |Here ''X'' represents the space of messages transmitted, and ''Y'' the space of messages received during a unit time over our channel. [10411470] |Let p(y|x) be the [[conditional probability]] distribution function of ''Y'' given ''X''. [10411480] |We will consider p(y|x) to be an inherent fixed property of our communications channel (representing the nature of the '''[[Signal noise|noise]]''' of our channel). [10411490] |Then the joint distribution of ''X'' and ''Y'' is completely determined by our channel and by our choice of f(x), the marginal distribution of messages we choose to send over the channel. [10411500] |Under these constraints, we would like to maximize the rate of information, or the '''[[Signal (electrical engineering)|signal]]''', we can communicate over the channel. [10411510] |The appropriate measure for this is the [[mutual information]], and this maximum mutual information is called the '''[[channel capacity]]''' and is given by: [10411520] |: C = \max_{f} I(X;Y).\! [10411530] |This capacity has the following property related to communicating at information rate ''R'' (where ''R'' is usually bits per symbol). [10411540] |For any information rate ''R < C'' and coding error ε > 0, for large enough ''N'', there exists a code of length ''N'' and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error. [10411550] |In addition, for any rate ''R > C'', it is impossible to transmit with arbitrarily small block error. [10411560] |'''[[Channel code|Channel coding]]''' is concerned with finding such nearly optimal [[error detection and correction|codes]] that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity. [10411570] |====Channel capacity of particular model channels==== [10411580] |* A continuous-time analog communications channel subject to Gaussian noise — see [[Shannon–Hartley theorem]]. [10411590] |* A [[binary symmetric channel]] (BSC) with crossover probability ''p'' is a binary input, binary output channel that flips the input bit with probability '' p''. [10411600] |The BSC has a capacity of 1 - H_\mbox{b}(p) bits per channel use, where H_\mbox{b} is the [[binary entropy function]]: [10411610] |:: [10411620] |* A binary erasure channel (BEC) with erasure probability '' p '' is a binary input, ternary output channel. [10411630] |The possible channel outputs are ''0'', ''1'', and a third symbol 'e' called an erasure. [10411640] |The erasure represents complete loss of information about an input bit. [10411650] |The capacity of the BEC is ''1 - p'' bits per channel use. [10411660] |:: [10411670] |==Applications to other fields== [10411680] |===Intelligence uses and secrecy applications=== [10411690] |Information theoretic concepts apply to [[cryptography]] and [[cryptanalysis]]. [10411700] |[[Turing]]'s information unit, the [[Ban (information)|ban]], was used in the [[Ultra]] project, breaking the German [[Enigma machine]] code and hastening the [[Victory in Europe Day|end of WWII in Europe]]. [10411710] |Shannon himself defined an important concept now called the [[unicity distance]]. [10411720] |Based on the [[redundancy (information theory)|redundancy]] of the [[plaintext]], it attempts to give a minimum amount of [[ciphertext]] necessary to ensure unique decipherability. [10411730] |Information theory leads us to believe it is much more difficult to keep secrets than it might first appear. [10411740] |A [[brute force attack]] can break systems based on [[public-key cryptography|asymmetric key algorithms]] or on most commonly used methods of [[symmetric-key algorithm|symmetric key algorithms]] (sometimes called secret key algorithms), such as [[block cipher]]s. [10411750] |The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time. [10411760] |[[Information theoretic security]] refers to methods such as the [[one-time pad]] that are not vulnerable to such brute force attacks. [10411770] |In such cases, the positive conditional [[mutual information]] between the [[plaintext]] and [[ciphertext]] (conditioned on the [[key (cryptography)| key]]) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications. [10411780] |In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key. [10411790] |However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the [[Venona project]] was able to crack the one-time pads of the [[Soviet Union]] due to their improper reuse of key material. [10411800] |===Pseudorandom number generation=== [10411810] |[[Pseudorandom number generator]]s are widely available in computer language libraries and application programs. [10411820] |They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software. [10411830] |A class of improved random number generators is termed [[Cryptographically secure pseudorandom number generator]]s, but even they require external to the software [[random seed]]s to work as intended. [10411840] |These can be obtained via [[extractor]]s, if done carefully. [10411850] |The measure of sufficient randomness in extractors is [[min-entropy]], a value related to Shannon entropy through [[Rényi entropy]]; Rényi entropy is also used in evaluating randomness in cryptographic systems. [10411860] |Although related, the distinctions among these measures mean that a [[random variable]] with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses. [10411870] |===Miscellaneous applications=== [10411880] |Information theory also has applications in [[Gambling and information theory|gambling and investing]], [[black hole information paradox|black holes]], [[bioinformatics]], and [[music]]. [10420010] |
Italian language
[10420020] |'''Italian''' (, or ''lingua italiana'') is a [[Romance languages|Romance language]] spoken as a [[first language]] by about 63 million people, primarily in [[Italy]]. [10420030] |In [[Switzerland]], Italian is one of four [[Linguistic geography of Switzerland|official language]]s. [10420040] |It is also the official language of [[San Marino]]. [10420050] |It is the primary language of the [[Vatican City]]. [10420060] |Standard Italian, adopted by the state after the [[unification of Italy]], is based on [[Tuscan dialect|Tuscan]] and is somewhat intermediate between [[Italo-Western|Italo-Dalmatian languages]] of the [[Mezzogiorno|South]] and [[Northern Italian dialects]] of the [[Northern Italy|North]]. [10420070] |Unlike most other Romance languages, Italian has retained the contrast between short and [[consonant length|long consonants]] which existed in Latin. [10420080] |As in most [[Romance languages]], [[stress (linguistics)|stress]] is distinctive. [10420090] |Of the Romance languages, Italian is considered to be one of the closest resembling [[Latin]] in terms of [[vocabulary]]. [10420100] |According to Ethnologue, lexical similarity is 89% with [[French language|French]], 87% with [[Catalan language|Catalan]], 85% with [[Sardinian language|Sardinian]], 82% with [[Spanish language|Spanish]], 78% with Rheto-Romance, and 77% with Romanian. [10420110] |It is affectionately called ''il parlar gentile'' (the gentle language) by its speakers. [10420120] |==Writing system== [10420130] |Italian is written using the [[Latin alphabet]]. [10420140] |The letters ''J'', ''K'', ''W'', ''X'' and ''Y'' are not considered part of the standard [[Italian alphabet]], but appear in loanwords (such as ''jeans'', ''whisky'', ''taxi''). [10420150] |''X'' has become a commonly used letter in genuine Italian words with the prefix ''extra-''. [10420160] |''J'' in Italian is an old-fashioned orthographic variant of ''I'', appearing in the first name "Jacopo" as well as in some Italian place names, e.g., the towns of [[Bajardo]], [[Bojano]], [[Joppolo]], [[Jesolo]], [[Jesi]], among numerous others, and in the alternate spelling ''Mar Jonio'' (also spelled ''Mar Ionio'') for the [[Ionian Sea]]. [10420170] |''J'' may also appear in many words from different dialects, but its use is discouraged in contemporary Italian, and it is not part of the standard 21-letter contemporary Italian alphabet. [10420180] |Each of these foreign letters had an Italian equivalent spelling: ''gi'' for ''j'', ''c'' or ''ch'' for ''k'', ''u'' or ''v'' for ''w'' (depending on what sound it makes), ''s'', ''ss'', or ''cs'' for ''x'', and ''i'' for ''y''. [10420190] |* Italian uses the [[acute accent]] over the letter ''E'' (as in ''perché'', why/because) to indicate a front mid-close vowel, and the [[grave accent]] (as in ''tè'', tea) to indicate a front mid-open vowel. [10420200] |The [[grave accent]] is also used on letters ''A'', ''I'', ''O'', and ''U'' to mark [[stress (linguistics)|stress]] when it falls on the final vowel of a word (for instance ''gioventù'', youth). [10420210] |Typically, the penultimate syllable is stressed. [10420220] |If syllables other than the last one are stressed, the accent is not mandatory, unlike in [[Spanish language|Spanish]], and, in virtually all cases, it is omitted. [10420230] |In some cases, when the word is ambiguous (as ''principi''), the accent mark is sometimes used in order to disambiguate its meaning (in this case, ''prìncipi'', princes, or ''princìpi'', principles). [10420240] |This is, however, not compulsory. [10420250] |Rare words with three or more syllables can confuse Italians themselves, and the pronunciation of [[Istanbul]] is a common example of a word in which placement of stress is not clearly established. [10420260] |Turkish, like French, tends to put the accent on ultimate syllable, but Italian doesn't. [10420270] |So we can hear "Istànbul" or "Ìstanbul". [10420280] |Another instance is the American State of [[Florida]]: the correct way to pronounce it in Italian is like in Spanish, "Florìda", but since there is an Italian word meaning the same ("flourishing"), "flòrida", and because of the influence of English, most Italians pronounce it that way. [10420290] |Dictionaries give the latter as an alternative pronunciation. [10420300] |* The letter ''H'' at the beginning of a word is used to distinguish ''ho'', ''hai'', ''ha'', ''hanno'' (present indicative of ''avere'', 'to have') from ''o'' ('or'), ''ai'' ('to the'), ''a'' ('to'), ''anno'' ('year'). [10420310] |In the spoken language this letter is always silent for the cases given above. [10420320] |''H'' is also used in combinations with other letters (see below), but no [[phoneme]] {{IPA|[h]}} exists in Italian. [10420330] |In foreign words entered in common use, like "hotel" or "hovercraft", the H is commonly silent, so they are pronounced as {{IPA|/oˈtɛl/}} and {{IPA|/ˈɔverkraft/}} [10420340] |* The letter ''Z'' represents {{IPA|/ʣ/}}, for example: ''Zanzara'' {{IPA|/dzan'dzaɾa/}} (mosquito), or {{IPA|/ʦ/}}, for example: ''Nazione'' {{IPA|/naˈttsjone/}} (nation), depending on context, though there are few [[minimal pair]]s. [10420350] |The same goes for ''S'', which can represent {{IPA|/s/}} or {{IPA|/z/}}. [10420360] |However, these two phonemes are in [[complementary distribution]] everywhere except between two vowels in the same word, and even in such environment there are extremely few minimal pairs, so that this distinction is being lost in many varieties. [10420370] |* The letters ''C'' and ''G'' represent [[affricate]]s: [[Voiceless postalveolar affricate|{{IPA|/ʧ/}}]] as in "chair" and [[Voiced postalveolar affricate|{{IPA|/ʤ/}}]] as in "gem", respectively, before the [[front vowel]]s ''I'' and ''E''. [10420380] |They are pronounced as [[plosive]]s {{IPA|/k/}}, {{IPA|/g/}} (as in "call" and "gall") otherwise. [10420390] |Front/back vowel rules for ''C'' and ''G'' are similar in [[French language|French]], [[Romanian language|Romanian]], [[Spanish language|Spanish]], and to some extent [[English language|English]] (including [[Old English]]). [10420400] |[[swedish language|Swedish]] and [[Norwegian language|Norwegian]] have similar rules for ''K'' and ''G''. [10420410] |(See also [[palatalization]].) [10420420] |* However, an ''H'' can be added between ''C'' or ''G'' and ''E'' or ''I'' to represent a plosive, and an ''I'' can be added between ''C'' or ''G'' and ''A'', ''O'' or ''U'' to signal that the consonant is an affricate. [10420430] |For example: [10420440] |:Note that the ''H'' is [[silent letter|silent]] in the digraphs ''[[ch (digraph)|CH]]'' and ''[[gh (digraph)|GH]]'', as also the ''I'' in ''cia'', ''cio'', ''ciu'' and even ''cie'' is not pronounced as a separate vowel, unless it carries the primary stress. [10420450] |For example, it is silent in ''[[ciao]]'' {{IPA|/ˈʧa.o/}} and cielo {{IPA|/ˈʧɛ.lo/}}, but it is pronounced in ''farmacia'' {{IPA|/ˌfaɾ.ma.ˈʧi.a/}} and ''farmacie'' {{IPA|/ˌfaɾ.ma.ˈʧi.e/}}. [10420460] |* There are three other special [[digraph (orthography)|digraphs]] in Italian: ''[[gn (digraph)|GN]]'', ''GL'' and ''SC''. [10420470] |''GN'' represents [[Palatal nasal|{{IPA|/ɲ/}}]]. [10420480] |''GL'' represents [[Palatal lateral approximant|{{IPA|/ʎ/}}]] only before ''i'', and never at the beginning of a word, except in the [[personal pronoun]] and [[definite article]] ''gli''. [10420490] |(Compare with [[Spanish language|Spanish]] ''ñ'' and ''ll'', [[Portuguese language|Portuguese]] ''nh'' and ''lh''.) [10420500] |''SC'' represents fricative [[Voiceless postalveolar fricative|{{IPA|/ʃ/}}]] before ''i'' or ''e''. [10420510] |Except in the speech of some Northern Italians, all of these are normally [[geminate]] between vowels. [10420520] |* In general, all letters or digraphs represent phonemes rather clearly, and, in standard varieties of Italian, there is little allophonic variation. [10420530] |The most notable exceptions are assimilation of /n/ in point of articulation before consonants, assimilatory voicing of /s/ to following voiced consonants, and vowel length (vowels are long in stressed open syllables, and short elsewhere) — compare with the enormous number of [[allophone]]s of the English phoneme /t/. [10420540] |Spelling is clearly phonemic and difficult to mistake given a clear pronunciation. [10420550] |Exceptions are generally only found in foreign borrowings. [10420560] |There are fewer cases of [[dyslexia]] than among speakers of languages such as English , and the concept of a spelling bee is strange to Italians. [10420570] |==History== [10420580] |The history of the Italian language is long, but the modern standard of the language was largely shaped by relatively recent events. [10420590] |The earliest surviving texts which can definitely be called Italian (or more accurately, vernacular, as opposed to its predecessor [[Vulgar Latin]]) are legal formulae from the region of [[province of Benevento|Benevento]] dating from 960-963. [10420600] |What would come to be thought of as Italian was first formalized in the first years of the 14th century through the works of [[Dante Alighieri]], who mixed southern Italian languages, especially [[Sicilian language|Sicilian]], with his native Tuscan in his epic poems known collectively as the ''[[Divine Comedy|Commedia]],'' to which [[Giovanni Boccaccio]] later affixed the title ''Divina''. [10420610] |Dante's much-loved works were read throughout Italy and his written dialect became the "canonical standard" that all educated Italians could understand. [10420620] |Dante is still credited with standardizing the Italian language and, thus, the dialect of [[Tuscany]] became the basis for what would become the official language of Italy. [10420630] |Italy has always had a distinctive dialect for each city since the cities were until recently thought of as [[city-state]]s. [10420640] |The latter now has considerable [[variety (linguistics)|variety]], however. [10420650] |As Tuscan-derived Italian came to be used throughout the nation, features of local speech were naturally adopted, producing various versions of Regional Italian. [10420660] |The most characteristic differences, for instance, between [[Romanesco|Roman Italian]] and [[Milanese|Milanese Italian]] are the [[consonant length|gemination]] of initial consonants and the pronunciation of stressed "e", and of "s" in some cases (e.g. ''va bene'' "all right": is pronounced {{IPA|[va ˈbːɛne]}} by a Roman, {{IPA|[va ˈbene]}} by a Milanese; ''a casa'' "at home": Roman {{IPA|[a ˈkːasa]}}, Milanese {{IPA|[a ˈkaza]}}). [10420670] |In contrast to the [[Northern Italian language|dialects of northern Italy]], [[southern Italian]] dialects were largely untouched by the Franco-[[Occitan language|Occitan]] influences introduced to Italy, mainly by [[bard]]s from [[France]], during the [[Middle Ages]]. [10420680] |Even in the case of Northern Italian dialects, however, scholars are careful not to overstate the effects of outsiders on the natural indigenous developments of the languages. [10420690] |(See [[La Spezia-Rimini Line]].) [10420700] |The economic might and relative advanced development of [[Tuscany]] at the time ([[Late Middle Ages]]), gave its dialect weight, though Venetian remained widespread in medieval Italian commercial life. [10420710] |Also, the increasing cultural relevance of [[Florence, Italy|Florence]] during the periods of '[[Humanism|Umanesimo (Humanism)]]' and the [[Renaissance|Rinascimento (Renaissance)]] made its ''volgare'' (dialect), or rather a refined version of it, a standard in the arts. [10420720] |The re-discovery of Dante's ''[[De vulgari eloquentia]]'' and a renewed interest in linguistics in the 16th century sparked a debate which raged throughout Italy concerning which criteria should be chosen to establish a modern Italian standard to be used as much as a literary as a spoken language. [10420730] |Scholars were divided into three factions: the [[purism|purists]], headed by [[Pietro Bembo]] who in his ''[[Gli Asolani]]'' claimed that the language might only be based on the great literary classics (notably, [[Petrarch]], and Boccaccio but not Dante as Bembo believed that the Divine Comedy was not dignified enough as it used elements from other dialects), [[Niccolò Machiavelli]] and other [[Florence|Florentine]]s who preferred the version spoken by ordinary people in their own times, and the [[Courtesan]]s like [[Baldassarre Castiglione]] and [[Gian Giorgio Trissino]] who insisted that each local vernacular must contribute to the new standard. [10420740] |Eventually Bembo's ideas prevailed, the result being the publication of the first Italian dictionary in 1612 and the foundation of the [[Accademia della Crusca]] in Florence (1582-3), the official legislative body of the Italian language. [10420750] |Italian literature's first modern novel, [[The Betrothed|''I Promessi Sposi'']] (The Betrothed), by [[Alessandro Manzoni]] further defined the standard by "rinsing" his Milanese 'in the waters of the [[Arno River|Arno]]" ([[Florence]]'s river), as he states in the Preface to his 1840 edition. [10420760] |After unification a huge number of civil servants and soldiers recruited from all over the country introduced many more words and idioms from their home dialects ("[[ciao]]" is [[Venetian language|Venetian]], "[[panettone]]" is [[Milanese]] etc.). [10420770] |==Classification== [10420780] |Italian is most closely related to the other two Italo-Dalmatian languages, [[Sicilian language|Sicilian]] and the extinct [[Dalmatian language|Dalmatian]]. [10420790] |The three are part of the [[Italo-Western languages|Italo-Western]] grouping of the [[Romance languages]], which are a subgroup of the [[Italic languages|Italic]] branch of [[Indo-European language family|Indo-European]]. [10420800] |==Geographic distribution== [10420810] |The total speakers of Italian as maternal language are between 60 and 70 million. [10420820] |The speakers who use Italian as second or cultural language are estimated around 110-120 million . [10420830] |Italian is the official language of [[Italy]] and [[San Marino]], and one of the official languages of [[Switzerland]], spoken mainly in [[Canton Ticino|Ticino]] and [[Graubünden|Grigioni]] cantons, a region referred to as [[Italian Switzerland]]. [10420840] |It is also the second official language in some areas of [[Istria]], in [[Slovenia]] and [[Croatia]], where an Italian minority exists. [10420850] |It is the primary language of the [[Vatican City]] and is widely used and taught in [[Monaco]] and [[Malta]]. [10420860] |It is also widely understood in France with over one million speakers (especially in [[Corsica]] and the [[County of Nice]], areas that historically spoke [[Italian dialects]] before annexation to [[France]]), and in [[Albania]]. [10420870] |Italian is also spoken by some in former Italian colonies in [[Africa]] ([[Libya]], [[Somalia]] and [[Eritrea]]). [10420880] |However, its use has sharply dropped off since the colonial period. [10420890] |In [[Eritrea]] [[Italian Language|Italian]] is widely understood . [10420900] |In fact, for fifty years, during the colonial period, Italian was the language of instruction, but [[as of 1997]], there is only one Italian language school remaining, with 470 pupils. [10420910] |In [[Somalia]] Italian used to be a major language but due to the civil war and lack of education only the older generation still uses it. [10420920] |Italian and [[Italian dialects]] are widely used by Italian immigrants and many of their descendants (see ''[[Italians]]'') living throughout [[Western Europe]] (especially [[France]], [[Germany]], [[Belgium]], [[Switzerland]], the [[Britalian|United Kingdom]] and [[Luxembourg]]), the [[Italian Americans|United States]], [[Italian Canadians|Canada]], [[Italian Australians|Australia]], and [[Latin America]] (especially [[Uruguay]], [[Italian Brazilians|Brazil]], [[Argentina]], and [[Venezuela]]). [10420930] |In the United States, Italian speakers are most commonly found in four cities: [[Boston]] (7,000), [[Chicago]] (12,000), [[New York City]] (140,000), and [[Philadelphia]] (15,000). [10420940] |In Canada there are large Italian-speaking communities in [[Montreal]] (120,000) and [[Toronto]] (195,000). [10420950] |Italian is the second most commonly-spoken language in Australia, where 353,605 [[Italian Australian]]s, or 1.9% of the population, reported speaking Italian at home in the 2001 [[Census in Australia|Census]]. [10420960] |In 2001 there were 130,000 Italian speakers in [[Melbourne]], and 90,000 in [[Sydney]]. [10420970] |===Italian language education=== [10420980] |Italian is widely taught in many schools around the world, but rarely as the first non-native language of pupils; in fact, Italian generally is the fourth or fifth most taught second-language in the world. [10420990] |In [[anglophone]] parts of [[Canada]], Italian is, after [[French language|French]], the third most taught language. [10421000] |In [[francophone]] Canada it is third after [[English language|English]]. [10421010] |In the [[United States]] and the [[United Kingdom]], Italian ranks fourth (after [[Spanish language|Spanish]]-French-[[German language|German]] and French-German-Spanish respectively). [10421020] |Throughout the world, Italian is the fifth most taught non-native language, after [[English language|English]], French, Spanish, and German. [10421030] |In the [[European Union]], Italian is spoken as a mother tongue by 13% of the population (64 million, mainly in Italy itself) and as a second language by 3% (14 million); among EU member states, it is most likely to be desired (and therefore learned) as a second language in [[Malta]] (61%), [[Croatia]] (14%), [[Slovenia]] (12%), [[Austria]] (11%), [[Romania]] (8%), [[France]] (6%), and [[Greece]] (6%). [10421040] |It is also an important second language in [[Albania]] and [[Switzerland]], which are not EU members or candidates. [10421050] |===Influence and derived languages=== [10421060] |From the late 19th to the mid 20th century, thousands of Italians settled in Argentina, Uruguay and southern Brazil, where they formed a very strong physical and cultural presence (see the [[Italian diaspora]]). [10421070] |In some cases, colonies were established where variants of [[Italian dialects]] were used, and some continue to use a derived dialect. [10421080] |An example is [[Rio Grande do Sul]], [[Brazil]], where [[Talian]] is used and in the town of [[Chipilo]] near Puebla, [[Mexico]] each continuing to use a derived form of [[Venetian language|Venetian]] dating back to the 19th century. [10421090] |Another example is [[Cocoliche]], an Italian-Spanish [[pidgin]] once spoken in [[Argentina]] and especially in [[Buenos Aires]], and [[Lunfardo]]. [10421100] |[[Rioplatense Spanish]], and particularly the speech of the city of Buenos Aires, has intonation patterns that resemble those of Italian dialects, due to the fact that Argentina had a constant, large influx of Italian settlers since the second half of the nineteenth century; initially primarily from Northern Italy then, since the beginning of the twentieth century, mostly from Southern Italy. [10421110] |===Lingua Franca=== [10421120] |Starting in late [[medieval]] times, Italian language variants replaced Latin to become the primary commercial language for much of Europe and Mediterranean Sea (especially the Tuscan and Venetian variants). [10421130] |This became solidified during the [[Renaissance]] with the strength of Italian banking and the rise of [[Renaissance humanism|humanism]] in the arts. [10421140] |During the period of the Renaissance, Italy held artistic sway over the rest of Europe. [10421150] |All educated European gentlemen were expected to make the [[Grand Tour]], visiting Italy to see its great historical monuments and works of art. [10421160] |It thus became expected that educated Europeans would learn at least some Italian; the English poet [[John Milton]], for instance, wrote some of his early poetry in Italian. [10421170] |In England, Italian became the second most common modern language to be learned, after [[French language|French]] (though the classical languages, [[Latin]] and [[Greek language|Greek]], came first). [10421180] |However, by the late eighteenth century, Italian tended to be replaced by [[German language|German]] as the second modern language on the curriculum. [10421190] |Yet Italian [[loanword]]s continue to be used in most other [[European languages]] in matters of art and music. [10421200] |Today, the Italian language continues to be used as a [[lingua franca]] in some environments. [10421210] |Within the [[Catholic church]] Italian is known by a large part of the ecclesiastic hierarchy, and is used in substitution of [[Latin]] in some official documents. [10421220] |The presence of Italian as the primary language in the [[Vatican City]] indicates not only use within the [[Holy See]], but also throughout the world where an episcopal seat is present. [10421230] |It continues to be used in [[music]] and [[opera]]. [10421240] |Other examples where Italian is sometimes used as a means communication is in some sports (sometimes in [[Football (association)|football]] and [[motorsports]]) and in the [[design]] and [[fashion]] industries. [10421250] |==Dialects== [10421260] |In Italy, all [[Romance languages]] spoken as the vernacular, other than standard Italian and other unrelated, non-Italian languages, are termed "Italian dialects". [10421270] |Many Italian dialects are, in fact, historical languages in their own right. [10421280] |These include recognized language groups such as [[Friulian language|Friulian]], [[Neapolitan language|Neapolitan]], [[Sardinian language|Sardinian]], [[Sicilian language|Sicilian]], [[Venetian language|Venetian]], and others, and regional variants of these languages such as [[Calabrian languages|Calabrian]]. [10421290] |The division between dialect and language has been used by scholars (such as by [[Francesco Bruni]]) to distinguish between the languages that made up the Italian [[koine]], and those which had very little or no part in it, such as [[Albanian language|Albanian]], [[Greek language|Greek]], [[German language|German]], [[Ladin language|Ladin]], and [[Occitan language|Occitan]], which are still spoken by minorities. [10421300] |Dialects are generally not used for general mass communication and are usually limited to native speakers in informal contexts. [10421310] |In the past, speaking in dialect was often deprecated as a sign of poor education. [10421320] |Younger generations, especially those under 35 (though it may vary in different areas), speak almost exclusively standard Italian in all situations, usually with local accents and idioms. [10421330] |Regional differences can be recognized by various factors: the openness of vowels, the length of the consonants, and influence of the local dialect (for example, ''annà'' replaces ''andare'' in the area of Rome for the infinitive "to go"). [10421340] |==Sounds== [10421350] |{{IPA notice|lang=it}} [10421360] |===Vowels=== [10421370] |Italian has seven [[vowel]] phonemes: {{IPA|/a/}}, {{IPA|/e/}}, {{IPA|/ɛ/}}, {{IPA|/i/}}, {{IPA|/o/}}, {{IPA|/ɔ/}}, {{IPA|/u/}}. [10421380] |The pairs {{IPA|/e/}}-{{IPA|/ɛ/}} and {{IPA|/o/}}-{{IPA|/ɔ/}} are seldom distinguished in writing and often confused, even though most varieties of Italian employ both phonemes consistently. [10421390] |Compare, for example: "perché" {{IPA|[perˈkɛ]}} (why, because) and "senti" {{IPA|[ˈsenti]}} (you listen, you are listening, listen!), employed by some northern speakers, with {{IPA|[perˈke]}} and {{IPA|[ˈsɛnti]}}, as pronounced by most central and southern speakers. [10421400] |As a result, the usage is strongly indicative of a person's origin. [10421410] |The standard (Tuscan) usage of these vowels is listed in vocabularies, and employed outside Tuscany mainly by specialists, especially actors and very few (television) journalists. [10421420] |These are truly different [[phonemes]], however: compare {{IPA|/ˈpeska/}} (fishing) and {{IPA|/ˈpɛska/}} (peach), both spelled ''pesca'' . [10421430] |Similarly {{IPA|/ˈbotte/}} ('barrel') and {{IPA|/ˈbɔtte/}} ('beatings'), both spelled ''botte'', discriminate {{IPA|/o/}} and {{IPA|/ɔ/}} . [10421440] |In general, vowel combinations usually pronounce each vowel separately. [10421450] |[[Diphthong]]s exist (e.g. ''uo'', ''iu'', ''ie'', ''ai''), but are limited to an unstressed ''u'' or ''i'' before or after a stressed vowel. [10421460] |The unstressed ''u'' in a diphthong approximates the English semivowel ''w'', the unstressed ''i'' approximates the semivowel ''y''. [10421470] |E.g.: ''buono'' {{IPA|[ˈbwɔno]}}, ''ieri'' {{IPA|[ˈjɛri]}}. [10421480] |[[Triphthong]]s exist in Italian as well, like "contin''uia''mo" ("we continue"). [10421490] |Three vowel combinations exist only in the form semiconsonant ({{IPA|/j/}} or {{IPA|/w/}}), followed by a vowel, followed by a desinence vowel (usually {{IPA|/i/}}), as in ''miei'', ''suoi'', or two semiconsonants followed by a vowel, as the group ''-uia-'' exemplified above, or ''-iuo-'' in the word ''aiuola''. [10421500] |===Mobile diphthongs=== [10421510] |Many Latin words with a short ''e'' or ''o'' have Italian counterparts with a mobile diphthong (''ie'' and ''uo'' respectively). [10421520] |When the vowel sound is stressed, it is pronounced and written as a diphthong; when not stressed, it is pronounced and written as a single vowel. [10421530] |So Latin ''focus'' gave rise to Italian ''fuoco'' (meaning both "fire" and "optical focus"): when unstressed, as in ''focale'' ("focal") the "o" remains alone. [10421540] |Latin ''pes'' (more precisely its accusative form ''pedem'') is the source of Italian ''piede'' (foot): but unstressed "e" was left unchanged in ''pedone'' (pedestrian) and ''pedale'' (pedal). [10421550] |From Latin ''iocus'' comes Italian ''giuoco'' ("play", "game"), though in this case ''gioco'' is more common: ''giocare'' means "to play (a game)". [10421560] |From Latin ''homo'' comes Italian ''uomo'' (man), but also ''umano'' (human) and ''ominide'' (hominid). [10421570] |From Latin ''ovum'' comes Italian ''uovo'' (egg) and ''ovaie'' (ovaries). [10421580] |(The same phenomenon occurs in [[Spanish language|Spanish]]: ''juego'' (play, game) and ''jugar'' (to play), ''nieve'' (snow) and ''nevar'' (to snow)). [10421590] |===Consonants=== [10421600] |Two symbols in a table cell denote the voiceless and voiced consonant, respectively. [10421610] |Nasals undergo assimilation when followed by a consonant, e.g., when preceding a velar ({{IPA|/k/}} or {{IPA|/g/}}) only {{IPA|[ŋ]}} appears, etc. [10421620] |Italian has geminate, or double, consonants, which are distinguished by [[Consonant length|length]]. [10421630] |Length is distinctive for all consonants except for {{IPA|/ʃ/}}, {{IPA|/ʦ/}}, {{IPA|/ʣ/}}, {{IPA|/ʎ/}} {{IPA|/ɲ/}}, which are always geminate, and {{IPA|/z/}} which is always single. [10421640] |Geminate plosives and affricates are realised as lengthened closures. [10421650] |Geminate fricatives, nasals, and {{IPA|/l/}} are realized as lengthened [[continuant]]s. [10421660] |The flap consonant {{IPA|/ɾː/}} is typically dialectal, and it is called ''erre moscia''. [10421670] |The correct standard pronunciation is {{IPA|[r]}}. [10421680] |Of special interest to the linguistic study of Italian is the ''[[Tuscan gorgia|Gorgia Toscana]]'', or "Tuscan Throat", the weakening or [[lenition]] of certain [[:wiktionary:intervocalic|intervocalic]] consonants in [[Tuscan dialect]]s. [10421690] |See also [[Syntactic doubling]]. [10421700] |===Assimilation=== [10421710] |Italian has few diphthongs, so most unfamiliar diphthongs that are heard in foreign words (in particular, those beginning with vowel "a", "e", or "o") will be assimilated as the corresponding [[diaeresis]] (i.e., the vowel sounds will be pronounced separately). [10421720] |Italian [[phonotactics]] do not usually permit polysyllabic nouns and verbs to end with consonants, excepting poetry and song, so foreign words may receive extra terminal vowel sounds. [10421730] |==Grammar== [10421740] |===Common variations in the writing systems=== [10421750] |Some variations in the usage of the writing system may be present in practical use. [10421760] |These are scorned by educated people, but they are so common in certain contexts that knowledge of them may be useful. [10421770] |* Usage of ''x'' instead of ''per'': this is very common among teenagers and in [[Text messaging|SMS]] abbreviations. [10421780] |The multiplication operator is pronounced "per" in Italian, and so it is sometimes used to replace the word "per", which means "for"; thus, for example, "per te" ("for you") is shortened to "x te" (compare with English "4 U"). [10421790] |Words containing ''per'' can also have it replaced with ''x'': for example, ''perché'' (both "why" and "because") is often shortened as ''xché'' or ''xké'' or ''x' ''(see below). [10421800] |This usage might be useful to jot down quick notes or to fit more text into the low character limit of an SMS, but it is considered unacceptable in formal writing. [10421810] |* Usage of foreign letters such as ''k'', ''j'' and ''y'', especially in nicknames and SMS language: ''ke'' instead of ''che'', ''Giusy'' instead of ''Giuseppina'' (or sometimes ''Giuseppe''). [10421820] |This is curiously mirrored in the usage of ''i'' in English names such as ''Staci'' instead of ''Stacey'', or in the usage of ''c'' in [[Northern Europe]] (''Jacob'' instead of ''Jakob''). [10421830] |The use of "k" instead of "ch" or "c" to represent a plosive sound is documented in some historical texts from before the standardization of the Italian language; however, that usage is no longer standard in Italian. [10421840] |Possibly because it is associated with the [[German language]], the letter "k" has sometimes also been used in satire to suggest that a political figure is an authoritarian or even a "pseudo-nazi": [[Francesco Cossiga]] was famously nicknamed ''Kossiga'' by rioting students during his tenure as minister of internal affairs. [10421850] |[Cf. the [[alternative political spelling#"K" replacing "C"|politicized spelling ''Amerika'']] in the USA.] [10421860] |* Usage of the following abbreviations is limited to the electronic communications media and is deprecated in all other cases: '''nn''' instead of ''non'' (not), '''cmq''' instead of ''comunque'' (anyway, however), '''cm''' instead of ''come'' (how, like, as), '''d''' instead of ''di'' (of), '''(io/loro) sn''' instead of ''(io/loro) sono'' (I am/they are), '''(io) dv''' instead of ''(io) devo'' (I must/I have to) or instead of ''dove'' (where), '''(tu) 6''' instead of ''(tu) sei'' (you are). [10421870] |* Inexperienced typists often replace accents with apostrophes, such as in ''perche''' instead of ''perché''. [10421880] |Uppercase ''[[È]]'' is particularly rare, as it is absent from the [[Keyboard layout#Italian|Italian keyboard layout]], and is very often written as ''E''' (even though there are [[:it:Aiuto:Manuale di stile#Scrivere .C3.88|several ways]] of producing the uppercase È on a computer). [10421890] |This never happens in books or other professionally typeset material. [10421900] |==Samples== [10421910] |==Examples== [10421920] |*Cheers: "Salute!" [10421930] |*English: ''inglese'' {{IPA|/iŋˈglese/}} [10421940] |*Good-bye: ''arrivederci'' {{IPA|/arriveˈdertʃi/}} [10421950] |*Hello: ''[[ciao]]'' {{IPA|/ˈtʃao/}} [10421960] |*Good day: ''buon giorno'' {{IPA|/bwɔnˈdʒorno/}} [10421970] |*Good evening: ''buona sera'' {{IPA|/bwɔnaˈsera/}} [10421980] |*Yes: ''sì'' {{IPA|/si/}} [10421990] |*No: ''no'' {{IPA|/nɔ/}} [10422000] |*How are you? : Come stai {{IPA|/ˈkome ˈstai/}} (informal); Come sta {{IPA|/ˈkome 'sta/}} (formal) [10422010] |*Sorry: ''mi dispiace'' {{IPA|/mi disˈpjatʃe/}} [10422020] |*Excuse me: scusa {{IPA|/ˈskuza/}} (informal); scusi {{IPA|/ˈskuzi/}} (formal) [10422030] |*Again: ''di nuovo'', /{{IPA|di ˈnwɔvo}}/; ''ancora'' /{{IPA|aŋˈkora}}/ [10422040] |*Always: ''sempre'' /{{IPA|ˈsɛmpre}}/ [10422050] |*When: ''quando'' {{IPA|/ˈkwando/}} [10422060] |*Where: ''dove'' {{IPA|/'dove/}} [10422070] |*Why/Because: ''perché'' {{IPA|/perˈke/}} [10422080] |*How: ''come'' {{IPA|/'kome/}} [10422090] |*How much is it?: ''quanto costa?'' [10422100] |{{IPA|/ˈkwanto/}} [10422110] |*Thank you!: ''grazie!'' [10422120] |{{IPA|/ˈgrattsie/}} [10422130] |*Bon appetit: ''buon appetito'' {{IPA|/ˌbwɔn appeˈtito/}} [10422140] |*You're welcome!: ''prego!'' [10422150] |{{IPA|/ˈprɛgo/}} [10422160] |*I love you: ''Ti amo'' {{IPA|/ti ˈamo/}}, ''Ti voglio bene'' {{IPA|/ti ˈvɔʎʎo ˈbɛne/}}. [10422170] |The difference is that you use "Ti amo" when you are in a romantic relationship, "Ti voglio bene" in any other occasion (to parents, to relatives, to friends...) [10422180] |Counting to twenty: [10422190] |*One: ''uno'' {{IPA|/ˈuno/}} [10422200] |*Two: ''due'' {{IPA|/ˈdue/}} [10422210] |*Three: ''tre'' {{IPA|/tre/}} [10422220] |*Four: ''quattro'' {{IPA|/ˈkwattro/}} [10422230] |*Five: ''cinque'' {{IPA|/ˈʧiŋkwe/}} [10422240] |*Six: ''sei'' {{IPA|/ˈsɛi/}} [10422250] |*Seven: ''sette'' {{IPA|/ˈsɛtte/}} [10422260] |*Eight: ''otto'' {{IPA|/ˈɔtto/}} [10422270] |*Nine: ''nove'' {{IPA|/ˈnɔve/}} [10422280] |*Ten: ''dieci'' {{IPA|/ˈdjɛʧi/}} [10422290] |*Eleven: ''undici'' {{IPA|/ˈundiʧi/}} [10422300] |*Twelve: ''dodici'' {{IPA|/ˈdodiʧi/}} [10422310] |*Thirteen: ''tredici'' {{IPA|/ˈtrediʧi/}} [10422320] |*Fourteen: ''quattordici'' {{IPA|/kwat'tordiʧi/}} [10422330] |*Fifteen: ''quindici'' {{IPA|/ˈkwindiʧi/}} [10422340] |*Sixteen: ''sedici'' {{IPA|/ˈsediʧi/}} [10422350] |*Seventeen: ''diciassette'' {{IPA|/diʧas'sɛtte/}} [10422360] |*Eighteen: ''diciotto'' {{IPA|/di'ʧɔtto/}} [10422370] |*Nineteen: ''diciannove'' {{IPA|/diʧan'nɔve/}} [10422380] |*Twenty: ''venti'' {{IPA|/'venti/}} [10422390] |The days of the week: [10422400] |*Monday: ''lunedì'' {{IPA|/lune'di/}} [10422410] |*Tuesday: ''martedì'' {{IPA|/marte'di/}} [10422420] |*Wednesday: ''mercoledì'' {{IPA|/merkole'di/}} [10422430] |*Thursday: ''giovedì'' {{IPA|/dʒove'di/}} [10422440] |*Friday: ''venerdì'' {{IPA|/vener'di/}} [10422450] |*Saturday: ''sabato'' {{IPA|/ˈsabato/}} [10422460] |*Sunday: ''domenica'' {{IPA|/do'menika/}} [10422470] |==Sample texts== [10422480] |There is a recording of [[Dante]]'s [[Divine Comedy]] read by [[Lino Pertile]] available at http://etcweb.princeton.edu/dante/pdp/