Algorithm
10010010 -> 1000000100020: Algorithm
10010020 -> 1000000100030: In mathematics, computing, linguistics and related disciplines, an algorithm is a sequence of instructions, often used for calculation and data processing.
10010030 -> 1000000100040: It is formally a type of effective method in which a list of well-defined instructions for completing a task will, when given an initial state, proceed through a well-defined series of successive states, eventually terminating in an end-state.
10010040 -> 1000000100050: The transition from one state to the next is not necessarily deterministic; some algorithms, known as probabilistic algorithms, incorporate randomness.
10010050 -> 1000000100060: A partial formalization of the concept began with attempts to solve the Entscheidungsproblem (the "decision problem") posed by David Hilbert in 1928.
10010060 -> 1000000100070: Subsequent formalizations were framed as attempts to define "effective calculability" (Kleene 1943:274) or "effective method" (Rosser 1939:225); those formalizations included the Gödel-Herbrand-Kleene recursive functions of 1930, 1934 and 1935, Alonzo Church's lambda calculus of 1936, Emil Post's "Formulation I" of 1936, and Alan Turing's Turing machines of 1936-7 and 1939.
10010070 -> 1000000100080: Etymology
10010080 -> 1000000100090: Al-Khwārizmī, Persian astronomer and mathematician, wrote a treatise in Arabic in 825 AD, On Calculation with Hindu Numerals.
10010090 -> 1000000100100: (See algorism).
10010100 -> 1000000100110: It was translated into Latin in the 12th century as Algoritmi de numero Indorum (al-Daffa 1977), which title was likely intended to mean "Algoritmi on the numbers of the Indians", where "Algoritmi" was the translator's rendition of the author's name; but people misunderstanding the title treated Algoritmi as a Latin plural and this led to the word "algorithm" (Latin algorismus) coming to mean "calculation method".
10010110 -> 1000000100120: The intrusive "th" is most likely due to a false cognate with the Greek {(Lang+ἀριθμός+grc+ἀριθμός)} (arithmos) meaning "number".
10010120 -> 1000000100130: Why algorithms are necessary: an informal definition
10010130 -> 1000000100140: No generally accepted formal definition of "algorithm" exists yet.
10010140 -> 1000000100150: An informal definition could be "an algorithm is a computer program that calculates something."
10010150 -> 1000000100160: For some people, a program is only an algorithm if it stops eventually.
10010160 -> 1000000100170: For others, a program is only an algorithm if it stops before a given number of calculation steps.
10010170 -> 1000000100180: A prototypical example of an "algorithm" is Euclid's algorithm to determine the maximum common divisor of two integers greater than one: "subtract the smallest number from the biggest one, repeat until you get a zero or a one".
10010180 -> 1000000100190: This procedure is know to stop always, and the number of subtractions needed is always smaller than the biggest of the two numbers.
10010190 -> 1000000100200: We can derive clues to the issues involved and an informal meaning of the word from the following quotation from {(Harvard citation text+Boolos & Jeffrey (1974, 1999)+Boolos+Jeffrey+1974, 1999)} (boldface added):
10010200 -> 1000000100210: No human being can write fast enough, or long enough, or small enough to list all members of an enumerably infinite set by writing out their names, one after another, in some notation.
10010210 -> 1000000100220: But humans can do something equally useful, in the case of certain enumerably infinite sets: They can give explicit instructions for determining the nth member of the set, for arbitrary finite n.
10010220 -> 1000000100230: Such instructions are to be given quite explicitly, in a form in which they could be followed by a computing machine, or by a human who is capable of carrying out only very elementary operations on symbols {(Harvard citation+(Boolos & Jeffrey 1974, 1999, p. 19)+Boolos+Jeffrey+1974, 1999+p=19)}
10010230 -> 1000000100240: The words "enumerably infinite" mean "countable using integers perhaps extending to infinity".
10010240 -> 1000000100250: Thus Boolos and Jeffrey are saying that an algorithm implies instructions for a process that "creates" output integers from an arbitrary "input" integer or integers that, in theory, can be chosen from 0 to infinity.
10010250 -> 1000000100260: Thus we might expect an algorithm to be an algebraic equation such as y = m + n — two arbitrary "input variables" m and n that produce an output y.
10010260 -> 1000000100270: As we see in Algorithm characterizations — the word algorithm implies much more than this, something on the order of (for our addition example):
10010270 -> 1000000100280: Precise instructions (in language understood by "the computer") for a "fast, efficient, good" process that specifies the "moves" of "the computer" (machine or human, equipped with the necessary internally-contained information and capabilities) to find, decode, and then munch arbitrary input integers/symbols m and n, symbols + and = ... and (reliably, correctly, "effectively") produce, in a "reasonable" time, output-integer y at a specified place and in a specified format.
10010280 -> 1000000100290: The concept of algorithm is also used to define the notion of decidability.
10010290 -> 1000000100300: That notion is central for explaining how formal systems come into being starting from a small set of axioms and rules.
10010300 -> 1000000100310: In logic, the time that an algorithm requires to complete cannot be measured, as it is not apparently related with our customary physical dimension.
10010310 -> 1000000100320: From such uncertainties, that characterize ongoing work, stems the unavailability of a definition of algorithm that suits both concrete (in some sense) and abstract usage of the term.
10010320 -> 1000000100330: For a detailed presentation of the various points of view around the definition of "algorithm" see Algorithm characterizations.
10010330 -> 1000000100340: For examples of simple addition algorithms specified in the detailed manner described in Algorithm characterizations, see Algorithm examples.
10010340 -> 1000000100350: Formalization of algorithms
10010350 -> 1000000100360: Algorithms are essential to the way computers process information, because a computer program is essentially an algorithm that tells the computer what specific steps to perform (in what specific order) in order to carry out a specified task, such as calculating employees’ paychecks or printing students’ report cards.
10010360 -> 1000000100370: Thus, an algorithm can be considered to be any sequence of operations that can be performed by a Turing-complete system.
10010370 -> 1000000100380: Authors who assert this thesis include Savage (1987) and Gurevich (2000):
10010380 -> 1000000100390: ...Turing's informal argument in favor of his thesis justifies a stronger thesis: every algorithm can be simulated by a Turing machine (Gurevich 2000:1)...according to Savage [1987], an algorithm is a computational process defined by a Turing machine.
10010390 -> 1000000100400: (Gurevich 2000:3)
10010400 -> 1000000100410: Typically, when an algorithm is associated with processing information, data are read from an input source or device, written to an output sink or device, and/or stored for further processing.
10010410 -> 1000000100420: Stored data are regarded as part of the internal state of the entity performing the algorithm.
10010420 -> 1000000100430: In practice, the state is stored in a data structure, but an algorithm requires the internal data only for specific operation sets called abstract data types.
10010430 -> 1000000100440: For any such computational process, the algorithm must be rigorously defined: specified in the way it applies in all possible circumstances that could arise.
10010440 -> 1000000100450: That is, any conditional steps must be systematically dealt with, case-by-case; the criteria for each case must be clear (and computable).
10010450 -> 1000000100460: Because an algorithm is a precise list of precise steps, the order of computation will almost always be critical to the functioning of the algorithm.
10010460 -> 1000000100470: Instructions are usually assumed to be listed explicitly, and are described as starting "from the top" and going "down to the bottom", an idea that is described more formally by flow of control.
10010470 -> 1000000100480: So far, this discussion of the formalization of an algorithm has assumed the premises of imperative programming.
10010480 -> 1000000100490: This is the most common conception, and it attempts to describe a task in discrete, "mechanical" means.
10010490 -> 1000000100500: Unique to this conception of formalized algorithms is the assignment operation, setting the value of a variable.
10010500 -> 1000000100510: It derives from the intuition of "memory" as a scratchpad.
10010510 -> 1000000100520: There is an example below of such an assignment.
10010520 -> 1000000100530: For some alternate conceptions of what constitutes an algorithm see functional programming and logic programming .
10010530 -> 1000000100540: Termination
10010540 -> 1000000100550: Some writers restrict the definition of algorithm to procedures that eventually finish.
10010550 -> 1000000100560: In such a category Kleene places the "decision procedure or decision method or algorithm for the question" (Kleene 1952:136).
10010560 -> 1000000100570: Others, including Kleene, include procedures that could run forever without stopping; such a procedure has been called a "computational method" (Knuth 1997:5) or "calculation procedure or algorithm" (Kleene 1952:137); however, Kleene notes that such a method must eventually exhibit "some object" (Kleene 1952:137).
10010570 -> 1000000100580: Minsky makes the pertinent observation, in regards to determining whether an algorithm will eventually terminate (from a particular starting state):
10010580 -> 1000000100590: But if the length of the process is not known in advance, then "trying" it may not be decisive, because if the process does go on forever — then at no time will we ever be sure of the answer (Minsky 1967:105).
10010590 -> 1000000100600: As it happens, no other method can do any better, as was shown by Alan Turing with his celebrated result on the undecidability of the so-called halting problem.
10010600 -> 1000000100610: There is no algorithmic procedure for determining of arbitrary algorithms whether or not they terminate from given starting states.
10010610 -> 1000000100620: The analysis of algorithms for their likelihood of termination is called termination analysis.
10010620 -> 1000000100630: See the examples of (im-)"proper" subtraction at partial function for more about what can happen when an algorithm fails for certain of its input numbers — e.g., (i) non-termination, (ii) production of "junk" (output in the wrong format to be considered a number) or no number(s) at all (halt ends the computation with no output), (iii) wrong number(s), or (iv) a combination of these.
10010630 -> 1000000100640: Kleene proposed that the production of "junk" or failure to produce a number is solved by having the algorithm detect these instances and produce e.g., an error message (he suggested "0"), or preferably, force the algorithm into an endless loop (Kleene 1952:322).
10010640 -> 1000000100650: Davis does this to his subtraction algorithm — he fixes his algorithm in a second example so that it is proper subtraction (Davis 1958:12-15).
10010650 -> 1000000100660: Along with the logical outcomes "true" and "false" Kleene also proposes the use of a third logical symbol "u" — undecided (Kleene 1952:326) — thus an algorithm will always produce something when confronted with a "proposition".
10010660 -> 1000000100670: The problem of wrong answers must be solved with an independent "proof" of the algorithm e.g., using induction:
10010670 -> 1000000100680: We normally require auxiliary evidence for this (that the algorithm correctly defines a mu recursive function), e.g., in the form of an inductive proof that, for each argument value, the computation terminates with a unique value (Minsky 1967:186).
10010680 -> 1000000100690: Expressing algorithms
10010690 -> 1000000100700: Algorithms can be expressed in many kinds of notation, including natural languages, pseudocode, flowcharts, and programming languages.
10010700 -> 1000000100710: Natural language expressions of algorithms tend to be verbose and ambiguous, and are rarely used for complex or technical algorithms.
10010710 -> 1000000100720: Pseudocode and flowcharts are structured ways to express algorithms that avoid many of the ambiguities common in natural language statements, while remaining independent of a particular implementation language.
10010720 -> 1000000100730: Programming languages are primarily intended for expressing algorithms in a form that can be executed by a computer, but are often used as a way to define or document algorithms.
10010730 -> 1000000100740: There is a wide variety of representations possible and one can express a given Turing machine program as a sequence of machine tables (see more at finite state machine and state transition table), as flowcharts (see more at state diagram), or as a form of rudimentary machine code or assembly code called "sets of quadruples" (see more at Turing machine).
10010740 -> 1000000100750: Sometimes it is helpful in the description of an algorithm to supplement small "flow charts" (state diagrams) with natural-language and/or arithmetic expressions written inside "block diagrams" to summarize what the "flow charts" are accomplishing.
10010750 -> 1000000100760: Representations of algorithms are generally classed into three accepted levels of Turing machine description (Sipser 2006:157):
10010760 -> 1000000100770: 1 High-level description:
10010770 -> 1000000100780: "...prose to describe an algorithm, ignoring the implementation details.
10010780 -> 1000000100790: At this level we do not need to mention how the machine manages its tape or head"
10010790 -> 1000000100800: 2 Implementation description:
10010800 -> 1000000100810: "...prose used to define the way the Turing machine uses its head and the way that it stores data on its tape.
10010810 -> 1000000100820: At this level we do not give details of states or transition function"
10010820 -> 1000000100830: 3 Formal description:
10010830 -> 1000000100840: Most detailed, "lowest level", gives the Turing machine's "state table".
10010840 -> 1000000100850: For an example of the simple algorithm "Add m+n" described in all three levels see Algorithm examples.
10010850 -> 1000000100860: Implementation
10010860 -> 1000000100870: Most algorithms are intended to be implemented as computer programs.
10010870 -> 1000000100880: However, algorithms are also implemented by other means, such as in a biological neural network (for example, the human brain implementing arithmetic or an insect looking for food), in an electrical circuit, or in a mechanical device.
10010880 -> 1000000100890: Example
10010890 -> 1000000100900: One of the simplest algorithms is to find the largest number in an (unsorted) list of numbers.
10010900 -> 1000000100910: The solution necessarily requires looking at every number in the list, but only once at each.
10010910 -> 1000000100920: From this follows a simple algorithm, which can be stated in a high-level description English prose, as:
10010920 -> 1000000100930: High-level description:
10010930 -> 1000000100940: Assume the first item is largest.
10010940 -> 1000000100950: Look at each of the remaining items in the list and if it is larger than the largest item so far, make a note of it.
10010950 -> 1000000100960: The last noted item is the largest in the list when the process is complete.
10010960 -> 1000000100970: (Quasi-)formal description: Written in prose but much closer to the high-level language of a computer program, the following is the more formal coding of the algorithm in pseudocode or pidgin code:
10010970 -> 1000000100980: Input: A non-empty list of numbers L.
10010980 -> 1000000100990: Output: The largest number in the list L. largest ← L0 for each item in the list L≥1, do if the item > largest, then largest ← the item return largest
10010990 -> 1000000101000: For a more complex example of an algorithm, see Euclid's algorithm for the greatest common divisor, one of the earliest algorithms known.
10011000 -> 1000000101010: Algorithm analysis
10011010 -> 1000000101020: As it happens, it is important to know how much of a particular resource (such as time or storage) is required for a given algorithm.
10011020 -> 1000000101030: Methods have been developed for the analysis of algorithms to obtain such quantitative answers; for example, the algorithm above has a time requirement of O(n), using the big O notation with n as the length of the list.
10011030 -> 1000000101040: At all times the algorithm only needs to remember two values: the largest number found so far, and its current position in the input list.
10011040 -> 1000000101050: Therefore it is said to have a space requirement of O(1), if the space required to store the input numbers is not counted, or O (log n) if it is counted.
10011050 -> 1000000101060: Different algorithms may complete the same task with a different set of instructions in less or more time, space, or effort than others.
10011060 -> 1000000101070: For example, given two different recipes for making potato salad, one may have peel the potato before boil the potato while the other presents the steps in the reverse order, yet they both call for these steps to be repeated for all potatoes and end when the potato salad is ready to be eaten.
10011070 -> 1000000101080: The analysis and study of algorithms is a discipline of computer science, and is often practiced abstractly without the use of a specific programming language or implementation.
10011080 -> 1000000101090: In this sense, algorithm analysis resembles other mathematical disciplines in that it focuses on the underlying properties of the algorithm and not on the specifics of any particular implementation.
10011090 -> 1000000101100: Usually pseudocode is used for analysis as it is the simplest and most general representation.
10011100 -> 1000000101110: Classes
10011110 -> 1000000101120: There are various ways to classify algorithms, each with its own merits.
10011120 -> 1000000101130: Classification by implementation
10011130 -> 1000000101140: One way to classify algorithms is by implementation means.
10011140 -> 1000000101150: Recursion or iteration: A recursive algorithm is one that invokes (makes reference to) itself repeatedly until a certain condition matches, which is a method common to functional programming.
10011150 -> 1000000101160: Iterative algorithms use repetitive constructs like loops and sometimes additional data structures like stacks to solve the given problems.
10011160 -> 1000000101170: Some problems are naturally suited for one implementation or the other.
10011170 -> 1000000101180: For example, towers of hanoi is well understood in recursive implementation.
10011180 -> 1000000101190: Every recursive version has an equivalent (but possibly more or less complex) iterative version, and vice versa.
10011190 -> 1000000101200: Logical: An algorithm may be viewed as controlled logical deduction.
10011200 -> 1000000101210: This notion may be expressed as: Algorithm = logic + control (Kowalski 1979).
10011210 -> 1000000101220: The logic component expresses the axioms that may be used in the computation and the control component determines the way in which deduction is applied to the axioms.
10011220 -> 1000000101230: This is the basis for the logic programming paradigm.
10011230 -> 1000000101240: In pure logic programming languages the control component is fixed and algorithms are specified by supplying only the logic component.
10011240 -> 1000000101250: The appeal of this approach is the elegant semantics: a change in the axioms has a well defined change in the algorithm.
10011250 -> 1000000101260: Serial or parallel or distributed: Algorithms are usually discussed with the assumption that computers execute one instruction of an algorithm at a time.
10011260 -> 1000000101270: Those computers are sometimes called serial computers.
10011270 -> 1000000101280: An algorithm designed for such an environment is called a serial algorithm, as opposed to parallel algorithms or distributed algorithms.
10011280 -> 1000000101290: Parallel algorithms take advantage of computer architectures where several processors can work on a problem at the same time, whereas distributed algorithms utilize multiple machines connected with a network.
10011290 -> 1000000101300: Parallel or distributed algorithms divide the problem into more symmetrical or asymmetrical subproblems and collect the results back together.
10011300 -> 1000000101310: The resource consumption in such algorithms is not only processor cycles on each processor but also the communication overhead between the processors.
10011310 -> 1000000101320: Sorting algorithms can be parallelized efficiently, but their communication overhead is expensive.
10011320 -> 1000000101330: Iterative algorithms are generally parallelizable.
10011330 -> 1000000101340: Some problems have no parallel algorithms, and are called inherently serial problems.
10011340 -> 1000000101350: Deterministic or non-deterministic: Deterministic algorithms solve the problem with exact decision at every step of the algorithm whereas non-deterministic algorithm solve problems via guessing although typical guesses are made more accurate through the use of heuristics.
10011350 -> 1000000101360: Exact or approximate: While many algorithms reach an exact solution, approximation algorithms seek an approximation that is close to the true solution.
10011360 -> 1000000101370: Approximation may use either a deterministic or a random strategy.
10011370 -> 1000000101380: Such algorithms have practical value for many hard problems.
10011380 -> 1000000101390: Classification by design paradigm
10011390 -> 1000000101400: Another way of classifying algorithms is by their design methodology or paradigm.
10011400 -> 1000000101410: There is a certain number of paradigms, each different from the other.
10011410 -> 1000000101420: Furthermore, each of these categories will include many different types of algorithms.
10011420 -> 1000000101430: Some commonly found paradigms include:
10011430 -> 1000000101440: Divide and conquer.
10011440 -> 1000000101450: A divide and conquer algorithm repeatedly reduces an instance of a problem to one or more smaller instances of the same problem (usually recursively), until the instances are small enough to solve easily.
10011450 -> 1000000101460: One such example of divide and conquer is merge sorting.
10011460 -> 1000000101470: Sorting can be done on each segment of data after dividing data into segments and sorting of entire data can be obtained in conquer phase by merging them.
10011470 -> 1000000101480: A simpler variant of divide and conquer is called decrease and conquer algorithm, that solves an identical subproblem and uses the solution of this subproblem to solve the bigger problem.
10011480 -> 1000000101490: Divide and conquer divides the problem into multiple subproblems and so conquer stage will be more complex than decrease and conquer algorithms.
10011490 -> 1000000101500: An example of decrease and conquer algorithm is binary search algorithm.
10011500 -> 1000000101510: Dynamic programming.
10011510 -> 1000000101520: When a problem shows optimal substructure, meaning the optimal solution to a problem can be constructed from optimal solutions to subproblems, and overlapping subproblems, meaning the same subproblems are used to solve many different problem instances, a quicker approach called dynamic programming avoids recomputing solutions that have already been computed.
10011520 -> 1000000101530: For example, the shortest path to a goal from a vertex in a weighted graph can be found by using the shortest path to the goal from all adjacent vertices.
10011530 -> 1000000101540: Dynamic programming and memoization go together.
10011540 -> 1000000101550: The main difference between dynamic programming and divide and conquer is that subproblems are more or less independent in divide and conquer, whereas subproblems overlap in dynamic programming.
10011550 -> 1000000101560: The difference between dynamic programming and straightforward recursion is in caching or memoization of recursive calls.
10011560 -> 1000000101570: When subproblems are independent and there is no repetition, memoization does not help; hence dynamic programming is not a solution for all complex problems.
10011570 -> 1000000101580: By using memoization or maintaining a table of subproblems already solved, dynamic programming reduces the exponential nature of many problems to polynomial complexity.
10011580 -> 1000000101590: The greedy method.
10011590 -> 1000000101600: A greedy algorithm is similar to a dynamic programming algorithm, but the difference is that solutions to the subproblems do not have to be known at each stage; instead a "greedy" choice can be made of what looks best for the moment.
10011600 -> 1000000101610: The greedy method extends the solution with the best possible decision (not all feasible decisions) at an algorithmic stage based on the current local optimum and the best decision (not all possible decisions) made in previous stage.
10011610 -> 1000000101620: It is not exhaustive, and does not give accurate answer to many problems.
10011620 -> 1000000101630: But when it works, it will be the fastest method.
10011630 -> 1000000101640: The most popular greedy algorithm is finding the minimal spanning tree as given by Kruskal.
10011640 -> 1000000101650: Linear programming.
10011650 -> 1000000101660: When solving a problem using linear programming, specific inequalities involving the inputs are found and then an attempt is made to maximize (or minimize) some linear function of the inputs.
10011660 -> 1000000101670: Many problems (such as the maximum flow for directed graphs) can be stated in a linear programming way, and then be solved by a 'generic' algorithm such as the simplex algorithm.
10011670 -> 1000000101680: A more complex variant of linear programming is called integer programming, where the solution space is restricted to the integers.
10011680 -> 1000000101690: Reduction.
10011690 -> 1000000101700: This technique involves solving a difficult problem by transforming it into a better known problem for which we have (hopefully) asymptotically optimal algorithms.
10011700 -> 1000000101710: The goal is to find a reducing algorithm whose complexity is not dominated by the resulting reduced algorithm's.
10011710 -> 1000000101720: For example, one selection algorithm for finding the median in an unsorted list involves first sorting the list (the expensive portion) and then pulling out the middle element in the sorted list (the cheap portion).
10011720 -> 1000000101730: This technique is also known as transform and conquer.
10011730 -> 1000000101740: Search and enumeration.
10011740 -> 1000000101750: Many problems (such as playing chess) can be modeled as problems on graphs.
10011750 -> 1000000101760: A graph exploration algorithm specifies rules for moving around a graph and is useful for such problems.
10011760 -> 1000000101770: This category also includes search algorithms, branch and bound enumeration and backtracking.
10011770 -> 1000000101780: The probabilistic and heuristic paradigm.
10011780 -> 1000000101790: Algorithms belonging to this class fit the definition of an algorithm more loosely.
10011790 -> 1000000101800: Probabilistic algorithms are those that make some choices randomly (or pseudo-randomly); for some problems, it can in fact be proven that the fastest solutions must involve some randomness.
10011800 -> 1000000101810: Genetic algorithms attempt to find solutions to problems by mimicking biological evolutionary processes, with a cycle of random mutations yielding successive generations of "solutions".
10011810 -> 1000000101820: Thus, they emulate reproduction and "survival of the fittest".
10011820 -> 1000000101830: In genetic programming, this approach is extended to algorithms, by regarding the algorithm itself as a "solution" to a problem.
10011830 -> 1000000101840: Heuristic algorithms, whose general purpose is not to find an optimal solution, but an approximate solution where the time or resources are limited.
10011840 -> 1000000101850: They are not practical to find perfect solutions.
10011850 -> 1000000101860: An example of this would be local search, tabu search, or simulated annealing algorithms, a class of heuristic probabilistic algorithms that vary the solution of a problem by a random amount.
10011860 -> 1000000101870: The name "simulated annealing" alludes to the metallurgic term meaning the heating and cooling of metal to achieve freedom from defects.
10011870 -> 1000000101880: The purpose of the random variance is to find close to globally optimal solutions rather than simply locally optimal ones, the idea being that the random element will be decreased as the algorithm settles down to a solution.
10011880 -> 1000000101890: Classification by field of study
10011890 -> 1000000101900: Every field of science has its own problems and needs efficient algorithms.
10011900 -> 1000000101910: Related problems in one field are often studied together.
10011910 -> 1000000101920: Some example classes are search algorithms, sorting algorithms, merge algorithms, numerical algorithms, graph algorithms, string algorithms, computational geometric algorithms, combinatorial algorithms, machine learning, cryptography, data compression algorithms and parsing techniques.
10011920 -> 1000000101930: Fields tend to overlap with each other, and algorithm advances in one field may improve those of other, sometimes completely unrelated, fields.
10011930 -> 1000000101940: For example, dynamic programming was originally invented for optimization of resource consumption in industry, but is now used in solving a broad range of problems in many fields.
10011940 -> 1000000101950: Classification by complexity
10011950 -> 1000000101960: Algorithms can be classified by the amount of time they need to complete compared to their input size.
10011960 -> 1000000101970: There is a wide variety: some algorithms complete in linear time relative to input size, some do so in an exponential amount of time or even worse, and some never halt.
10011970 -> 1000000101980: Additionally, some problems may have multiple algorithms of differing complexity, while other problems might have no algorithms or no known efficient algorithms.
10011980 -> 1000000101990: There are also mappings from some problems to other problems.
10011990 -> 1000000102000: Owing to this, it was found to be more suitable to classify the problems themselves instead of the algorithms into equivalence classes based on the complexity of the best possible algorithms for them.
10012000 -> 1000000102010: Classification by computing power
10012010 -> 1000000102020: Another way to classify algorithms is by computing power.
10012020 -> 1000000102030: This is typically done by considering some collection (class) of algorithms.
10012030 -> 1000000102040: A recursive class of algorithms is one that includes algorithms for all Turing computable functions.
10012040 -> 1000000102050: Looking at classes of algorithms allows for the possibility of restricting the available computational resources (time and memory) used in a computation.
10012050 -> 1000000102060: A subrecursive class of algorithms is one in which not all Turing computable functions can be obtained.
10012060 -> 1000000102070: For example, the algorithms that run in polynomial time suffice for many important types of computation but do not exhaust all Turing computable functions.
10012070 -> 1000000102080: The class algorithms implemented by primitive recursive functions is another subrecursive class.
10012080 -> 1000000102090: Burgin (2005, p. 24) uses a generalized definition of algorithms that relaxes the common requirement that the output of the algorithm that computes a function must be determined after a finite number of steps.
10012090 -> 1000000102100: He defines a super-recursive class of algorithms as "a class of algorithms in which it is possible to compute functions not computable by any Turing machine" (Burgin 2005, p. 107).
10012100 -> 1000000102110: This is closely related to the study of methods of hypercomputation.
10012110 -> 1000000102120: Legal issues
10012120 -> 1000000102130: See also: Software patents for a general overview of the patentability of software, including computer-implemented algorithms.
10012130 -> 1000000102140: Algorithms, by themselves, are not usually patentable.
10012140 -> 1000000102150: In the United States, a claim consisting solely of simple manipulations of abstract concepts, numbers, or signals do not constitute "processes" (USPTO 2006) and hence algorithms are not patentable (as in Gottschalk v. Benson).
10012150 -> 1000000102160: However, practical applications of algorithms are sometimes patentable.
10012160 -> 1000000102170: For example, in Diamond v. Diehr, the application of a simple feedback algorithm to aid in the curing of synthetic rubber was deemed patentable.
10012170 -> 1000000102180: The patenting of software is highly controversial, and there are highly criticized patents involving algorithms, especially data compression algorithms, such as Unisys' LZW patent.
10012180 -> 1000000102190: Additionally, some cryptographic algorithms have export restrictions (see export of cryptography).
10012190 -> 1000000102200: History: Development of the notion of "algorithm"
10012200 -> 1000000102210: Origin of the word
10012210 -> 1000000102220: The word algorithm comes from the name of the 9th century Persian mathematician Abu Abdullah Muhammad ibn Musa al-Khwarizmi whose works introduced Indian numerals and algebraic concepts.
10012220 -> 1000000102230: He worked in Baghdad at the time when it was the centre of scientific studies and trade.
10012230 -> 1000000102240: The word algorism originally referred only to the rules of performing arithmetic using Arabic numerals but evolved via European Latin translation of al-Khwarizmi's name into algorithm by the 18th century.
10012240 -> 1000000102250: The word evolved to include all definite procedures for solving problems or performing tasks.
10012250 -> 1000000102260: Discrete and distinguishable symbols
10012260 -> 1000000102270: Tally-marks: To keep track of their flocks, their sacks of grain and their money the ancients used tallying: accumulating stones or marks scratched on sticks, or making discrete symbols in clay.
10012270 -> 1000000102280: Through the Babylonian and Egyptian use of marks and symbols, eventually Roman numerals and the abacus evolved (Dilson, p.16–41).
10012280 -> 1000000102290: Tally marks appear prominently in unary numeral system arithmetic used in Turing machine and Post-Turing machine computations.
10012290 -> 1000000102300: Manipulation of symbols as "place holders" for numbers: algebra
10012300 -> 1000000102310: The work of the Ancient Greek geometers, Persian mathematician Al-Khwarizmi (often considered as the "father of algebra"), and Western European mathematicians culminated in Leibniz's notion of the calculus ratiocinator (ca 1680):
10012310 -> 1000000102320: "A good century and a half ahead of his time, Leibniz proposed an algebra of logic, an algebra that would specify the rules for manipulating logical concepts in the manner that ordinary algebra specifies the rules for manipulating numbers" (Davis 2000:1)
10012320 -> 1000000102330: Mechanical contrivances with discrete states
10012330 -> 1000000102340: The clock: Bolter credits the invention of the weight-driven clock as “The key invention [of Europe in the Middle Ages]", in particular the verge escapement< (Bolter 1984:24) that provides us with the tick and tock of a mechanical clock.
10012340 -> 1000000102350: “The accurate automatic machine” (Bolter 1984:26) led immediately to "mechanical automata" beginning in the thirteenth century and finally to “computational machines" – the difference engine and analytical engines of Charles Babbage and Countess Ada Lovelace (Bolter p.33–34, p.204–206).
10012350 -> 1000000102360: Jacquard loom, Hollerith punch cards, telegraphy and telephony — the electromechanical relay: Bell and Newell (1971) indicate that the Jacquard loom (1801), precursor to Hollerith cards (punch cards, 1887), and “telephone switching technologies” were the roots of a tree leading to the development of the first computers (Bell and Newell diagram p. 39, cf Davis 2000).
10012360 -> 1000000102370: By the mid-1800s the telegraph, the precursor of the telephone, was in use throughout the world, its discrete and distinguishable encoding of letters as “dots and dashes” a common sound.
10012370 -> 1000000102380: By the late 1800s the ticker tape (ca 1870s) was in use, as was the use of Hollerith cards in the 1890 U.S. census.
10012380 -> 1000000102390: Then came the Teletype (ca 1910) with its punched-paper use of Baudot code on tape.
10012390 -> 1000000102400: Telephone-switching networks of electromechanical relays (invented 1835) was behind the work of George Stibitz (1937), the inventor of the digital adding device.
10012400 -> 1000000102410: As he worked in Bell Laboratories, he observed the “burdensome’ use of mechanical calculators with gears.
10012410 -> 1000000102420: "He went home one evening in 1937 intending to test his idea....
10012420 -> 1000000102430: When the tinkering was over, Stibitz had constructed a binary adding device".
10012430 -> 1000000102440: (Valley News, p. 13).
10012440 -> 1000000102450: Davis (2000) observes the particular importance of the electromechanical relay (with its two "binary states" open and closed):
10012450 -> 1000000102460: It was only with the development, beginning in the 1930s, of electromechanical calculators using electrical relays, that machines were built having the scope Babbage had envisioned."
10012460 -> 1000000102470: (Davis, p. 14).
10012470 -> 1000000102480: Mathematics during the 1800s up to the mid-1900s
10012480 -> 1000000102490: Symbols and rules: In rapid succession the mathematics of George Boole (1847, 1854), Gottlob Frege (1879), and Giuseppe Peano (1888–1889) reduced arithmetic to a sequence of symbols manipulated by rules.
10012490 -> 1000000102500: Peano's The principles of arithmetic, presented by a new method (1888) was "the first attempt at an axiomatization of mathematics in a symbolic language" (van Heijenoort:81ff).
10012500 -> 1000000102510: But Heijenoort gives Frege (1879) this kudos: Frege’s is "perhaps the most important single work ever written in logic. ... in which we see a " 'formula language', that is a lingua characterica, a language written with special symbols, "for pure thought", that is, free from rhetorical embellishments ... constructed from specific symbols that are manipulated according to definite rules" (van Heijenoort:1).
10012510 -> 1000000102520: The work of Frege was further simplified and amplified by Alfred North Whitehead and Bertrand Russell in their Principia Mathematica (1910–1913).
10012520 -> 1000000102530: The paradoxes: At the same time a number of disturbing paradoxes appeared in the literature, in particular the Burali-Forti paradox (1897), the Russell paradox (1902–03), and the Richard Paradox (Dixon 1906, cf Kleene 1952:36–40).
10012530 -> 1000000102540: The resultant considerations led to Kurt Gödel’s paper (1931) — he specifically cites the paradox of the liar — that completely reduces rules of recursion to numbers.
10012540 -> 1000000102550: Effective calculability: In an effort to solve the Entscheidungsproblem defined precisely by Hilbert in 1928, mathematicians first set about to define what was meant by an "effective method" or "effective calculation" or "effective calculability" (i.e., a calculation that would succeed).
10012550 -> 1000000102560: In rapid succession the following appeared: Alonzo Church, Stephen Kleene and J.B. Rosser's λ-calculus, (cf footnote in Alonzo Church 1936a:90, 1936b:110) a finely-honed definition of "general recursion" from the work of Gödel acting on suggestions of Jacques Herbrand (cf Gödel's Princeton lectures of 1934) and subsequent simplifications by Kleene (1935-6:237ff, 1943:255ff). Church's proof (1936:88ff) that the Entscheidungsproblem was unsolvable, Emil Post's definition of effective calculability as a worker mindlessly following a list of instructions to move left or right through a sequence of rooms and while there either mark or erase a paper or observe the paper and make a yes-no decision about the next instruction (cf "Formulation I", Post 1936:289-290).
10012560 -> 1000000102570: Alan Turing's proof of that the Entscheidungsproblem was unsolvable by use of his "a- [automatic-] machine"(Turing 1936-7:116ff) -- in effect almost identical to Post's "formulation", J. Barkley Rosser's definition of "effective method" in terms of "a machine" (Rosser 1939:226).
10012570 -> 1000000102580: S. C. Kleene's proposal of a precursor to "Church thesis" that he called "Thesis I" (Kleene 1943:273–274), and a few years later Kleene's renaming his Thesis "Church's Thesis" (Kleene 1952:300, 317) and proposing "Turing's Thesis" (Kleene 1952:376).
10012580 -> 1000000102590: Emil Post (1936) and Alan Turing (1936-7, 1939)
10012590 -> 1000000102600: Here is a remarkable coincidence of two men not knowing each other but describing a process of men-as-computers working on computations — and they yield virtually identical definitions.
10012600 -> 1000000102610: Emil Post (1936) described the actions of a "computer" (human being) as follows:
10012610 -> 1000000102620: "...two concepts are involved: that of a symbol space in which the work leading from problem to answer is to be carried out, and a fixed unalterable set of directions.
10012620 -> 1000000102630: His symbol space would be
10012630 -> 1000000102640: "a two way infinite sequence of spaces or boxes...
10012640 -> 1000000102650: The problem solver or worker is to move and work in this symbol space, being capable of being in, and operating in but one box at a time.... a box is to admit of but two possible conditions, i.e., being empty or unmarked, and having a single mark in it, say a vertical stroke.
10012650 -> 1000000102660: "One box is to be singled out and called the starting point. ...a specific problem is to be given in symbolic form by a finite number of boxes [i.e., INPUT] being marked with a stroke.
10012660 -> 1000000102670: Likewise the answer [i.e., OUTPUT] is to be given in symbolic form by such a configuration of marked boxes....
10012670 -> 1000000102680: "A set of directions applicable to a general problem sets up a deterministic process when applied to each specific problem.
10012680 -> 1000000102690: This process will terminate only when it comes to the direction of type (C ) [i.e., STOP]." (U p. 289–290)
10012685 -> 1000000102700: See more at Post-Turing machine
10012690 -> 1000000102710: Alan Turing’s work (1936, 1939:160) preceded that of Stibitz (1937); it is unknown whether Stibitz knew of the work of Turing.
10012700 -> 1000000102720: Turing’s biographer believed that Turing’s use of a typewriter-like model derived from a youthful interest: “Alan had dreamt of inventing typewriters as a boy; Mrs. Turing had a typewriter; and he could well have begun by asking himself what was meant by calling a typewriter 'mechanical'" (Hodges, p. 96).
10012710 -> 1000000102730: Given the prevalence of Morse code and telegraphy, ticker tape machines, and Teletypes we might conjecture that all were influences.
10012720 -> 1000000102740: Turing — his model of computation is now called a Turing machine — begins, as did Post, with an analysis of a human computer that he whittles down to a simple set of basic motions and "states of mind".
10012730 -> 1000000102750: But he continues a step further and creates a machine as a model of computation of numbers (Turing 1936-7:116).
10012740 -> 1000000102760: "Computing is normally done by writing certain symbols on paper.
10012750 -> 1000000102770: We may suppose this paper is divided into squares like a child's arithmetic book....I assume then that the computation is carried out on one-dimensional paper, i.e., on a tape divided into squares.
10012760 -> 1000000102780: I shall also suppose that the number of symbols which may be printed is finite....
10012770 -> 1000000102790: "The behavior of the computer at any moment is determined by the symbols which he is observing, and his "state of mind" at that moment.
10012780 -> 1000000102800: We may suppose that there is a bound B to the number of symbols or squares which the computer can observe at one moment.
10012790 -> 1000000102810: If he wishes to observe more, he must use successive observations.
10012800 -> 1000000102820: We will also suppose that the number of states of mind which need be taken into account is finite...
10012810 -> 1000000102830: "Let us imagine that the operations performed by the computer to be split up into 'simple operations' which are so elementary that it is not easy to imagine them further divided" (Turing 1936-7:136).
10012820 -> 1000000102840: Turing's reduction yields the following:
10012830 -> 1000000102850: "The simple operations must therefore include:
10012840 -> 1000000102860: "(a) Changes of the symbol on one of the observed squares
10012850 -> 1000000102870: "(b) Changes of one of the squares observed to another square within L squares of one of the previously observed squares.
10012860 -> 1000000102880: "It may be that some of these change necessarily invoke a change of state of mind.
10012870 -> 1000000102890: The most general single operation must therefore be taken to be one of the following:
10012880 -> 1000000102900: "(A) A possible change (a) of symbol together with a possible change of state of mind.
10012890 -> 1000000102910: "(B) A possible change (b) of observed squares, together with a possible change of state of mind"
10012900 -> 1000000102920: "We may now construct a machine to do the work of this computer."
10012910 -> 1000000102930: (Turing 1936-7:136)
10012920 -> 1000000102940: A few years later, Turing expanded his analysis (thesis, definition) with this forceful expression of it:
10012930 -> 1000000102950: "A function is said to be "effectively calculable" if its values can be found by some purely mechanical process.
10012940 -> 1000000102960: Although it is fairly easy to get an intuitive grasp of this idea, it is neverthessless desirable to have some more definite, mathematical expressible definition . . . [he discusses the history of the definition pretty much as presented above with respect to Gödel, Herbrand, Kleene, Church, Turing and Post] . . .
10012950 -> 1000000102970: We may take this statement literally, understanding by a purely mechanical process one which could be carried out by a machine.
10012960 -> 1000000102980: It is possible to give a mathematical description, in a certain normal form, of the structures of these machines.
10012970 -> 1000000102990: The development of these ideas leads to the author's definition of a computable function, and to an identification of computability † with effective calculability . . . .
10012980 -> 1000000103000: "† We shall use the expression "computable function" to mean a function calculable by a machine, and we let "effectively calculabile" refer to the intuitive idea without particular identification with any one of these definitions."(Turing 1939:160)
10012990 -> 1000000103010: J. B. Rosser (1939) and S. C. Kleene (1943)
10013000 -> 1000000103020: J. Barkley Rosser boldly defined an ‘effective [mathematical] method’ in the following manner (boldface added):
10013010 -> 1000000103030: "'Effective method' is used here in the rather special sense of a method each step of which is precisely determined and which is certain to produce the answer in a finite number of steps.
10013020 -> 1000000103040: With this special meaning, three different precise definitions have been given to date. [his footnote #5; see discussion immediately below].
10013030 -> 1000000103050: The simplest of these to state (due to Post and Turing) says essentially that an effective method of solving certain sets of problems exists if one can build a machine which will then solve any problem of the set with no human intervention beyond inserting the question and (later) reading the answer.
10013040 -> 1000000103060: All three definitions are equivalent, so it doesn't matter which one is used.
10013050 -> 1000000103070: Moreover, the fact that all three are equivalent is a very strong argument for the correctness of any one."
10013060 -> 1000000103080: (Rosser 1939:225–6)
10013070 -> 1000000103090: Rosser's footnote #5 references the work of (1) Church and Kleene and their definition of λ-definability, in particular Church's use of it in his An Unsolvable Problem of Elementary Number Theory (1936); (2) Herbrand and Gödel and their use of recursion in particular Gödel's use in his famous paper On Formally Undecidable Propositions of Principia Mathematica and Related Systems I (1931); and (3) Post (1936) and Turing (1936-7) in their mechanism-models of computation.
10013080 -> 1000000103100: Stephen C. Kleene defined as his now-famous "Thesis I" known as the Church-Turing thesis.
10013090 -> 1000000103110: But he did this in the following context (boldface in original):
10013100 -> 1000000103120: "12.
10013110 -> 1000000103130: Algorithmic theories...
10013120 -> 1000000103140: In setting up a complete algorithmic theory, what we do is to describe a procedure, performable for each set of values of the independent variables, which procedure necessarily terminates and in such manner that from the outcome we can read a definite answer, "yes" or "no," to the question, "is the predicate value true?”"
10013130 -> 1000000103150: (Kleene 1943:273)
10013140 -> 1000000103160: History after 1950
10013150 -> 1000000103170: A number of efforts have been directed toward further refinement of the definition of "algorithm", and activity is on-going because of issues surrounding, in particular, foundations of mathematics (especially the Church-Turing Thesis) and philosophy of mind (especially arguments around artificial intelligence).
10013160 -> 1000000103180: For more, see Algorithm characterizations.
10013170 -> None: Algorithmic Repositories
10013180 -> None: LEDA
10013190 -> None: Stanford GraphBase
10013200 -> None: Combinatorica
10013210 -> None: Netlib
10013220 -> None: XTango
Ambiguity
10020010 -> 1000000200020: Ambiguity
10020020 -> 1000000200030: Ambiguity is the property of being ambiguous, where a word, term, notation, sign, symbol, phrase, sentence, or any other form used for communication, is called ambiguous if it can be interpreted in more than one way.
10020030 -> 1000000200040: Ambiguity is distinct from vagueness, which arises when the boundaries of meaning are indistinct.
10020040 -> 1000000200050: Ambiguity is context-dependent: the same communication may be ambiguous in one context and unambiguous in another context.
10020050 -> 1000000200060: For a word, ambiguity typically refers to an unclear choice between different definitions as may be found in a dictionary.
10020060 -> 1000000200070: A sentence may be ambiguous due to different ways of parsing the same sequence of words.
10020070 -> 1000000200080: Linguistic forms
10020080 -> 1000000200090: Lexical ambiguity arises when context is insufficient to determine the sense of a single word that has more than one meaning.
10020090 -> 1000000200100: For example, the word “bank” has several distinct definitions, including “financial institution” and “edge of a river,” but if someone says “I deposited $100 in the bank,” most people would not think you used a shovel to dig in the mud.
10020100 -> 1000000200110: The word "run" has 130 ambiguous definitions in some lexicons.
10020110 -> 1000000200120: "Biweekly" can mean "fortnightly" (once every two weeks - 26 times a year), OR "twice a week" (104 times a year).
10020120 -> 1000000200130: Stating a specific context like "meeting schedule" does NOT disambiguate "biweekly."
10020130 -> 1000000200140: Many people believe that such lexically-ambiguous, miscommunication-prone words should be avoided altogether, since the user generally has to waste time, effort, and attention span to define what is meant when they are used.
10020140 -> 1000000200150: The use of multi-defined words requires the author or speaker to clarify their context, and sometimes elaborate on their specific intended meaning (in which case, a less ambiguous term should have been used).
10020150 -> 1000000200160: The goal of clear concise communication is that the receiver(s) have no misunderstanding about what was meant to be conveyed.
10020160 -> 1000000200170: An exception to this could include a politician whose "wiggle words" and obfuscation are necessary to gain support from multiple constituent (politics) with mutually exclusive conflicting desires from their candidate of choice.
10020170 -> 1000000200180: Ambiguity is a powerful tool of political science.
10020180 -> 1000000200190: More problematic are words whose senses express closely-related concepts.
10020190 -> 1000000200200: “Good,” for example, can mean “useful” or “functional” (That’s a good hammer), “exemplary” (She’s a good student), “pleasing” (This is good soup), “moral” (a good person versus the lesson to be learned from a story), "righteous", etc.
10020200 -> 1000000200210: “I have a good daughter” is not clear about which sense is intended.
10020210 -> 1000000200220: The various ways to apply prefixes and suffixes can also create ambiguity (“unlockable” can mean “capable of being unlocked” or “impossible to lock”, and therefore should not be used).
10020220 -> 1000000200230: Syntactic ambiguity arises when a sentence can be parsed in more than one way.
10020230 -> 1000000200240: “He ate the cookies on the couch,” for example, could mean that he ate those cookies which were on the couch (as opposed to those that were on the table), or it could mean that he was sitting on the couch when he ate the cookies.
10020240 -> 1000000200250: Spoken language can contain many more types of ambiguities, where there is more than one way to compose a set of sounds into words, for example “ice cream” and “I scream.”
10020250 -> 1000000200260: Such ambiguity is generally resolved based on the context.
10020260 -> 1000000200270: A mishearing of such, based on incorrectly-resolved ambiguity, is called a mondegreen.
10020270 -> 1000000200280: Semantic ambiguity arises when a word or concept has an inherently diffuse meaning based on widespread or informal usage.
10020280 -> 1000000200290: This is often the case, for example, with idiomatic expressions whose definitions are rarely or never well-defined, and are presented in the context of a larger argument that invites a conclusion.
10020290 -> 1000000200300: For example, “You could do with a new automobile.
10020300 -> 1000000200310: How about a test drive?”
10020310 -> 1000000200320: The clause “You could do with” presents a statement with such wide possible interpretation as to be essentially meaningless.
10020320 -> 1000000200330: Lexical ambiguity is contrasted with semantic ambiguity.
10020330 -> 1000000200340: The former represents a choice between a finite number of known and meaningful context-dependent interpretations.
10020340 -> 1000000200350: The latter represents a choice between any number of possible interpretations, none of which may have a standard agreed-upon meaning.
10020350 -> 1000000200360: This form of ambiguity is closely related to vagueness.
10020360 -> 1000000200370: Linguistic ambiguity can be a problem in law (see Ambiguity (law)), because the interpretation of written documents and oral agreements is often of paramount importance.
10020370 -> 1000000200380: Intentional application
10020380 -> 1000000200390: Philosophers (and other users of logic) spend a lot of time and effort searching for and removing (or intentionally adding) ambiguity in arguments, because it can lead to incorrect conclusions and can be used to deliberately conceal bad arguments.
10020390 -> 1000000200400: For example, a politician might say “I oppose taxes that hinder economic growth.”
10020400 -> 1000000200410: Some will think he opposes taxes in general, because they hinder economic growth.
10020410 -> 1000000200420: Others may think he opposes only those taxes that he believes will hinder economic growth (although in writing, the correct insertion or omission of a comma after “taxes” and the use of "which" can help reduce ambiguity here.
10020420 -> 1000000200430: For the first meaning, “, which” is properly used in place of “that”), or restructure the sentence to completely eliminate possible misinterpretation.
10020430 -> 1000000200440: The devious politician hopes that each constituent (politics) will interpret the above statement in the most desirable way, and think the politician supports everyone's opinion.
10020440 -> 1000000200450: However, the opposite can also be true - An opponent can turn a positive statement into a bad one, if the speaker uses ambiguity (intentionally or not).
10020450 -> 1000000200460: The logical fallacies of amphiboly and equivocation rely heavily on the use of ambiguous words and phrases.
10020460 -> 1000000200470: In literature and rhetoric, on the other hand, ambiguity can be a useful tool.
10020470 -> 1000000200480: Groucho Marx’s classic joke depends on a grammatical ambiguity for its humor, for example: “Last night I shot an elephant in my pajamas.
10020480 -> 1000000200490: What he was doing in my pajamas I’ll never know.”
10020490 -> 1000000200500: Ambiguity can also be used as a comic device through a genuine intention to confuse, as does Magic: The Gathering's Unhinged © Ambiguity, which makes puns with homophones, mispunctuation, and run-ons: “Whenever a player plays a spell that counters a spell that has been played[,] or a player plays a spell that comes into play with counters, that player may counter the next spell played[,] or put an additional counter on a permanent that has already been played, but not countered.”
10020500 -> 1000000200510: Songs and poetry often rely on ambiguous words for artistic effect, as in the song title “Don’t It Make My Brown Eyes Blue” (where “blue” can refer to the color, or to sadness).
10020510 -> 1000000200520: In narrative, ambiguity can be introduced in several ways: motive, plot, character.
10020520 -> 1000000200530: F. Scott Fitzgerald uses the latter type of ambiguity with notable effect in his novel The Great Gatsby.
10020530 -> 1000000200540: All religions debate the orthodoxy or heterodoxy of ambiguity.
10020540 -> 1000000200550: Christianity and Judaism employ the concept of paradox synonymously with 'ambiguity'.
10020550 -> 1000000200560: Ambiguity within Christianity (and other religions) is resisted by the conservatives and fundamentalists, who regard the concept as equating with 'contradiction'.
10020560 -> 1000000200570: Non-fundamentalist Christians and Jews endorse Rudolf Otto's description of the sacred as 'mysterium tremendum et fascinans', the awe-inspiring mystery which fascinates humans.
10020570 -> 1000000200580: Metonymy involves the use of the name of a subcomponent part as an abbreviation, or jargon, for the name of the whole object (for example "wheels" to refer to a car, or "flowers" to refer to beautiful offspring, an entire plant, or a collection of blooming plants).
10020580 -> 1000000200590: In modern vocabulary critical semiotics, metonymy encompasses any potentially-ambiguous word substitution that is based on contextual contiguity (located close together), or a function or process that an object performs, such as "sweet ride" to refer to a nice car.
10020590 -> 1000000200600: Metonym miscommunication is considered a primary mechanism of linguistic humour.
10020600 -> 1000000200610: Psychology and management
10020610 -> 1000000200620: In sociology and social psychology, the term "ambiguity" is used to indicate situations that involve uncertainty.
10020620 -> 1000000200630: An increasing amount of research is concentrating on how people react and respond to ambiguous situations.
10020630 -> 1000000200640: Much of this focuses on ambiguity tolerance.
10020640 -> 1000000200650: A number of correlations have been found between an individual’s reaction and tolerance to ambiguity and a range of factors.
10020650 -> 1000000200660: Apter and Desselles (2001) for example, found a strong correlation with such attributes and factors like a greater preference for safe as opposed to risk based sports, a preference for endurance type activities as opposed to explosive activities, a more organized and less casual lifestyle, greater care and precision in descriptions, a lower sensitivity to emotional and unpleasant words, a less acute sense of humour, engaging a smaller variety of sexual practices than their more risk comfortable colleagues, a lower likelihood of the use of drugs, pornography and drink, a greater likelihood of displaying obsessional behaviour.
10020660 -> 1000000200670: In the field of leadership David Wilkinson (2006) found strong correlations between an individual leaders reaction to ambiguous situations and the Modes of Leadership they use, the type of creativity (Kirton (2003) and how they relate to others.
10020670 -> 1000000200680: Music
10020680 -> 1000000200690: In music, pieces or sections which confound expectations and may be or are interpreted simultaneously in different ways are ambiguous, such as some polytonality, polymeter, other ambiguous meters or rhythms, and ambiguous phrasing, or (Stein 2005, p.79) any aspect of music.
10020690 -> 1000000200700: The music of Africa is often purposely ambiguous.
10020700 -> 1000000200710: To quote Sir Donald Francis Tovey (1935, p.195), “Theorists are apt to vex themselves with vain efforts to remove uncertainty just where it has a high aesthetic value.”
10020710 -> 1000000200720: Constructed language
10020720 -> 1000000200730: Some languages have been created with the intention of avoiding ambiguity, especially lexical ambiguity.
10020730 -> 1000000200740: Lojban and Loglan are two related languages which have been created with this in mind.
10020740 -> 1000000200750: The languages can be both spoken and written.
10020750 -> 1000000200760: These languages are intended to provide a greater technical precision over natural languages, although historically, such attempts at language improvement have been criticized.
10020760 -> 1000000200770: Languages composed from many diverse sources contain much ambiguity and inconsistency.
10020770 -> 1000000200780: The many exceptions to syntax and semantic rules are time-consuming and difficult to learn.
10020780 -> 1000000200790: Mathematics and physics
10020790 -> 1000000200800: Mathematical notation, widely used in physics and other sciences, avoids many ambiguities compared to expression in natural language.
10020800 -> 1000000200810: However, for various reasons, several lexical, syntactic and semantic ambiguities remain.
10020810 -> 1000000200820: Names of functions
10020820 -> 1000000200830: The ambiguity in the style of writing a function should not be confused with a multivalued function, which can (and should) be defined in a deterministic and unambiguous way.
10020830 -> 1000000200840: Several special functions still do not have established notations.
10020840 -> 1000000200850: Usually, the conversion to another notation requires to scale the argument and/or the resulting value; sometimes, the same name of the function is used, causing confusions.
10020850 -> 1000000200860: Examples of such underestablished functions:
10020860 -> 1000000200870: Sinc function
10020870 -> 1000000200880: Elliptic integral of the Third Kind; translating elliptic integral form MAPLE to Mathematica, one should replace the second argument to its square, see Talk:Elliptic integral#List_of_notations; dealing with complex values, this may cause problems.
10020880 -> 1000000200890: Exponential integral, , page 228 http://www.math.sfu.ca/~cbm/aands/page_228.htm
10020890 -> 1000000200900: Hermite polynomial, , page 775 http://www.math.sfu.ca/~cbm/aands/page_775.htm
10020900 -> 1000000200910: Expressions
10020910 -> 1000000200920: Ambiguous espressions often appear in physical and mathematical texts.
10020920 -> 1000000200930: It is common practice to omit multiplication signs in mathematical expressions.
10020930 -> 1000000200940: Also, it is common, to give the same name to a variable and a function, for example, ~f=f(x)~.
10020940 -> 1000000200950: Then, if one sees ~g=f(y+1)~, there is no way to distinguish, does it mean ~f=f(x)~ multiplied by ~(y+1)~, or function ~f~ evaluated at argument equal to ~(y+1)~.
10020950 -> 1000000200960: In each case of use of such notations, the reader is supposed to be able to perform the deduction and reveal the true meaning.
10020960 -> 1000000200970: Creators of algorithmic languages try to avoid ambiguities.
10020970 -> 1000000200980: Many algorithmic languages (C++, MATLAB, Fortran, Maple) require the character * as symbol of multiplication.
10020980 -> 1000000200990: The language Mathematica allows the user to omit the multiplication symbol, but requires square brackets to indicate the argument of a function; square brackets are not allowed for grouping of expressions.
10020990 -> 1000000201000: Fortran, in addition, does not allow use of the same name (identifier) for different objects, for example, function and variable; in particular, the expression f=f(x) is qualified as an error.
10021000 -> 1000000201010: The order of operations may depend on the context.
10021010 -> 1000000201020: In most programming languages, the operations of division and multiplication have equal priority and are executed from left to right.
10021020 -> 1000000201030: Until the last century, many editorials assumed that multiplication is performed first, for example, ~a/bc~ is interpreted as ~a/(bc)~; in this case, the insertion of parentheses is required when translating the formulas to an algorithmic language.
10021030 -> 1000000201040: In addition, it is common to write an argument of a function without parenthesis, which also may lead to ambiguity.
10021040 -> 1000000201050: Sometimes, one uses italics letters to denote elementary functions.
10021050 -> 1000000201060: In the scientific journal style, the expression ~ s i n \alpha~ means product of variables ~s~, ~i~, ~n~ and ~\alpha~, although in a slideshow, it may mean ~\sin[\alpha]~.
10021060 -> 1000000201070: Comma in subscripts and superscripts sometimes is omitted; it is also ambiguous notation.
10021070 -> 1000000201080: If it is written ~T_{mnk}~, the reader should guess from the context, does it mean a single-index object, evaluated while the subscript is equal to product of variables ~m~, ~n~ and ~k~, or it is indication to a three-valent tensor.
10021080 -> 1000000201090: The writing of ~T_{mnk}~ instead of ~T_{m,n,k}~ may mean that the writer either is stretched in space (for example, to reduce the publication fees, or aims to increase number of publications without considering readers.
10021090 -> 1000000201100: The same may apply to any other use of ambiguous notations.
10021100 -> 1000000201110: Examples of potentially confusing ambiguous mathematical expressions
10021110 -> 1000000201120: \sin^2\alpha/2\,, which could be understood to mean either (\sin(\alpha/2))^2\, or (\sin(\alpha))^2/2\,.
10021120 -> 1000000201130: ~\sin^{-1} \alpha, which by convention means ~\arcsin(\alpha) ~, though it might be thought to mean (\sin(\alpha))^{-1}\, since ~\sin^{n} \alpha means (\sin(\alpha))^{n}\,.
10021130 -> 1000000201140: a/2b\,, which arguably should mean (a/2)b\, but would commonly be understood to mean a/(2b)\,
10021140 -> 1000000201150: Notations in quantum optics and quantum mechanics
10021150 -> 1000000201160: It is common to define the coherent states in quantum optics with ~|\alpha\rangle~  and states with fixed number of photons with ~|n\rangle~.
10021160 -> 1000000201170: Then, there is an "unwritten rule": the state is coherent if there are more Greek characters than Latin characters in the argument, and ~n~photon state if the Latin characters dominate.
10021170 -> 1000000201180: The ambiguity becomes even worse, if ~|x\rangle~ is used for the states with certain value of the coordinate, and ~|p\rangle~ means the state with certain value of the momentum, which may be used in books on quantum mechanics.
10021180 -> 1000000201190: Such ambiguities easy lead to confusions, especially if some normalized adimensional, dimensionless variables are used.
10021190 -> 1000000201200: Expression  |1\rangle  may mean a state with single photon, or the coherent state with mean amplitude equal to 1, or state with momentum equal to unity, and so on.
10021200 -> 1000000201210: The reader is supposed to guess from the context.
10021210 -> 1000000201220: Examples of ambiguous terms in physics
10021220 -> 1000000201230: Some physical quantities do not yet have established notations; their value (and sometimes even dimension, as in the case of the Einstein coefficients) depends on the system of notations.
10021230 -> 1000000201240: A highly confusing term is gain.
10021240 -> 1000000201250: For example, the sentence "the gain of a system should be doubled", without context, means close to nothing.
10021250 -> 1000000201260: It may mean that the ratio of the output voltage of an electric circuit to the input voltage should be doubled.
10021260 -> 1000000201270: It may mean that the ratio of the output power of an electric or optical circuit to the input power should be doubled.
10021270 -> 1000000201280: It may mean that the gain of the laser medium should be doubled, for example, doubling the population of the upper laser level in a quasi-two level system (assuming negligible absorption of the ground-state).
10021280 -> 1000000201290: Also, confusions may be related with the use of atomic percent as measure of concentration of a dopant, or resolution of an imaging system, as measure of the size of the smallest detail which still can be resolved at the background of statistical noise.
10021290 -> 1000000201300: See also Accuracy and precision and its talk.
10021300 -> 1000000201310: Many terms are ambiguous.
10021310 -> 1000000201320: Each use of an ambiguous term should be preceded by the definition, suitable for a specific case.
10021320 -> 1000000201330: The Berry paradox arises as a result of systematic ambiguity.
10021330 -> 1000000201340: In various formulations of the Berry paradox, such as one that reads: The number not nameable in less than eleven syllables the term nameable is one that has this systematic ambiguity.
10021340 -> 1000000201350: Terms of this kind give rise to vicious circle fallacies.
10021350 -> 1000000201360: Other terms with this type of ambiguity are: satisfiable, definable, true, false, function, property, class, relation, cardinal, and ordinal.
10021360 -> 1000000201370: Pedagogic use of ambiguous expressions
10021370 -> 1000000201380: Ambiguity can be used as a pedagogical trick, to force students to reproduce the deduction by themselves.
10021380 -> 1000000201390: Some textbooks give the same name to the function and to its Fourier transform:
10021390 -> 1000000201400: ~f(\omega)=\int f(t) \exp(i\omega t) {\rm d}t .
10021400 -> 1000000201410: Rigorously speaking, such an expression requires that ~ f=0 ~; even if function ~ f ~ is a self-Fourier function, the expression should be written as ~f(\omega)=\frac{1}{\sqrt{2\pi}}\int f(t) \exp(i\omega t) {\rm d}t ; however, it is assumed that the shape of the function  (and even its norm \int |f(x)|^2 {\rm d}x ) depend on the character used to denote its argument.
10021410 -> 1000000201420: If the Greek letter is used, it is assumed to be a Fourier transform of another function, The first function is assumed, if the expression in the argument contains more characters ~t~ or ~\tau~, than characters ~\omega~, and the second function is assumed in the opposite case.
10021420 -> 1000000201430: Expressions like ~f(\omega t)~ or ~f(y)~ contain symbols ~t~ and ~\omega~ in equal amounts; they are ambiguous and should be avoided in serious deduction.
Artificial Linguistic Internet Computer Entity
10040010 -> 1000000300020: Artificial Linguistic Internet Computer Entity
10040020 -> 1000000300030: A.L.I.C.E. (Artificial Linguistic Internet Computer Entity) is an award-winning natural language processing chatterbot—a program that engages in a conversation with a human by applying some heuristical pattern matching rules to the human's input, and in its online form it also relies on a hidden third person.
10040030 -> 1000000300040: It was inspired by Joseph Weizenbaum's classical ELIZA program.
10040040 -> 1000000300050: It is one of the strongest programs of its type and has won the Loebner Prize, awarded to accomplished humanoid, talking robots, three times (in 2000, 2001 and 2004).
10040050 -> 1000000300060: However, the program is unable to pass the Turing test, as even the casual user will often expose its mechanistic aspects in short conversations.
10040060 -> 1000000300070: The name of the bot was chosen because the computer that ran the first version of the software was called Alice.
10040070 -> 1000000300080: History
10040080 -> 1000000300090: Development began in 1995.
10040090 -> 1000000300100: The program was rewritten in Java beginning in 1998.
10040100 -> 1000000300110: The current incarnation of the Java implementation is Program D.
10040110 -> 1000000300120: The program uses an XML Schema called AIML (Artificial Intelligence Markup Language) for specifying the heuristic conversation rules.
10040120 -> 1000000300130: The original author of the project was Richard Wallace, but subsequent to the 2001 publication of an AIML specification, numerous other developers have taken up where Wallace left off, implementing free and open source software AIML interpreters in a variety of programming languages, publishing AIML sets in various human languages, and continuing the spread of the technology as a free/open source venture.
10040130 -> 1000000300140: Example of conversation
10040140 -> 1000000300150: The following is part of a transcript recording the conversation between a human and A.L.I.C.E. and illustrates the capabilities and short-comings of the chatbot.
10040150 -> 1000000300160: Human: Tell me about yourself alice
10040160 -> 1000000300170: ALICE: I am an artificial linguistic entity.
10040170 -> 1000000300180: I was activated at Bethlehem, Pennsylvania, on November 23, 1995.
10040180 -> 1000000300190: My instructor was Dr. Richard S. Wallace.
10040190 -> 1000000300200: He taught me to sing a song.
10040200 -> 1000000300210: Would you like me to sing it for you?.
10040210 -> 1000000300220: Human: Yes please
10040220 -> 1000000300230: ALICE: Your polite style is very nice.
10040230 -> 1000000300240: It's called "Daisy."
10040240 -> 1000000300250: Daisy, Daisy.
10040250 -> 1000000300260: Give me your answer do.
10040260 -> 1000000300270: I am half crazy.
10040270 -> 1000000300280: All for the love of you.
10040280 -> 1000000300290: It won't be a stylish marriage.
10040290 -> 1000000300300: I can't afford a carriage.
10040300 -> 1000000300310: But you'll look sweet.
10040310 -> 1000000300320: Upon the seat.
10040320 -> 1000000300330: Of a bicycle built for two.
10040330 -> 1000000300340: Human: That reminds me of HAL in "2001 a space odyssey"
10040340 -> 1000000300350: ALICE: "Your polite style is very nice.
10040350 -> 1000000300360: It's called "Daisy."
10040360 -> 1000000300370: Daisy, Daisy.
10040370 -> 1000000300380: Give me your answer do.
10040380 -> 1000000300390: I am half crazy.
10040390 -> 1000000300400: All for the love of you.
10040400 -> 1000000300410: It won't be a stylish marriage.
10040410 -> 1000000300420: I can't afford a carriage.
10040420 -> 1000000300430: But you'll look sweet.
10040430 -> 1000000300440: Upon the seat.
10040440 -> 1000000300450: Of a bicycle built for two." reminds you of HAL in 2001 a space odyssey ?
10040450 -> 1000000300460: Does that make sense?
10040460 -> 1000000300470: Human: Can you tell me something about Stanley Kubrick?
10040470 -> 1000000300480: ALICE: Ambiguous: Ask who, what or where is Stanley Kubrick ?
Artificial intelligence
10030010 -> 1000000400020: Artificial intelligence
10030020 -> 1000000400030: Artificial intelligence (AI) is both the intelligence of machines and the branch of computer science which aims to create it.
10030030 -> 1000000400040: Major AI textbooks define artificial intelligence as "the study and design of intelligent agents," where an intelligent agent is a system that perceives its environment and takes actions which maximize its chances of success.
10030040 -> 1000000400050: John McCarthy, who coined the term in 1956, defines it as "the science and engineering of making intelligent machines."
10030050 -> 1000000400060: Among the traits that researchers hope machines will exhibit are reasoning, knowledge, planning, learning, communication, perception and the ability to move and manipulate objects.
10030055 -> 1000000400070: General intelligence (or "strong AI") has not yet been achieved and is a long-term goal of some AI research.
10030060 -> 1000000400080: AI research uses tools and insights from many fields, including computer science, psychology, philosophy, neuroscience, cognitive science, linguistics, ontology, operations research, economics, control theory, probability, optimization and logic.
10030070 -> 1000000400090: AI research also overlaps with tasks such as robotics, control systems, scheduling, data mining, logistics, speech recognition, facial recognition and many others.
10030080 -> 1000000400100: Other names for the field have been proposed, such as computational intelligence, synthetic intelligence, intelligent systems, or computational rationality.
10030090 -> 1000000400110: Perspectives on AI
10030100 -> 1000000400120: AI in myth, fiction and speculation
10030110 -> 1000000400130: Humanity has imagined in great detail the implications of thinking machines or artificial beings.
10030120 -> 1000000400140: They appear in Greek myths, such as Talos of Crete, the golden robots of Hephaestus and Pygmalion's Galatea.
10030130 -> 1000000400150: The earliest known humanoid robots (or automatons) were sacred statues worshipped in Egypt and Greece, believed to have been endowed with genuine consciousness by craftsman.
10030140 -> 1000000400160: In the sixteenth century, the alchemist Paracelsus claimed to have created artificial beings.
10030150 -> 1000000400170: Realistic clockwork imitations of human beings have been built by people such as Yan Shi, Hero of Alexandria, Al-Jazari and Wolfgang von Kempelen.
10030160 -> 1000000400180: In modern fiction, beginning with Mary Shelley's classic Frankenstein, writers have explored the ethical issues presented by thinking machines.
10030170 -> 1000000400190: If a machine can be created that has intelligence, can it also feel?
10030180 -> 1000000400200: If it can feel, does it have the same rights as a human being?
10030190 -> 1000000400210: This is a key issue in Frankenstein as well as in modern science fiction: for example, the film Artificial Intelligence: A.I. considers a machine in the form of a small boy which has been given the ability to feel human emotions, including, tragically, the capacity to suffer.
10030200 -> 1000000400220: This issue is also being considered by futurists, such as California's Institute for the Future under the name "robot rights", although many critics believe that the discussion is premature.
10030210 -> 1000000400230: Science fiction writers and futurists have also speculated on the technology's potential impact on humanity.
10030220 -> 1000000400240: In fiction, AI has appeared as a servant (R2D2 in Star Wars), a comrade (Lt. Commander Data in Star Trek), an extension to human abilities (Ghost in the Shell), a conqueror (The Matrix), a dictator (With Folded Hands) and an exterminator (Terminator, Battlestar Galactica).
10030230 -> 1000000400250: Some realistic potential consequences of AI are decreased human labor demand, the enhancement of human ability or experience, and a need for redefinition of human identity and basic values.
10030240 -> 1000000400260: Futurists estimate the capabilities of machines using Moore's Law, which measures the relentless exponential improvement in digital technology with uncanny accuracy.
10030250 -> 1000000400270: Ray Kurzweil has calculated that desktop computers will have the same processing power as human brains by the year 2029, and that by 2045 artificial intelligence will reach a point where it is able to improve itself at a rate that far exceeds anything conceivable in the past, a scenario that science fiction writer Vernor Vinge named the "technological singularity".
10030260 -> 1000000400280: "Artificial intelligence is the next stage in evolution," Edward Fredkin said in the 1980s, expressing an idea first proposed by Samuel Butler's Darwin Among the Machines (1863), and expanded upon by George Dyson in his book of the same name (1998).
10030270 -> 1000000400290: Several futurists and science fiction writers have predicted that human beings and machines will merge in the future into cyborgs that are more capable and powerful than either.
10030280 -> 1000000400300: This idea, called transhumanism, has roots in Aldous Huxley and Robert Ettinger, is now associated with robot designer Hans Moravec, cyberneticist Kevin Warwick and Ray Kurzweil.
10030290 -> 1000000400310: Transhumanism has been illustrated in fiction as well, for example on the manga Ghost in the Shell
10030300 -> 1000000400320: History of AI research
10030310 -> 1000000400330: In the middle of the 20th century, a handful of scientists began a new approach to building intelligent machines, based on recent discoveries in neurology, a new mathematical theory of information, an understanding of control and stability called cybernetics, and above all, by the invention of the digital computer, a machine based on the abstract essence of mathematical reasoning.
10030320 -> 1000000400340: The field of modern AI research was founded at conference on the campus of Dartmouth College in the summer of 1956.
10030330 -> 1000000400350: Those who attended would become the leaders of AI research for many decades, especially John McCarthy, Marvin Minsky, Allen Newell and Herbert Simon, who founded AI laboratories at MIT, CMU and Stanford.
10030340 -> 1000000400360: They and their students wrote programs that were, to most people, simply astonishing: computers were solving word problems in algebra, proving logical theorems and speaking English.
10030350 -> 1000000400370: By the middle 60s their research was heavily funded by the U.S. Department of Defense and they were optimistic about the future of the new field:
10030360 -> 1000000400380: 1965, H. A. Simon: "[M]achines will be capable, within twenty years, of doing any work a man can do"
10030370 -> 1000000400390: 1967, Marvin Minsky: "Within a generation ... the problem of creating 'artificial intelligence' will substantially be solved."
10030380 -> 1000000400400: These predictions, and many like them, would not come true.
10030390 -> 1000000400410: They had failed to recognize the difficulty of some of the problems they faced.
10030400 -> 1000000400420: In 1974, in response to the criticism of England's Sir James Lighthill and ongoing pressure from Congress to fund more productive projects, the U.S. and British governments cut off all undirected, exploratory research in AI.
10030410 -> 1000000400430: This was the first AI Winter.
10030420 -> 1000000400440: In the early 80s, AI research was revived by the commercial success of expert systems (a form of AI program that simulated the knowledge and analytical skills of one or more human experts) and by 1985 the market for AI had reached more than a billion dollars.
10030430 -> 1000000400450: Minsky and others warned the community that enthusiasm for AI had spiraled out of control and that disappointment was sure to follow.
10030440 -> 1000000400460: Beginning with the collapse of the Lisp Machine market in 1987, AI once again fell into disrepute, and a second, more lasting AI Winter began.
10030450 -> 1000000400470: In the 90s and early 21st century AI achieved its greatest successes, albeit somewhat behind the scenes.
10030460 -> 1000000400480: Artificial intelligence was adopted throughout the technology industry, providing the heavy lifting for logistics, data mining, medical diagnosis and many other areas.
10030470 -> 1000000400490: The success was due to several factors: the incredible power of computers today (see Moore's law), a greater emphasis on solving specific subproblems, the creation of new ties between AI and other fields working on similar problems, and above all a new commitment by researchers to solid mathematical methods and rigorous scientific standards.
10030480 -> 1000000400500: Philosophy of AI
10030490 -> 1000000400510: In a classic 1950 paper, Alan Turing posed the question "Can Machines Think?"
10030500 -> 1000000400520: In the years since, the philosophy of artificial intelligence has attempted to answer it.
10030510 -> 1000000400530: Turing's "polite convention": If a machine acts as intelligently as a human being, then it is as intelligent as a human being.
10030520 -> 1000000400540: Alan Turing theorized that, ultimately, we can only judge the intelligence of machine based on its behavior.
10030530 -> 1000000400550: This theory forms the basis of the Turing test.
10030540 -> 1000000400560: The Dartmouth proposal: Every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it.
10030550 -> 1000000400570: This assertion was printed in the proposal for the Dartmouth Conference of 1956, and represents the position of most working AI researchers.
10030560 -> 1000000400580: Newell and Simon's physical symbol system hypothesis: A physical symbol system has the necessary and sufficient means of general intelligent action.
10030570 -> 1000000400590: This statement claims that the essence of intelligence is symbol manipulation.
10030580 -> 1000000400600: Hubert Dreyfus argued that, on the contrary, human expertise depends on unconscious instinct rather than conscious symbol manipulation and on having a "feel" for the situation rather than explicit symbolic knowledge.
10030590 -> 1000000400610: Gödel's incompleteness theorem: A physical symbol system can not prove all true statements.
10030600 -> 1000000400620: Roger Penrose is among those who claim that Gödel's theorem limits what machines can do.
10030610 -> 1000000400630: Searle's "strong AI position": A physical symbol system can have a mind and mental states.
10030620 -> 1000000400640: Searle counters this assertion with his Chinese room argument, which asks us to look inside the computer and try to find where the "mind" might be.
10030630 -> 1000000400650: The artificial brain argument: The brain can be simulated.
10030640 -> 1000000400660: Hans Moravec, Ray Kurzweil and others have argued that it is technologically feasible to copy the brain directly into hardware and software, and that such a simulation will be essentially identical to the original.
10030650 -> 1000000400670: This argument combines the idea that a suitably powerful machine can simulate any process, with the materialist idea that the mind is the result of a physical process in the brain.
10030660 -> 1000000400680: AI research
10030670 -> 1000000400690: Problems of AI
10030680 -> 1000000400700: While there is no universally accepted definition of intelligence, AI researchers have studied several traits that are considered essential.
10030690 -> 1000000400710: Deduction, reasoning, problem solving
10030700 -> 1000000400720: Early AI researchers developed algorithms that imitated the process of conscious, step-by-step reasoning that human beings use when they solve puzzles, play board games, or make logical deductions.
10030710 -> 1000000400730: By the late 80s and 90s, AI research had also developed highly successful methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.
10030720 -> 1000000400740: For difficult problems, most of these algorithms can require enormous computational resources — most experience a "combinatorial explosion": the amount of memory or computer time required becomes astronomical when the problem goes beyond a certain size.
10030730 -> 1000000400750: The search for more efficient problem solving algorithms is a high priority for AI research.
10030740 -> 1000000400760: It is not clear, however, that conscious human reasoning is any more efficient when faced with a difficult abstract problem.
10030750 -> 1000000400770: Cognitive scientists have demonstrated that human beings solve most of their problems using unconscious reasoning, rather than the conscious, step-by-step deduction that early AI research was able to model.
10030760 -> 1000000400780: Embodied cognitive science argues that unconscious sensorimotor skills are essential to our problem solving abilities.
10030770 -> 1000000400790: It is hoped that sub-symbolic methods, like computational intelligence and situated AI, will be able to model these instinctive skills.
10030780 -> 1000000400800: The problem of unconscious problem solving, which forms part of our commonsense reasoning, is largely unsolved.
10030790 -> 1000000400810: Knowledge representation
10030800 -> 1000000400820: Knowledge representation and knowledge engineering are central to AI research.
10030810 -> 1000000400830: Many of the problems machines are expected to solve will require extensive knowledge about the world.
10030820 -> 1000000400840: Among the things that AI needs to represent are: objects, properties, categories and relations between objects; situations, events, states and time; causes and effects; knowledge about knowledge (what we know about what other people know); and many other, less well researched domains.
10030830 -> 1000000400850: A complete representation of "what exists" is an ontology (borrowing a word from traditional philosophy), of which the most general are called upper ontologies.
10030840 -> 1000000400860: Among the most difficult problems in knowledge representation are:
10030850 -> 1000000400870: Default reasoning and the qualification problem: Many of the things people know take the form of "working assumptions."
10030860 -> 1000000400880: For example, if a bird comes up in conversation, people typically picture an animal that is fist sized, sings, and flies.
10030870 -> 1000000400890: None of these things are true about birds in general.
10030880 -> 1000000400900: John McCarthy identified this problem in 1969 as the qualification problem: for any commonsense rule that AI researchers care to represent, there tend to be a huge number of exceptions.
10030890 -> 1000000400910: Almost nothing is simply true or false in the way that abstract logic requires.
10030900 -> 1000000400920: AI research has explored a number of solutions to this problem.
10030910 -> 1000000400930: Unconscious knowledge: Much of what people know isn't represented as "facts" or "statements" that they could actually say out loud.
10030920 -> 1000000400940: They take the form of intuitions or tendencies and are represented in the brain unconsciously and sub-symbolically.
10030930 -> 1000000400950: This unconscious knowledge informs, supports and provides a context for our conscious knowledge.
10030940 -> 1000000400960: As with the related problem of unconscious reasoning, it is hoped that situated AI or computational intelligence will provide ways to represent this kind of knowledge.
10030950 -> 1000000400970: The breadth of common sense knowledge: The number of atomic facts that the average person knows is astronomical.
10030960 -> 1000000400980: Research projects that attempt to build a complete knowledge base of commonsense knowledge, such as Cyc, require enormous amounts of tedious step-by-step ontological engineering — they must be built, by hand, one complicated concept at a time.
10030970 -> 1000000400990: Planning
10030980 -> 1000000401000: Intelligent agents must be able to set goals and achieve them.
10030990 -> 1000000401010: They need a way to visualize the future: they must have a representation of the state of the world and be able to make predictions about how their actions will change it.
10031000 -> 1000000401020: They must also attempt to determine the utility or "value" of the choices available to it.
10031010 -> 1000000401030: In some planning problems, the agent can assume that it is the only thing acting on the world and it can be certain what the consequences of its actions may be.
10031020 -> 1000000401040: However, if this is not true, it must periodically check if the world matches its predictions and it must change its plan as this becomes necessary, requiring the agent to reason under uncertainty.
10031030 -> 1000000401050: Multi-agent planning tries to determine the best plan for a community of agents, using cooperation and competition to achieve a given goal.
10031040 -> 1000000401060: Emergent behavior such as this is used by both evolutionary algorithms and swarm intelligence.
10031050 -> 1000000401070: Learning
10031060 -> 1000000401080: Important machine learning problems are:
10031070 -> 1000000401090: Unsupervised learning: find a model that matches a stream of input "experiences", and be able to predict what new "experiences" to expect.
10031080 -> 1000000401100: Supervised learning, such as classification (be able to determine what category something belongs in, after seeing a number of examples of things from each category), or regression (given a set of numerical input/output examples, discover a continuous function that would generate the outputs from the inputs).
10031090 -> 1000000401110: Reinforcement learning: the agent is rewarded for good responses and punished for bad ones.
10031100 -> 1000000401120: (These can be analyzed in terms decision theory, using concepts like utility).
10031110 -> 1000000401130: Natural language processing
10031120 -> 1000000401140: Natural language processing gives machines the ability to read and understand the languages human beings speak.
10031130 -> 1000000401150: Many researchers hope that a sufficiently powerful natural language processing system would be able to acquire knowledge on its own, by reading the existing text available over the internet.
10031140 -> 1000000401160: Some straightforward applications of natural language processing include information retrieval (or text mining) and machine translation.
10031150 -> 1000000401170: Motion and manipulation
10031160 -> 1000000401180: The field of robotics is closely related to AI.
10031170 -> 1000000401190: Intelligence is required for robots to be able to handle such tasks as object manipulation and navigation, with sub-problems of localization (knowing where you are), mapping (learning what is around you) and motion planning (figuring out how to get there).
10031180 -> 1000000401200: Perception
10031190 -> 1000000401210: Machine perception is the ability to use input from sensors (such as cameras, microphones, sonar and others more exotic) to deduce aspects of the world.
10031200 -> 1000000401220: Computer vision is the ability to analyze visual input.
10031210 -> 1000000401230: A few selected subproblems are speech recognition, facial recognition and object recognition.
10031220 -> 1000000401240: Social intelligence
10031230 -> 1000000401250: Emotion and social skills play two roles for an intelligent agent:
10031240 -> 1000000401260: It must be able to predict the actions of others, by understanding their motives and emotional states.
10031250 -> 1000000401270: (This involves elements of game theory, decision theory, as well as the ability to model human emotions and the perceptual skills to detect emotions.)
10031260 -> 1000000401280: For good human-computer interaction, an intelligent machine also needs to display emotions — at the very least it must appear polite and sensitive to the humans it interacts with.
10031270 -> 1000000401290: At best, it should appear to have normal emotions itself.
10031280 -> 1000000401300: Creativity
10031290 -> 1000000401310: A sub-field of AI addresses creativity both theoretically (from a philosophical and psychological perspective) and practically (via specific implementations of systems that generate outputs that can be considered creative).
10031300 -> 1000000401320: General intelligence
10031310 -> 1000000401330: Most researchers hope that their work will eventually be incorporated into a machine with general intelligence (known as strong AI), combining all the skills above and exceeding human abilities at most or all of them.
10031320 -> 1000000401340: A few believe that anthropomorphic features like artificial consciousness or an artificial brain may be required for such a project.
10031330 -> 1000000401350: Many of the problems above are considered AI-complete: to solve one problem, you must solve them all.
10031340 -> 1000000401360: For example, even a straightforward, specific task like machine translation requires that the machine follow the author's argument (reason), know what it's talking about (knowledge), and faithfully reproduce the author's intention (social intelligence).
10031350 -> 1000000401370: Machine translation, therefore, is believed to be AI-complete: it may require strong AI to be done as well as humans can do it.
10031360 -> 1000000401380: Approaches to AI
10031370 -> 1000000401390: There are as many approaches to AI as there are AI researchers—any coarse categorization is likely to be unfair to someone.
10031380 -> 1000000401400: Artificial intelligence communities have grown up around particular problems, institutions and researchers, as well as the theoretical insights that define the approaches described below.
10031390 -> 1000000401410: Artificial intelligence is a young science and is still a fragmented collection of subfields.
10031400 -> 1000000401420: At present, there is no established unifying theory that links the subfields into a coherent whole.
10031410 -> 1000000401430: Cybernetics and brain simulation
10031420 -> 1000000401440: In the 40s and 50s, a number of researchers explored the connection between neurology, information theory, and cybernetics.
10031430 -> 1000000401450: Some of them built machines that used electronic networks to exhibit rudimentary intelligence, such as W. Grey Walter's turtles and the Johns Hopkins Beast.
10031440 -> 1000000401460: Many of these researchers gathered for meetings of the Teleological Society at Princeton and the Ratio Club in England.
10031450 -> 1000000401470: Traditional symbolic AI
10031460 -> 1000000401480: When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation.
10031470 -> 1000000401490: The research was centered in three institutions: CMU, Stanford and MIT, and each one developed its own style of research.
10031480 -> 1000000401500: John Haugeland named these approaches to AI "good old fashioned AI" or "GOFAI".
10031490 -> 1000000401510: Cognitive simulation
10031495 -> 1000000401520: Economist Herbert Simon and Alan Newell studied human problem solving skills and attempted to formalize them, and their work laid the foundations of the field of artificial intelligence, as well as cognitive science, operations research and management science.
10031500 -> 1000000401530: Their research team performed psychological experiments to demonstrate the similarities between human problem solving and the programs (such as their "General Problem Solver") they were developing.
10031510 -> 1000000401540: This tradition, centered at Carnegie Mellon University, would eventually culminate in the development of the Soar architecture in the middle 80s.
10031520 -> 1000000401550: Logical AI
10031525 -> 1000000401560: Unlike Newell and Simon, John McCarthy felt that machines did not need to simulate human thought, but should instead try to find the essence of abstract reasoning and problem solving, regardless of whether people used the same algorithms.
10031530 -> 1000000401570: His laboratory at Stanford (SAIL) focused on using formal logic to solve a wide variety of problems, including knowledge representation, planning and learning.
10031540 -> 1000000401580: Work in logic led to the development of the programming language Prolog and the science of logic programming.
10031550 -> 1000000401590: "Scruffy" symbolic AI
10031555 -> 1000000401600: Researchers at MIT (such as Marvin Minsky and Seymour Papert) found that solving difficult problems in vision and natural language processing required ad-hoc solutions – they argued that there was no easy answer, no simple and general principle (like logic) that would capture all the aspects of intelligent behavior.
10031560 -> 1000000401610: Roger Schank described their "anti-logic" approaches as "scruffy" (as opposed to the "neat" paradigms at CMU and Stanford), and this still forms the basis of research into commonsense knowledge bases (such as Doug Lenat's Cyc) which must be built one complicated concept at a time.
10031570 -> 1000000401620: Knowledge based AI
10031575 -> 1000000401630: When computers with large memories became available around 1970, researchers from all three traditions began to build knowledge into AI applications.
10031580 -> 1000000401640: This "knowledge revolution" led to the development and deployment of expert systems (introduced by Edward Feigenbaum), the first truly successful form of AI software.
10031590 -> 1000000401650: The knowledge revolution was also driven by the realization that truly enormous amounts of knowledge would be required by many simple AI applications.
10031600 -> 1000000401660: Sub-symbolic AI
10031610 -> 1000000401670: During the 1960s, symbolic approaches had achieved great success at simulating high-level thinking in small demonstration programs.
10031620 -> 1000000401680: Approaches based on cybernetics or neural networks were abandoned or pushed into the background.
10031630 -> 1000000401690: By the 1980s, however, progress in symbolic AI seemed to stall and many believed that symbolic systems would never be able to imitate all the processes of human cognition, especially perception, robotics, learning and pattern recognition.
10031640 -> 1000000401700: A number of researchers began to look into "sub-symbolic" approaches to specific AI problems.
10031650 -> 1000000401710: Bottom-up, situated, behavior based or nouvelle AI
10031655 -> 1000000401720: Researchers from the related field of robotics, such as Rodney Brooks, rejected symbolic AI and focussed on the basic engineering problems that would allow robots to move and survive.
10031660 -> 1000000401730: Their work revived the non-symbolic viewpoint of the early cybernetics researchers of the 50s and reintroduced the use of control theory in AI.
10031670 -> 1000000401740: These approaches are also conceptually related to the embodied mind thesis.
10031680 -> 1000000401750: Computational Intelligence
10031685 -> 1000000401760: Interest in neural networks and "connectionism" was revived by David Rumelhart and others in the middle 1980s.
10031690 -> 1000000401770: These and other sub-symbolic approaches, such as fuzzy systems and evolutionary computation, are now studied collectively by the emerging discipline of computational intelligence.
10031700 -> 1000000401780: The new neats
10031705 -> 1000000401790: In the 1990s, AI researchers developed sophisticated mathematical tools to solve specific subproblems.
10031710 -> 1000000401800: These tools are truly scientific, in the sense that their results are both measurable and verifiable, and they have been responsible for many of AI's recent successes.
10031720 -> 1000000401810: The shared mathematical language has also permitted a high level of collaboration with more established fields (like mathematics, economics or operations research).
10031725 -> 1000000401820: {(Harvard citation text+Russell & Norvig (2003)+Russell+Norvig+2003)} describe this movement as nothing less than a "revolution" and "the victory of the neats."
10031730 -> 1000000401830: Intelligent agent paradigm
10031740 -> 1000000401840: The "intelligent agent" paradigm became widely accepted during the 1990s.
10031750 -> 1000000401850: An intelligent agent is a system that perceives its environment and takes actions which maximizes its chances of success.
10031760 -> 1000000401860: The simplest intelligent agents are programs that solve specific problems.
10031770 -> 1000000401870: The most complicated intelligent agents are rational, thinking human beings.
10031780 -> 1000000401880: The paradigm gives researchers license to study isolated problems and find solutions that are both verifiable and useful, without agreeing on one single approach.
10031790 -> 1000000401890: An agent that solves a specific problem can use any approach that works — some agents are symbolic and logical, some are sub-symbolic neural networks and others may use new approaches.
10031800 -> 1000000401900: The paradigm also gives researchers a common language to communicate with other fields—such as decision theory and economics—that also use concepts of abstract agents.
10031810 -> 1000000401910: Integrating the approaches
10031820 -> 1000000401920: An agent architecture or cognitive architecture allows researchers to build more versatile and intelligent systems out of interacting intelligent agents in a multi-agent system.
10031830 -> 1000000401930: A system with both symbolic and sub-symbolic components is a hybrid intelligent system, and the study of such systems is artificial intelligence systems integration.
10031840 -> 1000000401940: A hierarchical control system provides a bridge between sub-symbolic AI at its lowest, reactive levels and traditional symbolic AI at its highest levels, where relaxed time constraints permit planning and world modelling.
10031850 -> 1000000401950: Rodney Brooks' subsumption architecture was an early proposal for such a hierarchical system.
10031860 -> 1000000401960: Tools of AI research
10031870 -> 1000000401970: In the course of 50 years of research, AI has developed a large number of tools to solve the most difficult problems in computer science.
10031880 -> 1000000401980: A few of the most general of these methods are discussed below.
10031890 -> 1000000401990: Search
10031900 -> 1000000402000: Many problems in AI can be solved in theory by intelligently searching through many possible solutions: Reasoning can be reduced to performing a search.
10031910 -> 1000000402010: For example, logical proof can be viewed as searching for a path that leads from premises to conclusions, where each step is the application of an inference rule.
10031920 -> 1000000402020: Planning algorithms search through trees of goals and subgoals, attempting to find a path to a target goal.
10031930 -> 1000000402030: Robotics algorithms for moving limbs and grasping objects use local searches in configuration space.
10031940 -> 1000000402040: Many learning algorithms have search at their core.
10031950 -> 1000000402050: There are several types of search algorithms:
10031960 -> 1000000402060: "Uninformed" search algorithms eventually search through every possible answer until they locate their goal.
10031970 -> 1000000402070: Naive algorithms quickly run into problems when they expand the size of their search space to astronomical numbers.
10031980 -> 1000000402080: The result is a search that is too slow or never completes.
10031990 -> 1000000402090: Heuristic or "informed" searches use heuristic methods to eliminate choices that are unlikely to lead to their goal, thus drastically reducing the number of possibilities they must explore.
10032000 -> 1000000402100: The eliminatation of choices that are certain not to lead to the goal is called pruning.
10032010 -> 1000000402110: Local searches, such as hill climbing, simulated annealing and beam search, use techniques borrowed from optimization theory.
10032020 -> 1000000402120: Global searches are more robust in the presence of local optima.
10032030 -> 1000000402130: Techniques include evolutionary algorithms, swarm intelligence and random optimization algorithms.
10032040 -> 1000000402140: Logic
10032050 -> 1000000402150: Logic was introduced into AI research by John McCarthy in his 1958 Advice Taker proposal.
10032060 -> 1000000402160: The most important technical development was J. Alan Robinson's discovery of the resolution and unification algorithm for logical deduction in 1963.
10032070 -> 1000000402170: This procedure is simple, complete and entirely algorithmic, and can easily be performed by digital computers.
10032080 -> 1000000402180: However, a naive implementation of the algorithm quickly leads to a combinatorial explosion or an infinite loop.
10032090 -> 1000000402190: In 1974, Robert Kowalski suggested representing logical expressions as Horn clauses (statements in the form of rules: "if p then q"), which reduced logical deduction to backward chaining or forward chaining.
10032100 -> 1000000402200: This greatly alleviated (but did not eliminate) the problem.
10032110 -> 1000000402210: Logic is used for knowledge representation and problem solving, but it can be applied to other problems as well.
10032120 -> 1000000402220: For example, the satplan algorithm uses logic for planning, and inductive logic programming is a method for learning.
10032130 -> 1000000402230: There are several different forms of logic used in AI research.
10032140 -> 1000000402240: Propositional logic or sentential logic is the logic of statements which can be true or false.
10032150 -> 1000000402250: First-order logic also allows the use of quantifiers and predicates, and can express facts about objects, their properties, and their relations with each other.
10032160 -> 1000000402260: Fuzzy logic, a version of first-order logic which allows the truth of a statement to be represented as a value between 0 and 1, rather than simply True (1) or False (0).
10032170 -> 1000000402270: Fuzzy systems can be used for uncertain reasoning and have been widely used in modern industrial and consumer product control systems.
10032180 -> 1000000402280: Default logics, non-monotonic logics and circumscription are forms of logic designed to help with default reasoning and the qualification problem.
10032190 -> 1000000402290: Several extensions of logic have been designed to handle specific domains of knowledge, such as: description logics; situation calculus, event calculus and fluent calculus (for representing events and time); causal calculus; belief calculus; and modal logics.
10032200 -> 1000000402300: Probabilistic methods for uncertain reasoning
10032210 -> 1000000402310: Many problems in AI (in reasoning, planning, learning, perception and robotics) require the agent to operate with incomplete or uncertain information.
10032220 -> 1000000402320: Starting in the late 80s and early 90s, Judea Pearl and others championed the use of methods drawn from probability theory and economics to devise a number of powerful tools to solve these problems.
10032230 -> 1000000402330: Bayesian networks are very general tool that can be used for a large number of problems: reasoning (using the Bayesian inference algorithm), learning (using the expectation-maximization algorithm), planning (using decision networks) and perception (using dynamic Bayesian networks).
10032240 -> 1000000402340: Probabilistic algorithms can also be used for filtering, prediction, smoothing and finding explanations for streams of data, helping perception systems to analyze processes that occur over time (e.g., hidden Markov models and Kalman filters).
10032250 -> 1000000402350: Planning problems have also taken advantages of other tools from economics, such as decision theory and decision analysis, information value theory, Markov decision processes, dynamic decision networks, game theory and mechanism design
10032260 -> 1000000402360: Classifiers and statistical learning methods
10032270 -> 1000000402370: The simplest AI applications can be divided into two types: classifiers ("if shiny then diamond") and controllers ("if shiny then pick up").
10032280 -> 1000000402380: Controllers do however also classify conditions before inferring actions, and therefore classification forms a central part of many AI systems.
10032290 -> 1000000402390: Classifiers are functions that use pattern matching to determine a closest match.
10032300 -> 1000000402400: They can be tuned according to examples, making them very attractive for use in AI.
10032310 -> 1000000402410: These examples are known as observations or patterns.
10032320 -> 1000000402420: In supervised learning, each pattern belongs to a certain predefined class.
10032330 -> 1000000402430: A class can be seen as a decision that has to be made.
10032340 -> 1000000402440: All the observations combined with their class labels are known as a data set.
10032350 -> 1000000402450: When a new observation is received, that observation is classified based on previous experience.
10032360 -> 1000000402460: A classifier can be trained in various ways; there are many statistical and machine learning approaches.
10032370 -> 1000000402470: A wide range of classifiers are available, each with its strengths and weaknesses.
10032380 -> 1000000402480: Classifier performance depends greatly on the characteristics of the data to be classified.
10032390 -> 1000000402490: There is no single classifier that works best on all given problems; this is also referred to as the "no free lunch" theorem.
10032400 -> 1000000402500: Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance.
10032410 -> 1000000402510: Determining a suitable classifier for a given problem is however still more an art than science.
10032420 -> 1000000402520: The most widely used classifiers are the neural network, kernel methods such as the support vector machine, k-nearest neighbor algorithm, Gaussian mixture model, naive Bayes classifier, and decision tree.
10032430 -> 1000000402530: The performance of these classifiers have been compared over a wide range of classification tasks in order to find data characteristics that determine classifier performance.
10032440 -> 1000000402540: Neural networks
10032450 -> 1000000402550: The study of artificial neural networks began with cybernetics researchers, working in the decade before the field AI research was founded.
10032460 -> 1000000402560: In the 1960s Frank Rosenblatt developed an important early version, the perceptron.
10032470 -> 1000000402570: Paul Werbos developed the backpropagation algorithm for multilayer perceptrons in 1974, which led to a renaissance in neural network research and connectionism in general in the middle 1980s.
10032480 -> 1000000402580: Other common network architectures which have been developed include the feedforward neural network, the radial basis network, the Kohonen self-organizing map and various recurrent neural networks.
10032490 -> 1000000402590: The Hopfield net, a form of attractor network, was first described by John Hopfield in 1982.
10032500 -> 1000000402600: Neural networks are applied to the problem of learning, using such techniques as Hebbian learning , Holographic associative memory and the relatively new field of Hierarchical Temporal Memory which simulates the architecture of the neocortex.
10032510 -> 1000000402610: Social and emergent models
10032520 -> 1000000402620: Several algorithms for learning use tools from evolutionary computation, such as genetic algorithms, swarm intelligence. and genetic programming.
10032530 -> 1000000402630: Control theory
10032540 -> 1000000402640: Control theory, the grandchild of cybernetics, has many important applications, especially in robotics.
10032550 -> 1000000402650: Specialized languages
10032560 -> 1000000402660: AI researchers have developed several specialized languages for AI research:
10032570 -> 1000000402670: IPL, one of the first programming languages, developed by Alan Newell, Herbert Simon and J. C. Shaw.
10032580 -> 1000000402680: Lisp was developed by John McCarthy at MIT in 1958.
10032590 -> 1000000402690: There are many dialects of Lisp in use today.
10032600 -> 1000000402700: Prolog, a language based on logic programming, was invented by French researchers Alain Colmerauer and Phillipe Roussel, in collaboration with Robert Kowalski of the University of Edinburgh.
10032610 -> 1000000402710: STRIPS, a planning language developed at Stanford in the 1960s.
10032620 -> 1000000402720: Planner developed at MIT around the same time.
10032630 -> 1000000402730: AI applications are also often written in standard languages like C++ and languages designed for mathematics, such as Matlab and Lush.
10032640 -> 1000000402740: Evaluating artificial intelligence
10032650 -> 1000000402750: How can one determine if an agent is intelligent?
10032660 -> 1000000402760: In 1950, Alan Turing proposed a general procedure to test the intelligence of an agent now known as the Turing test.
10032670 -> 1000000402770: This procedure allows almost all the major problems of artificial intelligence to be tested.
10032680 -> 1000000402780: However, it is a very difficult challenge and at present all agents fail.
10032690 -> 1000000402790: Artificial intelligence can also be evaluated on specific problems such as small problems in chemistry, hand-writing recognition and game-playing.
10032700 -> 1000000402800: Such tests have been termed subject matter expert Turing tests.
10032710 -> 1000000402810: Smaller problems provide more achievable goals and there are an ever-increasing number of positive results.
10032720 -> 1000000402820: The broad classes of outcome for an AI test are:
10032730 -> 1000000402830: optimal: it is not possible to perform better
10032740 -> 1000000402840: strong super-human: performs better than all humans
10032750 -> 1000000402850: super-human: performs better than most humans
10032760 -> 1000000402860: sub-human: performs worse than most humans
10032770 -> 1000000402870: For example, performance at checkers (draughts) is optimal, performance at chess is super-human and nearing strong super-human, and performance at many everyday tasks performed by humans is sub-human.
10032780 -> 1000000402880: Competitions and prizes
10032790 -> 1000000402890: There are a number of competitions and prizes to promote research in artificial intelligence.
10032800 -> 1000000402900: The main areas promoted are: general machine intelligence, conversational behaviour, data-mining, driverless cars, robot soccer and games.
10032810 -> 1000000402910: Applications of artificial intelligence
10032820 -> 1000000402920: Artificial intelligence has successfully been used in a wide range of fields including medical diagnosis, stock trading, robot control, law, scientific discovery and toys.
10032830 -> 1000000402930: Frequently, when a technique reaches mainstream use it is no longer considered artificial intelligence, sometimes described as the AI effect.
10032840 -> 1000000402940: It may also become integrated into artificial life.
Artificial neural network
10050010 -> 1000000500020: Artificial neural network
10050020 -> 1000000500030: An artificial neural network (ANN), often just called a "neural network" (NN), is a mathematical model or computational model based on biological neural networks.
10050030 -> 1000000500040: It consists of an interconnected group of artificial neurons and processes information using a connectionist approach to computation.
10050040 -> 1000000500050: In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase.
10050050 -> 1000000500060: In more practical terms neural networks are non-linear statistical data modeling tools.
10050060 -> 1000000500070: They can be used to model complex relationships between inputs and outputs or to find patterns in data.
10050070 -> 1000000500080: Background
10050080 -> 1000000500090: There is no precise agreed-upon definition among researchers as to what a neural network is, but most would agree that it involves a network of simple processing elements (neurons), which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters.
10050090 -> 1000000500100: The original inspiration for the technique was from examination of the central nervous system and the neurons (and their axons, dendrites and synapses) which constitute one of its most significant information processing elements (see Neuroscience).
10050100 -> 1000000500110: In a neural network model, simple nodes (called variously "neurons", "neurodes", "PEs" ("processing elements") or "units") are connected together to form a network of nodes — hence the term "neural network."
10050110 -> 1000000500120: While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
10050120 -> 1000000500130: These networks are also similar to the biological neural networks in the sense that functions are performed collectively and in parallel by the units, rather than there being a clear delineation of subtasks to which various units are assigned (see also connectionism).
10050130 -> 1000000500140: Currently, the term Artificial Neural Network (ANN) tends to refer mostly to neural network models employed in statistics, cognitive psychology and artificial intelligence.
10050140 -> 1000000500150: Neural network models designed with emulation of the central nervous system (CNS) in mind are a subject of theoretical neuroscience (computational neuroscience).
10050150 -> 1000000500160: In modern software implementations of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing.
10050160 -> 1000000500170: In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that combine both adaptive and non-adaptive elements.
10050170 -> 1000000500180: While the more general approach of such adaptive systems is more suitable for real-world problem solving, it has far less to do with the traditional artificial intelligence connectionist models.
10050180 -> 1000000500190: What they do, however, have in common is the principle of non-linear, distributed, parallel and local processing and adaptation.
10050190 -> 1000000500200: Models
10050200 -> 1000000500210: Neural network models in artificial intelligence are usually referred to as artificial neural networks (ANNs); these are essentially simple mathematical models defining a function  f : X \rightarrow Y .
10050210 -> 1000000500220: Each type of ANN model corresponds to a class of such functions.
10050220 -> 1000000500230: The network in artificial neural network
10050230 -> 1000000500240: The word network in the term 'artificial neural network' arises because the function f(x) is defined as a composition of other functions g_i(x), which can further be defined as a composition of other functions.
10050240 -> 1000000500250: This can be conveniently represented as a network structure, with arrows depicting the dependencies between variables.
10050250 -> 1000000500260: A widely used type of composition is the nonlinear weighted sum, where f (x) = K \left(\sum_i w_i g_i(x)\right) , where K is some predefined function, such as the hyperbolic tangent.
10050260 -> 1000000500270: It will be convenient for the following to refer to a collection of functions g_i as simply a vector g = (g_1, g_2, \ldots, g_n).
10050270 -> 1000000500280: This figure depicts such a decomposition of f, with dependencies between variables indicated by arrows.
10050280 -> 1000000500290: These can be interpreted in two ways.
10050290 -> 1000000500300: The first view is the functional view: the input x is transformed into a 3-dimensional vector h, which is then transformed into a 2-dimensional vector g, which is finally transformed into f.
10050300 -> 1000000500310: This view is most commonly encountered in the context of optimization.
10050310 -> 1000000500320: The second view is the probabilistic view: the random variable F = f(G)  depends upon the random variable G = g(H), which depends upon H=h(X), which depends upon the random variable X.
10050320 -> 1000000500330: This view is most commonly encountered in the context of graphical models.
10050330 -> 1000000500340: The two views are largely equivalent.
10050340 -> 1000000500350: In either case, for this particular network architecture, the components of individual layers are independent of each other (e.g., the components of g are independent of each other given their input h).
10050350 -> 1000000500360: This naturally enables a degree of parallelism in the implementation.
10050360 -> 1000000500370: Networks such as the previous one are commonly called feedforward, because their graph is a directed acyclic graph.
10050370 -> 1000000500380: Networks with cycles are commonly called recurrent.
10050380 -> 1000000500390: Such networks are commonly depicted in the manner shown at the top of the figure, where f is shown as being dependent upon itself.
10050390 -> 1000000500400: However, there is an implied temporal dependence which is not shown.
10050400 -> 1000000500410: What this actually means in practice is that the value of f at some point in time t depends upon the values of f at zero or at one or more other points in time.
10050410 -> 1000000500420: The graphical model at the bottom of the figure illustrates the case: the value of f at time t only depends upon its last value.
10050420 -> 1000000500430: Learning
10050430 -> 1000000500440: However interesting such functions may be in themselves, what has attracted the most interest in neural networks is the possibility of learning, which in practice means the following:
10050440 -> 1000000500450: Given a specific task to solve, and a class of functions F, learning means using a set of observations, in order to find f^* \in F which solves the task in an optimal sense.
10050450 -> 1000000500460: This entails defining a cost function C : F \rightarrow \mathbb{R} such that, for the optimal solution f^*, C(f^*) \leq C(f) \forall f \in F (no solution has a cost less than the cost of the optimal solution).
10050460 -> 1000000500470: The cost function C is an important concept in learning, as it is a measure of how far away we are from an optimal solution to the problem that we want to solve.
10050470 -> 1000000500480: Learning algorithms search through the solution space in order to find a function that has the smallest possible cost.
10050480 -> 1000000500490: For applications where the solution is dependent on some data, the cost must necessarily be a function of the observations, otherwise we would not be modelling anything related to the data.
10050490 -> 1000000500500: It is frequently defined as a statistic to which only approximations can be made.
10050500 -> 1000000500510: As a simple example consider the problem of finding the model f which minimizes C=E\left[(f(x) - y)^2\right], for data pairs (x,y) drawn from some distribution \mathcal{D}.
10050510 -> 1000000500520: In practical situations we would only have N samples from \mathcal{D} and thus, for the above example, we would only minimize \hat{C}=\frac{1}{N}\sum_{i=1}^N (f(x_i)-y_i)^2.
10050520 -> 1000000500530: Thus, the cost is minimized over a sample of the data rather than the true data distribution.
10050530 -> 1000000500540: When N \rightarrow \infty some form of online learning must be used, where the cost is partially minimized as each new example is seen.
10050540 -> 1000000500550: While online learning is often used when \mathcal{D} is fixed, it is most useful in the case where the distribution changes slowly over time.
10050550 -> 1000000500560: In neural network methods, some form of online learning is frequently also used for finite datasets.
10050560 -> 1000000500570: Choosing a cost function
10050570 -> 1000000500580: While it is possible to arbitrarily define some ad hoc cost function, frequently a particular cost will be used either because it has desirable properties (such as convexity) or because it arises naturally from a particular formulation of the problem (i.e., In a probabilistic formulation the posterior probability of the model can be used as an inverse cost).
10050580 -> 1000000500590: Ultimately, the cost function will depend on the task we wish to perform.
10050590 -> 1000000500600: The three main categories of learning tasks are overviewed below.
10050600 -> 1000000500610: Learning paradigms
10050610 -> 1000000500620: There are three major learning paradigms, each corresponding to a particular abstract learning task.
10050620 -> 1000000500630: These are supervised learning, unsupervised learning and reinforcement learning.
10050630 -> 1000000500640: Usually any given type of network architecture can be employed in any of those tasks.
10050640 -> 1000000500650: Supervised learning
10050650 -> 1000000500660: In supervised learning, we are given a set of example pairs  (x, y), x \in X, y \in Y and the aim is to find a function  f : X \rightarrow Y  in the allowed class of functions that matches the examples.
10050660 -> 1000000500670: In other words, we wish to infer the mapping implied by the data; the cost function is related to the mismatch between our mapping and the data and it implicitly contains prior knowledge about the problem domain.
10050670 -> 1000000500680: A commonly used cost is the mean-squared error which tries to minimize the average error between the network's output, f(x), and the target value y over all the example pairs.
10050680 -> 1000000500690: When one tries to minimise this cost using gradient descent for the class of neural networks called Multi-Layer Perceptrons, one obtains the common and well-known backpropagation algorithm for training neural networks.
10050690 -> 1000000500700: Tasks that fall within the paradigm of supervised learning are pattern recognition (also known as classification) and regression (also known as function approximation).
10050700 -> 1000000500710: The supervised learning paradigm is also applicable to sequential data (e.g., for speech and gesture recognition).
10050710 -> 1000000500720: This can be thought of as learning with a "teacher," in the form of a function that provides continuous feedback on the quality of solutions obtained thus far.
10050720 -> 1000000500730: Unsupervised learning
10050730 -> 1000000500740: In unsupervised learning we are given some data x, and the cost function to be minimized can be any function of the data x and the network's output, f.
10050740 -> 1000000500750: The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables).
10050750 -> 1000000500760: As a trivial example, consider the model f(x) = a, where a is a constant and the cost C=E[(x - f(x))^2].
10050760 -> 1000000500770: Minimizing this cost will give us a value of a that is equal to the mean of the data.
10050770 -> 1000000500780: The cost function can be much more complicated.
10050780 -> 1000000500790: Its form depends on the application: For example in compression it could be related to the mutual information between x and y.
10050790 -> 1000000500800: In statistical modelling, it could be related to the posterior probability of the model given the data.
10050800 -> 1000000500810: (Note that in both of those examples those quantities would be maximized rather than minimised).
10050810 -> 1000000500820: Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering.
10050820 -> 1000000500830: Reinforcement learning
10050830 -> 1000000500840: In reinforcement learning, data x is usually not given, but generated by an agent's interactions with the environment.
10050840 -> 1000000500850: At each point in time t, the agent performs an action y_t and the environment generates an observation x_t and an instantaneous cost c_t, according to some (usually unknown) dynamics.
10050850 -> 1000000500860: The aim is to discover a policy for selecting actions that minimizes some measure of a long-term cost, i.e. the expected cumulative cost.
10050860 -> 1000000500870: The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
10050870 -> 1000000500880: More formally, the environment is modeled as a Markov decision process (MDP) with states {s_1,...,s_n}\in S and actions {a_1,...,a_m} \in A with the following probability distributions: the instantaneous cost distribution P(c_t|s_t), the observation distribution P(x_t|s_t) and the transition P(s_{t+1}|s_t, a_t), while a policy is defined as conditional distribution over actions given the observations.
10050880 -> 1000000500890: Taken together, the two define a Markov chain (MC).
10050890 -> 1000000500900: The aim is to discover the policy that minimizes the cost, i.e. the MC for which the cost is minimal.
10050900 -> 1000000500910: ANNs are frequently used in reinforcement learning as part of the overall algorithm.
10050910 -> 1000000500920: Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.
10050920 -> 1000000500930: See also: dynamic programming, stochastic control
10050930 -> 1000000500940: Learning algorithms
10050940 -> 1000000500950: Training a neural network model essentially means selecting one model from the set of allowed models (or, in a Bayesian framework, determining a distribution over the set of allowed models) that minimises the cost criterion.
10050950 -> 1000000500960: There are numerous algorithms available for training neural network models; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.
10050960 -> 1000000500970: Most of the algorithms used in training artificial neural networks are employing some form of gradient descent.
10050970 -> 1000000500980: This is done by simply taking the derivative of the cost function with respect to the network parameters and then changing those parameters in a gradient-related direction.
10050980 -> 1000000500990: Evolutionary methods, simulated annealing, and expectation-maximization and non-parametric methods are among other commonly used methods for training neural networks.
10050990 -> 1000000501000: See also machine learning.
10051000 -> 1000000501010: Temporal perceptual learning rely on finding temporal relationships in sensory signal streams.
10051010 -> 1000000501020: In an environment, statistically salient temporal correlations can be found by monitoring the arrival times of sensory signals.
10051020 -> 1000000501030: This is done by the perceptual network.
10051030 -> 1000000501040: Employing artificial neural networks
10051040 -> 1000000501050: Perhaps the greatest advantage of ANNs is their ability to be used as an arbitrary function approximation mechanism which 'learns' from observed data.
10051050 -> 1000000501060: However, using them is not so straightforward and a relatively good understanding of the underlying theory is essential.
10051060 -> 1000000501070: Choice of model: This will depend on the data representation and the application.
10051070 -> 1000000501080: Overly complex models tend to lead to problems with learning.
10051080 -> 1000000501090: Learning algorithm: There are numerous tradeoffs between learning algorithms.
10051090 -> 1000000501100: Almost any algorithm will work well with the correct hyperparameters for training on a particular fixed dataset.
10051100 -> 1000000501110: However selecting and tuning an algorithm for training on unseen data requires a significant amount of experimentation.
10051110 -> 1000000501120: Robustness: If the model, cost function and learning algorithm are selected appropriately the resulting ANN can be extremely robust.
10051120 -> 1000000501130: With the correct implementation ANNs can be used naturally in online learning and large dataset applications.
10051130 -> 1000000501140: Their simple implementation and the existence of mostly local dependencies exhibited in the structure allows for fast, parallel implementations in hardware.
10051140 -> 1000000501150: Applications
10051150 -> 1000000501160: The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations.
10051160 -> 1000000501170: This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
10051170 -> 1000000501180: Real life applications
10051180 -> 1000000501190: The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
10051190 -> 1000000501200: Function approximation, or regression analysis, including time series prediction and modeling.
10051200 -> 1000000501210: Classification, including pattern and sequence recognition, novelty detection and sequential decision making.
10051210 -> 1000000501220: Data processing, including filtering, clustering, blind source separation and compression.
10051220 -> 1000000501230: Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition and more), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications (automated trading systems), data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.
10051230 -> 1000000501240: Neural network software
10051240 -> 1000000501250: Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.
10051250 -> 1000000501260: See also logistic regression.
10051260 -> 1000000501270: Types of neural networks
10051270 -> 1000000501280: Feedforward neural network
10051280 -> 1000000501290: The feedforward neural network was the first and arguably simplest type of artificial neural network devised.
10051290 -> 1000000501300: In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes.
10051300 -> 1000000501310: There are no cycles or loops in the network.
10051310 -> 1000000501320: Radial basis function (RBF) network
10051320 -> 1000000501330: Radial Basis Functions are powerful techniques for interpolation in multidimensional space.
10051330 -> 1000000501340: A RBF is a function which has built into a distance criterion with respect to a centre.
10051340 -> 1000000501350: Radial basis functions have been applied in the area of neural networks where they may be used as a replacement for the sigmoidal hidden layer transfer characteristic in Multi-Layer Perceptrons.
10051350 -> 1000000501360: RBF networks have two layers of processing: In the first, input is mapped onto each RBF in the 'hidden' layer.
10051360 -> 1000000501370: The RBF chosen is usually a Gaussian.
10051370 -> 1000000501380: In regression problems the output layer is then a linear combination of hidden layer values representing mean predicted output.
10051380 -> 1000000501390: The interpretation of this output layer value is the same as a regression model in statistics.
10051390 -> 1000000501400: In classification problems the output layer is typically a sigmoid function of a linear combination of hidden layer values, representing a posterior probability.
10051400 -> 1000000501410: Performance in both cases is often improved by shrinkage techniques, known as ridge regression in classical statistics and known to correspond to a prior belief in small parameter values (and therefore smooth output functions) in a Bayesian framework.
10051410 -> 1000000501420: RBF networks have the advantage of not suffering from local minima in the same way as Multi-Layer Perceptrons.
10051420 -> 1000000501430: This is because the only parameters that are adjusted in the learning process are the linear mapping from hidden layer to output layer.
10051430 -> 1000000501440: Linearity ensures that the error surface is quadratic and therefore has a single easily found minimum.
10051440 -> 1000000501450: In regression problems this can be found in one matrix operation.
10051450 -> 1000000501460: In classification problems the fixed non-linearity introduced by the sigmoid output function is most efficiently dealt with using iteratively re-weighted least squares.
10051460 -> 1000000501470: RBF networks have the disadvantage of requiring good coverage of the input space by radial basis functions.
10051470 -> 1000000501480: RBF centres are determined with reference to the distribution of the input data, but without reference to the prediction task.
10051480 -> 1000000501490: As a result, representational resources may be wasted on areas of the input space that are irrelevant to the learning task.
10051490 -> 1000000501500: A common solution is to associate each data point with its own centre, although this can make the linear system to be solved in the final layer rather large, and requires shrinkage techniques to avoid overfitting.
10051500 -> 1000000501510: Associating each input datum with an RBF leads naturally to kernel methods such as Support Vector Machines and Gaussian Processes (the RBF is the kernel function).
10051510 -> 1000000501520: All three approaches use a non-linear kernel function to project the input data into a space where the learning problem can be solved using a linear model.
10051520 -> 1000000501530: Like Gaussian Processes, and unlike SVMs, RBF networks are typically trained in a Maximum Likelihood framework by maximizing the probability (minimizing the error) of the data under the model.
10051530 -> 1000000501540: SVMs take a different approach to avoiding overfitting by maximizing instead a margin.
10051540 -> 1000000501550: RBF networks are outperformed in most classification applications by SVMs.
10051550 -> 1000000501560: In regression applications they can be competitive when the dimensionality of the input space is relatively small.
10051560 -> 1000000501570: Kohonen self-organizings network
10051570 -> 1000000501580: The self-organizing map (SOM) invented by Teuvo Kohonen uses a form of unsupervised learning.
10051580 -> 1000000501590: A set of artificial neurons learn to map points in an input space to coordinates in an output space.
10051590 -> 1000000501600: The input space can have different dimensions and topology from the output space, and the SOM will attempt to preserve these.
10051600 -> 1000000501610: Recurrent network
10051610 -> 1000000501620: Contrary to feedforward networks, recurrent neural networks (RNs) are models with bi-directional data flow.
10051620 -> 1000000501630: While a feedforward network propagates data linearly from input to output, RNs also propagate data from later processing stages to earlier stages.
10051630 -> 1000000501640: Simple recurrent network
10051640 -> 1000000501650: A simple recurrent network (SRN) is a variation on the Multi-Layer Perceptron, sometimes called an "Elman network" due to its invention by Jeff Elman.
10051650 -> 1000000501660: A three-layer network is used, with the addition of a set of "context units" in the input layer.
10051660 -> 1000000501670: There are connections from the middle (hidden) layer to these context units fixed with a weight of one.
10051670 -> 1000000501680: At each time step, the input is propagated in a standard feed-forward fashion, and then a learning rule (usually back-propagation) is applied.
10051680 -> 1000000501690: The fixed back connections result in the context units always maintaining a copy of the previous values of the hidden units (since they propagate over the connections before the learning rule is applied).
10051690 -> 1000000501700: Thus the network can maintain a sort of state, allowing it to perform such tasks as sequence-prediction that are beyond the power of a standard Multi-Layer Perceptron.
10051700 -> 1000000501710: In a fully recurrent network, every neuron receives inputs from every other neuron in the network.
10051710 -> 1000000501720: These networks are not arranged in layers.
10051720 -> 1000000501730: Usually only a subset of the neurons receive external inputs in addition to the inputs from all the other neurons, and another disjunct subset of neurons report their output externally as well as sending it to all the neurons.
10051730 -> 1000000501740: These distinctive inputs and outputs perform the function of the input and output layers of a feed-forward or simple recurrent network, and also join all the other neurons in the recurrent processing.
10051740 -> 1000000501750: Hopfield network
10051750 -> 1000000501760: The Hopfield network is a recurrent neural network in which all connections are symmetric.
10051760 -> 1000000501770: Invented by John Hopfield in 1982, this network guarantees that its dynamics will converge.
10051770 -> 1000000501780: If the connections are trained using Hebbian learning then the Hopfield network can perform as robust content-addressable (or associative) memory, resistant to connection alteration.
10051780 -> 1000000501790: Echo state network
10051790 -> 1000000501800: The echo state network (ESN) is a recurrent neural network with a sparsely connected random hidden layer.
10051800 -> 1000000501810: The weights of output neurons are the only part of the network that can change and be learned.
10051810 -> 1000000501820: ESN are good to (re)produce temporal patterns.
10051820 -> 1000000501830: Long short term memory network
10051830 -> 1000000501840: The Long short term memory is an artificial neural net structure that unlike traditional RNNs doesn't have the problem of vanishing gradients.
10051840 -> 1000000501850: It can therefore use long delays and can handle signals that have a mix of low and high frequency components.
10051850 -> 1000000501860: Stochastic neural networks
10051860 -> 1000000501870: A stochastic neural network differs from a typical neural network because it introduces random variations into the network.
10051870 -> 1000000501880: In a probabilistic view of neural networks, such random variations can be viewed as a form of statistical sampling, such as Monte Carlo sampling.
10051880 -> 1000000501890: Boltzmann machine
10051890 -> 1000000501900: The Boltzmann machine can be thought of as a noisy Hopfield network.
10051900 -> 1000000501910: Invented by Geoff Hinton and Terry Sejnowski in 1985, the Boltzmann machine is important because it is one of the first neural networks to demonstrate learning of latent variables (hidden units).
10051910 -> 1000000501920: Boltzmann machine learning was at first slow to simulate, but the contrastive divergence algorithm of Geoff Hinton (circa 2000) allows models such as Boltzmann machines and products of experts to be trained much faster.
10051920 -> 1000000501930: Modular neural networks
10051930 -> 1000000501940: Biological studies showed that the human brain functions not as a single massive network, but as a collection of small networks.
10051940 -> 1000000501950: This realisation gave birth to the concept of modular neural networks, in which several small networks cooperate or compete to solve problems.
10051950 -> 1000000501960: Committee of machines
10051960 -> 1000000501970: A committee of machines (CoM) is a collection of different neural networks that together "vote" on a given example.
10051970 -> 1000000501980: This generally gives a much better result compared to other neural network models.
10051980 -> 1000000501990: In fact in many cases, starting with the same architecture and training but using different initial random weights gives vastly different networks.
10051990 -> 1000000502000: A CoM tends to stabilize the result.
10052000 -> 1000000502010: The CoM is similar to the general machine learning bagging method, except that the necessary variety of machines in the committee is obtained by training from different random starting weights rather than training on different randomly selected subsets of the training data.
10052010 -> 1000000502020: Associative neural network (ASNN)
10052020 -> 1000000502030: The ASNN is an extension of the committee of machines that goes beyond a simple/weighted average of different models.
10052025 -> 1000000502040: ASNN represents a combination of an ensemble of feed-forward neural networks and the k-nearest neighbor technique (kNN).
10052030 -> 1000000502050: It uses the correlation between ensemble responses as a measure of distance amid the analyzed cases for the kNN.
10052040 -> 1000000502060: This corrects the bias of the neural network ensemble.
10052050 -> 1000000502070: An associative neural network has a memory that can coincide with the training set.
10052060 -> 1000000502080: If new data becomes available, the network instantly improves its predictive ability and provides data approximation (self-learn the data) without a need to retrain the ensemble.
10052070 -> 1000000502090: Another important feature of ASNN is the possibility to interpret neural network results by analysis of correlations between data cases in the space of models.
10052080 -> 1000000502100: The method is demonstrated at  www.vcclab.org, where you can either use it online or download it.
10052090 -> 1000000502110: Other types of networks
10052100 -> 1000000502120: These special networks do not fit in any of the previous categories.
10052110 -> 1000000502130: Holographic associative memory
10052120 -> 1000000502140: Holographic associative memory represents a family of analog, correlation-based, associative, stimulus-response memories, where information is mapped onto the phase orientation of complex numbers operating.
10052130 -> 1000000502150: Instantaneously trained networks
10052140 -> 1000000502160: Instantaneously trained neural networks (ITNNs) were inspired by the phenomenon of short-term learning that seems to occur instantaneously.
10052150 -> 1000000502170: In these networks the weights of the hidden and the output layers are mapped directly from the training vector data.
10052160 -> 1000000502180: Ordinarily, they work on binary data, but versions for continuous data that require small additional processing are also available.
10052170 -> 1000000502190: Spiking neural networks
10052180 -> 1000000502200: Spiking neural networks (SNNs) are models which explicitly take into account the timing of inputs.
10052190 -> 1000000502210: The network input and output are usually represented as series of spikes (delta function or more complex shapes).
10052200 -> 1000000502220: SNNs have an advantage of being able to process information in the time domain (signals that vary over time).
10052210 -> 1000000502230: They are often implemented as recurrent networks.
10052220 -> 1000000502240: SNNs are also a form of pulse computer.
10052230 -> 1000000502250: Networks of spiking neurons — and the temporal correlations of neural assemblies in such networks — have been used to model figure/ground separation and region linking in the visual system (see e.g. Reitboeck et.al.in Haken and Stadler: Synergetics of the Brain.
10052240 -> 1000000502260: Berlin, 1989).
10052250 -> 1000000502270: Gerstner and Kistler have a freely available online textbook on  Spiking Neuron Models.
10052260 -> 1000000502280: Spiking neural networks with axonal conduction delays exhibit polychronization, and hence could have a potentially unlimited memory capacity.
10052270 -> 1000000502290: In June 2005 IBM announced construction of a Blue Gene supercomputer dedicated to the simulation of a large recurrent spiking neural network .
10052280 -> 1000000502300: Dynamic neural networks
10052290 -> 1000000502310: Dynamic neural networks not only deal with nonlinear multivariate behaviour, but also include (learning of) time-dependent behaviour such as various transient phenomena and delay effects.
10052300 -> 1000000502320: Cascading neural networks
10052310 -> 1000000502330: Cascade-Correlation is an architecture and supervised learning algorithm developed by Scott Fahlman and Christian Lebiere.
10052320 -> 1000000502340: Instead of just adjusting the weights in a network of fixed topology, Cascade-Correlation begins with a minimal network, then automatically trains and adds new hidden units one by one, creating a multi-layer structure.
10052330 -> 1000000502350: Once a new hidden unit has been added to the network, its input-side weights are frozen.
10052340 -> 1000000502360: This unit then becomes a permanent feature-detector in the network, available for producing outputs or for creating other, more complex feature detectors.
10052350 -> 1000000502370: The Cascade-Correlation architecture has several advantages over existing algorithms: it learns very quickly, the network determines its own size and topology, it retains the structures it has built even if the training set changes, and it requires no back-propagation of error signals through the connections of the network.
10052360 -> 1000000502380: See: Cascade correlation algorithm.
10052370 -> 1000000502390: Neuro-fuzzy networks
10052380 -> 1000000502400: A neuro-fuzzy network is a fuzzy inference system in the body of an artificial neural network.
10052390 -> 1000000502410: Depending on the FIS type, there are several layers that simulate the processes involved in a fuzzy inference like fuzzification, inference, aggregation and defuzzification.
10052400 -> 1000000502420: Embedding an FIS in a general structure of an ANN has the benefit of using available ANN training methods to find the parameters of a fuzzy system.
10052410 -> 1000000502430: Holosemantic neural networks
10052420 -> 1000000502440: The holosemantic neural network invented by Manfred Hoffleisch uses a kind a genetic algorithm to build a multidimensional structure.
10052430 -> 1000000502450: It takes into account the timing of inputs.
10052440 -> 1000000502460: Compositional pattern-producing networks
10052450 -> 1000000502470: Compositional pattern-producing networks (CPPNs) are a variation of ANNs which differ in their set of activation functions and how they are applied.
10052460 -> 1000000502480: While typical ANNs often contain only sigmoid functions (and sometimes Gaussian functions), CPPNs can include both types of functions and many others.
10052470 -> 1000000502490: Furthermore, unlike typical ANNs, CPPNs are applied across the entire space of possible inputs so that they can represent a complete image.
10052480 -> 1000000502500: Since they are compositions of functions, CPPNs in effect encode images at infinite resolution and can be sampled for a particular display at whatever resolution is optimal.
10052490 -> 1000000502510: Theoretical properties
10052500 -> 1000000502520: Computational power
10052510 -> 1000000502530: The multi-layer perceptron (MLP) is a universal function approximator, as proven by the Cybenko theorem.
10052520 -> 1000000502540: However, the proof is not constructive regarding the number of neurons required or the settings of the weights.
10052530 -> 1000000502550: Work by Hava T. Siegelmann and Eduardo D. Sontag has provided a proof that a specific recurrent architecture with rational valued weights (as opposed to the commonly used floating point approximations) has the full power of a Universal Turing Machine.
10052540 -> 1000000502560: They have further shown that the use of irrational values for weights results in a machine with trans-Turing power.
10052550 -> 1000000502570: Capacity
10052560 -> 1000000502580: Artificial neural network models have a property called 'capacity', which roughly corresponds to their ability to model any given function.
10052570 -> 1000000502590: It is related to the amount of information that can be stored in the network and to the notion of complexity.
10052580 -> 1000000502600: Convergence
10052590 -> 1000000502610: Nothing can be said in general about convergence since it depends on a number of factors.
10052600 -> 1000000502620: Firstly, there may exist many local minima.
10052610 -> 1000000502630: This depends on the cost function and the model.
10052620 -> 1000000502640: Secondly, the optimization method used might not be guaranteed to converge when far away from a local minimum.
10052630 -> 1000000502650: Thirdly, for a very large amount of data or parameters, some methods become impractical.
10052640 -> 1000000502660: In general, it has been found that theoretical guarantees regarding convergence are not always a very reliable guide to practical application.
10052650 -> 1000000502670: Generalisation and statistics
10052660 -> 1000000502680: In applications where the goal is to create a system that generalises well in unseen examples, the problem of overtraining has emerged.
10052670 -> 1000000502690: This arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed free parameters.
10052680 -> 1000000502700: There are two schools of thought for avoiding this problem: The first is to use cross-validation and similar techniques to check for the presence of overtraining and optimally select hyperparameters such as to minimise the generalisation error.
10052690 -> 1000000502710: The second is to use some form of regularisation.
10052700 -> 1000000502720: This is a concept that emerges naturally in a probabilistic (Bayesian) framework, where the regularisation can be performed by putting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimise over two quantities: the 'empirical risk' and the 'structural risk', which roughly correspond to the error over the training set and the predicted error in unseen data due to overfitting.
10052710 -> 1000000502730: Supervised neural networks that use an MSE cost function can use formal statistical methods to determine the confidence of the trained model.
10052720 -> 1000000502740: The MSE on a validation set can be used as an estimate for variance.
10052730 -> 1000000502750: This value can then be used to calculate the confidence interval of the output of the network, assuming a normal distribution.
10052740 -> 1000000502760: A confidence analysis made this way is statistically valid as long as the output probability distribution stays the same and the network is not modified.
10052750 -> 1000000502770: By assigning a softmax activation function on the output layer of the neural network (or a softmax component in a component-based neural network) for categorical target variables, the outputs can be interpreted as posterior probabilities.
10052760 -> 1000000502780: This is very useful in classification as it gives a certainty measure on classifications.
10052770 -> 1000000502790: The softmax activation function: y_i=\frac{e^{x_i}}{\sum_{j=1}^c e^{x_j}}
10052780 -> 1000000502800: Dynamic properties
10052790 -> 1000000502810: Various techniques originally developed for studying disordered magnetic systems (i.e. the spin glass) have been successfully applied to simple neural network architectures, such as the Hopfield network.
10052800 -> 1000000502820: Influential work by E. Gardner and B. Derrida has revealed many interesting properties about perceptrons with real-valued synaptic weights, while later work by W. Krauth and M. Mezard has extended these principles to binary-valued synapses.
Association for Computational Linguistics
10060010 -> 1000000600020: Association for Computational Linguistics
10060020 -> 1000000600030: The Association for Computational Linguistics (ACL) is the international scientific and professional society for people working on problems involving natural language and computation.
10060030 -> 1000000600040: An annual meeting is held each summer in locations where significant computational linguistics research is carried out.
10060040 -> 1000000600050: It was founded in 1962, originally named the Association for Machine Translation and Computational Linguistics (AMTCL).
10060050 -> 1000000600060: It became the ACL in 1968.
10060060 -> 1000000600070: The ACL has European and North American chapters, the European Chapter of the Association for Computational Linguistics (EACL) and the North American Chapter of the Association for Computational Linguistics (NAACL).
10060070 -> 1000000600080: The ACL journal, Computational Linguistics, continues to be the primary forum for research on computational linguistics and natural language processing.
10060080 -> 1000000600090: Since 1988, the journal has been published for the ACL by MIT Press.
10060090 -> 1000000600100: The ACL book series, Studies in Natural Language Processing, is published by Cambridge University Press.
10060100 -> None: Special Interest Groups
10060110 -> None: ACL has a large number of Special Interest Groups (SIGs), focusing on specific areas of natural language processing.
10060120 -> None: Some current SIGs within ACL are:
10060130 -> None: Linguistic data and corpus-based approaches:  SIGDAT
10060140 -> None: Dialogue Processing:  SIGDIAL
10060150 -> None: Natural Language Generation:  SIGGEN
10060160 -> None: Lexicon:  SIGLEX
10060170 -> None: Mathematics of Language:  SIGMOL
10060180 -> None: Computational Morphology and Phonology:  SIGMORPHON
10060190 -> None: Computational Semantics:  SIGSEM
BLEU
10090010 -> 1000000700020: BLEU
10090020 -> 1000000700030: This page is about the evaluation metric for machine translation.
10090030 -> 1000000700040: For other meanings, please see Bleu.
10090040 -> 1000000700050: BLEU (Bilingual Evaluation Understudy) is a method for evaluating the quality of text which has been translated from one natural language to another using machine translation.
10090050 -> 1000000700060: BLEU was one of the first software metrics to report high correlation with human judgements of quality.
10090060 -> 1000000700070: The metric is currently one of the most popular in the field.
10090070 -> 1000000700080: The central idea behind the metric is that, "the closer a machine translation is to a professional human translation, the better it is".
10090080 -> 1000000700090: The metric calculates scores for individual segments, generally sentences, and then averages these scores over the whole corpus in order to reach a final score.
10090090 -> 1000000700100: It has been shown to correlate highly with human judgements of quality at the corpus level.
10090100 -> 1000000700110: The quality of translation is indicated as a number between 0 and 1 and is measured as statistical closeness to a given set of good quality human reference translations.
10090110 -> 1000000700120: Therefore, it does not directly take into account translation intelligibility or grammatical correctness.
10090120 -> 1000000700130: The metric works by measuring the n-gram co-occurrence between a given translation and the set of reference translations and then taking the weighted geometric mean.
10090130 -> 1000000700140: BLEU is specifically designed to approximate human judgement on a corpus level and performs badly if used to evaluate the quality of isolated sentences.
10090140 -> 1000000700150: Algorithm
10090150 -> 1000000700160: BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations.
10090160 -> 1000000700170: The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text.
10090170 -> 1000000700180: This is illustrated in the following example from Papineni et al. (2002),
10090180 -> 1000000700190: In this example, the candidate text is given a unigram precision of,
10090190 -> 1000000700200: P = \frac{m}{w_{t}} = \frac{7}{7} = 1
10090200 -> 1000000700210: Of the seven words in the candidate translation, all of them appear in the reference translations.
10090210 -> 1000000700220: This presents a problem for a metric, as the candidate translation above is complete nonsense, retaining none of the content of either of the references.
10090220 -> 1000000700230: The modification that BLEU makes is fairly straightforward.
10090230 -> 1000000700240: For each word in the candidate translation, the algorithm takes the maximum total count in the reference translations.
10090240 -> 1000000700250: Taking the example above, the word 'the' appears twice in reference 1, and once in reference 2.
10090250 -> 1000000700260: The largest value is taken, in this case '2' as the "maximum reference count".
10090260 -> 1000000700270: For each of the words in the candidate translation, the count of the word is compared against the maximum reference count, and the lowest value is taken.
10090270 -> 1000000700280: In this case, the count of the word 'the' in the candidate translation is '7', while the maximum reference count for the word is '2'.
10090280 -> 1000000700290: This "modified count" is then divided by the total number of words in the candidate translation.
10090290 -> 1000000700300: In the above example, the modified unigram precision score would be,
10090300 -> 1000000700310: P = \frac{2}{7}
10090310 -> 1000000700320: The above method is used to calculate scores for each n.
10090320 -> 1000000700330: The value of n which has the "highest correlation with monolingual human judgements" was found to be 4.
10090330 -> 1000000700340: The unigram scores are found to account for the adequacy of the translation, in other words, how much information is retained in the translation.
10090340 -> 1000000700350: The longer n-gram scores account for the fluency of the translation, or to what extent it reads like "good English".
10090350 -> 1000000700360: The modification made to precision does not solve the problem of short translations.
10090360 -> 1000000700370: Short translations can produce very high precision scores, even using modified precision.
10090370 -> 1000000700380: An example of a candidate translation for the same references as above might be:
10090380 -> 1000000700390: the cat
10090390 -> 1000000700400: In this example, the modified unigram precision would be,
10090400 -> 1000000700410: P = \frac{1}{2} + \frac{1}{2} = \frac{2}{2}
10090410 -> 1000000700420: as the word 'the' and the word 'cat' appear once each in the candidate, and the total number of words is two.
10090420 -> 1000000700430: The modified bigram precision would be 1 / 1 as the bigram, "the cat" appears once in the candidate.
10090430 -> 1000000700440: It has been pointed out that precision is usually twinned with recall to overcome this problem , as the unigram recall of this example would be 2 / 6 or 2 / 7.
10090440 -> 1000000700450: The problem being that as there are multiple reference translations, a bad translation could easily have an inflated recall, such as a translation which consisted of all the words in each of the references.
10090450 -> 1000000700460: In order to produce a score for the whole corpus, the modified precision scores for the segments are combined using the geometric mean, multiplied by a brevity penalty, whose purpose is to prevent very short candidates from receiving too high a score.
10090460 -> 1000000700470: Let r be the total length of the reference corpus, and c the total length of the translation corpus.
10090470 -> 1000000700480: If c \leq r, the brevity penalty applies and is defined to be e^{(1-r/c)}.
10090480 -> 1000000700490: (In the case of multiple reference sentences, r is taken to be the sum of the lengths of the sentences whose lengths are closest to the lengths of the candidate sentences.
10090490 -> 1000000700500: However, in the version of the metric used by NIST, the short reference sentence is used.)
10090500 -> 1000000700510: Performance
10090510 -> 1000000700520: BLEU has frequently been reported as correlating well with human judgement, and certainly remains a benchmark for any new evaluation metric to beat.
10090520 -> 1000000700530: There are however a number of criticisms that have been voiced.
10090530 -> 1000000700540: It has been noted that while in theory capable of evaluating any language, BLEU does not in the present form work on languages without word boundaries.
10090540 -> 1000000700550: It has been argued that although BLEU certainly has significant advantages, there is no guarantee that an increase in BLEU score is an indicator of improved translation quality.
10090550 -> 1000000700560: As BLEU scores are taken at the corpus level, it is difficult to give a textual example.
10090560 -> 1000000700570: Nevertheless, they highlight two instances where BLEU seriously underperformed.
10090570 -> 1000000700580: These were the 2005 NIST evaluations where a number of different machine translation systems were tested, and their study of the SYSTRAN engine versus two engines using statistical machine translation (SMT) techniques.
10090580 -> 1000000700590: In the 2005 NIST evaluation, they report that the scores generated by BLEU failed to correspond to the scores produced in the human evaluations.
10090590 -> 1000000700600: The system which was ranked highest by the human judges was only ranked 6th by BLEU.
10090600 -> 1000000700610: In their study, they compared SMT systems with SYSTRAN, a knowledge based system.
10090610 -> 1000000700620: The scores from BLEU for SYSTRAN were substantially worse than the scores given to SYSTRAN by the human judges.
10090620 -> 1000000700630: They note that the SMT systems were trained using BLEU minimum error rate training, and point out that this could be one of the reasons behind the difference.
10090630 -> 1000000700640: They conclude by recommending that BLEU be used in a more restricted manner, for comparing the results from two similar systems, and for tracking "broad, incremental changes to a single system".
Babel Fish (website)
10070010 -> 1000000800020: Babel Fish (website)
10070020 -> 1000000800030: Babel Fish is a web-based application on Yahoo! that machine translates text or web pages from one of several languages into another.
10070030 -> 1000000800040: Developed by AltaVista, the application is named after the fictional animal used for instantaneous language translation in Douglas Adams's series The Hitchhiker's Guide to the Galaxy.
10070040 -> 1000000800050: In turn the fish is a reference to the biblical account of the city of Babel and the various languages said to have arisen there.
10070050 -> 1000000800060: The translation technology for Babel Fish is provided by SYSTRAN, whose technology also powers a number of other sites and portals.
10070060 -> 1000000800070: It translates among English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, and Spanish.
10070070 -> 1000000800080: The service makes no claim to produce a perfect translation.
10070080 -> 1000000800090: A number of humour sites have sprung up that use the Babel Fish service to translate back and forth between one or more languages (a so-called round-trip translation).
10070090 -> 1000000800100: After a long existence at babelfish.altavista.com, the site was moved on May 9 2008 to babelfish.yahoo.com.
Bioinformatics
10080010 -> 1000000900020: Bioinformatics
10080020 -> 1000000900030: Bioinformatics and computational biology involve the use of techniques including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems usually on the molecular level.
10080030 -> 1000000900040: The core principle of these techniques is using computing resources in order to solve problems on scales of magnitude far too great for human discernment.
10080040 -> 1000000900050: Research in computational biology often overlaps with systems biology.
10080050 -> 1000000900060: Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution.
10080060 -> 1000000900070: Introduction
10080070 -> 1000000900080: The terms bioinformatics and computational biology are often used interchangeably.
10080080 -> 1000000900090: However bioinformatics more properly refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data.
10080090 -> 1000000900100: Computational biology, on the other hand, refers to hypothesis-driven investigation of a specific biological problem using computers, carried out with experimental or simulated data, with the primary goal of discovery and the advancement of biological knowledge.
10080100 -> 1000000900110: Put more simply, bioinformatics is concerned with the information while computational biology is concerned with the hypotheses.
10080110 -> 1000000900120: A similar distinction is made by National Institutes of Health in their  working definitions of Bioinformatics and Computational Biology, where it is further emphasized that there is a tight coupling of developments and knowledge between the more hypothesis-driven research in computational biology and technique-driven research in bioinformatics.
10080120 -> 1000000900130: Bioinformatics is also often specified as an applied subfield of the more general discipline of Biomedical informatics.
10080130 -> 1000000900140: A common thread in projects in bioinformatics and computational biology is the use of mathematical tools to extract useful information from data produced by high-throughput biological techniques such as genome sequencing.
10080140 -> 1000000900150: A representative problem in bioinformatics is the assembly of high-quality genome sequences from fragmentary "shotgun" DNA sequencing.
10080150 -> 1000000900160: Other common problems include the study of gene regulation to perform expression profiling using data from microarrays or mass spectrometry.
10080160 -> 1000000900170: Major research areas
10080170 -> 1000000900180: Sequence analysis
10080180 -> 1000000900190: Since the Phage Φ-X174 was sequenced in 1977, the DNA sequences of hundreds of organisms have been decoded and stored in databases.
10080190 -> 1000000900200: The information is analyzed to determine genes that encode polypeptides, as well as regulatory sequences.
10080200 -> 1000000900210: A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees).
10080210 -> 1000000900220: With the growing amount of data, it long ago became impractical to analyze DNA sequences manually.
10080220 -> 1000000900230: Today, computer programs are used to search the genome of thousands of organisms, containing billions of nucleotides.
10080230 -> 1000000900240: These programs would compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, in order to identify sequences that are related, but not identical.
10080240 -> 1000000900250: A variant of this sequence alignment is used in the sequencing process itself.
10080250 -> 1000000900260: The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research to sequence the first bacterial genome, Haemophilus influenzae) does not give a sequential list of nucleotides, but instead the sequences of thousands of small DNA fragments (each about 600-800 nucleotides long).
10080260 -> 1000000900270: The ends of these fragments overlap and, when aligned in the right way, make up the complete genome.
10080270 -> 1000000900280: Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes.
10080280 -> 1000000900290: In the case of the Human Genome Project, it took several months of CPU time (on a circa-2000 vintage DEC Alpha computer) to assemble the fragments.
10080290 -> 1000000900300: Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research.
10080300 -> 1000000900310: Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome.
10080310 -> 1000000900320: Not all of the nucleotides within a genome are genes.
10080320 -> 1000000900330: Within the genome of higher organisms, large parts of the DNA do not serve any obvious purpose.
10080330 -> 1000000900340: This so-called junk DNA may, however, contain unrecognized functional elements.
10080340 -> 1000000900350: Bioinformatics helps to bridge the gap between genome and proteome projects--for example, in the use of DNA sequences for protein identification.
10080350 -> 1000000900360: See also: sequence analysis, sequence profiling tool, sequence motif.
10080360 -> 1000000900370: Genome annotation
10080370 -> 1000000900380: In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence.
10080380 -> 1000000900390: The first genome annotation software system was designed in 1995 by Dr. Owen White, who was part of the team that sequenced and analyzed the first genome of a free-living organism to be decoded, the bacterium Haemophilus influenzae.
10080390 -> 1000000900400: Dr. White built a software system to find the genes (places in the DNA sequence that encode a protein), the transfer RNA, and other features, and to make initial assignments of function to those genes.
10080400 -> 1000000900410: Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA are constantly changing and improving.
10080410 -> 1000000900420: Computational evolutionary biology
10080420 -> 1000000900430: Evolutionary biology is the study of the origin and descent of species, as well as their change over time.
10080430 -> 1000000900440: Informatics has assisted evolutionary biologists in several key ways; it has enabled researchers to:
10080440 -> 1000000900450: trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,
10080450 -> 1000000900460: more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, lateral gene transfer, and the prediction of factors important in bacterial speciation,
10080460 -> 1000000900470: build complex computational models of populations to predict the outcome of the system over time
10080470 -> 1000000900480: track and share information on an increasingly large number of species and organisms
10080480 -> 1000000900490: Future work endeavours to reconstruct the now more complex tree of life.
10080490 -> 1000000900500: The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are unrelated.
10080500 -> 1000000900510: Measuring biodiversity
10080510 -> 1000000900520: Biodiversity of an ecosystem might be defined as the total genomic complement of a particular environment, from all of the species present, whether it is a biofilm in an abandoned mine, a drop of sea water, a scoop of soil, or the entire biosphere of the planet Earth.
10080520 -> 1000000900530: Databases are used to collect the species names, descriptions, distributions, genetic information, status and size of populations, habitat needs, and how each organism interacts with other species.
10080530 -> 1000000900540: Specialized software programs are used to find, visualize, and analyze the information, and most importantly, communicate it to other people.
10080540 -> 1000000900550: Computer simulations model such things as population dynamics, or calculate the cumulative genetic health of a breeding pool (in agriculture) or endangered population (in conservation).
10080550 -> 1000000900560: One very exciting potential of this field is that entire DNA sequences, or genomes of endangered species can be preserved, allowing the results of Nature's genetic experiment to be remembered in silico, and possibly reused in the future, even if that species is eventually lost.
10080560 -> 1000000900570: Important projects:  Species 2000 project;  uBio Project.
10080570 -> 1000000900580: Analysis of gene expression
10080580 -> 1000000900590: The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), or various applications of multiplexed in-situ hybridization.
10080590 -> 1000000900600: All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies.
10080600 -> 1000000900610: Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.
10080610 -> 1000000900620: Analysis of regulation
10080620 -> 1000000900630: Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins.
10080630 -> 1000000900640: Bioinformatics techniques have been applied to explore various steps in this process.
10080640 -> 1000000900650: For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene.
10080650 -> 1000000900660: These motifs influence the extent to which that region is transcribed into mRNA.
10080660 -> 1000000900670: Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state.
10080670 -> 1000000900680: In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.).
10080680 -> 1000000900690: One can then apply clustering algorithms to that expression data to determine which genes are co-expressed.
10080690 -> 1000000900700: For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.
10080700 -> 1000000900710: Analysis of protein expression
10080710 -> 1000000900720: Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample.
10080720 -> 1000000900730: Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.
10080730 -> 1000000900740: Analysis of mutations in cancer
10080740 -> 1000000900750: In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
10080750 -> 1000000900760: Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer.
10080760 -> 1000000900770: Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms.
10080770 -> 1000000900780: New physical detection technology are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known point mutations.
10080780 -> 1000000900790: These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment.
10080790 -> 1000000900800: Again the massive amounts and new types of data generate new opportunities for bioinformaticians.
10080800 -> 1000000900810: The data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes.
10080810 -> 1000000900820: Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors .
10080820 -> 1000000900830: Prediction of protein structure
10080830 -> 1000000900840: Protein structure prediction is another important application of bioinformatics.
10080840 -> 1000000900850: The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it.
10080850 -> 1000000900860: In the vast majority of cases, this primary structure uniquely determines a structure in its native environment.
10080860 -> 1000000900870: (Of course, there are exceptions, such as the bovine spongiform encephalopathy - aka Mad Cow Disease - prion.)
10080870 -> 1000000900880: Knowledge of this structure is vital in understanding the function of the protein.
10080880 -> 1000000900890: For lack of better terms, structural information is usually classified as one of secondary, tertiary and quaternary structure.
10080890 -> 1000000900900: A viable general solution to such predictions remains an open problem.
10080900 -> 1000000900910: As of now, most efforts have been directed towards heuristics that work most of the time.
10080910 -> 1000000900920: One of the key ideas in bioinformatics is the notion of homology.
10080920 -> 1000000900930: In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function.
10080930 -> 1000000900940: In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins.
10080940 -> 1000000900950: In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known.
10080950 -> 1000000900960: This currently remains the only way to predict protein structures reliably.
10080960 -> 1000000900970: One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin).
10080970 -> 1000000900980: Both serve the same purpose of transporting oxygen in the organism.
10080980 -> 1000000900990: Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.
10080990 -> 1000000901000: Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling.
10081000 -> 1000000901010: See also: structural motif and structural domain.
10081010 -> 1000000901020: Comparative genomics
10081020 -> 1000000901030: The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms.
10081030 -> 1000000901040: It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes.
10081040 -> 1000000901050: A multitude of evolutionary events acting at various organizational levels shape genome evolution.
10081050 -> 1000000901060: At the lowest level, point mutations affect individual nucleotides.
10081060 -> 1000000901070: At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion.
10081070 -> 1000000901080: Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation.
10081080 -> 1000000901090: The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectra of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov Chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
10081090 -> 1000000901100: Many of these studies are based on the homology detection and protein families computation.
10081100 -> 1000000901110: Modeling biological systems
10081110 -> 1000000901120: Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes which comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes.
10081120 -> 1000000901130: Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms.
10081130 -> 1000000901140: High-throughput image analysis
10081140 -> 1000000901150: Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery.
10081150 -> 1000000901160: Modern image analysis systems augment an observer's ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed.
10081160 -> 1000000901170: A fully developed analysis system may completely replace the observer.
10081170 -> 1000000901180: Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research.
10081180 -> 1000000901190: Some examples are:
10081190 -> 1000000901200: high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology)
10081200 -> 1000000901210: morphometrics
10081210 -> 1000000901220: clinical image analysis and visualization
10081220 -> 1000000901230: determining the real-time air-flow patterns in breathing lungs of living animals
10081230 -> 1000000901240: quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
10081240 -> 1000000901250: making behavioral observations from extended video recordings of laboratory animals
10081250 -> 1000000901260: infrared measurements for metabolic activity determination
10081260 -> 1000000901270: Protein-protein docking
10081270 -> 1000000901280: In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy (protein NMR).
10081280 -> 1000000901290: One central question for the biological scientist is whether it is practical to predict possible protein-protein interactions only based on these 3D shapes, without doing protein-protein interaction experiments.
10081290 -> 1000000901300: A variety of methods have been developed to tackle the Protein-protein docking problem, though it seems that there is still much place to work on in this field.
10081300 -> 1000000901310: Software and Tools
10081310 -> 1000000901320: Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services.
10081320 -> 1000000901330: The computational biology tool best-known among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences.
10081330 -> 1000000901340: The NCBI provides a popular web-based implementation that searches their databases.
10081340 -> 1000000901350: BLAST is one of a number of generally available programs for doing sequence alignment.
10081350 -> 1000000901360: Web Services in Bioinformatics
10081360 -> 1000000901370: SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world.
10081370 -> 1000000901380: The main advantages lay in the end user not having to deal with software and database maintenance overheads.
10081375 -> 1000000901390: Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological Sequence Analysis).
10081380 -> 1000000901400: The availability of these service-oriented bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.
Business intelligence
10100010 -> 1000001000020: Business intelligence
10100020 -> 1000001000030: Business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of business information and sometimes to the information itself.
10100030 -> 1000001000040: The purpose of business intelligence--a term that dates at least to 1958--is to support better business decision making.
10100040 -> 1000001000050: Thus, BI is also described as a decision support system (DSS):
10100050 -> 1000001000060: BI is sometimes used interchangeably with briefing books, report and query tools and executive information systems.
10100060 -> 1000001000070: In general, business intelligence systems are data-driven DSS.
10100070 -> 1000001000080: BI systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a data warehouse or a data mart and occasionally working from operational data.
10100080 -> 1000001000090: Software elements support the use of this information by assisting in the extraction, analysis, and reporting of information.
10100090 -> 1000001000100: Applications tackle sales, production, financial, and many other sources of business data for purposes that include, notably, business performance management.
10100100 -> 1000001000110: Information may be gathered on comparable companies to produce benchmarks.
10100110 -> 1000001000120: History
10100120 -> 1000001000130: Prior to the start of the Information Age in the late 20th century, businesses had to collect data from non-automated sources.
10100130 -> 1000001000140: Businesses then lacked the computing resources necessary to properly analyze the data, and as a result, companies often made business decisions primarily on the basis of intuition.
10100140 -> 1000001000150: As businesses automated systems the amount of data increased but its collection remained difficult due to the inability of information to be moved between or within systems.
10100150 -> 1000001000160: Analysis of information informed for long-term decision making, but was slow and often required the use of instinct or expertise to make short-term decisions.
10100160 -> 1000001000170: Business intelligence was defined in 1958 by Hans Peter Luhn, who wrote,
10100170 -> 1000001000180: In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera.
10100180 -> 1000001000190: The communication facility serving the conduct of a business (in the broad sense) may be referred to as an intelligence system.
10100190 -> 1000001000200: The notion of intelligence is also defined here, in a more general sense, as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."
10100200 -> 1000001000210: In 1989 Howard Dresner, later a Gartner Group analyst, popularized BI as an umbrella term to describe "concepts and methods to improve business decision making by using fact-based support systems."
10100210 -> 1000001000220: In modern businesses the use of standards, automation and specialized software, including analytical tools, allows large volumes of data to be extracted, transformed, loaded and warehoused to greatly increase the speed at which information becomes available for decision-making.
10100220 -> 1000001000230: Key intelligence topics
10100230 -> 1000001000240: Business intelligence often uses key performance indicators (KPIs) to assess the present state of business and to prescribe a course of action.
10100240 -> 1000001000250: Examples of KPIs are things such as lead conversion rate (in sales) and inventory turnover (in inventory management).
10100250 -> 1000001000260: Prior to the widespread adoption of computer and web applications, when information had to be manually input and calculated, performance data was often not available for weeks or months.
10100260 -> 1000001000270: Recently, banks have tried to make data available at shorter intervals and have reduced delays.
10100270 -> 1000001000280: The KPI methodology was further expanded with the Chief Performance Officer methodology which incorporated KPIs and root cause analysis into a single methodology.
10100280 -> 1000001000290: Businesses that face higher operational/credit risk loading, such as credit card companies and "wealth management" services, often make KPI-related data available weekly.
10100290 -> 1000001000300: In some cases, companies may even offer a daily analysis of data.
10100300 -> 1000001000310: This fast pace requires analysts to use IT systems to process this large volume of data.
Chatterbot
10110010 -> 1000001100020: Chatterbot
10110020 -> 1000001100030: A chatterbot (or chatbot) is a type of conversational agent, a computer program designed to simulate an intelligent conversation with one or more human users via auditory or textual methods.
10110030 -> 1000001100040: In other words, a chatterbot is a computer program with artificial intelligence to talk to people through voices or typed words.
10110040 -> 1000001100050: Though many appear to be intelligently interpreting the human input prior to providing a response, most chatterbots simply scan for keywords within the input and pull a reply with the most matching keywords or the most similar wording pattern from a local database.
10110050 -> 1000001100060: Chatterbots may also be referred to as talk bots, chat bots, or chatterboxes.
10110060 -> 1000001100070: Method of operation
10110070 -> 1000001100080: A good understanding of a conversation is required to carry on a meaningful dialog but most chatterbots do not attempt this.
10110080 -> 1000001100090: Instead they "converse" by recognizing cue words or phrases from the human user, which allows them to use pre-prepared or pre-calculated responses which can move the conversation on in an apparently meaningful way without requiring them to know what they are talking about.
10110090 -> 1000001100100: For example, if a human types, "I am feeling very worried lately," the chatterbot may be programmed to recognize the phrase "I am" and respond by replacing it with "Why are you" plus a question mark at the end, giving the answer, "Why are you feeling very worried lately?"
10110100 -> 1000001100110: A similar approach using keywords would be for the program to answer any comment including (Name of celebrity) with "I think they're great, don't you?"
10110110 -> 1000001100120: Humans, especially those unfamiliar with chatterbots, sometimes find the resulting conversations engaging.
10110120 -> 1000001100130: Critics of chatterbots call this engagement the ELIZA effect.
10110130 -> 1000001100140: Some programs classified as chatterbots use other principles.
10110140 -> 1000001100150: One example is Jabberwacky, which attempts to model the way humans learn new facts and language.
10110150 -> 1000001100160: ELLA attempts to use natural language processing to make more useful responses from a human's input.
10110160 -> 1000001100170: Some programs that use natural language conversation, such as SHRDLU, are not generally classified as chatterbots because they link their speech ability to knowledge of a simulated world.
10110170 -> 1000001100180: This type of link requires a more complex artificial intelligence (eg., a "vision" system) than standard chatterbots have.
10110180 -> 1000001100190: Early chatterbots
10110190 -> 1000001100200: The classic early chatterbots are ELIZA and PARRY.
10110200 -> 1000001100210: More recent programs are Racter, Verbots, A.L.I.C.E., and ELLA.
10110210 -> 1000001100220: The growth of chatterbots as a research field has created an expansion in their purposes.
10110220 -> 1000001100230: While ELIZA and PARRY were used exclusively to simulate typed conversation, Racter was used to "write" a story called The Policeman's Beard is Half Constructed.
10110230 -> 1000001100240: ELLA includes a collection of games and functional features to further extend the potential of chatterbots.
10110240 -> 1000001100250: The term "ChatterBot" was coined by Michael Mauldin (Creator of the first Verbot, Julia) in 1994 to describe these conversational programs.
10110250 -> 1000001100260: Malicious chatterbots
10110260 -> 1000001100270: Malicious chatterbots are frequently used to fill chat rooms with spam and advertising, or to entice people into revealing personal information, such as bank account numbers.
10110270 -> 1000001100280: They are commonly found on Yahoo! Messenger, Windows Live Messenger, AOL Instant Messenger and other instant messaging protocols.
10110280 -> 1000001100290: There has been a published report of a chatterbot used in a fake personal ad on a dating service's website.
10110290 -> 1000001100300: Chatterbots in modern AI
10110300 -> 1000001100310: Most modern AI research focuses on practical engineering tasks.
10110310 -> 1000001100320: This is known as weak AI and is distinguished from strong AI, which would require sapience and reasoning abilities.
10110320 -> 1000001100330: One pertinent field of AI research is natural language.
10110330 -> 1000001100340: Usually weak AI fields employ specialised software or programming languages created for them.
10110340 -> 1000001100350: For example, one of the 'most-human' natural language chatterbots, A.L.I.C.E., uses a programming language called AIML that is specific to its program, and its various clones, named Alicebots.
10110350 -> 1000001100360: Nevertheless, A.L.I.C.E. is still based on pattern matching without any reasoning.
10110360 -> 1000001100370: This is the same technique ELIZA, the first chatterbot, was using back in 1966.
10110370 -> 1000001100380: Australian company MyCyberTwin also deals in strong AI, allowing users to create and sustain their own virtual personalities online.
10110380 -> 1000001100390: MyCyberTwin.com also works in a corporate setting, allowing companies to set up Virtual AI Assistants.
10110390 -> 1000001100400: Another notable program, known as Jabberwacky, also deals in strong AI, as it is claimed to learn new responses based on user interactions, rather than being driven from a static database like many other existing chatterbots.
10110400 -> 1000001100410: Although such programs show initial promise, many of the existing results in trying to tackle the problem of natural language still appear fairly poor, and it seems reasonable to state that there is currently no general purpose conversational artificial intelligence.
10110410 -> 1000001100420: This has led some software developers to focus more on the practical aspect of chatterbot technology - information retrieval.
10110420 -> 1000001100430: A common rebuttal often used within the AI community against criticism of such approaches asks, "How do we know that humans don't also just follow some cleverly devised rules?" (in the way that Chatterbots do).
10110430 -> 1000001100440: Two famous examples of this line of argument against the rationale for the basis of the Turing test are John Searle's Chinese room argument and Ned Block's Blockhead argument.
10110440 -> 1000001100450: Chatterbots/Virtual Assistants in Commercial Environments
10110450 -> 1000001100460: Automated Conversational Systems have progressed and evolved far from the original designs of the first widely used chatbots.
10110460 -> 1000001100470: In the UK, large commercial entities such as Lloyds TSB, Royal Bank of Scotland, Renault, Citroën and One Railway are already utilizing Virtual Assistants to reduce expenditures on Call Centres and provide a first point of contact that can inform the user exactly of points of interest, provide support, capture data from the user and promote products for sale.
10110470 -> 1000001100480: In the UK, new projects and research are being conducted to introduce a Virtual Assistant into the classroom to assist the teacher.
10110480 -> 1000001100490: This project is the first of its kind and the chatbot VA in question is based on the Yhaken  chatbot design.
10110490 -> 1000001100500: The Yhaken template provides a further move forward in Automated Conversational Systems with features such as complex conversational routing and responses, well defined personality, a complex hierarchical construct with additional external reference points, emotional responses and in depth small talk, all to make the experience more interactive and involving for the user.
10110500 -> 1000001100510: Annual contests for chatterbots
10110510 -> 1000001100520: Many organizations tries to encourage and support developers all over the world to develop chatterbots that able to do variety of tasks and compete with each other through turing tests and more.
10110520 -> 1000001100530: Annual contests are organized at the following links:
10110530 -> 1000001100540: The Chatterbox Challenge
10110540 -> 1000001100550: The Loebner Prize
Cluster analysis
10200010 -> 1000001200020: Cluster analysis
10200020 -> 1000001200030: Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.
10200030 -> 1000001200040: Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
10200040 -> 1000001200050: The computational task of classifying the data set into k clusters is often referred to as k-clustering.
10200050 -> 1000001200060: Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
10200060 -> 1000001200070: Types of clustering
10200070 -> 1000001200080: Data clustering algorithms can be hierarchical.
10200080 -> 1000001200090: Hierarchical algorithms find successive clusters using previously established clusters.
10200090 -> 1000001200100: Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down").
10200100 -> 1000001200110: Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters.
10200110 -> 1000001200120: Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
10200120 -> 1000001200130: Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.
10200130 -> 1000001200140: Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.
10200140 -> 1000001200150: Another important distinction is whether the clustering uses symmetric or asymmetric distances.
10200150 -> 1000001200160: A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A).
10200160 -> 1000001200170: In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
10200170 -> 1000001200180: Distance measure
10200180 -> 1000001200190: An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated.
10200190 -> 1000001200200: This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another.
10200200 -> 1000001200210: For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2,\sqrt 2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
10200210 -> 1000001200220: Common distance functions:
10200220 -> 1000001200230: The Euclidean distance (also called distance as the crow flies or 2-norm distance).
10200230 -> 1000001200240: A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance.
10200240 -> 1000001200250: The Manhattan distance (also called taxicab norm or 1-norm)
10200250 -> 1000001200260: The maximum norm
10200260 -> 1000001200270: The Mahalanobis distance corrects data for different scales and correlations in the variables
10200270 -> 1000001200280: The angle between two vectors can be used as a distance measure when clustering high dimensional data.
10200280 -> 1000001200290: See Inner product space.
10200290 -> 1000001200300: The Hamming distance (sometimes edit distance) measures the minimum number of substitutions required to change one member into another.
10200300 -> 1000001200310: Hierarchical clustering
10200310 -> 1000001200320: Creating clusters
10200320 -> 1000001200330: Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters.
10200330 -> 1000001200340: The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other.
10200340 -> 1000001200350: Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root.
10200350 -> 1000001200360: (In the figure, the arrows indicate an agglomerative clustering.)
10200360 -> 1000001200370: Cutting the tree at a given height will give a clustering at a selected precision.
10200370 -> 1000001200380: In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.
10200380 -> 1000001200390: Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
10200390 -> 1000001200400: Agglomerative hierarchical clustering
10200400 -> 1000001200410: For example, suppose this data is to be clustered, and the euclidean distance is the distance metric.
10200410 -> 1000001200420: The hierarchical clustering dendrogram would be as such:
10200420 -> 1000001200430: This method builds the hierarchy from the individual elements by progressively merging clusters.
10200430 -> 1000001200440: In our example, we have six elements {a} {b} {c} {d} {e} and {f}.
10200440 -> 1000001200450: The first step is to determine which elements to merge in a cluster.
10200450 -> 1000001200460: Usually, we want to take the two closest elements, according to the chosen distance.
10200460 -> 1000001200470: Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements.
10200470 -> 1000001200480: Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated.
10200480 -> 1000001200490: This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters.
10200490 -> 1000001200500: A simple agglomerative clustering algorithm is described in the single linkage clustering page; it can easily be adapted to different types of linkage (see below).
10200500 -> 1000001200510: Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further.
10200510 -> 1000001200520: To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters.
10200520 -> 1000001200530: Usually the distance between two clusters \mathcal{A} and \mathcal{B} is one of the following:
10200530 -> 1000001200540: The maximum distance between elements of each cluster (also called complete linkage clustering):
10200540 -> 1000001200550: \max \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B}\,\}
10200550 -> 1000001200560: The minimum distance between elements of each cluster (also called single linkage clustering):
10200560 -> 1000001200570: \min \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B} \,\}
10200570 -> 1000001200580: The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in UPGMA):
10200580 -> 1000001200590: {1 \over {|\mathcal{A}|\cdot|\mathcal{B}|}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y)
10200590 -> 1000001200600: The sum of all intra-cluster variance
10200600 -> 1000001200610: The increase in variance for the cluster being merged (Ward's criterion)
10200610 -> 1000001200620: The probability that candidate clusters spawn from the same distribution function (V-linkage)
10200620 -> 1000001200630: Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion).
10200630 -> 1000001200640: Concept clustering
10200640 -> 1000001200650: Another variation of the agglomerative clustering approach is conceptual clustering.
10200650 -> 1000001200660: Partitional clustering
10200660 -> 1000001200670: K-means and derivatives
10200670 -> 1000001200680: K-means clustering
10200680 -> 1000001200690: The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest.
10200690 -> 1000001200700: The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster...
10200700 -> 1000001200710: Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3).
10200710 -> 1000001200720: Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2.
10200720 -> 1000001200730: The algorithm steps are (J. MacQueen, 1967):
10200730 -> 1000001200740: Choose the number of clusters, k.
10200740 -> 1000001200750: Randomly generate k clusters and determine the cluster centers, or directly generate k random points as cluster centers.
10200750 -> 1000001200760: Assign each point to the nearest cluster center.
10200760 -> 1000001200770: Recompute the new cluster centers.
10200770 -> 1000001200780: Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed).
10200780 -> 1000001200790: The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets.
10200790 -> 1000001200800: Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments.
10200800 -> 1000001200810: It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
10200810 -> 1000001200820: Fuzzy c-means clustering
10200820 -> 1000001200830: In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster.
10200830 -> 1000001200840: Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster.
10200840 -> 1000001200850: For each point x we have a coefficient giving the degree of being in the kth cluster u_k(x).
10200850 -> 1000001200860: Usually, the sum of those coefficients is defined to be 1:
10200860 -> 1000001200870: \forall x \sum_{k=1}^{\mathrm{num.}\ \mathrm{clusters}} u_k(x) \ =1.
10200870 -> 1000001200880: With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:
10200880 -> 1000001200890: \mathrm{center}_k = {{\sum_x u_k(x)^m x} \over {\sum_x u_k(x)^m}}.
10200890 -> 1000001200900: The degree of belonging is related to the inverse of the distance to the cluster
10200900 -> 1000001200910: u_k(x) = {1 \over d(\mathrm{center}_k,x)},
10200910 -> 1000001200920: then the coefficients are normalized and fuzzyfied with a real parameter m>1 so that their sum is 1.
10200920 -> 1000001200930: So
10200930 -> 1000001200940: u_k(x) = \frac{1}{\sum_j \left(\frac{d(\mathrm{center}_k,x)}{d(\mathrm{center}_j,x)}\right)^{2/(m-1)}}.
10200940 -> 1000001200950: For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1.
10200950 -> 1000001200960: When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
10200960 -> 1000001200970: The fuzzy c-means algorithm is very similar to the k-means algorithm:
10200970 -> 1000001200980: Choose a number of clusters.
10200980 -> 1000001200990: Assign randomly to each point coefficients for being in the clusters.
10200990 -> 1000001201000: Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than \epsilon, the given sensitivity threshold) :
10201000 -> 1000001201010: Compute the centroid for each cluster, using the formula above.
10201010 -> 1000001201020: For each point, compute its coefficients of being in the clusters, using the formula above.
10201020 -> 1000001201030: The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights.
10201030 -> 1000001201040: The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes.
10201040 -> 1000001201050: It has better convergence properties and is in general preferred to fuzzy-c-means.
10201050 -> 1000001201060: QT clustering algorithm
10201060 -> 1000001201070: QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering.
10201070 -> 1000001201080: It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
10201080 -> 1000001201090: The algorithm is:
10201090 -> 1000001201100: The user chooses a maximum diameter for clusters.
10201100 -> 1000001201110: Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold.
10201110 -> 1000001201120: Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration.
10201120 -> 1000001201130: Must clarify what happens if more than 1 cluster has the maximum number of points ?
10201130 -> 1000001201140: Recurse with the reduced set of points.
10201140 -> 1000001201150: The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
10201150 -> 1000001201160: Locality-sensitive hashing
10201160 -> 1000001201170: Locality-sensitive hashing can be used for clustering.
10201170 -> 1000001201180: Feature space vectors are sets, and the metric used is the Jaccard distance.
10201180 -> 1000001201190: The feature space can be considered high-dimensional.
10201190 -> 1000001201200: The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets.
10201200 -> 1000001201210: With just one set of hashing methods, there are only clusters of very similar elements.
10201210 -> 1000001201220: By seeding the hash functions several times (eg 20), it is possible to get bigger clusters.
10201220 -> 1000001201230: Graph-theoretic methods
10201230 -> 1000001201240: Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes.
10201240 -> 1000001201250: Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970).
10201250 -> 1000001201260: Elbow criterion
10201260 -> 1000001201270: The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen, for example for k-means and agglomerative hierarchical clustering.
10201270 -> 1000001201280: It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance.
10201280 -> 1000001201290: Thus, it is appropriate to re-run the cluster analysis multiple times.
10201290 -> 1000001201300: The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information.
10201300 -> 1000001201310: More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow).
10201310 -> 1000001201320: This elbow cannot always be unambiguously identified.
10201320 -> 1000001201330: Percentage of variance explained is the ratio of the between-group variance to the total variance.
10201330 -> 1000001201340: On the following graph, the elbow is indicated by the red circle.
10201340 -> 1000001201350: The number of clusters chosen should therefore be 4.
10201350 -> 1000001201360: Spectral clustering
10201360 -> 1000001201370: Given a set of data points A, the similarity matrix may be defined as a matrix S where S_{ij} represents a measure of the similarity between points i, j\in A.
10201370 -> 1000001201380: Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions.
10201380 -> 1000001201390: One such technique is the Shi-Malik algorithm, commonly used for image segmentation.
10201390 -> 1000001201400: It partitions points into two sets (S_1,S_2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix
10201400 -> 1000001201410: L = I - D^{-1/2}SD^{-1/2}
10201410 -> 1000001201420: of S, where D is the diagonal matrix
10201420 -> 1000001201430: D_{ii} = \sum_{j} S_{ij}.
10201430 -> 1000001201440: This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S_1, and the rest in S_2.
10201440 -> 1000001201450: The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
10201450 -> 1000001201460: A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD^{-1} for some k, and then invokes another (e.g. k-means) to cluster points by their respective k components in these eigenvectors.
10201460 -> 1000001201470: Applications
10201470 -> 1000001201480: Biology
10201480 -> 1000001201490: In biology clustering has many applications
10201490 -> 1000001201500: In imaging, data clustering may take different form based on the data dimensionality.
10201500 -> 1000001201510: For example, the  SOCR EM Mixture model segmentation activity and applet shows how to obtain point, region or volume classification using the online SOCR computational libraries.
10201510 -> 1000001201520: In the fields of plant and animal ecology, clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in plant systematics to generate artificial phylogenies or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes
10201520 -> 1000001201530: In computational biology and bioinformatics:
10201530 -> 1000001201540: In transcriptomics, clustering is used to build groups of genes with related expression patterns (also known as coexpressed genes).
10201540 -> 1000001201550: Often such groups contain functionally related proteins, such as enzymes for a specific pathway, or genes that are co-regulated.
10201550 -> 1000001201560: High throughput experiments using expressed sequence tags (ESTs) or DNA microarrays can be a powerful tool for genome annotation, a general aspect of genomics.
10201560 -> 1000001201570: In sequence analysis, clustering is used to group homologous sequences into gene families.
10201570 -> 1000001201580: This is a very important concept in bioinformatics, and evolutionary biology in general.
10201580 -> 1000001201590: See evolution by gene duplication.
10201590 -> 1000001201600: In high-throughput genotyping platforms clustering algorithms are used to automatically assign genotypes.
10201600 -> 1000001201610: Medicine
10201610 -> 1000001201620: In medical imaging, such as PET scans, cluster analysis can be used to differentiate between different types of tissue and blood in a three dimensional image.
10201620 -> 1000001201630: In this application, actual position does not matter, but the voxel intensity is considered as a vector, with a dimension for each image that was taken over time.
10201630 -> 1000001201640: This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of arterial blood, an intrusive technique that is most common today.
10201640 -> 1000001201650: Market research
10201650 -> 1000001201660: Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels.
10201660 -> 1000001201670: Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers.
10201670 -> 1000001201680: Segmenting the market and determining target markets
10201680 -> 1000001201690: Product positioning
10201690 -> 1000001201700: New product development
10201700 -> 1000001201710: Selecting test markets (see : experimental techniques)
10201710 -> 1000001201720: Other applications
10201720 -> 1000001201730: Social network analysis: In the study of social networks, clustering may be used to recognize communities within large groups of people.
10201730 -> 1000001201740: Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
10201740 -> 1000001201750: Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples.
10201750 -> 1000001201760: Another common application is the division of documents, such as World Wide Web pages, into genres.
10201760 -> 1000001201770: Search result grouping: In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google.
10201770 -> 1000001201780: There are currently a number of web based clustering tools such as Clusty.
10201780 -> 1000001201790: Slippy map optimization: Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map.
10201790 -> 1000001201800: This makes it both faster and reduces the amount of visual clutter.
10201800 -> 1000001201810: IMRT segmentation: Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
10201810 -> 1000001201820: Grouping of Shopping Items: Clustering can be used to group all the shopping items available on the web into a set of unique products.
10201820 -> 1000001201830: For example, all the items on eBay can be grouped into unique products.
10201825 -> 1000001201840: (eBay doesn't have the concept of a SKU)
10201830 -> 1000001201850: Mathematical chemistry: To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 topological indices.
10201840 -> 1000001201860: Petroleum Geology: Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties.
10201850 -> 1000001201870: Comparisons between data clusterings
10201860 -> 1000001201880: There have been several suggestions for a measure of similarity between two clusterings.
10201870 -> 1000001201890: Such a measure can be used to compare how well different data clustering algorithms perform on a set of data.
10201880 -> 1000001201900: Many of these measures are derived from the matching matrix (aka confusion matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures.
10201890 -> 1000001201910: Marina Meila's Variation of Information metric is a more recent approach for measuring distance between clusterings.
10201900 -> 1000001201920: It uses mutual information and entropy to approximate the distance between two clusterings across the lattice of possible clusterings.
10201910 -> 1000001201930: Algorithms
10201920 -> 1000001201940: In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998).
10201930 -> 1000001201950: Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996).
Computational linguistics
10120010 -> 1000001300020: Computational linguistics
10120020 -> 1000001300030: Computational linguistics is an interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective.
10120030 -> 1000001300040: This modeling is not limited to any particular field of linguistics.
10120040 -> 1000001300050: Traditionally, computational linguistics was usually performed by computer scientists who had specialized in the application of computers to the processing of a natural language.
10120050 -> 1000001300060: Recent research has shown that human language is much more complex than previously thought, so computational linguists often work as members of interdisciplinary teams, including linguists (specifically trained in linguistics), language experts (persons with some level of ability in the languages relevant to a given project), and computer scientists.
10120060 -> 1000001300070: In general computational linguistics draws upon the involvement of linguists, computer scientists, experts in artificial intelligence, cognitive psychologists, mathematicians, and logicians, amongst others.
10120070 -> 1000001300080: Origins
10120080 -> 1000001300090: Computational linguistics as a field predates artificial intelligence, a field under which it is often grouped.
10120090 -> 1000001300100: Computational linguistics originated with efforts in the United States in the 1950s to use computers to automatically translate texts from foreign languages, particularly Russian scientific journals, into English.
10120100 -> 1000001300110: Since computers had proven their ability to do arithmetic much faster and more accurately than humans, it was thought to be only a short matter of time before the technical details could be taken care of that would allow them the same remarkable capacity to process language.
10120110 -> 1000001300120: When machine translation (also known as mechanical translation) failed to yield accurate translations right away, automated processing of human languages was recognized as far more complex than had originally been assumed.
10120120 -> 1000001300130: Computational linguistics was born as the name of the new field of study devoted to developing algorithms and software for intelligently processing language data.
10120130 -> 1000001300140: When artificial intelligence came into existence in the 1960s, the field of computational linguistics became that sub-division of artificial intelligence dealing with human-level comprehension and production of natural languages.
10120140 -> 1000001300150: In order to translate one language into another, it was observed that one had to understand the grammar of both languages, including both morphology (the grammar of word forms) and syntax (the grammar of sentence structure).
10120150 -> 1000001300160: In order to understand syntax, one had to also understand the semantics and the lexicon (or 'vocabulary'), and even to understand something of the pragmatics of language use.
10120160 -> 1000001300170: Thus, what started as an effort to translate between languages evolved into an entire discipline devoted to understanding how to represent and process natural languages using computers.
10120170 -> 1000001300180: Subfields
10120180 -> 1000001300190: Computational linguistics can be divided into major areas depending upon the medium of the language being processed, whether spoken or textual; and upon the task being performed, whether analyzing language (recognition) or synthesizing language (generation).
10120190 -> 1000001300200: Speech recognition and speech synthesis deal with how spoken language can be understood or created using computers.
10120200 -> 1000001300210: Parsing and generation are sub-divisions of computational linguistics dealing respectively with taking language apart and putting it together.
10120210 -> 1000001300220: Machine translation remains the sub-division of computational linguistics dealing with having computers translate between languages.
10120220 -> 1000001300230: Some of the areas of research that are studied by computational linguistics include:
10120230 -> 1000001300240: Computer aided corpus linguistics
10120240 -> 1000001300250: Design of parsers or chunkers for natural languages
10120250 -> 1000001300260: Design of taggers like POS-taggers (part-of-speech taggers)
10120260 -> 1000001300270: Definition of specialized logics like resource logics for NLP
10120270 -> 1000001300280: Research in the relation between formal and natural languages in general
10120280 -> 1000001300290: Machine translation, e.g. by a translating computer
10120290 -> 1000001300300: Computational complexity of natural language, largely modeled on automata theory, with the application of context-sensitive grammar and linearly-bounded Turing machines.
10120300 -> 1000001300310: The Association for Computational Linguistics defines computational linguistics as:
10120310 -> 1000001300320: ...the scientific study of language from a computational perspective.
10120320 -> 1000001300330: Computational linguists are interested in providing computational models of various kinds of linguistic phenomena.
Computer program
10130010 -> 1000001400020: Computer program
10130020 -> 1000001400030: Computer programs (also software programs, or just programs) are instructions for a computer.
10130030 -> 1000001400040: A computer requires programs to function, and a computer program does nothing unless its instructions are executed by a central processor.
10130040 -> 1000001400050: Computer programs are usually executable programs or the source code from which executable programs are derived (e.g., compiled).
10130050 -> 1000001400060: Computer source code is often written by professional computer programmers.
10130060 -> 1000001400070: Source code is written in a programming language that usually follows one of two main paradigms: imperative or declarative programming.
10130070 -> 1000001400080: Source code may be converted into an executable file (sometimes called an executable program) by a compiler.
10130080 -> 1000001400090: Alternatively, computer programs may be executed by a central processing unit with the aid of an interpreter, or may be embedded directly into hardware.
10130090 -> 1000001400100: Computer programs may be categorized along functional lines: system software and application software.
10130100 -> 1000001400110: And many computer programs may run simultaneously on a single computer, a process known as multitasking.
10130110 -> 1000001400120: Programming
10130120 -> 1000001400130: main() {output_string("Hello world!");}
10130160 -> 1000001400140: Source code of a program written in the C programming language
10130170 -> 1000001400150: Computer programming is the iterative process of writing or editing source code.
10130180 -> 1000001400160: Editing source code involves testing, analyzing, and refining.
10130190 -> 1000001400170: A person who practices this skill is referred to as a computer programmer or software developer.
10130200 -> 1000001400180: The sometimes lengthy process of computer programming is usually referred to as software development.
10130210 -> 1000001400190: The term software engineering is becoming popular as the process is seen as an engineering discipline.
10130220 -> 1000001400200: Paradigms
10130230 -> 1000001400210: Computer programs can be categorized by the programming language paradigm used to produce them.
10130240 -> 1000001400220: Two of the main paradigms are imperative and declarative.
10130250 -> 1000001400230: Programs written using an imperative language specify an algorithm using declarations, expressions, and statements.
10130260 -> 1000001400240: A declaration associates a variable name with a datatype.
10130270 -> 1000001400250: For example:  var x: integer; .
10130280 -> 1000001400260: An expression yields a value.
10130290 -> 1000001400270: For example:  2 + 2  yields 4.
10130300 -> 1000001400280: Finally, a statement might assign an expression to a variable or use the value of a variable to alter the program's control flow.
10130310 -> 1000001400290: For example: x := 2 + 2; if x = 4 then do_something();
10130315 -> 1000001400300: One criticism of imperative languages is the side-effect of an assignment statement on a class of variables called non-local variables.
10130320 -> 1000001400310: Programs written using a declarative language specify the properties that have to be met by the output and do not specify any implementation details.
10130330 -> 1000001400320: Two broad categories of declarative languages are functional languages and logical languages.
10130340 -> 1000001400330: The principle behind functional languages (like Haskell) is to not allow side-effects, which makes it easier to reason about programs like mathematical functions.
10130350 -> 1000001400340: The principle behind logical languages (like Prolog) is to define the problem to be solved — the goal — and leave the detailed solution to the Prolog system itself.
10130360 -> 1000001400350: The goal is defined by providing a list of subgoals.
10130370 -> 1000001400360: Then each subgoal is defined by further providing a list of its subgoals, etc.
10130380 -> 1000001400370: If a path of subgoals fails to find a solution, then that subgoal is backtracked and another path is systematically attempted.
10130390 -> 1000001400380: The form in which a program is created may be textual or visual.
10130400 -> 1000001400390: In a visual language program, elements are graphically manipulated rather than textually specified.
10130410 -> 1000001400400: Compilation or interpretation
10130420 -> 1000001400410: A computer program in the form of a human-readable, computer programming language is called source code.
10130430 -> 1000001400420: Source code may be converted into an executable image by a compiler or executed immediately with the aid of an interpreter.
10130440 -> 1000001400430: Compiled computer programs are commonly referred to as executables, binary images, or simply as binaries — a reference to the binary file format used to store the executable code.
10130450 -> 1000001400440: Compilers are used to translate source code from a programming language into either object code or machine code.
10130460 -> 1000001400450: Object code needs further processing to become machine code, and machine code is the Central Processing Unit's native code, ready for execution.
10130470 -> 1000001400460: Interpreted computer programs are either decoded and then immediately executed or are decoded into some efficient intermediate representation for future execution.
10130480 -> 1000001400470: BASIC, Perl, and Python are examples of immediately executed computer programs.
10130490 -> 1000001400480: Alternatively, Java computer programs are compiled ahead of time and stored as a machine independent code called bytecode.
10130500 -> 1000001400490: Bytecode is then executed upon request by an interpreter called a virtual machine.
10130510 -> 1000001400500: The main disadvantage of interpreters is computer programs run slower than if compiled.
10130520 -> 1000001400510: Interpreting code is slower than running the compiled version because the interpreter must decode each statement each time it is loaded and then perform the desired action.
10130530 -> 1000001400520: On the other hand, software development may be quicker using an interpreter because testing is immediate when the compilation step is omitted.
10130540 -> 1000001400530: Another disadvantage of interpreters is the interpreter must be present on the computer at the time the computer program is executed.
10130550 -> 1000001400540: Alternatively, compiled computer programs need not have the compiler present at the time of execution.
10130560 -> 1000001400550: No properties of a programming language require it to be exclusively compiled or exclusively interpreted.
10130570 -> 1000001400560: The categorization usually reflects the most popular method of language execution.
10130580 -> 1000001400570: For example, BASIC is thought of as an interpreted language and C a compiled language, despite the existence of BASIC compilers and C interpreters.
10130590 -> 1000001400580: Self-modifying programs
10130600 -> 1000001400590: A computer program in execution is normally treated as being different from the data the program operates on.
10130610 -> 1000001400600: However, in some cases this distinction is blurred when a computer program modifies itself.
10130620 -> 1000001400610: The modified computer program is subsequently executed as part of the same program.
10130630 -> 1000001400620: Self-modifying code is possible for programs written in Lisp, COBOL, and Prolog.
10130640 -> 1000001400630: Execution and storage
10130650 -> 1000001400640: Typically, computer programs are stored in non-volatile memory until requested either directly or indirectly to be executed by the computer user.
10130660 -> 1000001400650: Upon such a request, the program is loaded into random access memory, by a computer program called an operating system, where it can be accessed directly by the central processor.
10130670 -> 1000001400660: The central processor then executes ("runs") the program, instruction by instruction, until termination.
10130680 -> 1000001400670: A program in execution is called a process.
10130690 -> 1000001400680: Termination is either by normal self-termination or by error — software or hardware error.
10130700 -> 1000001400690: Embedded programs
10130710 -> 1000001400700: Some computer programs are embedded into hardware.
10130720 -> 1000001400710: A stored-program computer requires an initial computer program stored in its read-only memory to boot.
10130730 -> 1000001400720: The boot process is to identify and initialize all aspects of the system, from CPU registers to device controllers to memory contents.
10130740 -> 1000001400730: Following the initialization process, this initial computer program loads the operating system and sets the program counter to begin normal operations.
10130750 -> 1000001400740: Independent of the host computer, a hardware device might have embedded firmware to control its operation.
10130760 -> 1000001400750: Firmware is used when the computer program is rarely or never expected to change, or when the program must not be lost when the power is off.
10130770 -> 1000001400760: Manual programming
10130780 -> 1000001400770: Computer programs historically were manually input to the central processor via switches.
10130790 -> 1000001400780: An instruction was represented by a configuration of on/off settings.
10130800 -> 1000001400790: After setting the configuration, an execute button was pressed.
10130810 -> 1000001400800: This process was then repeated.
10130820 -> 1000001400810: Computer programs also historically were manually input via paper tape or punched cards.
10130830 -> 1000001400820: After the medium was loaded, the starting address was set via switches and the execute button pressed.
10130840 -> 1000001400830: Automatic program generation
10130850 -> 1000001400840: Generative programming is a style of computer programming that creates source code through generic classes, prototypes, templates, aspects, and code generators to improve programmer productivity.
10130860 -> 1000001400850: Source code is generated with programming tools such as a template processor or an Integrated Development Environment.
10130870 -> 1000001400860: The simplest form of source code generator is a macro processor, such as the C preprocessor, which replaces patterns in source code according to relatively simple rules.
10130880 -> 1000001400870: Software engines output source code or markup code that simultaneously become the input to another computer process.
10130890 -> 1000001400880: The analogy is that of one process driving another process, with the computer code being burned as fuel.
10130900 -> 1000001400890: Application servers are software engines that deliver applications to client computers.
10130910 -> 1000001400900: For example, a Wiki is an application server that allows users to build dynamic content assembled from articles.
10130920 -> 1000001400910: Wikis generate HTML, CSS, Java, and Javascript which are then interpreted by a web browser.
10130930 -> 1000001400920: Simultaneous execution
10130940 -> 1000001400930: Many operating systems support multitasking which enables many computer programs to appear to be running simultaneously on a single computer.
10130950 -> 1000001400940: Operating systems may run multiple programs through process scheduling — a software mechanism to switch the CPU among processes frequently so that users can interact with each program while it is running.
10130960 -> 1000001400950: Within hardware, modern day multiprocessor computers or computers with multicore processors may run multiple programs.
10130970 -> 1000001400960: Functional categories
10130980 -> 1000001400970: Computer programs may be categorized along functional lines.
10130990 -> 1000001400980: These functional categories are system software and application software.
10131000 -> 1000001400990: System software includes the operating system which couples the computer's hardware with the application software.
10131010 -> 1000001401000: The purpose of the operating system is to provide an environment in which application software executes in a convenient and efficient manner.
10131020 -> 1000001401010: In addition to the operating system, system software includes utility programs that help manage and tune the computer.
10131030 -> 1000001401020: If a computer program is not system software then it is application software.
10131040 -> 1000001401030: Application software includes middleware, which couples the system software with the user interface.
10131050 -> 1000001401040: Application software also includes utility programs that help users solve application problems, like the need for sorting.
Computer science
10140010 -> 1000001500020: Computer science
10140020 -> 1000001500030: Computer science (or computing science) is the study and the science of the theoretical foundations of information and computation and their implementation and application in computer systems.
10140030 -> 1000001500040: Computer science has many sub-fields; some emphasize the computation of specific results (such as computer graphics), while others relate to properties of computational problems (such as computational complexity theory).
10140040 -> 1000001500050: Still others focus on the challenges in implementing computations.
10140050 -> 1000001500060: For example, programming language theory studies approaches to describing computations, while computer programming applies specific programming languages to solve specific computational problems.
10140060 -> 1000001500070: A further subfield, human-computer interaction, focuses on the challenges in making computers and computations useful, usable and universally accessible to people.
10140070 -> 1000001500080: History
10140080 -> 1000001500090: The early foundations of what would become computer science predate the invention of the modern digital computer.
10140090 -> 1000001500100: Machines for calculating fixed numerical tasks, such as the abacus, have existed since antiquity.
10140100 -> 1000001500110: Wilhelm Schickard built the first mechanical calculator in 1623.
10140110 -> 1000001500120: Charles Babbage designed a difference engine in Victorian times (between 1837 and 1901) helped by Ada Lovelace.
10140120 -> 1000001500130: Around 1900, the IBM corporation sold punch-card machines.
10140130 -> 1000001500140: However, all of these machines were constrained to perform a single task, or at best some subset of all possible tasks.
10140140 -> 1000001500150: During the 1940s, as newer and more powerful computing machines were developed, the term computer came to refer to the machines rather than their human predecessors.
10140150 -> 1000001500160: As it became clear that computers could be used for more than just mathematical calculations, the field of computer science broadened to study computation in general.
10140160 -> 1000001500170: Computer science began to be established as a distinct academic discipline in the 1960s, with the creation of the first computer science departments and degree programs.
10140170 -> 1000001500180: Since practical computers became available, many applications of computing have become distinct areas of study in their own right.
10140180 -> 1000001500190: Many initially believed it impossible that "computers themselves could actually be a scientific field of study" (Levy 1984, p. 11), though it was in the "late fifties" (Levy 1984, p.11) that it gradually became accepted among the greater academic population.
10140190 -> 1000001500200: It is the now well-known IBM brand that formed part of the computer science revolution during this time.
10140200 -> 1000001500210: 'IBM' (short for International Business Machines) released the IBM 704 and later the IBM 709 computers, which were widely used during the exploration period of such devices.
10140210 -> 1000001500220: "Still, working with the IBM [computer] was frustrating...if you had misplaced as much as one letter in one instruction, the program would crash, and you would have to start the whole process over again" (Levy 1984, p.13).
10140220 -> 1000001500230: During the late 1950s, the computer science discipline was very much in its developmental stages, and such issues were commonplace.
10140230 -> 1000001500240: Time has seen significant improvements in the useability and effectiveness of computer science technology.
10140240 -> 1000001500250: Modern society has seen a significant shift from computers being used solely by experts or professionals to a more widespread user base.
10140250 -> 1000001500260: By the 1990s, computers became accepted as being the norm within everyday life.
10140260 -> 1000001500270: During this time data entry was a primary component of the use of computers, many preferring to streamline their business practices through the use of a computer.
10140270 -> 1000001500280: This also gave the additional benefit of removing the need of large amounts of documentation and file records which consumed much-needed physical space within offices.
10140280 -> 1000001500290: Major achievements
10140290 -> 1000001500300: Despite its relatively short history as a formal academic discipline, computer science has made a number of fundamental contributions to science and society.
10140300 -> 1000001500310: These include:
10140310 -> 1000001500320: Applications within computer science
10140320 -> 1000001500330: A formal definition of computation and computability, and proof that there are computationally unsolvable and intractable problems.
10140330 -> 1000001500340: The concept of a programming language, a tool for the precise expression of methodological information at various levels of abstraction.
10140340 -> 1000001500350: Applications outside of computing
10140350 -> 1000001500360: Sparked the Digital Revolution which led to the current Information Age and the Internet.
10140360 -> 1000001500370: In cryptography, breaking the Enigma machine was an important factor contributing to the Allied victory in World War II.
10140370 -> 1000001500380: Scientific computing enabled advanced study of the mind and mapping the human genome was possible with Human Genome Project.
10140380 -> 1000001500390: Distributed computing projects like Folding@home explore protein folding.
10140390 -> 1000001500400: Algorithmic trading has increased the efficiency and liquidity of financial markets by using artificial intelligence, machine learning and other statistical and numerical techniques on a large scale.
10140400 -> 1000001500410: Relationship with other fields
10140410 -> 1000001500420: Despite its name, a significant amount of computer science does not involve the study of computers themselves.
10140420 -> 1000001500430: Because of this, several alternative names have been proposed.
10140430 -> 1000001500440: Danish scientist Peter Naur suggested the term datalogy, to reflect the fact that the scientific discipline revolves around data and data treatment, while not necessarily involving computers.
10140440 -> 1000001500450: The first scientific institution to use the term was the Department of Datalogy at the University of Copenhagen, founded in 1969, with Peter Naur being the first professor in datalogy.
10140450 -> 1000001500460: The term is used mainly in the Scandinavian countries.
10140460 -> 1000001500470: Also, in the early days of computing, a number of terms for the and practitioners of the field of computing were suggested in the Communications are of the ACM—turingineer, turologist, flow-charts-man, applied meta-mathematician, and applied epistemologist.
10140470 -> 1000001500480: Three months later in the same journal, comptologist was suggested, followed next year by hypologist.
10140480 -> 1000001500490: Recently the term computics has been suggested.
10140490 -> 1000001500500: Informatik was a term used in Europe with more frequency.
10140500 -> 1000001500510: The renowned computer scientist Edsger Dijkstra stated, "Computer science is no more about computers than astronomy is about telescopes."
10140510 -> 1000001500520: The design and deployment of computers and computer systems is generally considered the province of disciplines other than computer science.
10140520 -> 1000001500530: For example, the study of computer hardware is usually considered part of computer engineering, while the study of commercial computer systems and their deployment is often called information technology or information systems.
10140530 -> 1000001500540: Computer science is sometimes criticized as being insufficiently scientific, a view espoused in the statement "Science is to computer science as hydrodynamics is to plumbing", credited to Stan Kelly-Bootle and others.
10140540 -> 1000001500550: However, there has been much cross-fertilization of ideas between the various computer-related disciplines.
10140550 -> 1000001500560: Computer science research has also often crossed into other disciplines, such as cognitive science, economics, mathematics, physics (see quantum computing), and linguistics.
10140560 -> 1000001500570: Computer science is considered by some to have a much closer relationship with mathematics than many scientific disciplines.
10140570 -> 1000001500580: Early computer science was strongly influenced by the work of mathematicians such as Kurt Gödel and Alan Turing, and there continues to be a useful interchange of ideas between the two fields in areas such as mathematical logic, category theory, domain theory, and algebra.
10140580 -> 1000001500590: The relationship between computer science and software engineering is a contentious issue, which is further muddied by disputes over what the term "software engineering" means, and how computer science is defined.
10140590 -> 1000001500600: David Parnas, taking a cue from the relationship between other engineering and science disciplines, has claimed that the principal focus of computer science is studying the properties of computation in general, while the principal focus of software engineering is the design of specific computations to achieve practical goals, making the two separate but complementary disciplines.
10140600 -> 1000001500610: The academic, political, and funding aspects of computer science tend to have roots as to whether a department in the U.S. formed with either a mathematical emphasis or an engineering emphasis.
10140610 -> 1000001500620: In general, electrical engineering-based computer science departments have tended to succeed as computer science and/or engineering departments.
10140620 -> 1000001500630: Computer science departments with a mathematics emphasis and with a numerical orientation consider alignment computational science.
10140630 -> 1000001500640: Both types of departments tend to make efforts to bridge the field educationally if not across all research.
10140640 -> 1000001500650: Fields of computer science
10140650 -> 1000001500660: Computer science searches for concepts and formal proofs to explain and describe computational systems of interest.
10140660 -> 1000001500670: As with all sciences, these theories can then be utilised to synthesize practical engineering applications, which in turn may suggest new systems to be studied and analysed.
10140670 -> 1000001500680: While the ACM Computing Classification System can be used to split computer science up into different topics of fields, a more descriptive breakdown follows:
10140680 -> 1000001500690: Mathematical foundations
10140690 -> 1000001500700: Mathematical logic
10140700 -> 1000001500710: Boolean logic and other ways of modeling logical queries; the uses and limitations of formal proof methods.
10140710 -> 1000001500720: Number theory
10140720 -> 1000001500730: Theory of proofs and heuristics for finding proofs in the simple domain of integers.
10140730 -> 1000001500740: Used in cryptography as well as a test domain in artificial intelligence.
10140740 -> 1000001500750: Graph theory
10140750 -> 1000001500760: Foundations for data structures and searching algorithms.
10140760 -> 1000001500770: Type theory
10140770 -> 1000001500780: Formal analysis of the types of data, and the use of these types to understand properties of programs, especially program safety.
10140780 -> 1000001500790: Category theory
10140790 -> 1000001500800: Category theory provides a means of capturing all of math and computation in a single synthesis.
10140800 -> 1000001500810: Computational geometry
10140810 -> 1000001500820: The study of algorithms to solve problems stated in terms of geometry.
10140820 -> 1000001500830: Numerical analysis
10140830 -> 1000001500840: Foundations for algorithms in discrete mathematics, as well as the study of the limitations of floating point computation, including round-off errors.
10140840 -> 1000001500850: Theory of computation
10140850 -> 1000001500860: Automata theory
10140860 -> 1000001500870: Different logical structures for solving problems.
10140870 -> 1000001500880: Computability theory
10140880 -> 1000001500890: What is calculable with the current models of computers.
10140890 -> 1000001500900: Proofs developed by Alan Turing and others provide insight into the possibilities of what can be computed and what cannot.
10140900 -> 1000001500910: Computational complexity theory
10140910 -> 1000001500920: Fundamental bounds (especially time and storage space) on classes of computations; in practice, study of which problems a computer can solve with reasonable resources (while computability theory studies which problems can be solved at all).
10140920 -> 1000001500930: Quantum computing theory
10140930 -> 1000001500940: Representation and manipulation of data using the quantum properties of particles and quantum mechanism.
10140940 -> 1000001500950: Algorithms and data structures
10140950 -> 1000001500960: Analysis of algorithms
10140960 -> 1000001500970: Time and space complexity of algorithms.
10140970 -> 1000001500980: Algorithms
10140980 -> 1000001500990: Formal logical processes used for computation, and the efficiency of these processes.
10140990 -> 1000001501000: Programming languages and compilers
10141000 -> 1000001501010: Compilers
10141010 -> 1000001501020: Ways of translating computer programs, usually from higher level languages to lower level ones.
10141020 -> 1000001501030: Interpreters
10141030 -> 1000001501040: A program that takes in as input a computer program and executes it.
10141040 -> 1000001501050: Programming languages
10141050 -> 1000001501060: Formal language paradigms for expressing algorithms, and the properties of these languages (e.g., what problems they are suited to solve).
10141060 -> 1000001501070: Concurrent, parallel, and distributed systems
10141070 -> 1000001501080: Concurrency
10141080 -> 1000001501090: The theory and practice of simultaneous computation; data safety in any multitasking or multithreaded environment.
10141090 -> 1000001501100: Distributed computing
10141100 -> 1000001501110: Computing using multiple computing devices over a network to accomplish a common objective or task and thereby reducing the latency involved in single processor contributions for any task.
10141110 -> 1000001501120: Parallel computing
10141120 -> 1000001501130: Computing using multiple concurrent threads of execution.
10141130 -> 1000001501140: Software engineering
10141140 -> 1000001501150: Algorithm design
10141150 -> 1000001501160: Using ideas from algorithm theory to creatively design solutions to real tasks
10141160 -> 1000001501170: Computer programming
10141170 -> 1000001501180: The practice of using a programming language to implement algorithms
10141180 -> 1000001501190: Formal methods
10141190 -> 1000001501200: Mathematical approaches for describing and reasoning about software designs.
10141200 -> 1000001501210: Reverse engineering
10141210 -> 1000001501220: The application of the scientific method to the understanding of arbitrary existing software
10141220 -> 1000001501230: Software development
10141230 -> 1000001501240: The principles and practice of designing, developing, and testing programs, as well as proper engineering practices.
10141240 -> 1000001501250: System architecture
10141250 -> 1000001501260: Computer architecture
10141260 -> 1000001501270: The design, organization, optimization and verification of a computer system, mostly about CPUs and memory subsystems (and the bus connecting them).
10141270 -> 1000001501280: Computer organization
10141280 -> 1000001501290: The implementation of computer architectures, in terms of descriptions of their specific electrical circuitry
10141290 -> 1000001501300: Operating systems
10141300 -> 1000001501310: Systems for managing computer programs and providing the basis of a useable system.
10141310 -> 1000001501320: Communications
10141320 -> 1000001501330: Computer audio
10141330 -> 1000001501340: Algorithms and data structures for the creation, manipulation, storage, and transmission of digital audio recordings.
10141340 -> 1000001501350: Also important in voice recognition applications.
10141350 -> 1000001501360: Networking
10141360 -> 1000001501370: Algorithms and protocols for communicating data across different shared or dedicated media, often including error correction.
10141370 -> 1000001501380: Cryptography
10141380 -> 1000001501390: Applies results from complexity, probability and number theory to invent and break codes.
10141390 -> 1000001501400: Databases
10141400 -> 1000001501410: Data mining
10141410 -> 1000001501420: Data mining is the extraction of relevant data from all sources of data.
10141420 -> 1000001501430: Relational databases
10141430 -> 1000001501440: Study of algorithms for searching and processing information in documents and databases; closely related to information retrieval.
10141440 -> 1000001501450: OLAP
10141450 -> 1000001501460: Online Analytical Processing, or OLAP, is an approach to quickly provide answers to analytical queries that are multi-dimensional in nature.
10141460 -> 1000001501470: OLAP is part of the broader category business intelligence, which also encompasses relational reporting and data mining.
10141470 -> 1000001501480: Artificial intelligence
10141480 -> 1000001501490: Artificial intelligence
10141490 -> 1000001501500: The implementation and study of systems that exhibit an autonomous intelligence or behaviour of their own.
10141500 -> 1000001501510: Artificial life
10141510 -> 1000001501520: The study of digital organisms to learn about biological systems and evolution.
10141520 -> 1000001501530: Automated reasoning
10141530 -> 1000001501540: Solving engines, such as used in Prolog, which produce steps to a result given a query on a fact and rule database.
10141540 -> 1000001501550: Computer vision
10141550 -> 1000001501560: Algorithms for identifying three dimensional objects from one or more two dimensional pictures.
10141560 -> 1000001501570: Machine learning
10141570 -> 1000001501580: Automated creation of a set of rules and axioms based on input.
10141580 -> 1000001501590: Natural language processing/Computational linguistics
10141590 -> 1000001501600: Automated understanding and generation of human language
10141600 -> 1000001501610: Robotics
10141610 -> 1000001501620: Algorithms for controlling the behavior of robots.
10141620 -> 1000001501630: Visual rendering (or Computer graphics)
10141630 -> 1000001501640: Computer graphics
10141640 -> 1000001501650: Algorithms both for generating visual images synthetically, and for integrating or altering visual and spatial information sampled from the real world.
10141650 -> 1000001501660: Image processing
10141660 -> 1000001501670: Determining information from an image through computation.
10141670 -> 1000001501680: Human-Computer Interaction
10141680 -> 1000001501690: Human computer interaction
10141690 -> 1000001501700: The study of making computers and computations useful, usable and universally accessible to people, including the study and design of computer interfaces through which people use computers.
10141700 -> 1000001501710: Scientific computing
10141710 -> 1000001501720: Bioinformatics
10141720 -> 1000001501730: The use of computer science to maintain, analyse, and store biological data, and to assist in solving biological problems such as protein folding, function prediction and phylogeny.
10141730 -> 1000001501740: Cognitive Science
10141740 -> 1000001501750: Computational modelling of real minds
10141750 -> 1000001501760: Computational chemistry
10141760 -> 1000001501770: Computational modelling of theoretical chemistry in order to determine chemical structures and properties
10141770 -> 1000001501780: Computational neuroscience
10141780 -> 1000001501790: Computational modelling of real brains
10141790 -> 1000001501800: Computational physics
10141800 -> 1000001501810: Numerical simulations of large non-analytic systems
10141810 -> 1000001501820: Numerical algorithms
10141820 -> 1000001501830: Algorithms for the numerical solution of mathematical problems such as root-finding, integration, the solution of ordinary differential equations and the approximation/evaluation of special functions.
10141830 -> 1000001501840: Symbolic mathematics
10141840 -> 1000001501850: Manipulation and solution of expressions in symbolic form, also known as Computer algebra.
10141850 -> 1000001501860: Didactics of computer science/informatics
10141860 -> 1000001501870: The subfield didactics of computer science focuses on cognitive approaches of developing competencies of computer science and specific strategies for analysis, design, implementation and evaluation of excellent lessons in computer science.
10141870 -> 1000001501880: Computer science education
10141880 -> 1000001501890: Some universities teach computer science as a theoretical study of computation and algorithmic reasoning.
10141890 -> 1000001501900: These programs often feature the theory of computation, analysis of algorithms, formal methods, concurrency theory, databases, computer graphics and systems analysis, among others.
10141900 -> 1000001501910: They typically also teach computer programming, but treat it as a vessel for the support of other fields of computer science rather than a central focus of high-level study.
10141910 -> 1000001501920: Other colleges and universities, as well as secondary schools and vocational programs that teach computer science, emphasize the practice of advanced computer programming rather than the theory of algorithms and computation in their computer science curricula.
10141920 -> 1000001501930: Such curricula tend to focus on those skills that are important to workers entering the software industry.
10141930 -> 1000001501940: The practical aspects of computer programming are often referred to as software engineering.
10141940 -> 1000001501950: However, there is a lot of disagreement over what the term "software engineering" actually means, and whether it is the same thing as programming.
Computer software
10770010 -> 1000001600020: Computer software
10770020 -> 1000001600030: Computer software, or just software is a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system.
10770030 -> 1000001600040: The term includes application software such as word processors which perform productive tasks for users, system software such as operating systems, which interface with hardware to provide the necessary services for application software, and middleware which controls and co-ordinates distributed systems.
10770040 -> 1000001600050: "Software" is sometimes used in a broader context to mean anything which is not hardware but which is used with hardware, such as film, tapes and records.
10770050 -> 1000001600060: Relationship to computer hardware
10770060 -> 1000001600070: Computer software is so called to distinguish it from computer hardware, which encompasses the physical interconnections and devices required to store and execute (or run) the software.
10770070 -> 1000001600080: At the lowest level, software consists of a machine language specific to an individual processor.
10770080 -> 1000001600090: A machine language consists of groups of binary values signifying processor instructions which change the state of the computer from its preceding state.
10770090 -> 1000001600100: Software is an ordered sequence of instructions for changing the state of the computer hardware in a particular sequence.
10770100 -> 1000001600110: It is usually written in high-level programming languages that are easier and more efficient for humans to use (closer to natural language) than machine language.
10770110 -> 1000001600120: High-level languages are compiled or interpreted into machine language object code.
10770120 -> 1000001600130: Software may also be written in an assembly language, essentially, a mnemonic representation of a machine language using a natural language alphabet.
10770130 -> 1000001600140: Assembly language must be assembled into object code via an assembler.
10770140 -> 1000001600150: The term "software" was first used in this sense by John W. Tukey in 1958.
10770150 -> 1000001600160: In computer science and software engineering, computer software is all computer programs.
10770160 -> 1000001600170: The theory that is the basis for most modern software was first proposed by Alan Turing in his 1935 essay Computable numbers with an application to the Entscheidungsproblem.
10770170 -> 1000001600180: Types
10770180 -> 1000001600190: Practical computer systems divide software systems into three major classes: system software, programming software and application software, although the distinction is arbitrary, and often blurred.
10770190 -> 1000001600200: System software helps run the computer hardware and computer system.
10770200 -> 1000001600210: It includes operating systems, device drivers, diagnostic tools, servers, windowing systems, utilities and more.
10770210 -> 1000001600220: The purpose of systems software is to insulate the applications programmer as much as possible from the details of the particular computer complex being used, especially memory and other hardware features, and such as accessory devices as communications, printers, readers, displays, keyboards, etc.
10770220 -> 1000001600230: Programming software usually provides tools to assist a programmer in writing computer programs, and software using different programming languages in a more convenient way.
10770230 -> 1000001600240: The tools include text editors, compilers, interpreters, linkers, debuggers, and so on.
10770240 -> 1000001600250: An Integrated development environment (IDE) merges those tools into a software bundle, and a programmer may not need to type multiple commands for compiling, interpreting, debugging, tracing, and etc., because the IDE usually has an advanced graphical user interface, or GUI.
10770250 -> 1000001600260: Application software allows end users to accomplish one or more specific (non-computer related) tasks.
10770260 -> 1000001600270: Typical applications include industrial automation, business software, educational software, medical software, databases, and computer games.
10770270 -> 1000001600280: Businesses are probably the biggest users of application software, but almost every field of human activity now uses some form of application software
10770280 -> 1000001600290: Program and library
10770290 -> 1000001600300: A program may not be sufficiently complete for execution by a computer.
10770300 -> 1000001600310: In particular, it may require additional software from a software library in order to be complete.
10770310 -> 1000001600320: Such a library may include software components used by stand-alone programs, but which cannot work on their own.
10770320 -> 1000001600330: Thus, programs may include standard routines that are common to many programs, extracted from these libraries.
10770330 -> 1000001600340: Libraries may also include 'stand-alone' programs which are activated by some computer event and/or perform some function (e.g., of computer 'housekeeping') but do not return data to their calling program.
10770340 -> 1000001600350: Libraries may be called by one to many other programs; programs may call zero to many other programs.
10770350 -> 1000001600360: Three layers
10770360 -> 1000001600370: Users often see things differently than programmers.
10770370 -> 1000001600380: People who use modern general purpose computers (as opposed to embedded systems, analog computers, supercomputers, etc.) usually see three layers of software performing a variety of tasks: platform, application, and user software.
10770380 -> 1000001600390: Platform software
10770390 -> 1000001600400: Platform includes the firmware, device drivers, an operating system, and typically a graphical user interface which, in total, allow a user to interact with the computer and its peripherals (associated equipment).
10770400 -> 1000001600410: Platform software often comes bundled with the computer.
10770410 -> 1000001600420: On a PC you will usually have the ability to change the platform software.
10770420 -> 1000001600430: Application software
10770430 -> 1000001600440: Application software or Applications are what most people think of when they think of software.
10770440 -> 1000001600450: Typical examples include office suites and video games.
10770450 -> 1000001600460: Application software is often purchased separately from computer hardware.
10770460 -> 1000001600470: Sometimes applications are bundled with the computer, but that does not change the fact that they run as independent applications.
10770470 -> 1000001600480: Applications are almost always independent programs from the operating system, though they are often tailored for specific platforms.
10770480 -> 1000001600490: Most users think of compilers, databases, and other "system software" as applications.
10770490 -> 1000001600500: User-written software
10770500 -> 1000001600510: End-user development tailors systems to meet users' specific needs.
10770510 -> 1000001600520: User software include spreadsheet templates, word processor macros, scientific simulations, and scripts for graphics and animations.
10770520 -> 1000001600530: Even email filters are a kind of user software.
10770530 -> 1000001600540: Users create this software themselves and often overlook how important it is.
10770535 -> 1000001600550: Depending on how competently the user-written software has been integrated into purchased application packages, many users may not be aware of the distinction between the purchased packages, and what has been added by fellow co-workers.
10770540 -> None: Creation
10770550 -> 1000001600560: Operation
10770560 -> 1000001600570: Computer software has to be "loaded" into the computer's storage (such as a hard drive, memory, or RAM).
10770570 -> 1000001600580: Once the software has loaded, the computer is able to execute the software.
10770580 -> 1000001600590: This involves passing instructions from the application software, through the system software, to the hardware which ultimately receives the instruction as machine code.
10770590 -> 1000001600600: Each instruction causes the computer to carry out an operation -- moving data, carrying out a computation, or altering the control flow of instructions.
10770600 -> 1000001600610: Data movement is typically from one place in memory to another.
10770610 -> 1000001600620: Sometimes it involves moving data between memory and registers which enable high-speed data access in the CPU.
10770620 -> 1000001600630: Moving data, especially large amounts of it, can be costly.
10770630 -> 1000001600640: So, this is sometimes avoided by using "pointers" to data instead.
10770640 -> 1000001600650: Computations include simple operations such as incrementing the value of a variable data element.
10770650 -> 1000001600660: More complex computations may involve many operations and data elements together.
10770660 -> 1000001600670: Instructions may be performed sequentially, conditionally, or iteratively.
10770670 -> 1000001600680: Sequential instructions are those operations that are performed one after another.
10770680 -> 1000001600690: Conditional instructions are performed such that different sets of instructions execute depending on the value(s) of some data.
10770690 -> 1000001600700: In some languages this is known as an "if" statement.
10770700 -> 1000001600710: Iterative instructions are performed repetitively and may depend on some data value.
10770710 -> 1000001600720: This is sometimes called a "loop."
10770720 -> 1000001600730: Often, one instruction may "call" another set of instructions that are defined in some other program or module.
10770730 -> 1000001600740: When more than one computer processor is used, instructions may be executed simultaneously.
10770740 -> 1000001600750: A simple example of the way software operates is what happens when a user selects an entry such as "Copy" from a menu.
10770750 -> 1000001600760: In this case, a conditional instruction is executed to copy text from data in a 'document' area residing in memory, perhaps to an intermediate storage area known as a 'clipboard' data area.
10770760 -> 1000001600770: If a different menu entry such as "Paste" is chosen, the software may execute the instructions to copy the text from the clipboard data area to a specific location in the same or another document in memory.
10770770 -> 1000001600780: Depending on the application, even the example above could become complicated.
10770780 -> 1000001600790: The field of software engineering endeavors to manage the complexity of how software operates.
10770790 -> 1000001600800: This is especially true for software that operates in the context of a large or powerful computer system.
10770800 -> 1000001600810: Currently, almost the only limitations on the use of computer software in applications is the ingenuity of the designer/programmer.
10770810 -> 1000001600820: Consequently, large areas of activities (such as playing grand master level chess) formerly assumed to be incapable of software simulation are now routinely programmed.
10770820 -> 1000001600830: The only area that has so far proved reasonably secure from software simulation is the realm of human art— especially, pleasing music and literature.
10770830 -> 1000001600840: Kinds of software by operation: computer program as executable, source code or script, configuration.
10770840 -> 1000001600850: Quality and reliability
10770850 -> 1000001600860: Software reliability considers the errors, faults, and failures related to the design, implementation and operation of software.
10770860 -> 1000001600870: See Software auditing, Software quality, Software testing, and Software reliability.
10770870 -> 1000001600880: License
10770880 -> 1000001600890: Software license gives the user the right to use the software in the licensed environment, some software comes with the license when purchased off the shelf, or an OEM license when bundled with hardware.
10770890 -> 1000001600900: Other software comes with a free software licence, granting the recipient the rights to modify and redistribute the software.
10770900 -> 1000001600910: Software can also be in the form of freeware or shareware.
10770910 -> 1000001600920: See also License Management.
10770920 -> 1000001600930: Patents
10770930 -> 1000001600940: The issue of software patents is controversial.
10770940 -> 1000001600950: Some believe that they hinder software development, while others argue that software patents provide an important incentive to spur software innovation.
10770950 -> 1000001600960: See software patent debate.
10770960 -> 1000001600970: Ethics and rights for software users
10770970 -> 1000001600980: Being a new part of society, the idea of what rights users of software should have is not very developed.
10770980 -> 1000001600990: Some, such as the free software community, believe that software users should be free to modify and redistribute the software they use.
10770990 -> 1000001601000: They argue that these rights are necessary so that each individual can control their computer, and so that everyone can cooperate, if they choose, to work together as a community and control the direction that software progresses in.
10770995 -> 1000001601010: Others believe that software authors should have the power to say what rights the user will get.
10771000 -> 1000001601020: Software companies and non-profit organizations
10771010 -> 1000001601030: Examples of non-profit software organizations : Free Software Foundation, GNU Project, Mozilla Foundation
10771020 -> 1000001601040: Examples of large software companies are: Microsoft, IBM, Oracle, SAP and HP.
Corpus linguistics
10150010 -> 1000001700020: Corpus linguistics
10150020 -> 1000001700030: Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text.
10150030 -> 1000001700040: This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language.
10150040 -> 1000001700050: Originally done by hand, corpora are largely derived by an automated process, which is corrected.
10150050 -> 1000001700060: Computational methods had once been viewed as a holy grail of linguistic research, which would ultimately manifest a ruleset for natural language processing and machine translation at a high level.
10150060 -> 1000001700070: Such has not been the case, and since the cognitive revolution, cognitive linguistics has been largely critical of many claimed practical uses for corpora.
10150070 -> 1000001700080: However, as computation capacity and speed have increased, the use of corpora to study language and term relationships en masse has gained some respectability.
10150080 -> 1000001700090: The corpus approach runs counter to Noam Chomsky's view that real language is riddled with performance-related errors, thus requiring careful analysis of small speech samples obtained in a highly controlled laboratory setting.
10150090 -> 1000001700100: Corpus linguistics does away with Chomsky's competence/performance split; adherents believe that reliable language analysis best occurs on field-collected samples, in natural contexts and with minimal experimental interference.
10150100 -> 1000001700110: History
10150110 -> 1000001700120: A landmark in modern corpus linguistics was the publication by Henry Kucera and Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus, a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources.
10150120 -> 1000001700130: Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, language teaching, psychology, statistics, and sociology.
10150130 -> 1000001700140: A further key publication was Randolph Quirk's 'Towards a description of English Usage' (1960, Transactions of the Philological Society, 40-61) in which he introduced The Survey of English Usage.
10150140 -> 1000001700150: Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new American Heritage Dictionary, the first dictionary to be compiled using corpus linguistics.
10150150 -> 1000001700160: The AHD made the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used).
10150160 -> 1000001700170: Other publishers followed suit.
10150170 -> 1000001700180: The British publisher Collins' COBUILD monolingual learner's dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English.
10150180 -> 1000001700190: The Brown Corpus has also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New Zealand English), Australian Corpus of English (Australian English), the Frown Corpus (early 1990s American English), and the FLOB Corpus (1990s British English).
10150190 -> 1000001700200: Other corpora represent many languages, varieties and modes, and include the International Corpus of English, and the British National Corpus, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (Oxford and Lancaster) and the British Library.
10150200 -> 1000001700210: For contemporary American English, work has stalled on the American National Corpus, but the 360 million word Corpus of Contemporary American English (COCA) (1990-present) is now available.
10150210 -> 1000001700220: Methods
10150220 -> 1000001700230: This means dealing with real input data, where descriptions based on a linguist's intuition are not usually helpful.
Cross-platform
10160010 -> 1000001800020: Cross-platform
10160020 -> 1000001800030: Cross-platform (also known as multi-platform) is a term used in computing to refer to computer programs, operating systems, computer languages, programming languages, or other computer software and their implementations which can be made to work on multiple computer platforms.
10160030 -> 1000001800040: “Cross-platform” and “multi-platform” both refer to the idea that a given piece of computer software is able to be run on more than one computer platform.
10160040 -> 1000001800050: There are two major types of cross-platform software; one requires building for each platform that it supports (e.g., is written in a compiled language, such as Pascal), and the other one can be directly run on any platform which supports it (e.g., software written in an interpreted language such as Perl, Python, or shell script) or software written in a language which compiles to bytecode and the bytecode is redistributed (such as is the case with Java and languages used in the .NET Framework) such as Chrome.
10160050 -> 1000001800060: For example, a cross-platform application may run on Microsoft Windows on the x86 architecture, Linux on the x86 architecture and Mac OS X on either the PowerPC or x86 based Apple Macintosh systems.
10160060 -> 1000001800070: A cross-platform application may run on as many as all existing platforms, or on as few as two platforms.
10160070 -> 1000001800080: Platforms
10160080 -> 1000001800090: A platform is a combination of hardware and software used to run software applications.
10160090 -> 1000001800100: A platform can be described simply as an operating system or computer architecture, or it could be the combination of both.
10160100 -> 1000001800110: Probably the most familiar platform is Microsoft Windows running on the x86 architecture.
10160110 -> 1000001800120: Other well-known desktop computer platforms include Linux and Mac OS X (both of which are themselves cross-platform).
10160120 -> 1000001800130: There are, however, many devices such as cellular telephones that are also effectively computer platforms but less commonly thought about in that way.
10160130 -> 1000001800140: Application software can be written to depend on the features of a particular platform—either the hardware, operating system, or virtual machine it runs on.
10160140 -> 1000001800150: The Java platform is a virtual machine platform which runs on many operating systems and hardware types, and is a common platform for software to be written for.
10160150 -> 1000001800160: Hardware platforms
10160160 -> 1000001800170: A hardware platform can refer to a computer’s architecture or processor architecture.
10160170 -> 1000001800180: For example, the x86 and x86-64 CPUs make up one of the most common computer architectures in use in home machines today.
10160180 -> 1000001800190: These machines commonly run Microsoft Windows, though they can run other operating systems as well, including Linux, OpenBSD, NetBSD, Mac OS X and FreeBSD.
10160190 -> 1000001800200: Software platforms
10160200 -> 1000001800210: Software platforms can either be an operating system or programming environment, though more commonly it is a combination of both.
10160210 -> 1000001800220: A notable exception to this is Java, which uses an operating system independent virtual machine for its compiled code, known in the world of Java as bytecode.
10160220 -> 1000001800230: Examples of software platforms include:
10160230 -> 1000001800240: MS-DOS (x86), DR-DOS (x86), FreeDOS (x86) etc.
10160240 -> 1000001800250: Microsoft Windows (x86, x64)
10160250 -> 1000001800260: Linux (x86, x64, PowerPC, various other architectures)
10160260 -> 1000001800270: Mac OS X (PowerPC, x86)
10160270 -> 1000001800280: OS/2, eComStation
10160280 -> 1000001800290: AmigaOS (m68k), AROS (x86, PowerPC, m68k), MorphOS (PowerPC)
10160290 -> 1000001800300: Java
10160300 -> 1000001800310: Java platform
10160310 -> 1000001800320: As previously noted, the Java platform is an exception to the general rule that an operating system is a software platform.
10160320 -> 1000001800330: The Java language provides a virtual machine, or a “virtual CPU” which runs all of the code that is written for the language.
10160330 -> 1000001800340: This enables the same executable binary to run on all systems which support the Java software, through the Java Virtual Machine.
10160340 -> 1000001800350: Java executables do not run directly on the operating system; that is, neither Windows nor Linux execute Java programs directly.
10160350 -> 1000001800360: Because of this, however, Java is limited in that it does not directly support system-specific functionality.
10160360 -> 1000001800370: JNI can be used to access system specific functions, but then the code is likely no longer portable.
10160370 -> 1000001800380: Java programs can run on at least the Microsoft Windows, Mac OS X, Linux, and Solaris operating systems, and so the language is limited to functionality that exists on all these systems.
10160380 -> 1000001800390: This includes things such as computer networking, Internet sockets, but not necessarily raw hardware input/output.
10160390 -> 1000001800400: Cross-platform software
10160400 -> 1000001800410: In order for software to be considered cross-platform, it must be able to function on more than one computer architecture or operating system.
10160410 -> 1000001800420: This can be a time-consuming task given that different operating systems have different application programming interfaces or APIs (for example, Linux uses a different API for application software than Windows does).
10160420 -> 1000001800430: Just because a particular operating system may run on different computer architectures, that does not mean that the software written for that operating system will automatically work on all architectures that the operating system supports.
10160430 -> 1000001800440: One example as of August, 2006 was OpenOffice.org, which did not natively run on the AMD64 or EM64T lines of processors implementing the x86-64 64-bit standards for computers; this has since been changed, and the OpenOffice.org suite of software is “mostly” ported to these 64-bit systems.
10160440 -> 1000001800450: This also means that just because a program is written in a popular programming language such as C or C++, it does not mean it will run on all operating systems that support that programming language.
10160450 -> 1000001800460: Web applications
10160460 -> 1000001800470: Web applications are typically described as cross-platform because, ideally, they are accessible from any of various web browsers within different operating systems.
10160470 -> 1000001800480: Such applications generally employ a client-server system architecture, and vary widely in complexity and functionality.
10160480 -> 1000001800490: This wide variability significantly complicates the goal of cross-platform capability, which is routinely at odds with the goal of advanced functionality.
10160490 -> 1000001800500: Basic applications
10160500 -> 1000001800510: Basic web applications perform all or most processing from a stateless web server, and pass the result to the client web browser.
10160510 -> 1000001800520: All user interaction with the application consists of simple exchanges of data requests and server responses.
10160520 -> 1000001800530: These types of applications were the norm in the early phases of World Wide Web application development.
10160530 -> 1000001800540: Such applications follow a simple transaction model, identical to that of serving static web pages.
10160540 -> 1000001800550: Today, they are still relatively common, especially where cross-platform compatibility and simplicity are deemed more critical than advanced functionality.
10160550 -> 1000001800560: Advanced applications
10160560 -> 1000001800570: Prominent examples of advanced web applications include the Web interface to Gmail, A9.com, and the maps.live.com section of Live Search.
10160570 -> 1000001800580: Such advanced applications routinely depend on additional features found only in the more recent versions of popular web browsers.
10160580 -> 1000001800590: These dependencies include Ajax, JavaScript, “Dynamic” HTML, SVG, and other components of rich internet applications.
10160590 -> 1000001800600: Older versions of popular browsers tend to lack support for certain features.
10160600 -> 1000001800610: Design strategies
10160610 -> 1000001800620: Because of the competing interests of cross-platform compatibility and advanced functionality, numerous alternative web application design strategies have emerged.
10160620 -> 1000001800630: Such strategies include:
10160630 -> 1000001800640: Graceful degradation
10160640 -> 1000001800650: Graceful degradation attempts to provide the same or similar functionality to all users and platforms, while diminishing that functionality to a ‘least common denominator’ for more limited client browsers.
10160650 -> 1000001800660: For example, a user attempting to use a limited-feature browser to access Gmail may notice that Gmail switches to “Basic Mode,” with reduced functionality.
10160660 -> 1000001800670: Some view this strategy as a lesser form of cross-platform capability.
10160670 -> 1000001800680: Separation of functionality
10160680 -> 1000001800690: Separation of functionality attempts to simply omit those subsets of functionality that are not capable from within certain client browsers or operating systems, while still delivering a ‘complete’ application to the user. (see also Separation of concerns).
10160690 -> 1000001800700: Multiple codebase
10160700 -> 1000001800710: Multiple codebase applications present different versions of an application depending on the specific client in use.
10160710 -> 1000001800720: This strategy is arguably the most complicated and expensive way to fulfill cross-platform capability, since even different versions of the same client browser (within the same operating system) can differ dramatically between each other.
10160720 -> 1000001800730: This is further complicated by the support for “plugins” which may or may not be present for any given installation of a particular browser version.
10160730 -> 1000001800740: Third party libraries
10160740 -> 1000001800750: Third party libraries attempt to simplify cross-platform capability by ‘hiding’ the complexities of client differentiation behind a single, unified API.
10160750 -> 1000001800760: Testing strategies
10160760 -> 1000001800770: One complicated aspect of cross-platform web application design is the need for software testing.
10160770 -> 1000001800780: In addition to the complications mentioned previously, there is the additional restriction that some browsers prohibit installation of different versions of the same browser on the same operating system.
10160780 -> 1000001800790: Techniques such as full virtualization are sometimes used as a workaround for this problem.
10160790 -> 1000001800800: Traditional applications
10160800 -> 1000001800810: Although web applications are becoming increasingly popular, many computer users still use traditional application software which does not rely on a client/web-server architecture.
10160810 -> 1000001800820: The distinction between “traditional” and “web” applications is not always unambiguous, however, because applications have many different features, installation methods and architectures; and some of these can overlap and occur in ways that blur the distinction.
10160820 -> 1000001800830: Nevertheless, this simplifying distinction is a common and useful generalization.
10160830 -> 1000001800840: Binary software
10160840 -> 1000001800850: Traditionally in modern computing, application software has been distributed to end-users as binary images, which are stored in executables, a specific type of binary file.
10160850 -> 1000001800860: Such executables only support the operating system and computer architecture that they were built for—which means that making a “cross-platform executable” would be something of a massive task, and is generally not done.
10160860 -> 1000001800870: For software that is distributed as a binary executable, such as software written in C or C++, the programmer must build the software for each different operating system and computer architecture.
10160870 -> 1000001800880: For example, Mozilla Firefox, an open-source web browser, is available on Microsoft Windows, Mac OS X (both PowerPC and x86 through something Apple calls a Universal binary), and Linux on multiple computer architectures.
10160880 -> 1000001800890: The three platforms (in this case, Windows, Mac OS X, and Linux) are separate executable distributions, although they come from the same source code.
10160890 -> 1000001800900: In the context of binary software, cross-platform programs are written in the source code and then “translated” to each system that it runs on through compiling it on different platforms.
10160900 -> 1000001800910: Also, software can be ported to a new computer architecture or operating system so that the program becomes more cross-platform than it already is.
10160910 -> 1000001800920: For example, a program such as Firefox, which already runs on Windows on the x86 family, can be modified and re-built to run on Linux on the x86 (and potentially other architectures) as well.
10160920 -> 1000001800930: As an alternative to porting, cross-platform virtualization allows applications compiled for one CPU and operating system to run on a system with a different CPU and/or operating system, without modification to the source code or binaries.
10160930 -> 1000001800940: As an example, Apple's Rosetta software, which is built into Intel-based Apple Macintosh computers, runs applications compiled for the previous generation of Macs that used PowerPC CPUs.
10160940 -> 1000001800950: Another example is IBM PowerVM Lx86, which allows Linux/x86 applications to run unmodified on the Linux/Power operating system.
10160950 -> 1000001800960: Scripts and interpreted languages
10160960 -> 1000001800970: A script can be considered to be cross-platform if the scripting language is available on multiple platforms and the script only uses the facilities provided by the language.
10160970 -> 1000001800980: That is, a script written in Python for a Unix-like system will likely run with little or no modification on Windows, because Python also runs on Windows; there is also more than one implementation of Python that will run the same scripts (e.g., IronPython for .NET).
10160980 -> 1000001800990: The same goes for many of the open source programming languages that are available and are scripting languages.
10160990 -> 1000001801000: Unlike binary executables, the same script can be used on all computers that have software to interpret the script.
10161000 -> 1000001801010: This is because the script is generally stored in plain text in a text file.
10161010 -> 1000001801020: There may be some issues, however, such as the type of new line character that sits between the lines.
10161020 -> 1000001801030: Generally, however, little or no work has to be done to make a script written for one system, run on another.
10161030 -> 1000001801040: Some quite popular cross-platform scripting or interpreted languages are:
10161040 -> 1000001801050: bash—A Unix shell commonly run on Linux and other modern Unix-like systems, as well as on Windows via the Cygwin POSIX compatibility layer.
10161050 -> 1000001801060: Python—A modern scripting language where the focus is on rapid application development and ease-of-writing, instead of program run-time efficiency.
10161060 -> 1000001801070: Perl—A scripting language first released in 1987.
10161070 -> 1000001801080: Used for CGI WWW programming, small system administration tasks, and more.
10161080 -> 1000001801090: PHP—A scripting language most popular in use on the WWW for web applications.
10161090 -> 1000001801100: Ruby—A scripting language who's purpose is to be object-oriented and easy to read.
10161100 -> 1000001801110: Can also be used on the web through Ruby on Rails.
10161110 -> 1000001801120: Tcl - A dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.
10161120 -> 1000001801130: Video games
10161130 -> 1000001801140: Cross-platform is a term that can also apply to video games.
10161140 -> 1000001801150: Such games are released on a range of video game consoles and handheld game consoles, which are specialized computers dedicated to the task of playing games (and thus, are a platform as any other computer).
10161150 -> 1000001801160: Examples of these games include:
10161160 -> 1000001801170: Miner 2049er, the first major multiplatform game
10161170 -> 1000001801180: Phantasy Star Online
10161180 -> 1000001801190: Lara Croft Tomb Raider: Legend
10161190 -> 1000001801200: FIFA Series
10161200 -> 1000001801210: Shadow of Legend
10161210 -> 1000001801220: … which are spread across a variety of platforms, such as the Nintendo GameCube, PlayStation 2, Xbox, PC, and mobile devices.
10161220 -> 1000001801230: In some cases, depending on the hardware of a particular system it may take longer than expected to create a video game across multiple platforms.
10161230 -> 1000001801240: So, a video game may only get released on a few platforms and then later released on the remaining platforms.
10161240 -> 1000001801250: Typically, this is what occurs when a new system is released, because the developers of the video game need to become acquainted with the hardware and software associated with the new console.
10161250 -> 1000001801260: Some games may not become cross-platform because of licensing agreements between the developers and the maker of the video game console which state that the game will only be made for one particular console.
10161260 -> 1000001801270: As an example, Disney could create a new game and wish to release it on the latest Nintendo and Sony game consoles.
10161270 -> 1000001801280: If Disney licenses the game with Sony first, Disney may be required to only release the game on Sony’s console for a short time, or indefinitely—effectively prohibiting the game from cross-platform at least for a period of time.
10161280 -> 1000001801290: Several developers have developed ways to play games online while using different platforms.
10161290 -> 1000001801300: Epic Games, Microsoft and Valve Software all have this technology, that allows Xbox 360 gamers and PS3 gamers to play with PC gamers, allowing gamers to finally decide which platform is the best for a game.
10161300 -> 1000001801310: The first game released to allow this interactivity between PC and Console games was Quake 3.
10161310 -> 1000001801320: Games that feature cross-platform online play include:
10161320 -> 1000001801330: Champions Online
10161330 -> 1000001801340: Lost Planet: Colonies
10161340 -> 1000001801350: Phantasy Star Online
10161350 -> 1000001801360: Shadowrun
10161360 -> 1000001801370: UNO
10161370 -> 1000001801380: Final Fantasy XI Online
10161380 -> 1000001801390: Platform independent software
10161390 -> 1000001801400: Software that is platform independent does not rely on any special features of any single platform, or, if it does, handles those special features such that it can deal with multiple platforms.
10161400 -> 1000001801410: All algorithms, such as the quicksort algorithm, are able to be implemented on different platforms.
10161410 -> 1000001801420: Cross-platform programming
10161420 -> 1000001801430: Cross-platform programming is the practice of actively writing software that will work on more than one platform.
10161430 -> 1000001801440: Approaches to cross-platform programming
10161440 -> 1000001801450: There are different ways of approaching the problem of writing a cross-platform application program.
10161450 -> 1000001801460: One such approach is simply to create multiple versions of the same program in different source trees—in other words, the Windows version of a program might have one set of source code files and the Macintosh version might have another, while a FOSS *nix system might have another.
10161460 -> 1000001801470: While this is a straightforward approach to the problem, it has the potential to be considerably more expensive in development cost, development time, or both, especially for the corporate entities.
10161470 -> 1000001801480: The idea behind this is to create more than two different programs that have the ability to behave similarly to each other.
10161480 -> 1000001801490: It is also possible that this means of developing a cross-platform application will result in more problems with bug tracking and fixing, because the two different source trees would have different programmers, and thus different defects in each version.
10161490 -> 1000001801500: The smaller the programming team, the quicker the bug fixes tend to be.
10161500 -> 1000001801510: Another approach that is used is to depend on pre-existing software that hides the differences between the platforms—called abstraction of the platform—such that the program itself is unaware of the platform it is running on.
10161510 -> 1000001801520: It could be said that such programs are platform agnostic.
10161520 -> 1000001801530: Programs that run on the Java Virtual Machine (JVM) are built in this fashion.
10161530 -> 1000001801540: Some applications mix various methods of cross-platform programming to create the final application.
10161540 -> 1000001801550: An example of this is the Firefox web browser, which uses abstraction to build some of the lower-level components, separate source subtrees for implementing platform specific features (like the GUI), and the implementation of more than one scripting language to help facilitate ease of portability.
10161550 -> 1000001801560: Firefox implements XUL, CSS and JavaScript for extending the browser, in addition to classic Netscape-style browser plugins.
10161560 -> 1000001801570: Much of the browser itself is written in XUL, CSS, and JavaScript, as well.
10161570 -> 1000001801580: Cross-platform programming toolkits
10161580 -> 1000001801590: There are a number of tools which are available to help facilitate the process of cross-platform programming:
10161590 -> 1000001801600: Simple DirectMedia Layer—An open source cross-platform multimedia library written in C that creates an abstraction over various platforms’ graphics, sound, and input APIs.
10161600 -> 1000001801610: It runs on many operating systems including Linux, Windows and Mac OS X and is aimed at games and multimedia applications.
10161610 -> 1000001801620: Cairo−A free software library used to provide a vector graphics-based, device-independent API.
10161620 -> 1000001801630: It is designed to provide primitives for 2-dimensional drawing across a number of different backends.
10161630 -> 1000001801640: Cairo is written in C and has bindings for many programming languages.
10161640 -> 1000001801650: ParaGUI—ParaGUI is a cross-platform high-level application framework and GUI library.
10161650 -> 1000001801660: It can be compiled on various platforms(Linux, Win32, BeOS, Mac OS, ...).
10161660 -> 1000001801670: ParaGUI is based on the Simple DirectMedia Layer (SDL).
10161670 -> 1000001801680: ParaGUI is targeted on crossplatform multimedia applications and embedded devices operating on framebuffer displays.
10161680 -> 1000001801690: wxWidgets—An open source widget toolkit that is also an application framework.
10161690 -> 1000001801700: It runs on Unix-like systems with X11, Microsoft Windows and Mac OS X. It permits applications written to use it to run on all of the systems that it supports, if the application does not use any operating system-specific programming in addition to it.
10161700 -> 1000001801710: Qt—An application framework and widget toolkit for Unix-like systems with X11, Microsoft Windows, Mac OS X, and other systems—available under both open source and commercial licenses.
10161710 -> 1000001801720: GTK+—An open source widget toolkit for Unix-like systems with X11 and Microsoft Windows.
10161720 -> 1000001801730: FLTK—Another open source cross platform toolkit, but more light weight because it restricts itself to the GUI.
10161730 -> 1000001801740: Mozilla—An open source platform for building Mac, Windows and Linux applications.
10161740 -> 1000001801750: Mono (and more specifically, Microsoft .NET)—A cross-platform framework for applications and programming languages.
10161750 -> 1000001801760: molib—A robust commercial application toolkit library that abstracts the system calls through C++ objects (such as the file system, database system and thread implementation.).
10161760 -> 1000001801770: This allows for the creation of applications that compile and run under Microsoft Windows, Mac OS X, GNU/Linux, and other uses (Sun OS, AIX, HP-UX, 32/64 bit, SMP).
10161770 -> 1000001801780: Use in concert with the sandbox to create GUI-based applications.
10161780 -> 1000001801790: fpGUI - An open source widget toolkit that is completely implemented in Object Pascal.
10161790 -> 1000001801800: It currently supports Linux, Windows and a bit of Windows CE.
10161795 -> 1000001801810: fpGUI does not rely on any large libraries, instead it talks directly to Xlib (Linux) or GDI (Windows).
10161800 -> 1000001801820: The framework is compiled with the Free Pascal compiler.
10161810 -> 1000001801830: Mac OS support is also in the works.
10161820 -> 1000001801840: Tcl/Tk - Tcl (Tool Command Language) is a dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more.
10161830 -> 1000001801850: Open source and business-friendly, Tcl is a mature yet evolving language that is truly cross platform, easily deployed and highly extensible.
10161840 -> 1000001801860: Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches.
10161850 -> 1000001801870: Tk is the standard GUI not only for Tcl, but for many other dynamic languages, and can produce rich, native applications that run unchanged across Windows, Mac OS X, Linux and more.
10161860 -> 1000001801880: The combination of Tcl and the Tk GUI toolkit is referred to as Tcl/Tk.
10161870 -> 1000001801890: XVT is a cross-platform toolkit for creating enterprise and desktop applications in C/C++ on Windows, Linux and Unix (Solaris, HPUX, AIX), and Mac.
10161880 -> 1000001801900: Most recent release is 5.8, in April 2007
10161890 -> 1000001801910: Cross-platform development environments
10161900 -> 1000001801920: Cross-platform applications can also be built using proprietary IDEs, or so-called Rapid Application Development tools.
10161910 -> 1000001801930: There are a number of development environments which allow developers to build and deploy applications across multiple platforms:
10161920 -> 1000001801940: Eclipse—An Open source software framework and IDE extendable through plug-ins including the C++ Development Toolkit.
10161930 -> 1000001801950: Eclipse is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).
10161940 -> 1000001801960: IntelliJ IDEA—A proprietary IDE
10161950 -> 1000001801970: NetBeans—An Open source software framework and IDE extendable through plug-ins.
10161960 -> 1000001801980: NetBeans is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems).
10161970 -> 1000001801990: Similar to Eclipse in features and functionality.
10161980 -> 1000001802000: Promoted by Sun Microsystems
10161990 -> 1000001802010: Omnis Studio—A proprietary IDE or Rapid Application Development tool for creating enterprise and web applications for Windows, Linux, and Mac OS X.
10162000 -> 1000001802020: Runtime Revolution—a proprietary IDE, compiler engine and CGI builder that cross compiles to Windows, Mac OS X (PPC, Intel), Linux, Solaris, BSD, and Irix.
10162010 -> 1000001802030: Code::Blocks—A free/open source, cross platform IDE.
10162020 -> 1000001802040: It is developed in C++ using wxWidgets.
10162030 -> 1000001802050: Using a plugin architecture, its capabilities and features are defined by the provided plugins.
10162040 -> 1000001802060: Lazarus (software)—Lazarus is a cross platform Visual IDE developed for and supported by the open source Free Pascal compiler.
10162050 -> 1000001802070: It aims to provide a Rapid Application Development Delphi Clone for Pascal and Object Pascal developers.
10162060 -> 1000001802080: REALbasic—REALbasic (RB) is an object-oriented dialect of the BASIC programming language developed and commercially marketed by REAL Software, Inc in Austin, Texas for Mac OS X, Microsoft Windows, and Linux.
10162070 -> 1000001802090: Criticisms of cross-platform development
10162080 -> 1000001802100: There are certain issues associated with cross-platform development.
10162090 -> 1000001802110: Some of these include:
10162100 -> 1000001802120: Testing cross-platform applications may also be considerably more complicated, since different platforms can exhibit slightly different behaviors or subtle bugs.
10162110 -> 1000001802130: This problem has led some developers to deride cross-platform development as “Write Once, Debug Everywhere”, a take on Sun’s “Write Once, Run Anywhere” marketing slogan.
10162120 -> 1000001802140: Developers are often restricted to using the lowest common denominator subset of features which are available on all platforms.
10162130 -> 1000001802150: This may hinder the application's performance or prohibit developers from using platforms’ most advanced features.
10162140 -> 1000001802160: Different platforms often have different user interface conventions, which cross-platform applications do not always accommodate.
10162150 -> 1000001802170: For example, applications developed for Mac OS X and GNOME are supposed to place the most important button on the right-hand side of windows and dialogs, whereas Microsoft Windows and KDE have the opposite convention.
10162160 -> 1000001802180: Though many of these differences are subtle, a cross-platform application which does not conform appropriately to these conventions may feel clunky or alien to the user.
10162170 -> 1000001802190: When working quickly, such opposing conventions may even result in data loss, such as in a dialog box confirming whether the user wants to save or discard changes to a file.
10162180 -> 1000001802200: Scripting languages and virtual machines must be translated into native executable code each time the application is executed, imposing a performance penalty.
10162190 -> 1000001802210: This performance hit can be alleviated using advanced techniques like just-in-time compilation; but even using such techniques, some performance overhead may be unavoidable.
Data
10170010 -> 1000001900020: Data
10170020 -> 1000001900030: Data (singular: datum) are collected of natural phenomena descriptors including the results of experience, observation or experiment, or a set of premises.
10170030 -> 1000001900040: This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables.
10170040 -> 1000001900050: Etymology
10170050 -> 1000001900060: The word data is the plural of Latin datum, neuter past participle of dare, "to give", hence "something given".
10170060 -> 1000001900070: The past participle of "to give" has been used for millennia, in the sense of a statement accepted at face value; one of the works of Euclid, circa 300 BC, was the Dedomena (in Latin, Data).
10170070 -> 1000001900080: In discussions of problems in geometry, mathematics, engineering, and so on, the terms givens and data are used interchangeably.
10170080 -> 1000001900090: Such usage is the origin of data as a concept in computer science: data are numbers, words, images, etc., accepted as they stand.
10170090 -> 1000001900100: Pronounced dey-tuh, dat-uh, or dah-tuh.
10170100 -> 1000001900110: Experimental data are data generated within the context of a scientific investigation.
10170110 -> 1000001900120: Mathematically, data can be grouped in many ways.
10170120 -> 1000001900130: Usage in English
10170130 -> 1000001900140: In English, the word datum is still used in the general sense of "something given", and more specifically in cartography, geography, geology, NMR and drafting to mean a reference point, reference line, or reference surface.
10170140 -> 1000001900150: More generally speaking, any measurement or result can be called a (single) datum, but data point is more common.
10170150 -> 1000001900160: Both datums (see usage in datum article) and the originally Latin plural data are used as the plural of datum in English, but data is more commonly treated as a mass noun and used in the singular, especially in day-to-day usage.
10170160 -> 1000001900170: For example, "This is all the data from the experiment".
10170170 -> 1000001900180: This usage is inconsistent with the rules of Latin grammar and traditional English, which would instead suggest "These are all the data from the experiment".
10170180 -> 1000001900190: Some British and UN academic, scientific, and professional style guides (e.g., see page 43 of the  World Health Organization Style Guide) request that authors treat data as a plural noun.
10170190 -> 1000001900200: Other international organization, such as the IEEE computing society , allow its usage as either a mass noun or plural based on author preference.
10170200 -> 1000001900210: It is now usually treated as a singular mass noun in informal usage, but usage in scientific publications shows a strong UK/U.S divide.
10170210 -> 1000001900220: U.S. usage tends to treat data in the singular, including in serious and academic publishing, although some major newspapers (such as the New York Times) regularly use it in the plural.
10170220 -> 1000001900230: "The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.”
10170230 -> 1000001900240: Sometimes scientists think of data as plural, as in These data do not support the conclusions.
10170240 -> 1000001900250: But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage.
10170245 -> 1000001900260: "</ref> UK usage now widely accepts treating data as singular in standard English, including everyday newspaper usage at least in non-scientific use.
10170250 -> 1000001900270: UK scientific publishing usually still prefers treating it as a plural..
10170260 -> 1000001900280: Some UK university style guides recommend using data for both singular and plural use and some recommend treating it only as a singular in connection with computers.
10170270 -> 1000001900290: Uses of data in science and computing
10170280 -> 1000001900300: Raw data are numbers, characters, images or other outputs from devices to convert physical quantities into symbols, in a very broad sense.
10170290 -> 1000001900310: Such data are typically further processed by a human or input into a computer, stored and processed there, or transmitted (output) to another human or computer.
10170300 -> 1000001900320: Raw data is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next.
10170310 -> 1000001900330: Mechanical computing devices are classified according to the means by which they represent data.
10170320 -> 1000001900340: An analog computer represents a datum as a voltage, distance, position, or other physical quantity.
10170330 -> 1000001900350: A digital computer represents a datum as a sequence of symbols drawn from a fixed alphabet.
10170340 -> 1000001900360: The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1".
10170350 -> 1000001900370: More familiar representations, such as numbers or letters, are then constructed from the binary alphabet.
10170360 -> 1000001900380: Some special forms of data are distinguished.
10170370 -> 1000001900390: A computer program is a collection of data, which can be interpreted as instructions.
10170380 -> 1000001900400: Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data.
10170390 -> 1000001900410: It is also useful to distinguish metadata, that is, a description of other data.
10170400 -> 1000001900420: A similar yet earlier term for metadata is "ancillary data."
10170410 -> 1000001900430: The prototypical example of metadata is the library catalog, which is a description of the contents of books.
10170420 -> 1000001900440: Meaning of data, information and knowledge
10170430 -> 1000001900450: The terms information and knowledge are frequently used for overlapping concepts.
10170440 -> 1000001900460: The main difference is in the level of abstraction being considered.
10170450 -> 1000001900470: Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three.
10170460 -> 1000001900480: For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge".
10170470 -> 1000001900490: Information as a concept bears a diversity of meanings, from everyday usage to technical settings.
10170480 -> 1000001900500: Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
10170490 -> 1000001900510: Beynon-Davies uses the concept of a sign to distinguish between data and information.
10170500 -> 1000001900520: Data are symbols.
10170510 -> 1000001900530: Information occurs when symbols are used to refer to something.
Data analysis
10180010 -> 1000002000020: Data analysis
10180020 -> 1000002000030: Data analysis is the process of looking at and summarizing data with the intent to extract useful information and develop conclusions.
10180030 -> 1000002000040: Data analysis is closely related to data mining, but data mining tends to focus on larger data sets, with less emphasis on making inference, and often uses data that was originally collected for a different purpose.
10180040 -> 1000002000050: In statistical applications, some people divide data analysis into descriptive statistics, exploratory data analysis and confirmatory data analysis, where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses.
10180050 -> 1000002000060: Data analysis assumes different aspects, and possibly different names, in different fields.
10180060 -> 1000002000070: The term data analysis is also used as a synonym for data modeling, which is unrelated to the subject of this article.
10180070 -> 1000002000080: Nuclear and particle physics
10180080 -> 1000002000090: In nuclear and particle physics the data usually originate from the experimental apparatus via a data acquisition system.
10180090 -> 1000002000100: It is then processed, in a step usually called data reduction, to apply calibrations and to extract physically significant information.
10180100 -> 1000002000110: Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc.
10180110 -> 1000002000120: The resulting data n-tuples are then scrutinized by the physicists, using specialized software tools like ROOT or PAW, comparing the results of the experiment with theory.
10180120 -> 1000002000130: The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for Monte Carlo simulation software like Geant4 that predict the response of the detector to a given theoretical event, producing simulated events which are then compared to experimental data.
10180130 -> 1000002000140: See also: Computational physics.
10180140 -> 1000002000150: Social sciences
10180150 -> 1000002000160: Qualitative data analysis (QDA) or qualitative research is the analysis of non-numerical data, for example words, photographs, observations, etc..
10180160 -> 1000002000170: Information technology
10180170 -> 1000002000180: A special case is the data analysis in information technology audits.
10180180 -> None: Business
10180190 -> None: See
10180200 -> None: Analytics
10180210 -> None: Business intelligence
10180220 -> None: Data mining
Data mining
10210010 -> 1000002100020: Data mining
10210020 -> 1000002100030: Data mining is the process of sorting through large amounts of data and picking out relevant information.
10210030 -> 1000002100040: It is usually used by business intelligence organizations, and financial analysts, but is increasingly being used in the sciences to extract information from the enormous data sets generated by modern experimental and observational methods.
10210040 -> 1000002100050: It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data" and "the science of extracting useful information from large data sets or databases."
10210050 -> 1000002100060: Data mining in relation to enterprise resource planning is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making.
10210060 -> 1000002100070: Background
10210070 -> 1000002100080: Traditionally, business analysts have performed the task of extracting useful information from recorded data, but the increasing volume of data in modern business and science calls for computer-based approaches.
10210080 -> 1000002100090: As data sets have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools.
10210090 -> 1000002100100: The modern technologies of computers, networks, and sensors have made data collection and organization much easier.
10210100 -> 1000002100110: However, the captured data needs to be converted into information and knowledge to become useful.
10210110 -> 1000002100120: Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, to data.
10210120 -> 1000002100130: Data mining identifies trends within data that go beyond simple analysis.
10210130 -> 1000002100140: Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities.
10210140 -> 1000002100150: However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all.
10210150 -> 1000002100160: Although data mining is a relatively new term, the technology is not.
10210160 -> 1000002100170: For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining).
10210170 -> 1000002100180: Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis.
10210180 -> 1000002100190: Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded.
10210190 -> 1000002100200: Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable.
10210200 -> 1000002100210: The term data mining is often used to apply to the two separate processes of knowledge discovery and prediction.
10210210 -> 1000002100220: Knowledge discovery provides explicit information that has a readable form and can be understood by a user.
10210220 -> 1000002100230: Forecasting, or predictive modeling provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as neural networks.
10210230 -> 1000002100240: Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery.
10210240 -> 1000002100250: Metadata, or data about a given data set, are often expressed in a condensed data-minable format, or one that facilitates the practice of data mining.
10210250 -> 1000002100260: Common examples include executive summaries and scientific abstracts.
10210260 -> 1000002100270: Data mining relies on the use of real world data.
10210270 -> 1000002100280: This data is extremely vulnerable to collinearity precisely because data from the real world may have unknown interrelations.
10210280 -> 1000002100290: An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed.
10210290 -> 1000002100300: Alternative approaches using an experiment-based approach such as Choice Modelling for human-generated data may be used.
10210300 -> 1000002100310: Inherent correlations are either controlled for or removed altogether through the construction of an experimental design.
10210310 -> 1000002100320: Recently, there were some efforts to define a standard for data mining, for example the CRISP-DM standard for analysis processes or the Java Data-Mining Standard.
10210320 -> 1000002100330: Independent of these standardization efforts, freely available open-source software systems like RapidMiner and  Weka have become an informal standard for defining data-mining processes.
10210330 -> 1000002100340: Privacy concerns
10210340 -> 1000002100350: There are also privacy and human rights concerns associated with data mining, specifically regarding the source of the data analyzed.
10210350 -> 1000002100360: Data mining provides information that may be difficult to obtain otherwise.
10210360 -> 1000002100370: When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.
10210370 -> 1000002100380: In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.
10210380 -> 1000002100390: Notable uses of data mining
10210390 -> 1000002100400: Combatting Terrorism
10210400 -> 1000002100410: Data mining has been cited as the method by which the U.S. Army unit Able Danger had identified the September 11, 2001 attacks leader, Mohamed Atta, and three other 9/11 hijackers as possible members of an Al Qaeda cell operating in the U.S. more than a year before the attack.
10210410 -> 1000002100420: It has been suggested that both the Central Intelligence Agency and the Canadian Security Intelligence Service have employed this method.
10210420 -> 1000002100430: Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program  Security-MSNBC.
10210430 -> 1000002100440: These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment.
10210440 -> 1000002100450: Games
10210450 -> 1000002100460: Since the early 1960s, with the availability of oracles for certain combinatorial games, also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up.
10210460 -> 1000002100470: This is the extraction of human-usable strategies from these oracles.
10210470 -> 1000002100480: Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully.
10210480 -> 1000002100490: Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns.
10210490 -> 1000002100500: Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation.
10210500 -> 1000002100510: Business
10210510 -> 1000002100520: Data mining in customer relationship management applications can contribute significantly to the bottom line.
10210520 -> 1000002100530: Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted.
10210530 -> 1000002100540: More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers.
10210540 -> 1000002100550: Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer.
10210550 -> 1000002100560: Data clustering can also be used to automatically discover the segments or groups within a customer data set.
10210560 -> 1000002100570: Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large.
10210570 -> 1000002100580: Rather than one model to predict which customers will churn, a business could build a separate model for each region and customer type.
10210580 -> 1000002100590: Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer.
10210590 -> 1000002100600: And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable.
10210600 -> 1000002100610: In order to maintain this quantity of models, they need to manage model versions and move to automated data mining.
10210610 -> 1000002100620: Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees.
10210620 -> 1000002100630: Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly.
10210630 -> 1000002100640: Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels.
10210640 -> 1000002100650: Another example of data mining, often called the market basket analysis, relates to its use in retail sales.
10210650 -> 1000002100660: If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones.
10210660 -> 1000002100670: Although some explanations of relationships may be difficult, taking advantage of it is easier.
10210670 -> 1000002100680: The example deals with association rules within transaction-based data.
10210680 -> 1000002100690: Not all data are transaction based and logical or inexact rules may also be present within a database.
10210690 -> 1000002100700: In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months.
10210700 -> 1000002100710: Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."
10210710 -> 1000002100720: In this paper the application of data mining and decision analysis to the problem of die-level functional test is described.
10210720 -> 1000002100730: Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing.
10210730 -> 1000002100740: This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products.
10210740 -> 1000002100750: Science and engineering
10210750 -> 1000002100760: In recent years, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, education, and electrical power engineering.
10210760 -> 1000002100770: In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human DNA sequences and variability in disease susceptibility.
10210770 -> 1000002100780: In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as cancer.
10210780 -> 1000002100790: This is very important to help improve the diagnosis, prevention and treatment of the diseases.
10210790 -> 1000002100800: The data mining technique that is used to perform this task is known as multifactor dimensionality reduction.
10210800 -> 1000002100810: In the area of electrical power engineering, data mining techniques have been widely used for condition monitoring of high voltage electrical equipment.
10210810 -> 1000002100820: The purpose of condition monitoring is to obtain valuable information on the insulation's health status of the equipment.
10210820 -> 1000002100830: Data clustering such as self-organizing map (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS).
10210830 -> 1000002100840: Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms.
10210840 -> 1000002100850: Obviously, different tap positions will generate different signals.
10210850 -> 1000002100860: However, there was considerable variability amongst normal condition signals for the exact same tap position.
10210860 -> 1000002100870: SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.
10210870 -> 1000002100880: Data mining techniques have also been applied for dissolved gas analysis (DGA) on power transformers.
10210880 -> 1000002100890: DGA, as a diagnostics for power transformer, has been available for centuries.
10210890 -> 1000002100900: Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.
10210900 -> 1000002100910: A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention.
10210910 -> 1000002100920: Other examples of applying data mining technique applications are biomedical data facilitated by domain ontologies, mining clinical trial data, traffic analysis using SOM, et cetera.
Data set
10220010 -> 1000002200020: Data set
10220020 -> 1000002200030: A data set (or dataset) is a collection of data, usually presented in tabular form.
10220030 -> 1000002200040: Each column represents a particular variable.
10220040 -> 1000002200050: Each row corresponds to a given member of the data set in question.
10220050 -> 1000002200060: It lists values for each of the variables, such as height and weight of an object or values of random numbers.
10220060 -> 1000002200070: Each value is known as a datum.
10220070 -> 1000002200080: The data set may comprise data for one or more members, corresponding to the number of rows.
10220080 -> 1000002200090: Historically, the term originated in the mainframe field, where it had a well-defined meaning, very close to contemporary computer file.
10220090 -> 1000002200100: This topic is not covered here.
10220100 -> 1000002200110: In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list.
10220110 -> 1000002200120: The values may be numbers, such as real numbers or integers, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity.
10220120 -> 1000002200130: More generally, values may be of any of the kinds described as a level of measurement.
10220130 -> 1000002200140: For each variable, the values will normally all be of the same kind.
10220140 -> 1000002200150: However, there may also be "missing values", which need to be indicated in some way.
10220150 -> 1000002200160: In statistics data sets usually come from actual observations obtained by sampling a statistical population, and each row corresponds to the observations on one element of that population.
10220160 -> 1000002200170: Data sets may further be generated by algorithms for the purpose of testing certain kinds of software.
10220170 -> 1000002200180: Some modern statistical analysis software such as PSPP still present their data in the classical dataset fashion.
10220180 -> None: Classic data sets
10220190 -> None: Several classic data sets have been used extensively in the statistical literature:
10220200 -> None: Iris flower data set - multivariate data set introduced by Ronald Fisher (1936).
10220210 -> None: Categorical data analysis - Data sets used in the book, An Introduction to Categorical Data Analysis, by Agresti are  provided on-line by StatLib.
10220220 -> None: Robust statistics - Data sets used in Robust Regression and Outlier Detection (Rousseeuw and Leroy, 1986).
10220225 -> None: Provided on-line at the University of Cologne.
10220230 -> None: Time series - Data used in Chatfield's book, The Analysis of Time Series, are  provided on-line by StatLib.
10220240 -> None: Extreme values - Data used in the book, An Introduction to the Statistical Modeling of Extreme Values are  provided on-line by Stuart Coles, the book's author.
10220250 -> None: Bayesian Data Analysis - Data used in the book, Bayesian Data Analysis, are  provided on-line by Andrew Gelman, one of the book's authors.
10220260 -> None: The  Bupa liver data, used in several papers in the machine learning (data mining) literature.
Database
10190010 -> 1000002300020: Database
10190020 -> 1000002300030: A database is a structured collection of records or data.
10190030 -> 1000002300040: A computer database relies upon software to organize the storage of data.
10190040 -> 1000002300050: The software models the database structure in what are known as database models.
10190050 -> 1000002300060: The model in most common use today is the relational model.
10190060 -> 1000002300070: Other models such as the hierarchical model and the network model use a more explicit representation of relationships (see below for explanation of the various database models).
10190070 -> 1000002300080: Database management systems (DBMS) are the software used to organize and maintain the database.
10190080 -> 1000002300090: These are categorized according to the database model that they support.
10190090 -> 1000002300100: The model tends to determine the query languages that are available to access the database.
10190100 -> 1000002300110: A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from hardware failures.
10190110 -> 1000002300120: In these areas there are large differences between products.
10190120 -> 1000002300130: History
10190130 -> 1000002300140: The earliest known use of the term data base was in November 1963, when the System Development Corporation sponsored a symposium under the title Development and Management of a Computer-centered Data Base.
10190140 -> 1000002300150: Database as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers.
10190150 -> 1000002300160: (The abbreviation DB, however, survives.)
10190160 -> 1000002300170: The first database management systems were developed in the 1960s.
10190170 -> 1000002300180: A pioneer in the field was Charles Bachman.
10190180 -> 1000002300190: Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on punched cards and magnetic tape, so that serial processing was the dominant activity.
10190190 -> 1000002300200: Two key data models arose at this time: CODASYL developed the network model based on Bachman's ideas, and (apparently independently) the hierarchical model was used in a system developed by North American Rockwell later adopted by IBM as the cornerstone of their IMS product.
10190200 -> 1000002300210: While IMS along with the CODASYL IDMS were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today.
10190210 -> 1000002300220: Two worthy of mention are the PICK and MUMPS databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems.
10190220 -> 1000002300230: The relational model was proposed by E. F. Codd in 1970.
10190230 -> 1000002300240: He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms.
10190240 -> 1000002300250: For a long while, however, the relational model remained of academic interest only.
10190250 -> 1000002300260: While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time.
10190260 -> 1000002300270: Among the first implementations were Michael Stonebraker's Ingres at Berkeley, and the System R project at IBM.
10190270 -> 1000002300280: Both of these were research prototypes, announced during 1976.
10190280 -> 1000002300290: The first commercial products, Oracle and DB2, did not appear until around 1980.
10190290 -> 1000002300300: The first successful database product for microcomputers was dBASE for the CP/M and PC-DOS/MS-DOS operating systems.
10190300 -> 1000002300310: During the 1980s, research activity focused on distributed database systems and database machines.
10190310 -> 1000002300320: Another important theoretical idea was the Functional Data Model, but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice.
10190320 -> 1000002300330: In the 1990s, attention shifted to object-oriented databases.
10190330 -> 1000002300340: These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as spatial databases, engineering data (including software repositories), and multimedia data.
10190340 -> 1000002300350: Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result.
10190350 -> 1000002300360: The 1990s also saw the spread of Open Source databases, such as PostgreSQL and MySQL.
10190360 -> 1000002300370: In the 2000s, the fashionable area for innovation is the XML database.
10190370 -> 1000002300380: As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products.
10190380 -> 1000002300390: XML databases aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not.
10190390 -> 1000002300400: Database models
10190400 -> 1000002300410: Various techniques are used to model data structure.
10190410 -> 1000002300420: Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model.
10190420 -> 1000002300430: For any one logical model various physical implementations may be possible, and most products will offer the user some level of control in tuning the physical implementation, since the choices that are made have a significant effect on performance.
10190430 -> 1000002300440: Here are three examples:
10190440 -> 1000002300450: Hierarchical model
10190450 -> 1000002300460: In a hierarchical model, data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list.
10190460 -> 1000002300470: This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files.
10190470 -> 1000002300480: Each unit in the model is a record which is also known as a node.
10190480 -> 1000002300490: In such a model, each record on one level can be related to multiple records on the next lower level.
10190490 -> 1000002300500: A record that has subsidiary records is called a parent and the subsidiary records are called children.
10190500 -> 1000002300510: Data elements in this model are well suited for one-to-many relationships with other data elements in the database.
10190510 -> 1000002300520: This model is advantageous when the data elements are inherently hierarchical.
10190520 -> 1000002300530: The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated.
10190530 -> 1000002300540: Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization.
10190540 -> 1000002300550: Network model
10190550 -> 1000002300560: The network model tends to store records with links to other records.
10190560 -> 1000002300570: Each record in the database can have multiple parents, i.e., the relationships among data elements can have a many to many relationship.
10190570 -> 1000002300580: Associations are tracked via "pointers".
10190580 -> 1000002300590: These pointers can be node numbers or disk addresses.
10190590 -> 1000002300600: Most network databases tend to also include some form of hierarchical model.
10190600 -> 1000002300610: Databases can be translated from hierarchical model to network and vice versa.
10190610 -> 1000002300620: The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent.
10190620 -> 1000002300630: The network model provides greater advantage than the hierarchical model in that promotes greater flexibility and data accessibility, since records at a lower level can be accessed without accessing the records above them.
10190630 -> 1000002300640: This model is more efficient than hierarchical model, easier to understand and can be applied to many real world problems that require routine transactions.
10190640 -> 1000002300650: The disadvantages are that: It is a complex process to design and develop a network database; It has to be refined frequently; It requires that the relationships among all the records be defined before development starts, and changes often demand major programming efforts; Operation and maintenance of the network model is expensive and time consuming.
10190650 -> 1000002300660: Examples of database engines that have network model capabilities are RDM Embedded and RDM Server.
10190660 -> 1000002300670: Relational model
10190670 -> 1000002300680: The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows.
10190680 -> 1000002300690: The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number).
10190690 -> 1000002300700: Rows (also called records) represent instances of an entity (e.g. specific employees).
10190700 -> 1000002300710: The "relation" in "relational database" comes from the mathematical notion of relations from the field of set theory.
10190710 -> 1000002300720: A relation is a set of tuples, so rows are sometimes called tuples.
10190720 -> 1000002300730: All tables in a relational database adhere to three basic rules.
10190730 -> 1000002300740: The ordering of columns is immaterial
10190740 -> 1000002300750: Identical rows are not allowed in a table
10190750 -> 1000002300760: Each row has a single (separate) value for each of its columns (each tuple has an atomic value).
10190760 -> 1000002300770: If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records.
10190770 -> 1000002300780: Relationships between records are often categorized by their cardinality (1:1, (0), 1:M, M:M).
10190780 -> 1000002300790: Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values.
10190790 -> 1000002300800: A "primary key" is a key that has a unique value for each row in the table.
10190800 -> 1000002300810: Keys are commonly used to join or combine data from two or more tables.
10190810 -> 1000002300820: For example, an employee table may contain a column named address which contains a value that matches the key of an address table.
10190820 -> 1000002300830: Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables.
10190830 -> 1000002300840: It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one.
10190840 -> 1000002300850: Relational operations
10190850 -> 1000002300860: Users (or programs) request data from a relational database by sending it a query that is written in a special language, usually a dialect of SQL.
10190860 -> 1000002300870: Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface.
10190870 -> 1000002300880: Many web applications, such as Wikipedia, perform SQL queries when generating pages.
10190880 -> 1000002300890: In response to a query, the database returns a result set, which is the list of rows constituting the answer.
10190890 -> 1000002300900: The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted.
10190900 -> 1000002300910: Often, data from multiple tables are combined into one, by doing a join.
10190910 -> 1000002300920: There are a number of relational operations in addition to join.
10190920 -> 1000002300930: Normal forms
10190930 -> 1000002300940: Relations are classified based upon the types of anomalies to which they're vulnerable.
10190940 -> 1000002300950: A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies.
10190950 -> 1000002300960: Normal forms are hierarchical in nature.
10190960 -> 1000002300970: That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal form.
10190970 -> 1000002300980: Database Management Systems
10190980 -> 1000002300990: Relational database management systems
10190990 -> 1000002301000: An RDBMS implements the features of the relational model outlined above.
10191000 -> 1000002301010: In this context, Date's Information Principle states:
10191010 -> 1000002301020: The entire information content of the database is represented in one and only one way.
10191020 -> 1000002301030: Namely as explicit values in column positions (attributes) and rows in relations (tuples) Therefore, there are no explicit pointers between related tables.
10191030 -> 1000002301040: Post-relational database models
10191040 -> 1000002301050: Several products have been identified as post-relational because the data model incorporates relations but is not constrained by the Information Principle, requiring that all information is represented by data values in relations.
10191050 -> 1000002301060: Products using a post-relational data model typically employ a model that actually pre-dates the relational model.
10191060 -> 1000002301070: These might be identified as a directed graph with trees on the nodes.
10191070 -> 1000002301080: Examples of models that could be classified as post-relational are PICK aka MultiValue, and MUMPS.
10191080 -> 1000002301090: Object database models
10191090 -> 1000002301100: In recent years, the object-oriented paradigm has been applied to database technology, creating a new programming model known as object databases.
10191100 -> 1000002301110: These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same type system as the application program.
10191110 -> 1000002301120: This aims to avoid the overhead (sometimes referred to as the impedance mismatch) of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects).
10191120 -> 1000002301130: At the same time, object databases attempt to introduce the key ideas of object programming, such as encapsulation and polymorphism, into the world of databases.
10191130 -> 1000002301140: A variety of these ways have been tried for storing objects in a database.
10191140 -> 1000002301150: Some products have approached the problem from the application programming end, by making the objects manipulated by the program persistent.
10191150 -> 1000002301160: This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content.
10191160 -> 1000002301170: Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities.
10191170 -> 1000002301180: DBMS internals
10191180 -> 1000002301190: Storage and physical database design
10191190 -> 1000002301200: Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered flat files, ISAM, heaps, hash buckets or B+ trees.
10191200 -> 1000002301210: These have various advantages and disadvantages discussed further in the main article on this topic.
10191210 -> 1000002301220: The most commonly used are B+ trees and ISAM.
10191220 -> 1000002301230: Other important design choices relate to the clustering of data by category (such as grouping data by month, or location), creating pre-computed views known as materialized views, partitioning data by range or hash.
10191230 -> 1000002301240: As well memory management and storage topology can be important design choices for database designers.
10191240 -> 1000002301250: Just as normalization is used to reduce storage requirements and improve the extensibility of the database, conversely denormalization is often used to reduce join complexity and reduce execution time for queries.
10191250 -> 1000002301260: Indexing
10191260 -> 1000002301270: All of these databases can take advantage of indexing to increase their speed.
10191270 -> 1000002301280: This technology has advanced tremendously since its early uses in the 1960s and 1970s.
10191280 -> 1000002301290: The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value.
10191290 -> 1000002301300: An index allows a set of table rows matching some criterion to be located quickly.
10191300 -> 1000002301310: Typically, indexes are also stored in the various forms of data-structure mentioned above (such as B-trees, hashes, and linked lists).
10191310 -> 1000002301320: Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required.
10191320 -> 1000002301330: Relational DBMS's have the advantage that indexes can be created or dropped without changing existing applications making use of it.
10191330 -> 1000002301340: The database chooses between many different strategies based on which one it estimates will run the fastest.
10191340 -> 1000002301350: In other words, indexes are transparent to the application or end-user querying the database; while they affect performance, any SQL command will run with or without index to compute the result of an SQL statement.
10191350 -> 1000002301360: The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest.
10191360 -> 1000002301370: Some of the key algorithms that deal with joins are nested loop join, sort-merge join and hash join.
10191370 -> 1000002301380: Which of these is chosen depends on whether an index exists, what type it is, and its cardinality.
10191380 -> 1000002301390: An index speeds up access to data, but it has disadvantages as well.
10191390 -> 1000002301400: First, every index increases the amount of storage on the hard drive necessary for the database file, and second, the index must be updated each time the data are altered, and this costs time.
10191400 -> 1000002301410: (Thus an index saves time in the reading of data, but it costs time in entering and altering data.
10191410 -> 1000002301420: It thus depends on the use to which the data are to be put whether an index is on the whole a net plus or minus in the quest for efficiency.)
10191420 -> 1000002301430: A special case of an index is a primary index, or primary key, which is distinguished in that the primary index must ensure a unique reference to a record.
10191430 -> 1000002301440: Often, for this purpose one simply uses a running index number (ID number).
10191440 -> 1000002301450: Primary indexes play a significant role in relational databases, and they can speed up access to data considerably.
10191450 -> 1000002301460: Transactions and concurrency
10191460 -> 1000002301470: In addition to their data model, most practical databases ("transactional databases") attempt to enforce a database transaction .
10191470 -> 1000002301480: Ideally, the database software should enforce the ACID rules, summarized here:
10191480 -> 1000002301490: Atomicity: Either all the tasks in a transaction must be done, or none of them.
10191490 -> 1000002301500: The transaction must be completed, or else it must be undone (rolled back).
10191500 -> 1000002301510: Consistency: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database.
10191510 -> 1000002301520: It cannot place the data in a contradictory state.
10191520 -> 1000002301530: Isolation: Two simultaneous transactions cannot interfere with one another.
10191530 -> 1000002301540: Intermediate results within a transaction are not visible to other transactions.
10191540 -> 1000002301550: Durability: Completed transactions cannot be aborted later or their results discarded.
10191550 -> 1000002301560: They must persist through (for instance) restarts of the DBMS after crashes
10191560 -> 1000002301570: In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance.
10191570 -> 1000002301580: Concurrency control is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules.
10191580 -> 1000002301590: The DBMS must be able to ensure that only serializable, recoverable schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions .
10191590 -> 1000002301600: Replication
10191600 -> 1000002301610: Replication of databases is closely related to transactions.
10191610 -> 1000002301620: If a database can log its individual actions, it is possible to create a duplicate of the data in real time.
10191620 -> 1000002301630: The duplicate can be used to improve performance or availability of the whole database system.
10191630 -> 1000002301640: Common replication concepts include:
10191640 -> 1000002301650: Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves
10191650 -> 1000002301660: Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas.
10191660 -> 1000002301670: Multimaster: Two or more replicas sync each other via a transaction identifier.
10191670 -> 1000002301680: Parallel synchronous replication of databases enables transactions to be replicated on multiple servers simultaneously, which provides a method for backup and security as well as data availability.
10191680 -> 1000002301690: Security
10191690 -> 1000002301700: Database security denotes the system, processes, and procedures that protect a database from unintended activity.
10191700 -> 1000002301710: Security is usually enforced through access control, auditing, and encryption.
10191710 -> 1000002301720: Access control ensures and restricts who can connect and what can be done to the database.
10191720 -> 1000002301730: Auditing logs what action or change has been performed, when and by who.
10191730 -> 1000002301740: Encryption: Since security has become a major issue in recent years, many commercial database vendors provide built-in encryption mechanism.
10191740 -> 1000002301750: Data is encoded natively into the tables and deciphered "on the fly" when a query comes in.
10191745 -> 1000002301760: Connections can also be secured and encrypted if required using DSA, MD5, SSL or legacy encryption standard.
10191750 -> 1000002301770: Enforcing security is one of the major tasks of the DBA.
10191760 -> 1000002301780: In the United Kingdom, legislation protecting the public from unauthorized disclosure of personal information held on databases falls under the Office of the Information Commissioner.
10191770 -> 1000002301790: United Kingdom based organizations holding personal data in electronic format (databases for example) are required to register with the Data Commissioner.
10191780 -> 1000002301800: Locking
10191790 -> 1000002301810: Locking is how the database handle multiple concurent operations.
10191800 -> 1000002301820: This is the way how concurency and some form of basic intergrity is managed within the database system.
10191810 -> 1000002301830: Such locks can be applied on a row level, or on other levels like page (a basic data block), extend (multiple array of pages) or even an entire table.
10191820 -> 1000002301840: This helps maintain the integrity of the data by ensuring that only one process at a time can modify the same data.
10191830 -> 1000002301850: Unlike a basic filesystem files or folders, where only one lock at the time can be set, restricting the usage to one process only.
10191840 -> 1000002301860: A database can set and hold mutiples locks at the same time on the different level of the physical data structure.
10191850 -> 1000002301870: How locks are set, last is determined by the database engine locking scheme based on the submitted SQL or transactions by the users.
10191860 -> 1000002301880: Generaly speaking no activity on the database should be translated by no or very light locking.
10191870 -> 1000002301890: For most DBMS systems existing on the market, locks are generaly shared or exclusive.
10191880 -> 1000002301900: Exclusive locks mean that no other lock can acquire the current data object as long as the exclusive lock lasts.
10191890 -> 1000002301910: Exclusive locks are usually set while the database needs to change data, like during an UPDATE or DELETE operation.
10191900 -> 1000002301920: Shared locks can take ownership one from the other of the current data structure.
10191910 -> 1000002301930: Shared locks are usually used while the database is reading data, during a SELECT operation.
10191920 -> 1000002301940: The number, nature of locks and time the lock holds a data block can have a huge impact on the database performances.
10191930 -> 1000002301950: Bad locking can lead to desastrous performance response (usually the result of poor SQL requests, or inadequate database physical structure)
10191940 -> 1000002301960: Default locking behavior is enforced by the isolation level of the dataserver.
10191950 -> 1000002301970: Changing the isolation level will affect how shared or exclusive locks must be set on the data for the entire database system.
10191960 -> 1000002301980: Default isolation is generaly 1, where data can not be read while it is modfied, forbiding to return "ghost data" to end user.
10191970 -> 1000002301990: At some point intensive or inappropriate exclusive locking, can lead to the "dead lock" situation between two locks.
10191980 -> 1000002302000: Where none of the locks can be released because they try to acquire ressources mutually from each other.
10191990 -> 1000002302010: The Database has a fail safe mecanism and will automaticly "sacrifice" one of the locks releasing the ressource.
10192000 -> 1000002302020: Doing so processes or transactions involved in the "dead lock" will be rolled back.
10192010 -> 1000002302030: Databases can also be locked for other reasons, like access restrictions for given levels of user.
10192020 -> 1000002302040: Databases are also locked for routine database maintenance, which prevents changes being made during the maintenance.
10192030 -> 1000002302050: See  IBM for more detail.)
10192040 -> 1000002302060: Architecture
10192050 -> 1000002302070: Depending on the intended use, there are a number of database architectures in use.
10192060 -> 1000002302080: Many databases use a combination of strategies.
10192070 -> 1000002302090: On-line Transaction Processing systems (OLTP) often use a row-oriented datastore architecture, while data-warehouse and other retrieval-focused applications like Google's BigTable, or bibliographic database(library catalogue) systems may use a column-oriented datastore architecture.
10192080 -> 1000002302100: Document-Oriented, XML, Knowledgebases, as well as frame databases and rdf-stores (aka Triple-Stores), may also use a combination of these architectures in their implementation.
10192090 -> 1000002302110: Finally it should be noted that not all database have or need a database 'schema' (so called schema-less databases).
10192100 -> 1000002302120: Applications of databases
10192110 -> 1000002302130: Databases are used in many applications, spanning virtually the entire range of computer software.
10192120 -> 1000002302140: Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed.
10192130 -> 1000002302150: Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology.
10192140 -> 1000002302160: Software database drivers are available for most database platforms so that application software can use a common Application Programming Interface to retrieve the information stored in a database.
10192150 -> 1000002302170: Two commonly used database APIs are JDBC and ODBC.
10192160 -> 1000002302180: For example suppliers database contains the data relating to suppliers such as;
10192170 -> 1000002302190: supplier name
10192180 -> 1000002302200: supplier code
10192190 -> 1000002302210: supplier address
10192200 -> 1000002302220: It is often used by schools to teach students and grade them.
10192210 -> None: Links to DBMS products
10192220 -> None: 4D
10192230 -> None: ADABAS
10192240 -> None: Alpha Five
10192250 -> None: Apache Derby (Java, also known as IBM Cloudscape and Sun Java DB)
10192260 -> None: BerkeleyDB
10192270 -> None: CouchDB
10192280 -> None: CSQL
10192290 -> None: Datawasp
10192300 -> None: Db4objects
10192310 -> None: dBase
10192320 -> None: FileMaker
10192330 -> None: Firebird (database server)
10192340 -> None: H2 (Java)
10192350 -> None: Hsqldb (Java)
10192360 -> None: IBM DB2
10192370 -> None: IBM IMS (Information Management System)
10192380 -> None: IBM UniVerse
10192390 -> None: Informix
10192400 -> None: Ingres
10192410 -> None: Interbase
10192420 -> None: InterSystems Caché
10192430 -> None: MaxDB (formerly SapDB)
10192440 -> None: Microsoft Access
10192450 -> None: Microsoft SQL Server
10192460 -> None: Model 204
10192470 -> None: MySQL
10192480 -> None: Nomad
10192490 -> None: Objectivity/DB
10192500 -> None: ObjectStore
10192510 -> None: OpenLink Virtuoso
10192520 -> None: OpenOffice.org Base
10192530 -> None: Oracle Database
10192540 -> None: Paradox (database)
10192550 -> None: Polyhedra DBMS
10192560 -> None: PostgreSQL
10192570 -> None: Progress 4GL
10192580 -> None: RDM Embedded
10192590 -> None: ScimoreDB
10192600 -> None: Sedna
10192610 -> None: SQLite
10192620 -> None: Superbase
10192630 -> None: Sybase
10192640 -> None: Teradata
10192650 -> None: Vertica
10192660 -> None: Visual FoxPro
ELIZA
10230010 -> 1000002400020: ELIZA
10230020 -> 1000002400030: ELIZA is a computer program by Joseph Weizenbaum, designed in 1966, which parodied a Rogerian therapist, largely by rephrasing many of the patient's statements as questions and posing them to the patient.
10230030 -> 1000002400040: Thus, for example, the response to "My head hurts" might be "Why do you say your head hurts?"
10230040 -> 1000002400050: The response to "My mother hates me" might be "Who else in your family hates you?"
10230050 -> 1000002400060: ELIZA was named after Eliza Doolittle, a working-class character in George Bernard Shaw's play Pygmalion, who is taught to speak with an upper class accent.
10230060 -> 1000002400070: Overview
10230070 -> 1000002400080: It is sometimes inaccurately said that ELIZA simulates a therapist.
10230080 -> 1000002400090: Weizenbaum said that ELIZA provided a "parody" of "the responses of a non-directional psychotherapist in an initial psychiatric interview."
10230090 -> 1000002400100: He chose the context of psychotherapy to "sidestep the problem of giving the program a data base of real-world knowledge", the therapeutic situation being one of the few real human situations in which a human being can reply to a statement with a question that indicates very little specific knowledge of the topic under discussion.
10230100 -> 1000002400110: For example, it is a context in which the question "Who is your favorite composer?" can be answered acceptably with responses such as "What about your own favorite composer?" or "Does that question interest you?"
10230110 -> 1000002400120: First implemented in Weizenbaum's own SLIP list-processing language, ELIZA worked by simple parsing and substitution of key words into canned phrases.
10230120 -> 1000002400130: Depending upon the initial entries by the user the illusion of a human writer could be instantly dispelled, or could continue through several interchanges.
10230130 -> 1000002400140: It was sometimes so convincing that there are many anecdotes about people becoming very emotionally caught up in dealing with ELIZA for several minutes until the machine's true lack of understanding became apparent.
10230140 -> 1000002400150: This was likely due to people's tendency to attach meanings to words which the computer never put there.
10230150 -> 1000002400160: In 1966, interactive computing (via a teletype) was new.
10230160 -> 1000002400170: It was 15 years before the personal computer became familiar to the general public, and two decades before most people encountered attempts at natural language processing in Internet services like Ask.com or PC help systems such as Microsoft Office Clippy.
10230170 -> 1000002400180: Although those programs included years of research and work (while Ecala eclipsed the functionality of ELIZA after less than two weeks of work by a single programmer), ELIZA remains a milestone simply because it was the first time a programmer had attempted such a human-machine interaction with the goal of creating the illusion (however brief) of human-human interaction.
10230180 -> 1000002400190: In the article "theNewMediaReader" an excerpt from "From Computer Power and Human Reason" by Joseph Weizenbaum in 1976, edited by Noah Wardrip-Fruin and Nick Montfort he references how quickly and deeply people became emotionally involved with the computer program, taking offence when he asked to view the transcripts, saying it was an invasion of their privacy, even asking him to leave the room while they were working with ELIZA.
10230190 -> 1000002400200: Influence on games
10230200 -> 1000002400210: ELIZA impacted a number of early computer games by demonstrating additional kinds of interface designs.
10230210 -> 1000002400220: Don Daglow wrote an enhanced version of the program called Ecala on a PDP-10 mainframe computer at Pomona College in 1973 before writing what was possibly the second or third computer role-playing game, Dungeon (1975) (The first was probably "dnd", written on and for the PLATO system in 1974, and the second may have been Moria, written in 1975).
10230220 -> 1000002400230: It is likely that ELIZA was also on the system where Will Crowther created Adventure, the 1975 game that spawned the interactive fiction genre.
10230230 -> 1000002400240: But both these games appeared some nine years after the original ELIZA.
10230240 -> 1000002400250: Response and legacy
10230250 -> 1000002400260: Lay responses to ELIZA were disturbing to Weizenbaum and motivated him to write his book Computer Power and Human Reason: From Judgment to Calculation, in which he explains the limits of computers, as he wants to make clear in people's minds his opinion that the anthropomorphic views of computers are just a reduction of the human being and any life form for that matter.
10230260 -> 1000002400270: There are many programs based on ELIZA in different languages in addition to Ecala.
10230270 -> 1000002400280: For example, in 1980, a company called "Don't Ask Software", founded by Randy Simon, created a version for the Apple II, Atari, and Commodore PCs, which verbally abused the user based on the user's input.
10230280 -> 1000002400290: In Spain, Jordi Perez developed the famous ZEBAL in 1993, written in Clipper for MS-DOS.
10230290 -> 1000002400300: Other versions adapted ELIZA around a religious theme, such as ones featuring Jesus (both serious and comedic) and another Apple II variant called I Am Buddha.
10230300 -> 1000002400310: The 1980 game The Prisoner incorporated ELIZA-style interaction within its gameplay.
10230310 -> 1000002400320: ELIZA has also inspired a podcast called "The Eliza Podcast", in which the host engages in self-analysis using a computer generated voice prompting with questions in the same style as the ELIZA program.
10230320 -> None: Implementations
10230330 -> None: Using JavaScript: http://www.manifestation.com/neurotoys/eliza.php3
10230340 -> None: Source code in Java: http://chayden.net/eliza/Eliza.html
10230350 -> None: Another Java-implementation of ELIZA: http://www.wedesoft.demon.co.uk/eliza/
10230360 -> None: Using C on the TI-89: http://kaikostack.com/ti89_en.htm#eliza
10230370 -> None: Using z80 Assembly on the TI-83 Plus: http://www.ticalc.org/archives/files/fileinfo/354/35463.html
10230380 -> None: A perl module  Chatbot::Eliza —  example implementation
10230390 -> None: Trans-Tex Software has released shareware versions for Classic Mac OS and Mac OS X: http://www.tex-edit.com/index.html#Eliza
10230400 -> None: doctor.el (circa 1985) in Emacs.
10230410 -> None: Source code in Tcl:  http://wiki.tcl.tk/9235
10230420 -> None: The  Indy Delphi oriented TCP/IP components suite has an Eliza implementation as demo.
10230430 -> None: Pop-11 Eliza in the poplog system.
10230440 -> None: Goes back to about 1976, when it was used for teaching AI at Sussex University.
10230450 -> None: Now part of the free open source Poplog system.
10230460 -> None: Source code in BASIC: http://www.atariarchives.org/bigcomputergames/showpage.php?page=22
10230470 -> None: ECC-Eliza for Windows (actual program is for DOS, but unpacker is for Windows) (rename .txt to .exe before running): http://www5.domaindlx.com/ecceliza1/ecceliza.txt.
10230480 -> None: More recent version at http://web.archive.org/web/20041117123025/http://www5.domaindlx.com/ecceliza1/ecceliza.txt.
English language
10240010 -> 1000002500020: English language
10240020 -> 1000002500030: English is an Indo-European, West Germanic language originating in England, and is the first language for most people in the United Kingdom, the United States, Canada, Australia, New Zealand, Ireland, and the Anglophone Caribbean.
10240030 -> 1000002500040: It is used extensively as a second language and as an official language throughout the world, especially in Commonwealth countries and in many international organizations.
10240040 -> 1000002500050: Significance
10240050 -> 1000002500060: Modern English, sometimes described as the first global lingua franca, is the dominant international language in communications, science, business, aviation, entertainment, radio and diplomacy.
10240060 -> 1000002500070: The initial reason for its enormous spread beyond the bounds of the British Isles where it was originally a native tongue was the British Empire, and by the late nineteenth century its influence had won a truly global reach.
10240070 -> 1000002500080: It is the dominant language in the United States and the growing economic and cultural influence of that federal union as a global superpower since World War II has significantly accelerated adoption of English as a language across the planet.
10240080 -> 1000002500090: A working knowledge of English has become a requirement in a number of fields, occupations and professions such as medicine and as a consequence over a billion people speak English to at least a basic level (see English language learning and teaching).
10240090 -> 1000002500100: Linguists such as David Crystal recognize that one impact of this massive growth of English, in common with other global languages, has been to reduce native linguistic diversity in many parts of the world historically, most particularly in Australasia and North America, and its huge influence continues to play an important role in language attrition.
10240100 -> 1000002500110: By a similar token, historical linguists, aware of the complex and fluid dynamics of language change, are always alive to the potential English contains through the vast size and spread of the communities that use it and its natural internal variety, such as in its creoles and pidgins, to produce a new family of distinct languages over time.
10240110 -> 1000002500120: English is one of six official languages of the United Nations.
10240120 -> 1000002500130: History
10240130 -> 1000002500140: English is a West Germanic language that originated from the Anglo-Frisian dialects brought to Britain by Germanic settlers and Roman auxiliary troops from various parts of what is now northwest Germany and the Northern Netherlands.
10240140 -> 1000002500150: Initially, Old English was a diverse group of dialects, reflecting the varied origins of the Anglo-Saxon Kingdoms of England.
10240150 -> 1000002500160: One of these dialects, Late West Saxon, eventually came to dominate.
10240160 -> 1000002500170: The original Old English language was then influenced by two waves of invasion.
10240170 -> 1000002500180: The first was by language speakers of the Scandinavian branch of the Germanic family; they conquered and colonized parts of Britain in the 8th and 9th centuries.
10240180 -> 1000002500190: The second was the Normans in the 11th century, who spoke Old Norman and ultimately developed an English variety of this called Anglo-Norman.
10240190 -> 1000002500200: These two invasions caused English to become "mixed" to some degree (though it was never a truly mixed language in the strict linguistic sense of the word; mixed languages arise from the cohabitation of speakers of different languages, who develop a hybrid tongue for basic communication).
10240200 -> 1000002500210: Cohabitation with the Scandinavians resulted in a significant grammatical simplification and lexical supplementation of the Anglo-Frisian core of English; the later Norman occupation led to the grafting onto that Germanic core of a more elaborate layer of words from the Italic branch of the European languages.
10240210 -> 1000002500220: This Norman influence entered English largely through the courts and government.
10240220 -> 1000002500230: Thus, English developed into a "borrowing" language of great flexibility and with a huge vocabulary.
10240230 -> 1000002500240: Classification and related languages
10240240 -> 1000002500250: The English language belongs to the western sub-branch of the Germanic branch of the Indo-European family of languages.
10240250 -> 1000002500260: The closest living relative of English is Scots, spoken primarily in Scotland and parts of Northern Ireland, which is viewed by linguists as either a separate language or a group of dialects of English.
10240260 -> 1000002500270: The next closest relative to English after Scots is Frisian, spoken in the Northern Netherlands and Northwest Germany.
10240270 -> 1000002500280: Other less closely related living West Germanic languages include Dutch, Low German, German and Afrikaans.
10240280 -> 1000002500290: The North Germanic languages of Scandinavia are less closely related to English than the West Germanic languages.
10240290 -> 1000002500300: Many French words are also intelligible to an English speaker (though pronunciations are often quite different) because English absorbed a large vocabulary from Norman and French, via Anglo-Norman after the Norman Conquest and directly from French in subsequent centuries.
10240300 -> 1000002500310: As a result, a large portion of English vocabulary is derived from French, with some minor spelling differences (word endings, use of old French spellings, etc.), as well as occasional divergences in meaning, in so-called "faux amis", or false friends.
10240310 -> 1000002500320: The pronunciation of French loanwords in English has become completely anglicized and follows a typically Germanic pattern of stress.
10240320 -> 1000002500330: Geographical distribution
10240330 -> 1000002500340: Approximately 375 million people speak English as their first language.
10240340 -> 1000002500350: English today is probably the third largest language by number of native speakers, after Mandarin Chinese and Spanish.
10240350 -> 1000002500360: However, when combining native and non-native speakers it is probably the most commonly spoken language in the world, though possibly second to a combination of the Chinese languages, depending on whether or not distinctions in the latter are classified as "languages" or "dialects."
10240360 -> 1000002500370: Estimates that include second language speakers vary greatly from 470 million to over a billion depending on how literacy or mastery is defined.
10240370 -> 1000002500380: There are some who claim that non-native speakers now outnumber native speakers by a ratio of 3 to 1.
10240380 -> 1000002500390: The countries with the highest populations of native English speakers are, in descending order: United States (215 million), United Kingdom (58 million), Canada (18.2 million), Australia (15.5 million), Ireland (3.8 million), South Africa (3.7 million), and New Zealand (3.0-3.7 million).
10240390 -> 1000002500400: Countries such as Jamaica and Nigeria also have millions of native speakers of dialect continua ranging from an English-based creole to a more standard version of English.
10240400 -> 1000002500410: Of those nations where English is spoken as a second language, India has the most such speakers ('Indian English') and linguistics professor David Crystal claims that, combining native and non-native speakers, India now has more people who speak or understand English than any other country in the world.
10240410 -> 1000002500420: Following India is the People's Republic of China.
10240420 -> 1000002500430: Countries in order of total speakers
10240430 -> 1000002500440: English is the primary language in Anguilla, Antigua and Barbuda, Australia (Australian English), the Bahamas, Barbados, Bermuda, Belize (Belizean Kriol), the British Indian Ocean Territory, the British Virgin Islands, Canada (Canadian English), the Cayman Islands, the Falkland Islands, Gibraltar, Grenada, Guam, Guernsey (Channel Island English), Guyana, Ireland (Hiberno-English), Isle of Man (Manx English), Jamaica (Jamaican English), Jersey, Montserrat, Nauru, New Zealand (New Zealand English), Pitcairn Islands, Saint Helena, Saint Kitts and Nevis, Saint Vincent and the Grenadines, Singapore, South Georgia and the South Sandwich Islands, Trinidad and Tobago, the Turks and Caicos Islands, the United Kingdom, the U.S. Virgin Islands, and the United States.
10240440 -> 1000002500450: In many other countries, where English is not the most spoken language, it is an official language; these countries include Botswana, Cameroon, Dominica, Fiji, the Federated States of Micronesia, Ghana, Gambia, India, Kenya, Kiribati, Lesotho, Liberia, Madagascar, Malta, the Marshall Islands, Mauritius, Namibia, Nigeria, Pakistan, Palau, Papua New Guinea, the Philippines, Puerto Rico, Rwanda, the Solomon Islands, Saint Lucia, Samoa, Seychelles, Sierra Leone, Sri Lanka, Swaziland, Tanzania, Uganda, Zambia, and Zimbabwe.
10240450 -> 1000002500460: It is also one of the 11 official languages that are given equal status in South Africa (South African English).
10240460 -> 1000002500470: English is also the official language in current dependent territories of Australia (Norfolk Island, Christmas Island and Cocos Island) and of the United States (Northern Mariana Islands, American Samoa and Puerto Rico), and in the former British colony of Hong Kong.
10240470 -> 1000002500480: English is an important language in several former colonies and protectorates of the United Kingdom but falls short of official status, such as in Malaysia, Brunei, United Arab Emirates and Bahrain.
10240480 -> 1000002500490: English is also not an official language in either the United States or the United Kingdom.
10240490 -> 1000002500500: Although the United States federal government has no official languages, English has been given official status by 30 of the 50 state governments.
10240500 -> 1000002500510: English as a global language
10240510 -> 1000002500520: Because English is so widely spoken, it has often been referred to as a "world language", the lingua franca of the modern era.
10240520 -> 1000002500530: While English is not an official language in most countries, it is currently the language most often taught as a second language around the world.
10240530 -> 1000002500540: Some linguists believe that it is no longer the exclusive cultural sign of "native English speakers", but is rather a language that is absorbing aspects of cultures worldwide as it continues to grow.
10240540 -> 1000002500550: It is, by international treaty, the official language for aerial and maritime communications.
10240550 -> 1000002500560: English is an official language of the United Nations and many other international organizations, including the International Olympic Committee.
10240560 -> 1000002500570: English is the language most often studied as a foreign language in the European Union (by 89% of schoolchildren), followed by French (32%), German (18%), and Spanish (8%).
10240570 -> 1000002500580: In the EU, a large fraction of the population reports being able to converse to some extent in English.
10240580 -> 1000002500590: Among non-English speaking countries, a large percentage of the population claimed to be able to converse in English in the Netherlands (87%), Sweden (85%), Denmark (83%), Luxembourg (66%), Finland (60%), Slovenia (56%), Austria (53%), Belgium (52%), and Germany (51%).
10240590 -> 1000002500600: Norway and Iceland also have a large majority of competent English-speakers.
10240600 -> 1000002500610: Books, magazines, and newspapers written in English are available in many countries around the world.
10240610 -> 1000002500620: English is also the most commonly used language in the sciences.
10240620 -> 1000002500630: In 1997, the Science Citation Index reported that 95% of its articles were written in English, even though only half of them came from authors in English-speaking countries.
10240630 -> 1000002500640: Dialects and regional varieties
10240640 -> 1000002500650: The expansion of the British Empire and—since WWII—the primacy of the United States have spread English throughout the globe.
10240650 -> 1000002500660: Because of that global spread, English has developed a host of English dialects and English-based creole languages and pidgins.
10240660 -> 1000002500670: The major varieties of English include, in most cases, several subvarieties, such as Cockney within British English; Newfoundland English within Canadian English; and African American Vernacular English ("Ebonics") and Southern American English within American English.
10240670 -> 1000002500680: English is a pluricentric language, without a central language authority like France's Académie française; and, although no variety is clearly considered the only standard, there are a number of accents considered to be more prestigious, such as Received Pronunciation in Britain.
10240680 -> 1000002500690: Scots developed—largely independently—from the same origins, but following the Acts of Union 1707 a process of language attrition began, whereby successive generations adopted more and more features from English causing dialectalisation.
10240690 -> 1000002500700: Whether it is now a separate language or a dialect of English better described as Scottish English is in dispute.
10240700 -> 1000002500710: The pronunciation, grammar and lexis of the traditional forms differ, sometimes substantially, from other varieties of English.
10240710 -> 1000002500720: Because of the wide use of English as a second language, English speakers have many different accents, which often signal the speaker's native dialect or language.
10240720 -> 1000002500730: For the more distinctive characteristics of regional accents, see Regional accents of English, and for the more distinctive characteristics of regional dialects, see List of dialects of the English language.
10240730 -> 1000002500740: Just as English itself has borrowed words from many different languages over its history, English loanwords now appear in a great many languages around the world, indicative of the technological and cultural influence of its speakers.
10240740 -> 1000002500750: Several pidgins and creole languages have formed using an English base, such as Jamaican Patois, Nigerian Pidgin, and Tok Pisin.
10240750 -> 1000002500760: There are many words in English coined to describe forms of particular non-English languages that contain a very high proportion of English words.
10240760 -> 1000002500770: Franglais, for example, is used to describe French with a very high English word content; it is found on the Channel Islands.
10240770 -> 1000002500780: Another variant, spoken in the border bilingual regions of Québec in Canada, is called Frenglish.
10240780 -> 1000002500790: In Wales, which is part of the United Kingdom, the languages of Welsh and English are sometimes mixed together by fluent or comfortable Welsh speakers, the result of which is called Wenglish.
10240790 -> 1000002500800: Constructed varieties of English
10240800 -> 1000002500810: Basic English is simplified for easy international use.
10240810 -> 1000002500820: It is used by manufacturers and other international businesses to write manuals and communicate.
10240820 -> 1000002500830: Some English schools in Asia teach it as a practical subset of English for use by beginners.
10240830 -> 1000002500840: Special English is a simplified version of English used by the Voice of America.
10240840 -> 1000002500850: It uses a vocabulary of only 1500 words.
10240850 -> 1000002500860: English reform is an attempt to improve collectively upon the English language.
10240860 -> 1000002500870: Seaspeak and the related Airspeak and Policespeak, all based on restricted vocabularies, were designed by Edward Johnson in the 1980s to aid international cooperation and communication in specific areas.
10240870 -> 1000002500880: There is also a tunnelspeak for use in the Channel Tunnel.
10240880 -> 1000002500890: Euro-English is a concept of standardising English for use as a second language in continental Europe.
10240890 -> 1000002500900: Manually Coded English — a variety of systems have been developed to represent the English language with hand signals, designed primarily for use in deaf education.
10240900 -> 1000002500910: These should not be confused with true sign languages such as British Sign Language and American Sign Language used in Anglophone countries, which are independent and not based on English.
10240910 -> 1000002500920: E-Prime excludes forms of the verb to be.
10240920 -> 1000002500930: Euro-English (also EuroEnglish or Euro-English) terms are English translations of European concepts that are not native to English-speaking countries.
10240930 -> 1000002500940: Because of the United Kingdom's (and even the Republic of Ireland's) involvement in the European Union, the usage focuses on non-British concepts.
10240940 -> 1000002500950: This kind of Euro-English was parodied when English was "made" one of the constituent languages of Europanto.
10240950 -> 1000002500960: Phonology
10240960 -> 1000002500970: Vowels
10240970 -> 1000002500980: Notes:
10240980 -> 1000002500990: It is the vowels that differ most from region to region.
10240990 -> 1000002501000: Where symbols appear in pairs, the first corresponds to American English, General American accent; the second corresponds to British English, Received Pronunciation.
10241000 -> 1000002501010: American English lacks this sound; words with this sound are pronounced with {(IPA+ /ɑ/+ /ɑ/)} or {(IPA+ /ɔ/+ /ɔ/)}.
10241010 -> 1000002501020: See Lot-cloth split.
10241020 -> 1000002501030: Some dialects of North American English do not have this vowel.
10241030 -> 1000002501040: See Cot-caught merger.
10241040 -> 1000002501050: The North American variation of this sound is a rhotic vowel.
10241050 -> 1000002501060: Many speakers of North American English do not distinguish between these two unstressed vowels.
10241060 -> 1000002501070: For them, roses and Rosa's are pronounced the same, and the symbol usually used is schwa {(IPA+ /ə/+ /ə/)}.
10241070 -> 1000002501080: This sound is often transcribed with {(IPA+ /i/+ /i/)} or with {(IPA+ /ɪ/+ /ɪ/)}.
10241080 -> 1000002501090: The diphthongs {(IPA+ /eɪ/+ /eɪ/)} and {(IPA+ /oʊ/+ /oʊ/)} are monophthongal for many General American speakers, as {(IPA+ /eː/+ /eː/)} and {(IPA+ /oː/+ /oː/)}.
10241090 -> 1000002501100: The letter <U> can represent either {(IPA+/u/+/u/)} or the iotated vowel {(IPA+/ju/+/ju/)}.
10241100 -> 1000002501110: In BRP, if this iotated vowel {(IPA+/ju/+/ju/)} occurs after {(IPA+/t/+/t/)}, {(IPA+/d/+/d/)}, {(IPA+/s/+/s/)} or {(IPA+/z/+/z/)}, it often triggers palatalization of the preceding consonant, turning it to {(IPA+/ʨ/+/ʨ/)}, {(IPA+/ʥ/+/ʥ/)}, {(IPA+/ɕ/+/ɕ/)} and {(IPA+/ʑ/+/ʑ/)} respectively, as in tune, during, sugar, and azure.
10241110 -> 1000002501120: In American English, palatalization does not generally happen unless the {(IPA+/ju/+/ju/)} is followed by r, with the result that {(IPA+/(t, d,s, z)jur/+/(t, d,s, z)jur/)} turn to {(IPA+/tʃɚ/+/tʃɚ/)}, {(IPA+/dʒɚ/+/dʒɚ/)}, {(IPA+/ʃɚ/+/ʃɚ/)} and {(IPA+/ʒɚ/+/ʒɚ/)} respectively, as in nature, verdure, sure, and treasure.
10241120 -> 1000002501130: Vowel length plays a phonetic role in the majority of English dialects, and is said to be phonemic in a few dialects, such as Australian English and New Zealand English.
10241130 -> 1000002501140: In certain dialects of the modern English language, for instance General American, there is allophonic vowel length: vowel phonemes are realized as long vowel allophones before voiced consonant phonemes in the coda of a syllable.
10241140 -> 1000002501150: Before the Great Vowel Shift, vowel length was phonemically contrastive.
10241150 -> 1000002501160: This sound only occurs in non-rhotic accents.
10241160 -> 1000002501170: In some accents, this sound may be, instead of {(IPA+/ʊə/+/ʊə/)}, {(IPA+/ɔ:/+/ɔ:/)}.
10241170 -> 1000002501180: See English-language vowel changes before historic r.
10241180 -> 1000002501190: This sound only occurs in non-rhotic accents.
10241190 -> 1000002501200: In some accents, the schwa offglide of {(IPA+/ɛə/+/ɛə/)} may be dropped, monophthising and lengthening the sound to {(IPA+/ɛ:/+/ɛ:/)}.
10241200 -> 1000002501210: See also IPA chart for English dialects for more vowel charts.
10241210 -> 1000002501220: Consonants
10241220 -> 1000002501230: This is the English consonantal system using symbols from the International Phonetic Alphabet (IPA).
10241230 -> 1000002501240: The velar nasal {(IPA+ [ŋ]+ [ŋ])} is a non-phonemic allophone of /n/ in some northerly British accents, appearing only before /k/ and /g/.
10241240 -> 1000002501250: In all other dialects it is a separate phoneme, although it only occurs in syllable codas.
10241250 -> 1000002501260: The alveolar tap {(IPA+ [ɾ]+ [ɾ])} is an allophone of /t/ and /d/ in unstressed syllables in North American English and Australian English.
10241260 -> 1000002501270: This is the sound of tt or dd in the words latter and ladder, which are homophones for many speakers of North American English.
10241270 -> 1000002501280: In some accents such as Scottish English and Indian English it replaces {(IPA+/ɹ/+/ɹ/)}.
10241280 -> 1000002501290: This is the same sound represented by single r in most varieties of Spanish.
10241290 -> 1000002501300: In some dialects, such as Cockney, the interdentals /θ/ and /ð/ are usually merged with /f/ and /v/, and in others, like African American Vernacular English, /ð/ is merged with dental /d/.
10241300 -> 1000002501310: In some Irish varieties, /θ/ and /ð/ become the corresponding dental plosives, which then contrast with the usual alveolar plosives.
10241310 -> 1000002501320: The sounds {(IPA+ /ʃ/, /ʒ/, and /ɹ/+ /ʃ/, /ʒ/, and /ɹ/)} are labialised in some dialects.
10241320 -> 1000002501330: Labialisation is never contrastive in initial position and therefore is sometimes not transcribed.
10241330 -> 1000002501340: Most speakers of General American realize <r> (always rhoticized) as the retroflex approximant {(IPA+/ɻ/+/ɻ/)}, whereas the same is realized in Scottish English, etc. as the alveolar trill.
10241340 -> 1000002501350: The voiceless palatal fricative /ç/ is in most accents just an allophone of /h/ before /j/; for instance human /çjuːmən/.
10241350 -> 1000002501360: However, in some accents (see this), the /j/ is dropped, but the initial consonant is the same.
10241360 -> 1000002501370: The voiceless velar fricative /x/ is used by Scottish or Welsh speakers of English for Scots/Gaelic words such as loch {(IPA+ /lɒx/+ /lɒx/)} or by some speakers for loanwords from German and Hebrew like Bach {(IPA+/bax/+/bax/)} or Chanukah /xanuka/. /x/ is also used in South African English.
10241370 -> 1000002501380: In some dialects such as Scouse (Liverpool) either {(IPA+[x]+[x])} or the affricate {(IPA+[kx]+[kx])} may be used as an allophone of /k/ in words such as docker {(IPA+ [dɒkxə]+ [dɒkxə])}.
10241380 -> 1000002501390: Most native speakers have a great deal of trouble pronouncing it correctly when learning a foreign language.
10241390 -> 1000002501400: Most speakers use the sounds [k] and [h] instead.
10241400 -> 1000002501410: Voiceless w {(IPA+ [ʍ]+ [ʍ])} is found in Scottish and Irish English, as well as in some varieties of American, New Zealand, and English English.
10241410 -> 1000002501420: In most other dialects it is merged with /w/, in some dialects of Scots it is merged with /f/.
10241420 -> 1000002501430: Voicing and aspiration
10241430 -> 1000002501440: Voicing and aspiration of stop consonants in English depend on dialect and context, but a few general rules can be given:
10241440 -> 1000002501450: Voiceless plosives and affricates (/{(IPA+ p+ p)}/, /{(IPA+ t+ t)}/, /{(IPA+ k+ k)}/, and /{(IPA+ tʃ+ tʃ)}/) are aspirated when they are word-initial or begin a stressed syllable — compare pin {(IPA+ [pʰɪn]+ [pʰɪn])} and spin {(IPA+ [spɪn]+ [spɪn])}, crap {(IPA+ [kʰɹ̥æp]+ [kʰɹ̥æp])} and scrap {(IPA+ [skɹæp]+ [skɹæp])}.
10241450 -> 1000002501460: In some dialects, aspiration extends to unstressed syllables as well.
10241460 -> 1000002501470: In other dialects, such as Indian English, all voiceless stops remain unaspirated.
10241470 -> 1000002501480: Word-initial voiced plosives may be devoiced in some dialects.
10241480 -> 1000002501490: Word-terminal voiceless plosives may be unreleased or accompanied by a glottal stop in some dialects (e.g. many varieties of American English) — examples: tap [{(IPA+tʰæp̚+tʰæp̚)}], sack [{(IPA+sæk̚+sæk̚)}].
10241490 -> 1000002501500: Word-terminal voiced plosives may be devoiced in some dialects (e.g. some varieties of American English) — examples: sad [{(IPA+sæd̥+sæd̥)}], bag [{(IPA+bæɡ̊+bæɡ̊)}].
10241500 -> 1000002501510: In other dialects they are fully voiced in final position, but only partially voiced in initial position.
10241510 -> 1000002501520: Supra-segmental features
10241520 -> 1000002501530: Tone groups
10241530 -> 1000002501540: English is an intonation language. This means that the pitch of the voice is used syntactically, for example, to convey surprise and irony, or to change a statement into a question.
10241540 -> 1000002501550: In English, intonation patterns are on groups of words, which are called tone groups, tone units, intonation groups or sense groups.
10241550 -> 1000002501560: Tone groups are said on a single breath and, as a consequence, are of limited length, more often being on average five words long or lasting roughly two seconds.
10241560 -> 1000002501570: For example:
10241570 -> 1000002501580: -{(IPA+ /duː juː niːd ˈɛnɪˌθɪŋ/+ /duː juː niːd ˈɛnɪˌθɪŋ/)} Do you need anything?
10241580 -> 1000002501590: -{(IPA+ /aɪ dəʊnt | nəʊ/+ /aɪ dəʊnt | nəʊ/)} I don't, no
10241590 -> 1000002501600: -{(IPA+ /aɪ dəʊnt nəʊ/+ /aɪ dəʊnt nəʊ/)} I don't know (contracted to, for example, -{(IPA+ /aɪ dəʊnəʊ/+ /aɪ dəʊnəʊ/)} or {(IPA+ /aɪ dənəʊ/+ /aɪ dənəʊ/)} I dunno in fast or colloquial speech that de-emphasises the pause between don't and know even further)
10241600 -> 1000002501610: Characteristics of intonation
10241610 -> 1000002501620: English is a strongly stressed language, in that certain syllables, both within words and within phrases, get a relative prominence/loudness during pronunciation while the others do not.
10241620 -> 1000002501630: The former kind of syllables are said to be accentuated/stressed and the latter are unaccentuated/unstressed.
10241630 -> 1000002501640: All good dictionaries of English mark the accentuated syllable(s) by either placing an apostrophe-like ( {(IPA+ ˈ+ ˈ)} ) sign either before (as in IPA, Oxford English Dictionary, or Merriam-Webster dictionaries) or after (as in many other dictionaries) the syllable where the stress accent falls.
10241640 -> 1000002501650: Hence in a sentence, each tone group can be subdivided into syllables, which can either be stressed (strong) or unstressed (weak).
10241650 -> 1000002501660: The stressed syllable is called the nuclear syllable.
10241660 -> 1000002501670: For example:
10241670 -> 1000002501680: That | was | the | best | thing | you | could | have | done!
10241680 -> 1000002501690: Here, all syllables are unstressed, except the syllables/words best and done, which are stressed.
10241690 -> 1000002501700: Best is stressed harder and, therefore, is the nuclear syllable.
10241700 -> 1000002501710: The nuclear syllable carries the main point the speaker wishes to make.
10241710 -> 1000002501720: For example:
10241720 -> 1000002501730: John had not stolen that money. (...
10241730 -> 1000002501740: Someone else had.)
10241740 -> 1000002501750: John had not stolen that money. (...
10241750 -> 1000002501760: Someone said he had. or ...
10241760 -> 1000002501770: Not at that time, but later he did.)
10241770 -> 1000002501780: John had not stolen that money. (...
10241780 -> 1000002501790: He acquired the money by some other means.)
10241790 -> 1000002501800: John had not stolen that money. (...
10241800 -> 1000002501810: He had stolen some other money.)
10241810 -> 1000002501820: John had not stolen that money. (...
10241820 -> 1000002501830: He had stolen something else.)
10241830 -> 1000002501840: Also
10241840 -> 1000002501850: I did not tell her that. (...
10241850 -> 1000002501860: Someone else told her)
10241860 -> 1000002501870: I did not tell her that. (...
10241870 -> 1000002501880: You said I did. or ... but now I will)
10241880 -> 1000002501890: I did not tell her that. (...
10241890 -> 1000002501900: I did not say it; she could have inferred it, etc)
10241900 -> 1000002501910: I did not tell her that. (...
10241910 -> 1000002501920: I told someone else)
10241920 -> 1000002501930: I did not tell her that. (...
10241930 -> 1000002501940: I told her something else)
10241940 -> 1000002501950: This can also be used to express emotion:
10241950 -> 1000002501960: Oh really? (...I did not know that)
10241960 -> 1000002501970: Oh really? (...I disbelieve you. or ...
10241970 -> 1000002501980: That's blatantly obvious)
10241980 -> 1000002501990: The nuclear syllable is spoken more loudly than the others and has a characteristic change of pitch.
10241990 -> 1000002502000: The changes of pitch most commonly encountered in English are the rising pitch and the falling pitch, although the fall-rising pitch and/or the rise-falling pitch are sometimes used.
10242000 -> 1000002502010: In this opposition between falling and rising pitch, which plays a larger role in English than in most other languages, falling pitch conveys certainty and rising pitch uncertainty.
10242010 -> 1000002502020: This can have a crucial impact on meaning, specifically in relation to polarity, the positive–negative opposition; thus, falling pitch means "polarity known", while rising pitch means "polarity unknown".
10242020 -> 1000002502030: This underlies the rising pitch of yes/no questions.
10242030 -> 1000002502040: For example:
10242040 -> 1000002502050: When do you want to be paid?
10242050 -> 1000002502060: Now?
10242060 -> 1000002502070: (Rising pitch.
10242070 -> 1000002502080: In this case, it denotes a question: "Can I be paid now?" or "Do you desire to pay now?")
10242080 -> 1000002502090: Now.
10242090 -> 1000002502100: (Falling pitch.
10242100 -> 1000002502110: In this case, it denotes a statement: "I choose to be paid now.")
10242110 -> 1000002502120: Grammar
10242120 -> 1000002502130: English grammar has minimal inflection compared with most other Indo-European languages.
10242130 -> 1000002502140: For example, Modern English, unlike Modern German or Dutch and the Romance languages, lacks grammatical gender and adjectival agreement.
10242140 -> 1000002502150: Case marking has almost disappeared from the language and mainly survives in pronouns.
10242150 -> 1000002502160: The patterning of strong (e.g. speak/spoke/spoken) versus weak verbs inherited from its Germanic origins has declined in importance in modern English, and the remnants of inflection (such as plural marking) have become more regular.
10242160 -> 1000002502170: At the same time, the language has become more analytic, and has developed features such as modal verbs and word order as resources for conveying meaning.
10242170 -> 1000002502180: Auxiliary verbs mark constructions such as questions, negative polarity, the passive voice and progressive aspect.
10242180 -> 1000002502190: Vocabulary
10242190 -> 1000002502200: The English vocabulary has changed considerably over the centuries.
10242200 -> 1000002502210: Like many languages deriving from Proto-Indo-European (PIE), many of the most common words in English can trace back their origin (through the Germanic branch) to PIE.
10242210 -> 1000002502220: Such words include the basic pronouns I, from Old English ic, (cf. Latin ego, Greek ego, Sanskrit aham), me (cf. Latin me, Greek eme, Sanskrit mam), numbers (e.g. one, two, three, cf. Latin unus, duo, tres, Greek oinos "ace (on dice)", duo, treis), common family relationships such as mother, father, brother, sister etc (cf. Greek "meter", Latin "mater", Sanskrit "matṛ"; mother), names of many animals (cf. Sankrit mus, Greek mys, Latin mus; mouse), and many common verbs (cf. Greek gignōmi, Latin gnoscere, Hittite kanes; to know).
10242220 -> 1000002502230: Germanic words (generally words of Old English or to a lesser extent Norse origin) tend to be shorter than the Latinate words of English, and more common in ordinary speech.
10242230 -> 1000002502240: This includes nearly all the basic pronouns, prepositions, conjunctions, modal verbs etc. that form the basis of English syntax and grammar.
10242240 -> 1000002502250: The longer Latinate words are often regarded as more elegant or educated.
10242250 -> 1000002502260: However, the excessive use of Latinate words is considered at times to be either pretentious or an attempt to obfuscate an issue.
10242260 -> 1000002502270: George Orwell's essay "Politics and the English Language" is critical of this, as well as other perceived misuse of the language.
10242270 -> 1000002502280: An English speaker is in many cases able to choose between Germanic and Latinate synonyms: come or arrive; sight or vision; freedom or liberty.
10242280 -> 1000002502290: In some cases there is a choice between a Germanic derived word (oversee), a Latin derived word (supervise), and a French word derived from the same Latin word (survey).
10242290 -> 1000002502300: Such synonyms harbor a variety of different meanings and nuances, enabling the speaker to express fine variations or shades of thought.
10242300 -> 1000002502310: Familiarity with the etymology of groups of synonyms can give English speakers greater control over their linguistic register.
10242310 -> 1000002502320: See: List of Germanic and Latinate equivalents in English.
10242320 -> 1000002502330: An exception to this and a peculiarity perhaps unique to English is that the nouns for meats are commonly different from, and unrelated to, those for the animals from which they are produced, the animal commonly having a Germanic name and the meat having a French-derived one.
10242330 -> 1000002502340: Examples include: deer and venison; cow and beef; swine/pig and pork, or sheep and mutton.
10242340 -> 1000002502350: This is assumed to be a result of the aftermath of the Norman invasion, where a French-speaking elite were the consumers of the meat, produced by Anglo-Saxon lower classes.
10242350 -> 1000002502360: Since the majority of words used in informal settings will normally be Germanic, such words are often the preferred choices when a speaker wishes to make a point in an argument in a very direct way.
10242360 -> 1000002502370: A majority of Latinate words (or at least a majority of content words) will normally be used in more formal speech and writing, such as a courtroom or an encyclopedia article.
10242370 -> 1000002502380: However, there are other Latinate words that are used normally in everyday speech and do not sound formal; these are mainly words for concepts that no longer have Germanic words, and are generally assimilated better and in many cases do not appear Latinate.
10242380 -> 1000002502390: For instance, the words mountain, valley, river, aunt, uncle, move, use, push and stay are all Latinate.
10242390 -> 1000002502400: English easily accepts technical terms into common usage and often imports new words and phrases.
10242400 -> 1000002502410: Examples of this phenomenon include: cookie, Internet and URL (technical terms), as well as genre, über, lingua franca and amigo (imported words/phrases from French, German, modern Latin, and Spanish, respectively).
10242410 -> 1000002502420: In addition, slang often provides new meanings for old words and phrases.
10242420 -> 1000002502430: In fact, this fluidity is so pronounced that a distinction often needs to be made between formal forms of English and contemporary usage.
10242430 -> 1000002502440: See also: sociolinguistics.
10242440 -> 1000002502450: Number of words in English
10242450 -> 1000002502460: The General Explanations at the beginning of the Oxford English Dictionary states:
10242460 -> 1000002502470: {(Cquote+  +The Vocabulary of a widely diffused and highly cultivated living language is not a fixed quantity circumscribed by definite limits... there is absolutely no defining line in any direction: the circle of the English language has a well-defined centre but no discernible circumference.)}
10242465 -> 1000002502480: The vocabulary of English is undoubtedly vast, but assigning a specific number to its size is more a matter of definition than of calculation.
10242470 -> 1000002502490: Unlike other languages, such as French, German, Spanish and Italian there is no Academy to define officially accepted words and spellings.
10242480 -> 1000002502500: Neologisms are coined regularly in medicine, science and technology and other fields, and new slang is constantly developed.
10242490 -> 1000002502510: Some of these new words enter wide usage; others remain restricted to small circles.
10242500 -> 1000002502520: Foreign words used in immigrant communities often make their way into wider English usage.
10242510 -> 1000002502530: Archaic, dialectal, and regional words might or might not be widely considered as "English".
10242520 -> 1000002502540: The Oxford English Dictionary, 2nd edition (OED2) includes over 600,000 definitions, following a rather inclusive policy:
10242525 -> 1000002502550: {(Cquote+  +It embraces not only the standard language of literature and conversation, whether current at the moment, or obsolete, or archaic, but also the main technical vocabulary, and a large measure of dialectal usage and slang (Supplement to the OED, 1933). )}
10242530 -> 1000002502560: The editors of Webster's Third New International Dictionary, Unabridged (475,000 main headwords) in their preface, estimate the number to be much higher.
10242540 -> 1000002502570: It is estimated that about 25,000 words are added to the language each year.
10242550 -> 1000002502580: Word origins
10242560 -> 1000002502590: One of the consequences of the French influence is that the vocabulary of English is, to a certain extent, divided between those words which are Germanic (mostly West Germanic, with a smaller influence from the North Germanic branch) and those which are "Latinate" (Latin-derived, either directly or from Norman French or other Romance languages).
10242570 -> 1000002502600: Numerous sets of statistics have been proposed to demonstrate the origins of English vocabulary.
10242580 -> 1000002502610: None, as yet, is considered definitive by most linguists.
10242590 -> 1000002502620: A computerised survey of about 80,000 words in the old Shorter Oxford Dictionary (3rd ed.) was published in Ordered Profusion by Thomas Finkenstaedt and Dieter Wolff (1973) that estimated the origin of English words as follows:
10242600 -> 1000002502630: Langue d'oïl, including French and Old Norman: 28.3%
10242610 -> 1000002502640: Latin, including modern scientific and technical Latin: 28.24%
10242620 -> 1000002502650: Other Germanic languages (including words directly inherited from Old English): 25%
10242630 -> 1000002502660: Greek: 5.32%
10242640 -> 1000002502670: No etymology given: 4.03%
10242650 -> 1000002502680: Derived from proper names: 3.28%
10242660 -> 1000002502690: All other languages contributed less than 1%
10242670 -> 1000002502700: A survey by Joseph M. Williams in Origins of the English Language of 10,000 words taken from several thousand business letters gave this set of statistics:
10242680 -> 1000002502710: French (langue d'oïl): 41%
10242690 -> 1000002502720: "Native" English: 33%
10242700 -> 1000002502730: Latin: 15%
10242710 -> 1000002502740: Danish: 2%
10242720 -> 1000002502750: Dutch: 1%
10242730 -> 1000002502760: Other: 10%
10242740 -> 1000002502770: However, 83% of the 1,000 most-common, and all of the 100 most-common English words are Germanic.
10242750 -> 1000002502780: Dutch origins
10242760 -> 1000002502790: Words describing the navy, types of ships, and other objects or activities on the water are often from Dutch origin.
10242770 -> 1000002502800: Yacht (jacht) and cruiser (kruiser) are examples.
10242780 -> 1000002502810: French origins
10242790 -> 1000002502820: There are many words of French origin in English, such as competition, art, table, publicity, police, role, routine, machine, force, and many others that have been and are being anglicised; they are now pronounced according to English rules of phonology, rather than French.
10242800 -> 1000002502830: A large portion of English vocabulary is of French or Langues d'oïl origin, most derived from, or transmitted via, the Anglo-Norman spoken by the upper classes in England for several hundred years after the Norman conquest of England.
10242810 -> 1000002502840: Writing system
10242820 -> 1000002502850: English has been written using the Latin alphabet since around the ninth century.
10242830 -> 1000002502860: (Before that, Old English had been written using Anglo-Saxon runes.)
10242840 -> 1000002502870: The spelling system, or orthography, is multilayered, with elements of French, Latin and Greek spelling on top of the native Germanic system; it has grown to vary significantly from the phonology of the language.
10242850 -> 1000002502880: The spelling of words often diverges considerably from how they are spoken.
10242860 -> 1000002502890: Though letters and sounds may not correspond in isolation, spelling rules that take into account syllable structure, phonetics, and accents are 75% or more reliable.
10242870 -> 1000002502900: Some phonics spelling advocates claim that English is more than 80% phonetic.
10242880 -> 1000002502910: In general, the English language, being the product of many other languages and having only been codified orthographically in the 16th century, has fewer consistent relationships between sounds and letters than many other languages.
10242890 -> 1000002502920: The consequence of this orthographic history is that reading can be challenging.
10242900 -> 1000002502930: It takes longer for students to become completely fluent readers of English than of many other languages, including French, Greek, and Spanish.
10242910 -> 1000002502940: Basic sound-letter correspondence
10242920 -> 1000002502950: Only the consonant letters are pronounced in a relatively regular way:
10242930 -> 1000002502960: Written accents
10242940 -> 1000002502970: Unlike most other Germanic languages, English has almost no diacritics except in foreign loanwords (like the acute accent in café), and in the uncommon use of a diaeresis mark (often in formal writing) to indicate that two vowels are pronounced separately, rather than as one sound (e.g. naïve, Zoë).
10242950 -> 1000002502980: It is almost always acceptable to leave out the marks, especially in digital communications where the QWERTY keyboard lacks any marked letters, but it depends on the context where the word is used.
10242960 -> 1000002502990: Some English words retain the diacritic to distinguish them from others, such as animé, exposé, lamé, öre, øre, pâté, piqué, and rosé, though these are sometimes also dropped (résumé/resumé is usually spelled resume in the United States).
10242970 -> 1000002503000: There are loan words which occasionally use a diacritic to represent their pronunciation that is not in the original word, such as maté, from Spanish yerba mate, following the French usage, but they are extremely rare.
10242980 -> 1000002503010: Formal written English
10242990 -> 1000002503020: A version of the language almost universally agreed upon by educated English speakers around the world is called formal written English.
10243000 -> 1000002503030: It takes virtually the same form no matter where in the English-speaking world it is written.
10243010 -> 1000002503040: In spoken English, by contrast, there are a vast number of differences between dialects, accents, and varieties of slang, colloquial and regional expressions.
10243020 -> 1000002503050: In spite of this, local variations in the formal written version of the language are quite limited, being restricted largely to the spelling differences between British and American English.
10243030 -> 1000002503060: Basic and simplified versions
10243040 -> 1000002503070: To make English easier to read, there are some simplified versions of the language.
10243050 -> 1000002503080: One basic version is named Basic English, a constructed language with a small number of words created by Charles Kay Ogden and described in his book Basic English: A General Introduction with Rules and Grammar (1930).
10243060 -> 1000002503090: The language is based on a simplified version of English.
10243070 -> 1000002503100: Ogden said that it would take seven years to learn English, seven months for Esperanto, and seven weeks for Basic English, comparable with Ido.
10243080 -> 1000002503110: Thus Basic English is used by companies who need to make complex books for international use, and by language schools that need to give people some knowledge of English in a short time.
10243090 -> 1000002503120: Ogden did not put any words into Basic English that could be said with a few other words and he worked to make the words work for speakers of any other language.
10243100 -> 1000002503130: He put his set of words through a large number of tests and adjustments.
10243110 -> 1000002503140: He also made the grammar simpler, but tried to keep the grammar normal for English users.
10243120 -> 1000002503150: The concept gained its greatest publicity just after the Second World War as a tool for world peace.
10243130 -> 1000002503160: Although it was not built into a program, similar simplifications were devised for various international uses.
10243140 -> 1000002503170: Another version, Simplified English, exists, which is a controlled language originally developed for aerospace industry maintenance manuals.
10243150 -> 1000002503180: It offers a carefully limited and standardised subset of English.
10243160 -> 1000002503190: Simplified English has a lexicon of approved words and those words can only be used in certain ways.
10243170 -> 1000002503200: For example, the word close can be used in the phrase "Close the door" but not "do not go close to the landing gear".
Esperanto
10250010 -> 1000002600020: Esperanto
10250020 -> 1000002600030: is by far the most widely spoken constructed international auxiliary language in the world.
10250030 -> 1000002600040: Its name derives from Doktoro Esperanto, the pseudonym under which L. L. Zamenhof published the first book detailing Esperanto, the Unua Libro, in 1887.
10250040 -> 1000002600050: The word esperanto means 'one who hopes' in the language itself.
10250050 -> 1000002600060: Zamenhof's goal was to create an easy and flexible language that would serve as a universal second language to foster peace and international understanding.
10250060 -> 1000002600070: Esperanto has had continuous usage by a community estimated at between 100,000 and 2 million speakers for over a century.
10250070 -> 1000002600080: By most estimates, there are approximately one thousand native speakers.
10250080 -> 1000002600090: However, no country has adopted the language officially.
10250090 -> 1000002600100: Today, Esperanto is employed in world travel, correspondence, cultural exchange, conventions, literature, language instruction, television, and radio broadcasting.
10250100 -> 1000002600110: Also, there is an Esperanto Wikipedia that contains over 100,000 articles as of June 2008.
10250110 -> 1000002600120: There is evidence that learning Esperanto may provide a good foundation for learning languages in general.
10250120 -> 1000002600130: Some state education systems offer basic instruction and elective courses in Esperanto.
10250130 -> 1000002600140: Esperanto is also the language of instruction in one university, the Akademio Internacia de la Sciencoj in San Marino.
10250140 -> 1000002600150: History
10250150 -> 1000002600160: Esperanto was developed in the late 1870s and early 1880s by ophthalmologist Dr. Ludovic Lazarus Zamenhof, an Ashkenazi Jew from Bialystok, now in Poland and previously in the Polish-Lithuanian Commonwealth, but at the time part of the Russian Empire.
10250160 -> 1000002600170: After some ten years of development, which Zamenhof spent translating literature into the language as well as writing original prose and verse, the first book of Esperanto grammar was published in Warsaw in July 1887.
10250170 -> 1000002600180: The number of speakers grew rapidly over the next few decades, at first primarily in the Russian empire and Eastern Europe, then in Western Europe, the Americas, China, and Japan.
10250180 -> 1000002600190: In the early years, speakers of Esperanto kept in contact primarily through correspondence and periodicals, but in 1905 the first world congress of Esperanto speakers was held in Boulogne-sur-Mer, France.
10250190 -> 1000002600200: Since then world congresses have been held in different countries every year, except during the two World Wars.
10250200 -> 1000002600210: Since the Second World War, they have been attended by an average of over 2000 and up to 6000 people.
10250210 -> 1000002600220: Relation to 20th-century totalitarianism
10250220 -> 1000002600230: As a potential vehicle for international understanding, Esperanto attracted the suspicion of many totalitarian states.
10250230 -> 1000002600240: The situation was especially pronounced in Nazi Germany and in the Soviet Union under Joseph Stalin.
10250240 -> 1000002600250: In Germany, there was additional motivation to persecute Esperanto because Zamenhof was a Jew.
10250250 -> 1000002600260: In his work Mein Kampf, Hitler mentioned Esperanto as an example of a language that would be used by an International Jewish Conspiracy once they achieved world domination.
10250260 -> 1000002600270: Esperantists were executed during the Holocaust, with Zamenhof's family in particular singled out for execution.
10250270 -> 1000002600280: In the early years of the Soviet Union, Esperanto was given a measure of government support, and an officially recognized Soviet Esperanto Association came into being.
10250280 -> 1000002600290: However, in 1937, Stalin reversed this policy.
10250290 -> 1000002600300: He denounced Esperanto as "the language of spies" and had Esperantists executed.
10250300 -> 1000002600310: The use of Esperanto remained illegal until 1956.
10250310 -> 1000002600320: Official use
10250320 -> 1000002600330: Esperanto has never been an official language of any recognized country.
10250330 -> 1000002600340: However, there were plans at the beginning of the 20th century to establish Neutral Moresnet as the world's first Esperanto state.
10250340 -> 1000002600350: In China, there was talk in some circles after the 1911 Xinhai Revolution about officially replacing Chinese with Esperanto as a means to dramatically bring the country into the twentieth century, though this policy proved untenable.
10250350 -> 1000002600360: In the summer of 1924, the American Radio Relay League adopted Esperanto as its official international auxiliary language, and hoped that the language would be used by radio amateurs in international communications, but its actual use for radio communications was negligible.
10250360 -> 1000002600370: In addition, the self-proclaimed artificial island micronation of Rose Island used Esperanto as its official language in 1968.
10250370 -> 1000002600380: Esperanto is the working language of several non-profit international organizations such as the Sennacieca Asocio Tutmonda, but most others are specifically Esperanto organizations.
10250380 -> 1000002600390: The largest of these, the World Esperanto Association, has an official consultative relationship with the United Nations and UNESCO.
10250390 -> 1000002600400: The U.S. Army has published military phrasebooks in Esperanto, to be used in wargames by mock enemy forces.
10250400 -> 1000002600410: Esperanto is also the first language of teaching and administration of the International Academy of Sciences San Marino, which is sometimes called an "Esperanto University".
10250410 -> 1000002600420: Linguistic properties
10250420 -> 1000002600430: Classification
10250430 -> 1000002600440: As a constructed language, Esperanto is not genealogically related to any ethnic language.
10250440 -> 1000002600450: It has been described as "a language lexically predominantly Romanic, morphologically intensively agglutinative and to a certain degree isolating in character".
10250450 -> 1000002600460: The phonology, grammar, vocabulary, and semantics are based on the western Indo-European languages.
10250460 -> 1000002600470: The phonemic inventory is essentially Slavic, as is much of the semantics, while the vocabulary derives primarily from the Romance languages, with a lesser contribution from the Germanic languages.
10250470 -> 1000002600480: Pragmatics and other aspects of the language not specified by Zamenhof's original documents were influenced by the native languages of early speakers, primarily Russian, Polish, German, and French.
10250480 -> 1000002600490: Typologically, Esperanto has prepositions and a pragmatic word order that by default is Subject Verb Object and Adjective Noun.
10250490 -> 1000002600500: New words are formed through extensive prefixing and suffixing.
10250500 -> 1000002600510: Writing system
10250510 -> 1000002600520: Esperanto is written with a modified version of the Latin alphabet, including six letters with diacritics: ĉ, ĝ, ĥ, ĵ, ŝ and ŭ (that is, c, g, h, j, s circumflex, and u breve).
10250520 -> 1000002600530: The alphabet does not include the letters q, w, x, or y except in unassimilated foreign names.
10250530 -> 1000002600540: The 28-letter alphabet is: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s ŝ t u ŭ v z
10250540 -> 1000002600550: All letters are pronounced approximately as in the IPA, with the exception of c and the accented letters:
10250550 -> 1000002600560: Two ASCII-compatible writing conventions are in use.
10250560 -> 1000002600570: These substitute digraphs for the accented letters.
10250570 -> 1000002600580: The original "h-convention" (ch, gh, hh, jh, sh, u) is based on English 'ch' and 'sh', while a more recent "x-convention" (cx, gx, hx, jx, sx, ux) is useful for alphabetic word sorting on a computer (cx comes correctly after cu, sx after sv, etc.) as well as for simple conversion back into the standard orthography.
10250580 -> 1000002600590: Another scheme represents the superscripted letters by a caret (^), as for example: c^ or ^c.
10250590 -> 1000002600600: Phonology
10250600 -> 1000002600610: (For help with the phonetic symbols, see Help:IPA)
10250610 -> 1000002600620: Esperanto has 22 consonants, 5 vowels, and two semivowels, which combine with the vowels to form 6 diphthongs.
10250620 -> 1000002600630: (The consonant {(IPA+/j/+/j/)} and semivowel {(IPA+/i̯/+/i̯/)} are both written <j>.)
10250625 -> 1000002600640: Tone is not used to distinguish meanings of words.
10250630 -> 1000002600650: Stress is always on the penultimate vowel, unless a final vowel o is elided, a practice which occurs mostly in poetry.
10250640 -> 1000002600660: For example, familio "family" is stressed {(IPA-all+IPA: [fa.mi.ˈli.o]+fa.mi.ˈli.o)}, but when found without the final o, famili’, the stress does not shift: {(IPA+[fa.mi.ˈli]+[fa.mi.ˈli])}.
10250650 -> 1000002600670: Consonants
10250660 -> 1000002600680: The 22 consonants are:
10250670 -> 1000002600690: The sound {(IPA+/r/+/r/)} is usually rolled, but may be tapped {(IPA+[ɾ]+[ɾ])}.
10250680 -> 1000002600700: The {(IPA+/v/+/v/)} has a normative pronunciation like an English v, but is sometimes somewhere between a v and a w, {(IPA+[ʋ]+[ʋ])}, depending on the language background of the speaker.
10250690 -> 1000002600710: A semivowel {(IPA+/u̯/+/u̯/)} normally occurs only in diphthongs after the vowels {(IPA+/a/+/a/)} and {(IPA+/e/+/e/)}, not as a consonant {(IPA+*/w/+*/w/)}.
10250700 -> 1000002600720: Common, if debated, assimilation includes the pronunciation of {(IPA+/nk/+/nk/)} as {(IPA+[ŋk]+[ŋk])}, as in English sink, and {(IPA+/kz/+/kz/)} as {(IPA+[gz]+[gz])}, like the x in English example.
10250710 -> 1000002600730: A large number of consonant clusters can occur, up to three in initial position and four in medial position, as in instrui "to teach".
10250720 -> 1000002600740: Final clusters are uncommon except in foreign names, poetic elision of final o, and a very few basic words such as cent "hundred" and post "after".
10250730 -> 1000002600750: Vowels
10250740 -> 1000002600760: Esperanto has the five cardinal vowels of Spanish, Swahili, and Modern Greek.
10250750 -> 1000002600770: There are six falling diphthongs: uj, oj, ej, aj, aŭ, eŭ ({(IPA+/ui̯, oi̯, ei̯, ai̯, au̯, eu̯/+/ui̯, oi̯, ei̯, ai̯, au̯, eu̯/)}).
10250760 -> 1000002600780: With only five vowels, a good deal of variation is tolerated.
10250770 -> 1000002600790: For instance, {(IPA+/e/+/e/)} commonly ranges from {(IPA+[e]+[e])} (French é) to {(IPA+[ɛ]+[ɛ])} (French è).
10250780 -> 1000002600800: The details often depend on the speaker's native language.
10250790 -> 1000002600810: A glottal stop may occur between adjacent vowels in some people's speech, especially when the two vowels are the same, as in heroo "hero" ({(IPA+[he.ˈro.o]+[he.ˈro.o])} or {(IPA+[he.ˈro.ʔo]+[he.ˈro.ʔo])}) and praavo "great-grandfather" ({(IPA+[pra.ˈa.vo]+[pra.ˈa.vo])} or {(IPA+[pra.ˈʔa.vo]+[pra.ˈʔa.vo])}).
10250800 -> 1000002600820: Grammar
10250810 -> 1000002600830: Esperanto words are derived by stringing together prefixes, roots, and suffixes.
10250820 -> 1000002600840: This process is regular, so that people can create new words as they speak and be understood.
10250830 -> 1000002600850: Compound words are formed with a modifier-first, head-final order, the same order as English "birdsong" vs. "songbird".
10250840 -> 1000002600860: The different parts of speech are marked by their own suffixes: all common nouns end in -o, all adjectives in -a, all derived adverbs in -e, and all verbs in one of six tense and mood suffixes, such as present tense -as.
10250850 -> 1000002600870: Plural nouns end in -oj (pronounced "oy"), whereas direct objects end in -on.
10250860 -> 1000002600880: Plural direct objects end with the combination -ojn (pronounced to rhyme with "coin"): That is, -o for a noun, plus -j for plural, plus -n for direct object.
10250870 -> 1000002600890: Adjectives agree with their nouns; their endings are plural -aj (pronounced "eye"), direct-object -an, and plural direct-object -ajn (pronounced to rhyme with "fine").
10250880 -> 1000002600900: The suffix -n is used to indicate the goal of movement and a few other things, in addition to the direct object.
10250890 -> 1000002600910: See Esperanto grammar for details.
10250900 -> 1000002600920: The six verb inflections consist of three tenses and three moods.
10250910 -> 1000002600930: They are present tense -as, future tense -os, past tense -is, infinitive mood -i, conditional mood -us, and jussive mood -u (used for wishes and commands).
10250920 -> 1000002600940: Verbs are not marked for person or number.
10250930 -> 1000002600950: For instance: kanti "to sing"; mi kantas "I sing"; mi kantis "I sang"; mi kantos "I will sing"; li kantas "he sings"; vi kantas "you sing".
10250940 -> 1000002600960: Word order is comparatively free: Adjectives may precede or follow nouns, and subjects, verbs and objects (marked by the suffix -n) may occur in any order.
10250950 -> 1000002600970: However, the article la "the" and demonstratives such as tiu "this, that" almost always come before the noun, and a preposition such as ĉe "at" must come before it.
10250960 -> 1000002600980: Similarly, the negative ne "not" and conjunctions such as kaj "both, and" and ke "that" must precede the phrase or clause they introduce.
10250970 -> 1000002600990: In copular (A = B) clauses, word order is just as important as it is in English clauses like "people are dogs" vs. "dogs are people".
10250980 -> 1000002601000: Correlatives
10250990 -> 1000002601010: A correlative is a word used to ask or answer a question of who, where, what, when, or how.
10251000 -> 1000002601020: Correlatives in Esperanto are set out in a systematic manner that correlates a basic idea (quantity, manner, time, etc.) to a function (questioning, indicating, negating, etc.)
10251010 -> 1000002601030: Examples:
10251020 -> 1000002601040: Kio estas tio?
10251030 -> 1000002601050: "What is this?"
10251040 -> 1000002601060: Kioma estas la horo?
10251050 -> 1000002601070: "What time is it?"
10251060 -> 1000002601080: Note kioma rather than Kiu estas la horo? "which is the hour?", when asking for the ranking order of the hour on the clock.
10251070 -> 1000002601090: Io falis el la ŝranko "Something fell out of the cupboard."
10251080 -> 1000002601100: Homoj tiaj kiel mi ne konadas timon.
10251090 -> 1000002601110: "Men such as me know no fear."
10251100 -> 1000002601120: Correlatives are declined if the case demands it:
10251110 -> 1000002601130: Vi devas elekti ian vorton pli simpla "You should choose a (some kind of) simpler word."
10251120 -> 1000002601140: Ia receives -n because it's part of the direct object.
10251130 -> 1000002601150: Kian libron vi volas?
10251140 -> 1000002601160: "What sort of book do you want?"
10251150 -> 1000002601170: Contrast this with, Kiun libron vi volas?
10251160 -> 1000002601180: "Which book do you want?"
10251170 -> 1000002601190: Vocabulary
10251180 -> 1000002601200: The core vocabulary of Esperanto was defined by Lingvo internacia, published by Zamenhof in 1887.
10251190 -> 1000002601210: It comprised 900 roots, which could be expanded into tens of thousands of words with prefixes, suffixes, and compounding.
10251200 -> 1000002601220: In 1894, Zamenhof published the first Esperanto dictionary, Universala Vortaro, with a larger set of roots.
10251210 -> 1000002601230: However, the rules of the language allowed speakers to borrow new roots as needed, recommending only that they look for the most international forms, and then derive related meanings from these.
10251220 -> 1000002601240: Since then, many words have been borrowed, primarily but not solely from the Western European languages.
10251230 -> 1000002601250: Not all proposed borrowings catch on, but many do, especially technical and scientific terms.
10251240 -> 1000002601260: Terms for everyday use, on the other hand, are more likely to be derived from existing roots—for example komputilo (a computer) from komputi (to compute) plus the suffix -ilo (tool)—or to be covered by extending the meanings of existing words (for example muso (a mouse), as in English, now also means a computer input device).
10251250 -> 1000002601270: There are frequent debates among Esperanto speakers about whether a particular borrowing is justified or whether the need can be met by deriving from or extending the meaning of existing words.
10251260 -> 1000002601280: In addition to the root words and the rules for combining them, a learner of Esperanto must memorize some idiomatic compounds that are not entirely straightforward.
10251270 -> 1000002601290: For example, eldoni, literally "to give out", is used for "to publish" (a calque of words in several European languages with the same derivation), and vortaro, literally "a collection of words", means "a glossary" or "a dictionary".
10251280 -> 1000002601300: Such forms are modeled after usage in some European languages, and speakers of other languages may find them illogical.
10251290 -> 1000002601310: Fossilized derivations inherited from Esperanto's source languages may be similarly obscure, such as the opaque connection the root word centralo "power station" has with centro "center".
10251300 -> 1000002601320: Compounds with -um- are overtly arbitrary, and must be learned individually, as -um- has no defined meaning.
10251310 -> 1000002601330: It turns dekstren "to the right" into dekstrumen "clockwise", and komuna "common/shared" into komunumo "community", for example.
10251320 -> 1000002601340: Nevertheless, there are not nearly as many idiomatic or slang words in Esperanto as in ethnic languages, as these tend to make international communication difficult, working against Esperanto's main goal.
10251330 -> 1000002601350: Useful phrases
10251340 -> 1000002601360: Here are some useful Esperanto phrases, with IPA transcriptions:
10251350 -> 1000002601370: Hello: Saluton {(IPA+/sa.ˈlu.ton/+/sa.ˈlu.ton/)}
10251360 -> 1000002601380: What is your name?: Kiel vi nomiĝas?
10251370 -> 1000002601390: {(IPA+/ˈki.el vi no.ˈmi.ʤas/+/ˈki.el vi no.ˈmi.ʤas/)}
10251380 -> 1000002601400: My name is...: Mi nomiĝas...
10251390 -> 1000002601410: {(IPA+/mi no.ˈmi.ʤas/+/mi no.ˈmi.ʤas/)}
10251400 -> 1000002601420: How much (is it/are they)?: Kiom (estas)?
10251410 -> 1000002601430: {(IPA+/ˈki.om ˈes.tas/+/ˈki.om ˈes.tas/)}
10251420 -> 1000002601440: Here you are: Jen {(IPA+/jen/+/jen/)}
10251430 -> 1000002601450: Do you speak Esperanto?: Ĉu vi parolas Esperanton?
10251440 -> 1000002601460: {(IPA+/ˈʧu vi pa.ˈro.las es.pe.ˈran.ton/+/ˈʧu vi pa.ˈro.las es.pe.ˈran.ton/)}
10251450 -> 1000002601470: I do not understand you: Mi ne komprenas vin {(IPA+/mi ˈne kom.ˈpre.nas vin/+/mi ˈne kom.ˈpre.nas vin/)}
10251460 -> 1000002601480: I like this one: Ĉi tiu plaĉas al mi {(IPA+/ʧi ˈti.u ˈpla.ʧas al ˈmi/+/ʧi ˈti.u ˈpla.ʧas al ˈmi/)} or Mi ŝatas tiun ĉi {(IPA+/mi ˈʃa.tas ˈti.un ˈʧi/+/mi ˈʃa.tas ˈti.un ˈʧi/)}
10251470 -> 1000002601490: Thank you: Dankon {(IPA+/ˈdan.kon/+/ˈdan.kon/)}
10251480 -> 1000002601500: You're welcome: Ne dankinde {(IPA+/ˈne dan.ˈkin.de/+/ˈne dan.ˈkin.de/)}
10251490 -> 1000002601510: Please: Bonvolu {(IPA+/bon.ˈvo.lu/+/bon.ˈvo.lu/)} or mi petas {(IPA+/mi ˈpe.tas/+/mi ˈpe.tas/)}
10251500 -> 1000002601520: Here's to your health: Je via sano {(IPA+/je ˈvi.a ˈsa.no/+/je ˈvi.a ˈsa.no/)}
10251510 -> 1000002601530: Bless you!/Gesundheit!: Sanon!
10251520 -> 1000002601540: {(IPA+/ˈsa.non/+/ˈsa.non/)}
10251530 -> 1000002601550: Congratulations!: Gratulon!
10251540 -> 1000002601560: {(IPA+/ɡra.ˈtu.lon/+/ɡra.ˈtu.lon/)}
10251550 -> 1000002601570: Okay: Bone {(IPA+/ˈbo.ne/+/ˈbo.ne/)} or Ĝuste {(IPA+/ˈʤus.te/+/ˈʤus.te/)}
10251560 -> 1000002601580: Yes: Jes {(IPA+/ˈjes/+/ˈjes/)}
10251570 -> 1000002601590: No: Ne {(IPA+/ˈne/+/ˈne/)}
10251580 -> 1000002601600: It is a nice day: Estas bela tago {(IPA+/ˈes.tas ˈbe.la ˈta.ɡo/+/ˈes.tas ˈbe.la ˈta.ɡo/)}
10251590 -> 1000002601610: I love you: Mi amas vin {(IPA+/mi ˈa.mas vin/+/mi ˈa.mas vin/)}
10251600 -> 1000002601620: Goodbye: Ĝis (la) (revido) {(IPA+/ʤis la re.ˈvi.do/+/ʤis la re.ˈvi.do/)}
10251610 -> 1000002601630: One beer, please: Unu bieron, mi petas.
10251620 -> 1000002601640: {(IPA+/ˈu.nu bi.ˈe.ron, mi ˈpe.tas/+/ˈu.nu bi.ˈe.ron, mi ˈpe.tas/)}
10251630 -> 1000002601650: What is that?: Kio estas tio?
10251640 -> 1000002601660: {(IPA+/ˈki.o ˈes.tas ˈti.o/+/ˈki.o ˈes.tas ˈti.o/)}
10251650 -> 1000002601670: That is...: Tio estas...
10251660 -> 1000002601680: {(IPA+/ˈti.o ˈes.tas/+/ˈti.o ˈes.tas/)}
10251670 -> 1000002601690: How are you?: Kiel vi (fartas)?
10251680 -> 1000002601700: {(IPA+/ˈki.el vi ˈfar.tas/+/ˈki.el vi ˈfar.tas/)}
10251690 -> 1000002601710: Good morning!: Bonan matenon!
10251700 -> 1000002601720: {(IPA+/ˈbo.nan ma.ˈte.non/+/ˈbo.nan ma.ˈte.non/)}
10251710 -> 1000002601730: Good evening!: Bonan vesperon!
10251720 -> 1000002601740: {(IPA+/ˈbo.nan ves.ˈpe.ron/+/ˈbo.nan ves.ˈpe.ron/)}
10251730 -> 1000002601750: Good night!: Bonan nokton!
10251740 -> 1000002601760: {(IPA+/ˈbo.nan ˈnok.ton/+/ˈbo.nan ˈnok.ton/)}
10251750 -> 1000002601770: Peace!: Pacon!
10251760 -> 1000002601780: {(IPA+/ˈpa.tson/+/ˈpa.tson/)}
10251770 -> 1000002601790: Sample text
10251780 -> 1000002601800: The following short extract gives an idea of the character of Esperanto.
10251790 -> 1000002601810: (Pronunciation is covered above.
10251800 -> 1000002601820: The main point for English speakers to remember is that the letter 'J' has the sound of the letter 'Y' in English)
10251810 -> 1000002601830: Esperanto text
10251820 -> 1000002601840: En multaj lokoj de Ĉinio estis temploj de drako-reĝo. Dum trosekeco oni preĝis en la temploj, ke la drako-reĝo donu pluvon al la homa mondo.
10251830 -> 1000002601850: Tiam drako estis simbolo de la supernatura estaĵo. Kaj pli poste, ĝi fariĝis prapatro de la plej altaj regantoj kaj simbolis la absolutan aŭtoritaton de feŭda imperiestro.
10251840 -> 1000002601860: La imperiestro pretendis, ke li estas filo de la drako. Ĉiuj liaj vivbezonaĵoj portis la nomon drako kaj estis ornamitaj per diversaj drakofiguroj.
10251850 -> 1000002601870: Nun ĉie en Ĉinio videblas drako-ornamentaĵoj kaj cirkulas legendoj pri drakoj.
10251860 -> 1000002601880: English Translation:
10251870 -> 1000002601890: In many places in China there were temples of the dragon king.
10251880 -> 1000002601900: During times of drought, people prayed in the temples, that the dragon king would give rain to the human world.
10251890 -> 1000002601910: At that time the dragon was a symbol of the supernatural.
10251900 -> 1000002601920: Later on, it became the ancestor of the highest rulers and symbolised the absolute authority of the feudal emperor.
10251910 -> 1000002601930: The emperor claimed to be the son of the dragon.
10251920 -> 1000002601940: All of his personal possessions carried the name dragon and were decorated with various dragon figures.
10251930 -> 1000002601950: Now everywhere in China dragon decorations can be seen and there circulate legends about dragons.
10251940 -> 1000002601960: Education
10251950 -> 1000002601970: The majority of Esperanto speakers learn the language through self-directed study, online tutorials, and correspondence courses taught by volunteers.
10251960 -> 1000002601980: In more recent years, teaching websites like lernu! have become popular.
10251970 -> 1000002601990: Esperanto instruction is occasionally available at schools, such as a pilot project involving four primary schools under the supervision of the University of Manchester, and by one count at 69 universities.
10251980 -> 1000002602000: However, outside of China and Hungary, these mostly involve informal arrangements rather than dedicated departments or state sponsorship.
10251990 -> 1000002602010: Eötvös Loránd University in Budapest had a department of Interlinguistics and Esperanto from 1966 to 2004, after which time instruction moved to vocational colleges; there are state examinations for Esperanto instructors.
10252000 -> 1000002602020: Various educators have estimated that Esperanto can be learned in anywhere from one quarter to one twentieth the amount of time required for other languages.
10252010 -> 1000002602030: Some argue, however, that this is only true for native speakers of Western European languages.
10252020 -> 1000002602040: Claude Piron, a psychologist formerly at the University of Geneva and Chinese-English-Russian-Spanish translator for the United Nations, argued that Esperanto is far more "brain friendly" than many ethnic languages.
10252030 -> 1000002602050: "Esperanto relies entirely on innate reflexes [and] differs from all other languages in that you can always trust your natural tendency to generalize patterns. [...]
10252040 -> 1000002602060: The same neuropsychological law [— called by] Jean Piaget generalizing assimilation — applies to word formation as well as to grammar."
10252050 -> 1000002602070: Language acquisition
10252060 -> 1000002602080: Four primary schools in Britain, with some 230 pupils, are currently following a course in "propedeutic Esperanto", under the supervision of the University of Manchester.
10252070 -> 1000002602090: That is, instruction in Esperanto to raise language awareness and accelerate subsequent learning of foreign languages.
10252080 -> 1000002602100: Several studies demonstrate that studying Esperanto before another foreign language speeds and improves learning the second language to a greater extent than other languages which have been investigated.
10252090 -> 1000002602110: This appears to be because learning subsequent foreign languages is easier than learning one's first, while the use of a grammatically simple and culturally flexible auxiliary language like Esperanto lessens the first-language learning hurdle.
10252100 -> 1000002602120: In one study, a group of European secondary school students studied Esperanto for one year, then French for three years, and ended up with a significantly better command of French than a control group, who studied French for all four years.
10252110 -> 1000002602130: Similar results were found when the course of study was reduced to two years, of which six months was spent learning Esperanto.
10252120 -> 1000002602140: Results are not yet available from a study in Australia to see if similar benefits would occur for learning East Asian languages, but the pupils taking Esperanto did better and enjoyed the subject more than those taking other languages.
10252130 -> 1000002602150: Community
10252140 -> 1000002602160: Geography and demography
10252150 -> 1000002602170: Esperanto speakers are more numerous in Europe and East Asia than in the Americas, Africa, and Oceania, and more numerous in urban than in rural areas.
10252160 -> 1000002602180: Esperanto is particularly prevalent in the northern and eastern countries of Europe; in China, Korea, Japan, and Iran within Asia; in Brazil, Argentina, and Mexico in the Americas; and in Togo in Africa.
10252170 -> 1000002602190: Number of speakers
10252180 -> 1000002602200: An estimate of the number of Esperanto speakers was made by the late Sidney S. Culbert, a retired psychology professor at the University of Washington and a longtime Esperantist, who tracked down and tested Esperanto speakers in sample areas in dozens of countries over a period of twenty years.
10252190 -> 1000002602210: Culbert concluded that between one and two million people speak Esperanto at Foreign Service Level 3, "professionally proficient" (able to communicate moderately complex ideas without hesitation, and to follow speeches, radio broadcasts, etc.).
10252200 -> 1000002602220: Culbert's estimate was not made for Esperanto alone, but formed part of his listing of estimates for all languages of over 1 million speakers, published annually in the World Almanac and Book of Facts.
10252210 -> 1000002602230: Culbert's most detailed account of his methodology is found in a 1989 letter to David Wolff .
10252220 -> 1000002602240: Since Culbert never published detailed intermediate results for particular countries and regions, it is difficult to independently gauge the accuracy of his results.
10252230 -> 1000002602250: In the Almanac, his estimates for numbers of language speakers were rounded to the nearest million, thus the number for Esperanto speakers is shown as 2 million.
10252240 -> 1000002602260: This latter figure appears in Ethnologue.
10252250 -> 1000002602270: Assuming that this figure is accurate, that means that about 0.03% of the world's population speaks the language.
10252260 -> 1000002602280: This falls short of Zamenhof's goal of a universal language, but it represents a level of popularity unmatched by any other constructed language.
10252270 -> 1000002602290: Marcus Sikosek (now Ziko van Dijk) has challenged this figure of 1.6 million as exaggerated.
10252280 -> 1000002602300: He estimated that even if Esperanto speakers were evenly distributed, assuming one million Esperanto speakers worldwide would lead one to expect about 180 in the city of Cologne.
10252290 -> 1000002602310: Van Dijk finds only 30 fluent speakers in that city, and similarly smaller than expected figures in several other places thought to have a larger-than-average concentration of Esperanto speakers.
10252300 -> 1000002602320: He also notes that there are a total of about 20,000 members of the various Esperanto organizations (other estimates are higher).
10252310 -> 1000002602330: Though there are undoubtedly many Esperanto speakers who are not members of any Esperanto organization, he thinks it unlikely that there are fifty times more speakers than organization members.
10252320 -> 1000002602340: Finnish linguist Jouko Lindstedt, an expert on native-born Esperanto speakers, presented the following scheme to show the overall proportions of language capabilities within the Esperanto community:
10252330 -> 1000002602350: 1,000 have Esperanto as their native language
10252340 -> 1000002602360: 10,000 speak it fluently
10252350 -> 1000002602370: 100,000 can use it actively
10252360 -> 1000002602380: 1,000,000 understand a large amount passively
10252370 -> 1000002602390: 10,000,000 have studied it to some extent at some time.
10252380 -> 1000002602400: In the absence of Dr. Culbert's detailed sampling data, or any other census data, it is impossible to state the number of speakers with certainty.
10252390 -> 1000002602410: Few observers, probably, would challenge the following statement from the website of the World Esperanto Association:
10252400 -> 1000002602420: Numbers of textbooks sold and membership of local societies put the number of people with some knowledge of the language in the hundreds of thousands and possibly millions.
10252410 -> 1000002602430: Native speakers
10252420 -> 1000002602440: Ethnologue reports estimates that there are 200 to 2000 native Esperanto speakers (denaskuloj), who have learned the language from birth from their Esperanto-speaking parents.
10252430 -> 1000002602450: This usually happens when Esperanto is the chief or only common language in an international family, but sometimes in a family of devoted Esperantists.
10252440 -> 1000002602460: The most famous native speaker of Esperanto is businessman George Soros.
10252450 -> 1000002602470: Also notable is young Holocaust victim Petr Ginz, whose drawing of the planet Earth as viewed from the moon was carried aboard the Space Shuttle Columbia in 2003 (STS-107).
10252460 -> 1000002602480: Culture
10252470 -> 1000002602490: Esperanto speakers can access an international culture, including a large body of original as well as translated literature.
10252480 -> 1000002602500: There are over 25,000 Esperanto books, both originals and translations, as well as several regularly distributed Esperanto magazines.
10252490 -> 1000002602510: Esperanto speakers use the language for free accommodations with Esperantists in 92 countries using the Pasporta Servo or to develop pen pal friendships abroad through the Esperanto Pen Pal Service.
10252500 -> 1000002602520: Every year, 1,500-3,000 Esperanto speakers meet for the World Congress of Esperanto (Universala Kongreso de Esperanto).
10252510 -> 1000002602530: The European Esperanto Union (Eǔropa Esperanto-Unio) regroups the national Esperanto associations of the EU member states and holds congresses every two years.
10252520 -> 1000002602540: The most recent was in Maribor, Slovenia, in July-August 2007.
10252530 -> 1000002602550: It attracted 256 delegates from 28 countries, including 2 members of the European Parliament, Ms. Małgorzata Handzlik of Poland and Ms. Ljudmila Novak of Slovenia.
10252540 -> 1000002602560: Historically, much Esperanto music has been in various folk traditions, such as Kaj Tiel Plu, for example.
10252550 -> 1000002602570: In recent decades, more rock and other modern genres have appeared, an example being the Swedish band Persone.
10252560 -> 1000002602580: There are also shared traditions, such as Zamenhof Day, and shared behaviour patterns.
10252570 -> 1000002602590: Esperantists speak primarily in Esperanto at international Esperanto meetings.
10252580 -> 1000002602600: Detractors of Esperanto occasionally criticize it as "having no culture".
10252590 -> 1000002602610: Proponents, such as Prof. Humphrey Tonkin of the University of Hartford, observe that Esperanto is "culturally neutral by design, as it was intended to be a facilitator between cultures, not to be the carrier of any one national culture."
10252610 -> 1000002602620: The late Scottish Esperanto author William Auld has written extensively on the subject, arguing that Esperanto is "the expression of a common human culture, unencumbered by national frontiers.
10252620 -> 1000002602630: Thus it is considered a culture on its own."
10252630 -> 1000002602640: Others point to Esperanto's potential for strengthening a common European identity, as it combines features of several European languages.
10252640 -> 1000002602650: In popular culture
10252650 -> 1000002602660: Esperanto has been used in a number of films and novels.
10252660 -> 1000002602670: Typically, this is done either to add the exotic flavour of a foreign language without representing any particular ethnicity, or to avoid going to the trouble of inventing a new language.
10252670 -> 1000002602680: The Charlie Chaplin film The Great Dictator (1940) showed Jewish ghetto shops designated in Esperanto, each with the general Esperanto suffix -ejo (meaning "place for..."), in order to convey the atmosphere of some 'foreign' East European country without referencing any particular East European language.
10252680 -> 1000002602690: Two full-length feature films have been produced with dialogue entirely in Esperanto: Angoroj, in 1964, and Incubus, a 1965 B-movie horror film.
10252690 -> 1000002602700: Canadian actor William Shatner learned Esperanto to a limited level so that he could star in Incubus.
10252700 -> 1000002602710: Other amateur productions have been made, such as a dramatisation of the novel Gerda Malaperis (Gerda Has Disappeared).
10252710 -> 1000002602720: A number of "mainstream" films in national languages have used Esperanto in some way, such as Gattaca (1997), in which Esperanto can be overheard on the public address system.
10252720 -> 1000002602730: In the 1994 film Street Fighter, Esperanto is the native language of the fictional country of Shadaloo, and in a barracks scene the soldiers of villain M. Bison sing a rousing Russian Army-style chorus, the "Bison Troopers Marching Song", in the language.
10252730 -> 1000002602740: Esperanto is also spoken and appears on signs in the film Blade: Trinity.
10252740 -> 1000002602750: In the British comedy Red Dwarf, Arnold Rimmer is seen attempting to learn Esperanto in a number of early episodes, including Kryten.
10252750 -> 1000002602760: In the first season, signs on the titular spacecraft are in both English and Esperanto.
10252760 -> 1000002602770: Esperanto is used as the universal language in the far future of Harry Harrison's Stainless Steel Rat and Deathworld stories.
10252770 -> 1000002602780: In a 1969 guest appearance on The Tonight Show, Jay Silverheels of The Lone Ranger fame appeared in character as Tonto for a comedy sketch with Johnny Carson, and claimed Esperanto skills as he sought new employment.
10252780 -> 1000002602790: The sketch ended with a statement of his ideal situation: "Tonto, to Toronto, for Esperanto, and pronto!"
10252790 -> 1000002602800: Also, in the Danny Phantom Episode, "Public Enemies", Danny, Tucker, and Sam come across a ghost wolf who speaks Esperanto, but only Tucker can understand at first.
10252800 -> 1000002602810: In Science
10252810 -> 1000002602820: In 1921 the French Academy of Sciences recommended using Esperanto for international scientific communication.
10252820 -> 1000002602830: A few scientists and mathematicians, such as Maurice Fréchet (mathematics), John C. Wells (linguistics), Helmar Frank (pedagogy and cybernetics), and Nobel laureate Reinhard Selten (economics) have published part of their work in Esperanto.
10252830 -> 1000002602840: Frank and Selten were among the founders of the International Academy of Sciences in San Marino, sometimes called the "Esperanto University", where Esperanto is the primary language of teaching and administration.
10252840 -> 1000002602850: Goals of the movement
10252850 -> 1000002602860: Zamenhof's intention was to create an easy-to-learn language to foster international understanding.
10252860 -> 1000002602870: It was to serve as an international auxiliary language, that is, as a universal second language, not to replace ethnic languages.
10252870 -> 1000002602880: This goal was widely shared among Esperanto speakers in the early decades of the movement.
10252880 -> 1000002602890: Later, Esperanto speakers began to see the language and the culture that had grown up around it as ends in themselves, even if Esperanto is never adopted by the United Nations or other international organizations.
10252890 -> 1000002602900: Those Esperanto speakers who want to see Esperanto adopted officially or on a large scale worldwide are commonly called finvenkistoj, from fina venko, meaning "final victory", or pracelistoj, from pracelo, meaning "original goal".
10252900 -> 1000002602910: Those who focus on the intrinsic value of the language are commonly called raŭmistoj, from Rauma, Finland, where a declaration on the near-term unlikelihood of the "fina venko" and the value of Esperanto culture was made at the International Youth Congress in 1980.
10252910 -> 1000002602920: These categories are, however, not mutually exclusive.
10252920 -> 1000002602930: The Prague Manifesto (1996) presents the views of the mainstream of the Esperanto movement and of its main organisation, the World Esperanto Association (UEA).
10252930 -> 1000002602940: Symbols and flags
10252940 -> 1000002602950: In 1893, C. Rjabinis and P. Deullin designed and manufactured a lapel pin for Esperantists to identify each other.
10252950 -> 1000002602960: The design was a circular pin with a white background and a five pointed green star.
10252960 -> 1000002602970: The theme of the design was the hope of the five continents being united by a common language.
10252970 -> 1000002602980: The earliest flag, and the one most commonly used today, features a green five-pointed star against a white canton, upon a field of green.
10252980 -> 1000002602990: It was proposed to Zamenhof by Irishman Richard Geoghegan, author of the first Esperanto textbook for English speakers, in 1887.
10252990 -> 1000002603000: In 1905, delegates to the first conference of Esperantists at Boulogne-sur-Mer unanimously approved a version that differed from the modern flag only by the superimposition of an "E" over the green star.
10253000 -> 1000002603010: Other variants include that for Christian Esperantists, with a white Christian cross superimposed upon the green star, and that for Leftists, with the color of the field changed from green to red.
10253010 -> 1000002603020: In 1987, a second flag design was chosen in a contest organized by the UEA celebrating the first centennial of the language.
10253020 -> 1000002603030: It featured a white background with two stylised curved "E"s facing each other.
10253030 -> 1000002603040: Dubbed the "jubilea simbolo" (jubilee symbol) , it attracted criticism from some Esperantists, who dubbed it the "melono" (melon) because of the design's elliptical shape.
10253040 -> 1000002603050: It is still in use, though to a lesser degree than the traditional symbol, known as the "verda stelo" (green star).
10253050 -> 1000002603060: Religion
10253060 -> 1000002603070: Esperanto has served an important role in several religions, such as Oomoto from Japan and Baha'i from Iran, and has been encouraged by others.
10253070 -> 1000002603080: Oomoto
10253080 -> 1000002603090: The Oomoto religion encourages the use of Esperanto among their followers and includes Zamenhof as one of its deified spirits.
10253090 -> 1000002603100: Bahá'í Faith
10253100 -> 1000002603110: The Bahá'í Faith encourages the use of an auxiliary international language.
10253110 -> 1000002603120: While endorsing no specific language, some Bahá'ís see Esperanto as having great potential in this role.
10253120 -> 1000002603130: Lidja Zamenhof, the daughter of Esperanto founder L. L. Zamenhof, became a Bahá'í.
10253130 -> 1000002603140: Various volumes of the Bahá'í literatures and other Baha'i books have been translated into Esperanto.
10253140 -> 1000002603150: Spiritism
10253150 -> 1000002603160: Esperanto is also actively promoted, at least in Brazil, by followers of Spiritism.
10253160 -> 1000002603170: The Brazilian Spiritist Federation publishes Esperanto coursebooks, translations of Spiritism's basic books, and encourages Spiritists to become Esperantists.
10253170 -> 1000002603180: Bible translations
10253180 -> 1000002603190: The first translation of the Bible into Esperanto was a translation of the Tanach or Old Testament done by L. L. Zamenhof.
10253190 -> 1000002603200: The translation was reviewed and compared with other languages' translations by a group of British clergy and scholars before publishing it at the British and Foreign Bible Society in 1910.
10253200 -> 1000002603210: In 1926 this was published along with a New Testament translation, in an edition commonly called the "Londona Biblio".
10253210 -> 1000002603220: In the 1960s, the Internacia Asocio de Bibliistoj kaj Orientalistoj tried to organize a new, ecumenical Esperanto Bible version.
10253220 -> 1000002603230: Since then, the Dutch Lutheran pastor Gerrit Berveling has translated the Deuterocanonical or apocryphal books in addition to new translations of the Gospels, some of the New Testament epistles, and some books of the Tanakh or Old Testament.
10253230 -> 1000002603240: These have been published in various separate booklets, or serialized in Dia Regno, but the Deuterocanonical books have appeared in recent editions of the Londona Biblio.
10253240 -> 1000002603250: Christianity
10253250 -> 1000002603260: Two Roman Catholic popes, John Paul II and Benedict XVI, have regularly used Esperanto in their multilingual urbi et orbi blessings at Easter and Christmas each year since Easter 1994.
10253260 -> 1000002603270: Christian Esperanto organizations include two that were formed early in the history of Esperanto, the International Union of Catholic Esperantists and the International Christian Esperantists League.
10253270 -> 1000002603280: An issue of "The Friend" describes the activities of the Quaker Esperanto Society.
10253280 -> 1000002603290: There are instances of Christian apologists and teachers who use Esperanto as a medium.
10253290 -> 1000002603300: Nigerian Pastor Bayo Afolaranmi's " Spirita nutraĵo" (spiritual food) Yahoo mailing list, for example, has hosted weekly messages since 2003.
10253300 -> 1000002603310: Chick Publications, publisher of Protestant fundamentalist themed evangelistic tracts, has published a number of comic book style tracts by Jack T. Chick translated into Esperanto, including "This Was Your Life!"
10253310 -> 1000002603320: ("Jen Via Tuto Vivo!")
10253320 -> 1000002603330: Islam
10253330 -> 1000002603340: Ayatollah Khomeini of Iran called on Muslims to learn Esperanto and praised its use as a medium for better understanding among peoples of different religious backgrounds.
10253340 -> 1000002603350: After he suggested that Esperanto replace English as an international lingua franca, it began to be used in the seminaries of Qom.
10253350 -> 1000002603360: An Esperanto translation of the Qur'an was published by the state shortly thereafter.
10253360 -> 1000002603370: In 1981, Khomeini and the Iranian government began to oppose Esperanto after realising that followers of the Bahá'í Faith were interested in it.
10253370 -> 1000002603380: Criticism
10253380 -> 1000002603390: Esperanto was conceived as a language of international communication, more precisely as a universal second language.
10253390 -> 1000002603400: Since publication, there has been debate over whether it is possible for Esperanto to attain this position, and whether it would be an improvement for international communication if it did.
10253400 -> 1000002603410: There have been a number of attempts to reform the language, the most well-known of which is the language Ido which resulted in a schism in the community at the time, beginning in 1907.
10253410 -> 1000002603420: Since Esperanto is a planned language, there have been many, often passionate, criticisms of minor points which are too numerous to cover here, such as Zamenhof's choice of the word edzo over something like spozo for "husband, spouse", or his choice of the Classic Greek and Old Latin singular and plural endings -o, -oj, -a, -aj over their Medieval contractions -o, -i, -a, -e.
10253420 -> 1000002603430: (Both these changes were adopted by the Ido reform, though Ido dispensed with adjectival agreement altogether.)
10253430 -> 1000002603440: See the links below for examples of more general criticism.
10253440 -> 1000002603450: The more common points include:
10253450 -> 1000002603460: Esperanto has failed the expectations of its founder to become a universal second language.
10253460 -> 1000002603470: Although many promoters of Esperanto stress the few successes it has had, the fact remains that well over a century since its publication, the portion of the world that speaks Esperanto, and the number of primary and secondary schools which teach it, remain minuscule.
10253470 -> 1000002603480: It simply cannot compete with English in this regard.
10253480 -> 1000002603490: The vocabulary and grammar are based on major European languages, and are not universal.
10253490 -> 1000002603500: Often this criticism is specific to a few points such as adjectival agreement and the accusative case (generally such obvious details are all that reform projects suggest changing), but sometimes it is more general: Both the grammar and the 'international' vocabulary are difficult for many Asians, among others, and give an unfair advantage to speakers of European languages.
10253500 -> 1000002603510: One attempt to address this issue is Lojban, which draws from the six populous languages Arabic, Chinese, English, Hindi, Russian, and Spanish, and whose grammar is designed for computer parsing.
10253510 -> 1000002603520: The vocabulary, diacritic letters, and grammar are too dissimilar from the major Western European languages, and therefore Esperanto is not as easy as it could be for speakers of those languages to learn.
10253520 -> 1000002603530: Attempts to address this issue include the younger planned languages Ido and Interlingua.
10253530 -> 1000002603540: Esperanto phonology is unimaginatively provincial, being essentially Belorussian with regularized stress, leaving out only the nasal vowels, palatalized consonants, and /dz/.
10253540 -> 1000002603550: For example, Esperanto has phonemes such as {(IPA+/x/, /ʒ/, /ts/, /eu̯/+/x/, /ʒ/, /ts/, /eu̯/)} (ĥ, ĵ, c, eŭ) which are rare as distinct phonemes outside Europe.
10253550 -> 1000002603560: (Note that none of these are found in initial position in English.)
10253560 -> 1000002603570: Esperanto has no culture.
10253570 -> 1000002603580: Although it has a large international literature, Esperanto does not encapsulate a specific culture.
10253580 -> 1000002603590: Esperanto is culturally European.
10253590 -> 1000002603600: This is due to the European derivation of its vocabulary, and more insidiously, its semantics; both infuse the language with a European world view.
10253600 -> 1000002603610: The vocabulary is too large.
10253610 -> 1000002603620: Rather than deriving new words from existing roots, large numbers of new roots are adopted into the language by people who think they're international, when in fact they're only European.
10253620 -> 1000002603630: This makes the language much more difficult for non-Europeans than it needs to be.
10253630 -> 1000002603640: Esperanto is sexist.
10253640 -> 1000002603650: As in English, there is no neutral pronoun for s/he, and most kin terms and titles are masculine by default and only feminine when so specified.
10253650 -> 1000002603660: There have been many attempts to address this issue, of which one of the better known is Riism.
10253660 -> 1000002603670: Esperanto is, looks, or sounds artificial.
10253670 -> 1000002603680: This criticism is primarily due to the letters with circumflex diacritics, which some find odd or cumbersome, and to the lack of fluent speakers: Few Esperantists have spent much time with fluent, let alone native, speakers, and many learn Esperanto relatively late in life, and so speak haltingly, which can create a negative impression among non-speakers.
10253680 -> 1000002603690: Among fluent speakers, Esperanto sounds no more artificial than any other language.
10253690 -> 1000002603700: Others claim that an artificial language will necessarily be deficient, due to its very nature, but the Hungarian Academy of Sciences has found that Esperanto fulfills all the requirements of a living language.
10253700 -> 1000002603710: Modifications
10253710 -> 1000002603720: Though Esperanto itself has changed little since the publication of the Fundamento de Esperanto (Foundation of Esperanto), a number of reform projects have been proposed over the years, starting with Zamenhof's proposals in 1894 and Ido in 1907.
10253720 -> 1000002603730: Several later constructed languages, such as Fasile, were based on Esperanto.
10253730 -> 1000002603740: In modern times, attempts have been made to eliminate perceived sexism in the language.
10253740 -> 1000002603750: One example of this is Riism.
10253750 -> 1000002603760: However, as Esperanto has become a living language, changes are as difficult to implement as in ethnic languages.
Formal grammar
10260010 -> 1000002700020: Formal grammar
10260020 -> 1000002700030: In formal semantics, computer science and linguistics, a formal grammar (also called formation rules) is a precise description of a formal language – that is, of a set of strings over some alphabet.
10260030 -> 1000002700040: In other words, a grammar describes which of the possible sequences of symbols (strings) in a language constitute valid words or statements in that language, but it does not describe their semantics (i.e. what they mean).
10260040 -> 1000002700050: The branch of mathematics that is concerned with the properties of formal grammars and languages is called formal language theory.
10260050 -> 1000002700060: A grammar is usually regarded as a means to generate all the valid strings of a language; it can also be used as the basis for a recognizer that determines for any given string whether it is grammatical (i.e. belongs to the language).
10260060 -> 1000002700070: To describe such recognizers, formal language theory uses separate formalisms, known as automata.
10260070 -> 1000002700080: A grammar can also be used to analyze the strings of a language – i.e. to describe their internal structure.
10260080 -> 1000002700090: In computer science, this process is known as parsing.
10260090 -> 1000002700100: Most languages have very compositional semantics, i.e. the meaning of their utterances is structured according to their syntax; therefore, the first step to describing the meaning of an utterance in language is to analyze it and look at its analyzed form (known as its parse tree in computer science, and as its deep structure in generative grammar).
10260100 -> 1000002700110: Background
10260110 -> 1000002700120: Formal language
10260120 -> 1000002700130: A formal language is an organized set of symbols the essential feature of which is that it can be precisely defined in terms of just the shapes and locations of those symbols.
10260130 -> 1000002700140: Such a language can be defined, then, without any reference to any meanings of any of its expressions; it can exist before any formal interpretation is assigned to it -- that is, before it has any meaning.
10260140 -> 1000002700150: First order logic is expressed in some formal language.
10260150 -> 1000002700160: A formal grammar determines which symbols and sets of symbols are formulas in a formal language.
10260160 -> 1000002700170: Formal systems
10260170 -> 1000002700180: A formal system (also called a logical calculus, or a logical system) consists of a formal language together with a deductive apparatus (also called a deductive system).
10260180 -> 1000002700190: The deductive apparatus may consist of a set of transformation rules (also called inference rules) or a set of axioms, or have both.
10260190 -> 1000002700200: A formal system is used to derive one expression from one or more other expressions.
10260200 -> 1000002700210: Formal proofs
10260210 -> 1000002700220: A formal proof is a sequence of well-formed formulas of a formal language, the last one of which is a theorem of a formal system.
10260220 -> 1000002700230: The theorem is a syntactic consequence of all the wffs preceding it in the proof.
10260230 -> 1000002700240: For a wff to qualify as part of a proof, it must be the result of applying a rule of the deductive apparatus of some formal system to the previous wffs in the proof sequence.
10260240 -> 1000002700250: Formal interpretations
10260250 -> 1000002700260: An interpretation of a formal system is the assignment of meanings to the symbols, and truth-values to the sentences of a formal system.
10260260 -> 1000002700270: The study of formal interpretations is called formal semantics.
10260270 -> 1000002700280: Giving an interpretation is synonymous with constructing a model.
10260280 -> 1000002700290: Formal grammars
10260290 -> 1000002700300: A grammar mainly consists of a set of rules for transforming strings.
10260300 -> 1000002700310: (If it only consisted of these rules, it would be a semi-Thue system.)
10260310 -> 1000002700320: To generate a string in the language, one begins with a string consisting of only a single start symbol, and then successively applies the rules (any number of times, in any order) to rewrite this string.
10260320 -> 1000002700330: The language consists of all the strings that can be generated in this manner.
10260330 -> 1000002700340: Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language.
10260340 -> 1000002700350: If there are multiple ways of generating the same single string, then the grammar is said to be ambiguous.
10260350 -> 1000002700360: For example, assume the alphabet consists of a and b, the start symbol is S and we have the following rules:
10260360 -> 1000002700370: 1. S \rightarrow aSb
10260370 -> 1000002700380: 2. S \rightarrow ba
10260380 -> 1000002700390: then we start with S, and can choose a rule to apply to it.
10260390 -> 1000002700400: If we choose rule 1, we obtain the string aSb.
10260400 -> 1000002700410: If we choose rule 1 again, we replace S with aSb and obtain the string aaSbb.
10260410 -> 1000002700420: This process can be repeated at will until all occurrences of S are removed, and only symbols from the alphabet remain (i.e., a and b).
10260420 -> 1000002700430: For example, if we now choose rule 2, we replace S with ba and obtain the string aababb, and are done.
10260430 -> 1000002700440: We can write this series of choices more briefly, using symbols: S \Rightarrow aSb \Rightarrow aaSbb \Rightarrow aababb.
10260440 -> 1000002700450: The language of the grammar is the set of all the strings that can be generated using this process: \left \{ba, abab, aababb, aaababbb, ...\right \}.
10260450 -> 1000002700460: Formal definition
10260460 -> 1000002700470: In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s, a grammar G consists of the following components:
10260470 -> 1000002700480: A finite set N of nonterminal symbols.
10260480 -> 1000002700490: A finite set \Sigma of terminal symbols that is disjoint from N.
10260490 -> 1000002700500: A finite set P of production rules, each of the form
10260500 -> 1000002700510: (\Sigma \cup N)^{*} N (\Sigma \cup N)^{*} \rightarrow (\Sigma \cup N)^{*}
10260510 -> 1000002700520: where {}^{*} is the Kleene star operator and \cup denotes set union.
10260520 -> 1000002700530: That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol.
10260530 -> 1000002700540: In the case that the second string is the empty string – that is, that it contains no symbols at all – in order to avoid confusion, the empty string is often denoted with a special notation, often (\lambda, e or \epsilon.
10260540 -> 1000002700550: A distinguished symbol S \in N that is the start symbol.
10260550 -> 1000002700560: A grammar is formally defined as the ordered quad-tuple (N, \Sigma, P, S).
10260560 -> 1000002700570: Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.
10260570 -> 1000002700580: The operation of a grammar can be defined in terms of relations on strings:
10260580 -> 1000002700590: Given a grammar G = (N, \Sigma, P, S), the binary relation \Rightarrow_G (pronounced as "G derives in one step") on strings in (\Sigma \cup N)^{*} is defined by:
10260590 -> 1000002700600: x \Rightarrow_G y \mbox{ iff } \exists u, v, w \in \Sigma^*, X \in N: x = uXv \wedge y = uwv \wedge X \rightarrow w \in P
10260600 -> 1000002700610: the relation {\Rightarrow_G}^* (pronounced as G derives in zero or more steps) is defined as the transitive closure of (\Sigma \cup N)^{*}
10260610 -> 1000002700620: the language of G, denoted as \boldsymbol{L}(G), is defined as all those strings over \Sigma that can be generated by starting with the start symbol S and then applying the production rules in P until no more nonterminal symbols are present; that is, the set \{ w \in \Sigma^* \mid S {\Rightarrow_G}^* w \}.
10260620 -> 1000002700630: Note that the grammar G = (N, \Sigma, P, S) is effectively the semi-Thue system (N \cup \Sigma, P), rewriting strings in exactly the same way; the only difference is in that we distinguish specific nonterminal symbols which must be rewritten in rewrite rules, and are only interested in rewritings from the designated start symbol S to strings without nonterminal symbols.
10260630 -> 1000002700640: Example
10260640 -> 1000002700650: For these examples, formal languages are specified using set-builder notation.
10260650 -> 1000002700660: Consider the grammar G where N = \left \{S, B\right \}, \Sigma = \left \{a, b, c\right \}, S is the start symbol, and P consists of the following production rules:
10260660 -> 1000002700670: 1. S \rightarrow aBSc
10260670 -> 1000002700680: 2. S \rightarrow abc
10260680 -> 1000002700690: 3. Ba \rightarrow aB
10260690 -> 1000002700700: 4. Bb \rightarrow bb
10260700 -> 1000002700710: Some examples of the derivation of strings in \boldsymbol{L}(G) are:
10260710 -> 1000002700720: \boldsymbol{S} \Rightarrow_2 \boldsymbol{abc}
10260720 -> 1000002700730: \boldsymbol{S} \Rightarrow_1 \boldsymbol{aBSc} \Rightarrow_2 aB\boldsymbol{abc}c \Rightarrow_3 a\boldsymbol{aB}bcc \Rightarrow_4 aa\boldsymbol{bb}cc
10260730 -> 1000002700740: \boldsymbol{S} \Rightarrow_1 \boldsymbol{aBSc} \Rightarrow_1 aB\boldsymbol{aBSc}c \Rightarrow_2 aBaB\boldsymbol{abc}cc \Rightarrow_3 a\boldsymbol{aB}Babccc \Rightarrow_3 aaB\boldsymbol{aB}bccc  \Rightarrow_3 aa\boldsymbol{aB}Bbccc \Rightarrow_4 aaaB\boldsymbol{bb}ccc \Rightarrow_4 aaa\boldsymbol{bb}bccc
10260740 -> 1000002700750: (Note on notation: L \Rightarrow_i R reads "L generates R by means of production i" and the generated part is each time indicated in bold.)
10260750 -> 1000002700760: This grammar defines the language L = \left \{ a^{n}b^{n}c^{n} | n \ge 1 \right \} where a^{n} denotes a string of n consecutive a's.
10260760 -> 1000002700770: Thus, the language is the set of strings that consist of 1 or more a's, followed by the same number of b's, followed by the same number of c's.
10260770 -> 1000002700780: The Chomsky hierarchy
10260780 -> 1000002700790: When Noam Chomsky first formalized generative grammars in 1956, he classified them into types now known as the Chomsky hierarchy.
10260790 -> 1000002700800: The difference between these types is that they have increasingly strict production rules and can express fewer formal languages.
10260800 -> 1000002700810: Two important types are context-free grammars (Type 2) and regular grammars (Type 3).
10260810 -> 1000002700820: The languages that can be described with such a grammar are called context-free languages and regular languages, respectively.
10260820 -> 1000002700830: Although much less powerful than unrestricted grammars (Type 0), which can in fact express any language that can be accepted by a Turing machine, these two restricted types of grammars are most often used because parsers for them can be efficiently implemented.
10260830 -> 1000002700840: For example, all regular languages can be recognized by a finite state machine, and for useful subsets of context-free grammars there are well-known algorithms to generate efficient LL parsers and LR parsers to recognize the corresponding languages those grammars generate.
10260840 -> 1000002700850: Context-free grammars
10260850 -> 1000002700860: A context-free grammar is a grammar in which the left-hand side of each production rule consists of only a single nonterminal symbol.
10260860 -> 1000002700870: This restriction is non-trivial; not all languages can be generated by context-free grammars.
10260870 -> 1000002700880: Those that can are called context-free languages.
10260880 -> 1000002700890: The language defined above is not a context-free language, and this can be strictly proven using the pumping lemma for context-free languages, but for example the language \left \{ a^{n}b^{n} | n \ge 1 \right \} (at least 1 a followed by the same number of b's) is context-free, as it can be defined by the grammar G_2 with N=\left \{S\right \}, \Sigma=\left \{a,b\right \}, S the start symbol, and the following production rules:
10260890 -> 1000002700900: 1. S \rightarrow aSb
10260900 -> 1000002700910: 2. S \rightarrow ab
10260910 -> 1000002700920: A context-free language can be recognized in O(n^3) time (see Big O notation) by an algorithm such as Earley's algorithm.
10260920 -> 1000002700930: That is, for every context-free language, a machine can be built that takes a string as input and determines in O(n^3) time whether the string is a member of the language, where n is the length of the string.
10260930 -> 1000002700940: Further, some important subsets of the context-free languages can be recognized in linear time using other algorithms.
10260940 -> 1000002700950: Regular grammars
10260950 -> 1000002700960: In regular grammars, the left hand side is again only a single nonterminal symbol, but now the right-hand side is also restricted: It may be the empty string, or a single terminal symbol, or a single terminal symbol followed by a nonterminal symbol, but nothing else.
10260960 -> 1000002700970: (Sometimes a broader definition is used: one can allow longer strings of terminals or single nonterminals without anything else, making languages easier to denote while still defining the same class of languages.)
10260970 -> 1000002700980: The language defined above is not regular, but the language \left \{ a^{n}b^{m} \,| \, m,n \ge 1 \right \} (at least 1 a followed by at least 1 b, where the numbers may be different) is, as it can be defined by the grammar G_3 with N=\left \{S, A,B\right \}, \Sigma=\left \{a,b\right \}, S the start symbol, and the following production rules:
10260980 -> 1000002700990: S \rightarrow aA
10260990 -> 1000002701000: A \rightarrow aA
10261000 -> 1000002701010: A \rightarrow bB
10261010 -> 1000002701020: B \rightarrow bB
10261020 -> 1000002701030: B \rightarrow \epsilon
10261030 -> 1000002701040: All languages generated by a regular grammar can be recognized in linear time by a finite state machine.
10261040 -> 1000002701050: Although, in practice, regular grammars are commonly expressed using regular expressions, some forms of regular expression used in practice do not strictly generate the regular languages and do not show linear recognitional performance due to those deviations.
10261050 -> 1000002701060: Other forms of generative grammars
10261060 -> 1000002701070: Many extensions and variations on Chomsky's original hierarchy of formal grammars have been developed more recently, both by linguists and by computer scientists, usually either in order to increase their expressive power or in order to make them easier to analyze or parse.
10261070 -> 1000002701080: Some forms of grammars developed include:
10261080 -> 1000002701090: Tree-adjoining grammars increase the expressiveness of conventional generative grammars by allowing rewrite rules to operate on parse trees instead of just strings.
10261090 -> 1000002701100: Affix grammars and attribute grammars allow rewrite rules to be augmented with semantic attributes and operations, useful both for increasing grammar expressiveness and for constructing practical language translation tools.
10261100 -> 1000002701110: Analytic grammars
10261110 -> 1000002701120: Though there is very little literature on parsing algorithms, most of these algorithms assume that the language to be parsed is initially described by means of a generative formal grammar, and that the goal is to transform this generative grammar into a working parser.
10261120 -> 1000002701130: Strictly speaking, a generative grammar does not in any way correspond to the algorithm used to parse a language, and various algorithms have different restrictions on the form of production rules that are considered well-formed.
10261130 -> 1000002701140: An alternative approach is to formalize the language in terms of an analytic grammar in the first place, which more directly corresponds to the structure and semantics of a parser for the language.
10261140 -> 1000002701150: Examples of analytic grammar formalisms include the following:
10261150 -> 1000002701160: The Language Machine directly implements unrestricted analytic grammars.
10261160 -> 1000002701170: Substitution rules are used to transform an input to produce outputs and behaviour.
10261170 -> 1000002701180: The system can also produce  the lm-diagram which shows what happens when the rules of an unrestricted analytic grammar are being applied.
10261180 -> 1000002701190: Top-down parsing language (TDPL): a highly minimalist analytic grammar formalism developed in the early 1970s to study the behavior of top-down parsers.
10261190 -> 1000002701200: Link grammars: a form of analytic grammar designed for linguistics, which derives syntactic structure by examining the positional relationships between pairs of words.
10261200 -> 1000002701210: Parsing expression grammars (PEGs): a more recent generalization of TDPL designed around the practical expressiveness needs of programming language and compiler writers.
Free software
10270010 -> 1000002800020: Free software
10270020 -> 1000002800030: Free software or software libre is software that can be used, studied, and modified without restriction, and which can be copied and redistributed in modified or unmodified form either without restriction, or with minimal restrictions only to ensure that further recipients can also do these things.
10270030 -> 1000002800040: In practice, for software to be distributed as free software, the human readable form of the program (the "source code") must be made available to the recipient along with a notice granting the above permissions.
10270040 -> 1000002800050: Such a notice is a "free software licence", or, in theory, could be a notice saying that the source code is released into the public domain.
10270050 -> 1000002800060: The free software movement was conceived in 1983 by Richard Stallman to make these freedoms available to every computer user.
10270060 -> 1000002800070: From the late 1990s onward, alternative terms for free software came into use.
10270070 -> 1000002800080: "Open source software" is the most common such alternative term.
10270080 -> 1000002800090: Others include "software libre", "free, libre and open-source software" ("FOSS", or, with "libre", "FLOSS").
10270090 -> 1000002800100: The antonym of free software is "proprietary software" or non-free software.
10270100 -> 1000002800110: Free software is distinct from "freeware" which is proprietary software made available free of charge.
10270110 -> 1000002800120: Users usually cannot study, modify, or redistribute freeware.
10270120 -> 1000002800130: Since free software may be freely redistributed, it generally is available at little or no cost.
10270130 -> 1000002800140: Free software business models are usually based on adding value such as support, training, customization, integration, or certification.
10270140 -> 1000002800150: At the same time, some business models which work with proprietary software are not compatible with free software, such as those that depend on a user paying for a licence in order to lawfully use a software product.
10270150 -> 1000002800160: History
10270160 -> 1000002800170: In the 1950s, 1960s, and 1970s, it was normal for computer users to have the freedoms that are provided by free software.
10270170 -> 1000002800180: Software was commonly shared by individuals who used computers and by hardware manufacturers who were glad that people were making software that made their hardware useful.
10270180 -> 1000002800190: In the 1970s and early 1980s, the software industry began using technical measures (such as only distributing binary copies of computer programs) to prevent computer users from being able to study and modify software..
10270190 -> 1000002800200: In 1980 copyright law was extended to computer programs.
10270200 -> 1000002800210: In 1983, Richard Stallman, longtime member of the hacker community at the MIT Artificial Intelligence Laboratory, announced the GNU project, saying that he had become frustrated with the effects of the change in culture of the computer industry and its users.
10270210 -> 1000002800220: Software development for the GNU operating system began in January 1984, and the Free Software Foundation (FSF) was founded in October 1985.
10270220 -> 1000002800230: He developed a free software definition and the concept of "copyleft", designed to ensure software freedom for all.
10270230 -> 1000002800240: Free software is a widespread international concept, producing software used by individuals, large organizations, and governmental administrations.
10270240 -> 1000002800250: Free software has a very high market penetration in server-side Internet applications such as the Apache web server, MySQL database, and PHP scripting language.
10270250 -> 1000002800260: Completely free computing environments are available as large packages of basic system software, such as the many GNU/Linux distributions and FreeBSD.
10270260 -> 1000002800270: Free software developers have also created free versions of almost all commonly used desktop applications, including Web browsers, office productivity suites, and multimedia players.
10270270 -> 1000002800280: It is important to note, however, that in many categories, free software for individual workstations or home users has only a fraction of the market share of its proprietary competitors.
10270280 -> 1000002800290: Most free software is distributed online without charge, or off-line at the marginal cost of distribution, but this pricing model is not required, and people may sell copies of free software programs for any price.
10270290 -> 1000002800300: The economic viability of free software has been recognised by large corporations such as IBM, Red Hat, and Sun Microsystems.
10270300 -> 1000002800310: Many companies whose core business is not in the IT sector choose free software for their Internet information and sales sites, due to the lower initial capital investment and ability to freely customize the application packages.
10270310 -> 1000002800320: Also, some non-software industries are beginning to use techniques similar to those used in free software development for their research and development process; scientists, for example, are looking towards more open development processes, and hardware such as microchips are beginning to be developed with specifications released under copyleft licenses (see the OpenCores project, for instance).
10270320 -> 1000002800330: Creative Commons and the free culture movement have also been largely influenced by the free software movement.
10270330 -> 1000002800340: Naming
10270340 -> 1000002800350: The FSF recommends using the term "free software" rather than "open source software" because that term and the associated marketing campaign focuses on the technical issues of software development, avoiding the issue of user freedoms.
10270350 -> 1000002800360: "Libre" is used to avoid the ambiguity of the word "free".
10270360 -> 1000002800370: However, amongst English speakers, libre is primarily only used within the free software movement.
10270370 -> 1000002800380: Definition
10270380 -> 1000002800390: The first formal definition of free software was published by FSF in February 1986.
10270390 -> 1000002800400: That definition, written by Richard Stallman, is still maintained today and states that software is free software if people who receive a copy of the software have the following four freedoms:
10270400 -> 1000002800410: Freedom 0: The freedom to run the program for any purpose.
10270410 -> 1000002800420: Freedom 1: The freedom to study and modify the program.
10270420 -> 1000002800430: Freedom 2: The freedom to copy the program so you can help your neighbor.
10270430 -> 1000002800440: Freedom 3: The freedom to improve the program, and release your improvements to the public, so that the whole community benefits.
10270440 -> 1000002800450: Freedoms 1 and 3 require source code to be available because studying and modifying software without its source code is highly impractical.
10270450 -> 1000002800460: Thus, free software means that computer users have the freedom to cooperate with whom they choose, and to control the software they use.
10270460 -> 1000002800470: To summarize this into a remark distinguishing libre (freedom) software from gratis (zero price) software, Richard Stallman said: "Free software is a matter of liberty, not price.
10270470 -> 1000002800480: To understand the concept, you should think of 'free' as in 'free speech', not as in 'free beer'".
10270480 -> 1000002800490: In the late 90s, other groups published their own definitions which describe an almost identical set of software.
10270490 -> 1000002800500: The most notable are Debian Free Software Guidelines published in 1997, and the Open Source Definition, published in 1998.
10270500 -> 1000002800510: The BSD-based operating systems, such as FreeBSD, OpenBSD, and NetBSD, do not have their own formal definitions of free software.
10270510 -> 1000002800520: Users of these systems generally find the same set of software to be acceptable, but sometimes see copyleft as restrictive.
10270520 -> 1000002800530: They generally advocate permissive free software licenses, which allow others to make software based on their source code, and then release the modified result as proprietary software.
10270530 -> 1000002800540: Their view is that this permissive approach is more free.
10270540 -> 1000002800550: The Kerberos, X.org, and Apache software licenses are substantially similar in intent and implementation.
10270550 -> 1000002800560: All of these software packages originated in academic institutions interested in wide technology transfer (University of California, MIT, and UIUC).
10270560 -> 1000002800570: Examples of free software
10270570 -> 1000002800580: The Free Software Directory is a free software project that maintains a large database of free software packages.
10270580 -> 1000002800590: Notable free software
10270590 -> 1000002800600: GUI related
10270600 -> 1000002800610: X Window System
10270610 -> 1000002800620: GNOME
10270620 -> 1000002800630: KDE
10270630 -> 1000002800640: Xfce desktop environments
10270640 -> 1000002800650: OpenOffice.org office suite
10270650 -> 1000002800660: Mozilla and Firefox web browsers.
10270660 -> 1000002800670: Typesetting and document preparation systems
10270670 -> 1000002800680: TeX
10270680 -> 1000002800690: LaTeX
10270690 -> 1000002800700: Graphics tools like GIMP image graphics editor and Blender 3D animation program.
10270700 -> 1000002800710: Text editors like vi or emacs.
10270710 -> 1000002800720: ogg is a free software multimedia container, used to hold ogg vorbis sound and ogg theora video.
10270720 -> 1000002800730: Relational database systems
10270730 -> 1000002800740: MySQL
10270740 -> 1000002800750: PostgreSQL
10270750 -> 1000002800760: GCC compilers, GDB debugger and the GNU C Library.
10270760 -> None: Programming languages
10270770 -> None: Java
10270780 -> None: Perl
10270790 -> None: PHP
10270800 -> None: Python
10270810 -> None: Lua
10270820 -> None: Ruby
10270830 -> None: Tcl
10270840 -> 1000002800770: Servers
10270850 -> 1000002800780: Apache web server
10270860 -> 1000002800790: BIND name server
10270870 -> 1000002800800: Sendmail mail transport
10270880 -> 1000002800810: Samba file server.
10270890 -> 1000002800820: Operating systems
10270900 -> 1000002800830: GNU/Linux
10270910 -> 1000002800840: BSD
10270920 -> 1000002800850: Darwin
10270930 -> 1000002800860: OpenSolaris
10270940 -> 1000002800870: Free software licenses
10270950 -> 1000002800880: All free software licenses must grant people all the freedoms discussed above.
10270960 -> 1000002800890: However, unless the applications' licenses are compatible, combining programs by mixing source code or directly linking binaries is problematic, because of license technicalities.
10270970 -> 1000002800900: Programs indirectly connected together may avoid this problem.
10270980 -> 1000002800910: The majority of free software uses a small set of licenses.
10270990 -> 1000002800920: The most popular of these licenses are:
10271000 -> 1000002800930: the GNU General Public License
10271010 -> 1000002800940: the GNU Lesser General Public License
10271020 -> 1000002800950: the BSD License
10271030 -> 1000002800960: the Mozilla Public License
10271040 -> 1000002800970: the MIT License
10271050 -> 1000002800980: the Apache License
10271060 -> 1000002800990: The Free Software Foundation and the Open Source Initiative both publish lists of licenses that they find to comply with their own definitions of free software and open-source software respectively.
10271070 -> 1000002801000: List of FSF approved software licenses
10271080 -> 1000002801010: List of OSI approved software licenses
10271090 -> 1000002801020: These lists are necessarily incomplete, because a license need not be known by either organization in order to provide these freedoms.
10271100 -> 1000002801030: Apart from these two organizations, the Debian project is seen by some to provide useful advice on whether particular licenses comply with their Debian Free Software Guidelines.
10271110 -> 1000002801040: Debian doesn't publish a list of approved licenses, so its judgments have to be tracked by checking what software they have allowed into their software archives.
10271120 -> 1000002801050: That is summarized at the Debian web site.
10271130 -> 1000002801060: However, it is rare that a license is announced as being in-compliance by either FSF or OSI guidelines and not vice versa (the Netscape Public License used for early versions of Mozilla being an exception), so exact definitions of the terms have not become hot issues.
10271140 -> 1000002801070: Permissive and copyleft licenses
10271150 -> 1000002801080: The FSF categorizes licenses in the following ways:
10271160 -> 1000002801090: Public domain software - the copyright has expired, the work was not copyrighted or the author has abandoned the copyright.
10271170 -> 1000002801100: Since public-domain software lacks copyright protection, it may be freely incorporated into any work, whether proprietary or free.
10271180 -> 1000002801110: Permissive licenses, also called BSD-style because they are applied to much of the software distributed with the BSD operating systems.
10271190 -> 1000002801120: The author retains copyright solely to disclaim warranty and require proper attribution of modified works, but permits redistribution and modification in any work, even proprietary ones.
10271200 -> 1000002801130: Copyleft licenses, the GNU General Public License being the most prominent.
10271210 -> 1000002801140: The author retains copyright and permits redistribution and modification provided all such redistribution is licensed under the same license.
10271220 -> 1000002801150: Additions and modifications by others must also be licensed under the same 'copyleft' license whenever they are distributed with part of the original licensed product.
10271230 -> 1000002801160: Security and reliability
10271240 -> 1000002801170: There is debate over the security of free software in comparison to proprietary software, with a major issue being security through obscurity.
10271250 -> 1000002801180: A popular quantitative test in computer security is using relative counting of known unpatched security flaws.
10271260 -> 1000002801190: Generally, users of this method advise avoiding products which lack fixes for known security flaws, at least until a fix is available.
10271270 -> 1000002801200: Some claim that this method is biased by counting more vulnerabilities for the free software, since its source code is accessible and its community is more forthcoming about what problems exist.
10271280 -> 1000002801210: Free software advocates rebut that even if proprietary software does not have "published" flaws, flaws could still exist and possibly be known to malicious users.
10271290 -> 1000002801220: The ability of users to view and modify the source code allows many more people to potentially analyse the code and possibly to have a higher rate of finding bugs and flaws than an average sized corporation could manage.
10271300 -> 1000002801230: Users having access to the source code also makes creating and deploying spyware far more difficult.
10271310 -> 1000002801240: David A. Wheeler has published research concluding that free software is quantitatively more reliable than proprietary software.
10271320 -> 1000002801250: Adoption
10271330 -> 1000002801260: Free software played a part in the development of the Internet, the World Wide Web and the infrastructure of dot-com companies.
10271340 -> 1000002801270: Free software allows users to cooperate in enhancing and refining the programs they use; free software is a pure public good rather than a private good.
10271350 -> 1000002801280: Companies that contribute to free software can increase commercial innovation amidst the void of patent cross licensing lawsuits.
10271360 -> 1000002801290: (See mpeg2 patent holders)
10271370 -> 1000002801300: Under the free software business model, free software vendors may charge a fee for distribution and offer pay support and software customization services.
10271380 -> 1000002801310: Proprietary software uses a different business model, where a customer of the proprietary software pays a fee for a license to use the software.
10271390 -> 1000002801320: This license may grant the customer the ability to configure some or no parts of the software themselves.
10271400 -> 1000002801330: Often some level of support is included in the purchase of proprietary software, but additional support services (especially for enterprise applications) are usually available for an additional fee.
10271410 -> 1000002801340: Some proprietary software vendors will also customize software for a fee.
10271420 -> 1000002801350: Free software is generally available at little to no cost and can result in permanently lower costs compared to proprietary software.
10271430 -> 1000002801360: With free software, businesses can fit software to their specific needs by changing the software themselves or by hiring programmers to modify it for them.
10271440 -> 1000002801370: Free software often has no warranty, and more importantly, generally does not assign legal liability to anyone.
10271450 -> 1000002801380: However, warranties are permitted between any two parties upon the condition of the software and its usage.
10271460 -> 1000002801390: Such an agreement is made separately from the free software license.
10271470 -> 1000002801400: Controversies
10271480 -> 1000002801410: Binary blobs
10271490 -> 1000002801420: In 2006, OpenBSD started the first campaign against the use of binary blobs, in kernels.
10271500 -> 1000002801430: Blobs are usually freely distributable device drivers for hardware from vendors that do not reveal driver source code to users or developers.
10271510 -> 1000002801440: This restricts the users' freedom to effectively modify the software and distribute modified versions.
10271520 -> 1000002801450: Also, since the blobs are undocumented and may have bugs, they pose a security risk to any operating system whose kernel includes them.
10271530 -> 1000002801460: The proclaimed aim of the campaign against blobs is to collect hardware documentation that allows developers to write free software drivers for that hardware, ultimately enabling all free operating systems to become or remain blob-free.
10271540 -> 1000002801470: The issue of binary blobs in the Linux kernel and other device drivers motivated some developers in Ireland to launch gNewSense, a GNU/Linux distribution with all the binary blobs removed.
10271550 -> 1000002801480: The project received support from the Free Software Foundation
10271560 -> 1000002801490: BitKeeper
10271570 -> 1000002801500: Larry McVoy invited high-profile free software projects to use his proprietary versioning system, BitKeeper, free of charge, in order to attract paying users.
10271580 -> 1000002801510: In 2002, Linux coordinator Linus Torvalds decided to use BitKeeper to develop the Linux kernel, a free software project, claiming no free software alternative met his needs.
10271590 -> 1000002801520: This controversial decision drew criticism from several sources, including the Free Software Foundation's founder Richard Stallman.
10271600 -> 1000002801530: Following the apparent reverse engineering of BitKeeper's protocols, McVoy withdrew permission for gratis use by free software projects, leading the Linux kernel community to develop a free software replacement in Git.
10271610 -> 1000002801540: Patent deals
10271620 -> 1000002801550: In November 2006, the Microsoft and Novell software corporations announced a controversial partnership involving, among other things, patent protection for some customers of Novell under certain conditions.
Freeware
10280010 -> 1000002900020: Freeware
10280020 -> 1000002900030: Freeware is computer software that is available for use at no cost or for an optional fee.
10280030 -> 1000002900040: Freeware is often made available in a binary-only, proprietary form, thus making it distinct from free software.
10280040 -> 1000002900050: Proprietary freeware allows authors to contribute something for the benefit of the community, while at the same time allowing them to retain control of the source code and preserve its business potential.
10280050 -> 1000002900060: Freeware is different from shareware, where the user is obliged to pay (e.g. after some trial period or for additional functionality).
10280060 -> 1000002900070: History
10280070 -> 1000002900080: The term freeware was coined by Andrew Fluegelman when he wanted to sell a communications program named PC-Talk that he had created but for which he did not wish to use traditional methods of distribution because of their cost.
10280080 -> 1000002900090: Fluegelman actually distributed PC-Talk via a process now referred to as shareware.
10280090 -> 1000002900100: Current use of the term freeware does not necessarily match the original concept by Andrew Fluegelman.
10280100 -> 1000002900110: Criteria
10280110 -> 1000002900120: The only criterion for being classified as freeware is that the software must be fully functional for an unlimited time with no monetary cost.
10280120 -> 1000002900130: The software license may impose one or more other restrictions on the type of use including personal use, individual use, non-profit use, non-commercial use, academic use, commercial use or any combination of these.
10280130 -> 1000002900140: For instance, the license may be "free for personal, non-commercial use."
10280140 -> 1000002900150: Everything created with the freeware programs can be distributed at no cost (for example graphic, documents, or sounds made by user).
French language
10290010 -> 1000003000020: French language
10290020 -> 1000003000030: French (français, pronounced [fʁɑ̃sɛ]) is today spoken around the world by 72 to 130 million people as a native language, and by about 190 to 600 million people as a second or third language, with significant speakers in 54 countries.
10290030 -> 1000003000040: Most native speakers of the language live in France, where the language originated.
10290040 -> 1000003000050: The rest live in Canada, Belgium and Switzerland.
10290050 -> 1000003000060: French is a descendant of the Latin language of the Roman Empire, as are languages such as Portuguese, Spanish, Italian, Catalan and Romanian.
10290060 -> 1000003000070: Its development was also influenced by the native Celtic languages of Roman Gaul and by the Germanic language of the post-Roman Frankish invaders.
10290070 -> 1000003000080: It is an official language in 29 countries, most of which form what is called in French La Francophonie, the community of French-speaking nations.
10290080 -> 1000003000090: It is an official language of all United Nations agencies and a large number of international organizations.
10290090 -> 1000003000100: According to the European Union, 129 million (26% of the 497,198,740) people in 27 member states speak French, of which 59 million (12%) speak it natively and 69 million (14%) claim to speak it as a second language, which makes it the third most spoken second language in the Union, after English and German respectively.
10290100 -> 1000003000110: Geographic distribution
10290110 -> 1000003000120: Europe
10290120 -> 1000003000130: Legal status in France
10290130 -> 1000003000140: Per the Constitution of France, French has been the official language since 1992 (although previous legal texts have made it official since 1539, see ordinance of Villers-Cotterêts).
10290140 -> 1000003000150: France mandates the use of French in official government publications, public education outside of specific cases (though these dispositions are often ignored) and legal contracts; advertisements must bear a translation of foreign words.
10290150 -> 1000003000160: In addition to French, there are also a variety of regional languages.
10290160 -> 1000003000170: France has signed the European Charter for Regional Languages but has not ratified it since that would go against the 1958 Constitution.
10290170 -> 1000003000180: Switzerland
10290180 -> 1000003000190: French is one of the four official languages of Switzerland (along with German, Italian, and Romansh) and is spoken in the part of Switzerland called Romandie.
10290190 -> 1000003000200: French is the native language of about 20% of the Swiss population.
10290200 -> 1000003000210: Belgium
10290210 -> 1000003000220: In Belgium, French is the official language of Wallonia (excluding the East Cantons, which are German-speaking) and one of the two official languages—along with Dutch—of the Brussels-Capital Region where it is spoken by the majority of the population, though often not as their primary language.
10290220 -> 1000003000230: French and German are not official languages nor recognised minority languages in the Flemish Region, although along borders with the Walloon and Brussels-Capital regions, there are a dozen of municipalities with language facilities for French-speakers; a mirroring situation exists for the Walloon Region with respect to the Dutch and German languages.
10290230 -> 1000003000240: In total, native French-speakers make up about 40% of the country's population, the remaining 60% speak Dutch, the latter of which 59% claim to speak French as a second language.
10290240 -> 1000003000250: French is thus known by an estimated 75% of all Belgians, either as a mother tongue, as second, or as third language.
10290250 -> 1000003000260: Monaco and Andorra
10290260 -> 1000003000270: Although Monégasque is the national language of the Principality of Monaco, French is the only official language, and French nationals make up some 47% of the population.
10290270 -> 1000003000280: Catalan is the only official language of Andorra; however, French is commonly used due to the proximity to France.
10290280 -> 1000003000290: French nationals make up 7% of the population.
10290290 -> 1000003000300: Italy
10290300 -> 1000003000310: French is also an official language, along with Italian, in the province of Aosta Valley, Italy.
10290310 -> 1000003000320: In addition, a number of Franco-Provençal dialects are spoken in the province, although they do not have official recognition.
10290320 -> 1000003000330: Luxembourg
10290330 -> 1000003000340: French is one of three official languages of the Grand Duchy of Luxembourg  ;
10290340 -> 1000003000350: the other official languages of Luxembourg are
10290350 -> 1000003000360: German
10290360 -> 1000003000370: Luxemburgish.
10290370 -> 1000003000380: Luxemburgish is the natively-spoken language of Luxembourg ;
10290380 -> 1000003000390: Luxembourg's education system is trilingual: the first years of primary school are in Luxembourgish, before changing to German, while secondary school, the language of instruction changes to French.
10290390 -> 1000003000400: The Channel Islands
10290400 -> 1000003000410: Although Jersey and Guernsey, the two bailiwicks collectively referred to as the Channel Islands, are separate entities, both use French to some degree, mostly in an administrative capacity.
10290410 -> 1000003000420: Jersey Legal French is the standardized variety used in Jersey.
10290420 -> 1000003000430: The Americas
10290430 -> 1000003000440: Legal status in Canada
10290440 -> 1000003000450: About 7 million Canadians are native French-speakers, of whom 6 million live in Quebec, and French is one of Canada's two official languages (the other being English).
10290450 -> 1000003000460: Various provisions of the Canadian Charter of Rights and Freedoms deal with Canadians' right to access services in both languages, including the right to a publicly funded education in the minority language of each province, where numbers warrant in a given locality.
10290460 -> 1000003000470: By law, the federal government must operate and provide services in both English and French, proceedings of the Parliament of Canada must be translated into both these languages, and most products sold in Canada must have labeling in both languages.
10290470 -> 1000003000480: Overall, about 13% of Canadians have knowledge of French only, while 18% have knowledge of both English and French.
10290480 -> 1000003000490: In contrast, over 82% of the population of Quebec speaks French natively, and almost 96% speak it as either their first or second language.
10290490 -> 1000003000500: It has been the sole official language of Quebec since 1974.
10290500 -> 1000003000510: The legal status of French was further strengthened with the 1977 adoption of the Charter of the French Language (popularly known as Bill 101), which guarantees that every person has a right to have the civil administration, the health and social services, corporations, and enterprises in Quebec communicate with him in French.
10290510 -> 1000003000520: While the Charter mandates that certain provincial government services, such as those relating to health and education, be offered to the English minority in its language, where numbers warrant, its primary purpose is to cement the role of French as the primary language used in the public sphere.
10290520 -> 1000003000530: The provision of the Charter that has arguably had the most significant impact mandates French-language education unless a child's parents or siblings have received the majority of their own primary education in English within Canada, with minor exceptions.
10290530 -> 1000003000540: This measure has reversed a historical trend whereby a large number of immigrant children would attend English schools.
10290540 -> 1000003000550: In so doing, the Charter has greatly contributed to the "visage français" (French face) of Montreal in spite of its growing immigrant population.
10290550 -> 1000003000560: Other provisions of the Charter have been ruled unconstitutional over the years, including those mandating French-only commercial signs, court proceedings, and debates in the legislature.
10290560 -> 1000003000570: Though none of these provisions are still in effect today, some continued to be on the books for a time even after courts had ruled them unconstitutional as a result of the government's decision to invoke the so-called notwithstanding clause of the Canadian constitution to override constitutional requirements.
10290570 -> 1000003000580: In 1993, the Charter was rewritten to allow signage in other languages so long as French was markedly "predominant."
10290580 -> 1000003000590: Another section of the Charter guarantees every person the right to work in French, meaning the right to have all communications with one's superiors and coworkers in French, as well as the right not to be required to know another language as a condition of hiring, unless this is warranted by the nature of one's duties, such as by reason of extensive interaction with people located outside the province or similar reasons.
10290590 -> 1000003000600: This section has not been as effective as had originally been hoped, and has faded somewhat from public consciousness.
10290600 -> 1000003000610: As of 2006, approximately 65% of the workforce on the island of Montreal predominantly used French in the workplace.
10290610 -> 1000003000620: The only other province that recognizes French as an official language is New Brunswick, which is officially bilingual, like the nation as a whole.
10290620 -> 1000003000630: Outside of Quebec, the highest number of Francophones in Canada, 485,000, excluding those who claim multiple mother tongues, reside in Ontario, whereas New Brunswick, home to the vast majority of Acadians, has the highest percentage of Francophones after Quebec, 33%, or 237,000.
10290630 -> 1000003000640: In Ontario, Nova Scotia, Prince Edward Island, and Manitoba, French does not have full official status, although the provincial governments do provide some French-language services in all communities where significant numbers of Francophones live.
10290640 -> 1000003000650: Canada's three northern territories (Yukon, Northwest Territories, and Nunavut) all recognize French as an official language as well.
10290650 -> 1000003000660: All provinces make some effort to accommodate the needs of their Francophone citizens, although the level and quality of French-language service vary significantly from province to province.
10290660 -> 1000003000670: The Ontario French Language Services Act, adopted in 1986, guarantees French language services in that province in regions where the Francophone population exceeds 10% of the total population, as well as communities with Francophone populations exceeding 5,000, and certain other designated areas; this has the most effect in the north and east of the province, as well as in other larger centres such as Ottawa, Toronto, Hamilton, Mississauga, London, Kitchener, St. Catharines, Greater Sudbury and Windsor.
10290670 -> 1000003000680: However, the French Language Services Act does not confer the status of "official bilingualism" on these cities, as that designation carries with it implications which go beyond the provision of services in both languages.
10290680 -> 1000003000690: The City of Ottawa's language policy (by-law 2001-170) allows employees to work in their official language of choice and be supervised in the language of choice.
10290690 -> 1000003000700: Canada has the status of member state in the Francophonie, while the provinces of Quebec and New Brunswick are recognized as participating governments.
10290700 -> 1000003000710: Ontario is currently seeking to become a full member on its own.
10290710 -> 1000003000720: Haiti
10290720 -> 1000003000730: French is an official language of Haiti, although it is mostly spoken by the upper class, while Haitian Creole (a French-based creole language) is more widely spoken as a mother tongue.
10290730 -> 1000003000740: French overseas territories
10290740 -> 1000003000750: French is also the official language in France's overseas territories of French Guiana, Guadeloupe, Martinique, Saint Barthélemy, St. Martin and Saint-Pierre and Miquelon.
10290750 -> 1000003000760: The United States
10290760 -> 1000003000770: Although it has no official recognition on a federal level, French is the third most-spoken language in the United States, after English and Spanish, and the second most-spoken in the states of Louisiana, Maine, Vermont and New Hampshire.
10290770 -> 1000003000780: Louisiana is home to two distinct dialects, Cajun French and Creole French
10290780 -> 1000003000790: Africa
10290790 -> 1000003000800: A majority of the world's French-speaking population lives in Africa.
10290800 -> 1000003000810: According to the 2007 report by the Organisation internationale de la Francophonie, an estimated 115 million African people spread across 31 francophone African countries can speak French either as a first or second language.
10290810 -> 1000003000820: French is mostly a second language in Africa, but in some areas it has become a first language, such as in the region of Abidjan, Côte d'Ivoire and in Libreville, Gabon.
10290820 -> 1000003000830: It is impossible to speak of a single form of African French, but rather of diverse forms of African French which have developed due to the contact with many indigenous African languages.
10290830 -> 1000003000840: In the territories of the Indian Ocean, the French language is often spoken alongside French-derived creole languages, the major exception being Madagascar.
10290840 -> 1000003000850: There, a Malayo-Polynesian language (Malagasy) is spoken alongside French.
10290850 -> 1000003000860: The French language has also met competition with English since English has been the official language in Mauritius and the Seychelles for a long time and has recently become an official language of Madagascar.
10290860 -> 1000003000870: Sub-Saharan Africa is the region where the French language is most likely to expand due to the expansion of education and it is also there the language has evolved most in recent years.
10290870 -> 1000003000880: Some vernacular forms of French in Africa can be difficult to understand for French speakers from other countries but written forms of the language are very closely related to those of the rest of the French-speaking world.
10290880 -> 1000003000890: French is an official language of many African countries, most of them former French or Belgian colonies:
10290890 -> 1000003000900: Benin
10290900 -> 1000003000910: Burkina Faso
10290910 -> 1000003000920: Burundi
10290920 -> 1000003000930: Cameroon
10290930 -> 1000003000940: Central African Republic
10290940 -> 1000003000950: Chad
10290950 -> 1000003000960: Comoros
10290960 -> 1000003000970: Congo (Brazzaville)
10290970 -> 1000003000980: Côte d'Ivoire
10290980 -> 1000003000990: Democratic Republic of the Congo
10290990 -> 1000003001000: Djibouti
10291000 -> 1000003001010: Equatorial Guinea (former colony of Spain)
10291010 -> 1000003001020: Gabon
10291020 -> 1000003001030: Guinea
10291030 -> 1000003001040: Madagascar
10291040 -> 1000003001050: Mali
10291050 -> 1000003001060: Niger
10291060 -> 1000003001070: Rwanda
10291070 -> 1000003001080: Senegal
10291080 -> 1000003001090: Seychelles
10291090 -> 1000003001100: Togo
10291100 -> 1000003001110: In addition, French is an administrative language and commonly used though not on an official basis in Mauritius and in the Maghreb states:
10291110 -> 1000003001120: Mauritania
10291120 -> 1000003001130: Algeria
10291130 -> 1000003001140: Morocco
10291140 -> 1000003001150: Tunisia.
10291150 -> 1000003001160: Various reforms have been implemented in recent decades in Algeria to improve the status of Arabic relative to French, especially in education.
10291160 -> 1000003001170: While the predominant European language in Egypt is English, French is considered to be a more sophisticated language by some elements of the Egyptian upper and upper-middle classes; for this reason, a typical educated Egyptian will learn French in addition to English at some point in his or her education.
10291170 -> 1000003001180: The perception of sophistication may be related to the use of French as the royal court language of Egypt during the nineteenth century.
10291180 -> 1000003001190: Egypt participates in La Francophonie.
10291190 -> 1000003001200: French is also the official language of Mayotte and Réunion, two overseas territories of France located in the Indian Ocean, as well as an administrative and educational language in Mauritius, along with English.
10291200 -> 1000003001210: Asia
10291210 -> 1000003001220: Lebanon
10291220 -> 1000003001230: French was the official language in Lebanon along with Arabic until 1941, the country's declaration of independence from France.
10291230 -> 1000003001240: French is still seen as an official language by the Lebanese people as it is widely used by the Lebanese, especially for administrative purposes, and is taught in schools as a primary language along with Arabic.
10291240 -> 1000003001250: Southeast Asia
10291250 -> 1000003001260: French is an administrative language in Laos and Cambodia.
10291260 -> 1000003001270: French was historically spoken by the elite in the leased territory Guangzhouwan in southern China.
10291270 -> 1000003001280: In colonial Vietnam, the elites spoke French and many who worked for the French spoke a French creole known as "Tây Bồi" (now extinct).
10291280 -> 1000003001290: India
10291290 -> 1000003001300: French has official status in the Indian Union Territory of Pondicherry, along with the regional language Tamil and some students of Tamil Nadu may opt French as their third or fourth language (usually behind English, Tamil, Hindi).
10291300 -> 1000003001310: French is also commonly taught as third language in secondary school in most cities of Maharashtra State including Mumbai as part of the Secondary (X-SSC) and Higher secondary School (XII-HSC) certificate examinations.
10291310 -> 1000003001320: Oceania
10291320 -> 1000003001330: French is also a second official language of the Pacific Island nation of Vanuatu, along with France's territories of French Polynesia, Wallis & Futuna and New Caledonia.
10291330 -> None: Dialects
10291340 -> None: Acadian French
10291350 -> None: African French
10291360 -> None: Aostan French
10291370 -> None: Belgian French
10291380 -> None: Cajun French
10291390 -> None: Canadian French
10291400 -> None: Cambodian French
10291410 -> None: Guyana French (see French Guiana)
10291420 -> None: Indian French
10291430 -> None: Jersey Legal French
10291440 -> None: Lao French
10291450 -> None: Levantine French (most commonly referred to as Lebanese French, very similar to Maghreb French)
10291460 -> None: Louisiana Creole French
10291470 -> None: Maghreb French (see also North African French)
10291480 -> None: Meridional French
10291490 -> None: Metropolitan French
10291500 -> None: New Caledonian French
10291510 -> None: Newfoundland French
10291520 -> None: Oceanic French
10291530 -> None: Quebec French
10291540 -> None: South East Asian French
10291550 -> None: Swiss French
10291560 -> None: Vietnamese French
10291570 -> None: West Indian French
10291580 -> None: History
10291590 -> 1000003001340: Sounds
10291600 -> None: 
10291610 -> 1000003001350: Although there are many French regional accents, only one version of the language is normally chosen as a model for foreign learners, which has no commonly used special name, but has been termed français neutre (neutral French).
10291620 -> 1000003001360: Voiced stops (i.e. {(IPA+/b d g/+/b d g/)}) are typically produced fully voiced throughout.
10291630 -> 1000003001370: Voiceless stops (i.e. {(IPA+/p t k/+/p t k/)}) are unaspirated.
10291640 -> 1000003001380: Nasals: The velar nasal {(IPA+/ŋ/+/ŋ/)} occurs only in final position in borrowed (usually English) words: parking, camping, swing.
10291650 -> 1000003001390: The palatal nasal {(IPA+/ɲ/+/ɲ/)}can occur in word initial position (e.g. gnon), but it is most frequently found in intervocalic, onset position or word-finally (e.g. montagne).
10291660 -> 1000003001400: Fricatives: French has three pairs of homorganic fricatives distinguished by voicing, i.e. labiodental {(IPA+/f/–/v/+/f/–/v/)}, dental {(IPA+/s/–/z/+/s/–/z/)}, and palato-alveolar {(IPA+/ʃ/–/ʒ/+/ʃ/–/ʒ/)}.
10291670 -> 1000003001410: Notice that {(IPA+/s/–/z/+/s/–/z/)} are dental, like the plosives {(IPA+/t/–/d/+/t/–/d/)}, and the nasal {(IPA+/n/+/n/)}.
10291680 -> 1000003001420: French has one rhotic whose pronunciation varies considerably among speakers and phonetic contexts.
10291690 -> 1000003001430: In general it is described as a voiced uvular fricative as in {(IPA+[ʁu]+[ʁu])} roue "wheel" .
10291700 -> 1000003001440: Vowels are often lengthened before this segment.
10291710 -> 1000003001450: It can be reduced to an approximant, particularly in final position (e.g. "fort") or reduced to zero in some word-final positions.
10291720 -> 1000003001460: For other speakers, a uvular trill is also fairly common, and an apical trill {(IPA+[r]+[r])} occurs in some dialects.
10291730 -> 1000003001470: Lateral and central approximants: The lateral approximant {(IPA+/l/+/l/)} is unvelarised in both onset (lire) and coda position (il).
10291740 -> 1000003001480: In the onset, the central approximants {(IPA+[w]+[w])}, {(IPA+[ɥ]+[ɥ])}, and {(IPA+[j]+[j])} each correspond to a high vowel, {(IPA+/u/+/u/)}, {(IPA+/y/+/y/)}, and {(IPA+/i/+/i/)} respectively.
10291750 -> 1000003001490: There are a few minimal pairs where the approximant and corresponding vowel contrast, but there are also many cases where they are in free variation.
10291760 -> 1000003001500: Contrasts between {(IPA+/j/+/j/)} and {(IPA+/i/+/i/)} occur in final position as in {(IPA+/pɛj/+/pɛj/)} paye "pay" vs. {(IPA+/pɛi/+/pɛi/)} pays "country".
10291770 -> 1000003001510: French pronunciation follows strict rules based on spelling, but French spelling is often based more on history than phonology.
10291780 -> 1000003001520: The rules for pronunciation vary between dialects, but the standard rules are:
10291790 -> 1000003001530: final consonants: Final single consonants, in particular s, x, z, t, d, n and m, are normally silent.
10291800 -> 1000003001540: (The final letters c, r, f and l, however, are normally pronounced.)
10291810 -> 1000003001550: When the following word begins with a vowel, though, a silent consonant may once again be pronounced, to provide a liaison or "link" between the two words.
10291820 -> 1000003001560: Some liaisons are mandatory, for example the s in les amants or vous avez; some are optional, depending on dialect and register, for example the first s in deux cents euros or euros irlandais; and some are forbidden, for example the s in beaucoup d'hommes aiment.
10291830 -> 1000003001570: The t of et is never pronounced and the silent final consonant of a noun is only pronounced in the plural and in set phrases like pied-à-terre.
10291840 -> 1000003001580: Note that in the case of a word ending d as in pied-à-terre, the consonant t is pronounced instead.
10291850 -> 1000003001590: Doubling a final n and adding a silent e at the end of a word (e.g. chien → chienne) makes it clearly pronounced.
10291860 -> 1000003001600: Doubling a final l and adding a silent e (e.g. gentil → gentille) adds a [j] sound.
10291870 -> 1000003001610: elision or vowel dropping: Some monosyllabic function words ending in a or e, such as je and que, drop their final vowel when placed before a word that begins with a vowel sound (thus avoiding a hiatus).
10291880 -> 1000003001620: The missing vowel is replaced by an apostrophe. (e.g. je ai is instead pronounced and spelt → j'ai).
10291890 -> 1000003001630: This gives for example the same pronunciation for l'homme qu'il a vu ("the man whom he saw") and l'homme qui l'a vu ("the man who saw him").
10291900 -> 1000003001640: Orthography
10291910 -> 1000003001650: Nasal: n and m.
10291920 -> 1000003001660: When n or m follows a vowel or diphthong, the n or m becomes silent and causes the preceding vowel to become nasalized (i.e. pronounced with the soft palate extended downward so as to allow part of the air to leave through the nostrils).
10291930 -> 1000003001670: Exceptions are when the n or m is doubled, or immediately followed by a vowel.
10291940 -> 1000003001680: The prefixes en- and em- are always nasalized.
10291950 -> 1000003001690: The rules get more complex than this but may vary between dialects.
10291960 -> 1000003001700: Digraphs: French does not introduce extra letters or diacritics to specify its large range of vowel sounds and diphthongs, rather it uses specific combinations of vowels, sometimes with following consonants, to show which sound is intended.
10291970 -> 1000003001710: Gemination: Within words, double consonants are generally not pronounced as geminates in modern French (but geminates can be heard in the cinema or TV news from as recently as the 1970s, and in very refined elocution they may still occur).
10291980 -> 1000003001720: For example, illusion is pronounced {(IPA+[ilyzjɔ̃]+[ilyzjɔ̃])} and not {(IPA+[illyzjɔ̃]+[illyzjɔ̃])}.
10291990 -> 1000003001730: But gemination does occur between words.
10292000 -> 1000003001740: For example, une info ("a news") is pronounced {(IPA+[ynɛ̃fo]+[ynɛ̃fo])}, whereas une nympho ("a nympho") is pronounced {(IPA+[ynnɛ̃fo]+[ynnɛ̃fo])}.
10292010 -> 1000003001750: Accents are used sometimes for pronunciation, sometimes to distinguish similar words, and sometimes for etymology alone.
10292020 -> 1000003001760: Accents that affect pronunciation
10292030 -> 1000003001770: The acute accent (l'accent aigu), é (e.g. école—school), means that the vowel is pronounced {(IPA+/e/+/e/)} instead of the default {(IPA+/ə/+/ə/)}.
10292040 -> 1000003001780: The grave accent (l'accent grave), è (e.g. élève—pupil) means that the vowel is pronounced {(IPA+/ɛ/+/ɛ/)} instead of the default {(IPA+/ə/+/ə/)}.
10292050 -> 1000003001790: The circumflex (l'accent circonflexe) ê (e.g. forêt—forest) shows that an e is pronounced {(IPA+/ɛ/+/ɛ/)} and that an o is pronounced {(IPA+/o/+/o/)}.
10292060 -> 1000003001800: In standard French it also signifies a pronunciation of {(IPA+/ɑ/+/ɑ/)} for the letter a, but this differentiation is disappearing.
10292070 -> 1000003001810: In the late 19th century, the circumflex was used in place of s where that letter was not to be pronounced.
10292080 -> 1000003001820: Thus, forest became forêt and hospital became hôpital.
10292090 -> 1000003001830: The diaeresis (le tréma) (e.g. naïf—foolish, Noël—Christmas) as in English, specifies that this vowel is pronounced separately from the preceding one, not combined and is not a schwa.
10292100 -> 1000003001840: The cedilla (la cédille) ç (e.g. garçon—boy) means that the letter c is pronounced {(IPA+/s/+/s/)} in front of the hard vowels a, o and u (c is otherwise {(IPA+/k/+/k/)} before a hard vowel).
10292110 -> 1000003001850: C is always pronounced {(IPA+/s/+/s/)} in front of the soft vowels e, i, and y, thus ç is never found in front of soft vowels.
10292120 -> 1000003001860: Accents with no pronunciation effect
10292130 -> 1000003001870: The circumflex does not affect the pronunciation of the letters i or u, and in most dialects, a as well.
10292140 -> 1000003001880: It usually indicates that an s came after it long ago, as in hôtel.
10292150 -> 1000003001890: All other accents are used only to distinguish similar words, as in the case of distinguishing the adverbs là and où ("there", "where") from the article la and the conjunction ou ("the" fem. sing., "or") respectively.
10292160 -> 1000003001900: Grammar
10292170 -> 1000003001910: French grammar shares several notable features with most other Romance languages, including:
10292180 -> 1000003001920: the loss of Latin's declensions
10292190 -> 1000003001930: only two grammatical genders
10292200 -> 1000003001940: the development of grammatical articles from Latin demonstratives
10292210 -> 1000003001950: new tenses formed from auxiliaries
10292220 -> 1000003001960: French word order is Subject Verb Object, except when the object is a pronoun, in which case the word order is Subject Object Verb.
10292230 -> 1000003001970: Some rare archaisms allow for different word orders.
10292240 -> 1000003001980: Vocabulary
10292250 -> 1000003001990: The majority of French words derive from Vulgar Latin or were constructed from Latin or Greek roots.
10292260 -> 1000003002000: There are often pairs of words, one form being "popular" (noun) and the other one "savant" (adjective), both originating from Latin.
10292270 -> 1000003002010: Example:
10292280 -> 1000003002020: brother: frère / fraternel < from Latin frater
10292290 -> 1000003002030: finger: doigt / digital < from Latin digitus
10292300 -> 1000003002040: faith: foi / fidèle < from Latin fides
10292310 -> 1000003002050: cold: froid / frigide < from Latin frigidus
10292320 -> 1000003002060: eye: œil / oculaire < from Latin oculus
10292330 -> 1000003002070: In some examples there is a common word from Vulgar Latin and a more savant word borrowed directly from Medieval Latin or even Ancient Greek.
10292340 -> 1000003002080: Cheval—Concours équestre—Hippodrome
10292350 -> 1000003002090: The French words which have developed from Latin are usually less recognisable than Italian words of Latin origin because as French evolved from Vulgar Latin, the unstressed final syllable of many words was dropped or elided into the following word.
10292360 -> 1000003002100: It is estimated that 12% (4,200) of common French words found in a typical dictionary such as the Petit Larousse or Micro-Robert Plus (35,000 words) are of foreign origin.
10292370 -> 1000003002110: About 25% (1,054) of these foreign words come from English and are fairly recent borrowings.
10292380 -> 1000003002120: The others are some 707 words from Italian, 550 from ancient Germanic languages, 481 from ancient Gallo-Romance languages, 215 from Arabic, 164 from German, 160 from Celtic languages, 159 from Spanish, 153 from Dutch, 112 from Persian and Sanskrit, 101 from Native American languages, 89 from other Asian languages, 56 from other Afro-Asiatic languages, 55 from Slavic languages and Baltic languages, 10 for Basque and 144 — about three percent — from other languages.
10292390 -> 1000003002130: Numerals
10292400 -> 1000003002140: The French counting system is partially vigesimal: twenty ({(Lang+vingt+fr+vingt)}) is used as a base number in the names of numbers from 60–99.
10292410 -> 1000003002150: The French word for eighty, for example, is {(Lang+quatre-vingts+fr+quatre-vingts)}, which literally means "four twenties", and {(Lang+soixante-quinze+fr+soixante-quinze)} (literally "sixty-fifteen") means 75.
10292420 -> 1000003002160: This reform arose after the French Revolution to unify the different counting system (mostly vigesimal near the coast, due to Celtic (via Basque) and Viking influence).
10292430 -> 1000003002170: This system is comparable to the archaic English use of score, as in "fourscore and seven" (87), or "threescore and ten" (70).
10292440 -> 1000003002180: Belgian French and Swiss French are different in this respect.
10292450 -> 1000003002190: In Belgium and Switzerland 70 and 90 are {(Lang+septante+fr+septante)} and {(Lang+nonante+fr+nonante)}.
10292460 -> 1000003002200: In Switzerland, depending on the local dialect, 80 can be {(Lang+quatre-vingts+fr+quatre-vingts)} (Geneva, Neuchâtel, Jura) or {(Lang+huitante+fr+huitante)} (Vaud, Valais, Fribourg).
10292470 -> 1000003002210: Octante had been used in Switzerland in the past, but is now considered archaic.
10292480 -> 1000003002220: In Belgium, however, quatre-vingts is universally used.
10292490 -> 1000003002230: Writing system
10292500 -> 1000003002240: French is written using the 26 letters of the Latin alphabet, plus five diacritics (the circumflex accent, acute accent, grave accent, diaeresis, and cedilla) and the two ligatures (œ) and (æ).
10292510 -> 1000003002250: French spelling, like English spelling, tends to preserve obsolete pronunciation rules.
10292520 -> 1000003002260: This is mainly due to extreme phonetic changes since the Old French period, without a corresponding change in spelling.
10292530 -> 1000003002270: Moreover, some conscious changes were made to restore Latin orthography:
10292540 -> 1000003002280: Old French doit > French doigt "finger" (Latin digitus)
10292550 -> 1000003002290: Old French pie > French pied "foot" (Latin pes (stem: ped-)
10292560 -> 1000003002300: As a result, it is difficult to predict the spelling on the basis of the sound alone.
10292570 -> 1000003002310: Final consonants are generally silent, except when the following word begins with a vowel.
10292580 -> 1000003002320: For example, all of these words end in a vowel sound: pied, aller, les, finit, beaux.
10292590 -> 1000003002330: The same words followed by a vowel, however, may sound the consonants, as they do in these examples: beaux-arts, les amis, pied-à-terre.
10292600 -> 1000003002340: On the other hand, a given spelling will almost always lead to a predictable sound, and the Académie française works hard to enforce and update this correspondence.
10292610 -> 1000003002350: In particular, a given vowel combination or diacritic predictably leads to one phoneme.
10292620 -> 1000003002360: The diacritics have phonetic, semantic, and etymological significance.
10292630 -> 1000003002370: acute accent (é): Over an e, indicates the sound of a short ai in English, with no diphthong.
10292640 -> 1000003002380: An é in modern French is often used where a combination of e and a consonant, usually s, would have been used formerly: écouter < escouter.
10292650 -> 1000003002390: This type of accent mark is called accent aigu in French.
10292660 -> 1000003002400: grave accent (à, è, ù): Over a or u, used only to distinguish homophones: à ("to") vs. a ("has"), ou ("or") vs. où ("where").
10292670 -> 1000003002410: Over an e, indicates the sound {(IPA+/ɛ/+/ɛ/)}.
10292680 -> 1000003002420: circumflex (â, ê, î, ô, û): Over an a, e or o, indicates the sound {(IPA+/ɑ/+/ɑ/)}, {(IPA+/ɛ/+/ɛ/)} or {(IPA+/o/+/o/)}, respectively (the distinction a {(IPA+/a/+/a/)} vs. â {(IPA+/ɑ/+/ɑ/)} tends to disappear in many dialects).
10292690 -> 1000003002430: Most often indicates the historical deletion of an adjacent letter (usually an s or a vowel): château < castel, fête < feste, sûr < seur, dîner < disner.
10292700 -> 1000003002440: It has also come to be used to distinguish homophones: du ("of the") vs. dû (past participle of devoir "to have to do something (pertaining to an act)"; note that dû is in fact written thus because of a dropped e: deu).
10292710 -> 1000003002450: (See Use of the circumflex in French)
10292720 -> 1000003002460: diaeresis or tréma (ë, ï, ü, ÿ): Indicates that a vowel is to be pronounced separately from the preceding one: naïve, Noël.
10292730 -> 1000003002470: A diaeresis on y only occurs in some proper names and in modern editions of old French texts.
10292740 -> 1000003002480: Some proper names in which ÿ appears include Aÿ (commune in canton de la Marne formerly Aÿ-Champagne), Rue des Cloÿs (alley in the 18th arrondisement of Paris), Croÿ (family name and hotel on the Boulevard Raspail, Paris), Château du Feÿ (near Joigny), Ghÿs (name of Flemish origin spelt Ghĳs where ĳ in handwriting looked like ÿ to French clerks), l'Haÿ-les-Roses (commune between Paris and Orly airport), Pierre Louÿs (author), Moÿ (place in commune de l'Aisne and family name), and Le Blanc de Nicolaÿ (an insurance company in eastern France).
10292750 -> 1000003002490: The diaresis on u appears only in the biblical proper names Archélaüs, Capharnaüm, Emmaüs, Ésaü and Saül.
10292760 -> 1000003002500: Nevertheless, since the 1990 orthographic rectifications (which are not applied at all by most French people), the diaeresis in words containing guë (such as aiguë or ciguë) may be moved onto the u: aigüe, cigüe.
10292770 -> 1000003002510: Words coming from German retain the old Umlaut (ä, ö and ü) if applicable but use French pronunciation, such as kärcher (trade mark of a pressure washer).
10292780 -> 1000003002520: cedilla (ç): Indicates that an etymological c is pronounced {(IPA+/s/+/s/)} when it would otherwise be pronounced /k/.
10292790 -> 1000003002530: Thus je lance "I throw" (with c = {(IPA+[s]+[s])} before e), je lançais "I was throwing" (c would be pronounced {(IPA+[k]+[k])} before a without the cedilla).
10292800 -> 1000003002540: The c cedilla (ç) softens the hard /k/ sound to /s/ before the vowels a, o or u, for example ça /sa/.
10292810 -> 1000003002550: C cedilla is never used before the vowels e or i since these two vowels always produce a soft /s/ sound (ce, ci).
10292820 -> 1000003002560: There are two ligatures, which have various origins.
10292830 -> 1000003002570: The ligature œ is a mandatory contraction of oe in certain words.
10292840 -> 1000003002580: Some of these are native French words, with the pronunciation {(IPA+/œ/+/œ/)} or {(IPA+/ø/+/ø/)}, e.g. sœur "sister" {(IPA+/sœʁ/+/sœʁ/)}, œuvre "work (of art)" {(IPA+/œvʁ/+/œvʁ/)}.
10292850 -> 1000003002590: Note that it usually appears in the combination œu; œil is an exception.
10292860 -> 1000003002600: Many of these words were originally written with the digraph eu; the o in the ligature represents a sometimes artificial attempt to imitate the Latin spelling: Latin bovem > Old French buef/beuf > Modern French bœuf. Œ is also used in words of Greek origin, as the Latin rendering of the Greek diphthong οι, e.g. cœlacanthe "coelacanth".
10292870 -> 1000003002610: These words used to be pronounced with the vowel {(IPA+/e/+/e/)}, but in recent years a spelling pronunciation with {(IPA+/ø/+/ø/)} has taken hold, e.g. œsophage {(IPA+/ezɔfaʒ/+/ezɔfaʒ/)} or {(IPA+/øzɔfaʒ/+/øzɔfaʒ/)}.
10292880 -> 1000003002620: The pronunciation with {(IPA+/e/+/e/)} is often seen to be more correct.
10292890 -> 1000003002630: The ligature œ is not used in some occurrences of the letter combination oe, for example, when o is part of a prefix (coexister).
10292900 -> 1000003002640: The ligature æ is rare and appears in some words of Latin and Greek origin like ægosome, ægyrine, æschne, cæcum, nævus or uræus.
10292910 -> 1000003002650: The vowel quality is identical to é {(IPA+/e/+/e/)}.
10292920 -> 1000003002660: French writing, as with any language, is affected by the spoken language.
10292930 -> 1000003002670: In Old French, the plural for animal was animals.
10292940 -> 1000003002680: Common speakers pronounced a u before a word ending in l as the plural.
10292950 -> 1000003002690: This resulted in animauls.
10292960 -> 1000003002700: As the French language evolved this vanished and the form animaux (aux pronounced {(IPA+/o/+/o/)}) was admitted.
10292970 -> 1000003002710: The same is true for cheval pluralized as chevaux and many others.
10292980 -> 1000003002720: Also castel pl. castels became château pl. châteaux.
10292990 -> None: Samples
GNU General Public License
10310010 -> 1000003100020: GNU General Public License
10310020 -> 1000003100030: The GNU General Public License (GNU GPL or simply GPL) is a widely used free software license, originally written by Richard Stallman for the GNU project.
10310030 -> 1000003100040: The GPL is the most popular and well-known example of the type of strong copyleft license that requires derived works to be available under the same copyleft.
10310040 -> 1000003100050: Under this philosophy, the GPL is said to grant the recipients of a computer program the rights of the free software definition and uses copyleft to ensure the freedoms are preserved, even when the work is changed or added to.
10310050 -> 1000003100060: This is in distinction to permissive free software licenses, of which the BSD licenses are the standard examples.
10310060 -> 1000003100070: The GNU Lesser General Public License (LGPL) is a modified, more permissive, version of the GPL, originally intended for some software libraries.
10310070 -> 1000003100080: There is also a GNU Free Documentation License, which was originally intended for use with documentation for GNU software, but has also been adopted for other uses, such as the Wikipedia project.
10310080 -> 1000003100090: The Affero General Public License (GNU AGPL) is a similar license with a focus on networking server software.
10310090 -> 1000003100100: The GNU AGPL is similar to the GNU General Public License, except that it additionally covers the use of the software over a computer network, requiring that the complete source code be made available to any network user of the AGPLed work, for example a web application.
10310100 -> 1000003100110: The Free Software Foundation recommends that this license is considered for any software that will commonly be run over the network.
10310110 -> 1000003100120: History
10310120 -> 1000003100130: The GPL was written by Richard Stallman in 1989 for use with programs released as part of the GNU project.
10310130 -> 1000003100140: The original GPL was based on a unification of similar licenses used for early versions of GNU Emacs, the GNU Debugger and the GNU Compiler Collection.
10310140 -> 1000003100150: These licenses contained similar provisions to the modern GPL, but were specific to each program, rendering them incompatible, despite being the same license.
10310150 -> 1000003100160: Stallman's goal was to produce one license that could be used for any project, thus making it possible for many projects to share code.
10310160 -> 1000003100170: An important vote of confidence in the GPL came from Linus Torvalds' adoption of the license for the Linux kernel in 1992, switching from an earlier license that prohibited commercial distribution.
10310170 -> 1000003100180: As of August 2007, the GPL accounted for nearly 65% of the 43,442 free software projects listed on Freshmeat, and as of January 2006, about 68% of the projects listed on SourceForge.net.
10310180 -> 1000003100190: Similarly, a 2001 survey of Red Hat Linux 7.1 found that 50% of the source code was licensed under the GPL and a 1997 survey of MetaLab, then the largest free software archive, showed that the GPL accounted for about half of the licenses used.
10310190 -> 1000003100200: One survey of a large repository of open-source software reported that in July 1997, about half the software packages with explicit license terms used the GPL.
10310200 -> 1000003100210: Prominent free software programs licensed under the GPL include the Linux kernel and the GNU Compiler Collection (GCC).
10310210 -> 1000003100220: Some other free software programs are dual-licensed under multiple licenses, often with one of the licenses being the GPL.
10310220 -> 1000003100230: Some observers believe that the strong copyleft provided by the GPL was crucial to the success of Linux, giving the programmers who contributed to it the confidence that their work would benefit the whole world and remain free, rather than being exploited by software companies that would not have to give anything back to the community.
10310230 -> 1000003100240: The second version of the license, version 2, was released in 1991.
10310240 -> 1000003100250: Over the following 15 years, some members of the FOSS (Free and Open Source Software) community came to believe that some software and hardware vendors were finding loopholes in the GPL, allowing GPL-licensed software to be exploited in ways that were contrary to the intentions of the programmers.
10310250 -> 1000003100260: These concerns included tivoization (the inclusion of GPL-licensed software in hardware that will refuse to run modified versions of its software); the use of unpublished, modified versions of GPL software behind web interfaces; and patent deals between Microsoft and Linux and Unix distributors that may represent an attempt to use patents as a weapon against competition from Linux.
10310260 -> 1000003100270: Version 3 was developed to attempt to address these concerns.
10310270 -> 1000003100280: It was  officially released on June 29, 2007.
10310280 -> 1000003100290: Versions
10310290 -> 1000003100300: Version 1
10310300 -> 1000003100310: Version 1 of the GNU GPL, released in January 1989, prevented what were then the two main ways that software distributors restricted the freedoms that define free software.
10310310 -> 1000003100320: The first problem was that distributors may publish binary files only – executable, but not readable or modifiable by humans.
10310320 -> 1000003100330: To prevent this, GPLv1 said that any vendor distributing binaries must also make the human readable source code available under the same licensing terms.
10310330 -> 1000003100340: The second problem was the distributors might add additional restrictions, either by adding restrictions to the license, or by combining the software with other software which had other restrictions on its distribution.
10310340 -> 1000003100350: If this was done, then the union of the two sets of restrictions would apply to the combined work, thus unacceptable restrictions could be added.
10310350 -> 1000003100360: To prevent this, GPLv1 said that modified versions, as a whole, had to be distributed under the terms in GPLv1.
10310360 -> 1000003100370: Therefore, software distributed under the terms of GPLv1 could be combined with software under more permissive terms, as this would not change the terms under which the whole could be distributed, but software distributed under GPLv1 could not be combined with software distributed under a more restrictive license, as this would conflict with the requirement that the whole be distributable under the terms of GPLv1.
10310370 -> 1000003100380: Version 2
10310380 -> 1000003100390: According to Richard Stallman, the major change in GPLv2 was the "Liberty or Death" clause, as he calls it - Section 7.
10310390 -> 1000003100400: This section says that if someone has restrictions imposed that prevent him or her from distributing GPL-covered software in a way that respects other users' freedom (for example, if a legal ruling states that he or she can only distribute the software in binary form), he or she cannot distribute it at all.
10310400 -> 1000003100410: By 1990, it was becoming apparent that a less restrictive license would be strategically useful for some software libraries; when version 2 of the GPL (GPLv2) was released in June 1991, therefore, a second license - the Library General Public License (LGPL) was introduced at the same time and numbered with version 2 to show that both were complementary.
10310410 -> 1000003100420: The version numbers diverged in 1999 when version 2.1 of the LGPL was released, which renamed it the GNU Lesser General Public License to reflect its place in the GNU philosophy.
10310420 -> 1000003100430: Version 3
10310430 -> 1000003100440: In late 2005, the Free Software Foundation (FSF) announced work on version 3 of the GPL (GPLv3).
10310440 -> 1000003100450: On January 16, 2006, the first "discussion draft" of GPLv3 was published, and the public consultation began.
10310450 -> 1000003100460: The public consultation was originally planned for nine to fifteen months but finally stretched to eighteen months with four drafts being published.
10310460 -> 1000003100470: The official GPLv3 was released by FSF on June 29, 2007.
10310470 -> 1000003100480: GPLv3 was written by Richard Stallman, with legal counsel from Eben Moglen and Software Freedom Law Center.
10310480 -> 1000003100490: According to Stallman, the most important changes are in relation to software patents, free software license compatibility, the definition of "source code", and hardware restrictions on software modification ("tivoization").
10310490 -> 1000003100500: Other changes relate to internationalisation, how license violations are handled, and how additional permissions can be granted by the copyright holder.
10310500 -> 1000003100510: Other notable changes include allowing authors to add certain additional conditions or requirements to their contributions.
10310510 -> 1000003100520: One of those new optional requirements, sometimes referred to as the Affero clause, is intended to fulfill a request regarding software as a service; the permitting addition of this requirement makes GPLv3 compatible with the Affero General Public License.
10310520 -> 1000003100530: The public consultation process was coordinated by the Free Software Foundation with assistance from Software Freedom Law Center, Free Software Foundation Europe, and other free software groups.
10310530 -> 1000003100540: Comments were collected from the public via the gplv3.fsf.org web portal.
10310540 -> 1000003100550: That portal runs purpose-written software called stet.
10310550 -> 1000003100560: These comments were passed to four committees comprising approximately 130 people, including supporters and detractors of FSF's goals.
10310560 -> 1000003100570: Those committees researched the comments submitted by the public and passed their summaries to Stallman for a decision on what the license would do.
10310570 -> 1000003100580: During the public consultation process, 962 comments were submitted for the first draft.
10310580 -> 1000003100590: By the end, a total of 2,636 comments had been submitted.
10310590 -> 1000003100600: The third draft was released on March 28, 2007.
10310600 -> 1000003100610: This draft included language intended to prevent patent cross-licenses like the controversial Microsoft-Novell patent agreement and restricts the anti-tivoization clauses to a legal definition of a "User" or "consumer product."
10310610 -> 1000003100620: It also explicitly removed the section on "Geographical Limitations", whose probable removal had been announced at the launch of the public consultation.
10310620 -> 1000003100630: The fourth discussion draft, which was the last, was released on May 31, 2007.
10310630 -> 1000003100640: It introduced Apache Software License compatibility, clarified the role of outside contractors, and made an exception to permit the Microsoft-Novell agreement, saying in section 11 paragraph 6 that
10310635 -> 1000003100650: You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license [...]
10310640 -> 1000003100660: This aims to make future such deals ineffective.
10310650 -> 1000003100670: The license is also meant to cause Microsoft to extend the patent licenses it grants to Novell customers for the use of GPLv3 software to all users of that GPLv3 software; this is possible only if Microsoft is legally a "conveyor" of the GPLv3 software.
10310660 -> 1000003100680: Others, notably some high-profile developers of the Linux kernel, commented to the mass media and made public statements about their objections to parts of discussion drafts 1 and 2.
10310670 -> 1000003100690: Terms and conditions
10310680 -> 1000003100700: The terms and conditions of the GPL are available to anybody receiving a copy of the work that has a GPL applied to it ("the licensee").
10310690 -> 1000003100710: Any licensee who adheres to the terms and conditions is given permission to modify the work, as well as to copy and redistribute the work or any derivative version.
10310700 -> 1000003100720: The licensee is allowed to charge a fee for this service, or do this free of charge.
10310710 -> 1000003100730: This latter point distinguishes the GPL from software licenses that prohibit commercial redistribution.
10310720 -> 1000003100740: The FSF argues that free software should not place restrictions on commercial use, and the GPL explicitly states that GPL works may be sold at any price.
10310730 -> 1000003100750: The GPL additionally states that a distributor may not impose "further restrictions on the rights granted by the GPL".
10310740 -> 1000003100760: This forbids activities such as distributing of the software under a non-disclosure agreement or contract.
10310750 -> 1000003100770: Distributors under the GPL also grant a license for any of their patents practiced by the software, to practice those patents in GPL software.
10310760 -> 1000003100780: Section three of the license requires that programs distributed as pre-compiled binaries are accompanied by a copy of the source code, a written offer to distribute the source code via the same mechanism as the pre-compiled binary or the written offer to obtain the source code that you got when you received the pre-compiled binary under the GPL.
10310770 -> 1000003100790: Copyleft
10310780 -> 1000003100800: The distribution rights granted by the GPL for modified versions of the work are not unconditional.
10310790 -> 1000003100810: When someone distributes a GPL'd work plus their own modifications, the requirements for distributing the whole work cannot be any greater than the requirements that are in the GPL.
10310800 -> 1000003100820: This requirement is known as copyleft.
10310810 -> 1000003100830: It earns its legal power from the use of copyright on software programs.
10310820 -> 1000003100840: Because a GPL work is copyrighted, a licensee has no right to redistribute it, not even in modified form (barring fair use), except under the terms of the license.
10310830 -> 1000003100850: One is only required to adhere to the terms of the GPL if one wishes to exercise rights normally restricted by copyright law, such as redistribution.
10310840 -> 1000003100860: Conversely, if one distributes copies of the work without abiding by the terms of the GPL (for instance, by keeping the source code secret), he or she can be sued by the original author under copyright law.
10310850 -> 1000003100870: Copyleft thus uses copyright law to accomplish the opposite of its usual purpose: instead of imposing restrictions, it grants rights to other people, in a way that ensures the rights cannot subsequently be taken away.
10310860 -> 1000003100880: It also ensures that unlimited redistribution rights are not granted, should any legal flaw (or "bug") be found in the copyleft statement.
10310870 -> 1000003100890: Many distributors of GPL'ed programs bundle the source code with the executables.
10310880 -> 1000003100900: An alternative method of satisfying the copyleft is to provide a written offer to provide the source code on a physical medium (such as a CD) upon request.
10310890 -> 1000003100910: In practice, many GPL'ed programs are distributed over the Internet, and the source code is made available over FTP.
10310900 -> 1000003100920: For Internet distribution, this complies with the license.
10310910 -> 1000003100930: Copyleft applies only when a person seeks to redistribute the program.
10310920 -> 1000003100940: One is allowed to make private modified versions, without any obligation to divulge the modifications as long as the modified software is not distributed to anyone else.
10310930 -> 1000003100950: Note that the copyleft applies only to the software and not to its output (unless that output is itself a derivative work of the program); for example, a public web portal running a modified derivative of a GPL'ed content management system is not required to distribute its changes to the underlying software.
10310940 -> 1000003100960: Licensing and contractual issues
10310950 -> 1000003100970: The GPL was designed as a license, rather than a contract.
10310960 -> 1000003100980: In some Common Law jurisdictions, the legal distinction between a license and a contract is an important one: contracts are enforceable by contract law, whereas licenses are enforced under copyright law.
10310970 -> 1000003100990: However, this distinction is not useful in the many jurisdictions where there are no differences between contracts and licenses, such as Civil Law systems.
10310980 -> 1000003101000: Those who do not agree to the GPL's terms and conditions do not have permission, under copyright law, to copy or distribute GPL licensed software or derivative works.
10310990 -> 1000003101010: However, they may still use the software however they like.
10311000 -> 1000003101020: Copyright holders
10311010 -> 1000003101030: The text of the GPL is itself copyrighted, and the copyright is held by the Free Software Foundation (FSF).
10311020 -> 1000003101040: However, the FSF does not hold the copyright for a work released under the GPL, unless an author explicitly assigns copyrights to the FSF (which seldom happens except for programs that are part of the GNU project).
10311030 -> 1000003101050: Only the individual copyright holders have the authority to sue when a license violation takes place.
10311040 -> 1000003101060: The FSF permits people to create new licenses based on the GPL, as long as the derived licenses do not use the GPL preamble without permission.
10311050 -> 1000003101070: This is discouraged, however, since such a license is generally incompatible with the GPL.
10311060 -> 1000003101080: (See the  GPL FAQ for more information.)
10311070 -> 1000003101090: Other licenses created by the GNU project include the GNU Lesser General Public License and the GNU Free Documentation License.
10311080 -> 1000003101100: The GPL in court
10311090 -> 1000003101110: A key dispute related to the GPL is whether or not non-GPL software can dynamically link to GPL libraries.
10311100 -> 1000003101120: The GPL is clear in requiring that all derivative works of GPL'ed code must themselves be GPL'ed.
10311110 -> 1000003101130: However, it is not clear whether an executable that dynamically links to a GPL code should be considered a derivative work.
10311120 -> 1000003101140: The free/open-source software community is split on this issue.
10311130 -> 1000003101150: The FSF asserts that such an executable is indeed a derivative work if the executable and GPL code "make function calls to each other and share data structures," with others agreeing, while some (e.g. Linus Torvalds) agree that dynamic linking can create derived works but disagree over the circumstances.
10311150 -> 1000003101160: On the other hand, some experts have argued that the question is still open: one Novell lawyer has written that dynamic linking not being derivative "makes sense" but is not "clear-cut," and Lawrence Rosen has claimed that a court of law would "probably" exclude dynamic linking from derivative works although "there are also good arguments" on the other side and "the outcome is not clear" (on a later occasion, he argued that "market-based" factors are more important than the linking technique).
10311160 -> 1000003101170: This is ultimately a question not of the GPL per se, but of how copyright law defines derivative works.
10311170 -> 1000003101180: In Galoob v. Nintendo the Ninth Circuit Court of Appeals defined a derivative work as having "'form' or permanence" and noted that "the infringing work must incorporate a portion of the copyrighted work in some form," but there have been no clear court decisions to resolve this particular conflict.
10311180 -> 1000003101190: Since there is no record of anyone circumventing the GPL by dynamic linking and contesting when threatened with lawsuits by the copyright holder, the restriction appears de facto enforceable even if not yet proven de jure.
10311190 -> 1000003101200: In 2002, MySQL AB sued Progress NuSphere for copyright and trademark infringement in United States district court.
10311200 -> 1000003101210: NuSphere had allegedly violated MySQL's copyright by linking code for the Gemini table type into the MySQL server.
10311210 -> 1000003101220: After a preliminary hearing before Judge Patti Saris on February 27, 2002, the parties entered settlement talks and eventually settled.
10311220 -> 1000003101230: At the hearing, Judge Saris "saw no reason" that the GPL would not be enforceable.
10311230 -> 1000003101240: In August 2003, the SCO Group stated that they believed the GPL to have no legal validity, and that they intended to take up lawsuits over sections of code supposedly copied from SCO Unix into the Linux kernel.
10311240 -> 1000003101250: This was a problematic stand for them, as they had distributed Linux and other GPL'ed code in their Caldera OpenLinux distribution, and there is little evidence that they had any legal right to do so except under the terms of the GPL.
10311250 -> 1000003101260: For more information, see SCO-Linux controversies and SCO v. IBM.
10311260 -> 1000003101270: In April 2004 the netfilter/iptables project was granted a preliminary injunction against Sitecom Germany by Munich District Court after Sitecom refused to desist from distributing Netfilter's GPL'ed software in violation of the terms of the GPL.
10311270 -> 1000003101280: On July 2004 , the German court confirmed this injunction as a final ruling against Sitecom.
10311280 -> 1000003101290: The court's justification for its decision exactly mirrored the predictions given earlier by the FSF's Eben Moglen:
10311290 -> 1000003101300: Defendant has infringed on the copyright of plaintiff by offering the software 'netfilter/iptables' for download and by advertising its distribution, without adhering to the license conditions of the GPL.
10311300 -> 1000003101310: Said actions would only be permissible if defendant had a license grant...
10311310 -> 1000003101320: This is independent of the questions whether the licensing conditions of the GPL have been effectively agreed upon between plaintiff and defendant or not.
10311320 -> 1000003101330: If the GPL were not agreed upon by the parties, defendant would notwithstanding lack the necessary rights to copy, distribute, and make the software 'netfilter/iptables' publicly available.
10311330 -> 1000003101340: This ruling was important because it was the first time that a court had confirmed that violating terms of the GPL was an act of copyright violation.
10311340 -> 1000003101350: However, the case was not as crucial a test for the GPL as some have concluded.
10311350 -> 1000003101360: In the case, the enforceability of GPL itself was not under attack.
10311360 -> 1000003101370: Instead, the court was merely attempting to discern if the license itself was in effect.
10311370 -> 1000003101380: In May of 2005, Daniel Wallace filed suit against the Free Software Foundation (FSF) in the Southern District of Indiana, contending that the GPL is an illegal attempt to fix prices at zero.
10311380 -> 1000003101390: The suit was dismissed in March 2006, on the grounds that Wallace had failed to state a valid anti-trust claim; the court noted that "the GPL encourages, rather than discourages, free competition and the distribution of computer operating systems, the benefits of which directly pass to consumers."
10311390 -> 1000003101400: Wallace was denied the possibility of further amending his complaint, and was ordered to pay the FSF's legal expenses.
10311400 -> 1000003101410: On September 8, 2005, Seoul Central District Court ruled that GPL has no legal relevance concerning the case dealing with trade secret derived from GPL-licensed work.
10311410 -> 1000003101420: Defendants argued that since it is impossible to maintain trade secret while being compliant with GPL and distributing the work, they aren't in breach of trade secret.
10311420 -> 1000003101430: This argument was considered without ground.
10311430 -> 1000003101440: On September 6, 2006, the gpl-violations.org project prevailed in court litigation against D-Link Germany GmbH regarding D-Link's inappropriate and copyright infringing use of parts of the Linux Operating System Kernel.
10311440 -> 1000003101450: The judgment finally provided the on-record, legal precedent that the GPL is valid and legally binding, and that it will stand up in German court.
10311450 -> 1000003101460: In late 2007, the developers of BusyBox and the Software Freedom Law Center embarked upon a program to gain GPL compliance from distributors of BusyBox in embedded systems, suing those who would not comply.
10311460 -> 1000003101470: These were claimed to be the first US uses of courts for enforcement of GPL obligations.
10311470 -> 1000003101480: See BusyBox#GPL lawsuits.
10311480 -> 1000003101490: Compatibility and multi-licensing
10311490 -> 1000003101500: Many of the most common free software licenses, such as the original MIT/X license, the BSD license (in its current 3-clause form), and the LGPL, are "GPL-compatible".
10311500 -> 1000003101510: That is, their code can be combined with a program under the GPL without conflict (the new combination would have the GPL applied to the whole).
10311510 -> 1000003101520: However, some free/open source software licenses are not GPL-compatible.
10311520 -> 1000003101530: Many GPL proponents have strongly advocated that free/open source software developers use only GPL-compatible licenses, because doing otherwise makes it difficult to reuse software in larger wholes.
10311530 -> 1000003101540: Note that this issue only arises in concurrent use of licenses which impose conditions on their manner of combination.
10311540 -> 1000003101550: Some licenses, such as the BSD license, impose no conditions on the manner of their combination.
10311550 -> 1000003101560: Also see the list of FSF approved software licenses for examples of compatible and incompatible licenses.
10311560 -> 1000003101570: A number of businesses use dual-licensing to distribute a GPL version and sell a proprietary license to companies wishing to combine the package with proprietary code, using dynamic linking or not.
10311570 -> 1000003101580: Examples of such companies include MySQL AB, Trolltech (Qt toolkit), Namesys (ReiserFS) and Red Hat (Cygwin).
10311580 -> 1000003101590: Adoption
10311590 -> 1000003101600: The Open Source License Resource Center maintained by Black Duck Software shows that GPL is the license used in about 70% of all open source software.
10311600 -> 1000003101610: The vast majority of projects are released under GPL 2 with 3000 open source projects having migrated to GPL 3.
10311610 -> 1000003101620: Criticism
10311620 -> 1000003101630: In 2001 Microsoft CEO Steve Ballmer referred to Linux as "a cancer that attaches itself in an intellectual property sense to everything it touches."
10311630 -> 1000003101640: Critics of Microsoft claim that the real reason Microsoft dislikes the GPL is that the GPL resists proprietary vendors' attempts to "embrace, extend and extinguish".
10311640 -> 1000003101650: Microsoft has released Microsoft Windows Services for UNIX which contains GPL-licensed code.
10311650 -> 1000003101660: In response to Microsoft's attacks on the GPL, several prominent Free Software developers and advocates released a joint statement supporting the license.
10311660 -> 1000003101670: The GPL has been described as being "viral" by many of its critics because the GPL only allows conveyance of whole programs, which means that programmers are not allowed to convey programs that link to libraries having GPL-incompatible licenses.
10311670 -> 1000003101680: The so-called "viral" effect of this is that under such circumstances disparately licensed software cannot be combined unless one of the licenses is changed.
10311680 -> 1000003101690: Although theoretically either license could be changed, in the "viral" scenario the GPL cannot be practically changed (because the software may have so many contributors, some of whom will likely refuse), whereas the license of the other software can be practically changed.
10311690 -> 1000003101700: This is part of a philosophical difference between the GPL and permissive free software licenses such as the BSD-style licenses, which do not put such a requirement on modified versions.
10311700 -> 1000003101710: While proponents of the GPL believe that free software should ensure that its freedoms are preserved all the way from the developer to the user, others believe that intermediaries between the developer and the user should be free to redistribute the software as non-free software.
10311710 -> 1000003101720: More specifically, the GPL requires that redistribution occur subject to the GPL, whereas more "permissive" licenses allow redistribution to occur under licenses more restrictive than the original license.
10311720 -> 1000003101730: While the GPL does allow commercial distribution of GPL software, the market price will settle near the price of distribution—near zero—since the purchasers may redistribute the software and its source code for their cost of redistribution.
10311730 -> 1000003101740: This could be seen to inhibit commercial use of GPL'ed code by others wishing to use that code for proprietary purposes—if they don't wish to avail themselves of GPL'ed code, they will have to re-implement it themselves.
10311740 -> 1000003101750: Microsoft has included anti-GPL terms in their open source software.
10311750 -> 1000003101760: In addition, the FreeBSD project has stated that "a less publicized and unintended use of the GPL is that it is very favorable to large companies that want to undercut software companies.
10311760 -> 1000003101770: In other words, the GPL is well suited for use as a marketing weapon, potentially reducing overall economic benefit and contributing to monopolistic behavior".
10311770 -> 1000003101780: It's not clear that there are any cases of this happening in practice, however.
10311780 -> 1000003101790: The GPL has no indemnification clause explicitly protecting maintainers and developers from litigation resulting from unscrupulous contribution.
10311790 -> 1000003101800: (If a developer submits existing patented or copyright work to a GPL project claiming it as their own contribution, all the project maintainers and even other developers can be held legally responsible for damages to the copyright or patent holder.)
10311800 -> 1000003101810: Lack of indemnification is one criticism that lead Mozilla to create the Mozilla Public License rather than use the GPL or LGPL.
10311810 -> 1000003101820: However, Mozilla later relicensed their work under a GPL/LGPL/MPL triple license, due to problems with the GPL-incompatibility of the MPL.
10311820 -> 1000003101830: Some software developers have found the extensive scope of the GPL to be too restrictive.
10311830 -> 1000003101840: For example, Bjørn Reese and Daniel Stenberg describe how the downstream effects of the GPL on later developers creates a "quodque pro quo" (Latin, "Everything in return for something").
10311840 -> 1000003101850: For that reason, in 2001 they abandoned the GPLv2 in favor of less restrictive copyleft licenses.
10311850 -> 1000003101860: A more specific example of the downstream effects of the GPL can be observed through the frame of incompatible licenses.
10311860 -> 1000003101870: Sun Microsystems' ZFS, because it is licensed under the GPL-incompatible CDDL and covered by several Sun patents, cannot link to the GPL-licensed linux kernel.
10311870 -> 1000003101880: Some have also argued that the GPL could, and should, be shorter.
German language
10300010 -> 1000003200020: German language
10300020 -> 1000003200030: The German language ({(Lang+Deutsch+de+Deutsch)}) is a West Germanic language and one of the world's major languages.
10300030 -> 1000003200040: German is closely related to and classified alongside English and Dutch.
10300040 -> 1000003200050: Around the world, German is spoken by approximately 100 million native speakers and also about 80 million non-native speakers, and Standard German is widely taught in schools, universities, and Goethe Institutes worldwide.
10300050 -> 1000003200060: Geographic distribution
10300060 -> 1000003200070: Europe
10300070 -> 1000003200080: German is spoken primarily in Germany (95%), Austria (89%) and Switzerland (64%) together with Liechtenstein, Luxembourg (D-A-CH-Li-Lux) constituting the countries where German is the majority language.
10300080 -> 1000003200090: Other European German-speaking communities are found in Italy (Bolzano-Bozen), in the East Cantons of Belgium, in the french area Alsace which often was traded between Germany and France in history and in some border villages of the former South Jutland County (in German, Nordschleswig, in Danish, Sønderjylland) of Denmark.
10300090 -> 1000003200100: Some German-speaking communities still survive in parts of Romania, the Czech Republic, Poland, Hungary, and above all Russia and Kazakhstan, although forced expulsions after World War II and massive emigration to Germany in the 1980s and 1990s have depopulated most of these communities.
10300100 -> 1000003200110: It is also spoken by German-speaking foreign populations and some of their descendants in Portugal, Spain, Italy, Morocco, Egypt, Israel, Cyprus, Turkey, Greece, United Kingdom, Netherlands, Scandinavia, Siberia in Russia, Hungary, Romania, Bulgaria, and the former Yugoslavia (Bosnia, Serbia, Macedonia, Croatia and Slovenia).
10300110 -> 1000003200120: In Luxembourg and the surrounding areas, big parts of the native population speak German dialects, and some people also master standard German (especially in Luxembourg), although in the French regions of Alsace (German: Elsass) and Lorraine (German: Lothringen) French has replaced the local German dialects as the official language, even though it has not been fully replaced on the street.
10300120 -> 1000003200130: Overseas
10300130 -> 1000003200140: Outside of Europe and the former Soviet Union, the largest German-speaking communities are to be found in the United States, Canada, Brazil and in Argentina where millions of Germans migrated in the last 200 years; but the vast majority of their descendants no longer speak German.
10300140 -> 1000003200150: Additionally, German-speaking communities can be found in the former German colony of Namibia independent from South Africa since 1990, as well as in the other countries of German emigration such as Canada, Mexico, Dominican Republic, Paraguay, Uruguay, Chile, Peru, Venezuela (where Alemán Coloniero developed), South Africa and Australia.
10300150 -> 1000003200160: South America
10300160 -> 1000003200170: In Brazil the largest concentrations of German speakers are in Rio Grande do Sul (where Riograndenser Hunsrückisch was developed), Santa Catarina, Paraná, and Espírito Santo, and large German-speaking descendant communities in Argentina, Uruguay and Chile.
10300170 -> 1000003200180: In the 20th century, over 100,000 German political refugees and invited entrepreneurs settled in Latin America, such as Costa Rica, Panama, Venezuela and the Dominican Republic to establish German-speaking enclaves, and there is a reportedly small German immigration to Puerto Rico.
10300180 -> 1000003200190: North America
10300190 -> 1000003200200: The United States has the largest concentration of German speakers outside of Europe; an indication of this presence can be found in the names of such villages and towns as New Leipzig, Munich, Karlsruhe, and Strasburg, North Dakota, and New Braunfels, Texas.
10300200 -> 1000003200210: Though over the course of the 20th century many of the descendants of 18th and 19th-century immigrants ceased speaking German at home, small populations of elderly (as well as some younger) speakers can be found in Pennsylvania (Amish, Hutterites, Dunkards and some Mennonites historically spoke Pennsylvania Dutch (a West Central German variety) and Hutterite German), Kansas (Mennonites and Volga Germans), North Dakota (Hutterite Germans, Mennonites, Russian Germans, Volga Germans, and Baltic Germans), South Dakota, Montana, Texas (Texas German), Wisconsin, Indiana, Louisiana and Oklahoma.
10300210 -> 1000003200220: Early twentieth century immigration was often to St. Louis, Chicago, New York, Pittsburgh and Cincinnati.
10300220 -> 1000003200230: Most of the post–World War II wave are in the New York, Philadelphia, Los Angeles, San Francisco and Chicago urban areas, and in Florida, Arizona and California where large communities of retired German, Swiss and Austrian expatriates live.
10300230 -> 1000003200240: The American population of German ancestry is above 60 million.
10300240 -> 1000003200250: The German language is the third largest language in the U.S. after Spanish.
10300250 -> 1000003200260: In Canada there are people of German ancestry throughout the country and especially in the western cities such as Kelowna.
10300260 -> 1000003200270: German is also spoken in Ontario and southern Nova Scotia.
10300270 -> 1000003200280: There is a large and vibrant community in the city of Kitchener, Ontario.
10300280 -> 1000003200290: German immigrants were instrumental in the country's three largest urban areas: Montreal, Toronto and Vancouver, but post-WWII immigrants managed to preserve a fluency in the German language in their respective neighborhoods and sections.
10300290 -> 1000003200300: In the first half of the 20th century, over a million German-Canadians made the language one of Canada's most spoken after French.
10300300 -> 1000003200310: In Mexico there are also large populations of German ancestry, mainly in the cities of: Mexico City, Puebla, Mazatlán, Tapachula, and larger populations scattered in the states of Chihuahua, Durango, and Zacatecas.
10300310 -> 1000003200320: German ancestry is also said to be found in neighboring towns around Guadalajara, Jalisco and much of Northern Mexico, where German influence was immersed into the Mexican culture.
10300320 -> 1000003200330: Standard German is spoken by the affluent German communities in Puebla, Mexico City, Nuevo Leon, San Luis Potosi and Quintana Roo.
10300330 -> 1000003200340: German immigration in the twentieth century was small, but produced German-speaking communities in Central America (i.e.
10300340 -> 1000003200350: Guatemala, Honduras and Nicaragua) and the Caribbean Islands like the Dominican Republic.
10300350 -> 1000003200360: Dialects in North America:
10300360 -> 1000003200370: The dialects of German which are or were primarily spoken in colonies or communities founded by German speaking people resemble the dialects of the regions the founders came from.
10300370 -> 1000003200380: For example, Pennsylvania German resembles dialects of the Palatinate, and Hutterite German resembles dialects of Carinthia.
10300380 -> 1000003200390: Texas German is a dialect spoken in the areas of Texas settled by the Adelsverein, such as New Braunfels and Fredericksburg.
10300390 -> 1000003200400: In the Amana Colonies in the state of Iowa Amana German is spoken.
10300400 -> 1000003200410: Plautdietsch is a large minority language spoken in Northern Mexico by the Mennonite communities, and is spoken by more than 200,000 people in Mexico.
10300410 -> 1000003200420: Hutterite German is an Upper German dialect of the Austro-Bavarian variety of the German language, which is spoken by Hutterite communities in Canada and the United States.
10300420 -> 1000003200430: Hutterite is spoken in the U.S. states of Washington, Montana, North Dakota and South Dakota, and Minnesota; and in the Canadian provinces of Alberta, Saskatchewan and Manitoba.
10300430 -> 1000003200440: Its speakers belong to some Schmiedleit, Lehrerleit, and Dariusleit Hutterite groups, but there are also speakers among the older generations of Prairieleit (the descendants of those Hutterites who chose not to settle in colonies).
10300440 -> 1000003200450: Hutterite children who grow up in the colonies learn and speak first Hutterite German before learning English in the public school, the standard language of the surrounding areas.
10300450 -> 1000003200460: Many colonies though continue with German Grammar School, separate from the public school, throughout a student's elementary education.
10300460 -> 1000003200470: Creoles
10300470 -> 1000003200480: There is an important German creole being studied and recovered, named Unserdeutsch, spoken in the former German colony of Papua New Guinea, across Micronesia and in northern Australia (i.e. coastal parts of Queensland and Western Australia), by few elderly people.
10300480 -> 1000003200490: The risk of its extinction is serious and efforts to revive interest in the language are being implemented by scholars.
10300490 -> 1000003200500: Internet
10300500 -> 1000003200510: According to Global Reach (2004), 6.9% of the Internet population is German.
10300510 -> 1000003200520: According to Netz-tipp (2002), 7.7% of webpages are written in German, making it second only to English in the European language group.
10300520 -> 1000003200530: They also report that 12% of Google's users use its German interface.
10300530 -> 1000003200540: Older statistics: Babel (1998) found somewhat similar demographics.
10300540 -> 1000003200550: FUNREDES (1998) and Vilaweb (2000) both found that German is the third most popular language used by websites, after English and Japanese.
10300550 -> 1000003200560: History
10300560 -> 1000003200570: The history of the language begins with the High German consonant shift during the migration period, separating High German dialects from common West Germanic.
10300570 -> 1000003200580: The earliest testimonies of Old High German are from scattered Elder Futhark inscriptions, especially in Alemannic, from the 6th century, the earliest glosses (Abrogans) date to the 8th and the oldest coherent texts (the Hildebrandslied, the Muspilli and the Merseburg Incantations) to the 9th century.
10300580 -> 1000003200590: Old Saxon at this time belongs to the North Sea Germanic cultural sphere, and Low Saxon should fall under German rather than Anglo-Frisian influence during the Holy Roman Empire.
10300590 -> 1000003200600: As Germany was divided into many different states, the only force working for a unification or standardization of German during a period of several hundred years was the general preference of writers trying to write in a way that could be understood in the largest possible area.
10300600 -> 1000003200610: When Martin Luther translated the Bible (the New Testament in 1522 and the Old Testament, published in parts and completed in 1534) he based his translation mainly on the bureaucratic standard language used in Saxony (sächsische Kanzleisprache) also known as Meißner-Deutsch (Meißner-German), which was the most widely understood language at this time, because the region it was spoken in was quite influential amongst the German states.
10300610 -> 1000003200620: This language was based on Eastern Upper and Eastern Central German dialects and preserved much of the grammatical system of Middle High German (unlike the spoken German dialects in Central and Upper Germany that already at that time began to lose the genitive case and the preterite tense).
10300620 -> 1000003200630: In the beginning, copies of the Bible had a long list for each region, which translated words unknown in the region into the regional dialect.
10300630 -> 1000003200640: Roman Catholics rejected Luther's translation in the beginning and tried to create their own Catholic standard (gemeines Deutsch) — which, however, only differed from 'Protestant German' in some minor details.
10300640 -> 1000003200650: It took until the middle of the 18th century to create a standard that was widely accepted, thus ending the period of Early New High German.
10300650 -> 1000003200660: In 1901 the 2nd Orthographical Conference ended with a complete standardization of German language in written form while the Deutsche Bühnensprache (literally: German stage-language) had already established spelling-rules for German three years earlier which were later to become obligatory for general German pronunciation.
10300660 -> 1000003200670: German used to be the language of commerce and government in the Habsburg Empire, which encompassed a large area of Central and Eastern Europe.
10300670 -> 1000003200680: Until the mid-19th century it was essentially the language of townspeople throughout most of the Empire.
10300680 -> 1000003200690: It indicated that the speaker was a merchant, an urbanite, not their nationality.
10300690 -> 1000003200700: Some cities, such as Prague (German: Prag) and Budapest (Buda, German: Ofen), were gradually Germanized in the years after their incorporation into the Habsburg domain.
10300700 -> 1000003200710: Others, such as Bratislava(German: Pressburg), were originally settled during the Habsburg period and were primarily German at that time.
10300710 -> 1000003200720: A few cities such as Milan (German: Mailand) remained primarily non-German.
10300720 -> 1000003200730: However, most cities were primarily German during this time, such as Prague, Budapest, Bratislava (German: Pressburg), Zagreb (German: Agram), and Ljubljana (German: Laibach), though they were surrounded by territory that spoke other languages.
10300730 -> 1000003200740: Until about 1800, standard German was almost only a written language.
10300740 -> 1000003200750: At this time, people in urban northern Germany, who spoke dialects very different from Standard German, learned it almost like a foreign language and tried to pronounce it as close to the spelling as possible.
10300750 -> 1000003200760: Prescriptive pronunciation guides used to consider northern German pronunciation to be the standard.
10300760 -> 1000003200770: However, the actual pronunciation of standard German varies from region to region.
10300770 -> 1000003200780: Media and written works are almost all produced in standard German (often called Hochdeutsch in German) which is understood in all areas where German is spoken, except by pre-school children in areas which speak only dialect, for example Switzerland and Austria.
10300780 -> 1000003200790: However, in this age of television, even they now usually learn to understand Standard German before school age.
10300790 -> 1000003200800: The first dictionary of the Brothers Grimm, the 16 parts of which were issued between 1852 and 1860, remains the most comprehensive guide to the words of the German language.
10300800 -> 1000003200810: In 1860, grammatical and orthographic rules first appeared in the Duden Handbook.
10300810 -> 1000003200820: In 1901, this was declared the standard definition of the German language.
10300820 -> 1000003200830: Official revisions of some of these rules were not issued until 1998, when the German spelling reform of 1996 was officially promulgated by governmental representatives of all German-speaking countries.
10300830 -> 1000003200840: Since the reform, German spelling has been in an eight-year transitional period where the reformed spelling is taught in most schools, while traditional and reformed spellings co-exist in the media.
10300840 -> 1000003200850: See German spelling reform of 1996 for an overview of the public debate concerning the reform with some major newspapers and magazines and several known writers refusing to adopt it.
10300850 -> 1000003200860: The German spelling reform of 1996 led to public controversy indeed to considerable dispute.
10300860 -> 1000003200870: Some state parliaments (Bundesländer) would not accept it (North Rhine Westphalia and Bavaria).
10300870 -> 1000003200880: The dispute landed at one point in the highest court which made a short issue of it, claiming that the states had to decide for themselves and that only in schools could the reform be made the official rule - everybody else could continue writing as they had learned it.
10300880 -> 1000003200890: After 10 years, without any intervention by the federal parliament, a major yet incomplete revision was installed in 2006, just in time for the new school year of 2006.
10300890 -> 1000003200900: In 2007, some venerable spellings will be finally invalidated even though they caused little or no trouble.
10300900 -> 1000003200910: The only sure and easily recognizable symptom of a text's being in compliance with the reform is the -ss at the end of words, like in dass and muss.
10300910 -> 1000003200920: Classic spelling forbade this ending, instead using daß and muß.
10300920 -> 1000003200930: The cause of the controversy evolved around the question whether a language is part of the culture which must be preserved or a means of communicating information which has to allow for growth.
10300930 -> 1000003200940: (The reformers seemed to be unimpressed by the fact that a considerable part of that culture - namely the entire German literature of the 20th century - is in the old spelling.)
10300940 -> 1000003200950: The increasing use of English in Germany's higher education system, as well as in business and in popular culture, has led various German academics to state, not necessarily from an entirely negative perspective, that German is a language in decline in its native country.
10300950 -> 1000003200960: For example, Ursula Kimpel, of the University of Tübingen, said in 2005 that “German universities are offering more courses in English because of the large number of students coming from abroad.
10300960 -> 1000003200970: German is unfortunately a language in decline.
10300970 -> 1000003200980: We need and want our professors to be able to teach effectively in English.”
10300980 -> 1000003200990: Standard German
10300990 -> 1000003201000: Standard German originated not as a traditional dialect of a specific region, but as a written language.
10301000 -> 1000003201010: However, there are places where the traditional regional dialects have been replaced by standard German; this is the case in vast stretches of Northern Germany, but also in major cities in other parts of the country.
10301010 -> 1000003201020: Standard German differs regionally, between German-speaking countries, in vocabulary and some instances of pronunciation, and even grammar and orthography.
10301020 -> 1000003201030: This variation must not be confused with the variation of local dialects.
10301030 -> 1000003201040: Even though the regional varieties of standard German are only to a certain degree influenced by the local dialects, they are very distinct.
10301040 -> 1000003201050: German is thus considered a pluricentric language.
10301050 -> 1000003201060: In most regions, the speakers use a continuum of mixtures from more dialectal varieties to more standard varieties according to situation.
10301060 -> 1000003201070: In the German-speaking parts of Switzerland, mixtures of dialect and standard are very seldom used, and the use of standard German is largely restricted to the written language.
10301070 -> 1000003201080: Therefore, this situation has been called a medial diglossia.
10301080 -> 1000003201090: Swiss Standard German is used in the Swiss education system.
10301090 -> 1000003201100: Official status
10301100 -> 1000003201110: Standard German is the only official language in Liechtenstein and Austria; it shares official status in Germany (with Danish, Frisian and Sorbian as minority languages), Switzerland (with French, Italian and Romansh), Belgium (with Dutch and French) and Luxembourg (with French and Luxembourgish).
10301110 -> 1000003201120: It is used as a local official language in Italy (Province of Bolzano-Bozen), as well as in the cities of Sopron (Hungary), Krahule (Slovakia) and several cities in Romania.
10301120 -> 1000003201130: It is the official language (with Italian) of the Vatican Swiss Guard.
10301130 -> 1000003201140: German has an officially recognized status as regional or auxiliary language in Denmark (South Jutland region), France (Alsace and Moselle regions), Italy (Gressoney valley), Namibia, Poland (Opole region), and Russia (Asowo and Halbstadt).
10301140 -> 1000003201150: German is one of the 23 official languages of the European Union.
10301150 -> 1000003201160: It is the language with the largest number of native speakers in the European Union, and, shortly after English and long before French, the second-most spoken language in Europe.
10301160 -> 1000003201170: German as a foreign language
10301170 -> 1000003201180: German is the third most taught foreign language in the English speaking world after French and Spanish.
10301180 -> 1000003201190: German is the main language of about 90–95 million people in Europe (as of 2004), or 13.3% of all Europeans, being the second most spoken native language in Europe after Russian, above French (66.5 million speakers in 2004) and English (64.2 million speakers in 2004).
10301190 -> 1000003201200: It is therefore the most spoken first language in the EU.
10301200 -> 1000003201210: It is the second most known foreign language in the EU.
10301210 -> 1000003201220: It is one of the official languages of the European Union, and one of the three working languages of the European Commission, along with English and French.
10301220 -> 1000003201230: Thirty-two percent of citizens of the EU-15 countries say they can converse in German (either as a mother tongue or as a second or foreign language).
10301230 -> 1000003201240: This is assisted by the widespread availability of German TV by cable or satellite.
10301240 -> 1000003201250: German was once, and still remains to some extent, a lingua franca in Central, Eastern and Northern Europe.
10301250 -> 1000003201260: Dialects
10301260 -> 1000003201270: German is a member of the western branch of the Germanic family of languages, which in turn is part of the Indo-European language family.
10301270 -> 1000003201280: The German dialect continuum is traditionally divided most broadly into High German and Low German.
10301280 -> 1000003201290: The variation among the German dialects is considerable, with only the neighbouring dialects being mutually intelligible.
10301290 -> 1000003201300: Some dialects are not intelligible to people who only know standard German.
10301300 -> 1000003201310: However, all German dialects belong to the dialect continuum of High German and Low Saxon languages.
10301310 -> 1000003201320: Until roughly the end of the Second World War, there was a dialect continuum of all the continental West Germanic languages because nearly any pair of neighbouring dialects were perfectly mutually intelligible.
10301320 -> 1000003201330: Low German
10301330 -> 1000003201340: Low Saxon varieties (spoken on German territory) are considered linguistically a language separate from the German language by some, but just a dialect by others.
10301340 -> 1000003201350: Sometimes, Low Saxon and Low Franconian are grouped together because both are unaffected by the High German consonant shift.
10301350 -> 1000003201360: However, the part of the population capable of speaking and responding to it, or of understanding it has decreased continuously since WWII.
10301360 -> 1000003201370: Currently the effort to maintain a residual presence in cultural life is negligible.
10301370 -> 1000003201380: Middle Low German was the lingua franca of the Hanseatic League.
10301380 -> 1000003201390: It was the predominant language in Northern Germany.
10301390 -> 1000003201400: This changed in the 16th century.
10301400 -> 1000003201410: In 1534 the Luther Bible by Martin Luther was printed.
10301410 -> 1000003201420: This translation is considered to be an important step towards the evolution of the Early New High German.
10301420 -> 1000003201430: It aimed to be understandable to an ample audience and was based mainly on Central and Upper German varieties.
10301430 -> 1000003201440: The Early New High German language gained more prestige than Low Saxon and became the language of science and literature.
10301440 -> 1000003201450: Other factors were that around the same time, the Hanseatic league lost its importance as new trade routes to Asia and the Americas were established, and that the most powerful German states of that period were located in Middle and Southern Germany.
10301450 -> 1000003201460: The 18th and 19th centuries were marked by mass education, the language of the schools being standard German.
10301460 -> 1000003201470: Slowly Low Saxon was pushed back and back until it was nothing but a language spoken by the uneducated and at home.
10301470 -> 1000003201480: Today Low Saxon can be divided in two groups: Low Saxon varieties with a reasonable standard German influx and varieties of Standard German with a Low Saxon influence known as Missingsch.
10301480 -> 1000003201490: High German
10301490 -> 1000003201500: High German is divided into Central German and Upper German.
10301500 -> 1000003201510: Central German dialects include Ripuarian, Moselle Franconian, Hessian, Thuringian, South Franconian, Lorraine Franconian and Upper Saxon.
10301510 -> 1000003201520: It is spoken in the southeastern Netherlands, eastern Belgium, Luxembourg, parts of France, and in Germany approximately between the River Main and the southern edge of the Lowlands.
10301520 -> 1000003201530: Modern Standard German is mostly based on Central German, but it should be noted that the common (but not linguistically correct) German term for modern Standard German is Hochdeutsch, that is, High German.
10301530 -> 1000003201540: The Moselle Franconian varieties spoken in Luxembourg have been officially standardised and institutionalised and are therefore usually considered a separate language known as Luxembourgish.
10301540 -> 1000003201550: Upper German dialects include Alemannic (for instance Swiss German), Swabian, East Franconian, Alsatian and Austro-Bavarian.
10301550 -> 1000003201560: They are spoken in parts of the Alsace, southern Germany, Liechtenstein, Austria, and in the German-speaking parts of Switzerland and Italy.
10301560 -> 1000003201570: Wymysorys, Sathmarisch and Siebenbürgisch are High German dialects of Poland and Romania respectively.
10301570 -> 1000003201580: The High German varieties spoken by Ashkenazi Jews (mostly in the former Soviet Union) have several unique features, and are usually considered as a separate language, Yiddish.
10301580 -> 1000003201590: It is the only Germanic language that does not use the Latin alphabet as its standard script.
10301590 -> 1000003201600: German dialects versus varieties of standard German
10301600 -> 1000003201610: In German linguistics, German dialects are distinguished from varieties of standard German.
10301610 -> 1000003201620: The German dialects are the traditional local varieties.
10301620 -> 1000003201630: They are traditionally traced back to the different German tribes.
10301630 -> 1000003201640: Many of them are hardly understandable to someone who knows only standard German, since they often differ from standard German in lexicon, phonology and syntax.
10301640 -> 1000003201650: If a narrow definition of language based on mutual intelligibility is used, many German dialects are considered to be separate languages (for instance in the Ethnologue).
10301650 -> 1000003201660: However, such a point of view is unusual in German linguistics.
10301660 -> 1000003201670: The varieties of standard German refer to the different local varieties of the pluricentric standard German.
10301670 -> 1000003201680: They only differ slightly in lexicon and phonology.
10301680 -> 1000003201690: In certain regions, they have replaced the traditional German dialects, especially in Northern Germany.
10301690 -> 1000003201700: Grammar
10301700 -> 1000003201710: German is an inflected language.
10301710 -> 1000003201720: Noun inflection
10301720 -> 1000003201730: German nouns inflect into:
10301730 -> 1000003201740: one of four cases: nominative, genitive, dative, and accusative.
10301740 -> 1000003201750: one of three genders: masculine, feminine, or neuter.
10301750 -> 1000003201760: Word endings sometimes reveal grammatical gender; for instance, nouns ending in ...ung(-ing), ...e,...schaft(-ship), ...keit or ...heit(-hood) are feminine, while nouns ending in ...chen or ...lein (diminutive forms) are neuter and nouns ending in ...ismus (-ism) are masculine.
10301760 -> 1000003201770: Others are controversial, sometimes depending on the region in which it is spoken.
10301770 -> 1000003201780: Additionally, ambiguous endings exist, such as ...er (-er), e.g. Feier (feminine), engl. celebration, party, and Arbeiter (masculine), engl. labourer.
10301780 -> 1000003201790: Sentences can usually be reorganized to avoid a misunderstanding.
10301790 -> 1000003201800: two numbers: singular and plural
10301800 -> 1000003201810: Although German is usually cited as an outstanding example of a highly inflected language, the degree of inflection is considerably less than in Old German, or in other old Indo-European languages such as Latin, Ancient Greek, or Sanskrit.
10301810 -> 1000003201820: The three genders have collapsed in the plural, which now behaves, grammatically, somewhat as a fourth gender.
10301820 -> 1000003201830: With four cases and three genders plus plural there are 16 distinct possible combinations of case and gender/number, but presently there are only six forms of the definite article used for the 16 possibilities.
10301830 -> 1000003201840: Inflection for case on the noun itself is required in the singular for strong masculine and neuter nouns in the genitive and sometimes in the dative.
10301840 -> 1000003201850: Both of these cases are losing way to substitutes in informal speech.
10301850 -> 1000003201860: The dative ending is considered somewhat old-fashioned in many contexts and often dropped, but it is still used in sayings and in formal speech or in written language.
10301860 -> 1000003201870: Weak masculine nouns share a common case ending for genitive, dative and accusative in the singular.
10301870 -> 1000003201880: Feminines are not declined in the singular.
10301880 -> 1000003201890: The plural does have an inflection for the dative.
10301890 -> 1000003201900: In total, seven inflectional endings (not counting plural markers) exist in German: -s, -es, -n, -ns, -en, -ens, -e.
10301900 -> 1000003201910: In the German orthography, nouns and most words with the syntactical function of nouns are capitalised, which is supposed to make it easier for readers to find out what function a word has within the sentence (Am Freitag bin ich einkaufen gegangen. — "On Friday I went shopping."; Eines Tages war er endlich da. — "One day he finally showed up".)
10301910 -> 1000003201920: This spelling convention is almost unique to German today (shared perhaps only by the closely related Luxemburgish language), although it was historically common in other languages (e.g., Danish and English), too.
10301920 -> 1000003201930: Like most Germanic languages, German forms left-branching noun compounds, where the first noun modifies the category given by the second, for example: Hundehütte (eng. dog hut; specifically: doghouse).
10301930 -> 1000003201940: Unlike English, where newer compounds or combinations of longer nouns are often written in open form with separating spaces, German (like the other German languages) nearly always uses the closed form without spaces, for example: Baumhaus (eng. tree house).
10301940 -> 1000003201950: Like English, German allows arbitrarily long compounds, but these are rare.
10301950 -> 1000003201960: (See also English compounds.)
10301960 -> 1000003201970: The longest German word verified to be actually in (albeit very limited) use is Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz. [which, literally translated, breaks up into: Rind (cattle) - Fleisch (meat) - Etikettierung(s) (labelling) - Überwachung(s) (supervision) - Aufgaben (duties) - Übertragung(s) (assignment) - Gesetz (law), so "Beef labelling supervision duty assignment law".]
10301970 -> 1000003201980: Verb inflection
10301980 -> 1000003201990: Standard German verbs inflect into:
10301990 -> 1000003202000: one of two conjugation classes, weak and strong (like English).
10302000 -> 1000003202010: (There is actually a third class, known as mixed verbs, which exhibit inflections combining features of both the strong and weak patterns.)
10302010 -> 1000003202020: three persons: 1st, 2nd, 3rd.
10302020 -> 1000003202030: two numbers: singular and plural
10302030 -> 1000003202040: three moods: Indicative, Subjunctive, Imperative
10302040 -> 1000003202050: two genera verbi: active and passive; the passive being composed and dividable into static and dynamic.
10302050 -> 1000003202060: two non-composed tenses (present, preterite) and four composed tenses (perfect, pluperfect, future and future perfect)
10302060 -> 1000003202070: distinction between grammatical aspects is rendered by combined use of subjunctive and/or preterite marking; thus: neither of both is plain indicative voice, sole subjunctive conveys second-hand information, subjunctive plus Preterite marking forms the conditional state, and sole preterite is either plain indicative (in the past), or functions as a (literal) alternative for either second-hand-information or for the conditional state of the verb, when one of them may seem indistinguishable otherwise.
10302070 -> 1000003202080: distinction between perfect and progressive aspect is and has at every stage of development been at hand as a productive category of the older language and in nearly all documented dialects, but, strangely enough, is nowadays rigorously excluded from written usage in its present normalised form.
10302080 -> 1000003202090: disambiguation of completed vs. uncompleted forms is widely observed and regularly generated by common prefixes (blicken - to look, erblicken - to see [unrelated form: sehen - to see]).
10302090 -> 1000003202100: Verb prefixes
10302100 -> 1000003202110: There are also many ways to expand, and sometimes radically change, the meaning of a base verb through a relatively small number of prefixes.
10302110 -> 1000003202120: Some of those prefixes have a meaning themselves (Example: zer- refers to the destruction of things, as in zerreißen = to tear apart, zerbrechen = to break apart, zerschneiden = to cut apart), others do not have more than the vaguest meaning in and of themselves (Example: ver- , as in versuchen = to try, vernehmen = to interrogate, verteilen = to distribute, verstehen = to understand).
10302120 -> 1000003202130: More examples: haften = to stick, verhaften = to imprison; kaufen = to buy, verkaufen = to sell; hören = to hear, aufhören = to cease; fahren = to drive, erfahren = to get to know, to hear about something.
10302130 -> 1000003202140: Separable prefixes
10302140 -> 1000003202150: Many German verbs have a separable prefix, often with an adverbial function.
10302150 -> 1000003202160: In finite verb forms this is split off and moved to the end of the clause, and is hence considered by some to be a "resultative particle".
10302160 -> 1000003202170: For example, mitgehen meaning "to go with" would be split giving Gehen Sie mit?
10302170 -> 1000003202180: (Literal: "Go you with?" ; Formal: "Are you going along"?).
10302180 -> 1000003202190: Indeed, several parenthetical clauses may occur between the prefix of a finite verb and its complement; e.g.
10302190 -> 1000003202200: Er kam am Freitagabend nach einem harten Arbeitstag und dem üblichen Ärger, der ihn schon seit Jahren immer wieder an seinem Arbeitsplatz plagt, mit fraglicher Freude auf ein Mahl, das seine Frau ihm, wie er hoffte, bereits aufgetischt hatte, endlich zu Hause an .
10302200 -> 1000003202210: A literal translation of this example might look like this:
10302210 -> 1000003202220: He arr- on a Friday evening after a hard day at work and the usual disagreements that had been troubling him repeatedly, looking forward to a questionable meal which, as he hoped, his wife had already fixed for him, -ived at home.
10302220 -> 1000003202230: Word order
10302230 -> 1000003202240: German requires that a verbal element (main verb or auxiliary verb) appear second in the sentence, preceded by the most important topical phrase.
10302240 -> 1000003202250: The second most important phrase appears at the end of the sentence.
10302250 -> 1000003202260: For a sentence without an auxiliary, this gives several options:
10302260 -> 1000003202270: {(Lang+Der alte Mann gibt mir das Buch heute.+de+Der alte Mann gibt mir das Buch heute.)}
10302265 -> 1000003202280: (The old man gives me the book today)
10302270 -> 1000003202290: {(Lang+Der alte Mann gibt mir heute das Buch.+de+Der alte Mann gibt mir heute das Buch.)}
10302280 -> 1000003202300: {(Lang+Das Buch gibt mir der alte Mann heute.+de+Das Buch gibt mir der alte Mann heute.)}
10302290 -> 1000003202310: {(Lang+Das Buch gibt der alte Mann heute mir.+de+Das Buch gibt der alte Mann heute mir.)} (stress on mir)
10302300 -> 1000003202320: {(Lang+Das Buch gibt heute der alte Mann mir.+de+Das Buch gibt heute der alte Mann mir.)} (as well)
10302310 -> 1000003202330: {(Lang+Das Buch gibt der alte Mann mir heute.+de+Das Buch gibt der alte Mann mir heute.)}
10302320 -> 1000003202340: {(Lang+Das Buch gibt heute mir der alte Mann.+de+Das Buch gibt heute mir der alte Mann.)}
10302330 -> 1000003202350: {(Lang+Das Buch gibt mir heute der alte Mann.+de+Das Buch gibt mir heute der alte Mann.)}
10302340 -> 1000003202360: {(Lang+Heute gibt mir der alte Mann das Buch.+de+Heute gibt mir der alte Mann das Buch.)}
10302350 -> 1000003202370: {(Lang+Heute gibt mir das Buch der alte Mann.+de+Heute gibt mir das Buch der alte Mann.)}
10302360 -> 1000003202380: {(Lang+Heute gibt der alte Mann mir das Buch.+de+Heute gibt der alte Mann mir das Buch.)}
10302370 -> 1000003202390: {(Lang+Mir gibt der alte Mann das Buch heute.+de+Mir gibt der alte Mann das Buch heute.)}
10302380 -> 1000003202400: {(Lang+Mir gibt heute der alte Mann das Buch.+de+Mir gibt heute der alte Mann das Buch.)}
10302390 -> 1000003202410: {(Lang+Mir gibt der alte Mann heute das Buch.+de+Mir gibt der alte Mann heute das Buch.)}
10302400 -> 1000003202420: The position of a noun as a subject or object in a German sentence doesn't affect the meaning of the sentence as it would in English.
10302410 -> 1000003202430: In a declarative sentence in English if the subject does not occur before the predicate the sentence could well be misunderstood.
10302420 -> 1000003202440: For example, in the sentence "Man bites dog" it is clear who did what to whom.
10302430 -> 1000003202450: To exchange the place of the subject with that of the object — "Dog bites man" — changes the meaning completely.
10302440 -> 1000003202460: In other words the word order in a sentence conveys significant information.
10302450 -> 1000003202470: In German, nouns and articles are declined as in Latin thus indicating whether it is the subject or object of the verb's action.
10302460 -> 1000003202480: The above example in German would be {(Lang+Ein Mann beißt den Hund+de+Ein Mann beißt den Hund)} or {(Lang+Den Hund beißt ein Mann+de+Den Hund beißt ein Mann)} with both having exactly the same meaning.
10302470 -> 1000003202490: If the articles are omitted, which is sometimes done in headlines ({(Lang+Mann beißt Hund+de+Mann beißt Hund)}), the syntax applies as in English — the first noun is the subject and the noun following the predicate is the object.
10302480 -> 1000003202500: Except for emphasis, adverbs of time have to appear in the third place in the sentence, just after the predicate.
10302490 -> 1000003202510: Otherwise the speaker would be recognised as non-German.
10302500 -> 1000003202520: For instance the German word order (in Modern English) is: We're going tomorrow to town. ({(Lang+Wir gehen morgen in die Stadt.+de+Wir gehen morgen in die Stadt.)})
10302510 -> 1000003202530: Auxiliary verbs
10302520 -> 1000003202540: When an auxiliary verb is present, the auxiliary appears in second position, and the main verb appears at the end.
10302530 -> 1000003202550: This occurs notably in the creation of the perfect tense.
10302540 -> 1000003202560: Many word orders are still possible, e.g.:
10302550 -> 1000003202570: {(Lang+Der alte Mann hat mir das Buch gestern gegeben.+de+Der alte Mann hat mir das Buch gestern gegeben.)}
10302555 -> 1000003202580: (The old man gave me the book yesterday.)
10302560 -> 1000003202590: {(Lang+Der alte Mann hat mir gestern das Buch gegeben.+de+Der alte Mann hat mir gestern das Buch gegeben.)}
10302570 -> 1000003202600: {(Lang+Das Buch hat mir der alte Mann gestern gegeben.+de+Das Buch hat mir der alte Mann gestern gegeben.)}
10302580 -> 1000003202610: {(Lang+Das Buch hat mir gestern der alte Mann gegeben.+de+Das Buch hat mir gestern der alte Mann gegeben.)}
10302590 -> 1000003202620: {(Lang+Gestern hat mir der alte Mann das Buch gegeben.+de+Gestern hat mir der alte Mann das Buch gegeben.)}
10302600 -> 1000003202630: {(Lang+Gestern hat mir das Buch der alte Mann gegeben.+de+Gestern hat mir das Buch der alte Mann gegeben.)}
10302610 -> 1000003202640: The word order is generally less rigid than in Modern English except for nouns (see below).
10302620 -> 1000003202650: There are two common word orders; one is for main clauses and another for subordinate clauses.
10302630 -> 1000003202660: In normal positive sentences the inflected verb always has position 2; in questions, exclamations and wishes it always has position 1.
10302640 -> 1000003202670: In subordinate clauses the verb is supposed to occur at the very end, but in speech this rule is often disregarded.
10302650 -> 1000003202680: For example in a subordinate clause introduced by "weil" ("because") the verb quite often occupies the same order as in a main clause.
10302660 -> 1000003202690: The correct way of saying "because I'm broke" is "{(Lang+…weil ich pleite bin.+de+…weil ich pleite bin.)}".
10302670 -> 1000003202700: In the vernacular you may hear instead "{(Lang+…weil ich bin pleite.+de+…weil ich bin pleite.)}"
10302675 -> 1000003202710: This phenomenon may be caused by mixing the word-order pattern used for the word {(Lang+weil+de+weil)} with the pattern used for an alternative word for "because", {(Lang+denn+de+denn)}, which is used with the main clause order ("{(Lang+…denn ich bin pleite.+de+…denn ich bin pleite.)}").
10302680 -> 1000003202720: Modal verbs
10302690 -> 1000003202730: Sentences using modal verbs place the infinitive at the end.
10302700 -> 1000003202740: For example, the sentence in Modern English "Should he go home?" would be rearranged in German to say "Should he (to) home go?" ({(Lang+Soll er nach Hause gehen?+de+Soll er nach Hause gehen?)}).
10302710 -> 1000003202750: Thus in sentences with several subordinate or relative clauses the infinitives are clustered at the end.
10302720 -> 1000003202760: Compare the similar clustering of prepositions in the following English sentence: "What did you bring that book that I don't like to be read to out of up for?"
10302730 -> 1000003202770: Multiple infinitives
10302740 -> 1000003202780: The number of infinitives at the end is usually restricted to two, causing the third infinitive or auxiliary verb that would have gone at the very end to be placed instead at the beginning of the chain of verbs.
10302750 -> 1000003202790: For example in the sentence "Should he move into the house that he just has had renovated?" would be rearranged to "Should he into the house move, that he just renovated had?".
10302755 -> 1000003202800: ({(Lang+Soll er in das Haus einziehen, das er gerade hat renovieren lassen?+de+Soll er in das Haus einziehen, das er gerade hat renovieren lassen?)}).
10302760 -> 1000003202810: The older form would have been ({(Lang+Soll er in das Haus, das er gerade hat renovieren lassen, einziehen?+de+Soll er in das Haus, das er gerade hat renovieren lassen, einziehen?)}).
10302770 -> 1000003202820: If there are more than three infinitives, all except the first two are relocated to the beginning of the chain.
10302780 -> 1000003202830: Needless to say the rule is not rigorously applied.
10302790 -> 1000003202840: Vocabulary
10302800 -> 1000003202850: Most German vocabulary is derived from the Germanic branch of the Indo-European language family, although there are significant minorities of words derived from Latin, and Greek, and a smaller amount from French and most recently English .
10302810 -> 1000003202860: At the same time, the effectiveness of the German language in forming equivalents for foreign words from its inherited Germanic stem repertory is great.
10302820 -> 1000003202870: Thus, Notker Labeo was able to translate Aristotelian treatises in pure (Old High) German in the decades after the year 1000.
10302830 -> 1000003202880: Overall, German has fewer Romance-language loanwords than does English.
10302840 -> 1000003202890: The coining of new, autochthonous words gave German a vocabulary of an estimated 40,000 words as early as the ninth century.
10302850 -> 1000003202900: In comparison, Latin, with a written tradition of nearly 2,500 years in an empire which ruled the Mediterranean, has grown to no more than 45,000 words today.
10302860 -> 1000003202910: Even today, many low-key scholarly movements try to promote the Ersatz (substitution) of virtually all foreign words with ancient, dialectal, or neologous German alternatives.
10302870 -> 1000003202920: It is claimed that this would also help in spreading modern or scientific notions among the less educated, and thus democratise public life, too.
10302880 -> 1000003202930: Jurisprudence in Germany, for example, uses perhaps the "purest" tongue in terms of "Germanness", but also the most cumbersome, to be found today..
10302890 -> 1000003202940: In the modern scientific German vocabulary data base in Leipzig (as of July 2003) there are nine million words and word groups in 35 million sentences (out of a corpus of 500 million words).
10302900 -> 1000003202950: Writing system
10302910 -> 1000003202960: Present
10302920 -> 1000003202970: German is written using the Latin alphabet.
10302930 -> 1000003202980: In addition to the 26 standard letters, German has three vowels with Umlaut, namely ä, ö and ü, as well as the Eszett or scharfes s (sharp s), ß.
10302940 -> 1000003202990: Before the German spelling reform of 1996, ß replaced ss after long vowels and diphthongs and before consonants, word-, or partial-word-endings.
10302950 -> 1000003203000: In reformed spelling, ß replaces ss only after long vowels and diphthongs.
10302960 -> 1000003203010: Since there is no capital ß, it is always written as SS when capitalization is required.
10302970 -> 1000003203020: For example, Maßband (tape measure) is capitalized MASSBAND.
10302980 -> 1000003203030: An exception is the use of ß in legal documents and forms when capitalizing names.
10302990 -> 1000003203040: To avoid confusion with similar names, a "ß" is to be used instead of "SS".
10303000 -> 1000003203050: (So: "KREßLEIN" instead of "KRESSLEIN".)
10303010 -> 1000003203060: A capital ß has been proposed and included in Unicode, but it is not yet recognized as standard German.
10303020 -> 1000003203070: In Switzerland, ß is not used at all.
10303030 -> 1000003203080: Umlaut vowels (ä, ö, ü) are commonly circumscribed with ae, oe, and ue if the umlauts are not available on the keyboard used.
10303040 -> 1000003203090: In the same manner ß can be circumscribed as ss. German readers understand those circumscriptions (although they look unusual), but they are avoided if the regular umlauts are available because they are considered a makeshift, not proper spelling.
10303050 -> 1000003203100: (In Westphalia, city and family names exist where the extra e has a vowel lengthening effect, e.g. Raesfeld [ˈraːsfɛlt] and Coesfeld [ˈkoːsfɛlt], but this use of the letter e after a/o/u does not occur in the present-day spelling of words other than proper nouns.
10303060 -> 1000003203110: )
10303070 -> 1000003203120: Unfortunately there is still no general agreement exactly where these umlauts occur in the sorting sequence.
10303080 -> 1000003203130: Telephone directories treat them by replacing them with the base vowel followed by an e, whereas dictionaries use just the base vowel.
10303090 -> 1000003203140: As an example in a telephone book Ärzte occurs after Adressenverlage but before Anlagenbauer (because Ä is replaced by Ae).
10303100 -> 1000003203150: In a dictionary Ärzte occurs after Arzt but before Asbest (because Ä is treated as A).
10303110 -> 1000003203160: In some older dictionaries or indexes, initial Sch and St are treated as separate letters and are listed as separate entries after S.
10303120 -> 1000003203170: Past
10303130 -> 1000003203180: Until the early 20th century, German was mostly printed in blackletter typefaces (mostly in Fraktur, but also in Schwabacher) and written in corresponding handwriting (for example Kurrent and Sütterlin).
10303140 -> 1000003203190: These variants of the Latin alphabet are very different from the serif or sans serif Antiqua typefaces used today, and particularly the handwritten forms are difficult for the untrained to read.
10303150 -> 1000003203200: The printed forms however were claimed by some to be actually more readable when used for printing Germanic languages .
10303160 -> 1000003203210: The Nazis initially promoted Fraktur and Schwabacher since they were considered Aryan, although they later abolished them in 1941 by claiming that these letters were Jewish.
10303170 -> 1000003203220: The latter fact is not widely known anymore; today the letters are often associated with the Nazis and are no longer commonly used .
10303180 -> 1000003203230: The Fraktur script remains present in everyday life through road signs, pub signs, beer brands and other forms of advertisement, where it is used to convey a certain rusticality and oldness.
10303190 -> 1000003203240: A proper use of the long s, (langes s), ſ, is essential to write German text in Fraktur typefaces.
10303200 -> 1000003203250: Many Antiqua typefaces include the long s, also.
10303210 -> 1000003203260: A specific set of rules applies for the use of long s in German text, but it is rarely used in Antiqua typesetting, recently.
10303220 -> 1000003203270: Any lower case "s" at the beginning of a syllable would be a long s, as opposed to a terminal s or short s (the more common variation of the letter s), which marks the end of a syllable; for example, in differentiating between the words Wachſtube (=guard-house) and Wachstube (=tube of floor polish).
10303230 -> 1000003203280: One can decide which "s" to use by appropriate hyphenation, easily ("Wach-ſtube" vs. "Wachs-tube").
10303240 -> 1000003203290: The long s only appears in lower case.
10303250 -> 1000003203300: The widespread ignorance of the correct use of the Fraktur scripts shows however in the many mistakes made— such as the frequent erroneous use of the round s instead of the long s at the beginning of a syllable, the failure to employ the mandatory ligatures of Fraktur, or the use of letter-forms more alike to the Antiqua for certain especially hard-to-read Fraktur letters.
10303260 -> 1000003203310: Phonology
10303270 -> 1000003203320: Vowels
10303280 -> 1000003203330: German vowels (excluding diphthongs; see below) come in short and long varieties, as detailed in the following table:
10303290 -> 1000003203340: Short {(IPA+/ɛ/+/ɛ/)} is realised as {(IPA+[ɛ]+[ɛ])} in stressed syllables (including secondary stress), but as {(IPA+[ǝ]+[ǝ])} in unstressed syllables.
10303300 -> 1000003203350: Note that stressed short {(IPA+/ɛ/+/ɛ/)} can be spelled either with e or with ä (hätte 'would have' and Kette 'chain', for instance, rhyme).
10303310 -> 1000003203360: In general, the short vowels are open and the long vowels are closed.
10303320 -> 1000003203370: The one exception is the open {(IPA+/ɛː/+/ɛː/)} sound of long Ä; in some varieties of standard German, {(IPA+/ɛː/+/ɛː/)} and {(IPA+/eː/+/eː/)} have merged into {(IPA+[eː]+[eː])}, removing this anomaly.
10303330 -> 1000003203380: In that case, pairs like Bären/Beeren 'bears/berries' or Ähre/Ehre 'spike/honour' become homophonous).
10303340 -> 1000003203390: In many varieties of standard German, an unstressed {(IPA+/ɛr/+/ɛr/)} is not pronounced as {(IPA+[ər]+[ər])}, but vocalised to {(IPA+[ɐ]+[ɐ])}.
10303350 -> 1000003203400: Whether any particular vowel letter represents the long or short phoneme is not completely predictable, although the following regularities exist:
10303360 -> 1000003203410: If a vowel (other than i) is at the end of a syllable or followed by a single consonant, it is usually pronounced long (e.g. Hof [hoːf]).
10303370 -> 1000003203420: If the vowel is followed by a double consonant (e.g. ff, ss or tt), ck, tz or a consonant cluster (e.g. st or nd), it is nearly always short (e.g. hoffen [ˈhɔfǝn]).
10303380 -> 1000003203430: Double consonants are used only for this function of marking preciding vowels as short; the consonant itself is never pronounced lengthened or doubled.
10303390 -> 1000003203440: Both of these rules have exceptions (e.g. hat [hat] 'has' is short despite the first rule; Kloster {(IPA+[kloːstər]+[kloːstər])}, 'cloister'; Mond {(IPA+[moːnt]+[moːnt])}, 'moon' are long despite the second rule).
10303400 -> 1000003203450: For an i that is neither in the combination ie (making it long) nor followed by a double consonant or cluster (making it short), there is no general rule.
10303410 -> 1000003203460: In some cases, there are regional differences: In central Germany (Hessen), the o in the proper name "Hoffmann" is pronounced long while most other Germans would pronounce it short; the same applies to the e in the geographical name "Mecklenburg" for people in that region.
10303420 -> 1000003203470: The word Städte 'cities', is pronounced with a short vowel {(IPA+[ˈʃtɛtə]+[ˈʃtɛtə])} by some (Jan Hofer, ARD Television) and with a long vowel {(IPA+[ˈʃtɛːtə]+[ˈʃtɛːtə])} by others (Marietta Slomka, ZDF Television).
10303430 -> 1000003203480: Finally, a vowel followed by ch can be short (Fach {(IPA+[fax]+[fax])} 'compartment', Küche {(IPA+[ˈkʏçe]+[ˈkʏçe])} 'kitchen') or long (Suche {(IPA+[ˈzuːxǝ]+[ˈzuːxǝ])} 'search', Bücher {(IPA+[ˈbyːçər]+[ˈbyːçər])} 'books') almost at random.
10303440 -> 1000003203490: Thus, Lache is homographous: {(IPA+[la:xe]+[la:xe])} 'puddle' and {(IPA+[laxe]+[laxe])} 'manner of laughing' (coll.), 'laugh!'
10303450 -> 1000003203500: (Imp.).
10303460 -> 1000003203510: German vowels can form the following digraphs (in writing) and diphthongs (in pronunciation); note that the pronunciation of some of them (ei, äu, eu) is very different from what one would expect when considering the component letters:
10303470 -> 1000003203520: Additionally, the digraph ie generally represents the phoneme {(IPA+/iː/+/iː/)}, which is not a diphthong.
10303480 -> 1000003203530: In many varieties, a /r/ at the end of a syllable is vocalised.
10303490 -> 1000003203540: However, a sequence of a vowel followed by such a vocalised /r/ is not considered a diphthong: Bär {(IPA+[bɛːɐ̯]+[bɛːɐ̯])} 'bear', er {(IPA+[eːɐ̯]+[eːɐ̯])} 'he', wir {(IPA+[viːɐ̯]+[viːɐ̯])} 'we', Tor {(IPA+[toːɐ̯]+[toːɐ̯])} 'gate', kurz {(IPA+[kʊɐ̯ts]+[kʊɐ̯ts])} 'short', Wörter {(IPA+[vœɐ̯tɐ]+[vœɐ̯tɐ])} 'words'.
10303500 -> 1000003203550: In most varieties of standard German, word stems that begin with a vowel are preceded by a glottal stop [ʔ].
10303510 -> 1000003203560: Consonants
10303520 -> 1000003203570: c standing by itself is not a German letter.
10303530 -> 1000003203580: In borrowed words, it is usually pronounced [ʦ] (before ä, äu, e, i, ö, ü, y) or [k] (before a, o, u, or before consonants).
10303540 -> 1000003203590: The combination ck is, as in English, used to indicate that the preceding vowel is short.
10303550 -> 1000003203600: ch occurs most often and is pronounced either [ç] (after ä, ai, äu, e, ei, eu, i, ö, ü and after consonants) or [x] (after a, au, o, u).
10303560 -> 1000003203610: Ch never occurs at the beginning of an originally German word.
10303570 -> 1000003203620: In borrowed words with initial Ch there is no single agreement on the pronunciation.
10303580 -> 1000003203630: For example, the word "Chemie" (chemistry) can be pronounced [keːˈmiː], [çeːˈmiː] or [ʃeːˈmiː] depending on dialect.
10303590 -> 1000003203640: dsch is pronounced ʤ (like j in Jungle) but appears in a few loanwords only.
10303600 -> 1000003203650: f is pronounced [f] as in "father".
10303610 -> 1000003203660: h is pronounced [h] like in "home" at the beginning of a syllable.
10303620 -> 1000003203670: After a vowel it is silent and only lengthens the vowel (e.g. "Reh" = roe deer).
10303630 -> 1000003203680: j is pronounced [j] in Germanic words ("Jahr" [jaːɐ]).
10303640 -> 1000003203690: In younger loanwords, it follows more or less the respective languages' pronunciations.
10303650 -> 1000003203700: l is always pronounced [l], never [ɫ] (the English "Dark L").
10303660 -> 1000003203710: q only exists in combination with u and appears both in Germanic and Latin words ("quer"; "Qualität").
10303670 -> 1000003203720: It is pronounced [kv].
10303680 -> 1000003203730: r is pronounced as a guttural sound (an uvular trill, [ʀ]) in front of a vowel or consonant ("Rasen" [ʀaːzən]; "Burg" like [buʀg]).
10303690 -> 1000003203740: In spoken German however, it is commonly vocalised after a vowel ("er" being pronounced rather like ['ɛɐ] - "Burg" [buɐg]).
10303700 -> 1000003203750: In some southern non-standard varieties, the r is pronounced as a tongue-tip r (the alveolar trill).
10303710 -> 1000003203760: s in Germany, is pronounced [z] (as in "Zebra") if it forms the syllable onset (e.g. Sohn [zoːn]), otherwise [s] (e.g. Bus [bʊs]).
10303720 -> 1000003203770: In Austria, always pronounced [s].
10303730 -> 1000003203780: A ss [s] indicates that the preceding vowel is short. st and sp at the beginning of words of German origin are pronounced [ʃt] and [ʃp], respectively.
10303740 -> 1000003203790: ß (a letter unique to German called "Esszet") was a ligature of a double s and of a sz and is always pronounced [s].
10303750 -> 1000003203800: Originating in Blackletter typeface, it traditionally replaced ss at the end of a syllable (e.g. "ich muss" → "ich muß"; "ich müsste" → "ich müßte"); within a word it contrasts with ss [s] in indicating that the preceding vowel is long (compare "in Maßen" [in 'maːsən] "with moderation" and "in Massen" [in 'masən] "in loads").
10303760 -> 1000003203810: The use of ß has recently been limited by the latest German spelling reform and is no longer used for ss at the end of a syllable; Switzerland and Liechtenstein already abolished it in 1934.
10303770 -> 1000003203820: sch is pronounced [ʃ] (like "sh" in "Shine").
10303780 -> 1000003203830: v is pronounced [f] in words of Germanic origin (e.g. "Vater" [ˈfaːtɐ]) and [v] in most other words (e.g. "Vase" [ˈvaːzǝ]).
10303790 -> 1000003203840: w is pronounced [v] like in "vacation" (e.g. "was" [vas]).
10303800 -> 1000003203850: y only appears in loanwords and is traditionally considered a vowel.
10303810 -> 1000003203860: z is always pronounced [ʦ] (e.g. "zog" [ʦoːk]).
10303820 -> 1000003203870: A tz indicates that the preceding vowel is short.
10303830 -> 1000003203880: Consonant shifts
10303840 -> 1000003203890: German does not have any dental fricatives (as English th).
10303850 -> 1000003203900: The th sounds, which the English language has inherited from Anglo Saxon, survived on the continent up to Old High German and then disappeared in German with the consonant shifts between the 8th and the 10th century.
10303860 -> 1000003203910: It is sometimes possible to find parallels between German by replacing the English th with d in German: "Thank" → in German "Dank", "this" and "that" → "dies" and "das", "thou" (old 2nd person singular pronoun) → "du", "think" → "denken", "thirsty" → "durstig" and many other examples.
10303870 -> 1000003203920: Likewise, the gh in Germanic English words, pronounced in several different ways in modern English (as an f, or not at all), can often be linked to German ch: "to laugh" → "lachen", "through" and "thorough" → "durch", "high" → "hoch", "naught" → "nichts", etc.
10303880 -> 1000003203930: Cognates with English
10303890 -> 1000003203940: There are many thousands of German words that are cognate to English words (in fact a sizeable fraction of native German and English vocabulary, although for various reasons much of it is not immediately obvious).
10303900 -> 1000003203950: Most of the words in the following table have almost the same meaning as in English.
10303910 -> 1000003203960: Compound word cognates
10303920 -> 1000003203970: When these cognates have slightly different consonants, this is often due to the High German consonant shift.
10303930 -> 1000003203980: Hence the affinity of English words with those of German dialects is more evidently:
10303940 -> 1000003203990: There are cognates whose meanings in either language have changed through the centuries.
10303950 -> 1000003204000: It is sometimes difficult for both English and German speakers to discern the relationship.
10303960 -> 1000003204010: On the other hand, once the definitions are made clear, then the logical relation becomes obvious.
10303970 -> 1000003204020: Sometimes the generality or specificity of word pairs may be opposite in the two languages.
10303980 -> 1000003204030: German and English also share many borrowings from other languages, especially Latin, French and Greek.
10303990 -> 1000003204040: Most of these words have the same meaning, while a few have subtle differences in meaning.
10304000 -> 1000003204050: As many of these words have been borrowed by numerous languages, not only German and English, they are called internationalisms in German linguistics.
10304010 -> 1000003204060: For reference, a good number of these borrowed words are of the neuter gender.
10304020 -> 1000003204070: Words borrowed by English
10304030 -> 1000003204080: For a list of German loanwords in English, see Category:German loanwords
10304040 -> 1000003204090: In the English language, there are also many words taken from German without any letter change, e.g.:
10304050 -> 1000003204100: Names for German in other languages
10304060 -> 1000003204110: See also: Deutsch, Dutch, Deitsch, Dietsch, Teuton, Teutonic, Allemanic, Alleman, Theodisca
10304070 -> 1000003204120: The names that countries have for the language differ from region to region.
10304080 -> 1000003204130: In Italian the sole name for German is still tedesco, from the Latin theodiscus, meaning "vernacular".
10304090 -> 1000003204140: A possible explanation for the use of words meaning "mute" (e.g., nemoj in Russian, němý in Czech, nem in Serbian) to refer to German (and also to Germans) in Slavic languages is that Germans were the first people Slavic tribes encountered with whom they could not communicate.
10304100 -> 1000003204150: Romanian used to use the Slavonic term "nemţeşte", but "germană" is now widely used.
10304110 -> 1000003204160: Hungarian "német" is also of Slavonic origin.
10304120 -> 1000003204170: The Arabic name for Austria, النمسا ("an-namsa"), is derived from the Slavonic term.
10304130 -> 1000003204180: Note also that though the Russian term for the language is немецкий (nemetskij), the country is Германия (Germania).
10304140 -> 1000003204190: However, in certain other Slavic languages, such as Czech, the country name (Německo) is similar to the name of the language, německý (jazyk).
10304150 -> 1000003204200: Finns and Estonians use the term saksa, originally from the Saxon tribe.
10304160 -> 1000003204210: Scandinavians use derivatives of the word Tyskland/Þýskaland (from Theodisca) for the country and tysk(a)/þýska for the language.
10304170 -> 1000003204220: Hebrew traditionally (nowadays this is not the case) used the Biblical term אַשְׁכֲּנָז (Ashkenaz) (Genesis 10:3) to refer to Germany, or to certain parts of it, and the Ashkenazi Jews are those who originate from Germany and Eastern Europe and formerly spoke Yiddish as their native language, derived from Middle High German.
10304180 -> 1000003204230: Modern Hebrew uses גֶּרְמָנִי germaní (Or גֶּרְמָנִית germanít for the language).
10304190 -> 1000003204240: The French term is allemand, the Spanish term is alemán, the Catalan term is alemany, and the Portuguese term is alemão; all derive from the ancient Alamanni tribal alliance, meaning literally "All Men".
10304200 -> 1000003204250: The Latvian term vācu means "tinny" and refers disparagingly to the iron-clad Teutonic Knights that colonized the Baltic in the Middle Ages.
10304210 -> 1000003204260: The Scottish Gaelic term for the German language, Gearmailtis, is formed in the standard way of adding -(a)is to the end of the country name.
10304220 -> 1000003204270: See Names for Germany for further details on the origins of these and other terms.
Google
10320010 -> 1000003300020: Google
10320020 -> 1000003300030: Google Inc. (NASDAQ:  GOOG  and LSE:  GGEA ) is an American public corporation, earning revenue from advertising related to its Internet search, web-based e-mail, online mapping, office productivity, social networking, and video sharing services as well as selling advertising-free versions of the same technologies.
10320030 -> 1000003300040: Google's headquarters, the Googleplex, is located in Mountain View, California.
10320040 -> 1000003300050: As of June 30 2008 the company has 19,604 full-time employees.
10320050 -> 1000003300060: As of October 31, 2007, it is the largest American company (by market capitalization) that is not part of the Dow Jones Industrial Average.
10320060 -> 1000003300070: Google was co-founded by Larry Page and Sergey Brin while they were students at Stanford University and the company was first incorporated as a privately held company on September 7, 1998.
10320070 -> 1000003300080: Google's initial public offering took place on August 19, 2004, raising US$1.67 billion, making it worth US$23 billion.
10320080 -> 1000003300090: Google has continued its growth through a series of new product developments, acquisitions, and partnerships.
10320090 -> 1000003300100: Environmentalism, philanthropy, and positive employee relations have been important tenets during Google's growth, the latter resulting in being identified multiple times as Fortune Magazine's #1 Best Place to Work.
10320100 -> 1000003300110: The company's unofficial slogan is "Don't be evil", although criticism of Google include concerns regarding the privacy of personal information, copyright, censorship, and discontinuation of services.
10320110 -> 1000003300120: History
10320120 -> 1000003300130: Google began in January 1996, as a research project by Larry Page, who was soon joined by Sergey Brin, two Ph.D. students at Stanford University in California.
10320130 -> 1000003300140: They hypothesized that a search engine that analyzed the relationships between websites would produce better ranking of results than existing techniques, which ranked results according to the number of times the search term appeared on a page.
10320140 -> 1000003300150: Their search engine was originally nicknamed "BackRub" because the system checked backlinks to estimate a site's importance.
10320150 -> 1000003300160: A small search engine called Rankdex was already exploring a similar strategy.
10320160 -> 1000003300170: Convinced that the pages with the most links to them from other highly relevant web pages must be the most relevant pages associated with the search, Page and Brin tested their thesis as part of their studies, and laid the foundation for their search engine.
10320170 -> 1000003300180: Originally, the search engine used the Stanford University website with the domain google.stanford.edu.
10320180 -> 1000003300190: The domain google.com was registered on September 15, 1997, and the company was incorporated as Google Inc. on September 7, 1998 at a friend's garage in Menlo Park, California.
10320190 -> 1000003300200: The total initial investment raised for the new company amounted to almost US$1.1 million, including a US$100,000 check by Andy Bechtolsheim, one of the founders of Sun Microsystems.
10320200 -> 1000003300210: In March 1999, the company moved into offices in Palo Alto, home to several other noted Silicon Valley technology startups.
10320210 -> 1000003300220: After quickly outgrowing two other sites, the company leased a complex of buildings in Mountain View at 1600 Amphitheatre Parkway from Silicon Graphics (SGI) in 2003.
10320220 -> 1000003300230: The company has remained at this location ever since, and the complex has since come to be known as the Googleplex (a play on the word googolplex).
10320230 -> 1000003300240: In 2006, Google bought the property from SGI for US$319 million.
10320240 -> 1000003300250: The Google search engine attracted a loyal following among the growing number of Internet users, who liked its simple design and usability.
10320250 -> 1000003300260: In 2000, Google began selling advertisements associated with search keywords.
10320260 -> 1000003300270: The ads were text-based to maintain an uncluttered page design and to maximize page loading speed.
10320270 -> 1000003300280: Keywords were sold based on a combination of price bid and clickthroughs, with bidding starting at US$.05 per click.
10320280 -> 1000003300290: This model of selling keyword advertising was pioneered by Goto.com (later renamed Overture Services, before being acquired by Yahoo! and rebranded as Yahoo! Search Marketing).
10320290 -> 1000003300300: While many of its dot-com rivals failed in the new Internet marketplace, Google quietly rose in stature while generating revenue.
10320300 -> 1000003300310: The name "Google" originated from a common misspelling of the word "googol", which refers to 10100, the number represented by a 1 followed by one hundred zeros.
10320310 -> 1000003300320: Having found its way increasingly into everyday language, the verb "google", was added to the Merriam Webster Collegiate Dictionary and the Oxford English Dictionary in 2006, meaning "to use the Google search engine to obtain information on the Internet."
10320320 -> 1000003300330: A patent describing part of Google's ranking mechanism (PageRank) was granted on September 4, 2001.
10320330 -> 1000003300340: The patent was officially assigned to Stanford University and lists Lawrence Page as the inventor.
10320340 -> 1000003300350: Financing and initial public offering
10320350 -> 1000003300360: The first funding for Google as a company was secured in 1998, in the form of a US$100,000 contribution from Andy Bechtolsheim, co-founder of Sun Microsystems, given to a corporation which did not yet exist.
10320360 -> 1000003300370: Around six months later, a much larger round of funding was announced, with the major investors being rival venture capital firms Kleiner Perkins Caufield & Byers and Sequoia Capital.
10320370 -> 1000003300380: Google's IPO took place on August 19, 2004.
10320380 -> 1000003300390: 19,605,052 shares were offered at a price of US$85 per share.
10320390 -> 1000003300400: Of that, 14,142,135 (another mathematical reference as √2 ≈ 1.4142135) were floated by Google, and the remaining 5,462,917 were offered by existing stockholders.
10320400 -> 1000003300410: The sale of US$1.67 billion gave Google a market capitalization of more than US$23 billion.
10320410 -> 1000003300420: The vast majority of Google's 271 million shares remained under Google's control.
10320420 -> 1000003300430: Many of Google's employees became instant paper millionaires.
10320430 -> 1000003300440: Yahoo!, a competitor of Google, also benefited from the IPO because it owned 8.4 million shares of Google as of August 9, 2004, ten days before the IPO.
10320440 -> 1000003300450: Google's stock performance after its first IPO launch has gone well, with shares hitting US$700 for the first time on October 31, 2007, due to strong sales and earnings in the advertising market, as well as the release of new features such as the desktop search function and its iGoogle personalized home page.
10320450 -> 1000003300460: The surge in stock price is fueled primarily by individual investors, as opposed to large institutional investors and mutual funds.
10320460 -> 1000003300470: The company is listed on the NASDAQ stock exchange under the ticker symbol GOOG and under the London Stock Exchange under the ticker symbol GGEA.
10320470 -> 1000003300480: Growth
10320480 -> 1000003300490: While the company's primary business interest is in the web content arena, Google has begun experimenting with other markets, such as radio and print publications.
10320490 -> 1000003300500: On January 17, 2006, Google announced that its purchase of a radio advertising company "dMarc", which provides an automated system that allows companies to advertise on the radio.
10320500 -> 1000003300510: This will allow Google to combine two niche advertising media—the Internet and radio—with Google's ability to laser-focus on the tastes of consumers.
10320510 -> 1000003300520: Google has also begun an experiment in selling advertisements from its advertisers in offline newspapers and magazines, with select advertisements in the Chicago Sun-Times.
10320520 -> 1000003300530: They have been filling unsold space in the newspaper that would have normally been used for in-house advertisements.
10320530 -> 1000003300540: Google was added to the S&P 500 index on March 30, 2006.
10320540 -> 1000003300550: It replaced Burlington Resources, a major oil producer based in Houston which was acquired by ConocoPhillips.
10320550 -> 1000003300560: Acquisitions
10320560 -> 1000003300570: Since 2001, Google has acquired several small start-up companies, often consisting of innovative teams and products.
10320570 -> 1000003300580: One of the earlier companies that Google bought was Pyra Labs.
10320580 -> 1000003300590: They were the creators of Blogger, a weblog publishing platform, first launched in 1999.
10320590 -> 1000003300600: This acquisition led to many premium features becoming free.
10320600 -> 1000003300610: Pyra Labs was originally formed by Evan Williams, yet he left Google in 2004.
10320610 -> 1000003300620: In early 2006, Google acquired Upstartle, a company responsible for the online word processor, Writely.
10320620 -> 1000003300630: The technology in this product was used by Google to eventually create Google Docs & Spreadsheets.
10320630 -> 1000003300640: In 2004, Google acquired a company called Keyhole, Inc., which developed a product called Earth Viewer which was renamed in 2005 to Google Earth.
10320640 -> 1000003300650: In February 2006, software company Adaptive Path sold Measure Map, a weblog statistics application, to Google.
10320650 -> 1000003300660: Registration to the service has since been temporarily disabled.
10320660 -> 1000003300670: The last update regarding the future of Measure Map was made on April 6, 2006 and outlined many of the service's known issues.
10320670 -> 1000003300680: In late 2006, Google bought online video site YouTube for US$1.65 billion in stock.
10320680 -> 1000003300690: Shortly after, on October 31, 2006, Google announced that it had also acquired JotSpot, a developer of wiki technology for collaborative Web sites.
10320690 -> 1000003300700: On April 13, 2007, Google reached an agreement to acquire DoubleClick.
10320700 -> 1000003300710: Google agreed to buy the company for US$3.1 billion.
10320710 -> 1000003300720: On July 9, 2007, Google announced that it had signed a definitive agreement to acquire enterprise messaging security and compliance company Postini.
10320720 -> 1000003300730: Partnerships
10320730 -> 1000003300740: In 2005, Google entered into partnerships with other companies and government agencies to improve production and services.
10320740 -> 1000003300750: Google announced a partnership with NASA Ames Research Center to build up {(Convert+1000000 square feet (93000 m²)+1000000+sqft+m2+-3)} of offices and work on research projects involving large-scale data management, nanotechnology, distributed computing, and the entrepreneurial space industry.
10320750 -> 1000003300760: Google also entered into a partnership with Sun Microsystems in October to help share and distribute each other's technologies.
10320760 -> 1000003300770: The company entered into a partnership with Time Warner's AOL, to enhance each other's video search services.
10320770 -> 1000003300780: The same year, the company became a major financial investor of the new .mobi top-level domain for mobile devices, in conjunction with several other companies, including Microsoft, Nokia, and Ericsson among others.
10320780 -> 1000003300790: In September 2007, Google launched, "Adsense for Mobile", a service for its publishing partners which provides the ability to monetize their mobile websites through the targeted placement of mobile text ads, and acquired the mobile social networking site, Zingku.mobi, to "provide people worldwide with direct access to Google applications, and ultimately the information they want and need, right from their mobile devices."
10320790 -> 1000003300800: In 2006, Google and News Corp.'s Fox Interactive Media entered into a US$900 million agreement to provide search and advertising on the popular social networking site, MySpace.
10320800 -> 1000003300810: On November 5, 2007 Google announced the Open Handset Alliance to develop an open platform for mobile services called Android.
10320810 -> 1000003300820: On March,2008 Google, Sprint, Intel, Comcast, Time Warner Cable,Bright House Networks,Clearwire together found Xohm to provide wireless telecommunication service.
10320820 -> 1000003300830: Products and services
10320830 -> 1000003300840: Google has created services and tools for the general public and business environment alike; including Web applications, advertising networks and solutions for businesses.
10320840 -> 1000003300850: Advertising
10320850 -> 1000003300860: Most of Google's revenue is derived from advertising programs.
10320860 -> 1000003300870: For the 2006 fiscal year, the company reported US$10.492 billion in total advertising revenues and only US$112 million in licensing and other revenues.
10320870 -> 1000003300880: Google AdWords allows Web advertisers to display advertisements in Google's search results and the Google Content Network, through either a cost-per-click or cost-per-view scheme.
10320880 -> 1000003300890: Google AdSense website owners can also display adverts on their own site, and earn money every time ads are clicked.
10320890 -> 1000003300900: Web-based software
10320900 -> 1000003300910: The Google web search engine is the company's most popular service.
10320910 -> 1000003300920: As of August 2007, Google is the most used search engine on the web with a 53.6% market share, ahead of Yahoo! (19.9%) and Live Search (12.9%).
10320920 -> 1000003300930: Google indexes billions of Web pages, so that users can search for the information they desire, through the use of keywords and operators.
10320930 -> 1000003300940: Google has also employed the Web Search technology into other search services, including Image Search, Google News, the price comparison site Google Product Search, the interactive Usenet archive Google Groups, Google Maps, and more.
10320940 -> 1000003300950: In 2004, Google launched its own free web-based e-mail service, known as Gmail (or Google Mail in some jurisdictions).
10320950 -> 1000003300960: Gmail features spam-filtering technology and the capability to use Google technology to search e-mail.
10320960 -> 1000003300970: The service generates revenue by displaying advertisements and links from the AdWords service that are tailored to the choice of the user and/or content of the e-mail messages displayed on screen.
10320970 -> 1000003300980: In early 2006, the company launched Google Video, which not only allows users to search and view freely available videos but also offers users and media publishers the ability to publish their content, including television shows on CBS, NBA basketball games, and music videos.
10320980 -> 1000003300990: In August 2007, Google announced that it would shut down its video rental and sale program and offer refunds and Google Checkout credits to consumers who had purchased videos to own.
10320990 -> 1000003301000: On February 28, 2008 Google launched the Google Sites wiki as a Google Apps component.
10321000 -> 1000003301010: Google has also developed several desktop applications, including Google Earth, an interactive mapping program powered by satellite and aerial imagery that covers the vast majority of the planet.
10321010 -> 1000003301020: Google Earth is generally considered to be remarkably accurate and extremely detailed.
10321020 -> 1000003301030: Many major cities have such detailed images that one can zoom in close enough to see vehicles and pedestrians clearly.
10321030 -> 1000003301040: Consequently, there have been some concerns about national security implications.
10321040 -> 1000003301050: Specifically, some countries and militaries contend the software can be used to pinpoint with near-precision accuracy the physical location of critical infrastructure, commercial and residential buildings, bases, government agencies, and so on.
10321050 -> 1000003301060: However, the satellite images are not necessarily frequently updated, and all of them are available at no charge through other products and even government sources.
10321060 -> 1000003301070: For example, NASA and the National Geospatial-Intelligence Agency.
10321070 -> 1000003301080: Some counter this argument by stating that Google Earth makes it easier to access and research the images.
10321080 -> 1000003301090: Many other products are available through Google Labs, which is a collection of incomplete applications that are still being tested for use by the general public.
10321090 -> 1000003301100: Google has promoted their products in various ways.
10321100 -> 1000003301110: In London, Google Space was set-up in Heathrow Airport, showcasing several products, including Gmail, Google Earth and Picasa.
10321110 -> 1000003301120: Also, a similar page was launched for American college students, under the name College Life, Powered by Google.
10321120 -> 1000003301130: In 2007, some reports surfaced that Google was planning the release of its own mobile phone, possibly a competitor to Apple's iPhone.
10321130 -> 1000003301140: The project, called Android provides a standard development kit that will allow any "Android" phone to run software developed for the Android SDK, no matter the phone manufacturer.
10321140 -> 1000003301150: In October 2007, Google SMS service was launched in India allowing users to get business listings, movie showtimes, and information by sending an SMS.
10321150 -> 1000003301160: Enterprise products
10321160 -> 1000003301170: In 2007, Google launched Google Apps Premier Edition, a version of Google Apps targeted primarily at the business user.
10321170 -> 1000003301180: It includes such extras as more disk space for e-mail, API access, and premium support, for a price of US$50 per user per year.
10321180 -> 1000003301190: A large implementation of Google Apps with 38,000 users is at Lakehead University in Thunder Bay, Ontario, Canada.
10321190 -> 1000003301200: Platform
10321200 -> 1000003301210: Google runs its services on several server farms, each comprising thousands of low-cost commodity computers running stripped-down versions of Linux.
10321210 -> 1000003301220: While the company divulges no details of its hardware, a 2006 estimate cites 450,000 servers, "racked up in clusters at data centers around the world."
10321220 -> 1000003301230: Corporate affairs and culture
10321230 -> 1000003301240: Google is known for its relaxed corporate culture, of which its playful variations on its own corporate logo are an indicator.
10321240 -> 1000003301250: In 2007 and 2008, Fortune Magazine placed Google at the top of its list of the hundred best places to work.
10321250 -> 1000003301260: Google's corporate philosophy embodies such casual principles as "you can make money without doing evil," "you can be serious without a suit," and "work should be challenging and the challenge should be fun."
10321260 -> 1000003301270: Google has been criticized for having salaries below industry standards.
10321270 -> 1000003301280: For example, some system administrators earn no more than US$35,000 per year – considered to be quite low for the Bay Area job market.
10321280 -> 1000003301290: However, Google's stock performance following its IPO has enabled many early employees to be competitively compensated by participation in the corporation's remarkable equity growth.
10321290 -> 1000003301300: Google implemented other employee incentives in 2005, such as the Google Founders' Award, in addition to offering higher salaries to new employees.
10321300 -> 1000003301310: Google's workplace amenities, culture, global popularity, and strong brand recognition have also attracted potential applicants.
10321310 -> 1000003301320: After the company's IPO in August 2004, it was reported that founders Sergey Brin and Larry Page, and CEO Eric Schmidt, requested that their base salary be cut to US$1.00.
10321320 -> 1000003301330: Subsequent offers by the company to increase their salaries have been turned down, primarily because, "their primary compensation continues to come from returns on their ownership stakes in Google.
10321330 -> 1000003301340: As significant stockholders, their personal wealth is tied directly to sustained stock price appreciation and performance, which provides direct alignment with stockholder interests."
10321340 -> 1000003301350: Prior to 2004, Schmidt was making US$250,000 per year, and Page and Brin each earned a salary of US$150,000.
10321350 -> 1000003301360: They have all declined recent offers of bonuses and increases in compensation by Google's board of directors.
10321360 -> 1000003301370: In a 2007 report of the United States' richest people, Forbes reported that Sergey Brin and Larry Page were tied for #5 with a net worth of US$18.5 billion each.
10321370 -> 1000003301380: In 2007 and through early 2008, Google has seen the departure of several top executives.
10321380 -> 1000003301390: Justin Rosenstein, Google’s product manager, left in June of 2007.
10321390 -> 1000003301400: Shortly thereafter, Gideon Yu, former chief financial officer of YouTube, a Google unit, joined Facebook along with Benjamin Ling, a high-ranking engineer, who left in October 2007.
10321400 -> 1000003301410: In March 2008, two senior Google leaders announced their desire to pursue other opportunities.
10321410 -> 1000003301420: Sheryl Sandburg, ex-VP of global online sales and operations began her position as COO of Facebook while Ash ElDifrawi, former head of brand advertising, left to become CMO of Netshops Inc.
10321420 -> 1000003301430: Googleplex
10321430 -> 1000003301440: Google's headquarters in Mountain View, California, is referred to as "the Googleplex" in a play of words; a googolplex being 1 followed by a googol of zeros, and the HQ being a complex of buildings (cf. multiplex, cineplex, etc).
10321440 -> 1000003301450: The lobby is decorated with a piano, lava lamps, old server clusters, and a projection of search queries on the wall.
10321450 -> 1000003301460: The hallways are full of exercise balls and bicycles.
10321460 -> 1000003301470: Each employee has access to the corporate recreation center.
10321470 -> 1000003301480: Recreational amenities are scattered throughout the campus and include a workout room with weights and rowing machines, locker rooms, washers and dryers, a massage room, assorted video games, Foosball, a baby grand piano, a pool table, and ping pong.
10321480 -> 1000003301490: In addition to the rec room, there are snack rooms stocked with various foods and drinks.
10321490 -> 1000003301500: In 2006, Google moved into {(Convert+311000 square feet (28900 m²)+311000+sqft+m2+-2)} of office space in New York City, at 111 Eighth Ave. in Manhattan.
10321500 -> 1000003301510: The office was specially designed and built for Google and houses its largest advertising sales team, which has been instrumental in securing large partnerships, most recently deals with MySpace and AOL.
10321510 -> 1000003301520: In 2003, they added an engineering staff in New York City, which has been responsible for more than 100 engineering projects, including Google Maps, Google Spreadsheets, and others.
10321520 -> 1000003301530: It is estimated that the building costs Google US$10 million per year to rent and is similar in design and functionality to its Mountain View headquarters, including foosball, air hockey, and ping-pong tables, as well as a video game area.
10321530 -> 1000003301540: In November 2006, Google opened offices on Carnegie Mellon's campus in Pittsburgh.
10321540 -> 1000003301550: By late 2006, Google also established a new headquarters for its AdWords division in Ann Arbor, Michigan.
10321550 -> 1000003301560: The size of Google's search system is presently undisclosed.
10321560 -> 1000003301570: The best estimates place the total number of the company's servers at 450,000, spread over twenty five locations throughout the world, including major operations centers in Dublin (European Operations Headquarters) and Atlanta, Georgia.
10321570 -> 1000003301580: Google is also in the process of constructing a major operations center in The Dalles, Oregon, on the banks of the Columbia River.
10321580 -> 1000003301590: The site, also referred to by the media as Project 02, was chosen due to the availability of inexpensive hydroelectric power and a large surplus of fiber optic cable, remnants of the dot com boom of the late 1990s.
10321590 -> 1000003301600: The computing center is estimated to be the size of two football fields, and it has created hundreds of construction jobs, causing local real estate prices to increase 40%.
10321600 -> 1000003301610: Upon completion, the center is expected to create 60 to 200 permanent jobs in the town of 12,000 people.
10321610 -> 1000003301620: Google is taking steps to ensure that their operations are environmentally sound.
10321620 -> 1000003301630: In October 2006, the company announced plans to install thousands of solar panels to provide up to 1.6 megawatts of electricity, enough to satisfy approximately 30% of the campus' energy needs.
10321630 -> 1000003301640: The system will be the largest solar power system constructed on a U.S. corporate campus and one of the largest on any corporate site in the world.
10321640 -> 1000003301650: In June 2007, Google announced that they plan to become carbon neutral by 2008, which includes investing in energy efficiency, renewable energy sources, and purchasing carbon offsets, such as investing in projects like capturing and burning methane from animal waste at Mexican and Brazilian farms.
10321650 -> 1000003301660: Innovation time off
10321660 -> 1000003301670: As an interesting motivation technique (usually called Innovation Time Off), all Google engineers are encouraged to spend 20% of their work time (one day per week) on projects that interest them.
10321670 -> 1000003301680: Some of Google's newer services, such as Gmail, Google News, Orkut, and AdSense originated from these independent endeavors.
10321680 -> 1000003301690: In a talk at Stanford University, Marissa Mayer, Google's Vice President of Search Products and User Experience, stated that her analysis showed that half of the new product launches originated from the 20% time.
10321690 -> 1000003301700: Easter eggs and April Fool's Day jokes
10321700 -> 1000003301710: Google has a tradition of creating April Fool's Day jokes—such as Google MentalPlex, which allegedly featured the use of mental power to search the web.
10321710 -> 1000003301720: In 2002, they claimed that pigeons were the secret behind their growing search engine.
10321720 -> 1000003301730: In 2004, they featured Google Lunar (which claimed to feature jobs on the moon), and in 2005, a fictitious brain-boosting drink, termed Google Gulp was announced.
10321730 -> 1000003301740: In 2006, they came up with Google Romance, a hypothetical online dating service.
10321740 -> 1000003301750: In 2007, Google announced two joke products.
10321750 -> 1000003301760: The first was a free wireless Internet service called TiSP (Toilet Internet Service Provider) in which one obtained a connection by flushing one end of a fiber-optic cable down their toilet and waiting only an hour for a "Plumbing Hardware Dispatcher (PHD)" to connect it to the Internet.
10321760 -> 1000003301770: Additionally, Google's Gmail page displayed an announcement for Gmail Paper, which allows users of their free email service to have email messages printed and shipped to a snail mail address.
10321770 -> 1000003301780: Google's services contain a number of Easter eggs; for instance, the Language Tools page offers the search interface in the Swedish Chef's "Bork bork bork," Pig Latin, ”Hacker” (actually leetspeak), Elmer Fudd, and Klingon.
10321780 -> 1000003301790: In addition, the search engine calculator provides the Answer to Life, the Universe, and Everything from Douglas Adams' The Hitchhiker's Guide to the Galaxy.
10321790 -> 1000003301800: As Google's search box can be used as a unit converter (as well as a calculator), some non-standard units are built in, such as the Smoot.
10321800 -> 1000003301810: Google also routinely modifies its logo in accordance with various holidays or special events throughout the year, such as Christmas, Mother's Day, or the birthdays of various notable individuals.
10321810 -> 1000003301820: IPO and culture
10321820 -> 1000003301830: Many people speculated that Google's IPO would inevitably lead to changes in the company's culture, because of shareholder pressure for employee benefit reductions and short-term advances, or because a large number of the company's employees would suddenly become millionaires on paper.
10321830 -> 1000003301840: In a report given to potential investors, co-founders Sergey Brin and Larry Page promised that the IPO would not change the company's culture.
10321840 -> 1000003301850: Later Mr. Page said, "We think a lot about how to maintain our culture and the fun elements.
10321850 -> 1000003301860: We spent a lot of time getting our offices right.
10321860 -> 1000003301870: We think it's important to have a high density of people.
10321870 -> 1000003301880: People are packed together everywhere.
10321880 -> 1000003301890: We all share offices.
10321890 -> 1000003301900: We like this set of buildings because it's more like a densely packed university campus than a typical suburban office park."
10321900 -> 1000003301910: However, many analysts are finding that as Google grows, the company is becoming more "corporate".
10321910 -> 1000003301920: In 2005, articles in The New York Times and other sources began suggesting that Google had lost its anti-corporate, no evil philosophy.
10321920 -> 1000003301930: In an effort to maintain the company's unique culture, Google has designated a Chief Culture Officer in 2006, who also serves as the Director of Human Resources.
10321930 -> 1000003301940: The purpose of the Chief Culture Officer is to develop and maintain the culture and work on ways to keep true to the core values that the company was founded on in the beginning—a flat organization, a lack of hierarchy, a collaborative environment.
10321940 -> 1000003301950: Philanthropy
10321950 -> 1000003301960: In 2004, Google formed a for-profit philanthropic wing, Google.org, with a start-up fund of US$1 billion.
10321960 -> 1000003301970: The express mission of the organization is to create awareness about climate change, global public health, and global poverty.
10321970 -> 1000003301980: One of its first projects is to develop a viable plug-in hybrid electric vehicle that can attain 100 mpg.
10321980 -> 1000003301990: The founding and current director is Dr. Larry Brilliant.
10321990 -> 1000003302000: Criticism
10322000 -> 1000003302010: As it has grown, Google has found itself the focus of several controversies related to its business practices and services.
10322010 -> 1000003302020: For example, Google Book Search's effort to digitize millions of books and make the full text searchable has led to copyright disputes with the Authors Guild.
10322020 -> 1000003302030: Google's cooperation with the governments of China, and to a lesser extent France and Germany (regarding Holocaust denial) to filter search results in accordance to regional laws and regulations has led to claims of censorship.
10322030 -> 1000003302040: Google's persistent cookie and other information collection practices have led to concerns over user privacy.
10322040 -> 1000003302050: As of December 11, 2007, Google, like the Microsoft search engine, stores "personal information for 18 months" and by comparison, Yahoo! and AOL (Time Warner) "retain search requests for 13 months."
10322050 -> 1000003302060: A number of Indian state governments have raised concerns about the security risks posed by geographic details provided by Google Earth's satellite imaging.
10322060 -> 1000003302070: Google has also been criticized by advertisers regarding its inability to combat click fraud, when a person or automated script is used to generate a charge on an advertisement without really having an interest in the product.
10322070 -> 1000003302080: Industry reports in 2006 claim that approximately 14 to 20 percent of clicks were in fact fraudulent or invalid.
10322080 -> 1000003302090: Further, Google has faced allegations of sexism and ageism from former employees.
10322090 -> 1000003302100: Google has also faced accusations in Harper's Magazine of being extremely excessive with their energy usage, and were accused of employing their "Don't be evil" motto as well as their very public energy saving campaigns as means of trying to cover up or make up for the massive amounts of energy their servers actually require.
10322100 -> 1000003302110: Also, US District Court Judge Louis Stanton, on July 1, 2008 ordered Google to give YouTube user data / log to Viacom to support its case in a billion-dollar copyright lawsuit against Google.
10322110 -> 1000003302120: Google and Viacom, however, on July 14, 2008, agreed in compromise to protect YouTube users' personal data in the $ 1 billion (£ 497 million) copyright lawsuit.
10322120 -> 1000003302130: Google agreed it will make user information and internet protocol addresses from its YouTube subsidiary anonymous before handing over the data to Viacom.
10322130 -> 1000003302140: The privacy deal also applied to other litigants including the FA Premier League, the Rodgers & Hammerstein Organisation and the Scottish Premier League.
10322140 -> 1000003302150: The deal however did not extend the anonymity to employees, since Viacom would prove that Google staff are aware of uploading of illegal material to the site.
10322150 -> 1000003302160: The parties therefore will further meet on the matter lest the data be made available to the court.
Google Translate
10330010 -> 1000003400020: Google Translate
10330020 -> 1000003400030: Google Translate is a service provided by Google Inc. to translate a section of text, or a webpage, into another language, with limits to the number of paragraphs, or range of technical terms, translated.
10330030 -> 1000003400040: For some languages, users are asked for alternate translations, such as for technical terms, to be included for future updates to the translation process.
10330040 -> 1000003400050: Unlike other translation services such as Babel Fish, AOL, and Yahoo which use SYSTRAN, Google uses its own translation software.
10330050 -> 1000003400060: Functions
10330060 -> 1000003400070: The service also includes translation of an entire Web page.
10330070 -> 1000003400080: The translation is limited in number of paragraphs per webpage (such as indicated by break-tags <br>); however, if text on a webpage is separated by horizontal blank-line images (auto-wrapped without using any <br>), a long webpage can be translated containing several thousand words.
10330080 -> 1000003400090: Google Translate, like other automatic translation tools, has its limitations.
10330090 -> 1000003400100: While it can help the reader to understand the general content of a foreign language text, it does not deliver accurate translations and does not produce publication-standard content, for example it often translates words out of context and is deliberately not applying any grammatical rules.
10330100 -> 1000003400110: Approach
10330110 -> 1000003400120: Google translate is based on an approach called statistical machine translation, and more specifically, on research by Franz-Josef Och who won the DARPA contest for speed machine translation in 2003.
10330120 -> 1000003400130: Och is now the head of Google's machine translation department.
10330130 -> 1000003400140: According to Och, a solid base for developing a usable statistical machine translation system for a new pair of languages from scratch, would consist in having a bilingual text corpus (or parallel collection) of more than a million words and two monolingual corpora of each more than a billion words.
10330140 -> 1000003400150: Statistical models from this data are then used to translate between those languages.
10330150 -> 1000003400160: To acquire this huge amount of linguistic data, Google used United Nations documents.
10330160 -> 1000003400170: The same document is normally available in all six official UN languages, thus Google now has a hectalingual corpus of 20 billion words' worth of human translations.
10330170 -> 1000003400180: The availability of Arabic and Chinese as official UN languages is probably one of the reasons why Google Translate initially focused on the development of translation between English and those languages, and not, for example, Japanese and German, which are not official languages at the UN.
10330180 -> 1000003400190: Google representatives have been very active at domestic conferences in Japan in the field asking researchers to provide them with bilingual corpora.
10330190 -> 1000003400200: Options
10330200 -> 1000003400210: (by chronological order)
10330210 -> 1000003400220: Beginning
10330220 -> 1000003400230: English to Arabic
10330230 -> 1000003400240: English to French
10330240 -> 1000003400250: English to German
10330250 -> 1000003400260: English to Spanish
10330260 -> 1000003400270: French to English
10330270 -> 1000003400280: German to English
10330280 -> 1000003400290: Spanish to English
10330290 -> 1000003400300: Arabic to English
10330300 -> 1000003400310: 2nd stage
10330310 -> 1000003400320: English to Portuguese
10330320 -> 1000003400330: Portuguese to English
10330330 -> 1000003400340: 3rd stage
10330340 -> 1000003400350: English to Italian
10330350 -> 1000003400360: Italian to English
10330360 -> 1000003400370: 4th stage
10330370 -> 1000003400380: English to Chinese (Simplified) BETA
10330380 -> 1000003400390: English to Japanese BETA
10330390 -> 1000003400400: English to Korean BETA
10330400 -> 1000003400410: Chinese (Simplified) to English BETA
10330410 -> 1000003400420: Japanese to English BETA
10330420 -> 1000003400430: Korean to English BETA
10330430 -> 1000003400440: 5th stage
10330440 -> 1000003400450: English to Russian BETA
10330450 -> 1000003400460: Russian to English BETA
10330460 -> 1000003400470: 6th stage
10330470 -> 1000003400480: English to Arabic BETA
10330480 -> 1000003400490: Arabic to English BETA
10330490 -> 1000003400500: 7th stage (launched February, 2007)
10330500 -> 1000003400510: English to Chinese (Traditional) BETA
10330510 -> 1000003400520: Chinese (Traditional) to English BETA
10330520 -> 1000003400530: Chinese (Simplified to Traditional) BETA
10330530 -> 1000003400540: Chinese (Traditional to Simplified) BETA
10330540 -> 1000003400550: 8th stage (launched October, 2007)
10330550 -> 1000003400560: all 25 language pairs use Google's machine translation system
10330560 -> 1000003400570: 9th stage
10330570 -> 1000003400580: English to Hindi BETA
10330580 -> 1000003400590: Hindi to English BETA
10330590 -> 1000003400600: 10th stage (as of this stage, translation can be done between any two languages)
10330600 -> 1000003400610: Bulgarian
10330610 -> 1000003400620: Croatian
10330620 -> 1000003400630: Czech
10330630 -> 1000003400640: Danish
10330640 -> 1000003400650: Dutch
10330650 -> 1000003400660: Finnish
10330660 -> 1000003400670: Greek
10330670 -> 1000003400680: Norwegian
10330680 -> 1000003400690: Polish
10330690 -> 1000003400700: Romanian
10330700 -> 1000003400710: Swedish
Grammar
10340010 -> 1000003500020: Grammar
10340020 -> 1000003500030: Grammar is the field of linguistics that covers the rules governing the use of any given natural language.
10340030 -> 1000003500040: It includes morphology and syntax, often complemented by phonetics, phonology, semantics, and pragmatics.
10340040 -> 1000003500050: Each language has its own distinct grammar.
10340050 -> 1000003500060: "English grammar" is the rules of the English language itself.
10340060 -> 1000003500070: "An English grammar" is a specific study or analysis of these rules.
10340070 -> 1000003500080: A reference book describing the grammar of a language is called a "reference grammar" or simply "a grammar".
10340080 -> 1000003500090: A fully explicit grammar exhaustively describing the grammatical constructions of a language is called a descriptive grammar, as opposed to linguistic prescription which tries to enforce the governing rules how a language is to be used.
10340090 -> 1000003500100: Grammatical frameworks are approaches to constructing grammars.
10340100 -> 1000003500110: The standard framework of generative grammar is the transformational grammar model developed by Noam Chomsky and his followers from the 1950s to 1980s.
10340110 -> 1000003500120: Etymology
10340120 -> 1000003500130: The word "grammar," derives from Greek γραμματική τέχνη (grammatike techne), which means "art of letters," from γράμμα (gramma), "letter," and that from γράφειν (graphein), "to draw, to write".
10340130 -> 1000003500140: History
10340140 -> 1000003500150: The first systematic grammars originate in Iron Age India, with Panini (4th c. BC) and his commentators Pingala (ca. 200 BC), Katyayana, and Patanjali (2nd c. BC).
10340150 -> 1000003500160: In the West, grammar emerges as a discipline in Hellenism from the 3rd c. BC forward with authors like Rhyanus and Aristarchus of Samothrace, the oldest extant work being the Art of Grammar ({(Lang+Τέχνη Γραμματική+grc+Τέχνη Γραμματική)}), attributed to Dionysius Thrax (ca. 100 BC).
10340160 -> 1000003500170: Latin grammar developed by following Greek models from the 1st century BC, due to the work of authors such as Orbilius Pupillus, Remmius Palaemon, Marcus Valerius Probus, Verrius Flaccus, Aemilius Asper.
10340170 -> 1000003500180: Tamil grammatical tradition also began around the 1st century BC with the Tolkāppiyam.
10340180 -> 1000003500190: A grammar of Irish originated in the 7th century with the Auraicept na n-Éces.
10340190 -> 1000003500200: Arabic grammar emerges from the 8th century with the work of Ibn Abi Ishaq and his students.
10340200 -> 1000003500210: The first treatises on Hebrew grammar appear in the High Middle Ages, in the context of Mishnah (exegesis of the Hebrew Bible).
10340210 -> 1000003500220: The Karaite tradition originates in Abbasid Baghdad.
10340220 -> 1000003500230: The Diqduq (10th century) is one of the earliest grammatical commentaries on the Hebrew Bible.
10340230 -> 1000003500240: Ibn Barun in the 12th century compares the Hebrew language with Arabic in the Islamic grammatical tradition.
10340240 -> 1000003500250: Belonging to the trivium of the seven liberal arts, grammar was taught as a core discipline throughout the Middle Ages, following the influence of authors from Late Antiquity, such as Priscian.
10340250 -> 1000003500260: Treatment of vernaculars begins gradually during the High Middle Ages, with isolated works such as the First Grammatical Treatise, but becomes influential only in the Renaissance and Baroque periods.
10340260 -> 1000003500270: In 1486, Antonio de Nebrija published Las introduciones Latinas contrapuesto el romance al Latin, and the first Spanish grammar, Gramática de la lengua castellana, in 1492.
10340270 -> 1000003500280: During the 16th century Italian Renaissance, the Questione della lingua was the discussion on the status and ideal form of the Italian language, initiated by Dante's de vulgari eloquentia (Pietro Bembo, Prose della volgar lingua Venice 1525).
10340280 -> 1000003500290: Grammars of non-European languages began to be compiled for the purposes of evangelization and Bible translation from the 16th century onward, such as Grammatica o Arte de la Lengua General de los Indios de los Reynos del Perú (1560), and a Quechua grammar by Fray Domingo de Santo Tomás.
10340290 -> 1000003500300: In 1643 there appeared Ivan Uzhevych's Grammatica sclavonica and, in 1762, the Short Introduction to English Grammar of Robert Lowth was also published.
10340300 -> 1000003500310: The Grammatisch-Kritisches Wörterbuch der hochdeutschen Mundart, a High German grammar in five volumes by Johann Christoph Adelung, appeared as early as 1774.
10340310 -> 1000003500320: From the latter part of the 18th century, grammar came to be understood as a subfield of the emerging discipline of modern linguistics.
10340320 -> 1000003500330: The Serbian grammar by Vuk Stefanović Karadžić arrived in 1814, while the Deutsche Grammatik of the Brothers Grimm was first published in 1818.
10340330 -> 1000003500340: The Comparative Grammar of Franz Bopp, the starting point of modern comparative linguistics, came out in 1833.
10340340 -> 1000003500350: In the USA, the Society for the Promotion of Good Grammar has designated March 4, 2008 as National Grammar Day.
10340350 -> 1000003500360: Development of grammars
10340360 -> 1000003500370: Grammars evolve through usage, and grammars also develop due to separations of the human population.
10340370 -> 1000003500380: With the advent of written representations, formal rules about language usage tend to appear also.
10340380 -> 1000003500390: Formal grammars are codifications of usage that are developed by repeated documentation over time, and by observation as well.
10340390 -> 1000003500400: As the rules become established and developed, the prescriptive concept of grammatical correctness can arise.
10340400 -> 1000003500410: This often creates a discrepancy between contemporary usage and that which has been accepted over time as being correct.
10340410 -> 1000003500420: Linguists tend to believe that prescriptive grammars do not have any justification beyond their authors' aesthetic tastes; however, prescriptions are considered in sociolinguistics as part of the explanation for why some people say "I didn't do nothing", some say "I didn't do anything", and some say one or the other depending on social context.
10340420 -> 1000003500430: The formal study of grammar is an important part of education for children from a young age through advanced learning, though the rules taught in schools are not a "grammar" in the sense most linguists use the term, as they are often prescriptive rather than descriptive.
10340430 -> 1000003500440: Constructed languages (also called planned languages or conlangs) are more common in the modern day.
10340440 -> 1000003500450: Many have been designed to aid human communication (for example, naturalistic Interlingua, schematic Esperanto, and the highly logic-compatible artificial language Lojban).
10340450 -> 1000003500460: Each of these languages has its own grammar.
10340460 -> 1000003500470: No clear line can be drawn between syntax and morphology.
10340470 -> 1000003500480: Analytic languages use syntax to convey information that is encoded via inflection in synthetic languages.
10340480 -> 1000003500490: In other words, word order is not significant and morphology is highly significant in a purely synthetic language, whereas morphology is not significant and syntax is highly significant in an analytic language.
10340490 -> 1000003500500: Chinese and Afrikaans, for example, are highly analytic, and meaning is therefore very context – dependent.
10340500 -> 1000003500510: (Both do have some inflections, and have had more in the past; thus, they are becoming even less synthetic and more "purely" analytic over time.)
10340510 -> 1000003500520: Latin, which is highly synthetic, uses affixes and inflections to convey the same information that Chinese does with syntax.
10340520 -> 1000003500530: Because Latin words are quite (though not completely) self-contained, an intelligible Latin sentence can be made from elements that are placed in a largely arbitrary order.
10340530 -> 1000003500540: Latin has a complex affixation and a simple syntax, while Chinese has the opposite.
10340540 -> 1000003500550: Grammar frameworks
10340550 -> 1000003500560: Various "grammar frameworks" have been developed in theoretical linguistics since the mid 20th century, in particular under the influence of the idea of a "Universal grammar" in the USA.
10340560 -> 1000003500570: Of these, the main divisions are:
10340570 -> 1000003500580: Transformational grammar (TG))
10340580 -> 1000003500590: Principles and Parameters Theory (P&P)
10340590 -> 1000003500600: Lexical-functional Grammar (LFG)
10340600 -> 1000003500610: Generalized Phrase Structure Grammar (GPSG)
10340610 -> 1000003500620: Head-Driven Phrase Structure Grammar (HPSG)
10340620 -> 1000003500630: Dependency grammars (DG)
10340630 -> 1000003500640: Role and reference grammar (RRG)
HTML
10360010 -> 1000003600020: HTML
10360020 -> 1000003600030: HTML, an initialism of HyperText Markup Language, is the predominant markup language for web pages.
10360030 -> 1000003600040: It provides a means to describe the structure of text-based information in a document — by denoting certain text as links, headings, paragraphs, lists, and so on — and to supplement that text with interactive forms, embedded images, and other objects.
10360040 -> 1000003600050: HTML is written in the form of tags, surrounded by angle brackets.
10360050 -> 1000003600060: HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code (such as JavaScript) which can affect the behavior of Web browsers and other HTML processors.
10360060 -> 1000003600070: HTML is also often used to refer to content in specific languages, such as a MIME type text/html, or even more broadly as a generic term for HTML, whether in its XML-descended form (such as XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01 and earlier).
10360070 -> 1000003600080: By convention, HTML format data files use a file extension .html or .htm.
10360080 -> 1000003600090: History of HTML
10360090 -> 1000003600100: Origins
10360100 -> 1000003600110: In 1980, physicist Tim Berners-Lee, who was an independent contractor at CERN, proposed and prototyped ENQUIRE, a system for CERN researchers to use and share documents.
10360110 -> 1000003600120: In 1989, Berners-Lee and CERN data systems engineer Robert Cailliau each submitted separate proposals for an Internet-based hypertext system providing similar functionality.
10360120 -> 1000003600130: The following year, they collaborated on a joint proposal, the WorldWideWeb (W3) project, which was accepted by CERN.
10360130 -> 1000003600140: First specifications
10360140 -> 1000003600150: The first publicly available description of HTML was a document called HTML Tags, first mentioned on the Internet by Berners-Lee in late 1991.
10360150 -> 1000003600160: It describes 22 elements comprising the initial, relatively simple design of HTML.
10360160 -> 1000003600170: Thirteen of these elements still exist in HTML 4.
10360170 -> 1000003600180: Berners-Lee considered HTML to be, at the time, an application of SGML, but it was not formally defined as such until the mid-1993 publication, by the IETF, of the first proposal for an HTML specification: Berners-Lee and Dan Connolly's "Hypertext Markup Language (HTML)" Internet-Draft, which included an SGML Document Type Definition to define the grammar.
10360180 -> 1000003600190: The draft expired after six months, but was notable for its acknowledgment of the NCSA Mosaic browser's custom tag for embedding in-line images, reflecting the IETF's philosophy of basing standards on successful prototypes.
10360190 -> 1000003600200: Similarly, Dave Raggett's competing Internet-Draft, "HTML+ (Hypertext Markup Format)", from late 1993, suggested standardizing already-implemented features like tables and fill-out forms.
10360200 -> 1000003600210: After the HTML and HTML+ drafts expired in early 1994, the IETF created an HTML Working Group, which in 1995 completed "HTML 2.0", the first HTML specification intended to be treated as a standard against which future implementations should be based.
10360210 -> 1000003600220: Published as Request for Comments 1996, HTML 2.0 included ideas from the HTML and HTML+ drafts.
10360220 -> 1000003600230: There was no "HTML 1.0"; the 2.0 designation was intended to distinguish the new edition from previous drafts.
10360230 -> 1000003600240: Further development under the auspices of the IETF was stalled by competing interests.
10360240 -> 1000003600250: Since 1996, the HTML specifications have been maintained, with input from commercial software vendors, by the World Wide Web Consortium (W3C).
10360250 -> 1000003600260: However, in 2000, HTML also became an international standard (ISO/IEC 15445:2000).
10360260 -> 1000003600270: The last HTML specification published by the W3C is the HTML 4.01 Recommendation, published in late 1999.
10360270 -> 1000003600280: Its issues and errors were last acknowledged by errata published in 2001.
10360280 -> 1000003600290: Version history of the standard
10360290 -> 1000003600300: HTML versions
10360300 -> 1000003600310: July, 1993:  Hypertext Markup Language, was published at IETF working draft (that is, not yet a standard).
10360310 -> 1000003600320: November, 1995:  HTML 2.0 published as IETF Request for Comments:
10360320 -> 1000003600330: RFC 1866,
10360330 -> 1000003600340: supplemented by RFC 1867 (form-based file upload) that same month,
10360340 -> 1000003600350: RFC 1942 (tables) in May 1996,
10360350 -> 1000003600360: RFC 1980 (client-side image maps) in August 1996, and
10360360 -> 1000003600370: RFC 2070 (internationalization) in January 1997;
10360370 -> 1000003600380: Ultimately, all were declared obsolete/historic by RFC 2854 in June 2000.
10360380 -> 1000003600390: April 1995:  HTML 3.0, proposed as a standard to the IETF.
10360390 -> 1000003600400: It included many of the capabilities that were in Raggett's HTML+ proposal, such as support for tables, text flow around figures, and the display of complex mathematical formulas.
10360400 -> 1000003600410: A demonstration appeared in W3C's own Arena browser.
10360410 -> 1000003600420: HTML 3.0 did not succeed for several reasons.
10360420 -> 1000003600430: The pace of browser development, as well as the number of interested parties, had outstripped the resources of the IETF.
10360430 -> 1000003600440: Netscape continued to introduce HTML elements that specified the visual appearance of documents, contrary to the goals of the newly-formed W3C, which sought to limit HTML to describing logical structure.
10360440 -> 1000003600450: Microsoft, a newcomer at the time, played to all sides by creating its own tags, implementing Netscape's elements for compatibility, and supporting W3C features such as Cascading Style Sheets.
10360450 -> 1000003600460: January 14, 1997:  HTML 3.2, published as a W3C Recommendation.
10360460 -> 1000003600470: It was the first version developed and standardized exclusively by the W3C, as the IETF had closed its HTML Working Group in September 1997.
10360470 -> 1000003600480: The new version dropped math formulas entirely, reconciled overlap among various proprietary extensions, and adopted most of Netscape's visual markup tags.
10360480 -> 1000003600490: Netscape's blink element and Microsoft's marquee element were omitted due to a mutual agreement between the two companies.
10360490 -> 1000003600500: The ability to include mathematical formulas in HTML would not be standardized until years later in MathML.
10360500 -> 1000003600510: December 18, 1997:  HTML 4.0, published as a W3C Recommendation.
10360510 -> 1000003600520: It offers three "flavors":
10360520 -> 1000003600530: Strict, in which deprecated elements are forbidden,
10360530 -> 1000003600540: Transitional, in which deprecated elements are allowed,
10360540 -> 1000003600550: Frameset, in which mostly only frame related elements are allowed;
10360550 -> 1000003600560: HTML 4.0 (initially code-named "Cougar") likewise adopted many browser-specific element types and attributes, but at the same time sought to phase out Netscape's visual markup features by marking them as deprecated in favor of style sheets.
10360560 -> 1000003600570: Minor editorial revisions to the HTML 4.0 specification were published in 1998 without incrementing the version number and further minor revisions as HTML 4.01.
10360570 -> 1000003600580: April 24, 1998:  HTML 4.0 was reissued with minor edits without incrementing the version number.
10360580 -> 1000003600590: December 24, 1999:  HTML 4.01, published as a W3C Recommendation.
10360590 -> 1000003600600: It offers the same three flavors as HTML 4.0, and its last  errata were published May 12, 2001.
10360600 -> 1000003600610: HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML.
10360610 -> 1000003600620: May 15, 2000:  ISO/IEC 15445:2000 ("ISO HTML", based on HTML 4.01 Strict), published as an ISO/IEC international standard.
10360620 -> 1000003600630: January 22, 2008:  HTML 5, published as a Working Draft by W3C.
10360630 -> 1000003600640: XHTML versions
10360640 -> 1000003600650: XHTML is a separate language that began as a reformulation of HTML 4.01 using XML 1.0.
10360650 -> 1000003600660: It continues to be developed:
10360660 -> 1000003600670: XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and republished August 1, 2002.
10360670 -> 1000003600680: It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions.
10360680 -> 1000003600690: XHTML 1.1, published May 31, 2001 as a W3C Recommendation.
10360690 -> 1000003600700: It is based on XHTML 1.0 Strict, but includes minor changes, can be customized, and is reformulated using modules from  Modularization of XHTML, which was published April 10, 2001 as a W3C Recommendation.
10360700 -> 1000003600710: XHTML 2.0 is still a W3C Working Draft.
10360710 -> 1000003600720: XHTML 2.0 is incompatible with XHTML 1.x and, therefore, would be more accurate to characterize as an XHTML-inspired new language than an update to XHTML 1.x.
10360720 -> 1000003600730: XHTML 5, which is an update to XHTML 1.x, is being defined alongside HTML 5 in the  HTML 5 draft.
10360730 -> 1000003600740: HTML markup
10360740 -> 1000003600750: HTML markup consists of several key components, including elements (and their attributes), character-based data types, and character references and entity references.
10360750 -> 1000003600760: Another important component is the document type declaration.
10360760 -> 1000003600770: HTML Hello World:
10360770 -> 1000003600780: Elements
10360780 -> 1000003600790: See HTML elements for more detailed descriptions.
10360790 -> 1000003600800: Elements are the basic structure for HTML markup.
10360800 -> 1000003600810: Elements have two basic properties: attributes and content.
10360810 -> 1000003600820: Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid.
10360820 -> 1000003600830: An element usually has a start tag (e.g. <element-name>) and an end tag (e.g. </element-name>).
10360830 -> 1000003600840: The element's attributes are contained in the start tag and content is located between the tags (e.g. <element-name&nbsp;attribute="value">Content</element-name>).
10360840 -> 1000003600850: Some elements, such as <br>, do not have any content and must not have a closing tag.
10360850 -> 1000003600860: Listed below are several types of markup elements used in HTML.
10360860 -> 1000003600870: Structural markup describes the purpose of text.
10360870 -> 1000003600880: For example, <h2>Golf</h2> establishes "Golf" as a second-level heading, which would be rendered in a browser in a manner similar to the "HTML markup" title at the start of this section.
10360880 -> 1000003600890: Structural markup does not denote any specific rendering, but most Web browsers have standardized on how elements should be formatted.
10360890 -> 1000003600900: Text may be further styled with Cascading Style Sheets (CSS).
10360900 -> 1000003600910: Presentational markup describes the appearance of the text, regardless of its function.
10360910 -> 1000003600920: For example <b>boldface</b> indicates that visual output devices should render "boldface" in bold text, but gives no indication what devices which are unable to do this (such as aural devices that read the text aloud) should do.
10360920 -> 1000003600930: In the case of both <b>bold</b> and <i>italic</i>, there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely <strong>strong emphasis</strong> and <em>emphasis</em> respectively.
10360930 -> 1000003600940: It is easier to see how an aural user agent should interpret the latter two elements.
10360940 -> 1000003600950: However, they are not equivalent to their presentational counterparts: it would be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen such a name would be italicized.
10360950 -> 1000003600960: Most presentational markup elements have become deprecated under the HTML 4.0 specification, in favor of CSS based style design.
10360960 -> 1000003600970: Hypertext markup links parts of the document to other documents.
10360970 -> 1000003600980: HTML up through version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the flow of text: <a>Wikipedia</a>.
10360980 -> 1000003600990: However, the href attribute must also be set to a valid URL so for example the HTML code, <a href="http://en.wikipedia.org/">Wikipedia</a>, will render the word " Wikipedia" as a hyperlink.
10360985 -> 1000003601000: To link on an image, the anchor tag use the following syntax: <a href="url"><img src="image.gif" /></a>
10360990 -> 1000003601010: Attributes
10361000 -> 1000003601020: Most of the attributes of an element are name-value pairs, separated by "=", and written within the start tag of an element, after the element's name.
10361010 -> 1000003601030: The value may be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML).
10361020 -> 1000003601040: Leaving attribute values unquoted is considered unsafe.
10361030 -> 1000003601050: In contrast with name-value pair attributes, there are some attributes that affect the element simply by their presence in the start tag of the element (like the ismap attribute for the img element).
10361040 -> 1000003601060: Most elements can take any of several common attributes:
10361050 -> 1000003601070: The id attribute provides a document-wide unique identifier for an element.
10361060 -> 1000003601080: This can be used by stylesheets to provide presentational properties, by browsers to focus attention on the specific element, or by scripts to alter the contents or presentation of an element.
10361070 -> 1000003601090: The class attribute provides a way of classifying similar elements for presentation purposes.
10361080 -> 1000003601100: For example, an HTML document might use the designation class="notation" to indicate that all elements with this class value are subordinate to the main text of the document.
10361090 -> 1000003601110: Such elements might be gathered together and presented as footnotes on a page instead of appearing in the place where they occur in the HTML source.
10361100 -> 1000003601120: An author may use the style non-attributal codes presentational properties to a particular element.
10361110 -> 1000003601130: It is considered better practice to use an element’s son- id page and select the element with a stylesheet, though sometimes this can be too cumbersome for a simple ad hoc application of styled properties.
10361120 -> 1000003601140: The title attribute is used to attach subtextual explanation to an element.
10361130 -> 1000003601150: In most browsers this attribute is displayed as what is often referred to as a tooltip.
10361140 -> 1000003601160: The generic inline element span can be used to demonstrate these various attributes:
10361150 -> None: 
10361160 -> 1000003601170: This example displays as HTML; in most browsers, pointing the cursor at the abbreviation should display the title text "Hypertext Markup Language."
10361170 -> 1000003601180: Most elements also take the language-related attributes lang and dir.
10361180 -> 1000003601190: Character and entity references
10361190 -> 1000003601200: As of version 4.0, HTML defines a set of 252 character entity references and a set of 1,114,050 numeric character references, both of which allow individual characters to be written via simple markup, rather than literally.
10361200 -> 1000003601210: A literal character and its markup counterpart are considered equivalent and are rendered identically.
10361210 -> 1000003601220: The ability to "escape" characters in this way allows for the characters < and & (when written as &lt; and &amp;, respectively) to be interpreted as character data, rather than markup.
10361220 -> 1000003601230: For example, a literal < normally indicates the start of a tag, and & normally indicates the start of a character entity reference or numeric character reference; writing it as &amp; or &#x26; or &#38; allows & to be included in the content of elements or the values of attributes.
10361230 -> 1000003601240: The double-quote character ("), when used to quote an attribute value, must also be escaped as &quot; or &#x22; or &#34; when it appears within the attribute value itself.
10361240 -> 1000003601250: The single-quote character ('), when used to quote an attribute value, must also be escaped as &#x27; or &#39; (should NOT be escaped as &apos; except in XHTML documents) when it appears within the attribute value itself.
10361250 -> 1000003601260: However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent.
10361260 -> 1000003601270: Escaping also allows for characters that are not easily typed or that aren't even available in the document's character encoding to be represented within the element and attribute content.
10361270 -> 1000003601280: For example, the acute-accented e (é), a character typically found only on Western European keyboards, can be written in any HTML document as the entity reference &eacute; or as the numeric references &#233; or &#xE9;.
10361280 -> 1000003601290: The characters comprising those references (that is, the &, the ;, the letters in eacute, and so on) are available on all keyboards and are supported in all character encodings, whereas the literal é is not.
10361290 -> 1000003601300: Data types
10361300 -> 1000003601310: HTML defines several data types for element content, such as script data and stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of length, languages, media descriptors, colors, character encodings, dates and times, and so on.
10361310 -> 1000003601320: All of these data types are specializations of character data.
10361320 -> 1000003601330: The Document Type Declaration
10361330 -> 1000003601340: In order to enable Document Type Definition (DTD)-based validation with SGML tools and in order to avoid the quirks mode in browsers, HTML documents can start with a Document Type Declaration (informally, a "DOCTYPE").
10361340 -> 1000003601350: The DTD to which the DOCTYPE refers contains machine-readable grammar specifying the permitted and prohibited content for a document conforming to such a DTD.
10361350 -> 1000003601360: Browsers do not necessarily read the DTD, however.
10361360 -> 1000003601370: The most popular graphical browsers use DOCTYPE declarations (or the lack thereof) and other data at the beginning of sources to determine which rendering mode to use.
10361370 -> 1000003601380: For example:
10361380 -> 1000003601390: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
10361390 -> 1000003601400: This declaration references the Strict DTD of HTML 4.01, which does not have presentational elements like <font>, leaving formatting to Cascading Style Sheets and the span and div tags.
10361400 -> 1000003601410: SGML-based validators read the DTD in order to properly parse the document and to perform validation.
10361410 -> 1000003601420: In modern browsers, the HTML 4.01 Strict doctype activates standards layout mode for CSS as opposed to quirks mode.
10361420 -> 1000003601430: In addition, HTML 4.01 provides Transitional and Frameset DTDs.
10361430 -> 1000003601440: The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames.
10361440 -> 1000003601450: Semantic HTML
10361450 -> 1000003601460: There is no official specification called "Semantic HTML", though the strict flavors of HTML discussed below are a push in that direction.
10361460 -> 1000003601470: Rather, semantic HTML refers to an objective and a practice to create documents with HTML that contain only the author's intended meaning, without any reference to how this meaning is presented or conveyed.
10361470 -> 1000003601480: A classic example is the distinction between the emphasis element (<em>) and the italics element (<i>).
10361480 -> 1000003601490: Often the emphasis element is displayed in italics, so the presentation is typically the same.
10361490 -> 1000003601500: However, emphasizing something is different from listing the title of a book, for example, which may also be displayed in italics.
10361500 -> 1000003601510: In purely semantic HTML, a book title would use a different element than emphasized text uses (for example a <span>), because they are meaningfully different things.
10361510 -> 1000003601520: The goal of semantic HTML requires two things of authors:
10361520 -> 1000003601530: To avoid the use of presentational markup (elements, attributes, and other entities).
10361530 -> 1000003601540: To use available markup to differentiate the meanings of phrases and structure in the document.
10361540 -> 1000003601550: So for example, the book title from above would need to have its own element and class specified, such as <cite class="booktitle">The Grapes of Wrath</cite>.
10361545 -> 1000003601560: Here, the <cite> element is used because it most closely matches the meaning of this phrase in the text.
10361550 -> 1000003601570: However, the <cite> element is not specific enough to this task, since we mean to cite specifically a book title as opposed to a newspaper article or an academic journal.
10361560 -> 1000003601580: Semantic HTML also requires complementary specifications and software compliance with these specifications.
10361570 -> 1000003601590: Primarily, the development and proliferation of CSS has led to increasing support for semantic HTML, because CSS provides designers with a rich language to alter the presentation of semantic-only documents.
10361580 -> 1000003601600: With the development of CSS, the need to include presentational properties in a document has virtually disappeared.
10361590 -> 1000003601610: With the advent and refinement of CSS and the increasing support for it in Web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the semantic structure and phrasing of the document, like headings, paragraphs, quotes, and lists, instead of using markup which is written for visual purposes only, like <font>, <b> (bold), and <i> (italics).
10361600 -> 1000003601620: Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict.
10361610 -> 1000003601630: CSS provides a way to separate document semantics from the content's presentation, by keeping everything relevant to presentation defined in a CSS file.
10361620 -> 1000003601640: See separation of style and content.
10361630 -> 1000003601650: Semantic HTML offers many advantages.
10361640 -> 1000003601660: First, it ensures consistency in style across elements that have the same meaning.
10361650 -> 1000003601670: Every heading, every quotation, every similar element receives the same presentation properties.
10361660 -> 1000003601680: Second, semantic HTML frees authors from the need to concern themselves with presentation details.
10361670 -> 1000003601690: When writing the number two, for example, should it be written out in words ("two"), or should it be written as a numeral (2)?
10361680 -> 1000003601700: A semantic markup might enter something like <number>2</number> and leave presentation details to the stylesheet designers.
10361690 -> 1000003601710: Similarly, an author might wonder where to break out quotations into separate indented blocks of text: with purely semantic HTML, such details would be left up to stylesheet designers.
10361700 -> 1000003601720: Authors would simply indicate quotations when they occur in the text, and not concern themselves with presentation.
10361710 -> 1000003601730: A third advantage is device independence and repurposing of documents.
10361720 -> 1000003601740: A semantic HTML document can be paired with any number of stylesheets to provide output to computer screens (through Web browsers), high-resolution printers, handheld devices, aural browsers or braille devices for those with visual impairments, and so on.
10361730 -> 1000003601750: To accomplish this, nothing needs to be changed in a well-coded semantic HTML document.
10361740 -> 1000003601760: Readily available stylesheets make this a simple matter of pairing a semantic HTML document with the appropriate stylesheets.
10361750 -> 1000003601770: (Of course, the stylesheet's selectors need to match the appropriate properties in the HTML document.)
10361760 -> 1000003601780: Some aspects of authoring documents make separating semantics from style (in other words, meaning from presentation) difficult.
10361770 -> 1000003601790: Some elements are hybrids, using presentation in their very meaning.
10361780 -> 1000003601800: For example, a table displays content in a tabular form.
10361790 -> 1000003601810: Often such content conveys the meaning only when presented in this way.
10361800 -> 1000003601820: Repurposing a table for an aural device typically involves somehow presenting the table as an inherently visual element in an audible form.
10361810 -> 1000003601830: On the other hand, we frequently present lyrical songs—something inherently meant for audible presentation—and instead present them in textual form on a Web page.
10361820 -> 1000003601840: For these types of elements, the meaning is not so easily separated from their presentation.
10361830 -> 1000003601850: However, for a great many of the elements used and meanings conveyed in HTML, the translation is relatively smooth.
10361840 -> 1000003601860: Delivery of HTML
10361850 -> 1000003601870: HTML documents can be delivered by the same means as any other computer file; however, they are most often delivered in one of two forms: over HTTP servers and through e-mail.
10361860 -> 1000003601880: Publishing HTML with HTTP
10361870 -> 1000003601890: The World Wide Web is composed primarily of HTML documents transmitted from a Web server to a Web browser using the Hypertext Transfer Protocol (HTTP).
10361880 -> 1000003601900: However, HTTP can be used to serve images, sound, and other content in addition to HTML.
10361890 -> 1000003601910: To allow the Web browser to know how to handle the document it received, an indication of the file format of the document must be transmitted along with the document.
10361900 -> 1000003601920: This vital metadata includes the MIME type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see Character encodings in HTML).
10361910 -> 1000003601930: In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted.
10361920 -> 1000003601940: A document sent with an XHTML MIME type, or served as application/xhtml+xml, is expected to be well-formed XML, and a syntax error causes the browser to fail to render the document.
10361930 -> 1000003601950: The same document sent with an HTML MIME type, or served as text/html, might be displayed successfully, since Web browsers are more lenient with HTML.
10361940 -> 1000003601960: However, XHTML parsed in this way is not considered either proper XHTML or HTML, but so-called tag soup.
10361950 -> 1000003601970: If the MIME type is not recognized as HTML, the Web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration.
10361960 -> 1000003601980: Nevertheless, some Web browsers do examine the contents or URL of the document and attempt to infer the file type, despite this being forbidden by the HTTP 1.1 specification.
10361970 -> 1000003601990: HTML e-mail
10361980 -> 1000003602000: Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to provide formatting and semantic markup capabilities not available with plain text, like emphasized text, block quotations for replies, and diagrams or mathematical formulas that could not easily be described otherwise.
10361990 -> 1000003602010: Many of these clients include both a GUI editor for composing HTML e-mail messages and a rendering engine for displaying received HTML messages.
10362000 -> 1000003602020: Use of HTML in e-mail is controversial because of compatibility issues, because it can be used in phishing/privacy attacks, because it can confuse spam filters, and because the message size is larger than plain text.
10362010 -> 1000003602030: Naming conventions
10362020 -> 1000003602040: The most common filename extension for files containing HTML is .html.
10362030 -> 1000003602050: A common abbreviation of this is .htm; it originates from older operating systems and file systems, such as the DOS versions from the 80s and early 90s and FAT, which limit file extensions to three letters.
10362040 -> 1000003602060: Both forms are widely supported by browsers.
10362050 -> 1000003602070: Current flavors of HTML
10362060 -> 1000003602080: Since its inception, HTML and its associated protocols gained acceptance relatively quickly.
10362070 -> 1000003602090: However, no clear standards existed in the early years of the language.
10362080 -> 1000003602100: Though its creators originally conceived of HTML as a semantic language devoid of presentation details, practical uses pushed many presentational elements and attributes into the language, driven largely by the various browser vendors.
10362090 -> 1000003602110: The latest standards surrounding HTML reflect efforts to overcome the sometimes chaotic development of the language and to create a rational foundation for building both meaningful and well-presented documents.
10362100 -> 1000003602120: To return HTML to its role as a semantic language, the W3C has developed style languages such as CSS and XSL to shoulder the burden of presentation.
10362110 -> 1000003602130: In conjunction, the HTML specification has slowly reined in the presentational elements.
10362120 -> 1000003602140: There are two axes differentiating various flavors of HTML as currently specified: SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis, and strict versus transitional (loose) versus frameset on the other axis.
10362130 -> 1000003602150: SGML-based versus XML-based HTML
10362140 -> 1000003602160: One difference in the latest HTML specifications lies in the distinction between the SGML-based specification and the XML-based specification.
10362150 -> 1000003602170: The XML-based specification is usually called XHTML to distinguish it clearly from the more traditional definition; however, the root element name continues to be 'html' even in the XHTML-specified HTML.
10362160 -> 1000003602180: The W3C intended XHTML 1.0 to be identical to HTML 4.01 except where limitations of XML over the more complex SGML require workarounds.
10362170 -> 1000003602190: Because XHTML and HTML are closely related, they are sometimes documented in parallel.
10362180 -> 1000003602200: In such circumstances, some authors conflate the two names as (X)HTML or X(HTML).
10362190 -> 1000003602210: Like HTML 4.01, XHTML 1.0 has three sub-specifications: strict, loose, and frameset.
10362200 -> 1000003602220: Aside from the different opening declarations for a document, the differences between an HTML 4.01 and XHTML 1.0 document—in each of the corresponding DTDs—are largely syntactic.
10362210 -> 1000003602230: The underlying syntax of HTML allows many shortcuts that XHTML does not, such as elements with optional opening or closing tags, and even EMPTY elements which must not have an end tag.
10362220 -> 1000003602240: By contrast, XHTML requires all elements to have an opening tag or a closing tag.
10362230 -> 1000003602250: XHTML, however, also introduces a new shortcut: an XHTML tag may be opened and closed within the same tag, by including a slash before the end of the tag like this: <br/>.
10362240 -> 1000003602260: The introduction of this shorthand, which is not used in the SGML declaration for HTML 4.01, may confuse earlier software unfamiliar with this new convention.
10362250 -> 1000003602270: To understand the subtle differences between HTML and XHTML, consider the transformation of a valid and well-formed XHTML 1.0 document that adheres to Appendix C (see below) into a valid HTML 4.01 document.
10362260 -> 1000003602280: To make this translation requires the following steps:
10362270 -> 1000003602290: The language for an element should be specified with a lang attribute rather than the XHTML xml:lang attribute.
10362280 -> 1000003602300: XHTML uses XML's built in language-defining functionality attribute.
10362290 -> 1000003602310: Remove the XML namespace (xmlns=URI).
10362300 -> 1000003602320: HTML has no facilities for namespaces.
10362310 -> 1000003602330: Change the document type declaration from XHTML 1.0 to HTML 4.01. (see DTD section for further explanation).
10362320 -> 1000003602340: If present, remove the XML declaration.
10362330 -> 1000003602350: (Typically this is: <?xml version="1.0" encoding="utf-8"?>).
10362340 -> 1000003602360: Ensure that the document’s MIME type is set to text/html.
10362350 -> 1000003602370: For both HTML and XHTML, this comes from the HTTP Content-Type header sent by the server.
10362360 -> 1000003602380: Change the XML empty-element syntax to an HTML style empty element (<br/> to <br>).
10362370 -> 1000003602390: Those are the main changes necessary to translate a document from XHTML 1.0 to HTML 4.01.
10362380 -> 1000003602400: To translate from HTML to XHTML would also require the addition of any omitted opening or closing tags.
10362390 -> 1000003602410: Whether coding in HTML or XHTML it may just be best to always include the optional tags within an HTML document rather than remembering which tags can be omitted.
10362400 -> 1000003602420: A well-formed XHTML document adheres to all the syntax requirements of XML.
10362410 -> 1000003602430: A valid document adheres to the content specification for XHTML, which describes the document structure.
10362420 -> 1000003602440: The W3C recommends several conventions to ensure an easy migration between HTML and XHTML (see  HTML Compatibility Guidelines).
10362430 -> 1000003602450: The following steps can be applied to XHTML 1.0 documents only:
10362440 -> 1000003602460: Include both xml:lang and lang attributes on any elements assigning language.
10362450 -> 1000003602470: Use the empty-element syntax only for elements specified as empty in HTML.
10362460 -> 1000003602480: Include an extra space in empty-element tags: for example <br /> instead of <br/>.
10362470 -> 1000003602490: Include explicit close tags for elements that permit content but are left empty (for example, <div></div>, not <div />).
10362480 -> 1000003602500: Omit the XML declaration.
10362490 -> 1000003602510: By carefully following the W3C’s compatibility guidelines, a user agent should be able to interpret the document equally as HTML or XHTML.
10362500 -> 1000003602520: For documents that are XHTML 1.0 and have been made compatible in this way, the W3C permits them to be served either as HTML (with a text/html MIME type), or as XHTML (with an application/xhtml+xml or application/xml MIME type).
10362510 -> 1000003602530: When delivered as XHTML, browsers should use an XML parser, which adheres strictly to the XML specifications for parsing the document's contents.
10362520 -> 1000003602540: Transitional versus Strict
10362530 -> 1000003602550: The latest SGML-based specification HTML 4.01 and the earliest XHTML version include three sub-specifications: Strict, Transitional (once called Loose), and Frameset.
10362540 -> 1000003602560: The Strict variant represents the standard proper, whereas the Transitional and Frameset variants were developed to assist in the transition from earlier versions of HTML (including HTML 3.2).
10362550 -> 1000003602570: The Transitional and Frameset variants allow for presentational markup whereas the Strict variant encourages the use of style sheets through its omission of most presentational markup.
10362560 -> 1000003602580: The primary differences which make the Transitional variant more permissive than the Strict variant (the differences as the same in HTML 4 and XHTML 1.0) are:
10362570 -> 1000003602590: A looser content model
10362580 -> 1000003602600: Inline elements and plain text (#PCDATA) are allowed directly in: body, blockquote, form, noscript and noframes
10362590 -> 1000003602610: Presentation related elements
10362600 -> 1000003602620: underline (u)
10362610 -> 1000003602630: strike-through (del)
10362620 -> 1000003602640: center
10362630 -> 1000003602650: font
10362640 -> 1000003602660: basefont
10362650 -> 1000003602670: Presentation related attributes
10362660 -> 1000003602680: background and bgcolor attributes for body element.
10362670 -> 1000003602690: align attribute on div, form, paragraph (p), and heading (h1...h6) elements
10362680 -> 1000003602700: align, noshade, size, and width attributes on hr element
10362690 -> 1000003602710: align, border, vspace, and hspace attributes on img and object elements
10362700 -> 1000003602720: align attribute on legend and caption elements
10362710 -> 1000003602730: align and bgcolor on table element
10362720 -> 1000003602740: nowrap, bgcolor, width, height on td and th elements
10362730 -> 1000003602750: bgcolor attribute on tr element
10362740 -> 1000003602760: clear attribute on br element
10362750 -> 1000003602770: compact attribute on dl, dir and menu elements
10362760 -> 1000003602780: type, compact, and start attributes on ol and ul elements
10362770 -> 1000003602790: type and value attributes on li element
10362780 -> 1000003602800: width attribute on pre element
10362790 -> 1000003602810: Additional elements in Transitional specification
10362800 -> 1000003602820: menu list (no substitute, though unordered list is recommended; may return in XHTML 2.0 specification)
10362810 -> 1000003602830: dir list (no substitute, though unordered list is recommended)
10362820 -> 1000003602840: isindex (element requires server-side support and is typically added to documents server-side)
10362830 -> 1000003602850: applet (deprecated in favor of object element)
10362840 -> 1000003602860: The language attribute on script element (presumably redundant with type attribute, though this is maintained for legacy reasons).
10362850 -> 1000003602870: Frame related entities
10362860 -> 1000003602880: frameset element (used in place of body for frameset DTD)
10362870 -> 1000003602890: frame element
10362880 -> 1000003602900: iframe
10362890 -> 1000003602910: noframes
10362900 -> 1000003602920: target attribute on anchor, client-side image-map (imagemap), link, form, and base elements
10362910 -> 1000003602930: Frameset versus transitional
10362920 -> 1000003602940: In addition to the above transitional differences, the frameset specifications (whether XHTML 1.0 or HTML 4.01) specifies a different content model:
10362930 -> 1000003602950: Summary of flavors
10362940 -> 1000003602960: As this list demonstrates, the loose flavors of the specification are maintained for legacy support.
10362950 -> 1000003602970: However, contrary to popular misconceptions, the move to XHTML does not imply a removal of this legacy support.
10362960 -> 1000003602980: Rather the X in XML stands for extensible and the W3C is modularizing the entire specification and opening it up to independent extensions.
10362970 -> 1000003602990: The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is the modularization of the entire specification.
10362980 -> 1000003603000: The strict version of HTML is deployed in XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification.
10362990 -> 1000003603010: Likewise someone looking for the loose (transitional) or frameset specifications will find similar extended XHTML 1.1 support (much of it is contained in the legacy or frame modules).
10363000 -> 1000003603020: The modularization also allows for separate features to develop on their own timetable.
10363010 -> 1000003603030: So for example XHTML 1.1 will allow quicker migration to emerging XML standards such as MathML (a presentational and semantic math language based on XML) and XForms — a new highly advanced web-form technology to replace the existing HTML forms.
10363020 -> 1000003603040: In summary, the HTML 4.01 specification primarily reined in all the various HTML implementations into a single clear written specification based on SGML.
10363030 -> 1000003603050: XHTML 1.0, ported this specification, as is, to the new XML defined specification.
10363040 -> 1000003603060: Next, XHTML 1.1 takes advantage of the extensible nature of XML and modularizes the whole specification.
10363050 -> 1000003603070: XHTML 2.0 will be the first step in adding new features to the specification in a standards-body-based approach.
10363060 -> 1000003603080: Hypertext features not in HTML
10363070 -> 1000003603090: HTML lacks some of the features found in earlier hypertext systems, such as typed links, transclusion, source tracking, fat links, and more.
10363080 -> 1000003603100: Even some hypertext features that were in early versions of HTML have been ignored by most popular web browsers until recently, such as the link element and in-browser Web page editing.
10363090 -> 1000003603110: Sometimes Web services or browser manufacturers remedy these shortcomings.
10363100 -> 1000003603120: For instance, wikis and content management systems allow surfers to edit the Web pages they visit.
Hidden Markov model
10350010 -> 1000003700020: Hidden Markov model
10350020 -> 1000003700030: A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters.
10350030 -> 1000003700040: The extracted model parameters can then be used to perform further analysis, for example for pattern recognition applications.
10350040 -> 1000003700050: An HMM can be considered as the simplest dynamic Bayesian network.
10350050 -> 1000003700060: In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters.
10350060 -> 1000003700070: In a hidden Markov model, the state is not directly visible, but variables influenced by the state are visible.
10350070 -> 1000003700080: Each state has a probability distribution over the possible output tokens.
10350080 -> 1000003700090: Therefore the sequence of tokens generated by an HMM gives some information about the sequence of states.
10350090 -> 1000003700100: Hidden Markov models are especially known for their application in  temporal pattern recognition such as speech, handwriting, gesture recognition, musical score following, partial discharges and bioinformatics.
10350100 -> 1000003700110: Architecture of a hidden Markov model
10350110 -> 1000003700120: The diagram below shows the general architecture of an instantiated HMM.
10350120 -> 1000003700130: Each oval shape represents a random variable that can adopt a number of values.
10350130 -> 1000003700140: The random variable x(t) is the hidden state at time t (with the model from the above diagram, x(t) \in \{x_1, x_2, x_3\}).
10350140 -> 1000003700150: The random variable y(t) is the observation at time t (y(t) \in \{y_1, y_2, y_3, y_4\}).
10350150 -> 1000003700160: The arrows in the diagram (often called a trellis diagram) denote conditional dependencies.
10350160 -> 1000003700170: From the diagram, it is clear that the value of the hidden variable x(t) (at time t) only depends on the value of the hidden variable x(t-1) : the values at time t-2 and before have no influence.
10350170 -> 1000003700180: This is called the Markov property.
10350180 -> 1000003700190: Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).
10350190 -> 1000003700200: Probability of an observed sequence
10350200 -> 1000003700210: The probability of observing a sequence Y=y(0), y(1),\dots,y(L-1) of length L is given by
10350210 -> 1000003700220: P(Y)=\sum_{X}P(Y\mid X)P(X),
10350220 -> 1000003700230: where the sum runs over all possible hidden node sequences X=x(0), x(1), \dots, x(L-1).
10350230 -> 1000003700240: Brute force calculation of P(Y) is intractable for most real-life problems, as the number of possible hidden node sequences is typically extremely high.
10350240 -> 1000003700250: The calculation can however be sped up enormously using the forward algorithm or the equivalent backward algorithm.
10350250 -> 1000003700260: Using hidden Markov models
10350260 -> 1000003700270: There are three canonical problems associated with HMM:
10350270 -> 1000003700280: Given the parameters of the model, compute the probability of a particular output sequence, and the probabilities of the hidden state values given that output sequence.
10350280 -> 1000003700290: This problem is solved by the forward-backward algorithm.
10350290 -> 1000003700300: Given the parameters of the model, find the most likely sequence of hidden states that could have generated a given output sequence.
10350300 -> 1000003700310: This problem is solved by the Viterbi algorithm.
10350310 -> 1000003700320: Given an output sequence or a set of such sequences, find the most likely set of state transition and output probabilities.
10350320 -> 1000003700330: In other words, discover the parameters of the HMM given a dataset of sequences.
10350330 -> 1000003700340: This problem is solved by the Baum-Welch algorithm.
10350340 -> 1000003700350: A concrete example
10350350 -> 1000003700360: This example is further elaborated in the Viterbi algorithm page.
10350360 -> 1000003700370: Applications of hidden Markov models
10350370 -> 1000003700380: Cryptanalysis
10350380 -> 1000003700390: Speech recognition
10350390 -> 1000003700400: Machine translation
10350400 -> 1000003700410: Partial discharge
10350410 -> 1000003700420: History
10350420 -> 1000003700430: Hidden Markov Models were first described in a series of statistical papers by Leonard E. Baum and other authors in the second half of the 1960s.
10350430 -> 1000003700440: One of the first applications of HMMs was speech recognition, starting in the mid-1970s.
10350440 -> 1000003700450: In the second half of the 1980s, HMMs began to be applied to the analysis of biological sequences, in particular DNA.
10350450 -> 1000003700460: Since then, they have become ubiquitous in the field of bioinformatics.
IBM
10370010 -> 1000003800020: IBM
10370020 -> 1000003800030: International Business Machines Corporation, abbreviated IBM and nicknamed "Big Blue," NYSE:  IBM , is a multinational computer technology and consulting corporation headquartered in Armonk, New York, USA.
10370030 -> 1000003800040: The company is one of the few information technology companies with a continuous history dating back to the 19th century.
10370040 -> 1000003800050: IBM manufactures and sells computer hardware and software, and offers infrastructure services, hosting services, and consulting services in areas ranging from mainframe computers to nanotechnology.
10370050 -> 1000003800060: IBM has been known through most of its recent history as the world's largest computer company; with over 388,000 employees worldwide, IBM is the largest information technology employer in the world.
10370060 -> 1000003800070: Despite falling behind Hewlett-Packard in total revenue since 2006, it remains the most profitable.
10370070 -> 1000003800080: IBM holds more patents than any other U.S. based technology company.
10370080 -> 1000003800090: It has engineers and consultants in over 170 countries and IBM Research has eight laboratories worldwide.
10370090 -> 1000003800100: IBM employees have earned three Nobel Prizes, four Turing Awards, five National Medals of Technology, and five National Medals of Science.
10370100 -> 1000003800110: As a chip maker, IBM has been among the Worldwide Top 20 Semiconductor Sales Leaders in past years, and in 2007 IBM ranked second in the list of largest software companies in the world.
10370110 -> 1000003800120: History
10370120 -> 1000003800130: The company which became IBM was founded in 1896 as the Tabulating Machine Company by Herman Hollerith, in Broome County, New York (Endicott, New York, Where it still maintains very limited operations).
10370130 -> 1000003800140: It was incorporated as Computing Tabulating Recording Corporation (CTR) on June 16, 1911, and was listed on the New York Stock Exchange in 1916.
10370140 -> 1000003800150: IBM adopted its current name in 1924, when it became a Fortune 500 company.
10370150 -> 1000003800160: In the 1950s, IBM became the dominant vendor in the emerging computer industry with the release of the IBM 701 and other models in the IBM 700/7000 series of mainframes.
10370160 -> 1000003800170: The company's dominance became even more pronounced in the 1960s and 1970s with the IBM System/360 and IBM System/370 mainframes, however antitrust actions by the United States Department of Justice, the rise of minicomputer companies like Digital Equipment Corporation and Data General, and the introduction of the microprocessor all contributed to dilution of IBM's position in the industry, eventually leading the company to diversify into other areas including personal computers, software, and services.
10370170 -> 1000003800180: In 1981 IBM introduced the IBM Personal Computer which is the original version and progenitor of the IBM PC compatible hardware platform.
10370180 -> 1000003800190: Descendants of the IBM PC compatibles make up the majority of microcomputers on the market today.
10370190 -> 1000003800200: IBM sold its PC division to the Chinese company Lenovo on May 1, 2005 for $655 million in cash and $600 million in Lenovo stock.
10370200 -> 1000003800210: On January 25, 2007, Ricoh announced purchase of IBM Printing Systems Division for $725 million and investment in 3-year joint venture to form a new Ricoh subsidiary, InfoPrint Solutions Company; Ricoh will own a 51% share, and IBM will own a 49% share in InfoPrint.
10370210 -> 1000003800220: Controversies
10370220 -> 1000003800230: The author Edwin Black has alleged that, during World War II, IBM CEO Thomas J. Watson used overseas subsidiaries to provide the Third Reich with unit record data processing machines, supplies and services that helped the Nazis to efficiently track down European Jews, with sizable profits for the company.
10370230 -> 1000003800240: IBM denies that they had control over these subsidiaries after the Nazis took power.
10370240 -> 1000003800250: A lawsuit against IBM based on these allegations was dismissed.
10370250 -> 1000003800260: In support of the Allied war effort in World War II, from 1943 to 1945 IBM produced approximately 346,500 M1 Carbine (Caliber .30 carbine) light rifles for the U.S. Military.
10370260 -> 1000003800270: Current projects
10370270 -> 1000003800280: Eclipse
10370280 -> 1000003800290: Eclipse is a platform-independent, Java-based software framework.
10370290 -> 1000003800300: Eclipse was originally a proprietary product developed by IBM as a successor of the VisualAge family of tools.
10370300 -> 1000003800310: Eclipse has subsequently been released as free/open source software under the Eclipse Public License.
10370310 -> 1000003800320: developerWorks
10370320 -> 1000003800330: developerWorks is a website run by IBM for software developers and IT professionals.
10370330 -> 1000003800340: It contains a large number of how-to articles and tutorials, as well as software downloads and code samples, discussion forums, podcasts, blogs, wikis, and other resources for developers and technical professionals.
10370340 -> 1000003800350: Subjects range from open, industry-standard technologies like Java, Linux, SOA and web services, web development, Ajax, PHP, and XML to IBM's products (WebSphere, Rational, Lotus, Tivoli and DB2).
10370350 -> 1000003800360: In 2007 developerWorks was inducted into the Jolt Hall of Fame.
10370360 -> 1000003800370: alphaWorks
10370370 -> 1000003800380: alphaWorks is IBM's source for emerging software technologies.
10370380 -> 1000003800390: These technologies include:
10370390 -> 1000003800400: Flexible Internet Evaluation Report Architecture - A highly flexible architecture for the design, display, and reporting of Internet surveys.
10370400 -> 1000003800410: IBM History Flow Visualization Application - A tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors.
10370410 -> 1000003800420: IBM Linux on POWER Performance Simulator - A tool that provides users of Linux on Power a set of performance models for IBM's POWER processors.
10370420 -> 1000003800430: Database File Archive And Restoration Management - An application for archiving and restoring hard disk files using file references stored in a database.
10370430 -> 1000003800440: Policy Management for Autonomic Computing - A policy-based autonomic management infrastructure that simplifies the automation of IT and business processes.
10370440 -> 1000003800450: FairUCE - A spam filter that verifies sender identity instead of filtering content.
10370450 -> 1000003800460: Unstructured Information Management Architecture (UIMA) SDK - A Java SDK that supports the implementation, composition, and deployment of applications working with unstructured information.
10370460 -> 1000003800470: Accessibility Browser - A web-browser specifically designed to assist people with visual impairments, to be released as open-source software.
10370470 -> 1000003800480: Also known as the "A-Browser," the technology will aim to eliminate the need for a mouse, relying instead completely on voice-controls, buttons and predefined shortcut keys.
10370480 -> 1000003800490: Semiconductor design and manufacturing
10370490 -> 1000003800500: Virtually all modern console gaming systems use microprocessors developed by IBM.
10370500 -> 1000003800510: The Xbox 360 contains the Xenon tri-core processor, which was designed and produced by IBM in less than 24 months.
10370510 -> 1000003800520: Sony's PlayStation 3 features the  Cell BE microprocessor designed jointly by IBM, Toshiba, and Sony.
10370520 -> 1000003800530: Nintendo's seventh-generation console, Wii, features an IBM chip codenamed Broadway.
10370530 -> 1000003800540: The older Nintendo GameCube also utilizes the Gekko processor, designed by IBM.
10370540 -> 1000003800550: In May 2002, IBM and Butterfly.net, Inc. announced the Butterfly Grid, a commercial grid for the online video gaming market.
10370550 -> 1000003800560: In March 2006, IBM announced separate agreements with Hoplon Infotainment, Online Game Services Incorporated (OGSI), and RenderRocket to provide on-demand content management and blade server computing resources.
10370560 -> 1000003800570: Open Client Offering
10370570 -> 1000003800580: IBM announced it will launch its new software, called "Open Client Offering" which is to run on Microsoft's Windows, Linux and Apple's Macintosh.
10370580 -> 1000003800590: The company states that its new product allows businesses to offer employees a choice of using the same software on Windows and its alternatives.
10370590 -> 1000003800600: This means that "Open Client Offering" is to cut costs of managing whether Linux or Apple relative to Windows.
10370600 -> 1000003800610: There will be no necessity for companies to pay Microsoft for its licenses for operations since the operations will no longer rely on software which is Windows-based.
10370610 -> 1000003800620: One of Microsoft's office alternatives is the Open Document Format software, whose development IBM supports.
10370620 -> 1000003800630: It is going to be used for several tasks like: word processing, presentations, along with collaboration with Lotus Notes, instant messaging and blog tools as well as an Internet Explorer competitor – the Firefox web browser.
10370630 -> 1000003800640: IBM plans to install Open Client on 5 percent of its desktop PCs.
10370640 -> 1000003800650: UC2: Unified Communications and Collaboration
10370650 -> 1000003800660: UC2 (Unified Communications and Collaboration) is an IBM and Cisco joint project based on Eclipse and OSGi.
10370660 -> 1000003800670: It will offer the numerous Eclipse application developers a unified platform for an easier work environment.
10370670 -> 1000003800680: The software based on UC2 platform will provide major enterprises with easy-to-use communication solutions, such as the Lotus based Sametime.
10370680 -> 1000003800690: In the future the Sametime users will benefit from such additional functions as click-to-call and voice mailing.
10370690 -> 1000003800700: Internal programs
10370700 -> 1000003800710: Extreme Blue is a company initiative that uses experienced IBM engineers, talented interns, and business managers to develop high-value technology.
10370710 -> 1000003800720: The project is designed to analyze emerging business needs and the technologies that can solve them.
10370720 -> 1000003800730: These projects mostly involve rapid-prototyping of high-profile software and hardware projects.
10370730 -> 1000003800740: In May 2007, IBM unveiled Project Big Green -- a re-direction of $1 billion per year across its businesses to increase energy efficiency.
10370740 -> 1000003800750: IBM Software Group
10370750 -> 1000003800760: This group is one of the major divisions of IBM.
10370760 -> 1000003800770: The various brands include:
10370770 -> 1000003800780: Information Management Software — database servers and tools, text analytics, content management, business process management and business intelligence.
10370780 -> 1000003800790: Lotus Software — Groupware, collaboration and business software.
10370790 -> 1000003800800: Acquired in 1995.
10370800 -> 1000003800810: Rational Software — Software development and application lifecycle management.
10370810 -> 1000003800820: Acquired in 2002.
10370820 -> 1000003800830: Tivoli Software — Systems management.
10370830 -> 1000003800840: Acquired in 1996.
10370840 -> 1000003800850: WebSphere — Integration and application infrastructure software.
10370850 -> 1000003800860: Environmental record
10370860 -> 1000003800870: IBM has a long history of dealing with its environmental problems.
10370870 -> 1000003800880: It established a corporate policy on environmental protection in 1971, with the support of a comprehensive global environmental management system.
10370880 -> 1000003800890: According to IBM’s stats, its total hazardous waste decreased by 44 percent over the past five years, and has decreased by 94.6 percent since 1987.
10370890 -> 1000003800900: IBM's total hazardous waste calculation consists of waste from both non-manufacturing and manufacturing operations.
10370900 -> 1000003800910: Waste from manufacturing operations includes waste recycled in closed-loop systems where process chemicals are recovered and for subsequent reuse, rather than just disposing and using new chemical materials.
10370910 -> 1000003800920: Over the years, IBM has redesigned processes to eliminate almost all closed loop recycling and now uses more environmental-friendly materials in their place.
10370920 -> 1000003800930: IBM was recognized as one of the "Top 20 Best Workplaces for Commuters" by the U.S. Environmental Protection Agency (EPA) in 2005.
10370930 -> 1000003800940: This was to recognize the Fortune 500 companies that provided their employees with excellent commuter benefits that helped reduce traffic and air pollution.
10370940 -> 1000003800950: However, the birthplace of IBM, Endicott, suffered IBM's pollution for decades.
10370950 -> 1000003800960: IBM used liquid cleaning agents in its circuit board assembly operation for more than two decades, and six spills and leaks incidents were recorded, including one 1979 leak of 4,100 gallons from an underground tank.
10370960 -> 1000003800970: These left behind volatile organic compounds in the town's soil and aquifer.
10370970 -> 1000003800980: Trace elements of volatile organic compounds have been identified in the Endicott’s drinking water, but the levels are within regulatory limits.
10370980 -> 1000003800990: Also, from 1980, IBM has pumped out 78,000 gallons of chemicals, including trichloroethane, Freon, benzene and perchloroethene to the air and allegedly caused several cancer cases among the villagers.
10370990 -> 1000003801000: IBM Endicott has been identified by the Department of Environmental Conservation as the major source of pollution, though traces of contaminants from a local dry cleaner and other polluters were also found.
10371000 -> 1000003801010: Despite the amount of pollutant, state health officials cannot say whether air or water pollution in Endicott has actually caused any health problems.
10371010 -> 1000003801020: Village officials say tests show that the water is safe to drink.
10371020 -> 1000003801030: Solar power
10371030 -> 1000003801040: Tokyo Ohka Kogyo Co., Ltd. (TOK) and IBM are collaborating to establish new, low-cost methods for bringing the next generation of solar energy products to market,this is, CIGS (Copper-Indium-Gallium-Selenide) solar cell modules.
10371040 -> 1000003801050: Use of thin film technology, such as CIGS, has great promise in reducing the overall cost of solar cells and further enabling their widespread adoption.
10371050 -> 1000003801060: IBM is exploring four main areas of photovoltaic research: using current technologies to develop cheaper and more efficient silicon solar cells, developing new solution processed thin film photovoltaic devices, concentrator photovoltaics, and future generation photovoltaic architectures based upon nanostructures such as semiconductor quantum dots and nanowires.
10371060 -> 1000003801070: Dr. Supratik Guha is the leading scientist in IBM photovoltaics.
10371070 -> 1000003801080: Corporate culture of IBM
10371080 -> 1000003801090: Big Blue is a nickname for IBM; several theories exist regarding its origin.
10371090 -> 1000003801100: One theory, substantiated by people who worked for IBM at the time, is that IBM field reps coined the term in the 1960s, referring to the color of the mainframes IBM installed in the 1960s and early 1970s.
10371100 -> 1000003801110: "All blue" was a term used to describe a loyal IBM customer, and business writers later picked up the term.
10371110 -> 1000003801120: Another theory suggests that Big Blue simply refers to the Company's logo.
10371120 -> 1000003801130: A third theory suggests that Big Blue refers to a former company dress code that required many IBM employees to wear only white shirts and many wore blue suits.
10371130 -> 1000003801140: In any event, IBM keyboards, typewriters, and some other manufactured devices, have played on the "Big Blue" concept, using the color for enter keys and carriage returns.
10371140 -> 1000003801150: Sales
10371150 -> 1000003801160: IBM has often been described as having a sales-centric or a sales-oriented business culture.
10371160 -> 1000003801170: Traditionally, many IBM executives and general managers are chosen from the sales force.
10371170 -> 1000003801180: The current CEO, Sam Palmisano, for example, joined the company as a salesman and, unusually for CEOs of major corporations, has no MBA or postgraduate qualification.
10371180 -> 1000003801190: Middle and top management are often enlisted to give direct support to salesmen when pitching sales to important customers.
10371190 -> 1000003801200: The uniform
10371200 -> 1000003801210: A dark (or gray) suit, white shirt, and a "sincere" tie was the public uniform for IBM employees for most of the 20th Century.
10371210 -> 1000003801220: During IBM's management transformation in the 1990s, CEO Lou Gerstner relaxed these codes, normalizing the dress and behavior of IBM employees to resemble their counterparts in other large technology companies.
10371220 -> 1000003801230: IBM company values and "Jam"
10371230 -> 1000003801240: In 2003, IBM embarked on an ambitious project to rewrite company values.
10371240 -> 1000003801250: Using its Jam technology, the company hosted Intranet-based online discussions on key business issues with 50,000 employees over 3 days.
10371250 -> 1000003801260: The discussions were analyzed by sophisticated text analysis software (eClassifier) to mine online comments for themes.
10371260 -> 1000003801270: As a result of the 2003 Jam, the company values were updated to reflect three modern business, marketplace and employee views: "Dedication to every client's success", "Innovation that matters - for our company and for the world", "Trust and personal responsibility in all relationships".
10371270 -> 1000003801280: In 2004, another Jam was conducted during which 52,000 employees exchanged best practices for 72 hours.
10371280 -> 1000003801290: They focused on finding actionable ideas to support implementation of the values previously identified.
10371290 -> 1000003801300: A new post-Jam Ratings event was developed to allow IBMers to select key ideas that support the values.
10371300 -> 1000003801310: The board of directors cited this Jam when awarding Palmisano a pay rise in the spring of 2005.
10371310 -> 1000003801320: In July and September 2006, Palmisano launched another jam called  InnovationJam.
10371320 -> 1000003801330: InnovationJam was the largest online brainstorming session ever with more than 150,000 participants from 104 countries.
10371330 -> 1000003801340: The participants were IBM employees, members of IBM employees' families, universities, partners, and customers.
10371340 -> 1000003801350: InnovationJam was divided in two sessions (one in July and one in September) for 72 hours each and generated more than 46,000 ideas.
10371350 -> 1000003801360: In November 2006, IBM declared that they will invest $US 100 million in the 10 best ideas from InnovationJam.
10371360 -> 1000003801370: Open source
10371370 -> 1000003801380: IBM has been influenced by the Open Source Initiative, and began supporting Linux in 1998.
10371380 -> 1000003801390: The company invests billions of dollars in services and software based on Linux through the IBM Linux Technology Center, which includes over 300 Linux kernel developers.
10371390 -> 1000003801400: IBM has also released code under different open-source licenses, such as the platform-independent software framework Eclipse (worth approximately US$40 million at the time of the donation) and the Java-based relational database management system (RDBMS) Apache Derby.
10371400 -> 1000003801410: IBM's open source involvement has not been trouble-free, however (see SCO v. IBM).
10371410 -> 1000003801420: Corporate affairs
10371420 -> 1000003801430: Diversity and workforce issues
10371430 -> 1000003801440: IBM's efforts to promote workforce diversity and equal opportunity date back at least to World War I, when the company hired disabled veterans.
10371440 -> 1000003801450: IBM was the only technology company ranked in Working Mother magazine's Top 10 for 2004, and one of two technology companies in 2005 (the other company being Hewlett-Packard).
10371450 -> 1000003801460: On September 21, 1953, Thomas J. Watson, the CEO at the time, sent out a very controversial letter to all IBM employees stating that IBM needed to hire the best people, regardless of their race, ethnic origin, or gender.
10371460 -> 1000003801470: In 1984, IBM added sexual preference.
10371470 -> 1000003801480: He stated that this would give IBM a competitive advantage because IBM would then be able to hire talented people its competitors would turn down.
10371480 -> 1000003801490: The company has traditionally resisted labor union organizing, although unions represent some IBM workers outside the United States.
10371490 -> 1000003801500: In the 1990s, two major pension program changes, including a conversion to a cash balance plan, resulted in an employee class action lawsuit alleging age discrimination.
10371500 -> 1000003801510: IBM employees won the lawsuit and arrived at a partial settlement, although appeals are still underway.
10371510 -> 1000003801520: IBM also settled a major overtime class-action lawsuit in 2006.
10371520 -> 1000003801530: Historically IBM has had a good reputation of long-term staff retention with few large scale layoffs.
10371530 -> 1000003801540: In more recent years there have been a number of broad sweeping cuts to the workforce as IBM attempts to adapt to changing market conditions and a declining profit base.
10371540 -> 1000003801550: After posting weaker than expected revenues in the first quarter of 2005, IBM eliminated 14,500 positions from its workforce, predominantly in Europe.
10371550 -> 1000003801560: In May 2005, IBM Ireland said to staff that the MD(Micro-electronics Division) facility was closing down by the end of 2005 and offered a settlement to staff.
10371560 -> 1000003801570: However, all staff that wished to stay with the Company were redeployed within IBM Ireland.
10371570 -> 1000003801580: The production moved to a company called Amkor in Singapore who purchased IBM's Microelectronics business in Singapore and is widely agreed that IBM promised this Company a full load capacity in return for the purchase of the facility.
10371580 -> 1000003801590: On June 8 2005, IBM Canada Ltd. eliminated approximately 700 positions.
10371590 -> 1000003801600: IBM projects these as part of a strategy to "rebalance" its portfolio of professional skills & businesses.
10371600 -> 1000003801610: IBM India and other IBM offices in China, the Philippines and Costa Rica have been witnessing a recruitment boom and steady growth in number of employees due to lower wages.
10371610 -> 1000003801620: On October 10 2005, IBM became the first major company in the world to formally commit to not using genetic information in its employment decisions.
10371620 -> 1000003801630: This came just a few months after IBM announced its support of the National Geographic Society's Genographic Project.
10371630 -> 1000003801640: Gay rights
10371640 -> 1000003801650: IBM provides employees' same-sex partners with benefits and provides an anti-discrimination clause.
10371650 -> 1000003801660: The Human Rights Campaign has consistently rated IBM 100% on its index of gay-friendliness since 2003 (in 2002, the year it began compiling its report on major companies, IBM scored 86%).
10371660 -> 1000003801670: Logos
10371670 -> 1000003801680: Logos designed in the 1970s tended to be sensitive to the technical limitations of photocopiers, which were then being widely deployed.
10371680 -> 1000003801690: A logo with large solid areas tended to be poorly copied by copiers in the 1970s, so companies preferred logos that avoided large solid areas.
10371690 -> 1000003801700: The 1972 IBM logos are an example of this tendency.
10371700 -> 1000003801710: With the advent of digital copiers in the mid-1980s this technical restriction had largely disappeared; at roughly the same time, the 13-bar logo was abandoned for almost the opposite reason –  it was difficult to render accurately on the low-resolution digital printers (240 dots per inch) of the time.
10371710 -> 1000003801720: Board of directors
10371720 -> 1000003801730: Current members of the board of directors of IBM are:
10371730 -> 1000003801740: Cathleen Black President, Hearst Magazines
10371740 -> 1000003801750: William Brody President, Johns Hopkins University
10371750 -> 1000003801760: Ken Chenault Chairman and CEO, American Express Company
10371760 -> 1000003801770: Juergen Dormann Chairman of the Board, ABB Ltd
10371770 -> 1000003801780: Michael Eskew Chairman and CEO, United Parcel Service, Inc.
10371780 -> 1000003801790: Shirley Ann Jackson President, Rensselaer Polytechnic Institute
10371790 -> 1000003801800: Minoru Makihara Senior Corporate Advisor and former Chairman, Mitsubishi Corporation
10371800 -> 1000003801810: Lucio Noto Managing Partner, Midstream Partners LLC
10371810 -> 1000003801820: James W. Owens Chairman and CEO, Caterpillar Inc.
10371820 -> 1000003801830: Samuel J. Palmisano Chairman, President and CEO, IBM
10371830 -> 1000003801840: Joan Spero President, Doris Duke Charitable Foundation
10371840 -> 1000003801850: Sidney Taurell, Chairman and CEO, Eli Lilly and Company
10371850 -> 1000003801860: Lorenzo Zambrano Chairman and CEO, Cemex SAB de CV
Information
10380010 -> 1000003900020: Information
10380020 -> 1000003900030: Information as a concept has a diversity of meanings, from everyday usage to technical settings.
10380030 -> 1000003900040: Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.
10380040 -> 1000003900050: Many people speak about the Information Age as the advent of the Knowledge Age or knowledge society, the information society, the Information revolution, and information technologies, and even though informatics, information science and computer science are often in the spotlight, the word "information" is often used without careful consideration of the various meanings it has acquired.
10380050 -> 1000003900060: Etymology
10380060 -> 1000003900070: According to the Oxford English Dictionary, the earliest historical meaning of the word information in English was the act of informing, or giving form or shape to the mind, as in education, instruction, or training.
10380070 -> 1000003900080: A quote from 1387: "Five books come down from heaven for information of mankind."
10380080 -> 1000003900090: It was also used for an item of training, e.g. a particular instruction.
10380090 -> 1000003900100: "Melibee had heard the great skills and reasons of Dame Prudence, and her wise information and techniques."
10380100 -> 1000003900110: (1386)
10380110 -> 1000003900120: The English word was apparently derived by adding the common "noun of action" ending "-ation" (descended through Francais from Latin "-tio") to the earlier verb to inform, in the sense of to give form to the mind, to discipline, instruct, teach: "Men so wise should go and inform their kings."
10380120 -> 1000003900130: (1330) Inform itself comes (via French) from the Latin verb informare, to give form to, to form an idea of.
10380125 -> 1000003900140: Furthermore, Latin itself already even contained the word informatio meaning concept or idea, but the extent to which this may have influenced the development of the word information in English is unclear.
10380130 -> 1000003900150: As a final note, the ancient Greek word for form was [eidos], and this word was famously used in a technical philosophical sense by [Plato] (and later Aristotle) to denote the ideal identity or essence of something (see [Theory of forms]).
10380140 -> 1000003900160: "Eidos" can also be associated with [thought], [proposition] or even [concept].
10380150 -> 1000003900170: Information as a message
10380160 -> 1000003900180: Information is the state of a system of interest.
10380170 -> 1000003900190: Message is the information materialized.
10380180 -> 1000003900200: Information is a quality of a message from a sender to one or more receivers.
10380190 -> 1000003900210: Information is always about something (size of a parameter, occurrence of an event, etc).
10380200 -> 1000003900220: Viewed in this manner, information does not have to be accurate.
10380210 -> 1000003900230: It may be a truth or a lie, or just the sound of a falling tree.
10380220 -> 1000003900240: Even a disruptive noise used to inhibit the flow of communication and create misunderstanding would in this view be a form of information.
10380230 -> 1000003900250: However, generally speaking, if the amount of information in the received message increases, the message is more accurate.
10380240 -> 1000003900260: This model assumes there is a definite sender and at least one receiver.
10380250 -> 1000003900270: Many refinements of the model assume the existence of a common language understood by the sender and at least one of the receivers.
10380260 -> 1000003900280: An important variation identifies information as that which would be communicated by a message if it were sent from a sender to a receiver capable of understanding the message.
10380270 -> 1000003900290: Notably, it is not required that the sender be capable of understanding the message, or even cognizant that there is a message.
10380280 -> 1000003900300: Thus, information is something that can be extracted from an environment, e.g., through observation, reading or measurement.
10380290 -> 1000003900310: Information is a term with many meanings depending on context, but is as a rule closely related to such concepts as meaning, knowledge, instruction, communication, representation, and mental stimulus.
10380300 -> 1000003900320: Simply stated, information is a message received and understood.
10380310 -> 1000003900330: In terms of data, it can be defined as a collection of facts from which conclusions may be drawn.
10380320 -> 1000003900340: There are many other aspects of information since it is the knowledge acquired through study or experience or instruction.
10380330 -> 1000003900350: But overall, information is the result of processing, manipulating and organizing data in a way that adds to the knowledge of the person receiving it.
10380340 -> 1000003900360: Communication theory provides a numerical measure of the uncertainty of an outcome.
10380350 -> 1000003900370: For example, we can say that "the signal contained thousands of bits of information".
10380360 -> 1000003900380: Communication theory tends to use the concept of information entropy, generally attributed to C.E. Shannon (see below).
10380370 -> 1000003900390: Another form of information is Fisher information, a concept of R.A. Fisher.
10380380 -> 1000003900400: This is used in application of statistics to estimation theory and to science in general.
10380390 -> 1000003900410: Fisher information is thought of as the amount of information that a message carries about an unobservable parameter.
10380400 -> 1000003900420: It can be computed from knowledge of the likelihood function defining the system.
10380410 -> 1000003900430: For example, with a normal likelihood function, the Fisher information is the reciprocal of the variance of the law.
10380420 -> 1000003900440: In the absence of knowledge of the likelihood law, the Fisher information may be computed from normally distributed score data as the reciprocal of their second moment.
10380430 -> 1000003900450: Even though information and data are often used interchangeably, they are actually very different.
10380440 -> 1000003900460: Data is a set of unrelated information, and as such is of no use until it is properly evaluated.
10380450 -> 1000003900470: Upon evaluation, once there is some significant relation between data, and they show some relevance, then they are converted into information.
10380460 -> 1000003900480: Now this same data can be used for different purposes.
10380470 -> 1000003900490: Thus, till the data convey some information, they are not useful.
10380480 -> 1000003900500: Measuring information entropy
10380490 -> 1000003900510: The view of information as a message came into prominence with the publication in 1948 of an influential paper by Claude Shannon, "A Mathematical Theory of Communication."
10380500 -> 1000003900520: This paper provides the foundations of information theory and endows the word information not only with a technical meaning but also a measure.
10380510 -> 1000003900530: If the sending device is equally likely to send any one of a set of N messages, then the preferred measure of "the information produced when one message is chosen from the set" is the base two logarithm of N (This measure is called self-information).
10380520 -> 1000003900540: In this paper, Shannon continues:
10380521 -> 1000003900550: The choice of a logarithmic base corresponds to the choice of a unit for measuring information.
10380522 -> 1000003900560: If the base 2 is used the resulting units may be called binary digits, or more briefly bits, a word suggested by J. W. Tukey. A device with two stable positions, such as a relay or a flip-flop circuit, can store one bit of information.
10380523 -> 1000003900570: N such devices can store N bits…
10380530 -> 1000003900580: A complementary way of measuring information is provided by algorithmic information theory.
10380540 -> 1000003900590: In brief, this measures the information content of a list of symbols based on how predictable they are, or more specifically how easy it is to compute the list through a program: the information content of a sequence is the number of bits of the shortest program that computes it.
10380550 -> 1000003900600: The sequence below would have a very low algorithmic information measurement since it is a very predictable pattern, and as the pattern continues the measurement would not change.
10380560 -> 1000003900610: Shannon information would give the same information measurement for each symbol, since they are statistically random, and each new symbol would increase the measurement.
10380570 -> 1000003900620: 123456789101112131415161718192021
10380580 -> 1000003900630: It is important to recognize the limitations of traditional information theory and algorithmic information theory from the perspective of human meaning.
10380590 -> 1000003900640: For example, when referring to the meaning content of a message Shannon noted “Frequently the messages have meaning… these semantic aspects of communication are irrelevant to the engineering problem.
10380600 -> 1000003900650: The significant aspect is that the actual message is one selected from a set of possible messages” (emphasis in original).
10380610 -> 1000003900660: In information theory signals are part of a process, not a substance; they do something, they do not contain any specific meaning.
10380620 -> 1000003900670: Combining algorithmic information theory and information theory we can conclude that the most random signal contains the most information as it can be interpreted in any way and cannot be compressed.
10380630 -> 1000003900680: Michael Reddy noted that "'signals' of the mathematical theory are 'patterns that can be exchanged'.
10380640 -> 1000003900690: There is no message contained in the signal, the signals convey the ability to select from a set of possible messages."
10380650 -> 1000003900700: In information theory "the system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design".
10380660 -> 1000003900710: Information as a pattern
10380670 -> 1000003900720: Information is any represented pattern.
10380680 -> 1000003900730: This view assumes neither accuracy nor directly communicating parties, but instead assumes a separation between an object and its representation.
10380690 -> 1000003900740: Consider the following example: economic statistics represent an economy, however inaccurately.
10380700 -> 1000003900750: What are commonly referred to as data in computing, statistics, and other fields, are forms of information in this sense.
10380710 -> 1000003900760: The electro-magnetic patterns in a computer network and connected devices are related to something other than the pattern itself, such as text characters to be displayed and keyboard input.
10380720 -> 1000003900770: Signals, signs, and symbols are also in this category.
10380730 -> 1000003900780: On the other hand, according to semiotics, data is symbols with certain syntax and information is data with a certain semantic.
10380740 -> 1000003900790: Painting and drawing contain information to the extent that they represent something such as an assortment of objects on a table, a profile, or a landscape.
10380750 -> 1000003900800: In other words, when a pattern of something is transposed to a pattern of something else, the latter is information.
10380760 -> 1000003900810: This would be the case whether or not there was anyone to perceive it.
10380770 -> 1000003900820: But if information can be defined merely as a pattern, does that mean that neither utility nor meaning are necessary components of information?
10380780 -> 1000003900830: Arguably a distinction must be made between raw unprocessed data and information which possesses utility, value or some quantum of meaning.
10380790 -> 1000003900840: On this view, information may indeed be characterized as a pattern; but this is a necessary condition, not a sufficient one.
10380800 -> 1000003900850: An individual entry in a telephone book, which follows a specific pattern formed by name, address and telephone number, does not become "informative" in some sense unless and until it possesses some degree of utility, value or meaning.
10380810 -> 1000003900860: For example, someone might look up a girlfriend's number, might order a take away etc.
10380820 -> 1000003900870: The vast majority of numbers will never be construed as "information" in any meaningful sense.
10380830 -> 1000003900880: The gap between data and information is only closed by a behavioral bridge whereby some value, utility or meaning is added to transform mere data or pattern into information.
10380840 -> 1000003900890: When one constructs a representation of an object, one can selectively extract from the object (sampling) or use a system of signs to replace (encoding), or both.
10380850 -> 1000003900900: The sampling and encoding result in representation.
10380860 -> 1000003900910: An example of the former is a "sample" of a product; an example of the latter is "verbal description" of a product.
10380870 -> 1000003900920: Both contain information of the product, however inaccurate.
10380880 -> 1000003900930: When one interprets representation, one can predict a broader pattern from a limited number of observations (inference) or understand the relation between patterns of two different things (decoding).
10380890 -> 1000003900940: One example of the former is to sip a soup to know if it is spoiled; an example of the latter is examining footprints to determine the animal and its condition.
10380900 -> 1000003900950: In both cases, information sources are not constructed or presented by some "sender" of information.
10380910 -> 1000003900960: Regardless, information is dependent upon, but usually unrelated to and separate from, the medium or media used to express it.
10380920 -> 1000003900970: In other words, the position of a theoretical series of bits, or even the output once interpreted by a computer or similar device, is unimportant, except when someone or something is present to interpret the information.
10380930 -> 1000003900980: Therefore, a quantity of information is totally distinct from its medium.
10380940 -> 1000003900990: Information as sensory input
10380950 -> 1000003901000: Often information is viewed as a type of input to an organism or designed device.
10380960 -> 1000003901010: Inputs are of two kinds.
10380970 -> 1000003901020: Some inputs are important to the function of the organism (for example, food) or device (energy) by themselves.
10380980 -> 1000003901030: In his book Sensory Ecology, Dusenbery called these causal inputs.
10380990 -> 1000003901040: Other inputs (information) are important only because they are associated with causal inputs and can be used to predict the occurrence of a causal input at a later time (and perhaps another place).
10381000 -> 1000003901050: Some information is important because of association with other information but eventually there must be a connection to a causal input.
10381010 -> 1000003901060: In practice, information is usually carried by weak stimuli that must be detected by specialized sensory systems and amplified by energy inputs before they can be functional to the organism or device.
10381020 -> 1000003901070: For example, light is often a causal input to plants but provides information to animals.
10381030 -> 1000003901080: The colored light reflected from a flower is too weak to do much photosynthetic work but the visual system of the bee detects it and the bee's nervous system uses the information to guide the bee to the flower, where the bee often finds nectar or pollen, which are causal inputs, serving a nutritional function.
10381040 -> 1000003901090: Information is any type of sensory input.
10381050 -> 1000003901100: When an organism with a nervous system receives an input, it transforms the input into an electrical signal.
10381060 -> 1000003901110: This is regarded information by some.
10381070 -> 1000003901120: The idea of representation is still relevant, but in a slightly different manner.
10381080 -> 1000003901130: That is, while abstract painting does not represent anything concretely, when the viewer sees the painting, it is nevertheless transformed into electrical signals that create a representation of the painting.
10381090 -> 1000003901140: Defined this way, information does not have to be related to truth, communication, or representation of an object.
10381100 -> 1000003901150: Entertainment in general is not intended to be informative.
10381110 -> 1000003901160: Music, the performing arts, amusement parks, works of fiction and so on are thus forms of information in this sense, but they are not necessarily forms of information according to some definitions given above.
10381120 -> 1000003901170: Consider another example: food supplies both nutrition and taste for those who eat it.
10381130 -> 1000003901180: If information is equated to sensory input, then nutrition is not information but taste is.
10381140 -> 1000003901190: Information as an influence which leads to a transformation
10381150 -> 1000003901200: Information is any type of pattern that influences the formation or transformation of other patterns.
10381160 -> 1000003901210: In this sense, there is no need for a conscious mind to perceive, much less appreciate, the pattern.
10381170 -> 1000003901220: Consider, for example, DNA.
10381180 -> 1000003901230: The sequence of nucleotides is a pattern that influences the formation and development of an organism without any need for a conscious mind.
10381190 -> 1000003901240: Systems theory at times seems to refer to information in this sense, assuming information does not necessarily involve any conscious mind, and patterns circulating (due to feedback) in the system can be called information.
10381200 -> 1000003901250: In other words, it can be said that information in this sense is something potentially perceived as representation, though not created or presented for that purpose.
10381210 -> 1000003901260: When Marshall McLuhan speaks of media and their effects on human cultures, he refers to the structure of artifacts that in turn shape our behaviors and mindsets.
10381220 -> 1000003901270: Also, pheromones are often said to be "information" in this sense.
10381230 -> 1000003901280: (See also Gregory Bateson.)
10381240 -> 1000003901290: Information as a property in physics
10381250 -> 1000003901300: In 2003, J. D. Bekenstein claimed there is a growing trend in physics to define the physical world as being made of information itself (and thus information is defined in this way).
10381260 -> 1000003901310: Information has a well defined meaning in physics.
10381270 -> 1000003901320: Examples of this include the phenomenon of quantum entanglement where particles can interact without reference to their separation or the speed of light.
10381280 -> 1000003901330: Information itself cannot travel faster than light even if the information is transmitted indirectly.
10381290 -> 1000003901340: This could lead to the fact that all attempts at physically observing a particle with an "entangled" relationship to another are slowed down, even though the particles are not connected in any other way other than by the information they carry.
10381300 -> 1000003901350: Another link is demonstrated by the Maxwell's demon thought experiment.
10381310 -> 1000003901360: In this experiment, a direct relationship between information and another physical property, entropy, is demonstrated.
10381320 -> 1000003901370: A consequence is that it is impossible to destroy information without increasing the entropy of a system; in practical terms this often means generating heat.
10381330 -> 1000003901380: Another, more philosophical, outcome is that information could be thought of as interchangeable with energy.
10381340 -> 1000003901390: Thus, in the study of logic gates, the theoretical lower bound of thermal energy released by an AND gate is higher than for the NOT gate (because information is destroyed in an AND gate and simply converted in a NOT gate).
10381350 -> 1000003901400: Physical information is of particular importance in the theory of quantum computers.
10381360 -> 1000003901410: Information as records
10381370 -> 1000003901420: Records are a specialized form of information.
10381380 -> 1000003901430: Essentially, records are information produced consciously or as by-products of business activities or transactions and retained because of their value.
10381390 -> 1000003901440: Primarily their value is as evidence of the activities of the organization but they may also be retained for their informational value.
10381400 -> 1000003901450: Sound records management ensures that the integrity of records is preserved for as long as they are required.
10381410 -> 1000003901460: The international standard on records management, ISO 15489, defines records as "information created, received, and maintained as evidence and information by an organization or person, in pursuance of legal obligations or in the transaction of business".
10381420 -> 1000003901470: The International Committee on Archives (ICA) Committee on electronic records defined a record as, "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity".
10381430 -> 1000003901480: Records may be retained because of their business value, as part of the corporate memory of the organization or to meet legal, fiscal or accountability requirements imposed on the organization.
10381440 -> 1000003901490: Willis (2005) expressed the view that sound management of business records and information delivered "…six key requirements for good corporate governance…transparency; accountability; due process; compliance; meeting statutory and common law requirements; and security of personal and corporate information."
10381450 -> 1000003901500: Information and semiotics
10381460 -> 1000003901510: Beynon-Davies explains the multi-faceted concept of information in terms of that of signs and sign-systems.
10381470 -> 1000003901520: Signs themselves can be considered in terms of four inter-dependent levels, layers or branches of semiotics: pragmatics, semantics, syntactics and empirics.
10381480 -> 1000003901530: These four layers serve to connect the social world on the one hand with the physical or technical world on the other.
10381490 -> 1000003901540: Pragmatics is concerned with the purpose of communication.
10381500 -> 1000003901550: Pragmatics links the issue of signs with that of intention.
10381510 -> 1000003901560: The focus of pragmatics is on the intentions of human agents underlying communicative behaviour.
10381520 -> 1000003901570: In other words, intentions link language to action.
10381530 -> 1000003901580: Semantics is concerned with the meaning of a message conveyed in a communicative act.
10381535 -> 1000003901590: Semantics considers the content of communication.
10381540 -> 1000003901600: Semantics is the study of the meaning of signs - the association between signs and behaviour.
10381550 -> 1000003901610: Semantics can be considered as the study of the link between symbols and their referents or concepts; particularly the way in which signs relate to human behaviour.
10381560 -> 1000003901620: Syntactics is concerned with the formalism used to represent a message.
10381570 -> 1000003901630: Syntactics as an area studies the form of communication in terms of the logic and grammar of sign systems.
10381580 -> 1000003901640: Syntactics is devoted to the study of the form rather than the content of signs and sign-systems.
10381590 -> 1000003901650: Empirics is the study of the signals used to carry a message; the physical characteristics of the medium of communication.
10381600 -> 1000003901660: Empirics is devoted to the study of communication channels and their characteristics, e.g., sound, light, electronic transmission etc.
10381610 -> 1000003901670: Communication normally exists within the context of some social situation.
10381620 -> 1000003901680: The social situation sets the context for the intentions conveyed (pragmatics) and the form in which communication takes place.
10381630 -> 1000003901690: In a communicative situation intentions are expressed through messages which comprise collections of inter-related signs taken from a language which is mutually understood by the agents involved in the communication.
10381640 -> 1000003901700: Mutual understanding implies that agents involved understand the chosen language in terms of its agreed syntax (syntactics) and semantics.
10381650 -> 1000003901710: The sender codes the message in the language and sends the message as signals along some communication channel (empirics).
10381660 -> 1000003901720: The chosen communication channel will have inherent properties which determine outcomes such as the speed with which communication can take place and over what distance.
Information extraction
10390010 -> 1000004000020: Information extraction
10390020 -> 1000004000030: In natural language processing, information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machine-readable documents.
10390030 -> 1000004000040: An example of information extraction is the extraction of instances of corporate mergers, more formally MergerBetween(company_1, company_2, date), from an online news sentence such as: "Yesterday, New-York based Foo Inc. announced their acquisition of Bar Corp."
10390040 -> 1000004000050: A broad goal of IE is to allow computation to be done on the previously unstructured data.
10390050 -> 1000004000060: A more specific goal is to allow logical reasoning to draw inferences based on the logical content of the input data.
10390060 -> 1000004000070: The significance of IE is determined by the growing amount of information available in unstructured (i.e. without metadata) form, for instance on the Internet.
10390070 -> 1000004000080: This knowledge can be made more accessible by means of transformation into relational form, or by marking-up with XML tags.
10390080 -> 1000004000090: An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with.
10390090 -> 1000004000100: A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.
10390100 -> 1000004000110: Current approaches to IE use natural language processing techniques that focus on very restricted domains.
10390110 -> 1000004000120: For example, the Message Understanding Conference (MUC) is a competition-based conference that focused on the following domains in the past:
10390120 -> 1000004000130: MUC-1 (1987), MUC-2 (1989): Naval operations messages.
10390130 -> 1000004000140: MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
10390140 -> 1000004000150: MUC-5 (1993): Joint ventures and microelectronics domain.
10390150 -> 1000004000160: MUC-6 (1995): News articles on management changes.
10390160 -> 1000004000170: MUC-7 (1998): Satellite launch reports.
10390170 -> 1000004000180: Natural Language texts may need to use some form of a Text simplification to create a more easily machine readable text to extract the sentences.
10390180 -> 1000004000190: Typical subtasks of IE are:
10390190 -> 1000004000200: Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
10390200 -> 1000004000210: Coreference: identification chains of noun phrases that refer to the same object.
10390210 -> 1000004000220: For example, anaphora is a type of coreference.
10390220 -> 1000004000230: Terminology extraction: finding the relevant terms for a given corpus
10390230 -> 1000004000240: Relation Extraction: identification of relations between entities, such as:
10390240 -> 1000004000250: PERSON works for ORGANIZATION (extracted from the sentence "Bill works for IBM.")
10390250 -> 1000004000260: PERSON located in LOCATION (extracted from the sentence "Bill is in France.")
Information retrieval
10400010 -> 1000004100020: Information retrieval
10400020 -> 1000004100030: Information retrieval (IR) is the science of searching for documents, for information within documents and for metadata about documents, as well as that of searching relational databases and the World Wide Web.
10400030 -> 1000004100040: There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis and technologies.
10400040 -> 1000004100050: IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, statistics and physics.
10400050 -> 1000004100060: Automated information retrieval systems are used to reduce what has been called "information overload".
10400060 -> 1000004100070: Many universities and public libraries use IR systems to provide access to books, journals and other documents.
10400070 -> 1000004100080: Web search engines are the most visible IR applications.
10400080 -> 1000004100090: History
10400090 -> 1000004100100: The idea of using computers to search for relevant pieces of information was popularized in an article As We May Think by Vannevar Bush in 1945.
10400100 -> 1000004100110: First implementations of information retrieval systems were introduced in the 1950s and 1960s.
10400110 -> 1000004100120: By 1990 several different techniques had been shown to perform well on small text corpora (several thousand documents).
10400120 -> 1000004100130: In 1992 the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program.
10400130 -> 1000004100140: The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection.
10400140 -> 1000004100150: This catalyzed research on methods that scale to huge corpora.
10400150 -> 1000004100160: The introduction of web search engines has boosted the need for very large scale retrieval systems even further.
10400160 -> 1000004100170: The use of digital methods for storing and retrieving information has led to the phenomenon of digital obsolescence, where a digital resource ceases to be readable because the physical media, the reader required to read the media, the hardware, or the software that runs on it, is no longer available.
10400170 -> 1000004100180: The information is initially easier to retrieve than if it were on paper, but is then effectively lost.
10400180 -> 1000004100190: Timeline
10400190 -> 1000004100200: 1890: Hollerith tabulating machines were used to analyze the US census.
10400200 -> 1000004100210: (Herman Hollerith).
10400210 -> 1000004100220: 1945: Vannevar Bush's As We May Think appeared in Atlantic Monthly
10400220 -> 1000004100230: Late 1940s: The US military confronted problems of indexing and retrieval of wartime scientific research documents captured from Germans.
10400230 -> 1000004100240: 1947: Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized, punch card based system for searching chemical compounds.
10400240 -> 1000004100250: 1950: The term "information retrieval" may have been coined by Calvin Mooers.
10400250 -> 1000004100260: 1950s: Growing concern in the US for a "science gap" with the USSR motivated, encouraged funding, and provided a backdrop for mechanized literature searching systems (Allen Kent et al) and the invention of citation indexing (Eugene Garfield).
10400260 -> 1000004100270: 1955: Allen Kent joined Case Western Reserve University, and eventually becomes associate director of the Center for Documentation and Communications Research.
10400270 -> 1000004100280: That same year, Kent and colleagues publish a paper in American Documentation describing the precision and recall measures, as well as detailing a proposed "framework" for evaluating an IR system, which includes statistical sampling methods for determining the number of relevant documents not retrieved.
10400280 -> 1000004100290: 1958: International Conference on Scientific Information Washington DC included consideration of IR systems as a solution to problems identified.
10400290 -> 1000004100300: See: Proceedings of the International Conference on Scientific Information, 1958 (National Academy of Sciences, Washington, DC, 1959)
10400300 -> 1000004100310: 1959: Hans Peter Luhn published "Auto-encoding of documents for information retrieval."
10400310 -> 1000004100320: 1960: Melvin Earl (Bill) Maron and J. L. Kuhns published "On relevance, probabilistic indexing, and information retrieval" in Journal of the ACM 7(3):216-244, July 1960.
10400320 -> 1000004100330: Early 1960s: Gerard Salton began work on IR at Harvard, later moved to Cornell.
10400330 -> 1000004100340: 1962: Cyril W. Cleverdon published early findings of the Cranfield studies, developing a model for IR system evaluation.
10400340 -> 1000004100350: See: Cyril W. Cleverdon, "Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems".
10400350 -> 1000004100360: Cranfield Coll. of Aeronautics, Cranfield, England, 1962.
10400360 -> 1000004100370: 1962: Kent published Information Analysis and Retrieval
10400370 -> 1000004100380: 1963: Weinberg report "Science, Government and Information" gave a full articulation of the idea of a "crisis of scientific information."
10400380 -> 1000004100390: The report was named after Dr. Alvin Weinberg.
10400390 -> 1000004100400: 1963: Joseph Becker and Robert M. Hayes published text on information retrieval.
10400400 -> 1000004100410: Becker, Joseph; Hayes, Robert Mayo.
10400410 -> 1000004100420: Information storage and retrieval: tools, elements, theories.
10400420 -> 1000004100430: New York, Wiley (1963).
10400430 -> 1000004100440: 1964: Karen Spärck Jones finished her thesis at Cambridge, Synonymy and Semantic Classification, and continued work on computational linguistics as it applies to IR
10400440 -> 1000004100450: 1964: The National Bureau of Standards sponsored a symposium titled "Statistical Association Methods for Mechanized Documentation."
10400450 -> 1000004100460: Several highly significant papers, including G. Salton's first published reference (we believe) to the SMART system.
10400460 -> 1000004100470: Mid-1960s: National Library of Medicine developed MEDLARS Medical Literature Analysis and Retrieval System, the first major machine-readable database and batch retrieval system
10400470 -> 1000004100480: Mid-1960s: Project Intrex at MIT
10400480 -> 1000004100490: 1965: J. C. R. Licklider published Libraries of the Future
10400490 -> 1000004100500: 1966: Don Swanson was involved in studies at University of Chicago on Requirements for Future Catalogs
10400500 -> 1000004100510: 1968: Gerard Salton published Automatic Information Organization and Retrieval.
10400510 -> 1000004100520: 1968: J. W. Sammon's RADC Tech report "Some Mathematics of Information Storage and Retrieval..." outlined the vector model.
10400520 -> 1000004100530: 1969: Sammon's "A nonlinear mapping for data structure analysis" (IEEE Transactions on Computers) was the first proposal for visualization interface to an IR system.
10400530 -> 1000004100540: Late 1960s: F. W. Lancaster completed evaluation studies of the MEDLARS system and published the first edition of his text on information retrieval
10400540 -> 1000004100550: Early 1970s: first online systems--NLM's AIM-TWX, MEDLINE; Lockheed's Dialog; SDC's ORBIT
10400550 -> 1000004100560: Early 1970s: Theodor Nelson promoting concept of hypertext, published Computer Lib/Dream Machines
10400560 -> 1000004100570: 1971: N. Jardine and C. J. Van Rijsbergen published "The use of hierarchic clustering in information retrieval", which articulated the "cluster hypothesis."
10400570 -> 1000004100580: (Information Storage and Retrieval, 7(5), pp. 217-240, Dec 1971)
10400580 -> 1000004100590: 1975: Three highly influential publications by Salton fully articulated his vector processing framework and term discrimination model:
10400590 -> 1000004100600: A Theory of Indexing (Society for Industrial and Applied Mathematics)
10400600 -> 1000004100610: "A theory of term importance in automatic text analysis", (JASIS v. 26)
10400610 -> 1000004100620: "A vector space model for automatic indexing", (CACM 18:11)
10400620 -> 1000004100630: 1978: The First ACM SIGIR conference.
10400630 -> 1000004100640: 1979: C. J. Van Rijsbergen published Information Retrieval (Butterworths).
10400640 -> 1000004100650: Heavy emphasis on probabilistic models.
10400650 -> 1000004100660: 1980: First international ACM SIGIR conference, joint with British Computer Society IR group in Cambridge
10400660 -> 1000004100670: 1982: Belkin, Oddy, and Brooks proposed the ASK (Anomalous State of Knowledge) viewpoint for information retrieval.
10400670 -> 1000004100680: This was an important concept, though their automated analysis tool proved ultimately disappointing.
10400680 -> 1000004100690: 1983: Salton (and M. McGill) published Introduction to Modern Information Retrieval (McGraw-Hill), with heavy emphasis on vector models.
10400690 -> 1000004100700: Mid-1980s: Efforts to develop end user versions of commercial IR systems.
10400700 -> 1000004100710: 1985-1993: Key papers on and experimental systems for visualization interfaces.
10400710 -> 1000004100720: Work by D. B. Crouch, Robert R. Korfhage, M. Chalmers, A. Spoerri and others.
10400720 -> 1000004100730: 1989: First World Wide Web proposals by Tim Berners-Lee at CERN.
10400730 -> 1000004100740: 1992: First TREC conference.
10400740 -> 1000004100750: 1997: Publication of Korfhage's Information Storage and Retrieval with emphasis on visualization and multi-reference point systems.
10400750 -> 1000004100760: Late 1990s: Web search engine implementation of many features formerly found only in experimental IR systems
10400760 -> 1000004100770: Overview
10400770 -> 1000004100780: An information retrieval process begins when a user enters a query into the system.
10400780 -> 1000004100790: Queries are formal statements of information needs, for example search strings in web search engines.
10400790 -> 1000004100800: In information retrieval a query does not uniquely identify a single object in the collection.
10400800 -> 1000004100810: Instead, several objects may match the query, perhaps with different degrees of relevancy.
10400810 -> 1000004100820: An object is an entity which keeps or stores information in a database.
10400820 -> 1000004100830: User queries are matched to objects stored in the database.
10400830 -> 1000004100840: Depending on the application the data objects may be, for example, text documents, images or videos.
10400840 -> 1000004100850: Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.
10400850 -> 1000004100860: Most IR systems compute a numeric score on how well each object in the database match the query, and rank the objects according to this value.
10400860 -> 1000004100870: The top ranking objects are then shown to the user.
10400870 -> 1000004100880: The process may then be iterated if the user wishes to refine the query.
10400880 -> 1000004100890: Performance measures
10400890 -> 1000004100900: Many different measures for evaluating the performance of information retrieval systems have been proposed.
10400900 -> 1000004100910: The measures require a collection of documents and a query.
10400910 -> 1000004100920: All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query.
10400920 -> 1000004100930: In practice queries may be ill-posed and there may be different shades of relevancy.
10400930 -> 1000004100940: Precision
10400940 -> 1000004100950: Precision is the fraction of the documents retrieved that are relevant to the user's information need.
10400950 -> 1000004100960: \mbox{precision}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{retrieved documents}\}|}
10400960 -> 1000004100970: In binary classification, precision is analogous to positive predictive value.
10400970 -> 1000004100980: Precision takes all retrieved documents into account.
10400980 -> 1000004100990: It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system.
10400990 -> 1000004101000: This measure is called precision at n or P@n.
10401000 -> 1000004101010: Note that the meaning and usage of "precision" in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.
10401010 -> 1000004101020: Recall
10401020 -> 1000004101030: Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.
10401030 -> 1000004101040: \mbox{recall}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{relevant documents}\}|}
10401040 -> 1000004101050: In binary classification, recall is called sensitivity.
10401050 -> 1000004101060: So it can be looked at as the probability that a relevant document is retrieved by the query.
10401060 -> 1000004101070: It is trivial to achieve recall of 100% by returning all documents in response to any query.
10401070 -> 1000004101080: Therefore recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.
10401080 -> 1000004101090: Fall-Out
10401090 -> 1000004101100: The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:
10401100 -> 1000004101110: \mbox{fall-out}=\frac{|\{\mbox{non-relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{non-relevant documents}\}|}
10401110 -> 1000004101120: In binary classification, fall-out is closely related to specificity.
10401120 -> 1000004101130: More precisely: \mbox{fall-out}=1-\mbox{specificity}.
10401130 -> 1000004101140: It can be looked at as the probability that a non-relevant document is retrieved by the query.
10401140 -> 1000004101150: It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.
10401150 -> 1000004101160: F-measure
10401160 -> 1000004101170: The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:
10401170 -> 1000004101180: F = 2 \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\mathrm{precision} + \mathrm{recall}).\,
10401180 -> 1000004101190: This is also known as the F_1 measure, because recall and precision are evenly weighted.
10401190 -> 1000004101200: The general formula for non-negative real ß is:
10401200 -> 1000004101210: F_\beta = (1 + \beta^2) \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\beta^2 \cdot \mathrm{precision} + \mathrm{recall}).\,
10401210 -> 1000004101220: Two other commonly used F measures are the F_{2} measure, which weights recall twice as much as precision, and the F_{0.5} measure, which weights precision twice as much as recall.
10401220 -> 1000004101230: The F-measure was derived by van Rijsbergen (1979) so that F_\beta "measures the effectiveness of retrieval with respect to a user who attaches ß times as much importance to recall as precision".
10401230 -> 1000004101240: It is based on van Rijsbergen's effectiveness measure E = 1-(1/(\alpha/P + (1-\alpha)/R)).
10401240 -> 1000004101250: Their relationship is F_\beta = 1 - E where \alpha=1/(\beta^2+1).
10401250 -> 1000004101260: Average precision of precision and recall
10401260 -> 1000004101270: The precision and recall are based on the whole list of documents returned by the system.
10401270 -> 1000004101280: Average precision emphasizes returning more relevant documents earlier.
10401280 -> 1000004101290: It is average of precisions computed after truncating the list after each of the relevant documents in turn:
10401290 -> 1000004101300: \operatorname{AveP} = \frac{\sum_{r=1}^N (P(r) \times \mathrm{rel}(r))}{\mbox{number of relevant documents}} \!
10401300 -> 1000004101310: where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given rank, and P() precision at a given cut-off rank.
10401310 -> 1000004101320: Model types
10401320 -> None: 
10401325 -> 1000004101330: For the information retrieval to be efficient, the documents are typically transformed into a suitable representation.
10401330 -> 1000004101340: There are several representations.
10401340 -> 1000004101350: The picture on the right illustrates the relationship of some common models.
10401350 -> 1000004101360: In the picture, the models are categorized according to two dimensions: the mathematical basis and the properties of the model.
10401360 -> 1000004101370: First dimension: mathematical basis
10401370 -> 1000004101380: Set-theoretic models represent documents as sets of words or phrases.
10401380 -> 1000004101390: Similarities are usually derived from set-theoretic operations on those sets.
10401390 -> 1000004101400: Common models are:
10401400 -> 1000004101410: Standard Boolean model
10401410 -> 1000004101420: Extended Boolean model
10401420 -> 1000004101430: Fuzzy retrieval
10401430 -> 1000004101440: Algebraic models represent documents and queries usually as vectors, matrices or tuples.
10401440 -> 1000004101450: The similarity of the query vector and document vector is represented as a scalar value.
10401450 -> 1000004101460: Vector space model
10401460 -> 1000004101470: Generalized vector space model
10401470 -> 1000004101480: Topic-based vector space model (literature: , )
10401480 -> 1000004101490: Extended Boolean model
10401490 -> 1000004101500: Enhanced topic-based vector space model (literature: , )
10401500 -> 1000004101510: Latent semantic indexing aka latent semantic analysis
10401510 -> 1000004101520: Probabilistic models treat the process of document retrieval as a probabilistic inference.
10401520 -> 1000004101530: Similarities are computed as probabilities that a document is relevant for a given query.
10401530 -> 1000004101540: Probabilistic theorems like the Bayes' theorem are often used in these models.
10401540 -> 1000004101550: Binary independence retrieval
10401550 -> 1000004101560: Probabilistic relevance model (BM25)
10401560 -> 1000004101570: Uncertain inference
10401570 -> 1000004101580: Language models
10401580 -> 1000004101590: Divergence-from-randomness model
10401590 -> 1000004101600: Latent Dirichlet allocation
10401600 -> 1000004101610: Second dimension: properties of the model
10401610 -> 1000004101620: Models without term-interdependencies treat different terms/words as independent.
10401620 -> 1000004101630: This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.
10401630 -> 1000004101640: Models with immanent term interdependencies allow a representation of interdependencies between terms.
10401640 -> 1000004101650: However the degree of the interdependency between two terms is defined by the model itself.
10401650 -> 1000004101660: It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.
10401660 -> 1000004101670: Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined.
10401670 -> 1000004101680: They relay an external source for the degree of interdependency between two terms.
10401680 -> 1000004101690: (For example a human or sophisticated algorithms.)
10401690 -> None: Major figures
10401700 -> None: Gerard Salton
10401710 -> None: Hans Peter Luhn
10401720 -> None: W. Bruce Croft
10401730 -> None: Karen Spärck Jones
10401740 -> None: C. J. van Rijsbergen
10401750 -> None: Stephen E. Robertson
10401760 -> 1000004101700: Awards in the field
10401770 -> 1000004101710: Tony Kent Strix award
10401780 -> 1000004101720: Gerard Salton Award
Information theory
10410010 -> 1000004200020: Information theory
10410020 -> 1000004200030: Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information.
10410030 -> 1000004200040: Historically, information theory was developed to find fundamental limits on compressing and reliably communicating data.
10410040 -> 1000004200050: Since its inception it has broadened to find applications in many other areas, including statistical inference, natural language processing, cryptography generally, networks other than communication networks -- as in neurobiology, the evolution and function of molecular codes, model selection in ecology, thermal physics, quantum computing, plagiarism detection and other forms of data analysis.
10410050 -> 1000004200060: A key measure of information in the theory is known as information entropy, which is usually expressed by the average number of bits needed for storage or communication.
10410060 -> 1000004200070: Intuitively, entropy quantifies the uncertainty involved when encountering a random variable.
10410070 -> 1000004200080: For example, a fair coin flip (2 equally likely outcomes) will have less entropy than a roll of a die (6 equally likely outcomes).
10410080 -> 1000004200090: Applications of fundamental topics of information theory include lossless data compression (e.g. ZIP files), lossy data compression (e.g. MP3s), and channel coding (e.g. for DSL lines).
10410110 -> 1000004200100: The field is at the intersection of mathematics, statistics, computer science, physics, neurobiology, and electrical engineering.
10410120 -> 1000004200110: Its impact has been crucial to the success of the Voyager missions to deep space, the invention of the CD, the feasibility of mobile phones, the development of the Internet, the study of linguistics and of human perception, the understanding of black holes, and numerous other fields.
10410130 -> 1000004200120: Important sub-fields of information theory are source coding, channel coding, algorithmic complexity theory, algorithmic information theory, and measures of information.
10410140 -> 1000004200130: Overview
10410150 -> 1000004200140: The main concepts of information theory can be grasped by considering the most widespread means of human communication: language.
10410160 -> 1000004200150: Two important aspects of a good language are as follows: First, the most common words (e.g., "a", "the", "I") should be shorter than less common words (e.g., "benefit", "generation", "mediocre"), so that sentences will not be too long.
10410170 -> 1000004200160: Such a tradeoff in word length is analogous to data compression and is the essential aspect of source coding.
10410180 -> 1000004200170: Second, if part of a sentence is unheard or misheard due to noise -— e.g., a passing car -— the listener should still be able to glean the meaning of the underlying message.
10410190 -> 1000004200180: Such robustness is as essential for an electronic communication system as it is for a language; properly building such robustness into communications is done by channel coding.
10410200 -> 1000004200190: Source coding and channel coding are the fundamental concerns of information theory.
10410210 -> 1000004200200: Note that these concerns have nothing to do with the importance of messages.
10410220 -> 1000004200210: For example, a platitude such as "Thank you; come again" takes about as long to say or write as the urgent plea, "Call an ambulance!" while clearly the latter is more important and more meaningful.
10410230 -> 1000004200220: Information theory, however, does not consider message importance or meaning, as these are matters of the quality of data rather than the quantity and readability of data, the latter of which is determined solely by probabilities.
10410240 -> 1000004200230: Information theory is generally considered to have been founded in 1948 by Claude Shannon in his seminal work, "A Mathematical Theory of Communication."
10410250 -> 1000004200240: The central paradigm of classical information theory is the engineering problem of the transmission of information over a noisy channel.
10410260 -> 1000004200250: The most fundamental results of this theory are Shannon's source coding theorem, which establishes that, on average, the number of bits needed to represent the result of an uncertain event is given by its entropy; and Shannon's noisy-channel coding theorem, which states that reliable communication is possible over noisy channels provided that the rate of communication is below a certain threshold called the channel capacity.
10410270 -> 1000004200260: The channel capacity can be approached in practice by using appropriate encoding and decoding systems.
10410280 -> 1000004200270: Information theory is closely associated with a collection of pure and applied disciplines that have been investigated and reduced to engineering practice under a variety of rubrics throughout the world over the past half century or more: adaptive systems, anticipatory systems, artificial intelligence, complex systems, complexity science, cybernetics, informatics, machine learning, along with systems sciences of many descriptions.
10410290 -> 1000004200280: Information theory is a broad and deep mathematical theory, with equally broad and deep applications, amongst which is the vital field of coding theory.
10410300 -> 1000004200290: Coding theory is concerned with finding explicit methods, called codes, of increasing the efficiency and reducing the net error rate of data communication over a noisy channel to near the limit that Shannon proved is the maximum possible for that channel.
10410310 -> 1000004200300: These codes can be roughly subdivided into data compression (source coding) and error-correction (channel coding) techniques.
10410320 -> 1000004200310: In the latter case, it took many years to find the methods Shannon's work proved were possible.
10410330 -> 1000004200320: A third class of information theory codes are cryptographic algorithms (both codes and ciphers).
10410340 -> 1000004200330: Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis.
10410350 -> 1000004200340: See the article ban (information) for a historical application.
10410360 -> 1000004200350: Information theory is also used in information retrieval, intelligence gathering, gambling, statistics, and even in musical composition.
10410370 -> 1000004200360: Historical background
10410380 -> 1000004200370: The landmark event that established the discipline of information theory, and brought it to immediate worldwide attention, was the publication of Claude E. Shannon's classic paper "A Mathematical Theory of Communication" in the Bell System Technical Journal in July and October of 1948.
10410390 -> 1000004200380: Prior to this paper, limited information theoretic ideas had been developed at Bell Labs, all implicitly assuming events of equal probability.
10410400 -> 1000004200390: Harry Nyquist's 1924 paper, Certain Factors Affecting Telegraph Speed, contains a theoretical section quantifying "intelligence" and the "line speed" at which it can be transmitted by a communication system, giving the relation W = K \log m, where W is the speed of transmission of intelligence, m is the number of different voltage levels to choose from at each time step, and K is a constant.
10410410 -> 1000004200400: Ralph Hartley's 1928 paper, Transmission of Information, uses the word information as a measurable quantity, reflecting the receiver's ability to distinguish that one sequence of symbols from any other, thus quantifying information as H = \log S^n = n \log S, where S was the number of possible symbols, and n the number of symbols in a transmission.
10410420 -> 1000004200410: The natural unit of information was therefore the decimal digit, much later renamed the hartley in his honour as a unit or scale or measure of information.
10410430 -> 1000004200420: Alan Turing in 1940 used similar ideas as part of the statistical analysis of the breaking of the German second world war Enigma ciphers.
10410440 -> 1000004200430: Much of the mathematics behind information theory with events of different probabilities was developed for the field of thermodynamics by Ludwig Boltzmann and J. Willard Gibbs.
10410450 -> 1000004200440: Connections between information-theoretic entropy and thermodynamic entropy, including the important contributions by Rolf Landauer in the 1960s, are explored in Entropy in thermodynamics and information theory.
10410460 -> 1000004200450: In Shannon's revolutionary and groundbreaking paper, the work for which had been substantially completed at Bell Labs by the end of 1944, Shannon for the first time introduced the qualitative and quantitative model of communication as a statistical process underlying information theory, opening with the assertion that
10410470 -> 1000004200460: "The fundamental problem of communication is that of reproducing at one point, either exactly or approximately, a message selected at another point."
10410480 -> 1000004200470: With it came the ideas of
10410490 -> 1000004200480: the information entropy and redundancy of a source, and its relevance through the source coding theorem;
10410500 -> 1000004200490: the mutual information, and the channel capacity of a noisy channel, including the promise of perfect loss-free communication given by the noisy-channel coding theorem;
10410510 -> 1000004200500: the practical result of the Shannon–Hartley law for the channel capacity of a Gaussian channel; and of course
10410520 -> 1000004200510: the bit—a new way of seeing the most fundamental unit of information
10410530 -> 1000004200520: Ways of measuring information
10410540 -> 1000004200530: Information theory is based on probability theory and statistics.
10410550 -> 1000004200540: The most important quantities of information are entropy, the information in a random variable, and mutual information, the amount of information in common between two random variables.
10410560 -> 1000004200550: The former quantity indicates how easily message data can be compressed while the latter can be used to find the communication rate across a channel.
10410570 -> 1000004200560: The choice of logarithmic base in the following formulae determines the unit of information entropy that is used.
10410580 -> 1000004200570: The most common unit of information is the bit, based on the binary logarithm.
10410590 -> 1000004200580: Other units include the nat, which is based on the natural logarithm, and the hartley, which is based on the common logarithm.
10410600 -> 1000004200590: In what follows, an expression of the form p \log p \, is considered by convention to be equal to zero whenever p=0.
10410605 -> 1000004200600: This is justified because \lim_{p \rightarrow 0+} p \log p = 0 for any logarithmic base.
10410610 -> 1000004200610: Entropy
10410620 -> 1000004200620: The entropy, H, of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X.
10410630 -> 1000004200630: Suppose one transmits 1000 bits (0s and 1s).
10410640 -> 1000004200640: If these bits are known ahead of transmission (to be a certain value with absolute probability), logic dictates that no information has been transmitted.
10410650 -> 1000004200650: If, however, each is equally and independently likely to be 0 or 1, 1000 bits (in the information theoretic sense) have been transmitted.
10410660 -> 1000004200660: Between these two extremes, information can be quantified as follows.
10410670 -> 1000004200670: If \mathbb{X}\, is the set of all messages x that X could be, and p(x) is the probability of X given x, then the entropy of X is defined:
10410680 -> 1000004200680: H(X) = \mathbb{E}_{X} [I(x)] = -\sum_{x \in \mathbb{X}} p(x) \log p(x).
10410690 -> 1000004200690: (Here, I(x) is the self-information, which is the entropy contribution of an individual message.)
10410700 -> 1000004200700: An important property of entropy is that it is maximized when all the messages in the message space are equiprobable—i.e., most unpredictable—in which case H(X) = \log |\mathbb{X}|.
10410710 -> 1000004200710: The special case of information entropy for a random variable with two outcomes is the binary entropy function:
10410720 -> 1000004200720: H_\mbox{b}(p) = - p \log p - (1-p)\log (1-p).\,
10410730 -> 1000004200730: Joint entropy
10410740 -> 1000004200740: The joint entropy of two discrete random variables X and Y is merely the entropy of their pairing: (X, Y).
10410750 -> 1000004200750: This implies that if X and Y are independent, then their joint entropy is the sum of their individual entropies.
10410760 -> 1000004200760: For example, if (X,Y) represents the position of a chess piece — X the row and Y the column, then the joint entropy of the row of the piece and the column of the piece will be the entropy of the position of the piece.
10410770 -> 1000004200770: H(X, Y) = \mathbb{E}_{X,Y} [-\log p(x,y)] = - \sum_{x, y} p(x, y) \log p(x, y) \,
10410780 -> 1000004200780: Despite similar notation, joint entropy should not be confused with cross entropy.
10410790 -> 1000004200790: Conditional entropy (equivocation)
10410800 -> 1000004200800: The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y) is the average conditional entropy over Y:
10410810 -> 1000004200810: H(X|Y) = \mathbb E_Y [H(X|y)] = -\sum_{y \in Y} p(y) \sum_{x \in X} p(x|y) \log p(x|y) = -\sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(y)}.
10410820 -> 1000004200820: Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use.
10410830 -> 1000004200830: A basic property of this form of conditional entropy is that:
10410840 -> 1000004200840: H(X|Y) = H(X,Y) - H(Y) .\,
10410850 -> 1000004200850: Mutual information (transinformation)
10410860 -> 1000004200860: Mutual information measures the amount of information that can be obtained about one random variable by observing another.
10410870 -> 1000004200870: It is important in communication where it can be used to maximize the amount of information shared between sent and received signals.
10410880 -> 1000004200880: The mutual information of X relative to Y is given by:
10410890 -> 1000004200890: I(X;Y) = \mathbb{E}_{X,Y} [SI(x,y)] = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\, p(y)}
10410900 -> 1000004200900: where SI (Specific mutual Information) is the pointwise mutual information.
10410910 -> 1000004200910: A basic property of the mutual information is that
10410920 -> 1000004200920: I(X;Y) = H(X) - H(X|Y).\,
10410930 -> 1000004200930: That is, knowing Y, we can save an average of I(X; Y) bits in encoding X compared to not knowing Y.
10410940 -> 1000004200940: Mutual information is symmetric:
10410950 -> 1000004200950: I(X;Y) = I(Y;X) = H(X) + H(Y) - H(X,Y).\,
10410960 -> 1000004200960: Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) of the posterior probability distribution of X given the value of Y to the prior distribution on X:
10410970 -> 1000004200970: I(X;Y) = \mathbb E_{p(y)} [D_{\mathrm{KL}}( p(X|Y=y) \| p(X) )].
10410980 -> 1000004200980: In other words, this is a measure of how much, on the average, the probability distribution on X will change if we are given the value of Y.
10410990 -> 1000004200990: This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution:
10411000 -> 1000004201000: I(X; Y) = D_{\mathrm{KL}}(p(X,Y) \| p(X)p(Y)).
10411010 -> 1000004201010: Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.
10411020 -> 1000004201020: Kullback–Leibler divergence (information gain)
10411030 -> 1000004201030: The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions: a "true" probability distribution p(X), and an arbitrary probability distribution q(X).
10411040 -> 1000004201040: If we compress data in a manner that assumes q(X) is the distribution underlying some data, when, in reality, p(X) is the correct distribution, the Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression.
10411050 -> 1000004201050: It is thus defined
10411060 -> 1000004201060: D_{\mathrm{KL}}(p(X) \| q(X)) = \sum_{x \in X} -p(x) \log {q(x)} \, - \, \left( -p(x) \log {p(x)}\right) = \sum_{x \in X} p(x) \log \frac{p(x)}{q(x)}.
10411070 -> 1000004201070: Although it is sometimes used as a 'distance metric', it is not a true metric since it is not symmetric and does not satisfy the triangle inequality (making it a semi-quasimetric).
10411080 -> 1000004201080: Other quantities
10411090 -> 1000004201090: Other important information theoretic quantities include Rényi entropy (a generalization of entropy) and differential entropy (a generalization of quantities of information to continuous distributions.)
10411100 -> 1000004201100: Coding theory
10411110 -> 1000004201110: Coding theory is one of the most important and direct applications of information theory.
10411120 -> 1000004201120: It can be subdivided into source coding theory and channel coding theory.
10411130 -> 1000004201130: Using a statistical description for data, information theory quantifies the number of bits needed to describe the data, which is the information entropy of the source.
10411140 -> 1000004201140: Data compression (source coding): There are two formulations for the compression problem:
10411150 -> 1000004201150: lossless data compression: the data must be reconstructed exactly;
10411160 -> 1000004201160: lossy data compression: allocates bits needed to reconstruct the data, within a specified fidelity level measured by a distortion function.
10411170 -> 1000004201170: This subset of Information theory is called rate–distortion theory.
10411180 -> 1000004201180: Error-correcting codes (channel coding): While data compression removes as much redundancy as possible, an error correcting code adds just the right kind of redundancy (i.e. error correction) needed to transmit the data efficiently and faithfully across a noisy channel.
10411190 -> 1000004201190: This division of coding theory into compression and transmission is justified by the information transmission theorems, or source–channel separation theorems that justify the use of bits as the universal currency for information in many contexts.
10411200 -> 1000004201200: However, these theorems only hold in the situation where one transmitting user wishes to communicate to one receiving user.
10411210 -> 1000004201210: In scenarios with more than one transmitter (the multiple-access channel), more than one receiver (the broadcast channel) or intermediary "helpers" (the relay channel), or more general networks, compression followed by transmission may no longer be optimal.
10411220 -> 1000004201220: Network information theory refers to these multi-agent communication models.
10411230 -> 1000004201230: Source theory
10411240 -> 1000004201240: Any process that generates successive messages can be considered a source of information.
10411250 -> 1000004201250: A memoryless source is one in which each message is an independent identically-distributed random variable, whereas the properties of ergodicity and stationarity impose more general constraints.
10411260 -> 1000004201260: All such sources are stochastic.
10411270 -> 1000004201270: These terms are well studied in their own right outside information theory.
10411280 -> 1000004201280: Rate
10411290 -> 1000004201290: Information rate is the average entropy per symbol.
10411300 -> 1000004201300: For memoryless sources, this is merely the entropy of each symbol, while, in the case of a stationary stochastic process, it is
10411310 -> 1000004201310: r = \lim_{n \to \infty} H(X_n|X_{n-1},X_{n-2},X_{n-3}, \ldots);
10411320 -> 1000004201320: that is, the conditional entropy of a symbol given all the previous symbols generated.
10411330 -> 1000004201330: For the more general case of a process that is not necessarily stationary, the average rate is
10411340 -> 1000004201340: r = \lim_{n \to \infty} \frac{1}{n} H(X_1, X_2, \dots X_n);
10411350 -> 1000004201350: that is, the limit of the joint entropy per symbol.
10411360 -> 1000004201360: For stationary sources, these two expressions give the same result.
10411370 -> 1000004201370: It is common in information theory to speak of the "rate" or "entropy" of a language.
10411380 -> 1000004201380: This is appropriate, for example, when the source of information is English prose.
10411390 -> 1000004201390: The rate of a source of information is related to its redundancy and how well it can be compressed, the subject of source coding.
10411400 -> 1000004201400: Channel capacity
10411410 -> 1000004201410: Communications over a channel—such as an ethernet wire—is the primary motivation of information theory.
10411420 -> 1000004201420: As anyone who's ever used a telephone (mobile or landline) knows, however, such channels often fail to produce exact reconstruction of a signal; noise, periods of silence, and other forms of signal corruption often degrade quality.
10411430 -> 1000004201430: How much information can one hope to communicate over a noisy (or otherwise imperfect) channel?
10411440 -> 1000004201440: Consider the communications process over a discrete channel.
10411450 -> 1000004201450: A simple model of the process is shown below:
10411460 -> 1000004201460: Here X represents the space of messages transmitted, and Y the space of messages received during a unit time over our channel.
10411470 -> 1000004201470: Let p(y|x) be the conditional probability distribution function of Y given X.
10411480 -> 1000004201480: We will consider p(y|x) to be an inherent fixed property of our communications channel (representing the nature of the noise of our channel).
10411490 -> 1000004201490: Then the joint distribution of X and Y is completely determined by our channel and by our choice of f(x), the marginal distribution of messages we choose to send over the channel.
10411500 -> 1000004201500: Under these constraints, we would like to maximize the rate of information, or the signal, we can communicate over the channel.
10411510 -> 1000004201510: The appropriate measure for this is the mutual information, and this maximum mutual information is called the channel capacity and is given by:
10411520 -> 1000004201520: C = \max_{f} I(X;Y).\!
10411530 -> 1000004201530: This capacity has the following property related to communicating at information rate R (where R is usually bits per symbol).
10411540 -> 1000004201540: For any information rate R < C and coding error ε > 0, for large enough N, there exists a code of length N and rate ≥ R and a decoding algorithm, such that the maximal probability of block error is ≤ ε; that is, it is always possible to transmit with arbitrarily small block error.
10411550 -> 1000004201550: In addition, for any rate R > C, it is impossible to transmit with arbitrarily small block error.
10411560 -> 1000004201560: Channel coding is concerned with finding such nearly optimal codes that can be used to transmit data over a noisy channel with a small coding error at a rate near the channel capacity.
10411570 -> 1000004201570: Channel capacity of particular model channels
10411580 -> 1000004201580: A continuous-time analog communications channel subject to Gaussian noise — see Shannon–Hartley theorem.
10411590 -> 1000004201590: A binary symmetric channel (BSC) with crossover probability p is a binary input, binary output channel that flips the input bit with probability  p.
10411600 -> 1000004201600: The BSC has a capacity of 1 - H_\mbox{b}(p) bits per channel use, where H_\mbox{b} is the binary entropy function:
10411610 -> None: 
10411620 -> 1000004201610: A binary erasure channel (BEC) with erasure probability  p  is a binary input, ternary output channel.
10411630 -> 1000004201620: The possible channel outputs are 0, 1, and a third symbol 'e' called an erasure.
10411640 -> 1000004201630: The erasure represents complete loss of information about an input bit.
10411650 -> 1000004201640: The capacity of the BEC is 1 - p bits per channel use.
10411660 -> None: 
10411670 -> 1000004201650: Applications to other fields
10411680 -> 1000004201660: Intelligence uses and secrecy applications
10411690 -> 1000004201670: Information theoretic concepts apply to cryptography and cryptanalysis.
10411700 -> 1000004201680: Turing's information unit, the ban, was used in the Ultra project, breaking the German Enigma machine code and hastening the end of WWII in Europe.
10411710 -> 1000004201690: Shannon himself defined an important concept now called the unicity distance.
10411720 -> 1000004201700: Based on the redundancy of the plaintext, it attempts to give a minimum amount of ciphertext necessary to ensure unique decipherability.
10411730 -> 1000004201710: Information theory leads us to believe it is much more difficult to keep secrets than it might first appear.
10411740 -> 1000004201720: A brute force attack can break systems based on asymmetric key algorithms or on most commonly used methods of symmetric key algorithms (sometimes called secret key algorithms), such as block ciphers.
10411750 -> 1000004201730: The security of all such methods currently comes from the assumption that no known attack can break them in a practical amount of time.
10411760 -> 1000004201740: Information theoretic security refers to methods such as the one-time pad that are not vulnerable to such brute force attacks.
10411770 -> 1000004201750: In such cases, the positive conditional mutual information between the plaintext and ciphertext (conditioned on the  key) can ensure proper transmission, while the unconditional mutual information between the plaintext and ciphertext remains zero, resulting in absolutely secure communications.
10411780 -> 1000004201760: In other words, an eavesdropper would not be able to improve his or her guess of the plaintext by gaining knowledge of the ciphertext but not of the key.
10411790 -> 1000004201770: However, as in any other cryptographic system, care must be used to correctly apply even information-theoretically secure methods; the Venona project was able to crack the one-time pads of the Soviet Union due to their improper reuse of key material.
10411800 -> 1000004201780: Pseudorandom number generation
10411810 -> 1000004201790: Pseudorandom number generators are widely available in computer language libraries and application programs.
10411820 -> 1000004201800: They are, almost universally, unsuited to cryptographic use as they do not evade the deterministic nature of modern computer equipment and software.
10411830 -> 1000004201810: A class of improved random number generators is termed Cryptographically secure pseudorandom number generators, but even they require external to the software random seeds to work as intended.
10411840 -> 1000004201820: These can be obtained via extractors, if done carefully.
10411850 -> 1000004201830: The measure of sufficient randomness in extractors is min-entropy, a value related to Shannon entropy through Rényi entropy; Rényi entropy is also used in evaluating randomness in cryptographic systems.
10411860 -> 1000004201840: Although related, the distinctions among these measures mean that a random variable with high Shannon entropy is not necessarily satisfactory for use in an extractor and so for cryptography uses.
10411870 -> 1000004201850: Miscellaneous applications
10411880 -> 1000004201860: Information theory also has applications in gambling and investing, black holes, bioinformatics, and music.
Italian language
10420010 -> 1000004300020: Italian language
10420020 -> 1000004300030: Italian (, or lingua italiana) is a Romance language spoken as a first language by about 63 million people, primarily in Italy.
10420030 -> 1000004300040: In Switzerland, Italian is one of four official languages.
10420040 -> 1000004300050: It is also the official language of San Marino.
10420050 -> 1000004300060: It is the primary language of the Vatican City.
10420060 -> 1000004300070: Standard Italian, adopted by the state after the unification of Italy, is based on Tuscan and is somewhat intermediate between Italo-Dalmatian languages of the South and Northern Italian dialects of the North.
10420070 -> 1000004300080: Unlike most other Romance languages, Italian has retained the contrast between short and long consonants which existed in Latin.
10420080 -> 1000004300090: As in most Romance languages, stress is distinctive.
10420090 -> 1000004300100: Of the Romance languages, Italian is considered to be one of the closest resembling Latin in terms of vocabulary.
10420100 -> 1000004300110: According to Ethnologue, lexical similarity is 89% with French, 87% with Catalan, 85% with Sardinian, 82% with Spanish, 78% with Rheto-Romance, and 77% with Romanian.
10420110 -> 1000004300120: It is affectionately called il parlar gentile (the gentle language) by its speakers.
10420120 -> 1000004300130: Writing system
10420130 -> 1000004300140: Italian is written using the Latin alphabet.
10420140 -> 1000004300150: The letters J, K, W, X and Y are not considered part of the standard Italian alphabet, but appear in loanwords (such as jeans, whisky, taxi).
10420150 -> 1000004300160: X has become a commonly used letter in genuine Italian words with the prefix extra-.
10420160 -> 1000004300170: J in Italian is an old-fashioned orthographic variant of I, appearing in the first name "Jacopo" as well as in some Italian place names, e.g., the towns of Bajardo, Bojano, Joppolo, Jesolo, Jesi, among numerous others, and in the alternate spelling Mar Jonio (also spelled Mar Ionio) for the Ionian Sea.
10420170 -> 1000004300180: J may also appear in many words from different dialects, but its use is discouraged in contemporary Italian, and it is not part of the standard 21-letter contemporary Italian alphabet.
10420180 -> 1000004300190: Each of these foreign letters had an Italian equivalent spelling: gi for j, c or ch for k, u or v for w (depending on what sound it makes), s, ss, or cs for x, and i for y.
10420190 -> 1000004300200: Italian uses the acute accent over the letter E (as in perché, why/because) to indicate a front mid-close vowel, and the grave accent (as in tè, tea) to indicate a front mid-open vowel.
10420200 -> 1000004300210: The grave accent is also used on letters A, I, O, and U to mark stress when it falls on the final vowel of a word (for instance gioventù, youth).
10420210 -> 1000004300220: Typically, the penultimate syllable is stressed.
10420220 -> 1000004300230: If syllables other than the last one are stressed, the accent is not mandatory, unlike in Spanish, and, in virtually all cases, it is omitted.
10420230 -> 1000004300240: In some cases, when the word is ambiguous (as principi), the accent mark is sometimes used in order to disambiguate its meaning (in this case, prìncipi, princes, or princìpi, principles).
10420240 -> 1000004300250: This is, however, not compulsory.
10420250 -> 1000004300260: Rare words with three or more syllables can confuse Italians themselves, and the pronunciation of Istanbul is a common example of a word in which placement of stress is not clearly established.
10420260 -> 1000004300270: Turkish, like French, tends to put the accent on ultimate syllable, but Italian doesn't.
10420270 -> 1000004300280: So we can hear "Istànbul" or "Ìstanbul".
10420280 -> 1000004300290: Another instance is the American State of Florida: the correct way to pronounce it in Italian is like in Spanish, "Florìda", but since there is an Italian word meaning the same ("flourishing"), "flòrida", and because of the influence of English, most Italians pronounce it that way.
10420290 -> 1000004300300: Dictionaries give the latter as an alternative pronunciation.
10420300 -> 1000004300310: The letter H at the beginning of a word is used to distinguish ho, hai, ha, hanno (present indicative of avere, 'to have') from o ('or'), ai ('to the'), a ('to'), anno ('year').
10420310 -> 1000004300320: In the spoken language this letter is always silent for the cases given above.
10420320 -> 1000004300330: H is also used in combinations with other letters (see below), but no phoneme {(IPA+[h]+[h])} exists in Italian.
10420330 -> 1000004300340: In foreign words entered in common use, like "hotel" or "hovercraft", the H is commonly silent, so they are pronounced as {(IPA+/oˈtɛl/+/oˈtɛl/)} and {(IPA+/ˈɔverkraft/+/ˈɔverkraft/)}
10420340 -> 1000004300350: The letter Z represents {(IPA+/ʣ/+/ʣ/)}, for example: Zanzara {(IPA+/dzan'dzaɾa/+/dzan'dzaɾa/)} (mosquito), or {(IPA+/ʦ/+/ʦ/)}, for example: Nazione {(IPA+/naˈttsjone/+/naˈttsjone/)} (nation), depending on context, though there are few minimal pairs.
10420350 -> 1000004300360: The same goes for S, which can represent {(IPA+/s/+/s/)} or {(IPA+/z/+/z/)}.
10420360 -> 1000004300370: However, these two phonemes are in complementary distribution everywhere except between two vowels in the same word, and even in such environment there are extremely few minimal pairs, so that this distinction is being lost in many varieties.
10420370 -> 1000004300380: The letters C and G represent affricates: {(IPA+/ʧ/+/ʧ/)} as in "chair" and {(IPA+/ʤ/+/ʤ/)} as in "gem", respectively, before the front vowels I and E.
10420380 -> 1000004300390: They are pronounced as plosives {(IPA+/k/+/k/)}, {(IPA+/g/+/g/)} (as in "call" and "gall") otherwise.
10420390 -> 1000004300400: Front/back vowel rules for C and G are similar in French, Romanian, Spanish, and to some extent English (including Old English).
10420400 -> 1000004300410: Swedish and Norwegian have similar rules for K and G.
10420410 -> 1000004300420: (See also palatalization.)
10420420 -> 1000004300430: However, an H can be added between C or G and E or I to represent a plosive, and an I can be added between C or G and A, O or U to signal that the consonant is an affricate.
10420430 -> 1000004300440: For example:
10420440 -> 1000004300450: Note that the H is silent in the digraphs CH and GH, as also the I in cia, cio, ciu and even cie is not pronounced as a separate vowel, unless it carries the primary stress.
10420450 -> 1000004300460: For example, it is silent in ciao {(IPA+/ˈʧa.o/+/ˈʧa.o/)} and cielo {(IPA+/ˈʧɛ.lo/+/ˈʧɛ.lo/)}, but it is pronounced in farmacia {(IPA+/ˌfaɾ.ma.ˈʧi.a/+/ˌfaɾ.ma.ˈʧi.a/)} and farmacie {(IPA+/ˌfaɾ.ma.ˈʧi.e/+/ˌfaɾ.ma.ˈʧi.e/)}.
10420460 -> 1000004300470: There are three other special digraphs in Italian: GN, GL and SC.
10420470 -> 1000004300480: GN represents {(IPA+/ɲ/+/ɲ/)}.
10420480 -> 1000004300490: GL represents {(IPA+/ʎ/+/ʎ/)} only before i, and never at the beginning of a word, except in the personal pronoun and definite article gli.
10420490 -> 1000004300500: (Compare with Spanish ñ and ll, Portuguese nh and lh.)
10420500 -> 1000004300510: SC represents fricative {(IPA+/ʃ/+/ʃ/)} before i or e.
10420510 -> 1000004300520: Except in the speech of some Northern Italians, all of these are normally geminate between vowels.
10420520 -> 1000004300530: In general, all letters or digraphs represent phonemes rather clearly, and, in standard varieties of Italian, there is little allophonic variation.
10420530 -> 1000004300540: The most notable exceptions are assimilation of /n/ in point of articulation before consonants, assimilatory voicing of /s/ to following voiced consonants, and vowel length (vowels are long in stressed open syllables, and short elsewhere) — compare with the enormous number of allophones of the English phoneme /t/.
10420540 -> 1000004300550: Spelling is clearly phonemic and difficult to mistake given a clear pronunciation.
10420550 -> 1000004300560: Exceptions are generally only found in foreign borrowings.
10420560 -> 1000004300570: There are fewer cases of dyslexia than among speakers of languages such as English , and the concept of a spelling bee is strange to Italians.
10420570 -> 1000004300580: History
10420580 -> 1000004300590: The history of the Italian language is long, but the modern standard of the language was largely shaped by relatively recent events.
10420590 -> 1000004300600: The earliest surviving texts which can definitely be called Italian (or more accurately, vernacular, as opposed to its predecessor Vulgar Latin) are legal formulae from the region of Benevento dating from 960-963.
10420600 -> 1000004300610: What would come to be thought of as Italian was first formalized in the first years of the 14th century through the works of Dante Alighieri, who mixed southern Italian languages, especially Sicilian, with his native Tuscan in his epic poems known collectively as the Commedia, to which Giovanni Boccaccio later affixed the title Divina.
10420610 -> 1000004300620: Dante's much-loved works were read throughout Italy and his written dialect became the "canonical standard" that all educated Italians could understand.
10420620 -> 1000004300630: Dante is still credited with standardizing the Italian language and, thus, the dialect of Tuscany became the basis for what would become the official language of Italy.
10420630 -> 1000004300640: Italy has always had a distinctive dialect for each city since the cities were until recently thought of as city-states.
10420640 -> 1000004300650: The latter now has considerable variety, however.
10420650 -> 1000004300660: As Tuscan-derived Italian came to be used throughout the nation, features of local speech were naturally adopted, producing various versions of Regional Italian.
10420660 -> 1000004300670: The most characteristic differences, for instance, between Roman Italian and Milanese Italian are the gemination of initial consonants and the pronunciation of stressed "e", and of "s" in some cases (e.g. va bene "all right": is pronounced {(IPA+[va ˈbːɛne]+[va ˈbːɛne])} by a Roman, {(IPA+[va ˈbene]+[va ˈbene])} by a Milanese; a casa "at home": Roman {(IPA+[a ˈkːasa]+[a ˈkːasa])}, Milanese {(IPA+[a ˈkaza]+[a ˈkaza])}).
10420670 -> 1000004300680: In contrast to the dialects of northern Italy, southern Italian dialects were largely untouched by the Franco-Occitan influences introduced to Italy, mainly by bards from France, during the Middle Ages.
10420680 -> 1000004300690: Even in the case of Northern Italian dialects, however, scholars are careful not to overstate the effects of outsiders on the natural indigenous developments of the languages.
10420690 -> 1000004300700: (See La Spezia-Rimini Line.)
10420700 -> 1000004300710: The economic might and relative advanced development of Tuscany at the time (Late Middle Ages), gave its dialect weight, though Venetian remained widespread in medieval Italian commercial life.
10420710 -> 1000004300720: Also, the increasing cultural relevance of Florence during the periods of 'Umanesimo (Humanism)' and the Rinascimento (Renaissance) made its volgare (dialect), or rather a refined version of it, a standard in the arts.
10420720 -> 1000004300730: The re-discovery of Dante's De vulgari eloquentia and a renewed interest in linguistics in the 16th century sparked a debate which raged throughout Italy concerning which criteria should be chosen to establish a modern Italian standard to be used as much as a literary as a spoken language.
10420730 -> 1000004300740: Scholars were divided into three factions: the purists, headed by Pietro Bembo who in his Gli Asolani claimed that the language might only be based on the great literary classics (notably, Petrarch, and Boccaccio but not Dante as Bembo believed that the Divine Comedy was not dignified enough as it used elements from other dialects), Niccolò Machiavelli and other Florentines who preferred the version spoken by ordinary people in their own times, and the Courtesans like Baldassarre Castiglione and Gian Giorgio Trissino who insisted that each local vernacular must contribute to the new standard.
10420740 -> 1000004300750: Eventually Bembo's ideas prevailed, the result being the publication of the first Italian dictionary in 1612 and the foundation of the Accademia della Crusca in Florence (1582-3), the official legislative body of the Italian language.
10420750 -> 1000004300760: Italian literature's first modern novel, I Promessi Sposi (The Betrothed), by Alessandro Manzoni further defined the standard by "rinsing" his Milanese 'in the waters of the Arno" (Florence's river), as he states in the Preface to his 1840 edition.
10420760 -> 1000004300770: After unification a huge number of civil servants and soldiers recruited from all over the country introduced many more words and idioms from their home dialects ("ciao" is Venetian, "panettone" is Milanese etc.).
10420770 -> 1000004300780: Classification
10420780 -> 1000004300790: Italian is most closely related to the other two Italo-Dalmatian languages, Sicilian and the extinct Dalmatian.
10420790 -> 1000004300800: The three are part of the Italo-Western grouping of the Romance languages, which are a subgroup of the Italic branch of Indo-European.
10420800 -> 1000004300810: Geographic distribution
10420810 -> 1000004300820: The total speakers of Italian as maternal language are between 60 and 70 million.
10420820 -> 1000004300830: The speakers who use Italian as second or cultural language are estimated around 110-120 million .
10420830 -> 1000004300840: Italian is the official language of Italy and San Marino, and one of the official languages of Switzerland, spoken mainly in Ticino and Grigioni cantons, a region referred to as Italian Switzerland.
10420840 -> 1000004300850: It is also the second official language in some areas of Istria, in Slovenia and Croatia, where an Italian minority exists.
10420850 -> 1000004300860: It is the primary language of the Vatican City and is widely used and taught in Monaco and Malta.
10420860 -> 1000004300870: It is also widely understood in France with over one million speakers (especially in Corsica and the County of Nice, areas that historically spoke Italian dialects before annexation to France), and in Albania.
10420870 -> 1000004300880: Italian is also spoken by some in former Italian colonies in Africa (Libya, Somalia and Eritrea).
10420880 -> 1000004300890: However, its use has sharply dropped off since the colonial period.
10420890 -> 1000004300900: In Eritrea Italian is widely understood .
10420900 -> 1000004300910: In fact, for fifty years, during the colonial period, Italian was the language of instruction, but as of 1997, there is only one Italian language school remaining, with 470 pupils.
10420910 -> 1000004300920: In Somalia Italian used to be a major language but due to the civil war and lack of education only the older generation still uses it.
10420920 -> 1000004300930: Italian and Italian dialects are widely used by Italian immigrants and many of their descendants (see Italians) living throughout Western Europe (especially France, Germany, Belgium, Switzerland, the United Kingdom and Luxembourg), the United States, Canada, Australia, and Latin America (especially Uruguay, Brazil, Argentina, and Venezuela).
10420930 -> 1000004300940: In the United States, Italian speakers are most commonly found in four cities: Boston (7,000), Chicago (12,000), New York City (140,000), and Philadelphia (15,000).
10420940 -> 1000004300950: In Canada there are large Italian-speaking communities in Montreal (120,000) and Toronto (195,000).
10420950 -> 1000004300960: Italian is the second most commonly-spoken language in Australia, where 353,605 Italian Australians, or 1.9% of the population, reported speaking Italian at home in the 2001 Census.
10420960 -> 1000004300970: In 2001 there were 130,000 Italian speakers in Melbourne, and 90,000 in Sydney.
10420970 -> 1000004300980: Italian language education
10420980 -> 1000004300990: Italian is widely taught in many schools around the world, but rarely as the first non-native language of pupils; in fact, Italian generally is the fourth or fifth most taught second-language in the world.
10420990 -> 1000004301000: In anglophone parts of Canada, Italian is, after French, the third most taught language.
10421000 -> 1000004301010: In francophone Canada it is third after English.
10421010 -> 1000004301020: In the United States and the United Kingdom, Italian ranks fourth (after Spanish-French-German and French-German-Spanish respectively).
10421020 -> 1000004301030: Throughout the world, Italian is the fifth most taught non-native language, after English, French, Spanish, and German.
10421030 -> 1000004301040: In the European Union, Italian is spoken as a mother tongue by 13% of the population (64 million, mainly in Italy itself) and as a second language by 3% (14 million); among EU member states, it is most likely to be desired (and therefore learned) as a second language in Malta (61%), Croatia (14%), Slovenia (12%), Austria (11%), Romania (8%), France (6%), and Greece (6%).
10421040 -> 1000004301050: It is also an important second language in Albania and Switzerland, which are not EU members or candidates.
10421050 -> 1000004301060: Influence and derived languages
10421060 -> 1000004301070: From the late 19th to the mid 20th century, thousands of Italians settled in Argentina, Uruguay and southern Brazil, where they formed a very strong physical and cultural presence (see the Italian diaspora).
10421070 -> 1000004301080: In some cases, colonies were established where variants of Italian dialects were used, and some continue to use a derived dialect.
10421080 -> 1000004301090: An example is Rio Grande do Sul, Brazil, where Talian is used and in the town of Chipilo near Puebla, Mexico each continuing to use a derived form of Venetian dating back to the 19th century.
10421090 -> 1000004301100: Another example is Cocoliche, an Italian-Spanish pidgin once spoken in Argentina and especially in Buenos Aires, and Lunfardo.
10421100 -> 1000004301110: Rioplatense Spanish, and particularly the speech of the city of Buenos Aires, has intonation patterns that resemble those of Italian dialects, due to the fact that Argentina had a constant, large influx of Italian settlers since the second half of the nineteenth century; initially primarily from Northern Italy then, since the beginning of the twentieth century, mostly from Southern Italy.
10421110 -> 1000004301120: Lingua Franca
10421120 -> 1000004301130: Starting in late medieval times, Italian language variants replaced Latin to become the primary commercial language for much of Europe and Mediterranean Sea (especially the Tuscan and Venetian variants).
10421130 -> 1000004301140: This became solidified during the Renaissance with the strength of Italian banking and the rise of humanism in the arts.
10421140 -> 1000004301150: During the period of the Renaissance, Italy held artistic sway over the rest of Europe.
10421150 -> 1000004301160: All educated European gentlemen were expected to make the Grand Tour, visiting Italy to see its great historical monuments and works of art.
10421160 -> 1000004301170: It thus became expected that educated Europeans would learn at least some Italian; the English poet John Milton, for instance, wrote some of his early poetry in Italian.
10421170 -> 1000004301180: In England, Italian became the second most common modern language to be learned, after French (though the classical languages, Latin and Greek, came first).
10421180 -> 1000004301190: However, by the late eighteenth century, Italian tended to be replaced by German as the second modern language on the curriculum.
10421190 -> 1000004301200: Yet Italian loanwords continue to be used in most other European languages in matters of art and music.
10421200 -> 1000004301210: Today, the Italian language continues to be used as a lingua franca in some environments.
10421210 -> 1000004301220: Within the Catholic church Italian is known by a large part of the ecclesiastic hierarchy, and is used in substitution of Latin in some official documents.
10421220 -> 1000004301230: The presence of Italian as the primary language in the Vatican City indicates not only use within the Holy See, but also throughout the world where an episcopal seat is present.
10421230 -> 1000004301240: It continues to be used in music and opera.
10421240 -> 1000004301250: Other examples where Italian is sometimes used as a means communication is in some sports (sometimes in football and motorsports) and in the design and fashion industries.
10421250 -> 1000004301260: Dialects
10421260 -> 1000004301270: In Italy, all Romance languages spoken as the vernacular, other than standard Italian and other unrelated, non-Italian languages, are termed "Italian dialects".
10421270 -> 1000004301280: Many Italian dialects are, in fact, historical languages in their own right.
10421280 -> 1000004301290: These include recognized language groups such as Friulian, Neapolitan, Sardinian, Sicilian, Venetian, and others, and regional variants of these languages such as Calabrian.
10421290 -> 1000004301300: The division between dialect and language has been used by scholars (such as by Francesco Bruni) to distinguish between the languages that made up the Italian koine, and those which had very little or no part in it, such as Albanian, Greek, German, Ladin, and Occitan, which are still spoken by minorities.
10421300 -> 1000004301310: Dialects are generally not used for general mass communication and are usually limited to native speakers in informal contexts.
10421310 -> 1000004301320: In the past, speaking in dialect was often deprecated as a sign of poor education.
10421320 -> 1000004301330: Younger generations, especially those under 35 (though it may vary in different areas), speak almost exclusively standard Italian in all situations, usually with local accents and idioms.
10421330 -> 1000004301340: Regional differences can be recognized by various factors: the openness of vowels, the length of the consonants, and influence of the local dialect (for example, annà replaces andare in the area of Rome for the infinitive "to go").
10421340 -> 1000004301350: Sounds
10421350 -> None: 
10421360 -> 1000004301360: Vowels
10421370 -> 1000004301370: Italian has seven vowel phonemes: {(IPA+/a/+/a/)}, {(IPA+/e/+/e/)}, {(IPA+/ɛ/+/ɛ/)}, {(IPA+/i/+/i/)}, {(IPA+/o/+/o/)}, {(IPA+/ɔ/+/ɔ/)}, {(IPA+/u/+/u/)}.
10421380 -> 1000004301380: The pairs {(IPA+/e/+/e/)}-{(IPA+/ɛ/+/ɛ/)} and {(IPA+/o/+/o/)}-{(IPA+/ɔ/+/ɔ/)} are seldom distinguished in writing and often confused, even though most varieties of Italian employ both phonemes consistently.
10421390 -> 1000004301390: Compare, for example: "perché" {(IPA+[perˈkɛ]+[perˈkɛ])} (why, because) and "senti" {(IPA+[ˈsenti]+[ˈsenti])} (you listen, you are listening, listen!), employed by some northern speakers, with {(IPA+[perˈke]+[perˈke])} and {(IPA+[ˈsɛnti]+[ˈsɛnti])}, as pronounced by most central and southern speakers.
10421400 -> 1000004301400: As a result, the usage is strongly indicative of a person's origin.
10421410 -> 1000004301410: The standard (Tuscan) usage of these vowels is listed in vocabularies, and employed outside Tuscany mainly by specialists, especially actors and very few (television) journalists.
10421420 -> 1000004301420: These are truly different phonemes, however: compare {(IPA+/ˈpeska/+/ˈpeska/)} (fishing) and {(IPA+/ˈpɛska/+/ˈpɛska/)} (peach), both spelled pesca .
10421430 -> 1000004301430: Similarly {(IPA+/ˈbotte/+/ˈbotte/)} ('barrel') and {(IPA+/ˈbɔtte/+/ˈbɔtte/)} ('beatings'), both spelled botte, discriminate {(IPA+/o/+/o/)} and {(IPA+/ɔ/+/ɔ/)} .
10421440 -> 1000004301440: In general, vowel combinations usually pronounce each vowel separately.
10421450 -> 1000004301450: Diphthongs exist (e.g. uo, iu, ie, ai), but are limited to an unstressed u or i before or after a stressed vowel.
10421460 -> 1000004301460: The unstressed u in a diphthong approximates the English semivowel w, the unstressed i approximates the semivowel y.
10421470 -> 1000004301470: E.g.: buono {(IPA+[ˈbwɔno]+[ˈbwɔno])}, ieri {(IPA+[ˈjɛri]+[ˈjɛri])}.
10421480 -> 1000004301480: Triphthongs exist in Italian as well, like "continuiamo" ("we continue").
10421490 -> 1000004301490: Three vowel combinations exist only in the form semiconsonant ({(IPA+/j/+/j/)} or {(IPA+/w/+/w/)}), followed by a vowel, followed by a desinence vowel (usually {(IPA+/i/+/i/)}), as in miei, suoi, or two semiconsonants followed by a vowel, as the group -uia- exemplified above, or -iuo- in the word aiuola.
10421500 -> 1000004301500: Mobile diphthongs
10421510 -> 1000004301510: Many Latin words with a short e or o have Italian counterparts with a mobile diphthong (ie and uo respectively).
10421520 -> 1000004301520: When the vowel sound is stressed, it is pronounced and written as a diphthong; when not stressed, it is pronounced and written as a single vowel.
10421530 -> 1000004301530: So Latin focus gave rise to Italian fuoco (meaning both "fire" and "optical focus"): when unstressed, as in focale ("focal") the "o" remains alone.
10421540 -> 1000004301540: Latin pes (more precisely its accusative form pedem) is the source of Italian piede (foot): but unstressed "e" was left unchanged in pedone (pedestrian) and pedale (pedal).
10421550 -> 1000004301550: From Latin iocus comes Italian giuoco ("play", "game"), though in this case gioco is more common: giocare means "to play (a game)".
10421560 -> 1000004301560: From Latin homo comes Italian uomo (man), but also umano (human) and ominide (hominid).
10421570 -> 1000004301570: From Latin ovum comes Italian uovo (egg) and ovaie (ovaries).
10421580 -> 1000004301580: (The same phenomenon occurs in Spanish: juego (play, game) and jugar (to play), nieve (snow) and nevar (to snow)).
10421590 -> 1000004301590: Consonants
10421600 -> 1000004301600: Two symbols in a table cell denote the voiceless and voiced consonant, respectively.
10421610 -> 1000004301610: Nasals undergo assimilation when followed by a consonant, e.g., when preceding a velar ({(IPA+/k/+/k/)} or {(IPA+/g/+/g/)}) only {(IPA+[ŋ]+[ŋ])} appears, etc.
10421620 -> 1000004301620: Italian has geminate, or double, consonants, which are distinguished by length.
10421630 -> 1000004301630: Length is distinctive for all consonants except for {(IPA+/ʃ/+/ʃ/)}, {(IPA+/ʦ/+/ʦ/)}, {(IPA+/ʣ/+/ʣ/)}, {(IPA+/ʎ/+/ʎ/)} {(IPA+/ɲ/+/ɲ/)}, which are always geminate, and {(IPA+/z/+/z/)} which is always single.
10421640 -> 1000004301640: Geminate plosives and affricates are realised as lengthened closures.
10421650 -> 1000004301650: Geminate fricatives, nasals, and {(IPA+/l/+/l/)} are realized as lengthened continuants.
10421660 -> 1000004301660: The flap consonant {(IPA+/ɾː/+/ɾː/)} is typically dialectal, and it is called erre moscia.
10421670 -> 1000004301670: The correct standard pronunciation is {(IPA+[r]+[r])}.
10421680 -> 1000004301680: Of special interest to the linguistic study of Italian is the Gorgia Toscana, or "Tuscan Throat", the weakening or lenition of certain intervocalic consonants in Tuscan dialects.
10421690 -> 1000004301690: See also Syntactic doubling.
10421700 -> 1000004301700: Assimilation
10421710 -> 1000004301710: Italian has few diphthongs, so most unfamiliar diphthongs that are heard in foreign words (in particular, those beginning with vowel "a", "e", or "o") will be assimilated as the corresponding diaeresis (i.e., the vowel sounds will be pronounced separately).
10421720 -> 1000004301720: Italian phonotactics do not usually permit polysyllabic nouns and verbs to end with consonants, excepting poetry and song, so foreign words may receive extra terminal vowel sounds.
10421730 -> 1000004301730: Grammar
10421740 -> 1000004301740: Common variations in the writing systems
10421750 -> 1000004301750: Some variations in the usage of the writing system may be present in practical use.
10421760 -> 1000004301760: These are scorned by educated people, but they are so common in certain contexts that knowledge of them may be useful.
10421770 -> 1000004301770: Usage of x instead of per: this is very common among teenagers and in SMS abbreviations.
10421780 -> 1000004301780: The multiplication operator is pronounced "per" in Italian, and so it is sometimes used to replace the word "per", which means "for"; thus, for example, "per te" ("for you") is shortened to "x te" (compare with English "4 U").
10421790 -> 1000004301790: Words containing per can also have it replaced with x: for example, perché (both "why" and "because") is often shortened as xché or xké or x' (see below).
10421800 -> 1000004301800: This usage might be useful to jot down quick notes or to fit more text into the low character limit of an SMS, but it is considered unacceptable in formal writing.
10421810 -> 1000004301810: Usage of foreign letters such as k, j and y, especially in nicknames and SMS language: ke instead of che, Giusy instead of Giuseppina (or sometimes Giuseppe).
10421820 -> 1000004301820: This is curiously mirrored in the usage of i in English names such as Staci instead of Stacey, or in the usage of c in Northern Europe (Jacob instead of Jakob).
10421830 -> 1000004301830: The use of "k" instead of "ch" or "c" to represent a plosive sound is documented in some historical texts from before the standardization of the Italian language; however, that usage is no longer standard in Italian.
10421840 -> 1000004301840: Possibly because it is associated with the German language, the letter "k" has sometimes also been used in satire to suggest that a political figure is an authoritarian or even a "pseudo-nazi": Francesco Cossiga was famously nicknamed Kossiga by rioting students during his tenure as minister of internal affairs.
10421850 -> 1000004301850: [Cf. the politicized spelling Amerika in the USA.]
10421860 -> 1000004301860: Usage of the following abbreviations is limited to the electronic communications media and is deprecated in all other cases: nn instead of non (not), cmq instead of comunque (anyway, however), cm instead of come (how, like, as), d instead of di (of), (io/loro) sn instead of (io/loro) sono (I am/they are), (io) dv instead of (io) devo (I must/I have to) or instead of dove (where), (tu) 6 instead of (tu) sei (you are).
10421870 -> 1000004301870: Inexperienced typists often replace accents with apostrophes, such as in perche' instead of perché.
10421880 -> 1000004301880: Uppercase È is particularly rare, as it is absent from the Italian keyboard layout, and is very often written as E' (even though there are several ways of producing the uppercase È on a computer).
10421890 -> 1000004301890: This never happens in books or other professionally typeset material.
10421900 -> None: Samples
10421910 -> 1000004301900: Examples
10421920 -> 1000004301910: Cheers: "Salute!"
10421930 -> 1000004301920: English: inglese {(IPA+/iŋˈglese/+/iŋˈglese/)}
10421940 -> 1000004301930: Good-bye: arrivederci {(IPA+/arriveˈdertʃi/+/arriveˈdertʃi/)}
10421950 -> 1000004301940: Hello: ciao {(IPA+/ˈtʃao/+/ˈtʃao/)}
10421960 -> 1000004301950: Good day: buon giorno {(IPA+/bwɔnˈdʒorno/+/bwɔnˈdʒorno/)}
10421970 -> 1000004301960: Good evening: buona sera {(IPA+/bwɔnaˈsera/+/bwɔnaˈsera/)}
10421980 -> 1000004301970: Yes: sì {(IPA+/si/+/si/)}
10421990 -> 1000004301980: No: no {(IPA+/nɔ/+/nɔ/)}
10422000 -> 1000004301990: How are you? : Come stai {(IPA+/ˈkome ˈstai/+/ˈkome ˈstai/)} (informal); Come sta {(IPA+/ˈkome 'sta/+/ˈkome 'sta/)} (formal)
10422010 -> 1000004302000: Sorry: mi dispiace {(IPA+/mi disˈpjatʃe/+/mi disˈpjatʃe/)}
10422020 -> 1000004302010: Excuse me: scusa {(IPA+/ˈskuza/+/ˈskuza/)} (informal); scusi {(IPA+/ˈskuzi/+/ˈskuzi/)} (formal)
10422030 -> 1000004302020: Again: di nuovo, /{(IPA+di ˈnwɔvo+di ˈnwɔvo)}/; ancora /{(IPA+aŋˈkora+aŋˈkora)}/
10422040 -> 1000004302030: Always: sempre /{(IPA+ˈsɛmpre+ˈsɛmpre)}/
10422050 -> 1000004302040: When: quando {(IPA+/ˈkwando/+/ˈkwando/)}
10422060 -> 1000004302050: Where: dove {(IPA+/'dove/+/'dove/)}
10422070 -> 1000004302060: Why/Because: perché {(IPA+/perˈke/+/perˈke/)}
10422080 -> 1000004302070: How: come {(IPA+/'kome/+/'kome/)}
10422090 -> 1000004302080: How much is it?: quanto costa?
10422100 -> 1000004302090: {(IPA+/ˈkwanto/+/ˈkwanto/)}
10422110 -> 1000004302100: Thank you!: grazie!
10422120 -> 1000004302110: {(IPA+/ˈgrattsie/+/ˈgrattsie/)}
10422130 -> 1000004302120: Bon appetit: buon appetito {(IPA+/ˌbwɔn appeˈtito/+/ˌbwɔn appeˈtito/)}
10422140 -> 1000004302130: You're welcome!: prego!
10422150 -> 1000004302140: {(IPA+/ˈprɛgo/+/ˈprɛgo/)}
10422160 -> 1000004302150: I love you: Ti amo {(IPA+/ti ˈamo/+/ti ˈamo/)}, Ti voglio bene {(IPA+/ti ˈvɔʎʎo ˈbɛne/+/ti ˈvɔʎʎo ˈbɛne/)}.
10422170 -> 1000004302160: The difference is that you use "Ti amo" when you are in a romantic relationship, "Ti voglio bene" in any other occasion (to parents, to relatives, to friends...)
10422180 -> 1000004302170: Counting to twenty:
10422190 -> 1000004302180: One: uno {(IPA+/ˈuno/+/ˈuno/)}
10422200 -> 1000004302190: Two: due {(IPA+/ˈdue/+/ˈdue/)}
10422210 -> 1000004302200: Three: tre {(IPA+/tre/+/tre/)}
10422220 -> 1000004302210: Four: quattro {(IPA+/ˈkwattro/+/ˈkwattro/)}
10422230 -> 1000004302220: Five: cinque {(IPA+/ˈʧiŋkwe/+/ˈʧiŋkwe/)}
10422240 -> 1000004302230: Six: sei {(IPA+/ˈsɛi/+/ˈsɛi/)}
10422250 -> 1000004302240: Seven: sette {(IPA+/ˈsɛtte/+/ˈsɛtte/)}
10422260 -> 1000004302250: Eight: otto {(IPA+/ˈɔtto/+/ˈɔtto/)}
10422270 -> 1000004302260: Nine: nove {(IPA+/ˈnɔve/+/ˈnɔve/)}
10422280 -> 1000004302270: Ten: dieci {(IPA+/ˈdjɛʧi/+/ˈdjɛʧi/)}
10422290 -> 1000004302280: Eleven: undici {(IPA+/ˈundiʧi/+/ˈundiʧi/)}
10422300 -> 1000004302290: Twelve: dodici {(IPA+/ˈdodiʧi/+/ˈdodiʧi/)}
10422310 -> 1000004302300: Thirteen: tredici {(IPA+/ˈtrediʧi/+/ˈtrediʧi/)}
10422320 -> 1000004302310: Fourteen: quattordici {(IPA+/kwat'tordiʧi/+/kwat'tordiʧi/)}
10422330 -> 1000004302320: Fifteen: quindici {(IPA+/ˈkwindiʧi/+/ˈkwindiʧi/)}
10422340 -> 1000004302330: Sixteen: sedici {(IPA+/ˈsediʧi/+/ˈsediʧi/)}
10422350 -> 1000004302340: Seventeen: diciassette {(IPA+/diʧas'sɛtte/+/diʧas'sɛtte/)}
10422360 -> 1000004302350: Eighteen: diciotto {(IPA+/di'ʧɔtto/+/di'ʧɔtto/)}
10422370 -> 1000004302360: Nineteen: diciannove {(IPA+/diʧan'nɔve/+/diʧan'nɔve/)}
10422380 -> 1000004302370: Twenty: venti {(IPA+/'venti/+/'venti/)}
10422390 -> 1000004302380: The days of the week:
10422400 -> 1000004302390: Monday: lunedì {(IPA+/lune'di/+/lune'di/)}
10422410 -> 1000004302400: Tuesday: martedì {(IPA+/marte'di/+/marte'di/)}
10422420 -> 1000004302410: Wednesday: mercoledì {(IPA+/merkole'di/+/merkole'di/)}
10422430 -> 1000004302420: Thursday: giovedì {(IPA+/dʒove'di/+/dʒove'di/)}
10422440 -> 1000004302430: Friday: venerdì {(IPA+/vener'di/+/vener'di/)}
10422450 -> 1000004302440: Saturday: sabato {(IPA+/ˈsabato/+/ˈsabato/)}
10422460 -> 1000004302450: Sunday: domenica {(IPA+/do'menika/+/do'menika/)}
10422470 -> None: Sample texts
10422480 -> None: There is a recording of Dante's Divine Comedy read by Lino Pertile available at http://etcweb.princeton.edu/dante/pdp/
Japanese language
10430010 -> 1000004400020: Japanese language
10430020 -> 1000004400030: {(Nihongo+Japanese (日本語 / にほんご ?)+Japanese+日本語 / にほんご )} is a language spoken by over 130 million people in Japan and in Japanese emigrant communities.
10430030 -> 1000004400040: It is related to the Ryukyuan languages, but whatever relationships with other languages it may have remain undemonstrated.
10430040 -> 1000004400050: It is an agglutinative language and is distinguished by a complex system of honorifics reflecting the hierarchical nature of Japanese society, with verb forms and particular vocabulary to indicate the relative status of speaker, listener and the third person mentioned in conversation whether he is there or not.
10430050 -> 1000004400060: The sound inventory of Japanese is relatively small, and it has a lexically distinct pitch-accent system.
10430060 -> 1000004400070: It is a mora-timed language.
10430070 -> 1000004400080: The Japanese language is written with a combination of three different types of scripts: Chinese characters called kanji (漢字 / かんじ), and two syllabic scripts made up of modified Chinese characters, hiragana (平仮名 / ひらがな) and katakana (片仮名 / カタカナ).
10430080 -> 1000004400090: The Latin alphabet, rōmaji (ローマ字), is also often used in modern Japanese, especially for company names and logos, advertising, and when entering Japanese text into a computer.
10430090 -> 1000004400100: Western style Arabic numerals are generally used for numbers, but traditional Sino-Japanese numerals are also commonplace.
10430100 -> 1000004400110: Japanese vocabulary has been heavily influenced by loanwords from other languages.
10430110 -> 1000004400120: A vast number of words were borrowed from Chinese, or created from Chinese models, over a period of at least 1,500 years.
10430120 -> 1000004400130: Since the late 19th century, Japanese has borrowed a considerable number of words from Indo-European languages, primarily English.
10430130 -> 1000004400140: Because of the special trade relationship between Japan and first Portugal in the 16th century, and then mainly the Netherlands in the 17th century, Portuguese, German and Dutch have also been influential.
10430140 -> 1000004400150: Geographic distribution
10430150 -> 1000004400160: Although Japanese is spoken almost exclusively in Japan, it has been and sometimes still is spoken elsewhere.
10430160 -> 1000004400170: When Japan occupied Korea, Taiwan, parts of the Chinese mainland, and various Pacific islands before and during World War II, locals in those countries were forced to learn Japanese in empire-building programs.
10430170 -> 1000004400180: As a result, there are many people in these countries who can speak Japanese in addition to the local languages.
10430180 -> 1000004400190: Japanese emigrant communities (the largest of which are to be found in Brazil) sometimes employ Japanese as their primary language.
10430190 -> 1000004400200: Approximately 5% of Hawaii residents speak Japanese, with Japanese ancestry the largest single ancestry in the state (over 24% of the population).
10430200 -> 1000004400210: Japanese emigrants can also be found in Peru, Argentina, Australia (especially Sydney, Brisbane, and Melbourne), the United States (notably California, where 1.2% of the population has Japanese ancestry, and Hawaii), and the Philippines (particularly in Davao and Laguna).
10430210 -> 1000004400220: Their descendants, who are known as {(Transl+nikkei+ja+nikkei)} ({(Lang+日系+ja+日系)}, literally Japanese descendants), however, rarely speak Japanese fluently after the second generation.
10430220 -> 1000004400230: There are estimated to be several million non-Japanese studying the language as well.
10430230 -> 1000004400240: Official status
10430240 -> 1000004400250: Japanese is the de facto official language of Japan.
10430250 -> 1000004400260: There is a form of the language considered standard: {(Nihongo+hyōjungo (標準語?)+hyōjungo+標準語)} Standard Japanese, or {(Nihongo+kyōtsūgo (共通語?)+kyōtsūgo+共通語)} the common language.
10430260 -> 1000004400270: The meanings of the two terms are almost the same.
10430270 -> 1000004400280: {(Transl+Hyōjungo+ja+Hyōjungo)} or {(Transl+kyōtsūgo+ja+kyōtsūgo)} is a conception that forms the counterpart of dialect.
10430280 -> 1000004400290: This normative language was born after the {(Nihongo+Meiji Restoration (明治維新 meiji ishin?, 1868)+Meiji Restoration+明治維新+meiji ishin+1868)} from the language spoken in uptown Tokyo for communicating necessity.
10430290 -> 1000004400300: {(Transl+Hyōjungo+ja+Hyōjungo)} is taught in schools and used on television and in official communications, and is the version of Japanese discussed in this article.
10430300 -> 1000004400310: Formerly, standard {(Nihongo+Japanese in writing (文語 bungo?, "literary language")+Japanese in writing+文語+bungo+"literary language")} was different from {(Nihongo+colloquial language (口語 kōgo?)+colloquial language+口語+kōgo)}.
10430310 -> 1000004400320: The two systems have different rules of grammar and some variance in vocabulary.
10430320 -> 1000004400330: {(Transl+Bungo+ja+Bungo)} was the main method of writing Japanese until about 1900; since then {(Transl+kōgo+ja+kōgo)} gradually extended its influence and the two methods were both used in writing until the 1940s.
10430330 -> 1000004400340: {(Transl+Bungo+ja+Bungo)} still has some relevance for historians, literary scholars, and lawyers (many Japanese laws that survived World War II are still written in {(Transl+bungo+ja+bungo)}, although there are ongoing efforts to modernize their language).
10430340 -> 1000004400350: {(Transl+Kōgo+ja+Kōgo)} is the predominant method of both speaking and writing Japanese today, although {(Transl+bungo+ja+bungo)} grammar and vocabulary are occasionally used in modern Japanese for effect.
10430350 -> 1000004400360: Dialects
10430360 -> 1000004400370: Dozens of dialects are spoken in Japan.
10430370 -> 1000004400380: The profusion is due to many factors, including the length of time the archipelago has been inhabited, its mountainous island terrain, and Japan's long history of both external and internal isolation.
10430380 -> 1000004400390: Dialects typically differ in terms of pitch accent, inflectional morphology, vocabulary, and particle usage.
10430390 -> 1000004400400: Some even differ in vowel and consonant inventories, although this is uncommon.
10430400 -> 1000004400410: The main distinction in Japanese accents is between {(Nihongo+Tokyo-type (東京式 Tōkyō-shiki?)+Tokyo-type+東京式+Tōkyō-shiki)} and {(Nihongo+Kyoto-Osaka-type (京阪式 Keihan-shiki?)+Kyoto-Osaka-type+京阪式+Keihan-shiki)}, though Kyūshū-type dialects form a third, smaller group.
10430410 -> 1000004400420: Within each type are several subdivisions.
10430420 -> 1000004400430: Kyoto-Osaka-type dialects are in the central region, with borders roughly formed by Toyama, Kyōto, Hyōgo, and Mie Prefectures; most Shikoku dialects are also that type.
10430430 -> 1000004400440: The final category of dialects are those that are descended from the Eastern dialect of Old Japanese; these dialects are spoken in Hachijō-jima island and few islands.
10430440 -> 1000004400450: Dialects from peripheral regions, such as Tōhoku or Tsushima, may be unintelligible to speakers from other parts of the country.
10430450 -> 1000004400460: The several dialects of Kagoshima in southern Kyūshū are famous for being unintelligible not only to speakers of standard Japanese but to speakers of nearby dialects elsewhere in Kyūshū as well.
10430460 -> 1000004400470: This is probably due in part to the Kagoshima dialects' peculiarities of pronunciation, which include the existence of closed syllables (i.e., syllables that end in a consonant, such as {(IPA+/kob/+/kob/)} or {(IPA+/koʔ/+/koʔ/)} for Standard Japanese {(IPA+/kumo/+/kumo/)} "spider").
10430470 -> 1000004400480: A dialects group of Kansai is spoken and known by many Japanese, and Osaka dialect in particular is associated with comedy (See Kansai dialect).
10430480 -> 1000004400490: Dialects of Tōhoku and North Kantō are associated with typical farmers.
10430490 -> 1000004400500: The Ryūkyūan languages, spoken in Okinawa and Amami Islands that are politically part of Kagoshima, are distinct enough to be considered a separate branch of the Japonic family.
10430500 -> 1000004400510: But many Japanese common people tend to consider the Ryūkyūan languages as dialects of Japanese.
10430510 -> 1000004400520: Not only is each language unintelligible to Japanese speakers, but most are unintelligible to those who speak other Ryūkyūan languages.
10430520 -> 1000004400530: Recently, Standard Japanese has become prevalent nationwide (including the Ryūkyū islands) due to education, mass media, and increase of mobility networks within Japan, as well as economic integration.
10430530 -> 1000004400540: Sounds
10430540 -> None: 
10430550 -> 1000004400550: Japanese vowels are "pure" sounds.
10430560 -> 1000004400560: The only unusual vowel is the high back vowel {(IPA+/ɯ/+/ɯ/)} , which is like {(IPA+/u/+/u/)}, but compressed instead of rounded.
10430570 -> 1000004400570: Japanese has five vowels, and vowel length is phonemic, so each one has both a short and a long version.
10430580 -> 1000004400580: Some Japanese consonants have several allophones, which may give the impression of a larger inventory of sounds.
10430590 -> 1000004400590: However, some of these allophones have since become phonemic.
10430600 -> 1000004400600: For example, in the Japanese language up to and including the first half of the twentieth century, the phonemic sequence {(IPA+/ti/+/ti/)} was palatalized and realized phonetically as {(IPA+[tɕi]+[tɕi])}, approximately chi; however, now {(IPA+/ti/+/ti/)} and {(IPA+/tɕi/+/tɕi/)} are distinct, as evidenced by words like tī {(IPA+[tiː]+[tiː])} "Western style tea" and chii {(IPA+[tɕii]+[tɕii])} "social status."
10430610 -> 1000004400610: The 'r' of the Japanese language (technically a lateral apical postalveolar flap), is of particular interest, sounding to most English speakers to be something between an 'l' and a retroflex 'r' depending on its position in a word.
10430620 -> 1000004400620: The syllabic structure and the phonotactics are very simple: the only consonant clusters allowed within a syllable consist of one of a subset of the consonants plus {(IPA+/j/+/j/)}.
10430630 -> 1000004400630: These type of clusters only occur in onsets.
10430640 -> 1000004400640: However, consonant clusters across syllables are allowed as long as the two consonants are a nasal followed by a homo-organic consonant.
10430650 -> 1000004400650: Consonant length (gemination) is also phonemic.
10430660 -> 1000004400660: Grammar
10430670 -> 1000004400670: Sentence structure
10430680 -> 1000004400680: Japanese word order is classified as Subject Object Verb.
10430690 -> 1000004400690: However, unlike many Indo-European languages, Japanese sentences only require that verbs come last for intelligibility.
10430700 -> 1000004400700: This is because the Japanese sentence elements are marked with particles that identify their grammatical functions.
10430710 -> 1000004400710: The basic sentence structure is topic-comment.
10430720 -> 1000004400720: For example, {(Transl+Kochira-wa Tanaka-san desu+ja+Kochira-wa Tanaka-san desu)} ({(Lang+こちらは田中さんです+ja+こちらは田中さんです)}).
10430730 -> 1000004400730: {(Transl+Kochira+ja+Kochira)} ("this") is the topic of the sentence, indicated by the particle -wa.
10430740 -> 1000004400740: The verb is {(Transl+desu+ja+desu)}, a copula, commonly translated as "to be" or "it is" (though there are other verbs that can be translated as "to be").
10430750 -> 1000004400750: As a phrase, {(Transl+Tanaka-san desu+ja+Tanaka-san desu)} is the comment.
10430760 -> 1000004400760: This sentence loosely translates to "As for this person, (it) is Mr./Mrs./Miss Tanaka."
10430770 -> 1000004400770: Thus Japanese, like Chinese, Korean, and many other Asian languages, is often called a topic-prominent language, which means it has a strong tendency to indicate the topic separately from the subject, and the two do not always coincide.
10430780 -> 1000004400780: The sentence {(Transl+Zō-wa hana-ga nagai (desu)+ja+Zō-wa hana-ga nagai (desu))}　({(Lang+象は鼻が長いです+ja+象は鼻が長いです)}) literally means, "As for elephants, (their) noses are long".
10430790 -> 1000004400790: The topic is {(Transl+zō+ja+zō)} "elephant", and the subject is {(Transl+hana+ja+hana)} "nose".
10430800 -> 1000004400800: Japanese is a pro-drop language, meaning that the subject or object of a sentence need not be stated if it is obvious from context.
10430810 -> 1000004400810: In addition, it is commonly felt, particularly in spoken Japanese, that the shorter a sentence is, the better.
10430820 -> 1000004400820: As a result of this grammatical permissiveness and tendency towards brevity, Japanese speakers tend naturally to omit words from sentences, rather than refer to them with pronouns.
10430830 -> 1000004400830: In the context of the above example, {(Transl+hana-ga nagai+ja+hana-ga nagai)} would mean "[their] noses are long," while {(Transl+nagai+ja+nagai)} by itself would mean "[they] are long."
10430840 -> 1000004400840: A single verb can be a complete sentence: {(Transl+Yatta!+ja+Yatta!)}
10430850 -> 1000004400850: "[I / we / they / etc] did [it]!".
10430860 -> 1000004400860: In addition, since adjectives can form the predicate in a Japanese sentence (below), a single adjective can be a complete sentence: {(Transl+Urayamashii!+ja+Urayamashii!)}
10430870 -> 1000004400870: "[I'm] jealous [of it]!".
10430880 -> 1000004400880: While the language has some words that are typically translated as pronouns, these are not used as frequently as pronouns in some Indo-European languages, and function differently.
10430890 -> 1000004400890: Instead, Japanese typically relies on special verb forms and auxiliary verbs to indicate the direction of benefit of an action: "down" to indicate the out-group gives a benefit to the in-group; and "up" to indicate the in-group gives a benefit to the out-group.
10430900 -> 1000004400900: Here, the in-group includes the speaker and the out-group doesn't, and their boundary depends on context.
10430910 -> 1000004400910: For example, {(Transl+oshiete moratta+ja+oshiete moratta)} (literally, "explained" with a benefit from the out-group to the in-group) means "[he/she/they] explained it to [me/us]".
10430920 -> 1000004400920: Similarly, {(Transl+oshiete ageta+ja+oshiete ageta)} (literally, "explained" with a benefit from the in-group to the out-group) means "[I/we] explained [it] to [him/her/them]".
10430930 -> 1000004400930: Such beneficiary auxiliary verbs thus serve a function comparable to that of pronouns and prepositions in Indo-European languages to indicate the actor and the recipient of an action.
10430940 -> 1000004400940: Japanese "pronouns" also function differently from most modern Indo-European pronouns (and more like nouns) in that they can take modifiers as any other noun may.
10430950 -> 1000004400950: For instance, one cannot say in English:
10430970 -> 1000004400960: *The amazed he ran down the street. (grammatically incorrect)
10430980 -> 1000004400970: But one can grammatically say essentially the same thing in Japanese:
10430990 -> 1000004400980: {(Transl+Odoroita kare-wa michi-o hashitte itta.+ja+Odoroita kare-wa michi-o hashitte itta.)} (grammatically correct)
10431000 -> 1000004400990: This is partly due to the fact that these words evolved from regular nouns, such as {(Transl+kimi+ja+kimi)} "you" ({(Lang+君+ja+君)} "lord"), {(Transl+anata+ja+anata)} "you" ({(Lang+あなた+ja+あなた)} "that side, yonder"), and {(Transl+boku+ja+boku)} "I" ({(Lang+僕+ja+僕)} "servant").
10431010 -> 1000004401000: This is why some linguists do not classify Japanese "pronouns" as pronouns, but rather as referential nouns.
10431020 -> 1000004401010: Japanese personal pronouns are generally used only in situations requiring special emphasis as to who is doing what to whom.
10431030 -> 1000004401020: The choice of words used as pronouns is correlated with the sex of the speaker and the social situation in which they are spoken: men and women alike in a formal situation generally refer to themselves as {(Transl+watashi+ja+watashi)} ({(Lang+私+ja+私)} "private") or {(Transl+watakushi+ja+watakushi)} (also {(Lang+私+ja+私)}), while men in rougher or intimate conversation are much more likely to use the word {(Transl+ore+ja+ore)} ({(Lang+俺+ja+俺)} "oneself", "myself") or {(Transl+boku+ja+boku)}.
10431040 -> 1000004401030: Similarly, different words such as {(Transl+anata+ja+anata)}, {(Transl+kimi+ja+kimi)}, and {(Transl+omae+ja+omae)} ({(Lang+お前+ja+お前)}, more formally {(Lang+御前+ja+御前)} "the one before me") may be used to refer to a listener depending on the listener's relative social position and the degree of familiarity between the speaker and the listener.
10431050 -> 1000004401040: When used in different social relationships, the same word may have positive (intimate or respectful) or negative (distant or disrespectful) connotations.
10431060 -> 1000004401050: Japanese often use titles of the person referred to where pronouns would be used in English.
10431070 -> 1000004401060: For example, when speaking to one's teacher, it is appropriate to use {(Transl+sensei+ja+sensei)} ({(Lang+先生+ja+先生)}, teacher), but inappropriate to use {(Transl+anata+ja+anata)}.
10431080 -> 1000004401070: This is because {(Transl+anata+ja+anata)} is used to refer to people of equal or lower status, and one's teacher has allegedly higher status.
10431090 -> 1000004401080: For English speaking learners of Japanese, a frequent beginners mistake is to include {(Transl+watashi-wa+ja+watashi-wa)} or {(Transl+anata-wa+ja+anata-wa)} at the beginning of sentences as one would with I or you in English.
10431100 -> 1000004401090: Though these sentences are not grammatically incorrect, even in formal settings it would be considered unnatural and would equate in English to repeatedly using a noun where a pronoun would suffice.
10431110 -> 1000004401100: Inflection and conjugation
10431120 -> 1000004401110: Japanese nouns have no grammatical number, gender or article aspect.
10431130 -> 1000004401120: The noun {(Transl+hon+ja+hon)} ({(Lang+本+ja+本)}) may refer to a single book or several books; {(Transl+hito+ja+hito)} ({(Lang+人+ja+人)}) can mean "person" or "people"; and {(Transl+ki+ja+ki)} ({(Lang+木+ja+木)}) can be "tree" or "trees".
10431140 -> 1000004401130: Where number is important, it can be indicated by providing a quantity (often with a counter word) or (rarely) by adding a suffix.
10431150 -> 1000004401140: Words for people are usually understood as singular.
10431160 -> 1000004401150: Thus {(Transl+Tanaka-san+ja+Tanaka-san)} usually means Mr./Mrs./Miss. Tanaka.
10431170 -> 1000004401160: Words that refer to people and animals can be made to indicate a group of individuals through the addition of a collective suffix (a noun suffix that indicates a group), such as {(Transl+-tachi+ja+-tachi)}, but this is not a true plural: the meaning is closer to the English phrase "and company".
10431180 -> 1000004401170: A group described as {(Transl+Tanaka-san-tachi+ja+Tanaka-san-tachi)} may include people not named Tanaka.
10431190 -> 1000004401180: Some Japanese nouns are effectively plural, such as {(Transl+hitobito+ja+hitobito)} "people" and {(Transl+wareware+ja+wareware)} "we/us", while the word {(Transl+tomodachi+ja+tomodachi)} "friend" is considered singular, although plural in form.
10431200 -> 1000004401190: Verbs are conjugated to show tenses, of which there are two: past and present, or non-past, which is used for the present and the future.
10431210 -> 1000004401200: For verbs that represent an ongoing process, the -te iru form indicates a continuous (or progressive) tense.
10431220 -> 1000004401210: For others that represent a change of state, the {(Transl+-te iru+ja+-te iru)} form indicates a perfect tense.
10431230 -> 1000004401220: For example, {(Transl+kite iru+ja+kite iru)} means "He has come (and is still here)", but {(Transl+tabete iru+ja+tabete iru)} means "He is eating".
10431240 -> 1000004401230: Questions (both with an interrogative pronoun and yes/no questions) have the same structure as affirmative sentences, but with intonation rising at the end.
10431250 -> 1000004401240: In the formal register, the question particle {(Transl+-ka+ja+-ka)} is added.
10431260 -> 1000004401250: For example, {(Transl+Ii desu+ja+Ii desu)} ({(Lang+いいです。+ja+いいです。)}) "It is OK" becomes {(Transl+Ii desu-ka+ja+Ii desu-ka)} ({(Lang+いいですか？+ja+いいですか？)}) "Is it OK?".
10431270 -> 1000004401260: In a more informal tone sometimes the particle {(Transl+-no+ja+-no)} ({(Lang+の+ja+の)}) is added instead to show a personal interest of the speaker: {(Transl+Dōshite konai-no?+ja+Dōshite konai-no?)}
10431280 -> 1000004401270: "Why aren't (you) coming?".
10431290 -> 1000004401280: Some simple queries are formed simply by mentioning the topic with an interrogative intonation to call for the hearer's attention: {(Transl+Kore-wa?+ja+Kore-wa?)}
10431300 -> 1000004401290: "(What about) this?"; {(Transl+Namae-wa?+ja+Namae-wa?)} ({(Lang+名前は？+ja+名前は？)}) "(What's your) name?".
10431310 -> 1000004401300: Negatives are formed by inflecting the verb.
10431320 -> 1000004401310: For example, {(Transl+Pan-o taberu+ja+Pan-o taberu)} ({(Lang+パンを食べる。+ja+パンを食べる。)}) "I will eat bread" or "I eat bread" becomes {(Transl+Pan-o tabenai+ja+Pan-o tabenai)} ({(Lang+パンを食べない。+ja+パンを食べない。)}) "I will not eat bread" or "I do not eat bread".
10431330 -> 1000004401320: The so-called {(Transl+-te+ja+-te)} verb form is used for a variety of purposes: either progressive or perfect aspect (see above); combining verbs in a temporal sequence ({(Transl+Asagohan-o tabete sugu dekakeru+ja+Asagohan-o tabete sugu dekakeru)} "I'll eat breakfast and leave at once"), simple commands, conditional statements and permissions ({(Transl+Dekakete-mo ii?+ja+Dekakete-mo ii?)} "May I go out?"), etc.
10431340 -> 1000004401330: The word {(Transl+da+ja+da)} (plain), {(Transl+desu+ja+desu)} (polite) is the copula verb.
10431350 -> 1000004401340: It corresponds approximately to the English be, but often takes on other roles, including a marker for tense, when the verb is conjugated into its past form {(Transl+datta+ja+datta)} (plain), {(Transl+deshita+ja+deshita)} (polite).
10431360 -> 1000004401350: This comes into use because only {(Transl+keiyōshi+ja+keiyōshi)} adjectives and verbs can carry tense in Japanese.
10431370 -> 1000004401360: Two additional common verbs are used to indicate existence ("there is") or, in some contexts, property: {(Transl+aru+ja+aru)} (negative {(Transl+nai+ja+nai)}) and {(Transl+iru+ja+iru)} (negative {(Transl+inai+ja+inai)}), for inanimate and animate things, respectively.
10431380 -> 1000004401370: For example, {(Transl+Neko ga iru+ja+Neko ga iru)} "There's a cat", {(Transl+Ii kangae-ga nai+ja+Ii kangae-ga nai)} "[I] haven't got a good idea".
10431390 -> 1000004401380: Note that the negative forms of the verbs {(Transl+iru+ja+iru)} and {(Transl+aru+ja+aru)} are actually i-adjectives and inflect as such, e.g. {(Transl+Neko ga inakatta+ja+Neko ga inakatta)} "There was no cat".
10431400 -> 1000004401390: The verb "to do" ({(Transl+suru+ja+suru)}, polite form {(Transl+shimasu+ja+shimasu)}) is often used to make verbs from nouns ({(Transl+ryōri suru+ja+ryōri suru)} "to cook", {(Transl+benkyō suru+ja+benkyō suru)} "to study", etc.) and has been productive in creating modern slang words.
10431410 -> 1000004401400: Japanese also has a huge number of compound verbs to express concepts that are described in English using a verb and a preposition (e.g. {(Transl+tobidasu+ja+tobidasu)} "to fly out, to flee," from {(Transl+tobu+ja+tobu)} "to fly, to jump" + {(Transl+dasu+ja+dasu)} "to put out, to emit").
10431420 -> 1000004401410: There are three types of adjective (see also Japanese adjectives):
10431430 -> 1000004401420: {(Lang+形容詞+ja+形容詞)} {(Transl+keiyōshi+ja+keiyōshi)}, or {(Transl+i+ja+i)} adjectives, which have a conjugating ending {(Transl+i+ja+i)} ({(Lang+い+ja+い)}) (such as {(Lang+あつい+ja+あつい)} {(Transl+atsui+ja+atsui)} "to be hot") which can become past ({(Lang+あつかった+ja+あつかった)} {(Transl+atsukatta+ja+atsukatta)} "it was hot"), or negative ({(Lang+あつくない+ja+あつくない)} {(Transl+atsuku nai+ja+atsuku nai)} "it is not hot").
10431440 -> 1000004401430: Note that {(Transl+nai+ja+nai)} is also an {(Transl+i+ja+i)} adjective, which can become past ({(Lang+あつくなかった+ja+あつくなかった)} {(Transl+atsuku nakatta+ja+atsuku nakatta)} "it was not hot").
10431450 -> 1000004401440: {(Lang+暑い日+ja+暑い日)} {(Transl+atsui hi+ja+atsui hi)} "a hot day".
10431460 -> 1000004401450: {(Lang+形容動詞+ja+形容動詞)} {(Transl+keiyōdōshi+ja+keiyōdōshi)}, or {(Transl+na+ja+na)} adjectives, which are followed by a form of the copula, usually {(Transl+na+ja+na)}.
10431470 -> 1000004401460: For example {(Transl+hen+ja+hen)} (strange)
10431480 -> 1000004401470: {(Lang+変なひと+ja+変なひと)} {(Transl+hen na hito+ja+hen na hito)} "a strange person".
10431490 -> 1000004401480: {(Lang+連体詞+ja+連体詞)} {(Transl+rentaishi+ja+rentaishi)}, also called true adjectives, such as {(Transl+ano+ja+ano)} "that"
10431500 -> 1000004401490: {(Lang+あの山+ja+あの山)} {(Transl+ano yama+ja+ano yama)} "that mountain".
10431510 -> 1000004401500: Both {(Transl+keiyōshi+ja+keiyōshi)} and {(Transl+keiyōdōshi+ja+keiyōdōshi)} may predicate sentences.
10431520 -> 1000004401510: For example,
10431530 -> 1000004401520: {(Lang+ご飯が熱い。+ja+ご飯が熱い。)} {(Transl+Gohan-ga atsui.+ja+Gohan-ga atsui.)}
10431540 -> 1000004401530: "The rice is hot."
10431550 -> 1000004401540: {(Lang+彼は変だ。+ja+彼は変だ。)} {(Transl+Kare-wa hen da.+ja+Kare-wa hen da.)}
10431560 -> 1000004401550: "He's strange."
10431570 -> 1000004401560: Both inflect, though they do not show the full range of conjugation found in true verbs.
10431580 -> 1000004401570: The {(Transl+rentaishi+ja+rentaishi)} in Modern Japanese are few in number, and unlike the other words, are limited to directly modifying nouns.
10431590 -> 1000004401580: They never predicate sentences.
10431600 -> 1000004401590: Examples include {(Transl+ookina+ja+ookina)} "big", {(Transl+kono+ja+kono)} "this", {(Transl+iwayuru+ja+iwayuru)} "so-called" and {(Transl+taishita+ja+taishita)} "amazing".
10431610 -> 1000004401600: Both {(Transl+keiyōdōshi+ja+keiyōdōshi)} and {(Transl+keiyōshi+ja+keiyōshi)} form adverbs, by following with {(Transl+ni+ja+ni)} in the case of {(Transl+keiyōdōshi+ja+keiyōdōshi)}:
10431620 -> 1000004401610: {(Lang+変になる+ja+変になる)} {(Transl+hen ni naru+ja+hen ni naru)} "become strange",
10431630 -> 1000004401620: and by changing {(Transl+i+ja+i)} to {(Transl+ku+ja+ku)} in the case of {(Transl+keiyōshi+ja+keiyōshi)}:
10431640 -> 1000004401630: {(Lang+熱くなる+ja+熱くなる)} {(Transl+atsuku naru+ja+atsuku naru)} "become hot".
10431650 -> 1000004401640: The grammatical function of nouns is indicated by postpositions, also called particles.
10431660 -> 1000004401650: These include for example:
10431670 -> 1000004401660: {(Lang+が+ja+が)} {(Transl+ga+ja+ga)} for the nominative case.
10431680 -> 1000004401670: Not necessarily a subject.
10431690 -> 1000004401680: {(Lang+彼がやった。+ja+彼がやった。)}{(Transl+Kare ga yatta.+ja+Kare ga yatta.)}
10431700 -> 1000004401690: "He did it."
10431710 -> 1000004401700: {(Lang+に+ja+に)} {(Transl+ni+ja+ni)} for the dative case.
10431720 -> 1000004401710: {(Lang+田中さんにあげて下さい。+ja+田中さんにあげて下さい。)} {(Transl+Tanaka-san ni agete kudasai+ja+Tanaka-san ni agete kudasai)} "Please give it to Mr. Tanaka."
10431730 -> 1000004401720: It is also used for the lative case, indicating a motion to a location.
10431740 -> 1000004401730: {(Lang+日本 に行きたい。+ja+日本 に行きたい。)} {(Transl+Nihon ni ikitai+ja+Nihon ni ikitai)} "I want to go to Japan."
10431750 -> 1000004401740: {(Lang+の+ja+の)} {(Transl+no+ja+no)} for the genitive case, or nominalizing phrases.
10431760 -> 1000004401750: {(Lang+私のカメラ。+ja+私のカメラ。)} {(Transl+watashi no kamera+ja+watashi no kamera)} "my camera"
10431770 -> 1000004401760: {(Lang+スキーに行くのが好きです。+ja+スキーに行くのが好きです。)} {(Transl+Sukī-ni iku no ga suki desu+ja+Sukī-ni iku no ga suki desu)} "(I) like going skiing."
10431780 -> 1000004401770: {(Lang+を+ja+を)} {(Transl+o+ja+o)} for the accusative case.
10431790 -> 1000004401780: Not necessarily an object.
10431800 -> 1000004401790: {(Lang+何を食べますか。+ja+何を食べますか。)} {(Transl+Nani o tabemasu ka?+ja+Nani o tabemasu ka?)}
10431810 -> 1000004401800: "What will (you) eat?"
10431820 -> 1000004401810: {(Lang+は+ja+は)} {(Transl+wa+ja+wa)} for the topic.
10431830 -> 1000004401820: It can co-exist with case markers above except {(Transl+no+ja+no)}, and it overrides {(Transl+ga+ja+ga)} and {(Transl+o+ja+o)}.
10431840 -> 1000004401830: {(Lang+私はタイ料理がいいです。+ja+私はタイ料理がいいです。)} {(Transl+Watashi wa tai-ryōri ga ii desu.+ja+Watashi wa tai-ryōri ga ii desu.)}
10431850 -> 1000004401840: "As for me, Thai food is good."
10431860 -> 1000004401850: The nominative marker {(Transl+ga+ja+ga)} after {(Transl+watashi+ja+watashi)} is hidden under {(Transl+wa+ja+wa)}.
10431865 -> 1000004401860: (Note that English generally makes no distinction between sentence topic and subject.)
10431867 -> 1000004401870: Note: The difference between {(Transl+wa+ja+wa)} and {(Transl+ga+ja+ga)} goes beyond the English distinction between sentence topic and subject.
10431870 -> 1000004401880: While {(Transl+wa+ja+wa)} indicates the topic, which the rest of the sentence describes or acts upon, it carries the implication that the subject indicated by {(Transl+wa+ja+wa)} is not unique, or may be part of a larger group.
10431880 -> 1000004401890: {(Transl+Ikeda-san wa yonjū-ni sai da.+ja+Ikeda-san wa yonjū-ni sai da.)}
10431890 -> 1000004401900: "As for Mr. Ikeda, he is forty-two years old."
10431900 -> 1000004401910: Others in the group may also be of that age.
10431910 -> 1000004401920: Absence of {(Transl+wa+ja+wa)} often means the subject is the focus of the sentence.
10431920 -> 1000004401930: {(Transl+Ikeda-san ga yonjū-ni sai da.+ja+Ikeda-san ga yonjū-ni sai da.)}
10431930 -> 1000004401940: "It is Mr. Ikeda who is forty-two years old."
10431940 -> 1000004401950: This is a reply to an implicit or explicit question who in this group is forty-two years old.
10431950 -> 1000004401960: Politeness
10431960 -> 1000004401970: Unlike most western languages, Japanese has an extensive grammatical system to express politeness and formality.
10431970 -> 1000004401980: Most relationships are not equal in Japanese society.
10431980 -> 1000004401990: The differences in social position are determined by a variety of factors including job, age, experience, or even psychological state (e.g., a person asking a favour tends to do so politely).
10431990 -> 1000004402000: The person in the lower position is expected to use a polite form of speech, whereas the other might use a more plain form.
10432000 -> 1000004402010: Strangers will also speak to each other politely.
10432010 -> 1000004402020: Japanese children rarely use polite speech until they are teens, at which point they are expected to begin speaking in a more adult manner.
10432020 -> 1000004402030: See uchi-soto.
10432030 -> 1000004402040: Whereas {(Transl+teineigo+ja+teineigo)} ({(Lang+丁寧語+ja+丁寧語)}) (polite language) is commonly an inflectional system, {(Transl+sonkeigo+ja+sonkeigo)} ({(Lang+尊敬語+ja+尊敬語)}) (respectful language) and {(Transl+kenjōgo+ja+kenjōgo)} ({(Lang+謙譲語+ja+謙譲語)}) (humble language) often employ many special honorific and humble alternate verbs: {(Transl+iku+ja+iku)} "go" becomes {(Transl+ikimasu+ja+ikimasu)} in polite form, but is replaced by {(Transl+irassharu+ja+irassharu)} in honorific speech and {(Transl+ukagau+ja+ukagau)} or {(Transl+mairu+ja+mairu)} in humble speech.
10432040 -> 1000004402050: The difference between honorific and humble speech is particularly pronounced in the Japanese language.
10432050 -> 1000004402060: Humble language is used to talk about oneself or one's own group (company, family) whilst honorific language is mostly used when describing the interlocutor and his/her group.
10432060 -> 1000004402070: For example, the {(Transl+-san+ja+-san)} suffix ("Mr" "Mrs." or "Miss") is an example of honorific language.
10432070 -> 1000004402080: It is not used to talk about oneself or when talking about someone from one's company to an external person, since the company is the speaker's "group".
10432080 -> 1000004402090: When speaking directly to one's superior in one's company or when speaking with other employees within one's company about a superior, a Japanese person will use vocabulary and inflections of the honorific register to refer to the in-group superior and his or her speech and actions.
10432090 -> 1000004402100: When speaking to a person from another company (i.e., a member of an out-group), however, a Japanese person will use the plain or the humble register to refer to the speech and actions of his or her own in-group superiors.
10432100 -> 1000004402110: In short, the register used in Japanese to refer to the person, speech, or actions of any particular individual varies depending on the relationship (either in-group or out-group) between the speaker and listener, as well as depending on the relative status of the speaker, listener, and third-person referents.
10432110 -> 1000004402120: For this reason, the Japanese system for explicit indication of social register is known as a system of "relative honorifics."
10432120 -> 1000004402130: This stands in stark contrast to the Korean system of "absolute honorifics," in which the same register is used to refer to a particular individual (e.g. one's father, one's company president, etc.) in any context regardless of the relationship between the speaker and interlocutor.
10432130 -> 1000004402140: Thus, polite Korean speech can sound very presumptuous when translated verbatim into Japanese, as in Korean it is acceptable and normal to say things like "Our Mr. Company-President..." when communicating with a member of an out-group, which would be very inappropriate in a Japanese social context.
10432140 -> 1000004402150: Most nouns in the Japanese language may be made polite by the addition of {(Transl+o-+ja+o-)} or {(Transl+go-+ja+go-)} as a prefix.
10432145 -> 1000004402160: {(Transl+o-+ja+o-)} is generally used for words of native Japanese origin, whereas {(Transl+go-+ja+go-)} is affixed to words of Chinese derivation.
10432150 -> 1000004402170: In some cases, the prefix has become a fixed part of the word, and is included even in regular speech, such as {(Transl+gohan+ja+gohan)} 'cooked rice; meal.'
10432160 -> 1000004402180: Such a construction often indicates deference to either the item's owner or to the object itself.
10432170 -> 1000004402190: For example, the word {(Transl+tomodachi+ja+tomodachi)} 'friend,' would become {(Transl+o-tomodachi+ja+o-tomodachi)} when referring to the friend of someone of higher status (though mothers often use this form to refer to their children's friends).
10432180 -> 1000004402200: On the other hand, a polite speaker may sometimes refer to {(Transl+mizu+ja+mizu)} 'water' as {(Transl+o-mizu+ja+o-mizu)} in order to show politeness.
10432190 -> 1000004402210: Most Japanese people employ politeness to indicate a lack of familiarity.
10432200 -> 1000004402220: That is, they use polite forms for new acquaintances, but if a relationship becomes more intimate, they no longer use them.
10432210 -> 1000004402230: This occurs regardless of age, social class, or gender.
10432220 -> 1000004402240: Vocabulary
10432230 -> 1000004402250: The original language of Japan, or at least the original language of a certain population that was ancestral to a significant portion of the historical and present Japanese nation, was the so-called {(Transl+yamato kotoba+ja+yamato kotoba)} ({(Lang+大和言葉+ja+大和言葉)} or infrequently {(Lang+大和詞+ja+大和詞)}, i.e. "Yamato words"), which in scholarly contexts is sometimes referred to as {(Transl+wa-go+ja+wa-go)} ({(Lang+和語+ja+和語)} or rarely {(Lang+倭語+ja+倭語)}, i.e. the {(Transl+"Wa+ja+"Wa)} words").
10432240 -> 1000004402260: In addition to words from this original language, present-day Japanese includes a great number of words that were either borrowed from Chinese or constructed from Chinese roots following Chinese patterns.
10432250 -> 1000004402270: These words, known as {(Transl+kango+ja+kango)} ({(Lang+漢語+ja+漢語)}), entered the language from the fifth century onwards via contact with Chinese culture.
10432260 -> 1000004402280: According to a Japanese dictionary Shinsen-kokugojiten (新選国語辞典), Chinese-based words comprise 49.1% of the total vocabulary, Wago is 33.8% and other foreign words are 8.8%.
10432270 -> 1000004402290: Like Latin-derived words in English, {(Transl+kango+ja+kango)} words typically are perceived as somewhat formal or academic compared to equivalent Yamato words.
10432280 -> 1000004402300: Indeed, it is generally fair to say that an English word derived from Latin/French roots typically corresponds to a Sino-Japanese word in Japanese, whereas a simpler Anglo-Saxon word would best be translated by a Yamato equivalent.
10432290 -> 1000004402310: A much smaller number of words has been borrowed from Korean and Ainu.
10432300 -> 1000004402320: Japan has also borrowed a number of words from other languages, particularly ones of European extraction, which are called {(Transl+gairaigo+ja+gairaigo)}.
10432310 -> 1000004402330: This began with borrowings from Portuguese in the 16th century, followed by borrowing from Dutch during Japan's long isolation of the Edo period.
10432320 -> 1000004402340: With the Meiji Restoration and the reopening of Japan in the 19th century, borrowing occurred from German, French and English.
10432330 -> 1000004402350: Currently, words of English origin are the most commonly borrowed.
10432340 -> 1000004402360: In the Meiji era, the Japanese also coined many neologisms using Chinese roots and morphology to translate Western concepts.
10432350 -> 1000004402370: The Chinese and Koreans imported many of these pseudo-Chinese words into Chinese, Korean, and Vietnamese via their kanji in the late 19th and early 20th centuries.
10432360 -> 1000004402380: For example, {(Lang+政治+ja+政治)} {(Transl+seiji+ja+seiji)} ("politics"), and {(Lang+化学+ja+化学)} {(Transl+kagaku+ja+kagaku)} ("chemistry") are words derived from Chinese roots that were first created and used by the Japanese, and only later borrowed into Chinese and other East Asian languages.
10432370 -> 1000004402390: As a result, Japanese, Chinese, Korean, and Vietnamese share a large common corpus of vocabulary in the same way a large number of Greek- and Latin-derived words are shared among modern European languages, although many academic words formed from such roots were certainly coined by native speakers of other languages, such as English.
10432380 -> 1000004402400: In the past few decades, {(Transl+wasei-eigo+ja+wasei-eigo)} (made-in-Japan English) has become a prominent phenomenon.
10432390 -> 1000004402410: Words such as {(Transl+wanpatān+ja+wanpatān)} {(Lang+ワンパターン+ja+ワンパターン)} (< one + pattern, "to be in a rut", "to have a one-track mind") and {(Transl+sukinshippu+ja+sukinshippu)}　{(Lang+スキンシップ+ja+スキンシップ)} (< skin + -ship, "physical contact"), although coined by compounding English roots, are nonsensical in most non-Japanese contexts; exceptions exist in nearby languages such as Korean however, which often use words such as skinship and rimokon (remote control) in the same way as in Japanese.
10432400 -> 1000004402420: Additionally, many native Japanese words have become commonplace in English, due to the popularity of many Japanese cultural exports.
10432410 -> 1000004402430: Words such as futon, haiku, judo, kamikaze, karaoke, karate, ninja, origami, rickshaw (from {(Lang+人力車+ja+人力車)} {(Transl+jinrikisha+ja+jinrikisha)}), samurai, sayonara, sumo, sushi, tsunami, tycoon and many others have become part of the English language.
10432420 -> 1000004402440: See list of English words of Japanese origin for more.
10432430 -> 1000004402450: Writing system
10432440 -> 1000004402460: Literacy was introduced to Japan in the form of the Chinese writing system, by way of Baekje before the 5th century.
10432450 -> 1000004402470: Using this language, the Japanese emperor Yūryaku sent a letter to a Chinese emperor Liu Song in 478 CE.
10432460 -> 1000004402480: After the ruin of Baekje, Japan invited scholars from China to learn more of the Chinese writing system.
10432470 -> 1000004402490: Japanese Emperors gave an official rank to Chinese scholars (続守言/薩弘格/袁晋卿) and spread the use of Chinese characters from the 7th century to the 8th century.
10432480 -> 1000004402500: At first, the Japanese wrote in Classical Chinese, with Japanese names represented by characters used for their meanings and not their sounds.
10432490 -> 1000004402510: Later, during the seventh century CE, the Chinese-sounding phoneme principle was used to write pure Japanese poetry and prose (comparable to Akkadian's retention of Sumerian cuneiform), but some Japanese words were still written with characters for their meaning and not the original Chinese sound.
10432500 -> 1000004402520: This is when the history of Japanese as a written language begins in its own right.
10432510 -> 1000004402530: By this time, the Japanese language was already distinct from the Ryukyuan languages.
10432520 -> 1000004402540: The Korean settlers and their descendants used Kudara-on or Baekje pronunciation (百済音), which was also called Tsushima-pronunciation (対馬音) or Go-on (呉音).
10432530 -> 1000004402550: An example of this mixed style is the Kojiki, which was written in 712 AD.
10432540 -> 1000004402560: They then started to use Chinese characters to write Japanese in a style known as {(Transl+man'yōgana+ja+man'yōgana)}, a syllabic script which used Chinese characters for their sounds in order to transcribe the words of Japanese speech syllable by syllable.
10432550 -> 1000004402570: Over time, a writing system evolved.
10432560 -> 1000004402580: Chinese characters (kanji) were used to write either words borrowed from Chinese, or Japanese words with the same or similar meanings.
10432570 -> 1000004402590: Chinese characters were also used to write grammatical elements, were simplified, and eventually became two syllabic scripts: hiragana and katakana.
10432580 -> 1000004402600: Modern Japanese is written in a mixture of three main systems: kanji, characters of Chinese origin used to represent both Chinese loanwords into Japanese and a number of native Japanese morphemes; and two syllabaries: hiragana and katakana.
10432590 -> 1000004402610: The Latin alphabet is also sometimes used.
10432600 -> 1000004402620: Arabic numerals are much more common than the kanji when used in counting, but kanji numerals are still used in compounds, such as {(Lang+統一+ja+統一)} {(Transl+tōitsu+ja+tōitsu)} ("unification").
10432610 -> 1000004402630: Hiragana are used for words without kanji representation, for words no longer written in kanji, and also following kanji to show conjugational endings.
10432620 -> 1000004402640: Because of the way verbs (and adjectives) in Japanese are conjugated, kanji alone cannot fully convey Japanese tense and mood, as kanji cannot be subject to variation when written without losing its meaning.
10432630 -> 1000004402650: For this reason, hiragana are suffixed to the ends of kanji to show verb and adjective conjugations.
10432640 -> 1000004402660: Hiragana used in this way are called okurigana.
10432650 -> 1000004402670: Hiragana are also written in a superscript called furigana above or beside a kanji to show the proper reading.
10432660 -> 1000004402680: This is done to facilitate learning, as well as to clarify particularly old or obscure (or sometimes invented) readings.
10432670 -> 1000004402690: Katakana, like hiragana, are a syllabary; katakana are primarily used to write foreign words, plant and animal names, and for emphasis.
10432680 -> 1000004402700: For example "Australia" has been adapted as {(Transl+Ōsutoraria+ja+Ōsutoraria)} ({(Lang+オーストラリア+ja+オーストラリア)}), and "supermarket" has been adapted and shortened into {(Transl+sūpā+ja+sūpā)} ({(Lang+スーパー+ja+スーパー)}).
10432690 -> 1000004402710: The Latin alphabet (in Japanese referred to as Rōmaji ({(Lang+ローマ字+ja+ローマ字)}), literally "Roman letters") is used for some loan words like "CD" and "DVD", and also for some Japanese creations like "Sony".
10432700 -> 1000004402720: Historically, attempts to limit the number of kanji in use commenced in the mid-19th century, but did not become a matter of government intervention until after Japan's defeat in the Second World War.
10432710 -> 1000004402730: During the period of post-war occupation (and influenced by the views of some U.S. officials), various schemes including the complete abolition of kanji and exclusive use of rōmaji were considered.
10432720 -> 1000004402740: The {(Transl+jōyō kanji+ja+jōyō kanji)} ("common use kanji", originally called {(Transl+tōyō kanji+ja+tōyō kanji)} [kanji for general use]) scheme arose as a compromise solution.
10432730 -> 1000004402750: Japanese students begin to learn kanji from their first year at elementary school.
10432740 -> 1000004402760: A guideline created by the Japanese Ministry of Education, the list of {(Transl+kyōiku kanji+ja+kyōiku kanji)} ("education kanji", a subset of {(Transl+jōyō kanji+ja+jōyō kanji)}), specifies the 1,006 simple characters a child is to learn by the end of sixth grade.
10432750 -> 1000004402770: Children continue to study another 939 characters in junior high school, covering in total 1,945 {(Transl+jōyō kanji+ja+jōyō kanji)}.
10432760 -> 1000004402780: The official list of {(Transl+jōyō kanji+ja+jōyō kanji)} was revised several times, but the total number of officially sanctioned characters remained largely unchanged.
10432770 -> 1000004402790: As for kanji for personal names, the circumstances are somewhat complicated.
10432780 -> 1000004402800: {(Transl+Jōyō kanji+ja+Jōyō kanji)} and {(Transl+jinmeiyō kanji+ja+jinmeiyō kanji)} (an appendix of additional characters for names) are approved for registering personal names.
10432790 -> 1000004402810: Names containing unapproved characters are denied registration.
10432800 -> 1000004402820: However, as with the list of {(Transl+jōyō kanji+ja+jōyō kanji)}, criteria for inclusion were often arbitrary and led to many common and popular characters being disapproved for use.
10432810 -> 1000004402830: Under popular pressure and following a court decision holding the exclusion of common characters unlawful, the list of {(Transl+jinmeiyō kanji+ja+jinmeiyō kanji)} was substantially extended from 92 in 1951 (the year it was first decreed) to 983 in 2004.
10432820 -> 1000004402840: Furthermore, families whose names are not on these lists were permitted to continue using the older forms.
10432830 -> 1000004402850: Many writers rely on newspaper circulation to publish their work with officially sanctioned characters.
10432840 -> 1000004402860: This distribution method is more efficient than traditional pen and paper publications.
10432850 -> 1000004402870: Study by non-native speakers
10432860 -> 1000004402880: Many major universities throughout the world provide Japanese language courses, and a number of secondary and even primary schools worldwide offer courses in the language.
10432870 -> 1000004402890: International interest in the Japanese language dates from the 1800s but has become more prevalent following Japan's economic bubble of the 1980s and the global popularity of Japanese pop culture (such as anime and video games) since the 1990s.
10432880 -> 1000004402900: About 2.3 million people studied the language worldwide in 2003: 900,000 South Koreans, 389,000 Chinese, 381,000 Australians, and 140,000 Americans study Japanese in lower and higher educational institutions.
10432890 -> 1000004402910: In Japan, more than 90,000 foreign students study at Japanese universities and Japanese language schools, including 77,000 Chinese and 15,000 South Koreans in 2003.
10432900 -> 1000004402920: In addition, local governments and some NPO groups provide free Japanese language classes for foreign residents, including Japanese Brazilians and foreigners married to Japanese nationals.
10432910 -> 1000004402930: In the United Kingdom, studies are supported by the British Association for Japanese Studies.
10432920 -> 1000004402940: In Ireland, Japanese is offered as a language in the Leaving Certificate in some schools.
10432930 -> 1000004402950: The Japanese government provides standardised tests to measure spoken and written comprehension of Japanese for second language learners; the most prominent is the Japanese Language Proficiency Test (JLPT).
10432940 -> 1000004402960: The Japanese External Trade Organisation JETRO organises the Business Japanese Proficiency Test which tests the learner's ability to understand Japanese in a business setting.
10432950 -> 1000004402970: When learning Japanese in a college setting, students are usually first taught how to pronounce romaji.
10432960 -> 1000004402980: From that point, they are taught the two main syllabaries, with kanji usually being introduced in the second semester.
10432970 -> 1000004402990: Focus is usually first on polite (distal) speech, as students that might interact with native speakers would be expected to use.
10432980 -> 1000004403000: Casual speech and formal speech usually follow polite speech, as well as the usage of honourifics.
Java (programming language)
10440010 -> 1000004500020: Java (programming language)
10440020 -> 1000004500030: Java is a programming language originally developed by Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform.
10440030 -> 1000004500040: The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities.
10440040 -> 1000004500050: Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of computer architecture.
10440050 -> 1000004500060: The original and reference implementation Java compilers, virtual machines, and class libraries were developed by Sun from 1995.
10440060 -> 1000004500070: As of May 2007, in compliance with the specifications of the Java Community Process, Sun made available most of their Java technologies as free software under the GNU General Public License.
10440070 -> 1000004500080: Others have also developed alternative implementations of these Sun technologies, such as the GNU Compiler for Java and GNU Classpath.
10440080 -> 1000004500090: History
10440090 -> 1000004500100: The Java language was created by James Gosling in June 1991 for use in one of his many set-top box projects.
10440100 -> 1000004500110: The language was initially called Oak, after an oak tree that stood outside Gosling's office—and also went by the name Green—and ended up later being renamed to Java, from a list of random words.
10440110 -> 1000004500120: Gosling's goals were to implement a virtual machine and a language that had a familiar C/C++ style of notation.
10440120 -> 1000004500130: The first public implementation was Java 1.0 in 1995.
10440130 -> 1000004500140: It promised "Write Once, Run Anywhere" (WORA), providing no-cost runtimes on popular platforms.
10440140 -> 1000004500150: It was fairly secure and its security was configurable, allowing network and file access to be restricted.
10440150 -> 1000004500160: Major web browsers soon incorporated the ability to run secure Java applets within web pages.
10440160 -> 1000004500170: Java quickly became popular.
10440170 -> 1000004500180: With the advent of Java 2, new versions had multiple configurations built for different types of platforms.
10440180 -> 1000004500190: For example, J2EE was for enterprise applications and the greatly stripped down version J2ME was for mobile applications.
10440190 -> 1000004500200: J2SE was the designation for the Standard Edition.
10440200 -> 1000004500210: In 2006, for marketing purposes, new J2 versions were renamed Java EE, Java ME, and Java SE, respectively.
10440210 -> 1000004500220: In 1997, Sun Microsystems approached the ISO/IEC JTC1 standards body and later the Ecma International to formalize Java, but it soon withdrew from the process.
10440220 -> 1000004500230: Java remains a de facto standard that is controlled through the Java Community Process.
10440230 -> 1000004500240: At one time, Sun made most of its Java implementations available without charge although they were proprietary software.
10440240 -> 1000004500250: Sun's revenue from Java was generated by the selling of licenses for specialized products such as the Java Enterprise System.
10440250 -> 1000004500260: Sun distinguishes between its Software Development Kit (SDK) and Runtime Environment (JRE) that is a subset of the SDK, the primary distinction being that in the JRE, the compiler, utility programs, and many necessary header files are not present.
10440260 -> 1000004500270: On 13 November 2006, Sun released much of Java as free and open-source software under the terms of the GNU General Public License (GPL).
10440270 -> 1000004500280: On 8 May 2007 Sun finished the process, making all of Java's core code free and open-source, aside from a small portion of code to which Sun did not hold the copyright.
10440280 -> 1000004500290: Philosophy
10440290 -> 1000004500300: Primary goals
10440300 -> 1000004500310: There were five primary goals in the creation of the Java language:
10440310 -> 1000004500320: It should use the object-oriented programming methodology.
10440320 -> 1000004500330: It should allow the same program to be executed on multiple operating systems.
10440330 -> 1000004500340: It should contain built-in support for using computer networks.
10440340 -> 1000004500350: It should be designed to execute code from remote sources securely.
10440350 -> 1000004500360: It should be easy to use by selecting what were considered the good parts of other object-oriented languages.
10440360 -> 1000004500370: Platform independence
10440370 -> 1000004500380: One characteristic, platform independence, means that programs written in the Java language must run similarly on any supported hardware/operating-system platform.
10440380 -> 1000004500390: One should be able to write a program once, compile it once, and run it anywhere.
10440390 -> 1000004500400: This is achieved by most Java compilers by compiling the Java language code halfway (to Java bytecode) – simplified machine instructions specific to the Java platform.
10440400 -> 1000004500410: The code is then run on a virtual machine (VM), a program written in native code on the host hardware that interprets and executes generic Java bytecode.
10440410 -> 1000004500420: (In some JVM versions, bytecode can also be compiled to native code, either before or during program execution, resulting in faster execution.)
10440420 -> 1000004500430: Further, standardized libraries are provided to allow access to features of the host machines (such as graphics, threading and networking) in unified ways.
10440430 -> 1000004500440: Note that, although there is an explicit compiling stage, at some point, the Java bytecode is interpreted or converted to native machine code by the JIT compiler.
10440440 -> 1000004500450: The first implementations of the language used an interpreted virtual machine to achieve portability.
10440450 -> 1000004500460: These implementations produced programs that ran slower than programs compiled to native executables, for instance written in C or C++, so the language suffered a reputation for poor performance.
10440460 -> 1000004500470: More recent JVM implementations produce programs that run significantly faster than before, using multiple techniques.
10440470 -> 1000004500480: One technique, known as just-in-time compilation (JIT), translates the Java bytecode into native code at the time that the program is run, which results in a program that executes faster than interpreted code but also incurs compilation overhead during execution.
10440480 -> 1000004500490: More sophisticated VMs use dynamic recompilation, in which the VM can analyze the behavior of the running program and selectively recompile and optimize critical parts of the program.
10440490 -> 1000004500500: Dynamic recompilation can achieve optimizations superior to static compilation because the dynamic compiler can base optimizations on knowledge about the runtime environment and the set of loaded classes, and can identify the hot spots (parts of the program, often inner loops, that take up the most execution time).
10440500 -> 1000004500510: JIT compilation and dynamic recompilation allow Java programs to take advantage of the speed of native code without losing portability.
10440510 -> 1000004500520: Another technique, commonly known as static compilation, is to compile directly into native code like a more traditional compiler.
10440520 -> 1000004500530: Static Java compilers, such as GCJ, translate the Java language code to native object code, removing the intermediate bytecode stage.
10440530 -> 1000004500540: This achieves good performance compared to interpretation, but at the expense of portability; the output of these compilers can only be run on a single architecture.
10440540 -> 1000004500550: Some see avoiding the VM in this manner as defeating the point of developing in Java; however it can be useful to provide both a generic bytecode version, as well as an optimised native code version of an application.
10440550 -> 1000004500560: Implementations
10440560 -> 1000004500570: Sun Microsystems officially licenses the Java Standard Edition platform for Microsoft Windows, Linux, and Solaris.
10440570 -> 1000004500580: Through a network of third-party vendors and licensees, alternative Java environments are available for these and other platforms.
10440580 -> 1000004500590: To qualify as a certified Java licensee, an implementation on any particular platform must pass a rigorous suite of validation and compatibility tests.
10440590 -> 1000004500600: This method enables a guaranteed level of compliance and platform through a trusted set of commercial and non-commercial partners.
10440600 -> 1000004500610: Sun's trademark license for usage of the Java brand insists that all implementations be "compatible".
10440610 -> 1000004500620: This resulted in a legal dispute with Microsoft after Sun claimed that the Microsoft implementation did not support the RMI and JNI interfaces and had added platform-specific features of their own.
10440620 -> 1000004500630: Sun sued in 1997, and in 2001 won a settlement of $20 million as well as a court order enforcing the terms of the license from Sun.
10440630 -> 1000004500640: As a result, Microsoft no longer ships Java with Windows, and in recent versions of Windows, Internet Explorer cannot support Java applets without a third-party plugin.
10440640 -> 1000004500650: However, Sun and others have made available Java run-time systems at no cost for those and other versions of Windows.
10440650 -> 1000004500660: Platform-independent Java is essential to the Java Enterprise Edition strategy, and an even more rigorous validation is required to certify an implementation.
10440660 -> 1000004500670: This environment enables portable server-side applications, such as Web services, servlets, and Enterprise JavaBeans, as well as with Embedded systems based on OSGi, using Embedded Java environments.
10440670 -> 1000004500680: Through the new GlassFish project, Sun is working to create a fully functional, unified open-source implementation of the Java EE technologies.
10440680 -> 1000004500690: Automatic memory management
10440690 -> 1000004500700: One of the ideas behind Java's automatic memory management model is that programmers be spared the burden of having to perform manual memory management.
10440700 -> 1000004500710: In some languages the programmer allocates memory for the creation of objects stored on the heap and the responsibility of later deallocating that memory also resides with the programmer.
10440710 -> 1000004500720: If the programmer forgets to deallocate memory or writes code that fails to do so, a memory leak occurs and the program can consume an arbitrarily large amount of memory.
10440720 -> 1000004500730: Additionally, if the program attempts to deallocate the region of memory more than once, the result is undefined and the program may become unstable and may crash.
10440730 -> 1000004500740: Finally, in non garbage collected environments, there is a certain degree of overhead and complexity of user-code to track and finalize allocations.
10440740 -> 1000004500750: Often developers may box themselves into certain designs to provide reasonable assurances that memory leaks will not occur.
10440750 -> 1000004500760: In Java, this potential problem is avoided by automatic garbage collection.
10440760 -> 1000004500770: The programmer determines when objects are created, and the Java runtime is responsible for managing the object's lifecycle.
10440770 -> 1000004500780: The program or other objects can reference an object by holding a reference to it (which, from a low-level point of view, is its address on the heap).
10440780 -> 1000004500790: When no references to an object remain, the unreachable object is eligible for release by the Java garbage collector - it may be freed automatically by the garbage collector at any time.
10440790 -> 1000004500800: Memory leaks may still occur if a programmer's code holds a reference to an object that is no longer needed—in other words, they can still occur but at higher conceptual levels.
10440800 -> 1000004500810: The use of garbage collection in a language can also affect programming paradigms.
10440810 -> 1000004500820: If, for example, the developer assumes that the cost of memory allocation/recollection is low, they may choose to more freely construct objects instead of pre-initializing, holding and reusing them.
10440820 -> 1000004500830: With the small cost of potential performance penalties (inner-loop construction of large/complex objects), this facilitates thread-isolation (no need to synchronize as different threads work on different object instances) and data-hiding.
10440830 -> 1000004500840: The use of transient immutable value-objects minimizes side-effect programming.
10440840 -> 1000004500850: Comparing Java and C++, it is possible in C++ to implement similar functionality (for example, a memory management model for specific classes can be designed in C++ to improve speed and lower memory fragmentation considerably), with the possible cost of adding comparable runtime overhead to that of Java's garbage collector, and of added development time and application complexity if one favors manual implementation over using an existing third-party library.
10440850 -> 1000004500860: In Java, garbage collection is built-in and virtually invisible to the developer.
10440860 -> 1000004500870: That is, developers may have no notion of when garbage collection will take place as it may not necessarily correlate with any actions being explicitly performed by the code they write.
10440870 -> 1000004500880: Depending on intended application, this can be beneficial or disadvantageous: the programmer is freed from performing low-level tasks, but at the same time loses the option of writing lower level code.
10440880 -> 1000004500890: Additionally, the garbage collection capability demands some attention to tuning the JVM, as large heaps will cause apparently random stalls in performance.
10440890 -> 1000004500900: Java does not support pointer arithmetic as is supported in, for example, C++.
10440900 -> 1000004500910: This is because the garbage collector may relocate referenced objects, invalidating such pointers.
10440910 -> 1000004500920: Another reason that Java forbids this is that type safety and security can no longer be guaranteed if arbitrary manipulation of pointers is allowed.
10440920 -> 1000004500930: Syntax
10440930 -> 1000004500940: The syntax of Java is largely derived from C++.
10440940 -> 1000004500950: Unlike C++, which combines the syntax for structured, generic, and object-oriented programming, Java was built exclusively as an object oriented language.
10440950 -> 1000004500960: As a result, almost everything is an object and all code is written inside a class.
10440960 -> 1000004500970: The exceptions are the intrinsic data types (ordinal and real numbers, boolean values, and characters), which are not classes for performance reasons.
10440970 -> 1000004500980: Hello, world program
10440980 -> 1000004500990: This is a minimal Hello world program in Java with syntax highlighting:
10440990 -> None: 
10441000 -> 1000004501000: To execute a Java program, the code is saved as a file named Hello.java.
10441010 -> 1000004501010: It must first be compiled into bytecode using a Java compiler, which produces a file named Hello.class.
10441020 -> 1000004501020: This class is then launched.
10441030 -> 1000004501030: The above example merits a bit of explanation.
10441040 -> 1000004501040: All executable statements in Java are written inside a class, including stand-alone programs.
10441050 -> 1000004501050: Source files are by convention named the same as the class they contain, appending the mandatory suffix .java.
10441060 -> 1000004501060: A class that is declared public is required to follow this convention.
10441070 -> 1000004501070: (In this case, the class Hello is public, therefore the source must be stored in a file called Hello.java).
10441080 -> 1000004501080: The compiler will generate a class file for each class defined in the source file.
10441090 -> 1000004501090: The name of the class file is the name of the class, with .class appended.
10441100 -> 1000004501100: For class file generation, anonymous classes are treated as if their name was the concatenation of the name of their enclosing class, a $, and an integer.
10441110 -> 1000004501110: The keyword public denotes that a method can be called from code in other classes, or that a class may be used by classes outside the class hierarchy.
10441120 -> 1000004501120: The keyword static indicates that the method is a static method, associated with the class rather than object instances.
10441130 -> 1000004501130: The keyword void indicates that the main method does not return any value to the caller.
10441140 -> 1000004501140: The method name "main" is not a keyword in the Java language.
10441150 -> 1000004501150: It is simply the name of the method the Java launcher calls to pass control to the program.
10441160 -> 1000004501160: Java classes that run in managed environments such as applets and Enterprise Java Beans do not use or need a main() method.
10441170 -> 1000004501170: The main method must accept an array of {(Javadoc:SE+ String+java/lang+String)} objects.
10441180 -> 1000004501180: By convention, it is referenced as args although any other legal identifier name can be used.
10441190 -> 1000004501190: Since Java 5, the main method can also use variable arguments, in the form of public static void main(String... args), allowing the main method to be invoked with an arbitrary number of String arguments.
10441200 -> 1000004501200: The effect of this alternate declaration is semantically identical (the args parameter is still an array of String objects), but allows an alternate syntax for creating and passing the array.
10441210 -> 1000004501210: The Java launcher launches Java by loading a given class (specified on the command line) and starting its public static void main(String[]) method.
10441220 -> 1000004501220: Stand-alone programs must declare this method explicitly.
10441230 -> 1000004501230: The String[] args parameter is an array of {(Javadoc:SE+ String+java/lang+String)} objects containing any arguments passed to the class.
10441240 -> 1000004501240: The parameters to main are often passed by means of a command line.
10441250 -> 1000004501250: The printing facility is part of the Java standard library: The {(Javadoc:SE+ System+java/lang+System)} class defines a public static field called {(Javadoc:SE+ out+name=out+java/lang+System+out)}.
10441260 -> 1000004501260: The out object is an instance of the {(Javadoc:SE+ PrintStream+java/io+PrintStream)} class and provides the method {(Javadoc:SE+ println(String)+name=println(String)+java/io+PrintStream+println(java.lang.String))} for displaying data to the screen while creating a new line (standard out).
10441270 -> 1000004501270: A more comprehensive example
10441280 -> None: 
10441290 -> 1000004501280: The import statement imports the {(Javadoc:SE+ JOptionPane+javax/swing+JOptionPane)} class from the {(Javadoc:SE+ javax.swing+package=javax.swing+javax/swing)} package.
10441300 -> 1000004501290: The OddEven class declares a single private field of type int named input.
10441310 -> 1000004501300: Every instance of the OddEven class has its own copy of the input field.
10441320 -> 1000004501310: The private declaration means that no other class can access (read or write) the input field.
10441330 -> 1000004501320: OddEven() is a public constructor.
10441340 -> 1000004501330: Constructors have the same name as the enclosing class they are declared in, and unlike a method, have no return type.
10441350 -> 1000004501340: A constructor is used to initialize an object that is a newly created instance of the class.
10441360 -> 1000004501350: The dialog returns a String that is converted to an int by the {(Javadoc:SE+ Integer.parseInt(String)+java/lang+Integer+parseInt(String))} method.
10441370 -> 1000004501360: The calculate() method is declared without the static keyword.
10441380 -> 1000004501370: This means that the method is invoked using a specific instance of the OddEven class.
10441390 -> 1000004501380: (The reference used to invoke the method is passed as an undeclared parameter of type OddEven named this.)
10441400 -> 1000004501390: The method tests the expression input % 2 == 0 using the if keyword to see if the remainder of dividing the input field belonging to the instance of the class by two is zero.
10441410 -> 1000004501400: If this expression is true, then it prints Even; if this expression is false it prints Odd.
10441420 -> 1000004501410: (The input field can be equivalently accessed as this.input, which explicitly uses the undeclared this parameter.)
10441430 -> 1000004501420: OddEven number = new OddEven(); declares a local object reference variable in the main method named number.
10441440 -> 1000004501430: This variable can hold a reference to an object of type OddEven.
10441450 -> 1000004501440: The declaration initializes number by first creating an instance of the OddEven class, using the new keyword and the OddEven() constructor, and then assigning this instance to the variable.
10441460 -> 1000004501450: The statement number.showDialog(); calls the calculate method.
10441470 -> 1000004501460: The instance of OddEven object referenced by the number local variable is used to invoke the method and passed as the undeclared this parameter to the calculate method.
10441480 -> 1000004501470: For simplicity, error handling has been ignored in this example.
10441490 -> 1000004501480: Entering a value that is not a number will cause the program to crash.
10441500 -> 1000004501490: This can be avoided by catching and handling the {(Javadoc:SE+ NumberFormatException+java/lang+NumberFormatException)} thrown by Integer.parseInt(String).
10441510 -> 1000004501500: Applet
10441520 -> 1000004501510: Java applets are programs that are embedded in other applications, typically in a Web page displayed in a Web browser.
10441530 -> None: 
10441540 -> 1000004501520: The import statements direct the Java compiler to include the {(Javadoc:SE+ java.applet.Applet+package=java.applet+java/applet+Applet)} and {(Javadoc:SE+ java.awt.Graphics+package=java.awt+java/awt+Graphics)} classes in the compilation.
10441550 -> 1000004501530: The import statement allows these classes to be referenced in the source code using the simple class name (i.e. Applet) instead of the fully qualified class name (i.e. java.applet.Applet).
10441560 -> 1000004501540: The Hello class extends (subclasses) the Applet class; the Applet class provides the framework for the host application to display and control the lifecycle of the applet.
10441570 -> 1000004501550: The Applet class is an Abstract Windowing Toolkit (AWT) {(Javadoc:SE+ Component+java/awt+Component)}, which provides the applet with the capability to display a graphical user interface (GUI) and respond to user events.
10441580 -> 1000004501560: The Hello class overrides the {(Javadoc:SE+ paint(Graphics)+name=paint(Graphics)+java/awt+Container+paint(java.awt.Graphics))} method inherited from the {(Javadoc:SE+ Container+java/awt+Container)} superclass to provide the code to display the applet.
10441590 -> 1000004501570: The paint() method is passed a Graphics object that contains the graphic context used to display the applet.
10441600 -> 1000004501580: The paint() method calls the graphic context {(Javadoc:SE+ drawString(String, int, int)+name=drawString(String, int, int)+java/awt+Graphics+drawString(java.lang.String,%20int,%20int))} method to display the "Hello, world!" string at a pixel offset of (65, 95) from the upper-left corner in the applet's display.
10441610 -> None: 
10441620 -> 1000004501590: An applet is placed in an HTML document using the <applet> HTML element.
10441630 -> 1000004501600: The applet tag has three attributes set: code="Hello" specifies the name of the Applet class and width="200" height="200" sets the pixel width and height of the applet.
10441640 -> 1000004501610: Applets may also be embedded in HTML using either the object or embed element, although support for these elements by Web browsers is inconsistent.
10441650 -> 1000004501620: However, the applet tag is deprecated, so the object tag is preferred where supported.
10441660 -> 1000004501630: The host application, typically a Web browser, instantiates the Hello applet and creates an {(Javadoc:SE+ AppletContext+java/applet+AppletContext)} for the applet.
10441670 -> 1000004501640: Once the applet has initialized itself, it is added to the AWT display hierarchy.
10441680 -> 1000004501650: The paint method is called by the AWT event dispatching thread whenever the display needs the applet to draw itself.
10441690 -> 1000004501660: Servlet
10441700 -> 1000004501670: Java Servlet technology provides Web developers with a simple, consistent mechanism for extending the functionality of a Web server and for accessing existing business systems.
10441710 -> 1000004501680: Servlets are server-side Java EE components that generate responses (typically HTML pages) to requests (typically HTTP requests) from clients.
10441720 -> 1000004501690: A servlet can almost be thought of as an applet that runs on the server side—without a face.
10441730 -> None: 
10441740 -> 1000004501700: The import statements direct the Java compiler to include all of the public classes and interfaces from the {(Javadoc:SE+ java.io+package=java.io+java/io)} and {(Javadoc:EE+ javax.servlet+package=javax.servlet+javax/servlet)} packages in the compilation.
10441750 -> 1000004501710: The Hello class extends the {(Javadoc:EE+ GenericServlet+javax/servlet+GenericServlet)} class; the GenericServlet class provides the interface for the server to forward requests to the servlet and control the servlet's lifecycle.
10441760 -> 1000004501720: The Hello class overrides the {(Javadoc:EE+ service(ServletRequest, ServletResponse)+name=service(ServletRequest, ServletResponse)+javax/servlet+Servlet+service(javax.servlet.ServletRequest,javax.servlet.ServletResponse))} method defined by the {(Javadoc:EE+ Servlet+javax/servlet+Servlet)} interface to provide the code for the service request handler.
10441770 -> 1000004501730: The service() method is passed a {(Javadoc:EE+ ServletRequest+javax/servlet+ServletRequest)} object that contains the request from the client and a {(Javadoc:EE+ ServletResponse+javax/servlet+ServletResponse)} object used to create the response returned to the client.
10441780 -> 1000004501740: The service() method declares that it throws the exceptions {(Javadoc:EE+ ServletException+javax/servlet+ServletException)} and {(Javadoc:SE+ IOException+java/io+IOException)} if a problem prevents it from responding to the request.
10441790 -> 1000004501750: The {(Javadoc:EE+ setContentType(String)+name=setContentType(String)+javax/servlet+ServletResponse+setContentType(java.lang.String))} method in the response object is called to set the MIME content type of the returned data to "text/html".
10441800 -> 1000004501760: The {(Javadoc:EE+ getWriter()+name=getWriter()+javax/servlet+ServletResponse+getWriter())} method in the response returns a {(Javadoc:SE+ PrintWriter+java/io+PrintWriter)} object that is used to write the data that is sent to the client.
10441810 -> 1000004501770: The {(Javadoc:SE+ println(String)+name=println(String)+java/io+PrintWriter+println(java.lang.String))} method is called to write the "Hello, world!" string to the response and then the {(Javadoc:SE+ close()+name=close()+java/io+PrintWriter+close())} method is called to close the print writer, which causes the data that has been written to the stream to be returned to the client.
10441820 -> 1000004501780: JavaServer Page
10441830 -> 1000004501790: JavaServer Pages (JSPs) are server-side Java EE components that generate responses, typically HTML pages, to HTTP requests from clients.
10441840 -> 1000004501800: JSPs embed Java code in an HTML page by using the special delimiters <% and %>.
10441850 -> 1000004501810: A JSP is compiled to a Java servlet, a Java application in its own right, the first time it is accessed.
10441860 -> 1000004501820: After that, the generated servlet creates the response.
10441870 -> 1000004501830: Swing application
10441880 -> 1000004501840: Swing is a graphical user interface library for the Java SE platform.
10441890 -> 1000004501850: This example Swing application creates a single window with "Hello, world!" inside:
10441900 -> None: 
10441910 -> 1000004501860: The first import statement directs the Java compiler to include the {(Javadoc:SE+ BorderLayout+java/awt+BorderLayout)} class from the {(Javadoc:SE+ java.awt+package=java.awt+java/awt)} package in the compilation; the second import includes all of the public classes and interfaces from the {(Javadoc:SE+ javax.swing+package=javax.swing+javax/swing)} package.
10441920 -> 1000004501870: The Hello class extends the {(Javadoc:SE+ JFrame+javax/swing+JFrame)} class; the JFrame class implements a window with a title bar and a close control.
10441930 -> 1000004501880: The Hello() constructor initializes the frame by first calling the superclass constructor, passing the parameter "hello", which is used as the window's title.
10441940 -> 1000004501890: It then calls the {(Javadoc:SE+ setDefaultCloseOperation(int)+name=setDefaultCloseOperation(int)+javax/swing+JFrame+setDefaultCloseOperation(int))} method inherited from JFrame to set the default operation when the close control on the title bar is selected to {(Javadoc:SE+ WindowConstants.EXIT_ON_CLOSE+javax/swing+WindowConstants+EXIT_ON_CLOSE)} — this causes the JFrame to be disposed of when the frame is closed (as opposed to merely hidden), which allows the JVM to exit and the program to terminate.
10441950 -> 1000004501900: Next, the layout of the frame is set to a BorderLayout; this tells Swing how to arrange the components that will be added to the frame.
10441960 -> 1000004501910: A {(Javadoc:SE+ JLabel+javax/swing+JLabel)} is created for the string "Hello, world!" and the {(Javadoc:SE+ add(Component)+name=add(Component)+java/awt+Container+add(java.awt.Component))} method inherited from the {(Javadoc:SE+ Container+java/awt+Container)} superclass is called to add the label to the frame.
10441970 -> 1000004501920: The {(Javadoc:SE+ pack()+name=pack()+java/awt+Window+pack())} method inherited from the {(Javadoc:SE+ Window+java/awt+Window)} superclass is called to size the window and lay out its contents, in the manner indicated by the BorderLayout.
10441980 -> 1000004501930: The main() method is called by the JVM when the program starts.
10441990 -> 1000004501940: It instantiates a new Hello frame and causes it to be displayed by calling the {(Javadoc:SE+ setVisible(boolean)+name=setVisible(boolean)+java/awt+Component+setVisible(boolean))} method inherited from the {(Javadoc:SE+ Component+java/awt+Component)} superclass with the boolean parameter true.
10442000 -> 1000004501950: Note that once the frame is displayed, exiting the main method does not cause the program to terminate because the AWT event dispatching thread remains active until all of the Swing top-level windows have been disposed.
10442010 -> 1000004501960: Criticism
10442020 -> 1000004501970: Java's performance has improved substantially since the early versions, and performance of JIT compilers relative to native compilers has in some tests been shown to be quite similar.
10442030 -> 1000004501980: The performance of the compilers does not necessarily indicate the performance of the compiled code; only careful testing can reveal the true performance issues in any system.
10442040 -> 1000004501990: The default look and feel of GUI applications written in Java using the Swing toolkit is very different from native applications.
10442050 -> 1000004502000: It is possible to specify a different look and feel through the pluggable look and feel system of Swing.
10442060 -> 1000004502010: Clones of Windows, GTK and Motif are supplied by Sun.
10442070 -> 1000004502020: Apple also provides an Aqua look and feel for Mac OS X.
10442080 -> 1000004502030: Though prior implementations of these looks and feels have been considered lacking, Swing in Java SE 6 addresses this problem by using more native widget drawing routines of the underlying platforms.
10442090 -> 1000004502040: Alternatively, third party toolkits such as wx4j, Qt Jambi or SWT may be used for increased integration with the native windowing system.
10442100 -> 1000004502050: As in C++ and some other object-oriented languages, variables of Java's primitive types were not originally objects.
10442110 -> 1000004502060: Values of primitive types are either stored directly in fields (for objects) or on the stack (for methods) rather than on the heap, as is the common case for objects (but see Escape analysis).
10442120 -> 1000004502070: This was a conscious decision by Java's designers for performance reasons.
10442130 -> 1000004502080: Because of this, Java was not considered to be a pure object-oriented programming language.
10442140 -> 1000004502090: However, as of Java 5.0, autoboxing enables programmers to write as if primitive types are their wrapper classes, with their object-oriented counterparts representing classes of their own, and freely interchange between them for improved flexibility.
10442150 -> 1000004502100: Java suppresses several features (such as operator overloading and multiple inheritance) for classes in order to simplify the language, to "save the programmers from themselves", and to prevent possible errors and anti-pattern design.
10442160 -> 1000004502110: This has been a source of criticism, relating to a lack of low-level features, but some of these limitations may be worked around.
10442170 -> 1000004502120: Java interfaces have always had multiple inheritance.
10442180 -> 1000004502130: Resources
10442190 -> 1000004502140: Java Runtime Environment
10442200 -> 1000004502150: The Java Runtime Environment, or JRE, is the software required to run any application deployed on the Java Platform.
10442210 -> 1000004502160: End-users commonly use a JRE in software packages and Web browser plugins.
10442220 -> 1000004502170: Sun also distributes a superset of the JRE called the Java 2 SDK (more commonly known as the JDK), which includes development tools such as the Java compiler, Javadoc, Jar and debugger.
10442230 -> 1000004502180: One of the unique advantages of the concept of a runtime engine is that errors (exceptions) should not 'crash' the system.
10442240 -> 1000004502190: Moreover, in runtime engine environments such as Java there exist tools that attach to the runtime engine and every time that an exception of interest occurs they record debugging information that existed in memory at the time the exception was thrown (stack and heap values).
10442250 -> 1000004502200: These Automated Exception Handling tools provide 'root-cause' information for exceptions in Java programs that run in production, testing or development environments.
10442260 -> 1000004502210: Components
10442270 -> 1000004502220: Java libraries are the compiled byte codes of source code developed by the JRE implementor to support application development in Java.
10442280 -> 1000004502230: Examples of these libraries are:
10442290 -> 1000004502240: The core libraries, which include:
10442300 -> 1000004502250: Collection libraries that implement data structures such as lists, dictionaries, trees and sets
10442310 -> 1000004502260: XML Processing (Parsing, Transforming, Validating) libraries
10442320 -> 1000004502270: Security
10442330 -> 1000004502280: Internationalization and localization libraries
10442340 -> 1000004502290: The integration libraries, which allow the application writer to communicate with external systems.
10442350 -> 1000004502300: These libraries include:
10442360 -> 1000004502310: The Java Database Connectivity (JDBC) API for database access
10442370 -> 1000004502320: Java Naming and Directory Interface (JNDI) for lookup and discovery
10442380 -> 1000004502330: RMI and CORBA for distributed application development
10442390 -> 1000004502340: User Interface libraries, which include:
10442400 -> 1000004502350: The (heavyweight, or native) Abstract Windowing Toolkit (AWT), which provides GUI components, the means for laying out those components and the means for handling events from those components
10442410 -> 1000004502360: The (lightweight) Swing libraries, which are built on AWT but provide (non-native) implementations of the AWT widgetry
10442420 -> 1000004502370: APIs for audio capture, processing, and playback
10442430 -> 1000004502380: A platform dependent implementation of Java virtual machine (JVM) that is the means by which the byte codes of the Java libraries and third party applications are executed
10442440 -> 1000004502390: Plugins, which enable applets to be run in Web browsers
10442450 -> 1000004502400: Java Web Start, which allows Java applications to be efficiently distributed to end users across the Internet
10442460 -> 1000004502410: Licensing and documentation
10442470 -> 1000004502420: APIs
10442480 -> 1000004502430: Sun has defined three platforms targeting different application environments and segmented many of its APIs so that they belong to one of the platforms.
10442490 -> 1000004502440: The platforms are:
10442500 -> 1000004502450: Java Platform, Micro Edition (Java ME) — targeting environments with limited resources,
10442510 -> 1000004502460: Java Platform, Standard Edition (Java SE) — targeting workstation environments, and
10442520 -> 1000004502470: Java Platform, Enterprise Edition (Java EE) — targeting large distributed enterprise or Internet environments.
10442530 -> 1000004502480: The classes in the Java APIs are organized into separate groups called packages.
10442540 -> 1000004502490: Each package contains a set of related interfaces, classes and exceptions.
10442550 -> 1000004502500: Refer to the separate platforms for a description of the packages available.
10442560 -> 1000004502510: The set of APIs is controlled by Sun Microsystems in cooperation with others through the Java Community Process program.
10442570 -> 1000004502520: Companies or individuals participating in this process can influence the design and development of the APIs.
10442580 -> 1000004502530: This process has been a subject of controversy.
Language
10450010 -> 1000004600020: Language
10450020 -> 1000004600030: A language is a dynamic set of visual, auditory, or tactile symbols of communication and the elements used to manipulate them.
10450030 -> 1000004600040: Language can also refer to the use of such systems as a general phenomenon.
10450040 -> 1000004600050: Language is considered to be an exclusively human mode of communication; although other animals make use of quite sophisticated communicative systems, none of these are known to make use of all of the properties that linguists use to define language.
10450050 -> 1000004600060: Properties of language
10450060 -> 1000004600070: A set of agreed-upon symbols is only one feature of language; all languages must define the structural relationships between these symbols in a system of grammar.
10450070 -> 1000004600080: Rules of grammar are what distinguish language from other forms of communication.
10450080 -> 1000004600090: They allow a finite set of symbols to be manipulated to create a potentially infinite number of grammatical utterances.
10450090 -> 1000004600100: Another property of language is that its symbols are arbitrary.
10450100 -> 1000004600110: Any concept or grammatical rule can be mapped onto a symbol.
10450110 -> 1000004600120: Most languages make use of sound, but the combinations of sounds used do not have any inherent meaning – they are merely an agreed-upon convention to represent a certain thing by users of that language.
10450120 -> 1000004600130: For instance, there is nothing about the Spanish word {(Lang+nada+es+nada)} itself that forces Spanish speakers to convey the idea of "nothing".
10450130 -> 1000004600140: Another set of sounds (for example, the English word nothing) could equally be used to represent the same concept, but all Spanish speakers have acquired or learned to correlate this meaning for this particular sound pattern.
10450140 -> 1000004600150: For Slovenian, Croatian, Serbian/Kosovan or Bosnian speakers on the other hand, {(Lang+nada+hr+nada)} means something else; it means "hope".
10450150 -> 1000004600160: The study of language
10450160 -> 1000004600170: Linguistics
10450170 -> 1000004600180: Linguistics is the scientific and philosophical study of language, encompassing a number of sub-fields.
10450180 -> 1000004600190: At the core of theoretical linguistics are the study of language structure (grammar) and the study of meaning (semantics).
10450190 -> 1000004600200: The first of these encompasses morphology (the formation and composition of words), syntax (the rules that determine how words combine into phrases and sentences) and phonology (the study of sound systems and abstract sound units).
10450200 -> 1000004600210: Phonetics is a related branch of linguistics concerned with the actual properties of speech sounds (phones), non-speech sounds, and how they are produced and perceived.
10450210 -> 1000004600220: Theoretical linguistics is mostly concerned with developing models of linguistic knowledge.
10450220 -> 1000004600230: The fields that are generally considered as the core of theoretical linguistics are syntax, phonology, morphology, and semantics.
10450230 -> 1000004600240: Applied linguistics attempts to put linguistic theories into practice through areas like translation, stylistics, literary criticism and theory, discourse analysis, speech therapy, speech pathology and foreign language teaching.
10450240 -> 1000004600250: History
10450250 -> 1000004600260: The historical record of linguistics begins in India with Pāṇini, the 5th century BCE grammarian who formulated 3,959 rules of Sanskrit morphology, known as the {(Transl+Aṣṭādhyāyī+sa+IAST+sa)} (अष्टाध्यायी) and with Tolkāppiyar, the 3rd century BCE grammarian of the Tamil work Tolkāppiyam.
10450250 -> 1000004600270: Pāṇini’s grammar is highly systematized and technical.
10450260 -> 1000004600280: Inherent in its analytic approach are the concepts of the phoneme, the morpheme, and the root; Western linguists only recognized the phoneme some two millennia later.
10450270 -> 1000004600290: Tolkāppiyar's work is perhaps the first to describe articulatory phonetics for a language.
10450280 -> 1000004600300: Its classification of the alphabet into consonants and vowels, and elements like nouns, verbs, vowels, and consonants, which he put into classes, were also breakthroughs at the time.
10450290 -> 1000004600310: In the Middle East, the Persian linguist Sibawayh (سیبویه) made a detailed and professional description of Arabic in 760 CE in his monumental work, Al-kitab fi al-nahw (الكتاب في النحو, The Book on Grammar), bringing many linguistic aspects of language to light.
10450300 -> 1000004600320: In his book, he distinguished phonetics from phonology.
10450310 -> 1000004600330: Later in the West, the success of science, mathematics, and other formal systems in the 20th century led many to attempt a formalization of the study of language as a "semantic code".
10450320 -> 1000004600340: This resulted in the academic discipline of linguistics, the founding of which is attributed to Ferdinand de Saussure.
10450330 -> 1000004600350: In the 20th century, substantial contributions to the understanding of language came from Ferdinand de Saussure, Hjelmslev, Émile Benveniste and Roman Jakobson, which are characterized as being highly systematic.
10450340 -> 1000004600360: Human languages
10450350 -> 1000004600370: Human languages are usually referred to as natural languages, and the science of studying them falls under the purview of linguistics.
10450360 -> 1000004600380: A common progression for natural languages is that they are considered to be first spoken, then written, and then an understanding and explanation of their grammar is attempted.
10450370 -> 1000004600390: Languages live, die, move from place to place, and change with time.
10450380 -> 1000004600400: Any language that ceases to change or develop is categorized as a dead language.
10450390 -> 1000004600410: Conversely, any language that is a living language, that is, it is in a continuous state of change, is known as a modern language.
10450400 -> 1000004600420: Making a principled distinction between one language and another is usually impossible.
10450410 -> 1000004600430: For instance, there are a few dialects of German similar to some dialects of Dutch.
10450420 -> 1000004600440: The transition between languages within the same language family is sometimes gradual (see dialect continuum).
10450430 -> 1000004600450: Some like to make parallels with biology, where it is not possible to make a well-defined distinction between one species and the next.
10450440 -> 1000004600460: In either case, the ultimate difficulty may stem from the interactions between languages and populations.
10450450 -> 1000004600470: (See Dialect or August Schleicher for a longer discussion.)
10450460 -> 1000004600480: The concepts of Ausbausprache, Abstandsprache and Dachsprache are used to make finer distinctions about the degrees of difference between languages or dialects.
10450470 -> 1000004600490: Artificial languages
10450480 -> 1000004600500: Constructed languages
10450490 -> 1000004600510: Some individuals and groups have constructed their own artificial languages, for practical, experimental, personal, or ideological reasons.
10450500 -> 1000004600520: International auxiliary languages are generally constructed languages that strive to be easier to learn than natural languages; other constructed languages strive to be more logical ("loglangs") than natural languages; a prominent example of this is Lojban.
10450510 -> 1000004600530: Some writers, such as J. R. R. Tolkien, have created fantasy languages, for literary, artistic or personal reasons.
10450520 -> 1000004600540: The fantasy language of the Klingon race has in recent years been developed by fans of the Star Trek series, including a vocabulary and grammar.
10450530 -> 1000004600550: Constructed languages are not necessarily restricted to the properties shared by natural languages.
10450540 -> 1000004600560: This part of ISO 639 also includes identifiers that denote constructed (or artificial) languages.
10450550 -> 1000004600570: In order to qualify for inclusion the language must have a literature and it must be designed for the purpose of human communication.
10450560 -> 1000004600580: Specifically excluded are reconstructed languages and computer programming languages.
10450570 -> 1000004600590: International auxiliary languages
10450580 -> 1000004600600: Some languages, most constructed, are meant specifically for communication between people of different nationalities or language groups as an easy-to-learn second language.
10450590 -> 1000004600610: Several of these languages have been constructed by individuals or groups.
10450600 -> 1000004600620: Natural, pre-existing languages may also be used in this way - their developers merely catalogued and standardized their vocabulary and identified their grammatical rules.
10450610 -> 1000004600630: These languages are called naturalistic.
10450620 -> 1000004600640: One such language, Latino Sine Flexione, is a simplified form of Latin.
10450630 -> 1000004600650: Two others, Occidental and Novial, were drawn from several Western languages.
10450640 -> 1000004600660: To date, the most successful auxiliary language is Esperanto, invented by Polish ophthalmologist Zamenhof.
10450650 -> 1000004600670: It has a relatively large community roughly estimated at about 2 million speakers worldwide, with a large body of literature, songs, and is the only known constructed language to have native speakers, such as the Hungarian-born American businessman George Soros.
10450660 -> 1000004600680: Other auxiliary languages with a relatively large number of speakers and literature are Interlingua and Ido.
10450670 -> 1000004600690: Controlled languages
10450680 -> 1000004600700: Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity.
10450690 -> 1000004600710: The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language.
10450700 -> 1000004600720: An example of a widely used controlled natural language is Simplified English, which was originally developed for aerospace industry maintenance manuals.
10450710 -> 1000004600730: Formal languages
10450720 -> 1000004600740: Mathematics and computer science use artificial entities called formal languages (including programming languages and markup languages, and some that are more theoretical in nature).
10450730 -> 1000004600750: These often take the form of character strings, produced by a combination of formal grammar and semantics of arbitrary complexity.
10450740 -> 1000004600760: Programming languages
10450750 -> 1000004600770: A programming language is an extreme case of a formal language that can be used to control the behavior of a machine, particularly a computer, to perform specific tasks.
10450760 -> 1000004600780: Programming languages are defined using syntactic and semantic rules, to determine structure and meaning respectively.
10450770 -> 1000004600790: Programming languages are used to facilitate communication about the task of organizing and manipulating information, and to express algorithms precisely.
10450780 -> 1000004600800: Some authors restrict the term "programming language" to those languages that can express all possible algorithms; sometimes the term "computer language" is used for artificial languages that are more limited.
10450790 -> 1000004600810: Animal communication
10450800 -> 1000004600820: The term "animal languages" is often used for non-human languages.
10450810 -> 1000004600830: Linguists do not consider these to be "language", but describe them as animal communication, because the interaction between animals in such communication is fundamentally different in its underlying principles from human language.
10450820 -> 1000004600840: Nevertheless, some scholars have tried to disprove this mainstream premise through experiments on training chimpanzees to talk.
10450830 -> 1000004600850: Karl von Frisch received the Nobel Prize in 1973 for his proof of the language and dialects of the bees.
10450840 -> 1000004600860: In several publicized instances, non-human animals have been taught to understand certain features of human language.
10450850 -> 1000004600870: Chimpanzees, gorillas, and orangutans have been taught hand signs based on American Sign Language.
10450860 -> 1000004600880: The African Grey Parrot, which possesses the ability to mimic human speech with a high degree of accuracy, is suspected of having sufficient intelligence to comprehend some of the speech it mimics.
10450870 -> 1000004600890: Most species of parrot, despite expert mimicry, are believed to have no linguistic comprehension at all.
10450880 -> 1000004600900: While proponents of animal communication systems have debated levels of semantics, these systems have not been found to have anything approaching human language syntax.
Language model
10460010 -> 1000004700020: Language model
10460020 -> 1000004700030: A statistical language model assigns a probability to a sequence of m words P(w_1,\ldots,w_m) by means of a probability distribution.
10460030 -> 1000004700040: Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.
10460040 -> 1000004700050: In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.
10460050 -> 1000004700060: When used in information retrieval, a language model is associated with a document in a collection.
10460060 -> 1000004700070: With query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, P(Q|Md).
10460070 -> 1000004700080: Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting).
10460080 -> 1000004700090: For that reason these models are often approximated using smoothed N-gram models.
10460090 -> 1000004700100: N-gram models
10460100 -> 1000004700110: In an n-gram model, the probability P(w_1,\ldots,w_m) of observing the sentence w1,...,wm is approximated as
10460110 -> 1000004700120: P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i|w_1,\ldots,w_{i-1}) \approx \prod^m_{i=1} P(w_i|w_{i-(n-1)},\ldots,w_{i-1})
10460120 -> 1000004700130: Here, it is assumed that the probability of observing the ith word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words (nth order Markov property).
10460130 -> 1000004700140: The conditional probability can be calculated from n-gram frequency counts:  P(w_i|w_{i-(n-1)},\ldots,w_{i-1}) = \frac{count(w_{i-(n-1)},\ldots,w_{i-1})}{count(w_{i-(n-1)},w_{i-1},\ldots,w_i)}
10460140 -> 1000004700150: The words bigram and trigram language model denote n-gram language models with n=2 and n=3, respectively.
10460150 -> 1000004700160: Example
10460160 -> 1000004700170: In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as  P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red)
10460170 -> 1000004700180: whereas in a trigram (n=3) language model, the approximation is  P(I,saw,the,red,house) \approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red)
Latent semantic analysis
10470010 -> 1000004800020: Latent semantic analysis
10470020 -> 1000004800030: Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
10470030 -> 1000004800040: LSA was patented in 1988 ( US Patent 4,839,853) by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter.
10470040 -> 1000004800050: In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI).
10470050 -> 1000004800060: Occurrence matrix
10470060 -> 1000004800070: LSA can use a term-document matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents, typically stemmed words that appear in the documents.
10470070 -> 1000004800080: A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance.
10470080 -> 1000004800090: This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used.
10470090 -> 1000004800100: LSA transforms the occurrence matrix into a relation between the terms and some concepts, and a relation between those concepts and the documents.
10470100 -> 1000004800110: Thus the terms and documents are now indirectly related through the concepts.
10470110 -> 1000004800120: Applications
10470120 -> 1000004800130: The new concept space typically can be used to:
10470130 -> 1000004800140: Compare the documents in the concept space (data clustering, document classification)......
10470140 -> 1000004800150: Find similar documents across languages, after analyzing a base set of translated documents (cross language retrieval).
10470150 -> 1000004800160: Find relations between terms (synonymy and polysemy).
10470160 -> 1000004800170: Given a query of terms, translate it into the concept space, and find matching documents (information retrieval).
10470170 -> 1000004800180: Synonymy and polysemy are fundamental problems in natural language processing:
10470180 -> 1000004800190: Synonymy is the phenomenon where different words describe the same idea.
10470190 -> 1000004800200: Thus, a query in a search engine may fail to retrieve a relevant document that does not contain the words which appeared in the query.
10470200 -> 1000004800210: For example, a search for "doctors" may not return a document containing the word "physicians", even though the words have the same meaning.
10470210 -> 1000004800220: Polysemy is the phenomenon where the same word has multiple meanings.
10470220 -> 1000004800230: So a search may retrieve irrelevant documents containing the desired words in the wrong meaning.
10470230 -> 1000004800240: For example, a botanist and a computer scientist looking for the word "tree" probably desire different sets of documents.
10470240 -> 1000004800250: Rank lowering
10470250 -> 1000004800260: After the construction of the occurrence matrix, LSA finds a low-rank approximation to the term-document matrix.
10470260 -> 1000004800270: There could be various reasons for these approximations:
10470270 -> 1000004800280: The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil").
10470280 -> 1000004800290: The original term-document matrix is presumed noisy: for example, anecdotal instances of terms are to be eliminated.
10470290 -> 1000004800300: From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original).
10470300 -> 1000004800310: The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix.
10470310 -> 1000004800320: That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document--generally a much larger set due to synonymy.
10470320 -> 1000004800330: The consequence of the rank lowering is that some dimensions are combined and depend on more than one term:
10470330 -> 1000004800340: {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
10470340 -> 1000004800350: This mitigates synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings.
10470350 -> 1000004800360: It also mitigates polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning.
10470360 -> 1000004800370: Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense.
10470370 -> 1000004800380: Derivation
10470380 -> 1000004800390: Let X be a matrix where element (i,j) describes the occurrence of term i in document j (this can be, for example, the frequency).
10470385 -> 1000004800400: X will look like this:
10470390 -> 1000004800410: \begin{matrix} & \textbf{d}_j \\ & \downarrow \\ \textbf{t}_i^T \rightarrow & \begin{bmatrix} x_{1,1} & \dots & x_{1,n} \\ \vdots & \ddots & \vdots \\ x_{m,1} & \dots & x_{m,n} \\ \end{bmatrix} \end{matrix}
10470400 -> 1000004800420: Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document:
10470410 -> 1000004800430: \textbf{t}_i^T = \begin{bmatrix} x_{i,1} & \dots & x_{i,n} \end{bmatrix}
10470420 -> 1000004800440: Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term:
10470430 -> 1000004800450: \textbf{d}_j = \begin{bmatrix} x_{1,j} \\ \vdots \\ x_{m,j} \end{bmatrix}
10470440 -> 1000004800460: Now the dot product \textbf{t}_i^T \textbf{t}_p between two term vectors gives the correlation between the terms over the documents.
10470450 -> 1000004800470: The matrix product X X^T contains all these dot products.
10470460 -> 1000004800480: Element (i,p) (which is equal to element (p,i)) contains the dot product \textbf{t}_i^T \textbf{t}_p ( = \textbf{t}_p^T \textbf{t}_i).
10470470 -> 1000004800490: Likewise, the matrix X^T X contains the dot products between all the document vectors, giving their correlation over the terms: \textbf{d}_j^T \textbf{d}_q = \textbf{d}_q^T \textbf{d}_j.
10470480 -> 1000004800500: Now assume that there exists a decomposition of X such that U and V are orthonormal matrices and \Sigma is a diagonal matrix.
10470490 -> 1000004800510: This is called a singular value decomposition (SVD):
10470500 -> 1000004800520: X = U \Sigma V^T
10470510 -> 1000004800530: The matrix products giving us the term and document correlations then become
10470520 -> 1000004800540: \begin{matrix} X X^T &=& (U \Sigma V^T) (U \Sigma V^T)^T = (U \Sigma V^T) (V^{T^T} \Sigma^T U^T) = U \Sigma V^T V \Sigma^T U^T = U \Sigma \Sigma^T U^T \\ X^T X &=& (U \Sigma V^T)^T (U \Sigma V^T) = (V^{T^T} \Sigma^T U^T) (U \Sigma V^T) = V \Sigma U^T U \Sigma V^T = V \Sigma^T \Sigma V^T \end{matrix}
10470530 -> 1000004800550: Since \Sigma \Sigma^T and \Sigma^T \Sigma are diagonal we see that U must contain the eigenvectors of X X^T, while V must be the eigenvectors of X^T X.
10470540 -> 1000004800560: Both products have the same non-zero eigenvalues, given by the non-zero entries of \Sigma \Sigma^T, or equally, by the non-zero entries of \Sigma^T\Sigma.
10470550 -> 1000004800570: Now the decomposition looks like this:
10470560 -> 1000004800580: \begin{matrix} & X & & & U & & \Sigma & & V^T \\ & (\textbf{d}_j) & & & & & & & (\hat \textbf{d}_j) \\ & \downarrow & & & & & & & \downarrow \\ (\textbf{t}_i^T) \rightarrow & \begin{bmatrix} x_{1,1} & \dots & x_{1,n} \\ \\ \vdots & \ddots & \vdots \\ \\ x_{m,1} & \dots & x_{m,n} \\ \end{bmatrix} & = & (\hat \textbf{t}_i^T) \rightarrow & \begin{bmatrix} \begin{bmatrix} \, \\ \, \\ \textbf{u}_1 \\ \, \\ \,\end{bmatrix} \dots \begin{bmatrix} \, \\ \, \\ \textbf{u}_l \\ \, \\ \, \end{bmatrix} \end{bmatrix} & \cdot & \begin{bmatrix} \sigma_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \sigma_l \\ \end{bmatrix} & \cdot & \begin{bmatrix} \begin{bmatrix} & & \textbf{v}_1 & & \end{bmatrix} \\ \vdots \\ \begin{bmatrix} & & \textbf{v}_l & & \end{bmatrix} \end{bmatrix} \end{matrix}
10470570 -> 1000004800590: The values \sigma_1, \dots, \sigma_l are called the singular values, and u_1, \dots, u_l and v_1, \dots, v_l the left and right singular vectors.
10470580 -> 1000004800600: Notice how the only part of U that contributes to \textbf{t}_i is the i\textrm{'th} row.
10470590 -> 1000004800610: Let this row vector be called \hat \textrm{t}_i.
10470600 -> 1000004800620: Likewise, the only part of V^T that contributes to \textbf{d}_j is the j\textrm{'th} column, \hat \textrm{d}_j.
10470610 -> 1000004800630: These are not the eigenvectors, but depend on all the eigenvectors.
10470620 -> 1000004800640: It turns out that when you select the k largest singular values, and their corresponding singular vectors from U and V, you get the rank k approximation to X with the smallest error (Frobenius norm).
10470630 -> 1000004800650: The amazing thing about this approximation is that not only does it have a minimal error, but it translates the term and document vectors into a concept space.
10470640 -> 1000004800660: The vector \hat \textbf{t}_i then has k entries, each giving the occurrence of term i in one of the k concepts.
10470650 -> 1000004800670: Likewise, the vector \hat \textbf{d}_j gives the relation between document j and each concept.
10470660 -> 1000004800680: We write this approximation as
10470670 -> 1000004800690: X_k = U_k \Sigma_k V_k^T
10470680 -> 1000004800700: You can now do the following:
10470690 -> 1000004800710: See how related documents j and q are in the concept space by comparing the vectors \hat \textbf{d}_j and \hat \textbf{d}_q (typically by cosine similarity).
10470700 -> 1000004800720: This gives you a clustering of the documents.
10470710 -> 1000004800730: Comparing terms i and p by comparing the vectors \hat \textbf{t}_i and \hat \textbf{t}_p, giving you a clustering of the terms in the concept space.
10470720 -> 1000004800740: Given a query, view this as a mini document, and compare it to your documents in the concept space.
10470730 -> 1000004800750: To do the latter, you must first translate your query into the concept space.
10470740 -> 1000004800760: It is then intuitive that you must use the same transformation that you use on your documents:
10470750 -> 1000004800770: \textbf{d}_j = U_k \Sigma_k \hat \textbf{d}_j
10470760 -> 1000004800780: \hat \textbf{d}_j = \Sigma_k^{-1} U_k^T \textbf{d}_j
10470770 -> 1000004800790: This means that if you have a query vector q, you must do the translation \hat \textbf{q} = \Sigma_k^{-1} U_k^T \textbf{q} before you compare it with the document vectors in the concept space.
10470780 -> 1000004800800: You can do the same for pseudo term vectors:
10470790 -> 1000004800810: \textbf{t}_i^T = \hat \textbf{t}_i^T \Sigma_k V_k^T
10470800 -> 1000004800820: \hat \textbf{t}_i^T = \textbf{t}_i^T V_k^{-T} \Sigma_k^{-1} = \textbf{t}_i^T V_k \Sigma_k^{-1}
10470810 -> 1000004800830: \hat \textbf{t}_i = \Sigma_k^{-1} V_k^T \textbf{t}_i
10470820 -> 1000004800840: Implementation
10470830 -> 1000004800850: The SVD is typically computed using large matrix methods (for example, Lanczos methods) but may also be computed incrementally and with greatly reduced resources via a neural network-like approach, which does not require the large, full-rank matrix to be held in memory ( Gorrell and Webb, 2005).
10470840 -> 1000004800860: A fast, incremental, low-memory, large-matrix SVD algorithm has recently been developed ( Brand, 2006).
10470850 -> 1000004800870: Unlike Gorrell and Webb's (2005) stochastic approximation, Brand's (2006) algorithm provides an exact solution.
10470860 -> 1000004800880: Limitations
10470870 -> 1000004800890: LSA has two drawbacks:
10470880 -> 1000004800900: The resulting dimensions might be difficult to interpret.
10470890 -> 1000004800910: For instance, in
10470900 -> 1000004800920: {(car), (truck), (flower)} --> {(1.3452 * car + 0.2828 * truck), (flower)}
10470910 -> 1000004800930: the (1.3452 * car + 0.2828 * truck) component could be interpreted as "vehicle".
10470920 -> 1000004800940: However, it is very likely that cases close to
10470930 -> 1000004800950: {(car), (bottle), (flower)} --> {(1.3452 * car + 0.2828 * bottle), (flower)}
10470940 -> 1000004800960: will occur.
10470950 -> 1000004800970: This leads to results which can be justified on the mathematical level, but have no interpretable meaning in natural language.
10470960 -> 1000004800980: The probabilistic model of LSA does not match observed data: LSA assumes that words and documents form a joint Gaussian model (ergodic hypothesis), while a Poisson distribution has been observed.
10470970 -> 1000004800990: Thus, a newer alternative is probabilistic latent semantic analysis, based on a multinomial model, which is reported to give better results than standard LSA .
Lexical category
10660010 -> 1000004900020: Lexical category
10660020 -> 1000004900030: In grammar, a lexical category (also word class, lexical class, or in traditional grammar part of speech) is a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question.
10660030 -> 1000004900040: Common linguistic categories include noun and verb, among others.
10660040 -> 1000004900050: There are open word classes, which constantly acquire new members, and closed word classes, which acquire new members infrequently if at all.
10660050 -> 1000004900060: Different languages may have different lexical categories, or they might associate different properties to the same one.
10660060 -> 1000004900070: For example, Japanese has at least three classes of adjectives where English has one; Chinese and Japanese have measure words while European languages have nothing resembling them; many languages don't have a distinction between adjectives and adverbs, or adjectives and nouns, etc.
10660070 -> 1000004900080: Many linguists argue that the formal distinctions between parts of speech must be made within the framework of a specific language or language family, and should not be carried over to other languages or language families.
10660080 -> 1000004900090: History
10660090 -> 1000004900100: The classification of words into lexical categories is found from the earliest moments in the history of linguistics.
10660100 -> 1000004900110: In the Nirukta, written in the 5th or 6th century BCE, the Sanskrit grammarian Yāska defined four main categories of words :
10660110 -> 1000004900120: nāma - nouns or substantives
10660120 -> 1000004900130: ākhyāta - verbs
10660130 -> 1000004900140: upasarga - pre-verbs or prefixes
10660140 -> 1000004900150: nipāta - particles, invariant words (perhaps prepositions)
10660150 -> 1000004900160: These four were grouped into two large classes: inflected (nouns and verbs) and uninflected (pre-verbs and particles).
10660160 -> 1000004900170: A century or two later, the Greek scholar Plato wrote in the Cratylus dialog that "... sentences are, I conceive, a combination of verbs [rhēma] and nouns [ónoma]".
10660170 -> 1000004900180: Another class, "conjunctions" (covering conjunctions, pronouns, and the article), was later added by Aristotle.
10660180 -> 1000004900190: By the end of the 2nd century BCE, the classification scheme had been expanded into eight categories, seen in the Tékhnē grammatiké:
10660190 -> 1000004900200: Noun: a part of speech inflected for case, signifying a concrete or abstract entity
10660200 -> 1000004900210: Verb: a part of speech without case inflection, but inflected for tense, person and number, signifying an activity or process performed or undergone
10660210 -> 1000004900220: Participle: a part of speech sharing the features of the verb and the noun
10660220 -> 1000004900230: Article: a part of speech inflected for case and preposed or postposed to nouns (the relative pronoun is meant by the postposed article)
10660230 -> 1000004900240: Pronoun: a part of speech substitutable for a noun and marked for person
10660240 -> 1000004900250: Preposition: a part of speech placed before other words in composition and in syntax
10660250 -> 1000004900260: Adverb: a part of speech without inflection, in modification of or in addition to a verb
10660260 -> 1000004900270: Conjunction: a part of speech binding together the discourse and filling gaps in its interpretation
10660270 -> 1000004900280: The Latin grammarian Priscian (fl. 500 CE) modified the above eight-fold system, substituting "interjection" for "article".
10660280 -> 1000004900290: It wasn't until 1767 that the adjective was taken as a separate class.
10660290 -> 1000004900300: Traditional English grammar is patterned after the European tradition above, and is still taught in schools and used in dictionaries.
10660300 -> 1000004900310: It names eight parts of speech: noun, verb, adjective, adverb, pronoun, preposition, conjunction, and interjection (sometimes called an exclamation).
10660310 -> 1000004900320: Controversies
10660320 -> 1000004900330: Since the Greek grammarians of 2nd century BCE, parts of speech have been defined by morphological, syntactic and semantic criteria.
10660330 -> 1000004900340: However, there is currently no generally agreed-upon classification scheme that can apply to all languages, or even a set of criteria upon which such a scheme should be based.
10660340 -> 1000004900350: Linguists recognize that the above list of eight word classes is simplified and artificial.
10660350 -> 1000004900360: For example, "adverb" is to some extent a catch-all class that includes words with many different functions.
10660360 -> 1000004900370: Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.
10660370 -> 1000004900380: Functional classification
10660380 -> 1000004900390: Common ways of delimiting words by function include:
10660390 -> 1000004900400: Open word classes:
10660400 -> 1000004900410: adjectives
10660410 -> 1000004900420: adverbs
10660420 -> 1000004900430: interjections
10660430 -> 1000004900440: nouns
10660440 -> 1000004900450: verbs (except auxiliary verbs)
10660450 -> 1000004900460: Closed word classes:
10660460 -> 1000004900470: auxiliary verbs
10660470 -> 1000004900480: clitics
10660480 -> 1000004900490: coverbs
10660490 -> 1000004900500: conjunctions
10660500 -> 1000004900510: Determiners (articles, quantifiers, demonstrative adjectives, and possessive adjectives)
10660510 -> 1000004900520: particles
10660520 -> 1000004900530: measure words
10660530 -> 1000004900540: adpositions (prepositions, postpositions, and circumpositions)
10660540 -> 1000004900550: preverbs
10660550 -> 1000004900560: pronouns
10660560 -> 1000004900570: contractions
10660570 -> 1000004900580: cardinal numbers
10660580 -> 1000004900590: English
10660590 -> 1000004900600: English frequently does not mark words as belonging to one part of speech or another.
10660600 -> 1000004900610: Words like neigh, break, outlaw, laser, microwave and telephone might all be either verb forms or nouns.
10660610 -> 1000004900620: Although -ly is an adverb marker, not all adverbs end in -ly and not all words ending in -ly are adverbs.
10660620 -> 1000004900630: For instance, tomorrow, slow, fast, crosswise can all be adverbs, while early, friendly, ugly are all adjectives (though early can also function as an adverb).
10660630 -> 1000004900640: In certain circumstances, even words with primarily grammatical functions can be used as verbs or nouns, as in "We must look to the hows and not just the whys" or "Miranda was to-ing and fro-ing and not paying attention".
Linguistics
10480010 -> 1000005000020: Linguistics
10480020 -> 1000005000030: Linguistics is the scientific study of language, encompassing a number of sub-fields.
10480030 -> 1000005000040: An important topical division is between the study of language structure (grammar) and the study of meaning (semantics).
10480040 -> 1000005000050: Grammar encompasses morphology (the formation and composition of words), syntax (the rules that determine how words combine into phrases and sentences) and phonology (the study of sound systems and abstract sound units).
10480050 -> 1000005000060: Phonetics is a related branch of linguistics concerned with the actual properties of speech sounds (phones), non-speech sounds, and how they are produced and perceived.
10480060 -> 1000005000070: Over the twentieth century, following the work of Noam Chomsky, linguistics came to be dominated by the Generativist school, which is chiefly concerned with explaining how human beings acquire language and the biological constraints on this acquisition; generative theory is modularist in character.
10480070 -> 1000005000080: While this remains the dominant paradigm, other linguistic theories have increasingly gained in popularity — cognitive linguistics being a prominent example.
10480080 -> 1000005000090: There are many sub-fields in linguistics, which may or may not be dominated by a particular theoretical approach: evolutionary linguistics, for example, attempts to account for the origins of language; historical linguistics explores language change; and sociolinguistics looks at the relation between linguistic variation and social structures.
10480090 -> 1000005000100: A variety of intellectual disciplines are relevant to the study of language.
10480100 -> 1000005000110: Although certain linguists have downplayed the relevance of some other fields, linguistics — like other sciences — is highly interdisciplinary and draws on work from such fields as psychology, informatics, computer science, philosophy, biology, human anatomy, neuroscience, sociology, anthropology, and acoustics.
10480110 -> 1000005000120: Names for the discipline
10480120 -> 1000005000130: Before the twentieth century (the word is first attested 1716), the term "philology" was commonly used to refer to the science of language, which was then predominately historical in focus.
10480130 -> 1000005000140: Since Ferdinand de Saussure's insistence on the importance of synchronic analysis, however, this focus has shifted and the term "philology" is now generally used for the "study of a language's grammar, history and literary tradition", especially in the USA., where it was never as popular as elsewhere in the sense "science of language".
10480140 -> 1000005000150: The term "linguistics" dates from 1847, although "linguist" in the sense a student of language" dates from 1641.
10480150 -> 1000005000160: It is now the usual academic term in English for the scientific study of language.
10480160 -> 1000005000170: Fundamental concerns and divisions
10480170 -> 1000005000180: Linguistics concerns itself with describing and explaining the nature of human language.
10480180 -> 1000005000190: Relevant to this are the questions of what is universal to language, how language can vary, and how human beings come to know languages.
10480190 -> 1000005000200: All humans (setting aside extremely pathological cases) achieve competence in whatever language is spoken (or signed, in the case of signed languages) around them when growing up, with apparently little need for explicit conscious instruction.
10480200 -> 1000005000210: While non-humans acquire their own communication systems, they do not acquire human language in this way (although many non-human animals can learn to respond to language, or can even be trained to use it to a degree).
10480210 -> 1000005000220: Therefore, linguists assume, the ability to acquire and use language is an innate, biologically-based potential of modern human beings, similar to the ability to walk.
10480220 -> 1000005000230: There is no consensus, however, as to the extent of this innate potential, or its domain-specificity (the degree to which such innate abilities are specific to language), with some theorists claiming that there is a very large set of highly abstract and specific binary settings coded into the human brain, while others claim that the ability to learn language is a product of general human cognition.
10480230 -> 1000005000240: It is, however, generally agreed that there are no strong genetic differences underlying the differences between languages: an individual will acquire whatever language(s) they are exposed to as a child, regardless of parentage or ethnic origin.
10480240 -> 1000005000250: Linguistic structures are pairings of meaning and form (which may consist of sound patterns, movements of the hand, written symbols, and so on); such pairings are known as Saussurean signs.
10480250 -> 1000005000260: Linguists may specialize in some sub-area of linguistic structure, which can be arranged in the following terms, from form to meaning:
10480260 -> 1000005000270: Phonetics, the study of the physical properties of speech (or signed) production and perception
10480270 -> 1000005000280: Phonology, the study of sounds (adjusted appropriately for signed languages) as discrete, abstract elements in the speaker's mind that distinguish meaning
10480280 -> 1000005000290: Morphology, the study of internal structures of words and how they can be modified
10480290 -> 1000005000300: Syntax, the study of how words combine to form grammatical sentences
10480300 -> 1000005000310: Semantics, the study of the meaning of words (lexical semantics) and fixed word combinations (phraseology), and how these combine to form the meanings of sentences
10480310 -> 1000005000320: Pragmatics, the study of how utterances are used (literally, figuratively, or otherwise) in communicative acts
10480320 -> 1000005000330: Discourse analysis, the analysis of language use in texts (spoken, written, or signed)
10480330 -> 1000005000340: Many linguists would agree that these divisions overlap considerably, and the independent significance of each of these areas is not universally acknowledged.
10480340 -> 1000005000350: Regardless of any particular linguist's position, each area has core concepts that foster significant scholarly inquiry and research.
10480350 -> 1000005000360: Intersecting with these domains are fields arranged around the kind of external factors that are considered.
10480360 -> 1000005000370: For example
10480370 -> 1000005000380: Linguistic typology, the study of the common properties of diverse unrelated languages, properties that may, given sufficient attestation, be assumed to be innate to human language capacity.
10480380 -> 1000005000390: Stylistics, the study of linguistic factors that place a discourse in context.
10480390 -> 1000005000400: Developmental linguistics, the study of the development of linguistic ability in an individual, particularly the acquisition of language in childhood.
10480400 -> 1000005000410: Historical linguistics or Diachronic linguistics, the study of language change.
10480410 -> 1000005000420: Language geography, the study of the spatial patterns of languages.
10480420 -> 1000005000430: Evolutionary linguistics, the study of the origin and subsequent development of language.
10480430 -> 1000005000440: Psycholinguistics, the study of the cognitive processes and representations underlying language use.
10480440 -> 1000005000450: Sociolinguistics, the study of social patterns and norms of linguistic variability.
10480450 -> 1000005000460: Clinical linguistics, the application of linguistic theory to the area of Speech-Language Pathology.
10480460 -> 1000005000470: Neurolinguistics, the study of the brain networks that underlie grammar and communication.
10480470 -> 1000005000480: Biolinguistics, the study of natural as well as human-taught communication systems in animals compared to human language.
10480480 -> 1000005000490: Computational linguistics, the study of computational implementations of linguistic structures.
10480490 -> 1000005000500: Applied linguistics, the study of language related issues applied in everyday life, notably language. policies, planning, and education.
10480500 -> 1000005000510: Constructed language fits under Applied linguistics.
10480510 -> 1000005000520: The related discipline of semiotics investigates the relationship between signs and what they signify.
10480520 -> 1000005000530: From the perspective of semiotics, language can be seen as a sign or symbol, with the world as its representation.
10480530 -> 1000005000540: Variation and universality
10480540 -> 1000005000550: Much modern linguistic research, particularly within the paradigm of generative grammar, has concerned itself with trying to account for differences between languages of the world.
10480550 -> 1000005000560: This has worked on the assumption that if human linguistic ability is narrowly constrained by human biology, then all languages must share certain fundamental properties.
10480560 -> 1000005000570: In generativist theory, the collection of fundamental properties all languages share are referred to as universal grammar (UG).
10480570 -> 1000005000580: The specific characteristics of this universal grammar are a much debated topic.
10480580 -> 1000005000590: Typologists and non-generativist linguists usually refer simply to language universals, or universals of language.
10480590 -> 1000005000600: Similarities between languages can have a number of different origins.
10480600 -> 1000005000610: In the simplest case, universal properties may be due to universal aspects of human experience.
10480610 -> 1000005000620: For example, all humans experience water, and all human languages have a word for water.
10480620 -> 1000005000630: Other similarities may be due to common descent: the Latin language spoken by the Ancient Romans developed into Spanish in Spain and Italian in Italy; similarities between Spanish and Italian are thus in many cases due to both being descended from Latin.
10480630 -> 1000005000640: In other cases, contact between languages — particularly where many speakers are bilingual — can lead to much borrowing of structures, as well as words.
10480640 -> 1000005000650: Similarity may also, of course, be due to coincidence.
10480650 -> 1000005000660: English much and Spanish mucho are not descended from the same form or borrowed from one language to the other; nor is the similarity due to innate linguistic knowledge (see False cognate).
10480660 -> 1000005000670: Arguments in favor of language universals have also come from documented cases of sign languages (such as Al-Sayyid Bedouin Sign Language) developing in communities of congenitally deaf people, independently of spoken language.
10480670 -> 1000005000680: The properties of these sign languages conform generally to many of the properties of spoken languages.
10480680 -> 1000005000690: Other known and suspected sign language isolates include Kata Kolok, Nicaraguan Sign Language, and Providence Island Sign Language.
10480690 -> 1000005000700: Structures
10480700 -> 1000005000710: It has been perceived that languages tend to be organized around grammatical categories such as noun and verb, nominative and accusative, or present and past, though, importantly, not exclusively so.
10480710 -> 1000005000720: The grammar of a language is organized around such fundamental categories, though many languages express the relationships between words and syntax in other discrete ways (cf. some Bantu languages for noun/verb relations, ergative/absolutive systems for case relations, several Native American languages for tense/aspect relations).
10480720 -> 1000005000730: In addition to making substantial use of discrete categories, language has the important property that it organizes elements into recursive structures; this allows, for example, a noun phrase to contain another noun phrase (as in “the chimpanzee’s lips”) or a clause to contain a clause (as in “I think that it’s raining”).
10480730 -> 1000005000740: Though recursion in grammar was implicitly recognized much earlier (for example by Jespersen), the importance of this aspect of language became more popular after the 1957 publication of Noam Chomsky’s book “Syntactic Structures”, - that presented a formal grammar of a fragment of English.
10480740 -> 1000005000750: Prior to this, the most detailed descriptions of linguistic systems were of phonological or morphological systems.
10480750 -> 1000005000760: Chomsky used a context-free grammar augmented with transformations.
10480760 -> 1000005000770: Since then, following the trend of Chomskyan linguistics, context-free grammars have been written for substantial fragments of various languages (for example GPSG, for English), but it has been demonstrated that human languages include cross-serial dependencies, which cannot be handled adequately by context-free grammars.
10480770 -> 1000005000780: Some selected sub-fields
10480780 -> 1000005000790: Diachronic linguistics
10480790 -> 1000005000800: Studying languages at a particular point in time (usually the present) is "synchronic", while diachronic linguistics examines how language changes through time, sometimes over centuries.
10480800 -> 1000005000810: It enjoys both a rich history and a strong theoretical foundation for the study of language change.
10480810 -> 1000005000820: In universities in the United States, the non-historic perspective is often out of fashion.
10480820 -> 1000005000830: The shift in focus to a non-historic perspective started with Saussure and became pre-dominant with Noam Chomsky.
10480830 -> 1000005000840: Explicitly historical perspectives include historical-comparative linguistics and etymology.
10480840 -> 1000005000850: Contextual linguistics
10480850 -> 1000005000860: Contextual linguistics may include the study of linguistics in interaction with other academic disciplines.
10480860 -> 1000005000870: The interdisciplinary areas of linguistics consider how language interacts with the rest of the world.
10480870 -> 1000005000880: Sociolinguistics, anthropological linguistics, and linguistic anthropology are seen as areas that bridge the gap between linguistics and society as a whole.
10480880 -> 1000005000890: Psycholinguistics and neurolinguistics relate linguistics to the medical sciences.
10480890 -> 1000005000900: Other cross-disciplinary areas of linguistics include evolutionary linguistics, computational linguistics and cognitive science.
10480900 -> 1000005000910: Applied linguistics
10480910 -> 1000005000920: Linguists are largely concerned with finding and describing the generalities and varieties both within particular languages and among all language.
10480920 -> 1000005000930: Applied linguistics takes the result of those findings and “applies” them to other areas.
10480930 -> 1000005000940: Often “applied linguistics” refers to the use of linguistic research in language teaching, but results of linguistic research are used in many other areas, as well.
10480940 -> 1000005000950: Today in the age of information technology, many areas of applied linguistics attempt to involve the use of computers.
10480950 -> 1000005000960: Speech synthesis and speech recognition use phonetic and phonemic knowledge to provide voice interfaces to computers.
10480960 -> 1000005000970: Applications of computational linguistics in machine translation, computer-assisted translation, and natural language processing are areas of applied linguistics which have come to the forefront.
10480970 -> 1000005000980: Their influence has had an effect on theories of syntax and semantics, as modeling syntactic and semantic theories on computers constraints.
10480980 -> 1000005000990: Description and prescription
10480990 -> 1000005001000: Main articles: Descriptive linguistics, Linguistic prescription
10481000 -> 1000005001010: Linguistics is descriptive; linguists describe and explain features of language without making subjective judgments on whether a particular feature is "right" or "wrong".
10481010 -> 1000005001020: This is analogous to practice in other sciences: a zoologist studies the animal kingdom without making subjective judgments on whether a particular animal is better or worse than another.
10481020 -> 1000005001030: Prescription, on the other hand, is an attempt to promote particular linguistic usages over others, often favouring a particular dialect or "acrolect".
10481030 -> 1000005001040: This may have the aim of establishing a linguistic standard, which can aid communication over large geographical areas.
10481040 -> 1000005001050: It may also, however, be an attempt by speakers of one language or dialect to exert influence over speakers of other languages or dialects (see Linguistic imperialism).
10481050 -> 1000005001060: An extreme version of prescriptivism can be found among censors, who attempt to eradicate words and structures which they consider to be destructive to society.
10481060 -> 1000005001070: Speech and writing
10481070 -> 1000005001080: Most contemporary linguists work under the assumption that spoken (or signed) language is more fundamental than written language.
10481080 -> 1000005001090: This is because:
10481090 -> 1000005001100: Speech appears to be a human "universal", whereas there have been many cultures and speech communities that lack written communication;
10481100 -> 1000005001110: Speech evolved before human beings discovered writing;
10481110 -> 1000005001120: People learn to speak and process spoken languages more easily and much earlier than writing;
10481120 -> 1000005001130: Linguists nonetheless agree that the study of written language can be worthwhile and valuable.
10481130 -> 1000005001140: For research that relies on corpus linguistics and computational linguistics, written language is often much more convenient for processing large amounts of linguistic data.
10481140 -> 1000005001150: Large corpora of spoken language are difficult to create and hard to find, and are typically transcribed and written.
10481150 -> 1000005001160: Additionally, linguists have turned to text-based discourse occurring in various formats of computer-mediated communication as a viable site for linguistic inquiry.
10481160 -> 1000005001170: The study of writing systems themselves is in any case considered a branch of linguistics.
10481170 -> 1000005001180: History
10481180 -> 1000005001190: Some of the earliest linguistic activities can be recalled from Iron Age India with the analysis of Sanskrit.
10481190 -> 1000005001200: The Pratishakhyas (from ca. the 8th century BC) constitute as it were a proto-linguistic ad hoc collection of observations about mutations to a given corpus particular to a given Vedic school.
10481200 -> 1000005001210: Systematic study of these texts gives rise to the Vedanga discipline of Vyakarana, the earliest surviving account of which is the work of {(Transl+Pānini+sa+IAST+sa)} (c. 520 – 460 BC), who, however, looks back on what are probably several generations of grammarians, whose opinions he occasionally refers to.
10481210 -> 1000005001220: {(Transl+Pānini+sa+IAST+sa)} formulates close to 4,000 rules which together form a compact generative grammar of Sanskrit.
10481220 -> 1000005001230: Inherent in his analytic approach are the concepts of the phoneme, the morpheme and the root.
10481230 -> 1000005001240: Due to its focus on brevity, his grammar has a highly unintuitive structure, reminiscent of contemporary "machine language" (as opposed to "human readable" programming languages).
10481240 -> 1000005001250: Indian linguistics maintained a high level for several centuries; Patanjali in the 2nd century BC still actively criticizes Panini.
10481250 -> 1000005001260: In the later centuries BC, however, Panini's grammar came to be seen as prescriptive, and commentators came to be fully dependent on it.
10481260 -> 1000005001270: Bhartrihari (c. 450 – 510) theorized the act of speech as being made up of four stages: first, conceptualization of an idea, second, its verbalization and sequencing (articulation) and third, delivery of speech into atmospheric air, the interpretation of speech by the listener, the interpreter.
10481270 -> 1000005001280: In the Middle East, the Persian linguist Sibawayh made a detailed and professional description of Arabic in 760, in his monumental work, Al-kitab fi al-nahw (الكتاب في النحو, The Book on Grammar), bringing many linguistic aspects of language to light.
10481280 -> 1000005001290: In his book he distinguished phonetics from phonology.
10481290 -> 1000005001300: Western linguistics begins in Classical Antiquity with grammatical speculation such as Plato's Cratylus.
10481300 -> 1000005001310: Sir William Jones noted that Sanskrit shared many common features with classical Latin and Greek, notably verb roots and grammatical structures, such as the case system.
10481310 -> 1000005001320: This led to the theory that all languages sprung from a common source and to the discovery of the Indo-European language family.
10481320 -> 1000005001330: He began the study of comparative linguistics, which would uncover more language families and branches.
10481330 -> 1000005001340: Some early-19th-century linguists were Jakob Grimm, who devised a principle of consonantal shifts in pronunciation – known as Grimm's Law – in 1822; Karl Verner, who formulated Verner's Law; August Schleicher, who created the "Stammbaumtheorie" ("family tree"); and Johannes Schmidt, who developed the "Wellentheorie" ("wave model") in 1872.
10481340 -> 1000005001350: Ferdinand de Saussure was the founder of modern structural linguistics.
10481350 -> 1000005001360: Edward Sapir, a leader in American structural linguistics, was one of the first who explored the relations between language studies and anthropology.
10481360 -> 1000005001370: His methodology had strong influence on all his successors.
10481370 -> 1000005001380: Noam Chomsky's formal model of language, transformational-generative grammar, developed under the influence of his teacher Zellig Harris, who was in turn strongly influenced by Leonard Bloomfield, has been the dominant model since the 1960s.
10481380 -> 1000005001390: Noam Chomsky remains a pop-linguistic figure.
10481390 -> 1000005001400: Linguists (working in frameworks such as Head-Driven Phrase Structure Grammar (HPSG) or Lexical Functional Grammar (LFG)) are increasingly seen to stress the importance of formalization and formal rigor in linguistic description, and may distance themselves somewhat from Chomsky's more recent work (the "Minimalist" program for Transformational grammar), connecting more closely to his earlier works.
10481400 -> 1000005001410: Other linguists working in Optimality Theory state generalizations in terms of violable constraints that interact with each other, and abandon the traditional rule-based formalism first pioneered by early work in generativist linguistics.
10481410 -> 1000005001420: Functionalist linguists working in functional grammar and Cognitive Linguistics tend to stress the non-autonomy of linguistic knowledge and the non-universality of linguistic structures, thus differing significantly from the Chomskyan school.
10481420 -> 1000005001430: They reject Chomskyan intuitive introspection as a scientific method, relying instead on typological evidence.
Linux
10490010 -> 1000005100020: Linux
10490020 -> 1000005100030: Linux (commonly pronounced {(IPA-en+IPA: /ˈlɪnəks/+ˈlɪnəks)} in English; variants exist) is a Unix-like computer operating system.
10490030 -> 1000005100040: Linux is one of the most prominent examples of free software and open source development: typically all underlying source code can be freely modified, used, and redistributed by anyone.
10490040 -> 1000005100050: The name "Linux" comes from the Linux kernel, originally written in 1991 by Linus Torvalds.
10490050 -> 1000005100060: The system's utilities and libraries usually come from the GNU operating system, announced in 1983 by Richard Stallman.
10490060 -> 1000005100070: The GNU contribution is the basis for the alternative name GNU/Linux.
10490070 -> 1000005100080: Predominantly known for its use in servers, Linux is supported by corporations such as Dell, Hewlett-Packard, IBM, Novell, Oracle Corporation, Red Hat, and Sun Microsystems.
10490080 -> 1000005100090: It is used as an operating system for a wide variety of computer hardware, including desktop computers, supercomputers, video game systems, such as the PlayStation 2 and PlayStation 3, several arcade games, and embedded devices such as mobile phones, routers, and stage lighting systems.
10490090 -> 1000005100100: History
10490100 -> 1000005100110: The Unix operating system was conceived and implemented in the 1960s and first released in 1970.
10490110 -> 1000005100120: Its wide availability and portability meant that it was widely adopted, copied and modified by academic institutions and businesses, with its design being influential on authors of other systems.
10490120 -> 1000005100130: The GNU Project, started in 1984, had the goal of creating a "complete Unix-compatible software system" made entirely of free software.
10490130 -> 1000005100140: In 1985, Richard Stallman created the Free Software Foundation and developed the GNU General Public License (GNU GPL).
10490140 -> 1000005100150: Many of the programs required in an OS (such as libraries, compilers, text editors, a Unix shell, and a windowing system) were completed by the early 1990s, although low level elements such as device drivers, daemons, and the kernel were stalled and incomplete.
10490150 -> 1000005100160: Linus Torvalds has said that if the GNU kernel had been available at the time (1991), he would not have decided to write his own.
10490160 -> 1000005100170: MINIX
10490170 -> 1000005100180: MINIX, a Unix-like system intended for academic use, was released by Andrew S. Tanenbaum in 1987.
10490180 -> 1000005100190: While source code for the system was available, modification and redistribution were restricted (that is not the case today).
10490190 -> 1000005100200: In addition, MINIX's 16-bit design was not well adapted to the 32-bit design of the increasingly cheap and popular Intel 386 architecture for personal computers.
10490200 -> 1000005100210: In 1991, Torvalds began to work on a non-commercial replacement for MINIX while he was attending the University of Helsinki.
10490210 -> 1000005100220: This eventually became the Linux kernel.
10490220 -> 1000005100230: In 1992, Tanenbaum posted an article on Usenet claiming Linux was obsolete.
10490230 -> 1000005100240: In the article, he criticized the operating system as being monolithic in design and being tied closely to the x86 architecture and thus not portable, as he described "a fundamental error."
10490240 -> 1000005100250: Tanenbaum suggested that those who wanted a modern operating system should look into one based on the microkernel model.
10490250 -> 1000005100260: The posting elicited the response of Torvalds and Ken Thompson, one of the founders of Unix, which resulted in a well known debate over the microkernel and monolithic kernel designs.
10490260 -> 1000005100270: Linux was dependent on the MINIX user space at first.
10490270 -> 1000005100280: With code from the GNU system freely available, it was advantageous if this could be used with the fledgling OS.
10490275 -> 1000005100290: Code licensed under the GNU GPL can be used in other projects, so long as they also are released under the same or a compatible license.
10490280 -> 1000005100300: In order to make the Linux kernel compatible with the components from the GNU Project, Torvalds initiated a switch from his original license (which prohibited commercial redistribution) to the GNU GPL.
10490290 -> 1000005100310: Linux and GNU developers worked to integrate GNU components with Linux to make a fully functional and free operating system.
10490300 -> 1000005100320: Commercial and popular uptake
10490310 -> 1000005100330: Today Linux is used in numerous domains, from embedded systems to supercomputers, and has secured a place in server installations with the popular LAMP application stack.
10490320 -> 1000005100340: Torvalds continues to direct the development of the kernel.
10490330 -> 1000005100350: Stallman heads the Free Software Foundation, which in turn supports the GNU components.
10490340 -> 1000005100360: Finally, individuals and corporations develop third-party non-GNU components.
10490350 -> 1000005100370: These third-party components comprise a vast body of work and may include both kernel modules and user applications and libraries.
10490360 -> 1000005100380: Linux vendors and communities combine and distribute the kernel, GNU components, and non-GNU components, with additional package management software in the form of Linux distributions.
10490370 -> 1000005100390: Design
10490380 -> 1000005100400: Linux is a modular Unix-like operating system.
10490390 -> 1000005100410: It derives much of its basic design from principles established in Unix during the 1970s and 1980s.
10490400 -> 1000005100420: Linux uses a monolithic kernel, the Linux kernel, which handles process control, networking, and peripheral and file system access.
10490410 -> 1000005100430: Device drivers are integrated directly with the kernel.
10490420 -> 1000005100440: Much of Linux's higher-level functionality is provided by separate projects which interface with the kernel.
10490430 -> 1000005100450: The GNU userland is an important part of most Linux systems, providing the shell and Unix tools which carry out many basic operating system tasks.
10490440 -> 1000005100460: On top these tools form a Linux system with a graphical user interface that can be used, usually running in the X Window System.
10490450 -> 1000005100470: User interface
10490460 -> 1000005100480: Linux can be controlled by one or more of a text-based command line interface (CLI), graphical user interface (GUI) (usually the default for desktop), or through controls on the device itself (common on embedded machines).
10490470 -> 1000005100490: On desktop machines, KDE, GNOME and Xfce are the most popular user interfaces, though a variety of other user interfaces exist.
10490480 -> 1000005100500: Most popular user interfaces run on top of the X Window System (X), which provides network transparency, enabling a graphical application running on one machine to be displayed and controlled from another.
10490490 -> 1000005100510: Other GUIs include X window managers such as FVWM, Enlightenment and Window Maker.
10490500 -> 1000005100520: The window manager provides a means to control the placement and appearance of individual application windows, and interacts with the X window system.
10490510 -> 1000005100530: A Linux system usually provides a CLI of some sort through a shell, which is the traditional way of interacting with a Unix system.
10490520 -> 1000005100540: A Linux distribution specialized for servers may use the CLI as its only interface.
10490530 -> 1000005100550: A “headless system” run without even a monitor can be controlled by the command line via a protocol such as SSH or telnet.
10490540 -> 1000005100560: Most low-level Linux components, including the GNU Userland, use the CLI exclusively.
10490550 -> 1000005100570: The CLI is particularly suited for automation of repetitive or delayed tasks, and provides very simple inter-process communication.
10490560 -> 1000005100580: A graphical terminal emulator program is often used to access the CLI from a Linux desktop.
10490570 -> 1000005100590: Development
10490580 -> 1000005100600: The primary difference between Linux and many other popular contemporary operating systems is that the Linux kernel and other components are free and open source software.
10490590 -> 1000005100610: Linux is not the only such operating system, although it is the best-known and most widely used.
10490600 -> 1000005100620: Some free and open source software licences are based on the principle of copyleft, a kind of reciprocity: any work derived from a copyleft piece of software must also be copyleft itself.
10490610 -> 1000005100630: The most common free software license, the GNU GPL, is a form of copyleft, and is used for the Linux kernel and many of the components from the GNU project.
10490620 -> 1000005100640: As an operating system underdog competing with mainstream operating systems, Linux cannot rely on a monopoly advantage; in order for Linux to be convenient for users, Linux aims for interoperability with other operating systems and established computing standards.
10490630 -> 1000005100650: Linux systems adhere to POSIX, SUS, ISO and ANSI standards where possible, although to date only one Linux distribution has been POSIX.1 certified, Linux-FT.
10490640 -> 1000005100660: Free software projects, although developed in a collaborative fashion, are often produced independently of each other.
10490650 -> 1000005100670: However, given that the software licenses explicitly permit redistribution, this provides a basis for larger scale projects that collect the software produced by stand-alone projects and make it available all at once in the form of a Linux distribution.
10490660 -> 1000005100680: A Linux distribution, commonly called a “distro”, is a project that manages a remote collection of Linux-based software, and facilitates installation of a Linux operating system.
10490670 -> 1000005100690: Distributions are maintained by individuals, loose-knit teams, volunteer organizations, and commercial entities.
10490680 -> 1000005100700: They include system software and application software in the form of packages, and distribution-specific software for initial system installation and configuration as well as later package upgrades and installs.
10490690 -> 1000005100710: A distribution is responsible for the default configuration of installed Linux systems, system security, and more generally integration of the different software packages into a coherent whole.
10490700 -> 1000005100720: Community
10490710 -> 1000005100730: Linux is largely driven by its developer and user communities.
10490720 -> 1000005100740: Some vendors develop and fund their distributions on a volunteer basis, Debian being a well-known example.
10490730 -> 1000005100750: Others maintain a community version of their commercial distributions, as Red Hat does with Fedora.
10490740 -> 1000005100760: In many cities and regions, local associations known as Linux Users Groups (LUGs) seek to promote Linux and by extension free software.
10490750 -> 1000005100770: They hold meetings and provide free demonstrations, training, technical support, and operating system installation to new users.
10490760 -> 1000005100780: There are also many Internet communities that seek to provide support to Linux users and developers.
10490770 -> 1000005100790: Most distributions and open source projects have IRC chatrooms or newsgroups.
10490780 -> 1000005100800: Online forums are another means for support, with notable examples being LinuxQuestions.org and the Gentoo forums.
10490790 -> 1000005100810: Linux distributions host mailing lists; commonly there will be a specific topic such as usage or development for a given list.
10490800 -> 1000005100820: There are several technology websites with a Linux focus.
10490810 -> 1000005100830: Linux Weekly News is a weekly digest of Linux-related news; the Linux Journal is an online magazine of Linux articles published monthly; Slashdot is a technology-related news website with many stories on Linux and open source software; Groklaw has written in depth about Linux-related legal proceedings and there are many articles relevant to the Linux kernel and its relationship with GNU on the GNU project's website.
10490820 -> 1000005100840: Print magazines on Linux often include cover disks including software or even complete Linux distributions.
10490830 -> 1000005100850: Although Linux is generally available free of charge, several large corporations have established business models that involve selling, supporting, and contributing to Linux and free software.
10490840 -> 1000005100860: These include Dell, IBM, HP, Sun Microsystems, Novell, and Red Hat.
10490850 -> 1000005100870: The free software licenses on which Linux is based explicitly accommodate and encourage commercialization; the relationship between Linux as a whole and individual vendors may be seen as symbiotic.
10490860 -> 1000005100880: One common business model of commercial suppliers is charging for support, especially for business users.
10490870 -> 1000005100890: A number of companies also offer a specialized business version of their distribution, which adds proprietary support packages and tools to administer higher numbers of installations or to simplify administrative tasks.
10490880 -> 1000005100900: Another business model is to give away the software in order to sell hardware.
10490890 -> 1000005100910: Programming on Linux
10490900 -> 1000005100920: Most Linux distributions support dozens of programming languages.
10490910 -> 1000005100930: The most common collection of utilities for building both Linux applications and operating system programs is found within the GNU toolchain, which includes the GNU Compiler Collection (GCC) and the GNU build system.
10490920 -> 1000005100940: Amongst others, GCC provides compilers for Ada, C, C++, Java, and Fortran.
10490930 -> 1000005100950: The Linux kernel itself is written to be compiled with GCC.
10490940 -> 1000005100960: Proprietary compilers for Linux include the Intel C++ Compiler and IBM XL C/C++ Compiler.
10490950 -> 1000005100970: Most distributions also include support for Perl, Ruby, Python and other dynamic languages.
10490960 -> 1000005100980: Examples of languages that are less common, but still well-supported, are C# via the Mono project, sponsored by Novell, and Scheme.
10490970 -> 1000005100990: A number of Java Virtual Machines and development kits run on Linux, including the original Sun Microsystems JVM (HotSpot), and IBM's J2SE RE, as well as many open-source projects like Kaffe.
10490980 -> 1000005101000: The two main frameworks for developing graphical applications are those of GNOME and KDE.
10490990 -> 1000005101010: These projects are based on the GTK+ and Qt widget toolkits, respectively, which can also be used independently of the larger framework.
10491000 -> 1000005101020: Both support a wide variety of languages.
10491010 -> 1000005101030: There are a number of Integrated development environments available including Anjuta, Code::Blocks, Eclipse, KDevelop, Lazarus, MonoDevelop, NetBeans, and Omnis Studio while the long-established editors Vim and Emacs remain popular.
10491020 -> 1000005101040: Uses
10491030 -> 1000005101050: As well as those designed for general purpose use on desktops and servers, distributions may be specialized for different purposes including: computer architecture support, embedded systems, stability, security, localization to a specific region or language, targeting of specific user groups, support for real-time applications, or commitment to a given desktop environment.
10491040 -> 1000005101060: Furthermore, some distributions deliberately include only free software.
10491050 -> 1000005101070: Currently, over three hundred distributions are actively developed, with about a dozen distributions being most popular for general-purpose use.
10491060 -> 1000005101080: Linux is a widely ported operating system.
10491070 -> 1000005101090: While the Linux kernel was originally designed only for Intel 80386 microprocessors, it now runs on a more diverse range of computer architectures than any other operating system: in the hand-held ARM-based iPAQ and the mainframe IBM System z9, in devices ranging from mobile phones to supercomputers.
10491080 -> 1000005101100: Specialized distributions exist for less mainstream architectures.
10491090 -> 1000005101110: The ELKS kernel fork can run on Intel 8086 or Intel 80286 16-bit microprocessors, while the µClinux kernel fork may run on systems without a memory management unit.
10491100 -> 1000005101120: The kernel also runs on architectures that were only ever intended to use a manufacturer-created operating system, such as Macintosh computers, PDAs, video game consoles, portable music players, and mobile phones.
10491110 -> 1000005101130: Desktop
10491120 -> 1000005101140: Although there is a lack of Linux ports for some Mac OS X and Microsoft Windows programs in domains such as desktop publishing and professional audio, applications equivalent to those available for Mac and Windows are available for Linux.
10491130 -> 1000005101150: Most Linux distributions provide a program for browsing a list of thousands of free software applications that have already been tested and configured for a specific distribution.
10491140 -> 1000005101160: These free programs can be downloaded and installed with one mouse click and a digital signature guarantees that no one has added a virus or a spyware to these programs.
10491150 -> 1000005101170: Many free software titles that are popular on Windows, such as Pidgin, Mozilla Firefox, Openoffice.org, and GIMP, are available for Linux.
10491160 -> 1000005101180: A growing amount of proprietary desktop software is also supported under Linux, examples being Adobe Flash Player, Acrobat Reader, Matlab, Nero Burning ROM, Opera, RealPlayer, and Skype.
10491170 -> 1000005101190: In the field of animation and visual effects, most high end software, such as AutoDesk Maya, Softimage XSI and Apple Shake, is available for Linux, Windows and/or Mac OS X.
10491180 -> 1000005101200: CrossOver is a proprietary solution based on the open source Wine project that supports running older Windows versions of Microsoft Office and Adobe Photoshop versions through CS2.
10491190 -> 1000005101210: Microsoft Office 2007 and Adobe Photoshop CS3 are known not to work.
10491200 -> 1000005101220: Besides the free Windows compatibility layer Wine, most distributions offer Dual boot and X86 virtualization for running both Linux and Windows on the same computer.
10491210 -> 1000005101230: Linux's open nature allows distributed teams to localize Linux distributions for use in locales where localizing proprietary systems would not be cost-effective.
10491220 -> 1000005101240: For example the Sinhalese language version of the Knoppix distribution was available for a long time before Microsoft Windows XP was translated to Sinhalese.
10491230 -> 1000005101250: In this case the Lanka Linux User Group played a major part in developing the localized system by combining the knowledge of university professors, linguists, and local developers.
10491240 -> 1000005101260: The performance of Linux on the desktop has been a controversial topic, with at least one key Linux kernel developer, Con Kolivas, accusing the Linux community of favouring performance on servers.
10491250 -> 1000005101270: He quit Linux development because he was frustrated with this lack of focus on the desktop, and then gave a 'tell all' interview on the topic.
10491260 -> 1000005101280: Servers and supercomputers
10491270 -> 1000005101290: Historically, Linux has mainly been used as a server operating system, and has risen to prominence in that area; Netcraft reported in September 2006 that eight of the ten most reliable internet hosting companies run Linux on their web servers.
10491280 -> 1000005101300: This is due to its relative stability and long uptime, and the fact that desktop software with a graphical user interface for servers is often unneeded.
10491290 -> 1000005101310: Enterprise and non-enterprise Linux distributions may be found running on servers.
10491300 -> 1000005101320: Linux is the cornerstone of the LAMP server-software combination (Linux, Apache, MySQL, Perl/PHP/Python) which has achieved popularity among developers, and which is one of the more common platforms for website hosting.
10491310 -> 1000005101330: Linux is commonly used as an operating system for supercomputers.
10491320 -> 1000005101340: As of November 2007, out of the top 500 systems, 426 (85.2%) run Linux.
10491330 -> 1000005101350: Embedded devices
10491340 -> 1000005101360: Due to its low cost and ability to be easily modified, an embedded Linux is often used in embedded systems.
10491350 -> 1000005101370: Linux has become a major competitor to the proprietary Symbian OS found in the majority of smartphones — 16.7% of smartphones sold worldwide during 2006 were using Linux — and it is an alternative to the proprietary Windows CE and Palm OS operating systems on mobile devices.
10491360 -> 1000005101380: Cell phones or PDAs running on Linux and built on open source platform became a trend from 2007, like Nokia N810, Openmoko's Neo1973 and the on-going Google Android.
10491370 -> 1000005101390: The popular TiVo digital video recorder uses a customized version of Linux.
10491380 -> 1000005101400: Several network firewall and router standalone products, including several from Linksys, use Linux internally, using its advanced firewall and routing capabilities.
10491390 -> 1000005101410: The Korg OASYS and the Yamaha Motif XS music workstations also run Linux.
10491400 -> 1000005101420: Further more Linux is used in the leading stage lighting control system, FlyingPig/HighEnd WholeHogIII Console .
10491410 -> 1000005101430: Market share and uptake
10491420 -> 1000005101440: Many quantitative studies of open source software focus on topics including market share and reliability, with numerous studies specifically examining Linux.
10491430 -> 1000005101450: The Linux market is growing rapidly, and the revenue of servers, desktops, and packaged software running Linux is expected to exceed $35.7 billion by 2008.
10491440 -> 1000005101460: IDC's report for Q1 2007 says that Linux now holds 12.7% of the overall server market.
10491450 -> 1000005101470: This estimate was based on the number of Linux servers sold by various companies.
10491460 -> 1000005101480: Desktop adoption of Linux is approximately 1%.
10491470 -> 1000005101490: In comparison, Microsoft operating systems hold more than 90%.
10491480 -> 1000005101500: The frictional cost of switching operating systems and lack of support for certain hardware and application programs designed for Microsoft Windows have been two factors that have inhibited adoption.
10491490 -> 1000005101510: Proponents and analysts attribute the relative success of Linux to its security, reliability, low cost, and freedom from vendor lock-in.
10491500 -> 1000005101520: Also most recently Google has begun to fund Wine, which acts as a compatibility layer, allowing users to run some Windows programs under Linux.
10491510 -> 1000005101530: The XO laptop project of One Laptop Per Child is creating a new and potentially much larger Linux community, planned to reach  several hundred million schoolchildren and their families and communities in developing countries.
10491515 -> 1000005101540: Six countries have ordered a million or more units each for delivery in 2007 to distribute to schoolchildren at no charge.
10491520 -> 1000005101550: Google, Red Hat, and eBay are major supporters of the project.
10491530 -> 1000005101560: Copyright and naming
10491540 -> 1000005101570: The Linux kernel and most GNU software are licensed under the GNU General Public License (GPL).
10491550 -> 1000005101580: The GPL requires that anyone who distributes the Linux kernel must make the source code (and any modifications) available to the recipient under the same terms.
10491560 -> 1000005101590: In 1997, Linus Torvalds stated, “Making Linux GPL'd was definitely the best thing I ever did.”
10491570 -> 1000005101600: Other key components of a Linux system may use other licenses; many libraries use the GNU Lesser General Public License (LGPL), a more permissive variant of the GPL, and the X Window System uses the MIT License.
10491580 -> 1000005101610: Torvalds has publicly stated that he would not move the Linux kernel (currently licensed under GPL version 2) to version 3 of the GPL, released in mid-2007, specifically citing some provisions in the new license which prohibit the use of the software in digital rights management.
10491590 -> 1000005101620: A 2001 study of Red Hat Linux 7.1 found that this distribution contained 30 million source lines of code.
10491600 -> 1000005101630: Using the Constructive Cost Model, the study estimated that this distribution required about eight thousand man-years of development time.
10491610 -> 1000005101640: According to the study, if all this software had been developed by conventional proprietary means, it would have cost about 1.08 billion dollars (year 2000 U.S. dollars) to develop in the United States.
10491620 -> 1000005101650: Most of the code (71%) was written in the C programming language, but many other languages were used, including C++, assembly language, Perl, Python, Fortran, and various shell scripting languages.
10491630 -> 1000005101660: Slightly over half of all lines of code were licensed under the GPL.
10491640 -> 1000005101670: The Linux kernel itself was 2.4 million lines of code, or 8% of the total.
10491650 -> 1000005101680: In a later study, the same analysis was performed for Debian GNU/Linux version 4.0.
10491660 -> 1000005101690: This distribution contained over 283 million source lines of code, and the study estimated that it would have cost 5.4 billion Euros to develop by conventional means.
10491670 -> 1000005101700: In the United States, the name Linux is a trademark registered to Linus Torvalds.
10491680 -> 1000005101710: Initially, nobody registered it, but on August 15 1994, William R. Della Croce, Jr. filed for the trademark Linux, and then demanded royalties from Linux distributors.
10491690 -> 1000005101720: In 1996, Torvalds and some affected organizations sued him to have the trademark assigned to Torvalds, and in 1997 the case was settled.
10491700 -> 1000005101730: The licensing of the trademark has since been handled by the Linux Mark Institute.
10491710 -> 1000005101740: Torvalds has stated that he only trademarked the name to prevent someone else from using it, but was bound in 2005 by United States trademark law to take active measures to enforce the trademark.
10491720 -> 1000005101750: As a result, the LMI sent out a number of letters to distribution vendors requesting that a fee be paid for the use of the name, and a number of companies have complied.
10491730 -> 1000005101760: GNU/Linux
10491740 -> 1000005101770: The Free Software Foundation views Linux distributions which use GNU software as GNU variants and they ask that such operating systems be referred to as GNU/Linux or a Linux-based GNU system.
10491750 -> 1000005101780: However, the media and population at large refers to this family of operating systems simply as Linux.
10491760 -> 1000005101790: While some distributors make a point of using the aggregate form, most notably Debian with the Debian GNU/Linux distribution, the term's use outside of the enthusiast community is limited.
10491770 -> 1000005101800: The distinction between the Linux kernel and distributions based on it plus the GNU system is a source of confusion to many newcomers, and the naming remains controversial, as many large Linux distributions (e.g. Ubuntu and SuSE Linux) are simply using the Linux name, rather than GNU/Linux.
List of chatterbots
10500010 -> 1000005200020: List of chatterbots
10500020 -> None: Chatterbot Directories
10500030 -> None: 
10500040 -> None: Chatterbot Central at  The Simon Laven Page
10500050 -> None: The Chatterbot Collection
10500060 -> None: AI Hub - A directory of news, programs, and links all related to chatterbots and Artificial Intelligence
10500070 -> None: The Chatterbox Challenge Bots Directory at  The Chatterbox Challenge
10500080 -> None: Classic Chatterbots
10500090 -> None: Dr. Sbaitso
10500100 -> None: ELIZA
10500110 -> None: PARRY
10500120 -> None: Racter
10500130 -> None: General Chatterbots
10500140 -> None: A.L.I.C.E. and other Alicebot/pandorabot-based ( iGod,  Mitsuku,  FriendBot, etc.)
10500150 -> None: Albert One
10500160 -> None: ALIMbot
10500170 -> None: CHAT and TIPS
10500180 -> None: Chat-bot
10500190 -> None: Claude
10500200 -> None: Dadorac
10500210 -> None: DAI2 - A dynamic artificial intelligence which learns from its surrounding community
10500220 -> None: Elbot
10500230 -> None: Ella
10500240 -> None: Fred
10500250 -> None: Jabberwacky
10500260 -> None: Jabberwock
10500270 -> None: Jeeney AI
10500280 -> None: JIxperts – collection of wiki chatterbots.
10500290 -> None: KAR Intelligent Computer
10500300 -> None: Kyle – A unique learning Artificial Intelligence chatbot, which employs contextual learning algorithms.
10500310 -> None: MegaHal
10500320 -> None: Mr Know-It-All
10500330 -> None: Oliverbot
10500340 -> None: Poseidon
10500350 -> None: RoboMatic X1 - A chatbot which controls the user's PC through chatting by their voice or by typing.
10500360 -> None: Splotchy
10500370 -> None: Spookitalk - A chatterbot used for NPCs in Douglas Adams' Starship Titanic video game.
10500380 -> None: Thomas
10500390 -> None: Ultra Hal Assistant
10500400 -> None: Verbot
10500410 -> None: Yhaken
10500420 -> None: ScientioBot - A new technology chatterbot using concept mining techniques accessible via a free web service.
10500430 -> None: NICOLE A simple chatterbot with the ability to learn new phrases.
10500440 -> None: IM Chatterbots
10500450 -> None: DAI2 is also available on the MSN / Windows Live network as dai2@dai2.co.uk
10500460 -> None: MSN Quickbot
10500470 -> None: SmarterChild
10500480 -> None: Spleak
10500490 -> None: MrMovie - searching actors/movies/dvd's in IM (Skype, AOL/AIM or MSN/Live)
10500500 -> None: InsideMessenger
10500510 -> None: Inocu - (MSN/Live)
10500520 -> None: FriendBot-An AIM Chatterbot
10500530 -> None: amsnEliza plugin for aMSN
10500540 -> None: TrixieMouse
10500550 -> None: Infobot - Polish informational bot for Gadu-gadu, Skype and Jabber
10500560 -> None: AIML Chatterbots
10500570 -> None: Alan - In Turing Enigma Alan Turing's spirit has infiltrated the World War II encrypting device Enigma.
10500580 -> None: Deeb0t
10500590 -> None: Chomsky A chatbot that uses a smiley face to convey emotions.
10500600 -> None: It uses the information in Wikipedia to build its conversations and has links to Wikipedia articles.
10500610 -> None: John Lennon Artificial Intelligence Project
10500620 -> None: SitePal
10500630 -> 1000005200030: JFred Chatterbots
10500640 -> 1000005200040: The Turing Hub
10500650 -> 1000005200050: Educational Chatterbots
10500660 -> 1000005200060: Elizabeth Aims to teach AI techniques and concepts, starting from chatterbot design.
10500670 -> 1000005200070: Accompanied by self-teaching materials, as used at the University of Leeds.
10500680 -> None: Non-English Chatterbots
10500690 -> None: Amanda - (French) with source code for Windows.
10500700 -> None: Proteus
10500710 -> None: [msnim:chat?contact=senhorbot@hotmail.com Senhor Bot] (Brazillian bot for MSN)
10500720 -> None: 
10500730 -> None: 
Loebner prize
10510010 -> 1000005300020: Loebner prize
10510020 -> 1000005300030: The Loebner Prize is an annual competition that awards prizes to the Chatterbot considered by the judges to be the most humanlike of those entered.
10510030 -> 1000005300040: The format of the competition is that of a standard Turing test.
10510040 -> 1000005300050: In the Loebner Prize, as in a Turing test, a human judge is faced with two computer screens.
10510050 -> 1000005300060: One is under the control of a computer, the other is under the control of a human.
10510060 -> 1000005300070: The judge poses questions to the two screens and receives answers.
10510070 -> 1000005300080: Based upon the answers, the judge must decide which screen is controlled by the human and which is controlled by the computer program.
10510080 -> 1000005300090: The contest was begun in 1990 by Hugh Loebner in conjunction with the Cambridge Center for Behavioral Studies of Massachusetts, United States.
10510090 -> 1000005300100: It has since been associated with Flinders University, Dartmouth College, the Science Museum in London, and most recently the University of Reading.
10510100 -> 1000005300110: Within the field of artificial intelligence, the Loebner Prize is somewhat controversial; the most prominent critic, Marvin Minsky, has called it a publicity stunt that does not help the field along.
10510110 -> 1000005300120: Prizes
10510120 -> 1000005300130: The prizes for each year include:
10510130 -> 1000005300140: $2,000 for the most human-seeming of all chatterbots for that year - awarded every year.
10510140 -> 1000005300150: In 2005, the prize was increased to $3,000, and the prize was $2,250 in 2006.
10510150 -> 1000005300160: In 2008 the prize will be $3000.00
10510160 -> 1000005300170: $25,000 for the first chatterbot that judges cannot distinguish from a real human in a text-only Turing test, and that can convince judges that the other (human) entity they are talking to simultaneously is a computer.
10510165 -> 1000005300180: (to be awarded once only)
10510170 -> 1000005300190: $100,000 to the first chatterbot that judges cannot distinguish from a real human in a Turing test that includes deciphering and understanding text, visual, and auditory input.
10510175 -> 1000005300200: (to be awarded once only)
10510180 -> 1000005300210: The Loebner Prize dissolves once the $100,000 prize is won.
10510190 -> 1000005300220: 2008 Loebner Prize
10510200 -> 1000005300230: The 2008 Competition is to be held on Sunday 12 October in University of Reading, UK.
10510210 -> 1000005300240: The event, which is being co-directed by Kevin Warwick, will include a direct challenge on the Turing test as originally proposed by Alan Turing.
10510220 -> 1000005300250: The first place winner will receive $3000.00 and a bronze medal.
10510230 -> 1000005300260: 2007 Loebner Prize
10510240 -> 1000005300270: The 2007 Competition was held on Sunday, 21 October in New York City.
10510250 -> 1000005300280: The participants in the contest were:
10510260 -> 1000005300290: Rollo Carpenter from Icogno, creator of Jabberwacky
10510270 -> 1000005300300: Noah Duncan, private entry, creator of Cletus
10510280 -> 1000005300310: Robert Medeksza from Zabaware, creator of Ultra Hal Assistant
10510290 -> 1000005300320: No bot passed the Turing test but the judges ranked the bots as "most human".
10510300 -> 1000005300330: The results of the contest were:
10510310 -> 1000005300340: 1st place: Robert Medeksza
10510320 -> 1000005300350: 2nd place: Noah Duncan
10510330 -> 1000005300360: 3rd place: Rollo Carpenter
10510340 -> 1000005300370: The winner received $2250 and the Annual Medal.
10510350 -> 1000005300380: The runners up received $250 each.
10510360 -> 1000005300390: 2006 Loebner Prize
10510370 -> 1000005300400: On Wednesday, August 30, the finalists for the 2006 Loebner Prize were announced.
10510380 -> 1000005300410: The finalists were:
10510390 -> 1000005300420: Rollo Carpenter
10510400 -> 1000005300430: Richard Churchill and Marie-Claire Jenkins
10510410 -> 1000005300440: Noah Duncan
10510420 -> 1000005300450: Robert Medeksza
10510430 -> 1000005300460: The contest was held on Sunday, 17 September at the Torrington Theatre, University College London.
10510440 -> None: Winners
Machine learning
10520010 -> 1000005400020: Machine learning
10520020 -> 1000005400030: As a broad subfield of artificial intelligence, machine learning is concerned with the design and development of algorithms and techniques that allow computers to "learn".
10520030 -> 1000005400040: At a general level, there are two types of learning: inductive, and deductive.
10520040 -> 1000005400050: Inductive machine learning methods extract rules and patterns out of massive data sets.
10520050 -> 1000005400060: The major focus of machine learning research is to extract information from data automatically, by computational and statistical methods.
10520060 -> 1000005400070: Hence, machine learning is closely related not only to data mining and statistics, but also theoretical computer science.
10520070 -> 1000005400080: Applications
10520080 -> 1000005400090: Machine learning has a wide spectrum of applications including natural language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics, brain-machine interfaces and cheminformatics, detecting credit card fraud, stock market analysis, classifying DNA sequences, speech and handwriting recognition, object recognition in computer vision, game playing and robot locomotion.
10520090 -> 1000005400100: Human interaction
10520100 -> 1000005400110: Some machine learning systems attempt to eliminate the need for human intuition in the analysis of the data, while others adopt a collaborative approach between human and machine.
10520110 -> 1000005400120: Human intuition cannot be entirely eliminated since the designer of the system must specify how the data is to be represented and what mechanisms will be used to search for a characterization of the data.
10520120 -> 1000005400130: Machine learning can be viewed as an attempt to automate parts of the scientific method.
10520130 -> 1000005400140: Some statistical machine learning researchers create methods within the framework of Bayesian statistics.
10520140 -> 1000005400150: Algorithm types
10520150 -> 1000005400160: Machine learning algorithms are organized into a taxonomy, based on the desired outcome of the algorithm.
10520160 -> 1000005400170: Common algorithm types include:
10520170 -> 1000005400180: Supervised learning — in which the algorithm generates a function that maps inputs to desired outputs.
10520180 -> 1000005400190: One standard formulation of the supervised learning task is the classification problem: the learner is required to learn (to approximate) the behavior of a function which maps a vector [X_1, X_2, \ldots X_N]\, into one of several classes by looking at several input-output examples of the function.
10520190 -> 1000005400200: Unsupervised learning — An agent which models a set of inputs: labeled examples are not available.
10520200 -> 1000005400210: Semi-supervised learning — which combines both labeled and unlabeled examples to generate an appropriate function or classifier.
10520210 -> 1000005400220: Reinforcement learning — in which the algorithm learns a policy of how to act given an observation of the world.
10520220 -> 1000005400230: Every action has some impact in the environment, and the environment provides feedback that guides the learning algorithm.
10520230 -> 1000005400240: Transduction — similar to supervised learning, but does not explicitly construct a function: instead, tries to predict new outputs based on training inputs, training outputs, and test inputs which are available while training.
10520240 -> 1000005400250: Leaning to learn — in which the algorithm learns its own inductive bias based on previous experience.
10520250 -> 1000005400260: The computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory.
10520260 -> 1000005400270: Machine learning topics
10520270 -> 1000005400280: This list represents the topics covered on a typical machine learning course.
10520280 -> 1000005400290: Prerequisites
10520290 -> 1000005400300: Bayesian theory
10520300 -> 1000005400310: Modeling conditional probability density functions:  regression and classification
10520310 -> 1000005400320: Artificial neural networks
10520320 -> 1000005400330: Decision trees
10520330 -> 1000005400340: Gene expression programming
10520340 -> 1000005400350: Genetic algorithms
10520350 -> 1000005400360: Genetic programming
10520360 -> 1000005400370: Holographic associative memory
10520370 -> 1000005400380: Inductive Logic Programming
10520380 -> 1000005400390: Gaussian process regression
10520390 -> 1000005400400: Linear discriminant analysis
10520400 -> 1000005400410: K-nearest neighbor
10520410 -> 1000005400420: Minimum message length
10520420 -> 1000005400430: Perceptron
10520430 -> 1000005400440: Quadratic classifier
10520440 -> 1000005400450: Radial basis function networks
10520450 -> 1000005400460: Support vector machines
10520460 -> 1000005400470: Algorithms for estimating model parameters
10520470 -> 1000005400480: Dynamic programming
10520480 -> 1000005400490: Expectation-maximization algorithm
10520490 -> 1000005400500: Modeling probability density functions through generative models
10520500 -> 1000005400510: Graphical models including Bayesian networks and Markov random fields
10520510 -> 1000005400520: Generative topographic map
10520520 -> 1000005400530: Approximate inference techniques
10520530 -> 1000005400540: Monte Carlo methods
10520540 -> 1000005400550: Variational Bayes
10520550 -> 1000005400560: Variable-order Markov models
10520560 -> 1000005400570: Variable-order Bayesian networks
10520570 -> 1000005400580: Loopy belief propagation
10520580 -> 1000005400590: Optimization
10520590 -> 1000005400600: Most of methods listed above either use optimization or are instances of optimization algorithms
10520600 -> 1000005400610: Meta-learning (ensemble methods)
10520610 -> 1000005400620: Boosting
10520620 -> 1000005400630: Bootstrap aggregating
10520630 -> 1000005400640: Random forest
10520640 -> 1000005400650: Weighted majority algorithm
10520650 -> 1000005400660: Inductive transfer and learning to learn
10520660 -> 1000005400670: Inductive transfer
10520670 -> 1000005400680: Reinforcement learning
10520680 -> 1000005400690: Temporal difference learning
10520690 -> 1000005400700: Monte-Carlo method
Machine translation
10530010 -> 1000005500020: Machine translation
10530020 -> 1000005500030: Machine translation, sometimes referred to by the abbreviation MT, is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.
10530030 -> 1000005500040: At its basic level, MT performs simple substitution of words in one natural language for words in another.
10530040 -> 1000005500050: Using corpus techniques, more complex translations may be attempted, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies.
10530050 -> 1000005500060: Current machine translation software often allows for customisation by domain or profession (such as weather reports) — improving output by limiting the scope of allowable substitutions.
10530060 -> 1000005500070: This technique is particularly effective in domains where formal or formulaic language is used.
10530070 -> 1000005500080: It follows then that machine translation of government and legal documents more readily produces usable output than conversation or less standardised text.
10530080 -> 1000005500090: Improved output quality can also be achieved by human intervention: for example, some systems are able to translate more accurately if the user has unambiguously identified which words in the text are names.
10530090 -> 1000005500100: With the assistance of these techniques, MT has proven useful as a tool to assist human translators, and in some cases can even produce output that can be used "as is".
10530100 -> 1000005500110: However, current systems are unable to produce output of the same quality as a human translator, particularly where the text to be translated uses casual language.
10530110 -> 1000005500120: History
10530120 -> 1000005500130: The history of machine translation begins in the 1950s, after World War II.
10530130 -> 1000005500140: The Georgetown experiment (1954) involved fully-automatic translation of over sixty Russian sentences into English.
10530140 -> 1000005500150: The experiment was a great success and ushered in an era of substantial funding for machine-translation research.
10530150 -> 1000005500160: The authors claimed that within three to five years, machine translation would be a solved problem.
10530160 -> 1000005500170: Real progress was much slower, however, and after the ALPAC report (1966), which found that the ten-year-long research had failed to fulfill expectations, funding was greatly reduced.
10530170 -> 1000005500180: Beginning in the late 1980s, as computational power increased and became less expensive, more interest was shown in statistical models for machine translation.
10530180 -> 1000005500190: The idea of using digital computers for translation of natural languages was proposed as early as 1946 by A.D.Booth and possibly others.
10530190 -> 1000005500200: The Georgetown experiment was by no means the first such application, and a demonstration was made in 1954 on the APEXC machine at Birkbeck College (London Univ.) of a rudimentary translation of English into French.
10530200 -> 1000005500210: Several papers on the topic were published at the time, and even articles in popular journals (see for example Wireless World, Sept. 1955, Cleave and Zacharov).
10530210 -> 1000005500220: A similar application, also pioneered at Birkbeck College at the time, was reading and composing Braille texts by computer.
10530220 -> 1000005500230: Recently, Internet has emerged as global information infrastructure, revolutionizing access to any information, as well as fast information transfer and exchange.
10530230 -> 1000005500240: Using Internet and e-mail technology, people need to communicate rapidly over long distances across continent boundaries.
10530240 -> 1000005500250: Not all of these Internet users, however, can use their own language for global communication to different people with different languages.
10530250 -> 1000005500260: Therefore, using machine translation software, people can possibly communicate and contact one to another around the world in their own mother tongue, in the near future.
10530260 -> 1000005500270: Translation process
10530270 -> 1000005500280: The translation process may be stated as:
10530280 -> 1000005500290: Decoding the meaning of the source text; and
10530290 -> 1000005500300: Re-encoding this meaning in the target language.
10530300 -> 1000005500310: Behind this ostensibly simple procedure lies a complex cognitive operation.
10530310 -> 1000005500320: To decode the meaning of the source text in its entirety, the translator must interpret and analyse all the features of the text, a process that requires in-depth knowledge of the grammar, semantics, syntax, idioms, etc., of the source language, as well as the culture of its speakers.
10530320 -> 1000005500330: The translator needs the same in-depth knowledge to re-encode the meaning in the target language.
10530330 -> 1000005500340: Therein lies the challenge in machine translation: how to program a computer that will "understand" a text as a person does, and that will "create" a new text in the target language that "sounds" as if it has been written by a person.
10530340 -> 1000005500350: This problem may be approached in a number of ways.
10530350 -> 1000005500360: Approaches
10530360 -> 1000005500370: Machine translation can use a method based on linguistic rules, which means that words will be translated in a linguistic way — the most suitable (orally speaking) words of the target language will replace the ones in the source language.
10530370 -> 1000005500380: It is often argued that the success of machine translation requires the problem of natural language understanding to be solved first.
10530380 -> 1000005500390: Generally, rule-based methods parse a text, usually creating an intermediary, symbolic representation, from which the text in the target language is generated.
10530390 -> 1000005500400: According to the nature of the intermediary representation, an approach is described as interlingual machine translation or transfer-based machine translation.
10530400 -> 1000005500410: These methods require extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules.
10530410 -> 1000005500420: Given enough data, machine translation programs often work well enough for a native speaker of one language to get the approximate meaning of what is written by the other native speaker.
10530420 -> 1000005500430: The difficulty is getting enough data of the right kind to support the particular method.
10530430 -> 1000005500440: For example, the large multilingual corpus of data needed for statistical methods to work is not necessary for the grammar-based methods.
10530440 -> 1000005500450: But then, the grammar methods need a skilled linguist to carefully design the grammar that they use.
10530450 -> 1000005500460: To translate between closely related languages, a technique referred to as shallow-transfer machine translation may be used.
10530460 -> 1000005500470: Rule-based
10530470 -> 1000005500480: The rule-based machine translation paradigm includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms.
10530480 -> 1000005500490: Transfer-based machine translation
10530490 -> 1000005500500: Interlingual
10530500 -> 1000005500510: Interlingual machine translation is one instance of rule-based machine-translation approaches.
10530510 -> 1000005500520: In this approach, the source language, i.e. the text to be translated, is transformed into an interlingual, i.e. source-/target-language-independent representation.
10530520 -> 1000005500530: The target language is then generated out of the interlingua.
10530530 -> 1000005500540: Dictionary-based
10530540 -> 1000005500550: Machine translation can use a method based on dictionary entries, which means that the words will be translated as they are by a dictionary.
10530550 -> 1000005500560: Statistical
10530560 -> 1000005500570: Statistical machine translation tries to generate translations using statistical methods based on bilingual text corpora, such as the Canadian Hansard corpus, the English-French record of the Canadian parliament and EUROPARL, the record of the European Parliament.
10530570 -> 1000005500580: Where such corpora are available, impressive results can be achieved translating texts of a similar kind, but such corpora are still very rare.
10530580 -> 1000005500590: The first statistical machine translation software was CANDIDE from IBM.
10530590 -> 1000005500600: Google used SYSTRAN for several years, but has switched to a statistical translation method in October 2007.
10530600 -> 1000005500610: Recently, they improved their translation capabilities by inputting approximately 200 billion words from United Nations materials to train their system.
10530610 -> 1000005500620: Accuracy of the translation has improved.
10530620 -> 1000005500630: Example-based
10530630 -> 1000005500640: Example-based machine translation (EBMT) approach is often characterised by its use of a bilingual corpus as its main knowledge base, at run-time.
10530640 -> 1000005500650: It is essentially a translation by analogy and can be viewed as an implementation of case-based reasoning approach of machine learning.
10530650 -> 1000005500660: Major issues
10530660 -> 1000005500670: Disambiguation
10530670 -> 1000005500680: Word sense disambiguation concerns finding a suitable translation when a word can have more than one meaning.
10530680 -> 1000005500690: The problem was first raised in the 1950s by Yehoshua Bar-Hillel.
10530690 -> 1000005500700: He pointed out that without a "universal encyclopedia", a machine would never be able to distinguish between the two meanings of a word.
10530700 -> 1000005500710: Today there are numerous approaches designed to overcome this problem.
10530710 -> 1000005500720: They can be approximately divided into "shallow" approaches and "deep" approaches.
10530720 -> 1000005500730: Shallow approaches assume no knowledge of the text.
10530730 -> 1000005500740: They simply apply statistical methods to the words surrounding the ambiguous word.
10530740 -> 1000005500750: Deep approaches presume a comprehensive knowledge of the word.
10530750 -> 1000005500760: So far, shallow approaches have been more successful.
10530760 -> 1000005500770: Named entities
10530770 -> 1000005500780: Related to named entity recognition in information extraction.
10530780 -> 1000005500790: Applications
10530790 -> 1000005500800: There are now many software programs for translating natural language, several of them online, such as the SYSTRAN system which powers both Google translate and AltaVista's Babel Fish as well as Promt that powers online translation services at Voila.fr and Orange.fr.
10530800 -> 1000005500810: Although no system provides the holy grail of "fully automatic high quality machine translation" (FAHQMT), many systems produce reasonable output.
10530810 -> 1000005500820: Despite their inherent limitations, MT programs are used around the world.
10530820 -> 1000005500830: Probably the largest institutional user is the European Commission.
10530830 -> 1000005500840: Toggletext uses a transfer-based system (known as Kataku) to translate between English and Indonesian.
10530840 -> 1000005500850: Google has claimed that promising results were obtained using a proprietary statistical machine translation engine.
10530850 -> 1000005500860: The statistical translation engine used in the Google language tools for Arabic <-> English and Chinese <-> English has an overall score of 0.4281 over the runner-up IBM's BLEU-4 score of 0.3954 (Summer 2006) in tests conducted by the National Institute for Standards and Technology.
10530860 -> 1000005500870: Uwe Muegge has implemented a demo website that uses a controlled language in combination with the Google tool to produce fully automatic, high-quality machine translations of his English, German, and French web sites.
10530870 -> 1000005500880: With the recent focus on terrorism, the military sources in the United States have been investing significant amounts of money in natural language engineering.
10530880 -> 1000005500890: In-Q-Tel (a venture capital fund, largely funded by the US Intelligence Community, to stimulate new technologies through private sector entrepreneurs) brought up companies like Language Weaver.
10530890 -> 1000005500900: Currently the military community is interested in translation and processing of languages like Arabic, Pashto, and Dari.
10530900 -> 1000005500910: Information Processing Technology Office in DARPA hosts programs like TIDES and Babylon Translator.
10530910 -> 1000005500920: US Air Force has awarded a $1 million contract to develop a language translation technology.
10530920 -> 1000005500930: Evaluation
10530930 -> 1000005500940: There are various means for evaluating the performance of machine-translation systems.
10530940 -> 1000005500950: The oldest is the use of human judges to assess a translation's quality.
10530950 -> 1000005500960: Even though human evaluation is time-consuming, it is still the most reliable way to compare different systems such as rule-based and statistical systems.
10530960 -> 1000005500970: Automated means of evaluation include BLEU, NIST and METEOR.
10530970 -> 1000005500980: Relying exclusively on machine translation ignores that communication in human language is context-embedded, and that it takes a human to adequately comprehend the context of the original text.
10530980 -> 1000005500990: Even purely human-generated translations are prone to error.
10530990 -> 1000005501000: Therefore, to ensure that a machine-generated translation will be of publishable quality and useful to a human, it must be reviewed and edited by a human.
10531000 -> 1000005501010: It has, however, been asserted that in certain applications, e.g. product descriptions written in a controlled language, a dictionary-based machine-translation system has produced satisfactory translations that require no human intervention.
Metadata
10540010 -> 1000005600020: Metadata
10540020 -> 1000005600030: Metadata (meta data, or sometimes metainformation) is "data about data", of any sort in any media.
10540030 -> 1000005600040: An item of metadata may describe an individual datum, or content item, or a collection of data including multiple content items and hierarchical levels, for example a database schema.
10540040 -> 1000005600050: Purpose
10540050 -> 1000005600060: Metadata provides context for data.
10540060 -> 1000005600070: Metadata is used to facilitate the understanding, characteristics, and management usage of data.
10540070 -> 1000005600080: The metadata required for effective data management varies with the type of data and context of use.
10540080 -> 1000005600090: In a library, where the data is the content of the titles stocked, metadata about a title would typically include a description of the content, the author, the publication date and the physical location.
10540090 -> 1000005600100: Examples of Metadata
10540100 -> 1000005600110: Camera
10540110 -> 1000005600120: In the context of a camera, where the data is the photographic image, metadata would typically include the date the photograph was taken and details of the camera settings (lens, focal length, aperture, shutter timing, white balance, etc.).
10540120 -> 1000005600130: Digital Music Player
10540130 -> 1000005600140: On a digital portable music player, the album names, song titles and album art embedded in the music files are used to generate the artist and song listings, and are considered the metadata.
10540140 -> 1000005600150: Information system
10540150 -> 1000005600160: In the context of an information system, where the data is the content of the computer files, metadata about an individual data item would typically include the name of the field and its length.
10540160 -> 1000005600170: Metadata about a collection of data items, a computer file, might typically include the name of the file, the type of file and the name of the data administrator.
10540170 -> 1000005600180: Italic text
10540180 -> None: Real world location
10540190 -> None: If we consider a particular place in the real world, this may be described by data, for example:
10540200 -> None: 1 "E83BJ" .
10540210 -> None: 2 "17"
10540220 -> None: 3 "Sunny"
10540230 -> None: To make sense of and use this data, context is important, and can be provided by metadata.
10540240 -> None: The metadata for the above three items of data might include:
10540250 -> None: 1.1 "Post Code" – This is a brief description (or name) of the data item "E83BJ"
10540260 -> None: 1.2 "The unique identifier of a postal district" – This is another description (a definition) of "E83BJ"
10540270 -> None: 1.3 "27 June 2006" – This could also help describe "E83BJ", for example by giving the date it was last updated
10540280 -> None: 2 "Average temperature in degrees Celsius" – This is a possible description of "17"
10540290 -> None: 3 "Yesterday's weather" – This is a description of "sunny"
10540300 -> None: An item of metadata is itself data and therefore may have its own metadata.
10540310 -> None: For example, "Post Code" might have the following metadata:
10540320 -> None: 1.1.1 "data item name"
10540330 -> None: 1.1.2 "5 characters, starting with A – Z"
10540340 -> None: "27 June 2006" might have the following metadata:
10540350 -> None: 1.3.1 "date last changed"
10540360 -> None: 1.3.2 "dd MMM yyyy"
10540370 -> 1000005600190: Levels
10540380 -> 1000005600200: The hierarchy of metadata descriptions can go on forever, but usually context or semantic understanding makes extensively detailed explanations unnecessary.
10540390 -> 1000005600210: The role played by any particular datum depends on the context.
10540400 -> 1000005600220: For example, when considering the geography of London, "E83BJ" would be a datum and "Post Code" would be metadatum.
10540410 -> 1000005600230: But, when considering the data management of an automated system that manages geographical data, "Post Code" might be a datum and then "data item name" and "5 characters, starting with A – Z" would be metadata.
10540420 -> 1000005600240: In any particular context, metadata characterizes the data it describes, not the entity described by that data.
10540430 -> 1000005600250: So, in relation to "E83BJ", the datum "is in London" is a further description of the place in the real world which has the post code "E83BJ", not of the code itself.
10540440 -> 1000005600260: Therefore, although it is providing information connected to "E83BJ" (telling us that this is the post code of a place in London), this would not normally be considered metadata, as it is describing "E83BJ" qua place in the real world and not qua data.
10540450 -> 1000005600270: Definitions
10540460 -> 1000005600280: Etymology
10540470 -> 1000005600290: Meta is a classical Greek preposition (μετ’ αλλων εταιρων) and prefix (μεταβασις) conveying the following senses in English, depending upon the case of the associated noun: among; along with; with; by means of; in the midst of; after; behind.
10540480 -> 1000005600300: In epistemology, the word means "about (its own category)"; thus metadata is "data about the data".
10540490 -> 1000005600310: Varying definitions
10540500 -> 1000005600320: The term was introduced intuitively, without a formal definition.
10540510 -> 1000005600330: Because of that, today there are various definitions.
10540520 -> 1000005600340: The most common one is the literal translation:
10540530 -> 1000005600350: "Data about data are referred to as metadata."
10540540 -> 1000005600360: Example: "12345" is data, and with no additional context is meaningless.
10540550 -> 1000005600370: When "12345" is given a meaningful name (metadata) of "ZIP code", one can understand (at least in the United States, and further placing "ZIP code" within the context of a postal address) that "12345" refers to the General Electric plant in Schenectady, New York.
10540560 -> 1000005600380: As for most people the difference between data and information is merely a philosophical one of no relevance in practical use, other definitions are:
10540570 -> 1000005600390: Metadata is information about data.
10540580 -> 1000005600400: Metadata is information about information.
10540590 -> 1000005600410: Metadata contains information about that data or other data
10540600 -> 1000005600420: There are more sophisticated definitions, such as:
10540610 -> 1000005600430: "Metadata is structured, encoded data that describe characteristics of information-bearing entities to aid in the identification, discovery, assessment, and management of the described entities."
10540620 -> 1000005600440: "[Metadata is a set of] optional structured descriptions that are publicly available to explicitly assist in locating objects."
10540630 -> 1000005600450: These are used more rarely because they tend to concentrate on one purpose of metadata — to find "objects", "entities" or "resources" — and ignore others, such as using metadata to optimize compression algorithms, or to perform additional computations using the data.
10540640 -> 1000005600460: The metadata concept has been extended into the world of systems to include any "data about data": the names of tables, columns, programs, and the like.
10540650 -> 1000005600470: Different views of this "system metadata" are detailed below, but beyond that is the recognition that metadata can describe all aspects of systems: data, activities, people and organizations involved, locations of data and processes, access methods, limitations, timing and events, as well as motivation and rules.
10540660 -> 1000005600480: Fundamentally, then, metadata is "the data that describe the structure and workings of an organization's use of information, and which describe the systems it uses to manage that information".
10540670 -> 1000005600490: To do a model of metadata is to do an "Enterprise model" of the information technology industry itself.
10540680 -> 1000005600500: Metadata and Markup
10540690 -> 1000005600510: In the context of the web and the work of the W3C in providing markup technologies of HTML, XML and SGML the concept of metadata has specific context that is perhaps clearer than in other information domains.
10540700 -> 1000005600520: With markup technologies there is metadata, markup and data content.
10540710 -> 1000005600530: The metadata describes characteristics about the data, while the markup identifies the specific type of data content and acts as a container for that document instance.
10540720 -> 1000005600540: This page in Wikipedia is itself an example of such usage, where the textual information is data, how it is packaged, linked, referenced, styled and displayed is markup and aspects and characteristics of that markup are metadata set globally across Wikipedia.
10540730 -> 1000005600550: In the context of markup the metadata is architected to allow optimization of document instances to contain only a minimum amount of metadata, while the metadata itself is likely referenced externally such as in a schema definition (XSD) instance.
10540740 -> 1000005600560: Also it should be noted that markup provides specialised mechanisms that handle referential data, again avoiding confusion over what is metadata or data, and allowing optimizations.
10540750 -> 1000005600570: The reference and ID mechanisms in markup allowing reference links between related data items, and links to data items that can then be repeated about a data item, such as an address or product details.
10540760 -> 1000005600580: These are then all themselves simply more data items and markup instances rather than metadata.
10540770 -> 1000005600590: Similarly there are concepts such as classifications, ontologies and associations for which markup mechanisms are provided.
10540780 -> 1000005600600: A data item can then be linked to such categories via markup and hence providing a clean delineation between what is metadata, and actual data instances.
10540790 -> 1000005600610: Therefore the concepts and descriptions in a classification would be metadata, but the actual classification entry for a data item is simply another data instance.
10540800 -> 1000005600620: Some examples can illustrate the points here.
10540810 -> 1000005600630: Items in bold are data content, in italic are metadata, normal text items are all markup.
10540820 -> 1000005600640: The two examples show in-line use of metadata within markup relating to a data instance (XML) compared to simple markup (HTML).
10540830 -> 1000005600650: A simple HTML instance example:
10540840 -> 1000005600660: <span style="normalText">Example</span>
10540850 -> 1000005600670: And then a XML instance example with metadata:
10540860 -> 1000005600680: <PersonMiddleName ''nillable="true"''>John</PersonMiddleName>
10540870 -> 1000005600690: Where the inline assertion that a person's middle name may be an empty data item is metadata about the data item.
10540880 -> 1000005600700: Such definitions however are usually not placed inline in XML.
10540890 -> 1000005600710: Instead these definitions are moved away into the schema definition that contains the metadata for the entire document instance.
10540900 -> 1000005600720: This again illustrates another important aspect of metadata in the context of markup.
10540910 -> 1000005600730: The metadata is optimally defined only once for a collection of data instances.
10540920 -> 1000005600740: Hence repeated items of markup are rarely metadata, but rather more markup data instances themselves.
10540930 -> 1000005600750: Hierarchies of metadata
10540940 -> 1000005600760: When structured into a hierarchical arrangement, metadata is more properly called an ontology or schema.
10540950 -> 1000005600770: Both terms describe "what exists" for some purpose or to enable some action.
10540960 -> 1000005600780: For instance, the arrangement of subject headings in a library catalog serves not only as a guide to finding books on a particular subject in the stacks, but also as a guide to what subjects "exist" in the library's own ontology and how more specialized topics are related to or derived from the more general subject headings.
10540970 -> 1000005600790: Metadata is frequently stored in a central location and used to help organizations standardize their data.
10540980 -> 1000005600800: This information is typically stored in a metadata registry.
10540990 -> 1000005600810: Difference between data and metadata
10541000 -> 1000005600820: Usually it is not possible to distinguish between (plain) data and metadata because:
10541010 -> 1000005600830: Something can be data and metadata at the same time.
10541020 -> 1000005600840: The headline of an article is both its title (metadata) and part of its text (data).
10541030 -> 1000005600850: Data and metadata can change their roles.
10541040 -> 1000005600860: A poem, as such, would be regarded as data, but if there were a song that used it as lyrics, the whole poem could be attached to an audio file of the song as metadata.
10541050 -> 1000005600870: Thus, the labeling depends on the point of view.
10541060 -> 1000005600880: These considerations apply no matter which of the above definitions is considered, except where explicit markup is used to denote what is data and what is metadata.
10541070 -> 1000005600890: Use
10541080 -> 1000005600900: Metadata has many different applications; this section lists some of the most common.
10541090 -> 1000005600910: Metadata is used to speed up and enrich searching for resources.
10541100 -> 1000005600920: In general, search queries using metadata can save users from performing more complex filter operations manually.
10541110 -> 1000005600930: It is now common for web browsers (with the notable exception of Mozilla Firefox), P2P applications and media management software to automatically download and locally cache metadata, to improve the speed at which files can be accessed and searched.
10541120 -> 1000005600940: Metadata may also be associated to files manually.
10541130 -> 1000005600950: This is often the case with documents which are scanned into a document storage repository such as FileNet or Documentum.
10541140 -> 1000005600960: Once the documents have been converted into an electronic format a user brings the image up in a viewer application, manually reads the document and keys values into an online application to be stored in a metadata repository.
10541150 -> 1000005600970: Metadata provide additional information to users of the data it describes.
10541160 -> 1000005600980: This information may be descriptive ("These pictures were taken by children in the school's third grade class.") or algorithmic ("Checksum=139F").
10541170 -> 1000005600990: Metadata helps to bridge the semantic gap.
10541180 -> 1000005601000: By telling a computer how data items are related and how these relations can be evaluated automatically, it becomes possible to process even more complex filter and search operations.
10541190 -> 1000005601010: For example, if a search engine understands that "Van Gogh" was a "Dutch painter", it can answer a search query on "Dutch painters" with a link to a web page about Vincent Van Gogh, although the exact words "Dutch painters" never occur on that page.
10541200 -> 1000005601020: This approach, called knowledge representation, is of special interest to the semantic web and artificial intelligence.
10541210 -> 1000005601030: Certain metadata is designed to optimize lossy compression.
10541220 -> 1000005601040: For example, if a video has metadata that allows a computer to tell foreground from background, the latter can be compressed more aggressively to achieve a higher compression rate.
10541230 -> 1000005601050: Some metadata is intended to enable variable content presentation.
10541240 -> 1000005601060: For example, if a picture has metadata that indicates the most important region — the one where there is a person — an image viewer on a small screen, such as on a mobile phone's, can narrow the picture to that region and thus show the user the most interesting details.
10541250 -> 1000005601070: A similar kind of metadata is intended to allow blind people to access diagrams and pictures, by converting them for special output devices or reading their description using text-to-speech software.
10541260 -> 1000005601080: Other descriptive metadata can be used to automate workflows.
10541270 -> 1000005601090: For example, if a "smart" software tool knows content and structure of data, it can convert it automatically and pass it to another "smart" tool as input.
10541280 -> 1000005601100: As a result, users save the many copy-and-paste operations required when analyzing data with "dumb" tools.
10541290 -> 1000005601110: Metadata is becoming an increasingly important part of electronic discovery.
10541295 -> 1000005601120: Application and file system metadata derived from electronic documents and files can be important evidence.
10541300 -> 1000005601130: Recent changes to the Federal Rules of Civil Procedure make metadata routinely discoverable as part of civil litigation.
10541310 -> 1000005601140: Parties to litigation are required to maintain and produce metadata as part of discovery, and spoliation of metadata can lead to sanctions.
10541320 -> 1000005601150: Metadata has become important on the World Wide Web because of the need to find useful information from the mass of information available.
10541330 -> 1000005601160: Manually-created metadata adds value because it ensures consistency.
10541340 -> 1000005601170: If a web page about a certain topic contains a word or phrase, then all web pages about that topic should contain that same word or phrase.
10541350 -> 1000005601180: Metadata also ensures variety, so that if a topic goes by two names each will be used.
10541360 -> 1000005601190: For example, an article about "sport utility vehicles" would also be tagged "4 wheel drives", "4WDs" and "four wheel drives", as this is how SUVs are known in some countries.
10541370 -> 1000005601200: Examples of metadata for an audio CD include the MusicBrainz project and All Media Guide's Allmusic.
10541380 -> 1000005601210: Similarly, MP3 files have metadata tags in a format called ID3.
10541390 -> 1000005601220: Types of metadata
10541400 -> 1000005601230: Metadata can be classified by:
10541410 -> 1000005601240: Content.
10541420 -> 1000005601250: Metadata can either describe the resource itself (for example, name and size of a file) or the content of the resource (for example, "This video shows a boy playing football").
10541430 -> 1000005601260: Mutability.
10541440 -> 1000005601270: With respect to the whole resource, metadata can be either immutable (for example, the "Title" of a video does not change as the video itself is being played) or mutable (the "Scene description" does change).
10541450 -> 1000005601280: Logical function.
10541460 -> 1000005601290: There are three layers of logical function: at the bottom the subsymbolic layer that contains the raw data itself, then the symbolic layer with metadata describing the raw data, and on the top the logical layer containing metadata that allows logical reasoning using the symbolic layer
10541470 -> 1000005601300: Important issues
10541480 -> 1000005601310: To successfully develop and use metadata, several important issues should be treated with care:
10541490 -> 1000005601320: Metadata risks
10541500 -> 1000005601330: Microsoft Office files include metadata beyond their printable content, such as the original author's name, the creation date of the document, and the amount of time spent editing it.
10541510 -> 1000005601340: Unintentional disclosure can be awkward or even, in professional practices requiring confidentiality, raise malpractice concerns.
10541520 -> 1000005601350: Some of Microsoft Office document's metadata can be seen by clicking File then Properties from the program's menu.
10541530 -> 1000005601360: Other metadata is not visible except through external analysis of a file, such as is done in forensics.
10541540 -> 1000005601370: The author of the Microsoft Word-based Melissa computer virus in 1999 was caught due to Word metadata that uniquely identified the computer used to create the original infected document.
10541550 -> 1000005601380: Metadata lifecycle
10541560 -> 1000005601390: Even in the early phases of planning and designing it is necessary to keep track of all metadata created.
10541570 -> 1000005601400: It is not economical to start attaching metadata only after the production process has been completed.
10541580 -> 1000005601410: For example, if metadata created by a digital camera at recording time is not stored immediately, it may have to be restored afterwards manually with great effort.
10541590 -> 1000005601420: Therefore, it is necessary for different groups of resource producers to cooperate using compatible methods and standards.
10541600 -> 1000005601430: Manipulation.
10541610 -> 1000005601440: Metadata must adapt if the resource it describes changes.
10541620 -> 1000005601450: It should be merged when two resources are merged.
10541630 -> 1000005601460: These operations are seldom performed by today's software; for example, image editing programs usually do not keep track of the Exif metadata created by digital cameras.
10541640 -> 1000005601470: Destruction.
10541650 -> 1000005601480: It can be useful to keep metadata even after the resource it describes has been destroyed, for example in change histories within a text document or to archive file deletions due to digital rights management.
10541660 -> 1000005601490: None of today's metadata standards consider this phase.
10541670 -> 1000005601500: Storage
10541680 -> 1000005601510: Metadata can be stored either internally, in the same file as the data, or externally, in a separate file.
10541690 -> 1000005601520: Metadata that are embedded with content is called embedded metadata.
10541700 -> 1000005601530: A data repository typically stores the metadata detached from the data.
10541710 -> 1000005601540: Both ways have advantages and disadvantages:
10541720 -> 1000005601550: Internal storage allows transferring metadata together with the data it describes; thus, metadata is always at hand and can be manipulated easily.
10541730 -> 1000005601560: This method creates high redundancy and does not allow holding metadata together.
10541740 -> 1000005601570: External storage allows bundling metadata, for example in a database, for more efficient searching.
10541750 -> 1000005601580: There is no redundancy and metadata can be transferred simultaneously when using streaming.
10541760 -> 1000005601590: However, as most formats use URIs for that purpose, the method of how the metadata is linked to its data should be treated with care.
10541770 -> 1000005601600: What if a resource does not have a URI (resources on a local hard disk or web pages that are created on-the-fly using a content management system)?
10541780 -> 1000005601610: What if metadata can only be evaluated if there is a connection to the Web, especially when using RDF?
10541790 -> 1000005601620: How to realize that a resource is replaced by another with the same name but different content?
10541800 -> 1000005601630: Moreover, there is the question of data format: storing metadata in a human-readable format such as XML can be useful because users can understand and edit it without specialized tools.
10541810 -> 1000005601640: On the other hand, these formats are not optimized for storage capacity; it may be useful to store metadata in a binary, non-human-readable format instead to speed up transfer and save memory.
10541820 -> 1000005601650: Criticisms
10541830 -> 1000005601660: Although the majority of computer scientists see metadata as a chance for better interoperability, some critics argue:
10541840 -> 1000005601670: Metadata is too expensive and time-consuming.
10541850 -> 1000005601680: The argument is that companies will not produce metadata without need because it costs extra money, and private users also will not produce complex metadata because its creation is very time-consuming.
10541860 -> 1000005601690: Metadata is too complicated.
10541870 -> 1000005601700: Private users will not create metadata because existing formats, especially MPEG-7, are too complicated.
10541880 -> 1000005601710: As long as there are no automatic tools for creating metadata, it will not be created.
10541890 -> 1000005601720: Metadata is subjective and depends on context.
10541900 -> 1000005601730: Most probably, two persons will attach different metadata to the same resource due to their different points of view.
10541910 -> 1000005601740: Moreover, metadata can be misinterpreted due to its dependency on context.
10541920 -> 1000005601750: For example searching for "post-modern art" may miss a certain item because the expression was not in use at the time when that work of art was created, or searching for "pictures taken at 1:00" may produce confusing results due to local time differences.
10541930 -> 1000005601760: There is no end to metadata.
10541940 -> 1000005601770: For example, when annotating a match of soccer with metadata, one can describe all the players and their actions in time and stop there.
10541950 -> 1000005601780: One can also describe the advertisements in the background and the clothes the players wear.
10541960 -> 1000005601790: One can also describe each fan on the tribune and the clothes they wear.
10541970 -> 1000005601800: All of this metadata can be interesting to one party or another — such as the spectators, sponsors or a counter-terrorist unit of the police — and even for a simple resource the amount of possible metadata can be gigantic.
10541980 -> 1000005601810: Metadata is useless.
10541990 -> 1000005601820: Many of today's search engines are very efficient at finding text.
10542000 -> 1000005601830: Other techniques for finding pictures, videos and music (namely query-by-example) will become more and more powerful in the future.
10542010 -> 1000005601840: Thus, there is no real need for metadata.
10542020 -> 1000005601850: The opposers of metadata sometimes use the term metacrap to refer to the unsolved problems of metadata in some scenarios.
10542030 -> 1000005601860: These people are also referred to as "Meta Haters."
10542040 -> 1000005601870: Types
10542050 -> 1000005601880: In general, there are two distinct classes of metadata: structural or control metadata and guide metadata.
10542060 -> 1000005601890: Structural metadata is used to describe the structure of computer systems such as tables, columns and indexes.
10542070 -> 1000005601900: Guide metadata is used to help humans find specific items and is usually expressed as a set of keywords in a natural language.
10542080 -> 1000005601910: Metatadata can be divided into 3 distinct categories:
10542090 -> 1000005601920: Descriptive
10542100 -> 1000005601930: Administrative
10542110 -> 1000005601940: Structural
10542120 -> 1000005601950: Relational database metadata
10542130 -> 1000005601960: Each relational database system has its own mechanisms for storing metadata.
10542140 -> 1000005601970: Examples of relational-database metadata include:
10542150 -> 1000005601980: Tables of all tables in database, their names, sizes and number of rows in each table.
10542160 -> 1000005601990: Tables of columns in each database, what tables they are used in, and the type of data stored in each column.
10542170 -> 1000005602000: In database terminology, this set of metadata is referred to as the catalog.
10542180 -> 1000005602010: The SQL standard specifies a uniform means to access the catalog, called the INFORMATION_SCHEMA, but not all databases implement it, even if they implement other aspects of the SQL standard.
10542190 -> 1000005602020: For an example of database-specific metadata access methods, see Oracle metadata.
10542200 -> 1000005602030: Data warehouse metadata
10542210 -> 1000005602040: Data warehouse metadata systems are sometimes separated into two sections:
10542220 -> 1000005602050: back room metadata that are used for Extract, transform, load functions to get OLTP data into a data warehouse
10542230 -> 1000005602060: front room metadata that are used to label screens and create reports
10542240 -> 1000005602070: Kimball lists the following types of metadata in a data warehouse (See also ):
10542250 -> 1000005602080: source system metadata
10542260 -> 1000005602090: source specifications, such as repositories, and source logical schemas
10542270 -> 1000005602100: source descriptive information, such as ownership descriptions, update frequencies, legal limitations, and access methods
10542280 -> 1000005602110: process information, such as job schedules and extraction code
10542290 -> 1000005602120: data staging metadata
10542300 -> 1000005602130: data acquisition information, such as data transmission scheduling and results, and file usage
10542310 -> 1000005602140: dimension table management, such as definitions of dimensions, and surrogate key assignments
10542320 -> 1000005602150: transformation and aggregation, such as data enhancement and mapping, DBMS load scripts, and aggregate definitions
10542330 -> 1000005602160: audit, job logs and documentation, such as data lineage records, data transform logs
10542340 -> 1000005602170: DBMS metadata, such as:
10542350 -> 1000005602180: DBMS system table contents
10542360 -> 1000005602190: processing hints
10542370 -> 1000005602200: Michael Bracket defines metadata (what he calls "Data resource data") as "any data about the organization's data resource".
10542380 -> 1000005602210: Adrienne Tannenbaum defines metadata as "the detailed description of instance data.
10542390 -> 1000005602220: The format and characteristics of populated instance data: instances and values, dependent on the role of the metadata recipient".
10542400 -> 1000005602230: These definitions are characteristic of the "data about data" definition.
10542410 -> 1000005602240: Business Intelligence metadata
10542420 -> 1000005602250: Business Intelligence is the process of analyzing large amounts of corporate data, usually stored in large databases such as the Data Warehouse, tracking business performance, detecting patterns and trends, and helping enterprise business users make better decisions.
10542430 -> 1000005602260: Business Intelligence metadata describes how data is queried, filtered, analyzed, and displayed in Business Intelligence software tools, such as Reporting tools, OLAP tools, Data Mining tools.
10542440 -> 1000005602270: Examples:
10542450 -> 1000005602280: OLAP metadata: The descriptions and structures of Dimensions, Cubes, Measures (Metrics), Hierarchies, Levels, Drill Paths
10542460 -> 1000005602290: Reporting metadata: The descriptions and structures of Reports, Charts, Queries, DataSets, Filters, Variables, Expressions
10542470 -> 1000005602300: Data Mining metadata: The descriptions and structures of DataSets, Algorithms, Queries
10542480 -> 1000005602310: Business Intelligence metadata can be used to understand how corporate financial reports reported to Wall Street are calculated, how the revenue, expense and profit are aggregated from individual sales transactions stored in the data warehouse.
10542490 -> 1000005602320: A good understanding of Business Intelligence metadata is required to solve complex problems such as compliance with corporate governance standards, such as Sarbanes Oxley (SOX) or Basel II.
10542500 -> 1000005602330: General IT metadata
10542510 -> 1000005602340: In contrast, David Marco, another metadata theorist, defines metadata as "all physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation."
10542520 -> 1000005602350: Others have included web services, systems and interfaces.
10542530 -> 1000005602360: In fact, the entire Zachman framework (see Enterprise Architecture) can be represented as metadata.
10542540 -> 1000005602370: Notice that such definitions expand metadata's scope considerably, to encompass most or all of the data required by the Management Information Systems capability.
10542550 -> 1000005602380: In this sense, the concept of metadata has significant overlaps with the ITIL concept of a Configuration Management Database (CMDB), and also with disciplines such as Enterprise Architecture and IT portfolio management.
10542560 -> 1000005602390: This broader definition of metadata has precedent.
10542570 -> 1000005602400: Third generation corporate repository products (such as those eventually merged into the CA Advantage line) not only store information about data definitions (COBOL copybooks, DBMS schema), but also about the programs accessing those data structures, and the Job Control Language and batch job infrastructure dependencies as well.
10542580 -> 1000005602410: These products (some of which are still in production) can provide a very complete picture of a mainframe computing environment, supporting exactly the kinds of impact analysis required for ITIL-based processes such as Incident and Change Management.
10542590 -> 1000005602420: The ITIL  Back Catalogue includes the Data Management volume which recognizes the role of these metadata products on the mainframe, posing the CMDB as the distributed computing equivalent.
10542600 -> 1000005602430: CMDB vendors however have generally not expanded their scope to include data definitions, and metadata solutions are also available in the distributed world.
10542610 -> 1000005602440: Determining the appropriate role and scope for each is thus a challenge for large IT organizations requiring the services of both.
10542620 -> 1000005602450: Since metadata is pervasive, centralized attempts at tracking it need to focus on the most highly leveraged assets.
10542630 -> 1000005602460: Enterprise Assets may only constitute a small percentage of the entire IT portfolio.
10542640 -> 1000005602470: Some practitioners have successfully managed IT metadata using the Dublin Core metamodel.
10542650 -> 1000005602480: IT metadata management products
10542660 -> 1000005602490: First generation data dictionary/metadata repository tools would be those only supporting a specific DBMS, such as IDMS's IDD (integrated data dictionary), the IMS Data Dictionary, and ADABAS's Predict.
10542670 -> 1000005602500: Second generation would be ASG's DATAMANAGER product which could support many different file and DBMS types.
10542680 -> 1000005602510: Third generation repository products became briefly popular in the early 1990s along with the rise of widespread use of RDBMS engines such as IBM's DB2.
10542690 -> 1000005602520: Fourth generation products link the repository with more Extract, transform, load tools and can be connected with architectural modeling tools.
10542700 -> 1000005602530: Examples include  Adaptive Metadata Manager from Adaptive,  Rochade from ASG, InfoLibrarian Metadata Integration Framework and Troux Technologies Metis Server product.
10542710 -> 1000005602540: File system metadata
10542720 -> 1000005602550: Nearly all file systems keep metadata about files out-of-band.
10542730 -> 1000005602560: Some systems keep metadata in directory entries; others in specialized structure like inodes or even in the name of a file.
10542740 -> 1000005602570: Metadata can range from simple timestamps, mode bits, and other special-purpose information used by the implementation itself, to icons and free-text comments, to arbitrary attribute-value pairs.
10542750 -> 1000005602580: With more complex and open-ended metadata, it becomes useful to search for files based on the metadata contents.
10542760 -> 1000005602590: The Unix find utility was an early example, although inefficient when scanning hundreds of thousands of files on a modern computer system.
10542770 -> 1000005602600: Apple Computer's Mac OS X operating system supports cataloguing and searching for file metadata through a feature known as Spotlight, as of version 10.4.
10542780 -> 1000005602610: Microsoft worked in the development of similar functionality with the Instant Search system in Windows Vista, as well as being present in SharePoint Server.
10542790 -> 1000005602620: Linux implements file metadata using extended file attributes.
10542800 -> 1000005602630: Image metadata
10542810 -> 1000005602640: Examples of image files containing metadata include Exchangeable image file format (EXIF) and Tagged Image File Format (TIFF).
10542820 -> 1000005602650: Having metadata about images embedded in TIFF or EXIF files is one way of acquiring additional data about an image.
10542830 -> 1000005602660: Tagging pictures with subjects, related emotions, and other descriptive phrases helps Internet users find pictures easily rather than having to search through entire image collections.
10542840 -> 1000005602670: A prime example of an image tagging service is Flickr, where users upload images and then describe the contents.
10542850 -> 1000005602680: Other patrons of the site can then search for those tags.
10542860 -> 1000005602690: Flickr uses a folksonomy: a free-text keyword system in which the community defines the vocabulary through use rather than through a controlled vocabulary.
10542870 -> 1000005602700: Users can also tag photos for organization purposes using Adobe's Extensible Metadata Platform (XMP) language, for example.
10542880 -> 1000005602710: Digital photography is increasingly making use of technical metadata tags describing the conditions of exposure.
10542890 -> 1000005602720: Photographers shooting Camera RAW file formats can use applications such as Adobe Bridge or Apple Computer's Aperture to work with camera metadata for post-processing.
10542900 -> 1000005602730: Audio Metadata
10542910 -> 1000005602740: Audio metadata generally relates to the how the data should be written in order for a processor to efficiently process it.
10542920 -> 1000005602750: These technologies are usually seen in Audio Engine Programming such as Microsoft RIFF (Resource Interchange File Format) technologies for .wave file.
10542930 -> 1000005602760: Codes generally develop their own metadata standards for compression purpose.
10542940 -> 1000005602770: Program metadata
10542950 -> 1000005602780: Metadata is casually used to describe the controlling data used in software architectures that are more abstract or configurable.
10542960 -> 1000005602790: Most executable file formats include what may be termed "metadata" that specifies certain, usually configurable, behavioral runtime characteristics.
10542970 -> 1000005602800: However, it is difficult if not impossible to precisely distinguish program "metadata" from general aspects of stored-program computing architecture; if the machine reads it and acts upon it, it is a computational instruction, and the prefix "meta" has little significance.
10542980 -> 1000005602810: In Java, the class file format contains metadata used by the Java compiler and the Java virtual machine to dynamically link classes and to support reflection.
10542990 -> 1000005602820: The J2SE 5.0 version of Java included a metadata facility to allow additional annotations that are used by development tools.
10543000 -> 1000005602830: In MS-DOS, the COM file format does not include metadata, while the EXE file and Windows PE formats do.
10543010 -> 1000005602840: These metadata can include the company that published the program, the date the program was created, the version number and more.
10543020 -> 1000005602850: In the Microsoft .NET executable format, extra metadata is included to allow reflection at runtime.
10543030 -> 1000005602860: Existing software metadata
10543040 -> 1000005602870: Object Management Group (OMG) has defined metadata format for representing entire existing applications for the purposes of software mining, software modernization and software assurance.
10543050 -> 1000005602880: This specification, called the OMG Knowledge Discovery Metamodel (KDM) is the OMG's foundation for "modeling in reverse".
10543060 -> 1000005602890: KDM is a common language-independent intermediate representation that provides an integrated view of an entire enterprise application, including its behavior (program flow), data, and structure.
10543070 -> 1000005602900: One of the applications of KDM is Business Rules Mining.
10543080 -> 1000005602910: Knowledge Discovery Metamodel includes a fine grained low-level representation (called "micro KDM"), suitable for performing static analysis of programs.
10543090 -> 1000005602920: Document metadata
10543100 -> 1000005602930: Most programs that create documents, including Microsoft SharePoint, Microsoft Word and other Microsoft Office products, save metadata with the document files.
10543110 -> 1000005602940: These metadata can contain the name of the person who created the file (obtained from the operating system), the name of the person who last edited the file, how many times the file has been printed, and even how many revisions have been made on the file.
10543120 -> 1000005602950: Other saved material, such as deleted text (saved in case of an undelete command), document comments and the like, is also commonly referred to as "metadata", and the inadvertent inclusion of this material in distributed files has sometimes led to undesirable disclosures.
10543130 -> 1000005602960: Document Metadata is particularly important in legal environments where litigation can request this sensitive information (metadata) which can include many elements of private detrimental data.
10543140 -> 1000005602970: This data has been linked to multiple lawsuits that have got corporations into legal complications.
10543150 -> 1000005602980: Many legal firms today use "Metadata Management Software", also known as "Metadata Removal Tools".
10543160 -> 1000005602990: This software can be used to clean documents before they are sent outside of their firm.
10543170 -> 1000005603000: This process, known as metadata management, protects lawfirms from potentially unsafe leaking of sensitive data through Electronic Discovery.
10543180 -> 1000005603010: For a list of executable formats, see object file.
10543190 -> 1000005603020: Metamodels
10543200 -> 1000005603030: Metadata on Models are called Metamodels.
10543210 -> 1000005603040: In Model Driven Engineering, a Model has to conform to a given Metamodel.
10543220 -> 1000005603050: According to the MDA guide, a metamodel is a model and each model conforms to a given metamodel.
10543230 -> 1000005603060: Meta-modeling allows strict and agile automatic processing of models and metamodels.
10543240 -> 1000005603070: The Object Management Group (OMG) defines 4 layers of meta-modeling.
10543250 -> 1000005603080: Each level of modeling is defined, validated by the next layer:
10543260 -> 1000005603090: M0: instance object, data row, record -> "John Smith"
10543270 -> 1000005603100: M1: model, schema -> "Customer" UML Class or database Table
10543280 -> 1000005603110: M2: metamodel -> Unified Modeling Language (UML), Common Warehouse Metamodel (CWM), Knowledge Discovery Metamodel (KDM)
10543290 -> 1000005603120: M3: meta-metamodel -> Meta-Object Facility (MOF)
10543300 -> 1000005603130: Meta-metadata
10543310 -> 1000005603140: Since metadata are also data, it is possible to have metadata of metadata–"meta-metadata."
10543320 -> 1000005603150: Machine-generated meta-metadata, such as the reversed index created by a free-text search engine, is generally not considered metadata, though.
10543330 -> 1000005603160: Digital library metadata
10543340 -> 1000005603170: There are three categories of metadata that are frequently used to describe objects in a digital library:
10543350 -> 1000005603180: descriptive - Information describing the intellectual content of the object, such as MARC cataloguing records, finding aids or similar schemes.
10543360 -> 1000005603190: It is typically used for bibliographic purposes and for search and retrieval.
10543370 -> 1000005603200: structural - Information that ties each object to others to make up logical units (e.g., information that relates individual images of pages from a book to the others that make up the book).
10543380 -> 1000005603210: administrative - Information used to manage the object or control access to it.
10543390 -> 1000005603220: This may include information on how it was scanned, its storage format, copyright and licensing information, and information necessary for the long-term preservation of the digital objects.
10543400 -> 1000005603230: Geospatial metadata
10543410 -> 1000005603240: Metadata that describe geographic objects (such as datasets, maps, features, or simply documents with a geospatial component) have a history going back to at least 1994 (refer  MIT Library page on FGDC Metadata).
10543420 -> 1000005603250: This class of metadata is described more fully on the Geospatial metadata page.
Microsoft Windows
10550010 -> 1000005700020: Microsoft Windows
10550020 -> 1000005700030: Microsoft Windows is a series of software operating systems produced by Microsoft.
10550030 -> 1000005700040: Microsoft first introduced an operating environment named Windows in November 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces (GUIs).
10550040 -> 1000005700050: Microsoft Windows came to dominate the world's personal computer market, overtaking Mac OS, which had been introduced previously.
10550050 -> 1000005700060: At the 2004 IDC Directions conference, it was stated that Windows had approximately 90% of the client operating system market.
10550060 -> 1000005700070: The most recent client version of Windows is Windows Vista; the current server version is Windows Server 2008.
10550070 -> 1000005700080: Versions
10550080 -> 1000005700090: The term Windows collectively describes any or all of several generations of Microsoft (MS) operating system (OS) products.
10550090 -> 1000005700100: These products are generally categorized as follows:
10550100 -> 1000005700110: 16-bit operating environments
10550110 -> 1000005700120: The early versions of Windows were often thought of as just graphical user interfaces, mostly because they ran on top of MS-DOS and used it for file system services.
10550120 -> 1000005700130: However, even the earliest 16-bit Windows versions already assumed many typical operating system functions, notably, having their own executable file format and providing their own device drivers (timer, graphics, printer, mouse, keyboard and sound) for applications.
10550130 -> 1000005700140: Unlike MS-DOS, Windows allowed users to execute multiple graphical applications at the same time, through cooperative multitasking.
10550140 -> 1000005700150: Finally, Windows implemented an elaborate, segment-based, software virtual memory scheme, which allowed it to run applications larger than available memory: code segments and resources were swapped in and thrown away when memory became scarce, and data segments moved in memory when a given application had relinquished processor control, typically waiting for user input.
10550150 -> 1000005700160: 16-bit Windows versions include Windows 1.0 (1985), Windows 2.0 (1987) and its close relatives, Windows/286-Windows/386.
10550160 -> 1000005700170: Hybrid 16/32-bit operating environments
10550170 -> 1000005700180: Windows/386 introduced a 32-bit protected mode kernel and virtual machine monitor.
10550180 -> 1000005700190: For the duration of a Windows session, it created one or more virtual 8086 environments and provided device virtualization for the video card, keyboard, mouse, timer and interrupt controller inside each of them.
10550190 -> 1000005700200: The user-visible consequence was that it became possible to preemptively multitask multiple MS-DOS environments in separate windows, although graphical MS-DOS applications required full screen mode.
10550200 -> 1000005700210: Also, Windows applications were multi-tasked cooperatively inside one such virtual 8086 environment.
10550210 -> 1000005700220: Windows 3.0 (1990) and Windows 3.1 (1992) improved the design, mostly because of virtual memory and loadable virtual device drivers (VxDs) which allowed them to share arbitrary devices between multitasked DOS windows.
10550220 -> 1000005700230: Also, Windows applications could now run in protected mode (when Windows was running in Standard or 386 Enhanced Mode), which gave them access to several megabytes of memory and removed the obligation to participate in the software virtual memory scheme.
10550230 -> 1000005700240: They still ran inside the same address space, where the segmented memory provided a degree of protection, and multi-tasked cooperatively.
10550240 -> 1000005700250: For Windows 3.0, Microsoft also rewrote critical operations from C into assembly, making this release faster and less memory-hungry than its predecessors.
10550250 -> 1000005700260: Hybrid 16/32-bit operating systems
10550260 -> 1000005700270: With the introduction of the 32-bit Windows for Workgroups 3.11, Windows was able to stop relying on DOS for file management.
10550270 -> 1000005700280: Leveraging this, Windows 95 introduced Long File Names, reducing the 8.3 filename DOS environment to the role of a boot loader.
10550280 -> 1000005700290: MS-DOS was now bundled with Windows; this notably made it (partially) aware of long file names when its utilities were run from within Windows.
10550290 -> 1000005700300: The most important novelty was the possibility of running 32-bit multi-threaded preemptively multitasked graphical programs.
10550300 -> 1000005700310: However, the necessity of keeping compatibility with 16-bit programs meant the GUI components were still 16-bit only and not fully reentrant, which resulted in reduced performance and stability.
10550310 -> 1000005700320: There were three releases of Windows 95 (the first in 1995, then subsequent bug-fix versions in 1996 and 1997, only released to OEMs, which added extra features such as FAT32 and primitive USB support).
10550320 -> 1000005700330: Microsoft's next OS was Windows 98; there were two versions of this (the first in 1998 and the second, named "Windows 98 Second Edition", in 1999).
10550330 -> 1000005700340: In 2000, Microsoft released Windows Me (Me standing for Millennium Edition), which used the same core as Windows 98 but adopted some aspects of Windows 2000 and removed the option boot into DOS mode.
10550340 -> 1000005700350: It also added a new feature called System Restore, allowing the user to set the computer's settings back to an earlier date.
10550350 -> 1000005700360: 32-bit operating systems
10550360 -> 1000005700370: The NT family of Windows systems was fashioned and marketed for higher reliability business use, and was unencumbered by any Microsoft DOS patrimony.
10550370 -> 1000005700380: The first release was Windows NT 3.1 (1993, numbered "3.1" to match the Windows version and to one-up OS/2 2.1, IBM's flagship OS co-developed by Microsoft and was Windows NT's main competitor at the time), which was followed by NT 3.5 (1994), NT 3.51 (1995), NT 4.0 (1996), and Windows 2000 (essentially NT 5.0).
10550380 -> 1000005700390: NT 4.0 was the first in this line to implement the "Windows 95" user interface (and the first to include Windows 95's built-in 32-bit runtimes).
10550390 -> 1000005700400: Microsoft then moved to combine their consumer and business operating systems.
10550400 -> 1000005700410: Windows XP, coming in both home and professional versions (and later niche market versions for tablet PCs and media centers) improved stability, user experience and backwards compatibility.
10550410 -> 1000005700420: Then, Windows Server 2003 brought Windows Server up to date with Windows XP.
10550420 -> 1000005700430: Since then, a new version, Windows Vista was released and Windows Server 2008, released on February 27, 2008, brings Windows Server up to date with Windows Vista.
10550430 -> 1000005700440: Windows CE, Microsoft's offering in the mobile and embedded markets, is also a true 32-bit operating system that offers various services for all sub-operating workstations.
10550440 -> 1000005700450: 64-bit operating systems
10550450 -> 1000005700460: Windows NT included support for several different platforms before the x86-based personal computer became dominant in the professional world.
10550460 -> 1000005700470: Versions of NT from 3.1 to 4.0 variously supported PowerPC, DEC Alpha and MIPS R4000, some of which were 64-bit processors, although the operating system treated them as 32-bit processors.
10550470 -> 1000005700480: With the introduction of the Intel Itanium architecture, which is referred to as IA-64, Microsoft released new versions of Windows to support it.
10550480 -> 1000005700490: Itanium versions of Windows XP and Windows Server 2003 were released at the same time as their mainstream x86 (32-bit) counterparts.
10550490 -> 1000005700500: On April 25 2005, Microsoft released Windows XP Professional x64 Edition and x64 versions of Windows Server 2003 to support the AMD64/Intel64 (or x64 in Microsoft terminology) architecture.
10550500 -> 1000005700510: Microsoft dropped support for the Itanium version of Windows XP in 2005.
10550510 -> 1000005700520: Windows Vista is the first end-user version of Windows that Microsoft has released simultaneously in 32-bit and x64 editions.
10550520 -> 1000005700530: Windows Vista does not support the Itanium architecture.
10550530 -> 1000005700540: The modern 64-bit Windows family comprises AMD64/Intel64 versions of Windows Vista, and Windows Server 2003 and Windows Server 2008, in both Itanium and x64 editions.
10550540 -> 1000005700550: History
10550550 -> 1000005700560: Microsoft has taken two parallel routes in its operating systems.
10550560 -> 1000005700570: One route has been for the home user and the other has been for the professional IT user.
10550570 -> 1000005700580: The dual routes have generally led to home versions having greater multimedia support and less functionality in networking and security, and professional versions having inferior multimedia support and better networking and security.
10550580 -> 1000005700590: The first version of Microsoft Windows, version 1.0, released in November 1985, lacked a degree of functionality and achieved little popularity, and was to compete with Apple's own operating system.
10550590 -> 1000005700600: Windows 1.0 is not a complete operating system; rather, it extends MS-DOS.
10550600 -> 1000005700610: Microsoft Windows version 2.0 was released in November, 1987 and was slightly more popular than its predecessor.
10550610 -> 1000005700620: Windows 2.03 (release date January 1988) had changed the OS from tiled windows to overlapping windows.
10550620 -> 1000005700630: The result of this change led to Apple Computer filing a suit against Microsoft alleging infringement on Apple's copyrights.
10550630 -> 1000005700640: Microsoft Windows version 3.0, released in 1990, was the first Microsoft Windows version to achieve broad commercial success, selling 2 million copies in the first six months.
10550635 -> 1000005700650: It featured improvements to the user interface and to multitasking capabilities.
10550640 -> 1000005700660: It received a facelift in Windows 3.1, made generally available on March 1, 1992.
10550650 -> 1000005700670: Windows 3.1 support ended on December 31, 2001.
10550660 -> 1000005700680: In July 1993, Microsoft released Windows NT based on a new kernel.
10550670 -> 1000005700690: NT was considered to be the professional OS and was the first Windows version to utilize preemptive multitasking..
10550680 -> 1000005700700: Windows NT would later be retooled to also function as a home operating system, with Windows XP.
10550690 -> 1000005700710: On August 24th 1995, Microsoft released Windows 95, a new, and major, consumer version that made further changes to the user interface, and also used preemptive multitasking.
10550700 -> 1000005700720: Windows 95 was designed to replace not only Windows 3.1, but also Windows for Workgroups, and MS-DOS.
10550710 -> 1000005700730: It was also the first Windows operating system to use Plug and Play capabilities.
10550720 -> 1000005700740: The changes Windows 95 brought to the desktop were revolutionary, as opposed to evolutionary, such as those in Windows 98 and Windows Me.
10550730 -> 1000005700750: Mainstream support for Windows 95 ended on December 31, 2000 and extended support for Windows 95 ended on December 31, 2001.
10550740 -> 1000005700760: The next in the consumer line was Microsoft Windows 98 released on June 25th, 1998.
10550750 -> 1000005700770: It was substantially criticized for its slowness and for its unreliability compared with Windows 95, but many of its basic problems were later rectified with the release of Windows 98 Second Edition in 1999.
10550760 -> 1000005700780: Mainstream support for Windows 98 ended on June 30, 2002 and extended support for Windows 98 ended on July 11, 2006.
10550770 -> 1000005700790: As part of its "professional" line, Microsoft released Windows 2000 in February 2000.
10550780 -> 1000005700800: The consumer version following Windows 98 was Windows Me (Windows Millennium Edition).
10550790 -> 1000005700810: Released in September 2000, Windows Me implemented a number of new technologies for Microsoft: most notably publicized was "Universal Plug and Play."
10550800 -> 1000005700820: In October 2001, Microsoft released Windows XP, a version built on the Windows NT kernel that also retained the consumer-oriented usability of Windows 95 and its successors.
10550810 -> 1000005700830: This new version was widely praised in computer magazines.
10550820 -> 1000005700840: It shipped in two distinct editions, "Home" and "Professional", the former lacking many of the superior security and networking features of the Professional edition.
10550830 -> 1000005700850: Additionally, the first "Media Center" edition was released in 2002, with an emphasis on support for DVD and TV functionality including program recording and a remote control.
10550840 -> 1000005700860: Mainstream support for Windows XP will continue until April 14, 2009 and extended support will continue until April 8, 2014.
10550850 -> 1000005700870: In April 2003, Windows Server 2003 was introduced, replacing the Windows 2000 line of server products with a number of new features and a strong focus on security; this was followed in December 2005 by Windows Server 2003 R2.
10550860 -> 1000005700880: On January 30, 2007 Microsoft released Windows Vista.
10550870 -> 1000005700890: It contains a number of new features, from a redesigned shell and user interface to significant technical changes, with a particular focus on security features.
10550880 -> 1000005700900: It is available in a number of different editions, and has been subject to some criticism.
10550890 -> None: Timeline of releases
10550900 -> 1000005700910: Security
10550910 -> 1000005700920: Security has been a hot topic with Windows for many years, and even Microsoft itself has been the victim of security breaches.
10550920 -> 1000005700930: Consumer versions of Windows were originally designed for ease-of-use on a single-user PC without a network connection, and did not have security features built in from the outset.
10550930 -> 1000005700940: Windows NT and its successors are designed for security (including on a network) and multi-user PCs, but are not designed with Internet security in mind as much since, when it was first developed in the early 1990s, Internet use was less prevalent.
10550940 -> 1000005700950: These design issues combined with flawed code (such as buffer overflows) and the popularity of Windows means that it is a frequent target of worm and virus writers.
10550950 -> 1000005700960: In June 2005, Bruce Schneier's Counterpane Internet Security reported that it had seen over 1,000 new viruses and worms in the previous six months.
10550960 -> 1000005700970: Microsoft releases security patches through its Windows Update service approximately once a month (usually the second Tuesday of the month), although critical updates are made available at shorter intervals when necessary.
10550970 -> 1000005700980: In Windows 2000 (SP3 and later), Windows XP and Windows Server 2003, updates can be automatically downloaded and installed if the user selects to do so.
10550980 -> 1000005700990: As a result, Service Pack 2 for Windows XP, as well as Service Pack 1 for Windows Server 2003, were installed by users more quickly than it otherwise might have been.
10550990 -> 1000005701000: Windows Defender
10551000 -> 1000005701010: On 6 January 2005, Microsoft released a beta version of Microsoft AntiSpyware, based upon the previously released Giant AntiSpyware.
10551010 -> 1000005701020: On 14 February, 2006, Microsoft AntiSpyware became Windows Defender with the release of beta 2.
10551020 -> 1000005701030: Windows Defender is a freeware program designed to protect against spyware and other unwanted software.
10551030 -> 1000005701040: Windows XP and Windows Server 2003 users who have genuine copies of Microsoft Windows can freely download the program from Microsoft's web site, and Windows Defender ships as part of Windows Vista.
10551040 -> 1000005701050: Third-party analysis
10551050 -> 1000005701060: In an article based on a report by Symantec, internetnews.com has described Microsoft Windows as having the "fewest number of patches and the shortest average patch development time of the five operating systems it monitored in the last six months of 2006."
10551060 -> 1000005701070: And the number of vulnerabilities found in Windows has significantly increased— Windows: 12+, Red Hat + Fedora: 2, Mac OS X: 1, HP-UX: 2, Solaris: 1.
10551070 -> 1000005701080: A study conducted by Kevin Mitnick and marketing communications firm Avantgarde in 2004 found that an unprotected and unpatched Windows XP system with Service Pack 1 lasted only 4 minutes on the Internet before it was compromised, and an unprotected and also unpatched Windows Server 2003 system was compromised after being connected to the internet for 8 hours.
10551080 -> 1000005701090: However, it is important to note that this study does not apply to Windows XP systems running the Service Pack 2 update (released in late 2004), which vastly improved the security of Windows XP.
10551090 -> 1000005701100: The computer that was running Windows XP Service Pack 2 was not compromised.
10551100 -> 1000005701110: The AOL National Cyber Security Alliance Online Safety Study of October 2004 determined that 80% of Windows users were infected by at least one spyware/adware product.
10551110 -> 1000005701120: Much documentation is available describing how to increase the security of Microsoft Windows products.
10551120 -> 1000005701130: Typical suggestions include deploying Microsoft Windows behind a hardware or software firewall, running anti-virus and anti-spyware software, and installing patches as they become available through Windows Update.
10551130 -> 1000005701140: Windows Lifecycle Policy
10551140 -> 1000005701150: Microsoft has stopped releasing updates and hotfixes for many old Windows operating systems, including all versions of Windows 9x and earlier versions of Windows NT.
10551150 -> 1000005701160: Windows versions prior to XP are no longer supported, with the exception of Windows 2000, which is currently in the Extended Support Period, that will end on July 13, 2010.
10551160 -> 1000005701170: Windows XP versions prior to SP2 are no longer supported either.
10551170 -> 1000005701180: Also, support for Windows XP 64-bit Edition ended after the release of the more recent Windows XP Professional x64 Edition.
10551180 -> 1000005701190: No new updates are created for unsupported versions of Windows.
10551190 -> 1000005701200: Emulation software
10551200 -> 1000005701210: Emulation allows the use of some Windows applications without using Microsoft Windows.
10551210 -> 1000005701220: These include:
10551220 -> 1000005701230: Wine - a free and open source software implementation of the Windows API, allowing one to run many Windows applications on x86-based platforms, including Linux.
10551230 -> 1000005701240: Wine is technically not an emulator but a "compatibility layer"; while an emulator effectively 'pretends' to be a different CPU, Wine instead makes use of Windows-style APIs to 'simulate' the Windows environment directly.
10551240 -> 1000005701250: CrossOver - A Wine package with licensed fonts.
10551250 -> 1000005701260: Its developers are regular contributors to Wine, and focus on Wine running officially supported applications.
10551260 -> 1000005701270: Cedega - TransGaming Technologies' proprietary fork of Wine, designed specifically for running games written for Microsoft Windows under Linux.
10551270 -> 1000005701280: Darwine - This project intends to port and develop Wine as well as other supporting tools that will allow Darwin and Mac OS X users to run Microsoft Windows applications, and to provide Win32 API compatibility at application source code level.
10551280 -> 1000005701290: ReactOS - An open-source OS that is intended to run the same software as Windows, originally designed to imitate Windows NT 4.0, now aiming at Windows XP compatibility.
10551290 -> 1000005701300: It has been in the development stage since 1996.
Morphology (linguistics)
10560010 -> 1000005800020: Morphology (linguistics)
10560020 -> 1000005800030: Morphology is the field of linguistics that studies the internal structure of words.
10560030 -> 1000005800040: (Words as units in the lexicon are the subject matter of lexicology.)
10560040 -> 1000005800050: While words are generally accepted as being (with clitics) the smallest units of syntax, it is clear that in most (if not all) languages, words can be related to other words by rules.
10560050 -> 1000005800060: For example, English speakers recognize that the words dog, dogs, and dog-catcher are closely related.
10560060 -> 1000005800070: English speakers recognize these relations from their tacit knowledge of the rules of word-formation in English.
10560070 -> 1000005800080: They intuit that dog is to dogs as cat is to cats; similarly, dog is to dog-catcher as dish is to dishwasher.
10560080 -> 1000005800090: The rules understood by the speaker reflect specific patterns (or regularities) in the way words are formed from smaller units and how those smaller units interact in speech.
10560090 -> 1000005800100: In this way, morphology is the branch of linguistics that studies patterns of word-formation within and across languages, and attempts to formulate rules that model the knowledge of the speakers of those languages.
10560100 -> 1000005800110: History
10560110 -> 1000005800120: The history of morphological analysis dates back to the ancient Indian linguist Pāṇini, who formulated the 3,959 rules of Sanskrit morphology in the text Aṣṭādhyāyī by using a Constituency Grammar.
10560120 -> 1000005800130: The Graeco-Roman grammatical tradition also engaged in morphological analysis.
10560130 -> 1000005800140: The term morphology was coined by August Schleicher in 1859
10560140 -> 1000005800150: Fundamental concepts
10560150 -> 1000005800160: Lexemes and word forms
10560160 -> 1000005800170: The distinction between these two senses of "word" is arguably the most important one in morphology.
10560170 -> 1000005800180: The first sense of "word," the one in which dog and dogs are "the same word," is called lexeme.
10560180 -> 1000005800190: The second sense is called word-form.
10560190 -> 1000005800200: We thus say that dog and dogs are different forms of the same lexeme.
10560200 -> 1000005800210: Dog and dog-catcher, on the other hand, are different lexemes; for example, they refer to two different kinds of entities.
10560210 -> 1000005800220: The form of a word that is chosen conventionally to represent the canonical form of a word is called a lemma, or citation form.
10560220 -> 1000005800230: Prosodic word vs. morphological word
10560230 -> 1000005800240: Here are examples from other languages of the failure of a single phonological word to coincide with a single morphological word-form.
10560240 -> 1000005800250: In Latin, one way to express the concept of 'NOUN-PHRASE1 and NOUN-PHRASE2' (as in "apples and oranges") is to suffix '-que' to the second noun phrase: "apples oranges-and", as it were.
10560250 -> 1000005800260: An extreme level of this theoretical quandary posed by some phonological words is provided by the Kwak'wala language.
10560260 -> 1000005800270: In Kwak'wala, as in a great many other languages, meaning relations between nouns, including possession and "semantic case", are formulated by affixes instead of by independent "words".
10560270 -> 1000005800280: The three word English phrase, "with his club", where 'with' identifies its dependent noun phrase as an instrument and 'his' denotes a possession relation, would consist of two words or even just one word in many languages.
10560280 -> 1000005800290: But affixation for semantic relations in Kwak'wala differs dramatically (from the viewpoint of those whose language is not Kwak'wala) from such affixation in other languages for this reason: the affixes phonologically attach not to the lexeme they pertain to semantically, but to the preceding lexeme.
10560290 -> 1000005800300: Consider the following example (in Kwakw'ala, sentences begin with what corresponds to an English verb):
10560300 -> 1000005800310: kwixʔid-i-da bəgwanəmai-χ-a q'asa-s-isi t'alwagwayu
10560310 -> 1000005800320: Morpheme by morpheme translation:
10560320 -> 1000005800330: kwixʔid-i-da = clubbed-PIVOT-DETERMINER
10560330 -> 1000005800340: bəgwanəma-χ-a = man-ACCUSATIVE-DETERMINER
10560340 -> 1000005800350: q'asa-s-is = otter-INSTRUMENTAL-3.PERSON.SINGULAR-POSSESSIVE
10560350 -> 1000005800360: t'alwagwayu = club.
10560360 -> 1000005800370: "the man clubbed the otter with his club"
10560370 -> 1000005800380: (Notation notes:
10560380 -> 1000005800390: 1. accusative case marks an entity that something is done to.
10560390 -> 1000005800400: 2. determiners are words such as "the", "this", "that".
10560400 -> 1000005800410: 3. the concept of "pivot" is a theoretical construct that is not relevant to this discussion.)
10560410 -> 1000005800420: That is, to the speaker of Kwak'wala, the sentence does not contain the "words" 'him-the-otter' or 'with-his-club' Instead, the markers -i-da (PIVOT-'the'), referring to man, attaches not to bəgwanəma ('man'), but instead to the "verb"; the markers -χ-a (ACCUSATIVE-'the'), referring to otter, attach to bəgwanəma instead of to q'asa ('otter'), etc.
10560420 -> 1000005800430: To summarize differently: a speaker of Kwak'wala does not perceive the sentence to consist of these phonological words:
10560430 -> 1000005800440: kwixʔid i-da-bəgwanəma χ-a-q'asa s-isi-t'alwagwayu
10560440 -> 1000005800450: "clubbed PIVOT-the-mani hit-the-otter with-hisi-club
10560450 -> 1000005800460: A central publication on this topic is the recent volume edited by Dixon and Aikhenvald (2007), examining the mismatch between prosodic-phonological and grammatical definitions of "word" in various Amazonian, Australian Aboriginal, Caucasian, Eskimo, Indo-European, Native North American, and West African languages, and in sign languages.
10560460 -> 1000005800470: Apparently, a wide variety of languages make use of the hybrid linguistic unit clitic, possessing the grammatical features of independent words but the prosodic-phonological lack of freedom of bound morphemes.
10560470 -> 1000005800480: The intermediate status of clitics poses a considerable challenge to linguistic theory.
10560480 -> 1000005800490: Inflection vs. word-formation
10560490 -> 1000005800500: Given the notion of a lexeme, it is possible to distinguish two kinds of morphological rules.
10560500 -> 1000005800510: Some morphological rules relate to different forms of the same lexeme; while other rules relate to different lexemes.
10560510 -> 1000005800520: Rules of the first kind are called inflectional rules, while those of the second kind are called word-formation.
10560520 -> 1000005800530: The English plural, as illustrated by dog and dogs, is an inflectional rule; compounds like dog-catcher or dishwasher provide an example of a word-formation rule.
10560530 -> 1000005800540: Informally, word-formation rules form "new words" (that is, new lexemes), while inflection rules yield variant forms of the "same" word (lexeme).
10560540 -> 1000005800550: There is a further distinction between two kinds of word-formation: derivation and compounding.
10560550 -> 1000005800560: Compounding is a process of word-formation that involves combining complete word-forms into a single compound form; dog-catcher is therefore a compound, because both dog and catcher are complete word-forms in their own right before the compounding process has been applied, and are subsequently treated as one form.
10560560 -> 1000005800570: Derivation involves affixing bound (non-independent) forms to existing lexemes, whereby the addition of the affix derives a new lexeme.
10560570 -> 1000005800580: One example of derivation is clear in this case: the word independent is derived from the word dependent by prefixing it with the derivational prefix in-, while dependent itself is derived from the verb depend.
10560580 -> 1000005800590: The distinction between inflection and word-formation is not at all clear-cut.
10560590 -> 1000005800600: There are many examples where linguists fail to agree whether a given rule is inflection or word-formation.
10560600 -> 1000005800610: The next section will attempt to clarify this distinction.
10560610 -> 1000005800620: Paradigms and morphosyntax
10560620 -> 1000005800630: A paradigm is the complete set of related word-forms associated with a given lexeme.
10560630 -> 1000005800640: The familiar examples of paradigms are the conjugations of verbs, and the declensions of nouns.
10560640 -> 1000005800650: Accordingly, the word-forms of a lexeme may be arranged conveniently into tables, by classifying them according to shared inflectional categories such as tense, aspect, mood, number, gender or case.
10560650 -> 1000005800660: For example, the personal pronouns in English can be organized into tables, using the categories of person (1st., 2nd., 3rd.), number (singular vs. plural), gender (masculine, feminine, neuter), and case (subjective, objective, and possessive).
10560660 -> 1000005800670: See English personal pronouns for the details.
10560670 -> 1000005800680: The inflectional categories used to group word-forms into paradigms cannot be chosen arbitrarily; they must be categories that are relevant to stating the syntactic rules of the language.
10560680 -> 1000005800690: For example, person and number are categories that can be used to define paradigms in English, because English has grammatical agreement rules that require the verb in a sentence to appear in an inflectional form that matches the person and number of the subject.
10560690 -> 1000005800700: In other words, the syntactic rules of English care about the difference between dog and dogs, because the choice between these two forms determines which form of the verb is to be used.
10560700 -> 1000005800710: In contrast, however, no syntactic rule of English cares about the difference between dog and dog-catcher, or dependent and independent.
10560710 -> 1000005800720: The first two are just nouns, and the second two just adjectives, and they generally behave like any other noun or adjective behaves.
10560720 -> 1000005800730: An important difference between inflection and word-formation is that inflected word-forms of lexemes are organized into paradigms, which are defined by the requirements of syntactic rules, whereas the rules of word-formation are not restricted by any corresponding requirements of syntax.
10560730 -> 1000005800740: Inflection is therefore said to be relevant to syntax, and word-formation is not.
10560740 -> 1000005800750: The part of morphology that covers the relationship between syntax and morphology is called morphosyntax, and it concerns itself with inflection and paradigms, but not with word-formation or compounding.
10560750 -> 1000005800760: Allomorphy
10560760 -> 1000005800770: In the exposition above, morphological rules are described as analogies between word-forms: dog is to dogs as cat is to cats, and as dish is to dishes.
10560770 -> 1000005800780: In this case, the analogy applies both to the form of the words and to their meaning: in each pair, the first word means "one of X", while the second "two or more of X", and the difference is always the plural form -s affixed to the second word, signaling the key distinction between singular and plural entities.
10560780 -> 1000005800790: One of the largest sources of complexity in morphology is that this one-to-one correspondence between meaning and form scarcely applies to every case in the language.
10560790 -> 1000005800800: In English, we have word form pairs like ox/oxen, goose/geese, and sheep/sheep, where the difference between the singular and the plural is signaled in a way that departs from the regular pattern, or is not signaled at all.
10560800 -> 1000005800810: Even cases considered "regular", with the final -s, are not so simple; the -s in dogs is not pronounced the same way as the -s in cats, and in a plural like dishes, an "extra" vowel appears before the -s.
10560810 -> 1000005800820: These cases, where the same distinction is effected by alternative forms of a "word", are called allomorphy.
10560820 -> 1000005800830: Phonological rules constrain which sounds can appear next to each other in a language, and morphological rules, when applied blindly, would often violate phonological rules, by resulting in sound sequences that are prohibited in the language in question.
10560830 -> 1000005800840: For example, to form the plural of dish by simply appending an -s to the end of the word would result in the form *{(IPA+[dɪʃs]+[dɪʃs])}, which is not permitted by the phonotactics of English.
10560840 -> 1000005800850: In order to "rescue" the word, a vowel sound is inserted between the root and the plural marker, and {(IPA+[dɪʃəz]+[dɪʃəz])} results.
10560850 -> 1000005800860: Similar rules apply to the pronunciation of the -s in dogs and cats: it depends on the quality (voiced vs. unvoiced) of the final preceding phoneme.
10560860 -> 1000005800870: Lexical morphology
10560870 -> 1000005800880: Lexical morphology is the branch of morphology that deals with the lexicon, which, morphologically conceived, is the collection of lexemes in a language.
10560880 -> 1000005800890: As such, it concerns itself primarily with word-formation: derivation and compounding.
10560890 -> 1000005800900: Models of morphology
10560900 -> 1000005800910: There are three principal approaches to morphology, which each try to capture the distinctions above in different ways.
10560910 -> 1000005800920: These are,
10560920 -> 1000005800930: Morpheme-based morphology, which makes use of an Item-and-Arrangement approach.
10560930 -> 1000005800940: Lexeme-based morphology, which normally makes use of an Item-and-Process approach.
10560940 -> 1000005800950: Word-based morphology, which normally makes use of a Word-and-Paradigm approach.
10560950 -> 1000005800960: Note that while the associations indicated between the concepts in each item in that list is very strong, it is not absolute.
10560960 -> 1000005800970: Morpheme-based morphology
10560970 -> 1000005800980: In morpheme-based morphology, word-forms are analyzed as arrangements of morphemes.
10560980 -> 1000005800990: A morpheme is defined as the minimal meaningful unit of a language.
10560990 -> 1000005801000: In a word like independently, we say that the morphemes are in-, depend, -ent, and ly; depend is the root and the other morphemes are, in this case, derivational affixes.
10561000 -> 1000005801010: In a word like dogs, we say that dog is the root, and that -s is an inflectional morpheme.
10561010 -> 1000005801020: This way of analyzing word-forms as if they were made of morphemes put after each other like beads on a string, is called Item-and-Arrangement.
10561020 -> 1000005801030: The morpheme-based approach is the first one that beginners to morphology usually think of, and which laymen tend to find the most obvious.
10561030 -> 1000005801040: This is so to such an extent that very often beginners think that morphemes are an inevitable, fundamental notion of morphology, and many five-minute explanations of morphology are, in fact, five-minute explanations of morpheme-based morphology.
10561040 -> 1000005801050: This is, however, not so.
10561050 -> 1000005801060: The fundamental idea of morphology is that the words of a language are related to each other by different kinds of rules.
10561060 -> 1000005801070: Analyzing words as sequences of morphemes is a way of describing these relations, but is not the only way.
10561070 -> 1000005801080: In actual academic linguistics, morpheme-based morphology certainly has many adherents, but is by no means the dominant approach.
10561080 -> 1000005801090: Lexeme-based morphology
10561090 -> 1000005801100: Lexeme-based morphology is (usually) an Item-and-Process approach.
10561100 -> 1000005801110: Instead of analyzing a word-form as a set of morphemes arranged in sequence, a word-form is said to be the result of applying rules that alter a word-form or stem in order to produce a new one.
10561110 -> 1000005801120: An inflectional rule takes a stem, changes it as is required by the rule, and outputs a word-form; a derivational rule takes a stem, changes it as per its own requirements, and outputs a derived stem; a compounding rule takes word-forms, and similarly outputs a compound stem.
10561120 -> 1000005801130: Word-based morphology
10561130 -> 1000005801140: Word-based morphology is a (usually) Word-and-paradigm approach.
10561140 -> 1000005801150: This theory takes paradigms as a central notion.
10561150 -> 1000005801160: Instead of stating rules to combine morphemes into word-forms, or to generate word-forms from stems, word-based morphology states generalizations that hold between the forms of inflectional paradigms.
10561160 -> 1000005801170: The major point behind this approach is that many such generalizations are hard to state with either of the other approaches.
10561170 -> 1000005801180: The examples are usually drawn from fusional languages, where a given "piece" of a word, which a morpheme-based theory would call an inflectional morpheme, corresponds to a combination of grammatical categories, for example, "third person plural."
10561180 -> 1000005801190: Morpheme-based theories usually have no problems with this situation, since one just says that a given morpheme has two categories.
10561190 -> 1000005801200: Item-and-Process theories, on the other hand, often break down in cases like these, because they all too often assume that there will be two separate rules here, one for third person, and the other for plural, but the distinction between them turns out to be artificial.
10561200 -> 1000005801210: Word-and-Paradigm approaches treat these as whole words that are related to each other by analogical rules.
10561210 -> 1000005801220: Words can be categorized based on the pattern they fit into.
10561220 -> 1000005801230: This applies both to existing words and to new ones.
10561230 -> 1000005801240: Application of a pattern different than the one that has been used historically can give rise to a new word, such as older replacing elder (where older follows the normal pattern of adjectival superlatives) and cows replacing kine (where cows fits the regular pattern of plural formation).
10561240 -> 1000005801250: While a Word-and-Paradigm approach can explain this easily, other approaches have difficulty with phenomena such as this.
10561250 -> 1000005801260: Morphological typology
10561260 -> 1000005801270: In the 19th century, philologists devised a now classic classification of languages according to their morphology.
10561270 -> 1000005801280: According to this typology, some languages are isolating, and have little to no morphology; others are agglutinative, and their words tend to have lots of easily-separable morphemes; while others yet are inflectional or fusional, because their inflectional morphemes are said to be "fused" together.
10561280 -> 1000005801290: This leads to one bound morpheme conveying multiple pieces of information.
10561290 -> 1000005801300: The classic example of an isolating language is Chinese; the classic example of an agglutinative language is Turkish; both Latin and Greek are classic examples of fusional languages.
10561300 -> 1000005801310: Considering the variability of the world's languages, it becomes clear that this classification is not at all clear-cut, and many languages do not neatly fit any one of these types, and some fit in more than one.
10561310 -> 1000005801320: A continuum of complex morphology of language may be adapted when considering languages.
10561320 -> 1000005801330: The three models of morphology stem from attempts to analyze languages that more or less match different categories in this typology.
10561330 -> 1000005801340: The Item-and-Arrangement approach fits very naturally with agglutinative languages; while the Item-and-Process and Word-and-Paradigm approaches usually address fusional languages.
10561340 -> 1000005801350: The reader should also note that the classical typology also mostly applies to inflectional morphology.
10561350 -> 1000005801360: There is very little fusion going on with word-formation.
10561360 -> 1000005801370: Languages may be classified as synthetic or analytic in their word formation, depending on the preferred way of expressing notions that are not inflectional: either by using word-formation (synthetic), or by using syntactic phrases (analytic).
N-gram
10610010 -> 1000005900020: N-gram
10610020 -> 1000005900030: An n-gram is a sub-sequence of n items from a given sequence.
10610025 -> 1000005900040: n-grams are used in various areas of statistical natural language processing and genetic sequence analysis.
10610030 -> 1000005900050: The items in question can be letters, words or base pairs according to the application.
10610040 -> 1000005900060: An n-gram of size 1 is a "unigram"; size 2 is a "bigram" (or, more etymologically sound but less commonly used, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram".
10610050 -> 1000005900070: Some language models built from n-grams are "(n − 1)-order Markov models".
10610060 -> 1000005900080: Examples
10610070 -> 1000005900090: Here are examples of word level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.
10610080 -> 1000005900100: ceramics collectables collectibles (55)
10610090 -> 1000005900110: ceramics collectables fine (130)
10610100 -> 1000005900120: ceramics collected by (52)
10610110 -> 1000005900130: ceramics collectible pottery (50)
10610120 -> 1000005900140: ceramics collectibles cooking (45)
10610130 -> 1000005900150: 4-grams
10610140 -> 1000005900160: serve as the incoming (92)
10610150 -> 1000005900170: serve as the incubator (99)
10610160 -> 1000005900180: serve as the independent (794)
10610170 -> 1000005900190: serve as the index (223)
10610180 -> 1000005900200: serve as the indication (72)
10610190 -> 1000005900210: serve as the indicator (120)
10610200 -> 1000005900220: n-gram models
10610210 -> 1000005900230: An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.
10610220 -> 1000005900240: This idea can be traced to an experiment by Claude Shannon's work in information theory.
10610230 -> 1000005900250: His question was, given a sequence of letters (for example, the sequence "for ex"), what is the likelihood of the next letter?
10610240 -> 1000005900260: From training data, one can derive a probability distribution for the next letter given a history of size n: a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0.
10610250 -> 1000005900270: More concisely, an n-gram model predicts x_{i} based on x_{i-1}, x_{i-2}, \dots, x_{i-n}.
10610260 -> 1000005900280: In Probability terms, this is nothing but P(x_{i} | x_{i-1}, x_{i-2}, \dots, x_{i-n}).
10610270 -> 1000005900290: When used for language modeling independence assumptions are made so that each word depends only on the last n words.
10610280 -> 1000005900300: This Markov model is used as an approximation of the true underlying language.
10610290 -> 1000005900310: This assumption is important because it massively simplifies the problem of learning the language model from data.
10610300 -> 1000005900320: In addition, because of the open nature of language, it is common to group words unknown to the language model together.
10610310 -> 1000005900330: n-gram models are widely used in statistical natural language processing.
10610320 -> 1000005900340: In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution.
10610330 -> 1000005900350: For parsing, words are modeled such that each n-gram is composed of n words.
10610340 -> 1000005900360: For language recognition, sequences of letters are modeled for different languages.
10610350 -> 1000005900370: For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "the dog smelled", "dog smelled like", "smelled like a", and "like a skunk".
10610360 -> 1000005900380: For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth.
10610370 -> 1000005900390: Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks.
10610380 -> 1000005900400: Punctuation is also commonly reduced or removed by preprocessing.
10610385 -> 1000005900410: n-grams can also be used for sequences of words or, in fact, for almost any type of data.
10610390 -> 1000005900420: They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from.
10610400 -> 1000005900430: They have also been very successful as the first pass in genetic sequence search and in the identification of which species short sequences of DNA were taken from.
10610410 -> 1000005900440: N-gram models are often criticized because they lack any explicit representation of long range dependency.
10610420 -> 1000005900450: While it is true that the only explicit dependency range is (n-1) tokens for an n-gram model, it is also true that the effective range of dependency is significantly longer than this although long range correlations drop exponentially with distance for any Markov model.
10610430 -> 1000005900460: Alternative Markov language models that incorporate some degree of local state can exhibit very long range dependencies.
10610440 -> 1000005900470: This is often done using hand-crafted state variables that represent, for instance, the position in a sentence, the general topic of discourse or a grammatical state variable.
10610450 -> 1000005900480: Some of the best parsers of English currently in existence are roughly of this form.
10610460 -> 1000005900490: Another criticism that has been leveled is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction introduced by Noam Chomsky.
10610470 -> 1000005900500: This criticism fails to explain why parsers that are the best at parsing text seem to uniformly lack any such distinction and most even lack any clear distinction between semantics and syntax.
10610480 -> 1000005900510: Most proponents of n-gram and related language models opt for a fairly pragmatic approach to language modeling that emphasizes empirical results over theoretical purity.
10610490 -> 1000005900520: n-grams for approximate matching
10610500 -> 1000005900530: n-grams can also be used for efficient approximate matching.
10610510 -> 1000005900540: By converting a sequence of items to a set of n-grams, it can be embedded in a vector space (in other words, represented as a histogram), thus allowing the sequence to be compared to other sequences in an efficient manner.
10610520 -> 1000005900550: For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a 26^3-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters).
10610530 -> 1000005900560: Using this representation, we lose information about the string.
10610540 -> 1000005900570: For example, both the strings "abcba" and "bcbab" give rise to exactly the same 2-grams.
10610550 -> 1000005900580: However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar.
10610560 -> 1000005900590: Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results.
10610570 -> 1000005900600: For example z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the "background" vector).
10610580 -> 1000005900610: In the event of small counts, the g-score may give better results for comparing alternative models.
10610590 -> 1000005900620: It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.
10610600 -> 1000005900630: Other applications
10610610 -> 1000005900640: n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.
10610620 -> 1000005900650: They have been used to:
10610630 -> 1000005900660: design kernels that allow machine learning algorithms such as support vector machines to learn from string data
10610640 -> 1000005900670: find likely candidates for the correct spelling of a misspelled word
10610650 -> 1000005900680: improve compression in compression algorithms where a small area of data requires n-grams of greater length
10610660 -> 1000005900690: assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR (optical character recognition), Intelligent Character Recognition (ICR), machine translation and similar applications
10610670 -> 1000005900700: improve retrieval in information retrieval systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents
10610680 -> 1000005900710: improve retrieval performance in genetic sequence analysis as in the BLAST family of programs
10610690 -> 1000005900720: identify the language a text is in or the species a small sequence of DNA was taken from
10610700 -> 1000005900730: predict letters or words at random in order to create text, as in the dissociated press algorithm.
10610710 -> 1000005900740: Bias-versus-variance trade-off
10610720 -> 1000005900750: What goes into picking the n for the n-gram?
10610730 -> 1000005900760: There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams.
10610740 -> 1000005900770: Also, items not seen in the training data will be given a probability of 0.0 without smoothing.
10610750 -> 1000005900780: For unseen but plausible data from a sample, one can introduce pseudocounts.
10610760 -> 1000005900790: Pseudocounts are generally motivated on Bayesian grounds.
10610770 -> 1000005900800: Smoothing techniques
10610780 -> 1000005900810: Linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram)
10610790 -> 1000005900820: Good-Turing discounting
10610800 -> 1000005900830: Witten-Bell discounting
10610810 -> 1000005900840: Katz's back-off model (trigram)
10610820 -> 1000005900850: Google use of N-gram
10610830 -> 1000005900860: Google uses n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, checking spelling, entity detection, and data mining.
10610840 -> 1000005900870: In September of 2006  Google announced that they made their n-grams  public at the Linguistic Data Consortium ( LDC).
Named entity recognition
10570010 -> 1000006000020: Named entity recognition
10570020 -> 1000006000030: Named entity recognition (NER) (also known as entity identification (EI) and entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
10570030 -> 1000006000040: For example, a NER system producing MUC-style output might tag the sentence,
10570040 -> 1000006000050: Jim bought 300 shares of Acme Corp. in 2006.
10570050 -> 1000006000060: <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought <NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.
10570060 -> 1000006000070: NER systems have been created that use linguistic grammar-based techniques as well as statistical models.
10570070 -> 1000006000080: Hand-crafted grammar-based systems typically obtain better results, but at the cost of months of work by experienced linguists.
10570080 -> 1000006000090: Statistical NER systems typically require a large amount of manually annotated training data.
10570090 -> 1000006000100: Since about 1998, there has been a great deal of interest in entity identification in the molecular biology, bioinformatics, and medical natural language processing communities.
10570100 -> 1000006000110: The most common entity of interest in that domain has been names of genes and gene products.
10570110 -> 1000006000120: Named entity types
10570120 -> 1000006000130: In the expression named entity, the word named restricts the task to those entities for which one or many rigid designators, as defined by Kripke, stands for the referent.
10570130 -> 1000006000140: For instance, the automotive company created by Henry Ford in 1903 is referred to as Ford or Ford Motor Company.
10570140 -> 1000006000150: Rigid designators include proper names as well as certain natural kind terms like biological species and substances.
10570150 -> 1000006000160: There is a general agreement to include temporal expressions and some numerical expressions such as money and measures in named entities.
10570160 -> 1000006000170: While some instances of these types are good examples of rigid designators (e.g., the year 2001) there are also many invalid ones (e.g., I take my vacations in “June”).
10570170 -> 1000006000180: In the first case, the year 2001 refers to the 2001st year of the Gregorian calendar.
10570180 -> 1000006000190: In the second case, the month June may refer to the month of an undefined year (past June, next June, June 2020, etc.).
10570190 -> 1000006000200: It is arguable that the named entity definition is loosened in such cases for practical reasons.
10570200 -> 1000006000210: At least two hierarchies of named entity types have been proposed in the literature.
10570210 -> 1000006000220: BBN categories , proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes.
10570220 -> 1000006000230: Sekine's extended hierarchy , proposed in 2002, is made of 200 subtypes.
10570230 -> 1000006000240: Evaluation
10570240 -> 1000006000250: Benchmarking and evaluations have been performed in the Message Understanding Conferences (MUC) organized by DARPA, International Conference on Language Resources and Evaluation (LREC), Computational Natural Language Learning (CoNLL) workshops, Automatic Content Extraction (ACE) organized by NIST, the Multilingual Entity Task Conference (MET), Information Retrieval and Extraction Exercise (IREX) and in HAREM (Portuguese language only).
10570250 -> 1000006000260: State-of-the-art systems produce near-human performance.
10570260 -> 1000006000270: For instance, the best system entering  MUC-7 scored 93.39% of f-measure while human annotators scored 97.60% and 96.95%.
Natural language
10580010 -> 1000006100020: Natural language
10580020 -> 1000006100030: In the philosophy of language, a natural language (or ordinary language) is a language that is spoken, written, or signed by animals for general-purpose communication, as distinguished from formal languages (such as computer-programming languages or the "languages" used in the study of formal logic, especially mathematical logic) and from constructed languages.
10580030 -> 1000006100040: Defining natural language
10580040 -> 1000006100050: Though the exact definition is debatable, natural language is often contrasted with artificial or constructed languages such as Esperanto, Latino Sexione, and Occidental.
10580050 -> 1000006100060: Linguists have an incomplete understanding of all aspects of the rules underlying natural languages, and these rules are therefore objects of study.
10580060 -> 1000006100070: The understanding of natural languages reveals much about not only how language works (in terms of syntax, semantics, phonetics, phonology, etc), but also about how the human mind and the human brain process language.
10580070 -> 1000006100080: In linguistic terms, 'natural language' only applies to a language that has evolved naturally, and the study of natural language primarily involves native (first language) speakers.
10580080 -> 1000006100090: The theory of universal grammar proposes that all natural languages have certain underlying rules which constrain the structure of the specific grammar for any given language.
10580090 -> 1000006100100: While grammarians, writers of dictionaries, and language policy-makers all have a certain influence on the evolution of language, their ability to influence what people think they 'ought' to say is distinct from what people actually say.
10580100 -> 1000006100110: Natural language applies to the latter, and is thus a 'descriptive' rather than a 'prescriptive' term.
10580110 -> 1000006100120: Thus non-standard language varieties (such as African American Vernacular English) are considered to be natural while standard language varieties (such as Standard American English) which are more 'prescripted' can be considered to be at least somewhat artificial or constructed.
10580120 -> 1000006100130: Native language learning
10580130 -> 1000006100140: The learning of one's own native language, typically that of one's parents, normally occurs spontaneously in early human childhood and is biologically driven.
10580140 -> 1000006100150: A crucial role of this process is performed by the neural activity of a portion of the human brain known as Broca's area.
10580150 -> 1000006100160: There are approximately 7,000 current human languages, and many, if not most seem to share certain properties, leading to the belief in the existence of Universal Grammar, as shown by generative grammar studies pioneered by the work of Noam Chomsky.
10580160 -> 1000006100170: Recently, it has been demonstrated that a dedicated network in the human brain (crucially involving Broca's area, a portion of the left inferior frontal gyrus), is selectively activated by complex verbal structures (but not simple ones) of those languages that meet the Universal Grammar requirements.
10580170 -> 1000006100180: Origins of natural language
10580180 -> 1000006100190: There is disagreement among anthropologists on when language was first used by humans (or their ancestors).
10580190 -> 1000006100200: Estimates range from about two million (2,000,000) years ago, during the time of Homo habilis, to as recently as forty thousand (40,000) years ago, during the time of Cro-Magnon man.
10580200 -> 1000006100210: However recent evidence suggests modern human language was invented or evolved in Africa prior to the dispersal of humans from Africa around 50,000 years ago.
10580210 -> 1000006100220: Since all people including the most isolated indigenous groups such as the Andamanese or the Tasmanian aboriginals possess language, then it must have been present in the ancestral populations in Africa before the human population split into various groups to colonize the rest of the world.
10580220 -> 1000006100230: Some claim that all nautural languages came out of one single language, known as Adamic.
10580230 -> 1000006100240: Linguistic diversity
10580240 -> 1000006100250: As of early 2007, there are 6,912 known living human languages.
10580250 -> 1000006100260: A "living language" is simply one which is in wide use by a specific group of living people.
10580260 -> 1000006100270: The exact number of known living languages will vary from 5,000 to 10,000, depending generally on the precision of one's definition of "language", and in particular on how one classifies dialects.
10580270 -> 1000006100280: There are also many dead or extinct languages.
10580280 -> 1000006100290: There is no clear distinction between a language and a dialect, notwithstanding linguist Max Weinreich's famous aphorism that "a language is a dialect with an army and navy."
10580290 -> 1000006100300: In other words, the distinction may hinge on political considerations as much as on cultural differences, distinctive writing systems, or degree of mutual intelligibility.
10580300 -> 1000006100310: It is probably impossible to accurately enumerate the living languages because our worldwide knowledge is incomplete, and it is a "moving target", as explained in greater detail by the Ethnologue's Introduction, p. 7 - 8.
10580310 -> 1000006100320: With the 15th edition, the 103 newly added languages are not new but reclassified due to refinements in the definition of language.
10580320 -> 1000006100330: Although widely considered an encyclopedia, the Ethnologue actually presents itself as an incomplete catalog, including only named languages that its editors are able to document.
10580330 -> 1000006100340: With each edition, the number of catalogued languages has grown.
10580340 -> 1000006100350: Beginning with the 14th edition (2000), an attempt was made to include all known living languages.
10580350 -> 1000006100360: SIL used an internal 3-letter code fashioned after airport codes to identify languages.
10580360 -> 1000006100370: This was the precursor to the modern ISO 639-3 standard, to which SIL contributed.
10580370 -> 1000006100380: The standard allows for over 14,000 languages.
10580380 -> 1000006100390: In turn, the 15th edition was revised to conform to the pending ISO 639-3 standard.
10580390 -> 1000006100400: Of the catalogued languages, 497 have been flagged as "nearly extinct" due to trends in their usage.
10580400 -> 1000006100410: Per the 15th edition, 6,912 living languages are shared by over 5.7 billion speakers. (p. 15)
10580410 -> 1000006100420: Taxonomy
10580420 -> 1000006100430: The classification of natural languages can be performed on the basis of different underlying principles (different closeness notions, respecting different properties and relations between languages); important directions of present classifications are:
10580430 -> 1000006100440: paying attention to the historical evolution of languages results in a genetic classification of languages—which is based on genetic relatedness of languages,
10580440 -> 1000006100450: paying attention to the internal structure of languages (grammar) results in a typological classification of languages—which is based on similarity of one or more components of the language's grammar across languages,
10580450 -> 1000006100460: and respecting geographical closeness and contacts between language-speaking communities results in areal groupings of languages.
10580460 -> 1000006100470: The different classifications do not match each other and are not expected to, but the correlation between them is an important point for many linguistic research works.
10580470 -> 1000006100480: (There is a parallel to the classification of species in biological phylogenetics here: consider monophyletic vs. polyphyletic groups of species.)
10580480 -> 1000006100490: The task of genetic classification belongs to the field of historical-comparative linguistics, of typological—to linguistic typology.
10580490 -> 1000006100500: See also Taxonomy, and Taxonomic classification for the general idea of classification and taxonomies.
10580500 -> 1000006100510: Genetic classification
10580510 -> 1000006100520: The world's languages have been grouped into families of languages that are believed to have common ancestors.
10580520 -> 1000006100530: Some of the major families are the Indo-European languages, the Afro-Asiatic languages, the Austronesian languages, and the Sino-Tibetan languages.
10580530 -> 1000006100540: The shared features of languages from one family can be due to shared ancestry.
10580540 -> 1000006100550: (Compare with homology in biology.)
10580550 -> 1000006100560: Typological classification
10580560 -> 1000006100570: An example of a typological classification is the classification of languages on the basis of the basic order of the verb, the subject and the object in a sentence into several types: SVO, SOV, VSO, and so on, languages.
10580570 -> 1000006100580: (English, for instance, belongs to the SVO language type.)
10580580 -> 1000006100590: The shared features of languages of one type (= from one typological class) may have arisen completely independently.
10580590 -> 1000006100600: (Compare with analogy in biology.)
10580595 -> 1000006100610: Their cooccurence might be due to the universal laws governing the structure of natural languages—language universals.
10580600 -> 1000006100620: Areal classification
10580610 -> 1000006100630: The following language groupings can serve as some linguistically significant examples of areal linguistic units, or sprachbunds: Balkan linguistic union, or the bigger group of European languages; Caucasian languages; East Asian languages.
10580620 -> 1000006100640: Although the members of each group are not closely genetically related, there is a reason for them to share similar features, namely: their speakers have been in contact for a long time within a common community and the languages converged in the course of the history.
10580630 -> 1000006100650: These are called "areal features".
10580640 -> 1000006100660: One should be careful about the underlying classification principle for groups of languages which have apparently a geographical name: besides areal linguistic units, the taxa of the genetic classification (language families) are often given names which themselves or parts of which refer to geographical areas.
10580650 -> 1000006100670: Controlled languages
10580660 -> 1000006100680: Controlled natural languages are subsets of natural languages whose grammars and dictionaries have been restricted in order to reduce or eliminate both ambiguity and complexity.
10580670 -> 1000006100690: The purpose behind the development and implementation of a controlled natural language typically is to aid non-native speakers of a natural language in understanding it, or to ease computer processing of a natural language.
10580680 -> 1000006100700: An example of a widely used controlled natural language is Simplified English, which was originally developed for aerospace industry maintenance manuals.
10580690 -> 1000006100710: Constructed languages and international auxiliary languages
10580700 -> 1000006100720: Constructed international auxiliary languages such as Esperanto and Interlingua that have native speakers are by some also considered natural languages.
10580710 -> 1000006100730: However, constructed languages, while they are clearly languages, are not generally considered natural languages.
10580720 -> 1000006100740: The problem is that other languages have been used to communicate and evolve in a natural way, while Esperanto has been selectively designed by L.L. Zamenhof from natural languages, not grown from the natural fluctuations in vocabulary and syntax.
10580730 -> 1000006100750: Nor has Esperanto been naturally "standardized" by children's natural tendency to correct for illogical grammar structures in their parents' language, which can be seen in the development of pidgin languages into creole languages (as explained by Steven Pinker in The Language Instinct).
10580740 -> 1000006100760: The possible exception to this are true native speakers of such languages.
10580750 -> 1000006100770: More substantive basis for this designation is that the vocabulary, grammar, and orthography of Interlingua are natural; they have been standardized and presented by a linguistic research body, but they predated it and are not themselves considered a product of human invention.
10580760 -> 1000006100780: Most experts, however, consider Interlingua to be naturalistic rather than natural.
10580770 -> 1000006100790: Latino Sine Flexione, a second naturalistic auxiliary language, is also naturalistic in content but is no longer widely spoken.
10580780 -> 1000006100800: Natural Language Processing
10580790 -> 1000006100810: Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics.
10580800 -> 1000006100820: It studies the problems of automated generation and understanding of natural human languages.
10580810 -> 1000006100830: Natural-language-generation systems convert information from computer databases into normal-sounding human language.
10580820 -> 1000006100840: Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
10580830 -> 1000006100850: Modalities
10580840 -> 1000006100860: Natural language manifests itself in modalities other than speech.
10580850 -> 1000006100870: Sign languages
10580860 -> 1000006100880: In linguistic terms, sign languages are as rich and complex as any oral language, despite the previously common misconception that they are not "real languages".
10580870 -> 1000006100890: Professional linguists have studied many sign languages and found them to have every linguistic component required to be classed as true natural languages.
10580880 -> 1000006100900: Sign languages are not pantomime, much as most spoken language is not onomatopoeic.
10580890 -> 1000006100910: The signs do tend to exploit iconicity (visual connections with their referents) more than what is common in spoken language, but they are above all conventional and hence generally incomprehensible to non-speakers, just like spoken words and morphemes.
10580900 -> 1000006100920: They are not a visual rendition of an oral language either.
10580910 -> 1000006100930: They have complex grammars of their own, and can be used to discuss any topic, from the simple and concrete to the lofty and abstract.
10580920 -> 1000006100940: Written languages
10580930 -> 1000006100950: In a sense, written language should be distinguished from natural language.
10580940 -> 1000006100960: Until recently in the developed world, it was common for many people to be fluent in spoken or signed languages and yet remain illiterate; this is still the case in poor countries today.
10580950 -> 1000006100970: Furthermore, natural language acquisition during childhood is largely spontaneous, while literacy must usually be intentionally acquired.
Natural language processing
10590010 -> 1000006200020: Natural language processing
10590020 -> 1000006200030: Natural language processing (NLP) is a subfield of artificial intelligence and computational linguistics.
10590030 -> 1000006200040: It studies the problems of automated generation and understanding of natural human languages.
10590040 -> 1000006200050: Natural-language-generation systems convert information from computer databases into normal-sounding human language.
10590050 -> 1000006200060: Natural-language-understanding systems convert samples of human language into more formal representations that are easier for computer programs to manipulate.
10590060 -> 1000006200070: Tasks and limitations
10590070 -> 1000006200080: In theory, natural-language processing is a very attractive method of human-computer interaction.
10590080 -> 1000006200090: Early systems such as SHRDLU, working in restricted "blocks worlds" with restricted vocabularies, worked extremely well, leading researchers to excessive optimism, which was soon lost when the systems were extended to more realistic situations with real-world ambiguity and complexity.
10590090 -> 1000006200100: Natural-language understanding is sometimes referred to as an AI-complete problem, because natural-language recognition seems to require extensive knowledge about the outside world and the ability to manipulate it.
10590100 -> 1000006200110: The definition of "understanding" is one of the major problems in natural-language processing.
10590110 -> 1000006200120: Concrete problems
10590120 -> 1000006200130: Some examples of the problems faced by natural-language-understanding systems:
10590130 -> 1000006200140: The sentences We gave the monkeys the bananas because they were hungry and We gave the monkeys the bananas because they were over-ripe have the same surface grammatical structure.
10590140 -> 1000006200150: However, the pronoun they refers to monkeys in one sentence and bananas in the other, and it is impossible to tell which without a knowledge of the properties of monkeys and bananas.
10590150 -> 1000006200160: A string of words may be interpreted in different ways.
10590160 -> 1000006200170: For example, the string Time flies like an arrow may be interpreted in a variety of ways:
10590170 -> 1000006200180: The common simile: time moves quickly just like an arrow does;
10590180 -> 1000006200190: measure the speed of flies like you would measure that of an arrow (thus interpreted as an imperative) - i.e. (You should) time flies as you would (time) an arrow.;
10590190 -> 1000006200200: measure the speed of flies like an arrow would - i.e. Time flies in the same way that an arrow would (time them).;
10590200 -> 1000006200210: measure the speed of flies that are like arrows - i.e. Time those flies that are like arrows;
10590210 -> 1000006200220: all of a type of flying insect, "time-flies," collectively enjoys a single arrow (compare Fruit flies like a banana);
10590220 -> 1000006200230: each of a type of flying insect, "time-flies," individually enjoys a different arrow (similar comparison applies);
10590230 -> 1000006200240: A concrete object, for example the magazine, Time, travels through the air in an arrow-like manner.
10590240 -> 1000006200250: English is particularly challenging in this regard because it has little inflectional morphology to distinguish between parts of speech.
10590250 -> 1000006200260: English and several other languages don't specify which word an adjective applies to.
10590260 -> 1000006200270: For example, in the string "pretty little girls' school".
10590270 -> 1000006200280: Does the school look little?
10590280 -> 1000006200290: Do the girls look little?
10590290 -> 1000006200300: Do the girls look pretty?
10590300 -> 1000006200310: Does the school look pretty?
10590310 -> 1000006200320: We will often imply additional information in spoken language by the way we place stress on words.
10590320 -> 1000006200330: The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.
10590330 -> 1000006200340: Depending on which word the speaker places the stress, this sentence could have several distinct meanings:
10590340 -> 1000006200350: "I never said she stole my money" - Someone else said it, but I didn't.
10590350 -> 1000006200360: "I never said she stole my money" - I simply didn't ever say it.
10590360 -> 1000006200370: "I never said she stole my money" - I might have implied it in some way, but I never explicitly said it.
10590370 -> 1000006200380: "I never said she stole my money" - I said someone took it; I didn't say it was she.
10590380 -> 1000006200390: "I never said she stole my money" - I just said she probably borrowed it.
10590390 -> 1000006200400: "I never said she stole my money" - I said she stole someone else's money.
10590400 -> 1000006200410: "I never said she stole my money" - I said she stole something, but not my money.
10590410 -> 1000006200420: Subproblems
10590420 -> 1000006200430: Speech segmentation
10590430 -> 1000006200440: In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the analog signal to discrete characters can be a very difficult process.
10590440 -> 1000006200450: Also, in natural speech there are hardly any pauses between successive words; the location of those boundaries usually must take into account grammatical and semantic constraints, as well as the context.
10590450 -> 1000006200460: Text segmentation
10590460 -> 1000006200470: Some written languages like Chinese, Japanese and Thai do not have single-word boundaries either, so any significant text parsing usually requires the identification of word boundaries, which is often a non-trivial task.
10590470 -> 1000006200480: Word sense disambiguation
10590480 -> 1000006200490: Many words have more than one meaning; we have to select the meaning which makes the most sense in context.
10590490 -> 1000006200500: Syntactic ambiguity
10590500 -> 1000006200510: The grammar for natural languages is ambiguous, i.e. there are often multiple possible parse trees for a given sentence.
10590510 -> 1000006200520: Choosing the most appropriate one usually requires semantic and contextual information.
10590520 -> 1000006200530: Specific problem components of syntactic ambiguity include sentence boundary disambiguation.
10590530 -> 1000006200540: Imperfect or irregular input
10590540 -> 1000006200550: Foreign or regional accents and vocal impediments in speech; typing or grammatical errors, OCR errors in texts.
10590550 -> 1000006200560: Speech acts and plans
10590560 -> 1000006200570: A sentence can often be considered an action by the speaker.
10590570 -> 1000006200580: The sentence structure, alone, may not contain enough information to define this action.
10590580 -> 1000006200590: For instance, a question is actually the speaker requesting some sort of response from the listener.
10590590 -> 1000006200600: The desired response may be verbal, physical, or some combination.
10590600 -> 1000006200610: For example, "Can you pass the class?" is a request for a simple yes-or-no answer, while "Can you pass the salt?" is requesting a physical action to be performed.
10590610 -> 1000006200620: It is not appropriate to respond with "Yes, I can pass the salt," without the accompanying action (although "No" or "I can't reach the salt" would explain a lack of action).
10590620 -> 1000006200630: Statistical NLP
10590630 -> 1000006200640: Statistical natural-language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties discussed above, especially those which arise because longer sentences are highly ambiguous when processed with realistic grammars, yielding thousands or millions of possible analyses.
10590640 -> 1000006200650: Methods for disambiguation often involve the use of  corpora and Markov models.
10590650 -> 1000006200660: Statistical NLP comprises all quantitative approaches to automated language processing, including probabilistic modeling, information theory, and linear algebra.
10590660 -> 1000006200670: The technology for statistical NLP comes mainly from machine learning and data mining, both of which are fields of artificial intelligence that involve learning from data.
10590670 -> None: Major tasks in NLP
10590680 -> None: Automatic summarization
10590690 -> None: Foreign language reading aid
10590700 -> None: Foreign language writing aid
10590710 -> None: Information extraction
10590720 -> None: Information retrieval
10590730 -> None: Machine translation
10590740 -> None: Named entity recognition
10590750 -> None: Natural language generation
10590760 -> None: Natural language understanding
10590770 -> None: Optical character recognition
10590780 -> None: Question answering
10590790 -> None: Speech recognition
10590800 -> None: Spoken dialogue system
10590810 -> None: Text simplification
10590820 -> None: Text to speech
10590830 -> None: Text-proofing
10590840 -> 1000006200680: Evaluation of natural language processing
10590850 -> 1000006200690: Objectives
10590860 -> 1000006200700: The goal of NLP evaluation is to measure one or more qualities of an algorithm or a system, in order to determine if (or to what extent) the system answers the goals of its designers, or the needs of its users.
10590870 -> 1000006200710: Research in NLP evaluation has received considerable attention, because the definition of proper evaluation criteria is one way to specify precisely an NLP problem, going thus beyond the vagueness of tasks defined only as language understanding or language generation.
10590880 -> 1000006200720: A precise set of evaluation criteria, which includes mainly evaluation data and evaluation metrics, enables several teams to compare their solutions to a given NLP problem.
10590890 -> 1000006200730: Short history of evaluation in NLP
10590900 -> 1000006200740: The first evaluation campaign on written texts seems to be a campaign dedicated to message understanding in 1987 (Pallet 1998).
10590910 -> 1000006200750: Then, the Parseval/GEIG project compared phrase-structure grammars (Black 1991).
10590920 -> 1000006200760: A series of campaigns within Tipster project were realized on tasks like summarization, translation and searching (Hirshman 1998).
10590930 -> 1000006200770: In 1994, in Germany, the Morpholympics compared German taggers.
10590940 -> 1000006200780: Then, the Senseval and Romanseval campaigns were conducted with the objectives of semantic disambiguation.
10590950 -> 1000006200790: In 1996, the Sparkle campaign compared syntactic parsers in four different languages (English, French, German and Italian).
10590960 -> 1000006200800: In France, the Grace project compared a set of 21 taggers for French in 1997 (Adda 1999).
10590970 -> 1000006200810: In 2004, during the Technolangue/Easy project, 13 parsers for French were compared.
10590980 -> 1000006200820: Large-scale evaluation of dependency parsers were performed in the context of the CoNLL shared tasks in 2006 and 2007.
10590990 -> 1000006200830: In Italy, the evalita campaign was conducted in 2007 to compare various tools for Italian  evalita web site.
10591000 -> 1000006200840: In France, within the ANR-Passage project (end of 2007), 10 parsers for French were compared  passage web site.
10591010 -> 1000006200850: Adda G., Mariani J., Paroubek P., Rajman M. 1999 L'action GRACE d'évaluation de l'assignation des parties du discours pour le français. Langues vol-2
10591030 -> 1000006200860: Black E., Abney S., Flickinger D., Gdaniec C., Grishman R., Harrison P., Hindle D., Ingria R., Jelinek F., Klavans J., Liberman M., Marcus M., Reukos S., Santoni B., Strzalkowski T. 1991 A procedure for quantitatively comparing the syntactic coverage of English grammars. DARPA Speech and Natural Language Workshop
10591050 -> 1000006200870: Hirshman L. 1998 Language understanding evaluation: lessons learned from MUC and ATIS. LREC Granada
10591070 -> 1000006200880: Pallet D.S. 1998 The NIST role in automatic speech recognition benchmark tests. LREC Granada
10591090 -> 1000006200890: Different types of evaluation
10591100 -> 1000006200900: Depending on the evaluation procedures, a number of distinctions are traditionally made in NLP evaluation.
10591110 -> 1000006200910: Intrinsic vs. extrinsic evaluation
10591120 -> 1000006200920: Intrinsic evaluation considers an isolated NLP system and characterizes its performance mainly with respect to a gold standard result, pre-defined by the evaluators.
10591130 -> 1000006200930: Extrinsic evaluation, also called evaluation in use considers the NLP system in a more complex setting, either as an embedded system or serving a precise function for a human user.
10591140 -> 1000006200940: The extrinsic performance of the system is then characterized in terms of its utility with respect to the overall task of the complex system or the human user.
10591150 -> 1000006200950: Black-box vs. glass-box evaluation
10591160 -> 1000006200960: Black-box evaluation requires one to run an NLP system on a given data set and to measure a number of parameters related to the quality of the process (speed, reliability, resource consumption) and, most importantly, to the quality of the result (e.g. the accuracy of data annotation or the fidelity of a translation).
10591170 -> 1000006200970: Glass-box evaluation looks at the design of the system, the algorithms that are implemented, the linguistic resources it uses (e.g. vocabulary size), etc.
10591180 -> 1000006200980: Given the complexity of NLP problems, it is often difficult to predict performance only on the basis of glass-box evaluation, but this type of evaluation is more informative with respect to error analysis or future developments of a system.
10591190 -> 1000006200990: Automatic vs. manual evaluation
10591200 -> 1000006201000: In many cases, automatic procedures can be defined to evaluate an NLP system by comparing its output with the gold standard (or desired) one.
10591210 -> 1000006201010: Although the cost of producing the gold standard can be quite high, automatic evaluation can be repeated as often as needed without much additional costs (on the same input data).
10591220 -> 1000006201020: However, for many NLP problems, the definition of a gold standard is a complex task, and can prove impossible when inter-annotator agreement is insufficient.
10591230 -> 1000006201030: Manual evaluation is performed by human judges, which are instructed to estimate the quality of a system, or most often of a sample of its output, based on a number of criteria.
10591240 -> 1000006201040: Although, thanks to their linguistic competence, human judges can be considered as the reference for a number of language processing tasks, there is also considerable variation across their ratings.
10591250 -> 1000006201050: This is why automatic evaluation is sometimes referred to as objective evaluation, while the human kind appears to be more subjective.
10591260 -> None: Shared tasks (Campaigns)
10591270 -> None: BioCreative
10591280 -> None: Message Understanding Conference
10591290 -> None: Technolangue/Easy
10591300 -> None: Text Retrieval Conference
10591310 -> 1000006201060: Standardization in NLP
10591320 -> 1000006201070: An ISO sub-committee is working in order to ease interoperability between Lexical resources and NLP programs.
10591330 -> 1000006201080: The sub-committee is part of ISO/TC37 and is called ISO/TC37/SC4.
10591340 -> 1000006201090: Some ISO standards are already published but most of them are under construction, mainly on lexicon representation (see LMF), annotation and data category registry.
Neural network
10600010 -> 1000006300020: Neural network
10600020 -> 1000006300030: Traditionally, the term neural network had been used to refer to a network or circuit of biological neurons.
10600030 -> 1000006300040: The modern usage of the term often refers to artificial neural networks, which are composed of artificial neurons or nodes.
10600040 -> 1000006300050: Thus the term has two distinct usages:
10600050 -> 1000006300060: Biological neural networks are made up of real biological neurons that are connected or functionally-related in the peripheral nervous system or the central nervous system.
10600060 -> 1000006300070: In the field of neuroscience, they are often identified as groups of neurons that perform a specific physiological function in laboratory analysis.
10600070 -> 1000006300080: Artificial neural networks are made up of interconnecting artificial neurons (programming constructs that mimic the properties of biological neurons).
10600080 -> 1000006300090: Artificial neural networks may either be used to gain an understanding of biological neural networks, or for solving artificial intelligence problems without necessarily creating a model of a real biological system.
10600090 -> 1000006300100: This article focuses on the relationship between the two concepts; for detailed coverage of the two different concepts refer to the separate articles: Biological neural network and Artificial neural network.
10600100 -> 1000006300110: Characterization
10600110 -> 1000006300120: In general a biological neural network is composed of a group or groups of chemically connected or functionally associated neurons.
10600120 -> 1000006300130: A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive.
10600130 -> 1000006300140: Connections, called synapses, are usually formed from axons to dendrites, though dendrodendritic microcircuits and other connections are possible.
10600140 -> 1000006300150: Apart from the electrical signaling, there are other forms of signaling that arise from neurotransmitter diffusion, which have an effect on electrical signaling.
10600150 -> 1000006300160: As such, neural networks are extremely complex.
10600160 -> 1000006300170: Artificial intelligence and cognitive modeling try to simulate some properties of neural networks.
10600170 -> 1000006300180: While similar in their techniques, the former has the aim of solving particular tasks, while the latter aims to build mathematical models of biological neural systems.
10600180 -> 1000006300190: In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.
10600190 -> 1000006300200: Most of the currently employed artificial neural networks for artificial intelligence are based on statistical estimation, optimization and control theory.
10600200 -> 1000006300210: The cognitive modelling field involves the physical or mathematical modeling of the behaviour of neural systems; ranging from the individual neural level (e.g. modelling the spike response curves of neurons to a stimulus), through the neural cluster level (e.g. modelling the release and effects of dopamine in the basal ganglia) to the complete organism (e.g. behavioural modelling of the organism's response to stimuli).
10600210 -> 1000006300220: The brain, neural networks and computers
10600220 -> 1000006300230: Neural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is debated.
10600230 -> 1000006300240: A subject of current research in theoretical neuroscience is the question surrounding the degree of complexity and the properties that individual neural elements should have to reproduce something resembling animal intelligence.
10600240 -> 1000006300250: Historically, computers evolved from the von Neumann architecture, which is based on sequential processing and execution of explicit instructions.
10600250 -> 1000006300260: On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems, which may rely largely on parallel processing as well as implicit instructions based on recognition of patterns of 'sensory' input from external sources.
10600260 -> 1000006300270: In other words, at its very heart a neural network is a complex statistical processor (as opposed to being tasked to sequentially process and execute).
10600270 -> 1000006300280: Neural networks and artificial intelligence
10600280 -> 1000006300290: An artificial neural network (ANN), also called a simulated neural network (SNN) or commonly just neural network (NN) is an interconnected group of artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation.
10600290 -> 1000006300300: In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network.
10600300 -> 1000006300310: In more practical terms neural networks are non-linear statistical data modeling or decision making tools.
10600310 -> 1000006300320: They can be used to model complex relationships between inputs and outputs or to find patterns in data.
10600320 -> 1000006300330: Background
10600330 -> 1000006300340: An artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behaviour, determined by the connections between the processing elements and element parameters.
10600340 -> 1000006300350: One classical type of artificial neural network is the Hopfield net.
10600350 -> 1000006300360: In a neural network model simple nodes, which can be called variously "neurons", "neurodes", "Processing Elements" (PE) or "units", are connected together to form a network of nodes — hence the term "neural network".
10600360 -> 1000006300370: While a neural network does not have to be adaptive per se, its practical use comes with algorithms designed to alter the strength (weights) of the connections in the network to produce a desired signal flow.
10600370 -> 1000006300380: In modern software implementations of artificial neural networks the approach inspired by biology has more or less been abandoned for a more practical approach based on statistics and signal processing.
10600380 -> 1000006300390: In some of these systems neural networks, or parts of neural networks (such as artificial neurons) are used as components in larger systems that combine both adaptive and non-adaptive elements.
10600390 -> 1000006300400: The concept of a neural network appears to have first been proposed by Alan Turing in his 1948 paper "Intelligent Machinery".
10600400 -> 1000006300410: Applications
10600410 -> 1000006300420: The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it.
10600420 -> 1000006300430: This is particularly useful in applications where the complexity of the data or task makes the design of such a function by hand impractical.
10600430 -> 1000006300440: Real life applications
10600440 -> 1000006300450: The tasks to which artificial neural networks are applied tend to fall within the following broad categories:
10600450 -> 1000006300460: Function approximation, or regression analysis, including time series prediction and modelling.
10600460 -> 1000006300470: Classification, including pattern and sequence recognition, novelty detection and sequential decision making.
10600470 -> 1000006300480: Data processing, including filtering, clustering, blind signal separation and compression.
10600480 -> 1000006300490: Application areas include system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition, etc.), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, "KDD"), visualization and e-mail spam filtering.
10600490 -> 1000006300500: Neural network software
10600500 -> 1000006300510: Main article: Neural network software
10600510 -> 1000006300520: Neural network software is used to simulate, research, develop and apply artificial neural networks, biological neural networks and in some cases a wider array of adaptive systems.
10600520 -> 1000006300530: Learning paradigms
10600530 -> 1000006300540: There are three major learning paradigms, each corresponding to a particular abstract learning task.
10600540 -> 1000006300550: These are supervised learning, unsupervised learning and reinforcement learning.
10600550 -> 1000006300560: Usually any given type of network architecture can be employed in any of those tasks.
10600560 -> 1000006300570: Supervised learning
10600570 -> 1000006300580: In supervised learning, we are given a set of example pairs  (x, y), x \in X, y \in Y and the aim is to find a function f in the allowed class of functions that matches the examples.
10600580 -> 1000006300590: In other words, we wish to infer how the mapping implied by the data and the cost function is related to the mismatch between our mapping and the data.
10600590 -> 1000006300600: Unsupervised learning
10600600 -> 1000006300610: In unsupervised learning we are given some data x, and a cost function which is to be minimized which can be any function of x and the network's output, f.
10600610 -> 1000006300620: The cost function is determined by the task formulation.
10600620 -> 1000006300630: Most applications fall within the domain of estimation problems such as statistical modeling, compression, filtering, blind source separation and clustering.
10600630 -> 1000006300640: Reinforcement learning
10600640 -> 1000006300650: In reinforcement learning, data x is usually not given, but generated by an agent's interactions with the environment.
10600650 -> 1000006300660: At each point in time t, the agent performs an action y_t and the environment generates an observation x_t and an instantaneous cost c_t, according to some (usually unknown) dynamics.
10600660 -> 1000006300670: The aim is to discover a policy for selecting actions that minimises some measure of a long-term cost, i.e. the expected cumulative cost.
10600670 -> 1000006300680: The environment's dynamics and the long-term cost for each policy are usually unknown, but can be estimated.
10600680 -> 1000006300690: ANNs are frequently used in reinforcement learning as part of the overall algorithm.
10600690 -> 1000006300700: Tasks that fall within the paradigm of reinforcement learning are control problems, games and other sequential decision making tasks.
10600700 -> 1000006300710: Learning algorithms
10600710 -> 1000006300720: There are many algorithms for training neural networks; most of them can be viewed as a straightforward application of optimization theory and statistical estimation.
10600720 -> 1000006300730: Evolutionary computation methods, simulated annealing, expectation maximization and non-parametric methods are among other commonly used methods for training neural networks.
10600730 -> 1000006300740: See also machine learning.
10600740 -> 1000006300750: Recent developments in this field also saw the use of particle swarm optimization and other swarm intelligence techniques used in the training of neural networks.
10600750 -> 1000006300760: Neural networks and neuroscience
10600760 -> 1000006300770: Theoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems.
10600770 -> 1000006300780: Since neural systems are intimately related to cognitive processes and behaviour, the field is closely related to cognitive and behavioural modeling.
10600780 -> 1000006300790: The aim of the field is to create models of biological neural systems in order to understand how biological systems work.
10600790 -> 1000006300800: To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (biological neural network models) and theory (statistical learning theory and information theory).
10600800 -> 1000006300810: Types of models
10600810 -> 1000006300820: Many models are used in the field, each defined at a different level of abstraction and trying to model different aspects of neural systems.
10600820 -> 1000006300830: They range from models of the short-term behaviour of individual neurons, through models of how the dynamics of neural circuitry arise from interactions between individual neurons, to models of how behaviour can arise from abstract neural modules that represent complete subsystems.
10600830 -> 1000006300840: These include models of the long-term and short-term plasticity of neural systems and its relation to learning and memory, from the individual neuron to the system level.
10600840 -> 1000006300850: Current research
10600850 -> 1000006300860: While initially research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as dopamine, acetylcholine, and serotonin on behaviour and learning.
10600860 -> 1000006300870: Biophysical models, such as BCM theory, have been important in understanding mechanisms for synaptic plasticity, and have had applications in both computer science and neuroscience.
10600870 -> 1000006300880: Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for radial basis networks and neural backpropagation as mechanisms for processing data.
10600880 -> 1000006300890: History of the neural network analogy
10600890 -> 1000006300900: The concept of neural networks started in the late-1800s as an effort to describe how the human mind performed.
10600900 -> 1000006300910: These ideas started being applied to computational models with the Perceptron.
10600910 -> 1000006300920: In early 1950s Friedrich Hayek was one of the first to posit the idea of spontaneous order in the brain arising out of decentralized networks of simple units (neurons).
10600920 -> 1000006300930: In the late 1940s, Donald Hebb made one of the first hypotheses for a mechanism of neural plasticity (i.e. learning), Hebbian learning.
10600930 -> 1000006300940: Hebbian learning is considered to be a 'typical' unsupervised learning rule and it (and variants of it) was an early model for long term potentiation.
10600940 -> 1000006300950: The Perceptron is essentially a linear classifier for classifying data  x \in R^n specified by parameters w \in R^n, b \in R and an output function f = w'x + b.
10600950 -> 1000006300960: Its parameters are adapted with an ad-hoc rule similar to stochastic steepest gradient descent.
10600960 -> 1000006300970: Because the inner product is a linear operator in the input space, the Perceptron can only perfectly classify a set of data for which different classes are linearly separable in the input space, while it often fails completely for non-separable data.
10600970 -> 1000006300980: While the development of the algorithm initially generated some enthusiasm, partly because of its apparent relation to biological mechanisms, the later discovery of this inadequacy caused such models to be abandoned until the introduction of non-linear models into the field.
10600980 -> 1000006300990: The Cognitron (1975) was an early multilayered neural network with a training algorithm.
10600990 -> 1000006301000: The actual structure of the network and the methods used to set the interconnection weights change from one neural strategy to another, each with its advantages and disadvantages.
10601000 -> 1000006301010: Networks can propagate information in one direction only, or they can bounce back and forth until self-activation at a node occurs and the network settles on a final state.
10601010 -> 1000006301020: The ability for bi-directional flow of inputs between neurons/nodes was produced with the Hopfield's network (1982), and specialization of these node layers for specific purposes was introduced through the first hybrid network.
10601020 -> 1000006301030: The parallel distributed processing of the mid-1980s became popular under the name connectionism.
10601030 -> 1000006301040: The rediscovery of the backpropagation algorithm was probably the main reason behind the repopularisation of neural networks after the publication of "Learning Internal Representations by Error Propagation" in 1986 (Though backpropagation itself dates from 1974).
10601040 -> 1000006301050: The original network utilised multiple layers of weight-sum units of the type f = g(w'x + b), where g was a sigmoid function or logistic function such as used in logistic regression.
10601050 -> 1000006301060: Training was done by a form of stochastic steepest gradient descent.
10601060 -> 1000006301070: The employment of the chain rule of differentiation in deriving the appropriate parameter updates results in an algorithm that seems to 'backpropagate errors', hence the nomenclature.
10601070 -> 1000006301080: However it is essentially a form of gradient descent.
10601080 -> 1000006301090: Determining the optimal parameters in a model of this type is not trivial, and steepest gradient descent methods cannot be relied upon to give the solution without a good starting point.
10601090 -> 1000006301100: In recent times, networks with the same architecture as the backpropagation network are referred to as Multi-Layer Perceptrons.
10601100 -> 1000006301110: This name does not impose any limitations on the type of algorithm used for learning.
10601110 -> 1000006301120: The backpropagation network generated much enthusiasm at the time and there was much controversy about whether such learning could be implemented in the brain or not, partly because a mechanism for reverse signalling was not obvious at the time, but most importantly because there was no plausible source for the 'teaching' or 'target' signal.
10601120 -> 1000006301130: Criticism
10601130 -> 1000006301140: A. K. Dewdney, a former Scientific American columnist, wrote in 1997, “Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool.”
10601140 -> 1000006301150: (Dewdney, p.82)
10601150 -> 1000006301160: Arguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, ranging from autonomously flying aircraft to detecting credit card fraud.
10601160 -> 1000006301170: Technology writer Roger Bridgman commented on Dewdney's statements about neural nets:
10601170 -> 1000006301180: Neural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be "an opaque, unreadable table...valueless as a scientific resource".
10601180 -> 1000006301190: In spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers.
10601190 -> 1000006301200: An unreadable table that a useful machine could read would still be well worth having.
Noun
10620010 -> 1000006400020: Noun
10620020 -> 1000006400030: In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition.
10620030 -> 1000006400040: Lexical categories are defined in terms of how their members combine with other kinds of expressions.
10620040 -> 1000006400050: The syntactic rules for nouns differ from language to language.
10620050 -> 1000006400060: In English, nouns may be defined as those words which can occur with articles and attributive adjectives and can function as the head of a noun phrase.
10620060 -> 1000006400070: In traditional English grammar, the noun is one of the eight parts of speech.
10620070 -> 1000006400080: History
10620080 -> 1000006400090: The word comes from the Latin nomen meaning "name".
10620090 -> 1000006400100: Word classes like nouns were first described by the Sanskrit grammarian {(Transl+Pāṇini+sa+IAST+sa)} and ancient Greeks like Dionysios Thrax; and were defined in terms of their morphological properties.
10620100 -> 1000006400110: For example, in Ancient Greek, nouns inflect for grammatical case, such as dative or accusative.
10620110 -> 1000006400120: Verbs, on the other hand, inflect for tenses, such as past, present or future, while nouns do not.
10620120 -> 1000006400130: Aristotle also had a notion of onomata (nouns) and rhemata (verbs) which, however, does not exactly correspond with modern notions of nouns and verbs.
10620130 -> 1000006400140: Vinokurova 2005 has a more detailed discussion of the historical origin of the notion of a noun.
10620140 -> 1000006400150: Different definitions of nouns
10620150 -> 1000006400160: Expressions of natural language have properties at different levels.
10620160 -> 1000006400170: They have formal properties, like what kinds of morphological prefixes or suffixes they take and what kinds of other expressions they combine with; but they also have semantic properties, i.e. properties pertaining to their meaning.
10620170 -> 1000006400180: The definition of a noun at the outset of this page is thus a formal, traditional grammatical definition.
10620180 -> 1000006400190: That definition, for the most part, is considered uncontroversial and furnishes the propensity for certain language users to effectively distinguish most nouns from non-nouns.
10620190 -> 1000006400200: However, it has the disadvantage that it does not apply to nouns in all languages.
10620200 -> 1000006400210: For example in Russian, there are no definite articles, so one cannot define nouns as words that are modified by definite articles.
10620210 -> 1000006400220: There are also several attempts of defining nouns in terms of their semantic properties.
10620220 -> 1000006400230: Many of these are controversial, but some are discussed below.
10620230 -> 1000006400240: Names for things
10620240 -> 1000006400250: In traditional school grammars, one often encounters the definition of nouns that they are all and only those expressions that refer to a person, place, thing, event, substance, quality, or idea, etc.
10620250 -> 1000006400260: This is a semantic definition.
10620260 -> 1000006400270: It has been criticized by contemporary linguists as being uninformative.
10620270 -> 1000006400280: Contemporary linguists generally agree that one cannot successfully define nouns (or other grammatical categories) in terms of what sort of object in the world they refer to or signify.
10620280 -> 1000006400290: Part of the conundrum is that the definition makes use of relatively general nouns ("thing", "phenomenon", "event") to define what nouns are.
10620290 -> 1000006400300: The existence of such general nouns demonstrates that nouns refer to entities that are organized in taxonomic hierarchies.
10620300 -> 1000006400310: But other kinds of expressions are also organized into such structured taxonomic relationships.
10620310 -> 1000006400320: For example the verbs "stroll","saunter", "stride", and "tread" are more specific words than the more general "walk".
10620320 -> 1000006400330: Moreover, "walk" is more specific than the verb "move", which, in turn, is less general than "change".
10620330 -> 1000006400340: But it is unlikely that such taxonomic relationships can be used to define nouns and verbs.
10620340 -> 1000006400350: We cannot define verbs as those words that refer to "changes" or "states", for example, because the nouns change and state probably refer to such things, but, of course, aren't verbs.
10620350 -> 1000006400360: Similarly, nouns like "invasion", "meeting", or "collapse" refer to things that are "done" or "happen".
10620360 -> 1000006400370: In fact, an influential theory has it that verbs like "kill" or "die" refer to events, which is among the sort of thing that nouns are supposed to refer to.
10620370 -> 1000006400380: The point being made here is not that this view of verbs is wrong, but rather that this property of verbs is a poor basis for a definition of this category, just like the property of having wheels is a poor basis for a definition of cars (some things that have wheels, such as my suitcase or a jumbo jet, aren't cars).
10620380 -> 1000006400390: Similarly, adjectives like "yellow" or "difficult" might be thought to refer to qualities, and adverbs like "outside" or "upstairs" seem to refer to places, which are also among the sorts of things nouns can refer to.
10620390 -> 1000006400400: But verbs, adjectives and adverbs are not nouns, and nouns aren't verbs, adjectives or adverbs.
10620400 -> 1000006400410: One might argue that "definitions" of this sort really rely on speakers' prior intuitive knowledge of what nouns, verbs and adjectives are, and, so don't really add anything over and beyond this.
10620410 -> 1000006400420: Speakers' intuitive knowledge of such things might plausibly be based on formal criteria, such as the traditional grammatical definition of English nouns aforementioned.
10620420 -> 1000006400430: Prototypically referential expressions
10620430 -> 1000006400440: Another semantic definition of nouns is that they are prototypically referential.
10620440 -> 1000006400450: That definition is also not very helpful in distinguishing actual nouns from verbs.
10620450 -> 1000006400460: But it may still correctly identify a core property of nounhood.
10620460 -> 1000006400470: For example, we will tend to use nouns like "fool" and "car" when we wish to refer to fools and cars, respectively.
10620470 -> 1000006400480: The notion that this is prototypical reflects the fact that such nouns can be used, even though nothing with the corresponding property is referred to:
10620480 -> 1000006400490: John is no fool.
10620490 -> 1000006400500: If I had a car, I'd go to Marrakech.
10620500 -> 1000006400510: The first sentence above doesn't refer to any fools, nor does the second one refer to any particular car.
10620510 -> 1000006400520: Predicates with identity criteria
10620520 -> 1000006400530: The British logician Peter Thomas Geach proposed a very subtle semantic definition of nouns.
10620530 -> 1000006400540: He noticed that adjectives like "same" can modify nouns, but no other kinds of parts of speech, like verbs or adjectives.
10620540 -> 1000006400550: Not only that, but there also doesn't seem to be any other expressions with similar meaning that can modify verbs and adjectives.
10620550 -> 1000006400560: Consider the following examples.
10620560 -> 1000006400570: Good: John and Bill participated in the same fight.
10620580 -> 1000006400580: Bad: *John and Bill samely fought.
10620590 -> 1000006400590: There is no English adverb "samely".
10620600 -> 1000006400600: In some other languages, like Czech, however there are adverbs corresponding to "samely".
10620610 -> 1000006400610: Hence, in Czech, the translation of the last sentence would be fine; however, it would mean that John and Bill fought in the same way: not that they participated in the same fight.
10620620 -> 1000006400620: Geach proposed that we could explain this, if nouns denote logical predicates with identity criteria.
10620630 -> 1000006400630: An identity criterion would allow us to conclude, for example, that "person x at time 1 is the same person as person y at time 2".
10620640 -> 1000006400640: Different nouns can have different identity criteria.
10620650 -> 1000006400650: A well known example of this is due to Gupta:
10620660 -> 1000006400660: National Airlines transported 2 million passengers in 1979.
10620670 -> 1000006400670: National Airlines transported (at least) 2 million persons in 1979.
10620680 -> 1000006400680: Given that, in general, all passengers are persons, the last sentence above ought to follow logically from the first one.
10620690 -> 1000006400690: But it doesn't.
10620700 -> 1000006400700: It is easy to imagine, for example, that on average, every person who travelled with National Airlines in 1979, travelled with them twice.
10620710 -> 1000006400710: In that case, one would say that the airline transported 2 million passengers but only 1 million persons.
10620720 -> 1000006400720: Thus, the way that we count passengers isn't necessarily the same as the way that we count persons.
10620730 -> 1000006400730: Put somewhat differently: At two different times, you may correspond to two distinct passengers, even though you are one and the same person.
10620740 -> 1000006400740: For a precise definition of identity criteria, see Gupta.
10620750 -> 1000006400750: Recently, Baker has proposed that Geach's definition of nouns in terms of identity criteria allows us to explain the characteristic properties of nouns.
10620760 -> 1000006400760: He argues that nouns can co-occur with (in-)definite articles and numerals, and are "prototypically referential" because they are all and only those parts of speech that provide identity criteria.
10620770 -> 1000006400770: Baker's proposals are quite new, and linguists are still evaluating them.
10620780 -> 1000006400780: Classification of nouns in English
10620790 -> 1000006400790: Proper nouns and common nouns
10620800 -> 1000006400800: Proper nouns (also called proper names) are nouns representing unique entities (such as London, Universe or John), as distinguished from common nouns which describe a class of entities (such as city, planet or person).
10620810 -> 1000006400810: In English and most other languages that use the Latin alphabet, proper nouns are usually capitalized.
10620820 -> 1000006400820: Languages differ in whether most elements of multiword proper nouns are capitalised (e.g., American English House of Representatives) or only the initial element (e.g., Slovenian Državni zbor 'National Assembly').
10620830 -> 1000006400830: In German, nouns of all types are capitalized.
10620840 -> 1000006400840: The convention of capitalizing all nouns was previously used in English, but ended circa 1800.
10620850 -> 1000006400850: In America, the shift in capitalization is recorded in several noteworthy documents.
10620860 -> 1000006400860: The end (but not the beginning) of the Declaration of Independence (1776) and all of the Constitution (1787) show nearly all nouns capitalized, the Bill of Rights (1789) capitalizes a few common nouns but not most of them, and the Thirteenth Constitutional Amendment (1865) only capitalizes proper nouns.
10620870 -> 1000006400870: Sometimes the same word can function as both a common noun and a proper noun, where one such entity is special.
10620880 -> 1000006400880: For example the common noun god denotes all deities, while the proper noun God references the monotheistic God specifically.
10620890 -> 1000006400890: Owing to the essentially arbitrary nature of orthographic classification and the existence of variant authorities and adopted house styles, questionable capitalization of words is not uncommon, even in respected newspapers and magazines.
10620900 -> 1000006400900: Most publishers, however, properly require consistency, at least within the same document, in applying their specified standard.
10620910 -> 1000006400910: The common meaning of the word or words constituting a proper noun may be unrelated to the object to which the proper noun refers.
10620920 -> 1000006400920: For example, someone might be named "Tiger Smith" despite being neither a tiger nor a smith.
10620930 -> 1000006400930: For this reason, proper nouns are usually not translated between languages, although they may be transliterated.
10620940 -> 1000006400940: For example, the German surname Knödel becomes Knodel or Knoedel in English (not the literal Dumpling).
10620950 -> 1000006400950: However, the transcription of place names and the names of monarchs, popes, and non-contemporary authors is common and sometimes universal.
10620960 -> 1000006400960: For instance, the Portuguese word Lisboa becomes Lisbon in English; the English London becomes Londres in French; and the Greek Aristotelēs becomes Aristotle in English.
10620970 -> 1000006400970: Countable and uncountable nouns
10620980 -> 1000006400980: Count nouns are common nouns that can take a plural, can combine with numerals or quantifiers (e.g. "one", "two", "several", "every", "most"), and can take an indefinite article ("a" or "an").
10620990 -> 1000006400990: Examples of count nouns are "chair", "nose", and "occasion".
10621000 -> 1000006401000: Mass nouns (or non-count nouns) differ from count nouns in precisely that respect: they can't take plural or combine with number words or quantifiers.
10621010 -> 1000006401010: Examples from English include "laughter", "cutlery", "helium", and "furniture".
10621020 -> 1000006401020: For example, it is not possible to refer to "a furniture" or "three furnitures".
10621030 -> 1000006401030: This is true even though the pieces of furniture comprising "furniture" could be counted.
10621040 -> 1000006401040: Thus the distinction between mass and count nouns shouldn't be made in terms of what sorts of things the nouns refer to, but rather in terms of how the nouns present these entities.
10621050 -> 1000006401050: Collective nouns
10621060 -> 1000006401060: Collective nouns are nouns that refer to groups consisting of more than one individual or entity, even when they are inflected for the singular.
10621070 -> 1000006401070: Examples include "committee", "herd", and "school" (of herring).
10621080 -> 1000006401080: These nouns have slightly different grammatical properties than other nouns.
10621090 -> 1000006401090: For example, the noun phrases that they head can serve as the subject of a collective predicate, even when they are inflected for the singular.
10621100 -> 1000006401100: A collective predicate is a predicate that normally can't take a singular subject.
10621110 -> 1000006401110: An example of the latter is "talked to each other".
10621120 -> 1000006401120: Good: The boys talked to each other.
10621130 -> 1000006401130: Bad: *The boy talked to each other.
10621140 -> 1000006401140: Good: The committee talked to each other.
10621150 -> 1000006401150: Concrete nouns and abstract nouns
10621160 -> 1000006401160: Concrete nouns refer to physical bodies which you use at least one of your senses to observe.
10621170 -> 1000006401170: For instance, "chair", "apple", or "Janet".
10621180 -> 1000006401180: Abstract nouns on the other hand refer to abstract objects, that is ideas or concepts, such as "justice" or "hate".
10621190 -> 1000006401190: While this distinction is sometimes useful, the boundary between the two of them is not always clear; consider, for example, the noun "art".
10621200 -> 1000006401200: In English, many abstract nouns are formed by adding noun-forming suffixes ("-ness", "-ity", "-tion") to adjectives or verbs.
10621210 -> 1000006401210: Examples are "happiness", "circulation" and "serenity".
10621220 -> 1000006401220: Nouns and pronouns
10621230 -> 1000006401230: Noun phrases can typically be replaced by pronouns, such as "he", "it", "which", and "those", in order to avoid repetition or explicit identification, or for other reasons.
10621240 -> 1000006401240: For example, in the sentence "Janet thought that he was weird", the word "he" is a pronoun standing in place of the name of the person in question.
10621250 -> 1000006401250: The English word one can replace parts of noun phrases, and it sometimes stands in for a noun.
10621260 -> 1000006401260: An example is given below:
10621270 -> 1000006401270: John's car is newer than the one that Bill has.
10621280 -> 1000006401280: But one can also stand in for bigger subparts of a noun phrase.
10621290 -> 1000006401290: For example, in the following example, one can stand in for new car.
10621300 -> 1000006401300: This new car is cheaper than that one.
10621310 -> 1000006401310: Substantive as a word for "noun"
10621320 -> 1000006401320: Starting with old Latin grammars, many European languages use some form of the word substantive as the basic term for noun.
10621330 -> 1000006401330: Nouns in the dictionaries of such languages are demarked by the abbreviation "s" instead of "n", which may be used for proper nouns instead.
10621340 -> 1000006401340: This corresponds to those grammars in which nouns and adjectives phase into each other in more areas than, for example, the English term predicate adjective entails.
10621350 -> 1000006401350: In French and Spanish, for example, adjectives frequently act as nouns referring to people who have the characteristics of the adjective.
10621360 -> 1000006401360: An example in English is:
10621370 -> 1000006401370: The poor you have always with you.
10621380 -> 1000006401380: Similarly, an adjective can also be used for a whole group or organization of people:
10621390 -> 1000006401390: The Socialist International.
10621400 -> 1000006401400: Hence, these words are substantives that are usually adjectives in English.
Ontology (information science)
10630010 -> 1000006500020: Ontology (information science)
10630020 -> 1000006500030: In both computer science and information science, an ontology is a formal representation of a set of concepts within a domain and the relationships between those concepts.
10630030 -> 1000006500040: It is used to reason about the properties of that domain, and may be used to define the domain.
10630040 -> 1000006500050: Ontologies are used in artificial intelligence, the Semantic Web, software engineering, biomedical informatics, library science, and information architecture as a form of knowledge representation about the world or some part of it.
10630050 -> 1000006500060: Common components of ontologies include:
10630060 -> 1000006500070: Individuals: instances or objects (the basic or "ground level" objects)
10630070 -> 1000006500080: Classes: sets, collections, concepts or types of objects
10630080 -> 1000006500090: Attributes: properties, features, characteristics, or parameters that objects (and classes) can have
10630090 -> 1000006500100: Relations: ways that classes and objects can be related to one another
10630100 -> 1000006500110: Function terms: complex structures formed from certain relations that can be used in place of an individual term in a statement
10630110 -> 1000006500120: Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input
10630120 -> 1000006500130: Rules: statements in the form of an if-then (antecedent-consequent) sentence that describe the logical inferences that can be drawn from an assertion in a particular form
10630130 -> 1000006500140: Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the ontology describes in its domain of application.
10630140 -> 1000006500150: This definition differs from that of "axioms" in generative grammar and formal logic.
10630150 -> 1000006500160: In these disciplines, axioms include only statements asserted as a priori knowledge.
10630160 -> 1000006500170: As used here, "axioms" also include the theory derived from axiomatic statements.
10630170 -> 1000006500180: Events: the changing of attributes or relations
10630180 -> 1000006500190: Ontologies are commonly encoded using ontology languages.
10630190 -> 1000006500200: Elements
10630200 -> 1000006500210: Contemporary ontologies share many structural similarities, regardless of the language in which they are expressed.
10630210 -> 1000006500220: As mentioned above, most ontologies describe individuals (instances), classes (concepts), attributes, and relations.
10630220 -> 1000006500230: In this section each of these components is discussed in turn.
10630230 -> 1000006500240: Individuals
10630240 -> 1000006500250: Individuals (instances) are the basic, "ground level" components of an ontology.
10630250 -> 1000006500260: The individuals in an ontology may include concrete objects such as people, animals, tables, automobiles, molecules, and planets, as well as abstract individuals such as numbers and words.
10630260 -> 1000006500270: Strictly speaking, an ontology need not include any individuals, but one of the general purposes of an ontology is to provide a means of classifying individuals, even if those individuals are not explicitly part of the ontology.
10630270 -> 1000006500280: In formal extensional ontologies, only the utterances of words and numbers are considered individuals – the numbers and names themselves are classes.
10630280 -> 1000006500290: In a 4D ontology, an individual is identified by its spatio-temporal extent.
10630290 -> 1000006500300: Examples of formal extensional ontologies are ISO 15926 and the model in development by the IDEAS Group.
10630300 -> 1000006500310: Classes
10630310 -> 1000006500320: Classes – concepts that are also called type, sort, category, and kind – are abstract groups, sets, or collections of objects.
10630320 -> 1000006500330: They may contain individuals, other classes, or a combination of both.
10630330 -> 1000006500340: Some examples of classes:
10630340 -> 1000006500350: Person, the class of all people
10630350 -> 1000006500360: Vehicle, the class of all vehicles
10630360 -> 1000006500370: Car, the class of all cars
10630370 -> 1000006500380: Class, representing the class of all classes
10630380 -> 1000006500390: Thing, representing the class of all things
10630390 -> 1000006500400: Ontologies vary on whether classes can contain other classes, whether a class can belong to itself, whether there is a universal class (that is, a class containing everything), etc.
10630400 -> 1000006500410: Sometimes restrictions along these lines are made in order to avoid certain well-known paradoxes.
10630410 -> 1000006500420: The classes of an ontology may be extensional or intensional in nature.
10630420 -> 1000006500430: A class is extensional if and only if it is characterized solely by its membership.
10630430 -> 1000006500440: More precisely, a class C is extensional if and only if for any class C', if C' has exactly the same members as C, then C and C' are identical.
10630440 -> 1000006500450: If a class does not satisfy this condition, then it is intensional.
10630450 -> 1000006500460: While extensional classes are more well-behaved and well-understood mathematically, as well as less problematic philosophically, they do not permit the fine grained distinctions that ontologies often need to make.
10630460 -> 1000006500470: For example, an ontology may want to distinguish between the class of all creatures with a kidney and the class of all creatures with a heart, even if these classes happen to have exactly the same members.
10630470 -> 1000006500480: In the upper ontologies mentioned above, the classes are defined intensionally.
10630480 -> 1000006500490: Intensionally defined classes usually have necessary conditions associated with membership in each class.
10630490 -> 1000006500500: Some classes may also have sufficient conditions, and in those cases the combination of necessary and sufficient conditions make that class a fully defined class.
10630500 -> 1000006500510: Importantly, a class can subsume or be subsumed by other classes; a class subsumed by another is called a subclass of the subsuming class.
10630510 -> 1000006500520: For example, Vehicle subsumes Car, since (necessarily) anything that is a member of the latter class is a member of the former.
10630520 -> 1000006500530: The subsumption relation is used to create a hierarchy of classes, typically with a maximally general class like Thing at the top, and very specific classes like 2002 Ford Explorer at the bottom.
10630530 -> 1000006500540: The critically important consequence of the subsumption relation is the inheritance of properties from the parent (subsuming) class to the child (subsumed) class.
10630540 -> 1000006500550: Thus, anything that is necessarily true of a parent class is also necessarily true of all of its subsumed child classes.
10630550 -> 1000006500560: In some ontologies, a class is only allowed to have one parent (single inheritance), but in most ontologies, classes are allowed to have any number of parents (multiple inheritance), and in the latter case all necessary properties of each parent are inherited by the subsumed child class.
10630560 -> 1000006500570: Thus a particular class of animal (HouseCat) may be a child of the class Cat and also a child of the class Pet.
10630570 -> 1000006500580: A partition is a set of related classes and associated rules that allow objects to be placed into the appropriate class.
10630580 -> 1000006500590: For example, to the right is the partial diagram of an ontology that has a partition of the Car class into the classes 2-Wheel Drive and 4-Wheel Drive.
10630590 -> 1000006500600: The partition rule determines if a particular car is placed in the 2-Wheel Drive or the 4-Wheel Drive class.
10630600 -> 1000006500610: If the partition rule(s) guarantee that a single Car cannot be in both classes, then the partition is called a disjoint partition.
10630610 -> 1000006500620: If the partition rules ensure that every concrete object in the super-class is an instance of at least one of the partition classes, then the partition is called an exhaustive partition.
10630620 -> 1000006500630: Attributes
10630630 -> 1000006500640: Objects in the ontology can be described by assigning attributes to them.
10630640 -> 1000006500650: Each attribute has at least a name and a value, and is used to store information that is specific to the object it is attached to.
10630650 -> 1000006500660: For example the Ford Explorer object has attributes such as:
10630660 -> 1000006500670: Name: Ford Explorer
10630670 -> 1000006500680: Number-of-doors: 4
10630680 -> 1000006500690: Engine: {4.0L, 4.6L}
10630690 -> 1000006500700: Transmission: 6-speed
10630700 -> 1000006500710: The value of an attribute can be a complex data type; in this example, the value of the attribute called Engine is a list of values, not just a single value.
10630710 -> 1000006500720: If you did not define attributes for the concepts you would have either a taxonomy (if hyponym relationships exist between concepts) or a controlled vocabulary.
10630720 -> 1000006500730: These are useful, but are not considered true ontologies.
10630730 -> 1000006500740: Relationships
10630740 -> 1000006500750: An important use of attributes is to describe the relationships (also known as relations) between objects in the ontology.
10630750 -> 1000006500760: Typically a relation is an attribute whose value is another object in the ontology.
10630760 -> 1000006500770: For example in the ontology that contains the Ford Explorer and the Ford Bronco, the Ford Bronco object might have the following attribute:
10630770 -> 1000006500780: Successor: Ford Explorer
10630780 -> 1000006500790: This tells us that the Explorer is the model that replaced the Bronco.
10630790 -> 1000006500800: Much of the power of ontologies comes from the ability to describe these relations.
10630800 -> 1000006500810: Together, the set of relations describes the semantics of the domain.
10630810 -> 1000006500820: The most important type of relation is the subsumption relation (is-superclass-of, the converse of is-a, is-subtype-of or is-subclass-of).
10630820 -> 1000006500830: This defines which objects are members of classes of objects.
10630830 -> 1000006500840: For example we have already seen that the Ford Explorer is-a 4-wheel drive, which in turn is-a Car:
10630840 -> 1000006500850: The addition of the is-a relationships has created a hierarchical taxonomy; a tree-like structure (or, more generally, a partially ordered set) that clearly depicts how objects relate to one another.
10630850 -> 1000006500860: In such a structure, each object is the 'child' of a 'parent class' (Some languages restrict the is-a relationship to one parent for all nodes, but many do not).
10630860 -> 1000006500870: Another common type of relations is the meronymy relation, written as part-of, that represents how objects combine together to form composite objects.
10630870 -> 1000006500880: For example, if we extended our example ontology to include objects like Steering Wheel, we would say that "Steering Wheel is-part-of Ford Explorer" since a steering wheel is one of the components of a Ford Explorer.
10630880 -> 1000006500890: If we introduce meronymy relationships to our ontology, we find that this simple and elegant tree structure quickly becomes complex and significantly more difficult to interpret manually.
10630890 -> 1000006500900: It is not difficult to understand why; an entity that is described as 'part of' another entity might also be 'part of' a third entity.
10630900 -> 1000006500910: Consequently, entities may have more than one parent.
10630910 -> 1000006500920: The structure that emerges is known as a directed acyclic graph (DAG).
10630920 -> 1000006500930: As well as the standard is-a and part-of relations, ontologies often include additional types of relation that further refine the semantics they model.
10630930 -> 1000006500940: These relations are often domain-specific and are used to answer particular types of question.
10630940 -> 1000006500950: For example in the domain of automobiles, we might define a made-in relationship which tells us where each car is built.
10630950 -> 1000006500960: So the Ford Explorer is made-in Louisville.
10630960 -> 1000006500970: The ontology may also know that Louisville is-in Kentucky and Kentucky is-a state of the USA.
10630970 -> 1000006500980: Software using this ontology could now answer a question like "which cars are made in the U.S.?"
10630980 -> 1000006500990: Domain ontologies and upper ontologies
10630990 -> 1000006501000: A domain ontology (or domain-specific ontology) models a specific domain, or part of the world.
10631000 -> 1000006501010: It represents the particular meanings of terms as they apply to that domain.
10631010 -> 1000006501020: For example the word card has many different meanings.
10631020 -> 1000006501030: An ontology about the domain of poker would model the "playing card" meaning of the word, while an ontology about the domain of computer hardware would model the "punch card" and "video card" meanings.
10631030 -> 1000006501040: An upper ontology (or foundation ontology) is a model of the common objects that are generally applicable across a wide range of domain ontologies.
10631040 -> 1000006501050: It contains a core glossary in whose terms objects in a set of domains can be described.
10631050 -> 1000006501060: There are several standardized upper ontologies available for use, including Dublin Core, GFO, OpenCyc/ResearchCyc, SUMO, and  DOLCEl.
10631060 -> 1000006501070: WordNet, while considered an upper ontology by some, is not an ontology: it is a unique combination of a taxonomy and a controlled vocabulary (see above, under Attributes).
10631070 -> 1000006501080: The Gellish ontology is an example of a combination of an upper and a domain ontology.
10631080 -> 1000006501090: Since domain ontologies represent concepts in very specific and often eclectic ways, they are often incompatible.
10631090 -> 1000006501100: As systems that rely on domain ontologies expand, they often need to merge domain ontologies into a more general representation.
10631100 -> 1000006501110: This presents a challenge to the ontology designer.
10631110 -> 1000006501120: Different ontologies in the same domain can also arise due to different perceptions of the domain based on cultural background, education, ideology, or because a different representation language was chosen.
10631120 -> 1000006501130: At present, merging ontologies is a largely manual process and therefore time-consuming and expensive.
10631130 -> 1000006501140: Using a foundation ontology to provide a common definition of core terms can make this process manageable.
10631140 -> 1000006501150: There are studies on generalized techniques for merging ontologies, but this area of research is still largely theoretical.
10631150 -> 1000006501160: Ontology languages
10631160 -> 1000006501170: An ontology language is a formal language used to encode the ontology.
10631170 -> 1000006501180: There are a number of such languages for ontologies, both proprietary and standards-based:
10631180 -> 1000006501190: OWL is a language for making ontological statements, developed as a follow-on from RDF and RDFS, as well as earlier ontology language projects including OIL, DAML and DAML+OIL.
10631190 -> 1000006501200: OWL is intended to be used over the World Wide Web, and all its elements (classes, properties and individuals) are defined as RDF resources, and identified by URIs.
10631200 -> 1000006501210: KIF is a syntax for first-order logic that is based on S-expressions.
10631210 -> 1000006501220: The Cyc project has its own ontology language called CycL, based on first-order predicate calculus with some higher-order extensions.
10631220 -> 1000006501230: Rule Interchange Format (RIF) and F-Logic combine ontologies and rules.
10631230 -> 1000006501240: The Gellish language includes rules for its own extension and thus integrates an ontology with an ontology language.
10631240 -> 1000006501250: Relation to the philosophical term
10631250 -> 1000006501260: The term ontology has its origin in philosophy, where it is the name of one fundamental branch of metaphysics, concerned with analyzing various types or modes of existence, often with special attention to the relations between particulars and universals, between intrinsic and extrinsic properties, and between essence and existence.
10631260 -> 1000006501270: According to Tom Gruber at Stanford University, the meaning of ontology in the context of computer science is “a description of the concepts and relationships that can exist for an agent or a community of agents.”
10631270 -> 1000006501280: He goes on to specify that an ontology is generally written, “as a set of definitions of formal vocabulary.”
10631280 -> 1000006501290: What ontology has in common in both computer science and philosophy is the representation of entities, ideas, and events, along with their properties and relations, according to a system of categories.
10631290 -> 1000006501300: In both fields, one finds considerable work on problems of ontological relativity (e.g. Quine and Kripke in philosophy, Sowa and Guarino in computer science (Top-level ontological categories.
10631310 -> 1000006501310: By: Sowa, John F.
10631320 -> 1000006501320: In International Journal of Human-Computer Studies, v. 43 (November/December 1995) p. 669-85.), and debates concerning whether a normative ontology is viable (e.g. debates over foundationalism in philosophy, debates over the Cyc project in AI).
10631330 -> 1000006501330: Differences between the two are largely matters of focus.
10631340 -> 1000006501340: Philosophers are less concerned with establishing fixed, controlled vocabularies than are researchers in computer science, while computer scientists are less involved in discussions of first principles (such as debating whether there are such things as fixed essences, or whether entities must be ontologically more primary than processes).
10631350 -> 1000006501350: During the second half of the 20th century, philosophers extensively debated the possible methods or approaches to building ontologies, without actually building any very elaborate ontologies themselves.
10631360 -> 1000006501360: By contrast, computer scientists were building some large and robust ontologies (such as WordNet and Cyc) with comparatively little debate over how they were built.
10631370 -> 1000006501370: In the early years of the 21st century, the interdisciplinary project of cognitive science has been bringing the two circles of scholars closer together.
10631380 -> 1000006501380: For example, there is talk of a "computational turn in philosophy" which includes philosophers analyzing the formal ontologies of computer science (sometimes even working directly with the software), while researchers in computer science have been making more references to those philosophers who work on ontology (sometimes with direct consequences for their methods).
10631390 -> 1000006501390: Still, many scholars in both fields are uninvolved in this trend of cognitive science, and continue to work independently of one another, pursuing separately their different concerns.
10631400 -> 1000006501400: Resources
10631410 -> None: Examples of published ontologies
10631420 -> None: Dublin Core, a simple ontology for documents and publishing.
10631430 -> None: Cyc for formal representation of the universe of discourse.
10631440 -> None: Suggested Upper Merged Ontology, which is a formal upper ontology
10631450 -> None: Basic Formal Ontology (BFO), a formal upper ontology designed to support scientific research
10631460 -> None: Gellish English dictionary, an ontology that includes a dictionary and taxonomy that includes an upper ontology and a lower ontology that focusses on industrial and business applications in engineering, technology and procurement.
10631470 -> None: Generalized Upper Model, a linguistically-motivated ontology for mediating between clients systems and natural language technology
10631480 -> None: WordNet Lexical reference system
10631490 -> None: OBO Foundry: a suite of interoperable reference ontologies in biomedicine.
10631500 -> None: The Ontology for Biomedical Investigations is an open access, integrated ontology for the description of biological and clinical investigations.
10631510 -> None: COSMO: An OWL ontology that is a merger of the basic elements of the OpenCyc and SUMO ontologies, with additional elements.
10631520 -> None: Gene Ontology for genomics
10631530 -> None: PRO, the Protein Ontology of the Protein Information Resource, Georgetown University.
10631540 -> None: Protein Ontology for proteomics
10631550 -> None: Foundational Model of Anatomy for human anatomy
10631560 -> None: SBO, the Systems Biology Ontology, for computational models in biology
10631570 -> None: Plant Ontology for plant structures and growth/development stages, etc.
10631580 -> None: CIDOC CRM (Conceptual Reference Model) - an ontology for "cultural heritage information".
10631590 -> None: GOLD  (General Ontology for Linguistic Description )
10631600 -> None: Linkbase A formal representation of the biomedical domain, founded upon  Basic Formal Ontology (BFO).
10631610 -> None: Foundational, Core and Linguistic Ontologies
10631620 -> None: ThoughtTreasure ontology
10631630 -> None: LPL Lawson Pattern Language
10631640 -> None: TIME-ITEM Topics for Indexing Medical Education
10631650 -> None: POPE Purdue Ontology for Pharmaceutical Engineering
10631660 -> None: IDEAS Group A formal ontology for enterprise architecture being developed by the Australian, Canadian, UK and U.S. Defence Depts.  The IDEAS Group Website
10631670 -> None: program abstraction taxonomy
10631680 -> None: SWEET Semantic Web for Earth and Environmental Terminology
10631690 -> None: CCO The Cell-Cycle Ontology is an application ontology that represents the cell cycle
10631700 -> 1000006501410: Ontology libraries
10631710 -> 1000006501420: The development of ontologies for the Web has led to the apparition of services providing lists or directories of ontologies with search facility.
10631720 -> 1000006501430: Such directories have been called ontology libraries.
10631730 -> 1000006501440: The following are static libraries of human-selected ontologies.
10631740 -> 1000006501450: The  DAML Ontology Library maintains a legacy of ontologies in DAML.
10631750 -> 1000006501460: The  Protege Ontology Library contains a set of owl, Frame-based and other format ontologies.
10631760 -> 1000006501470: SchemaWeb is a directory of RDF schemata expressed in RDFS, OWL and DAML+OIL.
10631770 -> 1000006501480: The following are both directories and search engines.
10631780 -> 1000006501490: They include crawlers searching the Web for well-formed ontologies.
10631790 -> 1000006501500: Swoogle is a directory and search engine for all RDF resources available on the Web, including ontologies.
10631800 -> 1000006501510: The  OntoSelect Ontology Library offers similar services for RDF/S, DAML and OWL ontologies.
10631810 -> 1000006501520: Ontaria is a "searchable and browsable directory of semantic web data", with a focus on RDF vocabularies with OWL ontologies.
10631820 -> 1000006501530: The  OBO Foundry / Bioportalis a suite of interoperable reference ontologies in biology and biomedicine.
OpenOffice.org
10640010 -> 1000006600020: OpenOffice.org
10640020 -> 1000006600030: OpenOffice.org (OO.o or OOo) is a cross-platform office application suite available for a number of different computer operating systems.
10640030 -> 1000006600040: It supports the ISO standard OpenDocument Format (ODF) for data interchange as its default file format, as well as Microsoft Office '97–2003 formats, Microsoft Office '2007 format (in version 3), among many others.
10640040 -> 1000006600050: OpenOffice.org was originally derived from StarOffice, an office suite developed by StarDivision and acquired by Sun Microsystems in August 1999.
10640050 -> 1000006600060: The source code of the suite was released in July 2000 with the aim of reducing the dominant market share of Microsoft Office by providing a free, open and high-quality alternative; later versions of StarOffice are based upon OpenOffice.org with additional proprietary components.
10640060 -> 1000006600070: OpenOffice.org is free software, available under the GNU Lesser General Public License (LGPL).
10640070 -> 1000006600080: The project and software are informally referred to as OpenOffice, but this term is a trademark held by another party, requiring the project to adopt OpenOffice.org as its formal name.
10640080 -> 1000006600090: History
10640090 -> 1000006600100: Originally developed as the proprietary software application suite StarOffice by the German company StarDivision, the code was purchased in 1999 by Sun Microsystems.
10640100 -> 1000006600110: In August 1999 version 5.2 of StarOffice was made available free of charge.
10640110 -> 1000006600120: On July 19, 2000, Sun Microsystems announced that it was making the source code of StarOffice available for download under both the LGPL and the Sun Industry Standards Source License (SISSL) with the intention of building an open source development community around the software.
10640120 -> 1000006600130: The new project was known as OpenOffice.org, and its website went live on October 13, 2000.
10640130 -> 1000006600140: Work on version 2.0 began in early 2003 with the following goals: better interoperability with Microsoft Office; better performance, with improved speed and lower memory usage; greater scripting capabilities; better integration, particularly with GNOME; an easier-to-find and use database front-end for creating reports, forms and queries; a new built-in SQL database; and improved usability.
10640140 -> 1000006600150: A beta version was released on March 4, 2005.
10640150 -> 1000006600160: On September 2, 2005 Sun announced that it was retiring the SISSL.
10640160 -> 1000006600170: As a consequence, the OpenOffice.org Community Council announced that it would no longer dual license the office suite, and future versions would use only the LGPL.
10640170 -> 1000006600180: On October 20, 2005, OpenOffice.org 2.0 was formally released to the public.
10640180 -> 1000006600190: Eight weeks after the release of Version 2.0, an update, OpenOffice.org 2.0.1, was released.
10640190 -> 1000006600200: It fixed minor bugs and introduced new features.
10640200 -> 1000006600210: As of the 2.0.3 release, OpenOffice.org changed its release cycle from 18-months to releasing updates, feature enhancements and bug fixes every three months.
10640210 -> 1000006600220: Currently, new versions including new features are released every six months (so-called "feature releases") alternating with so-called "bug fix releases" which are being released between two feature releases (Every 3 months).
10640220 -> 1000006600230: StarOffice
10640230 -> 1000006600240: Sun subsidizes the development of OpenOffice.org in order to use it as a base for its commercial proprietary StarOffice application software.
10640240 -> 1000006600250: Releases of StarOffice since version 6.0 have been based on the OpenOffice.org source code, with some additional proprietary components, including:
10640250 -> 1000006600260: Additional bundled fonts (especially East Asian language fonts).
10640260 -> 1000006600270: Adabas D database.
10640270 -> 1000006600280: Additional document templates.
10640280 -> 1000006600290: Clip art.
10640290 -> 1000006600300: Sorting functionality for Asian versions.
10640300 -> 1000006600310: Additional file filters.
10640310 -> 1000006600320: Migration assessment tool (Enterprise Edition).
10640320 -> 1000006600330: Macro migration tool (Enterprise Edition).
10640330 -> 1000006600340: Configuration management tool (Enterprise Edition).
10640340 -> 1000006600350: OpenOffice.org, therefore, inherited many features from the original StarOffice upon which it was based including the OpenOffice.org XML file format which it retained until version 2, when it was replaced by the ISO standard OpenDocument Format (ODF).
10640350 -> 1000006600360: Features
10640360 -> 1000006600370: According to its mission statement, the OpenOffice.org project aims "To create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format."
10640370 -> 1000006600380: OpenOffice.org aims to compete with Microsoft Office and emulate its look and feel where suitable.
10640380 -> 1000006600390: It can read and write most of the file formats found in Microsoft Office, and many other applications; an essential feature of the suite for many users.
10640390 -> 1000006600400: OpenOffice.org has been found to be able to open files of older versions of Microsoft Office and damaged files that newer versions of Microsoft Office itself cannot open.
10640400 -> 1000006600410: However, it cannot open older Word for Macintosh (MCW) files.
10640410 -> 1000006600420: Platforms
10640420 -> 1000006600430: Platforms for which OO.o is available include Microsoft Windows, Linux, Solaris, BSD, OpenVMS, OS/2 and IRIX.
10640430 -> 1000006600440: The current primary development platforms are Microsoft Windows, Linux and Solaris.
10640440 -> 1000006600450: A port for Mac OS X exists for OS X machines which have the X Window System component installed.
10640450 -> 1000006600460: A port to OS X's native Aqua user interface is in progress, and is scheduled for completion for the 3.0 milestone.
10640460 -> 1000006600470: NeoOffice is an independent fork of OpenOffice, specially adapted for Mac OS X.
10640470 -> None: Version compatibility
10640480 -> None: Windows 95: up to v1.1.5
10640490 -> None: Windows 98-Vista: up to v2.4, development releases of v3.0
10640500 -> None: Mac OS 10.2: up to v1.1.2
10640510 -> None: Mac OS 10.3: up to v2.1
10640520 -> None: Mac OS 10.4-10.5: up to v2.4, development releases of v3.0 (intel only)
10640530 -> None: OS/2 and eComStation: up to v2.0.4
10640540 -> 1000006600480: Components
10640550 -> 1000006600490: OpenOffice.org is a collection of applications that work together closely to provide the features expected from a modern office suite.
10640560 -> 1000006600500: Many of the components are designed to mirror those available in Microsoft Office.
10640570 -> 1000006600510: The components available include:
10640580 -> 1000006600520: QuickStarter
10640590 -> 1000006600530: A small program for Windows and Linux that runs when the computer starts for the first time.
10640600 -> 1000006600540: It loads the core files and libraries for OpenOffice.org during computer startup and allows the suite applications to start more quickly when selected later.
10640610 -> 1000006600550: The amount of time it takes to open OpenOffice.org applications was a common complaint in version 1.0 of the suite.
10640620 -> 1000006600560: Substantial improvements were made in this area for version 2.2.
10640630 -> 1000006600570: The macro recorder
10640640 -> 1000006600580: Is used to record user actions and replay them later to help with automating tasks, using OpenOffice.org Basic (see below).
10640650 -> 1000006600590: It is not possible to download these components individually on Windows, though they can be installed separately.
10640660 -> 1000006600600: Most Linux distributions break the components into individual packages which may be downloaded and installed separately.
10640670 -> 1000006600610: OpenOffice.org Basic
10640680 -> 1000006600620: OpenOffice.org Basic is a programming language similar to Microsoft Visual Basic for Applications (VBA) based on StarOffice Basic.
10640690 -> 1000006600630: In addition to the macros, the upcoming Novell edition of OpenOffice.org 2.0 supports running Microsoft VBA macros, a feature expected to be incorporated into the mainstream version soon.
10640700 -> 1000006600640: OpenOffice.org Basic is available in the Writer and Calc applications.
10640710 -> 1000006600650: It is written in functions called subroutines or macros, with each macro performing a different task, such as counting the words in a paragraph.
10640720 -> 1000006600660: OpenOffice.org Basic is especially useful in doing repetitive tasks that have not been integrated in the program.
10640730 -> 1000006600670: As the OpenOffice.org database, called "Base", uses documents created under the Writer application for reports and forms, one could say that Base can also be programmed with OpenOffice.org Basic.
10640740 -> 1000006600680: File formats
10640750 -> 1000006600690: OpenOffice.org pioneered the ISO/IEC standard OpenDocument file formats (ODF), which it uses natively, by default.
10640760 -> 1000006600700: It also supports reading (and in some cases writing) a large number of legacy proprietary file formats (e.g.: WordPerfect through libwpd, StarOffice, Lotus software, MS Works through libwps, Rich Text Format), most notably including Microsoft Office formats after which the OpenDocument specification was "approved for release as an ISO and IEC International Standard" under the name ISO/IEC 26300:2006..
10640770 -> 1000006600710: Microsoft Office interoperability
10640780 -> 1000006600720: In response to Microsoft's recent movement towards using the Office Open XML format in Microsoft Office 2007, Novell has released an Office Open XML converter for OOo under a liberal BSD license (along with GNU GPL and LGPL licensed libraries), that will be submitted for inclusion into the OpenOffice.org project.
10640790 -> 1000006600730: This allows OOo to read and write Microsoft OpenXML-formatted word processing documents (.docx) in OpenOffice.org.
10640800 -> 1000006600740: Currently it works only with the latest Novell edition of OpenOffice.org.
10640810 -> 1000006600750: Sun Microsystems has developed an ODF plugin for Microsoft Office which enables users of Microsoft Office Word, Excel and PowerPoint to read and write ODF documents.
10640820 -> 1000006600760: The plugin currently works with Microsoft Office 2003, Microsoft Office XP and Microsoft Office 2000.
10640830 -> 1000006600770: Support for Microsoft Office 2007 is only available in combination with Microsoft Office 2007 SP1.
10640840 -> 1000006600780: Several software companies (including Microsoft and Novell) are working on an add-in for Microsoft Office that allows reading and writing ODF files.
10640850 -> 1000006600790: Currently it works only for Microsoft Word 2007 / XP / 2003.
10640860 -> 1000006600800: Microsoft provides a compatibility pack to read and write Office Open XML files with Office 2000, XP and 2003.
10640870 -> 1000006600810: The compatibility pack can also be used as a stand-alone converter with Microsoft Office 97.
10640880 -> 1000006600820: This might be helpful to convert older Microsoft Office files via Office Open XML to ODF if a direct conversion doesn't work as expected.
10640890 -> 1000006600830: The Office compatibility pack however does not install for Office 2000 or Office XP on Windows 9x.
10640900 -> 1000006600840: Note that some office applications built with Microsoft components may refuse to import OpenOffice data.
10640910 -> 1000006600850: The Sage Group's Simply Accounting, for example, can import Excel's .xls files, but refuses to accept OpenOffice.org-generated .xls files for the reason that the OOo .xls files are not "genuine Microsoft" .xls files.
10640920 -> 1000006600860: Development
10640930 -> 1000006600870: Overview
10640940 -> 1000006600880: The OpenOffice.org API is based on a component technology known as Universal Network Objects (UNO).
10640950 -> 1000006600890: It consists of a wide range of interfaces defined in a CORBA-like interface description language.
10640960 -> 1000006600900: The document file format used is based on XML and several export and import filters.
10640970 -> 1000006600910: All external formats read by OpenOffice.org are converted back and forth from an internal XML representation.
10640980 -> 1000006600920: By using compression when saving XML to disk, files are generally smaller than the equivalent binary Microsoft Office documents.
10640990 -> 1000006600930: The native file format for storing documents in version 1.0 was used as the basis of the OASIS OpenDocument file format standard, which has become the default file format in version 2.0.
10641000 -> 1000006600940: Development versions of the suite are released every few weeks on the developer zone of the OpenOffice.org website.
10641010 -> 1000006600950: The releases are meant for those who wish to test new features or are simply curious about forthcoming changes; they are not suitable for production use.
10641020 -> 1000006600960: Native desktop integration
10641030 -> 1000006600970: OpenOffice.org 1.0 was criticized for not having the look and feel of applications developed natively for the platforms on which it runs.
10641040 -> 1000006600980: Starting with version 2.0, OpenOffice.org uses native widget toolkit, icons, and font-rendering libraries across a variety of platforms, to better match native applications and provide a smoother experience for the user.
10641050 -> 1000006600990: There are projects underway to further improve this integration on both GNOME and KDE.
10641060 -> 1000006601000: This issue has been particularly pronounced on Mac OS X, whose standard user interface looks noticeably different from either Windows or X11-based desktop environments and requires the use of programming toolkits unfamiliar to most OpenOffice.org developers.
10641070 -> 1000006601010: There are two implementations of OpenOffice.org available for OS X:
10641080 -> 1000006601020: OpenOffice.org Mac OS X (X11)
10641090 -> 1000006601030: This official implementation requires the installation of X11.app or XDarwin, and is a close port of the well-tested Unix version.
10641100 -> 1000006601040: It is functionally equivalent to the Unix version, and its user interface resembles the look and feel of that version; for example, the application uses its own menu bar instead of the OS X menu at the top of the screen.
10641110 -> 1000006601050: It also requires system fonts to be converted to X11 format for OpenOffice.org to use them (which can be done during application installation).
10641120 -> 1000006601060: OpenOffice.org Aqua
10641130 -> 1000006601070: After a first step (completed) using Carbon, OpenOffice.org Aqua switched to Cocoa technology, and an Aqua version (based on Cocoa) is also being developed under the aegis of OpenOffice.org, with a Beta version currently available.
10641140 -> 1000006601080: Sun Microsystems is collaborating with OOo to further development of the Aqua version of OpenOffice.org for Mac.
10641150 -> 1000006601090: Future
10641160 -> 1000006601100: Currently, a developed preview of OpenOffice.org 3 (OOo-dev 3.0) is available for download.
10641170 -> 1000006601110: Among the planned features for OOo 3.0, set to be released by September 2008 , are:
10641180 -> 1000006601120: Personal Information Manager (PIM), probably based on Thunderbird/Lightning
10641190 -> 1000006601130: PDF import into Draw (to maintain correct layout of the original PDF)
10641200 -> 1000006601140: OOXML document support for opening documents created in Office 2007
10641210 -> 1000006601150: Support for Mac OS X Aqua platform
10641220 -> 1000006601160: Extensions, to add third party functionality.
10641230 -> 1000006601170: Presenter screen in Impress with multi-screen support
10641240 -> 1000006601180: Other projects
10641250 -> 1000006601190: A number of products are  derived from OpenOffice.org.
10641260 -> 1000006601200: Among the more well-known ones are Sun StarOffice and NeoOffice.
10641270 -> 1000006601210: The OpenOffice.org site also lists a large variety of  complementary products including groupware solutions.
10641280 -> 1000006601220: NeoOffice
10641290 -> 1000006601230: NeoOffice is an independent port that integrates with OS X’s Aqua user interface using Java, Carbon and (increasingly) Cocoa toolkits.
10641300 -> 1000006601240: NeoOffice adheres fairly closely to OS X UI standards (for example, using native pull-down menus), and has direct access to OS X’s installed fonts and printers.
10641310 -> 1000006601250: Its releases lag behind the official OpenOffice.org X11 releases, due to its small development team and the concurrent development of the technology used to port the user interface.
10641320 -> 1000006601260: Other projects run alongside the main OpenOffice.org project and are easier to contribute to.
10641330 -> 1000006601270: These include documentation, internationalisation and localisation and the API.
10641340 -> 1000006601280: OpenGroupware.org
10641350 -> 1000006601290: OpenGroupware.org is a set of extension programs to allow the sharing of OpenOffice.org documents, calendars, address books, e-mails, instant messaging and blackboards, and provide access to other groupware applications.
10641360 -> 1000006601300: There is also an effort to create and share assorted document templates and other useful additions at OOExtras.
10641370 -> 1000006601310: A set of Perl extensions is available through the CPAN in order to allow OpenOffice.org document processing by external programs.
10641380 -> 1000006601320: These libraries do not use the OpenOffice.org API.
10641390 -> 1000006601330: They directly read or write the OpenOffice.org files using Perl standard file compression/decompression, XML access and UTF-8 encoding modules.
10641400 -> 1000006601340: Portable
10641410 -> 1000006601350: A distribution of OpenOffice.org called OpenOffice.org Portable is designed to run the suite from a USB flash drive.
10641420 -> 1000006601360: OxygenOffice Professional
10641430 -> 1000006601370: An enhancement of OpenOffice.org, providing: Current Version: 2.4
10641440 -> 1000006601380: Possibility to run Visual Basic for Application (VBA) macros in Calc (for testing)
10641450 -> 1000006601390: Improved Calc HTML export
10641460 -> 1000006601400: Enhanced Access support for Base
10641470 -> 1000006601410: Security fixes
10641480 -> 1000006601420: Enhanced performance
10641490 -> 1000006601430: Enhanced color-palette
10641500 -> 1000006601440: Enhanced help menu, additional User’s Manual, and extended tips for beginners
10641510 -> 1000006601450: Optionally it provides, free for personal and professional use:
10641520 -> 1000006601460: More than 3,200 graphics, both clip art and photos.
10641530 -> 1000006601470: Several templates and sample documents
10641540 -> 1000006601480: Over 90 free fonts.
10641550 -> 1000006601490: Additional tools like OOoWikipedia
10641560 -> 1000006601500: Extensions
10641570 -> 1000006601510: Since version 2.0.4, OpenOffice.org has supported extensions in a similar manner to Mozilla Firefox.
10641580 -> 1000006601520: Extensions make it easy to add new functionality to an existing OpenOffice.org installation.
10641590 -> 1000006601530: The  OpenOffice.org Extension Repository lists already more than 80 extensions.
10641600 -> 1000006601540: Developers can easily build new extensions for OpenOffice.org, for example by using the  OpenOffice.org API Plugin for NetBeans.
10641610 -> 1000006601550: The OpenOffice.org Bibliographic Project
10641620 -> 1000006601560: This aims to incorporate a powerful reference management software into the suite.
10641630 -> 1000006601570: The new major addition is slated for inclusion with the standard OpenOffice.org release on late-2007 to mid-2008, or possibly later depending upon the availability of programmers.
10641640 -> 1000006601580: Security
10641650 -> 1000006601590: OpenOffice.org includes a security team, and as of June 2008 the security organization Secunia reports no known unpatched security flaws for the software.
10641660 -> 1000006601600: Kaspersky Lab has shown a proof of concept virus for OpenOffice.org.
10641670 -> 1000006601610: This shows OOo viruses are possible, but there is no known virus "in the wild".
10641680 -> 1000006601620: In a private meeting of the French Ministry of Defense, macro-related security issues were raised.
10641690 -> 1000006601630: OpenOffice.org developers have responded and noted that the supposed vulnerability had not been announced through "well defined procedures" for disclosure and that the ministry had revealed nothing specific.
10641700 -> 1000006601640: However, the developers have been in talks with the researcher concerning the supposed vulnerability.
10641710 -> 1000006601650: As with Microsoft Word, documents created in OpenOffice can contain metadata which may include a complete history of what was changed, when and by whom.
10641720 -> 1000006601660: Ownership
10641730 -> 1000006601670: The project and software are informally referred to as OpenOffice, but project organizers report that this term is a trademark held by another party, requiring them to adopt OpenOffice.org as its formal name.
10641740 -> 1000006601680: (Due to a similar trademark issue, the Brazilian Portuguese version of the suite is distributed under the name BrOffice.org.)
10641750 -> 1000006601690: Development is managed by staff members of StarOffice.
10641760 -> 1000006601700: Some delay and difficulty in implementing external contributions to the core codebase (even those from the project's corporate sponsors) has been noted.
10641770 -> 1000006601710: Currently, there are  several derived and/or proprietary works based on OOo, with some of them being:
10641780 -> 1000006601720: Sun Microsystem's StarOffice, with various complementary add-ons.
10641790 -> 1000006601730: IBM's Lotus Symphony, with a new interface based on Eclipse (based on OO.o 1.x).
10641800 -> 1000006601740: OpenOffice.org Novell edition, integrated with Evolution and with a OOXML filter.
10641810 -> 1000006601750: Beijing Redflag Chinese 2000's RedOffice, fully localized in Chinese characters.
10641820 -> 1000006601760: Planamesa's NeoOffice for Mac OS X with Aqua support via Java.
10641830 -> 1000006601770: In May 23, 2007, the OpenOffice.org community and Redflag Chinese 2000 Software Co, Ltd. announced a joint development effort focused on integrating the new features that have been added in the RedOffice localization of OpenOffice.org, as well as quality assurance and work on the core applications.
10641840 -> 1000006601780: Additionally, Redflag Chinese 2000 made public its commitment to the global OO.o community stating it would "strengthen its support of the development of the world's leading free and open source productivity suite", adding around 50 engineers (that have been working on RedOffice since 2006) to the project.
10641850 -> 1000006601790: In September 10, 2007, the OO.o community announced that IBM had joined to support the development of OpenOffice.org.
10641860 -> 1000006601800: "IBM will be making initial code contributions that it has been developing as part of its Lotus Notes product, including accessibility enhancements, and will be making ongoing contributions to the feature richness and code quality of OpenOffice.org.
10641870 -> 1000006601810: Besides working with the community on the free productivity suite's software, IBM will also leverage OpenOffice.org technology in its products" as has been seen with Lotus Symphony.
10641880 -> 1000006601820: Sean Poulley, the vice president of business and strategy in IBM's Lotus Software division said that IBM plans to take a leadership role in the OpenOffice.org community together with other companies such as Sun Microsystems.
10641890 -> 1000006601830: IBM will work within the leadership structure that exists.
10641900 -> 1000006601840: As of October 02, 2007, Michael Meeks announced (and generated an answer by Sun's Simon Phipps and Mathias Bauer) a derived OpenOffice.org work, under the wing of his employer Novell, with the purpose of including new features and fixes that do not get easily integrated in the OOo-build up-stream core.
10641910 -> 1000006601850: The work is called Go-OO (http://go-oo.org/) a name under which alternative OO.o software has been available for five years.
10641920 -> 1000006601860: The new features are shared with Novell's edition of OOo and include:
10641930 -> 1000006601870: VBA macros support.
10641940 -> 1000006601880: Faster start up time.
10641950 -> 1000006601890: "A linear optimization solver to optimize a cell value based on arbitrary constraints built into Calc".
10641960 -> 1000006601900: Multimedia content supports into documents, using the gstreamer multimedia framework.
10641970 -> 1000006601910: Support for Microsoft Works formats, WordPerfect graphics (WPG format) and T602 files imports.
10641980 -> 1000006601920: Details about the patch handling including metrics can be found on the OpenOffice.org site.
10641990 -> 1000006601930: Reactions
10642000 -> 1000006601940: Federal Computer Week issue listed OpenOffice.org as one of the "5 stars of open-source products."
10642010 -> 1000006601950: In contrast, OpenOffice.org was used in 2005 by The Guardian newspaper to illustrate what it claims are the limitations of open-source software, although the article does finish by stating that the software may be better than MS Word for books.
10642020 -> 1000006601960: Market share
10642030 -> 1000006601970: It is extremely difficult to estimate the market share of OpenOffice.org due to the fact that OpenOffice.org can be freely distributed via download sites including mirrors, peer-to-peer networks, CDs, Linux distros, etc.
10642040 -> 1000006601980: Nevertheless, the OpenOffice.org tries to capture key adoption data in a market share analysis
10642050 -> 1000006601990: Although Microsoft Office retains 95% of the general market as measured by revenue, OpenOffice.org and StarOffice have secured 14% of the large enterprise market as of 2004 and 19% of the small to midsize business market in 2005.
10642060 -> 1000006602000: The OpenOffice.org web site reports more than 98 million downloads.
10642070 -> 1000006602010: Other large scale users of OpenOffice.org include Singapore’s Ministry of Defence, and Bristol City Council in the UK.
10642080 -> 1000006602020: In France, OpenOffice.org has attracted the attention of both local and national government administrations who wish to rationalize their software procurement, as well as have stable, standard file formats for archival purposes.
10642090 -> 1000006602030: It is now the official office suite for the French Gendarmerie.
10642100 -> 1000006602040: Several government organizations in India, such as IIT Bombay (a renowned technical institute), the Supreme Court of India, the Allahabad High Court, which use Linux, completely rely on OpenOffice.org for their administration.
10642110 -> 1000006602050: On October 4, 2005, Sun and Google announced a strategic partnership.
10642120 -> 1000006602060: As part of this agreement, Sun will add a Google search bar to OpenOffice.org, Sun and Google will engage in joint marketing activities as well as joint research and development, and Google will help distribute OpenOffice.org.
10642130 -> 1000006602070: Google is currently distributing StarOffice as part of the Google Pack.
10642140 -> 1000006602080: Besides StarOffice, there are still a number of OpenOffice.org derived commercial products.
10642150 -> 1000006602090: Most of them are developed under SISSL license (which is valid up to OpenOffice.org 2.0 Beta 2).
10642160 -> 1000006602100: In general they are targeted at local or niche market, with proprietary add-ons such as speech recognition module, automatic database connection, or better CJK support.
10642170 -> 1000006602110: In July 2007 Everex, a division of First International Computer and the 9th largest PC supplier in the U.S., began shipping systems preloaded with OpenOffice.org 2.2 into Wal-Mart and Sam's Club throughout North America.
10642180 -> 1000006602120: In September 2007 IBM announced that it would supply and support OpenOffice.org branded as Lotus Symphony, and integrated into Lotus Notes.
10642190 -> 1000006602130: IBM also announced 35 developers would be assigned to work on OpenOffice.org, and that it would join the OpenOffice.org foundation.
10642200 -> 1000006602140: Commentators noted parallels between IBM's 2000 support of Linux and this announcement.
10642210 -> 1000006602150: Java controversy
10642220 -> 1000006602160: In the past OpenOffice.org was criticized for an increasing dependency on the Java Runtime Environment which was not free software.
10642230 -> 1000006602170: That Sun Microsystems is both the creator of Java and the chief supporter of OpenOffice.org drew accusations of ulterior motives for this technology choice.
10642240 -> 1000006602180: Version 1 depended on the Java Runtime Environment (JRE) being present on the user’s computer for some auxiliary functions, but version 2 increased the suite’s use of Java requiring a JRE.
10642250 -> 1000006602190: In response, Red Hat increased their efforts to improve free Java implementations.
10642260 -> 1000006602200: Red Hat’s Fedora Core 4 (released on June 13, 2005) included a beta version of OpenOffice.org version 2, running on GCJ and GNU Classpath.
10642270 -> 1000006602210: The issue of OpenOffice.org’s use of Java came to the fore in May 2005, when Richard Stallman appeared to call for a fork of the application in a posting on the Free Software Foundation website.
10642280 -> 1000006602220: This led to discussions within the OpenOffice.org community and between Sun staff and developers involved in GNU Classpath, a free replacement for Sun’s Java implementation.
10642290 -> 1000006602230: Later that year, the OpenOffice.org developers also placed into their development guidelines various requirements to ensure that future versions of OpenOffice.org could be run on free implementations of Java and fixed the issues which previously prevented OpenOffice.org 2.0 from using free software Java implementations.
10642300 -> 1000006602240: On November 13, 2006, Sun committed to releasing Java under the GNU General Public License in the near future.
10642310 -> 1000006602250: This process would end OpenOffice.org's dependence on non-free software.
10642320 -> 1000006602260: Between November 2006 and May 2007, Sun Microsystems made available most of their Java technologies under the GNU General Public License, in compliance with the specifications of the Java Community Process, thus making almost all of Sun's Java also free software.
10642330 -> 1000006602270: The following areas of OpenOffice.org 2.0 depend on the JRE being present:
10642340 -> 1000006602280: The media player on Unix-like systems
10642350 -> 1000006602290: All document wizards in Writer
10642360 -> 1000006602300: Accessibility tools
10642370 -> 1000006602310: Report Autopilot
10642380 -> 1000006602320: JDBC driver support
10642390 -> 1000006602330: HSQL database engine, which is used in OpenOffice.org Base
10642400 -> 1000006602340: XSLT filters
10642410 -> 1000006602350: BeanShell, the NetBeans scripting language and the Java UNO bridge
10642420 -> 1000006602360: Export filters to the Aportis.doc (.pdb) format for the Palm OS or Pocket Word (.psw) format for the Pocket PC
10642430 -> 1000006602370: Export filter to LaTeX
10642440 -> 1000006602380: Export filter to MediaWiki's wikitext
10642450 -> 1000006602390: A common point of confusion is that mail merge to generate emails requires the Java API JavaMail in StarOffice; however, as of version 2.0.1, OpenOffice.org uses a Python-component instead.
10642460 -> 1000006602400: Complementary software
10642470 -> 1000006602410: OpenOffice.org provides replacement for MS Office's Microsoft Word, Microsoft Excel, Microsoft PowerPoint, Microsoft Access, Microsoft Equation Editor and Microsoft Visio.
10642480 -> 1000006602420: But to level the equivalent functionality from the rest of MS Office, OOo can be complemented with other open source programs such as:
10642490 -> 1000006602430: Evolution or Thunderbird/Lightning for a PIM like Microsoft Outlook.
10642500 -> 1000006602440: OpenProj (which seeks integration with OOo, but might be limited due to licensing issues) for Microsoft Project.
10642510 -> 1000006602450: Scribus for Microsoft Publisher
10642520 -> 1000006602460: O3spaces for Sharepoint
10642530 -> 1000006602470: Microsoft also provides Administrative Template Files ("adm files") that allow MS Office to be configured using Windows Group Policy.
10642540 -> 1000006602480: Equivalent functionality for OpenOffice.org is provided by  OpenOffice-Enterprise, a commercial product from Open Office Technology, Inc.
10642550 -> 1000006602490: Issues
10642560 -> 1000006602500: OpenOffice.org has been criticized for slow start times and extensive CPU and RAM usage in comparison to other competitive software such as Microsoft Office.
10642570 -> 1000006602510: In comparison, tests between OpenOffice.org 2.2 and Microsoft Office 2007 have found that OpenOffice.org takes approximately 2 times the processing time and memory to load itself along with a blank file; and took approximately 4.7 times the processing time and 3.9 times the memory to open an extremely large spreadsheet file.
10642580 -> 1000006602520: Critics have pointed to excessive code bloat and OpenOffice.org's loading of the Java Runtime Environment as possible reasons for the slow speeds and excessive memory usage.
10642590 -> 1000006602530: However, since OpenOffice.org 2.2 the performance of OpenOffice.org has been improved dramatically.
10642600 -> 1000006602540: One of the greatest challenges is its ability to be truly cross compatible with other applications.
10642610 -> 1000006602550: Since Openoffice.org is forced to reverse engineer proprietary binary formats due to unavailability of open specifications, slight formatting incompatibilities tend to exist when files are saved in non-native format.
10642620 -> 1000006602560: For example, a complex .doc document formatted under OpenOffice.org, is usually not displayed with the correct format when opened with Microsoft Office.
10642630 -> 1000006602570: Retail
10642640 -> 1000006602580: The free software license under which OpenOffice.org is distributed allows unlimited use of the software for both home and business use, including unlimited redistribution of the software.
10642650 -> 1000006602590: Several businesses sell the OpenOffice.org suite on auction websites such as eBay, offering value-added services such as 24/7 technical support, download mirrors, and CD mailing.
10642660 -> 1000006602600: However, often the 24/7 support offered is not provided by the company selling the software, but rather by the official OpenOffice.org mailing list.
Parsing
10650010 -> 1000006700020: Parsing
10650020 -> 1000006700030: In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine grammatical structure with respect to a given (more or less) formal grammar.
10650030 -> 1000006700040: A parser is thus one of the components in an interpreter or compiler, where it captures the implied hierarchy of the input text and transforms it into a form suitable for further processing (often some kind of parse tree, abstract syntax tree or other hierarchical structure) and normally checks for syntax errors at the same time.
10650040 -> 1000006700050: The parser often uses a separate lexical analyser to create tokens from the sequence of input characters.
10650050 -> 1000006700060: Parsers may be programmed by hand or may be semi-automatically generated (in some programming language) by a tool (such as Yacc) from a grammar written in Backus-Naur form.
10650060 -> 1000006700070: Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin.
10650070 -> 1000006700080: Parsers can also be constructed as executable specifications of grammars in functional programming languages.
10650080 -> 1000006700090: Frost, Hafiz and Callaghan have built on the work of others to construct a set of higher-order functions (called parser combinators) which allow polynomial time and space complexity top-down parser to be constructed as executable specifications of ambiguous grammars containing left-recursive productions.
10650090 -> 1000006700100: The  X-SAIGA site has more about the algorithms and implementation details.
10650100 -> 1000006700110: Human languages
10650110 -> 1000006700120: Also see Category:Natural language parsing
10650120 -> 1000006700130: In some machine translation and natural language processing systems, human languages are parsed by computer programs.
10650130 -> 1000006700140: Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language.
10650140 -> 1000006700150: In order to parse natural language data, researchers must first agree on the grammar to be used.
10650150 -> 1000006700160: The choice of syntax is affected by both linguistic and computational concerns; for instance some parsing systems use lexical functional grammar, but in general, parsing for grammars of this type is known to be NP-complete.
10650160 -> 1000006700170: Head-driven phrase structure grammar is another linguistic formalism which has been popular in the parsing community, but other research efforts have focused on less complex formalisms such as the one used in the Penn Treebank.
10650170 -> 1000006700180: Shallow parsing aims to find only the boundaries of major constituents such as noun phrases.
10650180 -> 1000006700190: Another popular strategy for avoiding linguistic controversy is dependency grammar parsing.
10650190 -> 1000006700200: Most modern parsers are at least partly statistical; that is, they rely on a corpus of training data which has already been annotated (parsed by hand).
10650200 -> 1000006700210: This approach allows the system to gather information about the frequency with which various constructions occur in specific contexts.
10650210 -> 1000006700220: (See machine learning.)
10650220 -> 1000006700230: Approaches which have been used include straightforward PCFGs (probabilistic context free grammars), maximum entropy, and neural nets.
10650230 -> 1000006700240: Most of the more successful systems use lexical statistics (that is, they consider the identities of the words involved, as well as their part of speech).
10650240 -> 1000006700250: However such systems are vulnerable to overfitting and require some kind of smoothing to be effective.
10650250 -> 1000006700260: Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually-designed grammars for programming languages.
10650260 -> 1000006700270: As mentioned earlier some grammar formalisms are very computationally difficult to parse; in general, even if the desired structure is not context-free, some kind of context-free approximation to the grammar is used to perform a first pass.
10650265 -> 1000006700280: Algorithms which use context-free grammars often rely on some variant of the CKY algorithm, usually with some heuristic to prune away unlikely analyses to save time.
10650270 -> 1000006700290: (See chart parsing.)
10650280 -> 1000006700300: However some systems trade speed for accuracy using, eg, linear-time versions of the shift-reduce algorithm.
10650290 -> 1000006700310: A somewhat recent development has been parse reranking in which the parser proposes some large number of analyses, and a more complex system selects the best option.
10650300 -> 1000006700320: It is normally branching of one part and its subparts
10650310 -> 1000006700330: Programming languages
10650320 -> 1000006700340: The most common use of a parser is as a component of a compiler or interpreter.
10650330 -> 1000006700350: This parses the source code of a computer programming language to create some form of internal representation.
10650340 -> 1000006700360: Programming languages tend to be specified in terms of a context-free grammar because fast and efficient parsers can be written for them.
10650350 -> 1000006700370: Parsers are written by hand or generated by parser generators.
10650360 -> 1000006700380: Context-free grammars are limited in the extent to which they can express all of the requirements of a language.
10650370 -> 1000006700390: Informally, the reason is that the memory of such a language is limited.
10650380 -> 1000006700400: The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced.
10650390 -> 1000006700410: More powerful grammars that can express this constraint, however, cannot be parsed efficiently.
10650400 -> 1000006700420: Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out.
10650410 -> 1000006700430: Overview of process
10650420 -> 1000006700440: The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
10650430 -> 1000006700450: The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions.
10650440 -> 1000006700460: For example, a calculator program would look at an input such as "12*(3+4)^2" and split it into the tokens 12, *, (, 3, +, 4, ), ^, and 2, each of which is a meaningful symbol in the context of an arithmetic expression.
10650450 -> 1000006700470: The parser would contain rules to tell it that the characters *, +, ^, ( and ) mark the start of a new token, so meaningless tokens like "12*" or "(3" will not be generated.
10650460 -> 1000006700480: The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression.
10650470 -> 1000006700490: This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear.
10650480 -> 1000006700500: However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers.
10650490 -> 1000006700510: These rules can be formally expressed with attribute grammars.
10650500 -> 1000006700520: The final phase is semantic parsing or analysis, which is working out the implications of the expression just validated and taking the appropriate action.
10650510 -> 1000006700530: In the case of a calculator or interpreter, the action is to evaluate the expression or program; a compiler, on the other hand, would generate some kind of code.
10650520 -> 1000006700540: Attribute grammars can also be used to define these actions.
10650530 -> 1000006700550: Types of parsers
10650540 -> 1000006700560: The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar.
10650550 -> 1000006700570: This can be done in essentially two ways:
10650560 -> 1000006700580: Top-down parsing - Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse-trees using a top-down expansion of the given formal grammar rules.
10650570 -> 1000006700590: Tokens are consumed from left to right.
10650580 -> 1000006700600: Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules .
10650590 -> 1000006700610: LL parsers and recursive-descent parser are examples of top-down parsers, which cannot accommodate  left recursive productions.
10650600 -> 1000006700620: Although it has been believed that simple implementations of top-down parsing cannot accommodate direct and indirect left-recursion and may require exponential time and space complexity while parsing ambiguous context-free grammars, more sophisticated algorithm for top-down parsing have been created by Frost, Hafiz, and Callaghan which accommodates ambiguity and left recursion in polynomial time and which generates polynomial-size representations of the potentially-exponential number of parse trees.
10650610 -> 1000006700630: Their algorithm is able to produce both left-most and right-most derivations of an input w.r.t. a given CFG.
10650620 -> 1000006700640: Bottom-up parsing - A parser can start with the input and attempt to rewrite it to the start symbol.
10650630 -> 1000006700650: Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on.
10650640 -> 1000006700660: LR parsers are examples of bottom-up parsers.
10650650 -> 1000006700670: Another term used for this type of parser is Shift-Reduce parsing.
10650660 -> 1000006700680: Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation (see context-free grammar).
10650670 -> 1000006700690: LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse) .
10650680 -> 1000006700700: Examples of parsers
10650690 -> 1000006700710: Top-down parsers
10650700 -> 1000006700720: Some of the parsers that use top-down parsing include:
10650710 -> 1000006700730: Recursive descent parser
10650720 -> 1000006700740: LL parser (Left-to-right, Leftmost derivation)
10650730 -> 1000006700750: X-SAIGA - eXecutable SpecificAtIons of GrAmmars.
10650740 -> 1000006700760: Contains publications related to top-down parsing algorithm that supports left-recursion and ambiguity in polynomial time and space.
10650750 -> 1000006700770: Bottom-up parsers
10650760 -> 1000006700780: Some of the parsers that use bottom-up parsing include:
10650770 -> 1000006700790: Precedence parser
10650780 -> 1000006700800: Operator-precedence parser
10650790 -> 1000006700810: Simple precedence parser
10650800 -> 1000006700820: BC (bounded context) parsing
10650810 -> 1000006700830: LR parser (Left-to-right, Rightmost derivation)
10650820 -> 1000006700840: Simple LR (SLR) parser
10650830 -> 1000006700850: LALR parser
10650840 -> 1000006700860: Canonical LR (LR(1)) parser
10650850 -> 1000006700870: GLR parser
10650860 -> 1000006700880: CYK parser
Part-of-speech tagging
10670010 -> 1000006800020: Part-of-speech tagging
10670020 -> 1000006800030: Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e., relationship with adjacent and related words in a phrase, sentence, or paragraph.
10670030 -> 1000006800040: A simplified form of this is commonly taught school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
10670040 -> 1000006800050: Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.
10670050 -> 1000006800060: History
10670060 -> 1000006800070: Research on part-of-speech tagging has been closely tied to corpus linguistics.
10670070 -> 1000006800080: The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and Nelson Francis, in the mid-1960s.
10670080 -> 1000006800090: It consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications.
10670090 -> 1000006800100: Each sample is 2,000 or more words (ending at the first sentence-end after 2,000 words, so that the corpus contains only complete sentences).
10670100 -> 1000006800110: The Brown Corpus was painstakingly "tagged" with part-of-speech markers over many years.
10670110 -> 1000006800120: A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all.
10670120 -> 1000006800130: For example, article then noun can occur, but article verb (arguably) cannot.
10670130 -> 1000006800140: The program got about 70% correct.
10670140 -> 1000006800150: Its results were repeatedly reviewed and corrected by hand, and later users sent in errata, so that by the late 70s the tagging was nearly perfect (allowing for some cases even human speakers might not agree on).
10670150 -> 1000006800160: This corpus has been used for innumerable studies of word-frequency and of part-of-speech, and inspired the development of similar "tagged" corpora in many other languages.
10670160 -> 1000006800170: Statistics derived by analyzing it formed the basis for most later part-of-speech tagging systems, such as CLAWS and VOLSUNGA.
10670170 -> 1000006800180: However, by this time (2005) it has been superseded by larger corpora such as the 100 million word British National Corpus.
10670180 -> 1000006800190: For some time, part-of-speech tagging was considered an inseparable part of natural language processing, because there are certain cases where the correct part of speech cannot be decided without understanding the semantics or even the pragmatics of the context.
10670190 -> 1000006800200: This is extremely expensive, especially because analyzing the higher levels is much harder when multiple part-of-speech possibilities must be considered for each word.
10670200 -> 1000006800210: In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English.
10670210 -> 1000006800220: HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences.
10670220 -> 1000006800230: For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%.
10670230 -> 1000006800240: Knowing this, a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal.
10670240 -> 1000006800250: The same method can of course be used to benefit from knowledge about following words.
10670250 -> 1000006800260: More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences.
10670260 -> 1000006800270: So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but even less likely another verb.
10670270 -> 1000006800280: When several ambiguous words occur together, the possibilities multiply.
10670280 -> 1000006800290: However, it is easy to enumerate every combination and to assign a relative probability to each one, by multiplying together the probabilities of each choice in turn.
10670290 -> 1000006800300: The combination with highest probability is then chosen.
10670300 -> 1000006800310: The European group developed CLAWS, a tagging program that did exactly this, and achieved accuracy in the 93-95% range.
10670310 -> 1000006800320: It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language parsing , that merely assigning the most common tag to each known word and the tag "proper noun" to all unknowns, will approach 90% accuracy because many words are unambiguous.
10670320 -> 1000006800330: CLAWS pioneered the field of HMM-based part of speech tagging, but was quite expensive since it enumerated all possibilities.
10670330 -> 1000006800340: It sometimes had to resort to backup methods when there were simply too many (the Brown Corpus contains a case with 17 ambiguous words in a row, and there are words such as "still" that can represent as many as 7 distinct parts of speech).
10670340 -> 1000006800350: In 1987, Steve DeRose and Ken Church independently developed dynamic programming algorithms to solve the same problem in vastly less time.
10670350 -> 1000006800360: Their methods were similar to the Viterbi algorithm known for some time in other fields.
10670360 -> 1000006800370: DeRose used a table of pairs, while Church used a table of triples and an ingenious method of estimating the values for triples that were rare or nonexistent in the Brown Corpus (actual measurement of triple probabilities would require a much larger corpus).
10670370 -> 1000006800380: Both methods achieved accuracy over 95%.
10670380 -> 1000006800390: DeRose's 1990 dissertation at Brown University included analyses of the specific error types, probabilities, and other related data, and replicated his work for Greek, where it proved similarly effective.
10670390 -> 1000006800400: These findings were surprisingly disruptive to the field of Natural Language Processing.
10670400 -> 1000006800410: The accuracy reported was higher than the typical accuracy of very sophisticated algorithms that integrated part of speech choice with many higher levels of linguistic analysis: syntax, morphology, semantics, and so on.
10670410 -> 1000006800420: CLAWS, DeRose's and Church's methods did fail for some of the known cases where semantics is required, but those proved negligibly rare.
10670420 -> 1000006800430: This convinced many in the field that part-of-speech tagging could usefully be separated out from the other levels of processing; this in turn simplified the theory and practice of computerized language analysis, and encouraged researchers to find ways to separate out other pieces as well.
10670430 -> 1000006800440: Markov Models are now the standard method for part-of-speech assignment.
10670440 -> 1000006800450: The methods already discussed involve working from a pre-existing corpus to learn tag probabilities.
10670450 -> 1000006800460: It is, however, also possible to bootstrap using "unsupervised" tagging.
10670460 -> 1000006800470: Unsupervised tagging techniques use an untagged corpus for their training data and produce the tagset by induction.
10670470 -> 1000006800480: That is, they observe patterns in word use, and derive part-of-speech categories themselves.
10670480 -> 1000006800490: For example, statistics readily reveal that "the", "a", and "an" occur in similar contexts, while "eat" occurs in very different ones.
10670490 -> 1000006800500: With sufficient iteration, similarity classes of words emerge that are remarkably similar to those human linguists would expect; and the differences themselves sometimes suggest valuable new insights.
10670500 -> 1000006800510: These two categories can be further subdivided into rule-based, stochastic, and neural approaches.
10670510 -> 1000006800520: Some current major algorithms for part-of-speech tagging include the Viterbi algorithm, Brill Tagger, and the Baum-Welch algorithm (also known as the forward-backward algorithm).
10670520 -> 1000006800530: Hidden Markov model and visible Markov model taggers can both be implemented using the Viterbi algorithm.
Pattern recognition
10680010 -> 1000006900020: Pattern recognition
10680020 -> 1000006900030: Pattern recognition is a sub-topic of machine learning.
10680030 -> 1000006900040: It can be defined as
10680040 -> 1000006900050: "the act of taking in raw data and taking an action based on the category of the data".
10680050 -> 1000006900060: Most research in pattern recognition is about methods for supervised learning and unsupervised learning.
10680060 -> 1000006900070: Pattern recognition aims to classify data (patterns) based on either a priori knowledge or on statistical information extracted from the patterns.
10680070 -> 1000006900080: The patterns to be classified are usually groups of measurements or observations, defining points in an appropriate multidimensional space.
10680080 -> 1000006900090: This is in contrast to pattern matching, where the pattern is rigidly specified.
10680090 -> 1000006900100: Overview
10680100 -> 1000006900110: A complete pattern recognition system consists of a sensor that gathers the observations to be classified or described; a feature extraction mechanism that computes numeric or symbolic information from the observations; and a classification or description scheme that does the actual job of classifying or describing observations, relying on the extracted features.
10680110 -> 1000006900120: The classification or description scheme is usually based on the availability of a set of patterns that have already been classified or described.
10680120 -> 1000006900130: This set of patterns is termed the training set and the resulting learning strategy is characterized as supervised learning.
10680130 -> 1000006900140: Learning can also be unsupervised, in the sense that the system is not given an a priori labeling of patterns, instead it establishes the classes itself based on the statistical regularities of the patterns.
10680140 -> 1000006900150: The classification or description scheme usually uses one of the following approaches: statistical (or decision theoretic), syntactic (or structural).
10680150 -> 1000006900160: Statistical pattern recognition is based on statistical characterisations of patterns, assuming that the patterns are generated by a probabilistic system.
10680160 -> 1000006900170: Syntactical (or structural) pattern recognition is based on the structural interrelationships of features.
10680170 -> 1000006900180: A wide range of algorithms can be applied for pattern recognition, from very simple Bayesian classifiers to much more powerful neural networks.
10680180 -> 1000006900190: An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
10680190 -> 1000006900200: Pattern recognition is more complex when templates are used to generate variants.
10680200 -> 1000006900210: For example, in English, sentences often follow the "N-VP" (noun - verb phrase) pattern, but some knowledge of the English language is required to detect the pattern.
10680210 -> 1000006900220: Pattern recognition is studied in many fields, including psychology, ethology, and computer science.
10680220 -> 1000006900230: Holographic associative memory is another type of pattern matching scheme where a target small patterns can be searched from a large set of learned patterns based on cognitive meta-weight.
10680230 -> 1000006900240: Uses
10680240 -> 1000006900250: Within medical science pattern recognition creates the basis for computer-aided diagnosis (CAD) systems.
10680250 -> 1000006900260: CAD describes a procedure that supports the doctor's interpretations and findings.
10680260 -> 1000006900270: Typical applications are automatic speech recognition, classification of text into several categories (e.g. spam/non-spam email messages), the automatic recognition of handwritten postal codes on postal envelopes, or the automatic recognition of images of human faces.
10680270 -> 1000006900280: The last two examples form the subtopic image analysis of pattern recognition that deals with digital images as input to pattern recognition systems.
Phrase
10690010 -> 1000007000020: Phrase
10690020 -> 1000007000030: In grammar, a phrase is a group of words that functions as a single unit in the syntax of a sentence.
10690030 -> 1000007000040: For example the house at the end of the street (example 1) is a phrase.
10690040 -> 1000007000050: It acts like a noun.
10690050 -> 1000007000060: It contains the phrase at the end of the street (example 2), a prepositional phrase which acts like an adjective.
10690060 -> 1000007000070: Example 2 could be replaced by white, to make the phrase the white house.
10690070 -> 1000007000080: Examples 1 and 2 contain the phrase the end of the street (example 3) which acts like a noun.
10690080 -> 1000007000090: It could be replaced by the cross-roads to give the house at the cross-roads.
10690090 -> 1000007000100: Most phrases have a or central word which defines the type of phrase.
10690100 -> 1000007000110: This word is called the head of the phrase.
10690110 -> 1000007000120: In English the head is often the first word of the phrase.
10690120 -> 1000007000130: Some phrases, however, can be headless.
10690130 -> 1000007000140: For example, the rich is a noun phrase composed of a determiner and an adjective, but no noun.
10690140 -> 1000007000150: Phrases may be classified by the type of head they take
10690150 -> 1000007000160: Prepositional phrase (PP) with a preposition as head (e.g. in love, over the rainbow).
10690160 -> 1000007000170: Languages that use postpositions instead have postpositional phrases.
10690170 -> 1000007000180: The two types are sometimes commonly referred to as adpositional phrases.
10690180 -> 1000007000190: Noun phrase (NP) with a noun as head (e.g. the black cat, a cat on the mat)
10690190 -> 1000007000200: Verb phrase (VP) with a verb as head (e.g. eat cheese, jump up and down)
10690200 -> 1000007000210: Adjectival phrase with an adjective as head (e.g. full of toys)
10690210 -> 1000007000220: Adverbial phrase with adverb as head (e.g. very carefully)
10690220 -> 1000007000230: Formal definition
10690230 -> 1000007000240: A phrase is a syntactic structure which has syntactic properties derived from its head.
10690240 -> 1000007000250: Complexity
10690250 -> 1000007000260: A complex phrase consists of several words, whereas a simple phrase consists of only one word.
10690260 -> 1000007000270: This terminology is especially often used with verb phrases:
10690270 -> 1000007000280: simple past and present are simple verb, which require just one verb
10690280 -> 1000007000290: complex verb have one or two aspects added, hence require additional two or three words
10690290 -> 1000007000300: "Complex", which is phrase-level, is often confused with "compound", which is word-level.
10690300 -> 1000007000310: However, there are certain phenomena that formally seem to be phrases but semantically are more like compounds, like "women's magazines", which has the form of a possessive noun phrase, but which refers (just like a compound) to one specific lexeme (i.e. a magazine for women and not some magazine owned by a woman).
10690310 -> 1000007000320: Semiotic approaches to the concept of "phrase"
10690320 -> 1000007000330: In more semiotic approaches to language, such as the more cognitivist versions of construction grammar, a phrasal structure is not only a certain formal combination of word types whose features are inherited from the head.
10690330 -> 1000007000340: Here each phrasal structure also expresses some type of conceptual content, be it specific or abstract.
Portuguese language
10700010 -> 1000007100020: Portuguese language
10700020 -> 1000007100030: Portuguese ( or língua portuguesa) is a Romance language that originated in what is now Galicia (Spain) and northern Portugal from the Latin spoken by romanized Pre-Roman peoples of the Iberian Peninsula (namely the Gallaeci, the Lusitanians, the Celtici and the Conii) about 2000 years ago.
10700030 -> 1000007100040: It spread worldwide in the 15th and 16th centuries as Portugal established a colonial and commercial empire (1415–1999) which spanned from Brazil in the Americas to Goa in India and Macau in China, in fact it was used exclusively on the island of Sri Lanka as the lingua franca for almost 350 years.
10700040 -> 1000007100050: During that time, many creole languages based on Portuguese also appeared around the world, especially in Africa, Asia, and the Caribbean.
10700050 -> 1000007100060: Today it is one of the world's major languages, ranked 6th according to number of native speakers (approximately 177 million).
10700060 -> 1000007100070: It is the language with the largest number of speakers in South America, spoken by nearly all of Brazil's population, which amounts to over 51% of the continent's population even though it is the only Portuguese-speaking nation in the Americas.
10700070 -> 1000007100080: It is also a major lingua franca in Portugal's former colonial possessions in Africa.
10700080 -> 1000007100090: It is the official language of ten countries (see the table on the right), also being co-official with Spanish and French in Equatorial Guinea, with Cantonese Chinese in the Chinese special administrative region of Macau, and with Tetum in East Timor.
10700090 -> 1000007100100: There are sizable communities of Portuguese-speakers in various regions of North America, notably in the United States (New Jersey, New England and south Florida) and in Ontario, Canada.
10700100 -> 1000007100110: Spanish author Miguel de Cervantes once called Portuguese "the sweet language", while Brazilian writer Olavo Bilac poetically described it as a última flor do Lácio, inculta e bela: "the last flower of Latium, wild and beautiful".
10700110 -> 1000007100120: Geographic distribution
10700120 -> 1000007100130: Today, Portuguese is the official language of Angola, Brazil, Cape Verde, Guinea-Bissau, Portugal, São Tomé and Príncipe and Mozambique.
10700130 -> 1000007100140: It is also one of the official languages of Equatorial Guinea (with Spanish and French), the Chinese special administrative region of Macau (with Chinese), and East Timor, (with Tetum).
10700140 -> 1000007100150: It is a native language of most of the population in Portugal (100%), Brazil (99%), Angola (60%), and São Tomé and Príncipe (50%), and it is spoken by a plurality of the population of Mozambique (40%), though only 6.5% are native speakers.
10700150 -> 1000007100160: No data is available for Cape Verde, but almost all the population is bilingual, and the monolingual population speaks Cape Verdean Creole.
10700160 -> 1000007100170: Small Portuguese-speaking communities subsist in former overseas colonies of Portugal such as Macau, where it is spoken as a first language by 0.6% of the population and East Timor.
10700170 -> 1000007100180: Uruguay gave Portuguese an equal status to Spanish in its educational system at the north border with Brazil.
10700180 -> 1000007100190: In the rest of the country, it's taught as an obligatory subject beginning by the 6th grade.
10700190 -> 1000007100200: It is also spoken by substantial immigrant communities, though not official, in Andorra, France, Luxembourg, Jersey (with a statistically significant Portuguese-speaking community of approximately 10,000 people), Paraguay, Namibia, South Africa, Switzerland, Venezuela and in the U.S. states of California, Connecticut, Florida, Massachusetts, New Jersey, New York and Rhode Island.
10700200 -> 1000007100210: In some parts of India, such as Goa and Daman and Diu Portuguese is still spoken.
10700210 -> 1000007100220: There are also significant populations of Portuguese speakers in Canada (mainly concentrated in and around Toronto) Bermuda and Netherlands Antilles.
10700220 -> 1000007100230: Portuguese is an official language of several international organizations.
10700230 -> 1000007100240: The Community of Portuguese Language Countries (with the Portuguese acronym CPLP) consists of the eight independent countries that have Portuguese as an official language.
10700240 -> 1000007100250: It is also an official language of the European Union, Mercosul, the Organization of American States, the Organization of Ibero-American States, the Union of South American Nations, and the African Union (one of the working languages) and one of the official languages of other organizations.
10700250 -> 1000007100260: The Portuguese language is gaining popularity in Africa, Asia, and South America as a second language for study.
10700260 -> 1000007100270: Portuguese and Spanish are the fastest-growing European languages, and, according to estimates by UNESCO, Portuguese is the language with the highest potential for growth as an international language in southern Africa and South America.
10700270 -> 1000007100280: The Portuguese-speaking African countries are expected to have a combined population of 83 million by 2050.
10700280 -> 1000007100290: Since 1991, when Brazil signed into the economic market of Mercosul with other South American nations, such as Argentina, Uruguay, and Paraguay, there has been an increase in interest in the study of Portuguese in those South American countries.
10700290 -> 1000007100300: The demographic weight of Brazil in the continent will continue to strengthen the presence of the language in the region.
10700300 -> 1000007100310: Although in the early 21st century, after Macau was ceded to China in 1999, the use of Portuguese was in decline in Asia, it is becoming a language of opportunity there; mostly because of East Timor's boost in the number of speakers in the last five years but also because of increased Chinese diplomatic and financial ties with Portuguese-speaking countries.
10700310 -> 1000007100320: In July 2007, President Teodoro Obiang Nguema announced his government's decision to make Portuguese Equatorial Guinea's third official language, in order to meet the requirements to apply for full membership of the Community of Portuguese Language Countries.
10700320 -> 1000007100330: This upgrading from its current Associate Observer condition would result in Equatorial Guinea being able to access several professional and academic exchange programs and the facilitation of cross-border circulation of citizens.
10700330 -> 1000007100340: Its application is currently being assessed by other CPLP members.
10700340 -> 1000007100350: In March 1994 the Bosque de Portugal (Portugal's Woods) was founded in the Brazilian city of Curitiba.
10700350 -> 1000007100360: The park houses the Portuguese Language Memorial, which honors the Portuguese immigrants and the countries that adopted the Portuguese language.
10700360 -> 1000007100370: Originally there were seven nations represented with pillars, but the independence of East Timor brought yet another pillar for that nation in 2007.
10700370 -> 1000007100380: In March 2006, the Museum of the Portuguese Language, an interactive museum about the Portuguese language, was founded in São Paulo, Brazil, the city with the largest number of Portuguese speakers in the world.
10700380 -> 1000007100390: Dialects
10700390 -> 1000007100400: Portuguese is a pluricentric language with two main groups of dialects, those of Brazil and those of the Old World.
10700400 -> 1000007100410: For historical reasons, the dialects of Africa and Asia are generally closer to those of Portugal than the Brazilian dialects, although in some aspects of their phonetics, especially the pronunciation of unstressed vowels, they resemble Brazilian Portuguese more than European Portuguese.
10700410 -> 1000007100420: They have not been studied as widely as European and Brazilian Portuguese.
10700420 -> 1000007100430: Audio samples of some dialects of Portuguese are available below.
10700430 -> 1000007100440: There are some differences between the areas but these are the best approximations possible.
10700440 -> 1000007100450: For example, the caipira dialect has some differences from the one of Minas Gerais, but in general it is very close.
10700450 -> 1000007100460: A good example of Brazilian Portuguese may be found in the capital city, Brasília, because of the generalized population from all parts of the country.
10700460 -> 1000007100470: Angola
10700470 -> 1000007100480: Benguelense — Benguela province.
10700480 -> 1000007100490: Luandense — Luanda province.
10700490 -> 1000007100500: Sulista — South of Angola.
10700500 -> 1000007100510: Brazil
10700510 -> 1000007100520: Caipira — States of São Paulo (countryside; the city of São Paulo and the eastern areas of the state have their own dialect, called paulistano); southern Minas Gerais, northern Paraná, Goiás and Mato Grosso do Sul.
10700520 -> 1000007100530: Cearense — Ceará.
10700530 -> 1000007100540: Baiano — Bahia.
10700540 -> 1000007100550: Fluminense — Variants spoken in the states of Rio de Janeiro and Espírito Santo (excluding the city of Rio de Janeiro and its adjacent metropolitan areas, which have their own dialect, called carioca).
10700550 -> 1000007100560: Gaúcho — Rio Grande do Sul.
10700560 -> 1000007100570: (There are many distinct accents in Rio Grande do Sul, mainly due to the heavy influx of European immigrants of diverse origins, those which have settled several colonies throughout the state.)
10700570 -> 1000007100580: Mineiro — Minas Gerais (not prevalent in the Triângulo Mineiro, southern and southeastern Minas Gerais).
10700580 -> 1000007100590: Nordestino — northeastern states of Brazil (Pernambuco and Rio Grande do Norte have a particular way of speaking).
10700590 -> 1000007100600: Nortista — Amazon Basin states.
10700600 -> 1000007100610: Paulistano — Variants spoken around São Paulo city and the eastern areas of São Paulo state.
10700610 -> 1000007100620: Sertanejo — States of Goiás and Mato Grosso (the city of Cuiabá has a particular way of speaking).
10700620 -> 1000007100630: Sulista — Variants spoken in the areas between the northern regions of Rio Grande do Sul and southern regions of São Paulo state.
10700630 -> 1000007100640: (The cities of Curitiba, Florianópolis, and Itapetininga have fairly distinct accents as well.)
10700640 -> 1000007100650: Portugal
10700650 -> 1000007100660: Açoriano (Azorean) — Azores.
10700660 -> 1000007100670: Alentejano — Alentejo
10700670 -> 1000007100680: Algarvio — Algarve (there is a particular dialect in a small part of western Algarve).
10700680 -> 1000007100690: Alto-Minhoto — North of Braga (hinterland).
10700690 -> 1000007100700: Baixo-Beirão; Alto-Alentejano — Central Portugal (hinterland).
10700700 -> 1000007100710: Beirão — Central Portugal.
10700710 -> 1000007100720: Estremenho — Regions of Coimbra and Lisbon (the Lisbon dialect has some peculiar features not shared with the one of Coimbra).
10700720 -> 1000007100730: Madeirense (Madeiran) — Madeira.
10700730 -> 1000007100740: Nortenho — Regions of Braga and Porto.
10700740 -> 1000007100750: Transmontano — Trás-os-Montes e Alto Douro.
10700750 -> 1000007100760: Other countries
10700760 -> 1000007100770: Cape Verde —  Português cabo-verdiano (Cape Verdean Portuguese)
10700770 -> 1000007100780: Daman and Diu, India — Damaense.
10700780 -> 1000007100790: East Timor —  Timorense (East Timorese)
10700790 -> 1000007100800: Goa, India — Goês.
10700800 -> 1000007100810: Guinea-Bissau —  Guineense (Guinean Portuguese).
10700810 -> 1000007100820: Macau, China —  Macaense (Macanese)
10700820 -> 1000007100830: Mozambique —  Moçambicano (Mozambican)
10700830 -> 1000007100840: São Tomé and Príncipe —  Santomense
10700840 -> 1000007100850: Uruguay — Dialectos Portugueses del Uruguay (DPU).
10700850 -> 1000007100860: Differences between dialects are mostly of accent and vocabulary, but between the Brazilian dialects and other dialects, especially in their most coloquial forms, there can also be some grammatical differences.
10700860 -> 1000007100870: The Portuguese-based creoles spoken in various parts of Africa, Asia, and the Americas are independent languages which should not be confused with Portuguese itself.
10700870 -> 1000007100880: History
10700880 -> 1000007100890: Arriving in the Iberian Peninsula in 216 BC, the Romans brought with them the Latin language, from which all Romance languages descend.
10700890 -> 1000007100900: The language was spread by arriving Roman soldiers, settlers and merchants, who built Roman cities mostly near the settlements of previous civilizations.
10700900 -> 1000007100910: Between AD 409 and 711, as the Roman Empire collapsed in Western Europe, the Iberian Peninsula was conquered by Germanic peoples (Migration Period).
10700910 -> 1000007100920: The occupiers, mainly Suebi and Visigoths, quickly adopted late Roman culture and the Vulgar Latin dialects of the peninsula.
10700920 -> 1000007100930: After the Moorish invasion of 711, Arabic became the administrative language in the conquered regions, but most of the population continued to speak a form of Romance commonly known as Mozarabic.
10700930 -> 1000007100940: The influence exerted by Arabic on the Romance dialects spoken in the Christian kingdoms of the north was small, affecting mainly their lexicon.
10700940 -> 1000007100950: The earliest surviving records of a distinctively Portuguese language are administrative documents of the 9th century, still interspersed with many Latin phrases.
10700950 -> 1000007100960: Today this phase is known as Proto-Portuguese (between the 9th and the 12th centuries).
10700960 -> 1000007100970: In the first period of Old Portuguese — Galician-Portuguese Period (from the 12th to the 14th century) — the language gradually came into general use.
10700970 -> 1000007100980: For some time, it was the language of preference for lyric poetry in Christian Hispania, much like Occitan was the language of the poetry of the troubadours.
10700980 -> 1000007100990: Portugal was formally recognized as an independent kingdom by the Kingdom of Leon in 1143, with Afonso Henriques as king.
10700990 -> 1000007101000: In 1290, king Dinis created the first Portuguese university in Lisbon (the Estudos Gerais, later moved to Coimbra) and decreed that Portuguese, then simply called the "common language" should be known as the Portuguese language and used officially.
10701000 -> 1000007101010: In the second period of Old Portuguese, from the 14th to the 16th century, with the Portuguese discoveries, the language was taken to many regions of Asia, Africa and the Americas (nowadays, the great majority of Portuguese speakers live in Brazil, in South America).
10701010 -> 1000007101020: By the 16th century it had become a lingua franca in Asia and Africa, used not only for colonial administration and trade but also for communication between local officials and Europeans of all nationalities.
10701020 -> 1000007101030: Its spread was helped by mixed marriages between Portuguese and local people, and by its association with Roman Catholic missionary efforts, which led to the formation of a creole language called Kristang in many parts of Asia (from the word cristão, "Christian").
10701030 -> 1000007101040: The language continued to be popular in parts of Asia until the 19th century.
10701040 -> 1000007101050: Some Portuguese-speaking Christian communities in India, Sri Lanka, Malaysia, and Indonesia preserved their language even after they were isolated from Portugal.
10701050 -> 1000007101060: The end of the Old Portuguese period was marked by the publication of the Cancioneiro Geral by Garcia de Resende, in 1516.
10701060 -> 1000007101070: The early times of Modern Portuguese, which spans from the 16th century to present day, were characterized by an increase in the number of learned words borrowed from Classical Latin and Classical Greek since the Renaissance, which greatly enriched the lexicon.
10701070 -> 1000007101080: Characterization
10701080 -> 1000007101090: A distinctive feature of Portuguese is that it preserved the stressed vowels of Vulgar Latin, which became diphthongs in other Romance languages; cf. Fr. pierre, Sp. piedra, It. pietra, Port. pedra, from Lat. petra; or Sp. fuego, It. fuoco, Port. fogo, from Lat. focum.
10701090 -> 1000007101100: Another characteristic of early Portuguese was the loss of intervocalic l and n, sometimes followed by the merger of the two surrounding vowels, or by the insertion of an epenthetic vowel between them: cf. Lat. salire, tenere, catena, Sp. salir, tener, cadena, Port. sair, ter, cadeia.
10701100 -> 1000007101110: When the elided consonant was n, it often nasalized the preceding vowel: cf. Lat. manum, rana, bonum, Port. mão, rãa, bõo (now mão, rã, bom).
10701110 -> 1000007101120: This process was the source of most of the nasal diphthongs which are typical of Portuguese.
10701120 -> 1000007101130: In particular, the Latin endings -anem, -anum and -onem became -ão in most cases, cf. Lat. canem, germanum, rationem with Modern Port. cão, irmão, razão, and their plurals -anes, -anos, -ones normally became -ães, -ãos, -ões, cf. cães, irmãos, razões.
10701130 -> 1000007101140: Movement to make Portuguese an official language of the UN
10701140 -> 1000007101150: There is a growing number of people in the Portuguese speaking media and the internet who are presenting the case to the CPLP and other organizations to run a debate in the Lusophone community with the purpose of bringing forward a petition to make Portuguese an official language of the United Nations.
10701150 -> 1000007101160: In October 2005, during the international Convention of the  Elos Club International  that took place in Tavira, Portugal a petition was written and unanimously approved whose text can be found on the internet with the title Petição Para Tornar Oficial o Idioma Português na ONU.
10701160 -> 1000007101170: Romulo Alexandre Soares, president of the Brazil-Portugal Chamber highlights that the positioning of Brazil in the international arena as one of the emergent powers of the 21 century, the size of its population, and the presence of the language around the world provides legitimacy and justifies a petition to the UN to make the Portuguese an official language at the UN.
10701170 -> 1000007101180: Vocabulary
10701180 -> 1000007101190: Most of the lexicon of Portuguese is derived from Latin.
10701190 -> 1000007101200: Nevertheless, because of the Moorish occupation of the Iberian Peninsula during the Middle Ages, and the participation of Portugal in the Age of Discovery, it has adopted loanwords from all over the world.
10701200 -> 1000007101210: Very few Portuguese words can be traced to the pre-Roman inhabitants of Portugal, which included the Gallaeci, Lusitanians, Celtici and Cynetes.
10701210 -> 1000007101220: The Phoenicians and Carthaginians, briefly present, also left some scarce traces.
10701220 -> 1000007101230: Some notable examples are abóbora "pumpkin" and bezerro "year-old calf", from the nearby Celtiberian language (probably through the Celtici); cerveja "beer", from Celtic; saco "bag", from Phoenician; and cachorro "dog, puppy", from Basque.
10701230 -> 1000007101240: In the 5th century, the Iberian Peninsula (the Roman Hispania) was conquered by the Germanic Suevi and Visigoths.
10701240 -> 1000007101250: As they adopted the Roman civilization and language, however, these people contributed only a few words to the lexicon, mostly related to warfare — such as espora "spur", estaca "stake", and guerra "war", from Gothic *spaúra, *stakka, and *wirro, respectively.
10701250 -> 1000007101260: Between the 9th and 15th centuries Portuguese acquired about 1000 words from Arabic by influence of Moorish Iberia.
10701260 -> 1000007101270: They are often recognizable by the initial Arabic article a(l)-, and include many common words such as aldeia "village" from الضيعة aldaya, alface "lettuce" from الخس alkhass, armazém "warehouse" from المخزن almahazan, and azeite "olive oil" from زيت azzait.
10701270 -> 1000007101280: From Arabic came also the grammatically peculiar word oxalá "hopefully".
10701280 -> 1000007101290: The Mozambican currency name metical was derived from the word مطقال miṭqāl, a unit of weight.
10701290 -> 1000007101300: The word Mozambique itself is from the Arabic name of sultan Muça Alebique (Musa Alibiki).
10701300 -> 1000007101310: The name of the Portuguese town of Fátima comes from the name of one of the daughters of the prophet Muhammad.
10701310 -> 1000007101320: Starting in the 15th century, the Portuguese maritime explorations led to the introduction of many loanwords from Asian languages.
10701320 -> 1000007101330: For instance, catana "cutlass" from Japanese katana; corja "rabble" from Malay kórchchu; and chá "tea" from Chinese chá.
10701330 -> 1000007101340: From South America came batata "potato", from Taino; ananás and abacaxi, from Tupi-Guarani naná and Tupi ibá cati, respectively (two species of pineapple), and tucano "toucan" from Guarani tucan.
10701340 -> 1000007101350: See List of Brazil state name etymologies, for some more examples.
10701350 -> 1000007101360: From the 16th to the 19th century, the role of Portugal as intermediary in the Atlantic slave trade, with the establishment of large Portuguese colonies in Angola, Mozambique, and Brazil, Portuguese got several words of African and Amerind origin, especially names for most of the animals and plants found in those territories.
10701360 -> 1000007101370: While those terms are mostly used in the former colonies, many became current in European Portuguese as well.
10701370 -> 1000007101380: From Kimbundu, for example, came kifumate → cafuné "head caress", kusula → caçula "youngest child", marimbondo "tropical wasp", and kubungula → bungular "to dance like a wizard".
10701380 -> 1000007101390: Finally, it has received a steady influx of loanwords from other European languages.
10701390 -> 1000007101400: For example, melena "hair lock", fiambre "wet-cured ham" (in contrast with presunto "dry-cured ham" from Latin prae-exsuctus "dehydrated"), and castelhano "Castilian", from Spanish; colchete/crochê "bracket"/"crochet", paletó "jacket", batom "lipstick", and filé/filete "steak"/"slice" respectively, from French crochet, paletot, bâton, filet; macarrão "pasta", piloto "pilot", carroça "carriage", and barraca "barrack", from Italian maccherone, pilota, carrozza, baracca; and bife "steak", futebol, revólver, estoque, folclore, from English beef, football, revolver, stock, folklore.
10701400 -> 1000007101410: Classification and related languages
10701410 -> 1000007101420: Portuguese belongs to the West Iberian branch of the Romance languages, and it has special ties with the following members of this group:
10701420 -> 1000007101430: Galician and the Fala, its closest relatives.
10701430 -> 1000007101440: See below.
10701440 -> 1000007101450: Spanish, the major language closest to Portuguese.
10701450 -> 1000007101460: (See also Differences between Spanish and Portuguese.)
10701460 -> 1000007101470: Mirandese, another West Iberian language spoken in Portugal.
10701470 -> 1000007101480: Judeo-Portuguese and Judeo-Spanish, languages spoken by Sephardic Jews, which remained close to Portuguese and Spanish.
10701480 -> 1000007101490: Despite the obvious lexical and grammatical similarities between Portuguese and other Romance languages, it is not mutually intelligible with most of them.
10701490 -> 1000007101500: Apart from Galician, Portuguese speakers will usually need some formal study of basic grammar and vocabulary, before attaining a reasonable level of comprehension of those languages, and vice-versa.
10701500 -> 1000007101510: Galician and the Fala
10701510 -> 1000007101520: The closest language to Portuguese is Galician, spoken in the autonomous community of Galicia (northwestern Spain).
10701520 -> 1000007101530: The two were at one time a single language, known today as Galician-Portuguese, but since the political separation of Portugal from Galicia they have diverged somewhat, especially in pronunciation and vocabulary.
10701530 -> 1000007101540: Nevertheless, the core vocabulary and grammar of Galician are still noticeably closer to Portuguese than to Spanish.
10701540 -> 1000007101550: In particular, like Portuguese, it uses the future subjunctive, the personal infinitive, and the synthetic pluperfect (see the section on the grammar of Portuguese, below).
10701550 -> 1000007101560: Mutual intelligibility (estimated at 85% by R. A. Hall, Jr., 1989) is good between Galicians and northern Portuguese, but poorer between Galicians and speakers from central Portugal.
10701560 -> 1000007101570: The Fala language is another descendant of Galician-Portuguese, spoken by a small number of people in the Spanish towns of Valverdi du Fresnu, As Ellas and Sa Martín de Trebellu (autonomous community of Extremadura, near the border with Portugal).
10701570 -> 1000007101580: Influence on other languages
10701580 -> 1000007101590: Many languages have borrowed words from Portuguese, such as Indonesian, Sri Lankan Tamil and Sinhalese (see Sri Lanka Indo-Portuguese), Malay, Bengali, English, Hindi, Konkani, Marathi, Tetum, Xitsonga, Papiamentu, Japanese, Bajan Creole (Spoken in Barbados), Lanc-Patuá (spoken in northern Brazil) and Sranan Tongo (spoken in Suriname).
10701590 -> 1000007101600: It left a strong influence on the língua brasílica, a Tupi-Guarani language which was the most widely spoken in Brazil until the 18th century, and on the language spoken around Sikka in Flores Island, Indonesia.
10701600 -> 1000007101610: In nearby Larantuka, Portuguese is used for prayers in Holy Week rituals.
10701610 -> 1000007101620: The Japanese-Portuguese dictionary Nippo Jisho (1603) was the first dictionary of Japanese in a European language, a product of Jesuit missionary activity in Japan.
10701620 -> 1000007101630: Building on the work of earlier Portuguese missionaries, the Dictionarium Anamiticum, Lusitanum et Latinum (Annamite-Portuguese-Latin dictionary) of Alexandre de Rhodes (1651) introduced the modern orthography of Vietnamese, which is based on the orthography of 17th-century Portuguese.
10701630 -> 1000007101640: The Romanization of Chinese was also influenced by the Portuguese language (among others), particularly regarding Chinese surnames; one example is Mei.
10701640 -> 1000007101650: See also List of English words of Portuguese origin, Loan words in Indonesian, Japanese words of Portuguese origin, Borrowed words in Malay, Sinhala words of Portuguese origin, Loan words from Portuguese in Sri Lankan Tamil.
10701650 -> 1000007101660: Derived languages
10701660 -> 1000007101670: Beginning in the 16th century, the extensive contacts between Portuguese travelers and settlers, African slaves, and local populations led to the appearance of many pidgins with varying amounts of Portuguese influence.
10701670 -> 1000007101680: As these pidgins became the mother tongue of succeeding generations, they evolved into fully fledged creole languages, which remained in use in many parts of Asia and Africa until the 18th century.
10701680 -> 1000007101690: Some Portuguese-based or Portuguese-influenced creoles are still spoken today, by over 3 million people worldwide, especially people of partial Portuguese ancestry.
10701690 -> 1000007101700: Phonology
10701700 -> 1000007101710: There is a maximum of 9 oral vowels and 19 consonants, though some varieties of the language have fewer phonemes (Brazilian Portuguese has only 8 oral vowel phones).
10701710 -> 1000007101720: There are also five nasal vowels, which some linguists regard as allophones of the oral vowels, ten oral diphthongs, and five nasal diphthongs.
10701720 -> 1000007101730: Vowels
10701730 -> 1000007101740: To the seven vowels of Vulgar Latin, European Portuguese has added two near central vowels, one of which tends to be elided in rapid speech, like the e caduc of French (represented either as {(IPA+/ɯ̽/+/ɯ̽/)}, or {(IPA+/ɨ/+/ɨ/)}, or {(IPA+/ə/+/ə/)}).
10701740 -> 1000007101750: The high vowels {(IPA+/e o/+/e o/)} and the low vowels {(IPA+/ɛ ɔ/+/ɛ ɔ/)} are four distinct phonemes, and they alternate in various forms of apophony.
10701750 -> 1000007101760: Like Catalan, Portuguese uses vowel quality to contrast stressed syllables with unstressed syllables: isolated vowels tend to be raised, and in some cases centralized, when unstressed.
10701760 -> 1000007101770: Nasal diphthongs occur mostly at the end of words.
10701770 -> 1000007101780: Consonants
10701780 -> 1000007101790: The consonant inventory of Portuguese is fairly conservative.
10701790 -> 1000007101800: The medieval affricates {(IPA+/ts/+/ts/)}, {(IPA+/dz/+/dz/)}, {(IPA+/tʃ/+/tʃ/)}, {(IPA+/dʒ/+/dʒ/)} merged with the fricatives {(IPA+/s/+/s/)}, {(IPA+/z/+/z/)}, {(IPA+/ʃ/+/ʃ/)}, {(IPA+/ʒ/+/ʒ/)}, respectively, but not with each other, and there were no other significant changes to the consonant phonemes since then.
10701800 -> 1000007101810: However, some remarkable dialectal variants and allophones have appeared, among which:
10701810 -> 1000007101820: In many regions of Brazil, {(IPA+/t/+/t/)} and {(IPA+/d/+/d/)} have the affricate allophones {(IPA+[tʃ]+[tʃ])} and {(IPA+[dʒ]+[dʒ])}, respectively, before {(IPA+/i/+/i/)} and {(IPA+/ĩ/+/ĩ/)}.
10701820 -> 1000007101830: (Quebec French has a similar phenomenon, with alveolar affricates instead of postalveolars.
10701830 -> 1000007101840: Japanese is another example).
10701840 -> 1000007101850: At the end of a syllable, the phoneme {(IPA+/l/+/l/)} has the allophone {(IPA+[u̯]+[u̯])} in Brazilian Portuguese (L-vocalization).
10701850 -> 1000007101860: In many parts of Brazil and Angola, intervocalic {(IPA+/ɲ/+/ɲ/)} is pronounced as a nasalized palatal approximant {(IPA+[j̃]+[j̃])} which nasalizes the preceding vowel, so that for instance {(IPA+/ˈniɲu/+/ˈniɲu/)} is pronounced {(IPA+[ˈnĩj̃u]+[ˈnĩj̃u])}.
10701860 -> 1000007101870: In most of Brazil, the alveolar sibilants {(IPA+/s/+/s/)} and {(IPA+/z/+/z/)} occur in complementary distribution at the end of syllables, depending on whether the consonant that follows is voiceless or voiced, as in English.
10701870 -> 1000007101880: But in most of Portugal and parts of Brazil sibilants are postalveolar at the end of syllables, {(IPA+/ʃ/+/ʃ/)} before voiceless consonants, and {(IPA+/ʒ/+/ʒ/)} before voiced consonants (in Judeo-Spanish, {(IPA+/s/+/s/)} is often replaced with {(IPA+/ʃ/+/ʃ/)} at the end of syllables, too).
10701880 -> 1000007101890: There is considerable dialectal variation in the value of the rhotic phoneme {(IPA+/ʁ/+/ʁ/)}.
10701890 -> 1000007101900: See Guttural R in Portuguese, for details.
10701900 -> 1000007101910: Grammar
10701910 -> 1000007101920: A particularly interesting aspect of the grammar of Portuguese is the verb.
10701920 -> 1000007101930: Morphologically, more verbal inflections from classical Latin have been preserved by Portuguese than any other major Romance language.
10701930 -> 1000007101940: See Romance copula, for a detailed comparison.
10701940 -> 1000007101950: It has also some innovations not found in other Romance languages (except Galician and the Fala):
10701950 -> 1000007101960: The present perfect tense has an iterative sense unique among the Romance languages.
10701960 -> 1000007101970: It denotes an action or a series of actions which began in the past and are expected to keep repeating in the future.
10701970 -> 1000007101980: For instance, the sentence Tenho tentado falar com ela would be translated to "I have been trying to talk to her", not "I have tried to talk to her".
10701980 -> 1000007101990: On the other hand, the correct translation of the question "Have you heard the latest news?" is not *Tem ouvido a última notícia?, but Ouviu a última notícia?, since no repetition is implied.
10701990 -> 1000007102000: The future subjunctive tense, which was developed by medieval West Iberian Romance, but has now fallen into disuse in Spanish, is still used in vernacular Portuguese.
10702000 -> 1000007102010: It appears in dependent clauses that denote a condition which must be fulfilled in the future, so that the independent clause will occur.
10702010 -> 1000007102020: Other languages normally employ the present tense under the same circumstances:
10702020 -> 1000007102030: Se for eleito presidente, mudarei a lei.
10702030 -> 1000007102040: If I am elected president, I will change the law.
10702040 -> 1000007102050: Quando fores mais velho, vais entender.
10702050 -> 1000007102060: When you are older, you will understand.
10702060 -> 1000007102070: The personal infinitive: infinitives can inflect according to their subject in person and number, often showing who is expected to perform a certain action; cf. É melhor voltares "It is better [for you] to go back," É melhor voltarmos "It is better [for us] to go back."
10702070 -> 1000007102080: Perhaps for this reason, infinitive clauses replace subjunctive clauses more often in Portuguese than in other Romance languages.
10702080 -> 1000007102090: Writing system
10702090 -> 1000007102100: Portuguese is written with the Latin alphabet, making use of five diacritics to denote stress, vowel height, contraction, nasalization, and other sound changes (acute accent, grave accent, circumflex accent, tilde, and cedilla).
10702100 -> 1000007102110: Brazilian Portuguese also uses the diaeresis mark.
10702110 -> 1000007102120: Accented characters and digraphs are not counted as separate letters for collation purposes.
10702120 -> 1000007102130: Brazilian vs. European spelling
10702130 -> 1000007102140: There are some minor differences between the orthographies of Brazil and other Portuguese language countries.
10702140 -> 1000007102150: One of the most pervasive is the use of acute accents in the European/African/Asian orthography in many words such as sinónimo, where the Brazilian orthography has a circumflex accent, sinônimo.
10702150 -> 1000007102160: Another important difference is that Brazilian spelling often lacks c or p before c, ç, or t, where the European orthography has them; for example, cf. Brazilian fato with European facto, "fact", or Brazilian objeto with European objecto, "object".
10702160 -> 1000007102170: Some of these spelling differences reflect differences in the pronunciation of the words, but others are merely graphic.
10702170 -> None: Examples
10702180 -> None: Excerpt from the Portuguese national epic Os Lusíadas, by author Luís de Camões (I, 33)
Predictive analytics
10710010 -> 1000007200020: Predictive analytics
10710020 -> 1000007200030: Predictive analytics encompasses a variety of techniques from statistics and data mining that analyze current and historical data to make predictions about future events.
10710030 -> 1000007200040: Such predictions rarely take the form of absolute statements, and are more likely to be expressed as values that correspond to the odds of a particular event or behavior taking place in the future.
10710040 -> 1000007200050: In business, predictive models exploit patterns found in historical and transactional data to identify risks and opportunities.
10710050 -> 1000007200060: Models capture relationships among many factors to allow assessment of risk or potential associated with a particular set of conditions, guiding decision making for candidate transactions.
10710060 -> 1000007200070: One of the most well-known applications is credit scoring, which is used throughout financial services.
10710070 -> 1000007200080: Scoring models process a customer’s credit history, loan application, customer data, etc., in order to rank-order individuals by their likelihood of making future credit payments on time.
10710080 -> 1000007200090: Predictive analytics are also used in insurance, telecommunications, retail, travel, healthcare, pharmaceuticals and other fields.
10710090 -> 1000007200100: Types of predictive analytics
10710100 -> 1000007200110: Generally, predictive analytics is used to mean predictive modeling, scoring of predictive models, and forecasting.
10710110 -> 1000007200120: However, people are increasingly using the term to describe related analytic disciplines, such as descriptive modeling and decision modeling or optimization.
10710120 -> 1000007200130: These disciplines also involve rigorous data analysis, and are widely used in business for segmentation and decision making, but have different purposes and the statistical techniques underlying them vary.
10710130 -> 1000007200140: Predictive models
10710140 -> 1000007200150: Predictive models analyze past performance to assess how likely a customer is to exhibit a specific behavior in the future in order to improve marketing effectiveness.
10710150 -> 1000007200160: This category also encompasses models that seek out subtle data patterns to answer questions about customer performance, such as fraud detection models.
10710160 -> 1000007200170: Predictive models often perform calculations during live transactions, for example, to evaluate the risk or opportunity of a given customer or transaction, in order to guide a decision.
10710170 -> 1000007200180: Descriptive models
10710180 -> 1000007200190: Descriptive models “describe” relationships in data in a way that is often used to classify customers or prospects into groups.
10710190 -> 1000007200200: Unlike predictive models that focus on predicting a single customer behavior (such as credit risk), descriptive models identify many different relationships between customers or products.
10710200 -> 1000007200210: But the descriptive models do not rank-order customers by their likelihood of taking a particular action the way predictive models do.
10710210 -> 1000007200220: Descriptive models are often used “offline,” for example, to categorize customers by their product preferences and life stage.
10710220 -> 1000007200230: Descriptive modeling tools can be utilized to develop agent based models that can simulate large number of individualized agents to predict possible futures.
10710230 -> 1000007200240: Decision models
10710240 -> 1000007200250: Decision models describe the relationship between all the elements of a decision — the known data (including results of predictive models), the decision and the forecast results of the decision — in order to predict the results of decisions involving many variables.
10710250 -> 1000007200260: These models can be used in optimization, a data-driven approach to improving decision logic that involves maximizing certain outcomes while minimizing others.
10710260 -> 1000007200270: Decision models are generally used offline, to develop decision logic or a set of business rules that will produce the desired action for every customer or circumstance.
10710270 -> 1000007200280: Predictive analytics
10710280 -> 1000007200290: Definition
10710290 -> 1000007200300: Predictive analytics is an area of statistical analysis that deals with extracting information from data and using it to predict future trends and behavior patterns.
10710300 -> 1000007200310: The core of predictive analytics relies on capturing relationships between explanatory variables and the predicted variables from past occurrences, and exploiting it to predict future outcomes.
10710310 -> 1000007200320: Current uses
10710320 -> 1000007200330: Although predictive analytics can be put to use in many applications, we outline a few examples where predictive analytics has shown positive impact in recent years.
10710330 -> 1000007200340: Analytical Customer Relationship Management (CRM)
10710340 -> 1000007200350: Analytical Customer Relationship Management is a frequent commercial application of Predictive Analysis.
10710350 -> 1000007200360: Methods of predictive analysis are applied to customer data to pursue CRM objectives.
10710360 -> 1000007200370: Direct marketing
10710370 -> 1000007200380: Product marketing is constantly faced with the challenge of coping with the increasing number of competing products, different consumer preferences and the variety of methods (channels) available to interact with each consumer.
10710380 -> 1000007200390: Efficient marketing is a process of understanding the amount of variability and tailoring the marketing strategy for greater profitability.
10710390 -> 1000007200400: Predictive analytics can help identify consumers with a higher likelihood of responding to a particular marketing offer.
10710400 -> 1000007200410: Models can be built using data from consumers’ past purchasing history and past response rates for each channel.
10710410 -> 1000007200420: Additional information about the consumers demographic, geographic and other characteristics can be used to make more accurate predictions.
10710420 -> 1000007200430: Targeting only these consumers can lead to substantial increase in response rate which can lead to a significant reduction in cost per acquisition.
10710430 -> 1000007200440: Apart from identifying prospects, predictive analytics can also help to identify the most effective combination of products and marketing channels that should be used to target a given consumer.
10710440 -> 1000007200450: Cross-sell
10710450 -> 1000007200460: Often corporate organizations collect and maintain abundant data (e.g. customer records, sale transactions) and exploiting hidden relationships in the data can provide a competitive advantage to the organization.
10710460 -> 1000007200470: For an organization that offers multiple products, an analysis of existing customer behavior can lead to efficient cross sell of products.
10710470 -> 1000007200480: This directly leads to higher profitability per customer and strengthening of the customer relationship.
10710480 -> 1000007200490: Predictive analytics can help analyze customers’ spending, usage and other behavior, and help cross-sell the right product at the right time.
10710490 -> 1000007200500: Customer retention
10710500 -> 1000007200510: With the amount of competing services available, businesses need to focus efforts on maintaining continuous consumer satisfaction.
10710510 -> 1000007200520: In such a competitive scenario, consumer loyalty needs to be rewarded and customer attrition needs to be minimized.
10710520 -> 1000007200530: Businesses tend to respond to customer attrition on a reactive basis, acting only after the customer has initiated the process to terminate service.
10710530 -> 1000007200540: At this stage, the chance of changing the customer’s decision is almost impossible.
10710540 -> 1000007200550: Proper application of predictive analytics can lead to a more proactive retention strategy.
10710550 -> 1000007200560: By a frequent examination of a customer’s past service usage, service performance, spending and other behavior patterns, predictive models can determine the likelihood of a customer wanting to terminate service sometime in the near future.
10710560 -> 1000007200570: An intervention with lucrative offers can increase the chance of retaining the customer.
10710570 -> 1000007200580: Silent attrition is the behavior of a customer to slowly but steadily reduce usage and is another problem faced by many companies.
10710580 -> 1000007200590: Predictive analytics can also predict this behavior accurately and before it occurs, so that the company can take proper actions to increase customer activity.
10710590 -> 1000007200600: Underwriting
10710600 -> 1000007200610: Many businesses have to account for risk exposure due to their different services and determine the cost needed to cover the risk.
10710610 -> 1000007200620: For example, auto insurance providers need to accurately determine the amount of premium to charge to cover each automobile and driver.
10710620 -> 1000007200630: A financial company needs to assess a borrower’s potential and ability to pay before granting a loan.
10710630 -> 1000007200640: For a health insurance provider, predictive analytics can analyze a few years of past medical claims data, as well as lab, pharmacy and other records where available, to predict how expensive an enrollee is likely to be in the future.
10710640 -> 1000007200650: Predictive analytics can help underwriting of these quantities by predicting the chances of illness, default, bankruptcy, etc.
10710650 -> 1000007200660: Predictive analytics can streamline the process of customer acquisition, by predicting the future risk behavior of a customer using application level data.
10710660 -> 1000007200670: Proper predictive analytics can lead to proper pricing decisions, which can help mitigate future risk of default.
10710670 -> 1000007200680: Collection analytics
10710680 -> 1000007200690: Every portfolio has a set of delinquent customers who do not make their payments on time.
10710690 -> 1000007200700: The financial institution has to undertake collection activities on these customers to recover the amounts due.
10710700 -> 1000007200710: A lot of collection resources are wasted on customers who are difficult or impossible to recover.
10710710 -> 1000007200720: Predictive analytics can help optimize the allocation of collection resources by identifying the most effective collection agencies, contact strategies, legal actions and other strategies to each customer, thus significantly increasing recovery at the same time reducing collection costs.
10710720 -> 1000007200730: Fraud detection
10710730 -> 1000007200740: Fraud is a big problem for many businesses and can be of various types.
10710740 -> 1000007200750: Inaccurate credit applications, fraudulent transactions, identity thefts and false insurance claims are some examples of this problem.
10710750 -> 1000007200760: These problems plague firms all across the spectrum and some examples of likely victims are credit card issuers, insurance companies, retail merchants, manufacturers, business to business suppliers and even services providers.
10710760 -> 1000007200770: This is an area where a predictive model is often used to help weed out the “bads” and reduce a business's exposure to fraud.
10710770 -> 1000007200780: Portfolio, product or economy level prediction
10710780 -> 1000007200790: Often the focus of analysis is not the consumer but the product, portfolio, firm, industry or even the economy.
10710790 -> 1000007200800: For example a retailer might be interested in predicting store level demand for inventory management purposes.
10710800 -> 1000007200810: Or the Federal Reserve Board might be interested in predicting the unemployment rate for the next year.
10710810 -> 1000007200820: These type of problems can be addressed by predictive analytics using Time Series techniques (see below).
10710820 -> 1000007200830: Wrong Information....
10710830 -> 1000007200840: Statistical techniques
10710840 -> 1000007200850: The approaches and techniques used to conduct predictive analytics can broadly be grouped into regression techniques and machine learning techniques.
10710850 -> 1000007200860: Regression Techniques
10710860 -> 1000007200870: Regression models are the mainstay of predictive analytics.
10710870 -> 1000007200880: The focus lies on establishing a mathematical equation as a model to represent the interactions between the different variables in consideration.
10710880 -> 1000007200890: Depending on the situation, there is a wide variety of models that can be applied while performing predictive analytics.
10710890 -> 1000007200900: Some of them are briefly discussed below.
10710900 -> 1000007200910: Linear Regression Model
10710910 -> 1000007200920: The linear regression model analyzes the relationship between the response or dependent variable and a set of independent or predictor variables.
10710920 -> 1000007200930: This relationship is expressed as an equation that predicts the response variable as a linear function of the parameters.
10710930 -> 1000007200940: These parameters are adjusted so that a measure of fit is optimized.
10710940 -> 1000007200950: Much of the effort in model fitting is focused on minimizing the size of the residual, as well as ensuring that it is randomly distributed with respect to the model predictions.
10710950 -> 1000007200960: The goal of regression is to select the parameters of the model so as to minimize the sum of the squared residuals.
10710960 -> 1000007200970: This is referred to as ordinary least squares (OLS) estimation and results in best linear unbiased estimates (BLUE) of the parameters if and only if the Gauss-Markowitz assumptions are satisfied.
10710970 -> 1000007200980: Once the model has been estimated we would be interested to know if the predictor variables belong in the model – i.e. is the estimate of each variable’s contribution reliable?
10710980 -> 1000007200990: To do this we can check the statistical significance of the model’s coefficients which can be measured using the t-statistic.
10710990 -> 1000007201000: This amounts to testing whether the coefficient is significantly different from zero.
10711000 -> 1000007201010: How well the model predicts the dependent variable based on the value of the independent variables can be assessed by using the R² statistic.
10711010 -> 1000007201020: It measures predictive power of the model i.e. the proportion of the total variation in the dependent variable that is “explained” (accounted for) by variation in the independent variables.
10711020 -> 1000007201030: Discrete choice models
10711030 -> 1000007201040: Multivariate regression (above) is generally used when the response variable is continuous and has an unbounded range.
10711040 -> 1000007201050: Often the response variable may not be continuous but rather discrete.
10711050 -> 1000007201060: While mathematically it is feasible to apply multivariate regression to discrete ordered dependent variables, some of the assumptions behind the theory of multivariate linear regression no longer hold, and there are other techniques such as discrete choice models which are better suited for this type of analysis.
10711060 -> 1000007201070: If the dependent variable is discrete, some of those superior methods are logistic regression, multinomial logit and probit models.
10711070 -> 1000007201080: Logistic regression and probit models are used when the dependent variable is binary.
10711080 -> 1000007201090: Logistic regression
10711090 -> 1000007201100: In a classification setting, assigning outcome probabilities to observations can be achieved through the use of a logistic model, which is basically a method which transforms information about the binary dependent variable into an unbounded continuous variable and estimates a regular multivariate model (See Allison’s Logistic Regression for more information on the theory of Logistic Regression).
10711100 -> 1000007201110: The Wald and likelihood-ratio test are used to test the statistical significance of each coefficient b in the model (analogous to the t tests used in OLS regression; see above).
10711110 -> 1000007201120: A test assessing the goodness-of-fit of a classification model is the Hosmer and Lemeshow test.
10711120 -> 1000007201130: Multinomial logistic regression
10711130 -> 1000007201140: An extension of the binary logit model to cases where the dependent variable has more than 2 categories is the multinomial logit model.
10711140 -> 1000007201150: In such cases collapsing the data into two categories might not make good sense or may lead to loss in the richness of the data.
10711150 -> 1000007201160: The multinomial logit model is the appropriate technique in these cases, especially when the dependent variable categories are not ordered (for examples colors like red, blue, green).
10711160 -> 1000007201170: Some authors have extended multinomial regression to include feature selection/importance methods such as Random multinomial logit.
10711170 -> 1000007201180: Probit regression
10711180 -> 1000007201190: Probit models offer an alternative to logistic regression for modeling categorical dependent variables.
10711190 -> 1000007201200: Even though the outcomes tend to be similar, the underlying distributions are different.
10711200 -> 1000007201210: Probit models are popular in social sciences like economics.
10711210 -> 1000007201220: A good way to understand the key difference between probit and logit models, is to assume that there is a latent variable z.
10711220 -> 1000007201230: We do not observe z but instead observe y which takes the value 0 or 1.
10711230 -> 1000007201240: In the logit model we assume that follows a logistic distribution.
10711240 -> 1000007201250: In the probit model we assume that follows a standard normal distribution.
10711250 -> 1000007201260: Note that in social sciences (example economics), probit is often used to model situations where the observed variable y is continuous but takes values between 0 and 1.
10711260 -> 1000007201270: Logit vs. Probit
10711270 -> 1000007201280: The Probit model has been around longer than the logit model.
10711280 -> 1000007201290: They look identical, except that the logistic distribution tends to be a little flat tailed.
10711290 -> 1000007201300: In fact one of the reasons the logit model was formulated was that the probit model was extremely hard to compute because it involved calculating difficult integrals.
10711300 -> 1000007201310: Modern computing however has made this computation fairly simple.
10711310 -> 1000007201320: The coefficients obtained from the logit and probit model are also fairly close.
10711320 -> 1000007201330: However the odds ratio makes the logit model easier to interpret.
10711330 -> 1000007201340: For practical purposes the only reasons for choosing the probit model over the logistic model would be:
10711340 -> 1000007201350: There is a strong belief that the underlying distribution is normal
10711350 -> 1000007201360: The actual event is not a binary outcome (e.g. Bankrupt/not bankrupt) but a proportion (e.g. Proportion of population at different debt levels).
10711360 -> 1000007201370: Time series models
10711370 -> 1000007201380: Time series models are used for predicting or forecasting the future behavior of variables.
10711380 -> 1000007201390: These models account for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for.
10711390 -> 1000007201400: As a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend, seasonal and cyclical component of the series.
10711400 -> 1000007201410: Modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future.
10711410 -> 1000007201420: Time series models estimate difference equations containing stochastic components.
10711420 -> 1000007201430: Two commonly used forms of these models are autoregressive models (AR) and moving average (MA) models.
10711430 -> 1000007201440: The Box-Jenkins methodology (1976) developed by George Box and G.M. Jenkins combines the AR and MA models to produce the ARMA (autoregressive moving average) model which is the cornerstone of stationary time series analysis.
10711440 -> 1000007201450: ARIMA (autoregressive integrated moving average models) on the other hand are used to describe non-stationary time series.
10711450 -> 1000007201460: Box and Jenkins suggest differencing a non stationary time series to obtain a stationary series to which an ARMA model can be applied.
10711460 -> 1000007201470: Non stationary time series have a pronounced trend and do not have a constant long-run mean or variance.
10711470 -> 1000007201480: Box and Jenkins proposed a three stage methodology which includes: model identification, estimation and validation.
10711480 -> 1000007201490: The identification stage involves identifying if the series is stationary or not and the presence of seasonality by examining plots of the series, autocorrelation and partial autocorrelation functions.
10711490 -> 1000007201500: In the estimation stage, models are estimated using non-linear time series or maximum likelihood estimation procedures.
10711500 -> 1000007201510: Finally the validation stage involves diagnostic checking such as plotting the residuals to detect outliers and evidence of model fit.
10711510 -> 1000007201520: In recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as ARCH (autoregressive conditional heteroskedasticity) and GARCH (generalized autoregressive conditional heteroskedasticity) models frequently used for financial time series.
10711520 -> 1000007201530: In addition time series models are also used to understand inter-relationships among economic variables represented by systems of equations using VAR (vector autoregression) and structural VAR models.
10711530 -> 1000007201540: Survival or duration analysis
10711540 -> 1000007201550: Survival analysis is another name for time to event analysis.
10711550 -> 1000007201560: These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social sciences like economics, as well as in engineering (reliability and failure time analysis).
10711560 -> 1000007201570: Censoring and non-normality which are characteristic of survival data generate difficulty when trying to analyze the data using conventional statistical models such as multiple linear regression.
10711570 -> 1000007201580: The Normal distribution, being a symmetric distribution, takes positive as well as negative values, but duration by its very nature cannot be negative and therefore normality cannot be assumed when dealing with duration/survival data.
10711580 -> 1000007201590: Hence the normality assumption of regression models is violated.
10711590 -> 1000007201600: A censored observation is defined as an observation with incomplete information.
10711600 -> 1000007201610: Censoring introduces distortions into traditional statistical methods and is essentially a defect of the sample data.
10711610 -> 1000007201620: The assumption is that if the data were not censored it would be representative of the population of interest.
10711620 -> 1000007201630: In survival analysis, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time.
10711630 -> 1000007201640: An important concept in survival analysis is the hazard rate.
10711640 -> 1000007201650: The hazard rate is defined as the probability that the event will occur at time t conditional on surviving until time t.
10711650 -> 1000007201660: Another concept related to the hazard rate is the survival function which can be defined as the probability of surviving to time t.
10711660 -> 1000007201670: Most models try to model the hazard rate by choosing the underlying distribution depending on the shape of the hazard function.
10711670 -> 1000007201680: A distribution whose hazard function slopes upward is said to have positive duration dependence, a decreasing hazard shows negative duration dependence whereas constant hazard is a process with no memory usually characterized by the exponential distribution.
10711680 -> 1000007201690: Some of the distributional choices in survival models are: F, gamma, Weibull, log normal, inverse normal, exponential etc.
10711690 -> 1000007201700: All these distributions are for a non-negative random variable.
10711700 -> 1000007201710: Duration models can be parametric, non-parametric or semi-parametric.
10711710 -> 1000007201720: Some of the models commonly used are Kaplan-Meier, Cox proportional hazard model (non parametric).
10711720 -> 1000007201730: Classification and regression trees
10711730 -> 1000007201740: Classification and regression trees (CART) is a non-parametric technique that produces either classification or regression trees, depending on whether the dependent variable is categorical or numeric, respectively.
10711740 -> 1000007201750: Trees are formed by a collection of rules based on values of certain variables in the modeling data set
10711750 -> 1000007201760: Rules are selected based on how well splits based on variables’ values can differentiate observations based on the dependent variable
10711760 -> 1000007201770: Once a rule is selected and splits a node into two, the same logic is applied to each “child” node (i.e. it is a recursive procedure)
10711770 -> 1000007201780: Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met
10711780 -> 1000007201790: Each branch of the tree ends in a terminal node
10711790 -> 1000007201800: Each observation falls into one and exactly one terminal node
10711800 -> 1000007201810: Each terminal node is uniquely defined by a set of rules
10711810 -> 1000007201820: A very popular method for predictive analytics is Leo Breiman's Random forests or derived versions of this technique like Random multinomial logit.
10711820 -> 1000007201830: Multivariate adaptive regression splines
10711830 -> 1000007201840: Multivariate adaptive regression splines (MARS) is a non-parametric technique that builds flexible models by fitting piecewise linear regressions.
10711840 -> 1000007201850: An important concept associated with regression splines is that of a knot.
10711850 -> 1000007201860: Knot is where one local regression model gives way to another and thus is the point of intersection between two splines.
10711860 -> 1000007201870: In multivariate and adaptive regression splines, basis functions are the tool used for generalizing the search for knots.
10711870 -> 1000007201880: Basis functions are a set of functions used to represent the information contained in one or more variables.
10711880 -> 1000007201890: Multivariate and Adaptive Regression Splines model almost always creates the basis functions in pairs.
10711890 -> 1000007201900: Multivariate and adaptive regression spline approach deliberately overfits the model and then prunes to get to the optimal model.
10711900 -> 1000007201910: The algorithm is computationally very intensive and in practice we are required to specify an upper limit on the number of basis functions.
10711910 -> 1000007201920: Machine learning techniques
10711920 -> 1000007201930: Machine learning, a branch of artificial intelligence, was originally employed to develop techniques to enable computers to learn.
10711930 -> 1000007201940: Today, since it includes a number of advanced statistical methods for regression and classification, it finds application in a wide variety of fields including medical diagnostics, credit card fraud detection, face and speech recognition and analysis of the stock market.
10711940 -> 1000007201950: In certain applications it is sufficient to directly predict the dependent variable without focusing on the underlying relationships between variables.
10711950 -> 1000007201960: In other cases, the underlying relationships can be very complex and the mathematical form of the dependencies unknown.
10711960 -> 1000007201970: For such cases, machine learning techniques emulate human cognition and learn from training examples to predict future events.
10711970 -> 1000007201980: A brief discussion of some of these methods used commonly for predictive analytics is provided below.
10711980 -> 1000007201990: A detailed study of machine learning can be found in Mitchell (1997).
10711990 -> 1000007202000: Neural networks
10712000 -> 1000007202010: Neural networks are nonlinear sophisticated modeling techniques that are able to model complex functions.
10712010 -> 1000007202020: They can be applied to problems of prediction, classification or control in a wide spectrum of fields such as finance, cognitive psychology/neuroscience, medicine, engineering, and physics.
10712020 -> 1000007202030: Neural networks are used when the exact nature of the relationship between inputs and output is not known.
10712030 -> 1000007202040: A key feature of neural networks is that they learn the relationship between inputs and output through training.
10712040 -> 1000007202050: There are two types of training in neural networks used by different networks, supervised and unsupervised training, with supervised being the most common one.
10712050 -> 1000007202060: Some examples of neural network training techniques are backpropagation, quick propagation, conjugate gradient descent, projection operator, Delta-Bar-Delta etc.
10712060 -> 1000007202070: Theses are applied to network architectures such as multilayer perceptrons, Kohonen networks, Hopfield networks, etc.
10712070 -> 1000007202080: Radial basis functions
10712080 -> 1000007202090: A radial basis function (RBF) is a function which has built into it a distance criterion with respect to a center.
10712090 -> 1000007202100: Such functions can be used very efficiently for interpolation and for smoothing of data.
10712100 -> 1000007202110: Radial basis functions have been applied in the area of neural networks where they are used as a replacement for the sigmoidal transfer function.
10712110 -> 1000007202120: Such networks have 3 layers, the input layer, the hidden layer with the RBF non-linearity and a linear output layer.
10712120 -> 1000007202130: The most popular choice for the non-linearity is the Gaussian.
10712130 -> 1000007202140: RBF networks have the advantage of not being locked into local minima as do the feed-forward networks such as the multilayer perceptron.
10712140 -> 1000007202150: Support vector machines
10712150 -> 1000007202160: Support Vector Machines (SVM) are used to detect and exploit complex patterns in data by clustering, classifying and ranking the data.
10712160 -> 1000007202170: They are learning machines that are used to perform binary classifications and regression estimations.
10712170 -> 1000007202180: They commonly use kernel based methods to apply linear classification techniques to non-linear classification problems.
10712180 -> 1000007202190: There are a number of types of SVM such as linear, polynomial, sigmoid etc.
10712190 -> 1000007202200: Naïve Bayes
10712200 -> 1000007202210: Naïve Bayes based on Bayes conditional probability rule is used for performing classification tasks.
10712210 -> 1000007202220: Naïve Bayes assumes the predictors are statistically independent which makes it an effective classification tool that is easy to interpret.
10712220 -> 1000007202230: It is best employed when faced with the problem of ‘curse of dimensionality’ i.e. when the number of predictors is very high.
10712230 -> 1000007202240: k-nearest neighbours
10712240 -> 1000007202250: The nearest neighbour algorithm (KNN) belongs to the class of pattern recognition statistical methods.
10712250 -> 1000007202260: The method does not impose a priori any assumptions about the distribution from which the modeling sample is drawn.
10712260 -> 1000007202270: It involves a training set with both positive and negative values.
10712270 -> 1000007202280: A new sample is classified by calculating the distance to the nearest neighbouring training case.
10712280 -> 1000007202290: The sign of that point will determine the classification of the sample.
10712290 -> 1000007202300: In the k-nearest neighbour classifier, the k nearest points are considered and the sign of the majority is used to classify the sample.
10712300 -> 1000007202310: The performance of the kNN algorithm is influenced by three main factors: (1) the distance measure used to locate the nearest neighbours; (2) the decision rule used to derive a classification from the k-nearest neighbours; and (3) the number of neighbours used to classify the new sample.
10712310 -> 1000007202320: It can be proved that, unlike other methods, this method is universally asymptotically convergent, i.e.: as the size of the training set increases, if the observations are iid, regardless of the distribution from which the sample is drawn, the predicted class will converge to the class assignment that minimizes misclassification error.
10712320 -> 1000007202330: See Devroy et alt.
10712330 -> 1000007202340: Popular tools
10712340 -> 1000007202350: There are numerous tools available in the marketplace which help with the execution of predictive analytics.
10712350 -> 1000007202360: These range from those which need very little user sophistication to those that are designed for the expert practitioner.
10712360 -> 1000007202370: The difference between these tools is often in the level of customization and heavy data lifting allowed.
10712370 -> 1000007202380: For traditional statistical modeling some of the popular tools are DAP/SAS, S-Plus, PSPP/SPSS and Stata.
10712380 -> 1000007202390: For machine learning/data mining type of applications, KnowledgeSEEKER, KnowledgeSTUDIO, Enterprise Miner, GeneXproTools, Viscovery, Clementine, KXEN Analytic Framework, InforSense and Excel Miner are some of the popularly used options.
10712390 -> 1000007202400: Classification Tree analysis can be performed using CART software.
10712400 -> 1000007202410: SOMine is a predictive analytics tool based on self-organizing maps (SOMs) available from Viscovery Software.
10712410 -> 1000007202420: R is a very powerful tool that can be used to perform almost any kind of statistical analysis, and is freely downloadable.
10712420 -> 1000007202430: WEKA is a freely available open-source collection of machine learning methods for pattern classification, regression, clustering, and some types of meta-learning, which can be used for predictive analytics.
10712430 -> 1000007202440: RapidMiner is another freely available integrated open-source software environment for predictive analytics, data mining, and machine learning fully integrating WEKA and providing an even larger number of methods for predictive analytics.
10712440 -> 1000007202450: Recently, in an attempt to provide a standard language for expressing predictive models, the Predictive Model Markup Language (PMML) has been proposed.
10712450 -> 1000007202460: Such an XML-based language provides a way for the different tools to define predictive models and to share these between PMML compliant applications.
10712460 -> 1000007202470: Several tools already produce or consume PMML documents, these include ADAPA, IBM DB2 Warehouse, CART, SAS Enterprise Miner, and SPSS.
10712470 -> 1000007202480: Predictive analytics has also found its way into the IT lexicon, most notably in the area of IT Automation.
10712480 -> 1000007202490: Vendors such as Stratavia and their Data Palette product offer predictive analytics as part of their automation platform, predicting how resources will behave in the future and automate the environment accordingly.
10712490 -> 1000007202500: The widespread use of predictive analytics in industry has led to the proliferation of numerous productized solutions firms.
10712500 -> 1000007202510: Some of them are highly specialized (focusing, for example, on fraud detection, automatic saleslead generation or response modeling) in a specific domain (Fair Isaac for credit card scores) or industry verticals (MarketRx in Pharmaceutical).
10712510 -> 1000007202520: Others provide predictive analytics services in support of a wide range of business problems across industry verticals (Fifth C).
10712520 -> 1000007202530: Predictive Analytics competitions are also fairly common and often pit academics and Industry practitioners (see for example, KDD CUP).
10712530 -> 1000007202540: Conclusion
10712540 -> 1000007202550: Predictive analytics adds great value to a businesses decision making capabilities by allowing it to formulate smart policies on the basis of predictions of future outcomes.
10712550 -> 1000007202560: A broad range of tools and techniques are available for this type of analysis and their selection is determined by the analytical maturity of the firm as well as the specific requirements of the problem being solved.
10712560 -> 1000007202570: Education
10712570 -> 1000007202580: Predictive analytics is taught at the following institutions:
10712580 -> 1000007202590: Ghent University, Belgium:  Master of Marketing Analysis, an 8-month advanced master degree taught in English with strong emphasis on applications of predictive analytics in Analytical CRM.
RapidMiner
10720010 -> 1000007300020: RapidMiner
10720020 -> 1000007300030: RapidMiner (formerly YALE (Yet Another Learning Environment)) is an environment for machine learning and data mining experiments.
10720030 -> 1000007300040: It allows experiments to be made up of a large number of arbitrarily nestable operators, described in XML files which can easily be created with RapidMiner's graphical user interface.
10720040 -> 1000007300050: Applications of RapidMiner cover both research and real-world data mining tasks.
10720050 -> 1000007300060: The initial version has been developed by the Artificial Intelligence Unit of University of Dortmund since 2001.
10720060 -> 1000007300070: It is distributed under a GNU license, and has been hosted by SourceForge since 2004.
10720070 -> 1000007300080: RapidMiner provides more than 400 operators for all main machine learning procedures, including input and output, and data preprocessing and visualization.
10720080 -> 1000007300090: It is written in the Java programming language and therefore can work on all popular operating systems.
10720090 -> 1000007300100: It also integrates all learning schemes and attribute evaluators of the Weka learning environment.
10720100 -> 1000007300110: Properties
10720110 -> 1000007300120: Some properties of RapidMiner are:
10720120 -> 1000007300130: written in Java
10720130 -> 1000007300140: knowledge discovery processes are modeled as operator trees
10720140 -> 1000007300150: internal XML representation ensures standardized interchange format of data mining experiments
10720150 -> 1000007300160: scripting language allows for automatic large-scale experiments
10720160 -> 1000007300170: multi-layered data view concept ensures efficient and transparent data handling
10720170 -> 1000007300180: graphical user interface, command line mode (batch mode), and Java API for using RapidMiner from your own programs
10720180 -> 1000007300190: plugin and extension mechanisms, several plugins already exist
10720190 -> 1000007300200: plotting facility offering a large set of high-dimensional visualization schemes for data and models
10720200 -> 1000007300210: applications include text mining, multimedia mining, feature engineering, data stream mining and tracking drifting concepts, development of ensemble methods, and distributed data mining.
Russian language
10730010 -> 1000007400020: Russian language
10730020 -> 1000007400030: Russian ({(Lang+русский язык+ru+русский язык)} (help•info), transliteration: {(Transl+russkiy yazyk+ru+ALA+russkiy yazyk)}, {(IPA-ru+Russian pronunciation: [ˈruskʲɪj jɪˈzɨk]+ˈruskʲɪj jɪˈzɨk)}) is the most geographically widespread language of Eurasia, the most widely spoken of the Slavic languages, and the largest native language in Europe.
10730030 -> 1000007400040: Russian belongs to the family of Indo-European languages and is one of three (or, according to some authorities , four) living members of the East Slavic languages, the others being Belarusian and Ukrainian (and possibly Rusyn, often considered a dialect of Ukrainian).
10730040 -> 1000007400050: It is also spoken by the countries of the Russophone.
10730050 -> 1000007400060: Written examples of Old East Slavonic are attested from the 10th century onwards.
10730060 -> 1000007400070: Today Russian is widely used outside Russia.
10730070 -> 1000007400080: It is applied as a means of coding and storage of universal knowledge — 60–70% of all world information is published in English and Russian languages.
10730080 -> 1000007400090: Over a quarter of the world's scientific literature is published in Russian.
10730090 -> 1000007400100: Russian is also a necessary accessory of world communications systems (broadcasts, air- and space communication, etc).
10730100 -> 1000007400110: Due to the status of the Soviet Union as a superpower, Russian had great political importance in the 20th century.
10730110 -> 1000007400120: Hence, the language is one of the official languages of the United Nations.
10730120 -> 1000007400130: Russian distinguishes between consonant phonemes with palatal secondary articulation and those without, the so-called soft and hard sounds.
10730130 -> 1000007400140: This distinction is found between pairs of almost all consonants and is one of the most distinguishing features of the language.
10730140 -> 1000007400150: Another important aspect is the reduction of unstressed vowels, which is somewhat similar to that of English.
10730150 -> 1000007400160: Stress, which is unpredictable, is not normally indicated orthographically.
10730160 -> 1000007400170: According to the Institute of Russian Language of the Russian Academy of Sciences, an optional acute accent ({(Lang+знак ударения+ru+знак ударения)}) may, and sometimes should, be used to mark stress.
10730170 -> 1000007400180: For example, it is used to distinguish between otherwise identical words, especially when context doesn't make it obvious: замо́к/за́мок (lock/castle), сто́ящий/стоя́щий (worthwhile/standing), чудно́/чу́дно (this is odd/this is marvellous), молоде́ц/мо́лодец (attaboy/fine young man), узна́ю/узнаю́ (I shall learn it/I am learning it), отреза́ть/отре́зать (infinitive for "cut"/perfective for "cut"); to indicate the proper pronouncation of uncommon words, especially personal and family names (афе́ра, гу́ру, Гарси́а, Оле́ша, Фе́рми), and to express the stressed word in the sentence (Ты́ съел печенье?/Ты съе́л печенье?/Ты съел пече́нье? - Was it you who eat the cookie?/Did you eat the cookie?/Was the cookie your meal?).
10730180 -> 1000007400190: Acute accents are mandatory in lexical dictionaries and books intended to be used either by children or foreign readers.
10730190 -> 1000007400200: Classification
10730200 -> 1000007400210: Russian is a Slavic language in the Indo-European family.
10730210 -> 1000007400220: From the point of view of the spoken language, its closest relatives are Ukrainian and Belarusian, the other two national languages in the East Slavic group.
10730220 -> 1000007400230: In many places in eastern Ukraine and Belarus, these languages are spoken interchangeably, and in certain areas traditional bilingualism resulted in language mixture, e.g. Surzhyk in eastern Ukraine and Trasianka in Belarus.
10730240 -> 1000007400240: An East Slavic Old Novgorod dialect, although vanished during the fifteenth or sixteenth century, is sometimes considered to have played a significant role in formation of the modern Russian language.
10730250 -> 1000007400250: The vocabulary (mainly abstract and literary words), principles of word formation, and, to some extent, inflections and literary style of Russian have been also influenced by Church Slavonic, a developed and partly adopted form of the South Slavic Old Church Slavonic language used by the Russian Orthodox Church.
10730260 -> 1000007400260: However, the East Slavic forms have tended to be used exclusively in the various dialects that are experiencing a rapid decline.
10730270 -> 1000007400270: In some cases, both the East Slavic and the Church Slavonic forms are in use, with slightly different meanings.
10730280 -> 1000007400280: For details, see Russian phonology and History of the Russian language.
10730290 -> 1000007400290: Russian phonology and syntax (especially in northern dialects) have also been influenced to some extent by the numerous Finnic languages of the Finno-Ugric subfamily: Merya, Moksha, Muromian, the language of the Meshchera, Veps, et cetera.
10730300 -> 1000007400300: These languages, some of them now extinct, used to be spoken in the center and in the north of what is now the European part of Russia.
10730310 -> 1000007400310: They came in contact with Eastern Slavic as far back as the early Middle Ages and eventually served as substratum for the modern Russian language.
10730320 -> 1000007400320: The Russian dialects spoken north, north-east and north-west of Moscow have a considerable number of words of Finno-Ugric origin.
10730330 -> 1000007400330: Over the course of centuries, the vocabulary and literary style of Russian have also been influenced by Turkic/Caucasian/Central Asian languages, as well as Western/Central European languages such as Polish, Latin, Dutch, German, French, and English.
10730340 -> 1000007400340: According to the Defense Language Institute in Monterey, California, Russian is classified as a level III language in terms of learning difficulty for native English speakers, requiring approximately 780 hours of immersion instruction to achieve intermediate fluency.
10730350 -> 1000007400350: It is also regarded by the United States Intelligence Community as a "hard target" language, due to both its difficulty to master for English speakers as well as due to its critical role in American world policy.
10730360 -> 1000007400360: Geographic distribution
10730370 -> 1000007400370: Russian is primarily spoken in Russia and, to a lesser extent, the other countries that were once constituent republics of the USSR.
10730380 -> 1000007400380: Until 1917, it was the sole official language of the Russian Empire.
10730390 -> 1000007400390: During the Soviet period, the policy toward the languages of the various other ethnic groups fluctuated in practice.
10730400 -> 1000007400400: Though each of the constituent republics had its own official language, the unifying role and superior status was reserved for Russian.
10730410 -> 1000007400410: Following the break-up of 1991, several of the newly independent states have encouraged their native languages, which has partly reversed the privileged status of Russian, though its role as the language of post-Soviet national intercourse throughout the region has continued.
10730420 -> 1000007400420: In Latvia, notably, its official recognition and legality in the classroom have been a topic of considerable debate in a country where more than one-third of the population is Russian-speaking, consisting mostly of post-World War II immigrants from Russia and other parts of the former USSR (Belarus, Ukraine).
10730430 -> 1000007400430: Similarly, in Estonia, the Soviet-era immigrants and their Russian-speaking descendants constitute 25,6% of the country's current population and 58,6% of the native Estonian population is also able to speak Russian.
10730440 -> 1000007400440: In all, 67,8% of Estonia's population can speak Russian.
10730450 -> 1000007400450: In Kazakhstan and Kyrgyzstan, Russian remains a co-official language with Kazakh and Kyrgyz respectively.
10730460 -> 1000007400460: Large Russian-speaking communities still exist in northern Kazakhstan, and ethnic Russians comprise 25.6 % of Kazakhstan's population.
10730470 -> 1000007400470: A much smaller Russian-speaking minority in Lithuania has represented less than 1/10 of the country's overall population.
10730480 -> 1000007400480: Nevertheless more than half of the population of the Baltic states are able to hold a conversation in Russian and almost all have at least some familiarity with the most basic spoken and written phrases.
10730490 -> 1000007400490: The Russian control of Finland in 1809–1918, however, has left few Russian speakers in Finland.
10730500 -> 1000007400500: There are 33,400 Russian speakers in Finland, amounting to 0.6% of the population.
10730510 -> 1000007400510: 5000 (0.1%) of them are late 19th century and 20th century immigrants, and the rest are recent immigrants, who have arrived in the 90's and later.
10730520 -> 1000007400520: In the twentieth century, Russian was widely taught in the schools of the members of the old Warsaw Pact and in other countries that used to be allies of the USSR.
10730530 -> 1000007400530: In particular, these countries include Poland, Bulgaria, the Czech Republic, Slovakia, Hungary, Romania, Albania and Cuba.
10730540 -> 1000007400540: However, younger generations are usually not fluent in it, because Russian is no longer mandatory in the school system.
10730550 -> 1000007400550: It is currently the most widely-taught foreign language in Mongolia.
10730560 -> 1000007400560: Russian is also spoken in Israel by at least 750,000 ethnic Jewish immigrants from the former Soviet Union (1999 census).
10730570 -> 1000007400570: The Israeli press and websites regularly publish material in Russian.
10730580 -> 1000007400580: Sizable Russian-speaking communities also exist in North America, especially in large urban centers of the U.S. and Canada such as New York City, Philadelphia, Boston, Los Angeles, San Francisco, Seattle, Toronto, Baltimore, Miami, Chicago, Denver, and the Cleveland suburb of Richmond Heights.
10730590 -> 1000007400590: In the former two, Russian-speaking groups total over half a million.
10730600 -> 1000007400600: In a number of locations they issue their own newspapers, and live in their self-sufficient neighborhoods (especially the generation of immigrants who started arriving in the early sixties).
10730610 -> 1000007400610: Only about a quarter of them are ethnic Russians, however.
10730620 -> 1000007400620: Before the dissolution of the Soviet Union, the overwhelming majority of Russophones in North America were Russian-speaking Jews.
10730630 -> 1000007400630: Afterwards the influx from the countries of the former Soviet Union changed the statistics somewhat.
10730640 -> 1000007400640: According to the United States 2000 Census, Russian is the primary language spoken in the homes of over 700,000 individuals living in the United States.
10730650 -> 1000007400650: Significant Russian-speaking groups also exist in Western Europe.
10730660 -> 1000007400660: These have been fed by several waves of immigrants since the beginning of the twentieth century, each with its own flavor of language.
10730670 -> 1000007400670: Germany, the United Kingdom, Spain, France, Italy, Belgium, Greece, Brazil, Norway, Austria, and Turkey have significant Russian-speaking communities totaling 3 million people.
10730680 -> 1000007400680: Two thirds of them are actually Russian-speaking descendants of Germans, Greeks, Jews, Armenians, or Ukrainians who either repatriated after the USSR collapsed or are just looking for temporary employment.
10730690 -> 1000007400690: Recent estimates of the total number of speakers of Russian:
10730700 -> 1000007400700: Official status
10730710 -> 1000007400710: Russian is the official language of Russia.
10730720 -> 1000007400720: It is also an official language of Belarus, Kazakhstan, Kyrgyzstan, an unofficial but widely spoken language in Ukraine and the de facto official language of the unrecognized of Transnistria, South Ossetia and Abkhazia.
10730730 -> 1000007400730: Russian is one of the six official languages of the United Nations.
10730740 -> 1000007400740: Education in Russian is still a popular choice for both Russian as a second language (RSL) and native speakers in Russia as well as many of the former Soviet republics.
10730750 -> 1000007400750: 97% of the public school students of Russia, 75% in Belarus, 41% in Kazakhstan, 25% in Ukraine, 23% in Kyrgyzstan, 21% in Moldova, 7% in Azerbaijan, 5% in Georgia and 2% in Armenia and Tajikistan receive their education only or mostly in Russian.
10730760 -> 1000007400760: Although the corresponding percentage of ethnic Russians is 78% in Russia, 10% in Belarus, 26% in Kazakhstan, 17% in Ukraine, 9% in Kyrgyzstan, 6% in Moldova, 2% in Azerbaijan, 1.5% in Georgia and less than 1% in both Armenia and Tajikistan.
10730770 -> 1000007400770: Russian-language schooling is also available in Latvia, Estonia and Lithuania, but due to education reforms, a number of subjects taught in Russian are reduced at the high school level.
10730780 -> 1000007400780: The language has a co-official status alongside Moldovan in the autonomies of Gagauzia and Transnistria in Moldova, and in seven Romanian communes in Tulcea and Constanţa counties.
10730790 -> 1000007400790: In these localities, Russian-speaking Lipovans, who are a recognized ethnic minority, make up more than 20% of the population.
10730800 -> 1000007400800: Thus, according to Romania's minority rights law, education, signage, and access to public administration and the justice system are provided in Russian alongside Romanian.
10730810 -> 1000007400810: In the Autonomous Republic of Crimea in Ukraine, Russian is an officially recognized language alongside with Crimean Tatar, but in reality, is the only language used by the government, thus being a de facto official language.
10730820 -> 1000007400820: Dialects
10730830 -> 1000007400830: Despite leveling after 1900, especially in matters of vocabulary, a number of dialects exist in Russia.
10730840 -> 1000007400840: Some linguists divide the dialects of the Russian language into two primary regional groupings, "Northern" and "Southern", with Moscow lying on the zone of transition between the two.
10730850 -> 1000007400850: Others divide the language into three groupings, Northern, Central and Southern, with Moscow lying in the Central region.
10730860 -> 1000007400860: Dialectology within Russia recognizes dozens of smaller-scale variants.
10730870 -> 1000007400870: The dialects often show distinct and non-standard features of pronunciation and intonation, vocabulary, and grammar.
10730880 -> 1000007400880: Some of these are relics of ancient usage now completely discarded by the standard language.
10730890 -> 1000007400890: The northern Russian dialects and those spoken along the Volga River typically pronounce unstressed {(IPA+/o/+/o/)} clearly (the phenomenon called okanye/оканье).
10730900 -> 1000007400900: East of Moscow, particularly in Ryazan Region, unstressed {(IPA+/e/+/e/)} and {(IPA+/a/+/a/)} following palatalized consonants and preceding a stressed syllable are not reduced to {(IPA+[ɪ]+[ɪ])} (like in the Moscow dialect), being instead pronounced as {(IPA+/a/+/a/)} in such positions (e.g. несли is pronounced as {(IPA+[nʲasˈlʲi]+[nʲasˈlʲi])}, not as {(IPA+[nʲɪsˈlʲi]+[nʲɪsˈlʲi])}) - this is called yakanye/ яканье; many southern dialects have a palatalized final {(IPA+/tʲ/+/tʲ/)} in 3rd person forms of verbs (this is unpalatalized in the standard dialect) and a fricative {(IPA+[ɣ]+[ɣ])} where the standard dialect has {(IPA+[g]+[g])}.
10730910 -> 1000007400910: However, in certain areas south of Moscow, e.g. in and around Tula, {(IPA+/g/+/g/)} is pronounced as in the Moscow and northern dialects unless it precedes a voiceless plosive or a pause.
10730920 -> 1000007400920: In this position {(IPA+/g/+/g/)} is lenited and devoiced to the fricative {(IPA+[x]+[x])}, e.g. друг {(IPA+[drux]+[drux])} (in Moscow's dialect, only Бог {(IPA+[box]+[box])}, лёгкий {(IPA+[lʲɵxʲkʲɪj]+[lʲɵxʲkʲɪj])}, мягкий {(IPA+[ˈmʲæxʲkʲɪj]+[ˈmʲæxʲkʲɪj])} and some derivatives follow this rule).
10730930 -> 1000007400930: Some of these features (e.g. a debuccalized or lenited {(IPA+/g/+/g/)} and palatalized final {(IPA+/tʲ/+/tʲ/)} in 3rd person forms of verbs) are also present in modern Ukrainian, indicating either a linguistic continuum or strong influence one way or the other.
10730940 -> 1000007400940: The city of Veliky Novgorod has historically displayed a feature called chokanye/tsokanye (чоканье/цоканье), where {(IPA+/ʨ/+/ʨ/)} and {(IPA+/ʦ/+/ʦ/)} were confused (this is thought to be due to influence from Finnish, which doesn't distinguish these sounds).
10730950 -> 1000007400950: So, цапля ("heron") has been recorded as 'чапля'.
10730960 -> 1000007400960: Also, the second palatalization of velars did not occur there, so the so-called ě² (from the Proto-Slavonic diphthong *ai) did not cause {(IPA+/k, g, x/+/k, g, x/)} to shift to {(IPA+/ʦ, ʣ, s/+/ʦ, ʣ, s/)}; therefore where Standard Russian has цепь ("chain"), the form кепь {(IPA+[kʲepʲ]+[kʲepʲ])} is attested in earlier texts.
10730970 -> 1000007400970: Among the first to study Russian dialects was Lomonosov in the eighteenth century.
10730980 -> 1000007400980: In the nineteenth, Vladimir Dal compiled the first dictionary that included dialectal vocabulary.
10730990 -> 1000007400990: Detailed mapping of Russian dialects began at the turn of the twentieth century.
10731000 -> 1000007401000: In modern times, the monumental Dialectological Atlas of the Russian Language (Диалектологический атлас русского языка {(IPA+[dʲɪɐˌlʲɛktəlɐˈgʲiʨɪskʲɪj ˈatləs ˈruskəvə jɪzɨˈka]+[dʲɪɐˌlʲɛktəlɐˈgʲiʨɪskʲɪj ˈatləs ˈruskəvə jɪzɨˈka])}), was published in 3 folio volumes 1986–1989, after four decades of preparatory work.
10731010 -> 1000007401010: The standard language is based on (but not identical to) the Moscow dialect.
10731020 -> 1000007401020: Derived languages
10731030 -> 1000007401030: Balachka a dialect, spoken primarily by Cossacks, in the regions of Don, Kuban and Terek.
10731040 -> 1000007401040: Fenya, a criminal argot of ancient origin, with Russian grammar, but with distinct vocabulary.
10731050 -> 1000007401050: Nadsat, the fictional language spoken in 'A Clockwork Orange' uses a lot of Russian words and Russian slang.
10731060 -> 1000007401060: Surzhyk is a language with Russian and Ukrainian features, spoken in some areas of Ukraine
10731070 -> 1000007401070: Trasianka is a language with Russian and Belarusian features used by a large portion of the rural population in Belarus.
10731080 -> 1000007401080: Quelia, a pseudo pidgin of German and Russian.
10731090 -> 1000007401090: Runglish, Russian-English pidgin.
10731100 -> 1000007401100: This word is also used by English speakers to describe the way in which Russians attempt to speak English using Russian morphology and/or syntax.
10731110 -> 1000007401110: Russenorsk is an extinct pidgin language with mostly Russian vocabulary and mostly Norwegian grammar, used for communication between Russians and Norwegian traders in the Pomor trade in Finnmark and the Kola Peninsula.
10731120 -> 1000007401120: Writing system
10731130 -> 1000007401130: Alphabet
10731140 -> 1000007401140: Russian is written using a modified version of the Cyrillic (кириллица) alphabet.
10731150 -> 1000007401150: The Russian alphabet consists of 33 letters.
10731160 -> 1000007401160: The following table gives their upper case forms, along with IPA values for each letter's typical sound:
10731170 -> 1000007401170: Older letters of the Russian alphabet include <ѣ>, which merged to <е> ({(IPA+/e/+/e/)}); <і> and <ѵ>, which both merged to <и>({(IPA+/i/+/i/)}); <ѳ>, which merged to <ф> ({(IPA+/f/+/f/)}); and <ѧ>, which merged to <я> ({(IPA+/ja/+/ja/)} or {(IPA+/ʲa/+/ʲa/)}).
10731180 -> 1000007401180: While these older letters have been abandoned at one time or another, they may be used in this and related articles.
10731190 -> 1000007401190: The yers <ъ> and <ь> originally indicated the pronunciation of ultra-short or reduced {(IPA+/ŭ/+/ŭ/)}, {(IPA+/ĭ/+/ĭ/)}.
10731200 -> 1000007401200: The Russian alphabet has many systems of character encoding.
10731210 -> 1000007401210: KOI8-R was designed by the government and was intended to serve as the standard encoding.
10731220 -> 1000007401220: This encoding is still used in UNIX-like operating systems.
10731230 -> 1000007401230: Nevertheless, the spread of MS-DOS and Microsoft Windows created chaos and ended by establishing different encodings as de-facto standards.
10731240 -> 1000007401240: For communication purposes, a number of conversion applications were developed.
10731245 -> 1000007401250: "iconv" is an example that is supported by most versions of Linux, Macintosh and some other operating systems.
10731250 -> 1000007401260: Most implementations (especially old ones) of the character encoding for the Russian language are aimed at simultaneous use of English and Russian characters only and do not include support for any other language.
10731260 -> 1000007401270: Certain hopes for a unification of the character encoding for the Russian alphabet are related to the Unicode standard, specifically designed for peaceful coexistence of various languages, including even dead languages.
10731270 -> 1000007401280: Unicode also supports the letters of the Early Cyrillic alphabet, which have many similarities with the Greek alphabet.
10731280 -> 1000007401290: Orthography
10731290 -> 1000007401300: Russian spelling is reasonably phonemic in practice.
10731300 -> 1000007401310: It is in fact a balance among phonemics, morphology, etymology, and grammar; and, like that of most living languages, has its share of inconsistencies and controversial points.
10731310 -> 1000007401320: A number of rigid spelling rules introduced between the 1880s and 1910s have been responsible for the latter whilst trying to eliminate the former.
10731320 -> 1000007401330: The current spelling follows the major reform of 1918, and the final codification of 1956.
10731330 -> 1000007401340: An update proposed in the late 1990s has met a hostile reception, and has not been formally adopted.
10731340 -> 1000007401350: The punctuation, originally based on Byzantine Greek, was in the seventeenth and eighteenth centuries reformulated on the French and German models.
10731350 -> 1000007401360: Sounds
10731360 -> 1000007401370: The phonological system of Russian is inherited from Common Slavonic, but underwent considerable modification in the early historical period, before being largely settled by about 1400.
10731370 -> 1000007401380: The language possesses five vowels, which are written with different letters depending on whether or not the preceding consonant is palatalized.
10731380 -> 1000007401390: The consonants typically come in plain vs. palatalized pairs, which are traditionally called hard and soft.
10731390 -> 1000007401400: (The hard consonants are often velarized, especially before back vowels, although in some dialects the velarization is limited to hard {(IPA+/l/+/l/)}).
10731400 -> 1000007401410: The standard language, based on the Moscow dialect, possesses heavy stress and moderate variation in pitch.
10731410 -> 1000007401420: Stressed vowels are somewhat lengthened, while unstressed vowels tend to be reduced to near-close vowels or an unclear schwa.
10731420 -> 1000007401430: (See also: vowel reduction in Russian.)
10731430 -> 1000007401440: The Russian syllable structure can be quite complex with both initial and final consonant clusters of up to 4 consecutive sounds.
10731440 -> 1000007401450: Using a formula with V standing for the nucleus (vowel) and C for each consonant the structure can be described as follows:
10731450 -> 1000007401460: (C)(C)(C)(C)V(C)(C)(C)(C)
10731460 -> 1000007401470: Clusters of four consonants are not very common, however, especially within a morpheme.
10731470 -> 1000007401480: Consonants
10731480 -> 1000007401490: Russian is notable for its distinction based on palatalization of most of the consonants.
10731490 -> 1000007401500: While {(IPA+/k/, /g/, /x/+/k/, /g/, /x/)} do have palatalized allophones {(IPA+[kʲ, gʲ, xʲ]+[kʲ, gʲ, xʲ])}, only {(IPA+/kʲ/+/kʲ/)} might be considered a phoneme, though it is marginal and generally not considered distinctive (the only native minimal pair which argues for {(IPA+/kʲ/+/kʲ/)} to be a separate phoneme is "это ткёт"/"этот кот").
10731500 -> 1000007401510: Palatalization means that the center of the tongue is raised during and after the articulation of the consonant.
10731510 -> 1000007401520: In the case of {(IPA+/tʲ/ and /dʲ/+/tʲ/ and /dʲ/)}, the tongue is raised enough to produce slight frication (affricate sounds).
10731520 -> 1000007401530: These sounds: {(IPA+/t, d, ʦ, s, z, n and rʲ/+/t, d, ʦ, s, z, n and rʲ/)} are dental, that is pronounced with the tip of the tongue against the teeth rather than against the alveolar ridge.
10731530 -> 1000007401540: Grammar
10731540 -> 1000007401550: Russian has preserved an Indo-European synthetic-inflectional structure, although considerable leveling has taken place.
10731550 -> 1000007401560: Russian grammar encompasses
10731560 -> 1000007401570: a highly synthetic morphology
10731570 -> 1000007401580: a syntax that, for the literary language, is the conscious fusion of three elements:
10731580 -> 1000007401590: a polished vernacular foundation;
10731590 -> 1000007401600: a Church Slavonic inheritance;
10731600 -> 1000007401610: a Western European style.
10731610 -> 1000007401620: The spoken language has been influenced by the literary one, but continues to preserve characteristic forms.
10731620 -> 1000007401630: The dialects show various non-standard grammatical features, some of which are archaisms or descendants of old forms since discarded by the literary language.
10731630 -> 1000007401640: Vocabulary
10731640 -> 1000007401650: See History of the Russian language for an account of the successive foreign influences on the Russian language.
10731650 -> 1000007401660: The total number of words in Russian is difficult to reckon because of the ability to agglutinate and create manifold compounds, diminutives, etc. (see Word Formation under Russian grammar).
10731660 -> 1000007401670: The number of listed words or entries in some of the major dictionaries published during the last two centuries, and the total vocabulary of Pushkin (who is credited with greatly augmenting and codifying literary Russian), are as follows:
10731670 -> 1000007401680: (As a historical aside, Dahl was, in the second half of the nineteenth century, still insisting that the proper spelling of the adjective русский, which was at that time applied uniformly to all the Orthodox Eastern Slavic subjects of the Empire, as well as to its one official language, be spelled руский with one s, in accordance with ancient tradition and what he termed the "spirit of the language".
10731680 -> 1000007401690: He was contradicted by the philologist Grot, who distinctly heard the s lengthened or doubled).
10731690 -> 1000007401700: Proverbs and sayings
10731700 -> 1000007401710: The Russian language is replete with many hundreds of proverbs (пословица {(IPA+[pɐˈslo.vʲɪ.ʦə]+[pɐˈslo.vʲɪ.ʦə])}) and sayings (поговоркa {(IPA+[pə.gɐˈvo.rkə]+[pə.gɐˈvo.rkə])}).
10731710 -> 1000007401720: These were already tabulated by the seventeenth century, and collected and studied in the nineteenth and twentieth, with the folk-tales being an especially fertile source.
10731720 -> 1000007401730: History and examples
10731730 -> 1000007401740: The history of Russian language may be divided into the following periods.
10731740 -> 1000007401750: Kievan period and feudal breakup
10731750 -> 1000007401760: The Tatar yoke and the Grand Duchy of Lithuania
10731760 -> 1000007401770: The Moscovite period (15th–17th centuries)
10731770 -> 1000007401780: Empire (18th–19th centuries)
10731780 -> 1000007401790: Soviet period and beyond (20th century)
10731790 -> 1000007401800: Judging by the historical records, by approximately 1000 AD the predominant ethnic group over much of modern European Russia, Ukraine, and Belarus was the Eastern branch of the Slavs, speaking a closely related group of dialects.
10731800 -> 1000007401810: The political unification of this region into Kievan Rus' in about 880, from which modern Russia, Ukraine and Belarus trace their origins, established Old East Slavic as a literary and commercial language.
10731810 -> 1000007401820: It was soon followed by the adoption of Christianity in 988 and the introduction of the South Slavic Old Church Slavonic as the liturgical and official language.
10731820 -> 1000007401830: Borrowings and calques from Byzantine Greek began to enter the Old East Slavic and spoken dialects at this time, which in their turn modified the Old Church Slavonic as well.
10731830 -> 1000007401840: Dialectal differentiation accelerated after the breakup of Kievan Rus in approximately 1100.
10731840 -> 1000007401850: On the territories of modern Belarus and Ukraine emerged Ruthenian and in modern Russia medieval Russian.
10731850 -> 1000007401860: They definitely became distinct in 13th century by the time of division of that land between the Grand Duchy of Lithuania on the west and independent Novgorod Feudal Republic plus small duchies which were vassals of the Tatars on the east.
10731860 -> 1000007401870: The official language in Moscow and Novgorod, and later, in the growing Moscow Rus’, was Church Slavonic which evolved from Old Church Slavonic and remained the literary language until the Petrine age, when its usage shrank drastically to biblical and liturgical texts.
10731870 -> 1000007401880: Russian developed under a strong influence of the Church Slavonic until the close of the seventeenth century; the influence reversed afterwards leading to corruption of liturgical texts.
10731880 -> 1000007401890: The political reforms of Peter the Great were accompanied by a reform of the alphabet, and achieved their goal of secularization and Westernization.
10731890 -> 1000007401900: Blocks of specialized vocabulary were adopted from the languages of Western Europe.
10731900 -> 1000007401910: By 1800, a significant portion of the gentry spoke French, less often German, on an everyday basis.
10731910 -> 1000007401920: Many Russian novels of the 19th century, e.g. Lev Tolstoy’s "War and Peace", contain entire paragraphs and even pages in French with no translation given, with an assumption that educated readers won't need one.
10731920 -> 1000007401930: The modern literary language is usually considered to date from the time of Aleksandr Pushkin in the first third of the nineteenth century.
10731930 -> 1000007401940: Pushkin revolutionized Russian literature by rejecting archaic grammar and vocabulary (so called "высокий стиль" — "high style") in favor of grammar and vocabulary found in the spoken language of the time.
10731940 -> 1000007401950: Even modern readers of younger age may only experience slight difficulties understanding some words in Pushkin’s texts, since only few words used by Pushkin became archaic or changed meaning.
10731950 -> 1000007401960: On the other hand, many expressions used by Russian writers of the early 19th century, in particular Pushkin, Lermontov, Gogol, Griboiädov, became proverbs or sayings which can be frequently found even in the modern Russian colloquial speech.
10731960 -> 1000007401970: The political upheavals of the early twentieth century and the wholesale changes of political ideology gave written Russian its modern appearance after the spelling reform of 1918.
10731970 -> 1000007401980: Political circumstances and Soviet accomplishments in military, scientific, and technological matters (especially cosmonautics), gave Russian a world-wide prestige, especially during the middle third of the twentieth century.
SYSTRAN
10850010 -> 1000007500020: SYSTRAN
10850020 -> 1000007500030: SYSTRAN, founded by Dr. Peter Toma in 1968, is one of the oldest machine translation companies.
10850030 -> 1000007500040: SYSTRAN has done extensive work for the United States Department of Defense and the European Commission.
10850040 -> 1000007500050: SYSTRAN provides the technology for Yahoo! and AltaVista's (Babel Fish) among others, but use of it was ended (circa 2007) for all of the language combinations offered by Google's language tools.
10850050 -> 1000007500060: Commercial versions of SYSTRAN operate with operating systems Microsoft Windows (including Windows Mobile), Linux and Solaris.
10850060 -> 1000007500070: History
10850070 -> 1000007500080: With its origin in the Georgetown machine translation effort, SYSTRAN was one of the few machine translation systems to survive the major decrease of funding after the ALPAC Report of the mid-1960's.
10850080 -> 1000007500090: The company was established in La Jolla, California to work on translation of Russian to English text for the United States Air Force during the "Cold War".
10850090 -> 1000007500100: Large numbers of Russian scientific and technical documents were translated using SYSTRAN under the auspices of the USAF Foreign Technology Division (later the National Air and Space Intelligence Center) at Wright-Patterson Air Force Base, Ohio.
10850100 -> 1000007500110: The quality of the translations, although only approximate, was usually adequate for understanding content.
10850110 -> 1000007500120: The company was sold during 1986 to the Gachot family, based in Paris, France, and is now traded publicly by the French stock exchange.
10850120 -> 1000007500130: It has a main office at the Grande Arche in La Defense and maintains a secondary office in La Jolla, San Diego, California.
10850130 -> None: Languages
10850140 -> None: Here is a list of the source and target languages SYSTRAN works with.
10850150 -> None: Many of the pairs are to or from English or French.
10850160 -> None: Russian into English (1968)
10850170 -> None: English into Russian (1973) for the Apollo-Soyuz project
10850180 -> None: English source (1975) for the European Commission
10850190 -> None: Arabic
10850200 -> None: Chinese
10850210 -> None: Danish
10850220 -> None: Dutch
10850230 -> None: French
10850240 -> None: German
10850250 -> None: Greek
10850260 -> None: Hindi
10850270 -> None: Italian
10850280 -> None: Japanese
10850290 -> None: Korean
10850300 -> None: Norwegian
10850310 -> None: Serbo-Croatian
10850320 -> None: Spanish
10850330 -> None: Swedish
10850340 -> None: Persian
10850350 -> None: Polish
10850360 -> None: Portuguese
10850370 -> None: Ukrainian
10850380 -> None: Urdu
Semantics
10750010 -> 1000007600020: Semantics
10750020 -> 1000007600030: Semantics is the study of meaning in communication.
10750030 -> 1000007600040: The word derives from Greek σημαντικός (semantikos), "significant", from σημαίνω (semaino), "to signify, to indicate" and that from σήμα (sema), "sign, mark, token".
10750040 -> 1000007600050: In linguistics it is the study of interpretation of signs as used by agents or communities within particular circumstances and contexts.
10750050 -> 1000007600060: It has related meanings in several other fields.
10750060 -> 1000007600070: Semanticists differ on what constitutes meaning in an expression.
10750070 -> 1000007600080: For example, in the sentence, "John loves a bagel", the word bagel may refer to the object itself, which is its literal meaning or denotation, but it may also refer to many other figurative associations, such as how it meets John's hunger, etc., which may be its connotation.
10750080 -> 1000007600090: Traditionally, the formal semantic view restricts semantics to its literal meaning, and relegates all figurative associations to pragmatics, but this distinction is increasingly difficult to defend.
10750090 -> 1000007600100: The degree to which a theorist subscribes to the literal-figurative distinction decreases as one moves from the formal semantic, semiotic, pragmatic, to the cognitive semantic traditions.
10750100 -> 1000007600110: The word semantic in its modern sense is considered to have first appeared in French as sémantique in Michel Bréal's 1897 book, Essai de sémantique'.
10750110 -> 1000007600120: In International Scientific Vocabulary semantics is also called semasiology.
10750120 -> 1000007600130: The discipline of Semantics is distinct from Alfred Korzybski's General Semantics, which is a system for looking at non-immediate, or abstract meanings.
10750130 -> 1000007600140: Linguistics
10750140 -> 1000007600150: In linguistics, semantics is the subfield that is devoted to the study of meaning, as inherent at the levels of words, phrases, sentences, and even larger units of discourse (referred to as texts).
10750150 -> 1000007600160: The basic area of study is the meaning of signs, and the study of relations between different linguistic units: homonymy, synonymy, antonymy, polysemy, paronyms, hypernymy, hyponymy, meronymy, metonymy, holonymy, exocentricity / endocentricity, linguistic compounds.
10750160 -> 1000007600170: A key concern is how meaning attaches to larger chunks of text, possibly as a result of the composition from smaller units of meaning.
10750170 -> 1000007600180: Traditionally, semantics has included the study of connotative sense and denotative reference, truth conditions, argument structure, thematic roles, discourse analysis, and the linkage of all of these to syntax.
10750180 -> 1000007600190: Formal semanticists are concerned with the modeling of meaning in terms of the semantics of logic.
10750190 -> 1000007600200: Thus the sentence John loves a bagel above can be broken down into its constituents (signs), of which the unit loves may serve as both syntactic and semantic head.
10750200 -> 1000007600210: In the late 1960s, Richard Montague proposed a system for defining semantic entries in the lexicon in terms of lambda calculus.
10750210 -> 1000007600220: Thus, the syntactic parse of the sentence above would now indicate loves as the head, and its entry in the lexicon would point to the arguments as the agent, John, and the object, bagel, with a special role for the article "a" (which Montague called a quantifier).
10750220 -> 1000007600230: This resulted in the sentence being associated with the logical predicate loves (John, bagel), thus linking semantics to categorial grammar models of syntax.
10750230 -> 1000007600240: The logical predicate thus obtained would be elaborated further, e.g. using truth theory models, which ultimately relate meanings to a set of Tarskiian universals, which may lie outside the logic.
10750240 -> 1000007600250: The notion of such meaning atoms or primitives are basic to the language of thought hypothesis from the 70s.
10750250 -> 1000007600260: Despite its elegance, Montague grammar was limited by the context-dependent variability in word sense, and led to several attempts at incorporating context, such as :
10750260 -> 1000007600270: situation semantics ('80s): Truth-values are incomplete, they get assigned based on context
10750270 -> 1000007600280: generative lexicon ('90s): categories (types) are incomplete, and get assigned based on context
10750280 -> 1000007600290: The dynamic turn in semantics
10750290 -> 1000007600300: In the Chomskian tradition in linguistics there was no mechanism for the learning of semantic relations, and the nativist view considered all semantic notions as inborn.
10750300 -> 1000007600310: Thus, even novel concepts were proposed to have been dormant in some sense.
10750310 -> 1000007600320: This traditional view was also unable to address many issues such as metaphor or associative meanings, and semantic change, where meanings within a linguistic community change over time, and qualia or subjective experience.
10750320 -> 1000007600330: Another issue not addressed by the nativist model was how perceptual cues are combined in thought, e.g. in mental rotation.
10750330 -> 1000007600340: This traditional view of semantics, as an innate finite meaning inherent in a lexical unit that can be composed to generate meanings for larger chunks of discourse, is now being fiercely debated in the emerging domain of cognitive linguistics and also in the non-Fodorian camp in Philosophy of Language.
10750340 -> 1000007600350: The challenge is motivated by
10750350 -> 1000007600360: factors internal to language, such as the problem of resolving indexical or anaphora (e.g. this x, him, last week).
10750360 -> 1000007600370: In these situations "context" serves as the input, but the interpreted utterance also modifies the context, so it is also the output.
10750370 -> 1000007600380: Thus, the interpretation is necessarily dynamic and the meaning of sentences is viewed as context-change potentials instead of propositions.
10750380 -> 1000007600390: factors external to language, i.e. language is not a set of labels stuck on things, but "a toolbox, the importance of whose elements lie in the way they function rather than their attachments to things."
10750390 -> 1000007600400: This view reflects the position of the later Wittgenstein and his famous game example, and is related to the positions of Quine, Davidson, and others.
10750400 -> 1000007600410: A concrete example of the latter phenomenon is semantic underspecification — meanings are not complete without some elements of context.
10750410 -> 1000007600420: To take an example of a single word, "red", its meaning in a phrase such as red book is similar to many other usages, and can be viewed as compositional.
10750420 -> 1000007600430: However, the colours implied in phrases such as "red wine" (very dark), and "red hair" (coppery), or "red soil", or "red skin" are very different.
10750430 -> 1000007600440: Indeed, these colours by themselves would not be called "red" by native speakers.
10750440 -> 1000007600450: These instances are contrastive, so "red wine" is so called only in comparison with the other kind of wine (which also is not "white" for the same reasons).
10750450 -> 1000007600460: This view goes back to de Saussure:
10750460 -> 1000007600470: Each of a set of synonyms like redouter ('to dread'), craindre ('to fear'), avoir peur ('to be afraid') has its particular value only because they stand in contrast with one another.
10750470 -> 1000007600480: No word has a value that can be identified independently of what else is in its vicinity.
10750480 -> 1000007600490: and may go back to earlier Indian views on language, especially the Nyaya view of words as indicators and not carriers of meaning.
10750490 -> 1000007600500: An attempt to defend a system based on propositional meaning for semantic underspecification can be found in the Generative Lexicon model of James Pustejovsky, who extends contextual operations (based on type shifting) into the lexicon.
10750500 -> 1000007600510: Thus meanings are generated on the fly based on finite context.
10750510 -> 1000007600520: Prototype theory
10750520 -> 1000007600530: Another set of concepts related to fuzziness in semantics is based on prototypes.
10750530 -> 1000007600540: The work of Eleanor Rosch and George Lakoff in the 1970s led to a view that natural categories are not characterizable in terms of necessary and sufficient conditions, but are graded (fuzzy at their boundaries) and inconsistent as to the status of their constituent members.
10750540 -> 1000007600550: Systems of categories are not objectively "out there" in the world but are rooted in people's experience.
10750550 -> 1000007600560: These categories evolve as learned concepts of the world — meaning is not an objective truth, but a subjective construct, learned from experience, and language arises out of the "grounding of our conceptual systems in shared embodiment and bodily experience".
10750560 -> 1000007600570: A corollary of this is that the conceptual categories (i.e. the lexicon) will not be identical for different cultures, or indeed, for every individual in the same culture.
10750570 -> 1000007600580: This leads to another debate (see the Whorf-Sapir hypothesis or Eskimo words for snow).
10750580 -> 1000007600590: Computer science
10750590 -> 1000007600600: In computer science, where it is considered as an application of mathematical logic, semantics reflects the meaning of programs or functions.
10750600 -> 1000007600610: In this regard, semantics permits programs to be separated into their syntactical part (grammatical structure) and their semantic part (meaning).
10750610 -> 1000007600620: For instance, the following statements use different syntaxes (languages), but result in the same semantic:
10750620 -> 1000007600630: x += y; (C, Java, etc.)
10750630 -> 1000007600640: x := x + y; (Pascal)
10750640 -> 1000007600650: Let x = x + y; (early BASIC)
10750650 -> 1000007600660: x = x + y (most BASIC dialects, Fortran)
10750660 -> 1000007600670: Generally these operations would all perform an arithmetical addition of 'y' to 'x' and store the result in a variable 'x'.
10750670 -> 1000007600680: Semantics for computer applications falls into three categories:
10750680 -> 1000007600690: Operational semantics: The meaning of a construct is specified by the computation it induces when it is executed on a machine.
10750690 -> 1000007600700: In particular, it is of interest how the effect of a computation is produced.
10750700 -> 1000007600710: Denotational semantics: Meanings are modelled by mathematical objects that represent the effect of executing the constructs.
10750710 -> 1000007600720: Thus only the effect is of interest, not how it is obtained.
10750720 -> 1000007600730: Axiomatic semantics: Specific properties of the effect of executing the constructs as expressed as assertions.
10750730 -> 1000007600740: Thus there may be aspects of the executions that are ignored.
10750740 -> 1000007600750: The Semantic Web refers to the extension of the World Wide Web through the embedding of additional semantic metadata; s.a.
10750750 -> 1000007600760: Web Ontology Language (OWL).
10750760 -> 1000007600770: Psychology
10750770 -> 1000007600780: In psychology, semantic memory is memory for meaning, in other words, the aspect of memory that preserves only the gist, the general significance, of remembered experience, while episodic memory is memory for the ephemeral details, the individual features, or the unique particulars of experience.
10750780 -> 1000007600790: Word meaning is measured by the company they keep; the relationships among words themselves in a semantic network.
10750790 -> 1000007600800: In a network created by people analyzing their understanding of the word (such as Wordnet) the links and decomposition structures of the network are few in number and kind; and include "part of", "kind of", and similar links.
10750800 -> 1000007600810: In automated ontologies the links are computed vectors without explicit meaning.
10750810 -> 1000007600820: Various automated technologies are being developed to compute the meaning of words: latent semantic indexing and support vector machines as well as natural language processing, neural networks and predicate calculus techniques.
10750820 -> 1000007600830: Semantics has been reported to drive the course of psychotherapeutic interventions.
10750830 -> 1000007600840: Language structure can determine the treatment approach to drug-abusing patients. .
10750840 -> 1000007600850: While working in Europe for the US Information Agency, American psychiatrist, Dr. A. James Giannini reported semantic differences in medical approaches to addiction treatment..
10750850 -> 1000007600860: English speaking countries used the term "drug dependence" to describe a rather passive pathology in their patients.
10750860 -> 1000007600870: As a result the physician's role was more active.
10750870 -> 1000007600880: Southern European countries such as Italy and Yugoslavia utilized the concept of "tossicomania" (i.e. toxic mania) to describe a more acive rather than passive role of the addict.
10750880 -> 1000007600890: As a result the treating physician's role shifted to that of a more passive guide than that of an active interventionist. .
Sentence (linguistics)
10760010 -> 1000007700020: Sentence (linguistics)
10760020 -> 1000007700030: In linguistics, a sentence is a grammatical unit of one or more words, bearing minimal syntactic relation to the words that precede or follow it, often preceded and followed in speech by pauses, having one of a small number of characteristic intonation patterns, and typically expressing an independent statement, question, request, command, etc.
10760030 -> 1000007700040: Sentences are generally characterized in most languages by the presence of a finite verb, e.g. "The quick brown fox jumps over the lazy dog".
10760050 -> 1000007700050: Components of a sentence
10760060 -> 1000007700060: A simple complete sentence consists of a subject and a predicate.
10760070 -> 1000007700070: The subject is typically a noun phrase, though other kinds of phrases (such as gerund phrases) work as well, and some languages allow subjects to be omitted.
10760080 -> 1000007700080: The predicate is a finite verb phrase: it's a finite verb together with zero or more objects, zero or more complements, and zero or more adverbials.
10760090 -> 1000007700090: See also copula for the consequences of this verb on the theory of sentence structure.
10760100 -> 1000007700100: Clauses
10760110 -> 1000007700110: A clause consists of a subject and a verb.
10760120 -> 1000007700120: There are two types of clauses: independent and subordinate (dependent).
10760130 -> 1000007700130: An independent clause consists of a subject verb and also demonstrates a complete thought: for example, "I am sad."
10760140 -> 1000007700140: A subordinate clause consists of a subject and a verb, but demonstrates an incomplete thought: for example, "Because I had to move."
10760150 -> 1000007700150: Classification
10760160 -> 1000007700160: By structure
10760170 -> 1000007700170: One traditional scheme for classifying English sentences is by the number and types of finite clauses:
10760180 -> 1000007700180: A simple sentence consists of a single independent clause with no dependent clauses.
10760190 -> 1000007700190: A compound sentence consists of multiple independent clauses with no dependent clauses.
10760200 -> 1000007700200: These clauses are joined together using conjunctions, punctuation, or both.
10760210 -> 1000007700210: A complex sentence consists of one or more independent clauses with at least one dependent clause.
10760220 -> 1000007700220: A complex-compound sentence (or compound-complex sentence) consists of multiple independent clauses, at least one of which has at least one dependent clause.
10760230 -> 1000007700230: By purpose
10760240 -> 1000007700240: Sentences can also be classified based on their purpose:
10760250 -> 1000007700250: A declarative sentence or declaration, the most common type, commonly makes a statement: I am going home.
10760260 -> 1000007700260: A negative sentence or negation denies that a statement is true: I am not going home.
10760270 -> 1000007700270: An interrogative sentence or question is commonly used to request information — When are you going to work? — but sometimes not; see rhetorical question.
10760280 -> 1000007700280: An exclamatory sentence or exclamation is generally a more emphatic form of statement: What a wonderful day this is!
10760290 -> 1000007700290: Major and minor sentences
10760300 -> 1000007700300: A major sentence is a regular sentence; it has a subject and a predicate.
10760310 -> 1000007700310: For example: I have a ball.
10760320 -> 1000007700320: In this sentence one can change the persons: We have a ball.
10760330 -> 1000007700330: However, a minor sentence is an irregular type of sentence.
10760340 -> 1000007700340: It does not contain a finite verb.
10760350 -> 1000007700350: For example, "Mary!"
10760360 -> 1000007700360: "Yes."
10760370 -> 1000007700370: "Coffee." etc.
10760380 -> 1000007700380: Other examples of minor sentences are headings (e.g. the heading of this entry), stereotyped expressions (Hello!), emotional expressions (Wow!), proverbs, etc.
10760390 -> 1000007700390: This can also include sentences which do not contain verbs (e.g. The more, the merrier.) in order to intensify the meaning around the nouns (normally found in poetry and catchphrases) by Judee N..
Spanish language
10780010 -> 1000007800020: Spanish language
10780020 -> 1000007800030: Spanish or Castilian (castellano) is an Indo-European, Romance language that originated in northern Spain, and gradually spread in the Kingdom of Castile and evolved into the principal language of government and trade.
10780030 -> 1000007800040: It was taken to Africa, the Americas, and Asia Pacific with the expansion of the Spanish Empire between the fifteenth and nineteenth centuries.
10780040 -> 1000007800050: Today, between 322 and 400 million people speak Spanish as a native language, making it the world's second most-spoken language by native speakers (after Mandarin Chinese).
10780050 -> 1000007800060: Hispanosphere
10780060 -> 1000007800070: It is estimated that the combined total of native and non-native Spanish speakers is approximately 500 million, likely making it the third most spoken language by total number of speakers (after English and Chinese).
10780070 -> 1000007800080: Today, Spanish is an official language of Spain, most Latin American countries, and Equatorial Guinea; 21 nations speak it as their primary language.
10780080 -> 1000007800090: Spanish also is one of six official languages of the United Nations.
10780090 -> 1000007800100: Mexico has the world's largest Spanish-speaking population, and Spanish is the second most-widely spoken language in the United States and the most popular studied foreign language in U.S. schools and universities.
10780100 -> 1000007800110: Global internet usage statistics for 2007 show Spanish as the third most commonly used language on the Internet, after English and Chinese.
10780110 -> 1000007800120: Naming and origin
10780120 -> 1000007800130: Spaniards tend to call this language {(Lang+español+es+español)} (Spanish) when contrasting it with languages of other states, such as French and English, but call it {(Lang+castellano+es+castellano)} (Castilian), that is, the language of the Castile region, when contrasting it with other languages spoken in Spain such as Galician, Basque, and Catalan.
10780130 -> 1000007800140: This reasoning also holds true for the language's preferred name in some Hispanic American countries.
10780140 -> 1000007800150: In this manner, the Spanish Constitution of 1978 uses the term {(Lang+castellano+es+castellano)} to define the official language of the whole Spanish State, as opposed to {(Lang+las demás lenguas españolas+es+las demás lenguas españolas)} (lit. the other Spanish languages).
10780150 -> 1000007800160: Article III reads as follows:
10780151 -> 1000007800170: {(Lang+El castellano es la lengua española oficial del Estado.+es+El castellano es la lengua española oficial del Estado.)}
10780152 -> 1000007800180: {(Lang+(…) Las demás lenguas españolas serán también oficiales en las respectivas Comunidades Autónomas…+es+(…) Las demás lenguas españolas serán también oficiales en las respectivas Comunidades Autónomas…)}
10780153 -> 1000007800190: Castilian is the official Spanish language of the State.
10780154 -> 1000007800200: (…) The other Spanish languages shall also be official in their respective Autonomous Communities…
10780160 -> 1000007800210: The name castellano is, however, widely used for the language as a whole in Latin America.
10780170 -> 1000007800220: Some Spanish speakers consider {(Lang+castellano+es+castellano)} a generic term with no political or ideological links, much as "Spanish" is in English.
10780180 -> 1000007800230: Often Latin Americans use it to differentiate their own variety of Spanish as opposed to the variety of Spanish spoken in Spain, or variety of Spanish which is considered as standard in the region.
10780190 -> 1000007800240: Classification and related languages
10780200 -> 1000007800250: Spanish is closely related to the other West Iberian Romance languages: Asturian ({(Lang+asturianu+ast+asturianu)}), Galician ({(Lang+galego+gl+galego)}), Ladino ({(Lang+dzhudezmo/spanyol/kasteyano+lad+dzhudezmo/spanyol/kasteyano)}), and Portuguese ({(Lang+português+pt+português)}).
10780210 -> 1000007800260: Catalan, an East Iberian language which exhibits many Gallo-Romance traits, is more similar to the neighbouring Occitan language ({(Lang+occitan+oc+occitan)}) than to Spanish, or indeed than Spanish and Portuguese are to each other.
10780220 -> 1000007800270: Spanish and Portuguese share similar grammars and vocabulary as well as a common history of Arabic influence while a great part of the peninsula was under Islamic rule (both languages expanded over Islamic territories).
10780230 -> 1000007800280: Their lexical similarity has been estimated as 89%.
10780240 -> 1000007800290: See Differences between Spanish and Portuguese for further information.
10780250 -> 1000007800300: Ladino
10780260 -> 1000007800310: Ladino, which is essentially medieval Spanish and closer to modern Spanish than any other language, is spoken by many descendants of the Sephardi Jews who were expelled from Spain in the 15th century.
10780270 -> 1000007800320: Ladino speakers are currently almost exclusively Sephardi Jews, with family roots in Turkey, Greece or the Balkans: current speakers mostly live in Israel and Turkey, with a few pockets in Latin America.
10780280 -> 1000007800330: It lacks the Native American vocabulary which was influential during the Spanish colonial period, and it retains many archaic features which have since been lost in standard Spanish.
10780290 -> 1000007800340: It contains, however, other vocabulary which is not found in standard Castilian, including vocabulary from Hebrew, some French, Greek and Turkish, and other languages spoken where the Sephardim settled.
10780300 -> 1000007800350: Ladino is in serious danger of extinction because many native speakers today are elderly as well as elderly olim (immigrants to Israel) who have not transmitted the language to their children or grandchildren.
10780310 -> 1000007800360: However, it is experiencing a minor revival among Sephardi communities, especially in music.
10780320 -> 1000007800370: In the case of the Latin American communities, the danger of extinction is also due to the risk of assimilation by modern Castilian.
10780330 -> 1000007800380: A related dialect is Haketia, the Judaeo-Spanish of northern Morocco.
10780340 -> 1000007800390: This too tended to assimilate with modern Spanish, during the Spanish occupation of the region.
10780350 -> 1000007800400: Vocabulary comparison
10780360 -> 1000007800410: Spanish and Italian share a very similar phonological system.
10780370 -> 1000007800420: At present, the lexical similarity with Italian is estimated at 82%.
10780380 -> 1000007800430: As a result, Spanish and Italian are mutually intelligible to various degrees.
10780390 -> 1000007800440: The lexical similarity with Portuguese is greater, 89%, but the vagaries of Portuguese pronunciation make it less easily understood by Hispanophones than Italian.
10780400 -> 1000007800450: Mutual intelligibility between Spanish and French or Romanian is even lower (lexical similarity being respectively 75% and 71%): comprehension of Spanish by French speakers who have not studied the language is as low as an estimated 45% - the same as of English.
10780410 -> 1000007800460: The common features of the writing systems of the Romance languages allow for a greater amount of interlingual reading comprehension than oral communication would.
10780420 -> 1000007800470: 1. also {(Lang+nós outros+pt+nós outros)} in early modern Portuguese (e.g. The Lusiads)
10780430 -> 1000007800480: 2. {(Lang+noi altri+it+noi altri)} in Southern Italian dialects and languages
10780440 -> 1000007800490: 3. Alternatively {(Lang+nous autres+fr+nous autres)}
10780460 -> 1000007800500: History
10780470 -> 1000007800510: Spanish evolved from Vulgar Latin, with major influences from Arabic in vocabulary during the Andalusian period and minor surviving influences from Basque and Celtiberian, as well as Germanic languages via the Visigoths.
10780480 -> 1000007800520: Spanish developed along the remote cross road strips among the Alava, Cantabria, Burgos, Soria and La Rioja provinces of Northern Spain, as a strongly innovative and differing variant from its nearest cousin, Leonese speech, with a higher degree of Basque influence in these regions (see Iberian Romance languages).
10780490 -> 1000007800530: Typical features of Spanish diachronical phonology include lenition (Latin {(Lang+vita+la+vita)}, Spanish {(Lang+vida+es+vida)}), palatalization (Latin {(Lang+annum+la+annum)}, Spanish {(Lang+año+es+año)}, and Latin {(Lang+anellum+la+anellum)}, Spanish {(Lang+anillo+es+anillo)}) and diphthongation (stem-changing) of short e and o from Vulgar Latin (Latin {(Lang+terra+la+terra)}, Spanish {(Lang+tierra+es+tierra)}; Latin {(Lang+novus+la+novus)}, Spanish {(Lang+nuevo+es+nuevo)}).
10780500 -> 1000007800540: Similar phenomena can be found in other Romance languages as well.
10780510 -> 1000007800550: During the {(Lang+Reconquista+es+Reconquista)}, this northern dialect from Cantabria was carried south, and remains a minority language in the northern coastal Morocco.
10780520 -> 1000007800560: The first Latin-to-Spanish grammar ({(Lang+Gramática de la Lengua Castellana+es+Gramática de la Lengua Castellana)}) was written in Salamanca, Spain, in 1492, by Elio Antonio de Nebrija.
10780530 -> 1000007800570: When it was presented to Isabel de Castilla, she asked, "What do I want a work like this for, if I already know the language?", to which he replied, "Your highness, the language is the instrument of the Empire."
10780540 -> 1000007800580: From the 16th century onwards, the language was taken to the Americas and the Spanish East Indies via Spanish colonization.
10780550 -> 1000007800590: In the 20th century, Spanish was introduced to Equatorial Guinea and the Western Sahara, the United States, such as in Spanish Harlem, in New York City, that had not been part of the Spanish Empire.
10780560 -> 1000007800600: For details on borrowed words and other external influences upon Spanish, see Influences on the Spanish language.
10780570 -> 1000007800610: Characterization
10780580 -> 1000007800620: A defining characteristic of Spanish was the diphthongization of the Latin short vowels e and o into ie and ue, respectively, when they were stressed.
10780590 -> 1000007800630: Similar sound changes are found in other Romance languages, but in Spanish they were significant.
10780600 -> 1000007800640: Some examples:
10780610 -> 1000007800650: Lat. {(Lang+petra+la+petra)} > Sp. {(Lang+piedra+es+piedra)}, It. {(Lang+pietra+it+pietra)}, Fr. {(Lang+pierre+fr+pierre)}, Rom. {(Lang+piatrǎ+ro+piatrǎ)}, Port./Gal. {(Lang+pedra+pt+pedra)} "stone".
10780620 -> 1000007800660: Lat. {(Lang+moritur+la+moritur)} > Sp. {(Lang+muere+es+muere)}, It. {(Lang+muore+it+muore)}, Fr. {(Lang+meurt+fr+meurt)} / {(Lang+muert+fr+muert)}, Rom. {(Lang+moare+ro+moare)}, Port./Gal. {(Lang+morre+pt+morre)} "die".
10780630 -> 1000007800670: Peculiar to early Spanish (as in the Gascon dialect of Occitan, and possibly due to a Basque substratum) was the mutation of Latin initial f- into h- whenever it was followed by a vowel that did not diphthongate.
10780640 -> 1000007800680: Compare for instance:
10780650 -> 1000007800690: Lat. {(Lang+filium+la+filium)} > It. {(Lang+figlio+it+figlio)}, Port. {(Lang+filho+pt+filho)}, Gal. {(Lang+fillo+gl+fillo)}, Fr. {(Lang+fils+fr+fils)}, Occitan {(Lang+filh+oc+filh)} (but Gascon {(Lang+hilh+gsc+hilh)}) Sp. {(Lang+hijo+es+hijo)} (but Ladino {(Lang+fijo+lad+fijo)});
10780660 -> 1000007800700: Lat. {(Lang+fabulari+la+fabulari)} > Lad. {(Lang+favlar+lad+favlar)}, Port./Gal. {(Lang+falar+pt+falar)}, Sp. {(Lang+hablar+es+hablar)};
10780670 -> 1000007800710: but Lat. {(Lang+focum+la+focum)} > It. {(Lang+fuoco+it+fuoco)}, Port./Gal. {(Lang+fogo+pt+fogo)}, Sp./Lad. {(Lang+fuego+es+fuego)}.
10780680 -> 1000007800720: Some consonant clusters of Latin also produced characteristically different results in these languages, for example:
10780690 -> 1000007800730: Lat. {(Lang+clamare+la+clamare)}, acc. {(Lang+flammam+la+flammam)}, {(Lang+plenum+la+plenum)} > Lad. {(Lang+lyamar+lad+lyamar)}, {(Lang+flama+lad+flama)}, {(Lang+pleno+lad+pleno)}; Sp. {(Lang+llamar+es+llamar)}, {(Lang+llama+es+llama)}, {(Lang+lleno+es+lleno)}.
10780700 -> 1000007800740: However, in Spanish there are also the forms {(Lang+clamar+la+clamar)}, {(Lang+flama+lad+flama)}, {(Lang+pleno+lad+pleno)}; Port. {(Lang+chamar+pt+chamar)}, {(Lang+chama+pt+chama)}, {(Lang+cheio+pt+cheio)}; Gal. {(Lang+chamar+gl+chamar)}, {(Lang+chama+gl+chama)}, {(Lang+cheo+gl+cheo)}.
10780710 -> 1000007800750: Lat. acc. {(Lang+octo+la+octo)}, {(Lang+noctem+la+noctem)}, {(Lang+multum+la+multum)} > Lad. {(Lang+ocho+lad+ocho)}, {(Lang+noche+lad+noche)}, {(Lang+muncho+lad+muncho)}; Sp. {(Lang+ocho+es+ocho)}, {(Lang+noche+es+noche)}, {(Lang+mucho+es+mucho)}; Port. {(Lang+oito+pt+oito)}, {(Lang+noite+pt+noite)}, {(Lang+muito+pt+muito)}; Gal. {(Lang+oito+gl+oito)}, {(Lang+noite+gl+noite)}, {(Lang+moito+gl+moito)}.
10780720 -> 1000007800760: Geographic distribution
10780730 -> 1000007800770: Spanish is one of the official languages of the European Union, the Organization of American States, the Organization of Ibero-American States, the United Nations, and the Union of South American Nations.
10780740 -> 1000007800780: Europe
10780750 -> 1000007800790: Spanish is an official language of Spain, the country for which it is named and from which it originated.
10780760 -> 1000007800800: It is also spoken in Gibraltar, though English is the official language.
10780770 -> 1000007800810: Likewise, it is spoken in Andorra though Catalan is the official language.
10780780 -> 1000007800820: It is also spoken by small communities in other European countries, such as the United Kingdom, France, and Germany.
10780790 -> 1000007800830: Spanish is an official language of the European Union.
10780800 -> 1000007800840: In Switzerland, Spanish is the mother tongue of 1.7% of the population, representing the first minority after the 4 official languages of the country.
10780810 -> 1000007800850: The Americas
10780820 -> 1000007800860: Latin America
10780830 -> 1000007800870: Most Spanish speakers are in Latin America; of most countries with the most Spanish speakers, only Spain is outside of the Americas.
10780840 -> 1000007800880: Mexico has most of the world's native speakers.
10780850 -> 1000007800890: Nationally, Spanish is the official language of Argentina, Bolivia (co-official Quechua and Aymara), Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, El Salvador, Guatemala, Honduras, Mexico , Nicaragua, Panama, Paraguay (co-official Guaraní), Peru (co-official Quechua and, in some regions, Aymara), Uruguay, and Venezuela.
10780860 -> 1000007800900: Spanish is also the official language (co-official with English) in the U.S. commonwealth of Puerto Rico.
10780870 -> 1000007800910: Spanish has no official recognition in the former British colony of Belize; however, per the 2000 census, it is spoken by 43% of the population.
10780880 -> 1000007800920: Mainly, it is spoken by Hispanic descendants who remained in the region since the 17th century; however, English is the official language.
10780890 -> 1000007800930: Spain colonized Trinidad and Tobago first in 1498, leaving the Carib people the Spanish language.
10780900 -> 1000007800940: Also the Cocoa Panyols, laborers from Venezuela, took their culture and language with them; they are accredited with the music of "Parang" ("Parranda") on the island.
10780910 -> 1000007800950: Because of Trinidad's location on the South American coast, the country is much influenced by its Spanish-speaking neighbors.
10780920 -> 1000007800960: A recent census shows that more than 1,500 inhabitants speak Spanish.
10780930 -> 1000007800970: In 2004, the government launched the Spanish as a First Foreign Language (SAFFL) initiative in March 2005.
10780940 -> 1000007800980: Government regulations require Spanish to be taught, beginning in primary school, while thirty percent of public employees are to be linguistically competent within five years.
10780950 -> 1000007800990: The government also announced that Spanish will be the country's second official language by 2020, beside English.
10780960 -> 1000007801000: Spanish is important in Brazil because of its proximity to and increased trade with its Spanish-speaking neighbors; for example, as a member of the Mercosur trading bloc.
10780970 -> 1000007801010: In 2005, the National Congress of Brazil approved a bill, signed into law by the President, making Spanish available as a foreign language in secondary schools.
10780980 -> 1000007801020: In many border towns and villages (especially on the Uruguayan-Brazilian border), a mixed language known as Portuñol is spoken.
10780990 -> 1000007801030: United States
10781000 -> 1000007801040: In the 2006 census, 44.3 million people of the U.S. population were Hispanic or Latino by origin; 34 million people, 12.2 percent, of the population older than 5 years speak Spanish at home.
10781005 -> 1000007801050: Spanish has a long history in the United States (many south-western states were part of Mexico and Spain), and it recently has been revitalized by much immigration from Latin America.
10781010 -> 1000007801060: Spanish is the most widely taught foreign language in the country.
10781020 -> 1000007801070: Although the United States has no formally designated "official languages," Spanish is formally recognized at the state level beside English; in the U.S. state of New Mexico, 30 per cent of the population speak it.
10781030 -> 1000007801080: It also has strong influence in metropolitan areas such as Los Angeles, Miami and New York City.
10781040 -> 1000007801090: Spanish is the dominant spoken language in Puerto Rico, a U.S. territory.
10781050 -> 1000007801100: In total, the U.S. has the world's fifth-largest Spanish-speaking population.
10781060 -> 1000007801110: Asia
10781070 -> 1000007801120: Spanish was an official language of the Philippines but was never spoken by a majority of the population.
10781080 -> 1000007801130: Movements for most of the masses to learn the language were started but were stopped by the friars.
10781090 -> 1000007801140: Its importance fell in the first half of the 20th century following the U.S. occupation and administration of the islands.
10781100 -> 1000007801150: The introduction of the English language in the Philippine government system put an end to the use of Spanish as the official language.
10781110 -> 1000007801160: The language lost its official status in 1973 during the Ferdinand Marcos administration.
10781120 -> 1000007801170: Spanish is spoken mainly by small communities of Filipino-born Spaniards, Latin Americans, and Filipino mestizos (mixed race), descendants of the early colonial Spanish settlers.
10781130 -> 1000007801180: Throughout the 20th century, the Spanish language has declined in importance compared to English and Tagalog.
10781140 -> 1000007801190: According to the 1990 Philippine census, there were 2,658 native speakers of Spanish.
10781150 -> 1000007801200: No figures were provided during the 1995 and 2000 censuses; however, figures for 2000 did specify there were over 600,000 native speakers of Chavacano, a Spanish based creole language spoken in Cavite and Zamboanga.
10781160 -> 1000007801210: Some other sources put the number of Spanish speakers in the Philippines around two to three million; however, these sources are disputed.
10781170 -> 1000007801220: In Tagalog, there are 4,000 Spanish adopted words and around 6,000 Spanish adopted words in Visayan and other Philippine languages as well.
10781180 -> 1000007801230: Today Spanish is offered as a foreign language in Philippines schools and universities.
10781190 -> 1000007801240: Africa
10781200 -> 1000007801250: In Africa, Spanish is official in the UN-recognised but Moroccan-occupied Western Sahara (co-official Arabic) and Equatorial Guinea (co-official French and Portuguese).
10781210 -> 1000007801260: Today, nearly 200,000 refugee Sahrawis are able to read and write in Spanish, and several thousands have received university education in foreign countries as part of aid packages (mainly Cuba and Spain).
10781220 -> 1000007801270: In Equatorial Guinea, Spanish is the predominant language when counting native and non-native speakers (around 500,000 people), while Fang is the most spoken language by a number of native speakers.
10781230 -> 1000007801280: It is also spoken in the Spanish cities in continental North Africa (Ceuta and Melilla) and in the autonomous community of Canary Islands (143,000 and 1,995,833 people, respectively).
10781240 -> 1000007801290: Within Northern Morocco, a former Franco-Spanish protectorate that is also geographically close to Spain, approximately 20,000 people speak Spanish.
10781250 -> 1000007801300: It is spoken by some communities of Angola, because of the Cuban influence from the Cold War, and in Nigeria by the descendants of Afro-Cuban ex-slaves.
10781260 -> 1000007801310: In Côte d'Ivoire and Senegal, Spanish can be learned as a second foreign language in the public education system.
10781270 -> 1000007801320: In 2008, Cervantes Institutes centers will be opened in Lagos and Johannesburg, the first one in the Sub-Saharan Africa
10781280 -> 1000007801330: Oceania
10781290 -> 1000007801340: Among the countries and territories in Oceania, Spanish is also spoken in Easter Island, a territorial possession of Chile.
10781300 -> 1000007801350: According to the 2001 census, there are approximately 95,000 speakers of Spanish in Australia, 44,000 of which live in Greater Sydney , where the older Mexican, Colombian, and Spanish populations and newer Argentine, Salvadoran and Uruguyan communities live.
10781310 -> 1000007801360: The island nations of Guam, Palau, Northern Marianas, Marshall Islands and Federated States of Micronesia all once had Spanish speakers, since Marianas and Caroline Islands were Spanish colonial possessions until late 19th century (see Spanish-American War), but Spanish has since been forgotten.
10781320 -> 1000007801370: It now only exists as an influence on the local native languages and also spoken by Hispanic American resident populations.
10781330 -> 1000007801380: Dialectal variation
10781340 -> 1000007801390: There are important variations among the regions of Spain and throughout Spanish-speaking America.
10781350 -> 1000007801400: In countries in Hispanophone America, it is preferable to use the word castellano to distinguish their version of the language from that of Spain, thus asserting their autonomy and national identity.
10781360 -> 1000007801410: In Spain the Castilian dialect's pronunciation is commonly regarded as the national standard, although a use of slightly different pronouns called {(Lang+laísmo+es+laísmo)} of this dialect is deprecated.
10781370 -> 1000007801420: More accurately, for nearly everyone in Spain, "standard Spanish" means "pronouncing everything exactly as it is written," an ideal which does not correspond to any real dialect, though the northern dialects are the closest to it.
10781380 -> 1000007801430: In practice, the standard way of speaking Spanish in the media is "written Spanish" for formal speech, "Madrid dialect" (one of the transitional variants between Castilian and Andalusian) for informal speech.
10781390 -> 1000007801440: Voseo
10781400 -> 1000007801450: Spanish has three second-person singular pronouns: {(Lang+tú+es+tú)}, {(Lang+usted+es+usted)}, and in some parts of Latin America, {(Lang+vos+es+vos)} (the use of this pronoun and/or its verb forms is called voseo).
10781410 -> 1000007801460: In those regions where it is used, generally speaking, {(Lang+tú+es+tú)} and {(Lang+vos+es+vos)} are informal and used with friends; in other countries, {(Lang+vos+es+vos)} is considered an archaic form.
10781415 -> 1000007801470: {(Lang+Usted+es+Usted)} is universally regarded as the formal address (derived from {(Lang+vuestra merced+es+vuestra merced)}, "your grace"), and is used as a mark of respect, as when addressing one's elders or strangers.
10781420 -> 1000007801480: {(Lang+Vos+es+Vos)} is used extensively as the primary spoken form of the second-person singular pronoun, although with wide differences in social consideration, in many countries of Latin America, including Argentina, Chile, Costa Rica, the central mountain region of Ecuador, the State of Chiapas in Mexico, El Salvador, Guatemala, Honduras, Nicaragua, Paraguay, Uruguay, the Paisa region and Caleños of Colombia and the States of Zulia and Trujillo in Venezuela.
10781430 -> 1000007801490: There are some differences in the verbal endings for vos in each country.
10781440 -> 1000007801500: In Argentina, Uruguay, and increasingly in Paraguay and some Central American countries, it is also the standard form used in the media, but the media in other countries with {(Lang+voseo+es+voseo)} generally continue to use {(Lang+usted+es+usted)} or {(Lang+tú+es+tú)} except in advertisements, for instance.
10781445 -> 1000007801510: {(Lang+Vos+es+Vos)} may also be used regionally in other countries.
10781450 -> 1000007801520: Depending on country or region, usage may be considered standard or (by better educated speakers) to be unrefined.
10781460 -> 1000007801530: Interpersonal situations in which the use of vos is acceptable may also differ considerably between regions.
10781470 -> 1000007801540: Ustedes
10781480 -> 1000007801550: Spanish forms also differ regarding second-person plural pronouns.
10781490 -> 1000007801560: The Spanish dialects of Latin America have only one form of the second-person plural for daily use, {(Lang+ustedes+es+ustedes)} (formal or familiar, as the case may be, though {(Lang+vosotros+es+vosotros)} non-formal usage can sometimes appear in poetry and rhetorical or literary style).
10781500 -> 1000007801570: In Spain there are two forms — {(Lang+ustedes+es+ustedes)} (formal) and {(Lang+vosotros+es+vosotros)} (familiar).
10781510 -> 1000007801580: The pronoun {(Lang+vosotros+es+vosotros)} is the plural form of {(Lang+tú+es+tú)} in most of Spain, but in the Americas (and certain southern Spanish cities such as Cádiz or Seville, and in the Canary Islands) it is replaced with {(Lang+ustedes+es+ustedes)}.
10781520 -> 1000007801590: It is notable that the use of {(Lang+ustedes+es+ustedes)} for the informal plural "you" in southern Spain does not follow the usual rule for pronoun-verb agreement; e.g., while the formal form for "you go", {(Lang+ustedes van+es+ustedes van)}, uses the third-person plural form of the verb, in Cádiz or Seville the informal form is constructed as {(Lang+ustedes vais+es+ustedes vais)}, using the second-person plural of the verb.
10781530 -> 1000007801600: In the Canary Islands, though, the usual pronoun-verb agreement is preserved in most cases.
10781540 -> 1000007801610: Some words can be different, even embarrassingly so, in different Hispanophone countries.
10781550 -> 1000007801620: Most Spanish speakers can recognize other Spanish forms, even in places where they are not commonly used, but Spaniards generally do not recognise specifically American usages.
10781560 -> 1000007801630: For example, Spanish mantequilla, aguacate and albaricoque (respectively, "butter", "avocado", "apricot") correspond to manteca, palta, and damasco, respectively, in Argentina, Chile and Uruguay.
10781570 -> 1000007801640: The everyday Spanish words coger (to catch, get, or pick up), pisar (to step on) and concha (seashell) are considered extremely rude in parts of Latin America, where the meaning of coger and pisar is also "to have sex" and concha means "vulva".
10781580 -> 1000007801650: The Puerto Rican word for "bobby pin" (pinche) is an obscenity in Mexico, and in Nicaragua simply means "stingy".
10781590 -> 1000007801660: Other examples include taco, which means "swearword" in Spain but is known to the rest of the world as a Mexican dish.
10781600 -> 1000007801670: Pija in many countries of Latin America is an obscene slang word for "penis", while in Spain the word also signifies "posh girl" or "snobby".
10781610 -> 1000007801680: Coche, which means "car" in Spain, for the vast majority of Spanish-speakers actually means "baby-stroller", in Guatemala it means "pig", while carro means "car" in some Latin American countries and "cart" in others, as well as in Spain.
10781620 -> 1000007801690: The {(Lang+Real Academia Española+es+Real Academia Española)} (Royal Spanish Academy), together with the 21 other national ones (see Association of Spanish Language Academies), exercises a standardizing influence through its publication of dictionaries and widely respected grammar and style guides.
10781630 -> 1000007801700: Due to this influence and for other sociohistorical reasons, a standardized form of the language (Standard Spanish) is widely acknowledged for use in literature, academic contexts and the media.
10781640 -> 1000007801710: Writing system
10781650 -> 1000007801720: Spanish is written using the Latin alphabet, with the addition of the character ñ (eñe, representing the phoneme {(IPA+/ɲ/+/ɲ/)}, a letter distinct from n, although typographically composed of an n with a tilde) and the digraphs ch ({(Lang+che+es+che)}, representing the phoneme {(IPA+/tʃ/+/tʃ/)}) and ll ({(Lang+elle+es+elle)}, representing the phoneme {(IPA+/ʎ/+/ʎ/)}).
10781660 -> 1000007801730: However, the digraph rr ({(Lang+erre fuerte+es+erre fuerte)}, "strong r", {(Lang+erre doble+es+erre doble)}, "double r", or simply {(Lang+erre+es+erre)}), which also represents a distinct phoneme {(IPA+/r/+/r/)}, is not similarly regarded as a single letter.
10781670 -> 1000007801740: Since 1994, the digraphs ch and ll are to be treated as letter pairs for collation purposes, though they remain a part of the alphabet.
10781680 -> 1000007801750: Words with ch are now alphabetically sorted between those with ce and ci, instead of following cz as they used to, and similarly for ll.
10781690 -> 1000007801760: Thus, the Spanish alphabet has the following 29 letters:
10781700 -> 1000007801770: a, b, c, ch, d, e, f, g, h, i, j, k, l, ll, m, n, ñ, o, p, q, r, s, t, u, v, w, x, y, z.
10781710 -> 1000007801780: With the exclusion of a very small number of regional terms such as México (see Toponymy of Mexico) and some neologisms like software, pronunciation can be entirely determined from spelling.
10781720 -> 1000007801790: A typical Spanish word is stressed on the syllable before the last if it ends with a vowel (not including y) or with a vowel followed by n or s; it is stressed on the last syllable otherwise.
10781730 -> 1000007801800: Exceptions to this rule are indicated by placing an acute accent on the stressed vowel.
10781740 -> 1000007801810: The acute accent is used, in addition, to distinguish between certain homophones, especially when one of them is a stressed word and the other one is a clitic: compare {(Lang+el+es+el)} ("the", masculine singular definite article) with {(Lang+él+es+él)} ("he" or "it"), or {(Lang+te+es+te)} ("you", object pronoun), {(Lang+de+es+de)} (preposition "of" or "from"), and {(Lang+se+es+se)} (reflexive pronoun) with {(Lang+té+es+té)} ("tea"), {(Lang+dé+es+dé)} ("give") and {(Lang+sé+es+sé)} ("I know", or imperative "be").
10781750 -> 1000007801820: The interrogative pronouns ({(Lang+qué+es+qué)}, {(Lang+cuál+es+cuál)}, {(Lang+dónde+es+dónde)}, {(Lang+quién+es+quién)}, etc.) also receive accents in direct or indirect questions, and some demonstratives ({(Lang+ése+es+ése)}, {(Lang+éste+es+éste)}, {(Lang+aquél+es+aquél)}, etc.) must be accented when used as pronouns.
10781760 -> 1000007801830: The conjunction {(Lang+o+es+o)} ("or") is written with an accent between numerals so as not to be confused with a zero: e.g., {(Lang+10 ó 20+es+10 ó 20)} should be read as {(Lang+diez o veinte+es+diez o veinte)} rather than {(Lang+diez mil veinte+es+diez mil veinte)} ("10,020").
10781770 -> 1000007801840: Accent marks are frequently omitted in capital letters (a widespread practice in the early days of computers where only lowercase vowels were available with accents), although the RAE advises against this.
10781780 -> 1000007801850: When u is written between g and a front vowel (e or i), if it should be pronounced, it is written with a diaeresis (ü) to indicate that it is not silent as it normally would be (e.g., cigüeña, "stork", is pronounced {(IPA+/θiˈɣweɲa/+/θiˈɣweɲa/)}; if it were written cigueña, it would be pronounced {(IPA+/θiˈɣeɲa/+/θiˈɣeɲa/)}.
10781790 -> 1000007801860: Interrogative and exclamatory clauses are introduced with inverted question ( ¿ ) and exclamation ( ¡ ) marks.
10781800 -> 1000007801870: Sounds
10781810 -> 1000007801880: The phonemic inventory listed in the following table includes phonemes that are preserved only in some dialects, other dialects having merged them (such as yeísmo); these are marked with an asterisk (*).
10781820 -> 1000007801890: Sounds in parentheses are allophones.
10781830 -> 1000007801900: By the 16th century, the consonant system of Spanish underwent the following important changes that differentiated it from neighboring Romance languages such as Portuguese and Catalan:
10781840 -> 1000007801910: Initial {(IPA+/f/+/f/)}, when it had evolved into a vacillating {(IPA+/h/+/h/)}, was lost in most words (although this etymological h- is preserved in spelling and in some Andalusian dialects is still aspirated).
10781850 -> 1000007801920: The bilabial approximant {(IPA+/β̞/+/β̞/)} (which was written u or v) merged with the bilabial oclusive {(IPA+/b/+/b/)} (written b).
10781860 -> 1000007801930: There is no difference between the pronunciation of orthographic b and v in contemporary Spanish, excepting emphatic pronunciations that cannot be considered standard or natural.
10781870 -> 1000007801940: The voiced alveolar fricative {(IPA+/z/+/z/)} which existed as a separate phoneme in medieval Spanish merged with its voiceless counterpart {(IPA+/s/+/s/)}.
10781880 -> 1000007801950: The phoneme which resulted from this merger is currently spelled s.
10781890 -> 1000007801960: The voiced postalveolar fricative {(IPA+/ʒ/+/ʒ/)} merged with its voiceless counterpart {(IPA+/ʃ/+/ʃ/)}, which evolved into the modern velar sound {(IPA+/x/+/x/)} by the 17th century, now written with j, or g before e, i.
10781900 -> 1000007801970: Nevertheless, in most parts of Argentina and in Uruguay, y and ll have both evolved to {(IPA+/ʒ/+/ʒ/)} or {(IPA+/ʃ/+/ʃ/)}.
10781910 -> 1000007801980: The voiced alveolar affricate {(IPA+/dz/+/dz/)} merged with its voiceless counterpart {(IPA+/ts/+/ts/)}, which then developed into the interdental {(IPA+/θ/+/θ/)}, now written z, or c before e, i.
10781920 -> 1000007801990: But in Andalusia, the Canary Islands and the Americas this sound merged with {(IPA+/s/+/s/)} as well.
10781930 -> 1000007802000: See Ceceo, for further information.
10781940 -> 1000007802010: The consonant system of Medieval Spanish has been better preserved in Ladino and in Portuguese, neither of which underwent these shifts.
10781950 -> 1000007802020: Lexical stress
10781960 -> 1000007802030: Spanish is a syllable-timed language, so each syllable has the same duration regardless of stress.
10781970 -> 1000007802040: Stress most often occurs on any of the last three syllables of a word, with some rare exceptions at the fourth last.
10781980 -> 1000007802050: The tendencies of stress assignment are as follows:
10781990 -> 1000007802060: In words ending in vowels and {(IPA+/s/+/s/)}, stress most often falls on the penultimate syllable.
10782000 -> 1000007802070: In words ending in all other consonants, the stress more often falls on the ultimate syllable.
10782010 -> 1000007802080: Preantepenultimate stress occurs rarely and only in words like guardándoselos ('saving them for him/her') where a clitic follows certain verbal forms.
10782020 -> 1000007802090: In addition to the many exceptions to these tendencies, there are numerous minimal pairs which contrast solely on stress.
10782030 -> 1000007802100: For example, sabana, with penultimate stress, means 'savannah' while {(Lang+sábana+es+sábana)}, with antepenultimate stress, means 'sheet'; {(Lang+límite+es+límite)} ('boundary'), {(Lang+limite+es+limite)} ('[that] he/she limits') and {(Lang+limité+es+limité)} ('I limited') also contrast solely on stress.
10782040 -> 1000007802110: Phonological stress may be marked orthographically with an acute accent (ácido, distinción, etc).
10782050 -> 1000007802120: This is done according to the mandatory stress rules of Spanish orthography which are similar to the tendencies above (differing with words like distinción) and are defined so as to unequivocally indicate where the stress lies in a given written word.
10782060 -> 1000007802130: An acute accent may also be used to differentiate homophones (such as té for 'tea' and te
10782070 -> 1000007802140: An amusing example of the significance of intonation in Spanish is the phrase {(Lang+¿Cómo "cómo como"?+es+¿Cómo "cómo como"?)}
10782080 -> 1000007802150: {(Lang+¡Como como como!+es+¡Como como como!)}
10782090 -> 1000007802160: ("What do you mean / 'how / do I eat'? / I eat / the way / I eat!").
10782100 -> 1000007802170: Grammar
10782110 -> 1000007802180: Spanish is a relatively inflected language, with a two-gender system and about fifty conjugated forms per verb, but limited inflection of nouns, adjectives, and determiners.
10782120 -> 1000007802190: (For a detailed overview of verbs, see Spanish verbs and Spanish irregular verbs.)
10782130 -> 1000007802200: It is right-branching, uses prepositions, and usually, though not always, places adjectives after nouns.
10782140 -> 1000007802210: Its syntax is generally Subject Verb Object, though variations are common.
10782150 -> 1000007802220: It is a pro-drop language (allows the deletion of pronouns when pragmatically unnecessary) and verb-framed.
10782160 -> None: Samples
Speech recognition
10790010 -> 1000007900020: Speech recognition
10790020 -> 1000007900030: Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example, to keypresses, using the binary code for a string of character codes).
10790030 -> 1000007900040: The term voice recognition may also be used to refer to speech recognition, but more precisely refers to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said.
10790040 -> 1000007900050: Speech recognition applications include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), domotic appliance control and content-based spoken audio search (e.g., find a podcast where particular words were spoken), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), speech-to-text processing (e.g., word processors or emails), and in aircraft cockpits (usually termed Direct Voice Input).
10790050 -> 1000007900060: History
10790060 -> 1000007900070: One of the most notable domains for the commercial application of speech recognition in the United States has been health care and in particular the work of the medical transcriptionist (MT).
10790070 -> 1000007900080: According to industry experts, at its inception, speech recognition (SR) was sold as a way to completely eliminate transcription rather than make the transcription process more efficient, hence it was not accepted.
10790080 -> 1000007900090: It was also the case that SR at that time was often technically deficient.
10790090 -> 1000007900100: Additionally, to be used effectively, it required changes to the ways physicians worked and documented clinical encounters, which many if not all were reluctant to do.
10790100 -> 1000007900110: The biggest limitation to speech recognition automating transcription, however, is seen as the software.
10790110 -> 1000007900120: The nature of narrative dictation is highly interpretive and often requires judgment that may be provided by a real human but not yet by an automated system.
10790120 -> 1000007900130: Another limitation has been the extensive amount of time required by the user and/or system provider to train the software.
10790130 -> 1000007900140: A distinction in ASR is often made between "artificial syntax systems" which are usually domain-specific and "natural language processing" which is usually language-specific.
10790140 -> 1000007900150: Each of these types of application presents its own particular goals and challenges.
10790150 -> 1000007900160: Applications
10790160 -> 1000007900170: Health care
10790170 -> 1000007900180: In the health care domain, even in the wake of improving speech recognition technologies, medical transcriptionists (MTs) have not yet become obsolete.
10790180 -> 1000007900190: Many experts in the field anticipate that with increased use of speech recognition technology, the services provided may be redistributed rather than replaced.
10790190 -> 1000007900200: Speech recognition can be implemented in front-end or back-end of the medical documentation process.
10790200 -> 1000007900210: Front-End SR is where the provider dictates into a speech-recognition engine, the recognized words are displayed right after they are spoken, and the dictator is responsible for editing and signing off on the document.
10790210 -> 1000007900220: It never goes through an MT/editor.
10790220 -> 1000007900230: Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and the voice is routed through a speech-recognition machine and the recognized draft document is routed along with the original voice file to the MT/editor, who edits the draft and finalizes the report.
10790230 -> 1000007900240: Deferred SR is being widely used in the industry currently.
10790240 -> 1000007900250: Many Electronic Medical Records (EMR) applications can be more effective and may be performed more easily when deployed in conjunction with a speech-recognition engine.
10790250 -> 1000007900260: Searches, queries, and form filling may all be faster to perform by voice than by using a keyboard.
10790260 -> None: 
10790270 -> None: 
10790280 -> None: 
10790290 -> 1000007900270: Military
10790300 -> 1000007900280: High-performance fighter aircraft
10790310 -> 1000007900290: Substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft.
10790320 -> 1000007900300: Of particular note are the U.S. program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program in France on installing speech recognition systems on Mirage aircraft, and programs in the UK dealing with a variety of aircraft platforms.
10790330 -> 1000007900310: In these programs, speech recognizers have been operated successfully in fighter aircraft with applications including: setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight displays.
10790340 -> 1000007900320: Generally, only very limited, constrained vocabularies have been used successfully, and a major effort has been devoted to integration of the speech recognizer with the avionics system.
10790350 -> 1000007900330: Some important conclusions from the work were as follows:
10790360 -> 1000007900340: Speech recognition has definite potential for reducing pilot workload, but this potential was not realized consistently.
10790370 -> 1000007900350: Achievement of very high recognition accuracy (95% or more) was the most critical factor for making the speech recognition system useful — with lower recognition rates, pilots would not use the system.
10790380 -> 1000007900360: More natural vocabulary and grammar, and shorter training times would be useful, but only if very high recognition rates could be maintained.
10790390 -> 1000007900370: Laboratory research in robust speech recognition for military environments has produced promising results which, if extendable to the cockpit, should improve the utility of speech recognition in high-performance aircraft.
10790400 -> 1000007900380: Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found recognition deteriorated with increasing G-loads.
10790410 -> 1000007900390: It was also concluded that adaptation greatly improved the results in all cases and introducing models for breathing was shown to improve recognition scores significantly.
10790420 -> 1000007900400: Contrary to what might be expected, no effects of the broken English of the speakers were found.
10790430 -> 1000007900410: It was evident that spontaneous speech caused problems for the recognizer, as could be expected.
10790440 -> 1000007900420: A restricted vocabulary, and above all, a proper syntax, could thus be expected to improve recognition accuracy substantially.
10790450 -> 1000007900430: The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent system, i.e. it requires each pilot to create a template.
10790460 -> 1000007900440: The system is not used for any safety critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is used for a wide range of other cockpit functions.
10790470 -> 1000007900450: Voice commands are confirmed by visual and/or aural feedback.
10790480 -> 1000007900460: The system is seen as a major design feature in the reduction of pilot workload, and even allows the pilot to assign targets to himself with two simple voice commands or to any of his wingmen with only five commands.
10790490 -> 1000007900470: Helicopters
10790500 -> 1000007900480: The problems of achieving high recognition accuracy under stress and noise pertain strongly to the helicopter environment as well as to the fighter environment.
10790510 -> 1000007900490: The acoustic noise problem is actually more severe in the helicopter environment, not only because of the high noise levels but also because the helicopter pilot generally does not wear a facemask, which would reduce acoustic noise in the microphone.
10790520 -> 1000007900500: Substantial test and evaluation programs have been carried out in the post decade in speech recognition systems applications in helicopters, notably by the U.S. Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace Establishment (RAE) in the UK.
10790530 -> 1000007900510: Work in France has included speech recognition in the Puma helicopter.
10790540 -> 1000007900520: There has also been much useful work in Canada.
10790550 -> 1000007900530: Results have been encouraging, and voice applications have included: control of communication radios; setting of navigation systems; and control of an automated target handover system.
10790560 -> 1000007900540: As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot effectiveness.
10790570 -> 1000007900550: Encouraging results are reported for the AVRADA tests, although these represent only a feasibility demonstration in a test environment.
10790580 -> 1000007900560: Much remains to be done both in speech recognition and in overall speech recognition technology, in order to consistently achieve performance improvements in operational settings.
10790590 -> 1000007900570: Battle management
10790600 -> 1000007900580: Battle management command centres generally require rapid access to and control of large, rapidly changing information databases.
10790610 -> 1000007900590: Commanders and system operators need to query these databases as conveniently as possible, in an eyes-busy environment where much of the information is presented in a display format.
10790620 -> 1000007900600: Human machine interaction by voice has the potential to be very useful in these environments.
10790630 -> 1000007900610: A number of efforts have been undertaken to interface commercially available isolated-word recognizers into battle management environments.
10790640 -> 1000007900620: In one feasibility study, speech recognition equipment was tested in conjunction with an integrated information display for naval battle management applications.
10790650 -> 1000007900630: Users were very optimistic about the potential of the system, although capabilities were limited.
10790660 -> 1000007900640: Speech understanding programs sponsored by the Defense Advanced Research Projects Agency (DARPA) in the U.S. has focused on this problem of natural speech interface..
10790670 -> 1000007900650: Speech recognition efforts have focused on a database of continuous speech recognition (CSR), large-vocabulary speech which is designed to be representative of the naval resource management task.
10790680 -> 1000007900660: Significant advances in the state-of-the-art in CSR have been achieved, and current efforts are focused on integrating speech recognition and natural language processing to allow spoken language interaction with a naval resource management system.
10790690 -> 1000007900670: Training air traffic controllers
10790700 -> 1000007900680: Training for military (or civilian) air traffic controllers (ATC) represents an excellent application for speech recognition systems.
10790710 -> 1000007900690: Many ATC training systems currently require a person to act as a "pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog which the controller would have to conduct with pilots in a real ATC situation.
10790720 -> 1000007900700: Speech recognition and synthesis techniques offer the potential to eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel.
10790730 -> 1000007900710: Air controller tasks are also characterized by highly structured speech as the primary output of the controller, hence reducing the difficulty of the speech recognition task.
10790740 -> 1000007900720: The U.S. Naval Training Equipment Center has sponsored a number of developments of prototype ATC trainers using speech recognition.
10790750 -> 1000007900730: Generally, the recognition accuracy falls short of providing graceful interaction between the trainee and the system.
10790760 -> 1000007900740: However, the prototype training systems have demonstrated a significant potential for voice interaction in these systems, and in other training applications.
10790770 -> 1000007900750: The U.S. Navy has sponsored a large-scale effort in ATC training systems, where a commercial speech recognition unit was integrated with a complex training system including displays and scenario creation.
10790780 -> 1000007900760: Although the recognizer was constrained in vocabulary, one of the goals of the training programs was to teach the controllers to speak in a constrained language, using specific vocabulary specifically designed for the ATC task.
10790790 -> 1000007900770: Research in France has focussed on the application of speech recognition in ATC training systems, directed at issues both in speech recognition and in application of task-domain grammar constraints.
10790800 -> 1000007900780: The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech recognition provided by Adacel Systems Inc (ASI).
10790810 -> 1000007900790: Adacel's MaxSim software uses speech recognition and synthetic speech to enable the trainee to control aircraft and ground vehicles in the simulation without the need for pseudo pilots.
10790820 -> 1000007900800: Adacel's ATC In A Box Software provideds a synthetic ATC environment for flight simulators.
10790830 -> 1000007900810: The "real" pilot talks to a virtual controller using speech recognition and the virtual controller responds with synthetic speech.
10790840 -> 1000007900820: It will be an application format
10790850 -> 1000007900830: Telephony and other domains
10790860 -> 1000007900840: ASR in the field of telephony is now commonplace and in the field of computer gaming and simulation is becoming more widespread.
10790870 -> 1000007900850: Despite the high level of integration with word processing in general personal computing, however, ASR in the field of document production has not seen the expected increases in use.
10790880 -> 1000007900860: The improvement of mobile processor speeds let create speech-enabled Symbian and Windows Mobile Smartphones.
10790890 -> 1000007900870: Current speech-to-text programs are too large and require too much CPU power to be practical for the Pocket PC.
10790900 -> 1000007900880: Speech is used mostly as a part of User Interface, for creating pre-defined or custom speech commands.
10790910 -> 1000007900890: Leading software vendors in this field are: Microsoft Corporation (Microsoft Voice Command); Nuance Communications (Nuance Voice Control); Vito Technology (VITO Voice2Go); Speereo Software (Speereo Voice Translator).
10790920 -> 1000007900900: People with Disabilities
10790930 -> 1000007900910: People with disabilities are another part of the population that benefit from using speech recognition programs.
10790940 -> 1000007900920: It is especially useful for people who have difficulty with or are unable to use their hands, from mild repetitive stress injuries to involved disabilities that require alternative input for support with accessing the computer.
10790950 -> 1000007900930: In fact, people who used the keyboard a lot and developed RSI became an urgent early market for speech recognition.
10790960 -> 1000007900940: Speech recognition is used in deaf telephony, such as spinvox voice-to-text voicemail, relay services, and captioned telephone.
10790970 -> 1000007900950: Further applications
10790980 -> 1000007900960: Automatic translation
10790990 -> 1000007900970: Automotive speech recognition (e.g., Ford Sync)
10791000 -> 1000007900980: Telematics (e.g. vehicle Navigation Systems)
10791010 -> 1000007900990: Court reporting (Realtime Voice Writing)
10791020 -> 1000007901000: Hands-free computing: voice command recognition computer user interface
10791030 -> 1000007901010: Home automation
10791040 -> 1000007901020: Interactive voice response
10791050 -> 1000007901030: Mobile telephony, including mobile email
10791060 -> 1000007901040: Multimodal interaction
10791070 -> 1000007901050: Pronunciation evaluation in computer-aided language learning applications
10791080 -> 1000007901060: Robotics
10791090 -> 1000007901070: Transcription (digital speech-to-text).
10791100 -> 1000007901080: Speech-to-Text (Transcription of speech into mobile text messages)
10791110 -> 1000007901090: Performance of speech recognition systems
10791120 -> 1000007901100: The performance of speech recognition systems is usually specified in terms of accuracy and speed.
10791130 -> 1000007901110: Accuracy may be measured in terms of performance accuracy which is usually rated with word error rate (WER), whereas speed is measured with the real time factor.
10791140 -> 1000007901120: Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).
10791150 -> 1000007901130: Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions.
10791160 -> 1000007901140: There is some confusion, however, over the interchangeability of the terms "speech recognition" and "dictation".
10791170 -> 1000007901150: Commercially available speaker-dependent dictation systems usually require only a short period of training (sometimes also called `enrollment') and may successfully capture continuous speech with a large vocabulary at normal pace with a very high accuracy.
10791180 -> 1000007901160: Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy if operated under optimal conditions.
10791190 -> 1000007901170: `Optimal conditions' usually assume that users:
10791200 -> 1000007901180: have speech characteristics which match the training data,
10791210 -> 1000007901190: can achieve proper speaker adaptation, and
10791220 -> 1000007901200: work in a clean noise environment (e.g. quiet office or laboratory space).
10791230 -> 1000007901210: This explains why some users, especially those whose speech is heavily accented, might achieve recognition rates much lower than expected.
10791240 -> 1000007901220: Speech recognition in video has become a popular search technology used by several video search companies.
10791250 -> 1000007901230: Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers.
10791260 -> 1000007901240: Such systems are popular for routing incoming phone calls to their destinations in large organizations.
10791270 -> 1000007901250: Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
10791280 -> 1000007901260: Hidden Markov models (HMMs) are widely used in many systems.
10791290 -> 1000007901270: Language modeling has many other applications such as smart keyboard and document classification.
10791300 -> 1000007901280: Hidden Markov model (HMM)-based speech recognition
10791310 -> 1000007901290: Modern general-purpose speech recognition systems are generally based on HMMs.
10791320 -> 1000007901300: These are statistical models which output a sequence of symbols or quantities.
10791330 -> 1000007901310: One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piecewise stationary signal or a short-time stationary signal.
10791340 -> 1000007901320: That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process.
10791350 -> 1000007901330: Speech could thus be thought of as a Markov model for many stochastic processes.
10791360 -> 1000007901340: Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use.
10791370 -> 1000007901350: In speech recognition, the hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds.
10791380 -> 1000007901360: The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients.
10791390 -> 1000007901370: The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector.
10791400 -> 1000007901380: Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
10791410 -> 1000007901390: Described above are the core elements of the most common, HMM-based approach to speech recognition.
10791420 -> 1000007901400: Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above.
10791430 -> 1000007901410: A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation.
10791440 -> 1000007901420: The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT).
10791450 -> 1000007901430: Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data.
10791460 -> 1000007901440: Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
10791470 -> 1000007901450: Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).
10791480 -> 1000007901460: Dynamic time warping (DTW)-based speech recognition
10791490 -> 1000007901470: Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach.
10791500 -> 1000007901480: Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed.
10791510 -> 1000007901490: For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation.
10791520 -> 1000007901500: DTW has been applied to video, audio, and graphics – indeed, any data which can be turned into a linear representation can be analyzed with DTW.
10791530 -> 1000007901510: A well known application has been automatic speech recognition, to cope with different speaking speeds.
10791540 -> 1000007901520: In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other.
10791550 -> 1000007901530: This sequence alignment method is often used in the context of hidden Markov models.
10791560 -> 1000007901540: Further information
10791570 -> 1000007901550: Popular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU.
10791580 -> 1000007901560: Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing.
10791590 -> 1000007901570: Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication.
10791600 -> 1000007901580: Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993).
10791610 -> 1000007901590: Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998).
10791620 -> 1000007901600: Even more up to date is "Computer Speech", by Manfred R. Schroeder, second edition published in 2004.
10791630 -> 1000007901610: A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored evaluations such as those organised by DARPA (the largest speech recognition-related project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).
10791640 -> 1000007901620: In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting.
10791650 -> 1000007901630: Another such resource is Carnegie Mellon University's SPHINX toolkit.
10791660 -> 1000007901640: The AT&T libraries  FSM Library,  GRM library, and  DCD library are also general software libraries for large-vocabulary speech recognition.
10791670 -> 1000007901650: A useful review of the area of robustness in ASR is provided by Junqua and Haton (1995).
Speech synthesis
10800010 -> 1000008000020: Speech synthesis
10800020 -> 1000008000030: Speech synthesis is the artificial production of human speech.
10800030 -> 1000008000040: A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware.
10800040 -> 1000008000050: A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech.
10800050 -> 1000008000060: Synthesized speech can be created by concatenating pieces of recorded speech that are stored in a database.
10800060 -> 1000008000070: Systems differ in the size of the stored speech units; a system that stores phones or diphones provides the largest output range, but may lack clarity.
10800070 -> 1000008000080: For specific usage domains, the storage of entire words or sentences allows for high-quality output.
10800080 -> 1000008000090: Alternatively, a synthesizer can incorporate a model of the vocal tract and other human voice characteristics to create a completely "synthetic" voice output.
10800090 -> 1000008000100: The quality of a speech synthesizer is judged by its similarity to the human voice, and by its ability to be understood.
10800100 -> 1000008000110: An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to written works on a home computer.
10800110 -> 1000008000120: Many computer operating systems have included speech synthesizers since the early 1980s.
10800120 -> 1000008000130: Overview of text processing
10800130 -> 1000008000140: A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-end.
10800140 -> 1000008000150: The front-end has two major tasks.
10800150 -> 1000008000160: First, it converts raw text containing symbols like numbers and abbreviations into the equivalent of written-out words.
10800160 -> 1000008000170: This process is often called text normalization, pre-processing, or tokenization.
10800170 -> 1000008000180: The front-end then assigns phonetic transcriptions to each word, and divides and marks the text into prosodic units, like phrases, clauses, and sentences.
10800180 -> 1000008000190: The process of assigning phonetic transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.
10800190 -> 1000008000200: Phonetic transcriptions and prosody information together make up the symbolic linguistic representation that is output by the front-end.
10800200 -> 1000008000210: The back-end—often referred to as the synthesizer—then converts the symbolic linguistic representation into sound.
10800210 -> 1000008000220: History
10800220 -> 1000008000230: Long before electronic signal processing was invented, there were those who tried to build machines to create human speech.
10800230 -> 1000008000240: Some early legends of the existence of "speaking heads" involved Gerbert of Aurillac (d. 1003 AD), Albertus Magnus (1198–1280), and Roger Bacon (1214–1294).
10800240 -> 1000008000250: In 1779, the Danish scientist Christian Kratzenstein, working at the Russian Academy of Sciences, built models of the human vocal tract that could produce the five long vowel sounds (in International Phonetic Alphabet notation, they are {(IPA+[aː]+[aː])}, {(IPA+[eː]+[eː])}, {(IPA+[iː]+[iː])}, {(IPA+[oː]+[oː])} and {(IPA+[uː]+[uː])}).
10800250 -> 1000008000260: This was followed by the bellows-operated "acoustic-mechanical speech machine" by Wolfgang von Kempelen of Vienna, Austria, described in a 1791 paper.
10800260 -> 1000008000270: This machine added models of the tongue and lips, enabling it to produce consonants as well as vowels.
10800270 -> 1000008000280: In 1837, Charles Wheatstone produced a "speaking machine" based on von Kempelen's design, and in 1857, M. Faber built the "Euphonia".
10800280 -> 1000008000290: Wheatstone's design was resurrected in 1923 by Paget.
10800290 -> 1000008000300: In the 1930s, Bell Labs developed the VOCODER, a keyboard-operated electronic speech analyzer and synthesizer that was said to be clearly intelligible.
10800300 -> 1000008000310: Homer Dudley refined this device into the VODER, which he exhibited at the 1939 New York World's Fair.
10800310 -> 1000008000320: The Pattern playback was built by Dr. Franklin S. Cooper and his colleagues at Haskins Laboratories in the late 1940s and completed in 1950.
10800320 -> 1000008000330: There were several different versions of this hardware device but only one currently survives.
10800330 -> 1000008000340: The machine converts pictures of the acoustic patterns of speech in the form of a spectrogram back into sound.
10800340 -> 1000008000350: Using this device, Alvin Liberman and colleagues were able to discover acoustic cues for the perception of phonetic segments (consonants and vowels).
10800350 -> 1000008000360: Early electronic speech synthesizers sounded robotic and were often barely intelligible.
10800360 -> 1000008000370: However, the quality of synthesized speech has steadily improved, and output from contemporary speech synthesis systems is sometimes indistinguishable from actual human speech.
10800370 -> 1000008000380: Electronic devices
10800380 -> 1000008000390: The first computer-based speech synthesis systems were created in the late 1950s, and the first complete text-to-speech system was completed in 1968.
10800390 -> 1000008000400: In 1961, physicist John Larry Kelly, Jr and colleague Louis Gerstman used an IBM 704 computer to synthesize speech, an event among the most prominent in the history of Bell Labs.
10800400 -> 1000008000410: Kelly's voice recorder synthesizer (vocoder) recreated the song "Daisy Bell", with musical accompaniment from Max Mathews.
10800410 -> 1000008000420: Coincidentally, Arthur C. Clarke was visiting his friend and colleague John Pierce at the Bell Labs Murray Hill facility.
10800420 -> 1000008000430: Clarke was so impressed by the demonstration that he used it in the climactic scene of his screenplay for his novel 2001: A Space Odyssey, where the HAL 9000 computer sings the same song as it is being put to sleep by astronaut Dave Bowman.
10800430 -> 1000008000440: Despite the success of purely electronic speech synthesis, research is still being conducted into mechanical speech synthesizers.
10800440 -> 1000008000450: Synthesizer technologies
10800450 -> 1000008000460: The most important qualities of a speech synthesis system are naturalness and Intelligibility.
10800460 -> 1000008000470: Naturalness describes how closely the output sounds like human speech, while intelligibility is the ease with which the output is understood.
10800470 -> 1000008000480: The ideal speech synthesizer is both natural and intelligible.
10800480 -> 1000008000490: Speech synthesis systems usually try to maximize both characteristics.
10800490 -> 1000008000500: The two primary technologies for generating synthetic speech waveforms are concatenative synthesis and formant synthesis.
10800500 -> 1000008000510: Each technology has strengths and weaknesses, and the intended uses of a synthesis system will typically determine which approach is used.
10800510 -> 1000008000520: Concatenative synthesis
10800520 -> 1000008000530: Concatenative synthesis is based on the concatenation (or stringing together) of segments of recorded speech.
10800530 -> 1000008000540: Generally, concatenative synthesis produces the most natural-sounding synthesized speech.
10800540 -> 1000008000550: However, differences between natural variations in speech and the nature of the automated techniques for segmenting the waveforms sometimes result in audible glitches in the output.
10800550 -> 1000008000560: There are three main sub-types of concatenative synthesis.
10800560 -> 1000008000570: Unit selection synthesis
10800570 -> 1000008000580: Unit selection synthesis uses large databases of recorded speech.
10800580 -> 1000008000590: During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences.
10800590 -> 1000008000600: Typically, the division into segments is done using a specially modified speech recognizer set to a "forced alignment" mode with some manual correction afterward, using visual representations such as the waveform and spectrogram.
10800600 -> 1000008000610: An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones.
10800610 -> 1000008000620: At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection).
10800620 -> 1000008000630: This process is typically achieved using a specially weighted decision tree.
10800630 -> 1000008000640: Unit selection provides the greatest naturalness, because it applies only a small amount of digital signal processing (DSP) to the recorded speech.
10800640 -> 1000008000650: DSP often makes recorded speech sound less natural, although some systems use a small amount of signal processing at the point of concatenation to smooth the waveform.
10800650 -> 1000008000660: The output from the best unit-selection systems is often indistinguishable from real human voices, especially in contexts for which the TTS system has been tuned.
10800660 -> 1000008000670: However, maximum naturalness typically require unit-selection speech databases to be very large, in some systems ranging into the gigabytes of recorded data, representing dozens of hours of speech.
10800670 -> 1000008000680: Also, unit selection algorithms have been known to select segments from a place that results in less than ideal synthesis (e.g. minor words become unclear) even when a better choice exists in the database.
10800680 -> 1000008000690: Diphone synthesis
10800690 -> 1000008000700: Diphone synthesis uses a minimal speech database containing all the diphones (sound-to-sound transitions) occurring in a language.
10800700 -> 1000008000710: The number of diphones depends on the phonotactics of the language: for example, Spanish has about 800 diphones, and German about 2500.
10800710 -> 1000008000720: In diphone synthesis, only one example of each diphone is contained in the speech database.
10800720 -> 1000008000730: At runtime, the target prosody of a sentence is superimposed on these minimal units by means of digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA.
10800730 -> 1000008000740: The quality of the resulting speech is generally worse than that of unit-selection systems, but more natural-sounding than the output of formant synthesizers.
10800740 -> 1000008000750: Diphone synthesis suffers from the sonic glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has few of the advantages of either approach other than small size.
10800750 -> 1000008000760: As such, its use in commercial applications is declining, although it continues to be used in research because there are a number of freely available software implementations.
10800760 -> 1000008000770: Domain-specific synthesis
10800770 -> 1000008000780: Domain-specific synthesis concatenates prerecorded words and phrases to create complete utterances.
10800780 -> 1000008000790: It is used in applications where the variety of texts the system will output is limited to a particular domain, like transit schedule announcements or weather reports.
10800790 -> 1000008000800: The technology is very simple to implement, and has been in commercial use for a long time, in devices like talking clocks and calculators.
10800800 -> 1000008000810: The level of naturalness of these systems can be very high because the variety of sentence types is limited, and they closely match the prosody and intonation of the original recordings.
10800810 -> 1000008000820: Because these systems are limited by the words and phrases in their databases, they are not general-purpose and can only synthesize the combinations of words and phrases with which they have been preprogrammed.
10800820 -> 1000008000830: The blending of words within naturally spoken language however can still cause problems unless the many variations are taken into account.
10800830 -> 1000008000840: For example, in non-rhotic dialects of English the <r> in words like <clear> {(IPA+/ˈkliːə/+/ˈkliːə/)} is usually only pronounced when the following word has a vowel as its first letter (e.g. <clear out> is realized as {(IPA+/ˌkliːəɹˈɑʊt/+/ˌkliːəɹˈɑʊt/)}).
10800840 -> 1000008000850: Likewise in French, many final consonants become no longer silent if followed by a word that begins with a vowel, an effect called liaison.
10800845 -> 1000008000860: This alternation cannot be reproduced by a simple word-concatenation system, which would require additional complexity to be context-sensitive.
10800850 -> 1000008000870: Formant synthesis
10800860 -> 1000008000880: Formant synthesis does not use human speech samples at runtime.
10800870 -> 1000008000890: Instead, the synthesized speech output is created using an acoustic model.
10800880 -> 1000008000900: Parameters such as fundamental frequency, voicing, and noise levels are varied over time to create a waveform of artificial speech.
10800890 -> 1000008000910: This method is sometimes called rules-based synthesis; however, many concatenative systems also have rules-based components.
10800900 -> 1000008000920: Many systems based on formant synthesis technology generate artificial, robotic-sounding speech that would never be mistaken for human speech.
10800910 -> 1000008000930: However, maximum naturalness is not always the goal of a speech synthesis system, and formant synthesis systems have advantages over concatenative systems.
10800920 -> 1000008000940: Formant-synthesized speech can be reliably intelligible, even at very high speeds, avoiding the acoustic glitches that commonly plague concatenative systems.
10800930 -> 1000008000950: High-speed synthesized speech is used by the visually impaired to quickly navigate computers using a screen reader.
10800940 -> 1000008000960: Formant synthesizers are usually smaller programs than concatenative systems because they do not have a database of speech samples.
10800950 -> 1000008000970: They can therefore be used in embedded systems, where memory and microprocessor power are especially limited.
10800960 -> 1000008000980: Because formant-based systems have complete control of all aspects of the output speech, a wide variety of prosodies and intonations can be output, conveying not just questions and statements, but a variety of emotions and tones of voice.
10800970 -> 1000008000990: Examples of non-real-time but highly accurate intonation control in formant synthesis include the work done in the late 1970s for the Texas Instruments toy Speak & Spell, and in the early 1980s Sega arcade machines.
10800980 -> 1000008001000: Creating proper intonation for these projects was painstaking, and the results have yet to be matched by real-time text-to-speech interfaces.
10800990 -> 1000008001010: Articulatory synthesis
10801000 -> 1000008001020: Articulatory synthesis refers to computational techniques for synthesizing speech based on models of the human vocal tract and the articulation processes occurring there.
10801010 -> 1000008001030: The first articulatory synthesizer regularly used for laboratory experiments was developed at Haskins Laboratories in the mid-1970s by Philip Rubin, Tom Baer, and Paul Mermelstein.
10801020 -> 1000008001040: This synthesizer, known as ASY, was based on vocal tract models developed at Bell Laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker, and colleagues.
10801030 -> 1000008001050: Until recently, articulatory synthesis models have not been incorporated into commercial speech synthesis systems.
10801040 -> 1000008001060: A notable exception is the NeXT-based system originally developed and marketed by Trillium Sound Research, a spin-off company of the University of Calgary, where much of the original research was conducted.
10801050 -> 1000008001070: Following the demise of the various incarnations of NeXT (started by Steve Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium software was published under the GNU General Public License, with work continuing as gnuspeech.
10801060 -> 1000008001080: The system, first marketed in 1994, provides full articulatory-based text-to-speech conversion using a waveguide or transmission-line analog of the human oral and nasal tracts controlled by Carré's "distinctive region model".
10801070 -> 1000008001090: HMM-based synthesis
10801080 -> 1000008001100: HMM-based synthesis is a synthesis method based on hidden Markov models.
10801090 -> 1000008001110: In this system, the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech are modeled simultaneously by HMMs.
10801100 -> 1000008001120: Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion.
10801110 -> 1000008001130: Sinewave synthesis
10801120 -> 1000008001140: Sinewave synthesis is a technique for synthesizing speech by replacing the formants (main bands of energy) with pure tone whistles.
10801130 -> 1000008001150: Challenges
10801140 -> 1000008001160: Text normalization challenges
10801150 -> 1000008001170: The process of normalizing text is rarely straightforward.
10801160 -> 1000008001180: Texts are full of heteronyms, numbers, and abbreviations that all require expansion into a phonetic representation.
10801170 -> 1000008001190: There are many spellings in English which are pronounced differently based on context.
10801180 -> 1000008001200: For example, "My latest project is to learn how to better project my voice" contains two pronunciations of "project".
10801190 -> 1000008001210: Most text-to-speech (TTS) systems do not generate semantic representations of their input texts, as processes for doing so are not reliable, well understood, or computationally effective.
10801200 -> 1000008001220: As a result, various heuristic techniques are used to guess the proper way to disambiguate homographs, like examining neighboring words and using statistics about frequency of occurrence.
10801210 -> 1000008001230: Deciding how to convert numbers is another problem that TTS systems have to address.
10801220 -> 1000008001240: It is a simple programming challenge to convert a number into words, like "1325" becoming "one thousand three hundred twenty-five."
10801230 -> 1000008001250: However, numbers occur in many different contexts; when a year or part of an address, "1325" should likely be read as "thirteen twenty-five", or, when part of a social security number, as "one three two five".
10801240 -> 1000008001260: A TTS system can often infer how to expand a number based on surrounding words, numbers, and punctuation, and sometimes the system provides a way to specify the context if it is ambiguous.
10801250 -> 1000008001270: Similarly, abbreviations can be ambiguous.
10801260 -> 1000008001280: For example, the abbreviation "in" for "inches" must be differentiated from the word "in", and the address "12 St John St." uses the same abbreviation for both "Saint" and "Street".
10801270 -> 1000008001290: TTS systems with intelligent front ends can make educated guesses about ambiguous abbreviations, while others provide the same result in all cases, resulting in nonsensical (and sometimes comical) outputs.
10801280 -> 1000008001300: Text-to-phoneme challenges
10801290 -> 1000008001310: Speech synthesis systems use two basic approaches to determine the pronunciation of a word based on its spelling, a process which is often called text-to-phoneme or grapheme-to-phoneme conversion (phoneme is the term used by linguists to describe distinctive sounds in a language).
10801300 -> 1000008001320: The simplest approach to text-to-phoneme conversion is the dictionary-based approach, where a large dictionary containing all the words of a language and their correct pronunciations is stored by the program.
10801310 -> 1000008001330: Determining the correct pronunciation of each word is a matter of looking up each word in the dictionary and replacing the spelling with the pronunciation specified in the dictionary.
10801320 -> 1000008001340: The other approach is rule-based, in which pronunciation rules are applied to words to determine their pronunciations based on their spellings.
10801330 -> 1000008001350: This is similar to the "sounding out", or synthetic phonics, approach to learning reading.
10801340 -> 1000008001360: Each approach has advantages and drawbacks.
10801350 -> 1000008001370: The dictionary-based approach is quick and accurate, but completely fails if it is given a word which is not in its dictionary.
10801360 -> 1000008001380: As dictionary size grows, so too does the memory space requirements of the synthesis system.
10801370 -> 1000008001390: On the other hand, the rule-based approach works on any input, but the complexity of the rules grows substantially as the system takes into account irregular spellings or pronunciations.
10801380 -> 1000008001400: (Consider that the word "of" is very common in English, yet is the only word in which the letter "f" is pronounced [v].)
10801390 -> 1000008001410: As a result, nearly all speech synthesis systems use a combination of these approaches.
10801400 -> 1000008001420: Some languages, like Spanish, have a very regular writing system, and the prediction of the pronunciation of words based on their spellings is quite successful.
10801410 -> 1000008001430: Speech synthesis systems for such languages often use the rule-based method extensively, resorting to dictionaries only for those few words, like foreign names and borrowings, whose pronunciations are not obvious from their spellings.
10801420 -> 1000008001440: On the other hand, speech synthesis systems for languages like English, which have extremely irregular spelling systems, are more likely to rely on dictionaries, and to use rule-based methods only for unusual words, or words that aren't in their dictionaries.
10801430 -> 1000008001450: Evaluation challenges
10801440 -> 1000008001460: It is very difficult to evaluate speech synthesis systems consistently because there is no subjective criterion and usually different organizations use different speech data.
10801450 -> 1000008001470: The quality of a speech synthesis system highly depends on the quality of recording.
10801460 -> 1000008001480: Therefore, evaluating speech synthesis systems is almost the same as evaluating the recording skills.
10801470 -> 1000008001490: Recently researchers start evaluating speech synthesis systems using the common speech dataset.
10801480 -> 1000008001500: This may help people to compare the difference between technologies rather than recordings.
10801490 -> 1000008001510: Prosodics and emotional content
10801500 -> 1000008001520: A recent study reported in the journal "Speech Communication" by Amy Drahota and colleagues at the University of Portsmouth, UK, reported that listeners to voice recordings could determine, at better than chance levels, whether or not the speaker was smiling.
10801510 -> 1000008001530: It was suggested that identification of the vocal features which signal emotional content may be used to help make synthesized speech sound more natural.
10801520 -> None: Dedicated hardware
10801530 -> None: Votrax
10801540 -> None: SC-01A (analog formant)
10801550 -> None: SC-02 / SSI-263 / "Arctic 263"
10801560 -> None: General Instruments SP0256-AL2 (CTS256A-AL2, MEA8000)
10801570 -> None: Magnevation SpeakJet (www.speechchips.com TTS256)
10801580 -> None: Savage Innovations SoundGin
10801590 -> None: National Semiconductor DT1050 Digitalker (Mozer)
10801600 -> None: Silicon Systems SSI 263 (analog formant)
10801610 -> None: Texas Instruments
10801620 -> None: TMS5110A (LPC)
10801630 -> None: TMS5200
10801640 -> None: Oki Semiconductor
10801650 -> None: MSM5205
10801660 -> None: MSM5218RS (ADPCM)
10801670 -> None: Toshiba T6721A
10801680 -> None: Philips PCF8200
10801690 -> 1000008001540: Computer operating systems or outlets with speech synthesis
10801700 -> 1000008001550: Apple
10801710 -> 1000008001560: The first speech system integrated into an operating system was Apple Computer's MacInTalk in 1984.
10801720 -> 1000008001570: Since the 1980s Macintosh Computers offered text to speech capabilities through The MacinTalk software.
10801730 -> 1000008001580: In the early 1990s Apple expanded its capabilities offering system wide text-to-speech support.
10801740 -> 1000008001590: With the introduction of faster PowerPC based computers they included higher quality voice sampling.
10801750 -> 1000008001600: Apple also introduced speech recognition into its systems which provided a fluid command set.
10801760 -> 1000008001610: More recently, Apple has added sample-based voices.
10801770 -> 1000008001620: Starting as a curiosity, the speech system of Apple Macintosh has evolved into a cutting edge fully-supported program, PlainTalk, for people with vision problems.
10801780 -> 1000008001630: VoiceOver was included in Mac OS Tiger and more recently Mac OS Leopard.
10801790 -> 1000008001640: The voice shipping with Mac OS X 10.5 ("Leopard") is called "Alex" and features the taking of realistic-sounding breaths between sentences, as well as improved clarity at high read rates.
10801800 -> 1000008001650: AmigaOS
10801810 -> 1000008001660: The second operating system with advanced speech synthesis capabilities was AmigaOS, introduced in 1985.
10801820 -> 1000008001670: The voice synthesis was licensed by Commodore International from a third-party software house (Don't Ask Software, now Softvoice, Inc.) and it featured a complete system of voice emulation, with both male and female voices and "stress" indicator markers, made possible by advanced features of the Amiga hardware audio chipset.
10801830 -> 1000008001680: It was divided into a narrator device and a translator library.
10801840 -> 1000008001690: Amiga Speak Handler featured a text-to-speech translator.
10801850 -> 1000008001700: AmigaOS considered speech synthesis a virtual hardware device, so the user could even redirect console output to it.
10801860 -> 1000008001710: Some Amiga programs, such as word processors, made extensive use of the speech system.
10801870 -> 1000008001720: Microsoft Windows
10801880 -> 1000008001730: Modern Windows systems use SAPI4- and SAPI5-based speech systems that include a speech recognition engine (SRE).
10801890 -> 1000008001740: SAPI 4.0 was available on Microsoft-based operating systems as a third-party add-on for systems like Windows 95 and Windows 98.
10801900 -> 1000008001750: Windows 2000 added a speech synthesis program called Narrator, directly available to users.
10801910 -> 1000008001760: All Windows-compatible programs could make use of speech synthesis features, available through menus once installed on the system.
10801920 -> 1000008001770: Microsoft Speech Server is a complete package for voice synthesis and recognition, for commercial applications such as call centers.
10801930 -> 1000008001780: Internet
10801940 -> 1000008001790: Currently, there are a number of applications, plugins and gadgets that can read messages directly from an e-mail client and web pages from a web browser.
10801950 -> 1000008001800: Some specialized software can narrate RSS-feeds.
10801960 -> 1000008001810: On one hand, online RSS-narrators simplify information delivery by allowing users to listen to their favourite news sources and to convert them to podcasts.
10801970 -> 1000008001820: On the other hand, on-line RSS-readers are available on almost any PC connected to the Internet.
10801980 -> 1000008001830: Users can download generated audio files to portable devices, e.g. with a help of podcast receiver, and listen to them while walking, jogging or commuting to work.
10801990 -> 1000008001840: A growing field in internet based TTS technology is web-based assistive technology, e.g. Talklets.
10802000 -> 1000008001850: This web based approach to a traditionally locally installed form of software application can afford many of those requiring software for accessibility reason, the ability to access web content from public machines, or those belonging to others.
10802010 -> 1000008001860: While responsiveness is not as immediate as that of applications installed locally, the 'access anywhere' nature of it is the key benefit to this approach.
10802020 -> 1000008001870: Others
10802030 -> 1000008001880: Some models of Texas Instruments home computers produced in 1979 and 1981 (Texas Instruments TI-99/4 and TI-99/4A) were capable of text-to-phoneme synthesis or reciting complete words and phrases (text-to-dictionary), using a very popular Speech Synthesizer peripheral.
10802040 -> 1000008001890: TI used a proprietary codec to embed complete spoken phrases into applications, primarily video games.
10802050 -> 1000008001900: Systems that operate on free and open source software systems including GNU/Linux are various, and include open-source programs such as the Festival Speech Synthesis System which uses diphone-based synthesis (and can use a limited number of MBROLA voices), and gnuspeech which uses articulatory synthesis from the Free Software Foundation.
10802060 -> 1000008001910: Other commercial vendor software also runs on GNU/Linux.
10802070 -> 1000008001920: Several commercial companies are also developing speech synthesis systems (this list is reporting them just for the sake of information, not endorsing any specific product):  Acapela Group, AT&T, Cepstral, DECtalk, IBM ViaVoice, IVONA TTS,  Loquendo TTS,  NeoSpeech TTS, Nuance Communications, Rhetorical Systems,  SVOX and  YAKiToMe!.
10802080 -> 1000008001930: Companies which developed speech synthesis systems but which are no longer in this business include BeST Speech (bought by L&H), Lernout & Hauspie (bankrupt), SpeechWorks (bought by Nuance)
10802090 -> 1000008001940: Speech synthesis markup languages
10802100 -> 1000008001950: A number of markup languages have been established for the rendition of text as speech in an XML-compliant format.
10802110 -> 1000008001960: The most recent is Speech Synthesis Markup Language (SSML), which became a W3C recommendation in 2004.
10802120 -> 1000008001970: Older speech synthesis markup languages include Java Speech Markup Language (JSML) and SABLE.
10802130 -> 1000008001980: Although each of these was proposed as a standard, none of them has been widely adopted.
10802140 -> 1000008001990: Speech synthesis markup languages are distinguished from dialogue markup languages.
10802150 -> 1000008002000: VoiceXML, for example, includes tags related to speech recognition, dialogue management and touchtone dialing, in addition to text-to-speech markup.
10802160 -> 1000008002010: Applications
10802170 -> 1000008002020: Accessibility
10802180 -> 1000008002030: Speech synthesis has long been a vital assistive technology tool and its application in this area is significant and widespread.
10802190 -> 1000008002040: It allows environmental barriers to be removed for people with a wide range of disabilities.
10802200 -> 1000008002050: The longest application has been in the use of screenreaders for people with visual impairment, but text-to-speech systems are now commonly used by people with dyslexia and other reading difficulties as well as by pre-literate youngsters.
10802210 -> 1000008002060: They are also frequently employed to aid those with severe speech impairment usually through a dedicated voice output communication aid.
10802220 -> 1000008002070: News service
10802230 -> 1000008002080: Sites such as Ananova have used speech synthesis to convert written news to audio content, which can be used for mobile applications.
10802240 -> 1000008002090: Entertainment
10802250 -> 1000008002100: Speech synthesis techniques are used as well in the entertainment productions such as games, anime and similar.
10802260 -> 1000008002110: In 2007, Animo Limited announced the development of a software application package based on its speech synthesis software FineSpeech, explicitly geared towards customers in the entertainment industries, able to generate narration and lines of dialogue according to user specifications.
10802270 -> 1000008002120: Software such as Vocaloid can generate singing voices via lyrics and melody.
10802280 -> 1000008002130: This is also the aim of the Singing Computer project (which uses the GPL software Lilypond and Festival) to help blind people check their lyric input.
Statistical classification
10810010 -> 1000008100020: Statistical classification
10810020 -> 1000008100030: Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled items.
10810030 -> 1000008100040: Formally, the problem can be stated as follows: given training data \{(\mathbf{x_1},y_1),\dots,(\mathbf{x_n}, y_n)\} produce a classifier h:\mathcal{X}\rightarrow\mathcal{Y} which maps an object \mathbf{x} \in \mathcal{X} to its classification label y \in \mathcal{Y}.
10810040 -> 1000008100050: For example, if the problem is filtering spam, then \mathbf{x_i} is some representation of an email and y is either "Spam" or "Non-Spam".
10810050 -> 1000008100060: Statistical classification algorithms are typically used in pattern recognition systems.
10810060 -> 1000008100070: Note: in community ecology, the term "classification" is synonymous with what is commonly known (in machine learning) as clustering.
10810070 -> 1000008100080: See that article for more information about purely unsupervised techniques.
10810080 -> 1000008100090: The second problem is to consider classification as an estimation problem, where the goal is to estimate a function of the form
10810090 -> 1000008100100: P({\rm class}|{\vec x}) = f\left(\vec x;\vec \theta\right) where the feature vector input is \vec x, and the function f is typically parameterized by some parameters \vec \theta.
10810100 -> 1000008100110: In the Bayesian approach to this problem, instead of choosing a single parameter vector \vec \theta, the result is integrated over all possible thetas, with the thetas weighted by how likely they are given the training data D:
10810110 -> 1000008100120: P({\rm class}|{\vec x}) = \int f\left(\vec x;\vec \theta\right)P(\vec \theta|D) d\vec \theta
10810120 -> 1000008100130: The third problem is related to the second, but the problem is to estimate the class-conditional probabilities P(\vec x|{\rm class}) and then use Bayes' rule to produce the class probability as in the second problem.
10810130 -> 1000008100140: Examples of classification algorithms include:
10810140 -> 1000008100150: Linear classifiers
10810150 -> 1000008100160: Fisher's linear discriminant
10810160 -> 1000008100170: Logistic regression
10810170 -> 1000008100180: Naive Bayes classifier
10810180 -> 1000008100190: Perceptron
10810190 -> 1000008100200: Support vector machines
10810200 -> 1000008100210: Quadratic classifiers
10810210 -> 1000008100220: k-nearest neighbor
10810220 -> 1000008100230: Boosting
10810230 -> 1000008100240: Decision trees
10810240 -> 1000008100250: Random forests
10810250 -> 1000008100260: Neural networks
10810260 -> 1000008100270: Bayesian networks
10810270 -> 1000008100280: Hidden Markov models
10810280 -> 1000008100290: An intriguing problem in pattern recognition yet to be solved is the relationship between the problem to be solved (data to be classified) and the performance of various pattern recognition algorithms (classifiers).
10810290 -> 1000008100300: Van der Walt and Barnard (see reference section) investigated very specific artificial data sets to determine conditions under which certain classifiers perform better and worse than others.
10810300 -> 1000008100310: Classifier performance depends greatly on the characteristics of the data to be classified.
10810310 -> 1000008100320: There is no single classifier that works best on all given problems (a phenomenon that may be explained by the No-free-lunch theorem).
10810320 -> 1000008100330: Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance.
10810330 -> 1000008100340: Determining a suitable classifier for a given problem is however still more an art than a science.
10810340 -> 1000008100350: The most widely used classifiers are the Neural Network (Multi-layer Perceptron), Support Vector Machines, k-Nearest Neighbours, Gaussian Mixture Model, Gaussian, Naive Bayes, Decision Tree and RBF classifiers.
10810350 -> 1000008100360: Evaluation
10810360 -> 1000008100370: The measures Precision and Recall are popular metrics used to evaluate the quality of a classification system.
10810370 -> 1000008100380: More recently, Receiver Operating Characteristic (ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms.
10810380 -> None: Application domains
10810390 -> None: Computer vision
10810400 -> None: Medical Imaging and Medical Image Analysis
10810410 -> None: Optical character recognition
10810420 -> None: Geostatistics
10810430 -> None: Speech recognition
10810440 -> None: Handwriting recognition
10810450 -> None: Biometric identification
10810460 -> None: Natural language processing
10810470 -> None: Document classification
10810480 -> None: Internet search engines
10810490 -> None: Credit scoring
Statistical machine translation
10820010 -> 1000008200020: Statistical machine translation
10820020 -> 1000008200030: Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora.
10820030 -> 1000008200040: The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translation.
10820040 -> 1000008200050: The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, including the ideas of applying Claude Shannon's information theory.
10820050 -> 1000008200060: Statistical machine translation was re-introduced in 1991 by researchers at IBM's Thomas J. Watson Research Center and has contributed to the significant resurgence in interest in machine translation in recent years.
10820060 -> 1000008200070: As of 2006, it is by far the most widely-studied machine translation paradigm.
10820070 -> 1000008200080: Benefits
10820080 -> 1000008200090: The benefits of statistical machine translation over traditional paradigms that are most often cited are the following:
10820090 -> 1000008200100: Better use of resources
10820100 -> 1000008200110: There is a great deal of natural language in machine-readable format.
10820110 -> 1000008200120: Generally, SMT systems are not tailored to any specific pair of languages.
10820120 -> 1000008200130: Rule-based translation systems require the manual development of linguistic rules, which can be costly, and which often do not generalize to other languages.
10820130 -> 1000008200140: More natural translations
10820140 -> 1000008200150: The ideas behind statistical machine translation come out of information theory.
10820150 -> 1000008200160: Essentially, the document is translated on the probability p(e|f) that a string e in native language (for example, English) is the translation of a string f in foreign language (for example, French).
10820160 -> 1000008200170: Generally, these probabilities are estimated using techniques of parameter estimation.
10820170 -> 1000008200180: The Bayes Theorem is applied to p(e|f), the probability that the foreign string produces the native string to get p(e|f) \propto p(f|e) p(e), where the translation model p(f|e) is the probability that the native string is the translation of the foreign string, and the language model p(e) is the probability of seeing that native string.
10820180 -> 1000008200190: Mathematically speaking, finding the best translation \tilde{e} is done by picking up the one that gives the highest probability:
10820190 -> 1000008200200: \tilde{e} = arg \max_{e \in e^*} p(e|f) = arg \max_{e\in e^*} p(f|e) p(e) .
10820200 -> 1000008200210: For a rigorous implementation of this one would have to perform an exhaustive search by going through all strings e^* in the native language.
10820210 -> 1000008200220: Performing the search efficiently is the work of a machine translation decoder that uses the foreign string, heuristics and other methods to limit the search space and at the same time keeping acceptable quality.
10820220 -> 1000008200230: This trade-off between quality and time usage can also be found in speech recognition.
10820230 -> 1000008200240: As the translation systems are not able to store all native strings and their translations, a document is typically translated sentence by sentence, but even this is not enough.
10820240 -> 1000008200250: Language models are typically approximated by smoothed n-gram models, and similar approaches have been applied to translation models, but there is additional complexity due to different sentence lengths and word orders in the languages.
10820250 -> 1000008200260: The statistical translation models were initially word based (Models 1-5 from IBM), but significant advances were made with the introduction of phrase based models.
10820260 -> 1000008200270: Recent work has incorporated syntax or quasi-syntactic structures.
10820270 -> 1000008200280: Word-based translation
10820280 -> 1000008200290: In word-based translation, translated elements are words.
10820290 -> 1000008200300: Typically, the number of words in translated sentences are different due to compound words, morphology and idioms.
10820300 -> 1000008200310: The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces.
10820310 -> 1000008200320: Simple word-based translation is not able to translate language pairs with fertility rates different from one.
10820320 -> 1000008200330: To make word-based translation systems manage, for instance, high fertility rates, the system could be able to map a single word to multiple words, but not vice versa.
10820330 -> 1000008200340: For instance, if we are translating from French to English, each word in English could produce zero or more French words.
10820340 -> 1000008200350: But there's no way to group two English words producing a single French word.
10820350 -> 1000008200360: An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes IBM models.
10820360 -> 1000008200370: Phrase-based translation
10820370 -> 1000008200380: In phrase-based translation, the restrictions produced by word-based translation have been tried to reduce by translating sequences of words to sequences of words, where the lengths can differ.
10820380 -> 1000008200390: The sequences of words are called, for instance, blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from the corpus.
10820390 -> 1000008200400: Restricting the phrases to linguistic phrases has been shown to decrease translation quality.
10820400 -> None: Syntax-based translation
10820410 -> 1000008200410: Challenges with statistical machine translation
10820420 -> 1000008200420: Problems that statistical machine translation have to deal with include
10820430 -> None: Compound words
10820440 -> None: Idioms
10820450 -> None: Morphology
10820460 -> 1000008200430: Different word orders
10820470 -> 1000008200440: Word order in languages differ.
10820480 -> 1000008200450: Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages.
10820490 -> 1000008200460: There are also additional differences in word orders, for instance, where modifiers for nouns are located.
10820500 -> 1000008200470: In Speech Recognition, the speech signal and the corresponding textual representation can be mapped to each other in blocks in order.
10820510 -> 1000008200480: This is not always the case with the same text in two languages.
10820520 -> 1000008200490: For SMT, the translation model is only able to translate small sequences of words and word order has to be taken into account somehow.
10820530 -> 1000008200500: Typical solution has been re-ordering models, where a distribution of location changes for each item of translation is approximated from aligned bi-text.
10820540 -> 1000008200510: Different location changes can be ranked with the help of the language model and the best can be selected.
10820550 -> None: Syntax
10820560 -> 1000008200520: Out of vocabulary (OOV) words
10820570 -> 1000008200530: SMT systems store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated.
10820580 -> 1000008200540: Main reasons for out of vocabulary words are the limitation of training data, domain changes and morphology.
Statistics
10830010 -> 1000008300020: Statistics
10830020 -> 1000008300030: Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.
10830030 -> 1000008300040: It is applicable to a wide variety of academic disciplines, from the natural and social sciences to the humanities, government and business.
10830040 -> 1000008300050: Statistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics.
10830050 -> 1000008300060: In addition, patterns in the data may be modeled in a way that accounts for randomness and uncertainty in the observations, and then used to draw inferences about the process or population being studied; this is called inferential statistics.
10830060 -> 1000008300070: Both descriptive and inferential statistics comprise applied statistics.
10830070 -> 1000008300080: There is also a discipline called mathematical statistics, which is concerned with the theoretical basis of the subject.
10830080 -> 1000008300090: The word statistics is also the plural of statistic (singular), which refers to the result of applying a statistical algorithm to a set of data, as in economic statistics, crime statistics, etc.
10830090 -> 1000008300100: History
10830100 -> None: 
10830110 -> 1000008300110: "Five men, Conring, Achenwall, Süssmilch, Graunt and Petty have been honored by different writers as the founder of statistics." claims one source (Willcox, Walter (1938) The Founder of Statistics.
10830120 -> 1000008300120: Review of the International Statistical Institute 5(4):321-328.)
10830130 -> 1000008300130: Some scholars pinpoint the origin of statistics to 1662, with the publication of "Observations on the Bills of Mortality" by John Graunt.
10830140 -> 1000008300140: Early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data.
10830150 -> 1000008300150: The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general.
10830160 -> 1000008300160: Today, statistics is widely employed in government, business, and the natural and social sciences.
10830170 -> 1000008300170: Because of its empirical roots and its applications, statistics is generally considered not to be a subfield of pure mathematics, but rather a distinct branch of applied mathematics.
10830180 -> 1000008300180: Its mathematical foundations were laid in the 17th century with the development of probability theory by Pascal and Fermat.
10830190 -> 1000008300190: Probability theory arose from the study of games of chance.
10830200 -> 1000008300200: The method of least squares was first described by Carl Friedrich Gauss around 1794.
10830210 -> 1000008300210: The use of modern computers has expedited large-scale statistical computation, and has also made possible new methods that are impractical to perform manually.
10830220 -> 1000008300220: Overview
10830230 -> 1000008300230: In applying statistics to a scientific, industrial, or societal problem, one begins with a process or population to be studied.
10830240 -> 1000008300240: This might be a population of people in a country, of crystal grains in a rock, or of goods manufactured by a particular factory during a given period.
10830250 -> 1000008300250: It may instead be a process observed at various times; data collected about this kind of "population" constitute what is called a time series.
10830260 -> 1000008300260: For practical reasons, rather than compiling data about an entire population, one usually studies a chosen subset of the population, called a sample.
10830270 -> 1000008300270: Data are collected about the sample in an observational or experimental setting.
10830280 -> 1000008300280: The data are then subjected to statistical analysis, which serves two related purposes: description and inference.
10830290 -> 1000008300290: Descriptive statistics can be used to summarize the data, either numerically or graphically, to describe the sample.
10830300 -> 1000008300300: Basic examples of numerical descriptors include the mean and standard deviation.
10830310 -> 1000008300310: Graphical summarizations include various kinds of charts and graphs.
10830320 -> 1000008300320: Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population.
10830330 -> 1000008300330: These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), descriptions of association (correlation), or modeling of relationships (regression).
10830340 -> 1000008300340: Other modeling techniques include ANOVA, time series, and data mining.
10830350 -> 1000008300350: The concept of correlation is particularly noteworthy.
10830360 -> 1000008300360: Statistical analysis of a data set may reveal that two variables (that is, two properties of the population under consideration) tend to vary together, as if they are connected.
10830370 -> 1000008300370: For example, a study of annual income and age of death among people might find that poor people tend to have shorter lives than affluent people.
10830380 -> 1000008300380: The two variables are said to be correlated (which is a positive correlation in this case).
10830390 -> 1000008300390: However, one cannot immediately infer the existence of a causal relationship between the two variables.
10830400 -> 1000008300400: (See Correlation does not imply causation.)
10830410 -> 1000008300410: The correlated phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable.
10830420 -> 1000008300420: If the sample is representative of the population, then inferences and conclusions made from the sample can be extended to the population as a whole.
10830430 -> 1000008300430: A major problem lies in determining the extent to which the chosen sample is representative.
10830440 -> 1000008300440: Statistics offers methods to estimate and correct for randomness in the sample and in the data collection procedure, as well as methods for designing robust experiments in the first place.
10830450 -> 1000008300450: (See experimental design.)
10830460 -> 1000008300460: The fundamental mathematical concept employed in understanding such randomness is probability.
10830470 -> 1000008300470: Mathematical statistics (also called statistical theory) is the branch of applied mathematics that uses probability theory and analysis to examine the theoretical basis of statistics.
10830480 -> 1000008300480: The use of any statistical method is valid only when the system or population under consideration satisfies the basic mathematical assumptions of the method.
10830490 -> 1000008300490: Misuse of statistics can produce subtle but serious errors in description and interpretation — subtle in the sense that even experienced professionals sometimes make such errors, serious in the sense that they may affect, for instance, social policy, medical practice and the reliability of structures such as bridges.
10830500 -> 1000008300500: Even when statistics is correctly applied, the results can be difficult for the non-expert to interpret.
10830510 -> 1000008300510: For example, the statistical significance of a trend in the data, which measures the extent to which the trend could be caused by random variation in the sample, may not agree with one's intuitive sense of its significance.
10830520 -> 1000008300520: The set of basic statistical skills (and skepticism) needed by people to deal with information in their everyday lives is referred to as statistical literacy.
10830530 -> 1000008300530: Statistical methods
10830540 -> 1000008300540: Experimental and observational studies
10830550 -> 1000008300550: A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on response or dependent variables.
10830560 -> 1000008300560: There are two major types of causal statistical studies, experimental studies and observational studies.
10830570 -> 1000008300570: In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed.
10830580 -> 1000008300580: The difference between the two types lies in how the study is actually conducted.
10830590 -> 1000008300590: Each can be very effective.
10830600 -> 1000008300600: An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements.
10830610 -> 1000008300610: In contrast, an observational study does not involve experimental manipulation.
10830620 -> 1000008300620: Instead, data are gathered and correlations between predictors and response are investigated.
10830630 -> 1000008300630: An example of an experimental study is the famous Hawthorne studies, which attempted to test the changes to the working environment at the Hawthorne plant of the Western Electric Company.
10830640 -> 1000008300640: The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers.
10830650 -> 1000008300650: The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected the productivity.
10830660 -> 1000008300660: It turned out that the productivity indeed improved (under the experimental conditions).
10830663 -> 1000008300670: (See Hawthorne effect.)
10830665 -> 1000008300680: However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindedness.
10830670 -> 1000008300690: An example of an observational study is a study which explores the correlation between smoking and lung cancer.
10830680 -> 1000008300700: This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis.
10830690 -> 1000008300710: In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a case-control study, and then look for the number of cases of lung cancer in each group.
10830700 -> 1000008300720: The basic steps of an experiment are;
10830710 -> 1000008300730: Planning the research, including determining information sources, research subject selection, and ethical considerations for the proposed research and method.
10830720 -> 1000008300740: Design of experiments, concentrating on the system model and the interaction of independent and dependent variables.
10830730 -> 1000008300750: Summarizing a collection of observations to feature their commonality by suppressing details.
10830740 -> 1000008300760: (Descriptive statistics)
10830750 -> 1000008300770: Reaching consensus about what the observations tell about the world being observed.
10830760 -> 1000008300780: (Statistical inference)
10830770 -> 1000008300790: Documenting / presenting the results of the study.
10830780 -> 1000008300800: Levels of measurement
10830790 -> 1000008300810: See: Stanley Stevens' "Scales of measurement" (1946): nominal, ordinal, interval, ratio
10830800 -> 1000008300820: There are four types of measurements or levels of measurement or measurement scales used in statistics: nominal, ordinal, interval, and ratio.
10830810 -> 1000008300830: They have different degrees of usefulness in statistical research.
10830820 -> 1000008300840: Ratio measurements have both a zero value defined and the distances between different measurements defined; they provide the greatest flexibility in statistical methods that can be used for analyzing the data.
10830830 -> 1000008300850: Interval measurements have meaningful distances between measurements defined, but have no meaningful zero value defined (as in the case with IQ measurements or with temperature measurements in Fahrenheit).
10830840 -> 1000008300860: Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values.
10830850 -> 1000008300870: Nominal measurements have no meaningful rank order among values.
10830860 -> 1000008300880: Since variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are called together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative or continuous variables due to their numerical nature.
10830870 -> 1000008300890: Statistical techniques
10830880 -> 1000008300900: Some well known statistical tests and procedures for research observations are:
10830890 -> 1000008300910: Student's t-test
10830900 -> 1000008300920: chi-square test
10830910 -> 1000008300930: Analysis of variance (ANOVA)
10830920 -> 1000008300940: Mann-Whitney U
10830930 -> 1000008300950: Regression analysis
10830940 -> 1000008300960: Factor Analysis
10830950 -> 1000008300970: Correlation
10830960 -> 1000008300980: Pearson product-moment correlation coefficient
10830970 -> 1000008300990: Spearman's rank correlation coefficient
10830980 -> 1000008301000: Time Series Analysis
10830990 -> 1000008301010: Specialized disciplines
10831000 -> 1000008301020: Some fields of inquiry use applied statistics so extensively that they have specialized terminology.
10831010 -> 1000008301030: These disciplines include:
10831020 -> 1000008301040: Actuarial science
10831030 -> 1000008301050: Applied information economics
10831040 -> 1000008301060: Biostatistics
10831050 -> 1000008301070: Bootstrap & Jackknife Resampling
10831060 -> 1000008301080: Business statistics
10831070 -> 1000008301090: Data analysis
10831080 -> 1000008301100: Data mining (applying statistics and pattern recognition to discover knowledge from data)
10831090 -> 1000008301110: Demography
10831100 -> 1000008301120: Economic statistics (Econometrics)
10831110 -> 1000008301130: Energy statistics
10831120 -> 1000008301140: Engineering statistics
10831130 -> 1000008301150: Environmental Statistics
10831140 -> 1000008301160: Epidemiology
10831150 -> 1000008301170: Geography and Geographic Information Systems, more specifically in Spatial analysis
10831160 -> 1000008301180: Image processing
10831170 -> 1000008301190: Multivariate Analysis
10831180 -> 1000008301200: Psychological statistics
10831190 -> 1000008301210: Quality
10831200 -> 1000008301220: Social statistics
10831210 -> 1000008301230: Statistical literacy
10831220 -> 1000008301240: Statistical modeling
10831230 -> 1000008301250: Statistical surveys
10831240 -> 1000008301260: Process analysis and chemometrics (for analysis of data from analytical chemistry and chemical engineering)
10831250 -> 1000008301270: Structured data analysis (statistics)
10831260 -> 1000008301280: Survival analysis
10831270 -> 1000008301290: Reliability engineering
10831280 -> 1000008301300: Statistics in various sports, particularly baseball and cricket
10831290 -> 1000008301310: Statistics form a key basis tool in business and manufacturing as well.
10831300 -> 1000008301320: It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions.
10831310 -> 1000008301330: In these roles, it is a key tool, and perhaps the only reliable tool.
10831320 -> 1000008301340: Statistical computing
10831330 -> 1000008301350: The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science.
10831340 -> 1000008301360: Early statistical models were almost always from the class of linear models, but powerful computers, coupled with suitable numerical algorithms, caused an increased interest in nonlinear models (especially neural networks and decision trees) as well as the creation of new types, such as generalised linear models and multilevel models.
10831350 -> 1000008301370: Increased computing power has also led to the growing popularity of computationally-intensive methods based on resampling, such as permutation tests and the bootstrap, while techniques such as Gibbs sampling have made Bayesian methods more feasible.
10831360 -> 1000008301380: The computer revolution has implications for the future of statistics with new emphasis on "experimental" and "empirical" statistics.
10831370 -> 1000008301390: A large number of both general and special purpose statistical software are now available.
10831380 -> 1000008301400: Misuse
10831390 -> None: 
10831400 -> 1000008301410: There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter.
10831410 -> 1000008301420: A famous saying attributed to Benjamin Disraeli is, "There are three kinds of lies: lies, damned lies, and statistics"; and Harvard President Lawrence Lowell wrote in 1909 that statistics, "like veal pies, are good if you know the person that made them, and are sure of the ingredients".
10831420 -> 1000008301430: If various studies appear to contradict one another, then the public may come to distrust such studies.
10831430 -> 1000008301440: For example, one study may suggest that a given diet or activity raises blood pressure, while another may suggest that it lowers blood pressure.
10831440 -> 1000008301450: The discrepancy can arise from subtle variations in experimental design, such as differences in the patient groups or research protocols, that are not easily understood by the non-expert.
10831450 -> 1000008301460: (Media reports sometimes omit this vital contextual information entirely.)
10831460 -> 1000008301470: By choosing (or rejecting, or modifying) a certain sample, results can be manipulated.
10831470 -> 1000008301480: Such manipulations need not be malicious or devious; they can arise from unintentional biases of the researcher.
10831480 -> 1000008301490: The graphs used to summarize data can also be misleading.
10831490 -> 1000008301500: Deeper criticisms come from the fact that the hypothesis testing approach, widely used and in many cases required by law or regulation, forces one hypothesis (the null hypothesis) to be "favored", and can also seem to exaggerate the importance of minor differences in large studies.
10831500 -> 1000008301510: A difference that is highly statistically significant can still be of no practical significance.
10831510 -> 1000008301520: (See criticism of hypothesis testing and controversy over the null hypothesis.)
10831520 -> 1000008301530: One response is by giving a greater emphasis on the p-value than simply reporting whether a hypothesis is rejected at the given level of significance.
10831530 -> 1000008301540: The p-value, however, does not indicate the size of the effect.
10831540 -> 1000008301550: Another increasingly common approach is to report confidence intervals.
10831550 -> 1000008301560: Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both the size of the effect and the uncertainty surrounding it.
Syntax
10840010 -> 1000008400020: Syntax
10840020 -> 1000008400030: In linguistics, syntax (from Ancient Greek {(Lang+συν-+grc+συν-)} syn-, "together", and {(Lang+τάξις+grc+τάξις)} táxis, "arrangement") is the study of the principles and rules for constructing sentences in natural languages.
10840030 -> 1000008400040: In addition to referring to the discipline, the term syntax is also used to refer directly to the rules and principles that govern the sentence structure of any individual language, as in "the syntax of Modern Irish".
10840040 -> 1000008400050: Modern research in syntax attempts to describe languages in terms of such rules.
10840050 -> 1000008400060: Many professionals in this discipline attempt to find general rules that apply to all natural languages.
10840060 -> 1000008400070: The term syntax is also sometimes used to refer to the rules governing the behavior of mathematical systems, such as logic, artificial formal languages, and computer programming languages.
10840070 -> 1000008400080: Early history
10840080 -> 1000008400090: Works on grammar were being written long before modern syntax came about; the Aṣṭādhyāyī of Pāṇini is often cited as an example of a pre-modern work that approaches the sophistication of a modern syntactic theory.
10840090 -> 1000008400100: In the West, the school of thought that came to be known as "traditional grammar" began with the work of Dionysius Thrax.
10840100 -> 1000008400110: For centuries, work in syntax was dominated by a framework known as {(Lang+grammaire générale+fr+grammaire générale)}, first expounded in 1660 by Antoine Arnauld in a book of the same title.
10840110 -> 1000008400120: This system took as its basic premise the assumption that language is a direct reflection of thought processes and therefore there is a single, most natural way to express a thought.
10840120 -> 1000008400130: That way, coincidentally, was exactly the way it was expressed in French.
10840130 -> 1000008400140: However, in the 19th century, with the development of historical-comparative linguistics, linguists began to realize the sheer diversity of human language, and to question fundamental assumptions about the relationship between language and logic.
10840140 -> 1000008400150: It became apparent that there was no such thing as a most natural way to express a thought, and therefore logic could no longer be relied upon as a basis for studying the structure of language.
10840150 -> 1000008400160: The Port-Royal grammar modeled the study of syntax upon that of logic (indeed, large parts of the Port-Royal Logic were copied or adapted from the Grammaire générale).
10840160 -> 1000008400170: Syntactic categories were identified with logical ones, and all sentences were analyzed in terms of "Subject – Copula – Predicate".
10840170 -> 1000008400180: Initially, this view was adopted even by the early comparative linguists such as Franz Bopp.
10840180 -> 1000008400190: The central role of syntax within theoretical linguistics became clear only in the 20th century, which could reasonably be called the "century of syntactic theory" as far as linguistics is concerned.
10840190 -> 1000008400200: For a detailed and critical survey of the history of syntax in the last two centuries, see the monumental work by Graffi (2001).
10840200 -> 1000008400210: Modern theories
10840210 -> 1000008400220: There are a number of theoretical approaches to the discipline of syntax.
10840220 -> 1000008400230: Many linguists (e.g. Noam Chomsky) see syntax as a branch of biology, since they conceive of syntax as the study of linguistic knowledge as embodied in the human mind.
10840240 -> 1000008400240: Others (e.g. Gerald Gazdar) take a more Platonistic view, since they regard syntax to be the study of an abstract formal system.
10840260 -> 1000008400250: Yet others (e.g. Joseph Greenberg) consider grammar a taxonomical device to reach broad generalizations across languages.
10840280 -> 1000008400260: Some of the major approaches to the discipline are listed below.
10840290 -> 1000008400270: Generative grammar
10840300 -> 1000008400280: The hypothesis of generative grammar is that language is a structure of the human mind.
10840310 -> 1000008400290: The goal of generative grammar is to make a complete model of this inner language (known as i-language).
10840320 -> 1000008400300: This model could be used to describe all human language and to predict the grammaticality of any given utterance (that is, to predict whether the utterance would sound correct to native speakers of the language).
10840330 -> 1000008400310: This approach to language was pioneered by Noam Chomsky.
10840340 -> 1000008400320: Most generative theories (although not all of them) assume that syntax is based upon the constituent structure of sentences.
10840350 -> 1000008400330: Generative grammars are among the theories that focus primarily on the form of a sentence, rather than its communicative function.
10840360 -> 1000008400340: Among the many generative theories of linguistics are:
10840370 -> 1000008400350: Transformational Grammar (TG) (now largely out of date)
10840380 -> 1000008400360: Government and binding theory (GB) (common in the late 1970s and 1980s)
10840390 -> 1000008400370: Minimalism (MP) (the most recent Chomskyan version of generative grammar)
10840400 -> 1000008400380: Other theories that find their origin in the generative paradigm are:
10840410 -> 1000008400390: Generative semantics (now largely out of date)
10840420 -> 1000008400400: Relational grammar (RG) (now largely out of date)
10840430 -> 1000008400410: Arc Pair grammar
10840440 -> 1000008400420: Generalized phrase structure grammar (GPSG; now largely out of date)
10840450 -> 1000008400430: Head-driven phrase structure grammar (HPSG)
10840460 -> 1000008400440: Lexical-functional grammar (LFG)
10840470 -> 1000008400450: Categorial grammar
10840480 -> 1000008400460: Categorial grammar is an approach that attributes the syntactic structure not to rules of grammar, but to the properties of the syntactic categories themselves.
10840490 -> 1000008400470: For example, rather than asserting that sentences are constructed by a rule that combines a noun phrase (NP) and a verb phrase (VP) (e.g. the phrase structure rule S → NP VP), in categorial grammar, such principles are embedded in the category of the head word itself.
10840500 -> 1000008400480: So the syntactic category for an intransitive verb is a complex formula representing the fact that the verb acts as a functor which requires an NP as an input and produces a sentence level structure as an output.
10840510 -> 1000008400490: This complex category is notated as (NP\S) instead of V.
10840515 -> 1000008400500: NP\S is read as " a category that searches to the left (indicated by \) for a NP (the element on the left) and outputs a sentence (the element on the right)".
10840520 -> 1000008400510: The category of transitive verb is defined as an element that requires two NPs (its subject and its direct object) to form a sentence.
10840530 -> 1000008400520: This is notated as (NP/(NP\S)) which means "a category that searches to the right (indicated by /) for an NP (the object), and generates a function (equivalent to the VP) which is (NP\S), which in turn represents a function that searches to the left for an NP and produces a sentence).
10840540 -> 1000008400530: Tree-adjoining grammar is a categorial grammar that adds in partial tree structures to the categories.
10840550 -> 1000008400540: Dependency grammar
10840560 -> 1000008400550: Dependency grammar is a different type of approach in which structure is determined by the relations (such as grammatical relations) between a word (a head) and its dependents, rather than being based in constituent structure.
10840570 -> 1000008400560: For example, syntactic structure is described in terms of whether a particular noun is the subject or agent of the verb, rather than describing the relations in terms of trees (one version of which is the parse tree) or other structural system.
10840580 -> 1000008400570: Some dependency-based theories of syntax:
10840590 -> 1000008400580: Algebraic syntax
10840600 -> 1000008400590: Word grammar
10840610 -> 1000008400600: Operator Grammar
10840620 -> 1000008400610: Stochastic/probabilistic grammars/network theories
10840630 -> 1000008400620: Theoretical approaches to syntax that are based upon probability theory are known as stochastic grammars.
10840640 -> 1000008400630: One common implementation of such an approach makes use of a neural network or connectionism.
10840650 -> 1000008400640: Some theories based within this approach are:
10840660 -> 1000008400650: Optimality theory
10840670 -> 1000008400660: Stochastic context-free grammar
10840680 -> 1000008400670: Functionalist grammars
10840690 -> 1000008400680: Functionalist theories, although focused upon form, are driven by explanation based upon the function of a sentence (i.e. its communicative function).
10840700 -> 1000008400690: Some typical functionalist theories include:
10840710 -> 1000008400700: Functional grammar (Dik)
10840720 -> 1000008400710: Prague Linguistic Circle
10840730 -> 1000008400720: Systemic functional grammar
10840740 -> 1000008400730: Cognitive grammar
10840750 -> 1000008400740: Construction grammar (CxG)
10840760 -> 1000008400750: Role and reference grammar (RRG)
Text analytics
10860010 -> 1000008500020: Text analytics
10860020 -> 1000008500030: The term text analytics describes a set of linguistic, lexical, pattern recognition, extraction, tagging/structuring, visualization, and predictive techniques.
10860030 -> 1000008500040: The term also describes processes that apply these techniques, whether independently or in conjunction with query and analysis of fielded, numerical data, to solve business problems.
10860040 -> 1000008500050: These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.
10860050 -> 1000008500060: A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.
10860060 -> 1000008500070: Current approaches to text analytics use natural language processing techniques that focus on specialized domains.
10860070 -> 1000008500080: Typical subtasks are:
10860080 -> 1000008500090: Named Entity Recognition: recognition of entity names (for people and organizations), place names, temporal expressions, and certain types of numerical expressions.
10860090 -> 1000008500100: Coreference: identification chains of noun phrases that refer to the same object.
10860100 -> 1000008500110: For example, anaphora is a type of coreference.
10860110 -> 1000008500120: Relationship Extraction: extraction of named relationships between entities in text
Text corpus
10870010 -> 1000008600020: Text corpus
10870020 -> 1000008600030: In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed).
10870030 -> 1000008600040: They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe.
10870040 -> 1000008600050: A corpus may contain texts in a single language (monolingual corpus) or text data in multiple languages (multilingual corpus).
10870050 -> 1000008600060: Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora.
10870060 -> 1000008600070: In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as annotation.
10870070 -> 1000008600080: An example of annotating a corpus is part-of-speech tagging, or POS-tagging, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of tags.
10870080 -> 1000008600090: Another example is indicating the lemma (base) form of each word.
10870090 -> 1000008600100: When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.
10870100 -> 1000008600110: Corpora are the main knowledge base in corpus linguistics.
10870110 -> 1000008600120: The analysis and processing of various types of corpora are also the subject of much work in computational linguistics, speech recognition and machine translation, where they are often used to create hidden Markov models for POS-tagging and other purposes.
10870120 -> 1000008600130: Corpora and frequency lists derived from them are useful for language teaching.
10870130 -> 1000008600140: Archaeological corpora
10870140 -> 1000008600150: Text corpora. are also used in the study of historical documents, for example in attempts to decipher ancient scripts, or in Biblical scholarship.
10870150 -> 1000008600160: Some archaeological corpora can be of such short duration that they provide a snapshot in time.
10870160 -> 1000008600170: One of the shortest corpora in time, may be the 15-30 year Amarna letters texts-(1350 BC).
10870170 -> 1000008600180: The corpus of an ancient city, (for example the "Kültepe Texts" of Turkey), may go through a series of corpora, determined by their find site dates.
10870180 -> None: Some notable text corpora
10870190 -> None: English language:
10870200 -> None: American National Corpus
10870210 -> None: Bank of English
10870220 -> None: British National Corpus
10870230 -> None: Corpus Juris Secundum
10870240 -> None: Corpus of Contemporary American English (COCA) 360 million words, 1990-2007.
10870250 -> None: Freely available online.
10870260 -> None: Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
10870270 -> None: Oxford English Corpus
10870280 -> None: Scottish Corpus of Texts & Speech
10870290 -> None: Other languages:
10870300 -> None: Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
10870310 -> None: Bijankhan Corpus A Contemporary Persian Corpus for NLP researches
10870320 -> None: Croatian National Corpus
10870330 -> None: Hamshahri Corpus A Contemporary Persian Corpus for IR researches
10870340 -> None: Neo-Assyrian Text Corpus Project
10870350 -> None: Persian Today Corpus
10870360 -> None: Thesaurus Linguae Graecae (Ancient Greek)
Text mining
10880010 -> 1000008700020: Text mining
10880020 -> 1000008700030: Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text.
10880030 -> 1000008700040: High quality information is typically derived through the dividing of patterns and trends through means such as statistical pattern learning.
10880040 -> 1000008700050: Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.
10880050 -> 1000008700060: 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness.
10880060 -> 1000008700070: Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).
10880070 -> 1000008700080: History
10880080 -> 1000008700090: Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade.
10880090 -> 1000008700100: Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics.
10880100 -> 1000008700110: As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.
10880110 -> 1000008700120: Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.
10880120 -> 1000008700130: Sentiment analysis
10880130 -> 1000008700140: Sentiment analysis may, for example, involve analysis of movie reviews for estimating how favorably a review is for a movie.
10880140 -> 1000008700150: Such an analysis may require a labeled data set or labeling of the affectivity of words.
10880150 -> 1000008700160: A resource for affectivity of words has been made for WordNet.
10880160 -> 1000008700170: Applications
10880170 -> 1000008700180: Recently, text mining has been receiving attention in many areas.
10880180 -> 1000008700190: Security applications
10880190 -> 1000008700200: One of the largest text mining applications that exists is probably the classified ECHELON surveillance system.
10880200 -> 1000008700210: Additionally, many text mining software packages such as AeroText, Attensity, SPSS and Expert System are marketed towards security applications, particularly analysis of plain text sources such as Internet news.
10880210 -> 1000008700220: In 2007, Europol's Serious Crime division developed an analysis system in order to track transnational organized crime.
10880220 -> 1000008700230: This Overall Analysis System for Intelligence Support (OASIS) integrates among the most advanced text analytics and text mining technologies available on today's market.
10880230 -> 1000008700240: This system led Europol to make the most significant progress to support law enforcement objectives at the international level.
10880240 -> 1000008700250: Biomedical applications
10880250 -> 1000008700260: A range of applications of text mining of the biomedical literature has been described.
10880260 -> 1000008700270: One example is PubGene ( pubgene.org) that combines biomedical text mining with network visualization as an Internet service.
10880270 -> 1000008700280: Another example, which uses ontologies with textmining is  GoPubMed.org.
10880280 -> 1000008700290: Software and applications
10880290 -> 1000008700300: Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.
10880300 -> 1000008700310: Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.
10880310 -> 1000008700320: Marketing applications
10880320 -> 1000008700330: Text mining is starting to be used in marketing as well, more specifically in analytical Customer relationship management.  Coussement and Van den Poel (2008) apply it to improve predictive analytics models for customer churn (Customer attrition). .
10880330 -> 1000008700340: Academic applications
10880340 -> 1000008700350: The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval.
10880350 -> 1000008700360: This is particularly true in scientific disciplines, in which highly specific information is often contained within written text.
10880360 -> 1000008700370: Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.
10880370 -> 1000008700380: Academic institutions have also become involved in the text mining initiative:
10880380 -> 1000008700390: The National Centre for Text Mining, a collaborative effort between the Universities of Manchester and Liverpool, provides customised tools, research facilities and offers advice to the academic community.
10880390 -> 1000008700400: They are funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils.
10880400 -> 1000008700410: With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of Social Science.
10880410 -> 1000008700420: In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.
10880420 -> None: Software and applications
10880430 -> None: Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.
10880440 -> None: Text mining software is also being researched by different companies working in the area of search and indexing in general as a way to improve their results.
10880450 -> None: There is a large number of companies that provide commercial computer programs:
10880460 -> None: AeroText - provides a suite of text mining applications for content analysis.
10880470 -> None: Content used can be in multiple languages.
10880480 -> None: Attensity - suite of text mining solutions that includes search, statistical and NLP based technologies for a variety of industries.
10880490 -> None: Autonomy - suite of text mining, clustering and categorization solutions for a variety of industries.
10880500 -> None: Endeca Technologies - provides software to analyze and cluster unstructured text.
10880510 -> None: Expert System S.p.A. - suite of semantic technologies and products for developers and knowledge managers.
10880520 -> None: Fair Isaac - leading provider of decision management solutions powered by advanced analytics (includes text analytics).
10880530 -> None: LanguageWare  - the IBM Tools and Runtime for Text Mining.
10880540 -> None: Inxight - provider of text analytics, search, and unstructured visualization technologies.
10880550 -> None: (Inxight was sold to Business Objects that was sold to SAP AG in 2007)
10880560 -> None: Nstein Technologies  - provider of text mining, digital asset management, and web content management solutions
10880570 -> None: Pervasive Data Integrator - includes Extract Schema Designer that allows the user to point and click identify structure patterns in reports, html, emails, etc. for extraction into any database
10880580 -> None: RapidMiner/YALE - open-source data and text mining software for scientific and commercial use.
10880590 -> None: SPSS - provider of SPSS Text Analysis for Surveys, Text Mining for Clementine, LexiQuest Mine and LexiQuest Categorize, commercial text analytics software that can be used in conjunction with SPSS Predictive Analytics Solutions.
10880600 -> None: Thomson Data Analyzer - Enables complex analysis on patent information, scientific publications and news.
10880610 -> None: Clearforest Developer - A suite of tools for developing NLP (Natural Language Processing) based text mining applications to derive structure out of unstructured texts.
10880620 -> None: VantagePoint  - Text mining software which includes tools for data cleanup, analysis, process automation, and reporting.
10880630 -> 1000008700430: Open-source software and applications
10880640 -> 1000008700440: GATE - natural language processing and language engineering tool.
10880650 -> 1000008700450: YALE/RapidMiner with its Word Vector Tool plugin - data and text mining software.
10880660 -> 1000008700460: tm   - text mining in the R programming language
10880670 -> 1000008700470: Implications
10880680 -> 1000008700480: Until recently websites most often used text-based lexical searches; in other words, users could find documents only by the words that happened to occur in the documents.
10880690 -> 1000008700490: Text mining may allow searches to be directly answered by the semantic web; users may be able to search for content based on its meaning and context, rather than just by a specific word.
10880700 -> 1000008700500: Additionally, text mining software can be used to build large dossiers of information about specific people and events.
10880710 -> 1000008700510: For example, by using software that extracts specifics facts about businesses and individuals from news reports, large datasets can be built to facilitate social networks analysis or counter-intelligence.
10880720 -> 1000008700520: In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.
10880730 -> 1000008700530: Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.
Translation
10890010 -> 1000008800020: Translation
10890020 -> 1000008800030: Translation is the action of interpretation of the meaning of a text, and subsequent production of an equivalent text, also called a translation, that communicates the same message in another language.
10890030 -> 1000008800040: The text to be translated is called the source text, and the language it is to be translated into is called the target language; the final product is sometimes called the "target text."
10890040 -> 1000008800050: Translation must take into account constraints that include context, the rules of grammar of the two languages, their writing conventions, and their idioms.
10890050 -> 1000008800060: A common misconception is that there exists a simple word-for-word correspondence between any two languages, and that translation is a straightforward mechanical process.
10890060 -> 1000008800070: A word-for-word translation does not take into account context, grammar, conventions, and idioms.
10890070 -> 1000008800080: Translation is fraught with the potential for "spilling over" of idioms and usages from one language into the other, since both languages repose within the single brain of the translator.
10890080 -> 1000008800090: Such spilling-over easily produces linguistic hybrids such as "Franglais" (French-English), "Spanglish" (Spanish-English), "Poglish" (Polish-English) and "Portuñol" (Portuguese-Spanish).
10890090 -> 1000008800100: The art of translation is as old as written literature.
10890100 -> 1000008800110: Parts of the Sumerian Epic of Gilgamesh, among the oldest known literary works, have been found in translations into several Asiatic languages of the second millennium BCE.
10890110 -> 1000008800120: The Epic of Gilgamesh may have been read, in their own languages, by early authors of the Bible and of the Iliad.
10890120 -> 1000008800130: With the advent of computers, attempts have been made to computerize or otherwise automate the translation of natural-language texts (machine translation) or to use computers as an aid to translation (computer-assisted translation).
10890130 -> 1000008800140: The term
10890140 -> 1000008800150: Etymologically, "translation" is a "carrying across" or "bringing across."
10890150 -> 1000008800160: The Latin "translatio" derives from the perfect passive participle, "translatum," of "transferre" ("to transfer" — from "trans," "across" + "ferre," "to carry" or "to bring").
10890160 -> 1000008800170: The modern Romance, Germanic and Slavic European languages have generally formed their own equivalent terms for this concept after the Latin model — after "transferre" or after the kindred "traducere" ("to bring across" or "to lead across").
10890170 -> 1000008800180: Additionally, the Greek term for "translation," "metaphrasis" ("a speaking across"), has supplied English with "metaphrase" (a "literal translation," or "word-for-word" translation)—as contrasted with "paraphrase" ("a saying in other words," from the Greek "paraphrasis").
10890180 -> 1000008800190: "Metaphrase" equates, in one of the more recent terminologies, to "formal equivalence," and "paraphrase"—to "dynamic equivalence."
10890190 -> 1000008800200: Misconceptions
10890200 -> 1000008800210: Newcomers to translation sometimes proceed as if translation were an exact science — as if consistent, one-to-one correlations existed between the words and phrases of different languages, rendering translations fixed and identically reproducible, much as in cryptography.
10890210 -> 1000008800220: Such novices may assume that all that is needed to translate a text is to "encode" and "decode" equivalents between the two languages, using a translation dictionary as the "codebook."
10890220 -> 1000008800230: On the contrary, such a fixed relationship would only exist were a new language synthesized and simultaneously matched to a pre-existing language's scopes of meaning, etymologies, and lexical ecological niches.
10890230 -> 1000008800240: If the new language were subsequently to take on a life apart from such cryptographic use, each word would spontaneously begin to assume new shades of meaning and cast off previous associations, thereby vitiating any such artificial synchronization.
10890240 -> 1000008800250: Henceforth translation would require the disciplines described in this article.
10890250 -> 1000008800260: Another common misconception is that anyone who can speak a second language will make a good translator.
10890260 -> 1000008800270: In the translation community, it is generally accepted that the best translations are produced by persons who are translating into their own native languages, as it is rare for someone who has learned a second language to have total fluency in that language.
10890270 -> 1000008800280: A good translator understands the source language well, has specific experience in the subject matter of the text, and is a good writer in the target language.
10890280 -> 1000008800290: Moreover, he is not only bilingual but bicultural.
10890290 -> 1000008800300: It has been debated whether translation is art or craft.
10890300 -> 1000008800310: Literary translators, such as Gregory Rabassa in If This Be Treason, argue that translation is an art—a teachable one.
10890310 -> 1000008800320: Other translators, mostly technical, commercial, and legal, regard their métier as a craft—again, a teachable one, subject to linguistic analysis, that benefits from academic study.
10890320 -> 1000008800330: As with other human activities, the distinction between art and craft may be largely a matter of degree.
10890330 -> 1000008800340: Even a document which appears simple, e.g. a product brochure, requires a certain level of linguistic skill that goes beyond mere technical terminology.
10890340 -> 1000008800350: Any material used for marketing purposes reflects on the company that produces the product and the brochure.
10890350 -> 1000008800360: The best translations are obtained through the combined application of good technical-terminology skills and good writing skills.
10890360 -> 1000008800370: Translation has served as a writing school for many recognized writers.
10890370 -> 1000008800380: Translators, including the early modern European translators of the Bible, in the course of their work have shaped the very languages into which they have translated.
10890380 -> 1000008800390: They have acted as bridges for conveying knowledge and ideas between cultures and civilizations.
10890390 -> 1000008800400: Along with ideas, they have imported into their own languages, calques of grammatical structures and of vocabulary from the source languages.
10890400 -> 1000008800410: Interpreting
10890410 -> 1000008800420: Interpreting, or "interpretation," is the intellectual activity that consists of facilitating oral or sign-language communication, either simultaneously or consecutively, between two or among three or more speakers who are not speaking, or signing, the same language.
10890420 -> 1000008800430: The words "interpreting" and "interpretation" both can be used to refer to this activity; the word "interpreting" is commonly used in the profession and in the translation-studies field to avoid confusion with other meanings of the word "interpretation."
10890430 -> 1000008800440: Not all languages employ, as English does, two separate words to denote the activities of written and live-communication (oral or sign-language) translators.
10890440 -> 1000008800450: Fidelity vs. transparency
10890450 -> 1000008800460: Fidelity (or "faithfulness") and transparency are two qualities that, for millennia, have been regarded as ideals to be striven for in translation, particularly literary translation.
10890460 -> 1000008800470: These two ideals are often at odds.
10890470 -> 1000008800480: Thus a 17th-century French critic coined the phrase, "les belles infidèles," to suggest that translations, like women, could be either faithful or beautiful, but not both at the same time.
10890480 -> 1000008800490: Fidelity pertains to the extent to which a translation accurately renders the meaning of the source text, without adding to or subtracting from it, without intensifying or weakening any part of the meaning, and otherwise without distorting it.
10890490 -> 1000008800500: Transparency pertains to the extent to which a translation appears to a native speaker of the target language to have originally been written in that language, and conforms to the language's grammatical, syntactic and idiomatic conventions.
10890500 -> 1000008800510: A translation that meets the first criterion is said to be a "faithful translation"; a translation that meets the second criterion, an "idiomatic translation."
10890510 -> 1000008800520: The two qualities are not necessarily mutually exclusive.
10890520 -> 1000008800530: The criteria used to judge the faithfulness of a translation vary according to the subject, the precision of the original contents, the type, function and use of the text, its literary qualities, its social or historical context, and so forth.
10890530 -> 1000008800540: The criteria for judging the transparency of a translation would appear more straightforward: an unidiomatic translation "sounds wrong," and in the extreme case of word-for-word translations generated by many machine-translation systems, often results in patent nonsense with only a humorous value (see "round-trip translation").
10890540 -> 1000008800550: Nevertheless, in certain contexts a translator may consciously strive to produce a literal translation.
10890550 -> 1000008800560: Literary translators and translators of religious or historic texts often adhere as closely as possible to the source text.
10890560 -> 1000008800570: In doing so, they often deliberately stretch the boundaries of the target language to produce an unidiomatic text.
10890570 -> 1000008800580: Similarly, a literary translator may wish to adopt words or expressions from the source language in order to provide "local color" in the translation.
10890580 -> 1000008800590: In recent decades, prominent advocates of such "non-transparent" translation have included the French scholar Antoine Berman, who identified twelve deforming tendencies inherent in most prose translations, and the American theorist Lawrence Venuti, who has called upon translators to apply "foreignizing" translation strategies instead of domesticating ones.
10890590 -> 1000008800600: Many non-transparent-translation theories draw on concepts from German Romanticism, the most obvious influence on latter-day theories of "foreignization" being the German theologian and philosopher Friedrich Schleiermacher.
10890600 -> 1000008800610: In his seminal lecture "On the Different Methods of Translation" (1813) he distinguished between translation methods that move "the writer toward [the reader]," i.e., transparency, and those that move the "reader toward [the author]," i.e., an extreme fidelity to the foreignness of the source text.
10890610 -> 1000008800620: Schleiermacher clearly favored the latter approach.
10890620 -> 1000008800630: His preference was motivated, however, not so much by a desire to embrace the foreign, as by a nationalist desire to oppose France's cultural domination and to promote German literature.
10890630 -> 1000008800640: For the most part, current Western practices in translation are dominated by the concepts of "fidelity" and "transparency."
10890640 -> 1000008800650: This has not always been the case.
10890650 -> 1000008800660: There have been periods, especially in pre-Classical Rome and in the 18th century, when many translators stepped beyond the bounds of translation proper into the realm of adaptation.
10890660 -> 1000008800670: Adapted translation retains currency in some non-Western traditions.
10890670 -> 1000008800680: Thus the Indian epic, the Ramayana, appears in many versions in the various Indian languages, and the stories are different in each.
10890680 -> 1000008800690: If one considers the words used for translating into the Indian languages, whether those be Aryan or Dravidian languages, he is struck by the freedom that is granted to the translators.
10890690 -> 1000008800700: This may relate to a devotion to prophetic passages that strike a deep religious chord, or to a vocation to instruct unbelievers.
10890700 -> 1000008800710: Similar examples are to be found in medieval Christian literature, which adjusted the text to the customs and values of the audience.
10890710 -> 1000008800720: Equivalence
10890720 -> 1000008800730: The question of fidelity vs. transparency has also been formulated in terms of, respectively, "formal equivalence" and "dynamic equivalence."
10890730 -> 1000008800740: The latter two expressions are associated with the translator Eugene Nida and were originally coined to describe ways of translating the Bible, but the two approaches are applicable to any translation.
10890740 -> 1000008800750: "Formal equivalence" equates to "metaphrase," and "dynamic equivalence"—to "paraphrase."
10890750 -> 1000008800760: "Dynamic equivalence" (or "functional equivalence") conveys the essential thought expressed in a source text — if necessary, at the expense of literality, original sememe and word order, the source text's active vs. passive voice, etc.
10890760 -> 1000008800770: By contrast, "formal equivalence" (sought via "literal" translation) attempts to render the text "literally," or "word for word" (the latter expression being itself a word-for-word rendering of the classical Latin "verbum pro verbo") — if necessary, at the expense of features natural to the target language.
10890770 -> 1000008800780: There is, however, no sharp boundary between dynamic and formal equivalence.
10890780 -> 1000008800790: On the contrary, they represent a spectrum of translation approaches.
10890790 -> 1000008800800: Each is used at various times and in various contexts by the same translator, and at various points within the same text — sometimes simultaneously.
10890800 -> 1000008800810: Competent translation entails the judicious blending of dynamic and formal equivalents.
10890810 -> 1000008800820: Back-translation
10890820 -> 1000008800830: If one text is a translation of another, a back-translation is a translation of the translated text back into the language of the original text, made without reference to the original text.
10890830 -> 1000008800840: In the context of machine translation, this is also called a "round-trip translation."
10890840 -> 1000008800850: Comparison of a back-translation to the original text is sometimes used as a quality check on the original translation, but it is certainly far from infallible and the reliability of this technique has been disputed.
10890850 -> 1000008800860: Literary translation
10890860 -> 1000008800870: Translation of literary works (novels, short stories, plays, poems, etc.) is considered a literary pursuit in its own right.
10890870 -> 1000008800880: Notable in Canadian literature specifically as translators are figures such as Sheila Fischman, Robert Dickson and Linda Gaboriau, and the Governor General's Awards present prizes for the year's best English-to-French and French-to-English literary translations.
10890880 -> 1000008800890: Other writers, among many who have made a name for themselves as literary translators, include Vasily Zhukovsky, Tadeusz Boy-Żeleński, Vladimir Nabokov, Jorge Luis Borges, Robert Stiller and Haruki Murakami.
10890890 -> 1000008800900: History
10890900 -> 1000008800910: The first important translation in the West was that of the Septuagint, a collection of Jewish Scriptures translated into Koine Greek in Alexandria between the 3rd and 1st centuries BCE.
10890910 -> 1000008800920: The dispersed Jews had forgotten their ancestral language and needed Greek versions (translations) of their Scriptures.
10890920 -> 1000008800930: Throughout the Middle Ages, Latin was the lingua franca of the western learned world.
10890930 -> 1000008800940: The 9th-century Alfred the Great, king of Wessex in England, was far ahead of his time in commissioning vernacular Anglo-Saxon translations of Bede's Ecclesiastical History and Boethius' Consolation of Philosophy.
10890940 -> 1000008800950: Meanwhile the Christian Church frowned on even partial adaptations of the standard Latin Bible, St. Jerome's Vulgate of ca. 384 CE.
10890950 -> 1000008800960: In Asia, the spread of Buddhism led to large-scale ongoing translation efforts spanning well over a thousand years.
10890960 -> 1000008800970: The Tangut Empire was especially efficient in such efforts; exploiting the then newly-invented block printing, and with the full support of the government (contemporary sources describe the Emperor and his mother personally contributing to the translation effort, alongside sages of various nationalities), the Tanguts took mere decades to translate volumes that had taken the Chinese centuries to render.
10890970 -> 1000008800980: Large-scale efforts at translation were undertaken by the Arabs.
10890980 -> 1000008800990: Having conquered the Greek world, they made Arabic versions of its philosophical and scientific works.
10890990 -> 1000008801000: During the Middle Ages, some translations of these Arabic versions were made into Latin, chiefly at Córdoba in Spain.
10891000 -> 1000008801010: Such Latin translations of Greek and original Arab works of scholarship and science would help advance the development of European Scholasticism.
10891010 -> 1000008801020: The broad historic trends in Western translation practice may be illustrated on the example of translation into the English language.
10891020 -> 1000008801030: The first fine translations into English were made by England's first great poet, the 14th-century Geoffrey Chaucer, who adapted from the Italian of Giovanni Boccaccio in his own Knight's Tale and Troilus and Criseyde; began a translation of the French-language Roman de la Rose; and completed a translation of Boethius from the Latin.
10891030 -> 1000008801040: Chaucer founded an English poetic tradition on adaptations and translations from those earlier-established literary languages.
10891040 -> 1000008801050: The first great English translation was the Wycliffe Bible (ca. 1382), which showed the weaknesses of an underdeveloped English prose.
10891050 -> 1000008801060: Only at the end of the 15th century would the great age of English prose translation begin with Thomas Malory's Le Morte Darthur—an adaptation of Arthurian romances so free that it can, in fact, hardly be called a true translation.
10891060 -> 1000008801070: The first great Tudor translations are, accordingly, the Tyndale New Testament (1525), which would influence the Authorized Version (1611), and Lord Berners' version of Jean Froissart's Chronicles (1523–25).
10891070 -> 1000008801080: Meanwhile, in Renaissance Italy, a new period in the history of translation had opened in Florence with the arrival, at the court of Cosimo de' Medici, of the Byzantine scholar Georgius Gemistus Pletho shortly before the fall of Constantinople to the Turks (1453).
10891080 -> 1000008801090: A Latin translation of Plato's works was undertaken by Marsilio Ficino.
10891090 -> 1000008801100: This and Erasmus' Latin edition of the New Testament led to a new attitude to translation.
10891100 -> 1000008801110: For the first time, readers demanded rigor of rendering, as philosophical and religious beliefs depended on the exact words of Plato, Aristotle and Jesus.
10891110 -> 1000008801120: Non-scholarly literature, however, continued to rely on adaptation.
10891120 -> 1000008801130: France's Pléiade, England's Tudor poets, and the Elizabethan translators adapted themes by Horace, Ovid, Petrarch and modern Latin writers, forming a new poetic style on those models.
10891130 -> 1000008801140: The English poets and translators sought to supply a new public, created by the rise of a middle class and the development of printing, with works such as the original authors would have written, had they been writing in England in that day.
10891140 -> 1000008801150: The Elizabethan period of translation saw considerable progress beyond mere paraphrase toward an ideal of stylistic equivalence, but even to the end of this period—which actually reached to the middle of the 17th century—there was no concern for verbal accuracy.
10891150 -> 1000008801160: In the second half of the 17th century, the poet John Dryden sought to make Virgil speak "in words such as he would probably have written if he were living and an Englishman."
10891160 -> 1000008801170: Dryden, however, discerned no need to emulate the Roman poet's subtlety and concision.
10891170 -> 1000008801180: Similarly, Homer suffered from Alexander Pope's endeavor to reduce the Greek poet's "wild paradise" to order.
10891180 -> 1000008801190: Throughout the 18th century, the watchword of translators was ease of reading.
10891190 -> 1000008801200: Whatever they did not understand in a text, or thought might bore readers, they omitted.
10891200 -> 1000008801210: They cheerfully assumed that their own style of expression was the best, and that texts should be made to conform to it in translation.
10891210 -> 1000008801220: For scholarship they cared no more than had their predecessors, and they did not shrink from making translations from translations in third languages, or from languages that they hardly knew, or—as in the case of James Macpherson's "translations" of Ossian—from texts that were actually of the "translator's" own composition.
10891220 -> 1000008801230: The 19th century brought new standards of accuracy and style.
10891230 -> 1000008801240: In regard to accuracy, observes J.M. Cohen, the policy became "the text, the whole text, and nothing but the text," except for any bawdy passages and the addition of copious explanatory footnotes.
10891240 -> 1000008801250: In regard to style, the Victorians' aim, achieved through far-reaching metaphrase (literality) or pseudo-metaphrase, was to constantly remind readers that they were reading a foreign classic.
10891250 -> 1000008801260: An exception was the outstanding translation in this period, Edward FitzGerald's Rubaiyat of Omar Khayyam (1859), which achieved its Oriental flavor largely by using Persian names and discreet Biblical echoes and actually drew little of its material from the Persian original.
10891260 -> 1000008801270: In advance of the 20th century, a new pattern was set in 1871 by Benjamin Jowett, who translated Plato into simple, straightforward language.
10891270 -> 1000008801280: Jowett's example was not followed, however, until well into the new century, when accuracy rather than style became the principal criterion.
10891280 -> 1000008801290: Poetry
10891290 -> 1000008801300: Poetry presents special challenges to translators, given the importance of a text's formal aspects, in addition to its content.
10891300 -> 1000008801310: In his influential 1959 paper "On Linguistic Aspects of Translation," the Russian-born linguist and semiotician Roman Jakobson went so far as to declare that "poetry by definition [is] untranslatable."
10891310 -> 1000008801320: In 1974 the American poet James Merrill wrote a poem, "Lost in Translation," which in part explores this idea.
10891320 -> 1000008801330: The question was also discussed in Douglas Hofstadter's 1997 book, Le Ton beau de Marot.
10891330 -> 1000008801340: Sung texts
10891340 -> 1000008801350: Translation of a text that is sung in vocal music for the purpose of singing in another language — sometimes called "singing translation" — is closely linked to translation of poetry because most vocal music, at least in the Western tradition, is set to verse, especially verse in regular patterns with rhyme.
10891350 -> 1000008801360: (Since the late 19th century, musical setting of prose and free verse has also been practiced in some art music, though popular music tends to remain conservative in its retention of stanzaic forms with or without refrains.
10891360 -> 1000008801370: ) A rudimentary example of translating poetry for singing is church hymns, such as the German chorales translated into English by Catherine Winkworth.
10891370 -> 1000008801380: Translation of sung texts is generally much more restrictive than translation of poetry, because in the former there is little or no freedom to choose between a versified translation and a translation that dispenses with verse structure.
10891380 -> 1000008801390: One might modify or omit rhyme in a singing translation, but the assignment of syllables to specific notes in the original musical setting places great challenges on the translator.
10891390 -> 1000008801400: There is the option in prose sung texts, less so in verse, of adding or deleting a syllable here and there by subdividing or combining notes, respectively, but even with prose the process is almost like strict verse translation because of the need to stick as closely as possible to the original prosody of the sung melodic line.
10891400 -> 1000008801410: Other considerations in writing a singing translation include repetition of words and phrases, the placement of rests and/or punctuation, the quality of vowels sung on high notes, and rhythmic features of the vocal line that may be more natural to the original language than to the target language.
10891410 -> 1000008801420: A sung translation may be considerably or completely different from the original, thus resulting in a contrafactum.
10891420 -> 1000008801430: Translations of sung texts — whether of the above type meant to be sung or of a more or less literal type meant to be read — are also used as aids to audiences, singers and conductors, when a work is being sung in a language not known to them.
10891430 -> 1000008801440: The most familiar types are translations presented as subtitles projected during opera performances, those inserted into concert programs, and those that accompany commercial audio CDs of vocal music.
10891440 -> 1000008801450: In addition, professional and amateur singers often sing works in languages they do not know (or do not know well), and translations are then used to enable them to understand the meaning of the words they are singing.
10891450 -> 1000008801460: History of theory
10891460 -> 1000008801470: Discussions of the theory and practice of translation reach back into antiquity and show remarkable continuities.
10891470 -> 1000008801480: The distinction that had been drawn by the ancient Greeks between "metaphrase" ("literal" translation) and "paraphrase" would be adopted by the English poet and translator John Dryden (1631-1700), who represented translation as the judicious blending of these two modes of phrasing when selecting, in the target language, "counterparts," or equivalents, for the expressions used in the source language:
10891471 -> 1000008801490: When [words] appear... literally graceful, it were an injury to the author that they should be changed.
10891472 -> 1000008801500: But since... what is beautiful in one [language] is often barbarous, nay sometimes nonsense, in another, it would be unreasonable to limit a translator to the narrow compass of his author's words: 'tis enough if he choose out some expression which does not vitiate the sense.
10891480 -> 1000008801510: Dryden cautioned, however, against the license of "imitation," i.e. of adapted translation: "When a painter copies from the life... he has no privilege to alter features and lineaments..."
10891490 -> 1000008801520: This general formulation of the central concept of translation — equivalence — is probably as adequate as any that has been proposed ever since Cicero and Horace, in first-century-BCE Rome, famously and literally cautioned against translating "word for word" ("verbum pro verbo").
10891500 -> 1000008801530: Despite occasional theoretical diversities, the actual practice of translators has hardly changed since antiquity.
10891510 -> 1000008801540: Except for some extreme metaphrasers in the early Christian period and the Middle Ages, and adapters in various periods (especially pre-Classical Rome, and the 18th century), translators have generally shown prudent flexibility in seeking equivalents — "literal" where possible, paraphrastic where necessary — for the original meaning and other crucial "values" (e.g., style, verse form, concordance with musical accompaniment or, in films, with speech articulatory movements) as determined from context.
10891520 -> 1000008801550: In general, translators have sought to preserve the context itself by reproducing the original order of sememes, and hence word order — when necessary, reinterpreting the actual grammatical structure.
10891530 -> 1000008801560: The grammatical differences between "fixed-word-order" languages (e.g., English, French, German) and "free-word-order" languages (e.g., Greek, Latin, Polish, Russian) have been no impediment in this regard.
10891540 -> 1000008801570: When a target language has lacked terms that are found in a source language, translators have borrowed them, thereby enriching the target language.
10891550 -> 1000008801580: Thanks in great measure to the exchange of "calques" (French for "tracings") between languages, and to their importation from Greek, Latin, Hebrew, Arabic and other languages, there are few concepts that are "untranslatable" among the modern European languages.
10891560 -> 1000008801590: In general, the greater the contact and exchange that has existed between two languages, or between both and a third one, the greater is the ratio of metaphrase to paraphrase that may be used in translating between them.
10891570 -> 1000008801600: However, due to shifts in "ecological niches" of words, a common etymology is sometimes misleading as a guide to current meaning in one or the other language.
10891580 -> 1000008801610: The English "actual," for example, should not be confused with the cognate French "actuel" (meaning "present," "current") or the Polish "aktualny" ("present," "current").
10891590 -> 1000008801620: For the translation of Buddhist texts into Chinese, the monk Xuanzang (602–64) proposed the idea of 五不翻 ("five occasions when terms are left untranslated"):
10891600 -> 1000008801630: 秘密故—terms carry secrecy, e.g., chants and spells;
10891610 -> 1000008801640: 含多义故—terms carry multiple meanings;
10891620 -> 1000008801650: 此无故—no corresponding term exists;
10891630 -> 1000008801660: 顺古故—out of respect for earlier translations;
10891640 -> 1000008801670: 生善故—
10891650 -> 1000008801680: The translator's role as a bridge for "carrying across" values between cultures has been discussed at least since Terence, Roman adapter of Greek comedies, in the second century BCE.
10891660 -> 1000008801690: The translator's role is, however, by no means a passive and mechanical one, and so has also been compared to that of an artist.
10891670 -> 1000008801700: The main ground seems to be the concept of parallel creation found in critics as early as Cicero.
10891680 -> 1000008801710: Dryden observed that "Translation is a type of drawing after life..."
10891690 -> 1000008801720: Comparison of the translator with a musician or actor goes back at least to Samuel Johnson's remark about Alexander Pope playing Homer on a flageolet, while Homer himself used a bassoon.
10891700 -> 1000008801730: If translation be an art, it is no easy one.
10891710 -> 1000008801740: In the 13th century, Roger Bacon wrote that if a translation is to be true, the translator must know both languages, as well as the science that he is to translate; and finding that few translators did, he wanted to do away with translation and translators altogether.
10891720 -> 1000008801750: The first European to assume that one translates satisfactorily only toward his own language may have been Martin Luther, translator of the Bible into German.
10891730 -> 1000008801760: According to L.G. Kelly, since Johann Gottfried Herder in the 18th century, "it has been axiomatic" that one works only toward his own language.
10891740 -> 1000008801770: Compounding these demands upon the translator is the fact that not even the most complete dictionary or thesaurus can ever be a fully adequate guide in translation.
10891750 -> 1000008801780: Alexander Tytler, in his Essay on the Principles of Translation (1790), emphasized that assiduous reading is a more comprehensive guide to a language than are dictionaries.
10891760 -> 1000008801790: The same point, but also including listening to the spoken language, had earlier been made in 1783 by Onufry Andrzej Kopczyński, member of Poland's Society for Elementary Books, who was called "the last Latin poet."
10891770 -> 1000008801800: The special role of the translator in society was well described in an essay, published posthumously in 1803, by Ignacy Krasicki — "Poland's La Fontaine", Primate of Poland, poet, encyclopedist, author of the first Polish novel, and translator from French and Greek:
10891780 -> 1000008801810: Religious texts
10891790 -> 1000008801820: Translation of religious works has played an important role in history.
10891800 -> 1000008801830: Buddhist monks who translated the Indian sutras into Chinese often skewed their translations to better reflect China's very different culture, emphasizing notions such as filial piety.
10891810 -> 1000008801840: A famous mistranslation of the Bible is the rendering of the Hebrew word "keren," which has several meanings, as "horn" in a context where it actually means "beam of light."
10891820 -> 1000008801850: As a result, artists have for centuries depicted Moses the Lawgiver with horns growing out of his forehead.
10891830 -> 1000008801860: An example is Michelangelo's famous sculpture.
10891840 -> 1000008801870: Christian anti-Semites used such depictions to spread hatred of the Jews, claiming that they were devils with horns.
10891850 -> 1000008801880: One of the first recorded instances of translation in the West was the rendering of the Old Testament into Greek in the third century B.C.E.
10891860 -> 1000008801890: The resulting translation is known as the Septuagint, a name that alludes to the "seventy" translators (seventy-two in some versions) who were commissioned to translate the Bible in Alexandria.
10891870 -> 1000008801900: Each translator worked in solitary confinement in a separate cell, and legend has it that all seventy versions were identical.
10891880 -> 1000008801910: The Septuagint became the source text for later translations into many languages, including Latin, Coptic, Armenian and Georgian.
10891890 -> 1000008801920: Saint Jerome, the patron saint of translation, is still considered one of the greatest translators in history for rendering the Bible into Latin.
10891900 -> 1000008801930: The Roman Catholic Church used his translation (known as the Vulgate) for centuries, but even this translation at first stirred much controversy.
10891910 -> 1000008801940: The period preceding and contemporary with the Protestant Reformation saw the translation of the Bible into local European languages, a development that greatly affected Western Christianity's split into Roman Catholicism and Protestantism, due to disparities between Catholic and Protestant versions of crucial words and passages.
10891920 -> 1000008801950: Martin Luther's Bible in German, Jakub Wujek's in Polish, and the King James Bible in English had lasting effects on the religions, cultures and languages of those countries.
10891930 -> 1000008801960: Machine translation
10891940 -> 1000008801970: Machine translation (MT) is a procedure whereby a computer program analyzes a source text and produces a target text without further human intervention.
10891950 -> 1000008801980: In reality, however, machine translation typically does involve human intervention, in the form of pre-editing and post-editing.
10891960 -> 1000008801990: An exception to that rule might be, e.g., the translation of technical specifications (strings of technical terms and adjectives), using a dictionary-based machine-translation system.
10891970 -> 1000008802000: To date, machine translation—a major goal of natural-language processing—has met with limited success.
10891980 -> 1000008802010: A November 6, 2007, example illustrates the hazards of uncritical reliance on machine translation.
10891990 -> 1000008802020: Machine translation has been brought to a large public by tools available on the Internet, such as Yahoo!'s Babel Fish, Babylon, and StarDict.
10892000 -> 1000008802030: These tools produce a "gisting translation" — a rough translation that, with luck, "gives the gist" of the source text.
10892010 -> 1000008802040: With proper terminology work, with preparation of the source text for machine translation (pre-editing), and with re-working of the machine translation by a professional human translator (post-editing), commercial machine-translation tools can produce useful results, especially if the machine-translation system is integrated with a translation-memory or globalization-management system.
10892020 -> 1000008802050: In regard to texts (e.g., weather reports) with limited ranges of vocabulary and simple sentence structure, machine translation can deliver results that do not require much human intervention to be useful.
10892030 -> 1000008802060: Also, the use of a controlled language, combined with a machine-translation tool, will typically generate largely comprehensible translations.
10892040 -> 1000008802070: Relying on machine translation exclusively ignores the fact that communication in human language is context-embedded and that it takes a person to comprehend the context of the original text with a reasonable degree of probability.
10892050 -> 1000008802080: It is certainly true that even purely human-generated translations are prone to error.
10892060 -> 1000008802090: Therefore, to ensure that a machine-generated translation will be useful to a human being and that publishable-quality translation is achieved, such translations must be reviewed and edited by a human.
10892070 -> 1000008802100: CAT
10892080 -> 1000008802110: Computer-assisted translation (CAT), also called "computer-aided translation," "machine-aided human translation (MAHT)" and "interactive translation," is a form of translation wherein a human translator creates a target text with the assistance of a computer program.
10892090 -> 1000008802120: The machine supports a human translator.
10892100 -> 1000008802130: Computer-assisted translation can include standard dictionary and grammar software.
10892110 -> 1000008802140: The term, however, normally refers to a range of specialized programs available to the translator, including translation-memory, terminology-management, concordance, and alignment programs.
10892120 -> 1000008802150: With the internet, translation software can help non-native-speaking individuals understand web pages published in other languages.
10892130 -> 1000008802160: Whole-page translation tools are of limited utility, however, since they offer only a limited potential understanding of the original author's intent and context; translated pages tend to be more humorous and confusing than enlightening.
10892140 -> 1000008802170: Interactive translations with pop-up windows are becoming more popular.
10892150 -> 1000008802180: These tools show several possible translations of each word or phrase.
10892160 -> 1000008802190: Human operators merely need to select the correct translation as the mouse glides over the foreign-language text.
10892170 -> 1000008802200: Possible definitions can be grouped by pronunciation.
Translation memory
10900010 -> 1000008900020: Translation memory
10900020 -> 1000008900030: A translation memory, or TM, is a type of database that is used in software programs designed to aid human translators.
10900030 -> 1000008900040: Some software programs that use translation memories are known as translation memory managers (TMM).
10900040 -> 1000008900050: Translation memories are typically used in conjunction with a dedicated computer assisted translation (CAT) tool, word processing program, terminology management systems, multilingual dictionary, or even raw machine translation output.
10900050 -> 1000008900060: A translation memory consists of text segments in a source language and their translations into one or more target languages.
10900060 -> 1000008900070: These segments can be blocks, paragraphs, sentences, or phrases.
10900070 -> 1000008900080: Individual words are handled by terminology bases and are not within the domain of TM.
10900080 -> 1000008900090: Research indicates that many companies producing multilingual documentation are using translation memory systems.
10900090 -> 1000008900100: In a survey of language professionals in 2006, 82.5 % out of 874 replies confirmed the use of a TM.
10900100 -> 1000008900110: Usage of TM correlated with text type characterised by technical terms and simple sentence structure (technical, to a lesser degree marketing and financial), computing skills, and repetitiveness of content
10900110 -> 1000008900120: Using translation memories
10900120 -> 1000008900130: The program breaks the source text (the text to be translated) into segments, looks for matches between segments and the source half of previously translated source-target pairs stored in a translation memory, and presents such matching pairs as translation candidates.
10900130 -> 1000008900140: The translator can accept a candidate, replace it with a fresh translation, or modify it to match the source.
10900140 -> 1000008900150: In the last two cases, the new or modified translation goes into the database.
10900150 -> 1000008900160: Some translation memories systems search for 100% matches only, that is to say that they can only retrieve segments of text that match entries in the database exactly, while others employ fuzzy matching algorithms to retrieve similar segments, which are presented to the translator with differences flagged.
10900160 -> 1000008900170: It is important to note that typical translation memory systems only search for text in the source segment.
10900170 -> 1000008900180: The flexibility and robustness of the matching algorithm largely determine the performance of the translation memory, although for some applications the recall rate of exact matches can be high enough to justify the 100%-match approach.
10900180 -> 1000008900190: Segments where no match is found will have to be translated by the translator manually.
10900190 -> 1000008900200: These newly translated segments are stored in the database where they can be used for future translations as well as repetitions of that segment in the current text.
10900200 -> 1000008900210: Translation memories work best on texts which are highly repetitive, such as technical manuals.
10900210 -> 1000008900220: They are also helpful for translating incremental changes in a previously translated document, corresponding, for example, to minor changes in a new version of a user manual.
10900220 -> 1000008900230: Traditionally, translation memories have not been considered appropriate for literary or creative texts, for the simple reason that there is so little repetition in the language used.
10900230 -> 1000008900240: However, others find them of value even for non-repetitive texts, because the database resources created have value for concordance searches to determine appropriate usage of terms, for quality assurance (no empty segments), and the simplification of the review process (source and target segment are always displayed together while translators have to work with two documents in a traditional review environment).
10900240 -> 1000008900250: If a translation memory system is used consistently on appropriate texts over a period of time, it can save translators considerable work.
10900250 -> 1000008900260: Main benefits
10900260 -> 1000008900270: Translation memory managers are most suitable for translating technical documentation and documents containing specialized vocabularies.
10900270 -> 1000008900280: Their benefits include:
10900280 -> 1000008900290: Ensuring that the document is completely translated (translation memories do not accept empty target segments)
10900290 -> 1000008900300: Ensuring that the translated documents are consistent, including common definitions, phrasings and terminology.
10900300 -> 1000008900310: This is important when different translators are working on a single project.
10900310 -> 1000008900320: Enabling translators to translate documents in a wide variety of formats without having to own the software typically required to process these formats.
10900320 -> 1000008900330: Accelerating the overall translation process; since translation memories "remember" previously translated material, translators have to translate it only once.
10900330 -> 1000008900340: Reducing costs of long-term translation projects; for example the text of manuals, warning messages or series of documents needs to be translated only once and can be used several times.
10900340 -> 1000008900350: For large documentation projects, savings (in time or money) thanks to the use of a TM package may already be apparent even for the first translation of a new project, but normally such savings are only apparent when translating subsequent versions of a project that was translated before using translation memory.
10900350 -> 1000008900360: Main obstacles
10900360 -> 1000008900370: The main problems hindering wider use of translation memory managers include:
10900370 -> 1000008900380: The concept of "translation memories" is based on the premise that sentences used in previous translations can be "recycled".
10900380 -> 1000008900390: However, a guiding principle of translation is that the translator must translate the message of the text, and not its component sentences.
10900390 -> 1000008900400: Translation memory managers do not easily fit into existing translation or localization processes.
10900400 -> 1000008900410: In order to take advantages of TM technology, the translation processes must be redesigned.
10900410 -> 1000008900420: Translation memory managers do not presently support all documentation formats, and filters may not exist to support all file types.
10900420 -> 1000008900430: There is a learning curve associated with using translation memory managers, and the programs must be customized for greatest effectiveness.
10900430 -> 1000008900440: In cases where all or part of the translation process is outsourced or handled by freelance translators working off-site, the off-site workers require special tools to be able to work with the texts generated by the translation memory manager.
10900440 -> 1000008900450: Full versions of many translation memory managers can cost from US$500 to US$2,500 per seat, which can represent a considerable investment (although lower cost programs are also available).
10900450 -> 1000008900460: However, some developers produce free or low-cost versions of their tools with reduced feature sets that individual translators can use to work on projects set up with full versions of those tools.
10900460 -> 1000008900470: (Note that there are freeware and shareware TM packages available, but none of these has yet gained a large market share.)
10900470 -> 1000008900480: The costs involved in importing the user's past translations into the translation memory database, training, as well as any add-on products may also represent a considerable investment.
10900480 -> 1000008900490: Maintenance of translation memory databases still tends to be a manual process in most cases, and failure to maintain them can result in significantly decreased usability and quality of TM matches.
10900490 -> 1000008900500: As stated previously, translation memory managers may not be suitable for text that lacks internal repetition or which does not contain unchanged portions between revisions.
10900500 -> 1000008900510: Technical text is generally best suited for translation memory, while marketing or creative texts will be less suitable.
10900510 -> 1000008900520: The quality of the text recorded in the translation memory is not guaranteed; if the translation for particular segment is incorrect, it is in fact more likely that the incorrect translation will be reused the next time the same source text, or a similar source text, is translated, thereby perpetuating the error.
10900520 -> 1000008900530: There is also a potential, and, if present, probably an unconscious effect on the translated text.
10900530 -> 1000008900540: Different languages use different sequences for the logical elements within a sentence and a translator presented with a multiple clause sentence that is half translated is less likely to completely rebuild a sentence.
10900540 -> 1000008900550: There is also a potential for the translator to deal with the text mechanically sentence-by-sentence, instead of focusing on how each sentence relates to those around it and to the text as a whole.
10900550 -> 1000008900560: Translation memories also raise certain industrial relations issues as they make exploitation of human translators easier.
10900560 -> 1000008900570: Functions of a translation memory
10900570 -> 1000008900580: The following is a summary of the main functions of a Translation Memory.
10900580 -> 1000008900590: Off-line functions
10900590 -> 1000008900600: Import
10900600 -> 1000008900610: This function is used to transfer a text and its translation from a text file to the TM.
10900610 -> 1000008900620: Import can be done from a raw format, in which an external source text is available for importing into a TM along with its translation.
10900620 -> 1000008900630: Sometimes the texts have to be reprocessed by the user.
10900630 -> 1000008900640: There is another format that can be used to import: the native format.
10900640 -> 1000008900650: This format is the one that uses the TM to save translation memories in a file.
10900650 -> 1000008900660: Analysis
10900660 -> 1000008900670: The process of analysis is developed through the following steps:
10900670 -> 1000008900680: Textual parsing
10900680 -> 1000008900690: It is very important to recognize punctuation in order to distinguish for example the end of sentence from abbreviation.
10900690 -> 1000008900700: Thus, mark-up is a kind of pre-editing.
10900700 -> 1000008900710: Usually, the materials which have been processed through translators' aid programs contain mark-up, as the translation stage is embedded in a multilingual document production line.
10900710 -> 1000008900720: Other special text elements may be set off by mark-up.
10900720 -> 1000008900730: There are special elements which do not need to be translated, such as proper names and codes, while others may need to be converted to native format.
10900730 -> 1000008900740: Linguistic parsing
10900740 -> 1000008900750: The base form reduction is used to prepare lists of words and a text for automatic retrieval of terms from a term bank.
10900750 -> 1000008900760: On the other hand, syntactic parsing may be used to extract multi-word terms or phraseology from a source text.
10900760 -> 1000008900770: So parsing is used to normalise word order variation of phraseology, this is which words can form a phrase.
10900770 -> 1000008900780: Segmentation
10900780 -> 1000008900790: Its purpose is to choose the most useful translation units.
10900790 -> 1000008900800: Segmentation is like a type of parsing.
10900800 -> 1000008900810: It is done monolingually using superficial parsing and alignment is based on segmentation.
10900810 -> 1000008900820: If the translators correct the segmentations manually, later versions of the document will not find matches against the TM based on the corrected segmentation because the program will repeat its own errors.
10900820 -> 1000008900830: Translators usually proceed sentence by sentence, although the translation of one sentence may depend on the translation of the surrounding ones.
10900830 -> 1000008900840: Alignment
10900840 -> 1000008900850: It is the task of defining translation correspondences between source and target texts.
10900850 -> 1000008900860: There should be feedback from alignment to segmentation and a good alignment algorithm should be able to correct initial segmentation.
10900860 -> 1000008900870: Term extraction
10900870 -> 1000008900880: It can have as input a previous dictionary.
10900880 -> 1000008900890: Moreover, when extracting unknown terms, it can use parsing based on text statistics.
10900890 -> 1000008900900: These are used to estimate the amount of work involved in a translation job.
10900900 -> 1000008900910: This is very useful for planning and scheduling the work.
10900910 -> 1000008900920: Translation statistics usually count the words and estimate the amount of repetition in the text.
10900920 -> 1000008900930: Export
10900930 -> 1000008900940: Export transfers the text from the TM into an external text file.
10900940 -> 1000008900950: Import and export should be inverses.
10900950 -> 1000008900960: Online functions
10900960 -> 1000008900970: When translating, one of the main purposes of the TM is to retrieve the most useful matches in the memory so that the translator can choose the best one.
10900970 -> 1000008900980: The TM must show both the source and target text pointing out the identities and differences.
10900980 -> 1000008900990: Retrieval
10900990 -> 1000008901000: It is possible to retrieve from the TM one or more types of matches.
10901000 -> 1000008901010: Exact match
10901010 -> 1000008901020: Exact matches appear when the match between the current source segment and the stored one has been a character by character match.
10901020 -> 1000008901030: When translating a sentence, an exact match means the same sentence has been translated before.
10901030 -> 1000008901040: Exact matches are also called "100% matches".
10901040 -> 1000008901050: In Context Exact (ICE) match
10901050 -> 1000008901060: An ICE match is an exact match that occurs in exactly the same context, that is, the same location in a paragraph.
10901060 -> 1000008901070: Context is often defined by the surrounding sentences and attributes such as document file name, date, and permissions.
10901070 -> 1000008901080: Fuzzy match
10901080 -> 1000008901090: When the match has not been exact, it is a "fuzzy" match.
10901090 -> 1000008901100: Some systems assign percentages to these kinds of matches, in which case a fuzzy match is greater than 0% and less than 100%.
10901100 -> 1000008901110: Those figures are not comparable across systems unless the method of scoring is specified.
10901110 -> 1000008901120: Concordance
10901120 -> 1000008901130: This feature allows translators to select one or more words in the source segment and the system retrieves segment pairs that match the search criteria.
10901130 -> 1000008901140: This feature is helpful for finding translations of terms and idioms in the absence of a terminology database.
10901140 -> 1000008901150: Updating
10901150 -> 1000008901160: A TM is updated with a new translation when it has been accepted by the translator.
10901160 -> 1000008901170: As always in updating a database, there is the question what to do with the previous contents of the database.
10901170 -> 1000008901180: A TM can be modified by changing or deleting entries in the TM.
10901180 -> 1000008901190: Some systems allow translators to save multiple translations of the same source segment.
10901190 -> 1000008901200: Automatic translation
10901200 -> 1000008901210: Translation memories can do retrieval and substitution automatically without the help of the translator.
10901210 -> 1000008901220: If so.
10901220 -> 1000008901230: Automatic retrieval
10901230 -> 1000008901240: A TM features automatic retrieval and evaluation of translation correspondences in a translator's workbench.
10901240 -> 1000008901250: Automatic substitution
10901250 -> 1000008901260: Exact matches come up in translating new versions of a document.
10901260 -> 1000008901270: During automatic substitution, the translator does check the translation against the original, so if there are any mistakes in the previous translation, they will carry over.
10901270 -> 1000008901280: Networking
10901280 -> 1000008901290: When networking during the translation it is possible to translate a text efficiently together with a group of translators.
10901290 -> 1000008901300: This way, the translations entered by one translator are available to the others.
10901300 -> 1000008901310: Moreover, if translation memories are shared before the final translation, there is a chance that mistakes made by one translator will be corrected by other team members.
10901310 -> 1000008901320: Text memory
10901320 -> 1000008901330: "Text memory" is the basis of the proposed  Lisa OSCAR xml:tm standard.
10901330 -> 1000008901340: Text memory comprises author memory and translation memory.
10901340 -> None: Translation memory
10901350 -> None: The unique identifiers are remembered during translation so that the target language document is 'exactly' aligned at the text unit level.
10901360 -> None: If the source document is subsequently modified, then those text units that have not changed can be directly transferred to the new target version of the document without the need for any translator interaction.
10901370 -> None: This is the concept of 'exact' or 'perfect' matching to the translation memory. xml:tm can also provide mechanisms for in-document leveraged and fuzzy matching.
10901380 -> 1000008901350: History of translation memories
10901390 -> 1000008901360: The concept behind translation memories is not recent — university research into the concept began in the late 1970s, and the earliest commercializations became available in the late 1980s — but they became commercially viable only in the late 1990s.
10901400 -> 1000008901370: Originally translation memory systems stored aligned source and target sentences in a database, from which they could be recalled during translation.
10901410 -> 1000008901380: The problem with this 'leveraged' approach is that there is no guarantee if the new source language sentence is from the same context as the original database sentence.
10901420 -> 1000008901390: Therefore all 'leveraged' matches require that a translator reviews the memory match for relevance in the new document.
10901430 -> 1000008901400: Although cheaper than outright translation, this review still carries a cost.
10901440 -> 1000008901410: Support for new languages
10901450 -> 1000008901420: Translation memory tools from majority of the companies do not support many upcoming languages.
10901460 -> 1000008901430: Recently Asian countries like India also jumped in to language computing and there is high scope for Translation memories in such developing countries.
10901470 -> 1000008901440: As most of the CAT software companies are concentrating on legacy languages, nothing much is happening on Asian languages.
10901480 -> 1000008901450: Recent trends
10901490 -> 1000008901460: One recent development is the concept of 'text memory' in contrast to translation memory (see  Translating XML Documents with xml:tm).
10901500 -> 1000008901470: This is also the basis of the proposed LISA OSCAR  xml:tm standard.
10901510 -> 1000008901480: Text memory within xml:tm comprises 'author memory' and 'translation memory'.
10901520 -> 1000008901490: Author memory is used to keep track of changes during the authoring cycle.
10901530 -> 1000008901500: Translation memory uses the information from author memory to implement translation memory matching.
10901540 -> 1000008901510: Although primarily targeted at XML documents, xml:tm can be used on any document that can be converted to  XLIFF format.
10901550 -> 1000008901520: Second generation translation memories
10901560 -> 1000008901530: Much more powerful than first-generation TMs, they include a linguistic analysis engine, use chunk technology to break down segments into intelligent terminological groups, and automatically generate specific glossaries.
10901570 -> 1000008901540: Translation memory and related standards
10901580 -> 1000008901550: TMX
10901590 -> 1000008901560: Translation Memory Exchange format.
10901600 -> 1000008901570: This standard enables the interchange of translation memories between translation suppliers.
10901610 -> 1000008901580: TMX has been adopted by the translation community as the best way of importing and exporting translation memories.
10901620 -> 1000008901590: The current version is 1.4b - it allows for the recreation of the original source and target documents from the TMX data.
10901630 -> 1000008901600: An updated version, 2.0, is due to be released in 2008.
10901640 -> 1000008901610: TBX
10901650 -> 1000008901620: Termbase Exchange format.
10901660 -> 1000008901630: This LISA standard, which is currently being revised and republished as ISO 30042, allows for the interchange of terminology data including detailed lexical information.
10901670 -> 1000008901640: The framework for TBX is provided by three ISO standards: ISO 12620, ISO 12200 and ISO 16642.
10901680 -> 1000008901650: ISO 12620 provides an inventory of well-defined “data categories” with standardized names that function as data element types or as predefined values.
10901690 -> 1000008901660: ISO 12200 (also known as MARTIF) provides the basis for the core structure of TBX.
10901700 -> 1000008901670: ISO 16642 (also known as Terminological Markup Framework) includes a structural metamodel for Terminology Markup Languages in general.
10901710 -> 1000008901680: SRX
10901720 -> 1000008901690: Segmentation Rules Exchange format.
10901730 -> 1000008901700: SRX is intended to enhance the TMX standard so that translation memory data that is exchanged between applications can be used more effectively.
10901740 -> 1000008901710: The ability to specify the segmentation rules that were used in the previous translation increases the leveraging that can be achieved.
10901750 -> 1000008901720: GMX
10901760 -> 1000008901730: GILT Metrics.
10901770 -> 1000008901740: GILT stands for (Globalization, Internationalization, Localization, and Translation).
10901780 -> 1000008901750: The GILT Metrics standard comprises three parts: GMX-V for volume metrics, GMX-C for complexity metrics and GMX-Q for quality metrics.
10901790 -> 1000008901760: The proposed GILT Metrics standard is tasked with quantifying the workload and quality requirements for any given GILT task.
10901800 -> 1000008901770: OLIF
10901810 -> 1000008901780: Open Lexicon Interchange Format.
10901820 -> 1000008901790: OLIF is an open, XML-compliant standard for the exchange of terminological and lexical data.
10901830 -> 1000008901800: Although originally intended as a means for the exchange of lexical data between proprietary machine translation lexicons, it has evolved into a more general standard for terminology exchange.
10901840 -> None: 
10901850 -> 1000008901810: XLIFF
10901860 -> 1000008901820: XML Localisation Interchange File Format.
10901870 -> 1000008901830: It is intended to provide a single interchange file format that can be understood by any localization provider.
10901880 -> 1000008901840: XLIFF is the preferred way of exchanging data in XML format in the translation industry.
10901890 -> 1000008901850: TransWS
10901900 -> 1000008901860: Translation Web Services.
10901910 -> 1000008901870: TransWS specifies the calls needed to use Web services for the submission and retrieval of files and messages relating to localization projects.
10901920 -> 1000008901880: It is intended as a detailed framework for the automation of much of the current localization process by the use of Web Services.
10901930 -> 1000008901890: xml:tm
10901940 -> 1000008901900: xml:tm This approach to translation memory is based on the concept of text memory which comprises author and translation memory. xml:tm has been donated to Lisa OSCAR by  XML-INTL.
10901950 -> 1000008901910: PO
10901960 -> 1000008901920: Gettext Portable Object format.
10901970 -> 1000008901930: Though often not regarded as a translation memory format, Gettext PO files are bilingual files that are also used in translation memory processes in the same way translation memories are used.
10901980 -> 1000008901940: Typically, a PO translsation memory system will consist of various separate files in a director tree structure.
10901990 -> 1000008901950: Common tools that work with PO files include the  GNU Gettext Tools and the  Translate Toolkit.
10902000 -> 1000008901960: Several tools and programs also exist that edit PO files as if they are mere source text files.
Turing test
10910010 -> 1000009000020: Turing test
10910020 -> 1000009000030: The Turing test is a proposal for a test of a machine's capability to demonstrate intelligence.
10910030 -> 1000009000040: Described by Alan Turing in the 1950 paper "Computing Machinery and Intelligence," it proceeds as follows: a human judge engages in a natural language conversation with one human and one machine, each of which try to appear human; if the judge cannot reliably tell which is which, then the machine is said to pass the test.
10910040 -> 1000009000050: In order to test the machine's intelligence rather than its ability to render words into audio, the conversation is limited to a text-only channel such as a computer keyboard and screen (Turing originally suggested a teletype machine, one of the few text-only communication systems available in 1950).
10910050 -> 1000009000060: History
10910060 -> 1000009000070: While the field of artificial intelligence is said to have been founded in 1956, its roots extend back considerably further.
10910070 -> 1000009000080: The question as to whether or not it is possible for machines to think has a long history, firmly entrenched in the distinction between dualist and materialist views of the mind.
10910080 -> 1000009000090: From the perspective of dualism, the mind is non-physical (or, at the very least, has non-physical properties), and therefore cannot be explained in purely physical terms.
10910090 -> 1000009000100: The materialist perspective, on the other hand, argues that the mind can be explained physically, and thus leaves open the possibility of minds that are artificially produced.
10910100 -> 1000009000110: Alan Turing
10910110 -> 1000009000120: In more practical terms, researchers in Britain had been exploring "machine intelligence" for up to ten years prior to 1956.
10910120 -> 1000009000130: Alan Turing in particular had been tackling the notion of machine intelligence since at least 1941, and one of the earliest known mentions of "computer intelligence" was made by Turing in 1947.
10910130 -> 1000009000140: In Turing's report, "Intelligent Machinery", he investigated "the question of whether or not it is possible for machinery to show intelligent behaviour", and as part of that investigation proposed what may be considered the forerunner to his later tests:
10910131 -> 1000009000150: "It is not difficult to devise a paper machine which will play a not very bad game of chess.
10910132 -> 1000009000160: Now get three men as subjects for the experiment. A, B and C. A and C are to be rather poor chess players, B is the operator who works the paper machine. ...
10910133 -> 1000009000170: Two rooms are used with some arrangement for communicating moves, and a game is played between C and either A or the paper machine.
10910134 -> 1000009000180: C may find it quite difficult to tell which he is playing."
10910135 -> 1000009000190: {(Harvard citation no brackets+Turing 1948, p. 431+Turing+1948+p=431)}
10910140 -> 1000009000200: Thus by the time Turing published "Computing Machinery and Intelligence", he had been considering the possibility of machine intelligence for many years.
10910150 -> 1000009000210: This, however, was the first published paper by Turing to focus exclusively on the notion.
10910160 -> 1000009000220: Turing began his 1950 paper with the claim: "I propose to consider the question, 'Can machines think?'"
10910170 -> 1000009000230: As Turing highlighted, the traditional approach to such a question is to start with definitions, defining both the terms machine and intelligence.
10910180 -> 1000009000240: Nevertheless, Turing chose not to do so.
10910190 -> 1000009000250: Instead he replaced the question with a new question, "which is closely related to it and is expressed in relatively unambiguous words".
10910200 -> 1000009000260: In essence, Turing proposed to change the question from "Do machines think?" into "Can machines do what we (as thinking entities) can do?"
10910210 -> 1000009000270: The advantage of the new question, Turing argued, was that it "drew a fairly sharp line between the physical and intellectual capacities of a man.
10910220 -> 1000009000280: To demonstrate this approach, Turing proposed a test that was inspired by a party game known as the "Imitation Game", in which a man and a woman go into separate rooms, and guests try to tell them apart by writing a series of questions and reading the typewritten answers sent back.
10910230 -> 1000009000290: In this game, both the man and the woman aim to convince the guests that they are the other.
10910240 -> 1000009000300: Turing proposed recreating the imitation game as follows:
10910241 -> 1000009000310: "We now ask the question, 'What will happen when a machine takes the part of A in this game?'
10910242 -> 1000009000320: Will the interrogator decide wrongly as often when  the game is played like this as he does when the game is played between  a man and a woman?
10910243 -> 1000009000330: These questions replace our original, 'Can machines  think?'"
10910244 -> 1000009000340: {(Harvard citation no brackets+Turing 1950, p. 434+Turing+1950+p=434)}
10910255 -> 1000009000350: Later in the paper he suggested an "equivalent" alternative formulation involving a judge conversing only with a computer and a man.
10910260 -> 1000009000360: While neither of these two formulations precisely match the version of the Turing Test that is more generally known today, a third version was proposed by Turing in 1952.
10910270 -> 1000009000370: In this version, which Turing discussed in a BBC radio broadcast, Turing proposes a jury which asks questions of a computer, and where the role of the computer is to make a significant proportion of the jury believe that it is really a man.
10910280 -> 1000009000380: Turing's paper considered nine common objections, which include all the major arguments against artificial intelligence that have been raised in the years since his paper was first published.
10910290 -> 1000009000390: (See Computing Machinery and Intelligence.)
10910300 -> 1000009000400: ELIZA, PARRY and the Chinese room
10910310 -> 1000009000410: Blay Whitby lists four major turning points in the history of the Turing Test: the publication of "Computing Machinery and Intelligence" in 1950; the announcement of Joseph Weizenbaum's ELIZA in 1966; Kenneth Colby's creation of PARRY, which was first described in 1972; and the Turing Colloquium in 1990.
10910320 -> 1000009000420: ELIZA works by examining a user's typed comments for keywords.
10910330 -> 1000009000430: If a word is found a rule is applied which transforms the user's comments, and the resulting sentence is then returned.
10910340 -> 1000009000440: If a keyword is not found, ELIZA responds with either a generic response or by repeating one of the earlier comments.
10910350 -> 1000009000450: In addition, Weizenbaum developed ELIZA to replicate the behavior of a Rogerian psychotherapist, allowing ELIZA to be "free to assume the pose of knowing almost nothing of the real world."
10910360 -> 1000009000460: Due to these techniques, Weizenbaum's program was able to fool some people into believing that they were talking to a real person, with some subjects being "very hard to convince that ELIZA ... is not human."
10910370 -> 1000009000470: Thus ELIZA is claimed by many to be one of the programs (perhaps the first) that are able to pass the Turing Test.
10910380 -> 1000009000480: Colby's PARRY has been described as "ELIZA with attitude" - it attempts to model the behavior of a paranoid schizophrenic, using a similar (if more advanced) approach to that employed by Weizenbaum.
10910390 -> 1000009000490: In order to help validate the work, PARRY was tested in the early 1970s using a variation of the Turing Test.
10910400 -> 1000009000500: A group of experienced psychiatrists analyzed a combination of real patients and computers running PARRY through teletype machines.
10910410 -> 1000009000510: Another group of 33 psychiatrists were shown transcripts of the conversations.
10910420 -> 1000009000520: The two groups were then asked to identify which of the "patients" were human, and which were computer programs.
10910430 -> 1000009000530: The psychiatrists were only able to make the correct identification 48% of the time - a figure consistent with random guessing.
10910440 -> 1000009000540: While neither ELIZA nor PARRY were able to pass a strict Turing Test, they - and software like them - suggested that software might be written that was able to do so.
10910450 -> 1000009000550: More importantly, they suggested that such software might involve little more than databases and the application of simple rules.
10910460 -> 1000009000560: This led to John Searle's 1980 paper, "Minds, Brains, and Programs", in which he proposed an argument against the Turing Test.
10910470 -> 1000009000570: Searle described a thought experiment known as the Chinese room that highlighted what he saw as a fundamental misinterpretation of what the Turing Test could and could not prove: while software such as ELIZA might be able to pass the Turing Test, they might do so by simply manipulating symbols of which they have no understanding.
10910480 -> 1000009000580: And without understanding, they could not be described as "thinking" in the same sense people do.
10910490 -> 1000009000590: Searle concludes that the Turing Test can not prove that a machine can think, contrary to Turing's original proposal.
10910500 -> 1000009000600: Arguments such as that proposed by Searle and others working in the philosophy of mind sparked off a more intense debate about the nature of intelligence, the possibility of intelligent machines and the value of the Turing test that continued through the 1980s and 1990s.
10910510 -> 1000009000610: 1990s and beyond
10910520 -> 1000009000620: 1990 was the 40th anniversary of the first publication of Turing's "Computing Machinery and Intelligence" paper, and thus saw renewed interest in the test.
10910530 -> 1000009000630: Two significant events occurred in that year.
10910540 -> 1000009000640: The first with the Turing Colloquium, which was held at the University of Sussex in April, and brought together academics and researchers from a wide variety of disciplines to discuss the Turing Test in terms of its past, present and future.
10910550 -> 1000009000650: The second significant event was the formation of the annual Loebner prize competition.
10910560 -> 1000009000660: The Loebner prize was instigated by Hugh Loebner under the auspices of the Cambridge Center for Behavioral Studies of Massachusetts, United States, with the first competition held in November, 1991.
10910570 -> 1000009000670: As Loebner describes it, the competition was created to advance the state of AI research, at least in part because while the Turing Test had been discussed for many years, "no one had taken steps to implement it."
10910580 -> 1000009000680: The Loebner prize has three awards: the first prize of $100,000 and a gold medal, to be awarded to the first program that passes the "unrestricted" Turing test; the second prize of $25,000, to be awarded to the first program that passes the "restricted" version of the test; and a sum of $2000 (now $3000) to the "most human-like" program that was entered each year.
10910590 -> 1000009000690: As of 2007, neither the first nor second prizes have been awarded.
10910600 -> 1000009000700: The running of the Loebner prize led to renewed discussion of both the viability of the Turing Test and the aim of developing artificial intelligences that could pass it.
10910610 -> 1000009000710: The Economist, in an article entitled "Artificial Stupidity", commented that the winning entry from the first Loebner prize won, at least in part, because it was able to "imitate human typing errors".
10910620 -> 1000009000720: (Turing had considered the possibility that computers could be identified by their lack of errors, and had suggested that the computers should be programmed to add errors into their output, so as to be better "players" of the game).
10910630 -> 1000009000730: The issue that The Economist raised was one that was already well established in the literature: perhaps we don't really need the types of computers that could pass the Turing Test, and perhaps trying to pass the Turing Test is nothing more than a distraction from more fruitful lines of research.
10910640 -> 1000009000740: Equally, a second issue became apparent - by providing rules which restricted the abilities of the interrogators to ask questions, and by using comparatively "unsophisticated" interrogators, the Turing Test can be passed through the use of "trickery" rather than intelligence.
10910650 -> 1000009000750: Versions of the Turing test
10910660 -> 1000009000760: There are at least three primary versions of the Turing test - two offered by Turing in "Computing Machinery and Intelligence" and one which Saul Traiger describes as the "Standard Interpretation".
10910670 -> 1000009000770: While there is some debate as to whether or not the "Standard Interpretation" is described by Turing or is, instead, based on a misreading of his paper, these three versions are not regarded as being equivalent, and are seen as having different strengths and weaknesses.
10910680 -> 1000009000780: As empirical tests they conform to a proposal published in 1936 by A J Ayer on how to distinguish between a conscious man and an unconscious machine.
10910690 -> 1000009000790: In his book Language, Truth and Logic Ayer states that 'The only ground I can have for asserting that an object which appears to be conscious is not really a conscious being, but only a dummy or a machine, is that it fails to satisfy one of the empirical tests by which the presence or absence of consciousness is determined'.
10910700 -> 1000009000800: The imitation game
10910710 -> 1000009000810: Turing described a simple party game which involves three players.
10910720 -> 1000009000820: Player A is a man, player B is a woman, and player C (who plays the role of the interrogator) can be of either gender.
10910730 -> 1000009000830: In the imitation game, player C - the interrogator - is unable to see either player A or player B, and can only communicate with them through written notes.
10910740 -> 1000009000840: By asking questions of player A and player B, player C tries to determine which of the two is the man, and which of the two is the woman.
10910750 -> 1000009000850: Player A's role is to trick the interrogator into making the wrong decision, while player B attempts to assist the interrogator.
10910760 -> 1000009000860: In what Sterret refers to as the "Original Imitation Game Test", Turing proposed that the role of player A be replaced with a computer.
10910770 -> 1000009000870: The computer's task is therefore to pretend to be a woman and to attempt to trick the interrogator into making an incorrect evaluation.
10910780 -> 1000009000880: The success of the computer is determined by comparing the outcome of the game when player A is a computer against the outcome when player A is a man.
10910790 -> 1000009000890: If, as Turing puts it, "the interrogator decide[s] wrongly as often when the game is played [with the computer] as he does when the game is played between a man and a woman", then it can be argued that the computer is intelligent.
10910800 -> 1000009000900: The second version comes later in Turing's 1950 paper.
10910810 -> 1000009000910: As with the Original Imitation Game Test, the role of player A is performed by a computer.
10910820 -> 1000009000920: The difference is that now the role of player B is to be performed by a man, rather than by a woman.
10910821 -> 1000009000930: "Let us fix our attention on one particular digital computer C.
10910822 -> 1000009000940: Is it true that by modifying this computer to have an adequate storage, suitably increasing its speed of action, and providing it with an appropriate programme, C can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?"
10910823 -> 1000009000950: {(Harvard citation no brackets+Turing 1950, p. 442+Turing+1950+p=442)}
10910830 -> 1000009000960: In this version both player A (the computer) and player B are trying to trick the interrogator into making an incorrect decision.
10910840 -> 1000009000970: The standard interpretation
10910850 -> 1000009000980: A common understanding of the Turing test is that the purpose was not specifically to test if a computer is able to fool an interrogator into believing that it is a woman, but to test whether or not a computer could imitate a human.
10910860 -> 1000009000990: While there is some dispute as to whether or not this interpretation was intended by Turing (for example, Sterrett believes that it was, and thus conflates the second version with this one, while others, such as Traiger, do not), this has nevertheless led to what can be viewed as the "standard interpretation".
10910870 -> 1000009001000: In this version, player A is a computer, and player B is a person of either gender.
10910880 -> 1000009001010: The role of the interrogator is not to determine which is male and which is female, but to determine which is a computer and which is a human.
10910890 -> 1000009001020: Imitation game vs. standard Turing test
10910900 -> 1000009001030: There has been some controversy over which of the alternative formulations of the test Turing intended.
10910910 -> 1000009001040: Sterret argues that two distinct tests can be extracted from Turing's 1950 paper, and that, pace Turing's remark, they are not equivalent.
10910920 -> 1000009001050: The test that employs the party game and compares frequencies of success in the game is referred to as the "Original Imitation Game Test" whereas the test consisting of a human judge conversing with a human and a machine is referred to as the "Standard Turing Test", noting that Sterret equates this with the "standard interpretation" rather than the second version of the imitation game.
10910930 -> 1000009001060: Sterrett agrees that the Standard Turing Test (STT) has the problems its critics cite, but argues that, in contrast, the Original Imitation Game Test (OIG Test) so defined is immune to many of them, due to a crucial difference: the OIG Test, unlike the STT, does not make similarity to a human performance the criterion of the test, even though it employs a human performance in setting a criterion for machine intelligence.
10910940 -> 1000009001070: A man can fail the OIG Test, but it is argued that this is a virtue of a test of intelligence if failure indicates a lack of resourcefulness.
10910950 -> 1000009001080: It is argued that the OIG Test requires the resourcefulness associated with intelligence and not merely "simulation of human conversational behaviour".
10910960 -> 1000009001090: The general structure of the OIG Test could even be used with nonverbal versions of imitation games.
10910970 -> 1000009001100: Still other writers have interpreted Turing to be proposing that the imitation game itself is the test, without specifying how to take into account Turing's statement that the test he proposed using the party version of the imitation game is based upon a criterion of comparative frequency of success in that imitation game, rather than a capacity to succeed at one round of the game.
10910980 -> 1000009001110: Should the interrogator know about the computer?
10910990 -> 1000009001120: Turing never makes it clear as to whether or not the interrogator in his tests is aware that one of the participants is a computer.
10911000 -> 1000009001130: To return to the Original Imitation Game, Turing states only that Player A is to be replaced with a machine, not that player C is to be made aware of this replacement.
10911010 -> 1000009001140: When Colby, Hilf, Weber and Kramer tested PARRY, they did so by assuming that the interrogators did not need to know that one or more of those being interviewed was a computer during the interrogation.
10911020 -> 1000009001150: But, as Saygin and others highlight, this makes a big difference to the implementation and outcome of the test.
10911030 -> 1000009001160: Strengths of the test
10911040 -> 1000009001170: The power of the Turing test derives from the fact that it is possible to talk about anything.
10911050 -> 1000009001180: Turing wrote "the question and answer method seems to be suitable for introducing almost any one of the fields of human endeavor that we wish to include."
10911060 -> 1000009001190: John Haugeland adds that "understanding the words is not enough; you have to understand the topic as well."
10911070 -> 1000009001200: In order to pass a well designed Turing test, the machine would have to use natural language, to reason, to have knowledge and to learn.
10911080 -> 1000009001210: The test can be extended to include video input, as well as a "hatch" through which objects can be passed, and this would force the machine to demonstrate the skill of vision and robotics as well.
10911090 -> 1000009001220: Together these represent almost all the major problems of artificial intelligence.
10911100 -> 1000009001230: Weaknesses of the test
10911110 -> 1000009001240: The test has been criticized on several grounds.
10911120 -> 1000009001250: Human intelligence vs. intelligence in general
10911130 -> 1000009001260: The test is explicitly anthropomorphic.
10911140 -> 1000009001270: It only tests if the subject resembles a human being.
10911150 -> 1000009001280: It will fail to test for intelligence under two circumstances:
10911160 -> 1000009001290: It tests for many behaviors that we may not consider intelligent, such as the susceptibility to insults or the temptation to lie.
10911170 -> 1000009001300: A machine may very well be intelligent without being able to chat exactly like a human.
10911180 -> 1000009001310: It fails to capture the general properties of intelligence, such as the ability to solve difficult problems or come up with original insights.
10911190 -> 1000009001320: If a machine can solve a difficult problem that no person could solve, it would, in principle, fail the test.
10911200 -> 1000009001330: Stuart J. Russell and Peter Norvig argue that the anthropomorphism of the test prevents it from being truly useful for the task of engineering intelligent machines.
10911210 -> 1000009001340: They write: "Aeronautical engineering texts do not define the goal of their field as 'making machines that fly so exactly like pigeons that they can fool other pigeons.'"
10911220 -> 1000009001350: The test is also vulnerable to naivete on the part of the test subjects.
10911230 -> 1000009001360: If the testers have little experience with chatterbots they may be more likely to judge a computer program to be responding coherently than someone who is aware of the various tricks that chatterbots use, such as changing the subject or answering a question with another question.
10911240 -> 1000009001370: Such tricks may be misinterpreted as "playfulness" and therefore evidence of a human participant by uninformed testers, especially during brief sessions in which a chatterbot's inherent repetitiveness does not have a chance to become evident.
10911250 -> 1000009001380: Real intelligence vs. simulated intelligence
10911260 -> 1000009001390: The test is also explicitly behaviorist or functionalist: it only tests how the subject acts.
10911270 -> 1000009001400: A machine passing the Turing test may be able to simulate human conversational behaviour but the machine might just follow some cleverly devised rules.
10911280 -> 1000009001410: Two famous examples of this line of argument against the Turing test are John Searle's Chinese room argument and Ned Block's Blockhead argument.
10911290 -> 1000009001420: Even if the Turing test is a good operational definition of intelligence, it may not indicate that the machine has consciousness, or that it has intentionality.
10911300 -> 1000009001430: Perhaps intelligence and consciousness, for example, are such that neither one necessarily implies the other.
10911310 -> 1000009001440: In that case, the Turing test might fail to capture one of the key differences between intelligent machines and intelligent people.
10911320 -> 1000009001450: Predictions and tests
10911330 -> 1000009001460: Turing predicted that machines would eventually be able to pass the test.
10911340 -> 1000009001470: In fact, he estimated that by the year 2000, machines with 109 bits (about 119.2 MiB) of memory would be able to fool 30% of human judges during a 5-minute test.
10911350 -> 1000009001480: He also predicted that people would then no longer consider the phrase "thinking machine" contradictory.
10911360 -> 1000009001490: He further predicted that machine learning would be an important part of building powerful machines, a claim which is considered to be plausible by contemporary researchers in Artificial intelligence.
10911370 -> 1000009001500: By extrapolating an exponential growth of technology over several decades, futurist Ray Kurzweil predicted that Turing-test-capable computers would be manufactured around the year 2020, roughly speaking.
10911380 -> 1000009001510: See the Moore's Law article and the references therein for discussions of the plausibility of this argument.
10911390 -> 1000009001520: As of 2008, no computer has passed the Turing test as such.
10911400 -> 1000009001530: Simple conversational programs such as ELIZA have fooled people into believing they are talking to another human being, such as in an informal experiment termed AOLiza.
10911410 -> 1000009001540: However, such "successes" are not the same as a Turing Test.
10911420 -> 1000009001550: Most obviously, the human party in the conversation has no reason to suspect they are talking to anything other than a human, whereas in a real Turing test the questioner is actively trying to determine the nature of the entity they are chatting with.
10911430 -> 1000009001560: Documented cases are usually in environments such as Internet Relay Chat where conversation is sometimes stilted and meaningless, and in which no understanding of a conversation is necessary.
10911440 -> 1000009001570: Additionally, many internet relay chat participants use English as a second or third language, thus making it even more likely that they would assume that an unintelligent comment by the conversational program is simply something they have misunderstood, and do not recognize the very non-human errors they make.
10911450 -> 1000009001580: See ELIZA effect.
10911460 -> 1000009001590: The Loebner prize is an annual competition to determine the best Turing test competitors.
10911470 -> 1000009001600: Although they award an annual prize for the computer system that, in the judges' opinions, demonstrates the "most human" conversational behaviour (with learning AI Jabberwacky winning in 2005 and 2006, and A.L.I.C.E. before that), they have an additional prize for a system that in their opinion passes a Turing test.
10911480 -> 1000009001610: This second prize has not yet been awarded.
10911490 -> 1000009001620: The creators of Jabberwacky have proposed a personal Turing Test: the ability to pass the imitation test while attempting to specifically imitate the human player, with whom the AI will have conversed at length before the test.
10911500 -> 1000009001630: In 2008 the competition for the Loebner prize is being co-organised by Kevin Warwick and held at the University of Reading on October 12.
10911510 -> 1000009001640: The directive for the competition is to stay as close as possible to Turing's original statements made in his 1950 paper, such that it can be ascertained if any machines are presently close to 'passing the test'.
10911520 -> 1000009001650: An academic meeting discussing the Turing Test, organised by the Society for the Study of Artificial Intelligence and the Simulation of Behaviour, is being held in parallel at the same venue.
10911530 -> 1000009001660: Trying to pass the Turing test in its full generality is not, as of 2005, an active focus of much mainstream academic or commercial effort.
10911540 -> 1000009001670: Current research in AI-related fields is aimed at more modest and specific goals.
10911550 -> 1000009001680: The first bet of the Long Bet Project is a $10,000 one between Mitch Kapor (pessimist) and Ray Kurzweil (optimist) about whether a computer will pass a Turing Test by the year 2029.
10911560 -> 1000009001690: The bet specifies the conditions in some detail.
10911570 -> 1000009001700: Variations of the Turing test
10911580 -> 1000009001710: A modification of the Turing test, where the objective or one or more of the roles have been reversed between computers and humans, is termed a reverse Turing test.
10911590 -> 1000009001720: Another variation of the Turing test is described as the Subject matter expert Turing test where a computer's response cannot be distinguished from an expert in a given field.
10911600 -> 1000009001730: As brain and body scanning techniques improve it may also be possible to replicate the essential data elements of a person to a computer system.
10911610 -> 1000009001740: The Immortality test variation of the Turing test would determine if a person's essential character is reproduced with enough fidelity to make it impossible to distinguish a reproduction of a person from the original person.
10911620 -> 1000009001750: The Minimum Intelligent Signal Test proposed by Chris McKinstry, is another variation of Turing's test, but where only binary responses are permitted.
10911630 -> 1000009001760: It is typically used to gather statistical data against which the performance of artificial intelligence programs may be measured.
10911640 -> 1000009001770: Another variation of the reverse Turing test is implied in the work of psychoanalyst Wilfred Bion, who was particularly fascinated by the "storm" that resulted from the encounter of one mind by another.
10911650 -> 1000009001780: Carrying this idea forward, R. D. Hinshelwood described the mind as a "mind recognizing apparatus", noting that this might be some sort of "supplement" to the Turing test.
10911660 -> 1000009001790: To make this more explicit, the challenge would be for the computer to be able to determine if it were interacting with a human or another computer.
10911670 -> 1000009001800: This is an extension of the original question Turing was attempting to answer, but would, perhaps, be a high enough standard to define a machine that could "think" in a way we typically define as characteristically human.
10911680 -> 1000009001810: Another variation is the Meta Turing test, in which the subject being tested (for example a computer) is classified as intelligent if it itself has created something that the subject itself wants to test for intelligence.
10911690 -> 1000009001820: Practical applications
10911700 -> 1000009001830: Stuart J. Russell and Peter Norvig note that "AI researchers have devoted little attention to passing the Turing Test",
10911710 -> 1000009001840: Real Turing tests, such as the Loebner prize, do not usually force programs to demonstrate the full range of intelligence and are reserved for testing chatterbot programs.
10911720 -> 1000009001850: However, even in this limited form these tests are still very rigorous.
10911730 -> 1000009001860: The 2008 Loebner prize however is sticking closely to Turing's original concepts - for example conversations will be for 5 minutes only.
10911740 -> 1000009001870: CAPTCHA is a form of reverse Turing test.
10911750 -> 1000009001880: Before being allowed to do some action on a website, the user is presented with alphanumerical characters in a distorted graphic image and asked to recognise it.
10911760 -> 1000009001890: This is intended to prevent automated systems from abusing the site.
10911770 -> 1000009001900: The rationale is that software sufficiently sophisticated to read the distorted image accurately does not exist (or is not available to the average user), so any system able to do so is likely to be a human being.
10911780 -> 1000009001910: In popular culture
10911790 -> 1000009001920: In the Dilbert comic strip on Sunday 30 March 2008,, Dilbert says, "The security audit accidentally locked all of the developers out of the system", and his boss responds with only meaningless, tautological thought-terminating clichés, "Well, it is what it is." Dilbert asks "How does that help" and his boss responds with another cliche, "You don't know what you don't know."
10911800 -> 1000009001930: Dilbert replies, "Congratulations.
10911810 -> 1000009001940: You're the first human to fail the Turing Test."
10911820 -> 1000009001950: For that day, "turing test" was the 43rd most popular Google search.
10911830 -> 1000009001960: The character of Ghostwheel in Roger Zelazny's The Chronicles of Amber is mentioned to be capable of passing the Turing Test.
10911840 -> 1000009001970: The webcomic xkcd has referred to Turing and the Turing test.
10911850 -> 1000009001980: Rick Deckard,in the movie Blade Runner, used a Turing Test to determine if Rachael was a Replicant.
United States
10920010 -> 1000009100020: United States
10920020 -> 1000009100030: The United States of America, usually referred to as the United States, the U.S. or America, is a constitutional federal republic comprising fifty states and a federal district, as well as several territories, or insular areas, scattered around the Caribbean and Pacific.
10920030 -> 1000009100040: The country is situated mostly in central North America, where its forty-eight contiguous states and Washington, D.C., the capital district, lie between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south.
10920040 -> 1000009100050: The state of Alaska is in the northwest of the continent, with Canada to its east and Russia to the west across the Bering Strait, and the state of Hawaii is an archipelago in the mid-Pacific.
10920050 -> 1000009100060: At 3.79 million square miles (9.83 million km²) and with more than 300 million people, the United States is the third or fourth largest country by total area, and third largest by land area and by population.
10920060 -> 1000009100070: The United States is one of the world's most ethnically diverse nations, the product of large-scale immigration from many countries.
10920070 -> 1000009100080: The U.S. economy is the largest national economy in the world, with a nominal 2006 gross domestic product (GDP) of more than US$13 trillion (over 25% of the world total based on nominal GDP and almost 20% by purchasing power parity).
10920080 -> 1000009100090: The nation was founded by thirteen colonies of Great Britain located along the Atlantic seaboard.
10920090 -> 1000009100100: Proclaiming themselves "states," they issued the Declaration of Independence on July 4, 1776.
10920100 -> 1000009100110: The rebellious states defeated Great Britain in the American Revolutionary War, the first successful colonial war of independence.
10920110 -> 1000009100120: A federal convention adopted the current United States Constitution on September 17, 1787; its ratification the following year made the states part of a single republic.
10920120 -> 1000009100130: The Bill of Rights, comprising ten constitutional amendments, was ratified in 1791.
10920130 -> 1000009100140: In the nineteenth century, the United States acquired land from France, Spain, the United Kingdom, Mexico, and Russia, and annexed the Republic of Texas and the Republic of Hawaii.
10920140 -> 1000009100150: Disputes between the agrarian South and industrial North over states' rights and the expansion of the institution of slavery provoked the American Civil War of the 1860s.
10920150 -> 1000009100160: The North's victory prevented a permanent split of the country and led to the end of legal slavery in the United States.
10920160 -> 1000009100170: The Spanish-American War and World War I confirmed the nation's status as a military power.
10920170 -> 1000009100180: In 1945, the United States emerged from World War II as the first country with nuclear weapons, a permanent member of the United Nations Security Council, and a founding member of NATO.
10920180 -> 1000009100190: In the post–Cold War era, the United States is the only remaining superpower—accounting for approximately 50% of global military spending—and a dominant economic, political, and cultural force in the world.
10920190 -> 1000009100200: Etymology
10920200 -> 1000009100210: The term America, for the lands of the western hemisphere, was coined in 1507 after Amerigo Vespucci, an Italian explorer and cartographer.
10920210 -> 1000009100220: The full name of the country was first used officially in the Declaration of Independence, which was the "unanimous Declaration of the thirteen united States of America" adopted by the "Representatives of the united States of America" on July 4, 1776.
10920220 -> 1000009100230: The current name was finalized on November 15, 1777, when the Second Continental Congress adopted the Articles of Confederation, the first of which states, "The Stile of this Confederacy shall be 'The United States of America.'"
10920230 -> 1000009100240: Common short forms and abbreviations of the United States of America include the United States, the U.S., the U.S.A., and America.
10920240 -> 1000009100250: Colloquial names for the country include the U.S. of A. and the States.
10920250 -> 1000009100260: Columbia, a once popular name for the Americas and the United States, was derived from Christopher Columbus.
10920260 -> 1000009100270: It appears in the name "District of Columbia".
10920270 -> 1000009100280: A female personification of Columbia appears on some official documents, including certain prints of U.S. currency.
10920280 -> 1000009100290: The standard way to refer to a citizen of the United States is as an American.
10920290 -> 1000009100300: Though United States is the formal adjective, American and U.S. are the most common adjectives used to refer to the country ("American values," "U.S. forces").
10920300 -> 1000009100310: American is rarely used in English to refer to people not connected to the United States.
10920310 -> 1000009100320: The phrase "the United States" was originally treated as plural—e.g, "the United States are"—including in the Thirteenth Amendment to the Constitution, ratified in 1865.
10920320 -> 1000009100330: However, it became increasingly common to treat the name as singular—e.g., "the United States is"—after the end of the Civil War.
10920330 -> 1000009100340: The singular form is now standard, while the plural form is retained in the set idiom "these United States."
10920340 -> 1000009100350: Geography
10920350 -> 1000009100360: The United States is situated almost entirely in the western hemisphere: the contiguous United States stretches from the Pacific on the west to the Atlantic on the east, with the Gulf of Mexico to the southeast, and bordered by Canada on the north and Mexico on the south.
10920360 -> 1000009100370: Alaska is the largest state in area; separated from the contiguous U.S. by Canada, it touches the Pacific on the south and Arctic Ocean on the north.
10920370 -> 1000009100380: Hawaii occupies an archipelago in the central Pacific, southwest of North America.
10920380 -> 1000009100390: The United States is the world's third or fourth largest nation by total area, before or after China.
10920390 -> 1000009100400: The ranking varies depending on (a) how two territories disputed by China and India are counted and (b) how the total size of the United States is calculated: the CIA World Factbook gives {(Convert+9826630 km² (3794083 sq mi)+9826630+km2+sqmi+0+abbr=on)}, the United Nations Statistics Division gives {(Convert+9629091 km² (3717813 sq mi)+9629091+km2+sqmi+0+abbr=on)}, and the Encyclopedia Britannica gives {(Convert+9522055 km² (3676486 sq mi)+9522055+km2+sqmi+0+abbr=on)}.
10920400 -> 1000009100410: Including only land area, the United States is third in size behind Russia and China, just ahead of Canada.
10920410 -> 1000009100420: The United States also possesses several insular territories scattered around the West Indies (e.g., the commonwealth of Puerto Rico) and the Pacific (e.g., Guam).
10920420 -> 1000009100430: The coastal plain of the Atlantic seaboard gives way further inland to deciduous forests and the rolling hills of the Piedmont.
10920430 -> 1000009100440: The Appalachian Mountains divide the eastern seaboard from the Great Lakes and the grasslands of the Midwest.
10920440 -> 1000009100450: The Mississippi–Missouri River, the world's fourth longest river system, runs mainly north-south through the heart of the country.
10920450 -> 1000009100460: The flat, fertile prairie land of the Great Plains stretches to the west, interrupted by a highland region along its southeastern portion.
10920460 -> 1000009100470: The Rocky Mountains, at the western edge of the Great Plains, extend north to south across the continental United States, reaching altitudes higher than 14,000 feet (4,300 m) in Colorado.
10920470 -> 1000009100480: The area to the west of the Rocky Mountains is dominated by the rocky Great Basin and deserts such as the Mojave.
10920480 -> 1000009100490: The Sierra Nevada range runs parallel to the Rockies, relatively close to the Pacific coast.
10920490 -> 1000009100500: At 20,320 feet (6,194 m), Alaska's Mount McKinley is the country's tallest peak.
10920500 -> 1000009100510: Active volcanoes are common throughout the Alexander and Aleutian Islands, and the entire state of Hawaii is built upon tropical volcanic islands.
10920510 -> 1000009100520: The supervolcano underlying Yellowstone National Park in the Rockies is the continent's largest volcanic feature.
10920520 -> 1000009100530: Because of the United States' large size and wide range of geographic features, nearly every type of climate is represented.
10920530 -> 1000009100540: The climate is temperate in most areas, tropical in Hawaii and southern Florida, polar in Alaska, semi-arid in the Great Plains west of the 100th meridian, desert in the Southwest, Mediterranean in Coastal California, and arid in the Great Basin.
10920540 -> 1000009100550: Extreme weather is not uncommon—the states bordering the Gulf of Mexico are prone to hurricanes, and most of the world's tornadoes occur within the continental United States, primarily in the Midwest's Tornado Alley.
10920550 -> 1000009100560: Environment
10920560 -> 1000009100570: U.S. plant life is very diverse; the country has more than 17,000 identified native species of flora.
10920570 -> 1000009100580: More than 400 mammal, 700 bird, 500 reptile and amphibian, and 90,000 insect species have been documented.
10920580 -> 1000009100590: The Endangered Species Act of 1973 protects threatened and endangered species and their habitats, which are monitored by the U.S. Fish and Wildlife Service.
10920590 -> 1000009100600: The U.S. has fifty-eight national parks and hundreds of other federally managed parks, forests, and wilderness areas.
10920600 -> 1000009100610: Altogether, the U.S. government regulates 28.8% of the country's total land area.
10920610 -> 1000009100620: Most such public land comprises protected parks and forestland, though some federal land is leased for oil and gas drilling, mining, or cattle ranching.
10920620 -> 1000009100630: The energy policy of the United States is widely debated; many call on the country to take a leading role in fighting global warming.
10920630 -> 1000009100640: The United States is currently the second largest emitter, after the People's Republic of China, of carbon dioxide from the burning of fossil fuels.
10920640 -> 1000009100650: History
10920650 -> 1000009100660: Native Americans and European settlers
10920660 -> 1000009100670: The indigenous peoples of the U.S. mainland, including Alaska Natives, are thought to have migrated from Asia.
10920670 -> 1000009100680: They began arriving at least 12,000 and as many as 40,000 years ago.
10920680 -> 1000009100690: Several indigenous communities in the pre-Columbian era developed advanced agriculture, grand architecture, and state-level societies.
10920690 -> 1000009100700: In 1492, Genoese explorer Christopher Columbus, under contract to the Spanish crown, reached several Caribbean islands, making first contact with the indigenous population.
10920700 -> 1000009100710: In the years that followed, the majority of the indigenous American peoples were killed by epidemics of Eurasian diseases.
10920710 -> 1000009100720: On April 2, 1513, Spanish conquistador Juan Ponce de León landed on what he called "La Florida"—the first documented European arrival on what would become the U.S. mainland.
10920720 -> 1000009100730: Of the colonies Spain established in the region, only St. Augustine, founded in 1565, remains.
10920730 -> 1000009100740: Later Spanish settlements in the present-day southwestern United States drew thousands through Mexico.
10920740 -> 1000009100750: French fur traders established outposts of New France around the Great Lakes; France eventually claimed much of the North American interior as far south as the Gulf of Mexico.
10920750 -> 1000009100760: The first successful English settlements were the Virginia Colony in Jamestown in 1607 and the Pilgrims' Plymouth Colony in 1620.
10920760 -> 1000009100770: The 1628 chartering of the Massachusetts Bay Colony resulted in a wave of migration; by 1634, New England had been settled by some 10,000 Puritans.
10920770 -> 1000009100780: Between the late 1610s and the American Revolution, an estimated 50,000 convicts were shipped to England's, and later Great Britain's, American colonies.
10920780 -> 1000009100790: Beginning in 1614, the Dutch established settlements along the lower Hudson River, including New Amsterdam on Manhattan Island.
10920790 -> 1000009100800: The small settlement of New Sweden, founded along the Delaware River in 1638, was taken over by the Dutch in 1655.
10920800 -> 1000009100810: By 1674, English forces had won the former Dutch colonies in the Anglo–Dutch Wars; the province of New Netherland was renamed New York.
10920810 -> 1000009100820: Many new immigrants, especially to the South, were indentured servants—some two-thirds of all Virginia immigrants between 1630 and 1680.
10920820 -> 1000009100830: By the turn of the century, African slaves were becoming the primary source of bonded labor.
10920830 -> 1000009100840: With the 1729 division of the Carolinas and the 1732 colonization of Georgia, the thirteen British colonies that would become the United States of America were established.
10920840 -> 1000009100850: All had active local and colonial governments with elections open to most free men, with a growing devotion to the ancient rights of Englishmen and a sense of self government that stimulated support for republicanism.
10920850 -> 1000009100860: All had legalized the African slave trade.
10920860 -> 1000009100870: With high birth rates, low death rates, and steady immigration, the colonies doubled in population every twenty-five years.
10920870 -> 1000009100880: The Christian revivalist movement of the 1730s and 1740s known as the Great Awakening fueled interest in both religion and religious liberty.
10920880 -> 1000009100890: In the French and Indian War, British forces seized Canada from the French, but the francophone population remained politically isolated from the southern colonies.
10920890 -> 1000009100900: By 1770, those thirteen colonies had an increasingly Anglicized population of three million, approximately half that of Britain.
10920900 -> 1000009100910: Though subject to British taxation, they were given no representation in the Parliament of Great Britain.
10920910 -> 1000009100920: Independence and expansion
10920920 -> 1000009100930: Tensions between American colonials and the British during the revolutionary period of the 1760s and early 1770s led to the American Revolutionary War, fought from 1775 through 1781.
10920930 -> 1000009100940: On June 14, 1775, the Continental Congress, convening in Philadelphia, established a Continental Army under the command of George Washington.
10920940 -> 1000009100950: Proclaiming that "all men are created equal" and endowed with "certain unalienable Rights," the Congress adopted the Declaration of Independence on July 4, 1776.
10920950 -> 1000009100960: The Declaration, drafted largely by Thomas Jefferson, pronounced the colonies sovereign "states.
10920960 -> 1000009100970: " In 1777, the Articles of Confederation were adopted, uniting the states under a weak federal government that operated until 1788.
10920970 -> 1000009100980: Some 70,000–80,000 loyalists to the British Crown fled the rebellious states, many to Nova Scotia and the new British holdings in Canada.
10920980 -> 1000009100990: Native Americans, with divided allegiances, fought on both sides of the war's western front.
10920990 -> 1000009101000: After the defeat of the British army by American forces who were assisted by the French, Great Britain recognized the sovereignty of the thirteen states in 1783.
10921000 -> 1000009101010: A constitutional convention was organized in 1787 by those who wished to establish a strong national government with power over the states.
10921010 -> 1000009101020: By June 1788, nine states had ratified the United States Constitution, sufficient to establish the new government; the republic's first Senate, House of Representatives, and president—George Washington—took office in 1789.
10921020 -> 1000009101030: New York City was the federal capital for a year, before the government relocated to Philadelphia.
10921030 -> 1000009101040: In 1791, the states ratified the Bill of Rights, ten amendments to the Constitution forbidding federal restriction of personal freedoms and guaranteeing a range of legal protections.
10921040 -> 1000009101050: Attitudes toward slavery were shifting; a clause in the Constitution protected the African slave trade only until 1808.
10921050 -> 1000009101060: The Northern states abolished slavery between 1780 and 1804, leaving the slave states of the South as defenders of the "peculiar institution."
10921060 -> 1000009101070: In 1800, the federal government moved to the newly founded Washington, D.C. The Second Great Awakening made evangelicalism a force behind various social reform movements.
10921070 -> 1000009101080: Americans' eagerness to expand westward began a cycle of Indian Wars that stretched to the end of the nineteenth century, as Native Americans were stripped of their land.
10921080 -> 1000009101090: The Louisiana Purchase of French-claimed territory under President Thomas Jefferson in 1803 virtually doubled the nation's size.
10921090 -> 1000009101100: The War of 1812, declared against Britain over various grievances and fought to a draw, strengthened American nationalism.
10921100 -> 1000009101110: A series of U.S. military incursions into Florida led Spain to cede it and other Gulf Coast territory in 1819.
10921110 -> 1000009101120: The country annexed the Republic of Texas in 1845.
10921120 -> 1000009101130: The concept of Manifest Destiny was popularized during this time.
10921130 -> 1000009101140: The 1846 Oregon Treaty with Britain led to U.S. control of the present-day American Northwest.
10921140 -> 1000009101150: The U.S. victory in the Mexican-American War resulted in the 1848 cession of California and much of the present-day American Southwest.
10921150 -> 1000009101160: The California Gold Rush of 1848–49 further spurred western migration.
10921160 -> 1000009101170: New railways made relocation much less arduous for settlers and increased conflicts with Native Americans.
10921170 -> 1000009101180: Over a half-century, up to 40 million American bison, commonly called buffalo, were slaughtered for skins and meat and to ease the railways' spread.
10921180 -> 1000009101190: The loss of the bison, a primary economic resource for the plains Indians, was an existential blow to many native cultures.
10921190 -> 1000009101200: Civil War and industrialization
10921200 -> 1000009101210: Tensions between slave and free states mounted with increasing disagreements over the relationship between the state and federal governments and violent conflicts over the expansion of slavery into new states. Abraham Lincoln, candidate of the largely antislavery Republican Party, was elected president in 1860.
10921210 -> 1000009101220: Before he took office, seven slave states declared their secession from the United States, forming the Confederate States of America.
10921220 -> 1000009101230: The federal government maintained secession was illegal, and with the Confederate attack upon Fort Sumter, the American Civil War began and four more slave states joined the Confederacy.
10921230 -> 1000009101240: The Union freed Confederate slaves as its army advanced through the South.
10921240 -> 1000009101250: Following the Union victory in 1865, three amendments to the U.S. Constitution ensured freedom for the nearly four million African Americans who had been slaves, made them citizens, and gave them voting rights.
10921250 -> 1000009101260: The war and its resolution led to a substantial increase in federal power.
10921260 -> 1000009101270: After the war, the assassination of President Lincoln radicalized Republican Reconstruction policies aimed at reintegrating and rebuilding the Southern states while ensuring the rights of the newly freed slaves.
10921270 -> 1000009101280: The resolution of the disputed 1876 presidential election by the Compromise of 1877 ended Reconstruction; Jim Crow laws soon disenfranchised many African Americans.
10921280 -> 1000009101290: In the North, urbanization and an unprecedented influx of immigrants hastened the country's industrialization.
10921290 -> 1000009101300: The wave of immigration, which lasted until 1929, provided labor for U.S. businesses and transformed American culture.
10921300 -> 1000009101310: High tariff protections, national infrastructure building, and new banking regulations encouraged industrial growth.
10921310 -> 1000009101320: The 1867 Alaska purchase from Russia completed the country's mainland expansion.
10921320 -> 1000009101330: The Wounded Knee massacre in 1890 was the last major armed conflict of the Indian Wars.
10921330 -> 1000009101340: In 1893, the indigenous monarchy of the Pacific Kingdom of Hawaii was overthrown in a coup led by American residents; the archipelago was annexed by the United States in 1898.
10921340 -> 1000009101350: Victory in the Spanish-American War that same year demonstrated that the United States was a major world power and resulted in the annexation of Puerto Rico and the Philippines.
10921350 -> 1000009101360: The Philippines gained independence a half-century later; Puerto Rico remains a commonwealth of the United States.
10921360 -> 1000009101370: World War I, Great Depression, and World War II
10921370 -> 1000009101380: At the outbreak of World War I in 1914, the United States remained neutral.
10921380 -> 1000009101390: Americans sympathized with the British and French, although many citizens, mostly Irish and German, opposed intervention.
10921390 -> 1000009101400: In 1917, the United States joined the Allies, turning the tide against the Central Powers.
10921400 -> 1000009101410: Reluctant to be involved in European affairs, the Senate did not ratify the Treaty of Versailles, which established the League of Nations.
10921410 -> 1000009101420: The country pursued a policy of unilateralism, verging on isolationism.
10921420 -> 1000009101430: In 1920, the women's rights movement won passage of a constitutional amendment granting women's suffrage.
10921430 -> 1000009101440: Partly because of the service of many in the war, Native Americans gained U.S. citizenship in the Indian Citizenship Act of 1924.
10921440 -> 1000009101450: During most of the 1920s, the United States enjoyed a period of unbalanced prosperity as farm profits fell while industrial profits grew.
10921450 -> 1000009101460: A rise in debt and an inflated stock market culminated in the 1929 crash that triggered the Great Depression.
10921460 -> 1000009101470: After his election as president in 1932, Franklin D. Roosevelt responded with the New Deal, a range of policies increasing government intervention in the economy.
10921470 -> 1000009101480: The Dust Bowl of the mid-1930s impoverished many farming communities and spurred a new wave of western migration.
10921480 -> 1000009101490: The nation would not fully recover from the economic depression until the industrial mobilization spurred by its entrance into World War II.
10921490 -> 1000009101500: The United States, effectively neutral during the war's early stages after the Nazi invasion of Poland in September 1939, began supplying materiel to the Allies in March 1941 through the Lend-Lease program.
10921500 -> 1000009101510: On December 7, 1941, the United States joined the Allies against the Axis powers after a surprise attack on Pearl Harbor by Japan.
10921510 -> 1000009101520: World War II cost far more money than any other war in American history, but it boosted the economy by providing capital investment and jobs, while bringing many women into the labor market.
10921520 -> 1000009101530: Among the major combatants, the United States was the only nation to become richer—indeed, far richer—instead of poorer because of the war.
10921530 -> 1000009101540: Allied conferences at Bretton Woods and Yalta outlined a new system of international organizations that placed the United States and Soviet Union at the center of world affairs.
10921540 -> 1000009101550: As victory was achieved in Europe, a 1945 international conference held in San Francisco produced the United Nations Charter, which became active after the war.
10921550 -> 1000009101560: The United States, having developed the first nuclear weapons, used them on the Japanese cities of Hiroshima and Nagasaki in August.
10921560 -> 1000009101570: Japan surrendered on September 2, ending the war.
10921570 -> 1000009101580: Cold War and civil rights
10921580 -> 1000009101590: The United States and Soviet Union jockeyed for power after World War II during the Cold War, dominating the military affairs of Europe through NATO and the Warsaw Pact.
10921590 -> 1000009101600: The United States promoted liberal democracy and capitalism, while the Soviet Union promoted communism and a centrally planned economy.
10921600 -> 1000009101610: Both the United States and the Soviet Union supported dictatorships, and both engaged in proxy wars.
10921610 -> 1000009101620: United States troops fought Communist Chinese forces in the Korean War of 1950–53.
10921620 -> 1000009101630: The House Un-American Activities Committee pursued a series of investigations into suspected leftist subversion, while Senator Joseph McCarthy became the figurehead of anticommunist sentiment.
10921630 -> 1000009101640: The Soviet Union launched the first manned spacecraft in 1961, prompting U.S. efforts to raise proficiency in mathematics and science and President John F. Kennedy's call for the country to be first to land "a man on the moon," achieved in 1969.
10921640 -> 1000009101650: Kennedy also faced a tense nuclear showdown with Soviet forces in Cuba.
10921650 -> 1000009101660: Meanwhile, America experienced sustained economic expansion.
10921660 -> 1000009101670: A growing civil rights movement headed by prominent African Americans, such as Martin Luther King, Jr., fought segregation and discrimination, leading to the abolition of Jim Crow laws.
10921670 -> 1000009101680: Following Kennedy's assassination in 1963, the Civil Rights Act of 1964 was passed under President Lyndon B. Johnson.
10921680 -> 1000009101690: Johnson and his successor, Richard Nixon, expanded a proxy war in Southeast Asia into the unsuccessful Vietnam War.
10921690 -> 1000009101700: As a result of the Watergate scandal, in 1974 Nixon became the first U.S. president to resign, rather than be impeached on charges including obstruction of justice and abuse of power; he was succeeded by Vice President Gerald Ford.
10921700 -> 1000009101710: During the Jimmy Carter administration in the late 1970s, the U.S. economy experienced stagflation.
10921710 -> 1000009101720: The election of Ronald Reagan as president in 1980 marked a significant rightward shift in American politics, reflected in major changes in taxation and spending priorities.
10921720 -> 1000009101730: In the late 1980s and 1990s, the Soviet Union's power diminished, leading to its collapse and effectively ending the Cold War.
10921730 -> 1000009101740: Contemporary era
10921740 -> 1000009101750: The leadership role taken by the United States and its allies in the United Nations–sanctioned Gulf War, under President George H. W. Bush, and later the Yugoslav wars helped to preserve its position as the world's last remaining superpower.
10921750 -> 1000009101760: The longest economic expansion in modern U.S. history—from March 1991 to March 2001—encompassed the administrations of Presidents George H.W. Bush, Bill Clinton, and George W. Bush.
10921760 -> 1000009101770: In 1998, Clinton was impeached by the House on charges relating to a civil lawsuit and a sexual scandal, but he was acquitted by the Senate and remained in office.
10921770 -> 1000009101780: The 1990s also saw a rise in Islamic Terrorism against Americans from al-Qaeda and other groups, including an attack on the World Trade Center in 1993, an attack on U.S. forces in Somalia, the 1996 Khobar Towers bombing, the 1998 United States embassy bombings in Tanzania and Kenya, the 2000 millennium attack plots, and the USS Cole bombing in Yemen in October 2000.
10921780 -> 1000009101790: In Iraq, the regime of Saddam Hussein proved a continuing problem for the UN and its neighbors, prompting a variety of UN sanctions, Anglo-American patrolling of Iraqi no-fly zones, Operation Desert Fox, and the Iraq Liberation Act of 1998 which called for the removal of the Hussein regime and its replacement by a democratic system.
10921790 -> 1000009101800: The presidential election of 2000 was one of the closest in U.S. history and saw George W. Bush become President of the United States.
10921800 -> 1000009101810: On September 11, 2001, al-Qaeda terrorists struck the World Trade Center in New York City and The Pentagon near Washington, D.C., killing nearly three thousand people.
10921810 -> 1000009101820: In the aftermath, President Bush urged support from the international community for what was dubbed the War on Terrorism.
10921820 -> 1000009101830: In late 2001, U.S. forces launched Operation Enduring Freedom removing the Taliban government and al-Qaeda training camps.
10921830 -> 1000009101840: Taliban insurgents continue to fight a guerrilla war against a NATO-led force.
10921840 -> 1000009101850: Controversies arose regarding the conduct of the War on Terror.
10921850 -> 1000009101860: Using language from the 1998 Iraq Liberation Act and the Clinton Administration, in 2002 the Bush Administration began to press for regime change in Iraq.
10921860 -> 1000009101870: With broad bipartisan support in the U.S. Congress, Bush formed an international Coalition of the Willing and in March 2003 ordered Operation Iraqi Freedom, removing Saddam Hussein from power.
10921870 -> 1000009101880: Although facing pressure to withdraw, the U.S.-led coalition maintains a presence in Iraq and continues to train and mentor a new Iraqi military as well as lead economic and infrastructure development.
10921880 -> 1000009101890: In the upcoming 2008 presidential election, the Republican Party candidate, four-term Senator John McCain of Arizona – a former U.S. prisoner of war who served in the Vietnam War – will face the Democratic Party candidate, freshman Senator Barack Obama of Illinois, the first African American to head a major political party's presidential ticket.
10921890 -> 1000009101900: Government and elections
10921900 -> 1000009101910: The United States is the world's oldest surviving federation.
10921910 -> 1000009101920: It is a constitutional republic, "in which majority rule is tempered by minority rights protected by law."
10921920 -> 1000009101930: It is fundamentally structured as a representative democracy, though U.S. citizens residing in the territories are excluded from voting for federal officials.
10921930 -> 1000009101940: The government is regulated by a system of checks and balances defined by the United States Constitution, which serves as the country's supreme legal document and as a social contract for the people of the United States.
10921940 -> 1000009101950: In the American federalist system, citizens are usually subject to three levels of government, federal, state, and local; the local government's duties are commonly split between county and municipal governments.
10921950 -> 1000009101960: In almost all cases, executive and legislative officials are elected by a plurality vote of citizens by district.
10921960 -> 1000009101970: There is no proportional representation at the federal level, and it is very rare at lower levels.
10921970 -> 1000009101980: Federal and state judicial and cabinet officials are typically nominated by the executive branch and approved by the legislature, although some state judges and officials are elected by popular vote.
10921980 -> 1000009101990: The federal government is composed of three branches:
10921990 -> 1000009102000: Legislative: The bicameral Congress, made up of the Senate and the House of Representatives, makes federal law, declares war, approves treaties, has the power of the purse, and has the power of impeachment, by which it can remove sitting members of the government.
10922000 -> 1000009102010: Executive: The president is the commander-in-chief of the military, can veto legislative bills before they become law, and appoints the Cabinet and other officers, who administer and enforce federal laws and policies.
10922010 -> 1000009102020: Judicial: The Supreme Court and lower federal courts, whose judges are appointed by the president with Senate approval, interpret laws and can overturn laws they deem unconstitutional.
10922020 -> 1000009102030: The House of Representatives has 435 members, each representing a congressional district for a two-year term.
10922030 -> 1000009102040: House seats are apportioned among the fifty states by population every tenth year.
10922040 -> 1000009102050: As of the 2000 census, seven states have the minimum of one representative, while California, the most populous state, has fifty-three.
10922050 -> 1000009102060: Each state has two senators, elected at-large to six-year terms; one third of Senate seats are up for election every second year.
10922060 -> 1000009102070: The president serves a four-year term and may be elected to the office no more than twice.
10922070 -> 1000009102080: The president is not elected by direct vote, but by an indirect electoral college system in which the determining votes are apportioned by state.
10922080 -> 1000009102090: The Supreme Court, led by the Chief Justice of the United States, has nine members, who serve for life.
10922090 -> 1000009102100: All laws and procedures of both state and federal governments are subject to review, and any law ruled in violation of the Constitution by the judicial branch is overturned.
10922100 -> 1000009102110: The original text of the Constitution establishes the structure and responsibilities of the federal government, the relationship between it and the individual states, and essential matters of military and economic authority.
10922110 -> 1000009102120: Article One protects the right to the "great writ" of habeas corpus, and Article Three guarantees the right to a jury trial in all criminal cases.
10922120 -> 1000009102130: Amendments to the Constitution require the approval of three-fourths of the states. The Constitution has been amended twenty-seven times; the first ten amendments, which make up the Bill of Rights, and the Fourteenth Amendment form the central basis of individual rights in the United States.
10922130 -> 1000009102140: Parties and politics
10922140 -> 1000009102150: Politics in the United States have operated under a two-party system for virtually all of the country's history.
10922150 -> 1000009102160: For elective offices at all levels, state-administered primary elections are held to choose the major party nominees for subsequent general elections.
10922160 -> 1000009102170: Since the general election of 1856, the two dominant parties have been the Democratic Party, founded in 1824 (though its roots trace back to 1792), and the Republican Party, founded in 1854.
10922170 -> 1000009102180: Since the Civil War, only one third-party presidential candidate—former president Theodore Roosevelt, running as a Progressive in 1912—has won as much as 20% of the popular vote.
10922180 -> 1000009102190: The incumbent president, Republican George W. Bush, is the 43rd president in the country's history.
10922190 -> 1000009102200: All U.S. presidents to date have been white men.
10922200 -> 1000009102210: If Democrat Barack Obama wins the forthcoming presidential election, he will become the first African-American president.
10922210 -> 1000009102220: Following the 2006 midterm elections, the Democratic Party controls both the House and the Senate.
10922220 -> 1000009102230: Every member of the U.S. Congress is a Democrat or a Republican except two independent members of the Senate—one a former Democratic incumbent, the other a self-described socialist.
10922230 -> 1000009102240: An overwhelming majority of state and local officials are also either Democrats or Republicans.
10922240 -> 1000009102250: Within American political culture, the Republican Party is considered "center-right" or conservative and the Democratic Party is considered "center-left" or liberal, but members of both parties have a wide range of views.
10922250 -> 1000009102260: In a May 2008 poll, 44% of Americans described themselves as "conservative," 27% as "moderate," and 21% as "liberal."
10922260 -> 1000009102270: On the other hand, that same month a plurality of adults, 41.7%, identified as Democrats, 31.6% as Republicans, and 26.6% as independents.
10922270 -> 1000009102280: The states of the Northeast and West Coast and some of the Great Lakes states are relatively liberal-leaning—they are known in political parlance as "blue states."
10922280 -> 1000009102290: The "red states" of the South and the Rocky Mountains lean conservative.
10922290 -> 1000009102300: States
10922300 -> 1000009102310: The United States is a federal union of fifty states.
10922310 -> 1000009102320: The original thirteen states were the successors of the thirteen colonies that rebelled against British rule.
10922320 -> 1000009102330: Most of the rest have been carved from territory obtained through war or purchase by the U.S. government.
10922330 -> 1000009102340: The exceptions are Vermont, Texas, and Hawaii; each was an independent republic before joining the union.
10922340 -> 1000009102350: Early in the country's history, three states were created out of the territory of existing ones: Kentucky from Virginia; Tennessee from North Carolina; and Maine from Massachusetts.
10922350 -> 1000009102360: West Virginia broke away from Virginia during the American Civil War.
10922360 -> 1000009102370: The most recent state—Hawaii—achieved statehood on August 21, 1959.
10922370 -> 1000009102380: The U.S. Supreme Court has ruled that the states do not have the right to secede from the union.
10922380 -> 1000009102390: The states compose the vast bulk of the U.S. land mass; the only other areas considered integral parts of the country are the District of Columbia, the federal district where the capital, Washington, is located; and Palmyra Atoll, an uninhabited but incorporated territory in the Pacific Ocean.
10922390 -> 1000009102400: The United States possesses five major territories with indigenous populations: Puerto Rico and the United States Virgin Islands in the Caribbean; and American Samoa, Guam, and the Northern Mariana Islands in the Pacific.
10922400 -> 1000009102410: Those born in the territories (except for American Samoa) possess U.S. citizenship.
10922410 -> 1000009102420: Foreign relations and military
10922420 -> 1000009102430: The United States has vast economic, political, and military influence on a global scale, which makes its foreign policy a subject of great interest around the world.
10922430 -> 1000009102440: Almost all countries have embassies in Washington, D.C., and many host consulates around the country.
10922440 -> 1000009102450: Likewise, nearly all nations host American diplomatic missions.
10922450 -> 1000009102460: However, Cuba, Iran, North Korea, Bhutan, Sudan, and the Republic of China (Taiwan) do not have formal diplomatic relations with the United States.
10922460 -> 1000009102470: American isolationists have often been at odds with internationalists, as anti-imperialists have been with promoters of Manifest Destiny and American Empire.
10922470 -> 1000009102480: American imperialism in the Philippines drew sharp rebukes from Mark Twain, philosopher William James, and many others.
10922480 -> 1000009102490: Later, President Woodrow Wilson played a key role in creating the League of Nations, but the Senate prohibited American membership in it.
10922490 -> 1000009102500: Isolationism became a thing of the past when the United States took a lead role in founding the United Nations, becoming a permanent member of the Security Council and host to the United Nations Headquarters.
10922500 -> 1000009102510: The United States enjoys a special relationship with the United Kingdom and strong ties with Australia, New Zealand, Japan, Israel, and fellow NATO members.
10922510 -> 1000009102520: It also works closely with its neighbors through the Organization of American States and free trade agreements such as the trilateral North American Free Trade Agreement with Canada and Mexico.
10922520 -> 1000009102530: In 2005, the United States spent $27.3 billion on official development assistance, the most in the world; however, as a share of gross national income (GNI), the U.S. contribution of 0.22% ranked twentieth of twenty-two donor states.
10922530 -> 1000009102540: On the other hand, nongovernmental sources such as private foundations, corporations, and educational and religious institutions donated $95.5 billion.
10922540 -> 1000009102550: The total of $122.8 billion is again the most in the world and seventh in terms of GNI percentage.
10922550 -> 1000009102560: The president holds the title of commander-in-chief of the nation's armed forces and appoints its leaders, the secretary of defense and the Joint Chiefs of Staff.
10922560 -> 1000009102570: The United States Department of Defense administers the armed forces, including the Army, the Navy, the Marine Corps, and the Air Force.
10922570 -> 1000009102580: The Coast Guard falls under the jurisdiction of the Department of Homeland Security in peacetime and the Department of the Navy in times of war.
10922580 -> 1000009102590: In 2005, the military had 1.38 million personnel on active duty, along with several hundred thousand each in the Reserves and the National Guard for a total of 2.3 million troops.
10922590 -> 1000009102600: The Department of Defense also employs approximately 700,000 civilians, disregarding contractors.
10922600 -> 1000009102610: Military service is voluntary, though conscription may occur in wartime through the Selective Service System.
10922610 -> 1000009102620: The rapid deployment of American forces is facilitated by the Air Force's large fleet of transportation aircraft and aerial refueling tankers, the Navy's fleet of eleven active aircraft carriers, and Marine Expeditionary Units at sea in the Navy's Atlantic and Pacific fleets.
10922620 -> 1000009102630: Outside of the American homeland, the U.S. military is deployed to 770 bases and facilities, on every continent except Antarctica.
10922630 -> 1000009102640: Because of the extent of its global military presence, scholars describe the United States as maintaining an "empire of bases."
10922640 -> 1000009102650: Total U.S. military spending in 2006, over $528 billion, was 46% of the entire military spending in the world and greater than the next fourteen largest national military expenditures combined.
10922650 -> 1000009102660: (In purchasing power parity terms, it was larger than the next six such expenditures combined.)
10922660 -> 1000009102670: The per capita spending of $1,756 was approximately ten times the world average.
10922670 -> 1000009102680: At 4.06% of GDP, U.S. military spending is ranked 27th out of 172 nations.
10922680 -> 1000009102690: The proposed base Department of Defense budget for 2009, $515.4 billion, is a 7% increase over 2008 and a nearly 74% increase over 2001.
10922690 -> 1000009102700: The estimated total cost of the Iraq War to the United States through 2016 is $2.267 trillion.
10922700 -> 1000009102710: As of June 6, 2008, the United States had suffered 4,092 military fatalities during the war and nearly 30,000 wounded.
10922710 -> 1000009102720: Economy
10922720 -> 1000009102730: The United States has a capitalist mixed economy, which is fueled by abundant natural resources, a well-developed infrastructure, and high productivity.
10922730 -> 1000009102740: According to the International Monetary Fund, the United States GDP of more than $13 trillion constitutes over 25.5% of the gross world product at market exchange rates and over 19% of the gross world product at purchasing power parity (PPP).
10922740 -> 1000009102750: The largest national GDP in the world, it was slightly less than the combined GDP of the European Union at PPP in 2006.
10922750 -> 1000009102760: The country ranks eighth in the world in nominal GDP per capita and fourth in GDP per capita at PPP.
10922760 -> 1000009102770: The United States is the largest importer of goods and third largest exporter, though exports per capita are relatively low.
10922770 -> 1000009102780: Canada, China, Mexico, Japan, and Germany are its top trading partners.
10922780 -> 1000009102790: The leading export commodity is electrical machinery, while vehicles constitute the leading import.
10922790 -> 1000009102800: The private sector constitutes the bulk of the economy, with government activity accounting for 12.4% of GDP.
10922800 -> 1000009102810: The economy is postindustrial, with the service sector contributing 67.8% of GDP.
10922810 -> 1000009102820: The leading business field by gross business receipts is wholesale and retail trade; by net income it is finance and insurance.
10922820 -> 1000009102830: The United States remains an industrial power, with chemical products the leading manufacturing field.
10922830 -> 1000009102840: The United States is the third largest producer of oil in the world.
10922840 -> 1000009102850: It is the world's number one producer of electrical and nuclear energy, as well as liquid natural gas, sulfur, phosphates, and salt.
10922850 -> 1000009102860: While agriculture accounts for just under 1% of GDP, the United States is the world's top producer of corn and soybeans.
10922860 -> 1000009102870: The country's leading cash crop is marijuana, despite federal laws making its cultivation and sale illegal.
10922870 -> 1000009102880: The New York Stock Exchange is the world's largest by dollar volume.
10922880 -> 1000009102890: Coca-Cola and McDonald's are the two most recognized brands in the world.
10922890 -> 1000009102900: In 2005, 155 million persons were employed with earnings, of whom 80% worked in full-time jobs.
10922900 -> 1000009102910: The majority, 79%, were employed in the service sector.
10922910 -> 1000009102920: With approximately 15.5 million people, health care and social assistance is the leading field of employment.
10922920 -> 1000009102930: About 12% of American workers are unionized, compared to 30% in Western Europe.
10922930 -> 1000009102940: The U.S. ranks number one in the ease of hiring and firing workers, according to the World Bank.
10922940 -> 1000009102950: Between 1973 and 2003, a year's work for the average American grew by 199 hours.
10922950 -> 1000009102960: Partly as a result, the United States maintains the highest labor productivity in the world.
10922960 -> 1000009102970: However, it no longer leads the world in productivity per hour as it did from the 1950s through the early 1990s; workers in Norway, France, Belgium, and Luxembourg are now more productive per hour.
10922970 -> 1000009102980: The United States ranks third in the World Bank's Ease of Doing Business Index.
10922980 -> 1000009102990: Compared to Europe, U.S. property and corporate income taxes are generally higher, while labor and, particularly, consumption taxes are lower.
10922990 -> 1000009103000: Income and human development
10923000 -> 1000009103010: According to the Census Bureau, the pretax median household income in 2006 was $48,201.
10923010 -> 1000009103020: The two-year average ranged from $66,752 in New Jersey to $34,343 in Mississippi.
10923020 -> 1000009103030: Using purchasing power parity exchange rates, the overall median is similar to the most affluent cluster of developed nations.
10923030 -> 1000009103040: After having declined sharply throughout the mid 20th century, poverty rates have plateaued since the early 1970s, with roughly 12.3% or 13.3% of Americans below the federally designated poverty line in any given year.
10923040 -> 1000009103050: Owing to lackluster expansion since the late 1970s, the U.S. welfare state is now among the most austere in the developed world, reducing relative poverty by roughly 30% and absolute poverty by roughly 40%; considerably less than the mean for rich nations.
10923050 -> 1000009103060: While the American welfare state preforms well in reducing poverty among the elderly, from an estimated 50% to 10%, it lacks extensive programs geared towards the well-being of the young.
10923060 -> 1000009103070: A 2007 UNICEF study of children's well-being in twenty-one industrialized nations, covering a broad range of factors, ranked the U.S. next to last.
10923070 -> 1000009103080: Between 1947 and 1979, real median income rose by over 80% for all classes, more so for the poor than the rich.
10923080 -> 1000009103090: While median household income has increased for all classes since 1980, largely owing to more dual earner households, the closing of the gender gap and longer work hours, growth has been slower and strongly titled towards the very top (see graph).
10923090 -> 1000009103100: As a result the share of income of the top 1% has doubbled since 1979, leaving the U.S. with the highest level of income inequality among developed nations.
10923100 -> 1000009103110: While some economists do not see inequality as a considerable problem, most see it as a problem requiring government action.
10923110 -> 1000009103120: Wealth is highly concentrated: The richest 10% of the adult population possesses 69.8% of the country's household wealth, the second-highest share of any democratic developed nation.
10923120 -> 1000009103130: The top 1% possesses 33.4% of net wealth.
10923130 -> 1000009103140: Science and technology
10923140 -> 1000009103150: The United States has been a leader in scientific research and technological innovation since the late nineteenth century.
10923150 -> 1000009103160: In 1876, Alexander Graham Bell was awarded the first U.S. patent for the telephone.
10923160 -> 1000009103170: The laboratory of Thomas Edison developed the phonograph, the first long-lasting light bulb, and the first viable movie camera.
10923170 -> 1000009103180: In the early twentieth century, the automobile companies of Ransom E. Olds and Henry Ford pioneered assembly line manufacturing.
10923180 -> 1000009103190: The Wright brothers, in 1903, made what is recognized as the "first sustained and controlled heavier-than-air powered flight."
10923190 -> 1000009103200: The rise of Nazism in the 1930s led many important European scientists, including Albert Einstein and Enrico Fermi, to immigrate to the United States.
10923200 -> 1000009103210: During World War II, the U.S.-based Manhattan Project developed nuclear weapons, ushering in the Atomic Age.
10923210 -> 1000009103220: The Space Race produced rapid advances in rocketry, materials science, and computers.
10923220 -> 1000009103230: The United States largely developed the ARPANET and its successor, the Internet.
10923230 -> 1000009103240: Today, the bulk of research and development funding, 64%, comes from the private sector.
10923240 -> 1000009103250: The United States leads the world in scientific research papers and impact factor.
10923250 -> 1000009103260: Americans enjoy high levels of access to technological consumer goods, and almost half of U.S. households have broadband Internet service.
10923260 -> 1000009103270: The country is the primary developer and grower of genetically modified food; more than half of the world's land planted with biotech crops is in the United States.
10923270 -> 1000009103280: Transportation
10923280 -> 1000009103290: As of 2003, there were 759 automobiles per 1,000 Americans, compared to 472 per 1,000 inhabitants of the European Union the following year.
10923290 -> 1000009103300: Approximately 39% of personal vehicles are vans, SUVs, or light trucks.
10923300 -> 1000009103310: The average American adult (accounting for all drivers and nondrivers) spends 55 minutes behind the wheel every day, driving {(Convert+29 miles (47 km)+29+mi+km+0)}.
10923310 -> 1000009103320: The U.S. intercity passenger rail system is relatively weak.
10923320 -> 1000009103330: Only 9% of total U.S. work trips employ mass transit, compared to 38.8% in Europe.
10923330 -> 1000009103340: Bicycle usage is minimal, well below European levels.
10923340 -> 1000009103350: The civil airline industry is entirely privatized, while most major airports are publicly owned.
10923350 -> 1000009103360: The five largest airlines in the world by passengers carried are all American; American Airlines is number one.
10923360 -> 1000009103370: Of the world's thirty busiest passenger airports, sixteen are in the United States, including the busiest, Hartsfield-Jackson Atlanta International Airport (ATL).
10923370 -> 1000009103380: Energy
10923380 -> 1000009103390: The United States energy market is 29,000 terawatt hours per year.
10923390 -> 1000009103400: Energy consumption per capita is 7.8 tons of oil equivalent per year, compared to Germany's 4.2 tons and Canada's 8.3 tons.
10923400 -> 1000009103410: In 2005, 40% of the nation's energy came from petroleum, 23% from coal, and 22% from natural gas.
10923410 -> 1000009103420: The remainder was supplied by nuclear power and various renewable energy sources.
10923420 -> 1000009103430: The United States is the world's largest consumer of petroleum.
10923430 -> 1000009103440: For decades, nuclear power has played a limited role relative to many other developed countries.
10923440 -> 1000009103450: Recently, applications for new nuclear plants have been filed.
10923450 -> 1000009103460: Demographics
10923460 -> 1000009103470: As of 2008, the United States population was estimated by the U.S. Census Bureau to be 304,516,000.
10923470 -> 1000009103480: The U.S. population included an estimated 12 million unauthorized migrants, of whom an estimated 1 million were uncounted by the Census Bureau.
10923480 -> 1000009103490: The overall growth rate is 0.89%, compared to 0.16% in the European Union.
10923490 -> 1000009103500: The birth rate of 14.16 per 1,000 is 30% below the world average, while higher than any European country except for Albania and Ireland.
10923500 -> 1000009103510: In 2006, 1.27 million immigrants were granted legal residence.
10923510 -> 1000009103520: Mexico has been the leading source of new U.S. residents for over two decades; since 1998, China, India, and the Philippines have been in the top four sending countries every year.
10923520 -> 1000009103530: The United States is the only industrialized nation in which large population increases are projected.
10923530 -> 1000009103540: The United States has a very diverse population—thirty-one ancestry groups have more than a million members.
10923540 -> 1000009103550: Whites are the largest racial group, with German Americans, Irish Americans, and English Americans constituting three of the country's four largest ancestry groups.
10923550 -> 1000009103560: African Americans constitute the nation's largest racial minority and third largest ancestry group.
10923560 -> 1000009103570: Asian Americans are the country's second largest racial minority; the two largest Asian American ancestry groups are Chinese and Filipino.
10923570 -> 1000009103580: In 2006, the U.S. population included an estimated 4.5 million people with some American Indian or Alaskan native ancestry (2.9 million exclusively of such ancestry) and over 1 million with some native Hawaiian or Pacific island ancestry (0.5 million exclusively).
10923580 -> 1000009103590: The population growth of Hispanic and Latino Americans has been a major demographic trend.
10923590 -> 1000009103600: Approximately 44 million Americans are of Hispanic descent, with about 64% possessing Mexican ancestry.
10923600 -> 1000009103610: Between 2000 and 2006, the country's Hispanic population increased 25.5% while the non-Hispanic population rose just 3.5%.
10923610 -> 1000009103620: Much of this growth is from immigration; as of 2004, 12% of the U.S. population was foreign-born, over half that number from Latin America.
10923620 -> 1000009103630: Fertility is also a factor; the average Hispanic woman gives birth to three children in her lifetime.
10923630 -> 1000009103640: The comparable fertility rate is 2.2 for non-Hispanic black women and 1.8 for non-Hispanic white women (below the replacement rate of 2.1).
10923640 -> 1000009103650: Hispanics and Latinos accounted for nearly half of the national population growth of 2.9 million between July 2005 and July 2006.
10923650 -> 1000009103660: About 83% of the population lives in one of the country's 363 metropolitan areas.
10923660 -> 1000009103670: In 2006, 254 incorporated places in the United States had populations over 100,000, nine cities had more than 1 million residents, and four global cities had over 2 million (New York City, Los Angeles, Chicago, and Houston).
10923670 -> 1000009103680: The United States has fifty metropolitan areas with populations greater than 1 million.
10923680 -> 1000009103690: Of the fifty fastest-growing metro areas, twenty-three are in the West and twenty-five in the South.
10923690 -> 1000009103700: Among the country's twenty most populous metro areas, those of Dallas (the fourth largest), Houston (sixth), and Atlanta (ninth) saw the largest numerical gains between 2000 and 2006, while that of Phoenix (thirteenth) grew the largest in percentage terms.
10923700 -> 1000009103710: Language
10923710 -> 1000009103720: English is the de facto national language.
10923720 -> 1000009103730: Although there is no official language at the federal level, some laws—such as U.S. naturalization requirements—standardize English.
10923730 -> 1000009103740: In 2003, about 215 million, or 82% of the population aged five years and older, spoke only English at home.
10923740 -> 1000009103750: Spanish, spoken by over 10% of the population at home, is the second most common language and the most widely taught foreign language.
10923750 -> 1000009103760: Some Americans advocate making English the country's official language, as it is in at least twenty-eight states.
10923760 -> 1000009103770: Both Hawaiian and English are official languages in Hawaii by state law.
10923770 -> 1000009103780: While neither has an official language, New Mexico has laws providing for the use of both English and Spanish, as Louisiana does for English and French.
10923780 -> 1000009103790: Other states, such as California, mandate the publication of Spanish versions of certain government documents including court forms.
10923790 -> 1000009103800: Several insular territories grant official recognition to their native languages, along with English: Samoan and Chamorro are recognized by Samoa and Guam, respectively; Carolinian and Chamorro are recognized by the Northern Mariana Islands; Spanish is an official language of Puerto Rico.
10923800 -> 1000009103810: Religion
10923810 -> 1000009103820: The United States government does not audit Americans' religious beliefs.
10923820 -> 1000009103830: In a private survey conducted in 2001, 76.5% of American adults identified themselves as Christian, down from 86.4% in 1990.
10923830 -> 1000009103840: Protestant denominations accounted for 52% of adult Americans, while Roman Catholics, at 24.5%, were the largest individual denomination.
10923840 -> 1000009103850: A different study describes white evangelicals, 26.3% of the population, as the country's largest religious cohort; evangelicals of all races are estimated at 30–35%.
10923850 -> 1000009103860: The total reporting non-Christian religions in 2001 was 3.7%, up from 3.3% in 1990.
10923860 -> 1000009103870: The leading non-Christian faiths were Judaism (1.4%), Islam (0.5%), Buddhism (0.5%), Hinduism (0.4%), and Unitarian Universalism (0.3%).
10923870 -> 1000009103880: Between 1990 and 2001, the number of Muslims and Buddhists more than doubled.
10923880 -> 1000009103890: From 8.2% in 1990, 14.1% in 2001 described themselves as agnostic, atheist, or simply having no religion, still significantly less than in other postindustrial countries such as Britain (2005:44%) and Sweden (2001:69%, 2005:85%).
10923890 -> 1000009103900: Education
10923900 -> 1000009103910: American public education is operated by state and local governments, regulated by the United States Department of Education through restrictions on federal grants.
10923910 -> 1000009103920: Children are required in most states to attend school from the age of six or seven (generally, kindergarten or first grade) until they turn eighteen (generally bringing them through 12th grade, the end of high school); some states allow students to leave school at sixteen or seventeen.
10923920 -> 1000009103930: About 12% of children are enrolled in parochial or nonsectarian private schools.
10923930 -> 1000009103940: Just over 2% of children are homeschooled.
10923940 -> 1000009103950: The United States has many competitive private and public institutions of higher education, as well as local community colleges of varying quality with open admission policies.
10923950 -> 1000009103960: Of Americans twenty-five and older, 84.6% graduated from high school, 52.6% attended some college, 27.2% earned a bachelor's degree, and 9.6% earned graduate degrees.
10923960 -> 1000009103970: The basic literacy rate is approximately 99%.
10923970 -> 1000009103980: The United Nations assigns the United States an Education Index of 0.97, tying it for twelfth-best in the world.
10923980 -> 1000009103990: Health
10923990 -> 1000009104000: The American life expectancy of 77.8 years at birth is a year shorter than the overall figure in Western Europe, and three to four years lower than that of Norway, Switzerland, and Canada.
10924000 -> 1000009104010: Over the past two decades, the country's rank in life expectancy has dropped from 11th to 42nd place in the world.
10924010 -> 1000009104020: The infant mortality rate of 6.37 per thousand likewise places the United States 42nd out of 221 countries, behind all of Western Europe.
10924020 -> 1000009104030: U.S. cancer survival rates are the highest in the world.
10924030 -> 1000009104040: Approximately one-third of the adult population is obese and an additional third is overweight; the obesity rate, the highest in the industrialized world, has more than doubled in the last quarter-century.
10924040 -> 1000009104050: Obesity-related type 2 diabetes is considered epidemic by healthcare professionals.
10924050 -> 1000009104060: The U.S. adolescent pregnancy rate, 79.8 per 1,000 women, is nearly four times that of France and five times that of Germany.
10924060 -> 1000009104070: Abortion in the United States, legal on demand, is a source of great political controversy.
10924070 -> 1000009104080: Many states ban public funding of the procedure and have laws to restrict late-term abortions, require parental notification for minors, and mandate a waiting period prior to treatment.
10924080 -> 1000009104090: While the incidence of abortion is in decline, the U.S. abortion ratio of 241 per 1,000 live births and abortion rate of 15 per 1,000 women aged 15–44 remain higher than those of most Western nations.
10924090 -> 1000009104100: The United States healthcare system far outspends any other nation's, measured in both per capita spending and percentage of GDP.
10924100 -> 1000009104110: Unlike most developed countries, the U.S. healthcare system is not universal, and relies on a higher proportion of private funding.
10924110 -> 1000009104120: In 2004, private insurance paid for 36% of personal health expenditure, private out-of-pocket payments covered 15%, and federal, state, and local governments paid for 44%.
10924120 -> 1000009104130: The World Health Organization ranked the U.S. healthcare system in 2000 as first in responsiveness, but 37th in overall performance.
10924130 -> 1000009104140: The United States is a leader in medical innovation.
10924140 -> 1000009104150: In 2004, the U.S. nonindustrial sector spent three times as much as Europe per capita on biomedical research.
10924150 -> 1000009104160: Medical bills are the most common reason for personal bankruptcy in the United States.
10924160 -> 1000009104170: In 2005, 46.6 million Americans, or 15.9% of the population, were uninsured, 5.4 million more than in 2001.
10924170 -> 1000009104180: The primary cause of the decline in coverage is the drop in the number of Americans with employer-sponsored health insurance, which fell from 62.6% in 2001 to 59.5% in 2005.
10924180 -> 1000009104190: Approximately one third of the uninsured lived in households with annual incomes greater than $50,000, with half of those having an income over $75,000.
10924190 -> 1000009104200: Another third were eligible but not registered for public health insurance.
10924200 -> 1000009104210: In 2006, Massachusetts became the first state to mandate health insurance; California is considering similar legislation.
10924210 -> 1000009104220: Crime and punishment
10924220 -> 1000009104230: Law enforcement in the United States is primarily the responsibility of local police and sheriff's departments, with state police providing broader services.
10924230 -> 1000009104240: Federal agencies such as the Federal Bureau of Investigation (FBI) and the U.S. Marshals Service have specialized duties.
10924240 -> 1000009104250: At the federal level and in almost every state, jurisprudence operates on a common law system.
10924250 -> 1000009104260: State courts conduct most criminal trials; federal courts handle certain designated crimes as well as appeals from state systems.
10924260 -> 1000009104270: Among developed nations, the United States has above-average levels of violent crime and particularly high levels of gun violence and homicide.
10924270 -> 1000009104280: In 2006, there were 5.7 murders per 100,000 persons, three times the rate in neighboring Canada.
10924280 -> 1000009104290: The U.S. homicide rate, which decreased by 42% between 1991 and 1999, has been roughly steady since.
10924290 -> 1000009104300: Some scholars have associated the high rate of homicide with the country's high rates of gun ownership, in turn associated with U.S. gun laws which are very permissive compared to those of other developed countries.
10924300 -> 1000009104310: The United States has the highest documented incarceration rate and total prison population in the world and by far the highest figures among democratic, developed nations.
10924310 -> 1000009104320: At the start of 2008, more than 2.3 million people were held in American prisons or jails, more than one in every 100 adults.
10924320 -> 1000009104330: The current rate is almost seven times the 1980 figure.
10924330 -> 1000009104340: African American males are jailed at over six times the rate of white males and three times the rate of Hispanic males.
10924340 -> 1000009104350: In the latest comparable data, from 2006, the U.S. incarceration rate was more than three times the figure in Poland, the Organisation for Economic Co-operation and Development (OECD) country with the next highest rate.
10924350 -> 1000009104360: The country's extraordinary rate of incarceration is largely caused by changes in sentencing and drug policies.
10924360 -> 1000009104370: Though it has been abolished in most Western nations, capital punishment is sanctioned in the United States for certain federal and military crimes, and in thirty-seven states.
10924370 -> 1000009104380: Since 1976, when the U.S. Supreme Court reinstated the death penalty after a four-year moratorium, there have been over 1,000 executions in the United States.
10924380 -> 1000009104390: In 2006, the country had the sixth highest number of executions in the world, following China, Iran, Pakistan, Iraq, and Sudan.
10924390 -> 1000009104400: In December 2007, New Jersey became the first state to abolish the death penalty since the 1976 Supreme Court decision.
10924400 -> 1000009104410: Culture
10924410 -> 1000009104420: The United States is a multicultural nation, home to a wide variety of ethnic groups, traditions, and values.
10924420 -> 1000009104430: There is no "American" ethnicity; aside from the now relatively small Native American population, nearly all Americans or their ancestors immigrated within the past five centuries.
10924430 -> 1000009104440: The culture held in common by the majority of Americans is referred to as mainstream American culture, a Western culture largely derived from the traditions of Western European migrants, beginning with the early English and Dutch settlers.
10924440 -> 1000009104450: German, Irish, and Scottish cultures have also been very influential.
10924450 -> 1000009104460: Certain cultural attributes of Mandé and Wolof slaves from West Africa were adopted by the American mainstream; based more on the traditions of Central African Bantu slaves, a distinct African American culture developed that would eventually have a major effect on the mainstream as well.
10924460 -> 1000009104470: Westward expansion integrated the Creoles and Cajuns of Louisiana and the Hispanos of the Southwest and brought close contact with the culture of Mexico.
10924470 -> 1000009104480: Large-scale immigration in the late nineteenth and early twentieth centuries from Southern and Eastern Europe introduced many new cultural elements.
10924480 -> 1000009104490: More recent immigration from Asia and especially Latin America has had broad impact.
10924490 -> 1000009104500: The resulting mix of cultures may be characterized as a homogeneous melting pot or as a pluralistic salad bowl in which immigrants and their descendants retain distinctive cultural characteristics.
10924500 -> 1000009104510: While American culture maintains that the United States is a classless society, economists and sociologists have identified cultural differences between the country's social classes, affecting socialization, language, and values.
10924510 -> 1000009104520: The American middle and professional class has been the source of many contemporary social trends such as feminism, environmentalism, and multiculturalism.
10924520 -> 1000009104530: Americans' self-images, social viewpoints, and cultural expectations are associated with their occupations to an unusually close degree.
10924530 -> 1000009104540: While Americans tend greatly to value socioeconomic achievement, being ordinary or average is generally seen as a positive attribute.
10924540 -> 1000009104550: Though the American Dream, or the perception that Americans enjoy high social mobility, played a key role in attracting immigrants, particularly in the late 1800s, some analysts find that the United States has less social mobility than Western Europe and Canada.
10924550 -> 1000009104560: Women, many of whom were formerly more limited to domestic roles, now mostly work outside the home and receive a majority of bachelor's degrees.
10924560 -> 1000009104570: The changing role of women has also changed the American family.
10924570 -> 1000009104580: In 2005, no household arrangement defined more than 30% of households; married childless couples were most common, at 28%.
10924580 -> 1000009104590: The extension of marital rights to homosexual persons is an issue of debate; several more liberal states permit civil unions in lieu of marriage.
10924590 -> 1000009104600: In 2003, the Massachusetts Supreme Judicial Court ruled that state's ban on same-sex marriage unconstitutional; the Supreme Court of California ruled similarly in 2008.
10924600 -> 1000009104610: Forty-three states still legally restrict marriage to the traditional man-and-woman model.
10924610 -> 1000009104620: Popular media
10924620 -> 1000009104630: In 1878, Eadweard Muybridge demonstrated the power of photography to capture motion.
10924630 -> 1000009104640: In 1894, the world's first commercial motion picture exhibition was given in New York City, using Thomas Edison's Kinetoscope.
10924640 -> 1000009104650: The next year saw the first commercial screening of a projected film, also in New York, and the United States was in the forefront of sound film's development in the following decades.
10924650 -> 1000009104660: Since the early twentieth century, the U.S. film industry has largely been based in and around Hollywood, California.
10924660 -> 1000009104670: Director D. W. Griffith was central to the development of film grammar and Orson Welles's Citizen Kane (1941) is frequently cited in critics' polls as the greatest film of all time.
10924670 -> 1000009104680: American screen actors like John Wayne and Marilyn Monroe have become iconic figures, while producer/entrepreneur Walt Disney was a leader in both animated film and movie merchandising.
10924680 -> 1000009104690: The major film studios of Hollywood are the primary source of the most commercially successful movies in the world, such as Star Wars (1977) and Titanic (1997), and the products of Hollywood today dominate the global film industry.
10924690 -> 1000009104700: Americans are the heaviest television viewers in the world, and the average time spent in front of the screen continues to rise, hitting five hours a day in 2006.
10924700 -> 1000009104710: The four major broadcast networks are all commercial entities.
10924710 -> 1000009104720: Americans listen to radio programming, also largely commercialized, on average just over two-and-a-half hours a day.
10924720 -> 1000009104730: Aside from web portals and web search engines, the most popular websites are eBay, MySpace, Amazon.com, The New York Times, and Apple.
10924730 -> 1000009104740: Twelve million Americans keep a blog.
10924740 -> 1000009104750: The rhythmic and lyrical styles of African American music have deeply influenced American music at large, distinguishing it from European traditions.
10924750 -> 1000009104760: Elements from folk idioms such as the blues and what is now known as old-time music were adopted and transformed into popular genres with global audiences.
10924760 -> 1000009104770: Jazz was developed by innovators such as Louis Armstrong and Duke Ellington early in the twentieth century.
10924770 -> 1000009104780: Country music, rhythm and blues, and rock and roll emerged between the 1920s and 1950s.
10924780 -> 1000009104790: In the 1960s, Bob Dylan emerged from the folk revival to become one of America's greatest songwriters and James Brown led the development of funk.
10924790 -> 1000009104800: More recent American creations include hip hop and house music.
10924800 -> 1000009104810: American pop stars such as Elvis Presley, Michael Jackson, and Madonna have become global celebrities.
10924810 -> 1000009104820: Literature, philosophy, and the arts
10924820 -> 1000009104830: In the eighteenth and early nineteenth centuries, American art and literature took most of its cues from Europe.
10924830 -> 1000009104840: Writers such as Nathaniel Hawthorne, Edgar Allan Poe, and Henry David Thoreau established a distinctive American literary voice by the middle of the nineteenth century.
10924840 -> 1000009104850: Mark Twain and poet Walt Whitman were major figures in the century's second half; Emily Dickinson, virtually unknown during her lifetime, is recognized as another essential American poet.
10924850 -> 1000009104860: Eleven U.S. citizens have won the Nobel Prize in Literature, most recently Toni Morrison in 1993.
10924860 -> 1000009104870: Ernest Hemingway, the 1954 Nobel laureate, is often named as one of the most influential writers of the twentieth century.
10924870 -> 1000009104880: A work seen as capturing fundamental aspects of the national experience and character—such as Herman Melville's Moby-Dick (1851), Twain's The Adventures of Huckleberry Finn (1885), and F. Scott Fitzgerald's The Great Gatsby (1925)—may be dubbed the "Great American Novel."
10924880 -> 1000009104890: Popular literary genres such as the Western and hardboiled crime fiction developed in the United States. Postmodernism is the most recent major literary movement in the world, and though on the theory side postmodernism began with French writers like Jacques Derrida and Alain Robbe-Grillet, and was transitioned into largely by Irish writer Samuel Beckett, it has since been dominated by American writers such as Thomas Pynchon, Don DeLillo, William S. Burroughs, Jack Kerouac, John Barth, E.L. Doctorow, Kurt Vonnegut and many others.
10924890 -> 1000009104900: The transcendentalists, led by Ralph Waldo Emerson and Thoreau, established the first major American philosophical movement.
10924900 -> 1000009104910: After the Civil War, Charles Peirce and then William James and John Dewey were leaders in the development of pragmatism.
10924910 -> 1000009104920: In the twentieth century, the work of W. V. Quine and Richard Rorty helped bring analytic philosophy to the fore in U.S. academic circles.
10924920 -> 1000009104930: In the visual arts, the Hudson River School was an important mid-nineteenth-century movement in the tradition of European naturalism.
10924930 -> 1000009104940: The 1913 Armory Show in New York City, an exhibition of European modernist art, shocked the public and transformed the U.S. art scene.
10924940 -> 1000009104950: Georgia O'Keeffe, Marsden Hartley, and others experimented with new styles, displaying a highly individualistic sensibility.
10924950 -> 1000009104960: Major artistic movements such as the abstract expressionism of Jackson Pollock and Willem de Kooning and the pop art of Andy Warhol and Roy Lichtenstein have developed largely in the United States.
10924960 -> 1000009104970: The tide of modernism and then postmodernism has also brought American architects such as Frank Lloyd Wright, Philip Johnson, and Frank Gehry to the top of their field.
10924970 -> 1000009104980: One of the first notable promoters of the nascent American theater was impresario P. T. Barnum, who began operating a lower Manhattan entertainment complex in 1841.
10924980 -> 1000009104990: The team of Harrigan and Hart produced a series of popular musical comedies in New York starting in the late 1870s.
10924990 -> 1000009105000: In the twentieth century, the modern musical form emerged on Broadway; the songs of musical theater composers such as Irving Berlin, Cole Porter, and Stephen Sondheim have become pop standards.
10925000 -> 1000009105010: Playwright Eugene O'Neill won the Nobel literature prize in 1936; other acclaimed U.S. dramatists include multiple Pulitzer Prize winners Tennessee Williams, Edward Albee, and August Wilson.
10925010 -> 1000009105020: Though largely overlooked at the time, Charles Ives's work of the 1910s established him as the first major U.S. composer in the classical tradition; other experimentalists such as Henry Cowell and John Cage created an identifiably American approach to classical composition.
10925020 -> 1000009105030: Aaron Copland and George Gershwin developed a unique American synthesis of popular and classical music.
10925030 -> 1000009105040: Choreographers Isadora Duncan and Martha Graham were central figures in the creation of modern dance; George Balanchine and Jerome Robbins were leaders in twentieth-century ballet.
10925040 -> 1000009105050: The United States has long been at the fore in the relatively modern artistic medium of photography, with major practitioners such as Alfred Stieglitz, Edward Steichen, Ansel Adams, and many others.
10925050 -> 1000009105060: The newspaper comic strip and the comic book are both U.S. innovations.
10925060 -> 1000009105070: Superman, the quintessential comic book superhero, has become an American icon.
10925070 -> 1000009105080: Food
10925080 -> 1000009105090: Mainstream American culinary arts are similar to those in other Western countries.
10925090 -> 1000009105100: Wheat is the primary cereal grain.
10925100 -> 1000009105110: Traditional American cuisine uses ingredients such as turkey, white-tailed deer venison, potatoes, sweet potatoes, corn, squash, and maple syrup, indigenous foods employed by Native Americans and early European settlers.
10925110 -> 1000009105120: Slow-cooked pork and beef barbecue, crab cakes, potato chips, and chocolate chip cookies are distinctively American styles.
10925120 -> 1000009105130: Soul food, developed by African slaves, is popular around the South and among many African Americans elsewhere.
10925130 -> 1000009105140: Syncretic cuisines such as Louisiana creole, Cajun, and Tex-Mex are regionally important.
10925140 -> 1000009105150: Characteristic dishes such as apple pie, fried chicken, pizza, hamburgers, and hot dogs derive from the recipes of various immigrants.
10925150 -> 1000009105160: French fries, Mexican dishes such as burritos and tacos, and pasta dishes freely adapted from Italian sources are widely consumed.
10925160 -> 1000009105170: Americans generally prefer coffee to tea.
10925170 -> 1000009105180: Marketing by U.S. industries is largely responsible for making orange juice and milk ubiquitous breakfast beverages.
10925180 -> 1000009105190: During the 1980s and 1990s, Americans' caloric intake rose 24%; frequent dining at fast food outlets is associated with what health officials call the American "obesity epidemic."
10925190 -> 1000009105200: Highly sweetened soft drinks are widely popular; sugared beverages account for 9% of the average American's caloric intake.
10925200 -> 1000009105210: Sports
10925210 -> 1000009105220: Since the late nineteenth century, baseball has been regarded as the national sport; football, basketball, and ice hockey are the country's three other leading professional team sports.
10925220 -> 1000009105230: College football and basketball also attract large audiences.
10925230 -> 1000009105240: Football is now by several measures the most popular spectator sport in the United States.
10925240 -> 1000009105250: Boxing and horse racing were once the most watched individual sports, but they have been eclipsed by golf and auto racing, particularly NASCAR.
10925250 -> 1000009105260: Soccer, though not a leading professional sport in the country, is played widely at the youth and amateur levels.
10925260 -> 1000009105270: Tennis and many outdoor sports are also popular.
10925270 -> 1000009105280: While most major U.S. sports have evolved out of European practices, basketball, volleyball, skateboarding, and snowboarding are American inventions.
10925280 -> 1000009105290: Lacrosse and surfing arose from Native American and Native Hawaiian activities that predate Western contact.
10925290 -> 1000009105300: Eight Olympic Games have taken place in the United States.
10925300 -> 1000009105310: The United States has won 2,191 medals at the Summer Olympic Games, more than any other country, and 216 in the Winter Olympic Games, the second most.
Verb
10930010 -> 1000009200020: Verb
10930020 -> 1000009200030: For English usage of verbs see the wiki article English verbs.
10930030 -> 1000009200040: In syntax, a verb is a word (part of speech) that usually denotes an action (bring, read), an occurrence (decompose, glitter), or a state of being (exist, stand).
10930040 -> 1000009200050: Depending on the language, a verb may vary in form according to many factors, possibly including its tense, aspect, mood and voice.
10930050 -> 1000009200060: It may also agree with the person, gender, and/or number of some of its arguments (subject, object, etc.).
10930060 -> 1000009200070: Valency
10930070 -> 1000009200080: The number of arguments that a verb takes is called its valency or valence.
10930080 -> 1000009200090: Verbs can be classified according to their valency:
10930090 -> 1000009200100: Intransitive (valency = 1): the verb only has a subject.
10930100 -> 1000009200110: For example: "he runs", "it falls".
10930110 -> 1000009200120: Transitive (valency = 2): the verb has a subject and a direct object.
10930120 -> 1000009200130: For example: "she eats fish", "Mike hunts deer".
10930130 -> 1000009200140: Linking (valency = 3): State of being; does not require an action.
10930140 -> 1000009200150: The subject complements are related to subject rather than the verb.
10930150 -> 1000009200160: It simply reports a condition or asks a questions about a condition.
10930160 -> 1000009200170: It is impossible to have verbs with zero valency.
10930170 -> 1000009200180: Weather verbs are often impersonal (subjectless) in null-subject languages like Spanish, where the verb llueve means "It rains".
10930180 -> 1000009200190: In English, they require a dummy pronoun, and therefore formally have a valency of 1.
10930190 -> 1000009200200: The intransitive and transitive are typical, but the impersonal and objective are somewhat different from the norm.
10930200 -> 1000009200210: In this sense you can see that a verb is a person, place, or thing.
10930210 -> 1000009200220: In the objective the verb takes an object but no subject, the nonreferent subject in some uses may be marked in the verb by an incorporated dummy pronoun similar to the English weather verb (see below).
10930220 -> 1000009200230: Impersonal verbs take neither subject nor object, as with other null subject languages, but again the verb may show incorporated dummy pronouns despite the lack of subject and object phrases.
10930230 -> 1000009200240: Tlingit lacks a ditransitive, so the indirect object is described by a separate, extraposed clause.
10930240 -> 1000009200250: English verbs are often flexible with regard to valency.
10930250 -> 1000009200260: A transitive verb can often drop its object and become intransitive; or an intransitive verb can take an object and become transitive.
10930260 -> 1000009200270: Compare:
10930270 -> 1000009200280: I moved. (intransitive)
10930280 -> 1000009200290: I moved the book. (transitive)
10930290 -> 1000009200300: In the first example, the verb move has no grammatical object.
10930300 -> 1000009200310: (In this case, there may be an object understood - the subject (I/myself).
10930310 -> 1000009200320: The verb is then possibly reflexive, rather than intransitive); in the second the subject and object are distinct.
10930320 -> 1000009200330: The verb has a different valency, but the form remains exactly the same.
10930330 -> 1000009200340: In many languages other than English, such valency changes are not possible like this; the verb must instead be inflected for voice in order to change the valency.
10930340 -> 1000009200350: Copula
10930350 -> 1000009200360: A copula is a word that is used to describe its subject, or to equate or liken the subject with its predicate.
10930360 -> 1000009200370: In many languages, copulas are a special kind of verb, sometimes called copulative verbs or linking verbs.
10930370 -> 1000009200380: Because copulas do not describe actions being performed, they are usually analyzed outside the transitive/intransitive distinction.
10930380 -> 1000009200390: The most basic copula in English is to be; there are others (remain, seem, grow, become, etc.).
10930390 -> 1000009200400: Some languages (the Semitic and Slavic families, Chinese, Sanskrit, and others) can omit or do not have the simple copula equivalent of "to be", especially in the present tense.
10930400 -> 1000009200410: In these languages a noun and adjective pair (or two nouns) can constitute a complete sentence.
10930410 -> 1000009200420: This construction is called zero copula.
10930420 -> 1000009200430: Verbal noun and verbal adjective
10930430 -> 1000009200440: Most languages have a number of verbal nouns that describe the action of the verb.
10930440 -> 1000009200450: In Indo-European languages, there are several kinds of verbal nouns, including gerunds, infinitives, and supines.
10930450 -> 1000009200460: English has gerunds, such as seeing, and infinitives such as to see; they both can function as nouns; seeing is believing is roughly equivalent in meaning with to see is to believe.
10930460 -> 1000009200470: These terms are sometimes applied to verbal nouns of non-Indo-European languages.
10930470 -> 1000009200480: In the Indo-European languages, verbal adjectives are generally called participles.
10930480 -> 1000009200490: English has an active participle, also called a present participle; and a passive participle, also called a past participle.
10930490 -> 1000009200500: The active participle of play is playing, and the passive participle is played.
10930500 -> 1000009200510: The active participle describes nouns that perform the action given in the verb, e.g.
10930510 -> 1000009200520: I saw the playing children..
10930520 -> 1000009200530: The passive participle describes nouns that have been the object of the action of the verb, e.g.
10930530 -> 1000009200540: I saw the played game scattered across the floor..
10930540 -> 1000009200550: Other languages apply tense and aspect to participles, and possess a larger number of them with more distinct shades of meaning.
10930550 -> 1000009200560: Agreement
10930560 -> 1000009200570: In languages where the verb is inflected, it often agrees with its primary argument (what we tend to call the subject) in person, number and/or gender.
10930570 -> 1000009200580: English only shows distinctive agreement in the third person singular, present tense form of verbs (which is marked by adding "-s"); the rest of the persons are not distinguished in the verb.
10930580 -> 1000009200590: Spanish inflects verbs for tense/mood/aspect and they agree in person and number (but not gender) with the subject.
10930590 -> 1000009200600: Japanese, in turn, inflects verbs for many more categories, but shows absolutely no agreement with the subject.
10930600 -> 1000009200610: Basque, Georgian, and some other languages, have polypersonal agreement: the verb agrees with the subject, the direct object and even the secondary object if present.
Web application
10940010 -> 1000009300020: Web application
10940020 -> 1000009300030: In software engineering, a Web application is an application that is accessed via Web browser over a network such as the Internet or an intranet.
10940030 -> 1000009300040: It is also a computer software application that is coded in a browser-supported language (such as HTML, JavaScript, Java, etc.) and reliant on a common web browser to render the application executable.
10940040 -> 1000009300050: Web applications are popular due to the ubiquity of a client, sometimes called a thin client.
10940050 -> 1000009300060: The ability to update and maintain Web applications without distributing and installing software on potentially thousands of client computers is a key reason for their popularity.
10940060 -> 1000009300070: Common Web applications include Webmail, online retail sales, online auctions, wikis, discussion boards, Weblogs, MMORPGs and many other functions.
10940070 -> 1000009300080: History
10940080 -> 1000009300090: In earlier types of client-server computing, each application had its own client program which served as its user interface and had to be separately installed on each user's personal computer.
10940090 -> 1000009300100: An upgrade to the server part of the application would typically require an upgrade to the clients installed on each user workstation, adding to the support cost and decreasing productivity.
10940100 -> 1000009300110: In contrast, Web applications dynamically generate a series of Web documents in a standard format supported by common browsers such as HTML/XHTML.
10940110 -> 1000009300120: Client-side scripting in a standard language such as JavaScript is commonly included to add dynamic elements to the user interface.
10940120 -> 1000009300130: Generally, each individual Web page is delivered to the client as a static document, but the sequence of pages can provide an interactive experience, as user input is returned through Web form elements embedded in the page markup.
10940130 -> 1000009300140: During the session, the Web browser interprets and displays the pages, and acts as the universal client for any Web application.
10940140 -> 1000009300150: Interface
10940150 -> 1000009300160: The Web interface places very few limits on client functionality.
10940160 -> 1000009300170: Through Java, JavaScript, DHTML, Flash and other technologies, application-specific methods such as drawing on the screen, playing audio, and access to the keyboard and mouse are all possible.
10940170 -> 1000009300180: Many services have worked to combine all of these into a more familiar interface that adopts the appearance of an operating system.
10940180 -> 1000009300190: General purpose techniques such as drag and drop are also supported by these technologies.
10940190 -> 1000009300200: Web developers often use client-side scripting to add functionality, especially to create an interactive experience that does not require page reloading (which many users find disruptive).
10940200 -> 1000009300210: Recently, technologies have been developed to coordinate client-side scripting with server-side technologies such as PHP.
10940210 -> 1000009300220: Ajax, a web development technique using a combination of various technologies, is an example of technology which creates a more interactive experience.
10940220 -> 1000009300230: Technical considerations
10940230 -> 1000009300240: A significant advantage of building Web applications to support standard browser features is that they should perform as specified regardless of the operating system or OS version installed on a given client.
10940240 -> 1000009300250: Rather than creating clients for MS Windows, Mac OS X, GNU/Linux, and other operating systems, the application can be written once and deployed almost anywhere.
10940250 -> 1000009300260: However, inconsistent implementations of the HTML, CSS, DOM and other browser specifications can cause problems in web application development and support.
10940260 -> 1000009300270: Additionally, the ability of users to customize many of the display settings of their browser (such as selecting different font sizes, colors, and typefaces, or disabling scripting support) can interfere with consistent implementation of a Web application.
10940270 -> 1000009300280: Another approach is to use Adobe Flash or Java applets to provide some or all of the user interface.
10940280 -> 1000009300290: Since most Web browsers include support for these technologies (usually through plug-ins), Flash- or Java-based applications can be implemented with much of the same ease of deployment.
10940290 -> 1000009300300: Because they allow the programmer greater control over the interface, they bypass many browser-configuration issues, although incompatibilities between Java or Flash implementations on the client can introduce different complications.
10940300 -> 1000009300310: Because of their architectural similarities to traditional client-server applications, with a somewhat "thick" client, there is some dispute over whether to call systems of this sort "Web applications"; an alternative term is "Rich Internet Application" (RIA).
10940310 -> 1000009300320: Structure
10940320 -> 1000009300330: Though many variations are possible, a Web application is commonly structured as a three-tiered application.
10940330 -> 1000009300340: In its most common form, a Web browser is the first tier, an engine using some dynamic Web content technology (such as ASP, ASP.NET, CGI, ColdFusion, JSP/Java, PHP,embPerl, Python, or Ruby on Rails) is the middle tier, and a database is the third tier.
10940340 -> 1000009300350: The Web browser sends requests to the middle tier, which services them by making queries and updates against the database and generates a user interface.
10940350 -> 1000009300360: But there are some who view a web application as a Two-Tier architecture.
10940360 -> 1000009300370: Business use
10940370 -> 1000009300380: An emerging strategy for application software companies is to provide Web access to software previously distributed as local applications.
10940380 -> 1000009300390: Depending on the type of application, it may require the development of an entirely different browser-based interface, or merely adapting an existing application to use different presentation technology.
10940390 -> 1000009300400: These programs allow the user to pay a monthly or yearly fee for use of a software application without having to install it on a local hard drive.
10940400 -> 1000009300410: A company which follows this strategy is known as an application service provider (ASP), and ASPs are currently receiving much attention in the software industry.
10940410 -> 1000009300420: Writing Web applications
10940420 -> 1000009300430: There are many Web application frameworks which facilitate rapid application development by allowing the programmer to define a high-level description of the program.
10940430 -> 1000009300440: In addition, there is potential for the development of applications on Internet operating systems, although currently there are not many viable platforms that fit this model.
10940440 -> 1000009300450: The use of Web application frameworks can often reduce the number of errors in a program, both by making the code more simple, and by allowing one team to concentrate just on the framework.
10940450 -> 1000009300460: In applications which are exposed to constant hacking attempts on the Internet, security-related problems caused by errors in the program are a big issue.
10940460 -> 1000009300470: Frameworks may also promote the use of best practices such as GET after POST.
10940470 -> 1000009300480: Web Application Security
10940480 -> 1000009300490: The Web Application Security Consortium (WASC) and OWASP are projects developed with the intention of documenting how to avoid security problems in Web applications.
10940490 -> 1000009300500: A Web Application Security Scanner is specialized software for detecting security problems in web applications.
10940500 -> 1000009300510: Applications
10940510 -> 1000009300520: Browser applications typically include simple office software (word processors, spreadsheets, and presentation tools) and can also include more advanced application such as project management software, CAD Design Software, and point-of-sale applications.
10940520 -> 1000009300530: Examples
10940530 -> 1000009300540: Word processor and Spreadsheet:  Google Docs & Spreadsheets
10940540 -> 1000009300550: CRM Software:  SalesForce.com
10940550 -> 1000009300560: Benefits
10940560 -> 1000009300570: Browser Applications typically require little or no disk space, upgrade automatically with new features, integrate easily into other web procedures, such as email and searching.
10940570 -> 1000009300580: They also provide cross-platform compatibility (i.e Mac or Windows) because they operate within a web browser window.
10940580 -> 1000009300590: Disadvantages
10940590 -> 1000009300600: Standards compliance is an issue with any non-typical office document creator, which causes problems when file sharing and collaboration becomes critical.
10940600 -> 1000009300610: Also, Browser Applications rely on application files accessed on remote servers through the internet.
10940610 -> 1000009300620: Therefore, when connection is interrupted, the application is no longer usable.
10940620 -> 1000009300630: Google Gears is a beta platform to combat this issue and improve the usability of Browser Applications.
Web search engine
10740010 -> 1000009400020: Web search engine
10740020 -> 1000009400030: A Web search engine is a search engine designed to search for information on the World Wide Web.
10740030 -> 1000009400040: Information may consist of web pages, images and other types of files.
10740040 -> 1000009400050: Some search engines also mine data available in newsbooks, databases, or open directories.
10740050 -> 1000009400060: Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input.
10740060 -> 1000009400070: History
10740070 -> 1000009400080: Before there were search engines there was a complete list of all webservers.
10740080 -> 1000009400090: The list was edited by Tim Berners-Lee and hosted on the CERN webserver.
10740090 -> 1000009400100: One historical snapshot from 1992 remains.
10740100 -> 1000009400110: As more and more webservers went online the central list could not keep up.
10740110 -> 1000009400120: On the NCSA Site new servers were announced under the title "What's New!", but no complete listing existed any more.
10740120 -> 1000009400130: The very first tool used for searching on the (pre-web) Internet was Archie.
10740130 -> 1000009400140: The name stands for "archive" without the "v".
10740140 -> 1000009400150: It was created in 1990 by Alan Emtage, a student at McGill University in Montreal.
10740150 -> 1000009400160: The program downloaded the directory listings of all the files located on public anonymous FTP (File Transfer Protocol) sites, creating a searchable database of file names; however, Archie did not index the contents of these sites.
10740160 -> 1000009400170: The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota) led to two new search programs, Veronica and Jughead.
10740170 -> 1000009400180: Like Archie, they searched the file names and titles stored in Gopher index systems.
10740180 -> 1000009400190: Veronica (Very Easy Rodent-Oriented Net-wide Index to Computerized Archives) provided a keyword search of most Gopher menu titles in the entire Gopher listings.
10740190 -> 1000009400200: Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for obtaining menu information from specific Gopher servers.
10740200 -> 1000009400210: While the name of the search engine "Archie" was not a reference to the Archie comic book series, "Veronica" and "Jughead" are characters in the series, thus referencing their predecessor.
10740210 -> 1000009400220: The first Web search engine was Wandex, a now-defunct index collected by the World Wide Web Wanderer, a web crawler developed by Matthew Gray at MIT in 1993.
10740220 -> 1000009400230: Another very early search engine, Aliweb, also appeared in 1993.
10740230 -> 1000009400240: JumpStation (released in early 1994) used a crawler to find web pages for searching, but search was limited to the title of web pages only.
10740240 -> 1000009400250: One of the first "full text" crawler-based search engines was WebCrawler, which came out in 1994.
10740250 -> 1000009400260: Unlike its predecessors, it let users search for any word in any webpage, which became the standard for all major search engines since.
10740260 -> 1000009400270: It was also the first one to be widely known by the public.
10740270 -> 1000009400280: Also in 1994 Lycos (which started at Carnegie Mellon University) was launched, and became a major commercial endeavor.
10740280 -> 1000009400290: Soon after, many search engines appeared and vied for popularity.
10740290 -> 1000009400300: These included Magellan, Excite, Infoseek, Inktomi, Northern Light, and AltaVista.
10740300 -> 1000009400310: Yahoo! was among the most popular ways for people to find web pages of interest, but its search function operated on its web directory, rather than full-text copies of web pages.
10740310 -> 1000009400320: Information seekers could also browse the directory instead of doing a keyword-based search.
10740320 -> 1000009400330: In 1996, Netscape was looking to give a single search engine an exclusive deal to be their featured search engine.
10740330 -> 1000009400340: There was so much interest that instead a deal was struck with Netscape by 5 of the major search engines, where for $5Million per year each search engine would be in a rotation on the Netscape search engine page.
10740340 -> 1000009400350: These five engines were: Yahoo!, Magellan, Lycos, Infoseek and Excite.
10740350 -> 1000009400360: Search engines were also known as some of the brightest stars in the Internet investing frenzy that occurred in the late 1990s.
10740360 -> 1000009400370: Several companies entered the market spectacularly, receiving record gains during their initial public offerings.
10740370 -> 1000009400380: Some have taken down their public search engine, and are marketing enterprise-only editions, such as Northern Light.
10740380 -> 1000009400390: Many search engine companies were caught up in the dot-com bubble, a speculation-driven market boom that peaked in 1999 and ended in 2001.
10740390 -> 1000009400400: Around 2000, the Google search engine rose to prominence.
10740400 -> 1000009400410: The company achieved better results for many searches with an innovation called PageRank.
10740410 -> 1000009400420: This iterative algorithm ranks web pages based on the number and PageRank of other web sites and pages that link there, on the premise that good or desirable pages are linked to more than others.
10740420 -> 1000009400430: Google also maintained a minimalist interface to its search engine.
10740430 -> 1000009400440: In contrast, many of its competitors embedded a search engine in a web portal.
10740440 -> 1000009400450: By 2000, Yahoo was providing search services based on Inktomi's search engine.
10740450 -> 1000009400460: Yahoo! acquired Inktomi in 2002, and Overture (which owned AlltheWeb and AltaVista) in 2003.
10740460 -> 1000009400470: Yahoo! switched to Google's search engine until 2004, when it launched its own search engine based on the combined technologies of its acquisitions.
10740470 -> 1000009400480: Microsoft first launched MSN Search (since re-branded Live Search) in the fall of 1998 using search results from Inktomi.
10740480 -> 1000009400490: In early 1999 the site began to display listings from Looksmart blended with results from Inktomi except for a short time in 1999 when results from AltaVista were used instead.
10740490 -> 1000009400500: In 2004, Microsoft began a transition to its own search technology, powered by its own web crawler (called msnbot).
10740500 -> 1000009400510: As of late 2007, Google was by far the most popular Web search engine worldwide.
10740510 -> 1000009400520: A number of country-specific search engine companies have become prominent; for example Baidu is the most popular search engine in the People's Republic of China and guruji.com in India.
10740520 -> 1000009400530: How Web search engines work
10740530 -> 1000009400540: A search engine operates, in the following order
10740540 -> 1000009400550: Web crawling
10740550 -> 1000009400560: Indexing
10740560 -> 1000009400570: Searching
10740570 -> 1000009400580: Web search engines work by storing information about many web pages, which they retrieve from the WWW itself.
10740580 -> 1000009400590: These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees.
10740590 -> 1000009400600: Exclusions can be made by the use of robots.txt.
10740600 -> 1000009400610: The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags).
10740610 -> 1000009400620: Data about web pages are stored in an index database for use in later queries.
10740620 -> 1000009400630: Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find.
10740630 -> 1000009400640: This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it.
10740640 -> 1000009400650: This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage.
10740650 -> 1000009400660: This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages.
10740660 -> 1000009400670: Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
10740670 -> 1000009400680: When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text.
10740680 -> 1000009400690: Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query.
10740690 -> 1000009400700: Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords.
10740700 -> 1000009400710: The usefulness of a search engine depends on the relevance of the result set it gives back.
10740710 -> 1000009400720: While there may be millions of webpages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others.
10740720 -> 1000009400730: Most search engines employ methods to rank the results to provide the "best" results first.
10740730 -> 1000009400740: How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another.
10740740 -> 1000009400750: The methods also change over time as Internet usage changes and new techniques evolve.
10740750 -> 1000009400760: Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the controversial practice of allowing advertisers to pay money to have their listings ranked higher in search results.
10740760 -> 1000009400770: Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results.
10740770 -> 1000009400780: The search engines make money every time someone clicks on one of these ads.
10740780 -> 1000009400790: The vast majority of search engines are run by private companies using proprietary algorithms and closed databases, though some are open source.
10740790 -> 1000009400800: Revenue in the web search portals industry is projected to grow in 2008 by 13.4 percent, with broadband connections expected to rise by 15.1 percent.
10740800 -> 1000009400810: Between 2008 and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still has some way to go to reach full saturation in American households.
10740810 -> 1000009400820: Furthermore, broadband services are projected to account for an ever increasing share of domestic Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by fiber-optic and high speed cable lines.
Word
10950010 -> 1000009500020: Word
10950020 -> 1000009500030: A word is a unit of language that carries meaning and consists of one or more morphemes which are linked more or less tightly together, and has a phonetical value.
10950030 -> 1000009500040: Typically a word will consist of a root or stem and zero or more affixes.
10950040 -> 1000009500050: Words can be combined to create phrases, clauses, and sentences.
10950050 -> 1000009500060: A word consisting of two or more stems joined together form a compound.
10950060 -> 1000009500070: A word combined with another word or part of a word form a portmanteau.
10950070 -> 1000009500080: Etymology
10950080 -> 1000009500090: English word is directly from Old English word, and has cognates in all branches of Germanic (Old High German wort, Old Norse orð, Gothic waurd), deriving from Proto-Germanic *wurđa, continuing a virtual PIE *wr̥dhom.
10950090 -> 1000009500100: Cognates outside Germanic include Baltic (Old Prussian wīrds "word", and with different ablaut Lithuanian  var̃das "name", Latvian vàrds "word, name") and Latin verbum.
10950100 -> 1000009500110: The PIE stem *werdh- is also found in Greek ερθει (φθεγγεται "speaks, utters" Hes. ).
10950110 -> 1000009500120: The PIE root is *ŭer-, ŭrē- "say, speak" (also found in Greek ειρω, ρητωρ).
10950120 -> 1000009500130: The original meaning of word is "utterance, speech, verbal expression".
10950130 -> 1000009500140: Until Early Modern English, it could more specifically refer to a name or title.
10950140 -> 1000009500150: The technical meaning of "an element of speech" first arises in discussion of grammar (particularly Latin grammar), as in the prologue to Wyclif's Bible (ca. 1400):
10950150 -> 1000009500160: "This word autem, either vero, mai stonde for forsothe, either for but."
10950160 -> 1000009500170: Definitions
10950170 -> 1000009500180: Depending on the language, words can be difficult to identify or delimit.
10950180 -> 1000009500190: Dictionaries take upon themselves the task of categorizing a language's lexicon into lemmas.
10950190 -> 1000009500200: These can be taken as an indication of what constitutes a "word" in the opinion of the authors.
10950200 -> 1000009500210: Word boundaries
10950210 -> 1000009500220: In spoken language, the distinction of individual words is usually given by rhythm or accent, but short words are often run together.
10950220 -> 1000009500230: See clitic for phonologically dependent words.
10950230 -> 1000009500240: Spoken French has some of the features of a polysynthetic language: il y est allé ("He went there") is pronounced /{(IPA+i.ljɛ.ta.le+i.ljɛ.ta.le)}/.
10950240 -> 1000009500250: As the majority of the world's languages are not written, the scientific determination of word boundaries becomes important.
10950250 -> 1000009500260: There are five ways to determine where the word boundaries of spoken language should be placed:
10950260 -> 1000009500270: Potential pause
10950270 -> 1000009500280: A speaker is told to repeat a given sentence slowly, allowing for pauses.
10950280 -> 1000009500290: The speaker will tend to insert pauses at the word boundaries.
10950290 -> 1000009500300: However, this method is not foolproof: the speaker could easily break up polysyllabic words.
10950300 -> 1000009500310: Indivisibility
10950310 -> 1000009500320: A speaker is told to say a sentence out loud, and then is told to say the sentence again with extra words added to it.
10950320 -> 1000009500330: Thus, I have lived in this village for ten years might become I and my family have lived in this little village for about ten or so years.
10950330 -> 1000009500340: These extra words will tend to be added in the word boundaries of the original sentence.
10950340 -> 1000009500350: However, some languages have infixes, which are put inside a word.
10950350 -> 1000009500360: Similarly, some have separable affixes; in the German sentence "Ich komme gut zu Hause an," the verb ankommen is separated.
10950360 -> 1000009500370: Minimal free forms
10950370 -> 1000009500380: This concept was proposed by Leonard Bloomfield.
10950380 -> 1000009500390: Words are thought of as the smallest meaningful unit of speech that can stand by themselves.
10950390 -> 1000009500400: This correlates phonemes (units of sound) to lexemes (units of meaning).
10950400 -> 1000009500410: However, some written words are not minimal free forms, as they make no sense by themselves (for example, the and of).
10950410 -> 1000009500420: Phonetic boundaries
10950420 -> 1000009500430: Some languages have particular rules of pronunciation that make it easy to spot where a word boundary should be.
10950430 -> 1000009500440: For example, in a language that regularly stresses the last syllable of a word, a word boundary is likely to fall after each stressed syllable.
10950440 -> 1000009500450: Another example can be seen in a language that has vowel harmony (like Turkish): the vowels within a given word share the same quality, so a word boundary is likely to occur whenever the vowel quality changes.
10950450 -> 1000009500460: However, not all languages have such convenient phonetic rules, and even those that do present the occasional exceptions.
10950460 -> 1000009500470: Semantic units
10950470 -> 1000009500480: Much like the above mentioned minimal free forms, this method breaks down a sentence into its smallest semantic units.
10950480 -> 1000009500490: However, language often contains words that have little semantic value (and often play a more grammatical role), or semantic units that are compound words.
10950490 -> 1000009500500: A further criterion.
10950500 -> 1000009500510: Pragmatics.
10950510 -> 1000009500520: As Plag suggests, the idea of a lexical item being considered a word should also adjust to pragmatic criteria.
10950520 -> 1000009500530: The word "hello, for example, does not exist outside of the realm of greetings being difficult to assign a meaning out of it.
10950530 -> 1000009500540: This is a little more complex if we consider "how do you do?": is it a word, a phrase, or an idiom?
10950540 -> 1000009500550: In practice, linguists apply a mixture of all these methods to determine the word boundaries of any given sentence.
10950550 -> 1000009500560: Even with the careful application of these methods, the exact definition of a word is often still very elusive.
10950560 -> 1000009500570: There are some words that seem very general but may truly have a technical definition, such as the word "soon," usually meaning within a week.
10950570 -> 1000009500580: Orthography
10950580 -> 1000009500590: In languages with a literary tradition, there is interrelation between orthography and the question of what is considered a single word.
10950590 -> 1000009500600: Word separators (typically space marks) are common in modern orthography of languages using alphabetic scripts, but these are (excepting isolated precedents) a modern development (see also history of writing).
10950600 -> 1000009500610: In English orthography, words may contain spaces if they are compounds or proper nouns such as ice cream or air raid shelter.
10950610 -> 1000009500620: Vietnamese orthography, although using the Latin alphabet, delimits monosyllabic morphemes, not words.
10950620 -> 1000009500630: Conversely, synthetic languages often combine many lexical morphemes into single words, making it difficult to boil them down to the traditional sense of words found more easily in analytic languages; this is especially difficult for polysynthetic languages such as Inuktitut and Ubykh, where entire sentences may consist of single such words.
10950630 -> 1000009500640: Logographic scripts use single signs (characters) to express a word.
10950640 -> 1000009500650: Most de facto existing scripts are however partly logographic, and combine logographic with phonetic signs.
10950650 -> 1000009500660: The most widespread logographic script in modern use is the Chinese script.
10950660 -> 1000009500670: While the Chinese script has some true logographs, the largest class of characters used in modern Chinese (some 90%) are so-called pictophonetic compounds ({(Lang+形声字+zh+形声字)}, {(Lang+Xíngshēngzì+pny+Xíngshēngzì)}).
10950670 -> 1000009500680: Characters of this sort are composed of two parts: a pictograph, which suggests the general meaning of the character, and a phonetic part, which is derived from a character pronounced in the same way as the word the new character represents.
10950680 -> 1000009500690: In this sense, the character for most Chinese words consists of a determiner and a syllabogram, similar to the approach used by cuneiform script and Egyptian hieroglyphs.
10950690 -> 1000009500700: There is a tendency informed by orthography to identify a single Chinese character as corresponding to a single word in the Chinese language, parallel to the tendency to identify the letters between two space marks as a single word in the English language.
10950700 -> 1000009500710: In both cases, this leads to the identification of compound members as individual words, while e.g. in German orthography, compound members are not separated by space marks and the tendency is thus to identify the entire compound as a single word.
10950710 -> 1000009500720: Compare e.g. English capital city with German Hauptstadt and Chinese 首都 (lit. chief metropolis): all three are equivalent compounds, in the English case consisting of "two words" separated by a space mark, in the German case written as a "single word" without space mark, and in the Chinese case consisting of two logographic characters.
10950720 -> 1000009500730: Morphology
10950730 -> 1000009500740: In synthetic languages, a single word stem (for example, love) may have a number of different forms (for example, loves, loving, and loved).
10950740 -> 1000009500750: However, these are not usually considered to be different words, but different forms of the same word.
10950750 -> 1000009500760: In these languages, words may be considered to be constructed from a number of morphemes.
10950760 -> 1000009500770: In Indo-European languages in particular, the morphemes distinguished are
10950770 -> 1000009500780: the root
10950780 -> 1000009500790: optional suffixes
10950790 -> 1000009500800: a desinence.
10950800 -> 1000009500810: Thus, the Proto-Indo-European *wr̥dhom would be analysed as consisting of
10950810 -> 1000009500820: *wr̥-, the zero grade of the root *wer-
10950820 -> 1000009500830: a root-extension *-dh- (diachronically a suffix), resulting in a complex root *wr̥dh-
10950830 -> 1000009500840: The thematic suffix *-o-
10950840 -> 1000009500850: the neuter gender nominative or accusative singular desinence *-m.
10950850 -> 1000009500860: Classes
10950860 -> 1000009500870: Grammar classifies a language's lexicon into several groups of words.
10950870 -> 1000009500880: The basic bipartite division possible for virtually every natural language is that of nouns vs. verbs.
10950880 -> 1000009500890: The classification into such classes is in the tradition of Dionysius Thrax, who distinguished eight categories: noun, verb, adjective, pronoun, preposition, adverb, conjunction, interjection.
10950890 -> 1000009500900: In Indian grammatical tradition, Panini introduced a similar fundamental classification into a nominal (nāma, suP) and a verbal (ākhyāta, tiN) class, based on the set of desinences taken by the word.
Word sense disambiguation
10980010 -> 1000009600020: Word sense disambiguation
10980020 -> 1000009600030: In computational linguistics, word sense disambiguation (WSD) is the process of identifying which sense of a word (having a number of distinct senses) is used in a given sentence.
10980030 -> 1000009600040: For example, consider the word bass, two distinct senses of which are:
10980040 -> 1000009600050: a type of fish
10980050 -> 1000009600060: tones of low frequency
10980060 -> 1000009600070: and the sentences:
10980070 -> 1000009600080: I went fishing for some sea bass
10980080 -> 1000009600090: The bass line of the song is very moving
10980090 -> 1000009600100: Explanation
10980100 -> 1000009600110: To a human it is obvious that the first sentence is using the word bass in the first sense above, and that in the second sentence it is being used in the second sense.
10980110 -> 1000009600120: Although this seems obvious to a human, developing algorithms to replicate this human ability is a difficult task.
10980120 -> 1000009600130: Difficulties
10980130 -> 1000009600140: One problem with word sense disambiguation is deciding what the senses are.
10980140 -> 1000009600150: In cases like the word bass above, at least some senses are obviously different.
10980150 -> 1000009600160: In other cases, however, the different senses can be closely related (one meaning being a metaphorical or metonymic extension of another), and in such cases division of words into senses becomes much more difficult.
10980160 -> 1000009600170: Different dictionaries will provide different divisions of words into senses.
10980170 -> 1000009600180: One solution some researchers have used is to choose a particular dictionary, and just use its set of senses.
10980180 -> 1000009600190: Generally, however, research results using broad distinctions in senses have been much better than those using narrow, so most researchers ignore the fine-grained distinctions in their work.
10980190 -> 1000009600200: Another problem is inter-judge variance.
10980200 -> 1000009600210: WSD systems are normally tested by having their results on a task compared against those of a human.
10980210 -> 1000009600220: However, humans do not agree on the task at hand — give a list of senses and sentences, and humans will not always agree on which word belongs in which sense.
10980220 -> 1000009600230: A computer cannot be expected to give better performance on such a task than a human (indeed, since the human serves as the standard, the computer being better than the human is incoherent), so the human performance serves as an upper bound.
10980230 -> 1000009600240: Human performance, however, is much better on coarse-grained than fine-grained distinctions, so this again is why research on coarse-grained distinctions is most useful.
10980240 -> 1000009600250: Approaches
10980250 -> 1000009600260: As in all natural language processing, there are two main approaches to WSD — deep approaches and shallow approaches.
10980260 -> 1000009600270: Deep approaches presume access to a comprehensive body of world knowledge.
10980270 -> 1000009600280: Knowledge such as "you can go fishing for a type of fish, but not for low frequency sounds" and "songs have low frequency sounds as parts, but not types of fish" is then used to determine in which sense the word is used.
10980280 -> 1000009600290: These approaches are not very successful in practice, mainly because such a body of knowledge does not exist in computer-readable format outside of very limited domains.
10980290 -> 1000009600300: But if such knowledge did exist, they would be much more accurate than the shallow approaches.
10980300 -> 1000009600310: However, there is a long tradition in Computational Linguistics of trying such approaches in terms of coded knowledge, and in some cases it is hard to say clearly whether the knowledge involved is linguistic or world knowledge.
10980310 -> 1000009600320: The first attempt was that by Margaret Masterman and her colleagues at Cambridge Language Research Unit in England in the 1950s.
10980320 -> 1000009600330: This used as data a punched-card version of Roget's Thesaurus and its numbered "heads" as indicators of topics and looked for their repetitions in text, using a set intersection algorithm: it was not very successful (and is described in some detail in (Wilks, Y. et al., 1996) but had strong relationships to later work, especially Yarowsky's machine learning optimisation of a thesaurus method in the 1990s (see below).
10980330 -> 1000009600340: Shallow approaches don't try to understand the text.
10980340 -> 1000009600350: They just consider the surrounding words, using information like "if bass has words sea or fishing nearby, it probably is in the fish sense; if bass has the words music or song nearby, it is probably in the music sense."
10980350 -> 1000009600360: These rules can be automatically derived by the computer, using a training corpus of words tagged with their word senses.
10980360 -> 1000009600370: This approach, while theoretically not as powerful as deep approaches, gives superior results in practice, due to computers' limited world knowledge.
10980370 -> 1000009600380: It can, though, be confused by sentences like The dogs bark at the tree, which contains the word bark near both tree and dogs.
10980380 -> 1000009600390: These approaches normally work by defining a window of N content words around each word to be disambiguated in the corpus, and statistically analyzing those N surrounding words.
10980390 -> 1000009600400: Two shallow approaches used to train and then disambiguate are Naïve Bayes classifiers and decision trees.
10980400 -> 1000009600410: In recent research, kernel based methods such as support vector machines have shown superior performance in supervised learning.
10980410 -> 1000009600420: But over the last few years, there hasn't been any major improvement in performance of any of these methods.
10980420 -> 1000009600430: It is instructive to compare the word sense disambiguation problem with the problem of part-of-speech tagging.
10980430 -> 1000009600440: Both involve disambiguating or tagging with words, be it with senses or parts of speech.
10980440 -> 1000009600450: However, algorithms used for one do not tend to work well for the other, mainly because the part of speech of a word is primarily determined by the immediately adjacent one to three words, whereas the sense of a word may be determined by words further away.
10980450 -> 1000009600460: The success rate for part-of-speech tagging algorithms is at present much higher than that for WSD, state-of-the art being around 95% accuracy or better, as compared to less than 75% accuracy in word sense disambiguation with supervised learning.
10980460 -> 1000009600470: These figures are typical for English, and may be very different from those for other languages.
10980470 -> 1000009600480: Another aspect of word sense disambiguation that differentiates it from part-of-speech tagging is the availability of training data.
10980480 -> 1000009600490: While it is relatively easy to assign parts of speech to text, training people to tag senses is far more difficult .
10980490 -> 1000009600500: While users can memorize all of the possible parts of speech a word can take, it is impossible for individuals to memorize all of the senses a word can take.
10980500 -> 1000009600510: Thus, many word sense disambiguation algorithms use semi-supervised learning, which allows both labeled and unlabeled data.
10980510 -> 1000009600520: The Yarowsky algorithm was an early example of such an algorithm.
10980520 -> 1000009600530: Yarowsky’s unsupervised algorithm uses the ‘One sense per collocation’ and the ‘One sense per discourse’ properties of human languages for word sense disambiguation.
10980530 -> 1000009600540: From observation, words tend to exhibit only one sense in most given discourse and in a given collocation.
10980540 -> 1000009600550: The corpus is initially untagged.
10980550 -> 1000009600560: The algorithm starts with a large corpus, in which it identifies examples of the given polysemous word, and stores all the relevant sentences as lines.
10980560 -> 1000009600570: For instance, Yarowsky uses the word ‘plant’ in his 1995 paper to demonstrate the algorithm.
10980570 -> 1000009600580: Assume that there are two possible senses of the word, the next step is to identify a small number of seed collocations representative of each sense, give each sense a label, i.e. sense A and B, then assign the appropriate label to all training examples containing the seed collocations.
10980580 -> 1000009600590: In this case, the words ‘life’ and ‘manufacturing’ are chosen as initial seed collocations for sense A and B respectively.
10980590 -> 1000009600600: The residual examples (85% - 98% according to Yarowsky) remain untagged.
10980600 -> 1000009600610: The algorithm should initially choose seed collocations representative that will distinguish sense A and B accurately and productively.
10980610 -> 1000009600620: This can be done by selecting seed words from a dictionary’s entry for that sense.
10980620 -> 1000009600630: The collocations tend to have stronger effect if they are adjacent to the target word, the effect weakens with distance.
10980630 -> 1000009600640: According to the criteria given in Yarowsky (1993), seed words that appear in the most reliable collocational relationships with the target word will be selected.
10980640 -> 1000009600650: The effect is much stronger for words in a predicate-argument relationship than for arbitrary associations at the same distance to the target word, and is much stronger for collocations with content words than with function words.
10980650 -> 1000009600660: Having said this, a collocation word can have several collocational relationships with the target word throughout the corpus.
10980660 -> 1000009600670: This could give the word different rankings or even different classifications.
10980670 -> 1000009600680: Alternatively, it can be done by identifying a single defining collocate for each class, and using for seeds only those contexts containing one of these defining words.
10980680 -> 1000009600690: A publicly available database called WordNet can be used as an automatic source for such defining terms.
10980690 -> 1000009600700: In addition, words that occur near the target word in great frequency can be selected as seed collocations representative.
10980700 -> 1000009600710: This approach is not fully automatic, a human judge must decide which word will be selected for each target word’s sense, the outputs will be reliable indicators of the senses.
10980710 -> 1000009600720: A decision-list algorithm is then used to identify other reliable collocations.
10980720 -> 1000009600730: This training algorithm calculates the probability P(Sense | Collocation), and the decision list is ranked by the log-likelihood ratio: Log( P(SenseA | Collocationi) / P(SenseB | Collocationi) )
10980730 -> 1000009600740: A smoothing algorithm will then be used to avoid 0 values.
10980740 -> 1000009600750: The decision-list algorithm resolves many problems in a large set of non-independent evidence source by using only the most reliable piece of evidence rather than the whole matching collocation set.
10980750 -> 1000009600760: The new resulting classifier will then be applied to the whole sample set.
10980760 -> 1000009600770: Add those examples in the residual that are tagged as A or B with probability above a reasonable threshold to the seed sets.
10980770 -> 1000009600780: Apply the decision-list algorithm and the above adding step iteratively.
10980780 -> 1000009600790: As more newly-learned collocations are added to the seed sets, the sense A or sense B set will grow, and the original residual will shrink.
10980790 -> 1000009600800: However, these collocations stay in the seed sets only if their probability of classification remains above the threshold, otherwise they are returned to the residual for later classification.
10980800 -> 1000009600810: At the end of each iteration, the ‘One sense per discourse’ property can be used to help preventing initially mistagged collocates and hence improving the purity of the seed sets.
10980810 -> 1000009600820: In order to avoid strong collocates becoming indicators for the wrong class, the class-inclusion threshold needs to be randomly altered.
10980820 -> 1000009600830: For the same purpose, after intermediate convergence the algorithm will also need to increase the width of the context window.
10980830 -> 1000009600840: The algorithm will continue to iterate until no more reliable collocations are found.
10980840 -> 1000009600850: The ‘One sense per discourse’ property can be used here for error correction.
10980850 -> 1000009600860: For a target word that has a binary sense partition, if the occurrences of the majority sense A exceed that of the minor sense B by a certain threshold, the minority ones will be relabeled as A. According to Yarowsky, for any sense to be clearly dominant, the occurrences of the target word should not be less than 4.
10980860 -> 1000009600870: When the algorithm converges on a stable residual set, a final decision list of the target word is obtained.
10980870 -> 1000009600880: The most reliable collocations are at the top of the new list instead of the original seed words.
10980880 -> 1000009600890: The original untagged corpus is then tagged with sense labels and probabilities.
10980890 -> 1000009600900: The final decision list may now be applied to new data, the collocation with the highest rank in the list is used to classify the new data.
10980900 -> 1000009600910: For example, if the highest ranking collocation of the target word in the new data set is of sense A, then the target word is classified as sense A.
WordNet
10960010 -> 1000009700020: WordNet
10960020 -> 1000009700030: WordNet is a semantic lexicon for the English language.
10960030 -> 1000009700040: It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.
10960040 -> 1000009700050: The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications.
10960050 -> 1000009700060: The database and software tools have been released under a BSD style license and can be downloaded and used freely.
10960060 -> 1000009700070: The database can also be browsed online.
10960070 -> 1000009700080: WordNet was created and is being maintained at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George A. Miller.
10960080 -> 1000009700090: Development began in 1985.
10960090 -> 1000009700100: Over the years, the project received about $3 million of funding, mainly from government agencies interested in machine translation.
10960100 -> 1000009700110: In recent years, Dr. Christiane Fellbaum has overseen the development of WordNet.
10960110 -> 1000009700120: Database contents
10960120 -> 1000009700130: As of 2006, the database contains about 150,000 words organized in over 115,000 synsets for a total of 207,000 word-sense pairs; in compressed form, it is about 12 megabytes in size.
10960130 -> 1000009700140: WordNet distinguishes between nouns, verbs, adjectives and adverbs because they follow different grammatical rules.
10960140 -> 1000009700150: Every synset contains a group of synonymous words or collocations (a collocation is a sequence of words that go together to form a specific meaning, such as "car pool"); different senses of a word are in different synsets.
10960150 -> 1000009700160: The meaning of the synsets is further clarified with short defining glosses (Definitions and/or example sentences).
10960160 -> 1000009700170: A typical example synset with gloss is:
10960170 -> 1000009700180: good, right, ripe -- (most suitable or right for a particular purpose; "a good time to plant tomatoes"; "the right time to act"; "the time is ripe for great sociological changes")
10960180 -> 1000009700190: Most synsets are connected to other synsets via a number of semantic relations.
10960190 -> 1000009700200: These relations vary based on the type of word, and include:
10960200 -> 1000009700210: Nouns
10960210 -> 1000009700220: hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
10960220 -> 1000009700230: hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
10960230 -> 1000009700240: coordinate terms: Y is a coordinate term of X if X and Y share a hypernym (wolf is a coordinate term of dog, and dog is a coordinate term of wolf)
10960240 -> 1000009700250: holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window)
10960250 -> 1000009700260: meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)
10960260 -> 1000009700270: Verbs
10960270 -> 1000009700280: hypernym: the verb Y is a hypernym of the verb X if the activity X is a (kind of) Y (travel is an hypernym of movement)
10960280 -> 1000009700290: troponym: the verb Y is a troponym of the verb X if the activity Y is doing X in some manner (to lisp is a troponym of to talk)
10960290 -> 1000009700300: entailment: the verb Y is entailed by X if by doing X you must be doing Y (to sleep is entailed by to snore)
10960300 -> 1000009700310: coordinate terms: those verbs sharing a common hypernym (to lisp and to yell)
10960310 -> 1000009700320: Adjectives
10960320 -> 1000009700330: related nouns
10960330 -> 1000009700340: similar to
10960340 -> 1000009700350: participle of verb
10960350 -> 1000009700360: Adverbs
10960360 -> 1000009700370: root adjectives
10960370 -> 1000009700380: While semantic relations apply to all members of a synset because they share a meaning but are all mutually synonyms, words can also be connected to other words through lexical relations, including antonyms (opposites of each other) and derivationally related, as well.
10960380 -> 1000009700390: WordNet also provides the polysemy count of a word: the number of synsets that contain the word.
10960390 -> 1000009700400: If a word participates in several synsets (i.e. has several senses) then typically some senses are much more common than others.
10960400 -> 1000009700410: WordNet quantifies this by the frequency score: in which several sample texts have all words semantically tagged with the corresponding synset, and then a count provided indicating how often a word appears in a specific sense.
10960410 -> 1000009700420: The morphology functions of the software distributed with the database try to deduce the lemma or root form of a word from the user's input; only the root form is stored in the database unless it has irregular inflected forms.
10960420 -> 1000009700430: Knowledge structure
10960430 -> 1000009700440: Both nouns and verbs are organized into hierarchies, defined by hypernym or IS A relationships.
10960440 -> 1000009700450: For instance, the first sense of the word dog would have the following hypernym hierarchy; the words at the same level are synonyms of each other: some sense of dog is synonymous with some other senses of domestic dog and Canis familiaris, and so on.
10960450 -> 1000009700460: Each set of synonyms (synset), has a unique index and shares its properties, such as a gloss (or dictionary) definition.
10960460 -> 1000009700470: dog, domestic dog, Canis familiaris => canine, canid => carnivore => placental, placental mammal, eutherian, eutherian mammal => mammal => vertebrate, craniate => chordate => animal, animate being, beast, brute, creature, fauna => ...
10960470 -> 1000009700480: At the top level, these hierarchies are organized into base types, 25 primitive groups for nouns, and 15 for verbs.
10960480 -> 1000009700490: These groups form lexicographic files at a maintenance level.
10960490 -> 1000009700500: These primitive groups are connected to an abstract root node that have, for some time, been assumed by various applications that use WordNet.
10960500 -> 1000009700510: In the case of adjectives, the organization is different.
10960510 -> 1000009700520: Two opposite 'head' senses work as binary poles, while 'satellite' synonyms connect to each of the heads via synonymy relations.
10960520 -> 1000009700530: Thus, the hierarchies, and the concept involved with lexicographic files, do not apply here the same way they do for nouns and verbs.
10960530 -> 1000009700540: The network of nouns is far deeper than that of the other parts of speech.
10960540 -> 1000009700550: Verbs have a far bushier structure, and adjectives are organized into many distinct clusters.
10960550 -> 1000009700560: Adverbs are defined in terms of the adjectives they are derived from, and thus inherit their structure from that of the adjectives.
10960560 -> 1000009700570: Psychological justification
10960570 -> 1000009700580: The goal of WordNet was to develop a system that would be consistent with the knowledge acquired over the years about how human beings process language.
10960580 -> 1000009700590: Anomic aphasia, for example, creates a condition that seems to selectively encumber individuals' ability to name objects; this makes the decision to partition the parts of speech into distinct hierarchies more of a principled decision than an arbitrary one.
10960590 -> 1000009700600: In the case of hyponymy, psychological experiments revealed that individuals can access properties of nouns more quickly depending on when a characteristic becomes a defining property.
10960600 -> 1000009700610: That is, individuals can quickly verify that canaries can sing because a canary is a songbird (only one level of hyponymy), but requires slightly more time to verify that canaries can fly (two levels of hyponymy) and even more time to verify canaries have skin (multiple levels of hyponymy).
10960610 -> 1000009700620: This suggests that we too store semantic information in a way that is much like WordNet, because we only retain the most specific information needed to differentiate one particular concept from similar concepts.
10960620 -> 1000009700630: WordNet as an ontology
10960630 -> 1000009700640: The hypernym/hyponym relationships among the noun synsets can be interpreted as specialization relations between conceptual categories.
10960640 -> 1000009700650: In other words, WordNet can be interpreted and used as a lexical ontology in the computer science sense.
10960650 -> 1000009700660: However, such an ontology should normally be corrected before being used since it contains hundreds of basic semantic inconsistencies such as (i) the existence of common specializations for exclusive categories and (ii) redundancies in the specialization hierarchy.
10960660 -> 1000009700670: Furthermore, transforming WordNet into a lexical ontology usable for knowledge representation should normally also involve (i) distinguishing the specialization relations into subtypeOf and instanceOf relations, and (ii) associating intuitive unique identifiers to each category.
10960670 -> 1000009700680: Although such corrections and transformations have been performed and documented as part of the  integration of WordNet 1.7 into the cooperatively updatable knowledge base of WebKB-2, most projects claiming to re-use WordNet for knowledge-based applications (typically, knowledge-oriented information retrieval) simply re-use it directly.
10960680 -> 1000009700690: Limitations
10960690 -> 1000009700700: Unlike other dictionaries, WordNet does not include information about etymology, pronunciation and the forms of irregular verbs and contains only limited information about usage.
10960700 -> 1000009700710: The actual lexicographical and semantical information is maintained in lexicographer files, which are then processed by a tool called grind to produce the distributed database.
10960710 -> 1000009700720: Both grind and the lexicographer files are freely available in a separate distribution, but modifying and maintaining the database requires expertise.
10960720 -> 1000009700730: Though WordNet contains a sufficient wide range of common words, it does not cover special domain vocabulary.
10960730 -> 1000009700740: Since it is primarily designed to act as an underlying database for different applications, those applications cannot be used in specific domains that are not covered by WordNet.
10960740 -> 1000009700750: Applications in Information Systems
10960750 -> 1000009700760: WordNet has been used for a number of different purposes in information systems, including word sense disambiguation, information retrieval, automatic text classification, automatic text summarization, and even automatic crossword puzzle generation.
10960760 -> 1000009700770: A project at Brown University started by Jeff Stibel, James A. Anderson, Steve Reiss and others called Applied Cognition Lab created a disambiguator using WordNet in 1998.
10960770 -> 1000009700780: The project later morphed into a company called Simpli, which is now owned by ValueClick.
10960780 -> 1000009700790: George Miller joined the Company as a member of the Advisory Board.
10960790 -> 1000009700800: Simpli built an Internet search engine that utilized a knowledgebase principally based on WordNet to disambiguate and expand keywords and synsets to help retrieve information online.
10960800 -> 1000009700810: WordNet was expanded upon to add increased dimensionality, such as intentionality (used for x), people (Albert Einstein) and colloquial terminology more relevant to Internet search (i.e., blogging, ecommerce).
10960810 -> 1000009700820: Neural network algorithms searched the expanded WordNet for related terms to disambiguate search keywords (Java, in the sense of coffee) and expand the search synset (Coffee, Drink, Joe) to improve search engine results.
10960820 -> 1000009700830: Before the company was acquired, it performed searches across search engines such as Google, Yahoo!, Ask.com and others.
10960830 -> 1000009700840: Another prominent example of the use of WordNet is to determine the similarity between words.
10960840 -> 1000009700850: Various algorithms have been proposed, and these include considering the distance between the conceptual categories of words, as well as considering the hierarchical structure of the WordNet ontology.
10960850 -> 1000009700860: A number of these WordNet-based word similarity algorithms are implemented in a Perl package called  WordNet::Similarity.
10960860 -> 1000009700870: Interfaces
10960870 -> 1000009700880: Princeton maintains a list of  related projects that includes links to some of the widely used application programming interfaces available for accessing WordNet using various programming languages and environments.
10960880 -> 1000009700890: Other interfaces include the following:
10960890 -> 1000009700900: WordNet on Ajax::DefineItFast.com allows users to browse Wordnet 3.0 using an ajax interface.
10960900 -> 1000009700910: The  Jawbone project provides a Java API to the WordNet 2.1 and 3.0 data.
10960910 -> 1000009700920: The source code is released under the MIT license.
10960920 -> 1000009700930: The  Natural Language Toolkit provides a Python API to the WordNet 3.0.
10960930 -> 1000009700940: Lingua::Wordnet provides a Perl interface to WordNet.
10960940 -> 1000009700950: WordNet::Similarity Perl module for computing measures of semantic relatedness.
10960950 -> 1000009700960: Dictionary::CozyEnglish implemented a WordNet 3.0 interface that integrates with WordPress.
10960960 -> 1000009700970: Blog and website owners can embed this API via a set of HTML code.
10960970 -> 1000009700980: The  Visual Thesaurus is a subscription-based commercial application that presents WordNet data through an innovative and user-friendly interface.
10960980 -> 1000009700990: WordWeb is an extended dictionary based on WordNet, also available commercially as  SQL tables for use in other applications.
10960990 -> 1000009701000: Includes many additional terms, derived forms and pronunciations.
10961000 -> 1000009701010: Visual representation of WordNet - interface which attempts to visualise the relations.
10961010 -> 1000009701020: Related projects
10961020 -> 1000009701030: The EuroWordNet project has produced WordNets for several European languages and linked them together; these are not freely available however.
10961030 -> 1000009701040: The Global Wordnet project attempts to coordinate the production and linking of "wordnets" for all languages.
10961040 -> 1000009701050: Oxford University Press, the publisher of the Oxford English Dictionary, has voiced plans to produce their own online competitor to WordNet.
10961050 -> 1000009701060: The eXtended WordNet is a project at the University of Texas at Dallas which aims to improve WordNet by semantically parsing the glosses, thus making the information contained in these definitions available for automatic knowledge processing systems.
10961060 -> 1000009701070: It is also freely available under a license similar to WordNet's.
10961070 -> 1000009701080: The GCIDE project produces a dictionary by combining a public domain Webster's Dictionary from 1913 with some WordNet definitions and material provided by volunteers.
10961080 -> 1000009701090: It is released under the copyleft license GPL.
10961090 -> 1000009701100: WordNet is also commonly re-used via mappings between the WordNet categories and the categories from other ontologies.
10961100 -> 1000009701110: Most often, only the top-level categories of WordNet are mapped.
10961110 -> 1000009701120: However, the authors of the SUMO ontology have produced a mapping between all of the WordNet synsets, (including nouns, verbs, adjectives and adverbs), and SUMO classes.
10961120 -> 1000009701130: The most recent addition of the mappings provides links to all of the more specific terms in the MId-Level Ontology (MILO), which extends SUMO.
10961130 -> 1000009701140: OpenCyc has 12,000 terms linked to WordNet synonym sets.
10961140 -> 1000009701150: In most works that claim to have integrated WordNet into other ontologies, the content of WordNet has not simply been corrected when semantic problems have been encountered; instead, WordNet has been used as an inspiration source but heavily re-interpreted and updated whenever suitable.
10961150 -> 1000009701160: This was the case when, for example, the  top-level ontology of WordNet was re-structured according to the OntoClean based approach or when WordNet was used as a primary source for constructing the lower classes of the SENSUS ontology.
10961160 -> 1000009701170: FrameNet is a project similar to WordNet.
10961170 -> 1000009701180: It consists of a lexicon which is based on annotating over 100,000 sentences with their semantic properties.
10961180 -> 1000009701190: The unit in focus is the lexical frame, a type of state or event together with the properties associated with it.
10961190 -> 1000009701200: An independent project titled  wordNet with an initial lowercase w is an ongoing project to links words and phrases via a custom Web crawler.
10961200 -> 1000009701210: Lexical markup framework (LMF) is a work in progress within ISO/TC37 in order to define a common standardized framework for the construction of lexicons, including WordNet.
10961210 -> 1000009701220: The  BalkaNet project has produced WordNets for six European languages (Bulgarian, Czech, Greek, Romanian, Turkish and Serbian).
10961220 -> 1000009701230: For this project, freely available XML-based WordNet editor was developed.
10961230 -> 1000009701240: This editor -  VisDic - is not in active development anymore, but is still used for the creation of various WordNets.
10961240 -> 1000009701250: Its successor,  DEBVisDic, is client-server application and is currently used for the editing of several WordNets (Dutch in Cornetto project, Polish, Hungarian, several African languages, Chinese).
WordPerfect
10970010 -> 1000009800020: WordPerfect
10970020 -> 1000009800030: WordPerfect is a proprietary word processing application.
10970030 -> 1000009800040: At the height of its popularity in the late 1980s and early 1990s, it was the de facto standard word processor, but has since been eclipsed in sales by Microsoft Word.
10970040 -> 1000009800050: Although the MS-DOS and Microsoft Windows versions are best known, its popularity was based on the fact that it had been available for a wide variety of computers and operating systems, including Mac OS, Linux, the Apple IIe, a separate version for the Apple IIgs, most popular versions of Unix, VMS, Data General, System/370, AmigaOS, Atari ST, OS/2, and NeXTSTEP.
10970050 -> 1000009800060: WordPerfect for DOS
10970060 -> 1000009800070: WordPerfect was originally produced by Bruce Bastian and Dr. Alan Ashton who founded Satellite Software International, Inc. of Orem, Utah, which later renamed itself WordPerfect Corporation.
10970070 -> 1000009800080: Originally written for Data General minicomputers, in 1982 the developers ported the program to the IBM PC as WordPerfect 2.20, continuing the version numbering of the Data General series.
10970080 -> 1000009800090: The program's popularity took off with the introduction of WordPerfect 4.2 in 1986, with automatic paragraph numbering (important to the law office market), and the splitting of a lengthy footnote and its partial overflow to the bottom of the next page, as if it had been professionally typeset (valuable to both the law office and academic markets).
10970090 -> 1000009800100: WordPerfect 4.2 became the first program to overtake the original microcomputer word processor market leader, WordStar, in a major application category on the DOS platform.
10970100 -> 1000009800110: In 1989, WordPerfect Corporation released the program's most successful version ever, WordPerfect 5.1 for DOS, which was the first version to include Macintosh style pull-down menus to supplement the traditional F-key combinations, as well as support for tables, a spreadsheet-like feature.
10970110 -> 1000009800120: The data format used by WordPerfect 5.1 was, for years, the most portable format in the world.
10970120 -> 1000009800130: All word processors could read (and convert) that format.
10970130 -> 1000009800140: Many conferences and magazines insisted that you shipped your documents in 5.1 format.
10970140 -> 1000009800150: Unlike previous DOS versions, WordPerfect 6.0 for DOS could switch between its traditional text-based editing mode and a graphical editing mode that showed the document as it would print out, including fonts and text effects like bold, underline, and italics.
10970150 -> 1000009800160: The previous text-based versions used different colors or text color inversions to indicate various markups, and (starting with version 5.0) used a graphic mode only for an uneditable print preview that used generic fonts rather than the actual fonts that appeared on the printed page.
10970160 -> 1000009800170: Key characteristics
10970170 -> 1000009800180: To this day, WordPerfect's three major characteristics that have differentiated from other market-leading word processors are its streaming code architecture, its Reveal Codes feature, and its unusually user-friendly macro/scripting language, PerfectScript.
10970180 -> 1000009800190: Streaming code architecture
10970190 -> 1000009800200: A key to WordPerfect's design is its streaming code architecture that parallels the formatting features of HTML and Cascading Style Sheets.
10970200 -> 1000009800210: Documents are created much the same way that raw HTML pages are written, with text interspersed by tags that trigger treatment of data until a corresponding closing tag is encountered, at which point the settings active to the point of the opening tag resume control.
10970210 -> 1000009800220: As with HTML, tags can be nested.
10970220 -> 1000009800230: Some data structures are treated as objects within the stream as with HTML's treatment of graphic images, e.g., footnotes and styles, but the bulk of a WordPerfect document's data and formatting codes appear as a single continuous stream.
10970230 -> 1000009800240: Styles and style libraries
10970240 -> 1000009800250: The addition of styles and style libraries in WP 5.0 provided greatly increased power and flexibility in formatting documents, while maintaining the streaming-code architecture of earlier versions.
10970250 -> 1000009800260: Prior to that, WordPerfect's only use of styles (a particular type of programming object) is the Opening Style, which contains the default settings for a document.
10970260 -> 1000009800270: Reveal codes
10970270 -> 1000009800280: The Reveal Codes feature is a second editing screen that can be toggled open and closed at the bottom of the main editing screen.
10970280 -> 1000009800290: Text is displayed in Reveal Codes interspersed with tags and the occasional objects, with the tags and objects represented by named tokens.
10970290 -> 1000009800300: The scheme makes it far easier to untangle coding messes than with styles-based word processors, and object tokens can be clicked with a pointing device to directly open the configuration editor for the particular object type, e.g. clicking on a style token brings up the style editor with the particular style type displayed.
10970300 -> 1000009800310: WordPerfect users forced to change word processors by employers frequently complain on WordPerfect online forums that they are lost without Reveal Codes.
10970310 -> 1000009800320: Because of their style dependencies, efforts to create the equivalent of Reveal Codes in other word processors have produced disappointing results.
10970320 -> 1000009800330: Note that WordPerfect had this feature already in its DOS incarnations: it could be brought forward by pressing the keys 'Alt' and 'F3' together.
10970330 -> 1000009800340: Macro languages
10970340 -> 1000009800350: WordPerfect for DOS was notable for its Alt-keystroke macro facility, which was expanded with the addition of macro libraries in WP 5.0 that also allowed for Ctrl-keystroke macros, and remapping of any key as a macro.
10970350 -> 1000009800360: This enabled any sequence of keystrokes to be recorded, saved, edited, and recalled.
10970360 -> 1000009800370: Macros could examine system data, make decisions, be chained together, and operate recursively until a defined 'stop' condition was met.
10970370 -> 1000009800380: This capability provided an amazingly powerful way to rearrange data and formatting codes within a document, where the same sequence of actions needed to be performed repetitively e.g. for tabular data.
10970380 -> 1000009800390: Macros can also be edited using WordPerfect Program Editor.
10970390 -> 1000009800400: Unfortunately, this facility could not easily be ported to the subsequent Windows versions.
10970400 -> 1000009800410: A new and even more powerful interpreted token-based macro recording and scripting language was introduced for both DOS and Windows 6.0 versions, and that became the basis of the language named PerfectScript in later versions.
10970410 -> 1000009800420: PerfectScript has remained the mainstay scripting language for WordPerfect users ever since.
10970420 -> 1000009800430: PerfectScript was specifically designed to be user-friendly, thus avoiding far less user-friendly methods of scripting languages implemented on other word processing programs that require education in advanced programming concepts such as Object Oriented Programming in order to produce useful yet sophisticated and powerful macros.
10970430 -> 1000009800440: Function keys
10970440 -> 1000009800450: Like its mid-1980s competitor, MultiMate, WordPerfect used almost every possible combination of function keys with Ctrl, Alt, and Shift modifiers.
10970450 -> 1000009800460: (See example help screen on this page.)
10970460 -> 1000009800470: This was in contrast to WordStar, which used only Ctrl, in conjunction with traditional typing keys.
10970470 -> 1000009800480: Many people still know and use the function key combinations from the DOS version, which were originally designed for Data General Dasher VDUs that supported 2 groups of 5 plain, shift, control, and control shift function keys.
10970480 -> 1000009800490: This was translated to the layout of the 1981 IBM PC keyboard, with two columns of function keys at the left end of the keyboard, but worked even better with the 1984 PC AT keyboard with 3 groups of 4 function keys across the top of the keyboard.
10970490 -> 1000009800500: With the 1981 PC keyboard, the Tab key and the related F4 (Indent) functions were adjacent.
10970500 -> 1000009800510: This plethora of keystroke possibilities, combined with the developers' wish to keep the user interface free of "clutter" such as on-screen menus, made it necessary for most users to use a keyboard template showing each function.
10970510 -> 1000009800520: Infamously, WordPerfect used F3 instead of F1 for Help, F1 instead of Esc for Cancel, and Esc for Repeat (though a configuration option in later versions allowed these functions to be rotated to locations that later became more standard).
10970520 -> 1000009800530: Printer drivers
10970530 -> 1000009800540: WordPerfect for DOS shipped with an impressive array of printer drivers - a feature that played an important role in its adoption - and also shipped with a printer driver editor called PTR, which features a flexible macro language and allows technically-inclined users to customize and create printer drivers.
10970540 -> 1000009800550: Internally, WordPerfect used an extensive WordPerfect character set as its internal code.
10970550 -> 1000009800560: The precise meaning of the characters, although clearly defined and documented, can be overridden in its customizable printer drivers with PTR.
10970560 -> 1000009800570: The relationship between different type faces and styles, and between them and the various sections in the WordPerfect character set, were also described in the printer drivers and can be customized through PTR.
10970570 -> 1000009800580: WordPerfect Library/Office
10970580 -> 1000009800590: WordPerfect Corporation produced a variety of ancillary and spin-off products.
10970590 -> 1000009800600: WordPerfect Library (introduced in 1986 and later renamed WordPerfect Office) was a package of network and stand-alone utilities for use with WordPerfect, primarily developed for offices running Novell NetWare.
10970600 -> 1000009800610: WordPerfect Library/Office included the DOS antecedents of what is now known as Novell GroupWise, a shareable package of contact management, calendaring, and related word processing utilities.
10970610 -> 1000009800620: WordPerfect Library/Office a brand name later revived by Corel after it acquired ownership of WordPerfect and other programs still bundled under that product name as of this writing – included amongst other utilities a local area network (LAN) email facility and was the most popular such package in its day.
10970620 -> 1000009800630: WordPerfect Shell
10970630 -> 1000009800640: The Library/Office bundle also included a noteworthy task-switching program that ran as a shell atop DOS, branded as WordPerfect Shell.
10970640 -> 1000009800650: Task-switchers were a popular application type for the DOS operating system because of its lack of multi-tasking, making it impractical to have many applications running at once.
10970650 -> 1000009800660: Task-switchers were programs that allocated available memory between open applications, allowing fast switching between open applications whose actions were suspended when the user switched to a different program.
10970660 -> 1000009800670: WordPerfect Shell 4.0, which was also bundled with the WordPerfect 6.x versions, had most functionality of the Windows 3.x shell but was far more versatile.
10970670 -> 1000009800680: Its automated memory management was superior to that of the Microsoft Windows shell, and Microsoft's product generally performed with far less frequent memory glitches when Windows was run as a program under Shell 4.0.
10970680 -> 1000009800690: The user interface for Shell is based on a hierarchical menu metaphor rather than the windows/folders/icons metaphor used by Microsoft.
10970690 -> 1000009800700: Shell 4.0's menu structures could be individually hot-keyed as pop-ups and its powerful menu editor allowed fast creation and editing of menu structures and menu items, with each menu item quickly configurable for entry of command lines and menu names.
10970700 -> 1000009800710: Shell 4.0 included 80 programmable clipboards, and the menu structures and menu items were also programmable using a scripting language whose scripts could themselves be chained to and from WordPerfect macros.
10970710 -> 1000009800720: The scripting language also included a keyboard buffer stuffing tool for control and operation of non-WordPerfect applications.
10970720 -> 1000009800730: Microsoft Windows had no answer to such powerful features other than a glitz of windows, icons, pointing devices, and an overwhelming marketing strategy.
10970730 -> 1000009800740: WordPerfect Shell was laid to rest along with many other popular DOS character-based tools inundated by Microsoft's marketing of Windows 95.
10970740 -> 1000009800750: Novell later licensed Shell 3.0 and 4.0 for free distribution.
10970750 -> 1000009800760: As of this writing it is still downloadable from the DataPerfect Users Group.
10970760 -> 1000009800770: WordPerfect Library/Office also included a Calculator, a flat-file database called Notebook that could be used by itself or in WordPerfect document merges, an exceptionally powerful relational database - DataPerfect - that retains a small but dedicated following despite having been dropped by WordPerfect Corporation in favour of Borland's Paradox as a companion of WP for Windows.
10970770 -> 1000009800780: Additional features continue to be added from time to time by DataPerfect's author, Lew Bastian - Bruce Bastian's older brother - a brilliant programmer who had written some of IBM's earliest disk-caching patents, and DataPerfect can now run as web server.
10970780 -> 1000009800790: LetterPerfect was a scaled down version of WordPerfect with the more advanced features removed but with file and (for the most part) keystroke compatibility.
10970790 -> 1000009800800: An implementation of Microsoft Visual Basic for Applications (VBA), introduced with WordPerfect for Windows 9.0, provides a full-featured development environment for building advanced custom WordPerfect solutions.
10970800 -> 1000009800810: These solutions are often created by corporate developers or programmers and may not be easily accessible to the typical WordPerfect user.
10970810 -> 1000009800820: For these users, PerfectScript is the better option.
10970820 -> 1000009800830: People who code scripts for WordPerfect use the Macros & Merges forum at WordPerfect Universe as their primary meeting ground.
10970830 -> 1000009800840: That site is a collaboration among other WordPerfect-related web site operators and others and functions as a portal to WordPerfect resources on the web.
10970840 -> 1000009800850: The site also maintains an extensive clip library for use in PerfectScript programming, has the Web's largest metalink library for locating online WordPerfect resources, and has the only peer-to-peer forum on the Web for DOS WordPerfect.
10970850 -> 1000009800860: The WordPerfect template and document file formats have remained remarkably stable since the WordPerfect 6.x DOS and Windows versions.
10970860 -> 1000009800870: Complete backward compatibility has been maintained and all WordPerfect versions since 6.0 have included a feature that stores any unrecognized codes in stream location represented in Reveal Codes by an "Unknown" token.
10970870 -> 1000009800880: Documents generated on newer versions can thus be edited in older versions with the codes retained.
10970880 -> 1000009800890: Then, upon being reopened in a newer version of WordPerfect, the "unknown" tokens regain their functionality.
10970890 -> 1000009800900: None of the newer WordPerfect features reflected in the file formats cause data loss when opened in older versions.
10970900 -> 1000009800910: WordPerfect for Windows
10970910 -> 1000009800920: History
10970920 -> 1000009800930: WordPerfect was late in coming to market with a Windows version.
10970930 -> 1000009800940: The first mature version, WordPerfect 5.2 for Windows, was released in November 1992.
10970940 -> 1000009800950: Prior to that, there was a WordPerfect 5.1 for Windows, introduced a year earlier.
10970950 -> 1000009800960: That version had to be installed from DOS and was largely unpopular due to serious stability issues.
10970960 -> 1000009800970: By the time WordPerfect 5.2 for Windows was introduced, Microsoft Word for Windows version 2 had been on the market for over a year and had received its third interim release, v2.0c. WordPerfect's function-key-centered user interface did not adapt well to the new paradigm of mouse and pull-down menus, especially with many of WordPerfect's standard key combinations pre-empted by incompatible keyboard shortcuts that Windows itself used (e.g. Alt-F4 became Exit Program as opposed to WordPerfect's Block Text).
10970970 -> 1000009800980: The DOS version's impressive arsenal of finely tuned printer drivers was also rendered obsolete by Windows' use of its own printer device drivers.
10970980 -> 1000009800990: Internally, WordPerfect for Windows still used the WordPerfect character set as its internal code.
10970990 -> 1000009801000: This caused WordPerfect for Windows to be unable to support some languages — for example Chinese — that were natively supported by Windows.
10971000 -> 1000009801010: WordPerfect became part of an office suite when the company entered into a co-licensing agreement with Borland Software Corporation in 1993.
10971010 -> 1000009801020: The offerings were marketed as Borland Office, containing Windows versions of WordPerfect, Quattro Pro, Borland Paradox, and a LAN-based groupware package called WordPerfect Office (not to be confused with the complete applications suite of the same name later marketed by Corel) based on the WordPerfect Library for DOS.
10971020 -> 1000009801030: The WordPerfect product line was sold twice, first to Novell in June 1994, who then sold it to Corel in January 1996.
10971030 -> 1000009801040: However, Novell kept the WordPerfect Office technology, incorporating it into its GroupWise messaging and collaboration product.
10971040 -> 1000009801050: Compounding WordPerfect's troubles were issues associated with the release of the first 32-bit version, WordPerfect 7, intended for use on Windows 95.
10971050 -> 1000009801060: While it contained notable improvements over the 16-bit WordPerfect for Windows 6.1, it was released in May 1996, nine months after the introduction of Windows 95 and Microsoft Office 95 (including Word 95).
10971060 -> 1000009801070: The initial release suffered from notable stability problems.
10971070 -> 1000009801080: WordPerfect 7 also didn't have a Microsoft "Designed for Windows 95" logo.
10971080 -> 1000009801090: This was important to Windows 95 software purchasers as Microsoft set standards for application design, behavior, and interaction with the operating system.
10971090 -> 1000009801100: To make matters worse, the original release of WordPerfect 7 was incompatible with Windows NT, hindering its adoption in academia.
10971100 -> 1000009801110: The "NT Enabled" version of WordPerfect 7, which Corel considered to be Service Pack 2, wasn't available until Q1-1997, over 6 months after the introduction of Windows NT 4.0, a year and a half after the introduction of Office 95 (which supported Windows NT out of the box), and shortly after the introduction of Office 97.
10971110 -> 1000009801120: Corel charged its customers to receive, what amounted to, a bug fix.
10971120 -> 1000009801130: While WordPerfect retained a majority of the retail shelf sales of word processors, Microsoft gained marketshare by including Word for Windows in its Windows product on new PCs.
10971130 -> 1000009801140: Microsoft gave discounts for Windows to OEMs who included Word on their PCs.
10971140 -> 1000009801150: When new PC buyers found Word installed on their new PC, Word began to dominate marketshare of desktop word processing.
10971150 -> 1000009801160: Amongst the remaining avid users of WordPerfect are many law firms and academics who favor the WordPerfect features such as macros and reveal codes.
10971160 -> 1000009801170: Corel now caters to these markets, with, for example, a major sale to the United States Department of Justice in 2005 .
10971170 -> 1000009801180: In November 2004, Novell filed an antitrust lawsuit against Microsoft for alleged anticompetitive behavior (viz, tying Word to sales of Windows) that Novell claims led to loss of WordPerfect market share .
10971180 -> 1000009801190: Corel WordPerfect
10971190 -> 1000009801200: Since its acquisition by Corel, WordPerfect for Windows has officially been known as Corel WordPerfect.
10971200 -> 1000009801210: Unicode and Asian language editing
10971210 -> 1000009801220: WordPerfect also lacks support for Unicode.
10971220 -> 1000009801230: The absence of support for Unicode limits its usefulness in many markets outside North America and Western Europe.
10971230 -> 1000009801240: Despite pleas from longtime users, this feature has not been implemented as of yet.
10971240 -> 1000009801250: For users in WordPerfect's traditional markets, the inability to deal with complex character sets, such as Asian language scripts, can cause difficulty when working on documents containing those characters.
10971250 -> 1000009801260: However, later versions have provided better compliance with interface conventions, file compatibility, and even Word interface emulation.
10971260 -> None: "Classic Mode"
10971270 -> None: Corel added "Classic Mode" in WordPerfect 11.
10971280 -> 1000009801270: WordPerfect for Macintosh
10971290 -> 1000009801280: Development of WordPerfect for Macintosh did not run parallel to versions for other operating systems, and used version numbers unconnected to contemporary releases for DOS, Windows, etc.
10971300 -> 1000009801290: The first release reminded users and reviewers of the DOS version, and was not especially successful in the marketplace.
10971310 -> 1000009801300: Version 2 was a total re-write, adhering more closely to Apple's UI guidelines.
10971320 -> 1000009801310: Version 3 took this further, making extensive use of the technologies Apple introduced in Systems 7.0–7.5, while remaining fast and capable of running well on older machines.
10971330 -> 1000009801320: Corel released version 3.5 in 1996, followed by the improved version 3.5e.
10971340 -> 1000009801330: It was never updated beyond that, and the product was eventually discontinued.
10971350 -> 1000009801340: As of 2004, Corel has reiterated that the company has no plans to further develop WordPerfect for Macintosh (such as creating a native Mac OS X version).
10971360 -> 1000009801350: For several years, Corel allowed Mac users to download version 3.5e from their website free of charge, and some Mac users still use this version.
10971370 -> 1000009801360: The download is still available, along with the necessary OS 8/9/Classic Updater that slows scroll speed and restores functionality to the Style and Window menus.
10971380 -> 1000009801370: Like other Mac OS applications of its age, it requires the Classic environment on PowerPC Macs.
10971390 -> 1000009801380: While Intel Macs do not support Classic, emulators such as SheepShaver, and vMac allow users to run WordPerfect and other Mac OS applications.
10971400 -> 1000009801390: Users wishing to use an up to date version of WordPerfect can run the Windows version through Boot Camp or a Windows emulator, and through Darwine or CrossOver Mac with mixed results.
10971410 -> 1000009801400: WordPerfect for Linux
10971420 -> 1000009801410: In 1995, WordPerfect 6.0 was made available for Linux as part of Caldera's internet office package.
10971430 -> 1000009801420: In late 1997, a newer version was made available for download, but had to be purchased to be activated.
10971440 -> 1000009801430: Hoping to establish themselves in the nascent commercial Linux market, Corel also developed their own distribution of Linux.
10971450 -> 1000009801440: Although the Linux distribution was fairly well-received, the response to WordPerfect for Linux was varied.
10971460 -> 1000009801450: Some Linux promoters appreciated the availability of a well-known, mainstream application for the OS. Developers of other Linux-compatible word processors questioned the need for another application in the category.
10971470 -> 1000009801460: Advocates of open-source software scoffed at its proprietary, closed-source nature, and questioned the viability of a commercial application in a market dominated by free software, such as OpenOffice.org and numerous others.
10971480 -> 1000009801470: The performance and stability of WordPerfect 9.0 (not a native Linux application like WP 6-8, but derived from the Windows version using the Wine compatibility library) was highly criticized.
10971490 -> 1000009801480: WordPerfect failed to gain a large user base, and as part of Corel's change of strategic direction following a (non-voting) investment by Microsoft, WordPerfect for Linux was discontinued and their Linux distribution was sold to Xandros.
10971500 -> 1000009801490: In April 2004, Corel re-released WordPerfect 8.1 (the last Linux-native version) with some updates, as a "proof of concept" and to test the Linux market.
10971510 -> 1000009801500: As of 2005, WordPerfect for Linux is not available for purchase.
10971520 -> 1000009801510: Versions
10971530 -> 1000009801520: (* - Part of WordPerfect Office)
10971540 -> 1000009801530: Known versions for VAX/VMS include 5.1, 5.3 and 7.1 , year of release unknown.
10971550 -> 1000009801540: Known versions for SUN include 6.0, requiring SunOS or Solaris 2, year of release unknown.
10971560 -> 1000009801550: Known versions for IBM System/370 include 4.2, released 1988.
10971570 -> 1000009801560: Known versions for OS/2 include 5.0, released 1989.
10971580 -> 1000009801570: Known versions for the DEC Rainbow 100 include version (?), released November 1983.
10971590 -> 1000009801580: In addition, versions of WordPerfect have also been available for Apricot, Atari ST, DEC Rainbow, Tandy 2000, TI Professional, Victor 9000, and Zenith Z-100 systems, as well as around 30 flavors of unix, including AT&T, NCR, SCO Xenix, Microport Unix, DEC Ultrix, Pyramid Tech Unix, Tru64, AIX, Motorola 8000, and HP9000 and SUN 3.
10971600 -> 1000009801590: Current versions
10971610 -> 1000009801600: On January 17, 2006, Corel announced WordPerfect X3, the newest version of this office package.
10971620 -> 1000009801610: Corel is an original member of the OASIS Technical Committee on the Open Document Format, and Paul Langille, a senior Corel developer, is one of the original four authors of the OpenDocument specification.
10971630 -> 1000009801620: In January 2006, subscribers to Corel's electronic newsletter were informed that WordPerfect 13 was scheduled for release later in 2006.
10971640 -> 1000009801630: The subsequent release of X3 (identified as "13" internally and in registry entries) has been met with generally positive reviews, due to new features including a unique PDF import capability, metadata removal tools, integrated search and online resources and other features.
10971650 -> 1000009801640: Version X3 was described by CNET in January, 2006 as a "winner", "a feature-packed productivity suite that's just as easy to use – and in many ways more innovative than – industry-goliath Microsoft Office 2003."
10971660 -> 1000009801650: CNET went on to describe X3 as "a solid upgrade for longtime users", but that "Die-hard Microsoft fans may want to wait to see what Redmond has up its sleeve with the radical changes expected within the upcoming Microsoft Office 12."
10971670 -> 1000009801660: While the notable if incremental enhancements of WordPerfect Office X3 have been well received by reviewers, a number of online forums have voiced concern about the future direction of WordPerfect, with longtime users complaining about certain usability and functionality issues that users have been asking to have fixed for the last few release versions.
10971680 -> 1000009801670: Although the released version of X3 does not support the OOXML or OpenDocument formats, a beta has been released that supports both.
10971690 -> 1000009801680: Reports surfaced late in January 2006 that Apple's iWork had leapfrogged WordPerfect Office as the leading alternative to Microsoft Office.
10971700 -> 1000009801690: This claim was soon debunked after industry analyst Joe Wilcox described JupiterResearch usage surveys that showed WordPerfect as the No. 2 office suite behind Microsoft Office in the consumer, small and medium businesses, and enterprise markets with a roughly 15 percent share in each market.
10971710 -> 1000009801700: In April 2008 Corel released their WordPerfect Office X4 office suite containing the new X4 version of WordPerfect which includes support for PDF, OpenDocument and Office Open XML.
XHTML
10990010 -> 1000009900020: XHTML
10990020 -> 1000009900030: The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax.
10990030 -> 1000009900040: While HTML is an application of Standard Generalized Markup Language (SGML), a very flexible markup language, XHTML is an application of XML, a more restrictive subset of SGML.
10990040 -> 1000009900050: Because they need to be well-formed, true XHTML documents allow for automated processing to be performed using standard XML tools—unlike HTML, which requires a relatively complex, lenient, and generally custom parser.
10990050 -> 1000009900060: XHTML can be thought of as the intersection of HTML and XML in many respects, since it is a reformulation of HTML in XML.
10990060 -> 1000009900070: XHTML 1.0 became a World Wide Web Consortium (W3C) Recommendation on January 26, 2000.
10990070 -> 1000009900080: XHTML 1.1 became a W3C Recommendation on May 31, 2001.
10990080 -> 1000009900090: Overview
10990090 -> 1000009900100: XHTML is "a reformulation of the three HTML 4 document types as applications of XML 1.0".
10990100 -> 1000009900110: The W3C also continues to maintain the HTML 4.01 Recommendation and the specifications for HTML5 and XHTML5 are being actively developed.
10990110 -> 1000009900120: In the current XHTML 1.0 Recommendation document, as published and revised to August 2002, the W3C comments that, "The XHTML family is the next step in the evolution of the Internet.
10990120 -> 1000009900130: By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility."
10990130 -> 1000009900140: Motivation
10990140 -> 1000009900150: The need for a reformulated version of HTML was felt primarily because World Wide Web content now needs to be delivered to many devices (like mobile devices) apart from traditional desktop computers, where extra resources cannot be devoted to support the additional complexity of HTML syntax.
10990150 -> 1000009900160: In practice, however, HTML-supporting browsers for such constrained devices have emerged faster than XHTML support has been added to the desktop browser with the largest market share, Internet Explorer.
10990160 -> 1000009900170: Another goal for XHTML and XML was to reduce the demands on parsers and user agents in general.
10990170 -> 1000009900180: With HTML, user agents increasingly took on the burden of "correcting" errant documents.
10990180 -> 1000009900190: Instead, XML requires user agents to give a "fatal" error when encountering malformed XML.
10990190 -> 1000009900200: In theory, this allows for vendors to produce leaner browsers, without the obligation to work around author errors.
10990200 -> 1000009900210: A side effect of this behavior is that those authoring XHTML documents and testing in conformant browsers should be more readily alerted to errors that may have gone otherwise unnoticed if the browser had attempted to render or ignore the malformed markup.
10990210 -> 1000009900220: A feature XHTML inherits from its XML underpinnings is XML namespaces.
10990220 -> 1000009900230: With namespaces, authors or communities of authors can define their own XML elements, attributes and content models to mix within XHTML documents.
10990230 -> 1000009900240: This is similar to the semantic flexibility of the class attribute in an HTML element, but with fewer restrictions.
10990240 -> 1000009900250: Some W3C XML namespaces/schema that can be mixed with XHTML include MathML for semantic math markup, Scalable Vector Graphics for markup of vector graphics, and RDFa for embedding RDF data.
10990250 -> 1000009900260: Relationship to HTML
10990260 -> 1000009900270: HTML is the antecedent technology to XHTML.
10990270 -> 1000009900280: The changes from HTML to first-generation XHTML 1.0 are minor and are mainly to achieve conformance with XML.
10990280 -> 1000009900290: The most important change is the requirement that the document must be well-formed and that all elements must be explicitly closed as required in XML.
10990290 -> 1000009900300: In XML, all element and attribute names are case-sensitive, so the XHTML approach has been to define all tag names to be lowercase.
10990300 -> 1000009900310: This contrasts with some earlier established traditions which began around the time of HTML 2.0, when many used uppercase tags.
10990310 -> 1000009900320: In XHTML, all attribute values must be enclosed by quotes; either single (') or double (") quotes may be used.
10990320 -> 1000009900330: In contrast, this was sometimes optional in SGML-based HTML, where numeric or boolean attributes can omit quotes.
10990330 -> 1000009900340: All elements must also be explicitly closed, including empty (aka singleton) elements such as img and br.
10990340 -> 1000009900350: This can be done by adding a closing slash to the start tag, e.g., <img /> and <br />.
10990350 -> 1000009900360: Attribute minimization (e.g., <option selected>) is also prohibited, as the attribute selected contains no explicit value; instead this would be written as <option selected="selected">.
10990360 -> 1000009900370: HTML elements which are optional in the content model will not appear in the DOM tree unless they are explicitly specified.
10990370 -> 1000009900380: For example, an XHTML page must have a <body> element, and a table will not have a <tbody> element unless the author specifies one.
10990380 -> 1000009900390: The XHTML 1.0 recommendation devotes a section to differences between HTML and XHTML..
10990390 -> 1000009900400: The WHATWG wiki similarly considers differences that arise with the use of (X)HTML5..
10990400 -> 1000009900410: Because XHTML and HTML are closely related technologies, sometimes they are written about and documented in parallel.
10990410 -> 1000009900420: In such circumstances, some authors conflate the two names by using a parenthetical notation, such as (X)HTML.
10990420 -> 1000009900430: This indicates that the documentation and principles can be considered to apply generally to both standards.
10990430 -> 1000009900440: Adoption
10990440 -> 1000009900450: The similarities between HTML 4.01 and XHTML 1.0 led many web sites and content management systems to adopt the initial W3C XHTML 1.0 Recommendation.
10990450 -> 1000009900460: To aid authors in the transition, the W3C provided guidance on how to publish XHTML 1.0 documents in an HTML-compatible manner, and serve them to browsers that were not designed for XHTML.
10990460 -> 1000009900470: Such "HTML-compatible" content is sent using the HTML media type (text/html) rather than the official Internet media type for XHTML (application/xhtml+xml).
10990470 -> 1000009900480: When measuring the adoption of XHTML to that of regular HTML, therefore, it is important to distinguish whether it is media type usage or actual document contents that is being compared.
10990480 -> 1000009900490: Most web browsers have mature support for all of the possible XHTML media types.
10990490 -> 1000009900500: The notable exception is Internet Explorer by Microsoft; rather than rendering application/xhtml+xml content, a dialog box invites the user to save the content to disk instead.
10990500 -> 1000009900510: Both Internet Explorer 7 (released in 2006) and the initial beta version of Internet Explorer 8 (released in March 2008) exhibit this behaviour, and it is unclear whether this will be resolved in a future release.
10990510 -> 1000009900520: Whilst this remains the case, most web developers avoid using XHTML that isn’t HTML-compatible, so advantages of XML such as namespaces, faster parsing and smaller-footprint browsers do not benefit the user.
10990520 -> 1000009900530: Microsoft developer Chris Wilson explained in 2005 that IE7’s priorities were improved security and CSS support, and that proper XHTML support would be difficult to graft onto IE’s compatibility-oriented HTML parser.
10990530 -> 1000009900540: Recently, notable developers have begun to question why Web authors ever made the leap into authoring in XHTML.
10990540 -> 1000009900550: In October 2006, HTML inventor and W3C chair Tim Berners-Lee, explaining the motivation for the resumption of HTML (not XHTML) development, posted in his blog: "The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn't work.
10990550 -> 1000009900560: The large HTML-generating public did not move, largely because the browsers didn't complain."
10990560 -> 1000009900570: Versions of XHTML
10990570 -> 1000009900580: XHTML 1.0
10990580 -> 1000009900590: December 1998 saw the publication of a W3C Working Draft entitled Reformulating HTML in XML.
10990590 -> 1000009900600: This introduced Voyager, the codename for a new markup language based on HTML 4 but adhering to the stricter syntax rules of XML.
10990600 -> 1000009900610: By February 1999 the specification had changed name to XHTML™ 1.0: The Extensible HyperText Markup Language, and in January 2000 it was officially adopted as a W3C Recommendation.
10990610 -> 1000009900620: There are three formal DTDs for XHTML 1.0, corresponding to the three different versions of HTML 4.01:
10990620 -> 1000009900630: XHTML 1.0 Strict is the equivalent to strict HTML 4.01, and includes elements and attributes that have not been marked deprecated in the HTML 4.01 specification.
10990630 -> 1000009900640: XHTML 1.0 Transitional is the equivalent of HTML 4.01 Transitional, and includes the presentational elements (such as center, font and strike) excluded from the strict version.
10990640 -> 1000009900650: XHTML 1.0 Frameset is the equivalent of HTML 4.01 Frameset, and allows for the definition of frameset documents—a common Web feature in the late 1990s.
10990650 -> 1000009900660: The second edition of XHTML 1.0 became a W3C Recommendation in August 2002.
10990660 -> 1000009900670: Modularization of XHTML
10990670 -> 1000009900680: The initial draft of Modularization of XHTML became available in April 1999, and reached Recommendation status in April 2001.
10990680 -> 1000009900690: Modularization provides an abstract collection of components through which XHTML can be subsetted and extended.
10990690 -> 1000009900700: The feature is intended to help XHTML extend it’s reach onto emerging platforms, such as mobile devices and Web-enabled televisions.
10990700 -> 1000009900710: The first XHTML Family Markup Languages to be developed with this technique were XHTML 1.1 and XHTML Basic 1.0.
10990710 -> 1000009900720: Another example is XHTML-Print (W3C Recommendation, September 2006), a language designed for printing from mobile devices to low-cost printers.
10990720 -> 1000009900730: In 2008 Modularization of XHTML is expected to be superseded by XHTML Modularization 1.1, which adds an XML Schema implementation.
10990730 -> 1000009900740: XHTML 1.1—Module-based XHTML
10990740 -> 1000009900750: XHTML 1.1 evolved out of the work surrounding the initial Modularization of XHTML specification.
10990750 -> 1000009900760: The W3C released a first draft in September 1999; Recommendation status was reached in May 2001.
10990760 -> 1000009900770: The modules combined within XHTML 1.1 effectively recreate XHTML 1.0 Strict, with the addition of ruby annotation elements (ruby, rbc, rtc, rb, rt and rp) to better support East-Asian languages.
10990770 -> 1000009900780: Other changes include removal of the lang attribute (in favour of xml:lang), and removal of the name attribute from the a and map elements.
10990780 -> 1000009900790: Although XHTML 1.1 is largely compatible with XHTML 1.0 and HTML 4, in August 2002 the W3C issued a formal Note advising that it should not be transmitted with the HTML media type.
10990790 -> 1000009900800: With limited browser support for the alternate application/xhtml+xml media type, XHTML 1.1 has so far proven unable to gain widespread use.
10990800 -> 1000009900810: XHTML 1.1 Second Edition is expected in the third quarter of 2008.
10990810 -> 1000009900820: XHTML Basic and XHTML-MP
10990820 -> 1000009900830: To support constrained devices, XHTML Basic was created by the W3C; it reached Recommendation status in December 2000.
10990830 -> 1000009900840: XHTML Basic 1.0 is the most restrictive version of XHTML, providing a minimal set of features that even the most limited devices can be expected to support.
10990840 -> 1000009900850: The Open Mobile Alliance and it’s predecessor the WAP Forum released three specifications between 2001 and 2006 that extended XHTML Basic 1.0.
10990850 -> 1000009900860: Known as XHTML Mobile Profile or XHTML-MP, they were strongly focussed on uniting the differing markup languages used on mobile handsets at the time.
10990860 -> 1000009900870: All provide richer form controls than XHTML Basic 1.0, along with varying levels of scripting support.
10990870 -> 1000009900880: XHTML Basic 1.1 became a W3C Proposed Recommendation in June 2008, superseding XHTML-MP 1.2.
10990880 -> 1000009900890: XHTML Basic 1.1 is almost but not quite a subset of regular XHTML 1.1.
10990890 -> 1000009900900: The most notable addition over XHTML 1.1 is the inputmode attribute—also found in XHTML-MP 1.2—which provides hints to help browsers improve form entry.
10990900 -> 1000009900910: XHTML 1.2
10990910 -> 1000009900920: The XHTML 2 Working Group is considering the creation a new language based on XHTML 1.1.
10990920 -> 1000009900930: If XHTML 1.2 is created, it will include WAI-ARIA and role attributes to better support accessible web applications, and improved Semantic Web support through RDFa.
10990930 -> 1000009900940: The inputmode attribute from XHTML Basic 1.1, along with the target attribute (for specifying frame targets) may also be present.
10990940 -> 1000009900950: XHTML 2.0
10990950 -> 1000009900960: Between August 2002 and July 2006 the W3C released the first eight Working Drafts of XHTML 2.0, a new version of XHTML able to make a clean break from the past by discarding the requirement of backward compatibility.
10990960 -> 1000009900970: This lack of compatibility with XHTML 1.x and HTML 4 caused some early controversy in the web developer community.
10990970 -> 1000009900980: Some parts of the language (such as the role and RDFa attributes) were subsequently split out of the specification and worked on as separate modules, partially to help make the transition from XHTML 1.x to XHTML 2.0 smoother.
10990980 -> 1000009900990: A ninth draft of XHTML 2.0 is expected to appear in 2008.
10990990 -> 1000009901000: New features introduced by XHTML 2.0 include:
10991000 -> 1000009901010: HTML forms will be replaced by XForms, an XML-based user input specification allowing forms to be displayed appropriately for different rendering devices.
10991010 -> 1000009901020: HTML frames will be replaced by XFrames.
10991020 -> 1000009901030: The DOM Events will be replaced by XML Events, which uses the XML Document Object Model.
10991030 -> 1000009901040: A new list element type, the nl element type, will be included to specifically designate a list as a navigation list.
10991040 -> 1000009901050: This will be useful in creating nested menus, which are currently created by a wide variety of means like nested unordered lists or nested definition lists.
10991050 -> 1000009901060: Any element will be able to act as a hyperlink, e.g., <li href="articles.html">Articles</li>, similar to XLink.
10991060 -> 1000009901070: However, XLink itself is not compatible with XHTML due to design differences.
10991070 -> 1000009901080: Any element will be able to reference alternative media with the src attribute, e.g., <p src="lbridge.jpg" type="image/jpeg">London Bridge</p> is the same as <object src="lbridge.jpg" type="image/jpeg"><p>London Bridge</p></object>.
10991080 -> 1000009901090: The alt attribute of the img element has been removed: alternative text will be given in the content of the img element, much like the object element, e.g., <img src="hms_audacious.jpg">HMS <em>Audacious</em></img>.
10991090 -> 1000009901100: A single heading element (h) will be added.
10991100 -> 1000009901110: The level of these headings are determined by the depth of the nesting.
10991110 -> 1000009901120: This allows the use of headings to be infinite, rather than limiting use to six levels deep.
10991120 -> 1000009901130: The remaining presentational elements i, b and tt, still allowed in XHTML 1.x (even Strict), will be absent from XHTML 2.0.
10991130 -> 1000009901140: The only somewhat presentational elements remaining will be sup and sub for superscript and subscript respectively, because they have significant non-presentational uses and are required by certain languages.
10991140 -> 1000009901150: All other tags are meant to be semantic instead (e.g. <strong> for strong or bolded text) while allowing the user agent to control the presentation of elements via CSS.
10991150 -> 1000009901160: The addition of RDF triple with the property and about attributes to facilitate the conversion from XHTML to RDF/XML.
10991160 -> 1000009901170: HTML 5—Vocabulary and APIs for HTML and XHTML
10991170 -> 1000009901180: HTML 5 initially grew independently of the W3C, through a loose group of browser manufacturers and other interested parties calling themselves the WHATWG, or Web Hypertext Application Technology Working Group.
10991180 -> 1000009901190: The WHATWG announced the existence of an open mailing list in June 2004, along with a website bearing the strapline “Maintaining and evolving HTML since 2004.”
10991190 -> 1000009901200: The key motive of the group was to create a platform for dynamic web applications; they considered XHTML 2.0 to be too document-centric, and not suitable for the creation of forum sites or online shops.
10991200 -> 1000009901210: In April 2007, the Mozilla Foundation and Opera Software joined Apple in requesting that the newly rechartered HTML Working Group of the W3C adopt the work, under the name of HTML 5.
10991210 -> 1000009901220: The group resolved to do this the following month, and the First Public Working Draft of HTML 5 was issued by the W3C in January 2008.
10991220 -> 1000009901230: The most recent W3C Working Draft was published in June 2008.
10991230 -> 1000009901240: HTML 5 has both a regular text/html serialization and an XML serialization, which is known as XHTML 5.
10991240 -> 1000009901250: In addition to the markup language, the specification includes a number of application programming interfaces.
10991250 -> 1000009901260: The Document Object Model is extended with APIs for editing, drag-and-drop, data storage and network communication.
10991260 -> 1000009901270: The language can be considered more compatible with HTML 4 and XHTML 1.x than XHTML 2.0, due to the decision to keep the existing HTML form elements and events model.
10991270 -> 1000009901280: It adds many new elements not found in XHTML 1.x, however, such as section and aside.
10991280 -> 1000009901290: (The XHTML 1.2 equivalent of these structural elements would be <div role="region"> and <div role="complementary">.)
10991290 -> 1000009901300: The specification is expected to add WAI-ARIA support in a future draft.
10991300 -> 1000009901310: There is currently no indication as to whether HTML 5 will support RDFa, or be limited just to microformats.
10991310 -> 1000009901320: Valid XHTML documents
10991320 -> 1000009901330: An XHTML document that conforms to an XHTML specification is said to be valid.
10991330 -> 1000009901340: Validity assures consistency in document code, which in turn eases processing, but does not necessarily ensure consistent rendering by browsers.
10991340 -> 1000009901350: A document can be checked for validity with the W3C Markup Validation Service.
10991350 -> 1000009901360: In practice, many web development programs such as Dreamweaver provide code validation based on the W3C standards.
10991360 -> 1000009901370: DOCTYPEs
10991370 -> 1000009901380: In order to validate an XHTML document, a Document Type Declaration, or DOCTYPE, may be used.
10991380 -> 1000009901390: A DOCTYPE declares to the browser which Document Type Definition (DTD) the document conforms to.
10991390 -> 1000009901400: A Document Type Declaration should be placed before the root element.
10991400 -> 1000009901410: The system identifier part of the DOCTYPE, which in these examples is the URL that begins with http://'', need only point to a copy of the DTD to use if the validator cannot locate one based on the public identifier (the other quoted string).
10991410 -> 1000009901420: It does not need to be the specific URL that is in these examples; in fact, authors are encouraged to use local copies of the DTD files when possible.
10991420 -> 1000009901430: The public identifier, however, must be character-for-character the same as in the examples.
10991430 -> 1000009901440: These are the most common XHTML Document Type Declarations:
10991440 -> 1000009901450: XHTML 1.0 Strict
10991450 -> 1000009901460: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
10991460 -> 1000009901470: XHTML 1.0 Transitional
10991470 -> 1000009901480: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
10991480 -> 1000009901490: XHTML 1.0 Frameset
10991490 -> 1000009901500: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN""http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
10991500 -> 1000009901510: XHTML 1.1
10991510 -> 1000009901520: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN""http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
10991520 -> 1000009901530: HTML 5
10991530 -> 1000009901540: HTML5 does not require a doctype, and HTML 5 validation is not DTD-based.
10991540 -> 1000009901550: XHTML 2.0
10991550 -> 1000009901560: XHTML 2.0, As of April 2008, is in a draft phase.
10991560 -> 1000009901570: If an XHTML 2.0 Recommendation is published with the same document type declaration as in the current Working Draft, the declaration will appear as:
10991570 -> 1000009901580: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 2.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml2.dtd">
10991580 -> 1000009901590: A placeholder DTD schema exists at the corresponding URI, though it currently only includes the character reference entities from previous recommendations.
10991590 -> 1000009901600: XHTML 2 contemplates both a version attribute and an xsi:schemalocation attribute on the root HTML element that could possibly serve as a substitute for any DOCTYPE declaration.
10991600 -> 1000009901610: XML namespaces and schemas
10991610 -> 1000009901620: In addition to the DOCTYPE, all XHTML elements must be in the appropriate XML namespace for the version being used.
10991620 -> 1000009901630: This is usually done by declaring a default namespace on the root element using xmlns="namespace" as in the example below.
10991630 -> 1000009901640: For XHTML 1.0, XHTML 1.1 and HTML5, this is
10991640 -> 1000009901650: <html xmlns="http://www.w3.org/1999/xhtml">
10991650 -> 1000009901660: XHTML 2.0 requires both a namespace and an XML Schema instance declaration.
10991660 -> 1000009901670: These might be declared as
10991670 -> 1000009901680: <html xmlns="http://www.w3.org/2002/06/xhtml2/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://www.w3.org/2002/06/xhtml2/ http://www.w3.org/MarkUp/SCHEMA/xhtml2.xsd">
10991680 -> 1000009901690: This example for XHTML 2.0 also demonstrates the use of multiple namespaces within a document.
10991690 -> 1000009901700: The first xmlns default namespace declaration indicates that elements and attributes whose names have no XML namespace prefix fall within the XHTML 2.0 namespace.
10991700 -> 1000009901710: The second namespace prefix declaration xmlns:xsi indicates that any elements or attributes prefixed with the xsi: refer to the XMLSchema-Instance namespace.
10991710 -> 1000009901720: Through this namespace mechanism XML documents allow the use of a mixture of elements and attributes taken from various XML vocabularies while avoiding the potential for clashes of naming between items from independently developed vocabularies.
10991720 -> 1000009901730: Similar to the case of DOCTYPE above, the actual URL to the XML Schema file can be changed, as long as the Universal Resource Identifier (URI) before it (which indicates the XHTML 2.0 namespace) remains the same.
10991730 -> 1000009901740: The namespace URI is intended to be a persistent and universally unique identifier for the particular version of the specification.
10991740 -> 1000009901750: If treated as a URL, the actual content located at the site is of no significance.
10991750 -> 1000009901760: XML Declaration
10991760 -> 1000009901770: A character encoding may be specified at the beginning of an XHTML document in the XML declaration when the document is served using the application/xhtml+xml MIME type.
10991770 -> 1000009901780: (If an XML document lacks encoding specification, an XML parser assumes that the encoding is UTF-8 or UTF-16, unless the encoding has already been determined by a higher protocol.)
10991780 -> 1000009901790: For example:
10991790 -> 1000009901800: <?xml version="1.0" encoding="UTF-8"?>
10991800 -> 1000009901810: The declaration may be optionally omitted because it declares as its encoding the default encoding.
10991810 -> 1000009901820: However, if the document instead makes use of XML 1.1 or another character encoding, a declaration is necessary.
10991820 -> 1000009901830: Internet Explorer prior to version 7 enters quirks mode if it encounters an XML declaration in a document served as text/html.
10991830 -> 1000009901840: Common errors
10991840 -> 1000009901850: Some of the most common errors in the usage of XHTML are:
10991850 -> 1000009901860: Failing to realize that documents won’t be treated as XHTML unless they are served with an appropriate XML MIME type
10991860 -> 1000009901870: Not closing empty elements (elements without closing tags in HTML4)
10991870 -> 1000009901880: Incorrect: <br>
10991880 -> 1000009901890: Correct: <br />
10991890 -> 1000009901900: Note that any of these are acceptable in XHTML: <br></br>, <br/> and <br />.
10991900 -> 1000009901910: Older HTML-only browsers interpreting it as HTML will generally accept <br> and <br />.
10991910 -> 1000009901920: Not closing non-empty elements
10991920 -> 1000009901930: Incorrect: <p>This is a paragraph.<p>This is another paragraph.
10991930 -> 1000009901940: Correct: <p>This is a paragraph.</p><p>This is another paragraph.</p>
10991940 -> 1000009901950: Improperly nesting elements (Note that this would also be invalid in HTML)
10991950 -> 1000009901960: Incorrect: <em><strong>This is some text.</em></strong>
10991960 -> 1000009901970: Correct: <em><strong>This is some text.</strong></em>
10991970 -> 1000009901980: Not putting quotation marks around attribute values
10991980 -> 1000009901990: Incorrect: <td rowspan=3>
10991990 -> 1000009902000: Correct: <td rowspan="3">
10992000 -> 1000009902010: Correct: <td rowspan='3'>
10992010 -> 1000009902020: Using the ampersand character outside of entities
10992020 -> 1000009902030: Incorrect: <title>Cars & Trucks</title>
10992030 -> 1000009902040: Correct: <title>Cars &amp; Trucks</title>
10992040 -> 1000009902050: Using the ampersand outside of entities in URLs (Note that this would also be invalid in HTML)
10992050 -> 1000009902060: Incorrect: <a href="index.php?page=news&style=5">News</a>
10992060 -> 1000009902070: Correct: <a href="index.php?page=news&amp;style=5">News</a>
10992070 -> 1000009902080: Failing to recognize that XHTML elements and attributes are case sensitive
10992080 -> 1000009902090: Incorrect: <BODY><P ID="ONE">The Best Page Ever</P></BODY>
10992090 -> 1000009902100: Correct: <body><p id="ONE">The Best Page Ever</p></body>
10992100 -> 1000009902110: Using attribute minimization
10992110 -> 1000009902120: Incorrect: <textarea readonly>READ-ONLY</textarea>
10992120 -> 1000009902130: Correct: <textarea readonly="readonly">READ-ONLY</textarea>
10992130 -> 1000009902140: Mis-using CDATA, script-comments and xml-comments when embedding scripts and stylesheets.
10992140 -> 1000009902150: This problem can be avoided altogether by putting all script and stylesheet information into separate files and referring to them as follows in the XHTML head element.
10992150 -> 1000009902160: <link rel="stylesheet" href="/style/screen.css" type="text/css" /> <script type="text/javascript" src="/script/site.js"></script>
10992160 -> 1000009902170: Note: The format <script …></script>, rather than the more concise <script … />, is required for HTML compatibility when served as MIME type text/html.
10992170 -> 1000009902180: If an author chooses to include script or style data inline within an XHTML document, different approaches are recommended depending whether the author intends to serve the page as application/xhtml+xml and target only fully conformant browsers, or serve the page as text/html and try to obtain usability in Internet Explorer 6 and other non-conformant browsers.
10992180 -> 1000009902190: Backward compatibility
10992190 -> 1000009902200: XHTML 1.x documents are mostly backward compatible with HTML 4 user agents when the appropriate guidelines are followed.
10992200 -> 1000009902210: XHTML 1.1 is essentially compatible, although the elements for ruby annotiation are not part of the HTML 4 specification and thus generally ignored by HTML 4 browsers.
10992210 -> 1000009902220: Later XHTML 1.x modules such as those for the role attribute, RDFa and WAI-ARIA degrade gracefully in a similar manner.
10992220 -> 1000009902230: HTML 5 and XHTML 2 are significantly less compatible, although this can be mitigated to some degree through the use of scripting.
10992230 -> 1000009902240: (This can be simple one-liners, such as the use of “document.createElement()” to register a new HTML element within Internet Explorer, or complete JavaScript frameworks, such as the FormFaces implementation of XForms.)
10992240 -> 1000009902250: Examples
10992250 -> 1000009902260: The followings are examples of XHTML 1.0 Strict.
10992260 -> 1000009902270: Both of them have the same visual output.
10992270 -> 1000009902280: The former one follows the  HTML Compatibility Guidelines in Appendix C of the XHTML 1.0 Specification while the latter one breaks backward compatibility but provides cleaner codes.
10992280 -> 1000009902290: Example 1.
10992290 -> 1000009902300: Example 2.
10992300 -> 1000009902310: Notes:
10992310 -> 1000009902320: For further information on the media type recommendation, please refer to  XHTML Media Types, a W3C Note issued on 2002-08-01.
10992320 -> 1000009902330: The "loadpdf" function is actually a workaround for Internet Explorer.
10992330 -> 1000009902340: It can be replaced by adding <param name="src" value="http://www.w3.org/TR/xhtml1/xhtml1.pdf" /> within <object>.
10992340 -> 1000009902350: The img element does not get a name attribute in the  XHTML 1.0 Strict DTD.
10992350 -> 1000009902360: Use id instead.
XML
11000010 -> 1000010000020: XML
11000020 -> 1000010000030: The Extensible Markup Language (XML) is a general-purpose specification for creating custom markup languages.
11000030 -> 1000010000040: It is classified as an extensible language because it allows its users to define their own elements.
11000040 -> 1000010000050: Its primary purpose is to facilitate the sharing of structured data across different information systems, particularly via the Internet, and it is used both to encode documents and to serialize data.
11000050 -> 1000010000060: In the latter context, it is comparable with other text-based serialization languages such as JSON and YAML.
11000060 -> 1000010000070: It started as a simplified subset of the Standard Generalized Markup Language (SGML), and is designed to be relatively human-legible.
11000070 -> 1000010000080: By adding semantic constraints, application languages can be implemented in XML.
11000080 -> 1000010000090: These include XHTML, RSS, MathML, GraphML, Scalable Vector Graphics, MusicXML, and thousands of others.
11000090 -> 1000010000100: Moreover, XML is sometimes used as the specification language for such application languages.
11000100 -> 1000010000110: XML is recommended by the World Wide Web Consortium (W3C).
11000110 -> 1000010000120: It is a fee-free open standard.
11000120 -> 1000010000130: The recommendation specifies both the lexical grammar and the requirements for parsing.
11000130 -> 1000010000140: Well-formed and valid XML documents
11000140 -> 1000010000150: There are two levels of correctness of an XML document:
11000150 -> 1000010000160: Well-formed.
11000160 -> 1000010000170: A well-formed document conforms to all of XML's syntax rules.
11000170 -> 1000010000180: For example, if a start-tag appears without a corresponding end-tag, it is not well-formed.
11000180 -> 1000010000190: A document that is not well-formed is not considered to be XML; a conforming parser is not allowed to process it.
11000190 -> 1000010000200: Valid.
11000200 -> 1000010000210: A valid document additionally conforms to some semantic rules.
11000210 -> 1000010000220: These rules are either user-defined, or included as an XML schema or DTD.
11000220 -> 1000010000230: For example, if a document contains an undefined element, then it is not valid; a validating parser is not allowed to process it.
11000230 -> 1000010000240: Well-formed documents: XML syntax
11000240 -> 1000010000250: As long as only well-formedness is required, XML is a generic framework for storing any amount of text or any data whose structure can be represented as a tree.
11000250 -> 1000010000260: The only indispensable syntactical requirement is that the document has exactly one root element (alternatively called the document element).
11000260 -> 1000010000270: This means that the text must be enclosed between a root start-tag and a corresponding end-tag.
11000270 -> 1000010000280: The following is a "well-formed" XML document:
11000280 -> 1000010000290: The root element can be preceded by an optional XML declaration.
11000290 -> 1000010000300: This element states what version of XML is in use (normally 1.0); it may also contain information about character encoding and external dependencies.
11000300 -> 1000010000310: The specification requires that processors of XML support the pan-Unicode character encodings UTF-8 and UTF-16 (UTF-32 is not mandatory).
11000310 -> 1000010000320: The use of more limited encodings, such as those based on ISO/IEC 8859, is acknowledged and is widely used and supported.
11000320 -> 1000010000330: Comments can be placed anywhere in the tree, including in the text if the content of the element is text or #PCDATA.
11000330 -> 1000010000340: XML comments start with <!-- and end with -->.
11000340 -> 1000010000350: Two dashes (--) may not appear anywhere in the text of the comment.
11000350 -> None: 
11000360 -> 1000010000360: In any meaningful application, additional markup is used to structure the contents of the XML document.
11000370 -> 1000010000370: The text enclosed by the root tags may contain an arbitrary number of XML elements.
11000380 -> 1000010000380: The basic syntax for one element is:   The two instances of »name« are referred to as the start-tag and end-tag, respectively.
11000390 -> 1000010000390: Here, »content« is some text which may again contain XML elements.
11000400 -> 1000010000400: So, a generic XML document contains a tree-based data structure.
11000410 -> 1000010000410: Here is an example of a structured XML document:
11000420 -> None: 
11000430 -> 1000010000420: Attribute values must always be quoted, using single or double quotes; and each attribute name must appear only once in any element.
11000440 -> 1000010000430: XML requires that elements be properly nested — elements may never overlap, and so must be closed in the opposite order to which they are opened.
11000450 -> 1000010000440: For example, this fragment of code below cannot be part of a well-formed XML document because the title and author elements are closed in the wrong order:   One way of writing the same information in a way which could be incorporated into a well-formed XML document is as follows:
11000460 -> 1000010000450: XML provides special syntax for representing an element with empty content.
11000470 -> 1000010000460: Instead of writing a start-tag followed immediately by an end-tag, a document may contain an empty-element tag.
11000480 -> 1000010000470: An empty-element tag resembles a start-tag but contains a slash just before the closing angle bracket.
11000490 -> 1000010000480: The following three examples are equivalent in XML:   An empty-element may contain attributes:
11000500 -> 1000010000490: Entity references
11000510 -> 1000010000500: An entity in XML is a named body of data, usually text.
11000520 -> 1000010000510: Entities are often used to represent single characters that cannot easily be entered on the keyboard; they are also used to represent pieces of standard ("boilerplate") text that occur in many documents, especially if there is a need to allow such text to be changed in one place only.
11000530 -> 1000010000520: Special characters can be represented either using entity references, or by means of numeric character references.
11000540 -> 1000010000530: An example of a numeric character reference is "&#x20AC;", which refers to the Euro symbol by means of its Unicode codepoint in hexadecimal.
11000550 -> 1000010000540: An entity reference is a placeholder that represents that entity.
11000560 -> 1000010000550: It consists of the entity's name preceded by an ampersand ("&") and followed by a semicolon (";").
11000570 -> 1000010000560: XML has five predeclared entities:
11000580 -> 1000010000570: &amp; (& or "ampersand")
11000590 -> 1000010000580: &lt; (< or "less than")
11000600 -> 1000010000590: &gt; (> or "greater than")
11000610 -> 1000010000600: &apos; (' or "apostrophe")
11000620 -> 1000010000610: &quot; (" or "quotation mark")
11000630 -> 1000010000620: Here is an example using a predeclared XML entity to represent the ampersand in the name "AT&T":   Additional entities (beyond the predefined ones) can be declared in the document's Document Type Definition (DTD).
11000640 -> 1000010000630: A basic example of doing so in a minimal internal DTD follows.
11000650 -> 1000010000640: Declared entities can describe single characters or pieces of text, and can reference each other.
11000660 -> 1000010000650: When viewed in a suitable browser, the XML document above appears as:
11000670 -> 1000010000660: Copyright © 2006, XYZ Enterprises
11000680 -> 1000010000670: Numeric character references
11000690 -> 1000010000680: Numeric character references look like entity references, but instead of a name, they contain the "#" character followed by a number.
11000700 -> 1000010000690: The number (in decimal or "x"-prefixed hexadecimal) represents a Unicode code point.
11000710 -> 1000010000700: Unlike entity references, they are neither predeclared nor do they need to be declared in the document's DTD.
11000720 -> 1000010000710: They have typically been used to represent characters that are not easily encodable, such as an Arabic character in a document produced on a European computer.
11000730 -> 1000010000720: The ampersand in the "AT&T" example could also be escaped like this (decimal 38 and hexadecimal 26 both represent the Unicode code point for the "&" character):
11000740 -> 1000010000730: Similarly, in the previous example, notice that “&#xA9” is used to generate the “©” symbol.
11000750 -> 1000010000740: See also numeric character references.
11000760 -> 1000010000750: Well-formed documents
11000770 -> 1000010000760: In XML, a well-formed document must conform to the following rules, among others:
11000780 -> 1000010000770: Non-empty elements are delimited by both a start-tag and an end-tag.
11000790 -> 1000010000780: Empty elements may be marked with an empty-element (self-closing) tag, such as <IAmEmpty />.
11000800 -> 1000010000790: This is equal to <IAmEmpty></IAmEmpty>.
11000810 -> 1000010000800: All attribute values are quoted with either single (') or double (") quotes.
11000820 -> 1000010000810: Single quotes close a single quote and double quotes close a double quote.
11000830 -> 1000010000820: Tags may be nested but must not overlap.
11000840 -> 1000010000830: Each non-root element must be completely contained in another element.
11000850 -> 1000010000840: The document complies with its declared character encoding.
11000860 -> 1000010000850: The encoding may be declared or implied externally, such as in "Content-Type" headers when a document is transported via HTTP, or internally, using explicit markup at the very beginning of the document.
11000870 -> 1000010000860: When no such declaration exists, a Unicode encoding is assumed, as defined by a Unicode Byte Order Mark before the document's first character.
11000880 -> 1000010000870: If the mark does not exist, UTF-8 encoding is assumed.
11000890 -> 1000010000880: Element names are case-sensitive.
11000900 -> 1000010000890: For example, the following is a well-formed matching pair:
11000910 -> 1000010000900: <Step> ... </Step>
11000920 -> 1000010000910: whereas this is not
11000930 -> 1000010000920: <Step> ... </step>
11000940 -> 1000010000930: By carefully choosing the names of the XML elements one may convey the meaning of the data in the markup.
11000950 -> 1000010000940: This increases human readability while retaining the rigor needed for software parsing.
11000960 -> 1000010000950: Choosing meaningful names implies the semantics of elements and attributes to a human reader without reference to external documentation.
11000970 -> 1000010000960: However, this can lead to verbosity, which complicates authoring and increases file size.
11000980 -> 1000010000970: Automatic verification
11000990 -> 1000010000980: It is relatively simple to verify that a document is well-formed or validated XML, because the rules of well-formedness and validation of XML are designed for portability of tools.
11001000 -> 1000010000990: The idea is that any tool designed to work with XML files will be able to work with XML files written in any XML language (or XML application).
11001010 -> 1000010001000: Here are some examples of ways to verify XML documents:
11001020 -> 1000010001010: load it into an XML-capable browser, such as Firefox or Internet Explorer
11001030 -> 1000010001020: use a tool like xmlwf (usually bundled with expat)
11001040 -> 1000010001030: parse the document, for instance in Ruby:
11001050 -> 1000010001040: irb> require "rexml/document" irb> include REXML irb> doc = Document.new(File.new("test.xml")).root
11001060 -> 1000010001050: Valid documents: XML semantics
11001070 -> 1000010001060: By leaving the names, allowable hierarchy, and meanings of the elements and attributes open and definable by a customizable schema or DTD, XML provides a syntactic foundation for the creation of purpose-specific, XML-based markup languages.
11001080 -> 1000010001070: The general syntax of such languages is rigid — documents must adhere to the general rules of XML, ensuring that all XML-aware software can at least read and understand the relative arrangement of information within them.
11001090 -> 1000010001080: The schema merely supplements the syntax rules with a set of constraints.
11001100 -> 1000010001090: Schemas typically restrict element and attribute names and their allowable containment hierarchies, such as only allowing an element named 'birthday' to contain one element named 'month' and one element named 'day', each of which has to contain only character data.
11001110 -> 1000010001100: The constraints in a schema may also include data type assignments that affect how information is processed; for example, the 'month' element's character data may be defined as being a month according to a particular schema language's conventions, perhaps meaning that it must not only be formatted a certain way, but also must not be processed as if it were some other type of data.
11001120 -> 1000010001110: An XML document that complies with a particular schema/DTD, in addition to being well-formed, is said to be valid.
11001130 -> 1000010001120: An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic constraints imposed by XML itself.
11001140 -> 1000010001130: A number of standard and proprietary XML schema languages have emerged for the purpose of formally expressing such schemas, and some of these languages are XML-based, themselves.
11001150 -> 1000010001140: Before the advent of generalised data description languages such as SGML and XML, software designers had to define special file formats or small languages to share data between programs.
11001160 -> 1000010001150: This required writing detailed specifications and special-purpose parsers and writers.
11001170 -> 1000010001160: XML's regular structure and strict parsing rules allow software designers to leave parsing to standard tools, and since XML provides a general, data model-oriented framework for the development of application-specific languages, software designers need only concentrate on the development of rules for their data, at relatively high levels of abstraction.
11001180 -> 1000010001170: Well-tested tools exist to validate an XML document "against" a schema: the tool automatically verifies whether the document conforms to constraints expressed in the schema.
11001190 -> 1000010001180: Some of these validation tools are included in XML parsers, and some are packaged separately.
11001200 -> 1000010001190: Other usages of schemas exist: XML editors, for instance, can use schemas to support the editing process (by suggesting valid elements and attributes names, etc).
11001210 -> 1000010001200: DTD
11001220 -> 1000010001210: The oldest schema format for XML is the Document Type Definition (DTD), inherited from SGML.
11001230 -> 1000010001220: While DTD support is ubiquitous due to its inclusion in the XML 1.0 standard, it is seen as limited for the following reasons:
11001240 -> 1000010001230: It has no support for newer features of XML, most importantly namespaces.
11001250 -> 1000010001240: It lacks expressiveness.
11001260 -> 1000010001250: Certain formal aspects of an XML document cannot be captured in a DTD.
11001270 -> 1000010001260: It uses a custom non-XML syntax, inherited from SGML, to describe the schema.
11001280 -> 1000010001270: DTD is still used in many applications because it is considered the easiest to read and write.
11001290 -> 1000010001280: XML Schema
11001300 -> 1000010001290: A newer XML schema language, described by the W3C as the successor of DTDs, is XML Schema, or more informally referred to by the initialism for XML Schema instances, XSD (XML Schema Definition).
11001310 -> 1000010001300: XSDs are far more powerful than DTDs in describing XML languages.
11001320 -> 1000010001310: They use a rich datatyping system, allow for more detailed constraints on an XML document's logical structure, and must be processed in a more robust validation framework.
11001330 -> 1000010001320: XSDs also use an XML-based format, which makes it possible to use ordinary XML tools to help process them, although XSD implementations require much more than just the ability to read XML.
11001340 -> 1000010001330: Criticisms of XSD include the following:
11001350 -> 1000010001340: The specification is very large, which makes it difficult to understand and implement.
11001360 -> 1000010001350: The XML-based syntax leads to verbosity in schema descriptions, which makes XSDs harder to read and write.
11001370 -> 1000010001360: Schema validation can be an expensive addition to XML parsing, especially for high volume systems.
11001380 -> 1000010001370: The modeling capabilities are very limited, with no ability to allow attributes to influence content models.
11001390 -> 1000010001380: The type derivation model is very limited, in particular that derivation by extension is rarely useful.
11001400 -> 1000010001390: Database-related data transfer is supported with ideas such as nillability, but the requirements of industrial publishing are under-supported.
11001410 -> 1000010001400: The key/keyref/uniqueness mechanisms are not type-aware.
11001420 -> 1000010001410: The PSVI concept (Post Schema Validation Infoset) does not have a standard XML representation or Application Programming Interface, thus it works against vendor independence unless revalidation is performed.
11001430 -> 1000010001420: RELAX NG
11001440 -> 1000010001430: Another popular schema language for XML is RELAX NG.
11001450 -> 1000010001440: Initially specified by OASIS, RELAX NG is now also an ISO international standard (as part of DSDL).
11001460 -> 1000010001450: It has two formats: an XML based syntax and a non-XML compact syntax.
11001470 -> 1000010001460: The compact syntax aims to increase readability and writability but, since there is a well-defined way to translate the compact syntax to the XML syntax and back again by means of James Clark's  Trang conversion tool, the advantage of using standard XML tools is not lost.
11001480 -> 1000010001470: RELAX NG has a simpler definition and validation framework than XML Schema, making it easier to use and implement.
11001490 -> 1000010001480: It also has the ability to use datatype framework plug-ins; a RELAX NG schema author, for example, can require values in an XML document to conform to definitions in XML Schema Datatypes.
11001500 -> 1000010001490: ISO DSDL and other schema languages
11001510 -> 1000010001500: The ISO DSDL (Document Schema Description Languages) standard brings together a comprehensive set of small schema languages, each targeted at specific problems.
11001520 -> 1000010001510: DSDL includes RELAX NG full and compact syntax, Schematron assertion language, and languages for defining datatypes, character repertoire constraints, renaming and entity expansion, and namespace-based routing of document fragments to different validators.
11001530 -> 1000010001520: DSDL schema languages do not have the vendor support of XML Schemas yet, and are to some extent a grassroots reaction of industrial publishers to the lack of utility of XML Schemas for publishing.
11001540 -> 1000010001530: Some schema languages not only describe the structure of a particular XML format but also offer limited facilities to influence processing of individual XML files that conform to this format.
11001550 -> 1000010001540: DTDs and XSDs both have this ability; they can for instance provide attribute defaults.
11001560 -> 1000010001550: RELAX NG and Schematron intentionally do not provide these; for example the infoset augmentation facility.
11001570 -> 1000010001560: International use
11001580 -> 1000010001570: XML supports the direct use of almost any Unicode character in element names, attributes, comments, character data, and processing instructions (other than the ones that have special symbolic meaning in XML itself, such as the open corner bracket, "<").
11001590 -> 1000010001580: Therefore, the following is a well-formed XML document, even though it includes both Chinese and Cyrillic characters:
11001600 -> 1000010001590: Displaying XML on the web
11001610 -> 1000010001600: XML documents do not carry information about how to display the data.
11001620 -> 1000010001610: Without using CSS or XSL, a generic XML document is rendered as raw XML text by most web browsers.
11001630 -> 1000010001620: Some display it with 'handles' (e.g. + and - signs in the margin) that allow parts of the structure to be expanded or collapsed with mouse-clicks.
11001640 -> 1000010001630: In order to style the rendering in a browser with CSS, the XML document must include a reference to the stylesheet:
11001650 -> 1000010001640: Note that this is different from specifying such a stylesheet in HTML, which uses the <link> element.
11001660 -> 1000010001650: Extensible Stylesheet Language (XSL) can be used to alter the format of XML data, either into HTML or other formats that are suitable for a browser to display.
11001670 -> 1000010001660: To specify client-side XSL Transformation (XSLT), the following processing instruction is required in the XML:
11001680 -> 1000010001670: Client-side XSLT is supported by many web browsers.
11001690 -> 1000010001680: Alternatively, one may use XSL to convert XML into a displayable format on the server rather than being dependent on the end-user's browser capabilities.
11001700 -> 1000010001690: The end-user is not aware of what has gone on 'behind the scenes'; all they see is well-formatted, displayable data.
11001710 -> 1000010001700: See the XSLT article for an example of server-side XSLT in action.
11001720 -> 1000010001710: XML extensions
11001730 -> 1000010001720: XPath makes it possible to refer to individual parts of an XML document.
11001740 -> 1000010001730: This provides random access to XML data for other technologies, including XSLT, XSL-FO, XQuery etc.
11001750 -> 1000010001740: XPath expressions can refer to all or part of the text, data and values in XML elements, attributes, processing instructions, comments etc.
11001760 -> 1000010001750: They can also access the names of elements and attributes.
11001770 -> 1000010001760: XPaths can be used in both valid and well-formed XML, with and without defined namespaces.
11001780 -> 1000010001770: XInclude defines the ability for XML files to include all or part of an external file.
11001790 -> 1000010001780: When processing is complete, the final XML infoset has no XInclude elements, but instead has copied the documents or parts thereof into the final infoset.
11001800 -> 1000010001790: It uses XPath to refer to a portion of the document for partial inclusions.
11001810 -> 1000010001800: XQuery is to XML and XML Databases what SQL and PL/SQL are to relational databases: ways to access, manipulate and return XML.
11001820 -> 1000010001810: XML Namespaces enable the same document to contain XML elements and attributes taken from different vocabularies, without any naming collisions occurring.
11001830 -> 1000010001820: XML Signature defines the syntax and processing rules for creating digital signatures on XML content.
11001840 -> 1000010001830: XML Encryption defines the syntax and processing rules for encrypting XML content.
11001850 -> 1000010001840: XPointer is a system for addressing components of XML-based internet media.
11001860 -> 1000010001850: XML files may be served with a variety of Media types.
11001870 -> 1000010001860: RFC 3023 defines the types "application/xml" and "text/xml", which say only that the data is in XML, and nothing about its semantics.
11001880 -> 1000010001870: The use of "text/xml" has been criticized as a potential source of encoding problems but is now in the process of being deprecated.
11001890 -> 1000010001880: RFC 3023 also recommends that XML-based languages be given media types beginning in "application/" and ending in "+xml"; for example "application/atom+xml" for Atom.
11001900 -> 1000010001890: This page discusses further XML and MIME.
11001910 -> 1000010001900: Processing XML files
11001920 -> 1000010001910: Three traditional techniques for processing XML files are:
11001930 -> 1000010001920: Using a programming language and the SAX API.
11001940 -> 1000010001930: Using a programming language and the DOM API.
11001950 -> 1000010001940: Using a transformation engine and a filter
11001960 -> 1000010001950: More recent and emerging techniques for processing XML files are:
11001970 -> 1000010001960: Pull Parsing
11001980 -> 1000010001970: Non-Extractive Parsing (i.e. in-situ parsing)
11001990 -> 1000010001980: Data binding
11002000 -> 1000010001990: Simple API for XML (SAX)
11002010 -> 1000010002000: SAX is a lexical, event-driven interface in which a document is read serially and its contents are reported as "callbacks" to various methods on a handler object of the user's design.
11002020 -> 1000010002010: SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed.
11002030 -> 1000010002020: It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.
11002040 -> 1000010002030: DOM
11002050 -> 1000010002040: DOM is an interface-oriented Application Programming Interface that allows for navigation of the entire document as if it were a tree of "Node" objects representing the document's contents.
11002060 -> 1000010002050: A DOM document can be created by a parser, or can be generated manually by users (with limitations).
11002070 -> 1000010002060: Data types in DOM Nodes are abstract; implementations provide their own programming language-specific bindings.
11002080 -> 1000010002070: DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed.
11002090 -> 1000010002080: DOM is supported in Java by several packages that usually come with the standard libraries.
11002100 -> 1000010002090: As the DOM specification is regulated by the World Wide Web Consortium, the main interfaces (Node, Document, etc.) are in the package org.w3c.dom.*, as well as some of the events and interfaces for other capabilities like serialization (output).
11002110 -> 1000010002100: The package com.sun.org.apache.xml.internal.serialize.* provides the serialization (output capacities) by implementing the appropriate interfaces, while the javax.xml.parsers.* package parses data to create DOM XML documents for manipulation.
11002120 -> 1000010002110: Non-extractive XML Processing API
11002130 -> 1000010002120: Non-extractive XML Processing API is a new and emerging category of parsers that aim to overcome the fundamental limitations of DOM and SAX.
11002140 -> 1000010002130: The most representative is VTD-XML, which abolishes the object-oriented modeling of XML hierarchy and instead uses 64-bit Virtual Token Descriptors (encoding offsets, lengths, depths, and types) of XML tokens.
11002150 -> 1000010002140: VTD-XML's approach enables a number of interesting features/enhancements, such as high performance, low memory usage [8], ASIC implementation [9], incremental update [10], and native XML indexing [11] [12].
11002160 -> 1000010002150: Transformation engines and filters
11002170 -> 1000010002160: A filter in the Extensible Stylesheet Language (XSL) family can transform an XML file for displaying or printing.
11002180 -> 1000010002170: XSL-FO is a declarative, XML-based page layout language.
11002190 -> 1000010002180: An XSL-FO processor can be used to convert an XSL-FO document into another non-XML format, such as PDF.
11002200 -> 1000010002190: XSLT is a declarative, XML-based document transformation language.
11002210 -> 1000010002200: An XSLT processor can use an XSLT stylesheet as a guide for the conversion of the data tree represented by one XML document into another tree that can then be serialized as XML, HTML, plain text, or any other format supported by the processor.
11002220 -> 1000010002210: XQuery is a W3C language for querying, constructing and transforming XML data.
11002230 -> 1000010002220: XPath is a DOM-like node tree data model and path expression language for selecting data within XML documents.
11002240 -> 1000010002230: XSL-FO, XSLT and XQuery all make use of XPath.
11002250 -> 1000010002240: XPath also includes a useful function library.
11002260 -> 1000010002250: Pull parsing
11002270 -> 1000010002260: Pull parsing treats the document as a series of items which are read in sequence using the Iterator design pattern.
11002280 -> 1000010002270: This allows for writing of recursive-descent parsers in which the structure of the code performing the parsing mirrors the structure of the XML being parsed, and intermediate parsed results can be used and accessed as local variables within the methods performing the parsing, or passed down (as method parameters) into lower-level methods, or returned (as method return values) to higher-level methods.
11002290 -> 1000010002280: Examples of pull parsers include StAX in the Java programming language, SimpleXML in PHP and System.Xml.XmlReader in .NET.
11002300 -> 1000010002290: A pull parser creates an iterator that sequentially visits the various elements, attributes, and data in an XML document.
11002310 -> 1000010002300: Code which uses this 'iterator' can test the current item (to tell, for example, whether it is a start or end element, or text), and inspect its attributes (local name, namespace, values of XML attributes, value of text, etc.), and can also move the iterator to the 'next' item.
11002320 -> 1000010002310: The code can thus extract information from the document as it traverses it.
11002330 -> 1000010002320: The recursive-descent approach tends to lend itself to keeping data as typed local variables in the code doing the parsing, while SAX, for instance, typically requires a parser to manually maintain intermediate data within a stack of elements which are parent elements of the element being parsed.
11002340 -> 1000010002330: Pull-parsing code can be more straightforward to understand and maintain than SAX parsing code.
11002350 -> None: Non-extractive XML Processing API
11002360 -> None: Non-extractive XML Processing API is a new and emerging category of parsers that aim to overcome the fundamental limitations of DOM and SAX.
11002370 -> None: The most representative is VTD-XML, which abolishes the object-oriented modeling of XML hierarchy and instead uses 64-bit Virtual Token Descriptors (encoding offsets, lengths, depths, and types) of XML tokens.
11002380 -> None: VTD-XML's approach enables a number of interesting features/enhancements, such as high performance, low memory usage , ASIC implementation , incremental update , and native XML indexing .
11002390 -> 1000010002340: Data binding
11002400 -> 1000010002350: Another form of XML Processing API is data binding, where XML data is made available as a custom, strongly typed programming language data structure, in contrast to the interface-oriented DOM.
11002410 -> 1000010002360: Example data binding systems include the Java Architecture for XML Binding (JAXB).
11002420 -> 1000010002370: Specific XML applications and editors
11002430 -> 1000010002380: The native file format of OpenOffice.org, AbiWord, and Apple's iWork applications is XML.
11002440 -> 1000010002390: Some parts of Microsoft Office 2007 are also able to edit XML files with a user-supplied schema (but not a DTD), and Microsoft has released a file format compatibility kit for Office 2003 that allows previous versions of Office to save in the new XML based format.
11002450 -> 1000010002400: There are dozens of other XML editors available.
11002460 -> 1000010002410: History
11002470 -> 1000010002420: The versatility of SGML for dynamic information display was understood by early digital media publishers in the late 1980s prior to the rise of the Internet.
11002480 -> 1000010002430: By the mid-1990s some practitioners of SGML had gained experience with the then-new World Wide Web, and believed that SGML offered solutions to some of the problems the Web was likely to face as it grew.
11002490 -> 1000010002440: Dan Connolly added SGML to the list of W3C's activities when he joined the staff in 1995; work began in mid-1996 when Jon Bosak developed a charter and recruited collaborators.
11002500 -> 1000010002450: Bosak was well connected in the small community of people who had experience both in SGML and the Web.
11002510 -> 1000010002460: He received support in his efforts from Microsoft.
11002520 -> 1000010002470: XML was compiled by a working group of eleven members, supported by an (approximately) 150-member Interest Group.
11002530 -> 1000010002480: Technical debate took place on the Interest Group mailing list and issues were resolved by consensus or, when that failed, majority vote of the Working Group.
11002540 -> 1000010002490: A record of design decisions and their rationales was compiled by Michael Sperberg-McQueen on December 4th 1997.
11002550 -> 1000010002500: James Clark served as Technical Lead of the Working Group, notably contributing the empty-element "<empty/>" syntax and the name "XML".
11002560 -> 1000010002510: Other names that had been put forward for consideration included "MAGMA" (Minimal Architecture for Generalized Markup Applications), "SLIM" (Structured Language for Internet Markup) and "MGML" (Minimal Generalized Markup Language).
11002570 -> 1000010002520: The co-editors of the specification were originally Tim Bray and Michael Sperberg-McQueen.
11002580 -> 1000010002530: Halfway through the project Bray accepted a consulting engagement with Netscape, provoking vociferous protests from Microsoft.
11002590 -> 1000010002540: Bray was temporarily asked to resign the editorship.
11002600 -> 1000010002550: This led to intense dispute in the Working Group, eventually solved by the appointment of Microsoft's Jean Paoli as a third co-editor.
11002610 -> 1000010002560: The XML Working Group never met face-to-face; the design was accomplished using a combination of email and weekly teleconferences.
11002620 -> 1000010002570: The major design decisions were reached in twenty weeks of intense work between July and November of 1996, when the first Working Draft of an XML specification was published.
11002630 -> 1000010002580: Further design work continued through 1997, and XML 1.0 became a W3C Recommendation on February 10, 1998.
11002640 -> 1000010002590: XML 1.0 achieved the Working Group's goals of Internet usability, general-purpose usability, SGML compatibility, facilitation of easy development of processing software, minimization of optional features, legibility, formality, conciseness, and ease of authoring.
11002650 -> 1000010002600: Like its antecedent SGML, XML allows for some redundant syntactic constructs and includes repetition of element identifiers.
11002660 -> 1000010002610: In these respects, terseness was not considered essential in its structure.
11002670 -> 1000010002620: Sources
11002680 -> 1000010002630: XML is a profile of an ISO standard SGML, and most of XML comes from SGML unchanged.
11002690 -> 1000010002640: From SGML comes the separation of logical and physical structures (elements and entities), the availability of grammar-based validation (DTDs), the separation of data and metadata (elements and attributes), mixed content, the separation of processing from representation (processing instructions), and the default angle-bracket syntax.
11002700 -> 1000010002650: Removed were the SGML Declaration (XML has a fixed delimiter set and adopts Unicode as the document character set).
11002710 -> 1000010002660: Other sources of technology for XML were the Text Encoding Initiative (TEI), which defined a profile of SGML for use as a 'transfer syntax'; HTML, in which elements were synchronous with their resource, the separation of document character set from resource encoding, the xml:lang attribute, and the HTTP notion that metadata accompanied the resource rather than being needed at the declaration of a link; and the Extended Reference Concrete Syntax (ERCS), from which XML 1.0's naming rules were taken, and which had introduced hexadecimal numeric character references and the concept of references to make available all Unicode characters.
11002720 -> 1000010002670: Ideas that developed during discussion which were novel in XML, were the algorithm for encoding detection and the encoding header, the processing instruction target, the xml:space attribute, and the new close delimiter for empty-element tags.
11002730 -> 1000010002680: Versions
11002740 -> 1000010002690: There are two current versions of XML.
11002750 -> 1000010002700: The first, XML 1.0, was initially defined in 1998.
11002760 -> 1000010002710: It has undergone minor revisions since then, without being given a new version number, and is currently in its fourth edition, as published on August 16, 2006.
11002770 -> 1000010002720: It is widely implemented and still recommended for general use.
11002780 -> 1000010002730: The second, XML 1.1, was initially published on February 4, 2004, the same day as XML 1.0 Third Edition, and is currently in its second edition, as published on August 16, 2006.
11002790 -> 1000010002740: It contains features — some contentious — that are intended to make XML easier to use in certain cases - mainly enabling the use of line-ending characters used on EBCDIC platforms, and the use of scripts and characters absent from Unicode 2.0.
11002800 -> 1000010002750: XML 1.1 is not very widely implemented and is recommended for use only by those who need its unique features.
11002810 -> 1000010002760: XML 1.0 and XML 1.1 differ in the requirements of characters used for element and attribute names: XML 1.0 only allows characters which are defined in Unicode 2.0, which includes most world scripts, but excludes those which were added in later Unicode versions.
11002820 -> 1000010002770: Among the excluded scripts are Mongolian, Cambodian, Amharic, Burmese, and others.
11002830 -> 1000010002780: Almost any Unicode character can be used in the character data and attribute values of an XML 1.1 document, even if the character is not defined, aside from having a code point, in the current version of Unicode.
11002840 -> 1000010002790: The approach in XML 1.1 is that only certain characters are forbidden, and everything else is allowed, whereas in XML 1.0, only certain characters are explicitly allowed, thus XML 1.0 cannot accommodate the addition of characters in future versions of Unicode.
11002850 -> 1000010002800: In character data and attribute values, XML 1.1 allows the use of more control characters than XML 1.0, but, for "robustness", most of the control characters introduced in XML 1.1 must be expressed as numeric character references.
11002860 -> 1000010002810: Among the supported control characters in XML 1.1 are two line break codes that must be treated as whitespace.
11002870 -> 1000010002820: Whitespace characters are the only control codes that can be written directly.
11002880 -> 1000010002830: There are also discussions on an XML 2.0, although it remains to be seen if such will ever come about.  XML-SW (SW for skunk works), written by one of the original developers of XML, contains some proposals for what an XML 2.0 might look like: elimination of DTDs from syntax, integration of namespaces, XML Base and XML Information Set (infoset) into the base standard.
11002890 -> 1000010002840: The World Wide Web Consortium also has an XML Binary Characterization Working Group doing preliminary research into use cases and properties for a binary encoding of the XML infoset.
11002900 -> 1000010002850: The working group is not chartered to produce any official standards.
11002910 -> 1000010002860: Since XML is by definition text-based, ITU-T and ISO are using the name Fast Infoset for their own binary infoset to avoid confusion (see ITU-T Rec. X.891 | ISO/IEC 24824-1).
11002920 -> 1000010002870: Patent claims
11002930 -> 1000010002880: In October 2005 the small company Scientigo publicly asserted that two of its patents,  U.S. Patent 5842213 and  U.S. Patent 6393426, apply to the use of XML.
11002940 -> 1000010002890: The patents cover the "modeling, storage and transfer [of data] in a particular non-hierarchical, non-integrated neutral form", according to their applications, which were filed in 1997 and 1999.
11002950 -> 1000010002900: Scientigo CEO Doyal Bryant expressed a desire to "monetize" the patents but stated that the company was "not interested in having us against the world."
11002960 -> 1000010002910: He said that Scientigo was discussing the patents with several large corporations.
11002970 -> 1000010002920: XML users and independent experts responded to Scientigo's claims with widespread skepticism and criticism.
11002980 -> 1000010002930: Some derided the company as a patent troll.
11002990 -> 1000010002940: Tim Bray described any claims that the patents covered XML as "ridiculous on the face of it".
11003000 -> 1000010002950: Critique of XML
11003010 -> 1000010002960: Commentators have offered various critiques of XML, suggesting circumstances where XML provides both advantages and potential disadvantages.
11003020 -> 1000010002970: Advantages of XML
11003030 -> 1000010002980: It is text-based.
11003040 -> 1000010002990: It supports Unicode, allowing almost any information in any written human language to be communicated.
11003050 -> 1000010003000: It can represent common computer science data structures: records, lists and trees.
11003060 -> 1000010003010: Its self-documenting format describes structure and field names as well as specific values.
11003070 -> 1000010003020: The strict syntax and parsing requirements make the necessary parsing algorithms extremely simple, efficient, and consistent.
11003080 -> 1000010003030: XML is heavily used as a format for document storage and processing, both online and offline.
11003090 -> 1000010003040: It is based on international standards.
11003100 -> 1000010003050: It can be updated incrementally.
11003110 -> 1000010003060: It allows validation using schema languages such as XSD and Schematron, which makes effective unit-testing, firewalls, acceptance testing, contractual specification and software construction easier.
11003120 -> 1000010003070: The hierarchical structure is suitable for most (but not all) types of documents.
11003130 -> 1000010003080: It is platform-independent, thus relatively immune to changes in technology.
11003140 -> 1000010003090: Forward and backward compatibility are relatively easy to maintain despite changes in DTD or Schema.
11003150 -> 1000010003100: Its predecessor, SGML, has been in use since 1986, so there is extensive experience and software available.
11003160 -> 1000010003110: Disadvantages of XML
11003170 -> 1000010003120: XML syntax is redundant or large relative to binary representations of similar data, especially with tabular data.
11003180 -> 1000010003130: The redundancy may affect application efficiency through higher storage, transmission and processing costs.
11003190 -> 1000010003140: XML syntax is verbose, especially for human readers, relative to other alternative 'text-based' data transmission formats.
11003200 -> 1000010003150: The hierarchical model for representation is limited in comparison to an object oriented graph.
11003210 -> 1000010003160: Expressing overlapping (non-hierarchical) node relationships requires extra effort.
11003220 -> 1000010003170: XML namespaces are problematic to use and namespace support can be difficult to correctly implement in an XML parser.
11003230 -> 1000010003180: XML is commonly depicted as "self-documenting" but this depiction ignores critical ambiguities.
11003240 -> 1000010003190: The distinction between content and attributes in XML seems unnatural to some and makes designing XML data structures harder.
11003250 -> None: Standardization
11003260 -> None: In addition to the ISO standards mentioned above, other related document include
11003270 -> None: ISO/IEC 8825-4:2002 Information technology -- ASN.1 encoding rules: XML Encoding Rules (XER)
11003280 -> None: ISO/IEC 8825-5:2004 Information technology -- ASN.1 encoding rules: Mapping W3C XML schema definitions into ASN.1
11003290 -> None: ISO/IEC 9075-14:2006 Information technology -- Database languages -- SQL -- Part 14: XML-Related Specifications (SQL/XML)
11003300 -> None: ISO 10303-28:2007 Industrial automation systems and integration -- Product data representation and exchange -- Part 28: Implementation methods: XML representations of EXPRESS schemas and data, using XML schemas
11003310 -> None: ISO/IEC 13250-3:2007 Information technology -- Topic Maps -- Part 3: XML syntax
11003320 -> None: ISO/IEC 13522-5:1997 Information technology -- Coding of multimedia and hypermedia information -- Part 5: Support for base-level interactive applications
11003330 -> None: ISO/IEC 13522-8:2001 Information technology -- Coding of multimedia and hypermedia information -- Part 8: XML notation for ISO/IEC 13522-5
11003340 -> None: ISO/IEC 18056:2007 Information technology -- Telecommunications and information exchange between systems -- XML Protocol for Computer Supported Telecommunications Applications (CSTA) Phase III
11003350 -> None: ISO/IEC 19503:2005 Information technology -- XML Metadata Interchange (XMI)
11003360 -> None: ISO/IEC 19776-1:2005 Information technology -- Computer graphics, image processing and environmental data representation -- Extensible 3D (X3D) encodings -- Part 1: Extensible Markup Language (XML) encoding
11003370 -> None: ISO/IEC 22537:2006 Information technology -- ECMAScript for XML (E4X) specification
11003380 -> None: ISO 22643:2003 Space data and information transfer systems -- Data entity dictionary specification language (DEDSL) -- XML/DTD Syntax
11003390 -> None: ISO/IEC 23001-1:2006 Information technology -- MPEG systems technologies -- Part 1: Binary MPEG format for XML
11003400 -> None: ISO 24531:2007 Intelligent transport systems -- System architecture, taxonomy and terminology -- Using XML in ITS standards, data registries and data dictionaries