[10160010] |
Cross-platform
[10160020] |'''Cross-platform''' (also known as '''multi-platform''') is a term used in computing to refer to [[computer program]]s, [[operating system]]s, [[computer language]]s, [[programming language]]s, or other [[computer software]] and their implementations which can be made to work on multiple [[computer platform]]s. [10160030] |“Cross-platform” and “multi-platform” both refer to the idea that a given piece of computer software is able to be run on more than one computer platform. [10160040] |There are two major types of cross-platform software; one requires building for each platform that it supports (e.g., is written in a compiled language, such as [[Pascal (programming language)|Pascal]]), and the other one can be directly run on any platform which supports it (e.g., software written in an [[interpreted language]] such as [[Perl]], [[Python (programming language)|Python]], or [[shell script]]) or software written in a language which compiles to [[bytecode]] and the bytecode is redistributed (such as is the case with [[Java (programming language)|Java]] and languages used in the [[.NET Framework]]) such as [[Chrome (programming language)|Chrome]]. [10160050] |For example, a cross-platform [[application software|application]] may run on [[Microsoft Windows]] on the [[x86 architecture]], [[Linux]] on the [[x86 architecture]] and [[Mac OS X]] on either the [[PowerPC]] or [[x86]] based [[Apple Macintosh]] systems. [10160060] |A cross-platform [[application software|application]] may run on as many as all existing platforms, or on as few as two platforms. [10160070] |== Platforms == [10160080] |A platform is a combination of hardware and software used to run software applications. [10160090] |A platform can be described simply as an operating system or computer architecture, or it could be the combination of both. [10160100] |Probably the most familiar platform is [[Microsoft Windows]] running on the [[x86 architecture]]. [10160110] |Other well-known desktop computer platforms include [[Linux]] and [[Mac OS X]] (both of which are themselves cross-platform). [10160120] |There are, however, many devices such as [[cellular telephones]] that are also effectively computer platforms but less commonly thought about in that way. [10160130] |[[Application software]] can be written to depend on the features of a particular platform—either the hardware, operating system, or virtual machine it runs on. [10160140] |The [[Java Platform|Java platform]] is a [[virtual machine]] platform which runs on many operating systems and hardware types, and is a common platform for software to be written for. [10160150] |=== Hardware platforms === [10160160] |A '''hardware platform''' can refer to a computer’s [[computer architecture|architecture]] or [[processor architecture]]. [10160170] |For example, the [[x86]] and [[x86-64]] [[CPU]]s make up one of the most common [[computer architecture]]s in use in home machines today. [10160180] |These machines commonly run [[Microsoft Windows]], though they can run other [[operating system]]s as well, including [[Linux]], [[OpenBSD]], [[NetBSD]], [[Mac OS X]] and [[FreeBSD]]. [10160190] |=== Software platforms === [10160200] |Software platforms can either be an [[operating system]] or programming environment, though more commonly it is a combination of both. [10160210] |A notable exception to this is [[Java (programming language)|Java]], which uses an [[operating system]] independent [[virtual machine]] for its [[compiled]] code, known in the world of Java as [[bytecode]]. [10160220] |Examples of software platforms include: [10160230] |* [[MS-DOS]] ([[x86]]), [[DR-DOS]] ([[x86]]), [[FreeDOS]] ([[x86]]) etc. [10160240] |* [[Microsoft Windows]] ([[x86]], [[x64]]) [10160250] |* [[Linux]] (x86, x64, [[PowerPC]], various other architectures) [10160260] |* [[Mac OS X]] (PowerPC, x86) [10160270] |* [[OS/2]], [[eComStation]] [10160280] |* [[AmigaOS]] ([[m68k]]), [[AROS]] (x86, PowerPC, m68k), [[MorphOS]] (PowerPC) [10160290] |* [[Java (programming language)|Java]] [10160300] |==== Java platform ==== [10160310] |As previously noted, the [[Java platform]] is an exception to the general rule that an [[operating system]] is a software platform. [10160320] |The Java language provides a [[virtual machine]], or a “virtual CPU” which runs all of the code that is written for the language. [10160330] |This enables the same [[executable]] [[binary file|binary]] to run on all systems which support the Java software, through the [[Java Virtual Machine]]. [10160340] |Java [[executable]]s do not run directly on the [[operating system]]; that is, neither [[Microsoft Windows|Windows]] nor [[Linux]] execute Java programs directly. [10160350] |Because of this, however, Java is limited in that it does not directly support system-specific functionality. [10160360] |[[Java Native Interface|JNI]] can be used to access system specific functions, but then the code is likely no longer portable. [10160370] |Java programs can run on at least the [[Microsoft Windows]], [[Mac OS X]], [[Linux]], and [[Solaris Operating System|Solaris]] operating systems, and so the language is limited to functionality that exists on all these systems. [10160380] |This includes things such as [[computer networking]], [[Internet socket]]s, but not necessarily raw hardware [[input/output]]. [10160390] |== Cross-platform software == [10160400] |In order for software to be considered '''cross-platform''', it must be able to function on more than one [[computer architecture]] or [[operating system]]. [10160410] |This can be a time-consuming task given that different [[operating system]]s have different [[application programming interface]]s or [[application programming interface|API]]s (for example, [[Linux]] uses a different [[application programming interface|API]] for [[application software]] than [[Microsoft Windows|Windows]] does). [10160420] |Just because a particular [[operating system]] may run on different [[computer architecture]]s, that does not mean that the software written for that operating system will automatically work on all [[computer architecture|architecture]]s that the operating system supports. [10160430] |One example as of August, 2006 was [[OpenOffice.org]], which did not natively run on the [[AMD64]] or [[EM64T]] lines of processors implementing the [[x86-64]] [[64-bit]] standards for computers; this has since been changed, and the OpenOffice.org suite of software is “mostly” ported to these 64-bit systems[http://wiki.services.openoffice.org/wiki/Porting_to_x86-64_(AMD64,_EM64T)]. [10160440] |This also means that just because a program is written in a popular programming language such as [[C (programming language)|C]] or [[C++]], it does not mean it will run on all [[operating systems]] that support that [[programming language]]. [10160450] |=== Web applications === [10160460] |[[Web application]]s are typically described as cross-platform because, ideally, they are accessible from any of various [[web browser]]s within different operating systems. [10160470] |Such applications generally employ a [[client-server]] system architecture, and vary widely in complexity and functionality. [10160480] |This wide variability significantly complicates the goal of cross-platform capability, which is routinely at odds with the goal of advanced functionality. [10160490] |==== Basic applications ==== [10160500] |Basic web applications perform all or most processing from a [[Stateless server|stateless]] [[web server]], and pass the result to the client web browser. [10160510] |All user interaction with the application consists of simple exchanges of data requests and server responses. [10160520] |These types of applications were the norm in the early phases of [[World Wide Web]] application development. [10160530] |Such applications follow a simple [[Transaction processing|transaction]] model, identical to that of serving [[static web page]]s. [10160540] |Today, they are still relatively common, especially where cross-platform compatibility and simplicity are deemed more critical than advanced functionality. [10160550] |==== Advanced applications ==== [10160560] |Prominent examples of advanced web applications include the Web interface to [[Gmail]], [[A9.com]], and the maps.live.com section of [[Live Search]]. [10160570] |Such advanced applications routinely depend on additional features found only in the more recent versions of popular web browsers. [10160580] |These dependencies include [[Ajax (programming)|Ajax]], [[JavaScript]], [[Dynamic HTML|“Dynamic” HTML]], [[SVG]], and other components of [[rich internet application]]s. [10160590] |Older versions of popular browsers tend to lack support for certain features. [10160600] |==== Design strategies ==== [10160610] |Because of the competing interests of cross-platform compatibility and advanced functionality, numerous alternative web application design strategies have emerged. [10160620] |Such strategies include: [10160630] |=====Graceful degradation===== [10160640] |Graceful degradation attempts to provide the same or similar functionality to all users and platforms, while diminishing that functionality to a ‘least common denominator’ for more limited client browsers. [10160650] |For example, a user attempting to use a limited-feature browser to access Gmail may notice that Gmail switches to “Basic Mode,” with reduced functionality. [10160660] |Some view this strategy as a lesser form of cross-platform capability. [10160670] |=====Separation of functionality===== [10160680] |Separation of functionality attempts to simply omit those subsets of functionality that are not capable from within certain client browsers or operating systems, while still delivering a ‘complete’ application to the user. (see also [[Separation of concerns]]). [10160690] |=====Multiple codebase===== [10160700] |Multiple codebase applications present different versions of an application depending on the specific client in use. [10160710] |This strategy is arguably the most complicated and expensive way to fulfill cross-platform capability, since even different versions of the same client browser (within the same operating system) can differ dramatically between each other. [10160720] |This is further complicated by the support for “plugins” which may or may not be present for any given installation of a particular browser version. [10160730] |=====Third party libraries===== [10160740] |Third party libraries attempt to simplify cross-platform capability by ‘hiding’ the complexities of client differentiation behind a single, unified API. [10160750] |==== Testing strategies ==== [10160760] |One complicated aspect of cross-platform web application design is the need for [[software testing]]. [10160770] |In addition to the complications mentioned previously, there is the additional restriction that some browsers prohibit installation of different versions of the same browser on the same operating system. [10160780] |Techniques such as [[full virtualization]] are sometimes used as a workaround for this problem. [10160790] |=== Traditional applications === [10160800] |Although web applications are becoming increasingly popular, many computer users still use traditional [[application software]] which does not rely on a client/web-server architecture. [10160810] |The distinction between “traditional” and “web” applications is not always unambiguous, however, because applications have many different features, installation methods and architectures; and some of these can overlap and occur in ways that blur the distinction. [10160820] |Nevertheless, this simplifying distinction is a common and useful generalization. [10160830] |==== Binary software ==== [10160840] |Traditionally in modern computing, application software has been distributed to end-users as '''binary images''', which are stored in [[executable]]s, a specific type of [[binary file]]. [10160850] |Such [[executable]]s only support the [[operating system]] and [[computer architecture]] that they were built for—which means that making a “cross-platform executable” would be something of a massive task, and is generally not done. [10160860] |For software that is distributed as a [[binary file|binary]] [[executable]], such as software written in [[C (programming language)|C]] or [[C++]], the programmer must [[software build|build the software]] for each different [[operating system]] and [[computer architecture]]. [10160870] |For example, [[Mozilla]] [[Mozilla Firefox|Firefox]], an open-source web browser, is available on [[Microsoft Windows]], [[Mac OS X]] (both [[PowerPC]] and [[x86]] through something Apple calls a '''[[Universal binary]]'''), and [[Linux]] on multiple computer architectures. [10160880] |The three platforms (in this case, [[Microsoft Windows|Windows]], [[Mac OS X]], and [[Linux]]) are separate [[executable]] distributions, although they come from the same [[source code]]. [10160890] |In the context of binary software, cross-platform programs are written in the source code and then “translated” to each system that it runs on through compiling it on different platforms. [10160900] |Also, software can be [[porting|ported]] to a new [[computer architecture]] or [[operating system]] so that the program becomes more cross-platform than it already is. [10160910] |For example, a program such as Firefox, which already runs on Windows on the x86 family, can be modified and re-built to run on Linux on the x86 (and potentially other architectures) as well. [10160920] |As an alternative to porting, cross-platform virtualization allows applications compiled for one CPU and operating system to run on a system with a different CPU and/or operating system, without modification to the source code or binaries. [10160930] |As an example, [[Apple Computer|Apple's]] [[Rosetta (software)|Rosetta]] software, which is built into [[Intel]]-based Apple Macintosh computers, runs applications compiled for the previous generation of Macs that used [[PowerPC]] CPUs. [10160940] |Another example is IBM PowerVM Lx86, which allows Linux/x86 applications to run unmodified on the Linux/Power operating system. [10160950] |==== Scripts and [[interpreted language]]s ==== [10160960] |A script can be considered to be cross-platform if the [[scripting language]] is available on multiple platforms and the script only uses the facilities provided by the language. [10160970] |That is, a script written in [[Python (programming language)|Python]] for a [[Unix-like]] system will likely run with little or no modification on [[Microsoft Windows|Windows]], because Python also runs on [[Microsoft Windows|Windows]]; there is also more than one implementation of Python that will run the same scripts (e.g., [[IronPython]] for [[.NET Framework|.NET]]). [10160980] |The same goes for many of the [[open source]] [[programming language]]s that are available and are [[scripting language]]s. [10160990] |Unlike [[binary file|binary]] [[executable]]s, the same script can be used on all computers that have software to interpret the script. [10161000] |This is because the script is generally stored in [[plain text]] in a [[text file]]. [10161010] |There may be some issues, however, such as the type of [[newline|new line character]] that sits between the lines. [10161020] |Generally, however, little or no work has to be done to make a script written for one system, run on another. [10161030] |Some quite popular cross-platform scripting or [[interpreted language]]s are: [10161040] |* [[bash]]—A [[Unix shell]] commonly run on [[Linux]] and other modern [[Unix-like]] systems, as well as on [[Microsoft Windows|Windows]] via the [[Cygwin]] [[POSIX]] compatibility layer. [10161050] |* [[Python (programming language)|Python]]—A modern [[scripting language]] where the focus is on [[rapid application development]] and ease-of-writing, instead of program run-time efficiency. [10161060] |* [[Perl]]—A scripting language first released in 1987. [10161070] |Used for [[Common Gateway Interface|CGI]] [[WWW]] programming, small [[system administration]] tasks, and more. [10161080] |* [[PHP]]—A [[scripting language]] most popular in use on the [[WWW]] for [[web application]]s. [10161090] |* [[Ruby (programming language)|Ruby]]—A scripting language who's purpose is to be object-oriented and easy to read. [10161100] |Can also be used on the web through [[Ruby on Rails]]. [10161110] |* [[Tcl]] - A dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more. [10161120] |==== Video games ==== [10161130] |Cross-platform is a term that can also apply to [[video game]]s. [10161140] |Such games are released on a range of [[video game console]]s and [[handheld game console]]s, which are specialized [[computer]]s dedicated to the task of playing games (and thus, are a platform as any other computer). [10161150] |Examples of these games include: [10161160] |* [[Miner 2049er]], the first major multiplatform game [10161170] |* [[Phantasy Star Online]] [10161180] |* [[Lara Croft Tomb Raider: Legend]] [10161190] |* [[FIFA Series]] [10161200] |* [[Shadow of Legend]] [10161210] |… which are spread across a variety of platforms, such as the [[Nintendo GameCube]], [[PlayStation 2]], [[Xbox]], [[Personal computer|PC]], and [[mobile devices]]. [10161220] |In some cases, depending on the hardware of a particular system it may take longer than expected to create a video game across multiple platforms. [10161230] |So, a video game may only get released on a few platforms and then later released on the remaining platforms. [10161240] |Typically, this is what occurs when a new system is released, because the [[Video game developer|developer]]s of the video game need to become acquainted with the hardware and software associated with the new console. [10161250] |Some games may not become cross-platform because of licensing agreements between the [[Video game developer|developer]]s and the maker of the [[video game console]] which state that the game will only be made for one particular console. [10161260] |As an example, [[Disney]] could create a new game and wish to release it on the latest [[Nintendo]] and [[Sony]] game consoles. [10161270] |If [[Disney]] licenses the game with [[Sony]] first, [[Disney]] may be required to only release the game on [[Sony|Sony’s]] console for a short time, or indefinitely—effectively prohibiting the game from cross-platform at least for a period of time. [10161280] |Several developers have developed ways to play games online while using different platforms. [10161290] |Epic Games, Microsoft and Valve Software all have this technology, that allows Xbox 360 gamers and PS3 gamers to play with PC gamers, allowing gamers to finally decide which platform is the best for a game. [10161300] |The first game released to allow this interactivity between PC and Console games was [[Quake 3]]. [10161310] |Games that feature cross-platform online play include: [10161320] |*[[Champions Online]] [10161330] |*[[Lost Planet: Colonies]] [10161340] |*[[Phantasy Star Online]] [10161350] |*[[Shadowrun (2007 video game)|Shadowrun]] [10161360] |*[[UNO (Xbox Live Arcade)|UNO]] [10161370] |*[[Final Fantasy XI Online]] [10161380] |== Platform independent software == [10161390] |Software that is platform independent does not rely on any special features of any single platform, or, if it does, handles those special features such that it can deal with multiple platforms. [10161400] |All [[algorithm]]s, such as the [[quicksort]] algorithm, are able to be implemented on different platforms. [10161410] |== Cross-platform programming == [10161420] |Cross-platform programming is the practice of actively writing software that will work on more than one platform. [10161430] |=== Approaches to cross-platform programming === [10161440] |There are different ways of approaching the problem of writing a cross-platform application program. [10161450] |One such approach is simply to create multiple versions of the same program in different ''source trees''—in other words, the [[Microsoft Windows|Windows]] version of a program might have one set of source code files and the [[Apple Macintosh|Macintosh]] version might have another, while a FOSS *nix system might have another. [10161460] |While this is a straightforward approach to the problem, it has the potential to be considerably more expensive in development cost, development time, or both, especially for the corporate entities. [10161470] |The idea behind this is to create more than two different programs that have the ability to behave similarly to each other. [10161480] |It is also possible that this means of developing a cross-platform application will result in more problems with bug tracking and fixing, because the two different ''source trees'' would have different programmers, and thus different defects in each version. [10161490] |The smaller the programming team, the quicker the bug fixes tend to be. [10161500] |Another approach that is used is to depend on pre-existing software that hides the differences between the platforms—called [[abstraction]] of the platform—such that the program itself is unaware of the platform it is running on. [10161510] |It could be said that such programs are ''platform agnostic''. [10161520] |Programs that run on the [[Java (Sun)|Java]] [[Virtual Machine]] ([[Java Virtual Machine|JVM]]) are built in this fashion. [10161530] |Some applications mix various methods of cross-platform programming to create the final application. [10161540] |An example of this is the [[Firefox]] [[web browser]], which uses [[abstraction]] to build some of the lower-level components, separate source subtrees for implementing platform specific features (like the GUI), and the implementation of more than one [[scripting language]] to help facilitate ease of portability. [10161550] |[[Firefox]] implements [[XUL]], [[Cascading Style Sheets|CSS]] and [[JavaScript]] for extending the browser, in addition to classic [[Netscape]]-style browser plugins. [10161560] |Much of the browser itself is written in XUL, CSS, and JavaScript, as well. [10161570] |=== Cross-platform programming toolkits === [10161580] |There are a number of tools which are available to help facilitate the process of cross-platform programming: [10161590] |* [[Simple DirectMedia Layer]]—An [[open source]] cross-platform multimedia library written in C that creates an abstraction over various platforms’ graphics, sound, and input [[Application programming interface|API]]s. [10161600] |It runs on many operating systems including Linux, Windows and Mac OS X and is aimed at games and multimedia applications. [10161610] |* [[Cairo (graphics)|Cairo]]−A [[free software]] library used to provide a vector graphics-based, device-independent API. [10161620] |It is designed to provide primitives for 2-dimensional drawing across a number of different backends. [10161630] |Cairo is written in C and has bindings for many programming languages. [10161640] |* ''ParaGUI''—ParaGUI is a cross-platform high-level application framework and GUI library. [10161650] |It can be compiled on various platforms(Linux, Win32, BeOS, Mac OS, ...). [10161660] |ParaGUI is based on the Simple DirectMedia Layer (SDL). [10161670] |ParaGUI is targeted on crossplatform multimedia applications and embedded devices operating on framebuffer displays. [10161680] |* [[wxWidgets]]—An open source widget toolkit that is also an [[application framework]]. [10161690] |It runs on [[Unix-like]] systems with [[X11]], Microsoft Windows and Mac OS X. It permits applications written to use it to run on all of the systems that it supports, if the application does not use any [[operating system]]-specific programming in addition to it. [10161700] |* [[Qt (toolkit)|Qt]]—An application framework and [[widget toolkit]] for [[Unix-like]] systems with [[X11]], Microsoft Windows, Mac OS X, and other systems—available under both [[open source]] and commercial licenses. [10161710] |* [[GTK+]]—An open source widget toolkit for Unix-like systems with X11 and Microsoft Windows. [10161720] |* [[FLTK]]—Another open source cross platform toolkit, but more light weight because it restricts itself to the GUI. [10161730] |* [[Mozilla application framework|Mozilla]]—An open source platform for building Mac, Windows and Linux applications. [10161740] |* [[Mono (software)|Mono]] (and more specifically, [[Microsoft .NET]])—A cross-platform framework for applications and programming languages. [10161750] |* ''molib''—A robust commercial application toolkit library that abstracts the system calls through C++ objects (such as the file system, database system and thread implementation.). [10161760] |This allows for the creation of applications that compile and run under Microsoft Windows, Mac OS X, GNU/Linux, and other uses (Sun OS, AIX, HP-UX, 32/64 bit, SMP). [10161770] |Use in concert with ''the sandbox'' to create GUI-based applications. [10161780] |* [[fpGUI]] - An open source widget toolkit that is completely implemented in Object Pascal. [10161790] |It currently supports Linux, Windows and a bit of Windows CE. [10161795] |fpGUI does not rely on any large libraries, instead it talks directly to Xlib (Linux) or GDI (Windows). [10161800] |The framework is compiled with the Free Pascal compiler. [10161810] |Mac OS support is also in the works. [10161820] |* [[Tcl/Tk]] - Tcl (Tool Command Language) is a dynamic programming language, suitable for a wide range of uses, including web and desktop applications, networking, administration, testing and many more. [10161830] |Open source and business-friendly, Tcl is a mature yet evolving language that is truly cross platform, easily deployed and highly extensible. [10161840] |Tk is a graphical user interface toolkit that takes developing desktop applications to a higher level than conventional approaches. [10161850] |Tk is the standard GUI not only for Tcl, but for many other dynamic languages, and can produce rich, native applications that run unchanged across Windows, Mac OS X, Linux and more. [10161860] |The combination of Tcl and the Tk GUI toolkit is referred to as Tcl/Tk. [10161870] |* [[XVT]] is a cross-platform toolkit for creating enterprise and desktop applications in C/C++ on Windows, Linux and Unix (Solaris, HPUX, AIX), and Mac. [10161880] |Most recent release is 5.8, in April 2007 [10161890] |=== Cross-platform development environments === [10161900] |Cross-platform applications can also be built using proprietary [[Integrated development environment|IDE]]s, or so-called [[Rapid Application Development]] tools. [10161910] |There are a number of development environments which allow developers to build and deploy applications across multiple platforms: [10161920] |* [[Eclipse (software)| Eclipse]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins including the C++ Development Toolkit. [10161930] |Eclipse is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems). [10161940] |* [[IntelliJ IDEA]]—A proprietary [[Integrated development environment|IDE]] [10161950] |* [[NetBeans]]—An Open source [[software framework]] and [[Integrated development environment|IDE]] extendable through plug-ins. [10161960] |NetBeans is available on any operating system with a modern Java virtual machine (including Windows, Linux, and Mac OS X, Sun, HP-UX, and other systems). [10161970] |Similar to Eclipse in features and functionality. [10161980] |Promoted by [[Sun Microsystems]] [10161990] |* [[Omnis Studio]]—A proprietary [[Integrated development environment|IDE]] or Rapid Application Development tool for creating enterprise and web applications for Windows, Linux, and Mac OS X. [10162000] |* [[Runtime Revolution]]—a proprietary [[Integrated development environment|IDE]], compiler engine and CGI builder that [[cross compile]]s to [[Microsoft Windows|Windows]], [[Mac OS X]] ([[PowerPC|PPC]], [[Intel]]), [[Linux]], [[Solaris Operating System|Solaris]], [[BSD]], and [[Irix]]. [10162010] |*[[Code::Blocks]]—A free/open source, cross platform IDE. [10162020] |It is developed in C++ using wxWidgets. [10162030] |Using a plugin architecture, its capabilities and features are defined by the provided plugins. [10162040] |*[[Lazarus (software)]]—Lazarus is a cross platform Visual IDE developed for and supported by the open source Free Pascal compiler. [10162050] |It aims to provide a Rapid Application Development Delphi Clone for Pascal and Object Pascal developers. [10162060] |*[[REALbasic]]—REALbasic (RB) is an object-oriented dialect of the BASIC programming language developed and commercially marketed by REAL Software, Inc in Austin, Texas for Mac OS X, Microsoft Windows, and Linux. [10162070] |== Criticisms of cross-platform development == [10162080] |There are certain issues associated with cross-platform development. [10162090] |Some of these include: [10162100] |* Testing cross-platform applications may also be considerably more complicated, since different platforms can exhibit slightly different behaviors or subtle bugs. [10162110] |This problem has led some developers to deride cross-platform development as “Write Once, Debug Everywhere”, a take on Sun’s [[Write once, run anywhere|“Write Once, Run Anywhere”]] marketing slogan. [10162120] |* Developers are often restricted to using the [[lowest common denominator]] subset of features which are available on all platforms. [10162130] |This may hinder the application's performance or prohibit developers from using platforms’ most advanced features. [10162140] |* Different platforms often have different user interface conventions, which cross-platform applications do not always accommodate. [10162150] |For example, applications developed for Mac OS X and [[GNOME]] are supposed to place the most important button on the right-hand side of windows and dialogs, whereas Microsoft Windows and [[KDE]] have the opposite convention. [10162160] |Though many of these differences are subtle, a cross-platform application which does not conform appropriately to these conventions may feel clunky or alien to the user. [10162170] |When working quickly, such opposing conventions may even result in [[data loss]], such as in a [[dialog box]] confirming whether the user wants to save or discard changes to a file. [10162180] |* Scripting languages and virtual machines must be translated into native executable code each time the application is executed, imposing a performance penalty. [10162190] |This performance hit can be alleviated using advanced techniques like [[just-in-time compilation]]; but even using such techniques, some performance overhead may be unavoidable. [10170010] |
Data
[10170020] |'''Data''' (singular: '''datum''') are collected of natural phenomena descriptors including the results of [[experience]], [[observation]] or [[experiment]], or a set of [[premise]]s. [10170030] |This may consist of [[number]]s, [[word]]s, or [[image]]s, particularly as [[measurement]]s or observations of a set of [[variable]]s. [10170040] |==Etymology== [10170050] |The word ''data ''is the plural of [[Latin]] ''[[datum]]'', [[Grammatical gender|neuter]] past [[participle]] of ''dare'', "to give", hence "something given". [10170060] |The [[past participle]] of "to give" has been used for millennia, in the sense of a statement accepted at face value; one of the works of [[Euclid]], circa 300 BC, was the ''Dedomena'' (in Latin, ''Data''). [10170070] |In discussions of problems in [[geometry]], [[mathematics]], [[engineering]], and so on, the terms ''givens'' and ''data'' are used interchangeably. [10170080] |Such usage is the origin of ''data'' as a concept in [[computer science]]:'' ''data'' ''are numbers, words, images, etc., accepted as they stand. [10170090] |Pronounced dey-tuh, dat-uh, or dah-tuh.'' [10170100] |[[Experimental data]] are data generated within the context of a scientific investigation. [10170110] |Mathematically, data can be grouped in many ways. [10170120] |==Usage in English== [10170130] |In [[English language|English]], the word ''datum'' is still used in the general sense of "something given", and more specifically in [[cartography]], [[geography]], [[geology]], [[NMR]] and [[technical drawing|drafting]] to mean a reference point, reference line, or reference surface. [10170140] |More generally speaking, any measurement or result can be called a (single) ''datum'', but ''data point'' is more common. [10170150] |Both ''datums'' (see usage in [[datum]] article) and the originally Latin plural ''data'' are used as the plural of ''datum'' in English, but ''data'' is more commonly treated as a [[mass noun]] and used in the [[Grammatical number|singular]], especially in day-to-day usage. [10170160] |For example, "This is all the data from the experiment". [10170170] |This usage is inconsistent with the rules of Latin grammar and traditional English, which would instead suggest "These are all the data from the experiment". [10170180] |Some British and UN academic, scientific, and professional [[style guides]] (e.g., see page 43 of the [http://whqlibdoc.who.int/hq/2004/WHO_IMD_PUB_04.1.pdf World Health Organization Style Guide]) request that authors treat ''data'' as a plural noun. [10170190] |Other international organization, such as the IEEE computing society , allow its usage as either a mass noun or plural based on author preference. [10170200] |It is now usually treated as a singular mass noun in informal usage, but usage in scientific publications shows a strong UK/U.S divide. [10170210] |U.S. usage tends to treat ''data'' in the singular, including in serious and academic publishing, although some major newspapers (such as the [[New York Times]]) regularly use it in the plural. [10170220] |"The plural usage is still common, as this headline from the New York Times attests: “Data Are Elusive on the Homeless.” [10170230] |Sometimes scientists think of data as plural, as in ''These data do not support the conclusions.'' [10170240] |But more often scientists and researchers think of data as a singular mass entity like information, and most people now follow this in general usage. [10170245] |"[http://www.bartleby.com/61/51/D0035100.html] UK usage now widely accepts treating ''data'' as singular in standard English, including everyday newspaper usage at least in non-scientific use. [10170250] |UK scientific publishing usually still prefers treating it as a plural.. [10170260] |Some UK university style guides recommend using ''data'' for both singular and plural use and some recommend treating it only as a singular in connection with computers. [10170270] |==Uses of ''data'' in science and computing== [10170280] |''Raw data'' are [[number]]s, [[character (computing)|characters]], [[image]]s or other outputs from devices to convert physical quantities into symbols, in a very broad sense. [10170290] |Such data are typically further [[data processing|processed]] by a human or [[input]] into a [[computer]], [[Computer storage|stored]] and processed there, or transmitted ([[output]]) to another human or computer. [10170300] |''Raw data'' is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next. [10170310] |Mechanical computing devices are classified according to the means by which they represent data. [10170320] |An [[analog computer]] represents a datum as a voltage, distance, position, or other physical quantity. [10170330] |A [[digital computer]] represents a datum as a sequence of symbols drawn from a fixed [[alphabet]]. [10170340] |The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". [10170350] |More familiar representations, such as numbers or letters, are then constructed from the binary alphabet. [10170360] |Some special forms of data are distinguished. [10170370] |A [[computer program]] is a collection of data, which can be interpreted as instructions. [10170380] |Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably [[Lisp programming language|Lisp]] and similar languages, programs are essentially indistinguishable from other data. [10170390] |It is also useful to distinguish [[metadata]], that is, a description of other data. [10170400] |A similar yet earlier term for metadata is "ancillary data." [10170410] |The prototypical example of metadata is the library catalog, which is a description of the contents of books. [10170420] |==Meaning of data, information and knowledge== [10170430] |The terms [[information]] and [[knowledge]] are frequently used for overlapping concepts. [10170440] |The main difference is in the level of [[abstraction]] being considered. [10170450] |Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three. [10170460] |For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge". [10170470] |Information as a concept bears a diversity of meanings, from everyday usage to technical settings. [10170480] |Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation. [10170490] |Beynon-Davies uses the concept of a [[sign]] to distinguish between [[data]] and [[information]]. [10170500] |Data are symbols. [10170510] |Information occurs when symbols are used to refer to something. [10180010] |
Data analysis
[10180020] |'''Data analysis''' is the process of looking at and summarizing '''[[data]]''' with the intent to extract useful [[information]] and develop conclusions. [10180030] |Data analysis is closely related to [[data mining]], but data mining tends to focus on larger data sets, with less emphasis on making [[inference]], and often uses data that was originally collected for a different purpose. [10180040] |In [[statistics|statistical applications]], some people divide data analysis into [[descriptive statistics]], [[exploratory data analysis]] and [[confirmatory data analysis]], where the EDA focuses on discovering new features in the data, and CDA on confirming or falsifying existing hypotheses. [10180050] |Data analysis assumes different aspects, and possibly different names, in different fields. [10180060] |The term ''data analysis'' is also used as a synonym for [[data modeling]], which is unrelated to the subject of this article. [10180070] |==Nuclear and particle physics== [10180080] |In [[nuclear physics|nuclear]] and [[particle physics]] the data usually originate from the [[particle detector|experimental apparatus]] via a [[data acquisition]] system. [10180090] |It is then processed, in a step usually called ''data reduction'', to apply calibrations and to extract physically significant information. [10180100] |Data reduction is most often, especially in large particle physics experiments, an automatic, batch-mode operation carried out by software written ad-hoc. [10180110] |The resulting data ''n-tuples'' are then scrutinized by the physicists, using specialized software tools like [[ROOT]] or [[Physics Analysis Workstation|PAW]], comparing the results of the experiment with theory. [10180120] |The theoretical models are often difficult to compare directly with the results of the experiments, so they are used instead as input for [[Monte Carlo method|Monte Carlo simulation]] software like [[Geant4]] that predict the response of the detector to a given theoretical event, producing '''simulated events''' which are then compared to experimental data. [10180130] |See also: [[Computational physics]]. [10180140] |==Social sciences== [10180150] |[[Qualitative data analysis]] (QDA) or [[qualitative research]] is the analysis of non-numerical data, for example words, photographs, observations, etc.. [10180160] |==Information technology== [10180170] |A special case is the [[Data analysis (information technology in othm )|data analysis in information technology audits]]. [10180180] |==Business== [10180190] |See [10180200] |* [[Analytics]] [10180210] |* [[Business intelligence]] [10180220] |* [[Data mining]] [10190010] |
Database
[10190020] |A '''database''' is a [[structure]]d collection of records or [[data]]. [10190030] |A [[computer]] database relies upon [[software]] to organize the storage of data. [10190040] |The software models the database structure in what are known as [[database model]]s. [10190050] |The model in most common use today is the [[relational model]]. [10190060] |Other models such as the [[hierarchical model]] and the [[network model]] use a more explicit representation of relationships (see below for explanation of the various database models). [10190070] |Database management systems (DBMS) are the software used to organize and maintain the database. [10190080] |These are categorized according to the [[database model]] that they support. [10190090] |The model tends to determine the query languages that are available to access the database. [10190100] |A great deal of the internal engineering of a DBMS, however, is independent of the data model, and is concerned with managing factors such as performance, concurrency, integrity, and recovery from [[hardware failure]]s. [10190110] |In these areas there are large differences between products. [10190120] |==History== [10190130] |The earliest known use of the term '''''data base''''' was in November 1963, when the [[System Development Corporation]] sponsored a symposium under the title ''Development and Management of a Computer-centered Data Base''. [10190140] |'''Database''' as a single word became common in Europe in the early 1970s and by the end of the decade it was being used in major American newspapers. [10190150] |(The abbreviation DB, however, survives.) [10190160] |The first database management systems were developed in the 1960s. [10190170] |A pioneer in the field was [[Charles Bachman]]. [10190180] |Bachman's early papers show that his aim was to make more effective use of the new direct access storage devices becoming available: until then, data processing had been based on [[punch card|punched cards]] and [[magnetic tape]], so that serial processing was the dominant activity. [10190190] |Two key [[data model]]s arose at this time: [[CODASYL]] developed the [[network model]] based on Bachman's ideas, and (apparently independently) the [[hierarchical model]] was used in a system developed by [[North American Rockwell]] later adopted by [[IBM]] as the cornerstone of their [[Information Management System|IMS]] product. [10190200] |While IMS along with the CODASYL [[IDMS]] were the big, high visibility databases developed in the 1960s, several others were also born in that decade, some of which have a significant installed base today. [10190210] |Two worthy of mention are the [[Pick operating system|PICK]] and [[MUMPS]] databases, with the former developed originally as an operating system with an embedded database and the latter as a programming language and database for the development of healthcare systems. [10190220] |The [[relational model]] was proposed by [[Edgar F. Codd|E. F. Codd]] in 1970. [10190230] |He criticized existing models for confusing the abstract description of information structure with descriptions of physical access mechanisms. [10190240] |For a long while, however, the relational model remained of academic interest only. [10190250] |While CODASYL products (IDMS) and network model products (IMS) were conceived as practical engineering solutions taking account of the technology as it existed at the time, the relational model took a much more theoretical perspective, arguing (correctly) that hardware and software technology would catch up in time. [10190260] |Among the first implementations were [[Michael Stonebraker]]'s [[Ingres (database)|Ingres]] at [[University of California, Berkeley|Berkeley]], and the [[System R]] project at IBM. [10190270] |Both of these were research prototypes, announced during 1976. [10190280] |The first commercial products, [[Oracle database|Oracle]] and [[IBM DB2|DB2]], did not appear until around 1980. [10190290] |The first successful database product for microcomputers was [[dBASE]] for the [[CP/M]] and [[PC-DOS]]/[[MS-DOS]] operating systems. [10190300] |During the 1980s, research activity focused on [[distributed database]] systems and [[database machine]]s. [10190310] |Another important theoretical idea was the [[Functional Data Model]], but apart from some specialized applications in genetics, molecular biology, and fraud investigation, the world took little notice. [10190320] |In the 1990s, attention shifted to [[OODB|object-oriented databases]]. [10190330] |These had some success in fields where it was necessary to handle more complex data than relational systems could easily cope with, such as [[spatial database]]s, engineering data (including software [[Software repository|repositories]]), and multimedia data. [10190340] |Some of these ideas were adopted by the relational vendors, who integrated new features into their products as a result. [10190350] |The 1990s also saw the spread of [[Open Source]] databases, such as [[PostgreSQL]] and [[MySQL]]. [10190360] |In the 2000s, the fashionable area for innovation is the [[XML database]]. [10190370] |As with object databases, this has spawned a new collection of start-up companies, but at the same time the key ideas are being integrated into the established relational products. [10190380] |[[XML databases]] aim to remove the traditional divide between documents and data, allowing all of an organization's information resources to be held in one place, whether they are highly structured or not. [10190390] |==Database models== [10190400] |Various techniques are used to model data structure. [10190410] |Most database systems are built around one particular data model, although it is increasingly common for products to offer support for more than one model. [10190420] |For any one [[logical model]] various physical implementations may be possible, and most products will offer the user some level of control in tuning the [[physical implementation]], since the choices that are made have a significant effect on performance. [10190430] |Here are three examples: [10190440] |===Hierarchical model=== [10190450] |In a [[hierarchical model]], data is organized into an inverted tree-like structure, implying a multiple downward link in each node to describe the nesting, and a sort field to keep the records in a particular order in each same-level list. [10190460] |This structure arranges the various data elements in a hierarchy and helps to establish logical relationships among data elements of multiple files. [10190470] |Each unit in the model is a record which is also known as a node. [10190480] |In such a model, each record on one level can be related to multiple records on the next lower level. [10190490] |A record that has subsidiary records is called a parent and the subsidiary records are called children. [10190500] |Data elements in this model are well suited for one-to-many relationships with other data elements in the database. [10190510] |This model is advantageous when the data elements are inherently hierarchical. [10190520] |The disadvantage is that in order to prepare the database it becomes necessary to identify the requisite groups of files that are to be logically integrated. [10190530] |Hence, a hierarchical data model may not always be flexible enough to accommodate the dynamic needs of an organization. [10190540] |===Network model=== [10190550] |The [[network model]] tends to store records with links to other records. [10190560] |Each record in the database can have multiple parents, i.e., the relationships among data elements can have a many to many relationship. [10190570] |Associations are tracked via "pointers". [10190580] |These pointers can be node numbers or disk addresses. [10190590] |Most network databases tend to also include some form of hierarchical model. [10190600] |Databases can be translated from hierarchical model to network and vice versa. [10190610] |The main difference between the network model and hierarchical model is that in a network model, a child can have a number of parents whereas in a hierarchical model, a child can have only one parent. [10190620] |The network model provides greater advantage than the hierarchical model in that promotes greater flexibility and data accessibility, since records at a lower level can be accessed without accessing the records above them. [10190630] |This model is more efficient than hierarchical model, easier to understand and can be applied to many real world problems that require routine transactions. [10190640] |The disadvantages are that: It is a complex process to design and develop a network database; It has to be refined frequently; It requires that the relationships among all the records be defined before development starts, and changes often demand major programming efforts; Operation and maintenance of the network model is expensive and time consuming. [10190650] |Examples of database engines that have network model capabilities are [[RDM Embedded]] and [[RDM Server]]. [10190660] |===Relational model=== [10190670] |The basic data structure of the relational model is a table where information about a particular entity (say, an employee) is represented in columns and rows. [10190680] |The columns enumerate the various attributes of an entity (e.g. employee_name, address, phone_number). [10190690] |Rows (also called records) represent instances of an entity (e.g. specific employees). [10190700] |The "relation" in "relational database" comes from the mathematical notion of [[Relation (mathematics)|relations]] from the field of [[set theory]]. [10190710] |A relation is a set of [[tuple]]s, so rows are sometimes called tuples. [10190720] |All tables in a relational database adhere to three basic rules. [10190730] |* The ordering of columns is immaterial [10190740] |* Identical rows are not allowed in a table [10190750] |* Each row has a single (separate) value for each of its columns (each tuple has an atomic value). [10190760] |If the same value occurs in two different records (from the same table or different tables) it can imply a relationship between those records. [10190770] |Relationships between records are often categorized by their [[Cardinality (data modeling)|cardinality]] (1:1, (0), 1:M, M:M). [10190780] |Tables can have a designated column or set of columns that act as a "key" to select rows from that table with the same or similar key values. [10190790] |A "primary key" is a key that has a unique value for each row in the table. [10190800] |Keys are commonly used to join or combine data from two or more tables. [10190810] |For example, an ''employee'' table may contain a column named ''address'' which contains a value that matches the key of an ''address'' table. [10190820] |Keys are also critical in the creation of indexes, which facilitate fast retrieval of data from large tables. [10190830] |It is not necessary to define all the keys in advance; a column can be used as a key even if it was not originally intended to be one. [10190840] |====Relational operations==== [10190850] |Users (or programs) request data from a relational database by sending it a [[query]] that is written in a special language, usually a dialect of [[SQL]]. [10190860] |Although SQL was originally intended for end-users, it is much more common for SQL queries to be embedded into software that provides an easier user interface. [10190870] |Many web applications, such as [[Wikipedia]], perform SQL queries when generating pages. [10190880] |In response to a query, the database returns a result set, which is the list of rows constituting the answer. [10190890] |The simplest query is just to return all the rows from a table, but more often, the rows are filtered in some way to return just the answer wanted. [10190900] |Often, data from multiple tables are combined into one, by doing a [[Join (SQL)|join]]. [10190910] |There are a number of relational operations in addition to join. [10190920] |====Normal forms==== [10190930] |Relations are classified based upon the types of anomalies to which they're vulnerable. [10190940] |A database that's in the first normal form is vulnerable to all types of anomalies, while a database that's in the domain/key normal form has no modification anomalies. [10190950] |Normal forms are hierarchical in nature. [10190960] |That is, the lowest level is the first normal form, and the database cannot meet the requirements for higher level normal forms without first having met all the requirements of the lesser normal form. [10190970] |==Database Management Systems== [10190980] |===Relational database management systems=== [10190990] |An RDBMS implements the features of the relational model outlined above. [10191000] |In this context, [[Christopher J. Date|Date]]'s '''Information Principle''' states: [10191010] |
The entire information content of the database is represented in one and only one way. [10191020] |Namely as explicit values in column positions (attributes) and rows in relations ([[tuple]]s) Therefore, there are no explicit pointers between related tables.
[10191030] |===Post-relational database models=== [10191040] |Several products have been identified as [[post-relational]] because the data model incorporates [[relations]] but is not constrained by the Information Principle, requiring that all information is represented by [[data values]] in relations. [10191050] |Products using a post-relational data model typically employ a model that actually pre-dates the [[relational model]]. [10191060] |These might be identified as a [[directed graph]] with [[tree data structure|trees]] on the [[data structure|nodes]]. [10191070] |Examples of models that could be classified as post-relational are [[Pick operating system|PICK]] aka [[Multidimensional database|MultiValue]], and [[MUMPS]]. [10191080] |===Object database models=== [10191090] |In recent years, the [[object-oriented]] paradigm has been applied to database technology, creating a new programming model known as [[object database]]s. [10191100] |These databases attempt to bring the database world and the application programming world closer together, in particular by ensuring that the database uses the same [[type system]] as the application program. [10191110] |This aims to avoid the overhead (sometimes referred to as the ''[[Object-Relational impedance mismatch|impedance mismatch]]'') of converting information between its representation in the database (for example as rows in tables) and its representation in the application program (typically as objects). [10191120] |At the same time, object databases attempt to introduce the key ideas of object programming, such as [[encapsulation]] and [[polymorphism (computer science)|polymorphism]], into the world of databases. [10191130] |A variety of these ways have been tried for storing objects in a database. [10191140] |Some products have approached the problem from the application programming end, by making the objects manipulated by the program [[Persistence (computer science)|persistent]]. [10191150] |This also typically requires the addition of some kind of query language, since conventional programming languages do not have the ability to find objects based on their information content. [10191160] |Others have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities. [10191170] |==DBMS internals== [10191180] |===Storage and physical database design=== [10191190] |Database tables/indexes are typically stored in memory or on hard disk in one of many forms, ordered/unordered [[flat file database|flat files]], [[ISAM]], [[heap (data structure)|heaps]], [[hash table|hash buckets]] or [[B+ tree]]s. [10191200] |These have various advantages and disadvantages discussed further in the main article on this topic. [10191210] |The most commonly used are B+ trees and ISAM. [10191220] |Other important design choices relate to the clustering of data by category (such as grouping data by month, or location), creating pre-computed views known as materialized views, partitioning data by range or hash. [10191230] |As well memory management and storage topology can be important design choices for database designers. [10191240] |Just as normalization is used to reduce storage requirements and improve the extensibility of the database, conversely denormalization is often used to reduce join complexity and reduce execution time for queries. [10191250] |====Indexing==== [10191260] |All of these databases can take advantage of [[Index (database)|indexing]] to increase their speed. [10191270] |This technology has advanced tremendously since its early uses in the 1960s and 1970s. [10191280] |The most common kind of index is a sorted list of the contents of some particular table column, with pointers to the row associated with the value. [10191290] |An index allows a set of table rows matching some criterion to be located quickly. [10191300] |Typically, indexes are also stored in the various forms of data-structure mentioned above (such as [[B-tree]]s, [[hash table|hash]]es, and [[linked lists]]). [10191310] |Usually, a specific technique is chosen by the database designer to increase efficiency in the particular case of the type of index required. [10191320] |Relational DBMS's have the advantage that indexes can be created or dropped without changing existing applications making use of it. [10191330] |The database chooses between many different strategies based on which one it estimates will run the fastest. [10191340] |In other words, indexes are transparent to the application or end-user querying the database; while they affect performance, any SQL command will run with or without index to compute the result of an [[SQL]] statement. [10191350] |The RDBMS will produce a plan of how to execute the query, which is generated by analyzing the run times of the different algorithms and selecting the quickest. [10191360] |Some of the key algorithms that deal with [[join (SQL)|joins]] are [[nested loop join]], [[sort-merge join]] and [[hash join]]. [10191370] |Which of these is chosen depends on whether an index exists, what type it is, and its [[Cardinality (SQL statements)|cardinality]]. [10191380] |An index speeds up access to data, but it has disadvantages as well. [10191390] |First, every index increases the amount of storage on the hard drive necessary for the database file, and second, the index must be updated each time the data are altered, and this costs time. [10191400] |(Thus an index saves time in the reading of data, but it costs time in entering and altering data. [10191410] |It thus depends on the use to which the data are to be put whether an index is on the whole a net plus or minus in the quest for efficiency.) [10191420] |A special case of an index is a primary index, or primary key, which is distinguished in that the primary index must ensure a unique reference to a record. [10191430] |Often, for this purpose one simply uses a running index number (ID number). [10191440] |Primary indexes play a significant role in relational databases, and they can speed up access to data considerably. [10191450] |===Transactions and concurrency=== [10191460] |In addition to their data model, most practical databases ("transactional databases") attempt to enforce a [[database transaction]] . [10191470] |Ideally, the database software should enforce the [[ACID]] rules, summarized here: [10191480] |* [[Atomicity]]: Either all the tasks in a transaction must be done, or none of them. [10191490] |The transaction must be completed, or else it must be undone (rolled back). [10191500] |* [[Database consistency|Consistency]]: Every transaction must preserve the integrity constraints — the declared consistency rules — of the database. [10191510] |It cannot place the data in a contradictory state. [10191520] |* [[Isolation]]: Two simultaneous transactions cannot interfere with one another. [10191530] |Intermediate results within a transaction are not visible to other transactions. [10191540] |* [[Durability (computer science)|Durability]]: Completed transactions cannot be aborted later or their results discarded. [10191550] |They must persist through (for instance) restarts of the DBMS after crashes [10191560] |In practice, many DBMS's allow most of these rules to be selectively relaxed for better performance. [10191570] |[[Concurrency control]] is a method used to ensure that transactions are executed in a safe manner and follow the ACID rules. [10191580] |The DBMS must be able to ensure that only [[serializability|serializable]], [[serializability#correctness - recoverability|recoverable]] schedules are allowed, and that no actions of committed transactions are lost while undoing aborted transactions . [10191590] |===Replication=== [10191600] |Replication of databases is closely related to transactions. [10191610] |If a database can log its individual actions, it is possible to create a duplicate of the data in real time. [10191620] |The duplicate can be used to improve performance or availability of the whole database system. [10191630] |Common replication concepts include: [10191640] |* Master/Slave Replication: All write requests are performed on the master and then replicated to the slaves [10191650] |* Quorum: The result of Read and Write requests are calculated by querying a "majority" of replicas. [10191660] |* Multimaster: Two or more replicas sync each other via a transaction identifier. [10191670] |Parallel synchronous replication of databases enables transactions to be replicated on multiple servers simultaneously, which provides a method for backup and security as well as data availability. [10191680] |===Security=== [10191690] |[[Database security]] denotes the system, processes, and procedures that protect a database from unintended activity. [10191700] |Security is usually enforced through '''access control''', '''auditing''', and '''encryption'''. [10191710] |* Access control ensures and restricts who can connect and what can be done to the database. [10191720] |* Auditing logs what action or change has been performed, when and by who. [10191730] |* Encryption: Since security has become a major issue in recent years, many commercial database vendors provide built-in encryption mechanism. [10191740] |Data is encoded natively into the tables and deciphered "on the fly" when a query comes in. [10191745] |Connections can also be secured and encrypted if required using DSA, MD5, SSL or legacy encryption standard. [10191750] |Enforcing security is one of the major tasks of the DBA. [10191760] |In the United Kingdom, legislation protecting the public from unauthorized disclosure of personal information held on databases falls under the Office of the Information Commissioner. [10191770] |United Kingdom based organizations holding personal data in electronic format (databases for example) are required to register with the Data Commissioner. [10191780] |===Locking=== [10191790] |[[Lock (computer science)|Locking]] is how the database handle multiple concurent operations. [10191800] |This is the way how concurency and some form of basic intergrity is managed within the database system. [10191810] |Such locks can be applied on a row level, or on other levels like page (a basic data block), extend (multiple array of pages) or even an entire table. [10191820] |This helps maintain the integrity of the data by ensuring that only one process at a time can modify the '''same''' data. [10191830] |Unlike a basic filesystem files or folders, where only one lock at the time can be set, restricting the usage to one process only. [10191840] |A database can set and hold mutiples locks at the same time on the different level of the physical data structure. [10191850] |How locks are set, last is determined by the database engine locking scheme based on the submitted SQL or transactions by the users. [10191860] |Generaly speaking no activity on the database should be translated by no or very light locking. [10191870] |For most DBMS systems existing on the market, locks are generaly '''shared''' or '''exclusive'''. [10191880] |Exclusive locks mean that no other lock can acquire the current data object as long as the exclusive lock lasts. [10191890] |Exclusive locks are usually set while the database needs to change data, like during an UPDATE or DELETE operation. [10191900] |Shared locks can take ownership one from the other of the current data structure. [10191910] |Shared locks are usually used while the database is reading data, during a SELECT operation. [10191920] |The number, nature of locks and time the lock holds a data block can have a huge impact on the database performances. [10191930] |Bad locking can lead to desastrous performance response (usually the result of poor SQL requests, or inadequate database physical structure) [10191940] |Default locking behavior is enforced by the '''isolation level''' of the dataserver. [10191950] |Changing the isolation level will affect how shared or exclusive locks must be set on the data for the entire database system. [10191960] |Default isolation is generaly 1, where data can not be read while it is modfied, forbiding to return "ghost data" to end user. [10191970] |At some point intensive or inappropriate exclusive locking, can lead to the "dead lock" situation between two locks. [10191980] |Where none of the locks can be released because they try to acquire ressources mutually from each other. [10191990] |The Database has a fail safe mecanism and will automaticly "sacrifice" one of the locks releasing the ressource. [10192000] |Doing so processes or transactions involved in the "dead lock" will be rolled back. [10192010] |Databases can also be locked for other reasons, like access restrictions for given levels of user. [10192020] |Databases are also locked for routine database maintenance, which prevents changes being made during the maintenance. [10192030] |See [http://publib.boulder.ibm.com/infocenter/rbhelp/v6r3/index.jsp?topic=/com.ibm.redbrick.doc6.3/wag/wag80.htm IBM] for more detail.) [10192040] |===Architecture=== [10192050] |Depending on the intended use, there are a number of database architectures in use. [10192060] |Many databases use a combination of strategies. [10192070] |On-line Transaction Processing systems (OLTP) often use a row-oriented datastore architecture, while data-warehouse and other retrieval-focused applications like [[Google]]'s [[BigTable]], or bibliographic database(library catalogue) systems may use a column-oriented datastore architecture. [10192080] |Document-Oriented, XML, Knowledgebases, as well as frame databases and rdf-stores (aka Triple-Stores), may also use a combination of these architectures in their implementation. [10192090] |Finally it should be noted that not all database have or need a database 'schema' (so called schema-less databases). [10192100] |==Applications of databases== [10192110] |Databases are used in many applications, spanning virtually the entire range of [[computer software]]. [10192120] |Databases are the preferred method of storage for large multiuser applications, where coordination between many users is needed. [10192130] |Even individual users find them convenient, and many electronic mail programs and personal organizers are based on standard database technology. [10192140] |Software database drivers are available for most database platforms so that [[application software]] can use a common [[Application Programming Interface]] to retrieve the information stored in a database. [10192150] |Two commonly used database APIs are [[Java Database Connectivity|JDBC]] and [[ODBC]]. [10192160] |For example suppliers database contains the data relating to suppliers such as; [10192170] |*supplier name [10192180] |*supplier code [10192190] |*supplier address [10192200] |It is often used by schools to teach students and grade them. [10192210] |==Links to DBMS products== [10192220] |*[[4th Dimension (Software)|4D]] [10192230] |*[[ADABAS]] [10192240] |*[[Alpha Five]] [10192250] |*[[Apache Derby]] (Java, also known as IBM Cloudscape and Sun Java DB) [10192260] |*[[BerkeleyDB]] [10192270] |*[[CouchDB]] [10192280] |*[[CSQL]] [10192290] |*[[Datawasp]] [10192300] |*[[Db4objects]] [10192310] |*[[dBase]] [10192320] |*[[FileMaker]] [10192330] |*[[Firebird (database server)]] [10192340] |*[[H2 (DBMS)|H2]] (Java) [10192350] |*[[Hsqldb]] (Java) [10192360] |*[[IBM DB2]] [10192370] |*[[Information Management System|IBM IMS (Information Management System)]] [10192380] |*[[IBM UniVerse]] [10192390] |*[[Informix]] [10192400] |*[[Ingres (database)|Ingres]] [10192410] |*[[Interbase]] [10192420] |*[[InterSystems Caché]] [10192430] |*[[MaxDB]] (formerly SapDB) [10192440] |*[[Microsoft Access]] [10192450] |*[[Microsoft SQL Server]] [10192460] |*[[Model 204]] [10192470] |*[[MySQL]] [10192480] |*[[Nomad software|Nomad]] [10192490] |*[[Objectivity/DB]] [10192500] |*[[ObjectStore]] [10192510] |*[[Virtuoso Universal Server|OpenLink Virtuoso]] [10192520] |*[[OpenOffice.org Base]] [10192530] |*[[Oracle Database]] [10192540] |*[[Paradox (database)]] [10192550] |*[[Polyhedra DBMS]] [10192560] |*[[PostgreSQL]] [10192570] |*[[Progress 4GL]] [10192580] |*[[RDM Embedded]] [10192590] |*[[ScimoreDB]] [10192600] |*[[Sedna (database)|Sedna]] [10192610] |*[[SQLite]] [10192620] |*[[Superbase database|Superbase]] [10192630] |*[[Sybase]] [10192640] |*[[Teradata]] [10192650] |*[[Vertica]] [10192660] |*[[Visual FoxPro]] [10200010] |
Cluster analysis
[10200020] |'''Clustering''' is the [[Statistical classification|classification]] of objects into different groups, or more precisely, the [[partition of a set|partitioning]] of a [[data set]] into [[subset]]s (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined [[metric (mathematics)|distance measure]]. [10200030] |Data clustering is a common technique for [[statistics|statistical]] [[data analysis]], which is used in many fields, including [[machine learning]], [[data mining]], [[pattern recognition]], [[image analysis]] and [[bioinformatics]]. [10200040] |The computational task of classifying the data set into ''k'' clusters is often referred to as '''''k''-clustering'''''. [10200050] |Besides the term ''data clustering'' (or just ''clustering''), there are a number of terms with similar meanings, including ''cluster analysis'', ''automatic classification'', ''numerical taxonomy'', ''botryology'' and ''typological analysis''. [10200060] |== Types of clustering == [10200070] |Data clustering algorithms can be [[hierarchical]]. [10200080] |Hierarchical algorithms find successive clusters using previously established clusters. [10200090] |Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). [10200100] |Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. [10200110] |Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. [10200120] |[[partition of a set|Partitional]] algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the [[hierarchical]] clustering. [10200130] |''Two-way clustering'', ''co-clustering'' or [[biclustering]] are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a [[data matrix (statistics)|data matrix]], the rows and columns are clustered simultaneously. [10200140] |Another important distinction is whether the clustering uses symmetric or asymmetric distances. [10200150] |A property of [[Euclidean space]] is that distances are symmetric (the distance from object'' A'' to ''B'' is the same as the distance from ''B'' to ''A''). [10200160] |In other applications (e.g., sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case. [10200170] |== Distance measure == [10200180] |An important step in any clustering is to select a [[Distance|distance measure]], which will determine how the ''similarity'' of two elements is calculated. [10200190] |This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. [10200200] |For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2,\sqrt 2 or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance. [10200210] |Common distance functions: [10200220] |* The [[Euclidean distance]] (also called distance [[as the crow flies]] or 2-norm distance). [10200230] |A review of cluster analysis in health psychology research found that the most common distance measure in published studies in that research area is the Euclidean distance or the squared Euclidean distance. [10200240] |* The [[Manhattan distance]] (also called taxicab norm or 1-norm) [10200250] |* The [[Maximum_norm|maximum norm]] [10200260] |* The [[Mahalanobis distance]] corrects data for different scales and correlations in the variables [10200270] |* The angle between two vectors can be used as a distance measure when clustering high dimensional data. [10200280] |See [[Inner product space]]. [10200290] |* The [[Hamming distance]] (sometimes edit distance) measures the minimum number of substitutions required to change one member into another. [10200300] |==Hierarchical clustering== [10200310] |===Creating clusters=== [10200320] |Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. [10200330] |The traditional representation of this hierarchy is a [[tree data structure|tree]] (called a [[dendrogram]]), with individual elements at one end and a single cluster containing every element at the other. [10200340] |Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root. [10200350] |(In the figure, the arrows indicate an agglomerative clustering.) [10200360] |Cutting the tree at a given height will give a clustering at a selected precision. [10200370] |In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. [10200380] |Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters. [10200390] |===Agglomerative hierarchical clustering=== [10200400] |For example, suppose this data is to be clustered, and the [[euclidean distance]] is the [[Metric (mathematics)|distance metric]]. [10200410] |The hierarchical clustering [[dendrogram]] would be as such: [10200420] |This method builds the hierarchy from the individual elements by progressively merging clusters. [10200430] |In our example, we have six elements {a} {b} {c} {d} {e} and {f}. [10200440] |The first step is to determine which elements to merge in a cluster. [10200450] |Usually, we want to take the two closest elements, according to the chosen distance. [10200460] |Optionally, one can also construct a [[distance matrix]] at this stage, where the number in the ''i''-th row ''j''-th column is the distance between the ''i''-th and ''j''-th elements. [10200470] |Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. [10200480] |This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. [10200490] |A simple agglomerative clustering algorithm is described in the [[single linkage clustering]] page; it can easily be adapted to different types of linkage (see below). [10200500] |Suppose we have merged the two closest elements ''b'' and ''c'', we now have the following clusters {''a''}, {''b'', ''c''}, {''d''}, {''e''} and {''f''}, and want to merge them further. [10200510] |To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. [10200520] |Usually the distance between two clusters \mathcal{A} and \mathcal{B} is one of the following: [10200530] |* The maximum distance between elements of each cluster (also called complete linkage clustering): [10200540] |:: \max \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B}\,\} [10200550] |* The minimum distance between elements of each cluster (also called [[single linkage clustering]]): [10200560] |:: \min \{\, d(x,y) : x \in \mathcal{A},\, y \in \mathcal{B} \,\} [10200570] |* The mean distance between elements of each cluster (also called average linkage clustering, used e.g. in [[UPGMA]]): [10200580] |:: {1 \over {|\mathcal{A}|\cdot|\mathcal{B}|}}\sum_{x \in \mathcal{A}}\sum_{ y \in \mathcal{B}} d(x,y) [10200590] |* The sum of all intra-cluster variance [10200600] |* The increase in variance for the cluster being merged ([[Ward's criterion]]) [10200610] |* The probability that candidate clusters spawn from the same distribution function (V-linkage) [10200620] |Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). [10200630] |=== Concept clustering === [10200640] |Another variation of the agglomerative clustering approach is [[conceptual clustering]]. [10200650] |==Partitional clustering== [10200660] |===''K''-means and derivatives=== [10200670] |====''K''-means clustering==== [10200680] |The [[K-means algorithm|''K''-means algorithm]] assigns each point to the cluster whose center (also called centroid) is nearest. [10200690] |The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster... [10200700] |:''Example:'' The data set has three dimensions and the cluster has two points: ''X'' = (''x''1, ''x''2, ''x''3) and ''Y'' = (''y''1, ''y''2, ''y''3). [10200710] |Then the centroid ''Z'' becomes ''Z'' = (''z''1, ''z''2, ''z''3), where ''z''1 = (''x''1 + ''y''1)/2 and ''z''2 = (''x''2 + ''y''2)/2 and ''z''3 = (''x''3 + ''y''3)/2. [10200720] |The algorithm steps are (J. MacQueen, 1967): [10200730] |* Choose the number of clusters, ''k''. [10200740] |* Randomly generate ''k'' clusters and determine the cluster centers, or directly generate ''k'' random points as cluster centers. [10200750] |* Assign each point to the nearest cluster center. [10200760] |* Recompute the new cluster centers. [10200770] |* Repeat the two previous steps until some convergence criterion is met (usually that the assignment hasn't changed). [10200780] |The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. [10200790] |Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. [10200800] |It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. [10200810] |====Fuzzy ''c''-means clustering==== [10200820] |In [[fuzzy clustering]], each point has a degree of belonging to clusters, as in [[fuzzy logic]], rather than belonging completely to just one cluster. [10200830] |Thus, points on the edge of a cluster, may be ''in the cluster'' to a lesser degree than points in the center of cluster. [10200840] |For each point ''x'' we have a coefficient giving the degree of being in the ''k''th cluster u_k(x). [10200850] |Usually, the sum of those coefficients is defined to be 1: [10200860] |: \forall x \sum_{k=1}^{\mathrm{num.}\ \mathrm{clusters}} u_k(x) \ =1. [10200870] |With fuzzy ''c''-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster: [10200880] |:\mathrm{center}_k = {{\sum_x u_k(x)^m x} \over {\sum_x u_k(x)^m}}. [10200890] |The degree of belonging is related to the inverse of the distance to the cluster [10200900] |:u_k(x) = {1 \over d(\mathrm{center}_k,x)}, [10200910] |then the coefficients are normalized and fuzzyfied with a real parameter m>1 so that their sum is 1. [10200920] |So [10200930] |:u_k(x) = \frac{1}{\sum_j \left(\frac{d(\mathrm{center}_k,x)}{d(\mathrm{center}_j,x)}\right)^{2/(m-1)}}. [10200940] |For ''m'' equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. [10200950] |When ''m'' is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to ''k''-means. [10200960] |The fuzzy ''c''-means algorithm is very similar to the ''k''-means algorithm: [10200970] |* Choose a number of clusters. [10200980] |* Assign randomly to each point coefficients for being in the clusters. [10200990] |* Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than \epsilon, the given sensitivity threshold) : [10201000] |** Compute the centroid for each cluster, using the formula above. [10201010] |** For each point, compute its coefficients of being in the clusters, using the formula above. [10201020] |The algorithm minimizes intra-cluster variance as well, but has the same problems as ''k''-means, the minimum is a local minimum, and the results depend on the initial choice of weights. [10201030] |The [[Expectation-maximization algorithm]] is a more statistically formalized method which includes some of these ideas: partial membership in classes. [10201040] |It has better convergence properties and is in general preferred to fuzzy-c-means. [10201050] |====QT clustering algorithm==== [10201060] |QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering. [10201070] |It requires more computing power than ''k''-means, but does not require specifying the number of clusters ''a priori'', and always returns the same result when run several times. [10201080] |The algorithm is: [10201090] |* The user chooses a maximum diameter for clusters. [10201100] |* Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold. [10201110] |* Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. [10201120] |Must clarify what happens if more than 1 cluster has the maximum number of points ? [10201130] |* [[Recursion|Recurse]] with the reduced set of points. [10201140] |The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters). [10201150] |=== Locality-sensitive hashing === [10201160] |[[Locality-sensitive hashing]] can be used for clustering. [10201170] |Feature space vectors are sets, and the metric used is the [[Jaccard distance]]. [10201180] |The feature space can be considered high-dimensional. [10201190] |The ''min-wise independent permutations'' LSH scheme (sometimes MinHash) is then used to put similar items into buckets. [10201200] |With just one set of hashing methods, there are only clusters of very similar elements. [10201210] |By seeding the hash functions several times (eg 20), it is possible to get bigger clusters. [10201220] |=== Graph-theoretic methods === [10201230] |[[Formal concept analysis]] is a technique for generating clusters of objects and attributes, given a [[bipartite graph]] representing the relations between the objects and attributes. [10201240] |Other methods for generating ''overlapping clusters'' (a [[Cover (topology)|cover]] rather than a [[partition of a set|partition]]) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970). [10201250] |== Elbow criterion == [10201260] |The elbow criterion is a common [[rule of thumb]] to determine what number of clusters should be chosen, for example for ''k''-means and agglomerative hierarchical clustering. [10201270] |It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance. [10201280] |Thus, it is appropriate to re-run the cluster analysis multiple times. [10201290] |The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information. [10201300] |More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). [10201310] |This elbow cannot always be unambiguously identified. [10201320] |Percentage of variance explained is the ratio of the between-group variance to the total variance. [10201330] |On the following graph, the elbow is indicated by the red circle. [10201340] |The number of clusters chosen should therefore be 4. [10201350] |== Spectral clustering == [10201360] |Given a set of data points A, the [[similarity matrix]] may be defined as a matrix S where S_{ij} represents a measure of the similarity between points i, j\in A. [10201370] |Spectral clustering techniques make use of the [[Spectrum of a matrix|spectrum]] of the similarity matrix of the data to perform [[dimensionality reduction]] for clustering in fewer dimensions. [10201380] |One such technique is the ''[[Shi-Malik algorithm]]'', commonly used for [[segmentation (image processing)|image segmentation]]. [10201390] |It partitions points into two sets (S_1,S_2) based on the [[eigenvector]] v corresponding to the second-smallest [[eigenvalue]] of the [[Laplacian matrix]] [10201400] |:L = I - D^{-1/2}SD^{-1/2} [10201410] |of S, where D is the diagonal matrix [10201420] |:D_{ii} = \sum_{j} S_{ij}. [10201430] |This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S_1, and the rest in S_2. [10201440] |The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion. [10201450] |A related algorithm is the ''[[Meila-Shi algorithm]]'', which takes the [[eigenvector]]s corresponding to the ''k'' largest [[eigenvalue]]s of the matrix P = SD^{-1} for some ''k'', and then invokes another (e.g. ''k''-means) to cluster points by their respective ''k'' components in these eigenvectors. [10201460] |==Applications== [10201470] |=== Biology === [10201480] |In [[biology]] '''clustering''' has many applications [10201490] |*In imaging, data clustering may take different form based on the data dimensionality. [10201500] |For example, the [http://wiki.stat.ucla.edu/socr/index.php/SOCR_EduMaterials_Activities_2D_PointSegmentation_EM_Mixture SOCR EM Mixture model segmentation activity and applet] shows how to obtain point, region or volume classification using the online [[SOCR]] computational libraries. [10201510] |*In the fields of [[plant]] and [[animal]] [[ecology]], clustering is used to describe and to make spatial and temporal comparisons of communities (assemblages) of organisms in heterogeneous environments; it is also used in [[Systematics|plant systematics]] to generate artificial [[Phylogeny|phylogenies]] or clusters of organisms (individuals) at the species, genus or higher level that share a number of attributes [10201520] |*In computational biology and [[bioinformatics]]: [10201530] |** In [[transcriptome|transcriptomics]], clustering is used to build groups of [[genes]] with related expression patterns (also known as coexpressed genes). [10201540] |Often such groups contain functionally related proteins, such as [[enzyme]]s for a specific [[metabolic pathway|pathway]], or genes that are co-regulated. [10201550] |High throughput experiments using [[expressed sequence tag]]s (ESTs) or [[DNA microarray]]s can be a powerful tool for [[genome annotation]], a general aspect of [[genomics]]. [10201560] |** In [[sequence analysis]], clustering is used to group homologous sequences into [[list of gene families|gene families]]. [10201570] |This is a very important concept in bioinformatics, and [[evolutionary biology]] in general. [10201580] |See evolution by [[gene duplication]]. [10201590] |** In high-throughput genotyping platforms clustering algorithms are used to automatically assign [[genotypes]]. [10201600] |=== Medicine === [10201610] |In [[medical imaging]], such as [[PET scan|PET scans]], cluster analysis can be used to differentiate between different types of [[tissue (biology)|tissue]] and [[blood]] in a three dimensional image. [10201620] |In this application, actual position does not matter, but the [[voxel]] intensity is considered as a [[coordinate vector|vector]], with a dimension for each image that was taken over time. [10201630] |This technique allows, for example, accurate measurement of the rate a radioactive tracer is delivered to the area of interest, without a separate sampling of [[arterial]] blood, an intrusive technique that is most common today. [10201640] |=== Market research === [10201650] |Cluster analysis is widely used in [[market research]] when working with multivariate data from [[Statistical survey|surveys]] and test panels. [10201660] |Market researchers use cluster analysis to partition the general [[population]] of [[consumers]] into market segments and to better understand the relationships between different groups of consumers/potential [[customers]]. [10201670] |* Segmenting the market and determining [[target market]]s [10201680] |* [[positioning (marketing)|Product positioning]] [10201690] |* [[New product development]] [10201700] |* Selecting test markets (see : [[experimental techniques]]) [10201710] |=== Other applications === [10201720] |'''Social network analysis''': In the study of [[social networks]], clustering may be used to recognize [[communities]] within large groups of people. [10201730] |'''Image segmentation''': Clustering can be used to divide a [[digital]] [[image]] into distinct regions for [[border detection]] or [[object recognition]]. [10201740] |'''Data mining''': Many [[data mining]] applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. [10201750] |Another common application is the division of documents, such as [[World Wide Web]] pages, into genres. [10201760] |'''Search result grouping''': In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like [[Google]]. [10201770] |There are currently a number of web based clustering tools such as [[Clusty]]. [10201780] |'''Slippy map optimization''': [[Flickr]]'s map of photos and other map sites use clustering to reduce the number of markers on a map. [10201790] |This makes it both faster and reduces the amount of visual clutter. [10201800] |'''IMRT segmentation''': Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy. [10201810] |'''Grouping of Shopping Items''': Clustering can be used to group all the shopping items available on the web into a set of unique products. [10201820] |For example, all the items on eBay can be grouped into unique products. [10201825] |(eBay doesn't have the concept of a SKU) [10201830] |'''[[Mathematical chemistry]]''': To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the space of 90 [[topological index|topological indices]]. [10201840] |'''Petroleum Geology''': Cluster Analysis is used to reconstruct missing bottom hole core data or missing log curves in order to evaluate reservoir properties. [10201850] |== Comparisons between data clusterings == [10201860] |There have been several suggestions for a measure of similarity between two clusterings. [10201870] |Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. [10201880] |Many of these measures are derived from the [[matching matrix]] (aka [[confusion matrix]]), e.g., the [[Rand index|Rand measure]] and the Fowlkes-Mallows ''B''''k'' measures. [10201890] |[[Marina Meila]]'s Variation of Information metric is a more recent approach for measuring distance between clusterings. [10201900] |It uses [[Mutual information|mutual information]] and [[entropy]] to approximate the distance between two clusterings across the lattice of possible clusterings. [10201910] |==Algorithms== [10201920] |In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). [10201930] |Among the most popular are ''CLARANS'' (Ng and Han,1994), ''[[DBSCAN]]'' (Ester et al., 1996) and ''BIRCH'' (Zhang et al., 1996). [10210010] |
Data mining
[10210020] |'''Data mining''' is the process of [[sorting]] through large amounts of data and picking out relevant information. [10210030] |It is usually used by [[business intelligence]] organizations, and [[financial analyst]]s, but is increasingly being used in the sciences to extract information from the enormous [[data set]]s generated by modern experimental and observational methods. [10210040] |It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful [[information]] from [[data]]" and "the science of extracting useful information from large [[data set]]s or [[database]]s. [10210050] |" Data mining in relation to [[enterprise resource planning]] is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making. [10210060] |==Background== [10210070] |Traditionally, business analysts have performed the task of extracting useful [[information]] from recorded [[data]], but the increasing volume of data in modern business and science calls for computer-based approaches. [10210080] |As [[data set]]s have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. [10210090] |The modern technologies of [[computers]], [[networks]], and [[sensors]] have made [[data collection]] and organization much easier. [10210100] |However, the captured data needs to be converted into [[information]] and [[knowledge]] to become useful. [10210110] |Data mining is the entire process of applying computer-based [[methodology]], including new techniques for [[knowledge discovery]], to data. [10210120] |Data mining identifies trends within data that go beyond simple analysis. [10210130] |Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities. [10210140] |However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all. [10210150] |Although data mining is a relatively new term, the technology is not. [10210160] |For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining). [10210170] |Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis. [10210180] |Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded. [10210190] |Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable. [10210200] |The term data mining is often used to apply to the two separate processes of knowledge discovery and [[prediction]]. [10210210] |Knowledge discovery provides explicit information that has a readable form and can be understood by a user. [10210220] |[[Forecasting]], or [[predictive modeling]] provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as [[neural network]]s. [10210230] |Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery. [10210240] |[[Metadata]], or data about a given data set, are often expressed in a condensed ''data-minable'' format, or one that facilitates the practice of data mining. [10210250] |Common examples include executive summaries and scientific abstracts. [10210260] |Data mining relies on the use of real world data. [10210270] |This data is extremely vulnerable to [[collinearity]] precisely because data from the real world may have unknown interrelations. [10210280] |An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed. [10210290] |Alternative approaches using an experiment-based approach such as [[Choice Modelling]] for human-generated data may be used. [10210300] |Inherent correlations are either controlled for or removed altogether through the construction of an [[experimental design]]. [10210310] |Recently, there were some efforts to define a standard for data mining, for example the [[CRISP-DM]] standard for analysis processes or the [[Java Data-Mining]] Standard. [10210320] |Independent of these standardization efforts, freely available open-source software systems like [[RapidMiner]] and [[Weka (machine learning)| Weka]] have become an informal standard for defining data-mining processes. [10210330] |==Privacy concerns== [10210340] |There are also [[privacy]] and [[human rights]] concerns associated with data mining, specifically regarding the source of the data analyzed. [10210350] |Data mining provides information that may be difficult to obtain otherwise. [10210360] |When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics. [10210370] |In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns. [10210380] |==Notable uses of data mining== [10210390] |===Combatting Terrorism=== [10210400] |Data mining has been cited as the method by which the U.S. Army unit [[Able Danger]] had identified the [[September 11, 2001 attacks]] leader, [[Mohamed Atta]], and three other 9/11 hijackers as possible members of an [[Al Qaeda]] cell operating in the U.S. more than a year before the attack. [10210410] |It has been suggested that both the [[Central Intelligence Agency]] and the [[Canadian Security Intelligence Service]] have employed this method. [10210420] |Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program [http://www.msnbc.msn.com/id/20604775/ Security-MSNBC]. [10210430] |These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment. [10210440] |===Games=== [10210450] |Since the early 1960s, with the availability of [[Oracle machine|oracle]]s for certain [[combinatorial game]]s, also called [[tablebase]]s (e.g. for 3x3-chess) with any beginning configuration, small-board [[dots-and-boxes]], small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. [10210460] |This is the extraction of human-usable strategies from these oracles. [10210470] |Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. [10210480] |Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. [10210490] |[[Berlekamp]] in dots-and-boxes etc. and [[John Nunn]] in [[chess]] [[Chess endgame|endgames]] are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation. [10210500] |===Business=== [10210510] |Data mining in [[customer relationship management]] applications can contribute significantly to the bottom line. [10210520] |Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. [10210530] |More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers. [10210540] |Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. [10210550] |[[Data clustering]] can also be used to automatically discover the segments or groups within a customer data set. [10210560] |Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. [10210570] |Rather than one model to predict which customers will [[Churning (stock trade)|churn]], a business could build a separate model for each region and customer type. [10210580] |Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. [10210590] |And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. [10210600] |In order to maintain this quantity of models, they need to manage model versions and move to ''automated data mining''. [10210610] |Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. [10210620] |Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. [10210630] |Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels. [10210640] |Another example of data mining, often called the [[market basket analysis]], relates to its use in retail sales. [10210650] |If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. [10210660] |Although some explanations of relationships may be difficult, taking advantage of it is easier. [10210670] |The example deals with [[association rule]]s within transaction-based data. [10210680] |Not all data are transaction based and logical or inexact [[rule]]s may also be present within a [[database]]. [10210690] |In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months. [10210700] |Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing." [10210710] |In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. [10210720] |Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. [10210730] |This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products. [10210740] |===Science and engineering=== [10210750] |In recent years, data mining has been widely used in area of science and engineering, such as [[bioinformatic]]s, [[genetic]]s, [[medicine]], [[education]], and [[electrical power]] engineering. [10210760] |In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human [[DNA]] sequences and variability in disease susceptibility. [10210770] |In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as [[cancer]]. [10210780] |This is very important to help improve the diagnosis, prevention and treatment of the diseases. [10210790] |The data mining technique that is used to perform this task is known as [[multifactor dimensionality reduction]]. [10210800] |In the area of electrical power engineering, data mining techniques have been widely used for [[condition monitoring]] of high voltage electrical equipment. [10210810] |The purpose of condition monitoring is to obtain valuable information on the [[insulation]]'s health status of the equipment. [10210820] |[[Data clustering]] such as [[self-organizing map]] (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). [10210830] |Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. [10210840] |Obviously, different tap positions will generate different signals. [10210850] |However, there was considerable variability amongst normal condition signals for the exact same tap position. [10210860] |SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities. [10210870] |Data mining techniques have also been applied for [[dissolved gas analysis]] (DGA) on [[power transformer]]s. [10210880] |DGA, as a diagnostics for power transformer, has been available for centuries. [10210890] |Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle. [10210900] |A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning and to understand the factors influencing university student retention. [10210910] |Other examples of applying data mining technique applications are [[biomedical]] data facilitated by domain ontologies, mining clinical trial data, [[traffic analysis]] using SOM, et cetera. [10220010] |
Data set
[10220020] |A '''data set''' (or '''dataset''') is a collection of [[data]], usually presented in tabular form. [10220030] |Each column represents a particular variable. [10220040] |Each row corresponds to a given member of the data set in question. [10220050] |It lists values for each of the variables, such as height and weight of an object or values of random numbers. [10220060] |Each value is known as a [[datum]]. [10220070] |The data set may comprise data for one or more members, corresponding to the number of rows. [10220080] |Historically, the term originated in the [[mainframe computer|mainframe field]], where it had a [[Data set (IBM mainframe)|well-defined meaning]], very close to contemporary ''[[computer file]]''. [10220090] |This topic is not covered here. [10220100] |In the simplest case, there is only one variable, and then the data set consists of a single column of values, often represented as a list. [10220110] |The values may be numbers, such as [[real number]]s or [[integer]]s, for example representing a person's height in centimeters, but may also be [[nominal data]] (i.e., not consisting of [[numerical]] values), for example representing a person's ethnicity. [10220120] |More generally, values may be of any of the kinds described as a [[level of measurement]]. [10220130] |For each variable, the values will normally all be of the same kind. [10220140] |However, there may also be "[[missing values]]", which need to be indicated in some way. [10220150] |In [[statistics]] data sets usually come from actual observations obtained by [[sampling (statistics)|sampling]] a [[statistical population]], and each row corresponds to the observations on one element of that population. [10220160] |Data sets may further be generated by [[algorithms]] for the purpose of testing certain kinds of [[software]]. [10220170] |Some modern statistical analysis software such as [[PSPP]] still present their data in the classical dataset fashion. [10220180] |== Classic data sets == [10220190] |Several classic [[data set]]s have been used extensively in the [[statistical]] literature: [10220200] |* [[Iris flower data set]] - multivariate data set introduced by [[Ronald Fisher]] (1936). [10220210] |* ''[[Categorical data analysis]]'' - Data sets used in the book, ''An Introduction to Categorical Data Analysis'', by Agresti are [http://lib.stat.cmu.edu/datasets/agresti provided on-line by StatLib.] [10220220] |*''[[Robust statistics]]'' - Data sets used in ''Robust Regression and Outlier Detection'' (Rousseeuw and Leroy, 1986). [http://www.uni-koeln.de/themen/Statistik/data/rousseeuw/ Provided on-line at the University of Cologne.] [10220230] |*''[[Time series]]'' - Data used in Chatfield's book, ''The Analysis of Time Series'', are [http://lib.stat.cmu.edu/modules.php?op=modload&name=PostWrap&file=index&page=datasets/ provided on-line by StatLib.] [10220240] |*''Extreme values'' - Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' are [http://homes.stat.unipd.it/coles/public_html/ismev/ismev.dat provided on-line by Stuart Coles], the book's author. [10220250] |*''Bayesian Data Analysis'' - Data used in the book, ''[[Bayesian]] Data Analysis'', are [http://www.stat.columbia.edu/~gelman/book/data/ provided on-line by Andrew Gelman], one of the book's authors. [10220260] |* The [ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders Bupa liver data], used in several papers in the machine learning (data mining) literature. [10230010] |
ELIZA
[10230020] |'''ELIZA''' is a [[computer program]] by [[Joseph Weizenbaum]], designed in [[1966]], which parodied a [[Rogerian psychotherapy|Rogerian therapist]], largely by rephrasing many of the patient's statements as questions and posing them to the patient. [10230030] |Thus, for example, the response to "My head hurts" might be "Why do you say your head hurts?" [10230040] |The response to "My mother hates me" might be "Who else in your family hates you?" [10230050] |ELIZA was named after Eliza Doolittle, a working-class character in [[George Bernard Shaw|George Bernard Shaw's]] play ''[[Pygmalion (play)|Pygmalion]]'', who is taught to speak with an [[upper class]] [[accent (linguistics)|accent]]. [10230060] |==Overview== [10230070] |It is sometimes inaccurately said that ELIZA simulates a therapist. [10230080] |Weizenbaum said that ELIZA provided a "[[parody]]" of "the responses of a non-directional psychotherapist in an initial psychiatric interview." [10230090] |He chose the context of psychotherapy to "sidestep the problem of giving the program a data base of real-world knowledge", the therapeutic situation being one of the few real human situations in which a human being can reply to a statement with a question that indicates very little specific knowledge of the topic under discussion. [10230100] |For example, it is a context in which the question "Who is your favorite composer?" can be answered acceptably with responses such as "What about your own favorite composer?" or "Does that question interest you?" [10230110] |First implemented in Weizenbaum's own [[SLIP (programming language)|SLIP]] list-processing language, ELIZA worked by simple [[parsing]] and substitution of key words into canned phrases. [10230120] |Depending upon the initial entries by the user the illusion of a human writer could be instantly dispelled, or could continue through several interchanges. [10230130] |It was sometimes so convincing that there are many anecdotes about people becoming very emotionally caught up in dealing with ELIZA for several minutes until the machine's true lack of understanding became apparent. [10230140] |This was likely due to people's tendency to attach meanings to words which the computer never put there. [10230150] |In 1966, interactive computing (via a teletype) was new. [10230160] |It was 15 years before the personal computer became familiar to the general public, and two decades before most people encountered attempts at [[natural language processing]] in Internet services like [[Ask.com]] or PC help systems such as Microsoft Office [[Office Assistant|Clippy]]. [10230170] |Although those programs included years of research and work (while ''[[Ecala]]'' eclipsed the functionality of ''ELIZA'' after less than two weeks of work by a single programmer), ''ELIZA'' remains a milestone simply because it was the first time a programmer had attempted such a human-machine interaction with the goal of creating the illusion (however brief) of human-''human'' interaction. [10230180] |In the article "theNewMediaReader" an excerpt from "From Computer Power and Human Reason" by Joseph Weizenbaum in 1976, edited by Noah Wardrip-Fruin and Nick Montfort he references how quickly and deeply people became emotionally involved with the computer program, taking offence when he asked to view the transcripts, saying it was an invasion of their privacy, even asking him to leave the room while they were working with ELIZA. [10230190] |==Influence on games== [10230200] |ELIZA impacted a number of early [[computer games]] by demonstrating additional kinds of [[interface design]]s. [10230210] |[[Don Daglow]] wrote an enhanced version of the program called ''Ecala'' on a [[PDP-10]] [[mainframe computer]] at [[Pomona College]] in [[1973]] before writing what was possibly the second or third computer [[role-playing game]], ''[[Dungeon (computer game)|Dungeon]]'' ([[1975]]) (The first was probably "[[dnd (computer game)|dnd]]", written on and for the PLATO system in 1974, and the second may have been [[Moria]], written in 1975). [10230220] |It is likely that ''ELIZA'' was also on the system where [[Will Crowther]] created ''[[Colossal Cave Adventure|Adventure]]'', the 1975 game that spawned the [[interactive fiction]] genre. [10230230] |But both these games appeared some nine years after the original ''ELIZA''. [10230240] |==Response and legacy== [10230250] |Lay responses to ELIZA were disturbing to Weizenbaum and motivated him to write his book ''Computer Power and Human Reason: From Judgment to Calculation'', in which he explains the limits of computers, as he wants to make clear in people's minds his opinion that the anthropomorphic views of computers are just a reduction of the human being and any life form for that matter. [10230260] |There are many programs based on ELIZA in different languages in addition to ''Ecala''. [10230270] |For example, in 1980, a company called "Don't Ask Software", founded by Randy Simon, created a version for the Apple II, Atari, and Commodore PCs, which verbally abused the user based on the user's input. [10230280] |In Spain, Jordi Perez developed the famous ZEBAL in 1993, written in [[Clipper programming language|Clipper]] for MS-DOS. [10230290] |Other versions adapted ELIZA around a religious theme, such as ones featuring Jesus (both serious and comedic) and another Apple II variant called ''I Am Buddha''. [10230300] |The 1980 game ''[[The Prisoner (computer game)|The Prisoner]]'' incorporated ELIZA-style interaction within its gameplay. [10230310] |ELIZA has also inspired a [[podcast]] called "The Eliza Podcast", in which the host engages in self-analysis using a computer generated voice prompting with questions in the same style as the ELIZA program. [10230320] |==Implementations== [10230330] |* Using [[JavaScript]]: http://www.manifestation.com/neurotoys/eliza.php3 [10230340] |* Source code in [[Java (programming language)|Java]]: http://chayden.net/eliza/Eliza.html [10230350] |* Another [[Java (programming language)|Java]]-implementation of ELIZA: http://www.wedesoft.demon.co.uk/eliza/ [10230360] |* Using [[C (programming language)|C]] on the [[TI-89]]: http://kaikostack.com/ti89_en.htm#eliza [10230370] |* Using [[z80#The Z80 assembly language|z80 Assembly]] on the [[TI-83#TI-83 Plus|TI-83 Plus]]: http://www.ticalc.org/archives/files/fileinfo/354/35463.html [10230380] |* A [[perl module]] [http://search.cpan.org/dist/Chatbot-Eliza/ Chatbot::Eliza] — [http://www.terrence.com/perl/eliza/eliza.cgi example implementation] [10230390] |* Trans-Tex Software has released shareware versions for Classic Mac OS and Mac OS X: http://www.tex-edit.com/index.html#Eliza [10230400] |* doctor.el (circa [[1985]]) in [[Emacs]]. [10230410] |* Source code in [[Tcl]]: [http://wiki.tcl.tk/9235 http://wiki.tcl.tk/9235] [10230420] |* The [http://www.indyproject.org Indy] [[Delphi]] oriented TCP/IP components suite has an Eliza implementation as demo. [10230430] |*[http://www.cs.bham.ac.uk/research/projects/cogaff/eliza Pop-11 Eliza] in the [[poplog]] system. [10230440] |Goes back to about 1976, when it was used for teaching AI at [[Sussex University]]. [10230450] |Now part of the free open source Poplog system. [10230460] |* Source code in [[BASIC]]: http://www.atariarchives.org/bigcomputergames/showpage.php?page=22 [10230470] |* ECC-Eliza for Windows (actual program is for DOS, but unpacker is for Windows) (rename .txt to .exe before running): http://www5.domaindlx.com/ecceliza1/ecceliza.txt. [10230480] |More recent version at http://web.archive.org/web/20041117123025/http://www5.domaindlx.com/ecceliza1/ecceliza.txt.