Data mining '''Data mining''' is the process of [[sorting]] through large amounts of data and picking out relevant information. It is usually used by [[business intelligence]] organizations, and [[financial analyst]]s, but is increasingly being used in the sciences to extract information from the enormous [[data set]]s generated by modern experimental and observational methods. It has been described as "the nontrivial extraction of implicit, previously unknown, and potentially useful [[information]] from [[data]]"{{cite journal |author=W. Frawley and G. Piatetsky-Shapiro and C. Matheus |title=Knowledge Discovery in Databases: An Overview |journal=[[AI Magazine]] |date=Fall 1992 |pages=pp. 213–228 |id={{ISSN|0738-4602}}}} and "the science of extracting useful information from large [[data set]]s or [[database]]s."{{cite book |author=D. Hand, H. Mannila, P. Smyth |title=Principles of Data Mining |publisher=MIT Press, Cambridge, MA |year=2001 |id=ISBN 0-262-08290-X}} Data mining in relation to [[enterprise resource planning]] is the statistical and logical analysis of large sets of transaction data, looking for patterns that can aid decision making. {{cite book |author=Ellen Monk, Bret Wagner |title=Concepts in Enterprise Resource Planning, Second Edition |publisher=Thomson Course Technology, Boston, MA |year=2006 |id=ISBN 0-619-21663-8}} ==Background== Traditionally, business analysts have performed the task of extracting useful [[information]] from recorded [[data]], but the increasing volume of data in modern business and science calls for computer-based approaches. As [[data set]]s have grown in size and complexity, there has been a shift away from direct hands-on data analysis toward indirect, automatic data analysis using more complex and sophisticated tools. The modern technologies of [[computers]], [[networks]], and [[sensors]] have made [[data collection]] and organization much easier. However, the captured data needs to be converted into [[information]] and [[knowledge]] to become useful. Data mining is the entire process of applying computer-based [[methodology]], including new techniques for [[knowledge discovery]], to data.{{cite book |last= Kantardzic |first= Mehmed |title= Data Mining: Concepts, Models, Methods, and Algorithms|year= 2003|publisher= John Wiley & Sons |location= |isbn= 0471228524}} Data mining identifies trends within data that go beyond simple analysis. Through the use of sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of business processes and target opportunities. However, abdicating control of this process from the statistician to the machine may result in false-positives or no useful results at all. Although data mining is a relatively new term, the technology is not. For many years, businesses have used powerful computers to sift through volumes of data such as supermarket scanner data to produce market research reports (although reporting is not considered to be data mining). Continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy and usefulness of data analysis. Web 2.0 technologies have generated a colossal amount of user-generated data and media, making it hard to aggregate and consume information in a meaningful way without getting overloaded. Given the size of the data on the Internet, and the difficulty in contextualizing it, it is unclear whether the traditional approach to data mining is computationally viable. [http://www.internetevolution.com/author.asp?section_id=644&doc_id=157077&F_src=flftwoInternet Evolution (www.internetevolution.com): Data Mining in the Age of Web 2.0] The term data mining is often used to apply to the two separate processes of knowledge discovery and [[prediction]]. Knowledge discovery provides explicit information that has a readable form and can be understood by a user. [[Forecasting]], or [[predictive modeling]] provides predictions of future events and may be transparent and readable in some approaches (e.g., rule-based systems) and opaque in others such as [[neural network]]s. Moreover, some data-mining systems such as neural networks are inherently geared towards prediction and pattern recognition, rather than knowledge discovery. [[Metadata]], or data about a given data set, are often expressed in a condensed ''data-minable'' format, or one that facilitates the practice of data mining. Common examples include executive summaries and scientific abstracts. Data mining relies on the use of real world data. This data is extremely vulnerable to [[collinearity]] precisely because data from the real world may have unknown interrelations. An unavoidable weakness of data mining is that the critical data that may expose any relationship might have never been observed. Alternative approaches using an experiment-based approach such as [[Choice Modelling]] for human-generated data may be used. Inherent correlations are either controlled for or removed altogether through the construction of an [[experimental design]]. Recently, there were some efforts to define a standard for data mining, for example the [[CRISP-DM]] standard for analysis processes or the [[Java Data-Mining]] Standard. Independent of these standardization efforts, freely available open-source software systems like [[RapidMiner]] and [[Weka (machine learning)| Weka]] have become an informal standard for defining data-mining processes. ==Privacy concerns== There are also [[privacy]] and [[human rights]] concerns associated with data mining, specifically regarding the source of the data analyzed. Data mining provides information that may be difficult to obtain otherwise. When the data collected involves individual people, there are many questions concerning privacy, legality, and ethics.{{cite journal| author=Chip Pitts| title=The End of Illegal Domestic Spying? Don't Count on It| url= http://www.washingtonspectator.com/articles/20070315surveillance_1.cfm|journal=[[Washington Spectator|Wash. Spec.]]|date=March 15, 2007}}. In particular, data mining government or commercial data sets for national security or law enforcement purposes has raised privacy concerns.{{cite journal| author=K.A. Taipale| title=Data Mining and Domestic Security: Connecting the Dots to Make Sense of Data |url=http://www.stlr.org/cite.cgi?volume=5&article=2|volume=5|journal=[[Columbia Science and Technology Law Review|Colum. Sci. & Tech. L. Rev.]]|issue=2|date=December 15, 2003 |id = {{SSRN|546782}} / {{OCLC|45263753}} }}.{{cite journal| author=John Resig, Ankur Teredesai|year= 2004| title=A Framework for Mining Instant Messaging Services| url= http://citeseer.ist.psu.edu/resig04framework.html|journal=In Proceedings of the 2004 SIAM DM Conference}}. ==Notable uses of data mining== ===Combatting Terrorism=== Data mining has been cited as the method by which the U.S. Army unit [[Able Danger]] had identified the [[September 11, 2001 attacks]] leader, [[Mohamed Atta]], and three other 9/11 hijackers as possible members of an [[Al Qaeda]] cell operating in the U.S. more than a year before the attack. It has been suggested that both the [[Central Intelligence Agency]] and the [[Canadian Security Intelligence Service]] have employed this method.{{cite book|author=Stephen Haag et al.|title=Management Information Systems for the information age|pages=pp 28|id=ISBN 0-07-095569-7}} Previous data mining to stop terrorist programs under the US government include the Terrorism Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System (CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement (ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight program [http://www.msnbc.msn.com/id/20604775/ Security-MSNBC]. These programs have been discontinued due to controversy over whether they violate the US Constitution's 4th amendment. ===Games=== Since the early 1960s, with the availability of [[Oracle machine|oracle]]s for certain [[combinatorial game]]s, also called [[tablebase]]s (e.g. for 3x3-chess) with any beginning configuration, small-board [[dots-and-boxes]], small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data mining has been opened up. This is the extraction of human-usable strategies from these oracles. Current pattern recognition approaches do not seem to fully have the required high level of abstraction in order to be applied successfully. Instead, extensive experimentation with the tablebases, combined with an intensive study of tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns. [[Berlekamp]] in dots-and-boxes etc. and [[John Nunn]] in [[chess]] [[Chess endgame|endgames]] are notable examples of researchers doing this work, though they were not and are not involved in tablebase generation. ===Business=== {{Refimprove|date=July 2008}} Data mining in [[customer relationship management]] applications can contribute significantly to the bottom line.{{Fact|date=July 2008}} Rather than contacting a prospect or customer through a call center or sending mail, only prospects that are predicted to have a high likelihood of responding to an offer are contacted. More sophisticated methods may be used to optimize across campaigns so that we can predict which channel and which offer an individual is most likely to respond to - across all potential offers. Finally, in cases where many people will take an action without an offer, uplift modeling can be used to determine which people will have the greatest increase in responding if given an offer. [[Data clustering]] can also be used to automatically discover the segments or groups within a customer data set. Businesses employing data mining quickly see a return on investment, but also they recognize that the number of predictive models can quickly become very large. Rather than one model to predict which customers will [[Churning (stock trade)|churn]], a business could build a separate model for each region and customer type. Then instead of sending an offer to all people that are likely to churn, it may only want to send offers to customers that will likely take to offer. And finally, it may also want to determine which customers are going to be profitable over a window of time and only send the offers to those that are likely to be profitable. In order to maintain this quantity of models, they need to manage model versions and move to ''automated data mining''. Data mining can also be helpful to human-resources departments in identifying the characteristics of their most successful employees. Information obtained, such as universities attended by highly successful employees, can help HR focus recruiting efforts accordingly. Additionally, Strategic Enterprise Management applications help a company translate corporate-level goals, such as profit and margin share targets, into operational decisions, such as production plans and workforce levels. Another example of data mining, often called the [[market basket analysis]], relates to its use in retail sales. If a clothing store records the purchases of customers, a data-mining system could identify those customers who favour silk shirts over cotton ones. Although some explanations of relationships may be difficult, taking advantage of it is easier. The example deals with [[association rule]]s within transaction-based data. Not all data are transaction based and logical or inexact [[rule]]s may also be present within a [[database]]. In a manufacturing application, an inexact rule may state that 73% of products which have a specific defect or problem will develop a secondary problem within the next six months. Related to an integrated-circuit production line, an example of data mining is described in the paper "Mining IC Test Data to Optimize VLSI Testing."http://web.engr.oregonstate.edu/~tgd/publications/kdd2000-dlft.pdf In this paper the application of data mining and decision analysis to the problem of die-level functional test is described. Experiments mentioned in this paper demonstrate the ability of applying a system of mining historical die-test data to create a probabilistic model of patterns of die failure which are then utilized to decide in real time which die to test next and when to stop testing. This system has been shown, based on experiments with historical test data, to have the potential to improve profits on mature IC products. ===Science and engineering=== In recent years, data mining has been widely used in area of science and engineering, such as [[bioinformatic]]s, [[genetic]]s, [[medicine]], [[education]], and [[electrical power]] engineering. In the area of study on human genetics, the important goal is to understand the mapping relationship between the inter-individual variation in human [[DNA]] sequences and variability in disease susceptibility. In lay terms, it is to find out how the changes in an individual's DNA sequence affect the risk of developing common diseases such as [[cancer]]. This is very important to help improve the diagnosis, prevention and treatment of the diseases. The data mining technique that is used to perform this task is known as [[multifactor dimensionality reduction]].{{cite book|author=Xingquan Zhu, Ian Davidson|title=Knowledge Discovery and Data Mining: Challenges and Realities|publisher= Hershey, New Your| year =2007 |pages=pp 18|id=ISBN 978-159904252-7}} In the area of electrical power engineering, data mining techniques have been widely used for [[condition monitoring]] of high voltage electrical equipment. The purpose of condition monitoring is to obtain valuable information on the [[insulation]]'s health status of the equipment. [[Data clustering]] such as [[self-organizing map]] (SOM) has been applied on the vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using vibration monitoring, it can be observed that each tap change operation generates a signal that contains information about the condition of the tap changer contacts and the drive mechanisms. Obviously, different tap positions will generate different signals. However, there was considerable variability amongst normal condition signals for the exact same tap position. SOM has been applied to detect abnormal conditions and to estimate the nature of the abnormalities.{{cite Journal| author=A.J. McGrail, E.Gulski, and al.|title=Data Mining Techniques to Asses the Condition of High Voltage Electrical Plant| journal=CIGRE WG 15.11 of Study Committee 15}}. Data mining techniques have also been applied for [[dissolved gas analysis]] (DGA) on [[power transformer]]s. DGA, as a diagnostics for power transformer, has been available for centuries. Data mining techniques such as SOM has been applied to analyse data and to determine trends which are not obvious to the standard DGA ratio techniques such as Duval Triangle.{{cite Journal| author=A.J. McGrail, E.Gulski, and al.|title=Data Mining Techniques to Asses the Condition of High Voltage Electrical Plant|journal=CIGRE WG 15.11 of Study Committee 15}}. A fourth area of application for data mining in science/engineering is within educational research, where data mining has been used to study the factors leading students to choose to engage in behaviors which reduce their learning{{cite Journal| author=R.Baker|title=Is Gaming the System State-or-Trait? Educational Data Mining Through the Multi-Contextual Application of a Validated Behavioral Model|journal=Workshop on Data Mining for User Modeling 2007}} and to understand the factors influencing university student retention.{{cite Journal| author=J.F. Superby, J-P. Vandamme, N. Meskens |title=Determination of factors influencing the achievement of the first-year university students using data mining methods|journal=Workshop on Educational Data Mining 2006}} Other examples of applying data mining technique applications are [[biomedical]] data facilitated by domain ontologies,{{cite book|author=Xingquan Zhu, Ian Davidson|title=Knowledge Discovery and Data Mining: Challenges and Realities|publisher= Hershey, New Your| year =2007 |pages=pp 163-189|id=ISBN 978-159904252-7}} mining clinical trial data,{{cite book|author=Xingquan Zhu, Ian Davidson|title=Knowledge Discovery and Data Mining: Challenges and Realities|publisher= Hershey, New Your| year =2007 |pages=pp 31-48|id=ISBN 978-159904252-7}} [[traffic analysis]] using SOM,{{cite Journal| author=Yudong Chen, Yi Zhang, Jianming Hu, Xiang Li |title=Traffic Data Analysis Using Kernel PCA and Self-Organizing Map|journal=Intelligent Vehicles Symposium, 2006 IEEE}}. et cetera. ==See also== * [[Data analysis]] * [[Data warehouse]] * [[Pattern mining]] * [[R Project]] * [[Structured data analysis (statistics)]] ==References== {{refs|2}} == Further reading == {{refbegin|2}} * Peter Cabena, Pablo Hadjnian, Rolf Stadler, Jaap Verhees, Alessandro Zanasi, ''Discovering Data Mining: From Concept to Implementation'' (1997), Prentice Hall, ISBN 0137439806 * Ronen Feldman and James Sanger, ''The Text Mining Handbook'', Cambridge University Press, ISBN 9780521836579 * Phiroz Bhagat, ''Pattern Recognition in Industry'', Elsevier, ISBN 0-08-044538-1 * Ian Witten and Eibe Frank, ''Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations'' (2000), ISBN 1-55860-552-5, (see also [[Weka (machine learning)|Free Weka software]]) * Mark F. Hornick, Erik Marcade, Sunil Venkayala: "Java Data Mining: Strategy, Standard, And Practice: A Practical Guide for Architecture, Design, And Implementation" (Broché) * Weiss and Indurkhya, ''Predictive Data Mining'', Morgan Kaufman * Yike Guo and Robert Grossman, editors: High Performance Data Mining: Scaling Algorithms, Applications and Systems, Kluwer Academic Publishers, 1999 * Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001). ''The Elements of Statistical Learning'', Springer. ISBN 0387952845 ([http://www-stat.stanford.edu/~tibs/ElemStatLearn/ companion book site]) * Pascal Poncelet, Florent Masseglia and Maguelonne Teisseire (Editors). ''Data Mining Patterns: New Methods and Applications '', Information Science Reference, ISBN 978-1599041629, (October 2007). * Mierswa, Ingo and Wurst, Michael and [[Ralf Klinkenberg|Klinkenberg, Ralf]] and Scholz, Martin and Euler, Timm: ''YALE: Rapid Prototyping for Complex Data Mining Tasks'', in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06), 2006. * Peng, Y., Kou, G., Shi, Y. and Chen, Z. "A Systemic Framework for the Field of Data Mining and Knowledge Discovery", in Proceeding of workshops on The Sixth IEEE International Conference on Data Mining Technique (ICDM), 2006 {{refend}} ==External links== * {{dmoz|Computers/Software/Databases/Data_Mining/|Data Mining}} * [http://www.softcomputing.es/en/home.php European Centre for Soft Computing] [[Category:Data mining| ]] [[Category:Data analysis]] [[Category:Formal sciences]] [[ar:تنقيب في البيانات]] [[cs:Data mining]] [[da:Data mining]] [[de:Data-Mining]] [[es:Minería de datos]] [[eu:Datu-meatzaritza]] [[fa:داده‌کاوی]] [[fr:Exploration de données]] [[ko:데이터 마이닝]] [[id:Penggalian data]] [[it:Data mining]] [[he:כריית מידע]] [[lv:Datizrace]] [[lt:Duomenų išgavimas]] [[hu:Adatbányászat]] [[nl:Data mining]] [[ja:データマイニング]] [[no:Data mining]] [[pl:Eksploracja danych]] [[pt:Mineração de dados]] [[ro:Data mining]] [[ru:Интеллектуальный анализ данных]] [[simple:Data mining]] [[sk:Hĺbková analýza dát]] [[sl:Podatkovno rudarjenje]] [[su:Data mining]] [[sv:Data mining]] [[th:การทำเหมืองข้อมูล]] [[vi:Khai phá dữ liệu]] [[tr:Veri madenciliği]] [[uk:Добування даних]] [[zh:数据挖掘]]