Encyclopedia of Database Systems
Ling Liu, M. Tamer O¨zsu (Eds.)
Encyclopedia of Database Systems
With 3,067 Entries With 871 Authors With 1,176 Figures and 101 Tables With 6,900 Cross-references With 10,696 Bibliographic references
LING LIU Professor College of Computing Georgia Institute of Technology 266 Ferst Drive Atlanta, GA 30332-0765 USA M. TAMER O¨ZSU Professor and Director, University Research Chair Database Research Group David R. Cheriton School of Computer Science University of Waterloo 200 University Avenue West Waterloo, ON Canada N2L 3G1
Library of Congress Control Number: 2009931217 ISBN: 978-0-387-35544-3 This publication is available also as: Electronic publication under ISBN: 978-0-387-39940-9 and Print and electronic bundle under ISBN: 978-0-387-49616-0 ß Springer Science+Business Media, LLC 2009 (USA) All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC., 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. springer.com Printed on acid free paper
SPIN: 11752127 2109SPi– 5 4 3 2 1 0
To our families
Preface We are in an information era where generating and storing large amounts of data are commonplace. A growing number of organizations routinely handle terabytes and exabytes of data, and individual digital data collections easily reach multiple gigabytes. Along with the increases in volume, the modality of digitized data that requires efficient management and the access modes to these data have become more varied. It is increasingly common for business and personal data collections to include images, video, voice, and unstructured text; the retrieval of these data comprises various forms, including structured queries, keyword search, and visual access. Data have become a highly valued asset for governments, industries and individuals, and the management of these data collections remains a critical technical challenge. Database technology has matured over the past four decades and is now quite ubiquitous in many applications that deal with more traditional business data. The challenges of expanding data management to include other data modalities while maintaining the fundamental tenets of database management (data independence, data integrity, data consistency, etc.) are issues that the community continues to work on. The lines between database management and other fields such as information retrieval, multimedia retrieval, and data visualization are increasingly blurred. This multi-volume Encyclopedia of Database Systems provides easy access to important concepts on all aspects of database systems, including areas of current interest and research results of historical significance. It is a comprehensive collection of over 1,250 in-depth entries (3,067 including synonyms) that present coverage of the important concepts, issues, emerging technology and future trends in the field of database technologies, systems, and applications. The content of the Encyclopedia was determined through wide consultations. We were assisted by an Advisory Board in coming up with the overall structure and content. Each of these areas were put under the control of Area Editors (70 in total) who further developed the content for each area, soliciting experts in the field as contributors to write the entries, and performed the necessary technical editing. Some of them even wrote entries themselves. Nearly 1,000 authors were involved in writing entries. The intended audience for the Encyclopedia is technically broad and diverse. It includes anyone concerned with database system technology and its applications. Specifically, the Encyclopedia can serve as a valuable and authoritative reference for students, researchers and practitioners who need a quick and authoritative reference to the subject of databases, data management, and database systems. We anticipate that many people will benefit from this reference work, including database specialists, software developers, scientists and engineers who need to deal with (structured, semi-structured or unstructured) large datasets. In addition, database and data mining researchers and scholars in the many areas that apply database technologies, such as artificial intelligence, software engineering, robotics and computer vision, machine learning, finance and marketing are expected to benefit from the Encyclopedia. We would like to thank the members of the Advisory Board, the Editorial Board, and the individual contributors for their help in creating this Encyclopedia. The success of the Encyclopedia could not have been achieved without the expertise and the effort of the many contributors. Our sincere thanks also go to Springer’s editors and staff, including Jennifer Carlson, Susan Lagerstrom-Fife, Oona Schmid, and Susan Bednarczyk for their support throughout the project. Finally, we would very much like to hear from readers for any suggestions regarding the Encyclopedia’s content. With a project of this size and scope, it is quite possible that we may have missed some concepts. It is also possible that some entries may benefit from revisions and clarifications. We are committed to issuing periodic updates and we look forward to the feedback from the community to improve the Encyclopedia. Ling Liu ¨ zsu M. Tamer O
Editors-in-Chief
Ling Liu is a Professor in the School of Computer Science, College of Computing, at Georgia Institute of Technology. Dr. Liu directs the research programs in Distributed Data Intensive Systems Lab (DiSL), examining various aspects of data intensive systems, ranging from database and Internet data management, data storage, network computing, and mobile and wireless computing, with the focus on performance, availability, security, privacy, and energy efficiency in building very large database and data management systems and services. She has published over 200 international journal and conference articles in the areas of databases, data engineering, and distributed computing systems. She is a recipient of the best paper award of ICDCS 2003, the best paper award of WWW 2004, the 2005 Pat Goldberg Memorial Best Paper Award, and the best data engineering paper award of Int. Conf. on Software Engineering and Data Engineering (2008). Dr. Liu served on the editorial board of IEEE Transactions on Knowledge and Data Engineering and International Journal of Very Large Databases from 2004 to 2008 and is currently serving on the editorial board of several international journals, including Distributed and Parallel Databases Journal, IEEE Transactions on Service Computing (TSC), International Journal of Peer-to-Peer Networking and Applications (Springer), and Wireless Network (Springer). Dr. Liu’s current research is primarily sponsored by NSF, IBM, and Intel.
x
B
Editors-in-Chief
¨ zsu is a Professor of Computer Science and Director of the David R. Cheriton School of Computer M. Tamer O Science at the University of Waterloo. He holds a Ph.D. (1983) and an MS (1981) in Computer and Information Science from The Ohio State University (1983) and a B.S. (1974) and M.S. (1978) in Industrial Engineering from the Middle East Technical University, Turkey (1974). ¨ zsu’s current research focuses on three areas: (a) Internet-scale data distribution that emphasizes stream data Dr. O management, peer-to-peer databases, and Web data management; (b) multimedia data management, concentrating on similarity-based retrieval of time series and trajectory data; and (c) the integration of database and information retrieval technologies, focusing on XML query processing and optimization. His previous research focused on distributed databases, interoperable information systems, object database systems and image databases. He is the co-author of the book Principles of Distributed Database Systems (Prentice Hall), which is now in its second edition (third edition to publish in 2009). He currently holds a University Research Chair and has held a Faculty Research Fellowship at the University of Waterloo (2000-2003), and a McCalla Research Professorship (1993-1994) at the University of Alberta where he was faculty member between 1984 and 2000. He is a fellow of the Association for Computing Machinery (ACM), a senior member of Institute of Electrical and Electronics Engineers (IEEE), and a member of Sigma Xi. He was awarded the ACM SIGMOD Contributions Award in 2006. He is also the 2008 recipient of Ohio State University College of Engineering Distinguished Alumnus Award. He has held visiting positions at GTE Laboratories (USA), INRIA Rocquencourt (France), GMD-IPSI (Germany), University of Jyva¨skyla¨ (Finland), Technical University of Darmstadt (Germany), University of Udine (Italy), University of Milano (Italy), ETH Zu¨rich (Switzerland), and National University of Singapore (Singapore). ¨ zsu serves on the editorial boards of ACM Computing Surveys, Distributed and Parallel Databases Journal, Dr. O World Wide Web Journal, Information Technology and Management, and Springer Book Series on Advanced Information & Knowledge Processing. Previously he was the Coordinating Editor-in-Chief of The VLDB Journal (1997-2005) and was on the Editorial Board of Encyclopedia of Database Technology and Applications (Idea Group). He has served as the Program Chair of VLDB (2004), WISE (2001), IDEAS (2003), and CIKM (1996) conferences and the General Chair of CAiSE (2002), as well as serving on the Program Committees of many conferences including SIGMOD, VLDB, and ICDE. He is also a member of Association for Computing Machinery’s (ACM) Publications Board and is its Vice-Chair for New Publications. ¨ zsu was the Chair of ACM Special Interest Group on Management of Data (SIGMOD; 2001-2005) and a past Dr. O trustee of the VLDB Endowment (1996-2002). He was a member and chair of the Computer and Information Science Grant Selection Committee of the Natural Sciences and Engineering Research Council of Canada during 1991-94, and served on the Management Committee of the Canadian Genome Analysis and Technology Program during 1992-93. He was Acting Chair of the Department of Computing Science at the University of Alberta during 1994-95, and again, for a brief period, in 2000.
Advisory Board Serge Abiteboul INRIA-Futurs INRIA, Saclay Orsay, Cedex France
Jiawei Han University of Illinios at Urbana-Champaign Urbana, IL USA
Gustavo Alonso ETH Zu¨rich Zu¨rich Switzerland
Theo Ha¨rder University of Kaiserslautern Kaiserslautern Germany
Peter M. G. Apers University of Twente Enschede The Netherlands
Joseph M. Hellerstein University of California-Berkeley Berkeley, CA USA
Ricardo Baeza-Yates Yahoo! Research Barcelona Spain
Ramesh Jain University of California-Irvine Irvine, CA USA
Catriel Beeri Hebrew University of Jerusalem Jerusalem Israel
Matthias Jarke RWTH-Aachen Aachen Germany
Elisa Bertino Purdue University West Lafayette, IN USA
Jai Menon IBM Systems and Technology Group San Jose, CA USA
Stefano Ceri Politecnico di Milano Milan Italy
John Mylopoulos University of Toronto Toronto, ON Canada
Asuman Dogac Middle East Technical University Ankara Turkey
Beng Chin Ooi National University of Singapore Singapore Singapore
Alon Halevy Google, Inc. Mountain View, CA USA
Erhard Rahm University of Leipzig Leipzig Germany
xii
B
Advisory Board
Krithi Ramamritham IIT Bombay Mumbai India
Patrick Valduriez INRIA and LINA Nantes France
Schek Hans-Joerg ETH Zu¨rich Zu¨rich Switzerland
Gerhard Weikum Max Planck Institute for Informatics Saarbru¨cken Germany
Sellis Timos National Technical University of Athens Athens Greece
Jennifer Widom Stanford University Stanford, CA USA
Frank Wm. Tompa University of Waterloo Waterloo, ON Canada
Lizhu Zhou Tsinghua University Beijing China
Area Editors Peer-to-Peer Data Management
XML Data Management
K ARL A BERER EPFL-IC-IIF-LSIR Lausanne Switzerland
S IHEM A MER- YAHIA Yahoo! Research New York, NY USA
Database Management System Architectures
Database Middleware
A NASTASIA A ILAMAKI EPF Lausanne Lausanne Switzerland
C HRISTIANA A MZA University of Toronto Toronto, ON Canada
Information Retrieval Models
Database Tools Database Tuning
G IAMBATTISTA A MATI Fondazione Ugo Bordoni Rome Italy
PHILIPPE BONNET University of Copenhagen Copenhagen Denmark
Visual Interfaces
xiv
Area Editors
Self Management
TIZIANA CATARCI University of Rome Rome Italy
Stream Data Management
SURAJIT CHAUDHURI Microsoft Research Redmond, WA USA
Text Mining
UGUR CETINTEMEL Brown University Providence, RI USA
Querying Over Data Integration Systems
KEVIN CHANG University of Illinois at Urbana-Champaign Urbana, IL USA
ZHENG CHEN Microsoft Research Asia Beijing China
Extended Transaction Models (Advanced Concurrency Control Theory)
PANOS K. CHRYSANTHIS University of Pittsburgh Pittsburgh, PA USA
B
Area Editors
Privacy-Preserving Data Mining
CHRIS CLIFTON Purdue University West Lafayette, IN USA
Active Databases
KLAUS DITTRICH University of Zu¨rich Zu¨rich Switzerland
Data Models (Including Semantic Data Models)
DAVID EMBLEY Brigham Young University Provo, UT USA
Complex Event Processing
OPHER ETZION IBM Research Lab in Haifa Haifa Israel
Digital Libraries
Database Security and Privacy
AMR EL ABBADI University of California-Santa Barbara Santa Barbara, CA USA
ELENA FERRARI University of Insubria Varese Italy
xv
xvi
Area Editors
Semantic Web and Ontologies
Sensor Networks
AVIGDOR GAL Technion - Israel Institute of Technology Haifa Israel
LE GRUENWALD The University of Oklahoma Norman, OK USA
Data Clustering Data Cleaning
VENKATESH GANTI Microsoft Research Redmond, WA USA
DIMITRIOS GUNOPULOS University of Athens Athens Greece University of California – Riverside Riverside, CA USA
Web Data Extraction
Scientific Databases
GEORG GOTTLOB Oxford University Oxford UK
AMARNATH GUPTA University of California-San Diego La Jolla, CA USA
B
Area Editors
Geographic Information Systems
Temporal Databases
¨ TING RALF HARTMUT GU University of Hagen Hagen Germany
CHRISTIAN JENSEN Aalborg University Aalborg Denmark
Data Visualization
HANS HINTERBERGER ETH Zu¨rich Zu¨rich Switzerland
Metadata Management
MANFRED JEUSFELD Tilburg University Tilburg The Netherlands
Web Services and Service Oriented Architectures
Health Informatics Databases
HANS-ARNO JACOBSEN University of Toronto Toronto, ON Canada
VIPUL KASHYAP Partners Health Care System Wellesley, MA USA
xvii
xviii
Area Editors
Visual Data Mining
Views and View Management
DANIEL KEIM University of Konstanz Konstanz Germany
YANNIS KOTIDIS Athens University of Economics and Business Athens Greece
Data Replication
BETTINA KEMME McGill University Montreal, QC Canada
Semi-Structured Text Retrieval
MOUNIA LALMAS University of Glasgow Glasgow UK
Advanced Storage Systems Storage Structures and Systems
Information Quality
MASARU KITSUREGAWA The University of Tokyo Tokyo Japan
YANG LEE Northeastern University Boston, MA USA
B
Area Editors
Relational Theory
Database Design
LEONID LIBKIN University of Edinburgh Edinburgh UK
JOHN MYLOPOULOS University of Toronto Toronto, ON Canada
Information Retrieval Evaluation Measures
Text Indexing Techniques
WEIYI MENG State University of New York at Binghamton Binghamton, NY USA
MARIO NASCIMENTO University of Alberta Edmonton, AB Canada
Data Integration
Data Quality
No Photo available
RENE´E MILLER University of Toronto Toronto, ON Canada
FELIX NAUMANN Hasso Plattner Institute Potsdam Germany
xix
xx
Area Editors
Web Search and Crawl
Data Warehouse
CHRISTOPHER OLSTON Yahoo! Research Santa Clara, CA USA
TORBEN BACH PEDERSEN Aalborg University Aalborg Denmark
Multimedia Databases
Association Rule Mining
VINCENT ORIA New Jersey Institute of Technology Newark, NJ USA
JIAN PEI Simon Fraser University Burnaby, BC Canada
Spatial, Spatiotemporal, and Multidimensional Databases
Workflow Management
DIMITRIS PAPADIAS Hong Kong University of Science and Technology Hong Kong China
BARBARA PERNICI Politecnico di Milano Milan Italy
B
Area Editors
Query Processing and Optimization
Query Languages
EVAGGELIA PITOURA University of Ioannina Ioannina Greece
TORE RISCH Uppsala University Uppsala Sweden
Data Management for the Life Sciences
Data Warehouse
LOUIQA RASCHID University of Marlyand College Park, MD USA
STEFANO RIZZI University of Bologna Bologna Italy
Information Retrieval Operations
Multimedia Databases
EDIE RASMUSSEN The University of British Columbia Vancouver, BC Canada
SHIN’ICHI SATOH National Institute of Informatics Tokyo Japan
xxi
xxii
Area Editors
Spatial, Spatiotemporal, and Multidimensional Databases
TIMOS SELLIS National Technical University of Athens Athens Greece
Database Tools Database Tuning
DENNIS SHASHA New York University New York, NY USA
Temporal Databases
RICK SNODGRASS University of Arizona Tuscon, AZ USA
Stream Mining
DIVESH SRIVASTAVA AT&T Labs – Research Florham Park, NJ USA
Classification and Decision Trees
Distributed Database Systems
KYUSEOK SHIM Seoul National University Seoul Republic of Korea
KIAN-LEE TAN National University of Singapore Singapore Singapore
B
Area Editors
Logics and Databases
Parallel Database Systems
VAL TANNEN University of Pennsylvania Philadelphia, PA USA
PATRICK VALDURIEZ INRIA and LINA Nantes France
Structured and Semi-Structured Document Databases
Advanced Storage Systems Storage structures and systems
FRANK WM. TOMPA University of Waterloo Waterloo, ON Canada
KALADHAR VORUGANTI Network Appliance Sunnyvale, CA USA
Indexing
Transaction Management
VASSILIS TSOTRAS University of California – Riverside Riverside, CA USA
GOTTFRIED VOSSEN University of Mu¨nster Mu¨nster Germany
xxiii
xxiv
Area Editors
Self Management
Multimedia Information Retrieval
GERHARD WEIKUM Max Planck Institute for Informatics Saarbru¨cken Germany
JEFFREY XU YU Chinese University of Hong Kong Hong Kong China
Mobile and Ubiquitous Data Management
Approximation and Data Reduction Techniques
OURI WOLFSON University of Illinois at Chicago Chicago, IL USA
XIAOFANG ZHOU The University of Queensland Brisbane, QLD Australia
List of Contributors W. M. P. van der Aalst Eindhoven University of Technology Eindhoven The Netherlands
Gail-Joon Ahn Arizona State University Tempe, AZ USA
Daniel Abadi Yale University New Haven, CT USA
Anastasia Ailamaki EPFL Lausanne Switzerland
Alberto Abello´ Polytechnic University of Catalonia Barcelona Spain Serge Abiteboul INRIA, Saclay Orsay, Cedex France Ioannis Aekaterinidis University of Patras Rio Patras Greece Nitin Agarwal Arizona State University Tempe, AZ USA Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY USA Lalitha Agnihotri Philips Research Eindhoven The Netherlands Yanif Ahmad Brown University Providence, RI USA
Yousef J. Al-Houmaily Institute of Public Administration Riyadh Saudi Arabia Robert B. Allen Drexel University Philadelphia, PA USA Gustavo Alonso ETH Zurich Zurich Switzerland Omar Alonso University of California at Davis Davis, CA USA Bernd Amann Pierre & Marie Curie University (UPMC) Paris France Giambattista Amati Fondazione Ugo Bordoni Rome Italy Rainer von Ammon Center for Information Technology Transfer GmbH (CITT) Regensburg Germany
xxvi
B
List of Contributors
Robert A. Amsler CSC Falls Church, VA USA
Samuel Aronson Harvard Medical School Boston, MA USA
Cristiana Amza University of Toronto Toronto, ON Canada
Paavo Arvola University of Tampere Tampere Finland
George Anadiotis VU University Amsterdam Amsterdam The Netherlands
Noboru Babaguchi Osaka University Osaka Japan
Mihael Ankerst Allianz Munich Germany
Shivnath Babu Duke University Durham, NC USA
Sameer Antani National Institutes of Health Bethesda, MD USA
Kenneth Paul Baclawski Northeastern University Boston, MA USA
Grigoris Antoniou Foundation for Research and Technology-Hellas (FORTH) Heraklion Greece
Ricardo Baeza-Yates Yahoo! Research Barcelona Spain
Arvind Arasu Microsoft Research Redmond, WA USA
James Bailey University of Melbourne Melbourne, VIC Australia
Danilo Ardagna Politecnico di Milano Milan Italy
Peter Bak University of Konstanz Konstanz Germany
Walid G. Aref Purdue University West Lafayette, IN USA
Magdalena Balazinska University of Washington Seattle, WA USA
Marcelo Arenas Pontifical Catholic University of Chile Santiago Chile
Farnoush Banaei-Kashani University of Southern California Los Angeles, CA USA
B
List of Contributors
Stefano Baraldi University of Florence Florence Italy Mauro Barbieri Philips Research Eindhoven The Netherlands Denilson Barbosa University of Alberta Edmonton, AL Canada Pablo Barcelo´ University of Chile Santiago Chile Luciano Baresi Politecnico di Milano Milan Italy Ilaria Bartolini University of Bologna Bologna Italy Sugato Basu Google Inc. Mountain View, CA USA Carlo Batini University of Milan Bicocca Milan Italy
Robert Baumgartner Vienna University of Technology Vienna, Austria Lixto Software GmbH Vienna Austria Sean Bechhofer University of Manchester Manchester UK Steven M. Beitzel Telcordia Technologies Piscataway, NJ USA Ladjel Bellatreche LISI/ENSMA–Poitiers University Futuroscope Cedex France Omar Benjelloun Google Inc. Mountain View, CA USA Ve´ronique Benzaken University Paris 11 Orsay Cedex France Mikael Berndtsson University of Sko¨vde Sko¨vde Sweden
Michal Batko Masaryk University Brno Czech Republic
Philip A. Bernstein Microsoft Corporation Redmond, WA USA
Peter Baumann Jacobs University Bremen Germany
Damon Andrew Berry University of Massachusetts Lowell, MA USA
xxvii
B
xxviii t of Contributors
Leopoldo Bertossi Carleton University Ottawa, ON Canada
Philip Bohannon Yahoo! Research Santa Clara, CA USA
Claudio Bettini University of Milan Milan Italy
Michael H. Bo¨hlen Free University of Bozen-Bolzano Bozen-Bolzano Italy
Nigel Bevan Professional Usability Services London UK
Christian Bo¨hm University of Munich Munich Germany
Bharat Bhargava Purdue University West Lafayette, IN USA
Peter Boncz CWI Amsterdam The Netherlands
Arnab Bhattacharya Indian Institute of Technology Kanpur India
Philippe Bonnet University of Copenhagen Copenhagen Denmark
Ernst Biersack Eurecom Sophia Antipolis France
Alexander Borgida Rutgers University New Brunswick, NJ USA
Alberto Del Bimbo University of Florence Florence Italy
Chavdar Botev Yahoo Research! and Cornell University Ithaca, NY USA
Alan F. Blackwell University of Cambridge Cambridge UK
Sara Bouchenak University of Grenoble I — INRIA Grenoble France
Carlos Blanco University of Castilla-La Mancha Ciudad Real Spain
Luc Bouganim INRIA Paris-Rocquencourt Le Chesnay Cedex France
Marina Blanton University of Notre Dame Notre Dame, IN USA
Nozha Boujemaa INRIA Paris-Rocquencourt Le Chesnay Cedex France
B
List of Contributors
Shawn Bowers University of California-Davis Davis, CA USA
Alejandro Buchmann Darmstadt University of Technology Darmstadt Germany
Ste´phane Bressan National University of Singapore Singapore Singapore
Chiranjeeb Buragohain Amazon.com Seattle, WA USA
Martin Breunig University of Osnabrueck Osnabrueck Germany
Thorsten Bu¨ring Ludwig-Maximilians-University Munich Munich Germany
Scott A. Bridwell University of Utah Salt Lake City, UT USA
Benjamin Bustos Department of Computer Science University of Chile Santiago Chile
Thomas Brinkhoff Institute for Applied Photogrammetry and Geoinformatics (IAPG) Oldenburg Germany
David Buttler Lawrence Livermore National Laboratory Livermore, CA USA
Andrei Broder Yahoo! Research Santa Clara, CA USA
Yanli Cai Shanghai Jiao Tong University Shanghai China
Nicolas Bruno Microsoft Corporation Redmond, WA USA
Guadalupe Canahuate The Ohio State University Columbus, OH USA
Franc¸ois Bry University of Munich Munich Germany
K. Selcuk Candan Arizona State University Tempe, AZ USA
Yingyi Bu Chinese University of Hong Kong Hong Kong China
Turkmen Canli University of Illinois at Chicago Chicago, IL USA
xxix
xxx
B
List of Contributors
Alan Cannon Napier University Edinburgh UK
Wojciech Cellary Poznan University of Economics Poznan Poland
Cornelia Caragea Iowa State University Ames, IA USA
Michal Ceresna Lixto Software GmbH Vienna Austria
Barbara Carminati University of Insubria Varese Italy
Ug˘ur C¸etintemel Brown University Providence, RI USA
Michael W. Carroll Villanova University School of Law Villanova, PA USA
Soumen Chakrabarti Indian Institute of Technology of Bombay Mumbai India
Ben Carterette University of Massachusetts Amherst Amherst, MA USA
Don Chamberlin IBM Almaden Research Center San Jose, CA USA
Marco A. Casanova Pontifical Catholic University of Rio de Janeiro Rio de Janeiro Brazil
Allen Chan IBM Toronto Software Lab Markham, ON Canada
Giuseppe Castagna C.N.R.S. and University Paris 7 Paris France
Chee Yong Chan National University of Singapore Singapore Singapore
Tiziana Catarci University of Rome Rome Italy
K. Mani Chandy California Institute of Technology Pasadena, CA USA
James Caverlee Texas A&M University College Station, TX USA
Edward Y. Chang Google Research Mountain View, CA USA
Emmanuel Cecchet EPFL Lausanne Switzerland
Kevin C. Chang University of Illinois at Urbana-Champaign Urbana, IL USA
B
List of Contributors
Surajit Chaudhuri Microsoft Research Redmond, WA USA
InduShobha N. Chengalur-Smith University at Albany – SUNY Albany, NY USA
Elizabeth S. Chen Partners HealthCareSystem Boston, MA USA
Mitch Cherniack Brandeis University Wattham, MA USA
James L. Chen University of Illinois at Chicago Chicago, IL USA
Yun Chi NEC Laboratories America Cupertino, CA USA
Jinjun Chen Swinburne University of Technology Melbourne, VIC Australia
Rada Chirkova North Carolina State University Raleigh, NC USA
Lei Chen Hong Kong University of Science and Technology Hong Kong China Peter P. Chen Louisiana State University Baton Rouge, LA USA Hong Cheng University of Illinois at Urbana-Champaign Urbana, IL USA Chinese University of Hong Kong Hong Kong China Reynold Cheng The University of Hong Kong Hong Kong China Vivying S. Y. Cheng Hong Kong University of Science and Technology (HKUST) Hong Kong China
Jan Chomicki State University of New York at Buffalo Buffalo, NY USA Stephanie Chow University of Ontario Institute of Technology (UOIT) Oshawa, ON Canada Vassilis Christophides University of Crete Heraklion Greece Panos K. Chrysanthis University of Pittsburgh Pittsburgh, PA USA Paolo Ciaccia University of Bologna Bologna Italy John Cieslewicz Columbia University New York, NY USA
xxxi
xxxii
B
List of Contributors
Gianluigi Ciocca University of Milano-Bicocca Milan Italy
Dianne Cook Iowa State University Ames, IA USA
Eugene Clark Harvard Medical School Boston, MA USA
Graham Cormode AT&T Labs–Research Florham Park, NJ USA
Charles L. A. Clarke University of Waterloo Waterloo, ON Canada Eliseo Clementini University of L’Aguila L’Aguila Italy Chris Clifton Purdue University West Lafayette, IN USA Edith Cohen AT&T Labs-Research Florham Park, NJ USA Sara Cohen The Hebrew University of Jerusalem Jerusalem Israel Sarah Cohen-Boulakia University of Pennsylvania Philadelphia, PA USA
Antonio Corral University of Almeria Almeria Spain Maria Francesca Costabile University of Bari Bari Italy Nick Craswell Microsoft Research Cambridge Cambridge UK Fabio Crestani University of Lugano Lugano Switzerland Marco Antonio Cristo FUCAPI Manaus Brazil
Carlo Combi University of Verona Verona Italy
Maxime Crochemore King’s College London London UK University of Paris-East Paris France
Mariano P. Consens University of Toronto Toronto, ON Canada
Matthew G. Crowson University of Illinois at Chicago Chicago, IL USA
B
List of Contributors
Michel Crucianu National Conservatory of Arts and Crafts Paris France
Alex Delis University of Athens Athens Greece
Philippe Cudre´-Mauroux Massachussetts Institute of Technology Cambridge, MA USA
Alan Demers Cornell University Ithaca, NY USA
Francisco Curbera IBM T.J. Watson Research Center Hawthorne, NY USA
Ke Deng University of Queensland Brisbane, OLD Australia
Peter Dadam University of Ulm Ulm Germany
Amol Deshpande University of Maryland College Park, MD USA
Mehmet M. Dalkilic¸ Indiana University Bloomington, IN USA
Zoran Despotovic NTT DoCoMo Communications Laboratories Europe Munich Germany
Nilesh Dalvi Yahoo! Research Santa Clara, CA USA
Alin Deutsch University of California-San Diego La Jolla, CA USA
Manoranjan Dash Nanyang Technological University Singapore Singapore
Yanlei Diao University of Massachusetts Amherst, MA USA
Anwitaman Datta Nanyang Technological University Singapore Singapore
Suzanne W. Dietrich Arizona State University Phoenix, AZ USA
Ian Davidson University of California-Davis Davis, CA USA
Nevenka Dimitrova Philips Research Eindhoven The Netherlands
Antonios Deligiannakis University of Athens Athens Greece
Bolin Ding University of Illinois at Urbana-Champaign Champaign, IL USA
xxxiii
B
xxxiv t of Contributors
Chris Ding University of Texas at Arlington Arlington, TX USA
Xin Luna Dong AT&T Labs–Research Florham Park, NJ USA
Alan Dix Lancaster University Lancaster UK
Chitra Dorai IBM T. J. Watson Research Center Hawthorne, NY USA
Hong-Hai Do SAP AG Dresden Germany
Zhicheng Dou Nankai University Tianjin China
Gillian Dobbie University of Auckland Auckland New Zealand
Yang Du Northeastern University Boston, MA USA
Alin Dobra University of Florida Gainesville, FL USA
Marlon Dumas University of Tartu Tartu Estonia
Vlastislav Dohnal Masaryk University Brno Czech Republic
Susan Dumais Microsoft Research Redmond, WA USA
Mario Do¨ller University of Passau Passau Germany
Schahram Dustdar Technical University of Vienna Vienna Austria
Carlotta Domeniconi George Mason University Fairfax, VA USA
Curtis Dyreson Utah State University Logan, UT USA
Josep Domingo-Ferrer Universitat Rovira i Virgili Tarragona Spain
Todd Eavis Concordia University Montreal, QC Canada
Guozhu Dong Wright State University Dayton, OH USA
Johann Eder University of Vienna Vienna Austria
B
List of Contributors
Ibrahim Abu El-Khair Minia University Minia Egypt
Hui Fang University of Delaware Newark, DE USA
Ahmed K. Elmagarmid Purdue University West Lafayette, IN USA
Wei Fan IBM T.J. Watson Research Hawthorne, NY USA
Sameh Elnikety Microsoft Research Cambridge UK
Wenfei Fan University of Edinburgh Edinburgh UK
David W. Embley Brigham Young University Provo, UT USA
Alan Fekete University of Sydney Sydney, NSW Australia
Vincent Englebert University of Namur Namur Belgium
Jean-Daniel Fekete INRIA, LRI University Paris Sud Orsay Cedex France
AnnMarie Ericsson University of Sko¨vde Sko¨vde Sweden
Pascal Felber University of Neuchatel Neuchatel Switzerland
Martin Ester Simon Fraser University Burnaby, BC Canada
Paolino Di Felice University of L’Aguila L’Aguila Italy
Opher Etzion IBM Research Labs-Haifa Haifa Israel
Hakan Ferhatosmanoglu The Ohio State University Columbus, OH USA
Patrick Eugster Purdue University West Lafayette, IN USA
Eduardo B. Fernandez Florida Atlantic University Boca Raton, FL USA
Ronald Fagin IBM Almaden Research Center San Jose, CA USA
Eduardo Ferna´ndez-Medina University of Castilla-La Mancha Ciudad Real Spain
xxxv
B
xxxvi t of Contributors
Paolo Ferragina University of Pisa Pisa Italy
Chiara Francalanci Politecnico di Milano University Milan Italy
Elena Ferrari University of Insubria Varese Italy
Andrew U. Frank Vienna University of Technology Vienna Austria
Dennis Fetterly Microsoft Research Mountain View, CA USA
Michael J. Franklin University of California-Berkeley Berkeley, CA USA
Stephen E. Fienberg Carnegie Mellon University Pittsburgh, PA USA
Keir Fraser University of Cambridge Cambridge UK
Peter M. Fischer ETH Zurich Zurich Switzerland
Juliana Freire University of Utah Salt Lake City, UT USA
Simone Fischer-Hu¨bner Karlstad University Karlstad Sweden
Elias Frentzos University of Piraeus Piraeus Greece
Leila De Floriani University of Genova Genova Italy
Johann-Christoph Freytag Humboldt University of Berlin Berlin Germany
Christian Fluhr CEA LIST, Fontenay-aux Roses France
Ophir Frieder Georgetown University Washington, DC USA
Greg Flurry IBM SOA Advanced Technology Armonk, NY USA
Oliver Fro¨lich Lixto Software GmbH Vienna Austria
Edward A. Fox Virginia Tech Blacksburg, VA USA
Tim Furche University of Munich Munich Germany
B
List of Contributors
Ariel Fuxman Microsoft Research Mountain View, CA USA
Like Gao Teradata Corporation San Diego, CA USA
Ada Wai-Chee Fu Hong Kong University of Science and Technology Hong Kong China
Wei Gao The Chinese University of Hong Kong Hong Kong China
Silvia Gabrielli Bruno Kessler Foundation Trento Italy
Minos Garofalakis Technical University of Crete Chania Greece
Isabella Gagliardi National Research Council (CNR) Milan Italy
Wolfgang Gatterbauer University of Washington Seattle, WA USA
Avigdor Gal Technion – Israel Institute of Technology Haifa Israel
Bugra Gedik IBM T.J. Watson Research Center Hawthorne, NY USA
Wojciech Galuba Ecole Polytechnique Fe´de´rale de Lausanne (EPFL) Lausanne Switzerland
Floris Geerts University of Edinburgh Edinburgh UK
Johann Gamper Free University of Bozen-Bolzano Bolzano Italy
Johannes Gehrke Cornell University Ithaca, NY USA
Vijay Gandhi University of Minnesota Minneapolis, MN USA
Betsy George University of Minnesota Minneapolis, MN USA
Venkatesh Ganti Microsoft Research Redmond, WA USA
Lawrence Gerstley PSMI Consulting San Francisco, CA USA
Dengfeng Gao IBM Silicon Valley Lab San Jose, CA USA
Michael Gertz University of California - Davis Davis, CA USA
xxxvii
B
xxxviii t of Contributors
Giorgio Ghelli University of Pisa Pisa Italy
Lukasz Golab AT&T Labs-Research Florham Park, NJ USA
Gabriel Ghinita National University of Singapore Singapore Singapore
Matteo Golfarelli University of Bologna Bologna Italy
Phillip B. Gibbons Intel Research Pittsburgh, PA USA
Michael F. Goodchild University of California-Santa Barbara Santa Barbara, CA USA
Sarunas Girdzijauskas EPFL Lausanne Switzerland
Georg Gottlob Oxford University Oxford UK
Fausto Giunchiglia University of Trento Trento Italy
Valerie Gouet-Brunet CNAM Paris Paris France
Kazuo Goda The University of Tokyo Tokyo Japan
Ramesh Govindan University of Southern California Los Angeles, CA USA
Max Goebel Vienna University of Technology Vienna Austria
Goetz Graefe Hewlett-Packard Laboratories Palo Alto, CA USA
Bart Goethals University of Antwerp Antwerp Belgium
Go¨sta Grahne Concordia University Montreal, QC Canada
Martin Gogolla University of Bremen Bremen Germany
Fabio Grandi University of Bologna Bologna Italy
Aniruddha Gokhale Vanderbilt University Nashville, TN USA
Tyrone Grandison IBM Almaden Research Center San Jose, CA USA
B
List of Contributors
Peter M. D. Gray University of Aberdeen Aberdeen UK
Amarnath Gupta University of California-San Diego La Jolla, CA USA
Todd J. Green University of Pennsylvania Philadelphia, PA USA
Himanshu Gupta Stony Brook University Stony Brook, NY USA
Georges Grinstein University of Massachusetts Lowell, MA USA Tom Gruber RealTravel Emerald Hills, CA USA Le Gruenwald The University of Oklahoma Norman, OK USA Torsten Grust University of Tu¨bingen Tu¨bingen Germany Ralf Hartmut Gu¨ting University of Hagen Hagen Germany Dirk Van Gucht Indiana University Bloomington, IN USA Carlos Guestrin Carnegie Mellon University Pittsburgh, PA USA Dimitrios Gunopulos University of California-Riverside Riverside, CA USA University of Athens Athens Greece
Cathal Gurrin Dublin City University Dublin Ireland Marc Gyssens University of Hasselt & Transnational University of Limburg Diepenbeek Belgium Karl Hahn BMW AG Munich Germany Jean-Luc Hainaut University of Namur Namur Belgium Alon Halevy Google Inc. Mountain View, CA USA Maria Halkidi University of Piraeus Piraeus Greece Terry Halpin Neumont University South Jordan, UT USA Jiawei Han University of Illinois at Urbana-Champaign Urbana, IL USA
xxxix
xl
B
List of Contributors
Alan Hanjalic Delft University of Technology Delft The Netherlands
Alexander Hauptmann Carnegie Mellon University Pittsburgh, PA USA
David Hansen The Australian e-Health Research Centre Brisbane, QLD Australia
Helwig Hauser University of Bergen Bergen Norway
Jo¨rgen Hansson Carnegie Mellon University Pittsburgh, PA USA Nikos Hardavellas Carnegie Mellon University Pittsburgh, PA USA Theo Ha¨rder University of Kaiserslautern Kaiserslautern Germany David Harel The Weizmann Institute of Science Rehovot Israel Jayant R. Haritsa Indian Institute of Science Bangalore India Stavros Harizopoulos HP Labs Palo Alto, CA USA Per F. V. Hasle Aalborg University Aalborg Denmark Jordan T. Hastings University of California-Santa Barbara Santa Barbara, CA USA
Ben He University of Glasgow Glasgow UK Pat Helland Microsoft Corporation Redmond, WA USA Joseph M. Hellerstein University of California-Berkeley Berkeley, CA USA Jean Henrard University of Namur Namur Belgium John Herring Oracle Corporation Nashua, NH USA Nicolas Herve´ INRIA Paris-Rocquencourt Le Chesnay Cedex France Marcus Herzog Vienna University of Technology Vienna Austria Lixto Software GmbH Vienna Austria
B
List of Contributors
Jean-Marc Hick University of Namur Namur Belgium
Wynne Hsu National University of Singapore Singapore Singapore
Jan Hidders University of Antwerp Antwerpen Belgium
Jian Hu Microsoft Research Asia Haidian China
Djoerd Hiemstra University of Twente Enschede The Netherlands
Kien A. Hua University of Central Florida Orlando, FL USA
Linda L. Hill University of California-Santa Barbara Santa Barbara, CA USA
Xian-Sheng Hua Microsoft Research Asia Beijing China
Alexander Hinneburg Martin-Luther-University Halle-Wittenberg Halle/Saale Germany
Jun Huan University of Kansas Lawrence, KS USA
Hans Hinterberger ETH Zurich Zurich Switzerland
Haoda Huang Microsoft Research Asia Beijing China
Erik Hoel Environmental Systems Research Institute Redlands, CA USA
Michael Huggett University of British Columbia Vancouver, BC Canada
Vasant Honavar Iowa State University Ames, IA USA
Patrick C. K. Hung University of Ontario Institute of Technology (UOIT) Oshawa, ON Canada
Mingsheng Hong Cornell University Ithaca, NY USA
Jeong-Hyon Hwang Brown University Providence, RI USA
Haruo Hosoya The University of Tokyo Tokyo Japan
Ichiro Ide Nagoya University Nagoya Japan
xli
xlii
B
List of Contributors
Alfred Inselberg Tel Aviv University Tel Aviv Israel
Kalervo Ja¨rvelin University of Tampere Tampere Finland
Yannis Ioannidis University of Athens Athens Greece
Christian S. Jensen Aalborg University Aalborg Denmark
Panagiotis G. Ipeirotis New York University New York, NY USA
Eric C. Jensen Twitter, Inc. San Fransisco, CA USA
Zachary Ives University of Pennsylvania Philadelphia, PA USA
Manfred A. Jeusfeld Tilburg University Tilburg The Netherlands
Hans-Arno Jacobsen University of Toronto Toronto, ON Canada
Heng Ji New York University New York, NY USA
H. V. Jagadish University of Michigan Ann Arbor, MI USA
Ricardo Jimenez-Peris Universidad Politecnica de Madrid Madrid Spain
Alejandro Jaimes Telefonica R&D Madrid Spain
Jiashun Jin Carnegie Mellon University Pittsburgh, PA USA
Ramesh Jain University of California-Irvine Irvine, CA USA
Ryan Johnson Carnegie Mellon University Pittsburg, PA USA
Sushil Jajodia George Mason University Fairfax, VA USA
Theodore Johnson AT&T Labs Research Florham Park, NJ USA
Greg Jane´e University of California-Santa Barbara Santa Barbara, CA USA
Christopher B. Jones Cardiff University Cardiff UK
B
List of Contributors
Rosie Jones Yahoo! Research Burbank, CA USA
Carl-Christian Kanne University of Mannheim Mannheim Germany
James B. D. Joshi University of Pittsburgh Pittsburgh, PA USA
Aman Kansal Microsoft Research Redmond, WA USA
Vanja Josifovski Uppsala University Uppsala Sweden
Murat Kantarcioglu University of Texas at Dallas Dallas, TX USA
Marko Junkkari University of Tampere Tampere Finland
George Karabatis University of Maryland Baltimore County (UMBC) Baltimore, MD USA
Jan Jurjens The Open University Buckinghamshire UK
Grigoris Karvounarakis University of Pennsylvania Philadelphia, PA USA
Mouna Kacimi Max-Planck Institute for Informatics Saarbru¨cken Germany
George Karypis University of Minnesota Minneapolis, MN USA
Tamer Kahveci University of Florida Gainesville, FL USA
Vipul Kashyap Partners Healthcare System Wellesley, MA USA
Panos Kalnis National University of Singapore Singapore Singapore
Yannis Katsis University of California-San Diego La Jolla, CA USA
Jaap Kamps University of Amsterdam Amsterdam The Netherlands
Raghav Kaushik Microsoft Research Redmond, WA USA
James Kang University of Minnesota Minneapolis, MN USA
Gabriella Kazai Microsoft Research Cambridge Cambridge UK
xliii
xliv
B
List of Contributors
Daniel A. Keim University of Konstanz Konstanz Germany
Christoph Koch Cornell University Ithaca, NY USA
Jaana Keka¨la¨inen University of Tampere Tampere Finland
Solmaz Kolahi University of British Columbia Vancouver, BC Canada
Anastasios Kementsietsidis IBM T.J. Watson Research Center Hawthorne, NY USA
George Kollios Boston University Boston, MA USA
Bettina Kemme McGill University Montreal, QC Canada
Poon Wei Koot Nanyang Technological University Singapore Singapore
Jessie Kennedy Napier University Edinburgh UK
Flip R. Korn AT&T Labs–Research Florham Park, NJ USA
Vijay Khatri Indiana University Bloomington, IN USA
Harald Kosch University of Passau Passau Germany
Ashfaq Khokhar University of Illinois at Chicago Chicago, IL USA
Cartik R. Kothari University of British Columbia Vancouver, BC Canada
Daniel Kifer Yahoo! Research Santa Clara, CA USA
Yannis Kotidis Athens University of Economics and Business Athens Greece
Stephen Kimani CSIRO Tasmanian ICT Centre Hobart, TAS Australia
Spyros Kotoulas VU University Amsterdam Amsterdam The Netherlands
Craig A. Knoblock University of Southern California Marina del Rey, CA USA
Manolis Koubarakis University of Athens Athens Greece
B
List of Contributors
Konstantinos Koutroumbas Institute for Space Applications and Remote Sensing Athens Greece
Zoe´ Lacroix Arizona State University Tempe, AZ USA
Bernd J. Kra¨mer University of Hagen Hagen Germany
Alberto H. F. Laender Federal University of Minas Gerais Belo Horizonte Brazil
Peer Kro¨gerand Ludwig-Maximilians University of Munich Munich Germany
Bibudh Lahiri Iowa State University Ames, IA USA
Werner Kriechbaum IBM Development Lab Bo¨blingen Germany
Laks V. S. Lakshmanan University of British Columbia Vancouver, BC Canada
Hans-Peter Kriegel Ludwig-Maximilians-University Munich Germany
Mounia Lalmas University of Glasgow Glasgow UK
Rajasekar Krishnamurthy IBM Almaden Research Center San Jose, CA USA
Lea Landucci University of Florence Florence Italy
Ravi Kumar Yahoo Research Santa Clara, CA USA
Birger Larsen Royal School of Library and Information Science Copenhagen Denmark
Nicholas Kushmerick Decho Corporation Seattle, WA USA
˚ ke Larson Per-A Microsoft Corporation Redmond, WA USA
Mary Laarsgard University of California-Santa Barbara Santa Barbara, CA USA
Robert Laurini LIRIS, INSA-Lyon Lyon France
Alexandros Labrinidis University of Pittsburgh Pittsburgh, PA USA
Georg Lausen University of Freiburg Freiburg Germany
xlv
xlvi
B
List of Contributors
Jens Lechtenbo¨rger University of Mu¨nster Mu¨nster Germany
Stefano Levialdi Sapienza University of Rome Rome Italy
Thierry Lecroq University of Rouen Rouen France
Brian Levine University of Massachusetts Amherst, MA USA
Dongwon Lee The Pennsylvania State University University Park, PA USA
Changqing Li Duke University Durham, NC USA
Yang W. Lee Northeastern University Boston, MA USA
Chen Li University of California-Irvine Irvine, CA USA
Pieter De Leenheer Vrije Universiteit Brussel, Collibra nv Brussels Belgium
Chengkai Li University of Texas at Arlington Arlington, TX USA
Wolfgang Lehner Dresden University of Technology Dresden Germany
Hua Li Microsoft Research Asia Beijing China
Ronny Lempel Yahoo! Research Haifa Israel
Jinyan Li Nanyang Technological University Singapore Singapore
Kristina Lerman University of Southern California Marina del Rey, CA USA
Ninghui Li Purdue University West Lafayette, IN USA
Ulf Leser Humboldt University of Berlin Berlin Germany
Ping Li Cornell University Ithaca, NY USA
Carson Kai-Sang Leung University of Manitoba Winnipeg, MB Canada
Qing Li City University of Hong Kong Hong Kong China
B
List of Contributors
Xue Li The University of Queensland Brisbane, QLD Australia
Danzhou Liu University of Central Florida Orlando, FL USA
Ying Li IBM T.J. Watson Research Center Hawthorne, NY USA
Guimei Liu National University of Singapore Singapore Singapore
Yunyao Li IBM Almaden Research Center San Jose, CA USA
Huan Liu Arizona State University Tempe, AZ USA
Leonid Libkin University of Edinburgh Edinburgh UK
Jinze Liu University of Kentucky Lexington, KY USA
Sam S Lightstone IBM, Canada Ltd. Markham, ON Canada
Ning Liu Microsoft Research Asia Beijing China
Jimmy Lin University of Maryland College Park, MD USA
Qing Liu CSIRO Tasmanian ICT Centre Hobart, TAS Australia
Tsau Young (T.Y.) Lin San Jose State University San Jose, CA USA
Vebjorn Ljosa Broad Institute of MIT and Harvard Cambridge, MA USA
Xuemin Lin University of New South Wales Sydney, NSW Australia
David Lomet Microsoft Research Redmond, WA USA
Tok Wang Ling National University of Singapore Singapore Singapore
Phillip Lord Newcastle University Newcastle-Upon-Tyne UK
Bing Liu University of Illinois at Chicago Chicago, IL USA
Nikos A. Lorentzos Agricultural University of Athens Athens Greece
xlvii
xlviii
B
List of Contributors
Lie Lu Microsoft Research Asia Beijing China
Nikos Mamoulis University of Hong Kong Hong Kong China
Bertram Luda¨scher University of California-Davis Davis, CA USA
Stefan Manegold CWI Amsterdam The Netherlands
Yan Luo University of Illinois at Chicago Chicago, IL USA
Murali Mani Worcester Polytechnic Institute Worcester, MA USA
Yves A. Lussier University of Chicago Chicago, IL USA
Serge Mankovski CA Labs, CA Inc. Thornhill, ON Canada
Craig MacDonald University of Glasgow Glasgow UK
Ioana Manolescu INRIA, Saclay–Iˆle-de-France Orsay France
Ashwin Machanavajjhala Cornell University Ithaca, NY USA
Yannis Manolopoulos Aristotle University of Thessaloniki Thessaloniki Greece
Sam Madden Massachussetts Institute of Technology Cambridge, MA USA
Svetlana Mansmann University of Konstanz Konstanz Germany
Paola Magillo University of Genova Genova Italy
Florian Mansmann University of Konstanz Konstanz Germany
David Maier Portland State University Portland, OR USA
Shahar Maoz The Weizmann Institute of Science Rehovot Israel
Paul Maier Technical University of Munich Munich Germany
Ame´lie Marian Rutgers University Piscataway, NJ USA
B
List of Contributors
Volker Markl IBM Almaden Research Center San Jose, CA USA
Andrew McGregor Microsoft Research Mountain View, CA USA
Maria De Marsico Sapienza University of Rome Rome Italy
Timothy McPhillips University of California-Davis Davis, CA USA
David Martin SRI International Menlo Park, CA USA
Brahim Medjahed The University of Michigan–Dearborn Dearborn, MI USA
Maria Vanina Martinez University of Maryland College Park, MD USA
Carlo Meghini The Italian National Research Council Pisa Italy
Maristella Matera Polytechnico di Milano Milan Italy
Tao Mei Microsoft Research Asia Beijing China
Marta Mattoso Federal University of Rio de Janeiro Rio de Janeiro Brazil
Jonas Mellin University of Sko¨vde Sko¨vde Sweden
Andrea Maurino University of Milan Bicocca Milan Italy
Massimo Melucci University of Padua Padua Italy
Jan Małuszyn´ski Linko¨ping University Linko¨ping Sweden
Weiyi Meng State University of New York at Binghamton Binghamton, NY USA
Jose-Norberto Mazo´n University of Alicante Alicante Spain
Ahmed Metwally Google Inc. Mountain View, CA USA
Kevin S. McCurley Google Research Mountain View, CA USA
Gerome Miklau University of Massachusetts Amherst, MA USA
xlix
l
B
List of Contributors
Harvey J. Miller University of Utah Salt Lake City, UT USA
Peter Mork The MITRE Corporation McLean, VA USA
Rene´e J. Miller University of Toronto Toronto, ON Canada
Mirella M. Moro Federal University of Rio Grande do Sol Porto Alegre Brazil
Tova Milo Tel Aviv University Tel Aviv Israel
Edleno Silva de Moura Federal University of Amazonas Manaus Brazil
Prasenjit Mitra The Pennsylvania State University University Park, PA USA
Kyriakos Mouratidis Singapore Management University Singapore Singapore
Michael Mitzenmacher Harvard University Boston, MA USA
Kamesh Munagala Duke University Durham, NC USA
Mukesh Mohania IBM India Research Lab New Delhi India
Ethan V. Munson University of Wisconsin-Milwaukee Milwaukee, WI USA
Mohamed F. Mokbel University of Minnesota Minneapolis, MN USA
Shawn Murphy Massachusetts General Hospital Boston, MA USA
Angelo Montanari University of Udine Udine Italy
John Mylopoulos University of Toronto Toronto, ON Canada
Reagan W. Moore University of California - San Diego La Jolla, CA USA
Frank Nack University of Amsterdam Amsterdam The Netherlands
Konstantinos Morfonios University of Athens Athens Greece
Marc Najork Microsoft Research Mountain View, CA USA
B
List of Contributors
Ullas Nambiar IBM India Research Lab New Delhi India
Chong-Wah Ngo City University of Hong Kong Hong Kong China
Alexandros Nanopoulos Aristotle University Thessaloniki Greece
Peter Niblett IBM United Kingdom Limited Winchester UK
Vivek Narasayya Microsoft Corporation Redmond, WA USA
Naoko Nitta Osaka University Osaka Japan
Mario A. Nascimento University of Alberta Edmonton, AB Canada
Igor Nitto University of Pisa Pisa Italy
Alan Nash Aleph One LLC La Jolla, CA USA
Cheng Niu Microsoft Research Asia Beijing China
Harald Naumann Vienna University of Technology Vienna Austria
Vile´m Nova´k University of Ostrava Ostrava Czech Republic
Gonzalo Navarro University of Chile Santiago Chile
Chimezie Ogbuji Cleveland Clinic Foundation Cleveland, OH USA
Wolfgang Nejdl University of Hannover Hannover Germany
Peter Øhrstrøm Aalborg University Aalborg Denmark
Thomas Neumann Max-Planck Institute for Informatics Saarbru¨cken Germany
Christine M. O’Keefe CSIRO Preventative Health National Research Flagship Acton, ACT Australia
Frank Neven Hasselt University and Transnational University of Limburg Diepenbeek Belgium
Patrick O’Neil University of Massachusetts Boston, MA USA
li
lii
B
List of Contributors
Iadh Ounis University of Glasgow Glasgow UK
Dimitris Papadias Hong Kong University of Science and Technology Hong Kong China
Mourad Ouzzani Purdue University West Lafayette, IN USA
Spiros Papadimitriou IBM T.J. Watson Research Center Hawthorne, NY USA
Fatma O¨zcan IBM Almaden Research Center San Jose, CA USA
Apostolos N. Papadopoulos Aristotle University Thessaloniki Greece
M. Tamer O¨zsu University of Waterloo Waterloo, ON Canada
Yannis Papakonstantinou University of California-San Diego La Jolla, CA USA
Esther Pacitti University of Nantes Nantes France
Jan Paredaens University of Antwerp Antwerpen Belgium
Chris D. Paice Lancaster University Lancaster UK
Christine Parent University of Lausanne Lausanne Switzerland
Noe¨l De Palma INPG – INRIA Grenoble France
Gabriella Pasi University of Milano-Bicocca Milan Italy
Nathaniel Palmer Workflow Management Coalition Hingham, MA USA
Chintan Patel Columbia University New York, NY USA
Biswanath Panda Cornell University Ithaca, NY USA
Jignesh M. Patel University of Wisconsin-Madison Madison, WI USA
Ippokratis Pandis Carnegie Mellon University Pittsburgh, PA USA
Marta Patin˜o-Martinez Universidad Polytecnica de Madrid Madrid Spain
B
List of Contributors
Norman W. Paton University of Manchester Manchester UK
Mario Piattini University of Castilla-La Mancha Ciudad Real Spain
Cesare Pautasso University of Lugano Lugano Switzerland
Benjamin C. Pierce University of Pennsylvania Philadelphia, PA USA
Torben Bach Pedersen Aalborg University Aalborg Denmark
Karen Pinel-Sauvagnat IRIT-SIG Toulouse Cedex France
Fernando Pedone University of Lugano Lugano Switzerland
Leo L. Pipino University of Massachusetts Lowell, MA USA
Jovan Pehcevski INRIA Paris-Rocquencourt Le Chesnay Cedex France
Peter Pirolli Palo Alto Research Center Palo Alto, CA USA
Jian Pei Simon Fraser University Burnaby, BC Canada
Evaggelia Pitoura University of Ioannina Ioannina Greece
Ronald Peikert ETH Zurich Zurich Switzerland
Benjamin Piwowarski University of Glasgow Glasgow UK
Mor Peleg University of Haifa Haifa Israel
Vassilis Plachouras Yahoo! Research Barcelona Spain
Fuchun Peng Yahoo! Inc. Sunnyvale, CA USA
Catherine Plaisant University of Maryland College Park, MD USA
Liam Peyton University of Ottawa Ottawa, ON Canada
Claudia Plant University of Munich Munich Germany
liii
liv
B
List of Contributors
Christian Platzer Technical University of Vienna Vienna Austria
Vivien Que´ma CNRS, INRIA Saint-Ismier Cedex France
Dimitris Plexousakis Foundation for Research and Technology-Hellas (FORTH) Heraklion Greece
Christoph Quix RWTH Aachen University Aachen Germany
Neoklis Polyzotis University of California Santa Cruz Santa Cruz, CA USA
Sriram Raghavan IBM Almaden Research Center San Jose, CA USA
Raymond K. Pon University of California - Los Angeles Los Angeles, CA USA
Erhard Rahm University of Leipzig Leipzig Germany
Lucian Popa IBM Almaden Research Center San Jose, CA USA
Krithi Ramamritham IIT Bombay Mumbai India
Alexandra Poulovassilis University of London London UK
Maya Ramanath Max-Planck Institute for Informatics Saarbru¨cken Germany
Sunil Prabhakar Purdue University West Lafayette, IN USA
Georgina Ramı´rez Yahoo! Research Barcelona Barcelona Spain
Cecilia M. Procopiuc AT&T Labs Florham Park, NJ USA
Edie Rasmussen University of British Columbia Vancouver, BC Canada
Enrico Puppo University of Genova Genova Italy
Indrakshi Ray Colorado State University Fort Collins, CO USA
Ross S. Purves University of Zurich Zurich Switzerland
Diego Reforgiato Recupero University of Maryland College Park, MD USA
B
List of Contributors
Colin R. Reeves Coventry University Coventry UK
Rami Rifaieh University of California-San Diego San Diego, CA USA
Payam Refaeilzadeh Arizona State University Tempe, AZ USA
Stefanie Rinderle University of Ulm Ulm Germany
Bernd Reiner Technical University of Munich Munich Germany Frederick Reiss IBM Almaden Research Center San Jose, CA USA Harald Reiterer University of Konstanz Konstanz Germany Matthias Renz Ludwig Maximillian University of Munich Munich Germany Andreas Reuter EML Research gGmbH Villa Bosch Heidelberg Germany Technical University Kaiserslautern Kaiserslautern Germany
Tore Risch Uppsala University Uppsala Sweden Thomas Rist University of Applied Sciences Augsburg Germany Stefano Rizzi University of Bologna Bologna Italy Stephen Robertson Microsoft Research Cambridge Cambridge UK Roberto A. Rocha Partners Healthcare System, Inc. Boston, MA USA John F. Roddick Flinders University Adelaide, SA Australia
Peter Revesz University of Nebraska-Lincoln Lincoln, NE USA
Thomas Roelleke Queen Mary University of London London UK
Mirek Riedewald Cornell University Ithaca, NY USA
Didier Roland University of Namur Namur Belgium
lv
lvi
B
List of Contributors
Oscar Romero Polytechnic University of Catalonia Barcelona Spain
Kenneth Salem University of Waterloo Waterloo, ON Canada
Rafael Romero University of Alicante Alicante Spain
George Samaras University of Cyprus Nicosia Cyprus
Timothy Roscoe ETH Zurich Zurich Switzerland
Giuseppe Santucci University of Rome Roma Italy
Kenneth A. Ross Columbia University New York, NY USA
Maria Luisa Sapino University of Turin Turin Italy
Prasan Roy Aster Data Systems, Inc. Redwood City, CA USA
Sunita Sarawagi IIT Bombay Mumbai India
Yong Rui Microsoft China R&D Group Redmond, WA USA
Anatol Sargin University of Augsburg Augsburg Germany
Dan Russler Oracle Health Sciences Redwood Shores, CA USA
Kai-Uwe Sattler Technical University of Ilmenau llmenau Germany
Michael Rys Microsoft Corporation Sammamish, WA USA
Monica Scannapieco University of Rome Rome Italy
Giovanni Maria Sacco University of Torino Torino Italy
Matthias Scha¨fer University of Konstanz Konstanz Germany
Simonas Sˇaltenis Aalborg University Aalborg Denmark
Sebastian Schaffert Salzburg Research Salzburg Austria
B
List of Contributors
Ralf Schenkel Max-Planck Institute for Informatics Saarbru¨cken Germany
Heiko Schuldt University of Basel Basel Switzerland
Raimondo Schettini University of Milano-Bicocca Milan Italy
Heidrun Schumann University of Rostock Rostock Germany
Peter Scheuermann Northwestern University Evanston, IL USA Ulrich Schiel Federal University of Campina Grande Campina Grande Brazil Markus Schneider University of Florida Gainesville, FL USA Marc H. Scholl University of Konstanz Konstanz Germany Michel Scholl Cedric-CNAM Paris France Tobias Schreck Darmstadt University of Technology Darmstadt Germany
Felix Schwagereit University of Koblenz-Landau Koblenz Germany Nicole Schweikardt Johann Wolfgang Goethe-University Frankfurt Germany Fabrizio Sebastiani The Italian National Research Council Pisa Italy Nicu Sebe University of Amsterdam Amsterdam The Netherlands University of Trento Trento Italy Monica Sebillo University of Salerno Salerno Italy
Michael Schrefl University of Linz Linz Austria
Thomas Seidl RWTH Aachen University Aachen Germany
Matthias Schubert Ludwig-Maximilians-University Munich Germany
Manuel Serrano University of Castilla – La Mancha Ciudad Real Spain
lvii
lviii
B
List of Contributors
Amnon Shabo (Shvo) IBM Research Lab-Haifa Haifa Israel
Dennis Shasha New York University New York, NY USA
Mehul A. Shah HP Labs Palo Alto, CA USA
Carpendale Sheelagh University of Calgary Calgary, AB Canada
Nigam Shah Stanford University Stanford, CA USA
Shashi Shekhar University of Minnesota Minneapolis, MN USA
Cyrus Shahabi University of Southern California Los Angeles, CA USA
Dou Shen Microsoft Corporation Redmond, WA USA
Jayavel Shanmugasundaram Yahoo Research! Santa Clara, CA USA
Heng Tao Shen The University of Queensland Brisbane, QLD Australia
Marc Shapiro INRIA Paris-Rocquencourt and LIP6 Paris France
Jialie Shen Singapore Management University Singapore Singapore
Mohamed A. Sharaf University of Toronto Toronto, ON Canada
Rao Shen Yahoo! Sunnyvale, CA USA
Mehdi Sharifzadeh Google Santa Monica, CA USA
Xuehua Shen Google, Inc. Mountain View, CA USA
Jayant Sharma Oracle Corporation Nashua, NH USA
Frank Y. Shih New Jersey Institute of Technology Newark, NJ USA
Guy Sharon IBM Research Labs-Haifa Haifa Israel
Arie Shoshani Lawrence Berkeley National Laboratory Berkeley, CA USA
B
List of Contributors
Pavel Shvaiko University of Trento Trento Italy
Cristina Sirangelo University of Edinburgh Edinburgh UK
Wolf Siberski University of Hannover Hannover Germany
Yannis Sismanis IBM Almaden Research Center Almaden, CA USA
Ronny Siebes VU University Amsterdam Amsterdam The Netherlands
Spiros Skiadopoulos University of Peloponnese Tripoli Greece
Adam Silberstein Yahoo! Research Silicon Valley Santa Clara, CA USA
Richard T. Snodgrass University of Arizona Tucson, AZ USA
Sonia Fernandes Silva Etruria Telematica Srl Siena Italy
Cees Snoek University of Amsterdam Amsterdam The Netherlands
Fabrizio Silvestri ISTI-CNR Pisa Italy
Il-Yeol Song Drexel University Philadelphia, PA USA
Alkis Simitsis IBM Almaden Research Center San Jose, CA USA
Ruihua Song Microsoft Research Asia Beijing China
Simeon J. Simoff University of Western Sydney Sydney, NSW Australia
Stefano Spaccapietra EPFL Lausanne Switzerland
Radu Sion Stony Brook University Stony Brook, NY USA
Greg Speegle Baylor University Waco, TX USA
Mike Sips Stanford University Stanford, CA USA
Padmini Srinivasan The University of Iowa Iowa City, IA USA
lix
lx
B
List of Contributors
Venkat Srinivasan Virginia Tech Blacksburg, VA USA
Diane M. Strong Worcester Polytechnic Institute Worcester, MA USA
Divesh Srivastava AT&T Labs–Research Florham Park, NJ USA
Jianwen Su University of California-Santa Barbara Santa Barbara, CA USA
Steffen Staab University of Koblenz-Landau Koblenz Germany
Kazimierz Subieta Polish-Japanese Institute of Information Technology Warsaw Poland
Maarten van Steen VU University Amsterdam The Netherlands
V. S. Subrahmanian University of Maryland College Park, MD USA
Constantine Stephanidis Foundation for Research and Technology – Hellas (FORTH) Heraklion Greece
Dan Suciu University of Washington Seattle, WA USA
Robert Stevens University of Manchester Manchester UK
S. Sudarshan Indian Institute of Technology Bombay India
Andreas Stoffel University of Konstanz Konstanz Germany
Torsten Suel Yahoo! Research Sunnyvale, CA USA
Michael Stonebraker Massachusetts Institute of Technology Cambridge, MA USA
Jian-Tao Sun Microsoft Research Asia Beijing China
Umberto Straccia The Italian National Research Council Pisa Italy
Subhash Suri University of California-Santa Barbara Santa Barbara, CA USA
Martin J. Strauss University of Michigan Ann Arbor, MI USA
Stefan Tai University of Karlsruhe Karlsruhe Germany
B
List of Contributors
Kian-Lee Tan National University of Singapore Singapore Singapore
Sandeep Tata IBM Almaden Research Center San Jose, CA USA
Pang-Ning Tan Michigan State University East Lansing, MI USA
Nesime Tatbul ETH Zurich Zurich Switzerland
Wang-Chiew Tan University of California-Santa Cruz Santa Cruz CA, USA
Christophe Taton INPG – INRIA Grenoble France
Letizia Tanca Politecnico di Milano University Milan Italy
Paolo Terenziani University of Turin Turin Italy
Lei Tang Arizona State University Tempe, AZ USA
Evimaria Terzi IBM Almaden Research Center San Jose, CA USA
Wei Tang Teradata Corporation El Segundo, CA USA
Bernhard Thalheim Christian-Albrechts University Kiel Kiel Germany
Egemen Tanin University of Melbourne Melbourne, VIC Australia
Martin Theobald Stanford University Stanford, CA USA
Val Tannen University of Pennsylvania Philadelphia, PA USA
Sergios Theodoridis University of Athens Athens Greece
Abdullah Uz Tansel Baruch College – CUNY New York, NY USA
Yannis Theodoridis University of Piraeus Piraeus Greece
Yufei Tao Chinese University of Hong Kong Hong Kong China
Alexander Thomasian Thomasian and Associates Pleasantville, NY USA
lxi
lxii
B
List of Contributors
Bhavani Thuraisingham The University of Texas at Dallas Richardson, TX USA
Goce Trajcevski Northwestern University Evanston, IL USA
Srikanta Tirthapura Iowa State University Ames, IA USA
Peter Triantafillou University of Patras Rio Patras Greece
Wee Hyong Tok National University of Singapore Singapore Singapore
Silke Tribl Humboldt University of Berlin Berlin Germany
David Toman University of Waterloo Waterloo, ON Canada
Andrew Trotman University of Otago Dunedin New Zealand
Frank Wm. Tompa University of Waterloo Waterloo, ON Canada
Juan Trujillo University of Alicante Alicante Spain
Rodney Topor Griffith University Nathan, QLD Australia
Theodora Tsikrika Center for Mathematics and Computer Science Amsterdam The Netherlands
Riccardo Torlone University of Rome Rome Italy
Vassilis J. Tsotras University of California-Riverside Riverside, CA USA
Kristian Torp Aalborg University Aalborg Denmark
Peter A. Tucker Whitworth University Spokane, WA USA
Nicola Torpei University of Florence Florence Italy
Anthony K. H. Tung National University of Singapore Singapore Singapore
Nerius Tradisˇauskas Aalborg University Aalborg Denmark
Theodoros Tzouramanis University of the Aegean Salmos Greece
B
List of Contributors
Antti Ukkonen Helsinki University of Technology Helsinki Finland Mollie Ullman-Cullere Harvard Medical School Boston, MA USA Antony Unwin Augsburg University Augsburg Germany
Stijn Vansummeren Hasselt University and Transnational University of Limburg Diepenbeek Belgium Vasilis Vassalos Athens University of Economics and Business Athens Greece Michael Vassilakopoulos University of Central Greece Lamia Greece
Ali U¨nlu¨ University of Augsburg Augsburg Germany
Panos Vassiliadis University of Ioannina Ioannina Greece
Susan D. Urban Texas Tech University Lubbock, TX USA
Michalis Vazirgiannis Athens University of Economics & Business Athens Greece
Jaideep Vaidya Rutgers University Newark, NJ USA
Olga Vechtomova University of Waterloo Waterloo, ON Canada
Shivakumar Vaithyanathan IBM Almaden Research Center San Jose, CA USA
Erik Vee Yahoo! Research Silicon Valley, CA USA
Athena Vakali Aristotle University Thessaloniki Greece
Jari Veijalainen University of Jyvaskyla Jyvaskyla Finland
Patrick Valduriez INRIA and LINA Nantes France
Yannis Velegrakis University of Trento Trento Italy
Christelle Vangenot EPFL Lausanne Switzerland
Suresh Venkatasubramanian University of Utah Salt Lake City, UT USA
lxiii
lxiv
B
List of Contributors
Rossano Venturini University of Pisa Pisa Italy
Feng Wang City University of Hong Kong Hong Kong China
Victor Vianu University of California-San Diego La Jolla, CA USA
Jianyong Wang Tsinghua University Beijing China
K. Vidyasankar Memorial University of Newfoundland St. John’s, NL Canada
Jun Wang Queen Mary University of London London UK
Millist Vincent University of South Australia Adelaide, SA Australia Giuliana Vitiello University of Salerno Salerno Italy Michail Vlachos IBM T.J. Watson Research Center Hawthorne, NY USA
Meng Wang Microsoft Research Asia Beijing China X. Sean Wang University of Vermont Burlington, VT USA Xin-Jing Wang Microsoft Research Asia Beijing China
Agne`s Voisard Fraunhofer Institute for Software and Systems Engineering (ISST) Berlin Germany
Matthew O. Ward Worcester Polytechnic Institute Worcester, MA USA
Kaladhar Voruganti Network Appliance Sunnyvale, CA USA
Segev Wasserkrug IBM Research Haifa Israel
Gottfried Vossen University of Mu¨nster Mu¨nster Germany
Hans Weda Philips Research Eindhoven The Netherlands
Kenichi Wada Hitachi Limited Tokyo Japan
Gerhard Weikum Max-Planck Institute for Informatics Saarbru¨cken Germany
B
List of Contributors
Michael Weiner Indiana University School of Medicine Indianapolis, IN USA
Ian H. Witten University of Waikato Hamilton New Zealand
Michael Weiss Carleton University Ottawa, ON Canada
Kent Wittenburg Mitsubishi Electric Research Laboratories, Inc. Cambridge, MA USA
Ji-Rong Wen Microsoft Research Asia Beijing China
Eric Wohlstadter University of British Columbia Vancouver, BC Canada
Chunhua Weng Columbia University New York, NY USA
Dietmar Wolfram University of Wisconsin-Milwaukee Milwaukee, WI USA
Mathias Weske University of Potsdam Potsdam Germany
Ouri Wolfson University of Illinois at Chicago Chicago, IL USA
Thijs Westerveld Teezir Search Solutions Ede The Netherlands
Janette Wong IBM Canada Ltd. Markham, ON Canada
Karl Wiggisser University of Klagenfurt Klagenfurt Austria
Raymond Chi-Wing Wong Hong Kong University of Science and Technology Hong Kong China
Jef Wijsen University of Mons-Hainaut Mons Belgium
Peter T. Wood Birkbeck, University of London London UK
Mark D. Wilkinson University of British Columbia Vancouver, BC Canada
David Woodruff IBM Almaden Research Center San Jose, CA USA
Graham Wills SPSS Inc. Chicago, IL USA
Marcel Worring University of Amsterdam Amsterdam The Netherlands
lxv
lxvi
B
List of Contributors
Adam Wright Partners HealthCare Boston, MA USA Yuqing Wu Indiana University Bloomington, IN USA Alex Wun University of Toronto Toronto, ON Canada Ming Xiong Bell Labs Murray Hill, NJ USA Guandong Xu Victoria University Melbourne, VIC Australia Hua Xu Columbia University New York, NY USA Jun Yan Microsoft Research Asia Haidian China Xifeng Yan IBM T. J. Watson Research Center Hawthorne, NY USA Jun Yang Duke University Durham, NC USA Li Yang Western Michigan University Kalamazoo, MI USA
Ming-Hsuan Yang University of California at Merced Merced, CA USA Seungwon Yang Virginia Tech Blacksburg, VA USA Yu Yang City University of Hong Kong Hong Kong China Yun Yang Swinburne University of Technology Melbourne, VIC Australia Yong Yao Cornell University Ithaca, NY USA Mikalai Yatskevich University of Trento Trento Italy Hiroshi Yoshida Fujitsu Limited Yokohama Japan Masatoshi Yoshikawa University of Kyoto Kyoto Japan Matthew Young-Lai Sybase iAnywhere Waterloo, ON Canada
B
List of Contributors
Cong Yu Yahoo! Research New York, NY USA
Hugo Zaragoza Yahoo! Research Barcelona Spain
Hwanjo Yu University of Iowa Iowa City, IA USA
Stan Zdonik Brown University Providence, RI USA
Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong China Philip S. Yu IBM T.J. Watson Research Center Yorktown Heights, NY USA Ting Yu North Carolina State University Raleigh, NC USA Vladimir Zadorozhny University of Pittsburgh Pittsburgh, PA USA Ilya Zaihrayeu University of Trento Trento Italy Mohammed J. Zaki Rensselaer Polytechnic Institute Troy, NY USA Carlo Zaniolo University of California-Los Angeles Los Angeles, CA USA
Demetrios Zeinalipour-Yazti University of Cyprus Nicosia Cyprus Hans Zeller Hewlett-Packard Laboratories Palo Alto, CA USA Pavel Zezula Masaryk University Brno Czech Republic ChengXiang Zhai University of lllinois at Urbana-Champaign Urbana, IL USA Aidong Zhang State University of New York at Buffalo Buffalo, NY USA Benyu Zhang Microsoft Research Asia Beijing China Donghui Zhang Northeastern University Boston, MA USA Ethan Zhang University of California-Santa Cruz and Yahoo! Inc. Santa Cruz, CA USA
lxvii
lxviii
B
List of Contributors
Jin Zhang University of Wisconsin-Milwaukee Milwaukee, WI USA
Feng Zhao Microsoft Research Redmond, WA USA
Kun Zhang Xavier University of Louisiana New Orleans, LA USA
Ying Zhao Tsinghua University Beijing China
Lei Zhang Microsoft Research Asia Beijing China Li Zhang Peking University Beijing China Qing Zhang The Australian e-health Research Center Brisbane, QLD Australia Rui Zhang University of Melbourne Melbourne, VIC Australia Yanchun Zhang Victoria University Melbourne, VIC Australia Yi Zhang University of California-Santa Cruz Santa Cruz, CA USA
Baihua Zheng Singapore Management University Singapore Singapore Yi Zheng University of Ontario Institute of Technology (UOIT) Oshawa, ON Canada Jingren Zhou Microsoft Research Redmond, WA USA Li Zhou Partners HealthCare System Inc. and Harvard Medical School Boston, MA USA Zhi-Hua Zhou Nanjing University Nanjing China
Yue Zhang University of Pittsburgh Pittsburgh, PA USA
Huaiyu Zhu IBM Almaden Research Center San Jose, CA USA
Zhen Zhang University of Illinois at Urbana-Champaign Urbana, IL USA
Xingquan Zhu Florida Atlantic University Boca Ration, FL USA
B
List of Contributors
Cai-Nicolas Ziegler Siemens AG Munich Germany
Arthur Zimek Ludwig-Maximilians University of Munich Munich Germany
Hartmut Ziegler University of Konstanz Konstanz Germany
Esteban Zima´nyi Free University of Brussels Brussels Belgium
lxix
A Absolute Time C HRISTIAN S. J ENSEN 1, R ICHARD T. S NODGRASS 2 1 Aalborg University, Aalborg, Denmark 2 University of Arizona, Tucson, AZ, USA
Definition A temporal database contains time-referenced, or timestamped, facts. A time reference in such a database is absolute if its value is independent of the context, including the current time, now.
Abstract Versus Concrete Temporal Query Languages J AN C HOMICKI 1, DAVID T OMAN 2 State University of New York at Buffalo, Buffalo, NY, USA 2 University of Waterloo, Waterloo, ON, Canada
1
Synonyms Historical query languages
Definition Key Points An example is ‘‘Mary’s salary was raised on March 30, 2007.’’ The fact here is that Mary’s salary was raised. The absolute time reference is March 30, 2007, which is a time instant at the granularity of day. Another example is ‘‘Mary’s monthly salary was $ 15,000 from January 1, 2006 to November 30, 2007.’’ In this example, the absolute time reference is the time period [January 1, 2006 November 30, 2007]. Absolute time can be contrasted with relative time.
Cross-references
▶ Now in Temporal Databases ▶ Relative Time ▶ Time Instant ▶ Time Period ▶ Temporal Database ▶ Temporal Granularity
Recommended Reading 1.
2.
#
Bettini C., Dyreson C.E., Evans W.S., Snodgrass R.T., and Wang X.S. A glossary of time granularity concepts. In Temporal Databases: Research and Practice. O. Etzion, S. Jajodia, S. Sripada (eds.). LNCS, vol. 1399. Springer, 1998, pp. 406–413. Jensen C.S. and Dyreson C.E. (eds.). A consensus glossary of temporal database concepts – February 1998 version. In Temporal Databases: Research and Practice, O. Etzion, S. Jajodia, S. Sripada (eds.). LNCS, vol. 1399. Springer, 1998, pp. 367–405. 2009 Springer ScienceþBusiness Media, LLC
Temporal query languages are a family of query languages designed to query (and access in general) timedependent information stored in temporal databases. The languages are commonly defined as extensions of standard query languages for non-temporal databases with temporal features. The additional features reflect the way dependencies of data on time are captured by and represented in the underlying temporal data model.
Historical Background Most databases store time-varying information. On the other hand, SQL is often the language of choice for developing applications that utilize the information in these databases. Plain SQL, however, does not seem to provide adequate support for temporal applications. Example. To represent the employment histories of persons, a common relational design would use a schema EmploymentðFromDate;ToDate; EID; CompanyÞ;
with the intended meaning that a person identified by EID worked for Company continuously from FromDate to ToDate. Note that while the above schema is a standard relational schema, the additional assumption that the values of the attributes FromDate and ToDate represent continuous periods of time is itself not a part of the relational model. Formulating even simple queries over such a schema is non-trivial. For example, the query GAPS: “List all persons with gaps in their employment history, together
2
A
Abstract Versus Concrete Temporal Query Languages
with the gaps’’ leads to a rather complex formulation in, e.g., SQL over the above schema (this is left as a challenge to readers who consider themselves SQL experts; for a list of appealing, but incorrect solutions, including the reasons why, see [9]). The difficulty arises because a single tuple in the relation is conceptually a compact representation of a set of tuples, each tuple stating that an employment fact was true on a particular day. The tension between the conceptual abstract temporal data model (in the example, the property that employment facts are associated with individual time instants) and the need for an efficient and compact representation of temporal data (in the example, the representation of continuous periods by their start and end instants) has been reflected in the development of numerous temporal data models and temporal query languages [3].
Foundations Temporal query languages are commonly defined using temporal extensions of existing non-temporal query languages, such as relational calculus, relational algebra, or SQL. The temporal extensions can be categorized in two, mostly orthogonal, ways: The choice of the actual temporal values manipulated by the language. This choice is primarily determined by the underlying temporal data model. The model also determines the associated operations on these values. The meaning of temporal queries is then defined in terms of temporal values and operations on them, and their interactions with data (non-temporal) values in a temporal database. The choice of syntactic constructs to manipulate temporal values in the language. This distinction determines whether the temporal values in the language are accessed and manipulated explicitly, in a way similar to other values stored in the database, or whether the access is implicit, based primarily on temporally extending the meaning of constructs that already exist in the underlying non-temporal language (while still using the operations defined by the temporal data model). Additional design considerations relate to compatibility with existing query languages, e.g., the notion of temporal upward compatibility. However, as illustrated above, an additional hurdle stems from the fact that many (early) temporal query
languages allowed the users to manipulate a finite underlying representation of temporal databases rather than the actual temporal values/objects in the associated temporal data model. A typical example of this situation would be an approach in which the temporal data model is based on time instants, while the query language introduces interval-valued attributes. Such a discrepancy often leads to a complex and unintuitive semantics of queries. In order to clarify this issue, Chomicki has introduced the notions of abstract and concrete temporal databases and query languages [2]. Intuitively, abstract temporal query languages are defined at the conceptual level of the temporal data model, while their concrete counterparts operate directly on an actual compact encoding of temporal databases. The relationship between abstract and concrete temporal query languages is also implicitly present in the notion of snapshot equivalence [7]. Moreover, Bettini et al. [1] proposed to distinguish between explicit and implicit information in a temporal database. The explicit information is stored in the database and used to derive the implicit information through semantic assumptions. Semantic assumptions related to fact persistence play a role similar to mappings between concrete and abstract databases, while other assumptions are used to address time-granularity issues. Abstract Temporal Query Languages
Most temporal query languages derived by temporally extending the relational calculus can be classified as abstract temporal query languages. Their semantics are defined in terms of abstract temporal databases which, in turn, are typically defined within the point-stamped temporal data model, in particular without any additional hidden assumptions about the meaning of tuples in instances of temporal relations. Example. The employment histories in an abstract temporal data model would most likely be captured by a simpler schema ‘‘Employment(Date, EID, Company)’’, with the intended meaning that a person identified by EID was working for Company on a particular Date. While instances of such a schema can potentially be very large (especially when a fine granularity of time is used), formulating queries is now much more natural. Choosing abstract temporal query languages over concrete ones resolves the first design issue: the temporal values used by the former languages are time instants equipped with an appropriate temporal ordering (which
Abstract Versus Concrete Temporal Query Languages
is typically a linear order over the instants), and possibly other predicates such as temporal distance. The second design issue – access to temporal values – may be resolved in two different ways, as exemplified by two different query languages. They are as follows: Temporal Relational Calculus (TRC): a two-sorted first-order logic with variables and quantifiers explicitly ranging over the time and data domains. First-order Temporal Logic (FOTL): a language with an implicit access to timestamps using temporal connectives. Example. The GAPS query is formulated as follows: TRC: ∃t1,t3.t1 < t2 < t3 ∧∃c.Employment(t1, x, c) ∧ (¬∃c.Employment(t2, x, c)) ∧ ∃c.Employment(t3, x, c); FOTL: ◆∃c.Employment(x,c) ∧ (¬∃c.Employment(x, c)) ∧ ◇∃c.Employment(x, c)
Here, the explicit access to temporal values (in TRC) using the variables t1, t2, and t3 can be contrasted with the implicit access (in FOTL) using the temporal operators ◆ (read ‘‘sometime in the past’’) and ◇ (read ‘‘sometime in the future’’). The conjunction in the FOTL query represents an implicit temporal join. The formulation in TRC leads immediately to an equivalent way of expressing the query in SQL/TP [9], an extension of SQL based on TRC. Example. The above query can be formulated in SQL/TP as follows: SELECT t.Date, e1.EID FROM Employment e1, Time t, Employment e2 WHERE e1.EID = e2.EID AND e1.Date < e2.Date AND NOT EXISTS ( SELECT * FROM Employment e3 WHERE e1.EID = e3.EID AND t.Date = e3.Date AND e1.Date < e3.Date AND e3.Date < e2.Date ) The unary constant relation Time contains all time instants in the time domain (in our case, all Dates) and is only needed to fulfill syntactic SQL-style requirements on attribute ranges. However, despite the fact that the instance of this relation is not finite, the query can be efficiently evaluated [9].
A
Note also that in all of the above cases, the formulation is exactly the same as if the underlying temporal database used the plain relational model (allowing for attributes ranging over time instants). The two languages, FOTL and TRC, are the counterparts of the snapshot and timestamp models (cf. the entry Point-stamped Data Models) and are the roots of many other temporal query languages, ranging from the more TRC-like temporal extensions of SQL to more FOTL-like temporal relational algebras (e.g., the conjunction in temporal logic directly corresponds to a temporal join in a temporal relational algebra, as both of them induce an implicit equality on the associated time attributes). Temporal integrity constraints over point-stamped temporal databases can also be conveniently expressed in TRC or FOTL. Multiple Temporal Dimensions and Complex Values.
While the abstract temporal query languages are typically defined in terms of the point-based temporal data model, they can similarly be defined with respect to complex temporal values, e.g., pairs (or tuples) of time instants or even sets of time instants. In these cases, particularly in the case of set-valued attributes, it is important to remember that the set values are treated as indivisible objects, and hence truth (i.e., query semantics) is associated with the entire objects, but not necessarily with their components/subparts. Concrete Temporal Query Languages
Although abstract temporal query languages provide a convenient and clean way of specifying queries, they are not immediately amenable to implementation. The main problem is that, in practice, the facts in temporal databases persist over periods of time. Storing all true facts individually for every time instant during a period would be prohibitively expensive or, in the case of infinite time domains such as dense time, even impossible. Concrete temporal query languages avoid these problems by operating directly on the compact encodings of temporal databases. The most commonly used encoding is the one that uses intervals. However, in this setting, a tuple that associates a fact with such an interval is a compact representation of the association between the same fact and all the time instants that belong to this interval. This observation leads to the design choices that are commonly present in such languages:
3
A
4
A
Abstract Versus Concrete Temporal Query Languages
Coalescing is used, explicitly or implicitly, to consolidate representations of (sets of) time instants associated with the same fact. In the case of interval-based encodings, this leads to coalescing adjoining or overlapping intervals into a single interval. Note that coalescing only changes the concrete representation of a temporal relation, not its meaning (i.e., the abstract temporal relation); hence it has no counterpart in abstract temporal query languages. Implicit set operations on time values are used in relational operations. For example, conjunction (join) typically uses set intersection to generate a compact representation of the time instants attached to the facts in the result of such an operation. Example. For the running example, a concrete schema for the employment histories would typically be defined as ‘‘Employment(VT, EID, Company)’’ where VT is a valid time attribute ranging over periods (intervals). The GAPS query can be formulated in a calculusstyle language corresponding to TSQL2 (see the entry on TSQL2) along the following lines: 9I 1 ; I 2 :½9c:EmploymentðI 1 ; x; cÞ^ ½9c:EmploymentðI 2 ; x; cÞ ^ I 1 precedes I 2 ^ I ¼ ½endðI 1 Þ þ 1; beginðI 2 Þ 1: In particular, the variables I1 and I2 range over periods and the precedes relationship is one of Allen’s interval relationships. The final conjunct, I ¼ ½endðI1 Þ þ 1; beginðI2 Þ 1; creates a new period corresponding to the time instants related to a person’s gap in employment ; this interval value is explicitly constructed from the end and start points of I1 and I2, respectively. For the query to be correct, however, the results of evaluating the bracketed subexpressions, e.g., ‘‘[∃c.Employment(I1, x, c)],’’ have to be coalesced. Without the insertion of the explicit coalescing operators, the query is incorrect. To see that, consider a situation in which a person p0 is first employed by a company c1, then by c2, and finally by c3, without any gaps in employment. Then without coalescing of the bracketed subexpressions of the above query, p0 will be returned as a part of the result of the query, which is incorrect. Note also that it is not enough for the underlying (concrete) database to be coalesced.
The need for an explicit use of coalescing often makes the formulation of queries in some concrete SQL-based temporal query languages cumbersome and error-prone. An orthogonal issue is the difference between explicit and implicit access to temporal values. This distinction also carries over to the concrete temporal languages. Typically, the various temporal extensions of SQL are based on the assumption of an explicit access to temporal values (often employing a built-in valid time attribute ranging over intervals or temporal elements), while many temporal relational algebras have chosen to use the implicit access based on temporally extending standard relational operators such as temporal join or temporal projection. Compilation and Query Evaluation.
An alternative to allowing users direct access to the encodings of temporal databases is to develop techniques that allow the evaluation of abstract temporal queries over these encodings. The main approaches are based on query compilation techniques that map abstract queries to concrete queries, while preserving query answers. More formally: QðkEkÞ ¼ kevalðQÞðEÞk;
where Q an abstract query, eval(Q) the corresponding concrete query, E is a concrete temporal database, and ||.|| a mapping that associates encodings (concrete temporal databases) with their abstract counterparts (cf. Fig.1). Note that a single abstract temporal database, D, can be encoded using several different instances of the corresponding concrete database, e.g., E1 and E2 in Fig.1. Most of the practical temporal data models adopt a common approach to physical representation of temporal databases: with every fact (usually represented as a tuple), a concise encoding of the set of time points at which the fact holds is associated. The encoding is commonly realized by intervals [6,7] or temporal elements (finite unions of intervals). For such an encoding it has been shown that both First-Order Temporal Logic [4] and Temporal Relational Calculus [8] queries can be compiled to first-order queries over a natural relational representation of the interval encoding of the database. Evaluating the resulting queries yields the interval encodings of the answers to the original queries, as if the queries were directly evaluated on the pointstamped temporal database. Similar results can be obtained for more complex encodings, e.g., periodic
Abstract Versus Concrete Temporal Query Languages
A
5
A
Abstract Versus Concrete Temporal Query Languages. Figure 1. Query evaluation over interval encodings of pointstamped temporal databases.
sets, and for abstract temporal query languages that adopt the duplicate semantics matching the SQL standard, such as SQL/TP [9].
Key Applications Temporal query languages are primarily used for querying temporal databases. However, because of their generality they can be applied in other contexts as well, e.g., as an underlying conceptual foundation for querying sequences and data streams [5].
Cross-references
▶ Allen’s Relations ▶ Bitemporal Relation ▶ Constraint Databases ▶ Key ▶ Nested Transaction Models ▶ Non First Normal Form ▶ Point-Stamped Temporal Models ▶ Relational Model ▶ Snapshot Equivalence ▶ SQL ▶ Telic Distinction in Temporal Databases
▶ Temporal Coalescing ▶ Temporal Data Models ▶ Temporal Element ▶ Temporal Granularity ▶ Temporal Integrity Constraints ▶ Temporal Joins ▶ Temporal Logic in Database Query Languages ▶ Temporal Relational Calculus ▶ Time Domain ▶ Time Instant ▶ Transaction Time ▶ TSQL2 ▶ Valid Time
Recommended Reading 1.
2. 3.
Bettini C., Wang X.S., and Jajodia S. Temporal Semantic Assumptions and Their Use in Databases. Knowl. Data Eng., 10(2):277–296, 1998. Chomicki J. Temporal query languages: a survey. In Proc. 1st Int. Conf. on Temporal Logic, 1994, pp. 506–534. Chomicki J. and Toman D. Temporal databases. In Handbook of Temporal Reasoning in Artificial Intelligence, Fischer M., Gabbay D., and Villa L. (eds.). Elsevier Foundations of Artificial Intelligence, 2005, pp. 429–467.
6
A 4.
5.
6.
7. 8.
9.
Abstraction Chomicki J., Toman D., and Bo¨hlen M.H. Querying ATSQL databases with temporal logic. ACM Trans. Database Syst., 26(2):145–178, 2001. Law Y.-N., Wang H., and Zaniolo C. Query languages and data models for database sequences and data streams. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 492–503. Navathe S.B. and Ahmed R. In Temporal Extensions to the Relational Model and SQL. Tansel A., Clifford J., Gadia S., Jajodia S., Segev A., and Snodgrass R.T. (eds.). Temporal Databases: Theory, Design, and Implementation. Benjamin/ Cummings, Menlo Park, CA, 1993, pp. 92–109. Snodgrass R.T. The temporal query language TQuel. ACM Trans. Database Syst., 12(2):247–298, 1987. Toman D. Point vs. interval-based query languages for temporal databases. In Proc. 15th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1996, pp. 58–67. Toman D. Point-based temporal extensions of SQL. In Proc. 5th Int. Conf. on Deductive and Object-Oriented Databases, 1997, pp. 103–121.
Abstraction B ERNHARD T HALHEIM Christian-Albrechts University Kiel, Kiel, Germany
Synonyms Component abstraction; Implementation abstraction; Association; Aggregation; Composition; Grouping; Specialization; Generalisation; Classification
Definition Abstraction allows developers to concentrate on the essential, relevant, or important parts of an application. It uses a mapping to a model from things in reality or from virtual things. The model has the truncation property, i.e., it lacks some of the details in the original, and a pragmatic property, i.e., the model use is only justified for particular model users, tools of investigation, and periods of time. Database engineering uses construction abstraction, context abstraction, and refinement abstraction. Construction abstraction is based on the principles of hierarchical structuring, constructor composition, and generalization. Context abstraction assumes that the surroundings of a concept are commonly understood by a community or within a culture and focuses on the concept, turning away attention from its surroundings such as the environment and setting. Refinement abstraction uses the principle of modularization and information hiding. Developers typically use conceptual models or languages for
representing and conceptualizing abstractions. The enhanced entity-relationship model schema are typically depicted by an EER diagram.
Key Points Database engineering distinguishes three kinds of abstraction: construction abstraction, context abstraction, and refinement abstraction. Constructor composition depends on the constructors as originally introduced by J. M. Smith and D.C.W. Smith. Composition constructors must be well founded and their semantics must be derivable by inductive construction. There are three main methods for construction: development of ordered structures on the basis of hierarchies, construction by combination or association, and construction by classification into groups or collections. The set constructors (subset), (product), and P (powerset) for subset, product and nesting are complete for the construction of sets. Subset constructors support hierarchies of object sets in which one set of objects is a subset of some other set of objects. Subset hierarchies are usually a rooted tree. Product constructors support associations between object sets. The schema is decomposed into object sets related to each other by association or relationship types. Power set constructors support a classification of object sets into clusters or groups of sets – typically according to their properties. Context abstraction allows developers to commonly concentrate on those parts of an application that are essential for some perspectives during development and deployment of systems. Typical types of context abstraction are component abstraction, separation of concern, interaction abstraction, summarization, scoping, and focusing on typical application cases. Component abstraction factors out repeating, shared or local patterns of components or functions from individual concepts. It allows developers to concentrate on structural or behavioral aspects of similar elements of components. Separation of concern allows developers to concentrate on those concepts under development and to neglect all other concepts that are stable or not under consideration. Interaction abstraction allows developers to concentrate on parts of the model that are essential for interaction with other systems or users. Summarisation maps the conceptualizations within the scope to more abstract concepts. Scoping is typically used to select those concepts that are necessary for current development and removes
Access Control
those concepts which that do not have an impact on the necessary concepts. Database models may cover a large variety of different application cases. Some of them reflect exceptional, abnormal, infrequent and untypical application situations. Focusing on typical application cases explicitly separates models intended for the normal or typical application case from those that are atypical. Atypical application cases are not neglected but can be folded into the model whenever atypical situations are considered. The context abstraction concept is the main concept behind federated databases. Context of databases can be characterized by schemata, version, time, and security requirements. Sub-schemata, types of the schemata or views on the schemata, are associated with explicit import/export bindings based on a name space. Parametrization lets developers consider collections of objects. Objects are identifiable under certain assumptions and completely identifiable after instantiation of all parameters. Interaction abstraction allows developers to display the same set of objects in different forms. The view concept supports this visibility concept. Data is abstracted and displayed in various levels of granularity. Summarization abstraction allows developers to abstract from details that are irrelevant at a certain point. Scope abstraction allows developers to concentrate on a number of aspects. Names or aliases can be multiply used with varying structure, functionality and semantics. Refinement abstraction mainly concerns implementation and modularisation. It allows developers to selectively retain information about structures. Refinement abstraction is defined on the basis of the development cycle (refinement of implementations). It refines, summarizes and views conceptualizations, hides or encapsulates details, or manages collections of versions. Each refinement step transforms a schema to a schema of finer granularity. Refinement abstraction may be modeled by refinement theory and infomorphisms. Encapsulation removes internal aspects and concentrates on interface components. Blackbox or graybox approaches hide all aspects of the objects being considered. Partial visibility may be supported by modularization concepts. Hiding supports differentiation of concepts into public, private (with the possibility to be visible as ‘‘friends’’) and protected (with visibility to subconcepts). It is possible to define a number of visibility conceptualizations based on inflection. Inflection is used for the injection of
A
combinable views into the given view, for tailoring, ordering and restructuring of views, and for enhancement of views by database functionality. Behavioral transparency is supported by the glassbox approach. Security views are based on hiding. Versioning allows developers to manage a number of concepts which can be considered to be versions of each other.
Cross-references
▶ Entity Relationship Model ▶ Extended Entity-Relationship Model ▶ Language Models ▶ Object Data Models ▶ Object-Role Modeling ▶ Specialization and Generalization
Recommended Reading 1. 2.
3.
Bo¨rger E. The ASM refinement method. Formal Aspect. Comput., 15:237–257, 2003. Smith J.M. and Smith D.C.W. Data base abstractions: aggregation and generalization. ACM Trans. Database Syst., 2 (2):105–133, 1977. Thalheim B. Entity-Relationship Modeling – Foundations of Database Technology. Springer, 2000.
Access Control E LENA F ERRARI University of Insubria, Varese, Italy
Synonyms Authorization verification
Definition Access control deals with preventing unauthorized operations on the managed data. Access control is usually performed against a set of authorizations stated by Security Administrators (SAs) or users according to the access control policies of the organization. Authorizations are then processed by the access control mechanism (or reference monitor) to decide whether each access request can be authorized or should be denied.
Historical Background Access control models for DBMSs have been greatly influenced by the models developed for the protection of operating system resources. For instance, the model
7
A
8
A
Access Control
proposed by Lampson [16] is also known as the access matrix model since authorizations are represented as a matrix. However, much of the early work on database protection was on inference control in statistical databases. Then, in the 1970s, as research in relational databases began, attention was directed towards access control issues. As part of the research on System R at IBM Almaden Research Center, there was much work on access control for relational database systems [11,15], which strongly influenced access control models and mechanisms of current commercial relational DBMSs. Around the same time, some early work on multilevel secure database management systems (MLS/DBMSs) was reported. However, it was only after the Air Force Summer Study in 1982 [1] that developments on MLS/DBMSs began. For instance, the early prototypes based on the integrity lock mechanisms developed at the MITRE Corporation. Later, in the mid-1980s, pioneering research was carried out at SRI International and Honeywell Inc. on systems such as SeaView and LOCK Data Views [9]. Some of the technologies developed by these research efforts were transferred to commercial products by corporations such as Oracle, Sybase, and Informix. In the 1990s, numerous other developments were made to meet the access control requirements of new applications and environments, such as the World Wide Web, data warehouses, data mining systems, multimedia systems, sensor systems, workflow management systems, and collaborative systems. This resulted in several extensions to the basic access control models previously developed, by including the support for temporal constraints, derivation rules, positive and negative authorizations, strong and weak authorizations, and content and context-dependent authorizations [14]. Role-based access control has been proposed [12] to simplify authorization management within companies and organizations. Recently, there have been numerous developments in access control, mainly driven by developments in web data management. For example, standards such as XML (eXtensible Markup Language) and RDF (Resource Description Framework) require proper access control mechanisms [7]. Also, web services and the semantic web are becoming extremely popular and therefore research is currently carried out to address the related access control issues [13]. Access control is currently being examined for new application areas, such as knowledge management [4], data outsourcing, GIS
[10], peer-to-peer computing and stream data management [8]. For example, in the case of knowledge management applications, it is important to protect the intellectual property of an organization, whereas when data are outsourced, it is necessary to allow the owner to enforce its access control policies, even if data are managed by a third party.
Foundations The basic building block on which access control relies is a set of authorizations: which state, who can access which resource, and under which mode. Authorizations are specified according to a set of access control policies, which define the high-level rules according to which access control must occur. In its basic form, an authorization is, in general, specified on the basis of three components (s,o,p), and specifies that subject s is authorized to exercise privilege p on object o. The three main components of an authorization have the following meaning: Authorization subjects: They are the ‘‘active’’ entities in the system to which authorizations are granted. Subjects can be further classified into the following, not mutually exclusive, categories: users, that is, single individuals connecting to the system; groups, that is, sets of users; roles, that is, named collection of privileges needed to perform specific activities within the system; and processes, executing programs on behalf of users. Authorization objects: They are the ‘‘passive’’ components (i.e., resources) of the system to which protection from unauthorized accesses should be given. The set of objects to be protected clearly depends on the considered environment. For instance, files and directories are examples of objects of an operating system environment, whereas in a relational DBMS, examples of resources to be protected are relations, views and attributes. Authorizations can be specified at different granularity levels, that is, on a whole object or only on some of its components. This is a useful feature when an object (e.g., a relation) contains information (e.g., tuples) of different sensitivity levels and therefore requires a differentiated protection. Authorization privileges: They state the types of operations (or access modes) that a subject can exercise on the objects in the system. As for objects, the set of privileges also depends on the resources
Access Control
to be protected. For instance, read, write, and execute privileges are typical of an operating system environment, whereas in a relational DBMS privileges refer to SQL commands (e.g., select, insert, update, delete). Moreover, new environments such as digital libraries are characterized by new access modes, for instance, usage or copying access rights. Depending on the considered domain and the way in which access control is enforced, objects, subjects and/or privileges can be hierarchically organized. The hierarchy can be exploited to propagate authorizations and therefore to simplify authorization management by limiting the set of authorizations that must be explicitly specified. For instance, when objects are hierarchically organized, the hierarchy usually represents a ‘‘part-of ’’ relation, that is, the hierarchy reflects the way objects are organized in terms of other objects. In contrast, the privilege hierarchy usually represents a subsumption relation among privileges. Privileges towards the bottom of the hierarchy are subsumed by privileges towards the top (for instance, the write privilege is at a higher level in the hierarchy with respect to the read privilege, since write subsumes read operations). Also roles and groups can be hierarchically organized. The group hierarchy usually reflects the membership of a group to another group. In contrast, the role hierarchy usually reflects the relative position of roles within an organization. The higher the level of a role in the hierarchy, the higher its position in the organization.
A
Authorizations are stored into the system and are then used to verify whether an access request can be authorized or not. How to represent and store authorizations depends on the protected resources. For instance, in a relational DBMS, authorizations are modeled as tuples stored into system catalogs. In contrast, when resources to be protected are XML documents, authorizations are usually encoded using XML itself. Finally, the last key component of the access control infrastructure is the access control mechanism (or reference monitor), which is a trusted software module in charge of enforcing access control. It intercepts each access request submitted to the system (for instance, SQL statements in case of relational DBMSs) and, on the basis of the specified authorizations, it determines whether the access can be partially or totally authorized or should be denied. The reference monitor should be non-bypassable. Additionally, the hardware and software architecture should ensure that the reference monitor is tamper proof, that is, it cannot be maliciously modified (or at least that any improper modification can be detected). The main components of access control are illustrated in Fig. 1. A basic distinction when dealing with access control is between discretionary and mandatory access control. Discretionary access control (DAC) governs the access of subjects to objects on the basis of subjects’ identity and a set of explicitly specified authorizations that specify, for each subject, the set of objects that
Access Control. Figure 1. Access control: main components.
9
A
10
A
Access Control
he/she can access in the system and the allowed access modes. When an access request is submitted to the system, the access control mechanism verifies whether or not the access can be authorized according to the specified authorizations. The system is discretionary in the sense that a subject, by proper configuring the set of authorizations, is both able to enforce various access control requirements and to dynamically change them when needed (simply by updating the authorization state). In contrast, mandatory access control (MAC) specifies the accesses that subjects can exercise on the objects in the system, on the basis of subjects and objects security classification [14]. Security classes usually form a partially ordered set. This type of security has also been referred to as multilevel security, and database systems that enforce multilevel access control are called Multilevel Secure Database Management Systems (MLS/DBMSs). When mandatory access control is enforced, authorizations are implicitly specified, by assigning subjects and objects proper security classes. The decision on whether or not to grant an access depends on the access mode and the relation existing between the classification of the subject requesting the access and that of the requested object. In addition to DAC and MAC, role-based access control (RBAC) has been more recently proposed [12]. RBAC is an alternative to discretionary and mandatory access control, mainly conceived for regulating accesses within companies and organizations. In RBAC, permissions are associated with roles, instead of with users, and users acquire permissions through their membership to roles. The set of authorizations can be inferred by the sets of user-role and role-permission assignments.
Key Applications Access control techniques are applied in almost all environments that need to grant a controlled access to their resources, including, but not limited, to the following: DBMSs, Data Stream Management Systems, Operat ing Systems, Workflow Management Systems, Digital Libraries, GIS, Multimedia DBMSs, E-commerce services, Publish-subscribe systems, Data warehouses.
Future Directions Altough access control is a mature area with consolidated results, the evolution of DBMSs and the requirements of new applications and environments pose new challenges to the research community. An interesting
discussion on open research issues in the field can be found in [6]. Some research issues which complement those presented in [6] are discussed below. Social networks. Web-based social networks (WBSNs) are online communities where participants can establish relationships and share resources across the web with other users. In recent years, several WBSNs have been adopting semantic web technologies, such as FOAF, for representing users’ data and relationships, making it possible to enforce information interchange across multiple WBSNs. Despite its advantages in terms of information diffusion, this raised the need for giving content owners more control on the distribution of their resources, which may be accessed by a community far wider than they expected. So far, this issue has been mainly addressed in a very simple way, by some of the available WBSNs, by only allowing users to state whether a specific information (e.g., personal data and resources) should be public or accessible only by the users with whom the owner of such information has a direct relationship. Such simple access control strategies have the advantage of being straightforward, but they are not flexible enough in denoting authorized users. In fact, they do not take into account the type of the relationships existing between users and, consequently, it is not possible to state that only, say, my ‘‘friends’’ can access a given information. Moreover, they do not allow to grant access to users who have an indirect relationship with the resource owner (e.g., the ‘‘friends of my friends’’). Therefore, more flexible mechanisms are needed, making a user able to decide which network participants are authorized to access his/her resources and personal information. Additionally, since the number of social network users is considerably higher than those in conventional DBMSs, the traditional server-side way of enforcing access control, that is, the one relying on a centralized trusted reference monitor, should be revised and more efficient and distributed strategies should be devised for WBSNs. Until now, apart from [3], most of the security research on WBSNs has focused on privacy-preserving mining of social network data. The definition of a comprehensive framework for efficiently enforcing access control in social networks is therefore still an issue to be investigated. Data streams. In many applications, such as telecommunication, battle field monitoring, network
Access Control
monitoring, financial monitoring, sensor networks, data arrive in the form of high speed data streams. These data typically contain sensitive information (e.g., health information, credit card numbers) and thus unauthorized accesses should be avoided. Although many data stream processing systems have been developed so far (e.g., Aurora, Borealis, STREAM, TelegraphCQ, and StreamBase), the focus of these systems has been mainly on performance issues rather than on access control. On the other hand, though the data security community has a very rich history in developing access control models [9], these models are largely tailored to traditional DBMSs and therefore they cannot be readily applied to data stream management systems [8]. This is mainly because: (i) traditional data are static and bounded, while data streams are unbounded and infinite; (ii) queries in traditional DBMSs are one time and ad-hoc, whereas queries over data streams are typically continuous and long running; (iii) in traditional DBMSs, access control is enforced when users access the data; (iv) in data stream applications access control enforcement is data-driven (i.e., whenever data arrive), as such access control is more computational intensive in data stream applications and specific techniques to handle it efficiently should be devised; (v) temporal constraints (e.g., sliding windows) are more critical in data stream applications than in traditional DBMSs. Semantic web. The web is now evolving into the semantic web. The semantic web [5] is a web that is intelligent with machine-readable web pages. The major components of the semantic web include web infrastructures, web databases and services, ontology management and information integration. There has been much work on each of these areas. However, very little work has been devoted to access control. If the semantic web is to be effective, it is necessary to ensure that the information on the web is protected from unauthorized accesses and malicious modifications. Also, it must be ensured that individual’s privacy is maintained. To cope with these issues, it is necessary to secure all the semantic web related technologies, such as XML, RDF, Agents, Databases, web services, and Ontologies and ensure the secure interoperation of all these technologies [13].
A
Cross-references
▶ Access Control Policy Languages ▶ Discretionary Access Control ▶ Mandatory Access Control ▶ Multilevel Secure Database Management System ▶ Role Based Access Control ▶ Storage Security
Recommended Reading 1. Air Force Studies Board, Committee on Multilevel Data Management Security. Multilevel data management security. National Research Council, 1983. 2. Berners-Lee T. et al. The semantic web. Scientific American, 2001. 3. Bertino E., and Sandhu R.S. Database security: concepts, approaches, and challenges. IEEE Trans. Dependable and Secure Computing, 2(1):2–19, 2005. 4. Bertino E., Khan L.R., Sandhu R.S., and Thuraisingham B.M. Secure knowledge management: confidentiality, trust, and privacy. IEEE Trans. Syst. Man Cybern. A, 36(3):429–438, 2006. 5. Carminati B., Ferrari E., and Perego A. Enforcing access control in web-based social networks. ACM trans. Inf. Syst. Secur., to appear. 6. Carminati B., Ferrari E., and Tan K.L. A framework to enforce access control over Data Streams. ACM Trans. Inf. Syst. Secur., to appear. 7. Carminati B., Ferrari E., and Thuraisingham B.M. Access control for web data: models and policy languages. Ann. Telecomm., 61 (3–4):245–266, 2006. 8. Carminati B., Ferrari E., and Bertino E. Securing XML data in third party distribution systems. In Proc. of the ACM Fourteenth Conference on Information and Knowledge Management, 2005. 9. Castano S., Fugini M.G., Martella G., and Samarati P. Database security. Addison Wesley, 1995. 10. Damiani M.L. and Bertino E. Access control systems for geo-spatial data and applications. In Modelling and management of geographical data over distributed architectures, A. Belussi, B. Catania, E. Clementini, E. Ferrari (eds.). Springer, 2007. 11. Fagin R. On an authorization mechanism. ACM Trans. Database Syst., 3(3):310–319, 1978. 12. Ferraiolo D.F., Sandhu R.S., Gavrila S.I., Kuhn D.R., and Chandramouli R. Proposed NIST standard for role-based access control. ACM Trans. Inf. Syst. Secur., 4(3):224–274, 2001. 13. Ferrari E. and Thuraisingham B.M. Security and privacy for web databases and services. In Advances in Database Technology, Proc. 9th Int. Conf. on Extending Database Technology, 2004, pp. 17–28. 14. Ferrari E. and Thuraisingham B.M. Secure database systems. In O. Diaz, M. Piattini (eds.). Advanced databases: technology and design. Artech House, 2000. 15. Griffiths P.P. and Wade B.W. An authorization mechanism for a relational database system. ACM Trans. Database Syst., 1 (3):242–255, 1976. 16. Lampson B.W. Protection. Fifth Princeton Symposium on Information Science and Systems, Reprinted in ACM Oper. Sys. Rev., 8(1):18–24, 1974.
11
A
12
A
Access Control Administration Policies
Access Control Administration Policies E LENA F ERRARI University of Insubria, Varese, Italy
Synonyms Authorization administration policies; Authorization administration privileges
Definition Administration policies regulate who can modify the authorization state, that is, who has the right to grant and revoke authorizations.
Historical Background Authorization management is a an important issue when dealing with access control and, as such, research on this topic is strongly related to the developments in access control. A milestone in the field is represented by the research carried out in the 1970s at IBM in the framework of the System R project. In particular, the work by Griffiths and Wade [9] defines a semantics for authorization revocation, which had greatly influenced the way in which authorization revocation has been implemented in commercial Relational DBMSs. Administrative policies for Object-oriented DBMSs have been studied in [8]. Later on, some extensions to the System R access control administration model, have been defined [3], with the aim of making it more flexible and adaptable to a variety of access control requirements. Additionally, as the research on extending the System R access control model with enhanced functionalities progresses, authorization administration has been studied for these extensions, such as temporal authorizations [2], strong and weak and positive and negative authorizations [4]. Also, administrative policies for new environments and data models such as WFMSs [1] and XML data [12] have been investigated. Back in the 1990s, when research on rolebased access control began, administration policies for RBAC were investigated [6,11,10,13]. Some of the ideas developed as part of this research were adopted by the current SQL:2003 standard [7].
Foundations Access control administration deals with granting and revoking of authorizations. This function is usually
regulated by proper administration policies. Usually, if mandatory access control is enforced, the adopted administration policies are very simple, so that the Security Administrator (SA) is the only one authorized to change the classification level of subjects and objects. In contrast, discretionary and role-based access control are characterized by more articulated administration policies, which can be classified according to the following categories [3]: SA administration. According to this policy, only the SA can grant and revoke authorizations. Although the SA administration policy has the advantage of being very simple and easily implemented, it has the disadvantage of being highly centralized (even though different SAs can manage different portions of the database) and is seldom used in current DBMSs, apart from very simple systems. Object owner administration. This is the policy commonly adopted by DBMSs and operating systems. Under this policy, whoever creates an object become its owner and he/she is the only one authorized to grant and revoke authorizations on the object. Joint administration. Under this policy, particularly suited for collaborative environments, several subjects are jointly responsible for administering specific authorizations. For instance, under the joint administration policy it can be a requirement that the authorization to write a certain document is given by two different users, such as two different job functions within an organization. Authorizations for a subject to access a data object requires that all the administrators of the object issue a grant request. The object owner administration policy can be further combined with administration delegation, according to which the administrator of an object can grant other subjects the right to grant and revoke authorizations on the object. Delegation can be specified for selected privileges, for example only for read operations. Most current DBMSs support the owner administration policy with delegation. For instance, the Grant command provided by the SQL:2003 standard [7] supports a Grant Option optional clause. If a privilege p is granted with the grant option on an object o, the subject receiving it is not only authorized to exercise p on object o but he/she is also authorized to grant other subjects authorizations for p on object o with or without the grant option. Moreover, SQL:2003 provides an optional Admin Option clause, which has
Access Control Administration Policies
the same meaning as the Grant option clause but it applies to roles instead of to standard authorizations. If a subject is granted the authorization to play a role with the admin option he/she not only receives all the authorizations associated with the role, but he/she can also authorize other subjects to play that role. If administration delegation is supported, different administrators can grant the same authorization to the same subject. A subject can therefore receive an authorization for the same privilege on the same object by different sources. An important issue is therefore related to the management of revoke operations, that is, what happens when a subject revokes some of the authorizations he/she previously granted. For instance, consider three users: Ann, Tom, and Alice. Suppose that Ann grants Tom the privilege to select tuples from the Employee relation with the grant option and that, by having this authorization, Tom grants Alice the same privilege on the Employee relation. What happens to the authorization of Alice when Ann revokes Tom the privilege to select tuples from the Employee relation? The System R authorization model [9] adopts the most conscious approach with respect to security by enforcing recursive revocation: whenever a subject revokes an authorization on a relation from another subject, all the authorizations that the revokee had granted because of the revoked authorization are recursively removed from the system. The
A
revocation is iteratively applied to all the subjects that received an authorization from the revokee. In the example above, Alice will lose the privilege to select tuples from the Employee relation when Ann revokes this privilege to Tom. Implementing recursive revocation requires keeping track of the grantor of each authorization, that is, the subject who specifies the authorization, since the same authorization can be granted by different subjects, as well as of its timestamp, that is, the time when it was specified. To understand why the timestamp is important in correctly implementing recursive revocation, consider the graph in Fig. 1a, which represents the authorization state for a specific privilege p on a specific object o. Nodes represent subjects, and an edge from node n1 to node n2 means that n1 has granted privilege p on object o to n2. The edge is labeled with the timestamp of the granted privilege and, optionally, with symbol ‘‘g,’’ if the privilege has been granted with the grant option. Suppose that Tom revokes the authorization to Alice. As a result, the authorizations also held by Matt and Ann are recursively revoked because they could not have been granted if Alice did not receive authorization from Tom at time 32. In contrast, the authorization held by Paul is not revoked since it could have been granted even without the authorization granted by Tom to Alice at time 32, because of the privilege Alice had received by Helen at time
Access Control Administration Policies. Figure 1. Recursive revocation.
13
A
14
A
Access Control Administration Policies
47. The authorization state resulting from the revoke operation is illustrated in Fig. 1b. Although recursive revocation has the advantage of being the most conservative solution with regard to security, it has the drawback of in some cases the unnecessarily revoking of too many authorizations. For instance, in an organization, the authorizations a user possesses are usually related to his/her job functions within the organization, rather than to his/her identity. If a user changes his/her tasks (for instance, because of a promotion), it is desirable to remove only the authorizations of the user, without revoking all the authorizations granted by the user before changing his/her job function. For this reason, research has been carried out to devise alternative semantics for the revoke operation with regard to recursive revocation. Bertino et al. [5] have proposed an alternative type of revoke operation, called noncascading revocation. According to this, no recursive revocation is performed upon the execution of a revoke operation. Whenever a subject revokes a privilege on an object from another subject, all authorizations which the subject may have granted using the privilege received by the revoker are not removed. Instead, they are restated as if they had been granted by the revoker. SQL:2003 [7] adopts the object owner administration policy with delegation. A revoke request can either be issued to revoke an authorization from a subject for a particular privilege on a given object, or to revoke the authorization to play a given role. SQL:2003 supports two different options for the revoke operation. If the revoke operation is requested with the Restrict clause, then the revocation is not allowed if it causes the revocation of other privileges and/or the deletion of some objects from the database schema. In contrast, if the Cascade option is specified, then the system implements a revoke operation similar to the recursive revocation of the System R, but without taking into account authorization timestamps. Therefore, an authorization is recursively revoked only if the grantor no longer holds the grant/admin option for that, because of the requested revoke operation. Otherwise, the authorization is not deleted, regardless of the time the grantor had received the grant/admin option for that authorization. To illustrate the differences with regard to recursive revocation, consider once again Fig. 1a, and suppose that Tom revokes privilege p on object o to Alice with the Cascade option. With difference to the System R access control model, this revoke operation does not cause any other changes to the authorization state. The
authorization granted by Alice to Matt is not deleted, because Alice still holds the grant option for that access (received by Helen).
Key Applications Access control administration policies are fundamental in every environment where access control services are provided.
Cross-references
▶ Access Control ▶ Discretionary Access Control ▶ Role Based Access Control
Recommended Reading 1. Atluri V., Bertino E., Ferrari E., and Mazzoleni P. Supporting delegation in secure workflow management systems. In Proc. 17th IFIP WG 11.3 Conf. on Data and Application Security, 2003, pp. 190–202. 2. Bertino E., Bettini C., Ferrari E., and Samarati P. Decentralized administration for a temporal access control model. Inf. Syst., 22:(4)223–248, 1997. 3. Bertino E. and Ferrari E. Administration policies in a multipolicy authorization system. In Proc. 11th IFIP WG 11.3 Conference on Database Security, 1997, pp. 341–355. 4. Bertino E., Jajodia S., and Samarati P. A flexible authorization mechanism for relational data management systems. ACM Trans. Inf. Syst., 17:(2)101–140, 1999. 5. Bertino E., Samarati P., and Jajodia S. An extended authorization model. IEEE Trans. Knowl. Data Eng., 9:(1)85–101, 1997. 6. Crampton J. and Loizou G. Administrative scope: a foundation for role-based administrative models. ACM Trans. Inf. Syst. Secur., 6:(2)201–231, 2003. 7. Database Languages – SQL,ISO/IEC 9075-*, 2003. 8. Fernandez E.B., Gudes E., and Song H. A model for evaluation and administration of security in object-oriented databases. IEEE Trans. Knowl. Data Eng., 6:(2)275–292, 1994. 9. Griffiths P.P. and Wade B.W. An authorization mechanism for a relational database system. ACM Trans. Database Syst., 1:(3) 242–255, 1976. 10. Oh S., Sandhu R.S., and Zhang X. An effective role administration model using organization structure. ACM Trans. Inf. Syst. Secur., 9:(2)113–137, 2006. 11. Sandhu R.S., Bhamidipati V., and Munawer Q. The ARBAC97 model for role-based administration of roles. ACM Trans. Inf. Syst. Secur., 2:(1)105–135, 1999. 12. Seitz L., Rissanen E., Sandholm T., Sadighi Firozabadi B., and Mulmo O. Policy Administration control and delegation using XACML and delegent. In Proc. 6th IEEE/ACM Int. Workshop on Grid Computing, 2005, pp. 49–54. 13. Zhang L., Ahn G., and Chu B. A rule-based framework for role-based delegation and revocation. ACM Trans. Inf. Syst. Secur., 6:(3)404–441, 2003.
Access Control Policy Languages
Access Control Policy Languages ATHENA VAKALI Aristotle University, Thessaloniki, Greece
Synonyms Authorization policy languages
Definition An access control policy language is a particular set of grammar, syntax rules (logical and mathematical), and operators which provides an abstraction-layer for access control policy specifications. Such languages combine individual rules into a single policy set, which is the basis for (user/subject) authorization decisions on accessing content (object) stored in various information resources. The operators of an access control policy language are used on attributes of the subject, resource (object), and their underlying application framework to facilitate identifying the policy that (most appropriately) applies to a given action.
Historical Background The evolution of access control policy languages is inline with the evolving large-scale highly distributed information systems and the Internet, which turned the tasks of authorizing and controlling of accessing on a global enterprise (or on Internet) framework increasingly challenging and difficult. Obtaining a solid and accurate view of the policy in effect across its many and diverse systems and devices has guided the development of access control policy languages accordingly. Access control policy languages followed the Digital Rights Management (DRM) standardization efforts, which had focused in introducing DRM technology into commercial and mainstream products. Originally, access control was practiced in the most popular RDBMSs by policy languages that were SQL based. Certainly, the access control policy languages evolution was highly influenced by the wide adoption of XML (late 1990s) mainly in the enterprise world and its suitability for supporting access control acts. XML’s popularity resulted in an increasing need to support more flexible provisional access decisions than the initial simplistic authorization acts which were limited in an accept/deny decision. In this context, proposals of various access control policy languages were very
A
active starting around the year 2000. This trend seemed to stabilize around 2005. The historical pathway of such languages should highlight the following popular and general-scope access control policy languages: 1998: the Digital Property Rights Language (DPRL, Digital Property Rights Language, http://xml.coverpages.org/dprl.html) mostly addressed to commercial and enterprise communities was specified for describing rights, conditions, and fees to support commerce acts 2000: XML Access Control Language (XACL, XML Access Control Language, http://xml.coverpages.org/ xacl.html) was the first XML-based access control language for the provisional authorization model 2001: two languages were publicized: ▶ the eXtensible rights Markup Language (XrML, The Digital Rights Language for Trusted Content and Services, http://www.xrml.org/) promoted as the digital rights language for trusted content and services ▶ the Open Digital Rights Language (ODRL, Open Digital Rights Language, http://odrl.net/) for developing and promoting an open standard for rights expressions for transparent use of digital content in all sectors and communities 2002: the eXtensible Media Commerce Language (XMCL, eXtensible Media Commerce Language, http://www.w3.org/TR/xmcl/) to communicate usage rules in an implementation-independent manner for interchange between business systems and DRM implementations 2003: the eXtensible Access Control Markup Language (XACML, eXtensible Access Control Markup Language, http://www.oasis-open.org/committees/ xacml/) was accepted as a new OASIS, Organization for the Advancement of Structured Information Standards, http://www.oasis-open.org/, Open Standard language, designed as an XML specification with emphasis on expressing policies for information access over the Internet. 2005: Latest version XACML 2.0 appeared and policy languages which are mostly suited for Web services appear. These include WS-SecurityPolicy, http://www-128.ibm.com/developerworks/library/ specification/ws-secpol/, which defines general security policy assertions to be applied into Web services security frameworks.
15
A
16
A
Access Control Policy Languages
Foundations Since Internet and networks in general are currently the core media for data and knowledge exchange, a primary issue is to assure authorized access to (protected) resources located in such infrastructures. To support access control policies and mechanisms, the use of an appropriate and suitable language is the core requirement in order to express all of the various components of access control policies, such as subjects, objects, constraints, etc. Initial attempts for expressing access control policies (consisting of authorizations) involved primary ‘‘participants’’ in a policy, namely the subject (client requesting access), the object (protected resource), and the action (right or type of access). To understand the access control policy languages the context in which they are applied must be explained. Hence, the following notions which appear under varying terminology must be noted: Content/objects: Any physical or digital content which may be of different formats, may be divided into subparts and must be uniquely identified. Objects may also be encrypted to enable secure distribution of content.
Permissions/rights/actions: Any task that will enforce permissions for accessing, using and acting over a particular content/object. They may contain constraints (limits), requirements (obligations), and conditions (such as exceptions, negotiations). Subjects/users/parties: Can be humans (end users), organizations, and defined roles which aim in consuming (accessing) content. Under these three core entities, the policies are formed under a particular language to express offers and agreements. Therefore, the initial format of such languages authorization was (subject, object, and action) defining which subject can conduct what type of action over what object. However, with the advent of databases, networking, and distributed computing, users have witnessed (as presented in the section ‘‘Historical background’’) a phenomenal increase in the automation of organizational tasks covering several physical locations, as well as the computerization of information related services [6,7]. Therefore, new ideas have been added into modern access control models, like time, tasks, origin, etc. This was evident in the evolution of languages which initially supported an original syntax for policies limited in a
Access Control Policy Languages. Table 1. Summary of most popular access control policy languages Language/ technology
Subject types
DPRL/XML DTDs
Registered users
XACL/XML syntax
Group or organization members
XrML/XML schema
ODRL/open-source schema-valid XML syntax XMCL/XML namespaces
Object types
Protection granularity Accessing core formats
Focus
Digital XML data sources, stored on repositories Particular XML documents
Finegrained
Digital licenses assigned for a time-limited period
Finegrained
Set of particular specified privileges
Registered users and/or parties
digital XML data sources
Finegrained
Granted rights under specified conditions
Any user
Trusted or untrusted content
coarsegrained
Digital or physical rights
Registered users
Trusted multimedia content
Coarsegrained
Specified keyword-based Particular licenses business models Rule-based permissions
XACML/XML schema Any users Domain-specific input organized in categories WS-Security policy/ Any Web users/ Digital data sources XML, SOAP Web services
Finegrained Finegrained
Protection acts at SOAP Web services messages level
Web services security
Access Control Policy Languages
etc). Moreover, this table highlights the level at which the access control may be in effect for each language, i.e., the broad categorization into fine- and coarse-grained protection granularity, respectively, refers to either partitions/ detailed or full document/object protection capability. Moreover, the extensibility of languages which support Web-based objects and content is noted. To expand on the above, specific-scope languages have also emerged mainly to support research-oriented applications and tools. The most representative of such languages include:
3-tuple (subject, Subject primitive allows user IDs, groups, and/or role names. object, Object primitive allows granularity as fine as a single element within an XML document, and action, Action primitive consists of four kinds of actions: read, write, create, and delete.) which then was found quite simplistic and limited and it was extended to include non-XML documents, to allow roles and collections as subjects and to support more actions (such as approve, execute, etc). Table 1 summarizes the most important characteristics of the popular general scope access control policy languages. It is evident that these languages differentiate on the subjects/users types, on the protected object/content type (which is considered as trusted when it is addressed to trusted audience/users) and on the capabilities of access control acts, which are presented under various terms and formats (rights, permissions, privileges,
X-Sec [1]: To support the specification of subject credentials and security policies in Author-X and Decentral Author-X [2]. X-Sec adopts the idea of credentials which is similar to roles in that one user can be characterized by more than one credentials.
Access Control Policy Languages. Table 2. Specific-scope access control languages characteristics X-Sec Objects Protected resources XML documents and DTDs Identification XPath Protection granularity Subjects Identification Grouping of subjects Subjects hierarchy
A
XACL
RBXAC
XAS syntax
XML documents and DTDs XPath
XML documents XPath
XML documents and DTDs XPath
Content, attribute
Element
Content, attribute
Element
XML-expressed credentials No
Roles, UIDs, groups
Roles
User ID, location
Yes
No
Yes
No
Yes
Role trees
Yes
Support public subject Policies
No
Yes
No
Yes
Expressed in Closed/open Permissions/denials Access modes
Policy base Closed Both Authoring, browsing
Access control files Closed Permissions RI, WI, RC, WC
XAS Closed Both Read
Propagation
According to role tree -
Local, recursive
Priority
No-prop, first-level, cascade Implicit rules
XACL policy file Both Both Read, write, create, delete No/up/down
Hard, soft
Conflict resolution
Yes
According to priorities and implicit rules
-
Implicitly, explicitly
Other issues Subscription-based
Yes
Yes
Yes
Yes
Ownership
No
No
Yes
No
ntp, ptp, dtd
17
A
18
A
Access Methods
XAS Syntax: Designed to support the ACP (Access Control Processor) tool [3]. It is a simplified XMLbased syntax for expressing authorizations. RBXAC: A specification XML-based language supporting the role-based access control model [4]. XACL: Which was originally based on a provisional authorization model and it has been designed to support ProvAuth (Provisional Authorizations) tool. Its main function is to specify security policies to be enforced upon accesses to XML documents. Cred-XACL [5]: A recent access control policy language focusing on credentials support on distributed systems and the Internet.
challenges posed by acknowledging and identifying users/subjects on the Web.
The core characteristics of these specific-scope languages are given in Table 2, which summarizes them with respect to their approach for objects and subjects management, their policies practicing and their subscription and ownership mechanisms. Such a summary is important in order to understand the ‘‘nature’’ of each such language in terms of objects and subjects identification, protection (sources) granularity and (subject) hierarchies, policies expression and accessing modes under prioritization, and conflict resolution constraints. Finally, it should be noted that these highlighted characteristics are important in implementing security service tasks which support several security requirements from both the system and the sources perspective.
Cross-references
Key Applications Access control policy languages are involved in the transparent and innovative use of digital resources which are accessed in applications related to key nowadays areas such as publishing, distributing and consuming of electronic publications, digital images, audio and movies, learning objects, computer software and other creations in digital form.
URL to Code Code, examples, and application scenarios may be found for: ODRL application scenarios at http://www. w3.org/TR/odrl/#46354 and http://odrl.net/, XrML at http://www.xrml.org/, XMCL at http://www.w3.org/ TR/xmcl/, XACML at http://www.oasis-open.org/committees/xacml/, and WS-SecurityPolicy at http://www128.ibm.com/developerworks/library/specification/wssecpol/.
▶ Access Control ▶ Database Security ▶ Role-Based Access Control ▶ Secure Database Development
Recommended Reading 1.
2.
3.
4.
5.
6.
7.
Future Directions From the evolution of access control policy languages, it appears that, in the future, emphasis will be given on languages that are mostly suited for Web-accessed repositories, databases, and information sources. This trend is now apparent from the increasing interest on languages that control accessing on Web services and Web data sources. At the same time, it manages the
Bertino E., Castano S., and Ferrari E. On specifying security policies for web documents with an XML-based language. In Proc. 6th ACM Symp. on Access Control Models and Technologies, 2001, pp. 57–65. Bertino E., Castano S., and Ferrari E. Securing XML documents with author-X. IEEE Internet Computing, May–June 2001, pp. 21–31. Damiani E., De Capitani di Vimercati S., Paraboschi S., and Samarati P. Desing and implementation of an access control processor for XML documents. In Proc. 9th Int. World Wide Web Conference, 2000, pp. 59–75. He H. and Wong R.K. A role-based access control model for XML repositories. In Proc. 1st Int. Conf. on Web Information Systems Eng., 2000, pp. 138–145. Stoupa K. Access Control Techniques in distributed systems and the Internet, Ph.D. Thesis, Aristotle University, Department of Informatics, 2007. Stoupa K. and Vakali A. Policies for web security services, Chapter III. In Web and Information Security, E. Ferrari, B. Thuraisingham (eds.), Idea-Group Publishing, USA, 2006. Vuong N.N., Smith G.S., and Deng Y. Managing security policies in a distributed environment using eXtensible markup language (XML). In Proc. 16th ACM Symp. on Applied Computing, 2001, pp. 405–411.
Access Methods ▶ Access Path
ACID Properties
Access Path E VAGGELIA P ITOURA University of Ioannina, Ioannina, Greece
A
Recommended Reading 1.
Selinger P.G., Astrahan M.M., Chamberlin D.D., Lorie R.A., Price T.G. Access path selection in a relational database management system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1979, pp. 23–34.
Synonyms Access path; Access methods
Definition An access path specifies the path chosen by a database management system to retrieve the requested tuples from a relation. An access path may be either (i) a sequential scan of the data file or (ii) an index scan with a matching selection condition when there are indexes that match the selection conditions in the query. In general, an index matches a selection condition, if the index can be used to retrieve all tuples that satisfy the condition.
Key Points Access paths are the alternative ways for retrieving specific tuples from a relation. Typically, there is more than one way to retrieve tuples because of the availability of indexes and the potential presence of conditions specified in the query for selecting the tuples. Typical access methods include sequential access of unordered data files (heaps) as well as various kinds of indexes. All commercial database systems implement heaps and B+ tree indexes. Most of them also support hash indexes for equality conditions. To choose an access path, the optimizer first determines which matching access paths are available by examining the conditions specified by the query. Then, it estimates the selectivity of each access path using any available statistics for the index and data file. The selectivity of an access path is the number of pages (both index and data pages) accessed when the specific access path is used to retrieve the requested tuples. The access path having the smallest selectivity is called the most selective access path. Clearly, using the most selective access path minimizes the cost of data retrieval. Additional information can be found in [1].
Cross-references
▶ Index Structures for Biological Sequences ▶ Query Optimization ▶ Selectivity Estimation
Accountability ▶ Auditing and Forensic Analysis
ACID Properties G OTTFRIED VOSSEN University of Mu¨nster, Mu¨nster, Germany
Synonyms ACID properties; Atomicity; Isolation; Consistency preservation; Durability; Persistence
Definition The conceptual ACID properties (short for atomicity, isolation, consistency preservation, and durability) of a transaction together provide the key abstraction which allows application developers to disregard irregular or even malicious effects from concurrency or failures of transaction executions, as the transactional server in charge guarantees the consistency of the underlying data and ultimately the correctness of the application [1–3]. For example, in a banking context where debit/credit transactions are executed this means that no money is ever lost in electronic funds transfers and customers can rely on electronic receipts and balance statements. These cornerstones for building highly dependable information systems can be successfully applied outside the scope of online transaction processing and classical database applications as well.
Key Points The ACID properties are what a database server guarantees for transaction executions, in particular in the presence of multiple concurrently running transactions and in the face of failure situations; they comprise the following four properties (whose initial letters form the word ‘‘ACID’’):
19
A
20
A
ACID Properties
Atomicity. From the perspective of a client and an application program, a transaction is executed completely or not at all, i.e., in an all-or-nothing fashion. So the effects of a program under execution on the underlying data server(s) will only become visible to the outside world or to other program executions if and when the transaction reaches its ‘‘commit’’ operation. This case implies that the transaction could be processed completely, and no errors whatsoever were discovered while it was processed. On the other hand, if the program is abnormally terminated before reaching its commit operation, the data in the underlying data servers will be left in or automatically brought back to the state in which it was before the transaction started, i.e., the data appears as if the transaction had never been invoked at all. Consistency preservation: Consistency constraints that are defined on the underlying data servers (e.g., keys, foreign keys) are preserved by a transaction; so a transaction leads from one consistent state to another. Upon the commit of a transaction, all integrity constraints defined for the underlying database(s) must be satisfied; however, between the beginning and the end of a transaction, inconsistent intermediate states are tolerated and may even be unavoidable. This property generally cannot be ensured in a completely automatic manner. Rather, it is necessary that the application is programmed such that the code between the beginning and the commit of a transaction will eventually reach a consistent state. Isolation: A transaction is isolated from other transactions, i.e., each transaction behaves as if it was operating alone with all resources to itself. In particular, each transaction will ‘‘see’’ only consistent data in the underlying data sources. More specifically, it will see only data modifications that result from committed transactions, and it will see them only in their entirety, and never any effects of an incomplete transaction. This is the decisive property that allows to hide the fallacies and pitfalls of concurrency from the application developers. A sufficient condition for isolation is that concurrent executions are equivalent to sequential ones, so that all transactions appear as if they were executed one after the other rather than in an interleaved manner; this condition is made precise through serializability. Durability: When the application program from which a transaction derives is notified that the transaction has been successfully completed (i.e., when
the commit point of the transaction has been reached), all updates the transaction has made in the underlying data servers are guaranteed to survive subsequent software or hardware failures. Thus, updates of committed transactions are durable (until another transaction later modifies the same data items) in that they persist even across failures of the affected data server(s). Therefore, a transaction is a set of operations executed on one or more data servers which are issued by an application program and are guaranteed to have the ACID properties by the runtime system of the involved servers. The ‘‘ACID contract’’ between the application program and the data servers requires the program to demarcate the boundaries of the transaction as well as the desired outcome – successful or abnormal termination – of the transaction, both in a dynamic manner. There are two ways a transaction can finish: it can commit, or it can abort. If it commits, all its changes to the database are installed, and they will remain in the database until some other application makes further changes. Furthermore, the changes will seem to other programs to take place together. If the transaction aborts, none of its changes will take effect, and the DBMS will rollback by restoring previous values to all the data that was updated by the application program. A programming interface of a transactional system consequently needs to offer three types of calls: (i) ‘‘begin transaction’’ to specify the beginning of a transaction, (ii) ‘‘commit transaction’’ to specify the successful end of a transaction, and (iii) ‘‘rollback transaction’’ to specify the unsuccessful end of a transaction with the request to abort the transaction. The core requirement for a transactional server is to provide the ACID guarantees for sets of operations that belong to the same transaction issued by an application program requires that the server. This requires a concurrency control component to guarantee the isolation properties of transactions, for both committed and aborted transactions, and a recovery component to guarantee the atomicity and durability of transactions. The server may or may not provide explicit support for consistency preservation. In addition to the ACID contract, a transactional server should meet a number of technical requirements: A transactional data server (which most often will be a database system) must provide good performance with a given hardware/software configuration, or more generally, a good cost/performance
Active and Real-Time Data Warehousing
ratio when the configuration is not yet fixed. Performance typically refers to the two metrics of high throughput, which is defined as the number of successfully processed transactions per time unit, and of short response times, where the response time of a transaction is defined as the time span between issuing the transaction and its successful completion as perceived by the client. While the ACID properties are crucial for many applications in which the transaction concept arises, some of them are too restrictive when the transaction model is extended beyond the read/write context. For example, business processes can be cast into various forms of business transactions, i.e., long-running transactions for which atomicity and isolation are generally too strict. In these situations, additional or alternative guarantees need to be employed.
Cross-references
▶ Atomicity ▶ Concurrency Control ▶ Extended Transaction Models ▶ Multi-Level Recovery and the ARIES Algorithm ▶ Serializability ▶ Snapshot Isolation ▶ SQL Isolation Levels ▶ Transaction Model
Recommended Reading 1.
2.
3.
Bernstein P.A., Hadzilacos V., and Goodman N. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, MA, 1987. Bernstein P.A. and Newcomer E. Principles of Transaction Processing for the Systems Professional. Morgan Kaufmann, San Francisco, CA, 1997. Gray J. and Reuter A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, 1993.
ACID Transaction ▶ Transaction
A
Active and Real-Time Data Warehousing M UKESH M OHANIA 1, U LLAS N AMBIAR 1 , M ICHAEL S CHREFL 2 , M ILLIST V INCENT 3 1 IBM India Research Lab, New Delhi, India 2 University of Linz, Linz, Austria 3 University of South Australia, Adelaide, SA, Australia
Synonyms Right-time data warehousing
Definition Active Data Warehousing is the technical ability to capture transactions when they change, and integrate them into the warehouse, along with maintaining batch or scheduled cycle refreshes. An active data warehouse offers the possibility of automating routine tasks and decisions. The active data warehouse exports decisions automatically to the On-Line Transaction Processing (OLTP) systems. Real-time Data Warehousing describes a system that reflects the state of the warehouse in real time. If a query is run against the real-time data warehouse to understand a particular facet about the business or entity described by the warehouse, the answer reflects the state of that entity at the time the query was run. Most data warehouses have data that are highly latent – or reflects the business at a point in the past. A realtime data warehouse has low latency data and provides current (or real-time) data. Simply put, a real-time data warehouse can be built using an active data warehouse with a very low latency constraint added to it. An alternate view is to consider active data warehousing as being a design methodology suited to tactical decision-making based on very current data while real-time data warehousing is a collection of technologies that refresh a data warehouse frequently. A real-time data warehouse is one that acquires, cleanses, transforms, stores, and disseminates information in real time. An active data warehouse, on the other hand, operates in a non-real-time response mode with one-or-more OLTP systems.
Historical Background
Acquisitional Query Languages ▶ Database Languages for Sensor Networks
A data warehouse is a decision support database that is periodically updated by extracting, transforming, and loading operational data from several OLTP databases.
21
A
22
A
Active and Real-Time Data Warehousing
In the data warehouse, OLTP data is arranged using the (multi) dimensional data modeling approach (see [1] for a basic approach and [2] for details on translating an OLTP data model into a dimensional model), which classifies data into measures and dimensions. In recent years, several multidimensional data models have been proposed [3–6]. An in-depth comparison is provided by Pedersen and Jensen in [5]. The basic unit of interest in a data warehouse is a measure or fact (e.g., sales), which represent countable, semisummable, or summable information concerning a business process. An instance of a measure is called measure value. A measure can be analyzed from different perspectives, which are called the dimensions (e.g., location, product, time) of the data warehouse [7]. A dimension consists of a set of dimension levels (e.g., time: Day, Week, Month, Quarter, Season, Year, ALLTimes), which are organized in multiple hierarchies or dimension paths [6] (e.g., Time[Day] ! Time[Month] ! Time[Quarter] ! Time[Year] ! Time[ALLTimes]; Time[Day] ! Time [Week] ! Time[Season] ! Time[ALLTimes]). The hierarchies of a dimension form a lattice having at least one top dimension level and one bottom dimension level. The measures that can be analyzed by the same set of dimensions are described by a base cube or fact table. A base cube uses level instances of the lowest dimension levels of each of its dimensions to identify a measure value. The relationship between a set of measure values and the set of identifying level instances is called cell. Loading data into the data warehouse means that new cells will be added to base cubes and new level instances will be added to dimension levels. If a dimension D is related to a measure m by means of a base cube, then the hierarchies of D can be used to aggregate the measure values of m using operators like SUM, COUNT, or AVG. Aggregating measure values along the hierarchies of different dimensions (i.e., rollup) creates a multidimensional view on data, which is known as data cube or cube. Deaggregating the measures of a cube to a lower dimension level (i.e., drilldown) creates a more detailed cube. Selecting the subset of a cube’s cells that satisfy a certain selection condition (i.e., slicing) also creates a more detailed cube. The data warehouses are used by analysts to find solutions for decision tasks by using OLAP (On-Line Analytical Processing) [7] systems. The decision tasks can be split into three, viz. non-routine, semi-routine, and routine. Non-routine tasks occur infrequently and/or do not have a generally accepted decision
criteria. For example, strategic business decisions such as introducing a new brand or changing an existing business policy are non-routine tasks. Routine tasks, on the other hand, are well structured problems for which generally accepted procedures exist and they occur frequently and at predictive intervals. Examples can be found in the areas of product assortment (change price, withdraw product, etc.), customer relationship management (grant loyalty discounts etc.), and in many administrative areas (accept/reject paper based on review scores). Semi-routine tasks are tasks that require a non-routine solution – e.g., paper rated contradictory must be discussed by program committee. Since, most tasks are likely to be routine, it is logical to automate processing of such tasks to reduce the delay in decision-making. Active data warehouses [8] were designed to enable data warehouses to support automatic decisionmaking when faced with routine decision tasks and routinizable elements of semi-routine decision tasks. The active data warehouse design extends the technology behind active database systems. Active database technology transforms passive database systems into reactive systems that respond to database and external events through the use of rule processing features [9,10]. Limited versions of active rules exist in commercial database products [11,12]. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. Traditionally, data warehouses were regarded as an environment for analyzing historic data, either to understand what has happened or simply to log the changes as they happened. However, of late, businesses want to use them to predict the future: e.g., to predict customers likely to churn; and thereby seek better control of the business. However, until recently, it was not practical to have zero-latency data warehouses – the process of extracting data had too much of an impact on the source systems concerned, and the various steps needed to cleanse and transform the data required multiple temporary tables and took several hours to run. However, the increased visibility of (the value of) warehouse data, and the take-up by a wider audience within the organization, has lead to a number
Active and Real-Time Data Warehousing
of product developments by IBM [13], Oracle [14], and other vendors that make real-time data warehousing now possible.
Foundations The two example scenarios below describe typical situations in which active rules can be used to automate decision-making: Scenario 1: Reducing the price of an article. Twenty days after a soft drink has been launched on a market, analysts compare the quantities sold during this period with a standardized indicator. This indicator requires the total quantities sold during the 20-day period do not drop below a threshold of 10,000 sold items. If the analyzed sales figures are below this threshold, the price of the newly launched soft drink will be reduced by 15. Scenario 2 : Withdrawing articles from a market. At the end of every quarter, high-priced soft drinks which are sold in Upper Austrian stores will be analyzed. If the sales figures of a high-priced soft drink have continuously dropped, the article will be withdrawn from the Upper Austrian market. Analysts inspect sales figures at different granularities of the time dimension and at different granularities of the location dimension. Trend, average, and variance measures are used as indicators in decision-making. Rules that mimic the analytical work of a business analyst are called analysis rules [8]. The components of analysis rules constitute the knowledge model of an active data warehouse (and also a real-time data warehouse). The knowledge model determines what an analyst must consider when he specifies an active rule to automate a routine decision task. An analysis rule consists of (i) the primary dimension level and (ii) the primary condition, which identify the objects for which decision-making is necessary, (iii) the event, which triggers rule processing, (iv) the analysis graph, which specifies the cubes for analysis, (v) the decision steps, which represent the conditions under which a decision can be made, and (vi) the action, which represents the rule’s decision task. Below is a brief description of the components of an analysis rule. Detailed discussion is given in [8]. Event: Events are used to specify the timepoints at which analysis rules should be carried out. Active data warehouses provide three kinds of events: (i) OLTP method events, (ii) relative temporal events, and (iii) calendar events. OLTP method events describe basic happenings in the data warehouse’s sources. Relative
A
temporal events are used to define a temporal distance between such a basic happening and carrying out an analysis rule. Calendar events represent fixed points in time at which an analysis rule may be carried out. Structurally, every event instance is characterized by an occurrence time and by an event identifier. In its event part, an analysis rule refers to a calendar event or to a relative temporal event. An OLTP method event describes a happening in the data warehouse’s source systems that is of interest to analysis rules in the active data warehouse. Besides occurrence time and event identifier, the attributes of an OLTP method event are a reference to the dimension level for which the OLTP method event occurred and the parameters of the method invocation. To make OLTP method events available in data warehouses, a data warehouse designer has to define the schema of OLTP method events and extend the data warehouse’s extract/transform/load mechanism. Since instances of OLTP method events are loaded some time after their occurrence, analysis rules cannot be triggered directly by OLTP method events. Temporal events determine the timepoints at which decision-making has to be initiated. Scenario 1 uses the relative temporal event ‘‘twenty days after launch’’ while Scenario 2 uses the periodic temporal event ‘‘end of quarter.’’ The conditions for decision-making are based on indicators, which have been established in manual decision-making. Each condition refers to a multidimensional cube and therefore ‘‘analyzing’’ means to evaluate the condition on this cube. Scenario 1 uses a quantity-based indicator, whereas scenario 2 uses value-based indicators for decision-making. The decision whether to carry out the rule’s action depends on the result of evaluating the conditions. The action of scenario 1 is to reduce the price of an article, whereas the action of scenario 2 is to withdraw an article from a market. Primary Condition: Several analysis rules may share the same OLTP method as their action. These rules may be carried out at different timepoints and may utilize different multidimensional analyses. Thus, a certain analysis rule usually analyzes only a subset of the level instances that belong to the rule’s primary dimension level. The primary condition is used to determine for a level instance of the primary dimension level whether multidimensional analysis should be carried out by the analysis rule. The primary condition is specified as a Boolean expression, which refers to the
23
A
24
A
Active and Real-Time Data Warehousing
describing attributes of the primary dimension level. If omitted, the primary condition evaluates to TRUE. Action: The purpose of an analysis rule is to automate decision-making for objects that are available in OLTP systems and in the data warehouse. A decision means to invoke (or not to invoke) a method on a certain object in an OLTP system. In its action part, an analysis rule may refer to a single OLTP method of the primary dimension level, which represents a transaction in an OLTP system. These methods represent the decision space of an active data warehouse. To make the transactional behavior of an OLTP object type available in the active data warehouse, the data warehouse designer must provide (i) the specifications of the OLTP object type’s methods together with required parameters, (ii) the preconditions that must be satisfied before the OLTP method can be invoked in the OLTP system, and (iii) a conflict resolution mechanism, which solves contradictory decisions of different analysis rules. Since different analysis rules can make a decision for the same level instance of the rules’ primary dimension level during the same active data warehouse cycle, a decision conflict may occur. Such conflicts are considered as interrule conflicts. To detect interrule conflicts, a conflict table covering the OLTP methods of the decision space is used. The tuples of the conflict table have the form <m1, m2, m3> , where m1 and m2 identify two conflicting methods and m3 specifies the conflict resolution method that will be finally executed in OLTP systems. If a conflict cannot be solved automatically it has to be reported to analysts for manual conflict resolution. Analysis Graph: When an analyst queries the data warehouse to make a decision, he or she follows an incremental topdown approach in creating and analyzing cubes. Analysis rules follow the same approach. To automate decision-making, an analysis rule must ‘‘know’’ the cubes that are needed for multidimensional analysis. These cubes constitute the analysis graph, which is specified once by the analyst. The n dimensions of each cube of the analysis graph are classified into one primary dimension, which represents the level instances of the primary dimension level, and n 1 analysis dimensions, which represent the multidimensional space for analysis. Since a level instance of the primary dimension level is described by one or more cells of a cube, multidimensional analysis means to compare, aggregate, transform, etc., the measure values of these cells. Two kinds of multidimensional analysis
are carried out at each cube of the analysis graph: (i) select the level instances of the primary dimension level whose cells comply with the decision-making condition (e.g., withdraw an article if the sales total of the last quarter is below USD 10,000) and (ii) select the level instances of the primary dimension level whose cells comply with the condition under which more detailed analysis (at finer grained cubes) are necessary (e.g., continue analysis if the sales total of the last quarter is below USD 500,000). The multidimensional analysis that is carried out on the cubes of the analysis graph are called decision steps. Each decision step analyzes the data of exactly one cube of the analysis graph. Hence, analysis graph and decision steps represent the knowledge for multidimensional analysis and decision-making of an analysis rule. Enabling real-time data warehousing: As mentioned earlier, real-time data warehouses are active data warehouses that are loaded with data having (near) zero latency. Data warehouse vendors have used multiple approaches such as hand-coded scripting and data extraction, transformation, and loading (ETL) [15] solutions to serve the data acquisition needs of a data warehouse. However, as users move toward real-time data warehousing, there is a limited choice of technologies that facilitate real-time data delivery. The challenge is to determine the right technology approach or combination of solutions that best meets the data delivery needs. Selection criteria should include considerations for frequency of data, acceptable latency, data volumes, data integrity, transformation requirements and processing overhead. To solve the real-time challenge, businesses are turning to technologies such as enterprise application integration (EAI) [16] and transactional data management (TDM) [17], which offer high-performance, low impact movement of data, even at large volumes with sub-second speed. EAI has a greater implementation complexity and cost of maintenance, and handles smaller volumes of data. TDM provides the ability to capture transactions from OLTP systems, apply mapping, filtering, and basic transformations and delivers to the data warehouse directly. A more detailed study of the challenges involved in implementing a real-time data warehouse is given in [18].
Key Applications Active and Real-time data warehouses enable businesses across all industry verticals to gain competitive advantage by allowing them to run analytics solutions over the
Active and Real-Time Data Warehousing
most recent data of interest that is captured in the warehouse. This will provide them with the ability to make intelligent business decisions and better understand and predict customer and business trends based on accurate, up-to-the-second data. By introducing real-time flows of information to data warehouses, companies can increase supply chain visibility, gain a complete view of business performance, and increase service levels, ultimately increasing customer retention and brand value. The following are some additional business benefits of active and real-time data warehousing: Real-time Analytics: Real-time analytics is the ability to use all available data to improve performance and quality of service at the moment they are required. It consists of dynamic analysis and reporting, right at the moment (or very soon after) the resource (or information) entered the system. In a practical sense, real time is defined by the need of the consumer (business) and can vary from a few seconds to few minutes. In other words, more frequent than daily can be considered real-time, because it crosses the overnight-update barrier. With increasing availability of active and real-time data warehouses, the technology for capturing and analyzing real-time data is increasingly becoming available. Learning how to apply it effectively becomes the differentiator. Implementing real-time analytics requires the integration of a number of technologies that are not interoperable off-theshelf. There are no established best practices. Early detection of fraudulent activity in financial transactions is a potential environment for applying realtime analytics. For example, credit card companies monitor transactions and activate counter measures when a customer’s credit transactions fall outside the range of expected patterns. However, being able to correctly identify fraud while not offending a wellintentioned valuable customer is a critical necessity that adds complexity to the potential solution. Maximize ERP Investments: With a real-time data warehouse in place, companies can maximize their Enterprise Resource Planning (ERP) technology investment by turning integrated data into business intelligence. ETL solutions act as an integral bridge between ERP systems that collect high volumes of transactions and business analytics to create data reports.
A
Increase Supply Chain Visibility: Real-time data warehousing helps streamline supply chains through highly effective business-to-business communications and identifies any weak links or bottlenecks, enabling companies to enhance service levels and gain a competitive edge. Live 360 View of Customers: The active database solutions enable companies to capture, transform, and flow all types of customer data into a data warehouse, creating one seamless database that provides a 360 view of the customer. By tracking and analyzing all modes of interaction with a customer, companies can tailor new product offerings, enhance service levels, and ensure customer loyalty and retention.
Future Directions Data warehousing has greatly matured as a technology discipline; however enterprises that undertake data warehousing initiatives continue to face fresh challenges that evolve with the changing business and technology environment. Most future needs and challenges will come in the areas of active and real-time data warehousing solutions. Listed below are some future challenges: Integrating Heterogeneous Data Sources: The number of enterprise data sources is growing rapidly, with new types of sources emerging every year. Enterprises want to integrate the unstructured data generated from customer emails, chat and voice call transcripts, feedbacks, and surveys with other internal data in order to get a complete picture of their customers and integrate internal processes. Other sources for valuable data include ERP programs, operational data stores, packaged and homegrown analytic applications, and existing data marts. The process of integrating these sources into a data warehouse can be complicated and is made even more difficult when an enterprise merges with or acquires another enterprise. Integrating with CRM tools: Customer relationship management (CRM) is one of the most popular business initiatives in enterprises today. CRM helps enterprises attract new customers and develop loyalty among existing customers with the end result of increasing sales and improving profitability. Increasingly, enterprises want to use the holistic view of the customer to deliver value-added services to
25
A
26
A
Active and Real-Time Data Warehousing
the customer based on her overall value to the enterprise. This would include, automatically identifying when an important life event is happening and sending out emails with necessary information and/or relevant products, gauging the mood of the customer based on recent interactions, and alerting the enterprise before it is too late to retain the customer and most important of all identifying customers who are likely to accept suggestions about upgrades of existing products/services or be interested in newer versions. The data warehouse is essential in this integration process, as it collects data from all channels and customer touch points, and presents a unified view of the customer to sales, marketing, and customer-care employees. Going forward, data warehouses will have to provide support for analytics tools that are embedded into the warehouse, analyze the various customer interactions continuously, and then use the insights to trigger actions that enable delivery of the abovementioned value-added services. Clearly, this requires an active data warehouse to be tightly integrated with the CRM systems. If the enterprise has low latency for insight detection and value-added service delivery then a real-time data warehouse would be required. In-built data mining and analytics tools: Users are also demanding more sophisticated business intelligence tools. For example, if a telecom customer calls to cancel his call-waiting feature, real-time analytic software can detect this and trigger a special offer of a lower price in order to retain the customer. The need is to develop a new generation of data mining algorithms that work over data warehouses that integrate heterogeneous data and have self-learning features. These new algorithms must automate data mining and make it more accessible to mainstream data warehouse users by providing explanations with results, indicating when results are not reliable and automatically adapting to changes in underlying predictive models.
Cross-references
▶ Cube Implementations ▶ Data Warehouse Interoperability ▶ Data Warehousing Systems: Foundations and Architectures ▶ ETL
▶ Multidimensional Modeling ▶ On-Line Analytical Processing ▶ Query Processing in Data Warehouses ▶ Transformation
Recommended Reading 1. Kimball R. and Strethlo K. Why decision support fails and how to fit it. ACM SIGMOD Rec., 24(3):91–97, 1995. 2. Golfarelli M., Maio D., and Rizzi S. Conceptual design of data warehouses from E/R schemes. In Proc. 31st Annual Hawaii Int. Conf. on System Sciences, Vol. VII. 1998, pp. 334–343. 3. Lehner W. Modeling large scale OLAP scenarios. In Advances in Database Technology, Proc. 6th Int. Conf. on Extending Database Technology, 1998, pp. 153–167. 4. Li C. and Wang X.S. A data model for supporting on-line analytical processing. In Proc. Int. Conf. on Information and Knowledge Management, 1996, pp. 81–88. 5. Pedersen T.B. and Jensen C.S. Multidimensional data modeling for complex data. In Proc. 15th Int. Conf. on Data Engineering, 1999, pp. 336–345. 6. Vassiliadis P. Modeling multidimensional databases, cubes and cube operations. In Proc. 10th Int. Conf. on Scientific and Statistical Database Management, 1998, 53–62. 7. Chaudhuri S. and Dayal U. An overview of data warehousing and OLAP technology. ACM SIGMOD Rec., 26(1):65–74, 1997. 8. Thalhammer T., Schrefl M., and Mohania M. Active data warehouses: complementing OLAP with analysis rules. Data Knowl. Eng., 39(3):241–269, 2001. 9. ACT-NET Consortium. The active database management system manifesto: a rulebase of ADBMS featueres. ACM SIGMOD Rec., 25(3), 1996. 10. Simon E. and Dittrich A. Promises and realities of active database systems. In Proc. 21th Int. Conf. on Very Large Data Bases, 1995, pp. 642–653. 11. Brobst S. Active data warehousing: a new breed of decision support. In Proc. 13th Int. Workshop on Data and Expert System Applications, 2002, pp. 769–772. 12. Borbst S. and Rarey J. The five stages of an active data warehouse evolution. Teradata Mag., 38–44, 2001. 13. IBM DB2 Data Warehouse Edition. http://www-306.ibm.com/ software/data/db2/dwe/. 14. Rittman M. Implementing Real-Time Data Warehousing Using Oracle 10g. Dbazine.com. http://www.dbazine.com/datawarehouse/dw-articles/rittman5. 15. Kimball R. and Caserta J. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data. Wiley, 2004. 16. Linthicum R.S. Enterprise Application Integration. AddisonWesley, 1999. 17. Improving SOA with Goldengate TDM Technology. GoldenGate White Paper, October 2007. 18. Langseth J. Real-Time Data Warehousing: Challenges and Solutions. DSSResources.COM, 2004. 19. Paton N.W. and Diaz O. Active Database Systems. ACM Comput. Surv., 1(31),1999.
Active Database, Active Database (Management) System
Active Database, Active Database (Management) System M IKAEL B ERNDTSSON , J ONAS M ELLIN University of Sko¨vde, Sko¨vde, Sweden
Definition An active database (aDB) or active database (management) system (aDBS/aDBMS) is a database (management) system that supports reactive behavior through ECA-rules.
Historical Background The term active database was first used in the early 1980s [12]. Some related active database work was also done within the area of expert database systems in the mid 1980s, but it was not until the mid/late 1980s that the research on supporting ECA rules in database systems took off, for example [10,12,18]. During the 1990s the area was extensively explored through more than twenty suggested active database prototypes and a large body of publications: Seven workshops were held between 1993 and 1997: RIDS [12,16,17], RIDE-ADS [20], Dagstuhl Seminar [5] and ARTDB [3,4]. Two special issues of journals [8,9] and one special issue of ACM Sigmod Record [1]. Two text books [13,19] and one ACM Computing Survey paper [15]. In addition, the groups within the ACT-NET consortium (A European research network of Excellence on active databases 1993–1996) reached a consensus on what constitutes an active database management system with the publication of the Active Database System Manifesto [2]. Most of the active databases are monolithic and assume a centralized environment, consequently, the majority of the prototype implementations do not consider distributed issues. Initial work on how active databases are affected by distributed issues are reported in [7].
Foundations An active database can automatically react to events such as database transitions, time events, and external signals in a timely and efficient manner. This is in contrast to traditional database systems, which are passive in their behaviors, so that they only execute queries and transactions when they are explicitly requested to do so.
A
Previous approaches to support reactive behavior can broadly be classified into: Periodically polling the database. Embedding or encoding event detection and related action execution in the application code. The first approach implies that the queries must be run exactly when the event occurs. The frequency of polling can be increased in order to detect such an event, but if the polling is too frequent, then the database is overloaded with queries and will most often fail. On the other hand, if the frequency is too low, the event will be missed. The second approach implies that every application which updates the database needs to be augmented with condition checks in order to detect events. For example, an application may be extended with code to detect whether the quantity of certain items has fallen below a given level. From a software engineering point of view, this approach is inappropriate, since a change in a condition specification implies that every application that uses the modified condition needs to be updated. Neither of the two previous approaches can satisfactorily support reactive behavior in a database context [10]. An active database system avoids the previous disadvantages by moving the support for reactive behavior inside the database (management) system. Reactive behavior in an active database is supported by ECA-rules that have the following semantics: when an event is detected, evaluate a condition, and if the condition is true, execute an action. Similar to describing an object by its static features and dynamic features, an active database can be described by its knowledge model (static features) and execution model (dynamic features). Thus, by investigating the knowledge model and execution model of an active database, one can identify what type of ECA rules that can be defined and how the active database behave at run-time.
Key Applications An aDB or aDBS/aDBMS is useful for any non-mission critical application that require reactive behavior.
Future Directions Looking back, the RIDS’97 workshop marks the end of the active database period, since there are very few
27
A
28
A
Active Database Management System Architecture
active database publications after 1997. However, the concept of ECA-rules has resurfaced and has been picked up by other research communities such as Complex Event Processing and Semantic Web. In contrast to typical active database approaches that assume a centralized environment, the current research on ECA rules within Complex Event Processing and Semantic Web assume that the environment is distributed and heterogeneous. Thus, as suggested within the REWERSE project [3], one cannot assume that the event, condition, and action parts of an ECA rule are defined in one single ECA rule language. For example, the event part of an ECA-rule can be defined in one language (e.g., Snoop), whereas the condition part and action part are defined in a completely different rule language. The popularity of using XML for manipulating data has also led to proposals of ECA rule markup languages. These ECA rule markup languages are used for storing information about ECA rules and facilitates exchange of ECA-rules between different rule engines and applications. One research question that remains from the active database period is how to model and develop applications that use ECA rules. Some research on modeling ECA rules has been carried out, but there is no widely agreed approach for modeling ECA rules explicitly in UML, or how to derive ECA rules from existing UML diagrams.
Cross-references
▶ Active Database Execution Model ▶ Active Database Knowledge Model ▶ Complex Event Processing ▶ ECA Rules
6. Buchmann A., Chakravarthy S., and Dittrich K. Active Databases. Dagstuhl Seminar No. 9412, Report No. 86, 1994. 7. Bu¨ltzingsloewen G., Koschel A., Lockemann P.C., and Walter H.D. ECA Funtionality in a Distributed Environment. Monographs in Computer Science, chap. 8, Springer, 1999, pp. 147–175. 8. Chakravarthy S. (ed.), Special Issue on Active Databases, vol. 15, IEEE Quarterly Bulletin on Data Engineering, 1992. 9. Chakravarthy S. and Widom J. (eds.), Special Issue on the Active Database Systems, Journal of Intelligent Information Systems 7, 1996. 10. Dayal U., Blaustein B., Buchmann A., et al. S.C. HiPAC: A Research Project in Active, Time-Constrained Database Management. Tech. Rep. CCA-88-02, Xerox Advanced Information Technology, Cambridge, 1988. 11. Dittrich K.R., Kotz A.M., and Mulle J.A. An Event/Trigger Mechanism to Enforce Complex Consistency Constraints in Design Databases. ACM SIGMOD Rec., 15(3):22–36, 1986. 12. Geppert A. and Berndtsson M. (eds.). Proc. 3rd International Workshop on Rules in Database Systems, LNCS, vol. 1312, Springer, 1997. 13. Morgenstern M. Active Databases as a Paradigm for Enhanced Computing Environments. In Proc. 9th Int. Conf. on Very Data Bases, 1983, pp. 34–42. 14. Paton N.W. (ed.) Active Rules in Database Systems. Monographs in Computer Science, Springer, 1999. 15. Paton N.W. and Diaz O. Active Database Systems. ACM Comput. Surv, 31(1):63–103, 1999. 16. Paton N.W. and Williams M.W. (eds.). In Proc. 1st International Workshop on Rules in Database Systems, Springer, Berlin, 1994. 17. Sellis T. (ed.). In Proc. 2nd International Workshop on Rules in Database Systems, vol. 905, Springer, 1995. 18. Stonebraker M., Hearst M., and Potamianos S. Commentary on the POSTGRES Rules System. SIGMOD Rec., 18(3):5–11, 1989. 19. Widom J. and Ceri S. (eds.) Active Database Systems: Triggers and Rules For Advanced Database Processing. Morgan Kaufmann, 1996. 20. Widom J. and Chakravarthy S. (eds.). In Proc. 4th International Workshop on Research Issues in Data Engineering – Active Database Systems, 1994.
Recommended Reading 1. ACM SIGMOD Record. Special Issue on Rule Management and Processing in Expert Databases, 1989. 2. ACT-NET Consortium The active database management system manifesto: a rulebase of ADBMS features. ACM SIGMOD Rec., 25(3):40–49, 1996. 3. Alferes J.J., Amador R., and May W. A general language for evolution and reactivity in the semantic Web. In Proc. 3rd Workshop on Principles and Practice of Semantic Web Reasoning, 2005, pp. 101–115. 4. Andler S.F. and Hansson J. (eds.). In Proc. 2nd International Workshop on Active, Real-Time, and Temporal Database Systems, LNCS, vol. 1553, Springer, 1998. 5. Berndtsson M. and Hansson J. Workshop report: the first international workshop on active and real-time database systems. SIGMOD Rec., 25(1):64–66, 1996.
Active Database Management System Architecture J ONAS M ELLIN , M IKAEL B ERNDTSSON University of Sko¨vde, Sko¨vde, Sweden
Synonyms ADBMS infrastructure; ADBMS framework; ADBMS
Definition The active database management system (ADBMS) architecture is the software organization of a DBMS
Active Database Management System Architecture
with active capabilities. That is, the architecture defines support for active capabilities expressed in terms of services, significant components providing the services as well as critical interaction among these services.
Historical Background Several architectures has been proposed: HiPAC [5,8], REACH [4], ODE [14], SAMOS [10], SMILE [15], and DeeDS [1]. Each of these architectures emphasize particular issues concerning the actual DBMS that they are based on as well as the type of support for active capabilities. Paton and Diaz [18] provide an excellent survey on this topic. Essentially, these architectures propose that the active capabilities of an ADBMS require the services specified in Table 1. It is assumed that queries to the database are encompassed in transactions and hence transactions imply queries as well as database manipulation operations such as insertion, updates and deletion of tuples. The services in Table 1 interact as depicted in Fig. 1. Briefly, transactions are submitted to the scheduling service that updates the dispatch table read by the transaction processing service. When these transactions are processed by the transaction processing service events are generated. These events are signaled to the event monitoring service that analyzes them. Events that are associated with rules (subscribed events) are signaled to the rule evaluation service that evaluates the conditions of triggered rules (i.e., rules associated with signaled events). The actions of the rules whose conditions are true are submitted for scheduling and are executed as dictated by the scheduling policy. These actions execute as part of some transaction according to the coupling mode and can, in turn, generate events. This is a general description of the service interaction and it can be optimized by refining it for a specific purpose, for example, in
A
immediate coupling mode no queues between the services are actually needed. In more detail, transactions are submitted to the scheduling service via a queue of schedulable activities; this queue of schedulable activities is processed and a dispatch table of schedulable activities is updated. This scheduling service encompasses scheduling of transactions as well as ECA rule actions in addition to other necessary schedulable activities. It is desirable for the scheduling service to encompass all these types of schedulable activities, because they impact each other, since they compete for the same resources. The next step in the processing chain is the monitored transaction processing service, which includes the transaction management, lock management, and log management [11, Chap. 5], as well as a database query engine (cf. query processor [20, Chap. 1]), but not the scheduling service. Another way to view the transaction processing service is as a passive database management system without the transaction scheduling service. The transaction processing service is denoted ‘‘monitored,’’ since it generates events that are handled by the active capabilities. The monitored transaction processing service executes transactions and ECA rule actions according to the dispatch table. When transactions execute, event occurrences are signaled to the event monitoring service via a filtered event log. When event monitoring executes, it updates the filtered event log and submits subscribed events to the rule evaluation service. An example of event log filtering is that if a composite event occurrence is detected, then for optimization reasons (cf. dynamic programming) this event occurrence is stored in the filtered event log. Another example is that when events are no longer needed, then they are pruned; for example, when a transaction is aborted, then all event occurrences can be pruned unless intertransaction events are allowed (implying that dirty reads may occur). The rule evaluation service reads the
Active Database Management System Architecture. Table 1. Services in active database management systems Service
Responsibility
Event The event monitoring service is responsible for collecting events, analyzing events and disseminating monitoring results of the analysis (in terms of events) to subscribers, in particular, ECA rules. Rule evaluation The rule evaluation service is responsible for invoking condition evaluation of triggered ECA rules and submit actions for execution to the scheduler. Scheduling service
The scheduling service is responsible for readying and ordering schedulable activities such as ECA rule actions, transactions etc. for execution.
29
A
30
A
Active Database Management System Architecture
Active Database Management System Architecture. Figure 1. Service interaction view of architecture (based on architecture by Paton and Diaz [18]).
queue of subscribed events, finds the triggered rules and evaluates their conditions. These conditions may be queries, logical expressions or arbitrary code depending on the active database system [9]. The rule evaluation results in a set of actions that is submitted to the scheduling service for execution. The general view of active capabilities (in Fig. 1) can be refined and implemented in different ways. As mentioned, it is possible to optimize an implementation by removing the queues between the services if only immediate coupling mode is considered; this result in less overhead, but restricts the expressibility of ECArules significantly. A service can be implemented via one or more servers. These servers can be replicated to different physical nodes for performance or dependability reasons (e.g., availability, reliability). In active databases, a set of issues have a major impact on refinement and implementation of the general service-oriented view depicted in Fig. 1. These
issues are: (i) coupling modes; (ii) interaction with typical database management services such as transaction management, lock management, recovery management (both pre-crash such as logging and checkpointing and post-crash such as the actually recovery) (cf., for example, transaction processing by Gray and Reuter [11, Chap. 4]); (iii) when and how to invoke services; and (iv) active capabilities in distributed active databases. The coupling modes control how rule evaluation is invoked in response to events and how the ECA rule actions are submitted, scheduled, dispatched and executed for rules whose conditions are true (see entry ‘‘Coupling modes’’ for more detailed description). There are different alternatives to interaction with a database system. One alternative is to place active database services on top of existing database management systems. However, this is problematic if the database management system is not extended with active
Active Database Management System Architecture
capabilities [4]. For example, the deferred coupling mode require that when a transaction is requested to commit, then queued actions should be evaluated. This requires that the transaction management to interact with the rule evaluation and scheduling services during commit processing (e.g., by using back hooks in the database management system). Further, to be useful, the detached coupling mode has a set of significant varieties [4] that require the possibility to express constraints between transactions. The nested transaction model [16] is a sound basis for active capabilities. For example, deferred actions can be executed as subtransactions that can be committed or aborted independently of the parent transaction. Nested transactions still require that existing services are modified. Alternatively rule evaluation can be performed as subtransactions. To achieve implicit events the database schema translation process needs to automatically instrument the monitored systems. An inferior solution is to extend an existing schema with instrumented entities, for example, each class in an object-oriented database can be inherited to an instrumented class. In this example, there is no way to enforce that the instrumented classes are actually used. The problem is to modify the database schema translation process, since this is typically an intrinsic part in commercial DBMSs. Concerning issue (iii), the services must be allocated to existing resources and scheduled together with the transactions. Typically, the services are implemented as a set of server processes and transactions are performed by transaction programs running as processes (cf., [11]). These processes are typically scheduled, dispatched and executed as a response to the requests from outside the database management system or as a direct or indirect response to a timeout. Each service is either invoked when something is stored in the queue or table or explicitly invoked, for example, when the system clock is updated to reflect the new time. The issues concerning scheduling are paramount in any database management system for real-time system purposes [2]. Event monitoring can either be (i) implicitly invoked whenever an event occurs, or it can be (ii) explicitly invoked. This is similar to coupling modes, but it is between the event sources (e.g., transaction processing service and application) and the
A
event monitoring service rather than in between the services of the active capabilities. Case (i) is prevalent in most active database research, but it has a negative impact in terms of determinism of the result of event monitoring. For example, the problem addressed in the event specification and event detection entry concerning the unintuitive semantics of the disjunctive event operator is a result of implicit invocation. In distributed and real-time systems, explicit invocation is preferable in case (ii), since it provides the operating system with the control when something should be evaluated. Explicit invocation solves the problem of disjunction operator (see event specification and event detection entries), since the event expressions defining composite event types can be explicitly evaluated when all events have been delivered to event monitoring rather than implicitly evaluated whenever an event is delivered. In explicit invocation of event monitoring, the different event contexts can be treated in different ways. For example, in recent event context, only the most recent result is of interest in implicit invocation. However, in terms of explicit invocation, all possible most recent event occurrences may be of interest, not only the last one. For example, it may be desirable to keep the most recent event occurrence per time slot rather than per terminating event. Issue (iv) has been addressed in, for example, DeeDS [1], COBEA [15], Hermes [19], X2TS [5]. Further, it has been addressed in event based systems for mobile networks by Mu¨hl et al. [17]. Essentially, it is necessary to perform event detection in a moving time window, where the end of the time window is the current time. All events that are older than the beginning of the time window can be removed and ignored. Further, the heterogeneity must be addressed and there are XML-based solutions (e.g., Common Base Events [6]). Another issue that is significant in distributed active databases is the time and order of events. For example, in Snoop [8] it is suggested to separate global and local event detection, because of the difference in the time granularity of the local view of time and the global (distributed) view of time.
Foundations For a particular application domain, common significant requirements and properties as well as pre-requisites of
31
A
32
A
Active Database Management System Architecture
available resources need to be considered to refine the general architecture. Depending on the requirements, properties and pre-requisites, different compromises are reached. One example is the use of composite event detection in active real-time databases. In REACH [4], composite event detection is disallowed for real-time transactions. The reason for this is that during composite event detection, contributing events are locked and this locking affects other transaction in a harmful way with respect to meeting deadlines. A different approach is proposed in DeeDS [1], where events are stored in the database and cached in a special filtered event log; during event composition, events are not locked thus enabling the use of composite event detection for transaction with critical deadlines. The cost is that isolation of transactions can be violated unless it is handled by the active capabilities. Availability is an example of a property that significantly affects the software architecture. For example, availability is often considered significant in distributed systems; that is, even though physical nodes may fail, communications links may be down, or the other physical nodes may be overloaded, one should get, at least, some defined level of service from the system. An example of availability requirements is that emergency calls in phone switches should be prioritized over non-emergency calls, a fact that entails that existing phone call connections can be disconnected to let an emergency call through. Another example to improve availability is pursued in DeeDS [1], where eventual consistency is investigated as a mean to improve availability of data. The cost is that data can temporarily be inconsistent. As addressed in the aforementioned examples, different settings affect the architecture. Essentially, there are two approaches that can be mixed: (i) refine or invent new method, tools, techniques to solve a problem, and these method, tools, techniques can stem from different but relevant research areas; (ii) refine the requirements or pre-requisites to solve the problem (e.g., weaken the ACID properties of transactions).
Key Applications The architecture of ADBMSs is of special interest to developers of database management systems and their applications. In particular, software engineering issues are of major interest. Researchers performing experiments can make use of this architecture to enable valid experiments, study effects of optimizations etc.
Concerning real examples of applications, only simple things such as using rules for implementing alerters, for example, when an integrity constraint is violated. SQL Triggers implement simple ECA rules in immediate coupling mode between event monitoring and rule evaluation as well as between rule evaluation and action execution. Researchers have aimed for various application domains such as: Stock market Inventory control Bank applications Essentially, any application domain in which there is an interest to move functionality from the applications to the database schema to reduce the interdependence between applications and databases.
Future Directions There are no silver bullets in computer science or software engineering and each refinement of the architecture (in Fig. 1) is a compromise providing or enabling certain features and properties. For example, by allowing only detached coupling mode it is easier to achieve timeliness, an important property of real-time systems; however, the trade-off is that it is difficult to specify integrity rules in terms of ECA-rules, since the integrity checks are performed in a different transaction. The consequence is that dirty transactions as well as compensating transactions that perform recovery from violated integrity rules must be allowed. It is desirable to study architectures addressing how to meet specific requirement of the application area (e.g., accounting information in mobile ad-hoc networks), the specific environment in which the active database are used (e.g., distributed systems, real-time systems, mobile ad-hoc networks, limited resource equipment). The major criteria for a successful architecture (e.g., by refining an existing architecture) is if anyone can gain something from using it. For example, Borr [3] reported that by refining their architecture by employing transaction processing they improved productivity, reliability as well as average throughput in their heterogenous distributed reliable applications. An area that has received little attention in active database is optimization of processing. For example, how can queries to the database be optimized with condition evaluation if conditions are expressed as
Active Database Coupling Modes
arbitrary queries? Another question is how to group actions to optimize performance? So far, the emphasis has been on expressibility as well as techniques how to enable active support in different settings. Another area that has received little attention is recovery processing, both pre-crash and post-crash recovery. For example, how should recovery with respect to detached but dependent transactions be managed? Intertransaction events and rules has been proposed by, for example, Buchmann et al. [4]. How should this be managed with respect to the isolation levels proposed by Gray and Reuter [11, Chap. 7]? There are several other areas with which active database technology can be combined. Historical examples include real-time databases, temporal databases, main-memory databases, geographical information systems. One area that has received little attention is how enable reuse of database schemas.
Cross-references
▶ Active Database Coupling Modes ▶ Active Database Execution Model ▶ Active Database Knowledge Model ▶ Event Detection ▶ Event Specification
Recommended Reading 1. Andler S., Hansson J., Eriksson J., Mellin J., Berndtsson M., and Eftring B. DeeDS Towards a Distributed Active and Real-Time Database System. ACM SIGMOD Rec., 25(1), 1996. 2. Berndtsson M. and Hansson J. Issues in active real-time databases. In Proc. 1st Int. Workshop on Active and Real-Time Database System, 1995, pp. 142–150. 3. Borr A.J. Robustness to crash in a distributed database: A non shared-memory multi-processor approach. In Proc. 10th Int. Conf. on Very Large Data Bases, 1984, pp. 445–453. 4. Buchmann A.P., Zimmermann J., Blakeley J.A., and Wells D.L. Building an Integrated Active OODBMS: Requirements, Architecture, and Design Decisions. In Proc. 11th Int. Conf. on Data Engineering, 1995, pp. 117–128. 5. Chakravarthy S., Blaustein B., Buchmann A.P., Carey M., Dayal U., Goldhirsch D., Hsu M., Jauhuri R., Ladin R., Livny M., McCarthy D., McKee R., and Rosenthal A. HiPAC: A Research Project In Active Time-Constrained Database Management. Tech. Rep. XAIT-89-02, Xerox Advanced Information Technology, 1989. 6. Common Base Events. Http://www.ibm.com/developerworks/library/specification/ws-cbe/. 7. Chakravarthy S., Krishnaprasad V., Anwar E., and Kim S.K. Composite Events for Active Database: Semantics, Contexts, and Detection. In Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp. 606–617.
A
8. Dayal U., Blaustein B., Buchmann A., Chakravarthy S., Hsu M., Ladin R., McCarty D., Rosenthal A., Sarin S., Carey M.J., Livny M., and Jauharu R. The HiPAC Project: Combining active databases and timing constraints. ACM Sigmod Rec., 17(1), 1988. 9. Eriksson J. Real-Time and Active Databases: A Survey. In Proc. 2nd Int. Workshop on Active, Real-Time, and Temporal Database Systems, 1997, pp. 1–23. 10. Gatziu S. Events in an Active Object-Oriented Database System. Ph.D. thesis, University of Zurich, Switzerland, 1994. 11. Gray J. and Reuter A. Transaction processing: Concepts and techniques. Morgan Kaufmann, Los Altos, CA, 1994. 12. Jaeger U. Event Detection in Active Databases. Ph.D. thesis, University of Berlin, 1997. 13. Liebig C.M. and Malva A.B. Integrating Notifications and Transactions: Concepts and X2TS Prototype. In Second International Workshop on Engineering Distributed Objects, 2000, pp. 194–214. 14. Lieuwen D.F., Gehani N., and Arlein R. The ODE active database: Trigger semantics and implementation. In Proc. 12th Int. Conf. on Data Engineering, 1996, pp. 412–420. 15. Ma C. and Bacon J. COBEA: A CORBA-based Event Architecture. In Proc. 4th USENIX Conf. Object-Oriented Technologies and Syst., 1998, pp. 117–132. 16. Moss J.E.B. Nested transactions: An approach to reliable distributed computing. MIT, 1985. 17. Mu¨hl G., Fiege L., and Pietzuch P.R. Distributed event-based systems. Springer, Berlin, 2006. 18. Paton N. and Diaz O. Active database systems. ACM Comput. Surv., 31(1):63–103, 1999. 19. Pietzuch P. and Bacon J. Hermes: A Distributed EventBased Middleware Architecture. In Proc. 22nd Int. Conf. on Distributed Computing Systems Workshop. Vienna, Austria, 2002, pp. 611–618. 20. Ullman J.D. Principles of Database Systems. Computer Science, 1982.
Active Database Coupling Modes M IKAEL B ERNDTSSON , J ONAS M ELLIN University of Sko¨vde, Sko¨vde, Sweden
Definition Coupling modes specify execution points for ECA rule conditions and ECA rule actions with respect to the triggering event and the transaction model.
Historical Background Coupling modes for ECA rules were first suggested in the HiPAC project [2,3].
33
A
34
A
Active Database Coupling Modes
Foundations Coupling modes are specified for event-condition couplings and for condition-action couplings. In detail, the event-condition coupling specifies when the condition should be evaluated with respect to the triggering event, and the condition-action coupling specifies when the rule action should be executed with respect to the evaluated rule condition (if condition is evaluated to true). The three most common coupling modes are: immediate, deferred, and decoupled. The immediate coupling mode preempts the execution of the transaction and immediately initiates condition evaluation and action execution. In the deferred coupling mode, condition evaluation and action execution is deferred to the end of the transaction (before transaction commit). Finally, in decoupled (also referred to as detached) coupling mode, condition evaluation and action execution is performed in separate transactions. Specifying event-condition couplings and condition-action couplings in total isolation from each other is not a good idea. What first might seem to be one valid coupling mode for event-condition and one valid coupling mode for condition-action, can be an invalid coupling mode when used together. Thus, when combining event-condition couplings and conditionaction couplings, not all combinations of coupling modes are valid. The HiPAC project [2,3] proposed seven valid coupling modes, see Table 1. Immediate, immediate: the rule condition is evaluated immediately after the event, and the rule action is executed immediately after the rule condition. Immediate, deferred: the rule condition is evaluated immediately after the event, and the execution of
the rule action is deferred to the end of the transaction. Immediate, decoupled: the rule condition is evaluated immediately after the event, and the rule action is decoupled in a totally separate and parallel transaction. Deferred, deferred: both the evaluation of the rule condition and the execution of the rule action is deferred to the end of the transaction. Deferred, decoupled: the evaluation of the rule condition is deferred to the end of the transaction, and the rule action is decoupled in a totally separate and parallel transaction. Decoupled, immediate: the rule condition is decoupled in a totally separate and parallel transaction, and the rule action is executed (in the same parallel transaction) immediately after the rule condition. Decoupled, decoupled: the rule condition is decoupled in a totally separate and parallel transaction, and the rule action is decoupled in another totally separate and parallel transaction. The two invalid coupling modes are:
Deferred, immediate: this combination violates the semantics of ECA rules. That is, rule conditions must be evaluated before rule actions are executed. One cannot preempt the execution of the transaction immediately after the event and execute the rule action and at the same time postpone the condition evaluation to the end of the transaction. Decoupled, deferred: this combination violates transaction boundaries. That is, one cannot decouple the condition evaluation in a separate and parallel transaction and at the same time postpone the
Active Database Coupling Modes. Table 1. Coupling modes Condition-Action EventImmediate Condition Immediate condition evaluated and action executed after event Deferred
not valid
Decoupled in a separate transaction: condition evaluated and action executed after event
Deferred
Decoupled
condition evaluated after event, action executed at end of transaction condition evaluated and action executed at end of transaction not valid
condition evaluated after event, action executed in a separate transaction condition evaluated at end of transaction, action executed in a separate transaction condition evaluated in one separate transaction, action executed in another separate transaction
Active Database Execution Model
execution of the rule action to the end of the original transaction, since one cannot know when the condition evaluation will take place. Thus, there is a risk that the action execution in the original transaction will run before the condition has been evaluated in the parallel transaction. Rule actions executed in decoupled transactions can either be dependent upon or independent of the transaction in which the event took place. The research project REACH (REal-time ACtive Heterogeneous System) [1] introduced two additional coupling modes for supporting side effects of rule actions that are irreversible. The new coupling modes are variants of the detached casually dependent coupling mode: sequential casually dependent, and exclusive casually dependent. In sequential casually dependent, a rule is executed in a separate transaction. However, the rule execution can only begin once the triggering transaction has committed. In exclusive casually dependent, a rule is executed in a detached parallel transaction and it can commit only if the triggering transaction failed.
Cross-references
▶ Active Database Execution Model ▶ ECA-rules
Recommended Reading 1.
2.
3.
Branding H., Buchmann A., Kudrass T., and Zimmermann J. Rules in an Open System: The REACH Rule System. In Proc. 1st International Workshop on Rules in Database Systems, Workshops in Computing, 1994, pp. 111–126. Dayal U., Blaustein B.A., Buchmann S.C., et al. The HiPAC project: Combining active databases and timing constraints. ACM SIGMOD Rec., 17(1):51–70, 1988. Dayal U., Blaustein B., Buchmann A., Chakravarthy S., and et al. HiPAC: A Research Project in Active, Time-Constrained Database Management. Tech. Rep. CCA-88-02, Xerox Advanced Information Technology, Cambridge, 1988.
Active Database Execution Model M IKAEL B ERNDTSSON , J ONAS M ELLIN University of Sko¨vde, Sko¨vde, Sweden
Definition The execution model of an active database describes how a set of ECA rules behave at run time.
A
Key Points The execution model describes how a set of ECA rules (i.e., active database rulebase) behave at run time [2,4]. Any execution model of an active database must have support for: (i) detecting event occurrences, (ii) evaluating conditions, and (iii) executing actions. If an active database supports composite event detection, it needs a policy that describes how a composite event is computed. A typical approach is to use the event consumption modes as described in Snoop [1]: recent, chronicle, continuous, and cumulative. In the recent event context, only the most recent constituent events will be used to form composite events. In the chronicle event context, events are consumed in chronicle order. The earliest unused initiator/terminator pair are used to form the composite event. In the continuous event context, each initiator starts the detection of a new composite event and a terminator may terminate one or more composite event occurrences. The difference between continuous and chronicle event contexts is that in the continuous event context, one terminator can detect more than one occurrence of the composite event. In the cumulative event context, all events contributing to a composite event are accumulated until the composite event is detected. When the composite event is detected, all contributing events are consumed. Another approach to these event consumption modes is to specify a finer semantics for each event by using logical events as suggested in [3]. Once an event has been detected, there are several execution policies related to rule conditions and rule actions that must be in place in the execution model. Thus an execution model for an active database should provide answers to the following questions [2,4,5]: When should the condition be evaluated and when should the action should be executed with respect to the triggering event and the transaction model? This is usually specified by coupling modes. What happens if an event triggers several rules? – Are all rules evaluated, a subset, or only one rule? – Are rules executed in parallell, according to rule priority, or non-deterministically? What happens if one’s rules trigger another set of rules? – What happens if the rule action of one rule negates the rule condition of an already triggered rule? – Can cycles appear? For example, can a rule trigger itself?
35
A
36
A
Active Database Knowledge Model
The answers to the above questions are important to know, as they dictate how a ECA rule system will behave at run time. If the answers to the above questions are not known, then the behavior of the ECA rule application becomes unpredictable.
Cross-references
▶ Active Database Coupling Modes ▶ Active Database Rulebase ▶ Composite Event ▶ Database Trigger ▶ ECA Rules
Recommended Reading 1.
2.
3.
4. 5.
Chakravarthy S., Krishnaprasad V., Anwar E., and Kim S.K. Composite Events for Active Databases: Semantics Contexts and Detection. In Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp. 606–617. Dayal U., Blaustein B., Buchmann A., and Chakravarthy S. et al. HiPAC: A Research Project in Active, Time-Constrained Database Management. Technical Report CCA-88-02, Xerox Advanced Information Technology, Cambridge, 1988. Gehani N., Jagadish H.V., and Smueli O. Event specification in an active object-oriented database. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1992, pp. 81–90. Paton N.W. and Diaz O. Active Database Systems. ACM Comput. Surv, 31(1):63–103, 1999. Widom J. and Finkelstein S. Set-Oriented Production Rules in Relational Database Systems. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1990, pp. 259–270.
Active Database Knowledge Model M IKAEL B ERNDTSSON , J ONAS M ELLIN University of Sko¨vde, Sko¨vde, Sweden
Definition The knowledge model of an active database describes what can be said about the ECA rules, that is what type of events are supported, what type of conditions are supported, and what type of actions are supported?
Key Points The knowledge model describes what types of events, conditions, and actions that are supported in an active database. Another way to look at the knowledge model is to imagine what type of features are available in an ECA rule definition language. A framework of dimensions for the knowledge model is presented in [3]. Briefly, each part of an
ECA rule is associated with dimensions that describe supported features. Thus, an event can be described as either a primitive event or a composite event, how it was generated (source), whether the event is generated for all instances in a given set or only for a subset (event granularity), what type (if event is a composite event) of operators and event consumption modes are used in the detection of the composite event. Conditions are evaluated against a database state. There are three different database states that a rule condition can be associated with [3]: (i) the database state at the start of the transaction, (ii) the database state when the event was detected, and (iii) the database state when the condition is evaluated. There are four different database states that a rule action can be associated with [3]: (i) the database state at the start of the transaction, (ii) the database state when the event was detected, and (iii) the database state when the condition is evaluated, and (iv) the database state just before action execution. The type of rule actions range from internal database updates (e.g., update a table) to external programs (e.g., send email). Within the context of the knowledge model it is also useful to consider how ECA rules are represented, for example inside classes, as data members, or first class objects. Representing ECA rules as first class objects [1,2] is a popular choice, since rules can be treated as any other object in the database and traditional database operations can be used to manipulate the ECA rules. Thus, representing ECA rules as first class objects implies that ECA rules are not dependent upon the existence of other objects. The knowledge model of an active database should also describe whether the active database supports passing of parameters between the ECA rule parts, for example passing of parameters from the event part to the condition part. Related to the knowledge model is the execution model that describes how ECA rules behave at run time.
Cross-references
▶ Active Database Execution Model ▶ ECA Rules
Recommended Reading 1.
Dayal U., Blaustein B., Buchmann A. et al. S.C. HiPAC: A Research Project in Active, Time-Constrained Database Management. Tech. Rep. CCA-88-02, Xerox Advanced Information Technology, Cambridge, 1988.
Active Storage 2.
3.
Dayal U., Buchmann A., and McCarthy D. Rules are objects too: a knowledge model for an active, object-oriented database system. In Proc. 2nd Int. Workshop on Object-Oriented Database Systems, 1988, pp. 129–143. Paton N.W. and Diaz O. Active database systems. ACM Comput. Surv., 31(1):63–103, 1999.
A
Active Document ▶ Active XML
Active Storage Active Database Rulebase A NN M ARIE E RICSSON 1, M IKAEL B ERNDTSSON 2, J ONAS M ELLIN 3 University of Sko¨vde, Sko¨vde, Sweden
Definition An active database rulebase is a set of ECA rules that can be manipulated by an active database.
Key Points An active database rulebase is a set of ECA rules that can be manipulated by an active database. Thus, an ADB rulebase is not static, but it evolves over time. Typically, ECA rules can be added, deleted, modified, enabled, and disabled. Each update of the ADB rulebase can potentially lead to different behaviors of the ECA rules at run time, in particular with respect to termination and confluence. Termination concerns whether a set of rules is guaranteed to terminate. A set of rules may have a non-terminating behavior if rules are triggering each other in a circular order, for example, if the execution of rule R1 triggers rule R2 and the execution of rule R2 triggers rule R1. A set of rules is confluent if the outcome of simultaneously triggered rules is unique and independent of execution order.
Cross-references ▶ ECA Rules
Active Databases ▶ Event Driven Architecture
Active Disks ▶ Active Storage
K AZUO G ODA The University of Tokyo, Tokyo, Japan
Synonyms Active Disks; Intelligent Disks
Definition Active Storage is a computer system architecture which utilizes processing power in disk drives to execute application code. Active Storage was introduced in separate academic papers [1–3] in 1998. The term Active Storage is sometimes identified merely with the computer systems proposed in these papers. Two synonyms, Active Disk and Intelligent Disk, are also used to refer to Active Storage. The basic idea behind Active Storage is to offload computation and data traffic from host computers to the disk drives themselves such that the system can achieve significant performance improvements for data intensive applications such as decision support systems and multimedia applications.
Key Points A research group at Carnegie Mellon University proposed, in [3], a storage device called Active Disk, which has the capability of downloading application-level code and running it on a processor embedded on the device. Active Disk has a performance advantage for I/O bound scans, since processor-per-disk processing can potentially reduce data traffic on interconnects to host computers and yield great parallelism of scans. E. Riedel et al. carefully studied the potential benefits of using Active Disks for four types of data intensive applications, and introduced analytical performance models for comparing traditional server systems and Active Disks. They also prototyped ten Active Disks, each having a DEC Alpha processor and two Seagate disk drives, and demonstrated almost linear scalability in the experiments. A research group at University of California at Berkeley discussed a vision of Intelligent Disks (IDISKs) in [2]. The approach of Intelligent Disk is similar to that
37
A
38
A
Active XML
of Active Disk. K. Keeton et al. carefully studied the weaknesses of shared-nothing clusters of workstations and then explored the possibility of replacing the cluster nodes with Intelligent Disks for large-scale decision support applications. Intelligent Disks assumed higher complexity of applications and hardware resources in comparison with CMU’s Active Disks. Another Active Disk was presented by a research group at the University of California at Santa Barbara and University of Maryland in [1]. A. Acharya et al. carefully studied programming models to exploit diskembedded processors efficiently and safely and proposed algorithms for typical data intensive operations such as selection and external sorting, which were validated by simulation experiments. These three works are often recognized as opening the gate for new researches of Intelligent Storage Systems in the post-‘‘database machines’’ era.
Cross-references
▶ Database Machine ▶ Intelligent Storage Systems
Recommended Reading 1.
2. 3.
Acharya A., Mustafa U., and Saltz J.H. Active disks: programming model, algorithms and evaluation. In Proc. 8th Int. Conf. Architectural Support for Programming Lang. and Operating Syst., 1998, pp. 81–91. Keeton K., Patterson D.A., and Hellerstein J.M. A case for intelligent disks (IDISKs). SIGMOD Rec., 27(3):42–52, 1998. Riedel E., Gibson G.A., and Faloutsos C. Active storage for largescale data mining and multimedia. In Proc. 24th Int. Conf. on Very Large Data Bases, 1998, pp. 62–73.
Active XML S ERGE A BITEBOUL 1, O MAR B ENJELLOUN 2, T OVA M ILO 3 1 INRIA, Saclay Iˆle-de-France, Orsay, Cedex, France 2 Google Inc., Mountain view, CA, USA 3 Tel Aviv University, Tel Aviv, Israel
Synonyms Active document; AXML
Definition Active XML documents (AXML documents, for short) are XML documents [12] that may include embedded
calls to Web services [13]. Hence, AXML documents are a combination of regular ‘‘extensional’’ XML data with data that is defined ‘‘intensionally,’’ i.e., as a description that enables obtaining data dynamically (by calling the corresponding service). AXML documents evolve in time when calls to their embedded services are triggered. The calls may bring data once (when invoked) or continually (e.g., if the called service is a continuous one, such as a subscription to an RSS feed). They may even update existing parts of the document (e.g., by refreshing previously fetched data).
Historical Background The AXML language was originally proposed at INRIA around 2002. Work around AXML has been going there in the following years. A survey of the research on AXML is given in [13]. The software, primarily under the form of an AXML system, is available as open source software. Resources on Active XML may be found on the project’s Web site [11]. The notion of embedding function calls into data is old. Embedded functions are already present in relational systems as stored procedures. Of course, method calls form a key component of object databases. For the Web, scripting languages such as PHP or JSP have made popular the integration of processing inside HTML or XML documents. Combined with standard database interfaces such as JDBC and ODBC, functions are used to integrate results of (SQL) queries. This idea can also be found in commercial software products, for instance, in Microsoft Office XP, SmartTags inside Office documents can be linked to Microsoft’s .NET platform for Web services. The originality of the AXML approach is that it proposed to exchange such documents, building on the fact that Web services may be invoked from anywhere. In that sense, this is truly a language for distributed data management. Another particularity is that the logic (the AXML language) is a subset of the AXML algebra. Looking at the services in AXML as queries, the approach can be viewed as closely related to recent works based on XQuery [14] where the query language is used to describe query plans. For instance the DXQ project [7] developed at ATT and UCSD emphasizes the distributed evaluation of XQuery queries. Since one can describe documents in an XQquery syntax, such approaches encompass in
Active XML
some sense AXML documents where the service calls are XQuery queries. The connection with deductive databases is used in [1] to study the diagnosis problems in distributed networks. A similar approach is followed in [8] for declarative network routing. It should be observed that the AXML approach touches upon most database areas. In particular, the presence of intensional data leads to views, deductive databases and data integration. The activation of calls contained in a document essentially leads to active databases. AXML services may be activated by external servers, which relates to subscription queries and stream databases. Finally, the evolution of AXML documents and their inherent changing nature lead to an approach of workflows and service choreography in the style of business artifacts [10]. The management of AXML document raises a number of issues. For instance, the evaluation of queries over active documents is studied in [2]. The ‘‘casting’’ of a document to a desired type is studied in
Active XML. Figure 1. An AXML document.
A
[9]. The distribution of documents between several peers and their replication is the topic of [4].
Foundations An AXML document is a (syntactically valid) XML document, where service calls are denoted by special XML elements labeled call. An example AXML document is given in Fig. 1. The figure shows first the XML serialized syntax, then a more abstract view of the same document as a labeled tree. The document in the figure describes a (simplified) newspaper homepage consisting of (i) some extensional information (the name of the newspaper, the current date, and a news story), and (ii) some intensional information (service calls for the weather forecast, and for the current exhibits). When the services are called, the tree evolves. For example, the tree at the bottom is what results from a call to the service f at weather.com to obtain the temperature in Paris. AXML documents fit nicely in a peer-to-peer architecture, where each peer is a persistent store of AXML
39
A
40
A
Active XML
documents, and may act both as a client, by invoking the service calls embedded in its AXML documents, and as a server, by providing Web services over these documents. Two fundamental issues arise when dealing with AXML documents. The first one is related to the exchange of AXML documents between peers, and the second one is related to query evaluation over such data. Documents Exchange: When exchanged between two applications/peers, AXML documents have a crucial property: since Web services can be called from anywhere on the Web, data can either be materialized before sending, or sent in its intensional form and left to the receiver to materialize if and when needed. Just like XML Schemas do for standard XML, AXML schemas let the user specify the desired format of the exchanged data, including which parts should remain intensional and which should be materialized. Novel algorithms allow the sender to determine (statically or dynamically) which service invocations are required to ‘‘cast’’ the document to the required data exchange format [9]. Query evaluation: Answering a query on an AXML document may require triggering some of the service calls it contains. These services may, in turn, query other AXML documents and trigger some other services, and so on. This recursion, based on the management of intensional data, leads to a framework in the style of deductive databases. Query evaluation on AXML data can therefore benefit from techniques developed in deductive databases such as Magic Sets [6]. Indeed, corresponding AXML query optimization techniques where proposed in [1,2]. Efficient query processing is, in general, a critical issue for Web data management. AXML, when properly extended, becomes an algebraic language that enables query processors installed on different peers to collaborate by exchanging streams of (A)XML data [14]. The crux of the approach is (i) the introduction of generic services (i.e., services that can be provided by several peers, such as query processing) and (ii) some explicit control of distribution (e.g., to allow delegating part of some work to another peer).
Key Applications AXML and the AXML algebra target all distributed applications that involve the management of distributed data. AXML is particularly suited for data integration (from databases and other data resources
exported as Web services) and for managing (active) views on top of data sources. In particular, AXML can serve as a formal foundation for mash-up systems. Also, the language is useful for (business) applications based on evolving documents in the style of business artifacts, and on the exchange of such information. The fact that the exchange is based on flows of XML messages makes it also well-adapted to the management of distributed streams of information.
Cross-references
▶ Active Document ▶ BPEL ▶ Web Services ▶ W3C XML Query Language ▶ XML ▶ XML Types
Recommended Reading 1. Abiteboul S., Abrams Z., and Milo T. Diagnosis of Asynchronous Discrete Event Systems – Datalog to the Rescue! In Proc. 24th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2005, pp. 358–367. 2. Abiteboul S., Benjelloun O., Cautis B., Manolescu I., Milo T., and Preda N. Lazy Query Evaluation for Active XML. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 227–238. 3. Abiteboul S., Benjelloun O., and Milo T. The Active XML project, an overview, VLDB J, 17(5):1019–1040, 2008. 4. Abiteboul S., Bonifati A., Cobena G., Manolescu I., and Milo T. Dynamic XML Documents with Distribution and Replication. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 527–538. 5. Abiteboul S., Manolescu I., and Taropa E. A Framework for Distributed XML Data Management. In Advances in Database Technology, Proc. 10th Int. Conf. on Extending Database Technology, 2006. 6. Bancilhon F., Maier D., Sagiv Y., and Ullman J.D. Magic Sets and Other Strange Ways to Implement Logic Programs. In Proc. 5th ACM SIGACT-SIGMOD Symp. on Principles of Database Systems, 1986, pp. 1–15. 7. DXQ: Managing Distributed System Resources with Distributed XQuery. http://db.ucsd.edu/dxq/. 8. Loo B.T., Condie T., Garofalakis M., Gay D.E, Hellerstein J.M., Maniatis P., Ramakrishnan R., Roscoe T., and Stoica I. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 97–108. 9. Milo T., Abiteboul S., Amann B., Benjelloun O., and Ngoc F.D. Exchanging Intensional XML data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 289–300. 10. Nigam A. and Caswell N.S. Business artifacts: an approach to operational specification. IBM Syst. J., 42(3):428–445, 2003. 11. The Active XML homepage. http://www.activexml.net/.
Activity Diagrams 12. The Extensible Markup Language (XML) 1.0 (2nd edn). http:// www.w3.org/TR/REC-xml. 13. The W3C Web Services Activity. http://www.w3.org/2002/ws. 14. The XQuery language. http://www.w3.org/TR/xquery.
Activity N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Synonyms Step; Node; Task; Work element
Definition A description of a piece of work that forms one logical step within a process. An activity may be a manual activity, which does not support computer automation, or a workflow (automated) activity. A workflow activity requires human and/or machine resources to support process execution; where human resource is required an activity is allocated to a workflow participant.
Key Points A process definition generally consists of many process activities which are logically related in terms of their contribution to the overall realization of the business process. An activity is typically the smallest unit of work which is scheduled by a workflow engine during process enactment (e.g., using transition and pre/postconditions), although one activity may result in several work items being assigned (to a workflow participant). Wholly manual activities may form part of a business process and be included within its associated process definition, but do not form part of the automated workflow resulting from the computer supported execution of the process. An activity may therefore be categorized as ‘‘manual,’’ or ‘‘automated.’’ Within this document, which is written principally in the context of workflow management, the term is normally used to refer to an automated activity.
Cross-references
▶ Activity Diagrams ▶ Actors/Agents/Roles ▶ Workflow Model
A
Activity Diagrams LUCIANO B ARESI Politecnico di Milano University, Milan, Italy
Synonyms Control flow diagrams; Object flow diagrams; Flowcharts; Data flow diagrams
Definition Activity diagrams, also known as control flow and object flow diagrams, are one of the UML (unified modeling language [11]) behavioral diagrams. They provide a graphical notation to define the sequential, conditional, and parallel composition of lower-level behaviors. These diagrams are suitable for business process modeling and can easily be used to capture the logic of a single use case, the usage of a scenario, or the detailed logic of a business rule. They model the workflow behavior of an entity (system) in a way similar to state diagrams where the different activities are seen as the states of doing something. Although they could also model the internal logic of a complex operation, this is not their primary use since tangled operations should always be decomposed into simpler ones [1,2]. An activity [3] represents a behavior that is composed of individual elements called actions. Actions have incoming and outgoing edges that specify control and data flow from and to other nodes. Activities may form invocation hierarchies invoking other activities, ultimately resolving to individual actions. The execution of an activity implies that each contained action be executed zero, one, or more times depending on the execution conditions and the structure of the activity. The execution of an action is initiated by the termination of other actions, the availability of particular objects and data, or the occurrence of external events. The execution is based on token flow (like Petri Nets). A token contains an object, datum, or locus of control, and is present in the activity diagram at a particular node. When an action begins execution, tokens are accepted from some or all of its input edges and a token is placed on the node. When an action completes execution, a token is removed from the node and tokens are moved to some or all of its output edges.
41
A
42
A
Activity Diagrams
Historical Background OMG (Object Management Group, [10]) proposed and standardized activity diagrams by borrowing concepts from flow-based notations and some formal methods. As for the first class, these diagrams mimic flowcharts [6] in their idea of step-by-step representation of algorithms and processes, but they also resemble data and control flow diagrams [4]. The former provide a hierarchical and graphical representation of the ‘‘flow’’ of data through a system inspired by the idea of data flow graph. They show the flow of data from external entities into the system, how these data are moved from one computation to another, and how they are logically stored. Similarly, object flow diagrams show the relationships among input objects, methods, and output objects in object-based models. Control flow diagrams represent the paths that can be traversed while executing a program. Each node in the graph represents a basic block, be it a single line or an entire function, and edges render how the execution jumps among them. Moving to the second group, activity diagrams are similar to state diagrams [8], where the evolution of a system is rendered by the identification of the states, which characterize the element’s life cycle, and of the transitions between them. A state transition can be constrained by the occurrence of an event and by an additional condition; its firing can cause the execution of an associated action. Mealy et al. propose different variations: Mealy assumes that actions be only associated with transitions, Moore only considers actions associated with states, and Harel’s state charts [7] merge the two approaches with actions on both states and transitions, and enhance their flat model with nested and concurrent states. The dynamic semantics of activity diagrams is clearly inspired by Petri Nets [9], which are a simple graphical formalism to specify the behavior of concurrent and parallel systems. The nodes are partitioned into places and transitions, with arcs that can only connect nodes of different type. Places may contain any number of tokens and a distribution of tokens over the places of a net is called a marking. A transition can only fire when there is at least a token in all its input places (i.e., those places connected to the transition by means of incoming edges), and its firing removes a token for all these places and produces a new one in each output place (i.e., a place connected to the transition through an outgoing edge). P/T nets only consider tokens as placeholders, while colored nets
augment them with typed data and thus with firing conditions that become more articulated and can predicate on the tokens’ values in the input places. Activity diagrams also borrow from SDL (Specification and Description Language, [5]) as event handling. This is a specification language for the unambiguous description of the behavior of reactive and distributed systems. Originally, the notation was conceived for the specification of telecommunication systems, but currently its application is wider and includes process control and real-time applications in general. A system is specified as a set of interconnected abstract machines, which are extensions of finite state machines. SDL offers both a graphical and a textual representation and its last version (known as SDL-2000) is completely objectorientated.
Foundations Figure 1 addresses the well-known problem of order management and proposes a first activity diagram whose aim is twofold: it presents a possible formalization of the process, and it also introduces many of the concepts supplied by these diagrams. Each atomic step is called action, with an initial node and activity final nodes to delimit their ordering as sequences, parallel threads, or conditional flows. A fork splits a single execution thread into a set of parallel ones, while a join, along with an optional join specification to constrain the unification, is used to re-synchronize the different threads into a single execution. Similarly, a decision creates alternative paths, and a merge re-unifies them. To avoid misunderstandings, each path must be decorated with the condition, in brackets, that must be verified to make the execution take that path. The diagram of Fig. 1 also exemplifies the use of connectors to render flows/edges that might tangle the representation. This is nothing but an example, but the solution is interesting to avoid drawing flows that cross other elements or move all around the diagram. Another key feature is the use of a rake to indicate that action Fill Order is actually an activity invocation, and hides a hierarchical decomposition of actions into activities. Besides the control flow, activity diagrams can also show the data/object flow among the actions. The use of object nodes allows users to state the artifacts exchanged between two actions, even if they are not directly connected by an edge. In many cases
Activity Diagrams
A
43
A
Activity Diagrams. Figure 1. Example activity diagram.
control and object flows coincide, but this is not mandatory. Activities can also comprise input and output parameters to render the idea that the activity’s execution initiates when the inputs are available, and produces some outputs. For example, activity Fill Order of Fig. 2, which can be seen as a refinement of the invocation in Fig. 1, requires that at least one Request be present, but then it considers the parameter as a stream, and produces Shipment Information and Rejected Items. While the first outcome is the ‘‘normal’’ one, the second object is produced only in case of exceptions (rendered with a small triangle on both the object and the flow that produces it). In a stream, the flow is annotated from action Compose Requests to the join with its weight to mean that the subsequent processing must consider all the requests received when the composition starts. The execution can also consider signals as enablers or outcomes of special-purpose actions. For example, Fig. 2 shows the use of an accept signal, to force that the composition of orders (Compose Orders) must be initiated by an external event, a time signal, to make the execution wait for a given timeframe (be it absolute or relative), and a send signal, to produce a notification to the customer as soon as the action starts. Basic diagrams can also be enriched with swimlanes to partition the different actions with respect to their
responsibilities. Figure 3 shows a simple example: The primitive actions are the same as those of Fig. 1, but now they are associated with the three players in charge of activate the behaviors in the activity. The standard also supports hierarchical and multi-dimensional partitioning, that is, hierarchies of responsible actors or matrix-based partitions. The Warehouse can also receive Cancel Order notifications to asynchronously interrupt the execution as soon as the external event arrivers. This is obtained by declaring an interruptable region, which contains the accept signal node and generates the interrupt that stops the computation in that region and moves the execution directly to action Cancel Order by means of an interrupting edge. More generally, this is a way to enrich diagrams with specialized exception handlers similarly to many modern programming and workflow languages. The figure also introduces pins as a compact way to render the objects exchanged between actions: empty boxes correspond to discrete elements, while filled ones refer to streams. The discussion thus far considers the case in which the outcome of an action triggers a single execution of another action, but in some cases conditions may exist in which the ‘‘token’’ is structured and a single result triggers multiple executions of the same action. For example, if the example of Fig. 1 were slightly modified and after receiving an order, the user wants to check the items in it, a single execution of action
44
A
Activity Diagrams
Activity Diagrams. Figure 2. Activity diagrams with signals.
Activity Diagrams. Figure 3. Example swimlanes.
Receive Order would trigger multiple executions of Validate Item. This situation is depicted in the
left-hand side of Fig. 4, where the star * conceives the information described so far. The same problem can be addressed in a more complete way (right-hand side of figure) with an expansion region. The two arrays are supposed to
store the input and output elements. In some cases, the number of input and output tokens is the same, but it might also be the case that the behavior in the region filters the incoming elements. In the left-hand side of Fig. 4, it is assumed that some items are accepted and fill the output array, while others are rejected and thus their execution flow ends
Activity Diagrams
A
45
A
Activity Diagrams. Figure 4. Expansion region.
there. This situation requires that a flow final be used to state that only the flow is ended and not the whole activity. Flow final nodes are a means to interrupt particular flows in this kind of regions, but also in loops or other similar cases. The execution leaves an expansion region as soon as all the output tokens are available, that is, as soon as all the executions of the behavior embedded in the region are over. Notice that these executions can be carried out both concurrently (by annotating the rectangle with stereotype concurrent) or iteratively (with stereotype iterative). The next action considers the whole set of tokens as a single entity. Further details about exceptions and other advanced elements, like pre- and post-conditions associated with single actions or whole activities, central buffers, and data stores are not discussed here, but the reader is referred to [11] for a thorough presentation.
Key Applications Activity diagrams are usually employed to describe complex behaviors. This means that they are useful to model tangled processes, describe the actions that need to take place and when they should occur in use cases, render complicated algorithms, and model applications with parallel and alternative flows. Nowadays, these necessities belong to ICT specialists, like software engineering, requirements experts, and information systems architects, but also to experts in other fields (e.g., business analysts or production engineers) that need this kind of graphical notations to describe their solutions. Activity diagrams can be used in isolation, when the user needs a pure control (data) flow notation, but they can also be adopted in conjunction with other modeling techniques such as interaction diagrams, state diagrams, or other UML diagrams. However,
activity diagrams should not take the place of other diagrams. For example, even if the border between activity and state diagrams is sometimes blurred, activity diagrams provide a procedural decomposition of the problem under analysis, while state diagrams mostly concentrate on how studied elements behave. Moreover, activity diagrams do not give details about how objects behave or how they collaborate.
Cross-references
▶ Unified Modeling Language ▶ Web Services ▶ Workflow modeling
Recommended Reading 1. Arlow J. and Neustadt I. UML 2 and the Unified Process: Practical Object-Oriented Analysis and Design, 3rd edn. Addison-Wesley, Reading, MA, 2005. 2. Booch G., Rumbaugh J., and Jacobson I. The Unified Modeling Language User Guide, 2nd edn. Addison-Wesley, Reading, MA, 2005. 3. Fowler M. UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd edn. Addison-Wesley, Reading, MA, 2003. 4. Gane C. and Sarson T. Structured System Analysis. PrenticeHall, Englewood Cliffs, NJ, 1979. 5. Gaudin E., Najm E., and Reed R. In Proceedings of SDL 2007: Design for Dependable Systems, 13th International SDL Forum, LNCS, vol. 4745, Springer, 2007. 6. Goldstine H. The Computer from Pascal to Von Neumann. Princeton University Press, Princeton, NJ, 1972, pp. 266–267. 7. Harel D. and Naamad A. The STATEMATE Semantics of Statecharts. ACM Trans. Softw. Eng. Methodol., 5(4):293–333, 1996. 8. Hopcroft J. and Ullman J. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Reading, MA, 2002. 9. Murata T. Petri Nets: Properties, Analysis, and Applications. Proc. IEEE, 77(4):541–580, 1989. 10. Object Management Group, http://www.omg.org/ 11. OMG, Unified Modeling Language, http://www.uml.org/
46
A
Actors/Agents/Roles
Actors/Agents/Roles N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Adaptive Database Replication ▶ Autonomous Replication
Synonyms Workflow participant; Player; End user; Work performer
Definition A resource that performs the work represented by a workflow activity instance. This work is normally manifested as one or more work items assigned to the workflow participant via the worklist.
Key Points These terms are normally applied to a human resource but it could conceptually include machine-based resources such as an intelligent agent. Where an activity requires no human resource and is handled automatically by a computer application, the normal terminology for the machine-based resource is Invoked Application. An Actor, Agent or Role may be identified directly within the business process definition, or (more normally) is identified by reference within the process definition to a role, which can then be filled by one or more of the resources available to the workflow system to operate in that role during process enactment.
Cross-references ▶ Activity ▶ Workflow Model
Ad hoc Retrieval models ▶ Information Retrieval Models
Adaptation ▶ Mediation
Adaptive Interfaces M ARISTELLA M ATERA Politecnico di Milano University, Milan, Italy
Synonyms Context-aware interfaces; Personalized interfaces
Definition A specific class of user interfaces that are able to change in some way in response to different characteristics of the user, of the usage environment or of the task the user is supposed to accomplish. The aim is to improve the user’s experience, by providing both interaction mechanisms and contents that best suit the specific situation of use.
Key Points There are a number of ways in which interface adaptivity can be exploited to support user interaction. The interaction dimensions that are adapted vary among functionality (e.g., error correction or active help), presentation (user presentation of input to the system, system presentation of information to the user), and user tasks (e.g., task simplification based on the user’s capabilities). Adaptivity along such dimensions is achieved by capturing and representing into some models a number of characteristics: the user’s characteristics (preferences, experience, etc.); the tasks that the user accomplishes through the system; the characteristics of the information with which the user must be provided. Due to current advances in communication and network technologies, adaptivity is now gaining momentum. Different types of mobile devices indeed offer support to access – at any time, from anywhere, and with any media – services and contents customized to the users’ preferences and usage environments. In this new context, content personalization, based on user profile, has demonstrated its benefits for both users and content providers and has been
Adaptive Middleware for Message Queuing Systems
commonly recognized as fundamental factor for augmenting the overall effectiveness of applications. Going one step further, the new challenge in adaptive interfaces is now context-awareness. It can be interpreted as a natural evolution of personalization, addressing not only the user’s identity and preferences, but also the environment that hosts users, applications, and their interaction, i.e., the context. Context-awareness, hence, aims at enhancing the application usefulness by taking into account a wide range of properties of the context of use.
Cross-references ▶ Visual Interaction
Adaptive Message-Oriented Middleware ▶ Adaptive Middleware for Message Queuing Systems
Adaptive Metric Techniques ▶ Learning Distance Measures
Adaptive Middleware for Message Queuing Systems C HRISTOPHE TATON 1, N OEL D E PALMA 1, S ARA B OUCHENAK 2 1 INPG - INRIA, Grenoble, France 2 University of Grenoble I - INRIA, Grenoble, France
A
Historical Background The use of message oriented middlewares (MOMs) in the context of the Internet has evidenced a need for highly scalable and highly available MOM. A very promising approach to the above issue is to implement performance management as an autonomic software. The main advantages of this approach are: (i) Providing a high-level support for deploying and configuring applications reduces errors and administrator’s efforts. (ii) Autonomic management allows the required reconfigurations to be performed without human intervention, thus improving the system reactivity and saving administrator’s time. (iii) Autonomic management is a means to save hardware resources, as resources can be allocated only when required (dynamically upon failure or load peak) instead of pre-allocated. Several parameters may impact the performance of MOMs. Self-optimization makes use of these parameters to improve the performance of the MOM. The proposed self-optimization approach is based on a queue clustering solution: a clustered queue is a set of queues each running on different servers and sharing clients. Self-optimization takes place in two parts: (i) the optimization of the clustered queue load-balancing and (ii) the dynamic provisioning of a queue in the clustered queue. The first part allows the overall improvement of the clustered queue performance while the second part optimizes the resource usage inside the clustered queue. Thus the idea is to create an autonomic system that fairly distributes client connections among the queues belonging to the clustered queue and dynamically adds and removes queues in the clustered queue depending on the load. This would allow to use the adequate number of queues at any time.
Foundations Synonyms Autonomous message queuing systems; Adaptive message-oriented middleware; Autonomous messageoriented middleware
Definition Distributed database systems are usually built on top of middleware solutions, such as message queuing systems. Adaptive message queuing systems are able to improve the performance of such a middleware through load balancing and queue provisioning.
Clustered Queues
A queue is a staging area that contains messages which have been sent by message producers and are waiting to be read by message consumers. A message is removed from the queue once it has been read. For scalability purpose, a queue can be replicated forming a clustered queue. The clustered queue feature provides a load balancing mechanism. A clustered queue is a cluster of queues (a given number of queue destinations knowing each other) that are able to exchange
47
A
48
A
Adaptive Middleware for Message Queuing Systems
messages depending on their load. Each queue of a cluster periodically reevaluates its load factor and sends the result to the other queues of the cluster. When a queue hosts more messages than it is authorized to do, and according to the load factors of the cluster, it distributes the extra messages to the other queues. When a queue is requested to deliver messages but is empty, it requests messages from the other queues of the cluster. This mechanism guarantees that no queue is hyper-active while some others are lazy, and tends to distribute the work load among the servers involved in the cluster. Clustered Queue Performance
Clustered queues are standard queues that share a common pool of message producers and consumers, and that can exchange message to balance the load. All the queues of a clustered queue are supposed to be directly connected to each other. This allows message exchanges between the queues of a cluster in order to empty flooded queues and to fill draining queues. The clustered queue Qc is connected to Nc message producers and to Mc message consumers. Qc is composed of standard queues Qi(i 2 [1..k]). Each queue Qi is in charge of a subset of Ni message producers and of a subset of Mi message consumers: P N c ¼ Pi N i Mc ¼ i Mi The distribution of the clients between the queues Qi is described as follows: xi (resp. yi) is the fraction of message producers (resp. consumers) that are directed to Qi. P x ¼1 N i ¼ xi N c ; Pi i M i ¼ yi M c i yi ¼ 1 The standard queue Qi to which a consumer or producer is directed to cannot be changed after the client connection to the clustered queue. This way, the only action that may affect the client distribution among the queues is the selection of an adequate queue when the client connection is opened. The clustered queue Qc is characterized by its aggregate message production rate pc and its aggregate message consumption rate cc. The clustered queue Qc also has a virtual clustered queue length lc that aggregates the length of all contained standard queues:
lc ¼
X i
l i ¼ pc c c ;
P p c ¼ Pi p i cc ¼ i ci
The clustered queue length lc obeys to the same law as a standard queue: 1. Qc is globally stable when Dlc = 0. This configuration ensures that the clustered queue is globally stable. However Qc may observe local unstabilities if one of its queues is draining or is flooded. 2. If Dlc > 0, the clustered queue will grow and eventually saturate; then message producers will have to wait. 3. If Dlc < 0, the clustered queue will shrink until it is empty; then message consumers will also have to wait. Now, considering that the clustered queue is globally stable, several scenarios that illustrate the impact of client distribution on performance are given below. Optimal client distribution of the clustered queue Qc is achieved when clients are fairly distributed among the k queues Qi. Assuming that all queues and hosts have equivalent processing capabilities and that all producers (resp. consumers) have equivalent message production (resp. consumption) rates (and that all produced messages are equivalent: message cost is uniformly distributed), this means that: N i ¼ Nkc ; x i ¼ 1=k ; y i ¼ 1=k M i ¼ Mk c In these conditions, all queues Qi are stable and the queue cluster is balanced. As a consequence, there are no internal queue-to-queue message exchanges, and performance is optimal. Queue clustering then provides a quasi-linear speedup. The worst clients distribution appears when one queue only has message producers or only has message consumers. In the example depicted in Fig. 1, this is realized when: x1 ¼ 1 x2 ¼ 0 N2 ¼ 0 N1 ¼ Nc ; ; ; y1 ¼ 0 y2 ¼ 1 M1 ¼ 0 M2 ¼ Mc Indeed, this configuration implies that the whole message production is directed to queue Q1. Q1 then forwards all messages to Q2 that in turn delivers messages to the message consumers. Local instability is observed when some queues Qi of Qc are unbalanced. This is characterized by a
Adaptive Middleware for Message Queuing Systems
A
It is worthwhile to indicate that these scenarios may all happen since clients join and leave the system in an uncontrolled way. Indeed, the global stability of a (clustered) queue is under responsability of the application developper. For instance, the queue can be flooded for a period; it is assumed that it will get inverted and draining after, thus providing global stability over time. Provisioning
Adaptive Middleware for Message Queuing Systems. Figure 1. Clustered queue Qc.
mismatch between the fraction of producers and the fraction of consumers directed to Qi: x i 6¼ y i In the example showed in Fig. 1, Qc is composed of two standard queues Q1 and Q2. A scenario of local instability can be envisioned with the following clients distribution: x 2 ¼ 1=3 x 1 ¼ 2=3 ; y 1 ¼ 1=3 y 2 ¼ 2=3 This distribution implies that Q1 is flooding and will have to enqueue messages, while Q2 is draining and will see its consumer clients wait. However the queue cluster Qc ensures the global stability of the system thanks to internal message exchanges from Q1 to Q2. A stable and unfair distribution can be observed when the clustered queue is globally and locally stable, but the load is unfairly balanced within the queues. This happens when the client distribution is nonuniform. In the example presented in Fig. 1, this can be realized by directing more clients to Q1 than Q2: x 2 ¼ 1=3 x 1 ¼ 2=3 ; y 1 ¼ 2=3 y 2 ¼ 1=3 In this scenario, queue Q1 processes two third of the load, while queue Q2 only processes one third. Suc situation can lead to bad performance since Q1 may saturates while Q2 is lazy.
The previous scenario of stable and non-optimal distribution raises the question of the capacity of a queue. The capacity Ci of standard queue Qi is expressed as an optimal number of clients. The queue load Li is then expressed as the ratio between its current number of clients and its capacity: Li ¼
Ni þ Mi Ci
1. Li < 1: queue Qi is underloaded and thus lazy; the message throughput delivered by the queue can be improved and resources are wasted. 2. Li > 1: queue Qi is overloaded and may saturate; this induces a decreased message throughput and eventually leads to thrashing. 3. Li = 1: queue Qi is fairly loaded and delivers its optimal message throughput. These parameters and indicators are transposed to queue clusters. The clustered queue Qc is characterized by its aggregated capacity Cc and its global load Lc: P X Nc þ Mc i Li C i C i ; Lc ¼ ¼ P Cc ¼ Cc iCi i The load of a clustered queue obeys to the same law as the load of a standard queue. However a clustered queue allows to control k, the number of inside standard queues, and thus to control P its aggregated capacity Cc ¼ ki¼1 Ci. This control is indeed operated with a re-evaluation of the clustered queue provisioning. 1. When Lc < 1, the clustered queue is underloaded: if the clients distribution is optimal, then all the standard queues inside the cluster will be underloaded. However, as the client distribution may be nonoptimal, some of the single queues may be overloaded, even if the cluster is globally lazy. If the load is too low, then some queues may be removed from the cluster.
49
A
50
A
Adaptive Query Optimization
2. When Lc > 1, the clustered queue is overloaded: even if the distribution of clients over the queues is optimal, there will exist at least one standard queue that will be overloaded. One way to handle this case is to re-provision the clustered queue by inserting one or more queues into the cluster. Control Rules for a Self-Optimizing Clustered Queue
The global clients distribution D of the clustered queue Qc is captured by the fractions of message producers xi and consumers yi. The optimal clients distribution Dopt is realized when all queues are stable (8i xi = yi) and when the load is fairly balanced over all queues (8i, jxi = xj, yi = yj). This implies that the optimal distribution is reached when xi = yi = 1∕k. 2 3 2 3 x1 y1 1=k 1=k 6 .. 7 .. 7; D ¼ 6 .. D ¼ 4 ... 4 . opt . 5 . 5 xk
yk
1=k
2. [(R6)] Lc < 1: the queue cluster is underloaded and could accept a decrease in capacity.
Key Applications Adaptive middleware for message queuing systems helps building autonomous distributed systems to improve their performance while minimizing their resource usage, such as distributed Internet services and distributed information systems.
Cross-references
▶ Distributed Database Systems ▶ Distributed DBMS ▶ Message Queuing Systems
Recommended Reading 1.
1=k
Local instabilities are characterized by a mismatch between the fraction of message producers xi and consumers yi on a standard queue. The purpose of this rule is the stability of all standard queues so as to minimize internal queue-to-queue message transfer. 1. [(R1)] xi > yi: Qi is flooding with more message production than consumption and should then seek more consumers and/or fewer producers. 2. [(R2)] xi < yi: Qi is draining with more message consumption than production and should then seek more producers and/or fewer consumers.
2.
3.
4.
5.
Aron M., Druschel P., and Zwaenepoel W. Cluster reserves: a mechanism for resource management in cluster-based network servers. In Proc. 2000 ACM SIGMETRICS Int. Conf. on Measurement and Modeling of Comp. Syst., 2000, pp. 90–101. Menth M. and Henjes R. Analysis of the message waiting time for the fioranoMQ JMS server. In Proc. 23rd Int. Conf. on Distributed Computing Systems, 2006, pp. 1. Shen K., Tang H., Yang T., and Chu L. Integrated resource management for cluster-based internet services. In Proc. 5th USENIX Symp. on Operating System Design and Implementation, 2002. Urgaonkar B. and Shenoy P. Sharc: Managing CPU and network bandwidth in shared clusters. IEEE Trans. Parall. Distrib. Syst., 15(1):2–17, 2004. Zhu H., Ti H., and Yang Y. Demand-driven service differentiation in cluster-based network servers. In Proc. 20th Annual Joint Conf. of the IEEE Computer and Communications Societies, vol. 2, 2001, pp. 679–688.
Load balancing rules control the load applied to a single standard queue. The goal is then to enforce a fair load balancing over all queues. 1. [(R3)] Li > 1: Qi is overloaded and should avoid accepting new clients as it may degrade its performance. 2. [(R4)] Li < 1: Qi is underloaded and should request more clients so as to optimize resource usage. Global provisioning rules control the load applied to the whole clustered queue. These rules target the optimal size of the clustered queue while the load applied to the system evolves. 1. [(R5)] Lc > 1: the queue cluster is overloaded and requires an increased capacity to handle all its clients in an optimal way.
Adaptive Query Optimization ▶ Adaptive Query Processing
Adaptive Query Processing E VAGGELIA P ITOURA University of Ioannina, Ioannina, Greece
Synonyms Adaptive query optimization; Eddies; Autonomic query processing
Adaptive Query Processing
Definition While in traditional query processing, a query is first optimized and then executed, adaptive query processing techniques use runtime feedback to modify query processing in a way that provides better response time, more efficient CPU utilization or more useful incremental results. Adaptive query processing makes query processing more robust to optimizer mistakes, unknown statistics, and dynamically changing data, runtime and workload characteristics. The spectrum of adaptive query processing techniques is quite broad: they may span the executions of multiple queries or adapt within the execution of a single query; they may affect the query plan being executed or just the scheduling of operations within the plan.
Key Points Conventional query processing follows an optimizethen-execute strategy: after generating alternative query plans, the query optimizer selects the most cost-efficient among them and passes it to the execution engine that directly executes it, typically with little or no runtime decision-making. As queries become more complex, this strategy faces many limitations such as missing statistics, unexpected correlations, and dynamically changing data, runtime, and workload characteristics. These problems are aggregated in the case of long-running queries over data streams as well as in the case of queries over multiple potentially heterogeneous data sources across widearea networks. Adaptive query processing tries to address these shortcomings by using feedback during query execution to tune query processing. The goal is to increase throughput, improve response time or provide more useful incremental results. To implement adaptivity, regular query execution is supplemented with a control system for monitoring and analyzing at run-time various parameters that affect query execution. Based on this analysis, certain decisions are made about how the system behavior should be changed. Clearly, this may introduce considerable overheads. The complete space of adaptive query processing techniques is quite broad and varied. Adaptability may be applied to query execution of multiple queries or just a single one. It may also affect the whole query plan being executed or just the scheduling of operations within the plan. Adaptability techniques also differ on how much they interleave plan generation and
A
execution. Some techniques interleave planning and execution just a few times, by just having the plan re-optimized at specific points, whereas other techniques interleave planning and execution to the point where they are not even clearly distinguishable. A number of fundamental adaptability techniques include: Horizontal partitioning, where different plans are used on different portions of the data. Partitioning may be explicit or implicit in the functioning of the operator. Query execution by tuple routing, where query execution is treated as the process of routing tuples through operators and adaptability is achieved by changing the order in which tuples are routed. Plan partitioning, where execution progresses in stages, by interleaving optimization and execution steps at a number of well-defined points during query execution. Runtime binding decisions, where certain plan choices are deferred until runtime, allowing the execution engine to select among several alternative plans by potentially re-invoking the optimizer. In-operator adaptive logic, where scheduling and other decisions are made part of the individual query operators, rather than the optimizer. Many adaptability techniques rely on a symmetric hash join operator that offers a non-blocking variant of join by building hash tables on both the input relations. When an input tuple is read, it is stored in the appropriate hash table and probed against the opposite table, thus producing incremental output. The symmetric hash join operator can process data from either input, depending on availability. It also enables additional adaptivity, since it has frequent moments of symmetry, that is, points at which the join order can be changed without compromising correctness or losing work. The eddy operator provides an example of finegrained run-time control by tuple routing through operators. An eddy is used as a tuple router; it monitors execution, and makes routing decisions for the tuples. Eddies achieve adaptability by simply changing the order in which the tuples are routed through the operators. The degree of adaptability achieved depends on the type of the operators. Pipelined operators, such as the symmetric hash join, offer the most freedom, whereas, blocking operators, such as the sort-merge
51
A
52
A
Adaptive Stream Processing
join, are less suitable since they do not produce output before consuming the input relations in their entirety.
Cross-references
▶ Adaptive Stream Processing ▶ Cost Estimation ▶ Multi-query Optimization ▶ Query Optimization ▶ Query Processing
Recommended Reading 1.
2.
3.
Avnur R. and Hellerstein J.M. Eddies: continuously adaptive query processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 261–272. Babu S. and Bizarro P. Adaptive query processing in the looking glass. In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, 2005, pp. 238–249. Deshpande A., Ives Z.G., and Raman V. Adaptive query processing. Found. Trends Databases, 1(1):1–140, 2007.
Adaptive Stream Processing Z ACHARY I VES University of Pennsylvania, Philadelphia, PA, USA
Synonyms Adaptive query processing
Definition When querying long-lived data streams, the characteristics of the data may change over time or data may arrive in bursts – hence, the traditional model of optimizing a query prior to executing it is insufficient. As a result, most data stream management systems employ feedback-driven adaptive stream processing, which continuously re-optimizes the query execution plan based on data and stream properties, in order to meet certain performance or resource consumption goals. Adaptive stream processing is a special case of the more general problem of adaptive query processing, with the special property that intermediate results are bounded in size (by stream windows), but where query processing may have quality-of-service constraints.
Historical Background The field of adaptive stream processing emerged in the early 2000s, as two separate developments converged. Adaptive techniques for database query processing had
become an area of increasing interest as Web and integration applications exceeded the capabilities of conventional static query processing [10]. Simultaneously, a number of data stream management systems [1,6,8,12] were emerging, and each of these needed capabilities for query optimization. This led to a common approach of developing feedback-based re-optimization strategies for stream query computation. In contrast to Web-based adaptive query processing techniques, the focus in adaptive stream processing has especially been on maintaining quality of service under overload conditions.
Foundations Data stream management systems (DSMSs) typically face two challenges in query processing. First, the data to be processed comes from remote feeds that may be subject to significant variations in distribution or arrival rates over the lifetime of the query, meaning that no single query evaluation strategy may be appropriate over the entirety of execution. Second, DSMSs may be underprovisioned in terms of their ability to handle bursty input at its maximum rate, and yet may still need to meet certain quality-of-service or resource constraints (e.g., they may need to ensure data is processed within some latency bound). These two challenges have led to two classes of adaptive stream processing techniques: those that attempt to minimize the cost of computing query results from the input data (the problem traditionally faced by query optimization), and those that attempt to manage query processing, possibly at reduced accuracy, in the presence of limited resources. This article provides an overview of significant work in each area. Minimizing Computation Cost
The problem of adaptive query processing to minimize computation cost has been well-studied in a variety of settings [10]. What makes the adaptive stream processing setting unique (and unusually tractable) is the fact that joins are performed over sliding windows with size bounds: As the data stream exceeds the window size, old data values are expired. This means intermediate state within a query plan operator has constant maximum size; as opposed to being bounded by the size of the input data. Thus a windowed join operator can be modeled as a pair of filter operators, each of which joins its input with the bounded intermediate state produced from the other input. Optimization of joins
Adaptive Stream Processing
in data stream management systems becomes a minor variation on the problem of optimizing selection or filtering operators; hence certain theoretical optimality guarantees can actually be made. Eddies
Eddies [2,11,14] are composite dataflow operators that model select-project-join expressions. An eddy consists of a tuple router, plus a set of primitive query operators that run concurrently and each have input queues. Eddies come in several variations; the one proposed for distributed stream management uses state modules (SteMs) [14,11]. Figure 1 shows an example of such an eddy for a simplified stream SQL query, which joins three streams and applies a selection predicate over them. Eddy creation. The eddy is created prior to execution by an optimizer: every selection operation (sP in the example) is converted to a corresponding operator; additionally, each base relation to be joined is given a state module, keyed on the join attribute, to hold the intermediate state for each base relation [14] (⋈R, ⋈S, ⋈T). If a base relation appears with multiple different join attributes, then it may require multiple SteMs. In general, the state module can be thought of as one of the hash tables within a symmetric or pipelined hash join. The optimizer also determines whether the semantics of the query force certain operators to execute before others. Such constraints are expressed in an internal routing table, illustrated on the right side of the figure. As a tuple is processed, it is annotated with a tuple signature specifying what input streams’ data it contains and what operator may have last modified it. The routing table is a map from the tuple signature to a set of valid routing destinations, those operators that can successfully process a tuple with that particular signature.
A
Query execution/tuple routing. Initially, a tuple from an input data stream (R, S, or T) flows into the eddy router. The eddy (i) adds the data to the associated SteM or SteMs, and (ii) consults the routing table to determine the set of possible destination operators. It then chooses a destination (using a policy to be described later) and sends the tuple to the operator. The operator then either filters the tuple, or produces one or more output tuples, as a result of applying selection conditions or joining with the data within a SteM. Output tuples are marked as having been processed by the operator that produced them. If they have been processed by all operators, they will be sent to the query output, and if not, they will be sent back to the eddy’s router and to one of the remaining operators. Routing policies. The problem of choosing among alternate routing destinations has been addressed with a variety of strategies. Tickets and lottery scheduling [2]. In this scheme, each operator receives a ticket for each tuple it receives from the router, and it returns the ticket each time it outputs a tuple to the router. Over time, each operator is expected to have a number of tickets proportional to (1 p) where p is the operator’s selectivity. The router holds a lottery among valid routing destinations, where each operator’s chance of winning is proportional to its number of tickets. Additionally, as a flow control mechanism, each operator has an input queue, and if this queue fills, then the operator may not participate in the lottery. Deterministic with batching [9]. A later scheme was developed to reduce the per-tuple overhead of eddies by choosing destinations for batches of tuples. Here, each operator’s selectivity is explicitly monitored and each predicate is assumed to be independent. Periodically, a rank ordering algorithm is used to choose a
Adaptive Stream Processing. Figure 1. Illustration of eddy with SteMs.
53
A
54
A
Adaptive Stream Processing
destination for a batch of tuples: the rank ordering algorithm sorts predicates in decreasing order of ci /(1 pi), where ci is the cost of the applying predicate si and pi is its selectivity. Content-based routing [7]. (CBR) attempts to learn correlations between attribute values and selectivities. Using sampling, the system determines for each operator the attribute most strongly correlated with its selectivity – this is termed the classifier attribute. CBR then builds a table characterizing all operators’ selectivities for different values of each classifier attribute. Under this policy, when the eddy needs to route a tuple, it first looks up the tuple’s classifier attribute values in the table and determines the destination operators’ selectivities. It routes the tuple probabilistically, choosing a next operator with probability inversely proportional to its selectivity. Other optimization strategies. An alternative strategy that does not use the eddies framework is the adaptive greedy [5] (A-greedy) algorithm. A-greedy continuously monitors the selectivities of query predicates using a sliding window profile, a table with one Boolean attribute for each predicate in the query, and sampling. As a tuple is processed by the query, it may be chosen for sampling into the sliding window profile – if so, it is tested against every query predicate. The vector of Boolean results is added as a row to the sliding window profile. Then the sliding window profile is then used to create a matrix view V [i, j] containing, for each predicate si, the number of tuples in the profile that satisfy s1...si1 but not sj. From this matrix view, the reoptimizer seeks to maintain the constraint that the ith operation over an input tuple must have the lowest cost/selectivity ratio ci ∕ (1 p (SijS1,...,Si1)). The overall strategy has one of the few performance guarantees in the adaptive query processing space: if data properties were to converge, then performance would be within a factor of 4 of optimal [5]. Managing Resource Consumption
A common challenge in data stream management systems is limiting the use of resources – or accommodating limited resources while maintaining quality of service, in the case of bursty data. We discuss three different problems that have been studied: load shedding to ensure input data is processed by the CPU as fast as it arrives, minimizing buffering and
memory consumption during data bursts, and minimizing network communication with remote streaming sites. Load Shedding. Allows the system to selectively drop data items to ensure it can process data as it arrives. Both the Aurora and STREAM DSMSs focused heavily on adaptive load shedding. Aurora. In the Aurora DSMS [15], load shedding for a variety of query types are supported: the main requirement is that the user has a utility function describing the value of output data relative to how much of it has been dropped. The system seeks to place load shedding operators in the query plan in a way that maximizes the user’s utility function while the system achieves sufficient throughput. Aurora precomputes conditional load shedding plans, in the form of a load shedding road map (LRSM) containing a sequence of plans that shed progressively more load; this enables the runtime system to rapidly move to strategies that shed more or less load. LRSMs are created using the following heuristics: first, load shedding points are only inserted at data input points or at points in which data is split to two or more operators. Second, for each load shedding point, a loss/gain ratio is computed: this is the reduction in output utility divided by the gain in cycles, R(p L D), where R is the input rate into the drop point, p is the ratio of tuples to be dropped, L is the amount of system load flowing from the drop point, and D is the cost of the drop operator. Drop operators are injected at load shedding points in decreasing order of loss/gain ratio. Two different types of drops are considered using the same framework: random drop, in which an operator is placed in the query plan to randomly drop some fraction p of tuples; and semantic drop, which drops the p tuples of lowest utility. Aurora assumes for the latter case that there exists a utility function describing the relative worth of different attribute values. Stanford STREAM. The Stanford STREAM system [4] focuses on aggregate (particularly SUM) queries. Again the goal is to process data at the rate it arrives, while minimizing the inaccuracy in query answers: specifically, the goal is to minimize the maximum relative error across all queries, where the relative error of a query is the difference between actual and approximate value, divided by the actual value.
Adaptive Stream Processing
A Statistics Manager monitors computation and provides estimates of each operator’s selectivity and its running time, as well as the mean value and standard deviation of each query qi’s aggregate operator. For each qi, STREAM computes an error threshold Ci, based on the mean, standard deviation, and number of values. (The results are highly technical so the reader is referred to [4] for more details.) A sampling rate Pi is chosen for query qi that satisfies Pi Ci ∕ 2i, where 2i is the allowable relative error for the query. As in Aurora’s load shedding scheme, STREAM only inserts load shedding operators at the inputs or at the start of shared segments. Moreover, if a node has a set of children who all need to shed load, then a portion of the load shedding can be ‘‘pulled up’’ to the parent node, and all other nodes can be set to shed some amount of additional load relative to this. Based on this observation, STREAM creates a query dataflow graph in which each path from source to sink initially traverses through a load shedding operator whose sampling rate is determined by the desired error rate, followed by additional load shedding operators whose sampling rate is expressed relative to that first operator. STREAM iterates over each path, determines a sampling rate for the initial load shedding operator to satisfy the load constraint, and then computes the maximum relative error for any query. From this, it can set the load shedding rates for individual operators. Memory Minimization. STREAM also addresses the problem of minimizing the amount of space required to buffer data in the presence of burstiness [3]. The Chain algorithm begins by defining a progress chart for each operator in the query plan: this chart plots the relative size of the operator output versus the time it takes to compute. A point is plotted at time 0 with the full size of the input, representing the start of the query; then each operator is given a point according to its cost and relative output size. Now a lower envelope is plotted on the progress chart: starting with the initial point at time 0, the steepest line is plotted to any operator to the right of this point; from the point at the end of the first line, the next steepest line is plotted to a successor operator; etc. Each line segment (and the operators whose points are plotted beside it) represents a chain, and operators within a chain are scheduled together. During query processing, at each time ‘‘tick,’’
A
the scheduler considers all tuples that have been output by any chain. The tuple that lies on the segment with steepest slope is the one that is scheduled next; as a tiebreaker, the earliest such tuple is scheduled. This Chain algorithm is proven to be near-optimal (differing by at most one unit of memory per operator path for queries where selectivity is at most one). Minimizing Communication.
In some cases, the constrained resource is the network rather than CPU or memory. Olston et al. [13] develop a scheme for reducing network I/O for AVERAGE queries, by using accuracy bounds. Each remote object O is given a bound width wO: the remote site will only notify the central query processor if O’s value V falls outside this bound. Meanwhile, the central site maintains a bound cache with the last value and the bound width for every object. If given a precision constraint dj for each query Qj, then if the query processor is to provide query answers within dj, the sum of the bound widths for the data objects of Qj must not exceed dj times the number of objects. The challenge lies in the selection of widths for the objects. Periodically, the system tries to tighten all bounds, in case values have become more stable; objects whose values fall outside the new bounds get reported back to the central site. Now some of those objects’ bounds must be loosened in a way that maintains the precision constraints over all queries. Each object O is given a burden score equal to cO ∕ (pOwO), where cO is the cost of sending the object, wO is its bound width, and pO is the frequency of updates since the previous width adjustment. Using an approximation method based on an iterative linear equation solver, Olston et al. compute a burden target for each query, i.e., the lowest overall burden score required to always meet the query’s precision constraint. Next, each object is assigned a deviation, which is the maximum difference between the object’s burden score and any query’s burden target. Finally, a queried objects’ bounds are adjusted in decreasing order of deviation, and each object’s bound is increased by the largest amount that still conforms to the precision constraint for every query.
Key Applications Data stream management systems have seen significant adoption in areas such as sensor monitoring and processing of financial information. When there are
55
A
56
A
Adaptive Workflow/Process Management
associated quality-of-service constraints that might require load shedding, or when the properties of the data are subject to significant change, adaptive stream processing becomes vitally important.
Future Directions One of the most promising directions of future study is how to best use a combination of offline modeling, selective probing (in parallel with normal query execution), and feedback from query execution to find optimal strategies quickly. Algorithms with certain optimality guarantees are being explored in the online learning and theory communities (e.g., the k-armed bandit problem), and such work may lead to new improvements in adaptive stream processing.
Cross-references
10. Deshpande A., Ives Z., and Raman V. Adaptive query processing. Found. Trends Databases, 1(1):1–140, 2007. 11. Madden S., Shah M.A., Hellerstein J.M., and Raman V. Continuously adaptive continuous queries over streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 49–60. 12. Motwani R., Widom J., Arasu A., Babcock B., Babu S., Datar M., Manku G., Olston C., Rosenstein J., and Varma R. Query processing, resource management, and approximation in a data stream management system. In Proc. 1st Biennial Conf. on Innovative Data Systems Research, 2003. 13. Olston C.,Jiang J., and Widom J., Adaptive filters for continuous queries over distributed data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 563–574. 14. Raman V., Deshpande A., and Hellerstein J.M. Using state modules for adaptive query processing. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 353–366. 15. Tatbul N., Cetintemel U., Zdonik S.B., Cherniack M., and Stonebraker M. Load shedding in a data stream manager. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 309–320.
▶ Distributed Stream ▶ Query Processor ▶ Stream Processing
Recommended Reading 1. Abadi D.J., Carney D., Cetintemel U., Cherniack M., Convey C., Lee S., Stonebraker M., Tatbul N., and Zdonik S. Aurora: a new model and architecture for data stream management. VLDB J., 12(2):120–139, 2003. 2. Avnur R. and Hellerstein J.M. Eddies: continuously adaptive query processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 261–272. 3. Babcock B., Babu S., Datar M., and Motwani R. Chain: operator scheduling for memory minimization in data stream systems. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 253–264. 4. Babcock B., Datar M., and Motwani R. Load shedding for aggregation queries over data streams. In Proc. 20th Int. Conf. on Data Engineering, 2004, p. 350. 5. Babu S., Motwani R., Munagala K., Nishizawa I., and Widom J. Adaptive ordering of pipelined stream filters. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 407–418. 6. Balazinska M., BalaKrishnan H., and Stonebraker M. Demonstration: load management and high availability in the Medusa distributed stream processing system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 929–930. 7. Bizarro P., Babu S., DeWitt D.J., and Widom J. Content-based routing: different plans for different data. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 757–768. 8. Chandrasekaran S., Cooper O., Deshpande A., Franklin M.J., Hellerstein J.M., Hong W., Krishnamurthy S., Madden S., Raman V., Reiss F., and Shah M.A. TelegraphCQ: continuous dataflow processing for an uncertain world. In Proc. 1st Biennial Conf. on Innovative Data Systems Research, 2003. 9. Deshpande A. An initial study of overheads of eddies. ACM SIGMOD Rec., 33(1):44–49, 2004.
Adaptive Workflow/Process Management ▶ Workflow Evolution
ADBMS ▶ Active Database Management System Architecture
ADBMS Framework ▶ Active Database Management System Architecture
ADBMS Infrastructure ▶ Active Database Management System Architecture
Adding Noise ▶ Matrix Masking
Administration Model for RBAC
Additive Noise ▶ Noise Addition
Administration Model for RBAC Y UE Z HANG , J AMES B. D. J OSHI University of Pittsburgh, Pittsburgh, PA, USA
Synonyms ARBAC97; SARBAC
Definition The central ideal of administration model for RBAC is to use the role itself to manage roles. There are two well-known families of administration RBAC models. Administrative RBAC
The Administrative RBAC family of models known as ARBAC97 [3] introduces administrative roles that are used to manage the regular roles. These roles can form a role hierarchy and may have constraints. ARBAC97 consists of three administrative models, the user-role assignment (URA97) model, the permission-role assignment (PRA97) model, and the role-role administration (RRA97) model. URA97 defines which administrative roles can assign which users to which regular roles by means of the relation: can_assign. Similarly, PRA97 defines which administrative roles can assign which permissions to which regular roles by means of the relation: can_assignp. Each of these relations also has a counterpart for revoking the assignment (e.g., can_revoke). RRA97 defines which administrative roles can change the structure (add roles, delete roles, add edges, etc.) of which range of the regular roles using the notion of encapsulated range and the relation: can_modify. Scoped Administrative RBAC
The SARBAC model uses the notion of administrative scope to ensure that any operations executed by a role r will not affect other roles due to the hierarchical relations among them [1]. There are no special administrative roles in SARBAC, and each regular role has a scope of other regular roles called administrative scope that can be managed by it. Each role can only be managed
A
by its administrators. For example, a senior-most role should be able to manage all its junior roles.
Key Points ARBAC model is the first known role-based administration model and uses the notion of range and encapsulated range. Role range is essentially a set of regular roles. To avoid undesirable side effects, RRA97 requires that all role ranges in the can_modify relation be encapsulated, which means the range should have exactly one senior-most role and one junior-most role. Sandhu et al. later extended the ARBAC97 model into ARBAC99 model where the notion of mobile and immobile user/permission was introduced [4]. Oh et al. later extended ARBAC99 to ARBAC02 by adding the notion of organizational structure to redefine the user-role assignment and the role-permission assignment [2]. Recently, Zhang et al. have proposed an ARBAC07 model that extends the family of ARBAC models to deal with an RBAC model that allows hybrid hierarchies to co-exit [6]. SARBAC
The most important notion in SARBAC is that of the administrative scope, which is similar to the notion of encapsulated range in ARBAC97. A role r is said to be within to be the administrative scope of another role a if every path upwards from r goes through a, and a is said to be the administrator of r. SARBAC also consists of three models: SARBAC-RHA, SARBAC-URA, and SARBAC-PRA. In SARBAC-RHA, each role can only administer the roles that are within its own administrative scope. The operations include adding roles, deleting roles, adding permissions, and deleting permissions. The semantics for SARBAC-URA and SARBAC-PRA is similar to URA97 and PRA97. The administrative scope can change dynamically. Zhang et al. have extended SARBAC to also deal with hybrid hierarchy [5].
Cross-references
▶ Role Based Access Control
Recommended Reading 1.
2.
Crampton J. and Loizou G. Administrative scope: a foundation for role-based administrative models. ACM Trans. Inf. Syst. Secur., 6(2):201–231, 2003. Oh S. and Sandhu R. A model for role administration using organization structure. In Proc. 7th ACM Symp. on Access Control Models and Technologies, 2002, pp. 155–162.
57
A
58
A 3.
4.
5.
6.
Administration Wizards Sandhu R., Bhamidipati V., and Munawer Q. The ARBAC97 model for role-based administration of roles. ACM Trans. Inf. Syst. Secur., 2(1):105–135, 1999. Sandhu R. and Munawer Q. The ARBAC99 model for administration of roles (1999). In Proc. 15th Computer Security Applications Conf. Arizona, 1999, pp. 229. Zhang Y., James B., and Joshi D. ‘‘SARBAC07: scoped administration model for RBAC with hybrid hierarchy. In Proc. 3rd Int. Symp. on Information Assurance and Security, 2007, pp. 149–154. Zhang Y. and Joshi J.B.D. ARBAC07: a role based administration model for RBAC with hybrid hierarchy. In Proc. IEEE Int. Conf. Information Reuse and Integration, 2007, pp. 196–202.
Administration Wizards P HILIPPE B ONNET 1, D ENNIS S HASHA 2 University of Copenhagen, Copenhagen, Denmark 2 New York University, New York, NY, USA
1
Definition Modern database systems provide a collection of utilities and programs to assist a database administrator with tasks such as database installation and configuration, import/export, indexing (index wizards are covered in the self-management entry), and backup/restore.
Historical Background
The installation/configuration wizard is a graphical user interface that guides the administrator through the initial server configuration. The interface provides high-level choices (e.g., OLTP vs. OLAP workload), or simple questions (e.g., number of concurrent users) that are mapped onto database configuration values (log buffer size and thread pool size respectively). Data Import/Export
Import/export wizards are graphical tools that help database administrators map a database schema with an external data format (e.g., XML, CSV, PDF), or generate scripts that automate the transfer of data between a database and an external data source (possibly another database server). Back-up/Restore
Back-up/restore wizards automate the back-up procedure given a few input arguments: complete/incremental backup, scope of the back-up/restore operations (file, tablespace, database), target directory.
Key Applications Automation of the central database administration tasks.
Cross-references
Database Administrators have been skeptical of any form of automation as long as they could control the performance and security of a relatively straightforward installation. The advent of enterprise data management towards the end of the 1990s, where few administrators became responsible for many, possibly diverse database servers, has led to the use of graphical automation tools. In the mid-1990s, third party vendors introduced such tools. With SQL Server 6.5, Microsoft was the first constructor to provide an administration wizard.
▶ Self-Management
Foundations
▶ Extended Transaction Models and the ACTA Framework ▶ Generalization of ACID Properties ▶ Open Nested Transaction Models
Installation and Configuration
Database servers are configured using hundreds of parameters that control everything buffer size, file layout, concurrency control options and so on. They are either set statically in a configuration file before the server is started, or dynamically while the server is running. Out-of-the-box database servers are equipped with a limited set of typical configurations.
Recommended Reading 1. 2.
Bersinic D. and Gile S. Portable DBA: SQL Server. McGraw Hill, New York, 2004. Schumacher R. DBA Tools Today. DBMS Magazine, January 1997.
Advanced Transaction Models
Adversarial Information Retrieval ▶ Web Spam Detection
Aggregation: Expressiveness and Containment
Affix Removal ▶ Stemming
AFI ▶ Approximation of Frequent Itemsets
Aggregate Queries in P2P Systems ▶ Approximate Queries in Peer-to-Peer Systems
Aggregation ▶ Abstraction
Aggregation Algorithms for Middleware Systems ▶ Top-k Selection Queries on Multimedia Datasets
Aggregation and Threshold Algorithms for XML ▶ Ranked XML Processing
Aggregation: Expressiveness and Containment S ARA C OHEN The Hebrew University of Jerusalem, Jerusalem, Israel
Definition An aggregate function is a function that receives as input a multiset of values, and returns a single value. For example, the aggregate function count returns the number of input values. An aggregate query is simply a query that mentions an aggregate function, usually as part of its output. Aggregate queries are commonly
A
used to retrieve concise information from a database, since they can cover many data items, while returning few. Aggregation is allowed in SQL, and the addition of aggregation to other query languages, such as relational algebra and datalog, has been studied. The problem of determining query expressiveness is to characterize the types of queries that can be expressed in a given query language. The study of query expressiveness for languages with aggregation is often focused on determining how aggregation increases the ability to formulate queries. It has been shown that relational algebra with aggregation (which models SQL) has a locality property. Query containment is the problem of determining, for any two given queries q and q 0 , whether q(D)
q 0 (D), for all databases D, where q(D) is the result of applying q to D. Similarly, the query equivalence problem is to determine whether q(D)¼q 0 (D) for all databases D. For aggregate queries, it seems that characterizing query equivalence may be easier than characterizing query containment. In particular, almost all known results on query containment for aggregate queries are derived by a reduction from query equivalence.
Historical Background The SQL standard defines five aggregate functions, namely, count , sum , min , max and avg (average). Over time, it has become apparent that users would like to aggregate data in additional ways. Therefore, major database systems have added new built-in aggregate functions to meet this need. In addition, many database systems now allow the user to extend the set of available aggregate functions by defining his own aggregate functions. Aggregate queries are typically used to summarize detailed information. For example, consider a database with the relations Dept(deptId, deptName) and Emp(empId, deptId, salary). The following SQL query returns the number of employees, and the total department expenditure on salaries, for each department which has an average salary above $10,000. (Q1) SELECT deptID, count(empID), sum(salary) FROM Dept, Emp WHERE Dept.deptID = Emp.deptID GROUP BY Dept.deptID HAVING avg(salary) > 10000
59
A
60
A
Aggregation: Expressiveness and Containment
Typically, aggregate queries have three special components. First, the GROUP BY clause is used to state how intermediate tuples should be grouped before applying aggregation. In this example, tuples are grouped by their value of deptID, i.e., all tuples with the same value for this attribute form a single group. Second, a HAVING clause can be used to determine which groups are of interest, e.g., those with average salary above $10,000. Finally, the outputted aggregate functions are specified in the SELECT clause, e.g., the number of employees and the sum of salaries. The inclusion of aggregation in SQL has motivated the study of aggregation in relational algebra, as an abstract modeling of SQL. One of the earliest studies of aggregation was by Klug [11], who extended relational algebra and relational calculus to allow aggregate functions and showed the equivalence of these two languages. Aggregation has also been added to Datalog. This has proved challenging since it is not obvious what semantics should be adopted in the presence of recursion [15].
Foundations Expressiveness
The study of query expressiveness deals with determining what can be expressed in a given query language. The expressiveness of query languages with aggregation has been studied both for the language of relational algebra, as well as for datalog, which may have recursion. Various papers have studied the expressive power of nonrecursive languages, extended with aggregation, e.g., [7,9,13]. The focus here will be on [12], which has the cleanest, general proofs for the expressive power of languages modeling SQL. In [12], the expressiveness of variants of relational algebra, extended with aggregation, was studied. First, [12] observes that the addition of aggregation to relational algebra strictly increases its expressiveness. This is witnessed by the query Q2: (Q2) SELECT 1 FROM R1 WHERE(SELECT COUNT(*)FROM R)> (SELECT COUNT(*)FROM S)
Observe that Q2 returns 1 if R contains more tuples than S, and otherwise an empty answer. It is known that first-order logic cannot compare cardinalities, and hence neither can relational algebra. Therefore, SQL with aggregation is strictly more expressive than SQL without aggregation. The language ALGaggr is presented in [12]. Basically, ALGaggr is relational algebra, extended by arbitrary aggregation and arithmetic functions. In ALGaggr , non-numerical selection predicates are restricted to using only the equality relation (and not order comparisons). A purely relational query is one which is applied only to non-numerical data. It is shown that all purely relational queries in ALGaggr are local. Intuitively, the answers to local queries are determined by looking at small portions of the input. The formal definition of local queries follows. Let D be a database. The Gaifman graph G(D) of D is the undirected graph on the values appearing in D, with (a,b)2G(D) if a and b belong to the same tuple of some relation in D. Let ~ a ¼(a1,...,ak ) be a tuple of values, each of which appears in D. Let r be an integer, and let SrD ð~ aÞ be the set of values b such that dist (ai, b)r in G(D), for some i. The r-neighborhood NrD ð~ aÞ of ~ a is a new database in which the relations of D are restricted to contain only the values in SrD ð~ aÞ. Then, ~ a and ~ b are (D,r)-equivalent if there is an isomorphism h : NrD ð~ aÞ ! NrD ð~ bÞ such that hð~ aÞ ¼ ~ b. Finally, a q is local if there exists a number r such that for all D, if ð~ aÞ and ð~ bÞ are (D,r)-equivalent, then ~ a 2 qðDÞ if and only if ~ b 2 qðDÞ. There are natural queries that are not local. For example, transitive closure (also called reachability) is not local. Since all queries in ALGaggr are local, this implies that transitive closure cannot be expressed in ALGaggr . In addition to ALGaggr, [12] introduces the lan;N ;Q ;N ;Q guages ALGaggr and ALGaggr . ALGaggr and ALGaggr are the extensions of ALGaggr which allow order comparisons in the selection predicates, and allow natural numbers and rational numbers, respectively, in the database. It is not known whether transitive closure ;N can be expressed in ALGaggr . More precisely, [12] shows that if transitive closure is not expressible in ;N ALGaggr , then the complexity class Uniform TC0 is properly contained in the complexity class NLOGSPACE. Since the latter problem (i.e., determining strict containment of TC0 in NLOGSPACE) is believed
Aggregation: Expressiveness and Containment
to be very difficult to prove, so is the former. Moreover, this result holds even if the arithmetic functions are restricted to {þ,, 0; Z > 0 bðY Þ; bðZÞ; Y 0; Z > 0
q5 ðavgðY ÞÞ q50 ðavgðY ÞÞ
bðY Þ bðY Þ; bðZÞ
q6 ðmaxðY ÞÞ
bðY Þ; bðZ1 Þ; bðZ2 Þ; Z1 < Z2
q60 ðmaxðY ÞÞ
bðY Þ; bðZÞ; Z < Y
Characterizations for equivalence are known for queries of the above types. Specifically, characterizations have been presented for equivalence of conjunctive queries with the aggregate functions count , sum , max and count-distinct [4] and these were extended in [5] to queries with disjunctive bodies. Equivalence of conjunctive queries with avg and with percent were characterized in [8]. It is sometimes possible to define classes of aggregate functions and then present general characterizations for equivalence of queries with any aggregate function within the class of functions. Such characterizations are often quite intricate since they must deal with many different aggregate functions. A characterization of this type was given in [6] to decide equivalence of aggregate queries with decomposable aggregate functions, even if the queries contain negation. Intuitively, an aggregate function is decomposable if partially computed values can easily be combined together to return the result of aggregating an entire multiset of values, e.g., as is the case for count , sum and max . Interestingly, when dealing with aggregate queries it seems that the containment problem is more elusive than the equivalence problem. In fact, for aggregate queries, containment is decided by reducing to the equivalence problem. A reduction of containment to equivalence is presented for queries with expandable aggregate functions in [3]. Intuitively, for expandable aggregate functions, changing the number of occurrences of values in bags B and B0 does not affect the correctness of the formula a(B)¼a(B0 ), as long as the proportion of each value in each bag remains the same, e.g., as is the case for count , sum , max , count-distinct and avg .
The study of aggregate queries using the count function is closely related to the study of nonaggregate queries evaluated under bag-set semantics. Most past research on query containment and equivalence for nonaggregate queries assumed that queries are evaluated under set semantics. In set semantics, the output of a query does not contain duplicated tuples. (This corresponds to SQL queries with the DISTINCT operator.) Under bag-set semantics the result of a query is a multiset of values, i.e., the same value may appear many times. A related semantics is bag semantics in which both the database and the query results may contain duplication. To demonstrate the different semantics, recall the database D1 defined above. Consider evaluating, over D1, the following variation of q1: q100 ðXÞ
aðX; Y Þ
Under set-semantics q 00 1(D1) ¼{ (c),(d)}, and under bag-set semantics q 00 1(D1) ¼{{ (c),(c),(d)}}. Note the correspondence between bag-set semantics and using the count function, as in q1, where count returns exactly the number of duplicates of each value. Due to this correspondence, solutions for the query containment problem for queries with the count function immediately give rise to solutions for the query containment problem for nonaggregate queries evaluated under bag-set semantics, and vice-versa. The first paper to directly study containment and equivalence for nonaggregate queries under bag-set semantics was [1], which characterized equivalence for conjunctive queries. This was extended in [4] to queries with comparisons, in [5] to queries with disjunctions and in [6] to queries with negation.
Key Applications Query Optimization
The ability to decide query containment and equivalence is believed to be a key component in query optimization. When optimizing a query, the database can use equivalence characterizations to remove redundant portions of the query, or to find an equivalent, yet cheaper, alternative query. Query Rewriting
Given a user query q, and previously computed queries v1,. . .,vn, the query rewriting problem is to find a query r that (i) is equivalent to q, and (ii) uses the queries
Aggregation-Based Structured Text Retrieval
v1,...,vn instead of accessing the base relations. (Other variants of the query rewriting problem have also been studied.) Due to Condition (i), equivalence characterizations are needed to solve the query rewriting problem. Query rewriting is useful as an optimization technique, since it can be cheaper to use past results, instead of evaluating a query from scratch. Integrating information sources is another problem that can be reduced to the query rewriting problem.
Future Directions Previous work on query containment does not consider queries with HAVING clauses. Another open problem is containment for queries evaluated under bag-set semantics. In this problem, one wishes to determine if the bag returned by q is always sub-bag of that returned by q 0 . (Note that this is different from the corresponding problem of determining containment of queries with count , which has been solved.) It has shown [10] that bag-set containment is undecidable for conjunctive queries containing inequalities. However, for conjunctive queries without any order comparisons, determining bag-set containment is still an open problem.
Cross-references
▶ Answering Queries using Views ▶ Bag Semantics ▶ Data Aggregation in Sensor Networks ▶ Expressive Power of Query Languages ▶ Locality ▶ Query Containment ▶ Query Optimization (in Relational Databases) ▶ Query Rewriting using Views
Recommended Reading 1. Chaudhuri S. and Vardi M.Y. Optimization of real conjunctive queries. In Proc. 12th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1993, pp. 59–70. 2. Cohen S. Containment of aggregate queries. ACM SIGMOD Rec., 34(1):77–85, 2005. 3. Cohen S., Nutt W., and Sagiv Y. Containment of aggregate queries. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 111–125. 4. Cohen S., Nutt W., and Sagiv Y. Deciding equivalences among conjunctive aggregate queries. J. ACM, 54(2), 2007. 5. Cohen S., Nutt W., and Serebrenik A. Rewriting aggregate queries using views. In Proc. 18th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 1999, pp. 155–166.
A
6. Cohen S., Sagiv Y., and Nutt W. Equivalences among aggregate queries with negation. ACM Trans. Comput. Log., 6(2):328–360, 2005. 7. Consens M.P. and Mendelzon A.O. Low complexity aggregation in graphlog and datalog. Theor. Comput. Sci., 116(1 and 2): 95–116, 1993. 8. Grumbach S., Rafanelli M., and Tininini L. On the equivalence and rewriting of aggregate queries. Acta Inf., 40(8):529–584, 2004. 9. Hella L., Libkin L., Nurmonen J., and Wong L. Logics with aggregate operators. J. ACM, 48(4):880–907, 2001. 10. Jayram T.S., Kolaitis P.G., and Vee E. The containment problem for real conjunctive queries with inequalities. In Proc. 25th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2006, pp. 80–89. 11. Klug A.C. Equivalence of relational algebra and relational calculus query languages having aggregate functions. J. ACM, 29(3):699–717, 1982. 12. Libkin L. Expressive power of SQL. Theor. Comput. Sci., 3(296):379–404, 2003. 13. Libkin L. and Wong L. Query languages for bags and aggregate functions. J. Comput. Syst. Sci., 55(2):241–272, 1997. 14. Mumick I.S. and Shmueli O. How expressive is statified aggregation? Ann. Math. Artif. Intell., 15(3–4):407–434, 1995. 15. Ross K.A. and Sagiv Y. Monotonic aggregation in deductive database. J. Comput. Syst. Sci., 54(1):79–97, 1997.
Aggregation-Based Structured Text Retrieval T HEODORA T SIKRIKA Center for Mathematics and Computer Science, Amsterdam, The Netherlands
Definition Text retrieval is concerned with the retrieval of documents in response to user queries. This is achieved by (i) representing documents and queries with indexing features that provide a characterisation of their information content, and (ii) defining a function that uses these representations to perform retrieval. Structured text retrieval introduces a finer-grained retrieval paradigm that supports the representation and subsequent retrieval of the individual document components defined by the document’s logical structure. Aggregation-based structured text retrieval defines (i) the representation of each document component as the aggregation of the representation of its own information content and the representations of information content of its structurally related components, and
63
A
64
A
Aggregation-Based Structured Text Retrieval
(ii) retrieval of document components based on these (aggregated) representations. The aim of aggregation-based approaches is to improve retrieval effectiveness by capturing and exploiting the interrelations among the components of structured text documents. The representation of each component’s own information content is generated at indexing time. The recursive aggregation of these representations, which takes place at the level of their indexing features, leads to the generation, either at indexing or at query time, of the representations of those components that are structurally related with other components. Aggregation can be defined in numerous ways; it is typically defined so that it enables retrieval to focus on those document components more specific to the query or to each document’s best entry points, i.e., document components that contain relevant information and from which users can browse to further relevant components.
Historical Background A well-established Information Retrieval (IR) technique for improving the effectiveness of text retrieval (i.e., retrieval at the document level) has been the generation and subsequent combination of multiple representations for each document [3]. To apply this useful technique to the text retrieval of structured text documents, the typical approach has been to exploit their logical structure and consider that the individual representations of their components can act as the different representations to be combined [11]. This definition of the representation of a structured text document as the combination of the representations of its components was also based on the intuitive idea that the information content of each document consists of the information content of its sub-parts [2,6]. As the above description suggests, these combination-based approaches, despite restricting retrieval only at the document level, assign representations not only to documents, but also to individual document components. To generate these representations, structured text documents can simply be viewed as series of non-overlapping components (Figure 1a), such as title, author, abstract, body, etc. [13]. The proliferation of SGML and XML documents, however, has led to the consideration of hierarchical components (Figure 1b), and their interrelated representations [1]. For these
(disjoint or nested) document components, the combination of their representations can take place (i) directly at the level of their indexing features, which typically correspond to terms and their statistics (e.g., [13]), or (ii) at the level of retrieval scores computed independently for each component (e.g., [15]). Overall, these combination-based approaches have proven effective for the text retrieval of structured text documents [11,13,15]. Following the recent shift towards the structured text retrieval paradigm [2], which supports the retrieval of document components (including whole documents), it was only natural to try to adapt these combination-based approaches to this new requirement for retrieval at the sub-document level. Here, the focus is on each document component: its representation corresponds to the combination of its own representation with the representations of its structurally related components, and its retrieval is based on this combined representation. Similarly to the case of combination-based approaches for text retrieval, two strands of research can be identified: (i) approaches that operate at the level of the components’ indexing features (e.g., [12]), referred to as aggregation-based structured text retrieval (described in this entry), and (ii) approaches that operate at the level of retrieval scores computed independently for each component (e.g., [14]), referred to as propagation-based structured text retrieval. Figure 2b illustrates the premise of aggregationand propagated-based approaches for the simple structured text document depicted in Figure 2a. Since these approaches share some of their underlying motivations and assumptions, there has been a crossfertilisation of ideas between the two. This also implies that this entry is closely related to the entry on propagation-based structured text retrieval.
Foundations Structured text retrieval supports, in principle, the representation and subsequent retrieval of document components of any granularity; in practice, however, it is desirable to take into account only document components that users would find informative in response to their queries [1,2,4,6]. Such document components are referred to as indexing units and are usually chosen (manually or automatically) with respect to the requirements of each application. Once the indexing
Aggregation-Based Structured Text Retrieval
A
65
A
Aggregation-Based Structured Text Retrieval. Figure 1. Two views on the logical structure of a structured text document.
units have been determined, each can be assigned a representation of its information content, and, hence, become individually retrievable. Aggregation-based structured text retrieval approaches distinguish two types of indexing units: atomic and composite. Atomic components correspond to indexing units that cannot be further decomposed, i.e., the leaf components in Figure 1b. The representation of an atomic component is generated by
considering only its own information content. Composite components, on the other hand, i.e., the non-leaf nodes in Figure 1b, correspond to indexing units which are related to other components, e.g., consist of sub-components. In addition to its own information content, a composite component is also dependent on the information content of its structurally related components. Therefore, its representation can be derived via the aggregation of the representation of its
66
A
Aggregation-Based Structured Text Retrieval
Aggregation-Based Structured Text Retrieval. Figure 2. Simple example illustrating the differences between aggregation- and propagation-based approaches.
own information content with the representations of the information content of its structurally related components; this aggregation takes place at the level of their indexing features. Given the representations of atomic components and of composite components’ own information content, aggregation-based approaches recursively generate the aggregated representations of composite components and, based on them, perform retrieval of document components of varying granularity. In summary, each aggregation-based approach needs to define the following: (i) the representation of each component’s own information content, (ii) the aggregated representations of composite components, and (iii) the retrieval function that uses these representations. Although these three steps are clearly interdependent, the major issues addressed in each step need to be outlined first, before proceeding with the description of the key aggregation-based approaches in the field of structured text retrieval. 1. Representing each component’s own information content: In the field of text retrieval, the issue of
representing documents with indexing features that provide a characterisation of their information content has been extensively studied in the context of several IR retrieval models (e.g., Boolean, vector space, probabilistic, language models, etc.). For text documents, these indexing features typically correspond to term statistics. Retrieval functions produce a ranking in response to a user’s query, by taking into account the statistics of query terms together with each document’s length. The term statistics most commonly used correspond to the term frequency tf (t, d) of term t in document d and to the document frequency df (t, C) of term t in the document collection C, leading to standard tf idf weighting schemes. Structured text retrieval approaches need to generate representations for all components corresponding to indexing units. Since these components are nested, it is not straightforward to adapt these term statistics (particularly document frequency) at the component level [10]. Aggregation-based approaches, on the other hand, directly generate representations only for components that have their own information content,
Aggregation-Based Structured Text Retrieval
A
67
A
Aggregation-Based Structured Text Retrieval. Figure 3. Representing the components that contain their own information.
while the representations of the remaining components are obtained via the aggregation process. Therefore, the first step is to generate the representations of atomic components and of the composite components’ own information content, i.e., the content not contained in any of their structurally related components. This simplifies the process, since only disjoint units need to be represented [6], as illustrated in Figure 3 where the dashed boxes enclose the components to be represented (cf. [5]). Text retrieval approaches usually consider that the information content of a document corresponds only to its textual content, and possibly its metadata (also referred to as attributes). In addition to that, structured text retrieval approaches also aim at representing the information encoded in the logical structure of documents. Representing this structural information, i.e., the interrelations among the documents and their components, enables retrieval in response to both content-only queries and content-and-structure queries. Aggregation-based approaches that only represent the textual content typically adapt standard representation formalisms widely employed in text retrieval
approaches to their requirements for representation at the component level (e.g., [9,11]). Those that consider richer representations of information content apply more expressive formalisms (e.g., various logics [2,4]). 2. Aggregating the representations: The concept underlying aggregation-based approaches is that of augmentation [4]: the information content of a document component can be augmented with that of its structurally related components. Given the already generated representations (i.e., the representations of atomic components and of composite components’ own information content), the augmentation of composite components is performed by the aggregation process. The first step in the aggregation process is the identification of the structurally related components of each composite component. Three basic types of structural relationships (Figure 4) can be distinguished: hierarchical (h), sequential (s), and links (l). Hierarchical connections express the composition relationship among components, and induce the tree representing the logical structure of a structured
68
A
Aggregation-Based Structured Text Retrieval
Aggregation-Based Structured Text Retrieval. Figure 4. Different types of structural relationships between the components of a structured text document.
document. Sequential connections capture the order imposed by the document’s author(s), whereas links to components of the same or different documents reference (internal or external) sources that offer similar information. In principle, all these types of structural relationships between components can be taken into account by the aggregation process (and some aggregation-based approaches are generic enough to accommodate them, e.g., [7]). In practice, however, the hierarchical structural relations are the only ones usually considered. This leads to the aggregated representations of composite components being recursively generated in an ascending manner. The next step is to define the aggregation operator (or aggregation function). Since the aggregation of the textual content of related components is defined at the level of the indexing features of their representations, the aggregation function is highly dependent on the model (formalism) chosen to represent each component’s own content. This aggregation results in an (aggregated) representation modeled in the same formalism, and can be seen as being performed at two stages (although these are usually combined into one step): the aggregation of index expressions [2] (e.g., terms, conjunctions of terms, etc.), and of the uncertainty assigned to them (derived mainly by their statistics). An aggregation function could also take into account: (i) augmentation factors [6], which capture the fact that the textual content of the structurally related components of a composite component is not included in that components own content and has to be ‘‘propagated’’ in order to become part of it,
(ii) accessibility factors [4], which specify how the representation of a component is influenced by its connected components (a measure of the contribution of, say, a section to its embedding chapter [2]), and (iii) the overall importance of a component in a document’s structure [7] (e.g., it can be assumed that a title contains more informative content than a small subsection [13]). Finally, the issue of the possible aggregation of the attributes assigned to related components needs to be addressed [2]. The above aggregation process can take place either at indexing time (global aggregation) or at query time (local aggregation). Global aggregation is performed for all composite indexing units and considers all indexing features involved. Since this strategy does not scale well and can quickly become highly inefficient, local aggregation strategies are primarily used. These restrict the aggregation only to indexing features present in the query (i.e., query terms), and, starting from components retrieved in terms of their own information content, perform the aggregation only for these components’ ancestors. 3. Retrieval: The retrieval function operates both on the representations of atomic components and on the aggregated representations of composite components. Its definition is highly dependent on the formalism employed in modeling these representations. In conjunction with the definition of the aggregation function, the retrieval function operationalizes the notion of relevance for a structured text retrieval system. It can, therefore, determine whether retrieval focuses on those document components more specific to the query [2], or whether the aim is to support the users’
Aggregation-Based Structured Text Retrieval
browsing activities by identifying each documents best entry points [7] (i.e., document components that contain relevant information which users can browse to further relevant components). Aggregation-based Approaches
One of the most influential aggregation-based approaches has been developed by Chiaramella et al. [2] in the context of the FERMI project (http://www. dcs.gla.ac.uk/fermi/). Aiming at supporting the integration of IR, hypermedia, and database systems, the FERMI model introduced some of the founding principles of structured text retrieval (including the notion of retrieval focussed to the most specific components). It follows the logical view on IR, i.e., it models the retrieval process as inference, and it employs predicate logic as its underlying formalism. The model defines a generic representation of content, attributes, and structural information associated with the indexing units. This allows for rich querying capabilities, including support for both content-only queries and contentand-structured queries. The indexing features of structured text documents can be defined in various ways, e.g., as sets of terms or as logical expressions of terms, while the semantics of the aggregation function depend on this definition. Retrieval can then be performed by a function of the specificity of each component with respect to the query. The major limitation of the FERMI model is that it does not incorporate the uncertainty inherent to the representations of content and structure. To address this issue, Lalmas [8] adapted the FERMI model by using propositional logic as its basis, and extended it by modeling the uncertain representation of the textual content of components (estimated by a tf idf weighting scheme) using Dempster-Shafer’s theory of evidence. The structural information is not explicitly captured by the formalism; therefore, the model does not provide support for content-and-structured queries. The aggregation is performed by Dempster’s combination rule, while retrieval is based on the belief values of the query terms. Fuhr, Go¨vert, and Ro¨lleke [4] also extended the FERMI model using a combination of (a restricted form of) predicate logic with probabilistic inference. Their model captures the uncertainty in the representations of content, structure, and attributes. Aggregation of index expressions is based on a four-valued
A
logic, allowing for the handling of incomplete information and of inconsistencies arising by the aggregation (e.g., when two components containing contradictory information are aggregated). Aggregation of term weights is performed according to the rules of probability theory, typically by adopting term independence assumptions. This approach introduced the notion of accessibility factor being taken into account. Document components are retrieved based on the computed probabilities of query terms occurring in their (aggregated) representations. Following its initial development in [4], Fuhr and his colleagues investigated further this logic-based probabilistic aggregation model in [5,6]. They experimented with modeling aggregation by different Boolean operators; for instance, they noted that, given terms propagating in the document tree in a bottom-up fashion, a probabilistic-OR function would always result in higher weights for components further up the hierarchy. As this would lead (in contrast to the objectives of specificity-oriented retrieval) to the more general components being always retrieved, they introduced the notion of augmentation factors. These could be used to ‘‘downweight’’ the weights of terms (estimated by a tf idf scheme) that are aggregated in an ascending manner. The effectiveness of their approach has been assessed in the context of the Initiative for the Evaluation of XML retrieval (INEX) [6]. Myaeng et al. [11] also developed an aggregationbased approach based on probabilistic inference. They employ Bayesian networks as the underlying formalism for explicitly modeling the (hierarchical) structural relations between components. The document components are represented as nodes in the network and their relations as (directed) edges. They also capture the uncertainty associated with both textual content (again estimated by tf idf term statistics) and structure. Aggregation is performed by probabilistic inference, and retrieval is based on the computed beliefs. Although this model allows for document component scoring, in its original publication [11] it is evaluated in the context of text retrieval at the document level. Following the recent widespread application of statistical language models in the field of text retrieval, Ogilvie and Callan [8] adapted them to the requirements of structured text retrieval. To this end, each document component is modeled by a language model; a unigram language model estimates the probability of a term given
69
A
70
A
Aggregation-Based Structured Text Retrieval
some text. For atomic components, the language model is estimated by their own text by employing a maximum likelihood estimate (MLE). For instance, the probability of term t given the language model yT of text T in a component can be estimated by: P(tjyT) = (1 o)PMLE(tjyT) + oPMLE(tjycollection), where o is a parameter controlling the amount of smoothing of the background collection model. For composite components compi, the aggregation of language models is modeled as a linearX interpolation: 0 Pðtjy0compi Þ ¼ lccompi Pðtjycompi Þ þ lc j2childernðcompi Þ j P 0 Pðtjyj Þ, where lccompi þ j2childernðcompi Þ lcj ¼ 1. These ls model the contribution of each language model (i.e., document component) in the aggregation, while their estimation is a non-trivial issue. Ranking is typically produced by estimating the probability that each component generated the query string (assuming an underlying multinomial model). The major advantage of the language modeling approach is that it provides guidance in performing the aggregation and in estimating the term weights. A more recent research study has attempted to apply BM25 (one of the most successful text retrieval term weighting schemes) to structured text retrieval. Robertson et al. [13] initially adapted BM25 to structured text documents with non-hierarchical components (see Figure 1a), while investigating the effectiveness of retrieval at the document level. Next, they [9] adapted BM25 to deal with nested components (see Figure 1b), and evaluated it in the context of the INitiative for the Evaluation of XML retrieval (INEX). A final note on these aggregation-based approaches is that most aim at focusing retrieval on those document components more specific to the query. However, there are approaches that aim at modeling the criteria determining what constitutes a best entry point. For instance, Kazai et al. [7] model aggregation as a fuzzy formalisation of linguistic quantifiers. This means that an indexing feature (term) is considered in an aggregated representation of a composite component, if it represents LQ of its structurally related components, where LQ a linguistic quantifier, such as ‘‘at least one,’’ ‘‘all,’’ ‘‘most,’’ etc. By using these aggregated representations, the retrieval function determines that a component is relevant to a query if LQ of its structurally related components are relevant, in essence implementing different criteria of what can be regarded as a best entry point.
Key Applications Aggregation-based approaches can be used in any application requiring retrieval according to the structured text retrieval paradigm. In addition, such approaches are also well suited to the retrieval of multimedia documents. These documents can be viewed as consisting of (disjoint or nested) components each containing one or more media. Aggregation can be performed by considering atomic components to only contain a single medium, leading to retrieval of components of varying granularity. This was recognized early in the field of structured text retrieval and some of the initial aggregation-based approaches, e.g., [2,4], were developed for multimedia environments.
Experimental Results For most of the presented approaches, particularly for research conducted in the context of the INitiative for the Evaluation of XML retrieval (INEX), there is an accompanying experimental evaluation in the corresponding reference.
Data Sets A testbed for the evaluation of structured text retrieval approaches has been developed as part of the efforts of the INitiative for the Evaluation of XML retrieval (INEX) (http://inex.is.informatik.uni-duisburg.de/).
URL to Code The aggregation-based approach developed in [8] has been implemented as part of the open source Lemur toolkit (for language modeling and IR), available at: http://www.lemurproject.org/.
Cross-references
▶ Content-and-Structure Query ▶ Content-Only Query ▶ Indexing Units ▶ Information Retrieval Models ▶ INitiative for the Evaluation of XML Retrieval ▶ Logical Structure ▶ Propagation-based Structured Text Retrieval ▶ Relevance ▶ Specificity ▶ Structured Document Retrieval ▶ Text Indexing and Retrieval
Air Indexes for Spatial Databases
Recommended Reading 1. Chiaramella Y. Information retrieval and structured documents. In Lectures on Information Retrieval, Third European Summer-School, Revised Lectures, LNCS, Vol. 1980. M. Agosti, F. Crestani, and G. Pasi (eds.). Springer, 2001, pp. 286–309. 2. Chiaramella Y., Mulhem P., and Fourel F. A model for multimedia information retrieval. Technical Report FERMI, ESPRIT BRA 8134, University of Glasgow, Scotland, 1996. 3. Croft W.B. Combining approaches to information retrieval. In Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval, Vol. 7. W.B. Croft (ed.). The Information Retrieval Series, Kluwer Academic, Dordrecht, 2000, pp. 1–36. 4. Fuhr N., Go¨vert N., and Ro¨lleke T. DOLORES: A system for logic-based retrieval of multimedia objects. In Proc. 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 257–265. 5. Fuhr N. and Großjohann K. XIRQL: A query language for information retrieval in XML documents. In Proc. 24th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2001, pp. 172–180. 6. Go¨vert N., Abolhassani M., Fuhr N., and Großjohann K. Content-oriented XML retrieval with HyREX. In Proc. 1st Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2003, pp. 26–32. 7. Kazai G., Lalmas M., and Ro¨lleke T. A model for the representation and focussed retrieval of structured documents based on fuzzy aggregation. In Proc. 8th Int. Symp. on String Processing and Information Retrieval, 2001, pp. 123–135. 8. Lalmas M. Dempster-Shafer’s theory of evidence applied to structured documents: Modelling uncertainty. In Proc. 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1997, pp. 110–118. 9. Lu W., Robertson S.E., and MacFarlane A. Field-weighted XML retrieval based on BM25. In Proc. 4th Int. Workshop of the Initiative for the Evaluation of XML Retrieval, Revised Selected Papers, LNCS, Vol. 3977, Springer, 2006, pp. 161–171. 10. Mass Y. and Mandelbrod M. Retrieving the most relevant XML components. In Proc. 2nd Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2004, pp. 53–58. 11. Myaeng S.-H., Jang D.-H., Kim M.-S., and Zhoo Z.-C. A flexible model for retrieval of SGML documents. In Proc. 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 138–145. 12. Ogilvie P. and Callan J. Hierarchical language models for retrieval of XML components. In Advances in XML Information Retrieval and Evaluation. In Proc. 3rd Int. Workshop of the Initiative for the Evaluation of XML Retrieval, Revised Selected Papers, LNCS, Vol. 3493, Springer, 2005, pp. 224–237. 13. Robertson S.E., Zaragoza H., and Taylor M. Simple BM25 extension to multiple weighted fields. In Proc. Int. Conf. on Information and Knowledge Management, 2004, pp. 42–49. 14. Sauvagnat K., Boughanem M., and Chrisment C. Searching XML documents using relevance propagation. In Proc. 11th
A
Int. Symp. on String Processing and Information Retrieval, 2004, pp. 242–254. 15. Wilkinson R. Effective retrieval of structured documents. In Proc. 17th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1994, pp. 311–317.
AGMS Sketch ▶ AMS Sketch
Air Indexes for Spatial Databases B AIHUA Z HENG Singapore Management University, Singapore, Singapore
Definition Air indexes refer to indexes employed in wireless broadcast environments to address scalability issue and to facilitate power saving on mobile devices [4]. To retrieve a data object in wireless broadcast systems, a mobile client has to continuously monitor the broadcast channel until the data arrives. This will consume a lot of energy since the client has to remain active during its waiting time. The basic idea of air indexes is that by including index information about the arrival times of data items on the broadcast channel, mobile clients are able to predict the arrivals of their desired data. Thus, they can stay in power saving mode during waiting time and switch to active mode only when the data of their interests arrives.
Historical Background In spatial databases, clients are assumed to be interested in data objects having spatial features (e.g., hotels, ATM, gas stations). ‘‘Find me the nearest restaurant’’ and ‘‘locate all the ATMs that are within 100 miles of my current location’’ are two examples. A central server is allocated to keep all the data, based on which the queries issued by the clients are answered. There are basically two approaches to disseminating spatial data to clients: (i) on-demand access: a mobile client submits a request, which consists of a query and the query’s issuing location, to the server. The server returns the result to the mobile client via a dedicated pointto-point channel. (ii) periodic broadcast: data are
71
72
A
Air Indexes for Spatial Databases
periodically broadcast on a wireless channel open to the public. After a mobile client receives a query from its user, it tunes into the broadcast channel to receive the data of interest based on the query and its current location. On-demand access is particularly suitable for lightloaded systems when contention for wireless channels and server processing is not severe. However, as the number of users increases, the system performance deteriorates rapidly. Compared with on-demand access, broadcast is a more scalable approach since it allows simultaneous access by an arbitrary number of mobile clients. Meanwhile, clients can access spatial data without reporting to the server their current location and hence the private location information is not disclosed. In the literature, two performance metrics, namely access latency and tuning time, are used to measure access efficiency and energy conservation, respectively [4]. The former means the time elapsed between the moment when a query is issued and the moment when it is satisfied, and the latter represents the time a mobile client stays active to receive the requested data. As energy conservation is very critical due to the limited battery capacity on mobile clients, a mobile device typically supports two operation modes: active mode and doze mode. The device normally operates in active mode; it can switch to doze mode to save energy when the system becomes idle. With data broadcast, clients listen to a broadcast channel to retrieve data based on their queries and hence are responsible for query processing. Without any index information, a client has to download all data objects to process spatial search, which will consume a lot of energy since the client needs to remain active during a whole broadcast cycle. A broadcast cycle means the minimal duration within which all the data objects are broadcast at least once. A solution to this problem is air indexes [4]. The basic idea is to broadcast an index before data objects (see Fig. 1 for an
example). Thus, query processing can be performed over the index instead of actual data objects. As the index is much smaller than the data objects and is selectively accessed to perform a query, the client is expected to download less data (hence incurring less tuning time and energy consumption) to find the answers. The disadvantage of air indexing, however, is that the broadcast cycle is lengthened (to broadcast additional index information). As a result, the access latency would be worsen. It is obvious that the larger the index size, the higher the overhead in access latency. An important issue in air indexes is how to multiplex data and index on the sequential-access broadcast channel. Figure 1 shows the well-known (1, m) scheme [4], where the index is broadcast in front of every 1/m fraction of the dataset. To facilitate the access of index, each data page includes an offset to the beginning of the next index. The general access protocol for processing spatial search involves following three steps: (i) initial probe: the client tunes into the broadcast channel and determines when the next index is broadcast; (ii) index search: The client tunes into the broadcast channel again when the index is broadcast. It selectively accesses a number of index pages to find out the spatial data object and when to download it; and (iii) data retrieval: when the packet containing the qualified object arrives, the client downloads it and retrieves the object. To disseminate spatial data on wireless channels, well-known spatial indexes (e.g., R-trees) are candidates for air indexes. However, unique characteristics of wireless data broadcast make the adoption of existing spatial indexes inefficient (if not impossible). Specifically, traditional spatial indexes are designed to cluster data objects with spatial locality. They usually assume a resident storage (such as disk and memory) and adopt search strategies that minimize I/O cost. This is achieved by backtracking index nodes during search. However, the broadcast order (and thus the access order) of
Air Indexes for Spatial Databases. Figure 1. Air indexes in wireless broadcast environments.
Air Indexes for Spatial Databases
A
73
A
Air Indexes for Spatial Databases. Figure 2. Linear access on wireless broadcast channel.
index nodes is extremely important in wireless broadcast systems because data and index are only available to the client when they are broadcast on air. Clients cannot randomly access a specific data object or index node but have to wait until the next time it is broadcast. As a result, each backtracking operation extends the access latency by one more cycle and hence becomes a constraint in wireless broadcast scenarios. Figure 2 depicts an example of spatial query. Assume that an algorithm based on R-tree first visits root node, then the node R2, and finally R1, while the server broadcasts nodes in the order of root, R1, and R2. If a client wants to backtrack to node R1 after it retrieves R2, it will have to wait until the next cycle because R1 has already been broadcast. This significantly extends the access latency and it occurs every time a navigation order is different from the broadcast order. As a result, new air indexes which consider both the constraints of the broadcast systems and features of spatial queries are desired.
Foundations Several air indexes have been recently proposed to support broadcast of spatial data. These studies can be classified into two categories, according to the nature of the queries supported. The first category focuses on retrieving data associated with some specified geographical range, such as ‘‘Starbucks Coffee in New York City’s Times Square’’ and ‘‘Gas stations along Highway 515.’’ A representative is the index structure designed for DAYS project [1]. It proposes a location hierarchy and associates data with locations. The index structure is designed to support query on various types of data with different location granularity. The authors intelligently exploit an important property of the locations, i.e., containment relationship among the objects, to determine the relative
location of an object with respect to its parent that contains the object. The containment relationship limits the search range of available data and thus facilitates efficient processing of the supported queries. In brief, a broadcast cycle consists of several sub-cycles, with each containing data belonging to the same type. A major index (one type of index buckets) is placed at the beginning of each sub-cycle. It provides information related to the types of data broadcasted, and enables clients to quickly jump into the right sub-cycle which contains her interested data. Inside a sub-cycle, minor indexes (another type of index buckets) are interleaved with data buckets. Each minor index contains multiple pointers pointing to the data buckets with different locations. Consequently, a search for a data object involves accessing a major index and several minor indexes. The second category focuses on retrieving data according to specified distance metric, based on client’s current location. An example is nearest neighbor (NN) search based on Euclidian distance. According to the index structure, indexes of this category can be further clustered into two groups, i.e., central tree-based structure and distributed structure. In the following, we review some of the representative indexes of both groups. D-tree is a paged binary search tree to index a given solution space in support of planar point queries [6]. It assumes a data type has multiple data instances, and each instance has a certain valid scope within which this instance is the only correct answer. For example, restaurant is a data type, and each individual restaurant represents an instance. Take NN search as an example, Fig. 3a illustrates four restaurants, namely o1, o2, o3, and o4, and their corresponding valid scopes p1, p2, p3, and p4. Given any query location q in, say, p3, o3 is the restaurant to which q is nearest. D-tree assumes the valid scopes of different data instances are known and it focuses only on planar point queries which locate the
74
A
Air Indexes for Spatial Databases
Air Indexes for Spatial Databases. Figure 3. Index construction using the D-tree.
query point into a valid scope and return the client the corresponding data instance. The D-tree is a binary tree built based on the divisions between data regions (e.g., valid scopes). A space consisting of a set of data regions is recursively partitioned into two complementary subspaces containing about the same number of regions until each subspace has one region only. The partition between two subspaces is represented by one or more polylines. The overall orientation of the partition can be either x-dimensional or y-dimensional, which is obtained, respectively, by sorting the data regions based on their lowest/uppermost y-coordinates, or leftmost/ rightmost x-coordinates. Figure 3b shows the partitions for the running example. The polyline pl(v2, v3, v4, v6) partitions the original space into p5 and p6, and polylines pl(v1,v3) and pl(v4,v5) further partition p5 into p1 and p2, and p6 into p3 and p4, respectively. The first polyline is y-dimensional and the remaining two are x-dimensional. Given a query point q, the search algorithm works as follows. It starts from the root and recursively follows either the left subtree or the right subtree that bounds the query point until a leaf node is reached. The associated data instance is then returned as the final answer. Grid-partition index is specialized for NN problem [9]. It is motivated by the observation that an object is the NN only to the query points located inside its Voronoi Cell. Let O = {o1, o2,...,on} be a set of points. V ðoi Þ, the Voronoi cell (VC) for oi, is defined as the set of points q in the space such that dist(q,oi) < dist(q,oj), 8j 6¼ i. That is, V ðoi Þ consists of the set of points for which oi is the NN. As illustrated in Fig. 3a, p1, p2, p3, and p4 denote the VCs for four objects, o1, o2, o3, and
o4, respectively. Grid-partition index tries to reduce the search space for a query at the very beginning by partitioning the space into disjoint grid cells. For each grid cell, all the objects that could be NNs of at least one query point inside the grid cell are indexed, i.e., those objects whose VCs overlap with the grid cell are associated with that grid cell. Figure 4a shows a possible grid partition for the running example, and the index structure is depicted in Fig. 4b. The whole space is divided into four grid cells; i.e., G1, G2, G3, and G4. Grid cell G1 is associated with objects o1 and o2, since their VCs, p1 and p2, overlap with G1; likewise, grid cell G2 is associated with objects o1, o2, o3, and so on. If a given query point is in grid cell G1, the NN can be found among the objects associated with G1 (i.e., o1 and o2), instead of among the whole set of objects. Efficient search algorithms and partition approaches have been proposed to speed up the performance. Conventional spatial index R-tree has also been adapted to support kNN search in broadcast environments [2]. For R-tree index, the kNN search algorithm would visit index nodes and objects sequentially as backtracking is not feasible on the broadcast. This certainly results in a considerably long tuning time especially when the result objects are located in later part of the broadcast. However, if clients know that there are at least k objects in the later part of the broadcast that are closer to the query point than the currently found ones, they can safely skip the downloading of the intermediate objects currently located. This observation motivates the design of the enhanced kNN search algorithm which caters for the constraints of wireless broadcast. It requires each index node to
Air Indexes for Spatial Databases
A
75
A
Air Indexes for Spatial Databases. Figure 4. Index construction using the grid-partition.
carry a count of the underlying objects (object count) referenced by the current node. Thus, clients do not blindly download intermediate objects. Hilbert Curve Index (HCI) is designed to support general spatial queries, including window queries, kNN queries, and continuous nearest-neighbor (CNN) queries in wireless broadcast environments. Motivated by the linear streaming property of the wireless data broadcast channel and the optimal spatial locality of the Hilbert Curve (HC), HCI organizes data according to Hilbert Curve order [7,8], and adopts B+-tree as the index structure. Figure 5 depicts a 8 8 grid, with solid dots representing data objects. The numbers next to the data points, namely index value, represent the visiting orders of different points at Hilbert Curve. For instance, data point with (1,1) as the coordinates has the index value of 2, and it will be visited before data point with (2,2) as the coordinates because of the smaller index value. The filtering and refining strategy is adopted to answer all the queries. For window query, the basic idea is to decide a candidate set of points along the Hilbert curve which includes all the points within the query window and later to filter out those outside the window. Suppose the rectangle shown in Fig. 5 is a query window. Among all the points within the search range, the first point is point a and the last is b, sorted according to their occurring orders on the Hilbert curve, and both of them are lying on the boundary of the search range. Therefore, all the points inside this query window should lie on the Hilbert curve segmented by
Air Indexes for Spatial Databases. Figure 5. Hilbert curve index.
points a and b. In other words, data points with index values between 18 and 29, but not the others, are the candidates. During the access, the client can derive the coordinates of data points based on the index values and then retrieve those within the query window. For kNN query, the client first retrieves those k nearest objects to the query point along the Hilbert curve and then derives a range which for sure bounds at least k objects. In the filtering phase, a window query
76
A
Air Indexes for Spatial Databases
which bounds the search range is issued to filter out those unqualified. Later in the refinement phase, k nearest objects are identified according to their distance to the query point. Suppose an NN query at point q (i.e., index value 53) is issued. First, the client finds its nearest neighbor (i.e., point with index value 51) along the curve and derives a circle centered at q with r as the radius (i.e., the green circle depicted in Fig. 5). Since the circle bounds point 51, it is certain to contain the nearest neighbor to point q. Second, a window query is issued to retrieve all the data points inside the circle, i.e., points with index values 11, 32, and 51. Finally, the point 32 is identified as the nearest neighbor. The search algorithm for CNN adopts a similar approach. It approximates a search range which is guaranteed to bound all the answer objects, issues a window query to retrieve all the objects inside the search range, and finally filters out those unqualified. All the indexes mentioned above are based on a central tree-based structure, like R-tree and B-tree. However, employing a tree-based index on a linear broadcast channel to support spatial queries results in several deficiencies. First, clients can only start the search when they retrieve the root node in the channel. Replicating the index tree in multiple places in the broadcast channel provides multiple search starting points, shortening the initial root-probing time. However, a prolonged broadcast cycle leads to a long access latency experienced by the clients. Second, wireless broadcast media is not error-free. In case of losing intermediate nodes during the search process, the
clients are forced to either restart the search upon an upcoming root node or scan the subsequential broadcast for other possible nodes in order to resume the search, thus extending the tuning time. Distributed spatial index (DSI), a fully distributed spatial index structure, is motivated by these observations [5]. A similar distributed structure was proposed in [3] as well to support access to spatial data on air. DSI is very different from tree-based indexes, and is not a hierarchical structure. Index information of spatial objects is fully distributed in DSI, instead of simply replicated in the broadcast. With DSI, the clients do not need to wait for a root node to start the search. The search process launches immediately after a client tunes into the broadcast channel and hence the initial probe time for index information is minimized. Furthermore, in the event of data loss, clients resume the search quickly. Like HCI, DSI also adopts Hilbert curve to determine broadcast order of data objects. Data objects, mapped to point locations in a 2-D space, are broadcast in the ascending order of their HC index values. Suppose there are N objects in total, DSI chunks them into nF frames, with each having no objects (nF = dN ∕noe). The space covered by Hilbert Curve shown in Fig. 5 is used as a running example, with solid dots representing the locations of data objects (i.e., N = 8). Figure 6 demonstrates a DSI structure with no set to 1, i.e., each frame contains only one object. In addition to objects, each frame also has an index table as its header, which maintains information
Air Indexes for Spatial Databases. Figure 6. Distributed spatial index.
AJAX
regarding to the HC values of data objects to be broadcast with specific waiting interval from the current frame. This waiting interval can be denoted by delivery time difference or number of data frames apart, with respect to the current frame. Every index table keeps ni entries, each of which, tj, is expressed in the form of hHC 0 j,Pji, j 2 [0,ni). Pj is a pointer to the r j-th frame after the current frame, where r ( > 1) is an exponential base (i.e., a system-wide parameter), and HC 0 j is the HC value of the first object inside the frame pointed by Pj. In addition to tj, an index table also keeps the HC values HCk (k 2 [1,no]) of all the objects objk that are contained in the current frame. This extra information, although occupying litter extra bandwidth, can provide a more precise image of all the objects inside current frame. During the retrieval, a client can compare HCks of the objects against the one she has interest in, so the retrieval of unnecessary object whose size is much larger than an HC value can be avoided. Refer to the example shown in Fig. 5, with corresponding DSI depicted in Fig. 6. Suppose r = 2, no = 1, nF = 8, and ni = 3. The index tables corresponding to frames of data objects O6 and O32 are shown in the figure. Take the index table for frame O6 as an example: t0 contains a pointer to the next upcoming (20-th) frame whose first object’s HC value is 11, t1 contains a pointer to the second (21-th) frame with HC value for the first object (the only object) 17, and the last entry t2 points to the fourth (22-th) frame. It also keeps the HC value 6 of the object O6 in the current frame. Search algorithm for window queries and kNN searches are proposed.
Key Applications Location-based Service
Wireless broadcast systems, because of the scalability, provide an alternative to disseminate location-based information to a large number of users. Efficient air indexes enable clients to selectively tune into the channel and hence the power consumption is reduced. Moving Objects Monitoring
Many moving objects monitoring applications are interested in finding out all the objects that currently satisfy certain conditions specified by the users. In many cases, the number of moving objects is much
A
larger than the number of submitted queries. As a result, wireless broadcast provides an ideal way to deliver subscribed queries to the objects, and those objects that might affect the queries can then report their current locations.
Cross-references
▶ Nearest Neighbor Query ▶ Space-Filling Curves for Query Processing ▶ Spatial Indexing Techniques ▶ Voronoi Diagrams
Recommended Reading 1.
2.
3.
4. 5.
6.
7.
8.
9.
Acharya D. and Kumar V. Location based indexing scheme for days. In Proc. 4th ACM Int. Workshop on Data Eng. for Wireless and Mobile Access, 2005, pp. 17–24. Gedik B., Singh A., and Liu L. Energy efficient exact knn search in wireless broadcast environments. In Proc. 12th ACM Int. Symp. on Geographic Inf. Syst., 2004, pp. 137–146. Im S., Song M., and Hwang C. An error-resilient cell-based distributed index for location-based wireless broadcast services. In Proc. 5th ACM Int. Workshop on Data Eng. for Wireless and Mobile Access, 2006, pp. 59–66. Imielinski T., Viswanathan S., and Badrinath B.R. Data on air – organization and access. IEEE Trans. Knowl. Data Eng., 9(3):1997. Lee W.-C. and Zheng B. Dsi: a fully distributed spatial index for wireless data broadcast. In Proc. 23rd Int. Conf. on Distributed Computing Systems, 2005, pp. 349– 358. Xu J., Zheng B., Lee W.-C., and Lee D.L. The d-tree: an index structure for location-dependent data in wireless services. IEEE Trans. Knowl. Data Eng., 16(12):1526–1542, 2002. Zheng B., Lee W.-C., and Lee D.L. Spatial queries in wireless broadcast systems. ACM/Kluwer J. Wireless Networks, 10 (6):723–736, 2004. Zheng B., Lee W.-C., and Lee D.L. On searching continuous k nearest neighbors in wireless data broadcast systems. IEEE Trans. Mobile Comput., 6(7):748–761, 2007. Zheng B., Xu J., Lee W.-C., and Lee L. Grid-partition index: a hybrid method for nearest-neighbor queries in wireless location-based services. VLDB J., 15(1):21–39, 2006.
AJAX A LEX W UN University of Toronto, Toronto, ON, Canada
Definition AJAX is an acronym for ‘‘Asynchronous JavaScript and XML’’ and refers to a collection of web development
77
A
78
A
Allen’s Relations
processing on the client-side without any dependence on HTML form elements. Data composition is also facilitated by having data adhere to standard XML and DOM formats.
technologies used together to create highly dynamic web applications.
Key Points AJAX does not refer to a specific technology, but instead refers to a collection of technologies used in conjunction to develop dynamic and interactive web applications. The two main technologies comprising AJAX are the JavaScript scripting language and the W3C open standard XMLHttpRequest object API. While the use of XML and DOM are important for standardized data representation, using neither XML nor DOM is required for an application to be considered AJAX-enabled since the XMLHttpRequest API actually supports any text format. Using the XMLHttpRequest API, web applications can fetch data asynchronously while registering a callback function to be invoked once the fetched data is available. More concretely, the XMLHttpRequest object issues a standard HTTP POST or GET request to a web server but returns control to the calling application immediately after issuing the request. The calling application is then free to continue execution while the HTTP request is being handled on the server. When the HTTP response is received, the XMLHttpRequest object calls back into the function that was supplied by the calling application so that the response can be processed. The asynchronous callback model used in AJAX applications is analogous to the Operating System technique of using interrupt handlers to avoid blocking on I/O. As such, development using AJAX necessarily requires an understanding of multi-threaded programming. There are three main benefits to using AJAX in web applications: 1. Performance: Since XMLHttpRequest calls are asynchronous, client-side scripts can continue execution after issuing a request without being blocked by potentially lengthy data transfers. Consequently, web pages can be easily populated with data fetched in small increments in the background. 2. Interactivity: By maintaining long-lived data transfer requests, an application can closely approximate real-time event-driven behavior without resorting to periodic polling, which can only be as responsive as the polling frequency. 3. Data Composition: Web applications can easily pull data from multiple sources for aggregation and
The functionality provided by AJAX allows web applications to appear and behave much more like traditional desktop applications. The main difference is that data consumed by the application resides primarily out on the Internet – one of the concepts behind applications that are labeled as being representative ‘‘Web 2.0’’ applications.
Cross-references ▶ JavaScript ▶ MashUp ▶ Web 2.0/3.0 ▶ XML
Recommended Reading 1. 2.
The Document Object Model: W3C Working Draft. Available at: http://www.w3.org/DOM/ The XMLHttpRequest Object: W3C Working Draft. Available at: http://www.w3.org/TR/XMLHttpRequest/
Allen’s Relations P ETER R EVESZ 1, PAOLO T ERENZIANI 2 1 University of Nebraska-Lincoln, Lincoln, NE, USA 2 University of Turin, Turin, Italy
Synonyms Qualitative relations between time intervals; Qualitative temporal constraints between time intervals
Definition A (convex) time interval I is the set of all time points between a starting point (usually denoted by I) and an ending point (I+). Allen’s relations model all possible relative positions between two time intervals [1]. There are 13 different possibilities, depending on the relative positions of the endpoints of the intervals (Table 1). For example, ‘‘There will be a guest speaker during the Database System class’’ can be represented by Allen’s relation IGuest During IDatabase (or by IGuest > IDatabase ∧ I+Guest < I+Database considering the relative
AMOSQL
A
Allen’s Relations. Table 1. Translation of Allen’s interval relations between two intervals I and J into conjunctions of point relations between I, Iþ, J, Jþ. I J
I J+
I+J
A
I + J+
>
After
< =
Before Meets Met_by During
>
= = =
Starts Started_by Overlaps Overlapped_by
= = < >
< > < >
=
>
string as select name(c) from person c where parent(c) = p and hobby(c) = ‘sailing’;
This is turned internally into a typed object comprehension. sailch(p) == [name(c) | c 5.0); if ‘‘true,’’ then notify Dr. X.’’ ‘‘Event-driven programming’’ as opposed to ‘‘procedural programming’’ utilizes the same kinds of predicate logic in evaluating state transitions or triggers to state transitions in a modern computer-programming environment. Consequently, Clinical Events drive programming logic in many modern systems. The HL7 Reference Information Model (RIM) describes clinical events; the term ‘‘Act’’ in the RIM identifies objects that are instantiated in XML communications between systems or in records within the electronic healthcare systems themselves. These ‘‘Acts’’ correspond to ‘‘clinical events’’ used for monitoring systems in healthcare. However, in the RIM, ‘‘Event’’ is defined narrowly as an instance of an Act that has been completed or is in the process of being completed. Clinical event monitoring systems may also evaluate HL7 ‘‘Orders or Requests’’ or other kinds of ‘‘Act’’ instances as events of interest (www.hl7.org).
Cross-references
▶ Clinical Observation ▶ Clinical Order ▶ Interface Engines in Healthcare ▶ Event Driven Architecture ▶ HL7 Reference Information Model ▶ Predicate Logic ▶ Propositions
Recommended Reading 1. 2. 3.
Glaser J., et al. Impact of information events on medical care. HIMSS, 1996. Hripisak G., et al. Design of a clinical event monitor. Comp. Biomed. Res., 29:194–221, 1996. McDonald C. Action-oriented Decisions in Ambulatory Medicine. Yearbook Medical Publishers, Chicago, IL, 1981.
355
C
356
C
Clinical Genetics
Clinical Genetics ▶ Implications of Genomics for Clinical Informatics
Clinical Genomics ▶ Implications of Genomics for Clinical Informatics
Clinical Judgment ▶ Clinical Observation
Clinical Knowledge Base ▶ Clinical Knowledge Repository
Clinical Knowledge Directory ▶ Clinical Knowledge Repository
Clinical Knowledge Management Repository ▶ Clinical Knowledge Repository
Clinical Knowledge Repository R OBERTO A. R OCHA Partners Healthcare System, Inc., Boston, MA, USA
Synonyms Clinical knowledge base; Clinical content repository; Clinical content database; Clinical knowledge management repository; Clinical content registry; Clinical knowledge directory
Definition A clinical knowledge repository (CKR) is a multipurpose storehouse for clinical knowledge assets. ‘‘Clinical
knowledge asset’’ is a generic term that describes any type of human or machine-readable electronic content used for computerized clinical decision support. A CKR is normally implemented as an enterprise resource that centralizes a large quantity and wide variety of clinical knowledge assets. A CKR provides integrated support to all asset lifecycle phases such as authoring, review, activation, revision, and eventual inactivation. A CKR routinely provides services to search, retrieve, transform, merge, upload, and download clinical knowledge assets. From a content curation perspective, a CKR has to ensure proper asset provenance, integrity, and versioning, along with effective access and utilization constraints compatible with collaborative development and deployment activities. A CKR can be considered a specialized content management system, designed specifically to support clinical information systems. Within the context of clinical decision support systems, a CKR can be considered a special kind of knowledge base – one specially designed to manage multiple types of human and machine-readable clinical knowledge assets.
Key Points In recent years, multiple initiatives have attempted to better organize, filter, and apply the ever-growing biomedical knowledge. Among these initiatives, one of the most promising is the utilization of computerized clinical decision support systems. Computerized clinical decision support can be defined as computer systems that provide the correct amount of relevant knowledge at the appropriate time and context, contributing to improved clinical care and outcomes. A wide variety of knowledge-driven tools and methods have resulted in multiple modalities of clinical decision support, including information selection and retrieval, information aggregation and presentation, data entry assistance, event monitors, care workflow assistance, and descriptive or predictive modeling. A CKR provides an integrated storage platform that enables the creation and maintenance of multiple types of knowledge assets. A CKR ensures that different modalities of decision support can be combined to properly support the activities of clinical workers. Core requirements guiding the implementation of a CKR include clinical knowledge asset provenance (metadata), versioning, and integrity. Other essential requirements include the proper representation of access and utilization constraints, taking into account
Clinical Knowledge Repository
the collaborative nature of asset development processes and deployment environments. Another fundamental requirement is to aptly represent multiple types of knowledge assets, where each type might require specialized storage and handling. The CKR core requirements are generally similar to those specified for other types of repositories used for storage and management of machine-readable assets.
Historical Background Biomedical knowledge has always been in constant expansion, but unprecedented growth is being observed during the last decade. Over 30% of the 16.8 million citations accumulated by MEDLINE until December of 2007 were created in the last 10 years, with an average of over 525,000 new citations per year [5]. The number of articles published each year is commonly used as an indicator of how much new knowledge the scientific community is creating. However, from a clinical perspective, particularly for those involved with direct patient care, the vast amount of new knowledge represents an ever-growing gap between what is known and what is routinely practiced. Multiple initiatives in recent years have attempted to better organize, filter, and apply the knowledge being generated. Among these various initiatives, one of the most promising is the utilization of computerized clinical decision support systems [6]. In fact, some authors avow that clinical care currently mandates a degree of individualization that is inconceivable without computerized decision support [1]. Computerized clinical decision support can be defined as computer systems that provide the correct amount of relevant knowledge at the appropriate time and context, ultimately contributing to improved clinical care and outcomes [3]. Computerized clinical decision support has been an active area of informatics research and development for the last three decades [2]. A wide variety of knowledge-driven tools and methods have resulted in multiple modalities of clinical decision support, including information selection and retrieval (e.g., infobuttons, crawlers), information aggregation and presentation (e.g., summaries, reports, dashboards), data entry assistance (e.g., forcing functions, calculations, evidence-based templates for ordering and documentation), event monitors (e.g., alerts, reminders, alarms), care workflow assistance (e.g., protocols, care pathways, practice guidelines), and descriptive or predictive modeling (e.g., diagnosis, prognosis, treatment planning, treatment outcomes). Each modality requires
C
specific types of knowledge assets, ranging from production rules to mathematical formulas, and from automated workflows to machine learning models. A CKR provides an integrated storage platform that enables the creation and maintenance of multiple types of assets using knowledge management best practices [4]. The systematic application of knowledge management processes and best practices to the biomedical domain is a relatively recent endeavor [2]. Consequently, a CKR should be seen as a new and evolving concept that is only now being recognized as a fundamental component for the acquisition, storage, and maintenance of clinical knowledge assets. Most clinical decision support systems currently in use still rely on traditional knowledge bases that handle a single type of knowledge asset and do not provide direct support for a complete lifecycle management process. Another relatively recent principle is the recognition that different modalities of decision support have to be combined and subsequently integrated with information systems to properly support the activities of clinical workers. The premise of integrating multiple modalities of clinical decision support reinforces the need for knowledge management processes supported by a CKR.
Foundations Core requirements guiding the implementation of a CKR include clinical knowledge asset provenance (metadata), versioning, and integrity. Requirements associated with proper access and utilization constraints are also essential, particularly considering the collaborative nature of most asset development processes and deployment environments. Another fundamental requirement is to aptly represent multiple types of knowledge assets, where each type might require specialized storage and handling. The CKR core requirements are generally similar to those specified for other types of repositories used for storage and management of machine-readable assets (e.g., ‘‘ebXML Registry’’ (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev = regrep)). Requirements associated with asset provenance can be implemented using a rich set of metadata properties that describe the origin, purpose, evolution, and status of each clinical knowledge asset. The metadata properties should reflect the information that needs to be captured during each phase of the knowledge asset lifecycle process, taking into account multiple iterative
357
C
358
C
Clinical Knowledge Repository
authoring and review cycles, followed by a possibly long period of clinical use that might require multiple periodic revisions (updates). Despite the diversity of asset types, each with a potentially distinct lifecycle process, a portion of the metadata properties should be consistently implemented, enabling basic searching and retrieval services across asset types. Ideally, the shared metadata should be based on metadata standards (e.g., ‘‘Dublin Core Metadata Element Set’’ (http://dublincore.org/documents/dces/)). The adoption of standard metadata properties also simplifies the integration of external collections of clinical knowledge assets in a CKR. In addition to a shared set of properties, a CKR should also accommodate extended sets of properties specific for each clinical knowledge asset type and its respective lifecycle process. Discrete namespaces are commonly used to represent type-specific extended metadata properties. Asset version and status, along with detailed change tracking, are vital requirements for a CKR. Different versioning strategies can be used, but as a general rule there should be only one clinically active version of any given knowledge asset. This general rule is easily observed if the type and purpose of the clinical knowledge asset remains the same throughout its lifecycle. However, a competing goal is created with the very desirable evolution of human-readable assets to become machine-readable. Such evolution invariably requires the creation of new knowledge assets of different types and potentially narrower purposes. In order to support this ‘‘natural’’ evolution, a CKR should implement the concept of asset generations, while preserving the change history that links one generation to the next. Also within a clinical setting, it is not uncommon to have to ensure that knowledge assets comply with, or directly implement, different norms and regulations. As a result, the change history of a clinical knowledge asset should identify the standardization and compliance aspects considered, enabling subsequent auditing and/or eventual certification. Ensuring the integrity of clinical knowledge assets is yet another vital requirement for a CKR. Proper integrity guarantees that each asset is unique within a specific type and purpose, and that all its required properties are accurately defined. Integrity requirements also take into account the definition and preservation of dependencies between clinical knowledge assets. These dependencies can be manifested as simple hyperlinks, or as integral content defined as another
independent asset. Creating clinical knowledge assets from separate components or modules (i.e., modularity) is a very desirable feature in a CKR – one that ultimately contributes to the overall maintainability of the various asset collections. However, modularity introduces important integrity challenges, particularly when a new knowledge asset is being activated for clinical use. Activation for clinical use requires a close examination of all separate components, sometimes triggering unplanned revisions of components already in routine use. Another important integrity requirement is the ability to validate the structure and the content of a clinical knowledge asset against predefined templates (schemas) and dictionaries (ontologies). Asset content validation is essential for optimal integration with clinical information systems. Ideally, within a given healthcare organization all clinical information systems and the CKR should utilize the same standardized ontologies. Contextual characteristics of the care delivery process establish the requirements associated with proper access, utilization, and presentation of the clinical knowledge assets. The care delivery context is a multidimensional constraint that includes characteristics of the patient (e.g., gender, age group, language, clinical condition), the clinical worker (e.g., discipline, specialty, role), the clinical setting (e.g., inpatient, outpatient, ICU, Emergency Department), and the information system being used (e.g., order entry, documentation, monitoring), among others. The care delivery context normally applies to the entire clinical knowledge asset, directly influencing search, retrieval, and presentation services. The care delivery context can also be used to constrain specific portions of a knowledge asset, including links to other embedded assets, making them accessible only if the constraints are satisfied. An important integrity challenge created by the systematic use of the care delivery context is the need for reconciling conflicts caused by incompatible asset constraints, particularly when different teams maintain the assets being combined. In this scenario, competing requirements are frequently present, namely the intention to maximize modularity and reusability versus the need to maximize clinical specificity and ease or use. The accurate selection, retrieval, and presentation of unstructured assets is generally perceived as a simple but very useful modality of clinical decision support, particularly if the information presented to the clinical
Clinical Observation
worker is concise and appropriate to the care being delivered. However, the appropriateness of the information is largely defined by the constraints imposed by the aforementioned care delivery context. Moreover, the extent of indexing (‘‘retrievability’’) of most collections of unstructured clinical knowledge assets is not sufficient to fully recognize detailed care delivery context expressions. Ultimately, the care delivery context provides an extensible mechanism for defining the appropriateness of a given clinical knowledge asset in response to a wide variety of CKR service requests. The requirements just described are totally or partially implemented as part of general-purpose (enterprise) content management systems. However, content management systems have been traditionally constructed for managing primarily human-readable electronic content. Human-readable content, more properly characterized as unstructured knowledge assets, include narrative text, diagrams, and multimedia objects. When combined, these unstructured assets likely represent the largest portion of the inventory of clinical knowledge assets of any healthcare institution. As a result, in recent years different healthcare organizations have deployed CKRs using enterprise content management systems, despite their inability to manage machine-readable content.
Key Applications Computerized Clinical Decision Support, Clinical Knowledge Engineering, Clinical Information Systems.
Cross-references
▶ Biomedical Data/Content Acquisition, Curation ▶ Clinical Data Acquisition, Storage and Management ▶ Clinical Decision Support ▶ Dublin Core ▶ Evidence Based Medicine ▶ Executable Knowledge ▶ Metadata ▶ Reference Knowledge
Recommended Reading 1. 2. 3.
Bates D.W. and Gawande A.A. Improving safety with information technology. N. Engl. J. Med. 348(25):2526–2534, 2003. Greenes R.A (ed.). Clinical Decision Support: The road ahead. Academic Press, Boston, 2007, pp. 544. Osheroff J.A., Teich J.M., Middleton B., Steen E.B., Wright A., and Detmer D.E. A roadmap for national action on clinical decision support. J. Am. Med. Inform. Assoc., 14(2):141–145, 2007.
4.
5.
6.
C
Rocha R.A., Bradshaw R.L., Hulse N.C., and Rocha B.H.S.C. The clinical knowledge management infrastructure of Intermountain Healthcare. In: Clinical Decision Support: The road ahead, RA.Greenes (ed.). Academic Press, Boston, 2007, pp. 469–502. Statistical Reports on MEDLINE1/PubMed1 Baseline Data, National Library of Medicine, Department of Health and Human Services [Online]. Available at: http://www.nlm.nih. gov/bsd/licensee/baselinestats.html. Accessed 8 Feb 2008. Wyatt J.C. Decision support systems. J. R. Soc. Med., 93(12): 629–633, 2000.
Clinical Nomenclatures ▶ Clinical Ontologies
Clinical Observation DAN RUSSLER Oracle Health Sciences, Redwood Shores, CA, USA
Synonyms Clinical result; Clinical judgment; Clinical test; Finding of observation
Definition 1. The act of measuring, questioning, evaluating, or otherwise observing a patient or a specimen from a patient in healthcare; the act of making a clinical judgment. 2. The result, answer, judgment, or knowledge gained from the act of observing a patient or a specimen from a patient in healthcare. These two definitions of ‘‘observation’’ have caused confusion in clinical communications, especially when applying the term to the rigor of standardized terminologies. When developing a list of observations, the terminologists have differed on whether the list of terms should refer to the ‘‘act of observing’’ or the ‘‘result of the observation.’’ Logical Observation Identifiers Names and Codes (LOINC) (www.loinc.org) focus on observation as the ‘‘act of observing.’’ Systematized Nomenclature of Medicine (SNOMED) (www.ihtsdo.org) asserts that ‘‘General finding of observation of patient’’ is a synonym for ‘‘General observation of patient.’’ Of note is the analysis in HL7 that identifies many shared
359
C
360
C
Clinical Ontologies
attributes between descriptions of the act of observing and the result obtained. As a consequence, in HL7 Reference Information Model (RIM), both the act of observing and the result of the observation are contained in the same Observation Class (www. hl7.org).
Key Points The topic of clinical observation has been central to the study of medicine since medicine began. Early physicians focused on the use of all five senses in order to make judgments about the current condition of the patient, i.e., diagnosis, or to make judgments about the future of patients, i.e., prognosis. Physical exam included sight, touch, listening, and smell. Physicians diagnosed diabetes by tasting the urine for sweetness. As more tests on bodily fluids and tissues were discovered and used, the opportunity for better diagnosis and prognosis increased. Philosophy of science through the centuries often included the study of clinical observation in addition to the study of other observations in nature. During the last century, the study of rigorous testing techniques that improve the reproducibility and interpretation of results has included the development of extensive nomenclatures for naming the acts of observation and observation results, e.g., LOINC and SNOMED. These terminologies were developed in part to support the safe application of expert system rules to information recorded in the electronic health care record. The development of the HL7 Reference Information Model (RIM) was based on analysis of the ‘‘act of observing’’ and the ‘‘result of the act of observing’’ [1]. Today, new Entity attributes proposed for the HL7 RIM are evaluated for inclusion based partly on whether the information is best communicated in a new attribute for an HL7 Entity or best communicated in an HL7 Observation Act. Improved standardization of clinical observation techniques, both in the practice of bedside care and the recording of clinical observations in electronic healthcare systems is thought to be essential to the continuing improvement of healthcare and patient safety.
Cross-references
▶ Clinical Event ▶ Clinical Order ▶ Interface Engines in Healthcare
Recommended Reading 1.
Russler D., et al. Influences of the unified service action model on the HL7 reference information model. In JAMIA Symposium Supplement, Proceedings SCAMC, 1999, pp. 930–934.
Clinical Ontologies Y VES A. LUSSIER , J AMES L. C HEN University of Chicago, Chicago, IL, USA
Synonyms Clinical terminologies; Clinical nomenclatures; Clinical classifications
Definition An ontology is a formal representation of a set of heterogeneous concepts. However, in the life sciences, the term clinical ontology has also been more broadly defined as also comprising all forms of classified terminologies, including classifications and nomenclatures. Clinical ontologies provide not only a controlled vocabulary but also relationships among concepts allowing computer reasoning such that different parties, like physicians and insurers, can efficiently answer complex queries.
Historical Background As the life sciences integrates increasingly sophisticated systems of patient management, different means of data representation have had to keep pace to support user systems. Simultaneously, the explosion of genetic information from breakthroughs from the Human Genome Project and gene chip technology have further expedited the need for robust, scalable platforms for handling heterogeneous data. Multiple solutions have been developed by the scientific community to answer these challenges at all different levels of biology. This growing field of ‘‘systems medicine’’ starts humbly at the question – how can one best capture and represent complex data in a means that can be understood globally without ambiguity? In other words, does the data captured have the same semantic validity after retrieval as it did prior? These knowledgebases are in of themselves organic. They need to be able to expand, shrink, and rearrange themselves based on user or system needs. This entry will touch upon existing clinical ontologies used in a variety of applications.
C
Clinical Ontologies
Foundations The complexity of biological data cannot be understated. Issues generally fall into challenges with (i) definition, (ii) context, (iii) composition, and (iv) scale. One cannot even take for granted that the term ‘‘genome’’ is wellunderstood. Mahner found five different characterizations for the term ‘‘genome’’ [8]. Ontologies then provide a means of providing representational consistency through their structure and equally important provide the ability to connect these terms together in a semantically informative and computationally elegant manner [9]. This has led to their ubiquity in the life sciences. Formal ontologies are designated using frames or description logics [5]. However, few life science knowledgebases are represented completely in this manner due to difficulties with achieving consensus on definitions regarding the terms and the effort required to give context to the terms. Thus, this article defines well-organized nomenclatures and terminologies as clinical ontologies – regardless if their terms adhere to strict formalism. Looking at elevations in gene expression, it matters what organism and under what experimental conditions the experiment was conducted. Clinical context changes the meaning of terms. The term ‘‘cortex’’ can either indicate a part of the kidney or that of the brain. Generalized or ‘‘essential hypertension’’ can be what is known colloquially as ‘‘high blood pressure’’ or localized to the lungs as ‘‘pulmonary hypertension.’’ One can have pulmonary hypertension but not essential hypertension. This leads to the next representational challenge – that of composition. Should hypertension be represented implicitly as ‘‘essential hypertension’’ and as ‘‘pulmonary hypertension’’? Or should it be stored explicitly as ‘‘hypertension’’ with a location attribute? These representational decisions are driven by the queries that may be asked. The difficulty arises
in anticipating the queries and in post-processing of the query to split the terminological components of the overall concept. Finally, the knowledge model needs to be able to scale upward. The same decision logic that was relevant when the knowledgebase contained 100 concepts needs to still be relevant at 1,000,000 concepts. Properties of Clinical Ontologies
Ontologies vary widely in their degree of formalism and design. With this comes differing computability. In 1998, Cimino proposed desirable properties for purposes of clinical computation [3,4]. Table 1 summarizes the overall properties of the commonly used clinical ontologies. 1. Concept-oriented: a single concept is the preferred unit 2. Formal semantic definition: well-defined terms 3. Nonredundancy: each concept needs to be unique 4. Nonambiguity: different concepts should not overlap or be conflated 5. Relationships: the structure of connections between concepts differentiate ontologies: – Monohierarchy (tree): each concept only has one parent concept – Polyhierarchy: each concept may multiply inherit from multiple parents – Directed Acycle Graph (DAG): there are no cycles in the graph – in other words, children concepts may not point to parent terms
Key Applications This section reviews different, well-used life science ontologies used to annotate datasets. First, this discussion summarizes a select number of archetypal clinical
Clinical Ontologies. Table 1. Properties of clinical ontologies Architecture Ontology
Concept oriented
ICD-9 LOINC CPT SNOMED
þ
þ
UMLS
þ
Formal semantic definition
Concept permanence
þ þ
þ
þ
þ
þ
þ
M P M DAG
þ
þ
þ
CG
Nonredundancy Uniqueness Relationship
M = Monohierarchy/tree, P = Polyhierarchy, DAG = Directed Acyclic Graph, CG = cyclic graph
þ
361
C
362
C
Clinical Ontologies
Clinical Ontologies. Table 2. Coverage of classification, nomenclatures and ontologies Content Ontology Diseases Anatomy Morphology Labs Procedures Drugs ICD-9 LOINC CPT SNOMED UMLS
104 105
X X X X
X X
X X
X X
Number of concepts (order of magnitude)
X X X
X X
104 105 106
ontologies that comprise one or several types of clinical entities such as diseases, clinical findings, procedures, laboratory measurements, and medications. Table 2 below summarizes the content coverage of each of archetypal health ontologies.
the backbone of the MEDLINE/PubMed article database. MeSH can be browsed and downloaded free of charge on the Internet [10].
Prototypical Clinical Ontologies
tion system [12]. It has a biaxial structure of 17 clinical systems and 7 types of data. It allows for the classification of the patient’s reason for encounter (RFE), the problems/diagnosis managed, primary care interventions, and the ordering of the data. of the primary care session in an episode of care structure. ICPC-2-E refers to a revised electronic version.
a. The Systematized Nomenclature of Medicine (SNOMED CT) SNOMED CT is the most extensive set of pub-
lically available collection of clinical concepts. It is organized as a directed acyclic graph (DAG) and contains class/subclass relationships and partonomy relationships. It is maintained by the College of American Pathologists and is available in the United States through a license from the National Library of Medicine in perpetuity. SNOMED CT is one of the designated data standards for use in U.S. Federal Government systems for the electronic exchange of clinical health information. SNOMED CT is now owned by the International Healthcare Terminology Standards Development Organization [6]. b. International Statistical Classification of Diseases (ICD-9, ICD-10, ICD-CM) ICD-9 and ICD 10 are
detailed ontologies of disease and symptomatology used ubiquitiously for reimbursement systems (i.e., Medicare/Medicaid) and automated decision support in medicine. ICD-10 is used worldwide for morbidity and mortality statistics. Owned by the World Health Organization (WHO), licenses are available generally free for research. ICD-9 CM is a subtype of ICD-9 with clinical modifiers for billing purposes [11]. c. Medical Subject Headings (MeSH) MeSH grew out
of an effort by the NLM for indexing life science journal articles and books. {Nelson S.J., 2001 #6}. The extensive controlled vocabulary MeSH serves as
d. International Classification of Primary Care (ICPC-2, ICPC-2-E) ICPC is a primary care encounter classifica-
e. Diagnostic and Statistical Manual of Mental Disorders (DSM-IV, DSM-V) The DSM is edited and published by
the American Psychiatric Association provides categories of and diagnosis criteria for mental disorders [2]. It is used extensively by clinicians, policy makers and insurers. The original version of the DSM was published in 1962. DSM-V is due for publication in May 2012. The diagnosis codes are developed to be compatible with ICD-9. f. Logical Observation Identifiers Names and Codes (LOINC) LOINC is a database protocol aimed at stan-
dardizing laboratory and clinical codes. The Regenstrief Institute, Inc, maintains the LOINC database and supporting documentation. LOINC is endorsed by the American Clinical Laboratory Association and College of American Pathologist and is one of the accepted standards by the US Federal Government for information exchange [7]. g. Current Procedural Terminology (CPT)
The CPT code set is owned and maintained by the American Medical Association through the CPT Editorial Panel [1]. The CPT code set is used extensively to
Clinical Order
communicate medical and diagnostic services that were rendered among physicians and payers. The current version is the CPT 2008.
Cross-references
▶ Anchor text ▶ Annotation ▶ Archiving Experimental Data ▶ Biomedical Data/Content Acquisition, Curation ▶ Classification ▶ Clinical Data Acquisition, Storage and Management ▶ Clinical Data and Information Models ▶ Clinical Decision Support ▶ Data Integration Architectures and Methodology for the Life Sciences ▶ Data Types in Scientific Data Management ▶ Data Warehousing for Clinical Research ▶ Digital Curation ▶ Electronic Health Record ▶ Fully-Automatic Web Data Extraction ▶ Information Integration Techniques for Scientific Data ▶ Integration of Rules and Ontologies ▶ Logical Models of Information Retrieval ▶ Ontologies ▶ Ontologies and Life Science Data Management ▶ Ontology ▶ Ontology Elicitation ▶ Ontology Engineering ▶ Ontology Visual Querying ▶ OWL: Web Ontology Language ▶ Query Processing Techniques for Ontological Information ▶ Semantic Data Integration for Life Science Entities ▶ Semantic Web ▶ Storage Management ▶ Taxonomy: Biomedical Health Informatics ▶ Web Information Extraction
Recommended reading 1. American Medical Association [cited; Available at: http://www. cptnetwork.com]. 2. American Psychiatric Association [cited; Available at: http:// www.psych.org/MainMenu/Research/DSMIV.aspx]. 3. Cimino J.J. Desiderata for controlled medical vocabularies in the twenty-first century. Methods Inf. Med., 37(4–5):394–403, 1998. 4. Cimino J.J. In defense of the Desiderata. [comment]. J. Biomed. Inform., 39(3):299–306, 2006.
C
5. Gruber T.R. Toward principles for the design of ontologies used for knowledge sharing. Int. J. Hum. Comput. Stud., 43(4–5): 907–928, 1995. 6. I.H.T.S.D. [cited; Available from: http://www.ihtsdo.org/ourstandards/snomed-ct]. 7. Khan A.N. et al. Standardizing laboratory data by mapping to LOINC. J Am Med Inform Assoc, 13(3):353–355, 2006. 8. Mahner M. and Kary M. What exactly are genomes, genotypes and phenotypes? And what about phenomes? J. Theor. Biol., 186(1):55–63, 1997. 9. Musen M.A. et al. PROTEGE-II: computer support for development of intelligent systems from libraries of components. Medinfo, 8 (Pt 1):766–770, 1995. 10. Nelson S.J., Johnston D., and Humphreys. B.L Relationships in medicical subject headings. In Relationships in the Organization of Knowledge, A.B. Carol, G. Rebecca (eds.). Kluwer, Dordecht, 2001, pp. 171–184. 11. World Health Organization [cited; Available at: http://www.who. int/classifications/icd/en/]. 12. World Organization of National Colleges, Academies, and Academic Associations of General Practitioners/Family Physicians, ICPC. International Classification of Primary Care. Oxford University Press, Oxford, 1987.
Clinical Order DAN RUSSLER Oracle Health Sciences, Redwood Shores, CA, USA
Synonyms Order item; Service order; Service request; Service item; Procedure order; Procedure request
Definition The act of requesting that a service be performed for a patient. Clinical orders in healthcare share many characteristics with purchase orders in other industries. Both clinical orders and purchase orders establish a customer-provider relationship between the person placing the request for a service to be provided and the person or organization filling the request. In both cases, the clinical order and purchase order are followed by either a promise or intent to fill the request, a decline to fill the request, or a counter-proposal to provide an alternate service. In both scenarios, an authorization step such as an insurance company authorization or a credit company authorization may be required. Therefore, the dynamic flow of communications between a placer and filler in a clinical order
363
C
364
C
Clinical Research Chart
management system and a purchase order management system are very similar. Both clinical order and purchase order management systems maintain a catalog of items that may be requested. These items in both kinds of systems may represent physical items from supply or services from a service provider. Each of these items in both kinds of systems is associated with an internally unique identifier, a text description, and often a code. Dates, status codes, delivery locations, and other attributes of a clinical order and purchase order are also similar. Therefore, in addition to similarities in the dynamic flow of order communications, the structure of the content in clinical orders and purchase orders is similar. Logical Observation Identifiers Names and Codes (LOINC) (www.loinc.org) describe many of the requested services in healthcare, especially in laboratory systems. Other procedural terminologies exist for healthcare, either independently in terminologies like LOINC or included in more comprehensive terminologies such as Systematized Nomenclature of Medicine (SNOMED) (www.ihtsdo.org).
Key Points Clinical orders exist in the context of a larger clinical management, process. The order management business process of an organization, that includes defining a catalog of services to be provided and then allowing people to select from the catalog of services, is common in many industries. However, the decision support opportunities for helping providers select the optimum set of services for a patient are often more complex in healthcare than occurs in other industries. The outcomes of this selection process are studied in clinical research, clinical trials on medications and devices, and in organizational quality improvement initiatives. Finally, the outcomes of the service selection process are used to improve the clinical decision support processes utilized by providers selecting services for patients. This business process in healthcare as well as in many other industries describes a circular feedback loop defined by the offering of services, the selection of services, the delivery of services, the outcome of services, and finally, the modification of service selection opportunities and decision support. In the HL7 Reference Information Model (RIM), ‘‘ACT’’ classes sub-typed with the moodCode attribute support the healthcare improvement process
(www.hl7.org). These objects with process ‘‘moods’’ support the sequence of objects created during the execution of a process defined in Business Process Execution Language (BPEL) in a service oriented architecture that begins with an ‘‘order’’, evolves into an ‘‘appointment’’, which then is completed as an ‘‘event’’. The reason the term ‘‘mood’’ is used is that the values of the moodcode attribute are analogous to the models of verbs in many languages, e.g., the ‘‘Definition mood’’ used to define service catalogs corresponds to the ‘‘infinitive’’ verbal mood, i.e., a possible action; the ‘‘Request or Order mood’’ corresponds to the ‘‘imperative’’ verbal mood; the ‘‘Event mood’’ corresponds to the ‘‘indicative’’ verbal mood; and the ‘‘Goal mood,’’ which describes the desired outcome of the selected service, corresponds to the ‘‘subjunctive’’ verbal mood.
Cross-references
▶ Clinical Event ▶ Clinical Observation ▶ Interface Engines in Healthcare
Clinical Research Chart ▶ Data Warehousing for Clinical Research
Clinical Result ▶ Clinical Observation
Clinical Terminologies ▶ Clinical Ontologies
Clinical Test ▶ Clinical Observation
Clock ▶ Physical Clock ▶ Time-Line Clock
Closed Itemset Mining and Non-redundant Association Rule Mining
Closed Itemset Mining and Non-redundant Association Rule Mining M OHAMMED J. Z AKI Rensselaer Polytechnic Institute, Troy, NY, USA
Synonyms Frequent concepts; Rule bases
Definition Let I be a set of binary-valued attributes, called items. A set X I is called an itemset. A transaction database D is a multiset of itemsets, where each itemset, called a transaction, has a unique identifier, called a tid. The support of an itemset X in a dataset D, denoted sup(X), is the fraction of transactions in D where X appears as a subset. X is said to be a frequent itemset in D if sup(X) minsup, where minsup is a user defined minimum support threshold. An (frequent) itemset is called closed if it has no (frequent) superset having the same support. An association rule is an expression A ) B, where A and B are itemsets, and A \ B = ;. The support of the rule is the joint probability of a transaction containing both A and B, given as sup(A ) B) = P(A ∧ B) = sup(A [ B). The confidence of a rule is the conditional probability that a transaction contains B, given that it contains A, given as: supðA[BÞ conf ðA ) BÞ ¼ PðBjAÞ ¼ PðA^BÞ PðAÞ ¼ supðAÞ . A rule is frequent if the itemset A [ B is frequent. A rule is confident if conf minconf, where minconf is a user-specified minimum threshold. The aim of nonredundant association rule mining is to generate a rule basis, a small, non-redundant set of rules, from which all other association rules can be derived.
Historical Background The notion of closed itemsets has its origins in the elegant mathematical framework of Formal Concept Analysis (FCA) [3], where they are called concepts. The task of mining frequent closed itemsets was independently proposed in [7,11]. Approaches for non-redundant association rule mining were also independently proposed in [1,9]. These approaches rely heavily on the seminal work on rule bases in [5,6]. Efficient algorithms for mining frequent closed itemsets include CHARM
C
365
[10], CLOSET [8] and several new approaches described in the Frequent Itemset Mining Implementations workshops [4].
Foundations Let I = {i1,i2,...,im} be the set of items, and let T = {t1, t2,...,tn} be the set of tids, the transaction identifiers. Just as a subset of items is called an itemset, a subset of tids is called a tidset. Let t : 2I ! 2T be a function, defined as follows: tðXÞ ¼ ft 2 T j X iðtÞg That is, t(X) is the set of transactions that contain all the items in the itemset X. Let i : 2T ! 2I be a function, defined as follows: iðY Þ ¼ fi 2 I j 8t 2 Y ; t contains xg That is, i(T) is the set of items that are contained in all the tids in the tidset Y . Formally, an itemset X is closed if i ∘ t(X) = X, i.e., if X is a fixed-point of the closure operator c = i ∘ t. From the properties of the closure operator, one can derive that X is the maximal itemset that is contained in all the transactions t(X), which gives the simple definition of a closed itemset, namely, a closed itemset is one that has no superset that has the same support. Based on the discussion above, three main families of itemsets can be distinguished. Let F denote the set of all frequent itemsets, given as F ¼ fX j X I and supðXÞ minsupg Let C denote the set of all closed frequent itemsets, given as C ¼ fXjX 2 F and 6 9Y X with supðXÞ ¼ supðYÞg Finally, let M denote the set of all maximal frequent itemsets, given as M ¼ fXjX 2 F and 6 9Y X; such that Y 2 F g The following relationship holds between these sets: M C F , which is illustrated in Fig. 1, based on the example dataset shown in Table 1 and using minimum support minsup = 3. The equivalence classes of itemsets that have the same tidsets have been shown clearly; the largest itemset in each equivalence class is a closed itemset. The figure also shows that the maximal itemsets are a subset of the closed itemsets.
C
366
C
Closed Itemset Mining and Non-redundant Association Rule Mining
Closed Itemset Mining and Non-redundant Association Rule Mining. Figure 1. Frequent, closed frequent and maximal frequent itemsets.
Closed Itemset Mining and Non-redundant Association Rule Mining. Table 1. Example transaction dataset i(t) 1 2
ACTW CDW
3 4 5 6
ACTW ACDW ACDTW CDT
Mining Closed Frequent Itemsets
CHARM [8] is an efficient algorithm for mining closed itemsets. Define two itemsets X,Y of length k as belonging to the same prefix equivalence class, [P], if they share the k 1 length prefix P, i.e., X = Px and Y = Py, where x,y 2 I. More formally, [P] = {Pxi j xi 2 I}, is the class of all itemsets sharing P as a common prefix. In CHARM there is no distinct candidate generation and support counting phase. Rather, counting is simultaneous with candidate generation. For a given prefix class, one performs intersections of the tidsets ofall pairs of itemsets in the class, and checks if the resulting tidsets have cardinality at least minsup. Each resulting frequent itemset generates a new class which will be expanded in the next step. That is, for a given class of itemsets with prefix P, [P] = {Px1,Px2,..., Pxn}, one performs the intersection of Pxi with all Pxj with j > i to obtain a new class [Pxi] = [P 0 ] with
elements P 0 xj provided the itemset Pxixj is frequent. The computation progresses recursively until no more frequent itemsets are produced. The initial invocation is with the class of frequent single items (the class [;]). All tidset intersections for pairs of class elements are computed. However in addition to checking for frequency, CHARM eliminates branches that cannot lead to closed sets, and grows closed itemsets using subset relationships among tidsets. There are four cases: if t(Xi) t(Xj) or if t(Xi) = t(Xj), then replace every occurrence of Xi with Xi [ Xj, since whenever Xi occurs Xj also occurs, which implies that c(Xi) c(Xi [ Xj). If t(Xi) t(Xj) then replace Xj for the same reason. Finally, further recursion is required if t(Xi) 6¼ t(Xj). These four properties allow CHARM to efficiently prune the search tree (for additional details see [10]). Figure 2 shows how CHARM works on the example database shown in Table 1. First, CHARM sorts the items in increasing order of support, and initializes the root class as [;] = {D 2456, T 1356, A 1345, W 12345, C 123456}. The notation D 2456 stands for the itemset D and its tidset t(D) = {2,4,5,6}. CHARM first processes the node D 2456; it will be combined with the sibling elements. DT and DA are not frequent and are thus pruned. Looking at W, since t(D) 6¼ t(W), W is inserted in the new equivalence class [D]. For C, since t(D) t(C), all occurrences of D are replaced with DC, which means that [D] is also changed to [DC], and the element DW to DWC. A recursive call with class [DC] is then made and since
Closed Itemset Mining and Non-redundant Association Rule Mining
C
367
C
Closed Itemset Mining and Non-redundant Association Rule Mining. Figure 2. CHARM: mining closed frequent itemsets.
there is only a single itemset DWC, it is added to the set of closed itemsets C. When the call returns to D (i.e., DC) all elements in the class have been processed, so DC itself is added to C. When processing T, t(T) 6¼ t(A), and thus CHARM inserts A in the new class [T]. Next it finds that t(T) 6¼ t(W) and updates [T] = {A,W}. When it finds t(T) t(C) it updates all occurrences of T with TC. The class [T] becomes [TC] = {A,W}. CHARM then makes a recursive call to process [TC]. When combining TAC with TWC it finds t(TAC) = t(TWC), and thus replaces TAC with TACW, deleting TWC at the same time. Since TACW cannot be extended further, it is inserted in C. Finally, when it is done processing the branch TC, it too is added to C. Since t(A) t(W) t(C) no new recursion is made; the final set of closed itemsets C consists of the uncrossed itemsets shown in Fig. 2.
where X1 X2, i.e., from a subset to a proper superset. For example, the rule W ) T is equivalent to the rule c(W) )c(W [ T) CW ) ACTW. Its support is sup(TW) = 3∕6 = 0.5, and its confidence is supðTW Þ supðW Þ ¼ 3=5 ¼ 0:6 or 60%. More details on how to generate these non-redundant rules appears in [9].
Non-redundant Association Rules
Future Directions
Given the set of closed frequent itemsets C, one can generate all non-redundant association rules. There are two main classes of rules: (i) those that have 100% confidence, and (ii) those that have less than 100% confidence [9]. Let X1 and X2 be closed frequent itemsets. The 100% confidence rules are equivalent to those directed from X1 to X2, where X2 X1, i.e., from a superset to a subset (not necessarily proper subset). For example, the rule C ) W is equivalent to the rule between the closed itemsets c(W) )c(C) CW ) C. Its support is sup(CW) = 5∕6, and its confidence is supðCW Þ supðW Þ ¼ 5=5 ¼ 1, i.e., 100%. The less than 100% confidence rules are equivalent to those from X1 to X2
Closed itemset mining has inspired a lot of subsequent research in mining compressed representations or summaries of the set of frequent patterns; see [2] for a survey of these approaches. Mining compressed pattern bases remains an active area of study.
Key Applications Closed itemsets provide a loss-less representation of the set of all frequent itemsets; they allow one to determine not only the frequent sets but also their exact support. At the same time they can be orders of magnitude fewer. Likewise, the non-redundant rules provide a much smaller, and manageable, set of rules, from which all other rules can be derived. There are numerous applications of these methods, such as market basket analysis, web usage mining, gene expression pattern mining, and so on.
Experimental Results A number of algorithms have been proposed to mine frequent closed itemsets, and to extract non-redundant rule bases. The Frequent Itemset Mining Implementations (FIMI) Repository contains links to many of the latest implementations for mining closed itemsets. A report on the comparison of these methods
368
C
Closest Pairs
also appears in [4]. Other implementations can be obtained from individual author’s websites.
Closest Pairs ▶ Closest-Pair Query
Data Sets The FIMI repository has a number of real and synthetic datasets used in various studies on closed itemset mining.
Closest-Pair Query A NTONIO C ORRAL 1, M ICHAEL VASSILAKOPOULOS 2 University of Almeria, Almeria, Spain 2 University of Central Greece, Lamia, Greece
1
Url to Code The main FIMI website is at http://fimi.cs.helsinki.fi/, which is also mirrored at: http://www.cs.rpi.edu/~zaki/ FIMI/
Synonyms Closest pairs; k-Closest pair query; k-Distance join; Incremental k-distance join; k-Closest pair join
Cross-references
▶ Association Rule Mining on Streams ▶ Data Mining
Recommended Reading 1. Bastide Y., Pasquier N., Taouil R., Stumme G., and Lakhal L. Mining minimal non-redundant association rules using frequent closed itemsets. In Proc. 1st Int. Conf. Computational Logic, 2000, pp. 972–986. 2. Calders T., Rigotti C., and Boulicaut J.-F. A Survey on Condensed Representation for Frequent Sets. In Constraintbased Mining and Inductive Databases, LNCS, Vol. 3848, J-F. Boulicaut, L. De Raedt, and H. Mannila (eds.). Springer, 2005, pp. 64–80. 3. Ganter B. and Wille R. Formal Concept Analysis: Mathematical Foundations. Springer, Berlin Heidelberg New York, 1999. 4. Goethals B. and Zaki M.J. Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor., 6(1): 109–117, June 2003. 5. Guigues J.L. and Duquenne V. Familles minimales d’implications informatives resultant d’un tableau de donnees binaires. Math. Sci. hum., 24(95):5–18, 1986. 6. Luxenburger M. Implications partielles dans un contexte. Math. Inf. Sci. hum., 29(113):35–55, 1991. 7. Pasquier N., Bastide Y., Taouil R., and Lakhal L. Discovering frequent closed itemsets for association rules. In Proc. 7th Int. Conf. on Database Theory, 1999, pp. 398–416. 8. Pei J., Han J., and Mao R. Closet: An efficient algorithm for mining frequent closed itemsets. In Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000, pp. 21–30. 9. Zaki M.J. Generating non-redundant association rules. In Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, pp. 34–43. 10. Zaki M.J. and Hsiao C.-J. CHARM: An efficient algorithm for closed itemset mining. In Proc. SIAM International Conference on Data Mining, 2002, pp. 457–473. 11. Zaki M.J. and Ogihara M. Theoretical foundations of association rules. In Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 1998.
Definition Given two sets P and Q of objects, a closest pair (CP) query discovers the pair of objects (p, q) with a distance that is the smallest among all object pairs in the Cartesian product PQ. Similarly, a k closest pair query (k-CPQ) retrieves k pairs of objects from P and Q with the minimum distances among all the object pairs. In spatial databases, the distance is usually defined according to the Euclidean metric, and the set of objects P and Q are disk-resident. Query algorithms aim at minimizing the processing cost and the number of I/O operations, by using several optimization techniques for pruning the search space.
Historical Background The closest pair query, has been widely studied in computational geometry. More recently, this problem has been approached in the context of spatial databases [4,8,12,14]. In spatial databases, existing algorithms assume that P and Q are indexed by a spatial access method (usually an R-tree [1]) and utilize some pruning bounds and heuristics to restrict the search space. [8] was the first to address this issue, and proposed the following distance-based algorithms: incremental distance join, k distance join and k distance semijoin between two R-tree indices. The incremental processing reports one-by-one the desired elements of the result in ascending order of distance (k is unknown in advance and the user can stop when he/she is satisfied by the result). The algorithms follow the Best-First (BF) traversal policy, which keeps a heap with the entries of the nodes visited so far (it maintains a priority queue which contains pairs of index entries
Closest-Pair Query
and objects, and pop out the closest pair and process it). BF is near-optimal for CP queries; i.e., it only visits the pairs of nodes necessary for obtaining the result with a high probability. In [12] several modifications to the algorithms of [8] had been proposed in order to improve performance. Mainly, a method was proposed for selecting the sweep axis and direction for the plane sweep technique in bidirectional node expansion which minimizes the computational overhead of [8]. Later, an improved version of BF and several algorithms that follow Depth-First (DF) traversal ordering from the non-incremental point of view (which assumes that k is known in advance and reports the k elements of the result all together at the end of the algorithm) was proposed in [4]. In general, a DF algorithm visit the roots of the two R-trees and recursively follows the pair of entries < EP, EQ >, EP 2 RP and EQ 2 RQ, whose MINMINDIST is the minimum distance among all pairs. At the opposite of BF, DF is sub-optimal, i.e., it accesses more nodes than necessary. The main disadvantage of BF with respect to DF is that it may suffer from buffer thrashing if the available memory is not enough for the heap (it is space-consuming), when a great quantity of elements of the result is required. In this case, part of the heap must be migrated to disk, incurring frequent I/O accesses. The implementation of DF is by recursion, which is available in most of the programming languages, and linear-space consuming with respect to the height of the R-trees. Moreover, BF is not favored by page replacement policies (e.g., LRU), as it does not exhibit locality between I/O accesses. Another interesting contribution to the CP query was proposed by [14], in which a new structure called the b-Rdnn tree was presented, along with a better solution to the k-CP query when there is high overlap between the two datasets. The main idea is to find k objects from each dataset which are the closest to the other dataset. There are a lot of papers related to k-CP query, like buffer query [3], iceberg distance join query [13], multiway distance join query [6], k-nearest neighbor join [2], closest pair query with spatial constraints [11], etc. For example, a buffer query [3] involves two spatial datasets and a distance threshold r; the answer to this query is a set of pairs of spatial objects, one from each input dataset, that are within distance r of each other.
Foundations In spatial databases, existing algorithms assume that sets of spatial objects are indexed by a spatial access
C
method (usually an R-tree [1]) and utilize some pruning bounds to restrict the search space. An R-tree is a hierarchical, height balanced multidimensional data structure, designed to be used in secondary storage based on B-trees. The R-trees are considered as excellent choices for indexing various kinds of spatial data (points, rectangles, line-segments, polygons, etc.). They are used for the dynamic organization of a set of spatial objects approximated by their Minimum Bounding Rectangles (MBRs). These MBRs are characterized by min and max points of rectangles with faces parallel to the coordinate axis. Using the MBR instead of the exact geometrical representation of the object, its representational complexity is reduced to two points where the most important features of the spatial object (position and extension) are maintained. The R-trees belong to the category of data-driven access methods, since their structure adapts itself to the MBRs distribution in the space (i.e., the partitioning adapts to the object distribution in the embedding space). Figure 1a shows two points sets P and Q (and the node extents), where the closest pair is (p8, q8), and Fig. 1b is the R-tree for the point set P = {p1, p2,...,p12} with a capacity of three entries per node (branching factor or fan-out). Assuming that the spatial datasets are indexed on any spatial tree-like structure belonging to the R-tree family, then the main objective while answering these types of spatial queries is to reduce the search space. In [5], three MBR-based distance functions to be used in algorithms for CP queries were formally defined, as an extension of the work presented in [4]. These metrics are MINMINDIST, MINMAXDIST and MAXMAXDIST. MINMINDIST (M1, M2) between two MBRs is the minimum possible distance between any point in the first MBR and any point in the second MBR. Maxmaxdist between two MBRs (M1, M2) is the maximum possible distance between any point in the first MBR and any point in the second MBR. Finally, MINMAXDIST between two MBRs (M1, M2) is the minimum of the maximum distance values of all the pairs of orthogonal faces to each dimension. Formally, they are defined as follows: Given two MBRs M1 = (a, b) and M2 = (c, d), in the d-dimensional Euclidean space, M1 = (a, b), where a = (a1, a2,...,ad) and b = (b1, b2,...,bd) such that ai bi 1 i d M2 = (c, d), where c = (c1, c2,...,cd) and d = (d1, d2,...,dd) such that ai bi 1 i d
369
C
370
C
Closest-Pair Query
Closest-Pair Query. Figure 1. Example of an R-tree and a point CP query.
the MBR-based distance functions are defined as follows: MINMINDIST ðM1 ; M2 Þ ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 8 2 u d > ci > b i uX< ðci bi Þ ; 2 u ai > di t > ðai di Þ ; i¼1 : 0; otherwise MAXMAXDIST ðM1 ; M2 Þ ¼ vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 8 u u d > ðd ai Þ2 ; ci > bi uX < i 2 u : ðb c Þ ; ai > di t > i i i¼1 : max ðdi ai Þ2 ; ðbi ci Þ2 ; otherwise vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ( )ffi u d X u MINMAXDIST ðM1 ; M2 Þ ¼ t min xj2 þ yi2 1 j d
i¼1;i6¼j
where xj ¼ min aj cj ; aj dj ; bj cj ; bj dj and yi ¼ maxfjai di j; jbi ci jg To illustrate the distance functions MINMINDIST, MINMAXDIST and MAXMAXDIST which are the basis of query algorithms for CPQ, in Fig. 2, two MBRs and their MBR-based distance functions and their relation with the distance (dist) between two points (pi, qj) are depicted in 2-dimensional Euclidean space. According to [5], MINMINDIST(M1,M2) is monotonically non-decreasing with the R-tree heights. MINMINDIST(M1,M2) and MAXMAXDIST(M1, M2) serve respectively as lower and upper bounding functions of the Euclidean distance from the k closest pairs of spatial objects within the MBRs M1, and M2. In the
Closest-Pair Query. Figure 2. MBR-based distance functions in 2-dimensional Euclidean space.
same sense, MINMAXDIST(M1, M2) serves as an upper bounding function of the Euclidean distance from the closest pair of spatial objects enclosed by the MBRs M1 and M2. As long as the distance functions are consistent, the branch-bound algorithms based on them will work correctly [5]. Moreover, the general pruning mechanism for k-CP queries over R-tree nodes using branch-andbound algorithms is the following: if MINMINDIST (M1, M2) > z, then the pair of MBRs (M1, M2) will be discarded, where z is the distance value of the k-th closest pair that has been found so far (during the processing of the algorithm), or the distance value of the k-th largest MAXMAXDIST found so far (z is also called as the pruning distance). Branch-and-bound algorithms can be designed following DF or BF traversal ordering (Breadth-First traversal order (level-by-level) can also be implemented, but the processing of each level must follow a BF order) to report k closest pairs in non-incremental way (for incremental processing the ordering of traversal must be BF [8]).
Closest-Pair Query
As an example, Fig. 3 shows the BF k-CPQ algorithm for two R-trees, for the non-incremental processing version. This algorithm needs to keep a minimum binary heap (H) with the references to pairs of internal nodes (characterized by their MBRs) accessed so far from the two different R-trees and their minimum distance (<MINMINDIST, AddrMPi, AddrMQj>). It visits the pair of MBRs (nodes) with the minimum MINMINDIST in H, until it becomes empty or the MINMINDIST value of the pair of MBRs located in the root of H is larger than the distance value of the k-th closest pair that has been found so far (z). To keep track of z, an additional data structure that stores the k closest pairs discovered during the processing of the algorithm is needed. This data structure is organized as a maximum binary heap (k-heap) and holds pairs of objects according to their minimum distance (the pair with the largest distance resides in the root). In the implementation of k-CPQ algorithm, the following cases must be considered: (i) initially the k-heap is
C
empty (z is initialized to 1), (ii) the pairs of objects reached at the leaf level are inserted in the k-heap until it gets full (z keeps the value of 1), (iii) if the distance of a new pair of objects discovered at the leaf level is smaller than the distance of the pair residing in the k-heap root, then the root is extracted and the new pair is inserted in the k-heap, updating this data structure and z (distance of the pair of objects residing in the k-heap root). Several optimizations had been proposed in order to improve performance, mainly with respect to the CPU cost. For instance, a method for selecting the sweep axis and direction for the plane sweep technique has been proposed [12]. But the most important optimization is the use of the plane-sweep technique for k-CPQ [5,12], which is a common technique for computing intersections. The basic idea is to move a sweepline perpendicular to one of the dimensions, so-called the sweeping dimension, from left to right. This technique is applied for restricting all possible combinations of pairs of MBRs from two R-tree nodes from RP
Closest-Pair Query. Figure 3. Best-First k-CPQ Algorithm using R–trees.
371
C
372
C
Closest-Pair Query
and RQ. If this technique is not used, then a set with all possible combinations of pairs of MBRs from two R-tree nodes must be created. In general, the technique consists in sorting the MBRs of the two current R-tree nodes, based on the coordinates of one of the lower left corners of the MBRs in increasing order. Each MBR encountered during a plane sweep is selected as a pivot, and it is paired up with the non-processed MBRs in the other R-tree node from left to right. The pairs of MBRs with MINMINDIST on the sweeping dimension that are less than or equal to z (pruning distance) are selected for processing. After all possible pairs of MBRs that contain the pivot have been found, the pivot is updated with the MBR of the next smallest value of a lower left corner of MBRs on the sweeping dimension, and the process is repeated. In summary, the application of this technique can be viewed as a sliding window on the sweeping dimension with a width of z starting in the lower end of the pivot MBR, where all possible pairs of MBRs that can be formed using the MBR of the pivot and the other MBRs from the remainder entries of the other R-tree node that fall into the current sliding window are chosen. For example, in Fig. 4, a set of MBRs from two R-tree nodes ({MP1, MP2, MP3, MP4, MP5, MP6} and {MQ1, MQ2, MQ3, MQ4, MQ5, MQ6, MQ7}) is shown. Without plane-sweep, 6*7 = 42 pairs of MBRs must be generated. If the plane-sweep technique is applied over the X axis (sweeping dimension) and taking into account the distance value of z (pruning distance), this number of possible pairs will reduced considerably (the number of selected pairs of MBRs using the plane sweep technique is only 29).
Key Applications Geographical Information Systems
Closest pair is a common distance-based query in the spatial database context, and it has only recently received special attention. Efficient algorithms are important for dealing with the large amount of spatial data in several GIS applications. For example, k-CPQ can discover the K closest pairs of cities and cultural landmarks providing an increase order based on its distances. Data Analysis
Closest pair queries have been considered as a core module of clustering. For example, a proposed clustering algorithm [10] owes its efficiency to the use of closest pair query, as opposed to previous quadraticcost approaches. Decision Making
A number of decision support tasks can be modeled as closest pairs query. For instance, find the top k factory-house pairs ordered by the closeness to one another. This gives us a measure of the effect of individual factory on individual household, and can give workers a priority to which factory to address first.
Future Directions k-closest pair query is a useful type of query in many practical applications involving spatial data, and the traditional technique to handle this spatial query generally assumes that the objects are static. Objects represented as a function of time have been studied in other domains, as in spatial semijoin [9]. For this reason,
Closest-Pair Query. Figure 4. Using plane-sweep technique over MBRs from two R-tree nodes.
Cloud Computing
closest pair query in spatio-temporal databases could be an interesting line of research. Another interesting problem to study is the monitoring of k-closest pairs over moving objects. It aims at maintaining closest pairs results while the underlying objects change the positions [15]. For example, return k pairs of taxi stands and taxies that have the smallest distances. Other interesting topics to consider (from the static point of view) are to study k-CPQ between different spatial data structures (Linear Region Quadtrees for raster and R-trees for vector data), and to investigate k-CPQ in non-Euclidean spaces (e.g., road networks).
Experimental Results In general, for every presented method, there is an accompanying experimental evaluation in the corresponding reference. [4,5,8] compare BF and DF traversal order for conventional k-CPQ (from the incremental and non-incremental point of view). In [7], a cost model for k-CPQ using R-trees was proposed, evaluating their accuracy. Moreover, experimental results on k-closest pair queries to support the fact that b-Rdnn tree is a better alternative with respect to the R*-trees, when there is high overlap between the two datasets, were presented in [14].
Data Sets A large collection of real datasets, commonly used for experiments, can be found at: http://www. rtreeportal.org/
URL to Code R-tree portal (see above) contains the code for most common spatial access methods (mainly R-tree and variations), as well as data generators and several useful links for researchers and practitioners in spatial databases. The sources in C + + of k-CPQ are in: http://www. ual.es/acorral/DescripcionTesis.htm
373
Recommended Reading 1. Beckmann N., Kriegel H.P., Schneider R., and Seeger B. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1990, pp. 322–331. 2. Bo¨hm C. and Krebs F. The k-nearest neighbour join: Turbo charging the KDD process. Knowl. Inform. Syst., 6(6):728–749, 2004. 3. Chan E.P.F. Buffer queries. IEEE Trans. Knowl. Data Eng., 15(4):895–910, 2003. 4. Corral A., Manolopoulos Y., Theodoridis Y., and Vassilakopoulos M. Closest pair queries in spatial databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 189–200. 5. Corral A., Manolopoulos Y., Theodoridis Y., and Vassilakopoulos M. Algorithms for processing K-closest-pair queries in spatial databases. Data Knowl. Eng., 49(1):67–104, 2004. 6. Corral A., Manolopoulos Y., Theodoridis Y., and Vassilakopoulos M. Multi-way distance join queries in spatial databases. GeoInformatica, 8(4):373–402, 2004. 7. Corral A., Manolopoulos Y., Theodoridis Y., and Vassilakopoulos M. Cost models for distance joins queries using R-trees. Data Knowl. Eng., 57(1):1–36, 2006. 8. Hjaltason G.R. and Samet H. Incremental distance join algorithms for spatial databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 237–248. 9. Iwerks G.S., Samet H., and Smith K. Maintenance of spatial semijoin queries on moving points. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 828–839. 10. Nanopoulos A., Theodoridis Y., and Manolopoulos Y. C2P: clustering based on closest pairs. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 331–340. 11. Papadopoulos A.N., Nanopoulos A., and Manolopoulos Y. Processing distance join queries with constraints. Comput. J., 49(3):281–296, 2006. 12. Shin H., Moon B., and Lee S. Adaptive multi-stage distance join processing. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 343–354. 13. Shou Y., Mamoulis N., Cao H., Papadias D., and Cheung D.W. Evaluation of iceberg distance joins. In Proc. 8th Int. Symp. Advances in Spatial and Temporal Databases, 2003, pp. 270–288. 14. Yang C. and Lin K. An index structure for improving closest pairs and related join queries in spatial databases. In Proc. Int. Conf. on Database Eng. and Applications, 2002, pp. 140–149. 15. Zhu M., Lee D.L., and Zhang J. k-closest pair query monitoring over moving objects. In Proc. 3rd Int. Conf. on Mobile Data Management, 2002, pp. 14–14.
Cross-references
▶ Multi-Step Query Processing ▶ Nearest Neighbor Query ▶ R-Tree (and family) ▶ Spatial Indexing Techniques ▶ Spatial Join
C
Cloud Computing ▶ Replication in Multi-Tier Architectures ▶ Storage Grid
C
374
C
Cluster and Distance Measure
Cluster and Distance Measure D IMITRIOS G UNOPULOS 1,2 1 Computer Science and Eng. Dept., Univ. of California Riverside, Riverside, CA 92521, USA 2 Dept. of Informatics and Telecommunications, University of Athens, Athens, Greece
Synonyms Unsupervised learning; Segmentation
Definition Clustering
Clustering is the assignment of objects to groups of similar objects (clusters). The objects are typically described as vectors of features (also called attributes). So if one has n attributes, object x is described as a vector (x1,..,xn). Attributes can be numerical (scalar) or categorical. The assignment can be hard, where each object belongs to one cluster, or fuzzy, where an object can belong to several clusters with a probability. The clusters can be overlapping, though typically they are disjoint. Fundamental in the clustering process is the use of a distance measure. Distance Measure
In the clustering setting, a distance (or equivalently a similarity) measure is a function that quantifies the similarity between two objects.
Key Points The choice of a distance measure depends on the nature of the data, and the expected outcome of the clustering process. The most important consideration is the type of the features of the objects. One first focuses on distance measures when the features are all numerical. This includes features with continuous values (real numbers) or discrete values (integers). In this case, typical choices include: 1. The P
Lp
norm. It is defined as D(x,y)¼ p 1=p ðX Y Þ . Typically p is 2 (the intui1 1 1 i n tive and therefore widely used Euclidean distance), or 1 (the Manhattan or city block distance), or infinity (the Maximum distance). 2. The Mahalanobis distance. It is defined as P Dðx; yÞ ¼ ðx yÞ 1 ðx yÞT which generalizes the Euclidean and allows the assignment of different weights to different features.
3. The angle between two vectors, computed using the inner product of two vectors x·y. 4. The Hamming distance, which measures the number of disagreements between two binary vectors. In different settings different distance measures can be used. The edit, or Levenshtein, distance, is an extension of the Hamming distance, and is typically used for measuring the distance between two strings of characters. The edit distance is defined as the minimum number of insertions, deletions or substitutions that it takes to transform one sting to another. When two time series are compared, the Dynamic Time Warping distance measure is often used to quantify their distance. The length of the Longest Common Subsequence (LCSS) of two time series is also frequently used to provide a similarity measure between the time series. LCSS is a similarity measure because the longest common subsequence becomes longer when two time series are more similar. To create a distance measure, LCSS is typically normalized by dividing by the length of the longest of the two sequences, and then subtracting the ratio from one. Finally, when sets of objects are compared, the Jaccard coefficient is typically used to compute their distance. The Jaccard coefficient of sets A and B is defined as JðA; BÞ ¼ jA \ Bj=jA [ Bj, that is, the fraction of the common elements over the union of the two sets. The majority of the distance measures used in practice, and indeed most of the ones described above are metrics. Formally, a distance measure D is a metric if it obeys the following properties: For objects A, B, (i) D(A,B)\geq 0, (ii) D(A,B) = 0 if and only if A = B, and (iii) D(A,B) = D(B,A), and (iv) for any objects A,B,C, D(A,B) + D(B,C)\geq D(A,C) (triangle inequality). Most distance measures can be trivially shown to observe the first three properties, but do not necessarily observe the triangle inequality. For example, the constrained Dynamic Time Warping distance, a typically used measure to compute the similarity between time series which does not allow arbitrary stretching of a time series, is not a metric because it does not satisfy the triangle inequality. Experimental results have shown that the constrained Dynamic Time Warping distance performs at least as good as the unconstrained one and it is also faster to compute, thus justifying its use although it is not a metric. Note however that, if it is so required, any distance measure can be converted into a metric by
Clustering for Post Hoc Information Retrieval
taking the shortest path between objects A and B in the complete graph where each object is a node and each edge is weighted by the distance between the two nodes.
Cross-references
▶ Clustering Overview and Applications ▶ Data Mining
C
Clustering for Post Hoc Information Retrieval D IETMAR WOLFRAM University of Wisconsin-Milwaukee, Milwaukee, WI, USA
Synonyms Recommended Reading 1. 2. 3.
Everitt B.S., Landau S., Leese M. Cluster Analysis. Wiley, 2001. Jain A.K., Murty M.N., and Flyn P.J. Data Clustering: A Review. ACM Comput Surv, 31(3):1999. Theodoridis S. and Koutroubas K. Pattern recognition. Academic Press, 1999.
Cluster Database Replication ▶ Replica Control
Cluster Databases
Document clustering
Definition Clustering is a technique that allows similar objects to be grouped together based on common attributes. It has been used in information retrieval for different retrieval process tasks and objects of interest (e.g., documents, authors, index terms). Attributes used for clustering may include assigned terms within documents and their co-occurrences, the documents themselves if the focus is on index terms, or linkages (e.g., hypertext links of Web documents, citations or co-citations within documents, documents accessed). Clustering in IR facilitates browsing and assessment of retrieved documents for relevance and may reveal unexpected relationships among the clustered objects.
▶ Process Structure of a DBMS
Historical Background
Cluster Replication ▶ Replication for Scalability ▶ Replication in Multi-Tier Architectures
Cluster Stability ▶ Clustering Validity
Cluster Validation ▶ Clustering Validity
Clustering ▶ Deduplication in Data Cleaning ▶ Physical Database Design for Relational Databases
375
A fundamental challenge of information retrieval (IR) that continues today is how to best match user queries with documents in a queried collection. Many mathematical models have been developed over the years to facilitate the matching process. The details of the matching process are usually hidden from the user, who is only presented with an outcome. Once a set of candidate documents has been identified, they are presented to the user for perusal. Traditional approaches have relied on ordered linear lists of documents based on calculated relevance or another sequencing criterion (e.g., date, alphabetical by title or author). The resulting linear list addresses the assessed relationship of documents to queries, but not the relationships of the documents themselves. Clustering techniques can reduce this limitation by creating groups of documents (or other objects of interest) to facilitate more efficient retrieval or perusal and evaluation of retrieved sets. The application of clustering techniques to IR extends back to some of the earliest experimental IR systems including Gerard Salton’s SMART system, which relied on document cluster identification within
C
376
C
Clustering for Post Hoc Information Retrieval
a vector space as a means of quickly identifying sets of relevant documents. The rationale for applying clustering was formalized as the ‘‘cluster hypothesis,’’ proposed by Jardine and van Rijsbergen [6]. This hypothesis proposes that documents that are relevant to a query are more similar to each other than to documents that are not relevant to the query. The manifestation of this relationship can be represented in different ways by grouping like documents or, more recently, visualizing the relationships and resulting proximities in a multidimensional space. Early applications of clustering emphasized its use to more efficiently identify groups of related, relevant documents and to improve search techniques. The computational burden associated with real-time cluster identification during searches on increasingly larger data corpora and the resulting lackluster performance improvements have caused clustering to lose favor as a primary mechanism for retrieval. However, clustering methods continue to be studied and used today (see, for example, [7]). Much of the recent research into clustering for information retrieval has focused on other areas that support the retrieval process. For instance, clustering has been used to assist in query expansion, where additional terms for retrieval may be identified. Clustering of similar terms can be used to construct thesauri, which can be used to index documents [3]. Recent research on clustering has highlighted its benefits for post hoc retrieval tasks, in particular for the presentation of search results to better model user and usage behavior. The focus of applications presented here is on these post hoc IR tasks, dealing with effective representation of groups of objects once identified to support exploratory browsing and to provide a greater understanding of users and system usage for future IR system development.
Foundations Methods used to identify clusters are based on cluster analysis, a multivariate exploratory statistical technique. Cluster analysis relies on similarities or differences in object attributes and their values. The granularity of the analysis and the validity of the resulting groups are dependent on the range of attributes and values associated with objects of interest. For IR applications, clusters are based on common occurrences and weights of assigned terms for documents,
the use of query terms, or linkages between objects of interest represented as hypertext linkages or citations/ co-citations. Clustering techniques can be divided into hierarchical and non-hierarchical approaches. Non-hierarchical clustering methods require that a priori assumptions be made about the nature and number of clusters, but can be useful if specific cluster parameters are sought. Hierarchical clustering, which is more commonly used, begins with many small groups of objects that serve as initial clusters. Existing groups are clustered into larger groups until only one cluster remains. Visually, the structure and relationship of clusters may be represented as a dendrogram, with different cluster agglomerations at different levels on the dendrogram representing the strength of relationship between clusters. Other visualization techniques may be applied and are covered elsewhere. In hierarchical methods, the shorter the agglomerative distance, the closer the relationship and the more similar the clusters are. As an exploratory technique, there is no universally accepted algorithm to conduct the analysis, but the general steps for conducting the analysis are similar. First, a similarity measure is applied to the object attributes, which serves as the basis for pairwise comparisons. Standard similarity or distance measures applied in IR research such as the Euclidean distance, cosine measure, Jaccard coefficient, and Dice coefficient can be used. Next, a method for cluster determination is selected. Common methods include: single complete linkage, average linkage, nearest neighbor, furthest neighbor, centroid clustering (representing the average characteristics of objects within a cluster), and Ward’s method. Each method uses a different algorithm to assess cluster membership and may be found to be more appropriate in given circumstances. Outcomes can vary significantly depending on the method used. This flexibility underscores one of the challenges for effectively implementing cluster analysis. With no one correct or accepted way to conduct the analysis, outcomes are open to interpretation, but may be viewed as equally valid. For example, single linkage clustering, which links pairs of objects that most closely resemble one another, is comparatively simple to implement and has been widely used, but can result in lengthy linear chains of clusters. Parameters may need to be specified that dictate the minimum size of clusters to avoid situations where there are large orders of difference in cluster membership. Another challenge inherent in clustering is that different clustering algorithms can produce similar
Clustering for Post Hoc Information Retrieval
numbers of clusters, but if some clusters contain few members, this does little to disambiguate the members within large clusters. The number of clusters that partition the object set can be variable in hierarchical clustering. More clusters result in fewer objects per cluster with greater inter-object similarity, but with potentially more groups to assess. It is possible to test for an optimal number of clusters using various measures that calculate how differing numbers of clusters affect cluster cohesiveness. Clustering may be implemented in dynamic environments by referencing routines based on specific clustering algorithms developed by researchers or through specialty clustering packages. Details on clustering algorithms for information retrieval can be found in Rasmussen [8]. Standard statistical and mathematical software packages such as SAS and SPSS also support a range of clustering algorithms. Special algorithms may need to be applied to very large datasets to reduce computational overhead, which can be substantial for some algorithms.
Key Applications In addition to early applications of clustering for improving retrieval efficiency, clustering techniques in IR have included retrieval results presentation, and modeling of IR user and usage characteristics based on transactions logs. Although largely a topic of research interest, some applications have found their way into commercial systems. Clustering of search results has been applied by several Web-based search services since the late 1990s, some of which are no longer available. Most notable of the current generation is Clusty (clusty.com), which organizes retrieval results from several search services around topical themes. The application of clustering to support interactive browsing has been an active area of investigation in recent years. Among the earliest demonstrations for this purpose was the Scatter/Gather method outlined by Cutting et al. [4], in which the authors demonstrated how clustering of retrieved items can facilitate browsing for vaguely defined information needs. This approach was developed to serve as a complement to more focused techniques for retrieval assessment. In application, the method presents users with a set of clusters that serves as the starting point for browsing. The user selects the clusters of greatest interest. The contents of those clusters are then gathered into a single cluster, which now serves as the corpus for a new round of
C
clustering, into which the new smaller corpus of items is scattered. The process continues until the user’s information need is met or the user abandons the search. To support real time clustering of datasets, the authors developed an efficient clustering algorithm, called buckshot, plus a more accurate algorithm, called fractionation, to permit more detailed clustering in offline environments where a timely response is less critical. Another algorithm, called cluster digest, was used to encapsulate the topicality of a given cluster based on the highest weighted terms within the cluster. Hearst and Pedersen [5] evaluated the efficacy of Scatter/Gather on the top-ranked retrieval outcomes of a large dataset, and tested the validity of the cluster hypothesis. The authors compared the number of known relevant items to those appearing in the generated clusters. A user study was also conducted, which demonstrated that participants were able to effectively navigate and interact with the system incorporating Scatter/Gather. Increasingly, IR systems provide access to heterogeneous collections of documents. The question arises whether the cluster hypothesis, and the benefits of capitalizing on its attributes, extends to the distributed IR environment, where additional challenges include the merger of different representations of documents and identification of multiple occurrences of documents across the federated datasets. Crestani and Wu [2] conducted an experimental study to determine whether the cluster hypothesis holds in a distributed environment. They simulated a distributed environment by using different combinations of retrieval environments and document representation heterogeneity, with the most sophisticated implementation representing three different IR environments with three different collections. Results of the different collections and systems were clustered and compared. The authors concluded that the cluster hypothesis largely holds true in distributed environments, but fails when brief surrogates of full text documents are used. With the growing availability of large IR system transaction logs, clustering methods have been used to identify user and usage patterns. By better understanding patterns in usage behavior, IR systems may be able to identify types of behaviors and accommodate those behaviors through context-sensitive assistance or through integration of system features that accommodate identified behaviors. Chen and Cooper [1] relied on a rich dataset of user sessions collected from the University of California MELVYL online public access
377
C
378
C
Clustering Index
catalog system. Based on 47 variables associated with each user session (e.g., session length in seconds, average number of items retrieved, average number of search modifications), their analysis identified six clusters representing different types of user behaviors during search sessions. These included help-intensive searching, knowledgeable usage, and known-item searching. Similarly, Wen et al. [9] focused on clustering of user queries in an online encyclopedia environment to determine whether queries could be effectively clustered to direct users to appropriate frequently asked questions topics. IR environments that cater to a broad range of users are well-known for short query submissions by users, which make clustering applications based solely on query term co-occurrence unreliable. In addition to the query content, the authors based their analysis on common retrieved documents viewed by users. By combining query content with common document selections, a link was established between queries that might not share search terms. The authors demonstrated how the application of their clustering method, which was reportedly adopted by the encyclopedia studied, could effectively guide users to appropriate frequently asked questions. The previous examples represent only a sample of clustering applications in an IR context. Additional recent research developments and applications using clustering may be found in Wu et al. [10].
Cross-references
▶ Data Mining ▶ Text Mining ▶ Visualization for Information Retrieval
Recommended Reading 1. Chen H.M. and Cooper M.D. Using clustering techniques to detect usage patterns in a web-based information system. J. Am. Soc. Inf. Sci. Technol., 52(11):888–904, 2001. 2. Crestani F. and Wu S. Testing the cluster hypothesis in distributed information retrieval. Inf. Process. Manage., 42(5):1137–1150, 2006. 3. Crouch C.J. A cluster-based approach to thesaurus construction. In Proc. 11th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1988, pp. 309–320. 4. Cutting D.R., Karger D.R., Pedersen J.O., and Tukey J.W. Scatter/ Gather: a cluster-based approach to browsing large document collections. In Proc. 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1992, pp. 318–329. 5.Hearst M.A. and Pedersen J.O. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proc. 19th Annual Int.
6. 7.
8.
9. 10.
ACM SIGIR Conf. on Research and Development in Information Retrieval, 1996, pp. 76–84. Jardine N. and van Rijsbergen C. The use of hierarchic clustering in information retrieval. Inf. Storage Retr., 7(5):217–240, 1971. Liu X. and Croft W.B. Cluster-based retrieval using language models. In Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004, pp. 186–193. Rasmussen E. Clustering algorithms. In Information Retrieval Data Structures & Algorithms, W.B. Frakes, R. Baeza-Yates (eds.). Prentice Hall, Englewood Cliffs, NJ, 1992, pp. 419–442. Wen J.R., Nie J.Y., and Zhang H.J. Query clustering using user logs. ACM Trans. Inf. Syst., 20(1):59–81, 2002. Wu W., Xiong H., and Shekhar S. (eds.) Clustering and Information Retrieval. Kluwer, Norwell, MA, 2004.
Clustering Index ▶ Primary Index
Clustering on Streams S URESH V ENKATASUBRAMANIAN University of Utah, Salt Lake City, UT, USA
Definition An instance of a clustering problem (see clustering) consists of a collection of points in a distance space, a measure of the cost of a clustering, and a measure of the size of a clustering. The goal is to compute a partitioning of the points into clusters such that the cost of this clustering is minimized, while the size is kept under some predefined threshold. Less commonly, a threshold for the cost is specified, while the goal is to minimize the size of the clustering. A data stream (see data streams) is a sequence of data presented to an algorithm one item at a time. A stream algorithm, upon reading an item, must perform some action based on this item and the contents of its working space, which is sublinear in the size of the data sequence. After this action is performed (which might include copying the item to its working space), the item is discarded. Clustering on streams refers to the problem of clustering a data set presented as a data stream.
Historical Background Clustering (see clustering) algorithms typically require access to the entire data to produce an effective clustering. This is a problem for large data sets, where
Clustering on Streams
random access to the data, or repeated access to the entire data set, is a costly operation. For example, the well-known k-means heuristic is an iterative procedure that in each iteration must read the entire data set twice. One set of approaches to performing clustering on large data involves sampling: a small subset of data is extracted from the input and clustered, and then this clustering is extrapolated to the entire data set. The data stream paradigm [14] came about in two ways: first, as a way to model access to large streaming sources (network traffic, satellite imagery) that by virtue of their sheer volume, cannot be archived for offline processing and need to be aggregated, summarized and then discarded in real time. Second, the streaming paradigm has shown itself to be the most effective way of accessing large databases: Google’s Map Reduce [9] computational framework is one example of the efficacy of stream processing. Designing clustering algorithms for stream data requires different algorithmic ideas than those useful for traditional clustering algorithms. The online computational paradigm [4] is a potential solution: in this paradigm, an algorithm is presented with items one by one, and using only information learned up to the current time, must make a prediction or estimate on the new item being presented. Although the online computing paradigm captures the sequential aspect of stream processing, it does not capture the additional constraint that only a small portion of the history may be stored. In fact, an online algorithm is permitted to use the entirety of the history of the stream, and is usually not limited computationally in any way. Thus, new ideas are needed to perform clustering in a stream setting.
Foundations Preliminaries
Let X be a domain and d be a distance function defined between pairs of elements in X. Typically, it is assumed that d is a metric (i.e., it satisfies the triangle inequality d(x, y) + d(y, z) d(x, z)8x,y,z 2 X). One of the more common measures of the cost of a cluster is the so-called median cost: the cost of a cluster C X is the function X costðCÞ ¼ dðx; c Þ x2C
where c * 2 X, the cluster center, is the point that minimizes cost(C). The k-median problem is to find a
C
collection of k disjoint clusters, the sum of whose costs is minimized. An equally important cost function is the mean cost: the cost of a cluster C X is the function X costðCÞ ¼ d 2 ðx; c Þ x2C
where c * is defined as before. The k-means problem is to find a collection of clusters whose total mean cost is minimized. It is useful to note that the median cost is more robust to outliers in the data; however, the mean cost function, especially for points in Euclidean spaces, yields a very simple definition for c *: it is merely the centroid of the set of points in the cluster. Other measures that are often considered are the k-center cost, where the goal is to minimize the maximum radius of a cluster, and the diameter cost, where the goal is to minimize the maximum diameter of a cluster (note that the diameter measure does not require one to define a cluster center). A data stream problem consists of a sequence of items x1, x2,...,xn, and a function f(x1,...,xn) that one wishes to compute. The limitation here is that the algorithm is only permitted to store a sublinear number of items in memory, because n is typically too large for all the items to fit in memory. Further, even random access to the data is prohibitive, and so the algorithm is limited to accessing the data in sequential order. Since most standard clustering problems (including the ones described above) are NP-hard in general, one cannot expect solutions that minimize the cost of a clustering. However, one can often show that an algorithm comes close to being optimal: formally, one can show that the cost achieved by an algorithm is within some multiplicative factor c of the optimal solution. Such an algorithm is said to be a c-approximation algorithm. Many of the methods presented here will provide such guarantees on the quality of their output. As usual, one should keep in mind that these guarantees are worst-case, and thus apply to any possible input the algorithm may encounter. In practice, these algorithms will often perform far better than promised. General Principles
Stream clustering is a relatively new topic within the larger area of stream algorithms and data analysis. However, there are some general techniques that have proven their usefulness both theoretically as well as practically, and are good starting points for the design
379
C
380
C
Clustering on Streams
and analysis of stream clustering methods. This section reviews these ideas, as well as pointing to examples of how they have been used in various settings.
the k-means measure described above), this technique proceeds as follows. Algorithm 1: Clustering with representations
Incremental Clustering
The simplest way to think about a clustering algorithm on stream data is to imagine the stream data arriving in chunks of elements. Prior to the arrival of the current chunk, the clustering algorithm has computed a set of clusters for all the data seen so far. Upon encountering the new chunk, the algorithm must update the clusters, possibly expanding some and contracting others, merging some clusters and splitting others. It then requests the next chunk, discarding the current one. Thus, a core component of any stream clustering algorithm is a routine to incrementally update a clustering when new data arrives. Such an approach was developed by Charikar et al. [6] for maintaining clusterings of data in a metric space using a diameter cost function. Although their scheme was phrased in terms of incremental clusterings, rather than stream clusterings, their approach generalizes well to streams. They show that their scheme yields a provable approximation to the optimal diameter of a k-clustering.
Representations One of the problems with clustering data streams is choosing a representation for a cluster. At the very least, any stream clustering algorithm stores the location of a cluster center, and possibly the number of items currently associated with this cluster. This representation can be viewed as a weighted point, and can be treated as a single point in further iterations of the clustering process. However, this representation loses information about the geometric size and distribution of a cluster. Thus, another standard representation of a cluster consists of the center and the number of points augmented with the sum of squared distances from the points in the cluster to the center. This last term informally measures the variation of points within a cluster, and when viewed in the context of density estimation via Gaussians, is in fact the sample variance of the cluster. Clusters reduced in this way can be treated as weighted points (or weighted balls), and clustering algorithms should be able to handle such generalized points. One notable example of the use of such a representation is the one-pass clustering algorithm of Bradley et al. [5], which was simplified and improved by Farnstrom et al. [11]. Built around the well known k-means algorithm (that iteratively seeks to minimize
Initialize cluster centers randomly While chunk of data remains to be read do Read a chunk of data (as much as will fit in memory), and cluster it using the k-means algorithm. For each cluster, divide the points contained within it into the core (points that are very close to the center under various measures), and the periphery. Replace the set of points in the core by a summary as described above. Discard all remaining points. Use the current cluster list as the set of centers for the next chunk.
It is important that representations be linear. Specifically, given two chunks of data c,c 0 , and their representations r,r 0, it should be the case that the representation of c [ c 0 be formed from a linear combination of r and r 0. This relates to the idea of sketching in stream algorithms, and is important because it allows the clustering algorithm to work in the (reduced) space of representations, rather than in the original space of data. Representations like the one described above are linear, and this is a crucial factor in the effectiveness of these algorithms. Hierarchical Clustering Viewing a cluster as a weight-
ed point in a new clustering problem quickly leads to the idea of hierarchical clustering: by thinking of a point as a single-element cluster, and connecting a cluster and its elements in a parent-child relationship, a clustering algorithm can represent multiple levels of merges as a tree of clusters, with the root node being a single cluster containing all the data, and each leaf being a single item. Such a tree is called a Hierarchical Agglomerative Clustering (HAC), since it can be viewed bottom-up as a series of agglomerations. Building such a hierarchy yields more general information about the relationship between clusters, and the ability to make better judgments about how to merge clusters. The well-known clustering algorithm BIRCH [15] makes use of a hierarchy of cluster representations to cluster a large database in a few passes. In a first pass, a tree called the CF-tree is constructed, where each internal node represents a cluster of clusters, and each leaf represents a cluster of items. This tree is controlled by two parameters: B, the branching factor, and T, a diameter threshold that limits the size of leaf clusters.
Clustering on Streams
In further passes, more analysis is performed on the CF-tree to compress clusters further. The tree is built much in the way a B+-tree is built: new items are inserted in the deepest cluster possible, and if the threshold constraint is violated, the cluster is split, and updates are propagated up the tree. BIRCH is one of the best-known large-data clustering algorithms, and is generally viewed as a benchmark to compare other clustering algorithms against. However, BIRCH does not provide formal guarantees on the quality of the clusterings thus produced. The first algorithm that computes a hierarchical clustering on a stream while providing formal performance guarantees is a method for solving the k-median problem developed by Guha et al. [12,13]. This algorithm is best described by first presenting it in a non-streaming context: Algorithm 2: Small space Divide the input into l disjoint parts. Cluster each part into O(k) clusters. Assign each point to its nearest cluster center. cluster the O(lk) cluster centers, where each center is weighted by the number of points assigned to it.
Note that the total space required by this algorithm is O(‘k + n ∕‘). The value of this algorithm is that it propagates good clusterings: specifically, if the intermediate clusterings are computed by algorithms that yield constant-factor approximations to the best clustering (under the k-median cost measure), then the final output will also be a (larger) constant factor approximation to the best clustering. Also note that the final clustering step may itself be replaced by a recursive call to the algorithm, yielding a hierarchical scheme. Converting this to a stream algorithm is not too difficult. Consider each chunk of data as one of the disjoint parts the input is broken into. Suppose each part is of size m, and there exists a clustering procedure that can cluster these points into 2k centers with reasonable accuracy. The algorithm reads enough data to obtain m centers (m2∕ 2k points). These m ‘‘points’’ can be viewed as the input to a second level streaming process, which performs the same operations. In general, the ith-level stream process takes m2∕ 2k points from the (i 1)th-level stream process and clusters them into m points, which are appended to the stream for the next level.
C
The guarantees provided by the method rely on having accurate clustering algorithms for the intermediate steps. However, the general paradigm itself is useful as a heuristic: the authors show that using the k-means algorithm as the intermediate clustering step yields reasonable clustering results in practice, even though the method comes with no formal guarantees. On Relaxing the Number of Clusters If one wishes to obtain guarantees on the quality of a clustering, using at least k clusters is critical; it is easy to design examples where the cost of a (k 1)-clustering is much larger than the cost of a k-clustering. One interesting aspect of the above scheme is how it uses weaker clustering algorithms (that output O(k) rather than k clusters) as intermediate steps on the way to computing a kclustering. In fact, this idea has been shown to be useful in a formal sense: subsequent work by Charikar et al. [7] showed that if one were to use an extremely weak clustering algorithm (in fact, one that produces O(k logn) clusters), then this output can be fed into a clustering algorithm that produces k clusters, while maintaining overall quality bounds that are better than those described above. This idea is useful especially if one has a fast algorithm that produces a larger number of clusters, and a more expensive algorithm that produces k clusters: the expensive algorithm can be run on the (small) output of the fast algorithm to produce the desired answer. Clustering Evolving Data
Stream data is often temporal. Typical data analysis questions are therefore often limited to ranges of time (‘‘in the last three days,’’ ‘‘over the past week,’’ ‘‘for the period between Jan 1 and Feb 1,’’ and so on). All of the above methods for clustering streams assume that the goal is to cluster the entire data stream, and the only constraint is the space needed to store the data. Although they are almost always incremental, in that the stream can be stopped at any time and the resulting clustering will be accurate for all data seen upto that point, they cannot correctly output clusterings on windows of data, or allow the influence of past data to gradually wane over time. Even with non-temporal data, it may be important to allow the data analysis to operate on a subset of the data to capture the notion of concept drift [10], a term that is used to describe a scenario when natural data characteristics change as the stream evolves.
381
C
382
C
Clustering on Streams
Sliding Windows A popular model of stream analysis is the sliding window model, which introduces a new parameter W. The goal of the stream analysis is to produce summary statistics (a clustering, variance estimates or other statistics), on the most recent W items only, while using space that is sublinear in W. This model can be thought of as represented by a sliding window of length W with one end (the sliding end) anchored to the current element being read. The challenge of dealing with sliding windows is the problem of deletion. Although not as general as a fully dynamic data model where arbitrary elements can be inserted and deleted, the sliding window model introduces with the problem of updating a cluster representation under deletions, and requires new ideas. One such idea is the exponential histogram, first introduced by Datar et al. [8] to estimate certain statistical properties of sliding windows on streams, and used by Babcock et al. [3] to compute an approximate k-median clustering in the sliding window model. The idea here is to maintain a set of buckets that together partition all data in the current window. For each bucket, relevant summary statistics are maintained. Intuitively, the smaller the number of items assigned to a bucket, the more accurate the summary statistics (in the limit, the trivial histogram has one bucket for each of the W items in the window). The larger this number, the fewer the number of buckets needed. Balancing these two conflicting requirements yields a scheme where each bucket stores the items between two timestamps, and the bucket sizes increase exponentially as they store items further in the past. It requires more detailed analysis to demonstrate that such a scheme will provide accurate answers to queries over windows, but the use of such exponentially increasing bucket sizes allows the algorithm to use a few buckets, while still maintaining a reasonable approximation to the desired estimate. Hierarchies of Windows
The sliding window model introduces an extra parameter W whose value must be justified by external considerations. One way of getting around this problem is to maintain statistics for multiple values of W (typically an exponentially increasing family). Another approach, used by Aggarwal et al. [1] is to maintain snapshots (summary representations of the clusterings) at time steps at different levels of resolution. For example, a simple two level snapshot scheme might store the cluster representations
computed after times t, t + 1,...t + W, as well as t, t + 2, t + 4,...t + 2W (eliminating duplicate summaries as necessary). Using the linear structure of representations will allow the algorithm to extract summaries for time intervals: they show that such a scheme uses space efficiently while still being able to detect evolution in data streams at different scales. Decaying Data For scenarios where such a justification might be elusive, another model of evolving data is the decay model, in which one can think of a data item’s influence waning (typically exponentially) with time. In other words, the value of the ith item, instead of being fixed at xi, is a function of time xi(t) = xi(0)exp (c(t i)). This reduces the problem to the standard setting of computing statistics over the entire stream, while using the decay function to decide which items to remove from the limited local storage when computing statistics. The use of exponentially decaying data is quite common in temporal data analysis: one specific example of its application in the clustering of data streams is the work on HPStream by Aggarwal et al. [2].
Key Applications Systems that manage large data sets and perform data analysis will require stream clustering methods. Many modern data cleaning systems require such tools, as well as large scientific databases. Another application of stream clustering is for network traffic analysis: such algorithms might be situated at routers, operating on packet streams.
Experimental Results Most of the papers cited above are accompanied by experimental evaluations and comparisons to prior work. BIRCH, as mentioned before, is a common benchmarking tool.
Cross-references
▶ Data Clustering ▶ Information Retrieval ▶ Visualization
Recommended Reading 1. Aggarwal C.C., Han J., Wang J., and Yu P.S. A framework for clustering evolving data streams. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 81–92. 2. Aggarwal C.C., Han J., Wang J., and Yu P.S. A framework for projected clustering of high dimensional data streams. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 852–863.
Clustering Overview and Applications 3. Babcock B., Datar M., Motwani R., and O’Callaghan L. Maintaining variance and k-medians over data stream windows. In Proc. 22nd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2003, pp. 234–243. 4. Borodin A. and El-Yaniv R. Online computation and competitive analysis. Cambridge University Press, New York, NY, USA, 1998. 5. Bradley P.S., Fayyad U.M., and Reina C. Scaling Clustering Algorithms to Large Databases. In Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, 1998, pp. 9–15. 6. Charikar M., Chekuri C., Feder T., and Motwani R. Incremental Clustering and Dynamic Information Retrieval. SIAM J. Comput., 33(6):1417–1440, 2004. 7. Charikar M., O’Callaghan L., and Panigrahy R. Better streaming algorithms for clustering problems. In Proc. 35th Annual ACM Symp. on Theory of Computing, 2003, pp. 30–39. 8. Datar M., Gionis A., Indyk P., and Motwani R. Maintaining stream statistics over sliding windows: (extended abstract). In Proc. 13th Annual ACM -SIAM Symp. on Discrete Algorithms, 2002, pp. 635–644. 9. Dean J. and Ghemaway S. MapReduce: simplified data processing on large clusters. In Proc. 6th USENIX Symp. on Operating System Design and Implementation, 2004, pp. 137–150. 10. Domingos P. and Hulten G. Mining high-speed data streams. In Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, pp. 71–80. 11. Farnstrom F., Lewis J., and Elkan C. Scalability for clustering algorithms revisited. SIGKDD Explor., 2(1):51–57, 2000. 12. Guha S., Meyerson A., Mishra N., Motwani R., and O’Callaghan L. Clustering Data Streams: Theory and practice. IEEE Trans. Knowl. Data Eng., 15(3):515–528, 2003. 13. Guha S., Mishra N., Motwani R., and O’Callaghan L. Clustering data streams. In Proc. 41st Annual Symp. on Foundations of Computer Science, 2000, p. 359. 14. Muthukrishnan S. Data streams: algorithms and applications. Found. Trend Theor. Comput. Sci., 1(2), 2005. 15. Zhang T., Ramakrishnan R., and Livny M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Min. Knowl. Discov., 1(2):141–182, 1997.
Clustering Overview and Applications D IMITRIOS G UNOPULOS 1,2 1 Computer Science and Eng. Dept., Univ. of California Riverside, Riverside, CA 92521, USA 2 Dept. of Informatics and Telecommunications, University of Athens, Athens, Greece
Synonyms
C
described as vectors of features (also called attributes). Attributes can be numerical (scalar) or categorical. The assignment can be hard, where each object belongs to one cluster, or fuzzy, where an object can belong to several clusters with a probability. The clusters can be overlapping, though typically they are disjoint. A distance measure is a function that quantifies the similarity of two objects.
Historical Background Clustering is one of the most useful tasks in data analysis. The goal of clustering is to discover groups of similar objects and to identify interesting patterns in the data. Typically, the clustering problem is about partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters [4,8]. For example, consider a retail database where each record contains items purchased at the same time by a customer. A clustering procedure could group the customers in such a way that customers with similar buying patterns are in the same cluster. Thus, the main concern in the clustering process is to reveal the organization of patterns into ‘‘sensible’’ groups, which allow one to discover similarities and differences, as well as to derive useful conclusions about them. This idea is applicable to many fields, such as life sciences, medical sciences and engineering. Clustering may be found under different names in different contexts, such as unsupervised learning (in pattern recognition), numerical taxonomy (in biology, ecology), typology (in social sciences) and partition (in graph theory) [13]. The clustering problem comes up in so many domains due to the prevalence of large datasets for which labels are not available. In one or two dimensions, humans can perform clustering very effectively visually, however in higher dimensions automated procedures are necessary. The lack of training examples makes it very difficult to evaluate the results of the clustering process. In fact, the clustering process may result in different partitioning of a data set, depending on the specific algorithm, criterion, or choice of parameters used for clustering.
Unsupervised learning
Foundations
Definition
The Clustering Process
Clustering is the assignment of objects to groups of similar objects (clusters). The objects are typically
In the clustering process, there are no predefined classes and no examples that would show what kind of
383
C
384
C
Clustering Overview and Applications
desirable relations should be valid among the data. That is the main difference from the task of classification: Classification is the procedure of assigning an object to a predefined set of categories [FSSU96]. Clustering produces initial categories in which values of a data set are classified during the classification process. For this reason, clustering is described as ‘‘unsupervised learning’’; in contrast to classification, which is considered as ‘‘supervised learning.’’ Typically, the clustering process will include at least the following steps: 1. Feature selection: Typically, the objects or observations to be clustered are described using a set of features. The goal is to appropriately select the features on which clustering is to be performed so as to encode as much information as possible concerning the task of interest. Thus, a pre-processing step may be necessary before using the data. 2. Choice of the clustering algorithm. In this step the user chooses the algorithm that is more appropriate for the data at hand, and therefore is more likely to result to a good clustering scheme. In addition, a similarity (or distance) measure and a clustering criterion are selected in tandem – The distance measure is a function that quantifies how ‘‘similar’’ two objects are. In most of the cases, one has to ensure that all selected features contribute equally to the computation of the proximity measure and there are no features that dominate others. – The clustering criterion is typically a cost function that the clustering algorithm has to optimize. The choice of clustering criterion has to take into account the type of clusters that are expected to occur. 3. Validation and interpretation of the results. The correctness of the results of the clustering algorithm is verified using appropriate criteria and techniques. Since clustering algorithms define clusters that are not known a priori, irrespective of the clustering methods, the final partition of the data typically requires some kind of evaluation. In many cases, the experts in the application area have to integrate the clustering results with other experimental evidence and analysis in order to draw the right conclusion. After the third phase the user may elect to use the clustering results obtained, or may start the process
from the beginning, perhaps using different clustering algorithms or parameters. Clustering Algorithms Taxonomy
With clustering being a useful tool in diverse research communities, a multitude of clustering methods has been proposed in the literature. Occasionally similar techniques have been proposed and used in different communities. Clustering algorithms can be classified according to: 1. The type of data input to the algorithm (for example, objects described with numerical features or categorical features) and the choice of similarity function between two objects. 2. The clustering criterion optimized by the algorithm. 3. The theory and fundamental concepts on which clustering analysis techniques are based (e.g., fuzzy theory, statistics). A broad classification of clustering algorithms is the following [8,14]: 1. Partitional clustering algorithms: here the algorithm attempts to directly decompose the data set into a set of (typically) disjoint clusters. More specifically, the algorithm attempts to determine an integer number of partitions that optimize a certain criterion function. 2. Hierarchical clustering algorithms: here the algorithm proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters. The result of the algorithm is a tree of clusters, called dendrogram, which shows how the clusters are related. By cutting the dendrogram at a desired level, a clustering of the data items into disjoint groups is obtained. 3. Density-based clustering : The key idea of this type of clustering is to group neighbouring objects of a data set into clusters based on density conditions. This includes grid-based algorithms that quantise the space into a finite number of cells and then do operations in the quantised space. For each of above categories there is a wealth of subtypes and different algorithms for finding the clusters. Thus, according to the type of variables allowed in the data set additional categorizations include [14]: (i) Statistical algorithms, which are based on statistical analysis concepts and use similarity measures to partition objects and they are limited to numeric data.
Clustering Overview and Applications
(ii) Conceptual algorithms that are used to cluster categorical data. (iii) Fuzzy clustering algorithms, which use fuzzy techniques to cluster data and allow objects to be classified into more than one clusters. Such algorithms lead to clustering schemes that are compatible with everyday life experience as they handle the uncertainty of real data. (iv) Crisp clustering techniques, that consider non-overlapping partitions so that a data point either belongs to a class or not. Most of the clustering algorithms result in crisp clusters, and thus can be categorized in crisp clustering. (v) Kohonen net clustering, which is based on the concepts of neural networks. In the remaining discussion, partitional clustering algorithms will be described in more detail; other techniques will be dealt with separately. Partitional Algorithms
In general terms, the clustering algorithms are based on a criterion for assessing the quality of a given partitioning. More specifically, they take as input some parameters (e.g., number of clusters, density of clusters) and attempt to define the best partitioning of a data set for the given parameters. Thus, they define a partitioning of a data set based on certain assumptions and not necessarily the ‘‘best’’ one that fits the data set. In this category, K-Means is a commonly used algorithm [10]. The aim of K-Means clustering is the optimisation of an objective function that is described by the equation: E¼
c X X
dðx; mi Þ
i¼1 x2Ci
In the above equation, mi is the center of cluster Ci, while d(x, mi) is the Euclidean distance between a point x and mi. Thus, the criterion function E attempts to minimize the distance of every point from the center of the cluster to which the point belongs. It should be noted that optimizing E is a combinatorial problem that is NP-Complete and thus any practical algorithm to optimize it cannot guarantee optimality. The K-means algorithm is the first practical and effective heuristic that was suggested to optimize this criterion, and owes its popularity to its good performance in practice. The K-means algorithm begins by initialising a set of c cluster centers. Then, it assigns each object of the dataset to the cluster whose center is the nearest, and re-computes the centers. The process continues until the centers of the clusters stop changing.
C
Another algorithm of this category is PAM (Partitioning Around Medoids). The objective of PAM is to determine a representative object (medoid) for each cluster, that is, to find the most centrally located objects within the clusters. The algorithm begins by selecting an object as medoid for each of c clusters. Then, each of the non-selected objects is grouped with the medoid to which it is the most similar. PAM swaps medoids with other non-selected objects until all objects qualify as medoid. It is clear that PAM is an expensive algorithm with respect to finding the medoids, as it compares an object with the entire dataset [12]. CLARA (Clustering Large Applications), is an implementation of PAM in a subset of the dataset. It draws multiple samples of the dataset, applies PAM on samples, and then outputs the best clustering out of these samples [12]. CLARANS (Clustering Large Applications based on Randomized Search), combines the sampling techniques with PAM. The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids. The clustering obtained after replacing a medoid is called the neighbour of the current clustering. CLARANS selects a node and compares it to a user-defined number of their neighbours searching for a local minimum. If a better neighbor is found (i.e., having lower-square error), CLARANS moves to the neighbour’s node and the process starts again; otherwise the current clustering is a local optimum. If the local optimum is found, CLARANS starts with a new randomly selected node in search for a new local optimum. The algorithms described above result in crisp clusters, meaning that a data point either belongs to a cluster or not. The clusters are non-overlapping and this kind of partitioning is further called crisp clustering. The issue of uncertainty support in the clustering task leads to the introduction of algorithms that use fuzzy logic concepts in their procedure. A common fuzzy clustering algorithm is the Fuzzy C-Means (FCM), an extension of classical C-Means algorithm for fuzzy applications [2]. FCM attempts to find the most characteristic point in each cluster, which can be considered as the ‘‘center’’ of the cluster and, then, the grade of membership for each object in the clusters. Another approach proposed in the literature to solve the problems of crisp clustering is based on probabilistic models. The basis of this type of clustering algorithms is the EM algorithm, which provides a quite general approach to learning in presence of
385
C
386
C
Clustering Overview and Applications
unobservable variables [11]. A common algorithm is the probabilistic variant of K-Means, which is based on the mixture of Gaussian distributions. This approach of K-Means uses probability density rather than distance to associate records with clusters. More specifically, it regards the centers of clusters as means of Gaussian distributions. Then, it estimates the probability that a data point is generated by the jth Gaussian (i.e., belongs to jth cluster). This approach is based on Gaussian model to extract clusters and assigns the data points to clusters assuming that they are generated by normal distribution. Also, this approach is implemented only in the case of algorithms based on the EM (Expectation Maximization) algorithm. Another type of clustering algorithms combine graph partitioning and hierarchical clustering algorithms characteristics. Such algorithms include CHAMELEON [9], which measures the similarity among clusters based on a dynamic model contrary to the clustering algorithms discussed above. Moreover in the clustering process both the inter-connectivity and closeness between two clusters are taken into account to decide how to merge the clusters. The merge process based on the dynamic model facilitates the discovery of natural and homogeneous clusters. Also it is applicable to all types of data as long as a similarity function is specified. Finally, BIRCH [ZRL99] uses a data structure called CFTree for partitioning the incoming data points in an incremental and dynamic way, thus providing an effective way to cluster very large datasets. Partitional algorithms are applicable mainly to numerical data sets. However, there are some variants of K-Means such as K-prototypes, and K-mode [7] that are based on the K-Means algorithm, but they aim at clustering categorical data. K-mode discovers clusters while it adopts new concepts in order to handle categorical data. Thus, the cluster centers are replaced with ‘‘modes,’’ a new dissimilarity measure used to deal with categorical objects. The K-means algorithm and related techniques tend to produce spherical clusters due to the use of a symmetric objective function. They require the user to set only one parameter, the desirable number of clusters K. However, since the objective function gets smaller monotonically as K increases, it is not clear how to define what is the best number of clusters for a given dataset. Although several approaches have been proposed to address this shortcoming [14], this is one of the main disadvantages of partitional algorithms.
Another characteristic of the partitional algorithms is that they are unable to handle noise and outliers and they are not suitable to discover clusters with nonconvex shapes. Another characteristic of K-means is that the algorithm does not display a monotone behavior with respect to K. For example, if a dataset is clustered into M and 2M clusters, it is intuitive to expect that the smaller clusters in the second clustering will be subsets of the larger clusters in the first; however this is typically not the case.
Key Applications Cluster analysis is very useful task in exploratory data analysis and a major tool in a very wide spectrum of applications in many fields of business and science. Clustering applications include: 1. Data reduction. Cluster analysis can contribute to the compression of the information included in the data. In several cases, the amount of the available data is very large and its processing becomes very demanding. Clustering can be used to partition the data set into a number of ‘‘interesting’’ clusters. Then, instead of processing the data set as an entity, the representatives of the defined clusters are adopted in the process. Thus, data compression is achieved. 2. Hypothesis generation. Cluster analysis is used here in order to infer some hypotheses concerning the data. For instance, one may find in a retail database that there are two significant groups of customers based on their age and the time of purchases. Then, one may infer some hypotheses for the data, that it, ‘‘young people go shopping in the evening,’’ ‘‘old people go shopping in the morning.’’ 3. Hypothesis testing. In this case, the cluster analysis is used for the verification of the validity of a specific hypothesis. For example, consider the following hypothesis: ‘‘Young people go shopping in the evening.’’ One way to verify whether this is true is to apply cluster analysis to a representative set of stores. Suppose that each store is represented by its customer’s details (age, job, etc.) and the time of transactions. If, after applying cluster analysis, a cluster that corresponds to ‘‘young people buy in the evening’’ is formed, then the hypothesis is supported by cluster analysis. 4. Prediction based on groups. Cluster analysis is applied to the data set and the resulting clusters are characterized by the features of the patterns that
Clustering Overview and Applications
5.
6.
7.
8.
belong to these clusters. Then, unknown patterns can be classified into specified clusters based on their similarity to the clusters’ features. In such cases, useful knowledge related to this data can be extracted. Assume, for example, that the cluster analysis is applied to a data set concerning patients infected by the same disease. The result is a number of clusters of patients, according to their reaction to specific drugs. Then, for a new patient, one identifies the cluster in which he/she can be classified and based on this decision his/her medication can be made. Business Applications and Market Research. In business, clustering may help marketers discover significant groups in their customers’ database and characterize them based on purchasing patterns. Biology and Bioinformatics. In biology, it can be used to define taxonomies, categorize genes with similar functionality and gain insights into structures inherent in populations. Spatial data analysis. Due to the huge amounts of spatial data that may be obtained from satellite images, medical equipment, Geographical Information Systems (GIS), image database exploration etc., it is expensive and difficult for the users to examine spatial data in detail. Clustering may help to automate the process of analysing and understanding spatial data. It is used to identify and extract interesting characteristics and patterns that may exist in large spatial databases. Web mining. Clustering is used to discover significant groups of documents on the Web huge collection of semi-structured documents. This classification of Web documents assists in information discovery. Another application of clustering is discovering groups in social networks.
In addition, clustering can be used as a pre-processing step for other algorithms, such as classification, which would then operate on the detected clusters.
Cross-references
▶ Cluster and Distance Measure ▶ Clustering for Post Hoc Information Retrieval ▶ Clustering on Streams ▶ Clustering Validity ▶ Clustering with Constraints ▶ Data Mining ▶ Data Reduction ▶ Density-Based Clustering ▶ Dimension Reduction Techniques for Clustering
C
▶ Document Clustering ▶ Feature Selection for Clustering ▶ Hierarchial Clustering ▶ Semi-Supervised Learning ▶ Spectral Clustering ▶ Subspace Clustering Techniques ▶ Text Clustering ▶ Visual Clustering ▶ Visualizing Clustering Results
Recommended Reading 1. Agrawal R., Gehrke J., Gunopulos D., and Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 94–105. 2. Bezdeck J.C., Ehrlich R., and Full W. FCM: Fuzzy C-Means algorithm. Comput. Geosci., 10(2–3):191–203, 1984. 3. Ester M., Kriegel H.-Peter., Sander J., and Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226–231. 4. Everitt B.S., Landau S., and Leese M. Cluster Analysis. Hodder Arnold, London, UK, 2001. 5. Fayyad U.M., Piatesky-Shapiro G., Smuth P., and Uthurusamy R. Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, 1996. 6. Han J. and Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Fransisco, CA, 2001. 7. Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. In Proc. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 1997. 8. Jain A.K., Murty M.N., and Flyn P.J. Data clustering: a review. ACM Comput. Surv., 31(3):264–323, 1999. 9. Karypis G., Han E.-H., and Kumar V. CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer., 32(8):68–75, 1999. 10. MacQueen J.B. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, 1967, pp. 281–297. 11. Mitchell T. Machine Learning. McGraw-Hill, New York, 1997. 12. Ng R. and Han J. Efficient and effective clustering methods for spatial data mining. In Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp. 144–155. 13. Theodoridis S. and Koutroubas K. Pattern Recognition. Academic Press, New York, 1999. 14. Vazirgiannis M., Halkidi M., and Gunopulos D. Uncertainty Handling and Quality Assessment in Data Mining. Springer, New York, 2003. 15. Wang W., Yang J., and Muntz R. STING: A statistical information grid approach to spatial data mining. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 186–195. 16. Zhang T., Ramakrishnman R., and Linvy M. BIRCH: an efficient method for very large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 103–114.
387
C
388
C
Clustering Validity
Clustering Validity M ICHALIS VAZIRGIANNIS Athens University of Economics & Business, Athens, Greece
Synonyms Cluster validation; Cluster stability; Quality assessment; Stability-based validation of clustering
Definition A problem one faces in clustering is to decide the optimal partitioning of the data into clusters. In this context visualization of the data set is a crucial verification of the clustering results. In the case of large multidimensional data sets (e.g., more than three dimensions) effective visualization of the data set is cumbersome. Moreover the perception of clusters using available visualization tools is a difficult task for humans that are not accustomed to higher dimensional spaces. The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity. Cluster validity consists of a set of techniques for finding a set of clusters that best fits natural partitions (of given datasets) without any a priori class information. The outcome of the clustering process is validated by a cluster validity index.
Historical Background Clustering is a major task in the data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. In the literature a wide variety of algorithms for different applications and sizes of data sets. The application of an algorithm to a data set, assuming that the data set offers a clustering tendency, aims at discovering its inherent partitions. However, the clustering process is an unsupervised process, since there are no predefined classes or examples. Then, the various clustering algorithms are based on some assumptions in order to define a partitioning of a data set. As a consequence, they may behave in a different way depending on: i. the features of the data set (geometry and density distribution of clusters) and ii. the input parameter values. One of the most important issues in cluster analysis is the evaluation of clustering results to find the partitioning that best fits the underlying data. This is the main subject of cluster validity. If clustering
algorithm parameters are assigned an improper value, the clustering method results in a partitioning scheme that is not optimal for the specific data set leading to wrong decisions. The problems of deciding the number of clusters better fitting a data set as well as the evaluation of the clustering results has been subject of several research efforts. The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity. In general terms, there are three approaches to investigate cluster validity. The first is based on external criteria. This implies that the results of a clustering algorithm are evaluated based on a prespecified structure, which is imposed on a data set and reflects one’s intuition about the clustering structure of the data set. The second approach is based on internal criteria. The results of a clustering algorithm may be evaluated in terms of quantities that involve the vectors of the data set themselves (e.g., proximity matrix). The third approach of clustering validity is based on relative criteria. Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting by the same algorithm but with different parameter values. There are two criteria proposed for clustering evaluation and selection of an optimal clustering scheme: (i) Compactness, the members of each cluster should be as close to each other as possible. A common measure of compactness is the variance, which should be minimized. (ii) Separation, the clusters themselves should be widely spaced.
Foundations This section discusses methods suitable for the quantitative evaluation of the clustering results, known as cluster validity methods. However, these methods give an indication of the quality of the resulting partitioning and thus they can only be considered as a tool at the disposal of the experts in order to evaluate the clustering results. The cluster validity approaches based on external and internal criteria rely on statistical hypothesis testing. In the following section, an introduction to the fundamental concepts of hypothesis testing in cluster validity is presented. In cluster validity the basic idea is to test whether the points of a data set are randomly structured or not. This analysis is based on the Null Hypothesis, denoted as Ho, expressed as a statement of random structure of a data set X. To test this hypothesis, statistical tests are used, which lead to a computationally complex procedure.
Clustering Validity
Monte Carlo techniques are used as a solution to this problem.
G ¼ ð1=MÞ
N1 X N X
Xði; jÞYði; jÞ
½2
i¼1 j¼iþ1
Based on external criteria, one can work in two different ways. First, one can evaluate the resulting clustering structure C, by comparing it to an independent partition of the data P built according to one’s intuition about the clustering structure of the data set. Second, one can compare the proximity matrix P to the partition P. Comparison of C with Partition P (Non-hierarchical Clustering) Let C = {C1...Cm} be a clustering struc-
ture of a data set X and P = {P1...Ps} be a defined partition of the data. Refer to a pair of points (xv, xu) from the data set using the following terms: SS: if both points belong to the same cluster of the clustering structure C and to the same group of partition P. SD: if points belong to the same cluster of C and to different groups of P. DS: if points belong to different clusters of C and to the same group of P. DD: if both points belong to different clusters of C and to different groups of P. Assuming now that a, b, c and d are the number of SS, SD, DS and DD pairs respectively, then a þ b þ c þ d = M which is the maximum number of all pairs in the data set (meaning, M = N(N1)/2 where N is the total number of points in the data set). Now define the following indices to measure the degree of similarity between C and P: 1. Rand Statistic: R = (a þ d)/M 2. Jaccard Coefficient: J = a/(a þ b þ c) The above two indices range between 0 and 1, and are maximized when m=s. Another known index is the: 3. Folkes and Mallows index: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a a aþb aþc
½1
where m1 = (a + b), m2= (a + c). For the previous three indices it has been proven that the higher the values of these indices are the more similar C and P are. Other indices are:
389
4. Huberts Gstatistic:
External Criteria
pffiffiffiffiffiffiffiffiffiffiffiffi FM ¼ a= m1 m2 ¼
C
High values of this index indicate a strong similarity between the matrices X and Y. 5. Normalized G statistic: " # N1 N P P ð1=MÞ ðXði; jÞ mX ÞðYði; jÞ mY Þ ^
G¼
i¼1 j¼iþ1
sX sY ½3
where X(i, j) and Y(i, j) are the (i, j) element of the matrices X, Y respectively that one wants to compare. Also mx, my, sx, sy are the respective means and variances of X, Y matrices. This index takes values between –1 and 1. All these statistics have right-tailed probability density functions, under the random hypothesis. In order to use these indices in statistical tests, one must know their respective probability density function under the Null Hypothesis, Ho, which is the hypothesis of random structure of the data set. Thus, if one accepts the Null Hypothesis, the data are randomly distributed. However, the computation of the probability density function of these indices is computationally expensive. A solution to this problem is to use Monte Carlo techniques. After having plotted the approximation of the probability density function of the defined statistic index, its value, denoted by q, is compared to the q(Ci) values, further referred to as qi. The indices R, J, FM, Gdefined previously are used as the q index mentioned in the above procedure. Internal Criteria
Using this approach of cluster validity the goal is to evaluate the clustering result of an algorithm using only quantities and features inherited from the data set. There are two cases in which one applies internal criteria of cluster validity depending on the clustering structure: (i) hierarchy of clustering schemes, and (ii) single clustering scheme. Validating Hierarchy of Clustering Schemes
A matrix called cophenetic matrix, Pc, can represent the
C
390
C
Clustering Validity
hierarchy diagram that is produced by a hierarchical algorithm. The element Pc(i, j) of cophenetic matrix represents the proximity level at which the two vectors xi and xj are found in the same cluster for the first time. A statistical index can be defined to measure the degree of similarity between Pc and P (proximity matrix) matrices. This index is called Cophenetic Correlation Coefficient and defined as: CPCC N1 N P P ð1=MÞ dij cij mP mC i¼1 j¼iþ1 ¼ v"ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #" #ffi ; u N1 N N1 N u P P P P t ð1=MÞ d2ij m2P ð1=MÞ c2ij m2C i¼1 j¼iþ1
i¼1 j¼iþ1
½4 where M = N(N1)/2 and N is the number of points in a data set. Also, mp and mc are the means of matrices P and Pc respectively, and are defined in the (Eq. 5): mP ¼ ð1=MÞ
N1 X N X
Pði; jÞ;
i¼1 j¼iþ1
mC ¼ ð1=MÞ
N1 X N X
½5 Pc ði; jÞ
i¼1 j¼iþ1
Moreover, dij, cij are the (i, j) elements of P and Pc matrices respectively. The CPCC values range in [–1, 1]. A value of the index close to 1 is an indication of a significant similarity between the two matrices. Validating a Single Clustering Scheme The goal here is to find the degree of match between a given clustering scheme C, consisting of nc clusters, and the proximity matrix P. The defined index for this approach is Hubert’s Gstatistic (or normalized Gstatistic). An additional matrix for the computation of the index is used, that is 1; if xi and xj belong to different clusters Y ði; j Þ ¼ 0; otherwise array: where i; j ¼ 1; 1=4; N:
The application of Monte Carlo techniques is also a means to test the random hypothesis in a given data set. Relative Criteria
The major drawback of techniques based on internal or external criteria is their high computational complexity. A different validation approach is discussed in this
section. The fundamental idea of the relative criteria is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion. More specifically, the problem can be stated as follows: Let Palg be the set of parameters associated with a specific clustering algorithm (e.g., the number of clusters nc). Among the clustering schemes Ci, i = 1,...,nc, is defined by a specific algorithm. For different values of the parameters in Palg, choose the one that best fits the data set. Then, consider the following cases of the problem: 1. Palg does not contain the number of clusters, nc, as a parameter. In this case, the choice of the optimal parameter values are described as follows: The algorithm runs for a wide range of its parameters’ values and the largest range for which nc remains constant is selected (usually nc lP enðxÞ > > : g½i i¼1
The values in g are not bound to any range. This means that in terms of g, different levels of the context can be weighted in various ways. For example, weighting may increase or decrease toward the topmost context (root element).
Contextualization
Now, parent and root contextualization with the given generalized notation are considered. For simplicity, binary contextualization weights are used, that is, only such cases where the values of g are either 1 or 0. Zero value means that the corresponding element is not taken into account in the contextualization. With relation to a query expression q, the contextualization based on the first level (parent) context of the x element is calculated using the contextualization vector where two last elements have the value 1 and the others zero value. This function is denoted cp(q, x) and defined as follows: cp ðq; xÞ ¼ Cðq; x; gÞ where g = 8 g½lenðxÞ ¼ 1 > > > < g½lenðxÞ 1 ¼ 1; when lenðxÞ > 1 > lenðxÞ2 > > : g ½i ¼ 0; when lenðxÞ > 2 i¼1
The contextualization by the topmost context (or by the root element) is denoted by the function symbol cr. In this case the weights for the first and the last element are 1 and the other weights are 0 in the contextualization vector. 8 g½lenðxÞ ¼ 1 > > > > > < g½1 ¼ 1 lenðxÞ1 cr ðq; xÞ ¼ Cðq; x; gÞ where g ¼ > g ½i ¼ 0; > > > > : i¼2 when lenðxÞ > 2: There are alternative interpretations and presentations of the idea of contextualization, for example [8,6,10]. The idea of mixing evidence from the element itself and its surrounding elements was presented for the first time by Sigurbjo¨rnsson, Kamps, and de Rijke [8]. They apply language modeling based on a mixture model. The final RSV for the element is combined from the RSV of the element itself and the RSV of the root element as follows: wmix ðq; eÞ ¼ lpðeÞ þ a wðq; rootðeÞÞ þ ð1 aÞ wðq; eÞ; where the notation is as given above; wmix(q,e) is the combined RSV for the element, lp(e) is the length prior, and a is a tuning parameter. In [6] independent indices are created for different – selected – element types. Some indices have sparse data
C
compared with others, that is, all the words of root elements are not contained in lower level elements. This has effects on inverted element frequencies and comparability of the word weights across the indices. To resolve the problem the final element RSV is tuned by a scaling factor and the RSV of the root element. With the notation explained above this can be represented as follows: wp ðq; eÞ ¼ DocPivot wðq; rootðeÞÞ þ ð1 DocPivot Þ wðq; eÞ where wp(q,e) is the pivoted RSV of the element, and DocPivot is the scaling factor. In [10], the structural context of the element is utilized for scoring elements. Original RSVs are obtained with the Okapi model [7]. For this, word frequencies are calculated for elements rather than documents, and normalized for each element type. The combined RSV for the element is calculated as a sum of the RSV of the context and the original RSV for the element, each score multiplied by a parameter. Machine learning approach is applied for learning the parameters. Let x = {t1, t2,...,td} be a vector of features representing the element e. The features are RSVs of the element e and its different contexts. Then the combined RSV is f oðxÞ ¼
d X
oj tj ;
j¼1
where o = {o1, o2,...,od} are the parameters to be learned. The approach was tested with the parent and root as the contexts.
Key Applications Contextualization and similar approached have been applied in XML retrieval with different retrieval models: tf-idf based ranking and structural indices [1], the vector space model [6], and the language modeling [8]. The key application is element ranking. The results obtained with different retrieval systems seem to indicate that the root contextualization is the best alternative. In traditional text retrieval a similar approach has been applied to text passages [3,5]. The idea of contextualization is applicable to all structured text documents; yet the type of the documents to be retrieved has effects on contextualization. Lengthy documents with one or few subjects are more amenable for the method than short documents, or documents with diverse subjects; contextualization might not work in an encyclopedia
477
C
478
C
Continuous Backup
with short entries, but can improve effectiveness, say, in the retrieval of the elements of scientific articles.
Experimental Results The effectiveness of contextualization in XML retrieval has been experimented with the INEX test collection consisting of IEEE articles [1,2]. All three contextualization types mentioned above (parent, root, and tower contextualization) were tested. The results show clear improvement over a non-contextualized baseline; the best results were obtained with the root and tower contextualization. Additionally, approaches suggested in [8,6,10] were tested with the same INEX collection and reported to be effective compared with noncontextualized baselines, that is, compared with ranking based on elements’ basic RSVs.
Cross-references
▶ Indexing Units ▶ Term Statistics for Structured Text Retrieval ▶ XML Retrieval
Recommended Reading 1. Arvola P., Junkkari M., and Keka¨la¨inen J. Generalized contextualization method for XML information retrieval. In Proc. Int. Conf. on Information and Knowledge Management, 2005, pp. 20–27. 2. Arvola P., Junkkari M., and Keka¨la¨inen J. Query evaluation with structural indices. In Proc. 4th Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2005, pp. 134–145. 3. Callan J.P. Passage-level evidence in document retrieval. In Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004, pp. 302–310. 4. Extensible Markup Language (XML) 1.0 (Fourth Edition). W3C Recommendation 16 August 2006. Available at: http://www.w3. org/TR/xml/[retrieved 17.8.2007]. 5. Kaszkiel M., Zobel J., and Sacks-Davis R. Efficient passage ranking for document databases. ACM Trans. Infor. Syst., 17(4):406–439, 1999. 6. Mass Y. and Mandelbrod M. Component ranking and automatic query refinement for XML retrieval. In Proc. 4th Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2005, pp. 73–84. 7. Robertson S.E., Walker S., Jones S., Hancock-Beaulieu M.M., and Gatford M. Okapi at TREC-3. In Proc. The 3rd Text Retrieval Conf., 1994, pp. 500–226. 8. Sigurbjo¨rnsson B, Kamps J., and De Rijke M. An element-based approach to XML retrieval. In Proc. 2nd Int. Workshop of the Initiative for the Evaluation of XML Retrieval, 2003, 19–26. Available at: http://inex.is.informatik.uni-duisburg.de:2003/ proceedings.pdf [retrieved 29.8.2007]. 9. Singhal A., Buckley C., and Mitra M. Pivoted document length normalization. In Proc. 19th Annual Int. ACM SIGIR Conf. on
Research and Development in Information Retrieval, 1996, 21–29. 10. Vittaut J.-N. and Gallinari P. Machine learning ranking for structured information retrieval. In Proc. 28th European Conf. on IR Research, 2006, 338–349.
Continuous Backup ▶ Continuous Data Protection (CDP)
Continuous Data Feed ▶ Data Stream
Continuous Data Protection K ENICHI WADA Hitachi, Ltd, Tokyo, Japan
Synonyms Continuous backup CDP
Definition CDP is a data protection service capturing data changes to storage, often providing the capability of restoring any point in time copies.
Key Points CDP differs from usual backups in that users do not need to specify the point in time until they recover data from backups. From an application point of view, every time when it updates data in an original volume, CDP keeps updates. In case of recovery, when users specify the point in time, CDP creates the point in time copy from an original volume and updates. In several CDP implementations, users can specify the granularities of restorable objects which help them to specify the point in time easily. For example, restorable objects range from crash-consistent images to logical objects such as files, mail boxes, messages, database files, or logs.
Cross-references
▶ Backup and Restore
Continuous Monitoring of Spatial Queries
Recommended Reading 1.
Laden G., et al. Architectures for Controller Based CDP. In Proc. 5th USENIX conf. on File and Storage Technologies, 2007, pp. 107–121.
Continuous Monitoring of Spatial Queries K YRIAKOS M OURATIDIS Singapore Management University, Singapore, Singapore
C
moves (see entry Nearest Neighbor Query). These methods assume that the data objects are either static or moving linearly with known velocities. Due to the wide availability of positioning devices and the need for improved location-based services, the research focus has recently shifted to continuous spatial queries. In contrast with earlier assumed contexts, in this setting (i) there are multiple queries being evaluated simultaneously, (ii) the query results are continuously updated, and (iii) both the query points and the data objects move unpredictably.
Foundations Synonyms Spatio-temporal stream processing
Definition A continuous spatial query runs over long periods of time and requests constant reporting of its result as the data dynamically change. Typically, the query type is range or nearest neighbor (NN), and the assumed distance metric is the Euclidean one. In general, there are multiple queries being processed simultaneously. The query points and the data objects move frequently and arbitrarily, i.e., their velocity vectors and motion patterns are unknown. They issue location updates to a central server, which processes them and continuously reports the current (i.e., updated) query results. Consider, for example, that the queries correspond to vacant cabs, and that the data objects are pedestrians that ask for a taxi. As cabs and pedestrians move, each free taxi driver wishes to know his/her closest client. This is an instance of continuous NN monitoring. Spatial monitoring systems aim at minimizing the processing time at the server and/or the communication cost incurred by location updates. Due to the time-critical nature of the problem, the data are usually stored in main memory to allow fast processing.
Historical Background The first algorithms in the spatial database literature process snapshot (i.e., one-time) queries over static objects. They assume disk-resident data and utilize an index (e.g., an R-tree) to restrict the search space and reduce the I/O cost. Subsequent research considered spatial queries in client-server architectures. The general idea is to provide the user with extra information (along with the result at query-time) in order to reduce the number of subsequent queries as he/she
The first spatial monitoring method is called Q-index [13] and processes static range queries. Based on the observation that maintaining an index over frequently moving objects is very costly, Q-index indexes the queries instead of the objects. In particular, the monitored ranges are organized by an R-tree, and moving objects probe this tree to find the queries that they influence. Additionally, Q-index introduces the concept of safe regions to reduce the number of location updates. Specifically, each object p is assigned a circular or rectangular region, such that p needs to issue an update only if it exits this area (because, otherwise, it does not influence the result of any query). Figure 1 shows an example, where the current result of query q1 contains object p1, that of q2 contains p2, and the results of q3, q4, and q5 are empty. The safe regions for p1 and p4 are circular, while for p2 and p3 they are rectangular. Note that no query result can change unless some objects fall outside their assigned safe regions. Kalashnikov et al. [4] show that a grid implementation of Q-index is more efficient (than R-trees) for main memory evaluation. Monitoring Query Management (MQM) [1] and MobiEyes [2] also monitor range queries. They further exploit the computational capabilities of the objects to reduce the number of updates and the processing load of the server. In both systems, the objects store locally the queries in their vicinity and issue updates to the server only when they cross the boundary of any of these queries. To save their limited computational capabilities, the objects store and monitor only the queries they may affect when they move. MQM and MobiEyes employ different strategies to identify these queries. The former applies only to static queries. The latter can also handle moving ones, making however the assumption that they move linearly with fixed velocity.
479
C
480
C
Continuous Monitoring of Spatial Queries
Continuous Monitoring of Spatial Queries. Figure 1. Circular and rectangular safe regions.
Mokbel et al. [7] present Scalable INcremental hash-based Algorithm (SINA), a system that monitors both static and moving ranges. In contrast with the aforementioned methods, in SINA the objects do not perform any local processing. Instead, they simply report their locations whenever they move, and the objective is to minimize the processing cost at the server. SINA is based on shared execution and incremental evaluation. Shared execution is achieved by implementing query evaluation as a spatial join between the objects and the queries. Incremental evaluation implies that the server computes only updates (i.e., object inclusions/exclusions) over the previously reported answers, as opposed to re-evaluating the queries from scratch. The above algorithms focus on ranges, and their extension to NN queries is either impossible or nontrivial. The systems described in the following target NN monitoring. Hu et al. [3] extend the safe region technique to NN queries; they describe a method that computes and maintains rectangular safe regions subject to the current query locations and kNN results. Mouratidis et al. [11] propose Threshold-Based algorithm (TB), also aiming at communication cost reduction. To suppress unnecessary location updates, in TB the objects monitor their distance from the queries (instead of safe regions). Consider the example in Fig. 2, and assume that q is a continuous 3-NN query (i.e., k = 3). The initial result contains p1, p2, p3. TB computes three thresholds (t1, t2, t3) which define a range for each object. If every object’s distance from
q lies within its respective range, the result of the query is guaranteed to remain unchanged. Each threshold is set in the middle of the distances of two consecutive objects from the query. The distance range for p1 is [0, t1), for p2 is [t1, t2), for p3 is [t2, t3), and for p4, p5 is [t3, 1). Every object is aware of its distance range, and when there is a boundary violation, it informs the server about this event. For instance, assume that p1, p3, and p5 move to positions p10 ; p30 and p50 , respectively. Objects p3 and p5 compute their new distances from q, and avoid sending an update since they still lie in their permissible ranges. Object p1, on the other hand, violates its threshold and updates its position to the server. Since the order between the first two NNs may have changed, the server requests for the current location of p2, and updates accordingly the result and threshold t1. In general, TB processes all updates issued since the last result maintenance, and (if necessary) it decides which additional object positions to request for, updates the k NNs of q, and sends new thresholds to the involved objects. All the following methods aim at minimizing the processing time. Koudas et al. [6] describe aDaptive Indexing on Streams by space-filling Curves (DISC), a technique for e-approximate kNN queries over streams of multi-dimensional points. The returned (e-approximate) kth NN lies at most e distance units farther from q than the actual kth NN of q. DISC partitions the space with a regular grid of granularity such that the maximum distance between any pair of points in a cell is at most e. To avoid keeping all arriving data in the system, for each cell c it maintains only K points and discards the rest. It is proven that an exact kNN search in the retained points corresponds to a valid ekNN answer over the original dataset provided that k K. DISC indexes the data points with a B-tree that uses a space filling curve mechanism to facilitate fast updates and query processing. The authors show how to adjust the index to: (i) use the minimum amount of memory in order to guarantee a given error bound e, or (ii) achieve the best possible accuracy, given a fixed amount of memory. DISC can process both snapshot and continuous ekNN queries. Yu et al. [17] propose a method, hereafter referred to as YPK-CNN, for continuous monitoring of exact kNN queries. Objects are stored in main memory and indexed with a regular grid of cells with size dd. YPK-CNN does not process updates as they arrive, but directly applies them to the grid. Each NN
Continuous Monitoring of Spatial Queries
query installed in the system is re-evaluated every T time units. When a query q is evaluated for the first time, a two-step NN search technique retrieves its result. The first step visits the cells in an iteratively enlarged square R around the cell cq of q until k objects are found. Figure 3a shows an example of a single NN query where the first candidate NN is p1 with distance d from q; p1 is not necessarily the actual NN since there may be objects (e.g., p2) in cells outside R with distance smaller than d. To retrieve such objects, the second step searches in the cells intersecting the square SR centered at cq with side length 2·d + d, and determines the actual kNN set of q therein. In Fig. 3a, YPK-CNN processes p1
Continuous Monitoring of Spatial Queries. Figure 2. TB example (k = 3).
C
up to p5 and returns p2 as the actual NN. The accessed cells appear shaded. When re-evaluating an existing query q, YPK-CNN makes use of its previous result in order to restrict the search space. In particular, it computes the maximum distance dmax among the current locations of the previous NNs (i.e., dmax is the distance of the previous neighbor that currently lies furthest from q). The new SR is a square centered at cq with side length 2·dmax + d. In Fig. 3b, assume that the current NN p2 of q moves to location p20 . Then, the rectangle defined by dmax ¼ distðp20 ; qÞ is guaranteed to contain at least one object (i.e., p2). YPK-CNN collects all objects (p1 up to p10) in the cells intersecting SR and identifies p1 as the new NN. Finally, when a query q changes location, it is handled as a new one (i.e., its NN set is computed from scratch). Xiong et al. [16] propose Shared Execution Algorithm for Continuous NN queries (SEA-CNN). SEACNN focuses exclusively on monitoring the NN changes, without including a module for the firsttime evaluation of an arriving query q (i.e., it assumes that the initial result is available). Objects are stored in secondary memory, indexed with a regular grid. The answer region of a query q is defined as the circle with center q and radius NN_dist (where NN_dist is the distance of the current kth NN). Book-keeping information is stored in the cells that intersect the answer region of q to indicate this fact. When updates arrive at the system, depending on which cells they
Continuous Monitoring of Spatial Queries. Figure 3. YPK-CNN examples.
481
C
482
C
Continuous Monitoring of Spatial Queries
affect and whether these cells intersect the answer region of the query, SEA-CNN determines a circular search region SR around q, and computes the new kNN set of q therein. To determine the radius r of SR, the algorithm distinguishes the following cases: (i) If some of the current NNs move within the answer region or some outer objects enter the answer region, SEA-CNN sets r = NN_dist and processes all objects falling in the answer region in order to retrieve the new NN set. (ii) If any of the current NNs moves out of the answer region, processing is similar to YPK-CNN; i.e., r = dmax (where dmax is the distance of the previous NN that currently lies furthest from q), and the NN set is computed among the objects inside SR. Assume that in Fig. 4a the current NN p2 issues an update reporting its new location p20 . SEA-CNN sets r ¼ dmax ¼ distðp20 ; qÞ, determines the cells intersecting SR (these cells appear shaded), collects the corresponding objects (p1 up to p7), and retrieves p1 as the new NN. (iii) Finally, if the query q moves to a new location q0 , then SEA-CNN sets r = NN_dist + dist (q, q0 ), and computes the new kNN set of q by processing all the objects that lie in the circle centered at q0 with radius r. For instance, in Fig. 4b the algorithm considers the objects falling in the shaded cells (i.e., objects from p1 up to p10 except for p6 and p9) in order to retrieve the new NN (p4). Mouratidis et al. [9] propose another NN monitoring method, termed Conceptual Partitioning Monitoring (CPM). CPM assumes the same system architecture and uses similar indexing and book-keeping structures
as YPK-CNN and SEA-CNN. When a query q arrives at the system, the server computes its initial result by organizing the cells into conceptual rectangles based on their proximity to q. Each rectangle rect is defined by a direction and a level number. The direction is U, D, L, or R (for up, down, left and right), and the level number indicates how many rectangles are between rect and q. Figure 5a illustrates the conceptual space partitioning around the cell cq of q. If mindist(c,q) is the minimum possible distance between any object in cell c and q, the NN search considers the cells in ascending mindist(c, q) order. In particular, CPM initializes an empty heap H and inserts (i) the cell of q with key equal to 0, and (ii) the level zero rectangles for each direction DIR with key mindist(DIR0, q). Then, it starts de-heaping entries iteratively. If the de-heaped entry is a cell, it examines the objects inside and updates accordingly the NN set (i.e., the list of the k closest objects found so far). If the de-heaped entry is a rectangle DIRlvl, it inserts into H (i) each cell c ∈ DIRlvl with key mindist(c, q) and (ii) the next level rectangle DIRlvl + 1 with key mindist(DIRlvl + 1, q). The algorithm terminates when the next entry in H (corresponding either to a cell or a rectangle) has key greater than the distance NN_dist of the kth NN found. It can be easily verified that the server processes only the cells that intersect the circle with center at q and radius equal to NN_dist. This is the minimal set of cells to visit in order to guarantee correctness. In Fig. 5a, the search processes the shaded cells and returns p2 as the result.
Continuous Monitoring of Spatial Queries. Figure 4. SEA-CNN update handling examples.
Continuous Monitoring of Spatial Queries
C
483
C
Continuous Monitoring of Spatial Queries. Figure 5. CPM examples.
The encountered cells constitute the influence region of q, and only updates therein can affect the current result. When updates arrive for these cells, CPM monitors how many objects enter or leave the circle centered at q with radius NN_dist. If the outgoing objects are more than the incoming ones, the result is computed from scratch. Otherwise, the new NN set of q can be inferred by the previous result and the update information, without accessing the grid at all. Consider the example of Fig. 5b, where p2 and p3 move to positions p20 and p30 , respectively. Object p3 moves closer to q than the previous NN_dist and, therefore, CPM replaces the outgoing NN p2 with the incoming p3. The experimental evaluation in [11] shows that CPM is significantly faster than YPK-CNN and SEA-CNN.
Key Applications Location-Based Services
The increasing trend of embedding positioning systems (e.g., GPS) in mobile phones and PDAs has given rise to a growing number of location-based services. Many of these services involve monitoring spatial relationships among mobile objects, facilities, landmarks, etc. Examples include location-aware advertising, enhanced 911 services, and mixed-reality games. Traffic Monitoring
Continuous spatial queries find application in traffic monitoring and control systems, such as on-the-fly
driver navigation, efficient congestion detection and avoidance, as well as dynamic traffic light scheduling and toll fee adjustment. Security Systems
Intrusion detection and other security systems rely on monitoring moving objects (pedestrians, vehicles, etc.) around particular areas of interest or important people.
Future Directions Future research directions include other types of spatial queries (e.g., reverse nearest neighbor monitoring [15,5]), different settings (e.g., NN monitoring over sliding windows [10]), and alternative distance metrics (e.g., NN monitoring in road networks [12]). Similar techniques and geometric concepts to the ones presented above also apply to problems of a non-spatial nature, such as continuous skyline [14] and top-k queries [8,18].
Experimental Results The methods described above are experimentally evaluated and compared with alternative algorithms in the corresponding reference.
Cross-references
▶ B+-Tree ▶ Nearest Neighbor Query ▶ R-tree (and Family) ▶ Reverse Nearest Neighbor Query
484
C
Continuous Multimedia Data Retrieval
▶ Road Networks ▶ Space-Filling Curves for Query Processing
Recommended Reading 1. Cai Y., Hua K., and Cao G. Processing range-monitoring queries on heterogeneous mobile objects. In Proc. 5th IEEE Int. Conf. on Mobile Data Management, 2004, pp. 27–38. 2. Gedik B. and Liu L. MobiEyes: Distributed processing of continuously moving queries on moving objects in a mobile system. In Advances in Database Technology, Proc. 9th Int. Conf. on Extending Database Technology, 2004, pp. 67–87. 3. Hu H., Xu J., and Lee D. A generic framework for monitoring continuous spatial queries over moving objects. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 479–490. 4. Kalashnikov D., Prabhakar S., and Hambrusch S. Main memory evaluation of monitoring queries over moving objects. Distrib. Parallel Databases, 15(2):117–135, 2004. 5. Kang J., Mokbel M., Shekhar S., Xia T., and Zhang D. Continuous evaluation of monochromatic and bichromatic reverse nearest neighbors. In Proc. 23rd Int. Conf. on Data Engineering, 2007, pp. 806–815. 6. Koudas N., Ooi B., Tan K., and Zhang R. Approximate NN queries on streams with guaranteed error/performance bounds. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 804–815. 7. Mokbel M., Xiong X., and Aref W. SINA: Scalable incremental processing of continuous queries in spatio-temporal databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 623–634. 8. Mouratidis K., Bakiras S., Papadias D. Continuous monitoring of top-k queries over sliding windows. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 635–646. 9. Mouratidis K., Hadjieleftheriou M., and Papadias D. Conceptual partitioning: an efficient method for continuous nearest neighbor monitoring. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 634–645. 10. Mouratidis K. and Papadias D. Continuous nearest neighbor queries over sliding windows. IEEE Trans. Knowledge and Data Eng., 19(6):789–803, 2007. 11. Mouratidis K., Papadias D., Bakiras S., and Tao Y. A thresholdbased algorithm for continuous monitoring of k nearest neighbors. IEEE Trans. Knowledge and Data Eng., 17(11):1451–1464, 2005. 12. Mouratidis K., Yiu M., Papadias D., and Mamoulis N. Continuous nearest neighbor monitoring in road networks. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 43–54. 13. Prabhakar S., Xia Y., Kalashnikov D., Aref W., and Hambrusch S. Query indexing and velocity constrained indexing: scalable techniques for continuous queries on moving objects. IEEE Trans. Comput., 51(10):1124–1140, 2002. 14. Tao Y. and Papadias D. Maintaining sliding window skylines on data Streams. IEEE Trans. Knowledge and Data Eng., 18(3): 377–391, 2006. 15. Xia T. and Zhang D. Continuous reverse nearest neighbor monitoring. In Proc. 22nd Int. Conf. on Data Engineering, 2006. 16. Xiong X., Mokbel M., and Aref W. SEA-CNN: Scalable processing of continuous k-nearest neighbor queries in spatio-temporal
databases. In Proc. 21st Int. Conf. on Data Engineering, 2005, pp. 643–654. 17. Yu X., Pu K., and Koudas N. Monitoring k-nearest neighbor queries over moving objects. In Proc. 21st Int. Conf. on Data Engineering, 2005, pp. 631–642. 18. Zhang D., Du Y., and Hu L. On monitoring the top-k unsafe places, In Proc. 24th Int. Conf. on Data Engineering, 2008, pp. 337–345.
Continuous Multimedia Data Retrieval J EFFREY X U Y U Chinese University of Hong Kong, Hong Kong, China
Definition Continuous multimedia is widely used in many applications nowadays. Continuous multimedia objects, such as audio and video streams, being stored on disks with different requirements of bandwidths, are required to be retrieved continuously without interruption. The response time is an important measurement in supporting continuous multimedia streams. Several strategies are proposed in order to satisfy the requirements of all users in a multi-user environment where multiple users are trying to retrieve different continuous multimedia streams together.
Historical Background Several multimedia data retrieval techniques are proposed to support the real-time display of continuous multimedia objects. There are three categories [6]. The first category is to sacrifice the quality of the data in order to guarantee the required bandwidth of multimedia objects. The existing techniques either use lossy compression techniques (such as predictive [15], frequency oriented [11], and importance oriented [10]), or use a low resolution device. The second category is to use the placement techniques to satisfy the continuous requirement by arranging the data to appropriate disk locations. In other words, it is to organize multimedia data across the surface of a disk drive to maximize its bandwidth when it is retrieved [4,5,16,22,20]. The third category is to increase the bandwidth of storage device by using parallelism. The basic idea is to employ the aggregate bandwidth of several disk drives by putting an object across multiple disks, for
Continuous Multimedia Data Retrieval
example, a Redundant Arrays of Inexpensive Disk (RAID) [17]. The existing works [9,19] focus on this direction.
Foundations This section focuses on the second and third categories, and discusses multimedia data retrieval regarding single/multiple stream(s) and single/multiple disk(s). Retrieval of a Single Stream on a Single Disk
For the retrieval of a single multimedia stream on a single disk, the stream data is read into a first-in-firstout queue (FIFO) continuously first, and then is sent to the display devices, possibly via a network, at the appropriate rate. In order to satisfy the real-time requirements – to display multimedia data continuously on a display, it is required to keep the FIFO non empty. In other words, there is some multimedia data to be displayed in the FIFO in the duration of the playback. As pointed in [6], pre-fetching all the data into the FIFO before playback is not a feasible solution because the size of the stream can be very large. Suppose that a read request of a large multimedia data is issued. The starting time and the minimum buffer space, to display the retrieved multimedia data continuously, are determined as follows, under the following conditions: (i) the timing of data retrieval is known in advance, (ii) both the transfer rate and consumption rate are constant, and (iii) the transfer rate of the storage device is at least as great as the consumption rate. Consider Fig. 1. First, the amount of data, that needs to be consumed by a display, is illustrated as the dotted line marked data read. The vertical line segments show the amount of data that needs to be consumed in order to continuously display,
Continuous Multimedia Data Retrieval. Figure 1. Finding minimum buffer space and start time (Fig. 2 in [6]).
C
and the horizontal line shows the time periods such amount of data is consumed on a display. Second, the solid zigzag line, marked data buffered, shows the data to be accessed in the data buffers. The vertical line segments show the data to be read into the buffers followed by the line segments that show data is consumed in the buffer during a certain time interval. Here, in the solid zigzag line, there is a minimum point (marked minimum-shift up to zero), which is a possible negative value and is denoted as z(< 0). Third, the dotted zigzag line (marked shifted buffer plot) is the line by shifting the entire solid zigzag line up by the amount of jzj where z < 0. Finally, the starting time to display is determined as the point at which the shifted-up dotted zigzag line (shifted buffer plot) and the dotted line (data read) intersect, which is indicated as intersection - start time in Fig. 1. Also, the minimum buffer size is the maximum value of in the line of shifted buffer plot, which is indicated as required buffer space in Fig. 1. Details are discussed in [7]. Retrieval of a Single Stream on Multiple Disks
The multimedia data retrieval using multiple disks is a technique to retrieve a data stream continuously at the required bandwidth. The main idea behind is to de-cluster the data stream into several fragments [14,2], and distribute these fragments across multiple processors (and disks). By combining the I/O bandwidths of several disks, a system can provide the required retrieval rate to display a continuous multimedia stream in real-time. Assume that the required retrieval rate is B and the bandwidth of each disk is BD . The degree of de-clustering can be calculated as M ¼ dBBD e, which implies the number of disks that is needed to satisfy the required retrieval rate. When the degree of de-clustering is determined, the fragments can be formed using a round-robin partitioning strategy as illustrated in Fig. 2, where an object x is partitioned into M fragments stored on M disks. The round-robin partitioning is conducted as follows. First, the object x is divided in N blocks (disk pages) depending on the disk-page size allowed on disks. In Fig. 2, the number of blocks is N = M M. The first block0 is assigned to first fragment indicated as x1 in Fig. 2, and the second block1 is assigned to the second fragment indicated as x2. The first M blocks from block0 to blockM1 are assigned to the M fragments one by one. In next run, the next set of blocks,
485
C
486
C
Continuous Multimedia Data Retrieval
Retrieval of Multiple Streams on Multiple Disks
Continuous Multimedia Data Retrieval. Figure 2. Round-robin partitioning of object x (Fig. 5 in [9]).
from blockM to block2M1 will be assigned to the M fragments in the similar fashion. The process repeats until all data blocks are assigned to the fragments in a round-robin fashion. Retrieval of Multiple Streams on a Single Disk
In a multi-user environment, several users may retrieve data streams simultaneously. Therefore, there are multiple data streams requested on a single disk. The data streams are retrieved in rounds, and each stream is allowed a disk access or a fixed number of disk accesses at one time. All data retrieval requests need to be served in turn. Existing solutions include SCAN, round-robin, EDF, and Sorting-Set algorithms. The round-robin algorithm retrieves data for each data retrieval request, in turn, in a predetermined order. The SCAN algorithm moves the disk head back and forth, and retrieves the requested blocks when the disk head passes over the requested blocks [18]. The EDF (earliest-deadline-first) algorithm serves the request with the earliest deadline first, where a deadline is given to a data stream [13]. The sorting-set algorithm is designed to exploit the trade-off between the number of rounds between successive reads for a data stream and the length of the round [8,21], by assigning each data stream to a sorting set. Fixed time slots are allocated to a sorting set in a round during which its requests are possibly processed.
In order to support multiple stream retrieval, making use of parallel disks is an effective method, where a data stream is striped across the parallel disks. There are several approaches to retrieve data streams when they are stored on parallel disks (Fig. 3). It is important to note that the main issue here is to increase the number of data streams to be retrieved simultaneously. It is not to speed up retrieval for an individual data stream using multiple disks. Consider the striped retrieval as shown in Fig. 3a, where a data stream is striped across m parallel disks. Suppose that each disk has rc bandwidth, m parallel disks can be together used to increase the bandwidth up to m rc. However, the issue is the system capacity in terms of the number of data streams it can serve, for example, from n data streams to m n data streams using m parallel disks. Suppose that each data stream will be served in turn. When it increases the number of data streams from n to m n, in the striped retrieval, the round length (or in other words consecutive reads for a single data stream) increases proportionally from n to m n. It implies that, in order to satisfy the required retrieval rate, it needs to use a larger buffer, which also implies a larger startup delay. An improvement over striped retrieval is to use split-stripe retrieval which allows partial stripes to be used (Fig. 3b), in order to reduce the buffer size required in the striped retrieval. But, it has its limit to significantly reduce startup delay and buffer space. Observe the data transfer patterns in the striped retrieval and the split-striped retrieval, which show busty patterns for data to be read into the buffer. For instance, consider Fig. 3a, an entire strip for a single data stream will be read in and be consumed in a period which is related to the round length. It requests larger buffer sizes, because it needs to keep the data to be displayed continuously until the next read, in particular when the number of streams increases from n to m n. Instead, an approach is proposed to read small portion of data frequently, in order to reduce the buffer space required. The approach is called cyclic retrieval. As shown in Fig. 3c, the cyclic retrieval tries to read multiple streams rather than one stream at one time. Rather than retrieving an entire stripe at once, the cyclic retrieval retrieves each striping unit of a stripe consecutively [1,3]. Using this approach, the buffer space is significantly reduced. But the reduction comes with cost. The buffer space reduction is achieved at the expense of cuing (a stream is said to
Continuous Multimedia Data Retrieval
C
487
C
Continuous Multimedia Data Retrieval. Figure 3. Retrieval of multiple streams on multiple disks [6].
be cued if it is paused and playback may be initiated instantaneously) and clock skew tolerance. As an alternative to striped (split-striped) or cyclic retrieval, it can deal with each disk independently rather than treating them as parallel disks. Here, each disk stores a number of titles (data streams). When there is a multimedia data retrieval request, a disk that contains the data stream will respond. The data streams that are frequently requested may be kept in multiple disks using replication. The number of replications can be determined based on the retrieval frequency of data streams [12], as shown in Fig. 3d. Based on the replicated retrieval, both the startup delay time and buffer space can be reduced significantly. It is shown that it is easy to scale when the number of data streams increase at the expense of more disk space required. [9] discusses data replication techniques. A comparison among striped-retrieval, cyclic retrieval, and replicated retrieval in supporting n streams is shown in Table 1.
Continuous Multimedia Data Retrieval. Table 1. A comparison of multi-disk retrieval strategies supporting n streams (Table 1 in [6]) Striped
Cyclic
Replicated
Instant restart
yes
no
yes
Clock skew tolerance Easy scaling Capacity
yes
no
yes
Startup delay Buffer space
O(n) O(n2)
no no yes per-system per-system per-title O(n) O(n)
O(1) O(n)
Key Applications Continuous multimedia data retrieval is used in many real-time continuous multimedia streams such as audio and video through the network. Especially in a multiuser environment, the continuous multimedia data retrieval techniques are used to support simultaneous display of several multimedia objects in real-time.
488
C
Continuous Queries in Sensor Networks
Cross-references
▶ Buffer Management ▶ Buffer Manager ▶ Multimedia Data Buffering ▶ Multimedia Data Storage ▶ Multimedia Resource Scheduling ▶ Scheduling Strategies for Data Stream Processing ▶ Storage Management ▶ Storage Manager
Recommended Reading 1. Berson S., Ghandeharizadeh S., Muntz R., and Ju X. Staggered striping in multimedia information systems. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1994, pp. 79–90. 2. Carey M.J. and Livny M. Parallelism and concurrency control performance in distributed database machines. ACM SIGMOD Rec., 18(2):122–133, 1989. 3. Chen M.S., Kandlur D.D., and Yu P.S. Storage and retrieval methods to support fully interactive playout in a disk-array-based video server. Multimedia Syst., 3(3):126–135, 1995. 4. Christodoulakis S. and Ford D.A. Performance analysis and fundamental performance tradeoffs for CLV optical disks. ACM SIGMOD Rec., 17(3):286–294, 1988. 5. Ford D.A. and Christodoulakis S. Optimizing random retrievals from CLV format optical disks. In Proc. 17th Int. Conf. on Very Large Data Bases, 1991, pp. 413–422. 6. Gemmell D.J. Multimedia information storage and management, chap. 1. Disk Scheduling for Continuous Media. Kluwer, Norwell, MA, USA, 1996. 7. Gemmell J. and Christodoulakis S. Principles of delay-sensitive multimedia data storage retrieval. ACM Trans. Inf. Syst., 10(1):51–90, 1992. 8. Gemmell D.J. and Han J. Multimedia network file servers: multichannel delay-sensitive data retrieval. Multimedia Syst., 1(6):240–252, 1994. 9. Ghandeharizadeh S. and Ramos L. Continuous retrieval of multimedia data using parallelism. IEEE Trans. on Knowl. and Data Eng., 5(4):658–669, 1993. 10. Green J.L. The evolution of DVI system software. Commun. ACM, 35(1):52–67, 1992. 11. Lippman A. and Butera W. Coding image sequences for interactive retrieval. Commun. ACM, 32(7):852–860, 1989. 12. Little T.D.C. and Venkatesh D. Popularity-based assignment of movies to storage devices in a video-on-demand system. Multimedia Syst., 2(6):280–287, 1995. 13. Liu C.L. and Layland J.W. Scheduling algorithms for multiprogramming in a hard real-time environment. In Tutorial: Hard Real-Time Systems. IEEE Computer Society, Los Alamitos, CA, USA, 1989, pp. 174–189. 14. Livny M., Khoshafian S., and Boral H. Multi-disk management algorithms. SIGMETRICS Perform. Eval. Rev., 15(1):69–77, 1987. 15. Luther A.C. Digital video in the PC environment, (2nd edn.). McGraw-Hill, New York, NY, USA, 1991.
16. McKusick M.K., Joy W.N., Leffler S.J., and Fabry R.S. A fast file system for UNIX. Comput. Syst., 2(3):181–197, 1984. 17. Patterson D.A., Gibson G.A., and Katz R.H. A case for redundant arrays of inexpensive disks (RAID). In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1988, pp. 109–116. 18. Teorey T.J. and Pinkerton T.B. A comparative analysis of disk scheduling policies. In Proc. 3rd ACM Symp. on Operating System Principles, 1971, pp. 114. 19. Tsai W.J. and Lee S.Y. Storage design and retrieval of continuous multimedia data using multi-disks. In Proc. 1994 Int. Conf. on Parallel and Distributed Systems, 1994, pp. 148–153. 20. Wong C.K. Minimizing expected head movement in onedimensional and two-dimensional mass storage systems. ACM Comput. Surv., 12(2):167–178, 1980. 21. Yu P.S., Chen M.S., and Kandlur D.D. Grouped sweeping scheduling for DASD-based multimedia storage management. Multimedia Syst., 1(3):99–109, 1993. 22. Yue P.C. and Wong C.K. On the optimality of the probability ranking scheme in storage applications. J. ACM, 20(4):624–633, 1973.
Continuous Queries in Sensor Networks YONG YAO, J OHANNES G EHRKE Cornell University, Ithaca, NY, USA
Synonyms Long running queries
Definition A powerful programming paradigm for data acquisition and dissemination in sensor networks is a declarative query interface. With a declarative query interface, the sensor network is programmed for long term monitoring and event detection applications through continuous queries, which specify what data to retrieve at what time or under what conditions. Unlike snapshot queries which execute only once, continuous queries are evaluated periodically until the queries expire. Continuous queries are expressed in a high-level language, and are compiled and installed on target sensor nodes, controlling when, where, and what data is sampled, possibly filtering out unqualified data through local predicates. Continuous queries can have a variety of optimization goals, from improving result quality and response time to reducing energy consumption and prolonging network lifetime.
Continuous Queries in Sensor Networks
Historical Background In recent years sensor networks have been deployed successfully for a wide range of applications from environmental sensing to process monitoring. A database approach to programming sensor networks has gained much importance: Clients program the network through queries without knowing how the results are generated, processed, and returned to the client. Sophisticated catalog management, query optimization, and query processing techniques abstract the client from the physical details of contacting the relevant sensors, processing the sensor data, and sending the results to the client. The concept of a sensor network as a database was first introduced in [3]. A number of research projects, including TinyDB [9] and Cougar [14] have implemented continuous queries as part of their database languages for sensor networks. In these systems time is divided into epochs of equal size, and continuous queries are evaluated once per epoch during their lifetime. Figure 1 shows this database view of sensor networks. Two properties are significant to continuous query processing in sensor networks: energy conservation and fault-tolerance in case of failures of sensors, both topics that are not of importance in traditional database systems or data stream systems. Advanced query processing techniques have been proposed to enable energy-efficient query processing in the presence of
C
frequent node and communication failures. For example, a lot of research has been dedicated to in-network query processing [6,9,14] to reduce the amount of data to be transmitted inside the network. Another approach is to permit approximate query processing [4,5], which produces approximate query answers within a predefined accuracy range, but consumes much less energy. Sensor data is correlated in time and space. Data compression in sensor networks and probabilistic data models [1,7,8] exploit data correlation and remove redundant data from intermediate results. Next generation sensor network may consist of media-rich and mobile sensor nodes, which result in new challenges arise for continuous query processing such as mobility and high data rates. ICEDB [15] describes a new framework for continuous query processing in sensor networks with intermittent network connectivity and large amount of data to transfer.
Foundations Continuous queries are a natural approach for data fusion in sensor networks for long running applications as they provide a high-level interface that abstracts the user from the physical details of the network. The design and implementation of continuous queries needs to satisfy several requirements. First, it has to preserve the scarce resources such as energy and bandwidth in battery-powered sensor networks. Thus the simple approach of transmitting all relevant data back to a central node for query evaluation is prohibitive for sensor networks of non-trivial size, as communication using the wireless medium consumes a lot of energy. Since sensor nodes have the ability to perform local computation, communication can be traded for computation by moving computation from the clients into the sensor network, aggregating partial results or eliminating irrelevant data. Second, sensor network applications usually have different QoS requirements, from accuracy, energy consumption to delay. Therefore the continuous query model needs to be flexible enough to adopt various processing techniques in different scenarios. Sensor Data Model
Continuous Queries in Sensor Networks. Figure 1. Database view of sensor networks.
In the view of a sensor network as a database, each sensor node is modeled as a separate data source that generates records with several fields such as the sensor type, location of the sensor node, a time stamp, and the value of the reading. Records of the same sensor type from different nodes have the same schema, and
489
C
490
C
Continuous Queries in Sensor Networks
these records collectively form a distributed table of sensor readings. Thus the sensor network can be considered as a large distributed database system consisting of several tables of different types of sensors. Sensor readings are samples of physical signals whose values change continuously over time. For example, in environmental monitoring applications, sensor readings are generated every few seconds (or even faster). For some sensor types (such as PIR sensors that sense the presence of objects) their readings might change rapidly and thus may be outdated rather quickly, whereas for other sensors, their value changes only slowly over time as for temperature sensors that usually have a small derivative. Continuous queries recompute query results periodically and keep query results up-todate. For applications that require only approximate results, the system can cache previous results and lower the query update rate to save energy. Instead of querying raw sensor data, most applications are more interested in composite data which captures high-level events monitored by sensor networks. Such composite data is produced by complex signal processing algorithms given raw sensor measurements as inputs. Composite data usually has a compact structure and is easier to query. Continuous Query Models
In TinyDB and Cougar, continuous queries are represented as a variant of SQL with a few extensions. A simple query template in Cougar is shown in the figure below. (TinyDB uses a very similar query structure.) SELECT {attribute, aggregate} FROM {Sensordata S} WHERE {predicate} GROUP BY {attribute} HAVING {predicate} DURATION time interval EVERY time span e
The template can be extended to support nested queries, where the basic query block shown below can appear within the WHERE or HAVING clause of another query block. The query template has an obvious semantics: the SELECT clause specifies attributes and aggregates from sensor records, the FROM clause specifies the distributed relation describing the sensor type, the WHERE clause filters sensor records by a predicate, the GROUP BY clause classifies sensor records into different partitions according to some attributes, and the
HAVING clause eliminates groups by a predicate. Join
queries between external tables and sensor readings are constructed by including the external tables and sensor readings in the FROM clause and join predicates in the WHERE clause. Two new clauses introduced for continuous queries are DURATION and EVERY; The DURATION clause specifies the lifetime of the continuous query, and the EVERY or clause determines the rate of query answers. TinyDB has two related clauses: LIFETIME and SAMPLE INTERVAL, specifying the lifetime of the query and the sample interval, respectively. The LIFETIME clause will be discussed in more detail a few paragraphs later. In event detection applications, sensor data is collected only when particular events happen. The above query template can be extended with a condition clause as a prerequisite to determine when to start or stop the main query. Event-based queries have the following structure in TinyDB: ON EVENT {event(arguments)}: {query body}
Another extension to the basic query template is lifetime-based queries, which have no explicit EVERY or SAMPLE INTERVAL clause; only the query lifetime is specified through a LIFETIME clause [9]. The system automatically adjusts the sensor sampling rate to the highest rate possible with the guarantee that the sensor network can process the query for the specified lifetime. Lifetime-based queries are more intuitive in some mission critical applications where user queries have to run for a given period of time, but it is hard to predict the optimal sampling rate in advance. Since the sampling rate is adjusted continuously according to the available power and the energy consumption rate in the sensor network, lifetime-based queries are more adaptive to unpredictable changes in sensor networks deployed in a harsh environment. Common Types of Continuous Queries in Sensor Networks Select-All Queries
Recent sensor network deployments indicate that a very common type of continuous queries is a selectall query, which extracts all relevant data from the sensor network and stores the data in a central place for further processing and analysis. Although select-all queries are simple to express, efficient processing of select-all queries is a big challenge. Without optimization, the size of the transmitted data explodes
Continuous Queries in Sensor Networks
quickly, and thus the power of the network would be drained in a short time, especially for those nodes acting as bridge to the outside world; this significantly decreases the lifetime of the sensor network. One possible approach is to apply model-based data compression at intermediate sensor nodes [7]. For many types of signals, e.g., temperature and light, sensor readings are highly correlated in both time and space. Data compression in sensor networks can significantly reduce the communication overhead and increase the network lifetime. Data compression can also improve the signal quality by removing unwanted noise from the original signal. One possible form of compression is to construct and maintain a model of the sensor data in the network; the model is stored both on the server and on sensor nodes in the network. The model on the server can be used to predicate future values within a pre-defined accuracy range. Data communication happens to synchronize the data model on the server with real sensor measurements [7]. Aggregate Queries
Aggregate queries return aggregate values for each group of sensor nodes specified by the GROUP BY clause. Below is is an example query that computes the average concentration in a region every 10 seconds for the next hour: SELECT FROM WHERE HAVING DURATION EVERY
AVG(R.concentration) ChemicalSensor R R.loc IN region AVG(R.concentration) > T (now,now+3600) 10
Data aggregation in sensor networks is well-studied because it scales to sensor networks with even thousands of nodes. Query processing proceeds along a spanning tree of sensor nodes towards a gateway node. During query processing, partial aggregate results are transmitted from a node to its parent in the spanning tree. Once an intermediate node in the tree has received all data from nodes below it in a round, the node compute a partial aggregate of all received data and sends that output to the next node. This solution works for aggregate operators that are incrementally computable, such as avg, max, and moments of the data. The only caveat is that this in-network computation requires synchronization between sensor nodes along the communication path, since a node
C
491
has to ‘‘wait’’ to receive results to be aggregated. In networks with high loss rates, broken links are hard to differentiate from long delays due to high loss rates, making synchronization a non-trivial problem [13].
C Join Queries
In a wide range of event detection applications, sensor readings are compared to a large number of time and location varying predicates to determine whether a user-interesting event is detected [1]. The values of these predicates are stored in a table. Continuous queries with a join operator between sensor readings and the predicate table are suitable for such applications. Similar join queries can be used to detect defective sensor nodes whose readings are inaccurate by checking their readings against readings from neighboring sensors (again assuming spatial correlation between sensor readings). Suitable placement of the join operator in a sensor network has also been examined [2].
Key Applications Habitat Monitoring
In the Great Duck Island experiment, a network of sensors was deployed to monitor the microclimate in and around nesting burrows used by birds, with the goal of developing a habitat monitoring kit that would enable researchers worldwide to engage in non-intrusive and non-disruptive monitoring of sensitive wildlife and habitats [10]. In a more recent experiment, a sensor network was deployed to densely record the complex spatial variations and the temporal dynamics of the microclimate around a 70-meter tall redwood tree [12]. The Intelligent Building
Sensor networks can be deployed in intelligent buildings for the collection and analysis of structural responses to ambient or forced excitation of the building’s structure, for control of light and temperature to conserve energy, and for monitoring of the flow of people in critical areas. Continuous queries are used both for data collection and for event-based monitoring of sensitive areas and to enforce security policies. Industrial Process Control
Idustrial manufacturing processes often have strict requirements on temperature, humidity, and other environmental parameters. Sensor networks can be
492
C
Continuous Query
deployed to monitor the production environment without expensive wires to be installed. Continuous join queries compare the state of the environment to a range of values specified in advance and send an alert when an exception is detected [1].
Cross-references
▶ Approximate Query Processing ▶ Data Acquisition and Dissemination in Sensor Networks ▶ Data Aggregation in Sensor networks ▶ Data Compression in Sensor Networks ▶ Data Fusion in Sensor Networks ▶ Database Languages for Sensor Networks ▶ Distributed Database Systems ▶ In-Network Query Processing ▶ Sensor Networks
Recommended Reading 1. Abadi D., Madden S., and Lindner W. REED: robust, efficient filtering and event detection in sensor networks. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 768–780. 2. Bonfils B. and Bonnet P. Adaptive and decentralized operator placement for in-network query processing. In Proc. 2nd Int. Workshop Int. Proc. in Sensor Networks, 2003, pp. 47–62. 3. Bonnet P., Gehrke J., and Seshadri P. Towards sensor database systems. In Proc. 2nd Int. Conf. on Mobile Data Management, 2001, pp. 3–14. 4. Chu D., Deshpande A., Hellerstein J., and Hong W. Approximate data collection in sensor networks using probabilistic models. In Proc. 22nd Int. Conf. on Data Engineering, 2006. 5. Considine J., Li F., Kollios G., and Byers J. Approximate aggregation techniques for sensor databases. In Proc. 20th Int. Conf. on Data Engineering, 2004, pp. 449–460. 6. Deligiannakis A., Kotidis Y., and Roussopoulos N. Hierarchical in-network data aggregation with quality guarantees. In Advances in Database Technology, Proc. 9th Int. Conf. on Extending Database Technology, 2004, pp. 658–675. 7. Deshpande A., Guestrin C., Madden S., Hellerstein J., and Hong W. Model-driven data acquisition in sensor networks. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 588–599. 8. Kanagal B. and Deshpande A. Online filtering, smoothing and probabilistic modeling of streaming data. In Proc. 24th Int. Conf. on Data Engineering, 2008, pp. 1160–1169. 9. Madden S., Franklin M., Hellerstein J., and Hong W. The design of an acquisitional query processor for sensor networks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 491–502. 10. Mainwaring A., Polastre J., Szewczyk R., Culler D., and Anderson J. Wireless sensor networks for habitat monitoring. In Proc. 1st ACM Int. Workshop on Wireless Sensor Networks and Applications, 2002, pp. 88–97.
11. Stoianov I., Nachman L., Madden S., and Tokmouline T. PIPENET: a wireless sensor network for pipeline monitoring. In Proc. 6th Int. Symp. Inf. Proc. in Sensor Networks, 2007, pp. 264–273. 12. Tolle G., Polastre J., Szewczyk R., Culler D., Turner N., Tu K., Burgess S., Dawson T., Buonadonna P., Gay D., and Hong W. A macroscope in the redwoods. In Proc. 3rd Int. Conf. on Embedded Networked Sensor Systems, 2005. 13. Trigoni N., Yao Y., Demers A.J., Gehrke J., and Rajaraman R. Wave scheduling and routing in sensor networks. ACM Trans. Sensor Netw., 3(1):2, 2007. 14. Yao Y. and Gehrke J. Query processing in sensor networks. In Proc. 1st Biennial Conf. on Innovative Data Systems Research, 2003. 15. Zhang Y., Hull B., Balakrishnan H., and Madden S. ICEDB: intermittently connected continuous query processing. In Proc. 23rd Int. Conf. on Data Engineering, 2007, pp. 166–175.
Continuous Query S HIVNATH B ABU Duke University, Durham, NC, USA
Synonyms Standing query
Definition A continuous query Q is a query that is issued once over a database D, and then logically runs continuously over the data in D until Q is terminated. Q lets users get new results from D without having to issue the same query repeatedly. Continuous queries are best understood in contrast to traditional SQL queries over D that run once to completion over the current data in D.
Key Points Traditional database systems expect all data to be managed within some form of persistent data sets. For many recent applications, where the data is changing constantly (often exclusively through insertions of new elements), the concept of a continuous data stream is more appropriate than a data set. Several applications generate data streams naturally as opposed to data sets, e.g., financial tickers, performance measurements in network monitoring, and call detail records in telecommunications. Continuous queries are a natural interface for monitoring data streams. In network monitoring, e.g., continuous queries may be used to monitor whether all routers and links are functioning efficiently.
ConTract
The Tapestry system [3] for filtering streams of email and bulletin-board messages was the first to make continuous queries a core component of a database system. Continuous queries in Tapestry were expressed using a subset of SQL. Barbara [2] later formalized continuous queries for a wide spectrum of environments. With the recent emergence of generalpurpose systems for processing data streams, continuous queries have become the main interface that users and applications use to query data streams [1]. Materialized views and triggers in traditional database systems can be viewed as continuous queries. A materialized view V is a query that needs to be reevaluated or incrementally updated whenever the base data over which V is defined changes. Triggers implement event-condition-action rules that enable database systems to take appropriate actions when certain events occur.
C
ConTract A NDREAS R EUTER 1,2 1 EML Research aGmbH Villa Bosch, Heidelberg, Germany 2 Technical University Kaiserslautern, Kaiserslautern, Germany
Definition A ConTract is an extended transaction model that employs transactional mechanisms in order to provide a run-time environment for the reliable execution of long-lived, workflow-like computations. The focus is on durable execution and on correctness guarantees with respect to the effects of such computations on shared data.
Key Points Cross-references
▶ Database Trigger ▶ ECA-Rule ▶ Materialized Views ▶ Processing
Recommended Reading 1. 2. 3.
Babu S. and Widom J. Continuous queries over data streams. ACM SIGMOD Rec., 30(3):109–120, 2001. Barbara D. The characterization of continuous queries. Int. J. Coop. Inform. Syst., 8(4):295–323, 1999. Terry D., Goldberg D., Nichols D., and Oki B. Continuous queries over append-only databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1992, pp. 321–330.
Continuous Query Languages ▶ Stream-oriented Query Languages and Operators
Continuous Query Processing Applications ▶ Streaming Applications
Continuous Query Scheduling ▶ Scheduling Strategies for Data Stream Processing
493
The notion of a ConTract (concatenated transactions) combines the principles of workflow programing with the ideas related to long-lived transactions. The ConTract model is based on a two-tier programing approach. At the top level, each ConTract is a script describing a (long-lived) computation. The script describes the order of execution of so-called steps. A step is a predefined unit of execution (e.g., a service invocation) with no visible internal structure. A step can access shared data in a database, send messages, etc. A ConTract, once it is started, will never be lost by the system, no matter which technical problems (short of a real disaster) will occur during execution. If completion is not possible, all computations performed by a ConTract will be revoked, so in a sense ConTracts have transactional behaviour in that they will either be run to completion, or the impossibility of completion will be reflected in the invocation of appropriate recovery measures. The ConTract model draws on the idea of Sagas, where the notion of compensation is employed as a means for revoking the results of computations beyond the boundaries of ACID transactions. In a ConTract, by default each step is an ACID transaction. But it is possible to group multiple steps (not just linear sequences) into a transaction. Compensation steps must be supplied by the application explicitly. The ideas of the ConTract model have selectively been implemented in some academic prototypes, but a full implementation has never been attempted. It has
C
494
C
ConTracts
influenced many later versions of ‘‘long-lived transaction’’ schemes, and a number of its aspects can be found in commercial systems such as BizTalk.
Cross-references
▶ Extended Transaction Models ▶ Persistent Execution ▶ Sagas ▶ Workflow
Recommended Reading 1.
Reuter A. and Waechter H. The ConTract model. In Readings in Database in Database Systems, (2nd edn.), M. Stonebraker, J. Hellerstein, (eds.). Morgan Kaufmann, Los Altos, CA, 1992, pp. 219–263.
Key Points Workflow control data represents the dynamic state of the workflow system and its process instances. Workflow control data examples include: State information about each workflow instance. State information about each activity instance (active or inactive). Information on recovery and restart points within each process, etc. The workflow control data may be written to persistent storage periodically to facilitate restart and recovery of the system after failure. It may also be used to derive audit data.
Cross-references
ConTracts ▶ Flex Transactions
▶ Activity ▶ Process Life Cycle ▶ Workflow Management and Workflow Management System ▶ Workflow Model
Contrast Pattern ▶ Emerging Patterns
Control Flow Diagrams ▶ Activity Diagrams
Contrast Pattern Based Classification ▶ Emerging Pattern Based Classification
Controlled Vocabularies ▶ Lightweight Ontologies
Control Data N ATHANIEL PALMER Workflow Management Coalition, Hingham, MA, USA
Controlling Overlap ▶ Processing Overlaps
Synonyms Workflow control data; Workflow engine state data; Workflow enactment service state data
Definition Data that is managed by the Workflow Management System and/or a Workflow Engine. Such data is internal to the workflow management system and is not normally accessible to applications.
Convertible Constraints C ARSON K AI -S ANG L EUNG University of Manitoba, Winnipeg, MB, Canada
Definition A constraint C is convertible if and only if C is convertible anti-monotone or convertible monotone.
Coordination
A constraint C is convertible anti-monotone provided there is an order R on items such that when an ordered itemset S satisfies constraint C, so does any prefix of S. A constraint C is convertible monotone provided there is an order R0 on items such that when an ordered itemset S0 violates constraint C, so does any prefix of S0 .
Key Points Although some constraints are neither anti-monotone nor monotone in general, several of them can be converted into anti-monotone or monotone ones by properly ordering the items. These convertible constraints [1-3] possess the following nice properties. By arranging items according to some proper order R, if an itemset S satisfies a convertible anti-monotone constraint C, then all prefixes of S also satisfy C. Similarly, by arranging items according to some proper order R0, if an itemset S violates a convertible monotone constraint C 0 , then any prefix of S also violates C 0 . Examples of convertible constraints include avg(S. Price)50, which expresses that the average price of all items in an itemset S is at least $50. By arranging items in non-ascending order R of price, if the average price of items in an itemset S is at least $50, then the average price of items in any prefix of S would not be lower than that of S (i.e., all prefixes of S satisfying a convertible anti-monotone constraint C also satisfy C). Similarly, by arranging items in non-descending order R1 of price, if the average price of items in an itemset S falls below $50, then the average price of items in any prefix of S would not be higher than that of S (i.e., any prefix of S violating a convertible monotone constraint C also violates C). Note that (i) any anti-monotone constraint is also convertible anti-monotone (for any order R) and (ii) any monotone constraint is also convertible monotone (for any order R0 ).
Cross-references
▶ Frequent Itemset Mining with Constraints
Recommended Reading 1
2.
Pei J. and Han J. Can we push more constraints into frequent pattern mining? In Proc. 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2000, pp. 350–354. Pei J., Han J., and Lakshmanan L.V.S. Mining frequent item sets with convertible constraints. In Proc. 17th Int. Conf. on Data Engineering, 2001, pp. 433–442.
3.
C
495
Pei J., Han J., and Lakshmanan L.V.S. Pushing convertible constraints in frequent itemset mining. Data Mining Knowl. Discov. 8(3):227–252, 2004.
C Cooperative Classification ▶ Visual Classification
Cooperative Content Distribution ▶ Peer-To-Peer Content Distribution
Cooperative Storage Systems ▶ Peer-to-Peer Storage
Coordination W.M.P. VAN DER A ALST Eindhoven University of Technology, Eindhoven, The Netherlands
Definition Coordination is about managing dependencies between activities, processes, and components. Unlike the classical computation models, a coordination model puts much more emphasis on communication and cooperation than computation.
Key Points Turing machines are a nice illustration of the classical ‘‘computation-oriented’’ view of systems. However, this view is too limited for many applications (e.g., web services). Many systems can be viewed as a collection of interacting entities (e.g., communicating Turing machines). For example, in the context of a service oriented architecture (SOA) coordination is more important than computation. There exist many approaches to model and support coordination. Linda is an example
496
C
||-Coords
of a language to model coordination and communication among several parallel processes operating upon objects stored in and retrieved from a shared, virtual, associative memory [1]. Linda attempts to separate coordination from computation by only allowing interaction through tuplespaces. However, one could argue that this is also possible in classical approaches such as Petri nets (e.g., connect processes through shared places), synchronized transition systems/automata, process algebra, etc. Coordination also plays an important role in agent technology [2]. Some authors emphasize the interdisciplinary nature of coordination [3]. Coordination is indeed not a pure computer science issue and other disciplines like organizational theory, economics, psychology, etc. are also relevant.
Cross-references
▶ Business Process Management ▶ Choreography ▶ Web Services ▶ Workflow Management
Recommended Reading 1. 2.
3.
Gelernter D. and Carriero N. Coordination languages and their significance. Commun. ACM, 35(2):97–107, 1992. Jennings N.R. Commitments and conventions: the foundation of coordination in multi-agent systems. Knowl. Eng. Rev., 8(3):223–250, 1993. Malone T.W. and Crowston K. The interdisciplinary study of coordination. ACM Comput. Surv., 26(1):87–119, 1994.
||-Coords ▶ Parallel Coordinates
Copy Divergence ▶ Weak Consistency Models for Replicated Data
Copyright Issues in Databases M ICHAEL W. C ARROLL Villanova University School of Law, Villanova, PA, USA
Synonyms Intellectual property; License
Definition Copyright is a set of exclusive rights granted by law to authors of original works of authorship. It applies automatically as soon as an original work is created and fixed in a tangible medium of expression, such as when it is stored on a hard disk. Originality requires independent creation by the author and a modicum of creativity. Copyright covers only an author’s original expression. Facts and ideas are not copyrightable. Copyright usually applies only partially to databases. Copyrightable expression usually is found in database structures, such as the selection and arrangement of field names, unless these do not reflect any creativity or are standard within an area of research. Copyright will also apply to creative data, such as photographs or expressive and sufficiently long text entries. By and large, the rule on facts and ideas means that most numerical data, scientific results, other factual data, and short text entries are not covered by copyright.
Historical Background Copyright has evolved from a limited right to control the unauthorized distribution of a limited class of works, primarily books, to a more expansive set of rights that attach automatically to any original work of authorship. Copyright law has always been national in scope, but through international treaties most nations now extend copyright to non-resident copyright owners. To comply with these treaties, copyright is now also automatic in the USA, which has abandoned requirements that a copyright owner register the work with the Copyright Office or publish the work with the copyright symbol – ß – in order to retain copyright.
Foundations
Copy Transparency ▶ Strong Consistency Models for Replicated Data
Copyright
Copyright attaches to an original work of authorship that has been embodied in a fixed form. The ‘‘work’’ to
Copyright Issues in Databases
which copyright attaches can be the structure of the database or a relatively small part of a database, including an individual data element, such as a photograph. It is therefore possible for a database to contain multiple overlapping copyrighted works or elements. To the extent that a database owner has a copyright, or multiple copyrights, in elements of a database, the rights apply only to those copyrighted elements. The rights are to reproduce, publicly distribute or communicate, publicly display, publicly perform, and prepare adaptations or derivative works. Standards for Obtaining Copyright Originality
Copyright protects only an author’s ‘‘original’’ expression, which means expression independently created by the author that reflects a minimal spark of creativity. A database owner may have a copyright in the database structure or in the user interface with the database, whether that be a report form or an electronic display of field names associated with data. The key is whether the judgments made by the person(s) selecting and arranging the data require the exercise of sufficient discretion to make the selection or arrangement ‘‘original.’’ In Feist Publications, Inc. v. Rural Telephone Service Company, the US Supreme Court held that a white pages telephone directory could not be copyrighted. The data—the telephone numbers and addresses – were ‘‘facts’’ which were not original because they had no ‘‘author.’’ Also, the selection and arrangement of the facts did not meet the originality requirement because the decision to order the entries alphabetically by name did not reflect the ‘‘minimal spark’’ of creativity needed. As a practical matter, this originality standard prevents copyright from applying to complete databases – i.e., those that list all instances of a particular phenomenon – that are arranged in an unoriginal manner, such as alphabetically or by numeric value. However, courts have held that incomplete databases that reflect original selection and arrangement of data, such as a guide to the ‘‘best’’ restaurants in a city, are copyrightable in their selection and arrangement. Such a copyright would prohibit another from copying and posting such a guide on the Internet without permission. However, because the copyright would be limited to that particular selection and arrangement of restaurants, a user could use such a database as a reference for creating a different selection and arrangement of
C
restaurants without violating the copyright owner’s copyright. Copyright is also limited by the merger doctrine, which appears in many database disputes. If there are only a small set of practical choices for expressing an idea, the law holds that the idea and expression merge and the result is that there is no legal liability for using the expression. Under these principles, metadata is copyrightable only if it reflects an author’s original expression. For example, a collection of simple bibliographic metadata with fields named ‘‘author,’’ ‘‘title,’’ ‘‘date of publication,’’ would not be sufficiently original to be copyrightable. More complex selections and arrangements may cross the line of originality. Finally, to the extent that software is used in a databases, software is protectable as a ‘‘literary work.’’ A discussion of copyright in executable code is beyond the scope of this entry. Fixation A work must also be ‘‘fixed’’ in any medium permitting the work to be perceived, reproduced, or otherwise communicated for a period of more than a transitory duration. The structure and arrangement of a database may be fixed any time that it is written down or implemented. For works created after January 1, 1978 in the USA, exclusive rights under copyright shower down upon the creator at the moment of fixation. The Duration of Copyright
Under international treaties, copyright must last for at least the life of the author plus 50 years. Some countries, including the USA, have extended the length to the life of the author plus 70 years. Under U.S. law, if a work was made as a ‘‘work made for hire,’’ such as a work created by an employee within the scope of employment, the copyright lasts for 120 years from creation if the work is unpublished or 95 years from the date of publication. Ownership and Transfer of Copyright
Copyright is owned initially by the author of the work. If the work is jointly produced by two or more authors, such as a copyrightable database compiled by two or more scholars, each has a legal interest in the copyright. When a work is produced by an employee, ownership differs by country. In the USA, the employer is treated as the author under the ‘‘work made for hire’’ doctrine and the employee has no rights in the resulting
497
C
498
C
Copyright Issues in Databases
work. Elsewhere, the employee is treated as the author and retains certain moral rights in the work while the employer receives the economic rights in the work. Copyrights may be licensed or transferred. A nonexclusive license, or permission, may be granted orally or even by implication. A transfer or an exclusive license must be done in writing and signed by the copyright owner. Outside of the USA, some or all of the author’s moral rights cannot be transferred or terminated by agreement. The law on this issue varies by jurisdiction. The Copyright Owner’s Rights
The rights of a copyright owner are similar throughout the world although the terminology differs as do the limitations and exceptions to these rights. Reproduction As the word ‘‘copyright’’ implies, the owner controls the right to reproduce the work in copies. The reproduction right covers both exact duplicates of a work and works that are ‘‘substantially similar’’ to the copyrighted work when it can be shown that the alleged copyist had access to the copyrighted work. In the USA, some courts have extended this right to cover even a temporary copy of a copyrighted work stored in a computer’s random access memory (‘‘RAM’’). Public Distribution, Performance, Display or Communication The USA divides the rights to express the work
to the public into rights to distribute copies, display a copy, or publicly perform the work. In other parts of the world, these are subsumed within a right to communicate the work to the public. Within the USA, courts have given the distribution right a broad reading. Some courts, including the appeals court in the Napster case, have held that a download of a file from a server connected to the internet is both a reproduction by the person requesting the file and a distribution by the owner of the machine that sends the file. The right of public performance applies whenever the copyrighted work can be listened to or watched by members of the public at large or a subset of the public larger than a family unit or circle of friends. Similarly, the display right covers works that can be viewed at home over a computer network as long as the work is accessible to the public at large or a subset of the public. Right of Adaptation, Modification or Right to Prepare Derivative Works A separate copyright arises with
respect to modifications or adaptations of a copyrighted work so long as these modifications or adaptations themselves are original. This separate copyright applies only to these changes. The copyright owner has the right to control such adaptations unless a statutory provision, such as fair use, applies. Theories of Secondary Liability
Those who build or operate databases also have to be aware that copyright law holds liable certain parties that enable or assist others in infringing copyright. In the USA, these theories are known as contributory infringement or vicarious infringement. Contributory Infringement
Contributory copyright infringement requires proof that a third party intended to assist a copyright infringer in that activity. This intent can be shown when one supplies a means of infringement with the intent to induce another to infringe or with knowledge that the recipient will infringe. This principle is limited by the so-called Sony doctrine, by which one who supplies a service or technology that enables infringement, such as a VCR or photocopier, will be deemed not to have knowledge of infringement or intent to induce infringement so long as the service or technology is capable of substantial non-infringing uses. Two examples illustrate the operation of this rule. In A&M Records, Inc. v. Napster, Inc., the court of appeals held that peer-to-peer file sharing is infringing but that Napster’s database system for connecting users for peer-to-peer file transfers was capable of substantial non-infringing uses and so it was entitled to rely on the Sony doctrine. (Napster was held liable on other grounds.) In contrast, in MGM Studios, Inc. v. Grokster, Ltd., the Supreme Court held that Grokster was liable for inducing users to infringe by specifically advertising its database service as a substitute for Napster’s. Vicarious Liability for Copyright Infringement
Vicarious liability in the USA will apply whenever (i) one has control or supervisory power over the direct infringer’s infringing conduct and (ii) one receives a direct financial benefit from the infringing conduct. In the Napster case, the court held that Napster had control over its users because it could refuse them access to the Napster server and, pursuant to the Terms of Service Agreements entered into with users, could terminate access if infringing conduct was discovered. Other courts have
Copyright Issues in Databases
required a greater showing of actual control over the infringing conduct. Similarly, a direct financial benefit is not limited to a share of the infringer’s profits. The Napster court held that Napster received a direct financial benefit from infringing file trading because users’ ability to obtain infringing audio files drew them to use Napster’s database. Additionally, Napster could potentially receive a financial benefit from having attracted a larger user base to the service. Limitations and Exceptions
Copyrights’ limitations and exceptions vary by jurisdiction. In the USA, the broad ‘‘fair use’’ provision is a fact-specific balancing test that permits certain uses of copyrighted works without permission. Fair use is accompanied by some specific statutory limitations that cover, for example, certain uses in the classroom use and certain uses by libraries. The factors to consider for fair use are: (i) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (ii) the nature of the copyrighted work; (iii) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (iv) the effect of the use upon the potential market for or value of the copyrighted work. The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors. Countries whose copyright law follows that of the United Kingdom, a more limited ‘‘fair dealing’’ provision enumerates specific exceptions to copyright. In Europe, Japan, and elsewhere, the limitations and exceptions are specified legislatively and cover some private copying and some research or educational uses. Remedies and Penalties
In general, a copyright owner can seek an injunction against one who is either a direct or secondary infringer of copyright. The monetary consequences of infringement differ by jurisdiction. In the USA, the copyright owner may choose between actual or statutory damages. Actual damages cover the copyright owner’s lost profits as well as a right to the infringer’s profits derived from infringement. The range for statutory damages is $750–$30,000 per copyrighted work infringed. If infringement is found to have been willful, the range increases to $150,000. The amount of statutory damages in a specific case is determined by the
C
jury. There is a safe harbor from statutory damages for non-profit educational institutions if an employee reproduces a copyrighted work with a good faith belief that such reproduction is a fair use. A separate safe harbor scheme applies to online service providers when their database is comprised of information stored at the direction of their users. An example of such a database would be YouTube’s video sharing database. The service provider is immune from monetary liability unless the provider has knowledge of infringement or has control over the infringer and receives a direct financial benefit from infringement. The safe harbor is contingent on a number of requirements, including that the provider have a copyright policy that terminates repeat infringers, that the provider comply with a notice-and-takedown procedure, and that the provider have an agent designated to receive notices of copyright infringement.
Key Applications In cases arising after the Feist decision, the courts have faithfully applied the core holding that facts are in the public domain and free from copyright even when substantial investments are made to gather such facts. There has been more variation in the characterization of some kinds of data as facts and in application of the modicum-of-creativity standard to the selections and arrangements in database structures. On the question of when data is copyrightable, a court of appeals found copyrightable expression in the ‘‘Red Book’’ listing of used car valuations. The defendant had copied these valuations into its database, asserting that it was merely copying unprotected factual information. The court disagreed, likening the valuations to expressive opinions and finding a modicum of originality in these. In addition, the selection and arrangement of the data, which included a division of the market into geographic regions, mileage adjustments in 5,000-mile increments, a selection of optional features for inclusion, entitled the plaintiff to a thin copyright in the database structure. Subsequently, the same court found that the prices for futures contracts traded on the New York Mercantile Exchange (NYMEX) probably were not expressive data even though a committee makes some judgments in the setting of these prices. The court concluded that even if such price data were expressive, the merger doctrine applied because there was no other practicable way of expressing the idea other than through a
499
C
500
C
CORBA
numerical value and a rival was free to copy price data from NYMEX’s database without copyright liability. Finally, where data are comprised of arbitrary numbers used as codes, the courts have split. One court of appeals has held that an automobile parts manufacturer owns no copyright in its parts numbers, which are generated by application of a numbering system that the company created. In contrast, another court of appeals has held that the American Dental Association owns a copyright in its codes for dental procedures. On the question of copyright in database structures, a court of appeals found that the structure of a yellow pages directory including listing of Chinese restaurants was entitled to a ‘‘thin’’ copyright, but that copyright was not infringed by a rival database that included 1,500 of the listings because the rival had not copied the plaintiff ’s data structure. Similarly, a different court of appeals acknowledged that although a yellow pages directory was copyrightable as a compilation, a rival did not violate that copyright by copying the name, address, telephone number, business type, and unit of advertisement purchased for each listing in the original publisher’s directory. Finally, a database of real estate tax assessments that arranged the data collected by the assessor into 456 fields grouped into 34 categories was sufficiently original to be copyrightable.
Cross-references
▶ European Law in Databases ▶ Licensing and Contracting Issues in Databases
Recommended Reading 1. 2. 3. 4. 5. 6.
7. 8. 9.
American Dental Association v. Delta Dental Plans Ass’n, 126 F.3d 977 (7th Cir.1997). Assessment Technologies of WI, LLC v. WIRE data, Inc., 350 F.3d 640 (7th Cir. 2003). Bellsouth Advertising & Publishing Corp. v. Donnelly Information Publishing, Inc., 999 F.2d 1436 (11th Cir. 1993) (en banc). CCC Information Services, Inc. v. MacLean Hunter Market Reports, Inc., 44 F.3d 61 (2d Cir. 1994). Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991). Ginsburg J.C. Copyright, common law, and sui generis protection of databases in the United States and abroad, University of Cincinnati Law Rev., 66:151–176, 1997. Key Publications, Inc. v. Chinatown Today Publishing Enterprises, Inc., 945 F.2d 509 (2d Cir. 1991). New York Mercantile Exchange, Inc. v. IntercontinentalExchange, Inc., 497 F.3d 109, (2d Cir. 2007). Southco, Inc. v. Kanebridge Corp., 390 F.3d 276 (3d Cir. 2004) (en banc).
CORBA A NIRUDDHA G OKHALE Vanderbilt University, Nashville, TN, USA
Synonyms Object request broker; Common object request broker architecture
Definition The Common Object Request Broker Architecture (CORBA) [2,3] is standardized by the Object Management Group (OMG) for distributed object computing.
Key Points The CORBA standard specifies a platform-independent and programming language-independent architecture and a set of APIs to simplify distributed application development. The central idea in CORBA is to decouple the interface from the implementation. Applications that provide services declare their interfaces and operations in the Interface Description Language (IDL). IDL compilers read these definitions and synthesize client-side stubs and server-side skeletons, which provide data marshaling and proxy capabilities. CORBA provides both a type-safe RPC-style object communication paradigm called the Static Invocation Interface (SII), and a more dynamic form of communication called the Dynamic Invocation Interface (DII), which allows creation and population of requests dynamically via reflection capabilities. The DII is often used to bridge different object models. CORBA defines a binary format for on-the-wire representation of data called the Common Data Representation (CDR). CDR has been defined to enable programming languageneutrality. The CORBA 1.0 specification (October 1991) and subsequent revisions through version 1.2 (December 1993) defined these basic capabilities, however, they lacked any support for interoperability across different CORBA implementations. The CORBA 2.0 specification (August 1996) defined an interoperability protocol called the General Inter-ORB Protocol (GIOP), which defines the packet formats for data exchange between communicating CORBA entities. GIOP is an abstract specification and must be mapped to the underlying transport protocol. The most widely used concrete mapping of GIOP is
Correctness Criteria Beyond Serializability
called the Internet Inter-ORB Protocol (IIOP) used for data exchange over TCP/IP networks. Despite these improvements, the earlier versions of CORBA focused only on the client-side portability and lacked any support for server-side portability. This limitation was addressed in the CORBA 2.2 specification (August 1996) through the Portable Object Adapter (POA) concept. The POA enables server-side transparency to applications and server-side portability. The POA provides a number of policies that can be used to manage the server-side objects. The CORBA specification defines compliance points for implementations to ensure interoperability. The CORBA specification has also been enhanced with additional capabilities that are available beyond the basic features, such as the Real-time CORBA specification [1]. Implementations of these specifications must provide these additional capabilities. In general, CORBA enhances conventional procedural RPC middleware by supporting object oriented language features (such as encapsulation, interface inheritance, parameterized types, and exception handling) and advanced design patterns for distributed communication. The most recent version of CORBA specification at the time of this writing is 3.3 (January 2008), which also includes support for a component architecture.
Cross-references
▶ Client-Server Architecture ▶ DCE ▶ DCOM ▶ J2EE ▶ Java RMI ▶ .NET Remoting ▶ Request Broker ▶ SOAP
Recommended Reading 1. 2.
3.
Object Management Group, Real-Time CORBA Specification, Version 1.2, OMG Document No. formal/2005-01-04, January 2005. Object Management Group, Common Object Request Broker Architecture (CORBA), Version 3.1, OMG Document No. formal/2008-01-08, January 2008. Soley R.M. and Stone C.M. Object Management Architecture Guide, 3rd edn., Object Management Group, June 1995.
Corpora ▶ Document Databases
C
501
Corpus ▶ Test Collection
C Correctness Criteria Beyond Serializability M OURAD O UZZANI 1, B RAHIM M EDJAHED 2, A HMED K. E LMAGARMID 1 1 Purdue University, West Lafayette, IN, USA 2 The University of Michigan – Dearborn, Dearborn, MI, USA
Synonyms Concurrency control; Preserving database consistency
Definition A transaction is a logical unit of work that includes one or more database access operations such as insertion, deletion, modification, and retrieval [8]. A schedule (or history) S of n transactions T1,...,Tn is an ordering of the transactions that satisfies the following two conditions: (i) the operations of Ti (i = 1,...,n) in S must occur in the same order in which they appear in Ti, and (ii) operations from Tj (j 6¼ i) may be interleaved with Ti’s operations in S. A schedule S is serial if for every two transactions Ti and Tj that appear in S, either all operations of Ti appear before all operations of Tj, or vice versa. Otherwise, the schedule is called nonserial or concurrent. Non-serial schedules of transactions may lead to concurrency problems such as lost update, dirty read, and unrepeatable read. For instance, the lost update problem occurs whenever two transactions, while attempting to modify a data item, both read the item’s old value before either of them writes the item’s new value [2]. The simplest way for controlling concurrency is to allow only serial schedules. However, with no concurrency, database systems may make poor use of their resources and hence, be inefficient, resulting in smaller transaction execution rate for example. To broaden the class of allowable transaction schedules, serializability has been proposed as the major correctness criterion for concurrency control [7,11]. Serializability ensures that a concurrent schedule of transactions is equivalent to some serial schedule of the same transactions [12]. While serializability has been successfully used in
502
C
Correctness Criteria Beyond Serializability
traditional database applications, e.g., airline reservations andbanking,ithasbeenproventoberestrictiveandhardly applicable in advanced applications such as ComputerAided Design (CAD), Computer-Aided Manufacturing (CAM), office automation, and multidatabases. These applications introduced new requirements that either prevent the use of serializability (e.g., violation of local autonomy in multidatabases) or make the use of serializability inefficient (e.g., long-running transactions in CAD/CAM applications). These limitations have motivated the introduction of more flexible correctness criteria that go beyond the traditional serializability.
Historical Background Concurrency control began appearing in database systems in the early to mid 1970s. It emerged as an active database research thrust starting from 1976 as witnessed by the early influential papers published by Eswaren et al. [5] and Gray et al. [7]. A comprehensive coverage of serializability theory has been presented in 1986 by Papadimitriou in [12]. Simply put, serializability theory is a mathematical model for proving whether or not a concurrent execution of transactions is correct. It gives precise definitions and properties that non-serial schedules of transactions must satisfy to be serializable. Equivalence between a concurrent and serial schedule of transactions is at the core of the serializability theory. Two major types of equivalence have then been defined: conflict and view equivalence. If two schedules are conflict equivalent then they are view equivalent. The converse is not generally true. Conflict equivalence has initially been introduced by Gray et al. in 1975 [7]. A concurrent schedule of transactions is conflict equivalent to a serial schedule of the same transactions (and hence conflict serializable) if they order conflicting operations in the same way, i.e., they have the same precedence relations of conflicting operations. Two operations are conflicting if they are from different transactions upon the same data item, and at least one of them is write. If two operations conflict, their execution order matters. For instance, the value returned by a read operation depends on whether or not that operation precedes or follows a particular write operation on the same data item. Conflict serializability is tested by analyzing the acyclicity of the graph derived from the execution of the different transactions in a schedule. This graph, called serializability graph, is a directed graph that
models the precedence of conflicting operations in the transactions. View equivalence has been proposed by Yannakakis in 1984 [15]. A concurrent schedule of transactions is view equivalent to a serial schedule of the same transactions (and hence view serializable) if the respective transactions in the two schedules read and write the same data values. View equivalence is based on the following two observations: (i) if each transaction reads each of its data items from the same writes, then all writes write the same value in both schedules; and (ii) if the final write on each data item is the same in both schedules, then the final value of all data items will be the same in both schedules. View serializability is usually expensive to check. One approach is to check the acyclicity of a special graph called polygraph. A polygraph is a generalization of the precedence graph that takes into account all precedence constraints required by view serializability.
Foundations The limitations of the traditional serializability concept combined with the requirement of advanced database applications triggered a wave of new correctness criteria that go beyond serializability. These criteria aim at achieving one or several of the following goals: (i) accept non serializable but correct executions by exploiting the semantics of transactions, their structure, and integrity constraints (ii) allow inconsistencies to appear in a controlled manner which may be acceptable for some transactions, (iii) limit conflicts by creating a new version of the data for each update, and (iv) treat transactions accessing more than one database, in the case of multidatabases, differently from those accessing one single database and maintain overall correctness. While a large number of correctness criteria have been presented in the literature, this entry will focus on the major criteria which had a considerable impact on the field. These criteria will be presented as described in their original versions as several of these criteria have been either extended, improved, or applied to specific contexts. Table 1 summarizes the correctness criteria outlined in this section. Multiversion Serializability
Multiversion databases aim at increasing the degree of concurrency and providing a better system recovery. In such databases, whenever a transaction writes a data
C
Correctness Criteria Beyond Serializability
503
Correctness Criteria Beyond Serializability. Table 1. Representative correctness criteria for concurrency control Correctness criterion
Examples of application domains
Basic idea
Multiversion serializability
Allows some schedules as serializable if a read is performed on some older version of a data item instead of the newer modified version. Semantic Uses semantic information about transactions to accept consistency some non-serializable but correct schedules. Predicatewise Focuses on data integrity constraints. serializability EpsilonAllows inconsistencies to appear in a controlled manner by serializability attaching a specification of the amount of permitted inconsistency to each transaction. Eventual Requires that duplicate copies are consistent at certain consistency times but may be inconsistent in the interim intervals. Quasi serializability Two-level serializability
Multiversion database systems
[1]
Applications that can provide some semantic knowledge CAD database and office information systems Applications that tolerate some inconsistencies
[6]
Distributed databases with replicated or interdependent data
Executes global transactions in a serializable way while Multidatabase systems taking into account the effect of local transactions. Ensures consistency by exploiting the nature of integrity Multidatabase systems constraints and the nature of transactions in multidatabase environments.
item, it creates a new version of this item instead of overwriting it. The basic idea of multiversion serializability [1] is that some schedules can be still seen as serializable if a read is performed on some older version of a data item instead of the newer modified version. Concurrency is increased by having transactions read older versions while other concurrent transactions are creating newer versions. There is only one type of conflict that is possible; when a transactions reads a version of a data item that was written by another transaction. The two other conflicts (write, write) and (read, write) are not possible since each write produces a new version and a data item cannot be read until it has been produced, respectively. Based on the assumption that users expect their transactions to behave as if there were just one copy of each data item, the notion of one-copy serial schedule is defined. A schedule is one-copy serial if for all i, j, and x, if a transaction Tj reads x from a transaction Ti, then either i = j or Ti is the last transaction preceding tj that writes into any version of x. Hence, a schedule is defined as one-copy serializable (1-SR) if it is equivalent to a 1-serial schedule. 1-SR is shown to maintain correctness by proving that a multiversion schedule behaves like a serial non-multiversion schedule (there is only one version for each data item) if the multiversion schedule is one-serializable. The one-copy
Reference
[9] [13]
[14]
[4] [10]
serializability of a schedule can be verified by checking the acyclicity of the multiversion serialization graph of that schedule. Semantic Consistency
Semantic consistency uses semantic information about transactions to accept some non-serializable but correct schedules [6]. To ensure that users see consistent data, the concept of sensitive transactions has been introduced. Sensitive transactions output only consistent data and thus must see a consistent database state. A semantically consistent schedule is one that transforms the database from a consistent state to another consistent state and where all sensitive transactions obtain a consistent view of the database with respect to the data accessed by these transactions, i.e., all data consistency constraints of the accessed data are evaluated to True. Enforcing semantic consistency requires knowledge about the application which must be provided by the user. In particular, users will need to group actions of the transactions into steps and specify which steps of a transaction of a given type can be interleaved with the steps of another type of transactions without violating consistency. Four types of semantic knowledge are defined: (i) transaction semantic types, (ii) compatibility sets associated with each type, (iii) division of transactions into steps,
C
504
C
Correctness Criteria Beyond Serializability
and (iv) counter-steps to (semantically) compensate the effect from some of the steps executed within the transaction. Predicatewise Serializability
Predicatewise serializability (PWSR) has been introduced as a correctness criterion for CAD database and office information systems [9]. PWSR focuses solely on data integrity constraints. In a nutshell, if database consistency constraints can be expressed in a conjunctive normal form, a schedule is said to be PWSR if all projections of that schedule on each group of data items that share a disjunctive clause (of the conjunctive form representing the integrity constraints) are serializable. There are three different types of restrictions that must be enforced on PWSR schedules to preserve database consistency: (i) force the transactions to be of fixed structure, i.e., they are independent of the database state from which they execute, (ii) force the schedules to be delayed read, i.e., a transaction Ti cannot read a data item written by a transaction Tj until after Tj has completed all of its operations, or (iii) the conjuncts of the integrity constraints can be ordered in a way that no transaction reads a data item belonging to a higher numbered conjunct and writes a data item belonging to a lower numbered conjunct.
consisting of both query ETs and update ETs is such that the non serializable conflicts between query ETs and update ETs are less than the permitted limits specified by each ET. An epsilon-serializable schedule is one that is equivalent to an epsilon-serial schedule. If the permitted limits are set to zero, ESR corresponds to the classical notion of serializability. Eventual Consistency
Eventual consistency has been proposed as an alternative correctness criterion for distributed databases with replicated or interdependent data [14]. This criterion is useful is several applications like mobile databases, distributed databases, and large scale distributed systems in general. Eventual consistency requires that duplicate copies are consistent at certain times but may be inconsistent in the interim intervals. The basic idea is that duplicates are allowed to diverge as long as the copies are made consistent periodically. The times where these copies are made consistent can be specified in several ways which could depend on the application, for example, at specified time intervals, when some events occur, or at some specific times. A correctness criterion that ensures eventual consistency is the current copy serializability. Each update occurs on a current copy and is asynchronously propagated to other replicas. Quasi Serializability
Epsilon-Serializability
Epsilon-serializability (ESR) [13] has been introduced as a generalization to serializability where a limited amount of inconsistency is permitted. The goal is to enhance concurrency by allowing some non serializable schedules. ESR introduces the notion of epsilon transactions (ETs) by attaching a specification of the amount of permitted inconsistency to each (standard) transaction. ESR distinguishes between transactions that contain only read operation, called query epsilon transaction or query ET, and transactions with at least one update operation, called update epsilon transaction or update ET. Query ETs may view uncommitted, possibly inconsistent, data being updated by update ETs. Thus, update ETs are seen as exporting some inconsistencies while query ETs are importing these inconsistencies. ESR aims at bounding the amount of imported and exported inconsistency for each ET. An epsilon-serial schedule is defined as a schedule where (i) the update ETs form a serial schedule if considered alone without the query ET and (ii) the entire schedule
Quasi Serializability (QSR) is a correctness criterion that has been introduced for multidatabase systems [4]. A multidatabase system allows users to access data located in multiple autonomous databases. It generally involves two kinds of transactions: (i) Local transactions that access only one database; they are usually outside the control of the multidatabase system, and (ii) global transactions that can access more than one database and are subject to control by both the multidatabase and the local databases. The basic premise is that to preserve global database consistency, global transactions should be executed in a serializable way while taking into account the effect of local transactions. The effect of local transactions appears in the form of indirect conflicts that these local transactions introduce between global transactions which may not necessarily access (conflict) the same data items. A quasi serial schedule is a schedule where global transactions are required to execute serially and local schedules are required to be serializable. This is in contrast to global serializability where all transactions,
Correctness Criteria Beyond Serializability
both local and global, need to execute in a (globally) serializable way. A global schedule is said to be quasi serializable if it is (conflict) equivalent to a quasi serial schedule. Based on this definition, a quasi serializable schedule maintains the consistency of multidatabase systems since (i) a quasi serial schedule preserves the mutual consistency of globally replicated data items, based on the assumptions that these replicated data items are updated only by global transactions, and (ii) a quasi serial schedule preserves the global transaction consistency constraints as local schedules are serializable and global transactions are executed following a schedule that is equivalent to a serial one. Two-Level Serializability
Two-level serializability (2LSR) has been introduced to relax serializability requirements in multidatabases and allow a higher degree of concurrency while ensuring consistency [10]. Consistency is ensured by exploiting the nature of integrity constraints and the nature of transactions in multidatabase environments. A global schedule, consisting of both local and global transactions, is 2LSR if all local schedules are serializable and the projection of that schedule on global transactions is serializable. Local schedules consist of all operations, from global and local transactions, that access the same local database. Ensuring that each local schedule is serializable is already taken care of by the local database. Furthermore, ensuring that the global transactions are executed in a serializable way can be done by the global concurrency controller using any existing technique from centralized databases like the Two-phase-locking (2PL) protocol. This is possible since the global transactions are under the full control of the global transaction manager. [10] shows that under different scenarios 2LSR preserves a strong notion of correctness where the multidatabase consistency is preserved and all transactions see consistent data. These different scenarios differ depending on (i) which kind of data items, local or global, global and local transactions are reading or writing, (ii) the existence of integrity constraints between local and global data items, and (iii) whether all transaction are preserving the consistency of local databases when considered alone.
Key Applications The major database applications behind the need for new correctness criteria include distributed databases, mobile databases, multidatabases, CAD/CAM
C
applications, office automation, cooperative applications, and software development environments. All of these advanced applications introduced requirements and limitations that either prevent the use of serializability like the violation of local autonomy in multidatabases, or make the use of serializability inefficient like blocking long-running transactions.
Future Directions A recent trend in transaction management focuses on adding transactional properties (e.g., isolation, atomicity) to business processes [3]. A business process (BP) is a set of tasks which are performed collaboratively to realize a business objective. Since BPs contain activities that access shared and persistent data resources, they have to be subject to transactional semantics. However, it is not adequate to treat an entire BP as a single ‘‘traditional’’ transaction mainly because BPs: (i) are of long duration and treating an entire process as a transaction would require locking resources for long periods of time, (ii) involve many independent database and application systems and enforcing transactional properties across the entire process would require expensive coordination among these systems, and (iii) have external effects and using conventional transactional rollback mechanisms is not feasible. These characteristics open new research issues to take the concept of correctness criterion and how it should be enforced beyond even the correctness criteria discussed here.
Cross-references
▶ ACID Properties ▶ Concurrency Control ▶ Distributed ▶ Parallel and Networked Databases ▶ System Recovery ▶ Transaction Management ▶ Two-Phase Commit ▶ Two-Phase Locking
Recommended Reading 1. Bernstein P.A. and Goodman N. Multiversion concurrency control – theory and algorithms. ACM Trans. Database Syst., 8(4):465–483, 1983. 2. Bernstein P.A., Hadzilacos V., and Goodman N. Concurrency control and recovery in database systems. Addison-Wesley, Reading, MA, 1987. 3. Dayal U., Hsu M., and Ladin R. Business process coordination: state of the art, trends, and open issues. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 3–13.
505
C
506
C
Correctness Criterion for Concurrent Executions
4. Du W. and Elmagarmid A.K. Quasi serializability: a correctness criterion for global concurrency control in Interbase. In Proc. 15th Int. Conf. on Very Large Data Bases, 1989, pp. 347–355. 5. Eswaran K.P., Gray J., Lorie R.A., and Traiger I.L. The notions of consistency and predicate locks in a database system. Commun. ACM, 19(11):624–633, 1976. 6. Garcia-Molina H. Using semantic knowledge for transaction processing in a distributed database. ACM Trans. Database Syst., 8(2):186–213, 1983. 7. Gray J., Lorie R.A., Putzolu G.R., and Traiger I.L. Granularity of locks in a large shared data base. In Proc. 1st Int. Conf. on Very Data Bases, 1975, pp. 428–451. 8. Gray J. and Reuter A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, Los Altos, CA, 1993. 9. Korth H.F. and Speegle G.D. Formal model of correctness without serializability. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1988, pp. 379–386. 10. Mehrotra S., Rastogi R., Korth H.F., and Silberschatz A. Ensuring consistency in multidatabases by preserving two-level serializability. ACM Trans. Database Syst., 23(2):199–230, 1998. 11. Papadimitriou C.H. The serializability of concurrent database updates. J. ACM, 26(4):631–653, 1979. 12. Papadimitriou C.H. The Theory of Database Concurrency Control. Computer Science, Rockville, MD, 1986. 13. Ramamritham K. and Pu C. A formal characterization of epsilon serializability. IEEE Trans. Knowl. Data Eng., 7(6):997–1007, 1995. 14. Sheth A., Leu Y., and Elmagarmid A. Maintaining Consistency of Interdependent Data in Multidatabase Systems. Tech. Rep. CSD-TR-91-016, Purdue University, http://www.cs.toronto.edu/ georgem/ws/ws.ps, 1991. 15. Yannakakis M. Serializability by locking. J. ACM, 31(2): 227–244, 1984.
Correctness Criterion for Concurrent Executions ▶ Serializability
Correlated Data Collection ▶ Data Compression in Sensor Networks
Correlation ▶ Similarity and Ranking Operations
Correlation Clustering ▶ Subspace Clustering Techniques
Cost Estimation S TEFAN M ANEGOLD CWI, Amsterdam, The Netherlands
Definition Execution costs, or simply costs, is a generic term to collectively refer to the various goals or objectives of database query optimization. Optimization aims at finding the ‘‘cheapest’’ (‘‘best’’ or at least a ‘‘reasonably good’’) query execution plan (QEP) among semantically equivalent alternative plans for the given query. Cost is used as a metric to compare plans. Depending on the application different types of costs are considered. Traditional optimization goals include minimizing response time (for the first answer or the complete result), minimizing resource consumption (like CPU time, I/O, network bandwidth, or amount of memory required), or maximizing throughput, i.e., the number of queries that the system can answer per time. Other, less obvious objectives – e.g., in a mobile environment – may be to minimize the power consumption needed to answer the query or the on-line time being connected to a remote database server. Obviously, evaluating a QEP to measure its execution cost does not make sense. Cost estimation refers to the task of predicting the (approximate) costs of a given QEP a priori, i.e., without actually evaluating it. For this purpose, mathematical algorithms or parametric equations, commonly referred to as cost models, provide a simplified ‘‘idealized’’ abstract description of the system, focusing on the most relevant components. In general, the following three cost components are distinguished. 1. Logical costs consider only the data distributions and the semantics of relational algebra operations to estimate intermediate result sizes of a given (logical) query plan. 2. Algorithmic costs extend logical costs by taking also the computational complexity (expressed in terms of O-classes) of the algorithms into account. 3. Physical costs finally combine algorithmic costs with system/hardware specific parameters to predict the total costs, usually in terms of execution time. Next to query optimization, cost models can serve another purpose. Especially algorithmic and physical cost models can help database developers to understand
Cost Estimation
and/or predict the performance of existing algorithms on new hardware systems. Thus, they can improve the algorithms or even design new ones without having to run time and resource consuming experiments to evaluate their performance. Since the quality of query optimization strongly depends on the quality of cost estimation, details of cost estimation in commercial database products are usually well kept secrets of their vendors.
Historical Background Not all aspects of database cost estimation are treated as independent research topic of their own. Mainly selectivity estimation and intermediate result size estimation have received intensive attention yielding a plethora of techniques proposed in database literature. Discussion of algorithmic costs usually occurs with the proposal of new or modified database algorithms. Given its tight coupling with query optimization, physical cost estimation has never been an independent research topic of its own. Apart from very few exceptions, new physical cost models and estimation techniques are usually published as ‘‘by-products’’ in publications that mainly deal with novel optimization techniques. The first use of (implicit) cost estimation were complexity analyses that led to heuristic optimization rules. For instance, a join is always considered cheaper than calculating first the Cartesian product, followed by a selection. Likewise, linear operations that tend to reduce the data stream (selections, projections) should be evaluated as early as data dependencies allow, followed by (potentially) quadratic operations that do not ‘‘blowup’’ the intermediate results (semijoins, foreign-key joins). More complex, and hence expensive, operations (general joins, Cartesian products) should be executed as late as possible. Since a simple complexity metric does not necessarily reflect the same ranking of plans as the actual execution costs, first explicit cost estimation in database query optimization aimed at estimating intermediate result sizes. Initial works started with simplifications such as assuming uniform data distributions and independence of attribute values. Over time, the techniques have been improved to model non-uniform data distributions. Til date, effective treatment of (hidden) correlations is still an open research topic. The following refinement was the introduction of physical costs. With I/O being the dominating cost factor in the early days of database management
C
systems, the first systems assessed query plans by merely estimating the number of I/O operations required. However, I/O systems exhibit quite different performance for sequential and randomly placed I/O operations. Hence, the models were soon refined to distinguish between sequential and random accesses, weighing them with their respective costs, i.e., time to execute one operation. With main memory sizes growing, more and more query processing work is done within main memory, minimizing disk accesses. Consequently, CPU and memory access costs can no longer be ignored. Assuming uniform memory access costs, memory access has initially been covered by CPU costs. CPU costs are estimated in terms of CPU cycles. Scoring them with the CPU’s clock speed yields time, the common unit to combine CPU and I/O costs to get the overall physical costs. Only recently with the advent of CPU caches and extended memory hierarchies, the impact of memory access costs has become so significant that it needs to be modeled separately [15,16]. Similarly to I/O costs, memory access costs are estimated in terms of number of memory accesses (or cache misses) and scored by their penalty to achieve time as common unit. In parallel and distributed database systems, also network communication costs are considered as contributing factors to the overall execution costs.
Foundations Different query execution plans require different amounts of effort to be evaluated. The objective function for the query optimization problems assigns every execution plan a single non-negative value. This value is commonly referred to as costs in the query optimization business. Cost Components Logical Costs/Data Volume
The most important cost component is the amount of data that is to be processed. Per operator, three data volumes are distinguished: input (per operand), output, and temporary data. Data volumes are usually measured as cardinality, i.e., number of tuples. Often, other units such as number of I/O blocks, number of memory pages, or total size in bytes are required. Provided that the respective tuple sizes, page sizes, and block sizes are known, the cardinality can easily be transformed into the other units.
507
C
508
C
Cost Estimation
The amount of input data is given as follows: For the leaf nodes of the query graph, i.e., those operations that directly access base tables stored in the database, the input cardinality is given by the cardinality of the base table(s) accessed. For the remaining (inner) nodes of the query graph, the input cardinality is given by the output cardinality of the predecessor(s) in the query graph. Estimating the output size of database operations – or more generally, their selectivity – is anything else but trivial. For this purpose, DBMSs usually maintain statistic about the data stored in the database. Typical statistics are 1. Cardinality of each table, 2. Number of distinct values per column, 3. Highest/lowest value per column (where applicable). Logical cost functions use these statistics to estimate output sizes (respectively selectivities) of database operations. The simplest approach is to assume that attribute values are uniformly distributed over the attribute’s domain. Obviously, this assumption virtually never holds for ‘‘real-life’’ data, and hence, estimations based on these assumption will never be accurate. This is especially severe, as the estimation errors compound exponentially throughout the query plan [9]. This shows, that more accurate (but compact) statistics on data distributions (of base tables as well as intermediate results) are required to estimate intermediate results sizes. The importance of statistics management has led to a plethora of approximation techniques, for which [6] have coined the general term ‘‘data synopses’’. Such techniques range from advanced forms of histograms (most notably, V-optimal histograms including multidimensional variants) [7,10] over spline synopses [12,11], sampling [3,8], and parametric curve-fitting techniques [4,20] all the way to highly sophisticated methods based on kernel estimators [1] or Wavelets and other transforms [2,17]. A logical cost model is a prerequisite for the following two cost components.
sorting; n being the input cardinality. With proper support by access structures like indices or hash tables, the complexity of selection may drop to O(log n) or O(1), respectively. Binary operators can be in O(n), like a union of sets that does not eliminate duplicates, or, more often, in O(n2), as for instance join operators. More detailed algorithmic cost functions are used to estimate, e.g., the number of I/O operations or the amount of main memory required. Though these functions require some so-called ‘‘physical’’ information like I/O block sizes or memory pages sizes, they are still considered algorithmic costs and not physical cost, as these informations are system specific, but not hardware specific. The standard database literature provides a large variety of cost formulas for the most frequently used operators and their algorithms. Usually, these formulas calculate the costs in terms of I/O operations as this still is the most common objective function for query optimization in database systems [5,13]. Physical Costs/Execution Time
Logical and algorithmic costs alone are not sufficient to do query optimization. For example, consider two algorithms for the same operation, where the first algorithm requires slightly more I/O operations than the second, while the second requires significantly more CPU operations than the first one. Looking only at algorithmic costs, both algorithms are not comparable. Even assuming that I/O operations are more expensive than CPU operations cannot in general answer the question which algorithm is faster. The actual execution time of both algorithms depends on the speed of the underlying hardware. The physical cost model combines the algorithmic cost model with an abstract hardware description to derive the different cost factors in terms of time, and hence the total execution time. A hardware description usually consists of information such as CPU speed, I/O latency, I/O bandwidth, and network bandwidth. The next section discusses physical cost factors on more detail.
Algorithmic Costs/Complexity
Logical costs only depend on the data and the query (i.e., the operators’ semantics), but they do not consider the algorithms used to implement the operators’ functionality. Algorithmic costs extend logical costs by taking the properties of the algorithms into account. A first criterion is the algorithm’s complexity in the classical sense of complexity theory. Most unary operator are in O(n), like selections, or O(n log n), like
Cost Factors
In principle, physical costs are considered to occur in two flavors, temporal and spatial. Temporal costs cover all cost factors that can easily be related to execution time, e.g., by multiplying the number of certain events with there respective cost in terms of some time unit. Spatial costs contain resource consumptions that cannot directly (or not at all) be related to time. The
Cost Estimation
following briefly describes the most prominent cost factors of both categories. Temporal Cost Factors
Disk-I/O This is the cost of searching for, reading, and writing data blocks that reside on secondary storage, mainly on disk. In addition to accessing the database files themselves, temporary intermediate files that are too large to fit in main memory buffers and hence are stored on disk also need to be accessed. The cost of searching for records in a database file or a temporary file depends on the type of access structures on that file, such as ordering, hashing, and primary or secondary indexes. I/O costs are either simply measured in terms of the number of block-I/O operations, or in terms of the time required to perform these operations. In the latter case, the number of block-I/O operations is multiplied by the time it takes to perform a single block-I/O operation. The time to perform a single block-I/O operation is made up by an initial seek time (I/O latency) and the time to actually transfer the data block (i.e., block size divided by I/O bandwidth). Factors such as whether the file blocks are allocated contiguously on the same disk cylinder or scattered across the disk affect the access cost. In the first case (also called sequential I/O), I/O latency has to be counted only for the first of a sequence of subsequent I/O operations. In the second case (random I/O), seek time has to be counted for each I/O operation, as the disk heads have to be repositioned each time. Main-Memory Access These are the costs for reading data from or writing data to main memory. Such data may be intermediate results or any other temporary data produced/used while performing database operations. Similar to I/O costs, memory access costs can be modeled be estimating the number of memory accesses (i.e., cache misses) and scoring them with their respective penalty (latency) [16]. Network Communication In centralized DBMSs, communication costs cover the costs of shipping the query from the client to the server and the query’s result back to the client. In distributed, federated, and parallel DBMSs, communication costs additionally contain all costs for shipping (sub-)queries and/or (intermediate) results between the different hosts that are involved in evaluating the query. Also with communication costs, there is a latency component, i.e., a delay to initiate a network connection and package transfer, and a bandwidth
C
component, i.e., the amount of data that can be transfer through the network infrastructure per time. CPU Processing This is the cost of performing operations such as computations on attribute values, evaluating predicates, searching and sorting tuples, and merging tuples for join. CPU costs are measured in either CPU cycles or time. When using CPU cycles, the time may be calculated by simply dividing the number of cycles by the CPU’s clock speed. While allowing limited portability between CPUs of the same kind, but with different clock speeds, portability to different types of CPUs is usually not given. The reason is, that the same basic operations like adding two integers might require different amounts of CPU cycles on different types of CPUs. Spatial Cost Factors
Usually, there is only one spatial cost factor considered in database literature: memory size. This cost it the amount of main memory required to store intermediate results or any other temporary data produced/used while performing database operations. Next to not (directly) being related to execution time, there is another difference between temporal and spatial costs that stems from the way they share the respective resources. A simple example shall demonstrate the differences. Consider to operations or processes each of which consumes 50% of the available resources (i.e., CPU power, I/O-, memory-, and network bandwidth). Further, assume that when run one at a time, both tasks have equal execution time. Running both tasks concurrently on the same system (ideally) results in the same execution time, now consuming all the available resources. In case each individual process consumes 100% of the available resources, the concurrent execution time will be twice the individual execution time. In other words, if the combined resource consumption of concurrent tasks exceed 100%, the execution time extends to accommodate the excess resource requirements. With spatial cost factors, however, such ‘‘stretching’’ is not possible. In case two tasks together would require more than 100% of the available memory, they simply cannot be executed at the same time, but only after another. Types of (Cost) Models
According to their degree of abstraction, (cost) models can be classified into two classes: analytical models and simulation models. Analytical Models In some cases, the assumptions made about the real system can be translated into
509
C
510
C
Cost Estimation
mathematical descriptions of the system under study. Hence, the result is a set of mathematical formulas that is called an analytical model. The advantage of an analytical model is that evaluation is rather easy and hence fast. However, analytical models are usually not very detailed (and hence not very accurate). In order to translate them into a mathematical description, the assumptions made have to be rather general, yielding a rather high degree of abstraction. Simulation Models Simulation models provide a very detailed and hence rather accurate description of the system. They describe the system in terms of (a) simulation experiment(s) (e.g., using event simulation). The high degree of accuracy is charged at the expense of evaluation performance. It usually takes relatively long to evaluate a simulation base model, i.e., to actually perform the simulation experiment(s). It is not uncommon, that the simulation actually takes longer than the execution in the real system would take. In database query optimization, though it would appreciate the accuracy, simulation models are not feasible, as the evaluation effort is far to high. Query optimization requires that costs of numerous alternatives are evaluated and compared as fast as possible. Hence, only analytical cost models are applicable in this scenario. Architecture and Evaluation of Database Cost Models
The architecture and evaluation mechanism of database cost models is tightly coupled to the structure of query execution plans. Due to the strong encapsulation offered by relational algebra operators, the cost of each operator, respectively each algorithm, can be described individually. For this purpose, each algorithm is assigned a set of cost functions that calculate the three cost components as described above. Obviously, the physical cost functions depend on the algorithmic cost functions, which in turn depend on the logical cost functions. Algebraic cost functions use the data volume estimations of the logical cost functions as input parameters. Physical cost functions are usually specializations of algorithmic cost functions that are parametrized by the hardware characteristics. The cost model also defines how the single operator costs within a query have to be combined to calculate the total costs of the query. In traditional sequential DBMSs, the single operators are assumed to have no performance side-effects on each other. Thus, the
cost of a QEP is the cumulative cost of the operators in the QEP [18]. Since every operator in the QEP is the root of a sub-plan, its cost includes the cost of its input operators. Hence, the cost of a QEP is the cost of the topmost operator in the QEP. Likewise, the cardinality of an operator is derived from the cardinalities of its inputs, and the cardinality of the topmost operator represents the cardinality of the query result. In non-sequential (e.g., distributed or parallel) DBMSs, this subject is much more complicated, as more issues such as scheduling, concurrency, resource contention, and data dependencies have to considered. For instance, in such environments, more than one operator may be executed at a time, either on disjoint (hardware) resources, or (partly) sharing resources. In the first case, the total cost (in terms of time) is calculated as the maximum of the costs (execution times) of all operators running concurrently. In the second case, the operators compete for the same resources, and hence mutually influence their performance and costs. More sophisticated cost function and cost models are required here to adequately model this resource contention [14,19].
Cross-references
▶ Distributed Query Optimization ▶ Multi-Query Optimization ▶ Optimization and Tuning in Data Warehouses ▶ Parallel Query Optimization ▶ Process Optimization ▶ Query Optimization ▶ Query Optimization (in Relational Databases) ▶ Query Optimization in Sensor Networks ▶ Query Plan ▶ Selectivity Estimation ▶ Spatio-Temporal Selectivity Estimation ▶ XML Selectivity Estimation
Recommended Reading 1. Blohsfeld B., Korus D., and Seeger B. A comparison of selectivity estimators for range queries on metric attributes. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 239–250. 2. Chakrabarti K., Garofalakis M.N., Rastogi R., and Shim K. Approximate query processing using wavelets. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 111–122. 3. Chaudhuri S., Motwani R., and Narasayya V.R. On random sampling over joins. In Proc. ACM SIGMOD Int. Conf.
Count-Min Sketch
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
on Management of Data, Philadephia, PA, USA, June 1999, pp. 263–274. Chen C.M. and Roussopoulos N. Adaptive selectivity estimation using query feedback. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1994, pp. 161–172. Garcia-Molina H., Ullman J.D., and Widom J. Database Systems: The Complete Book. Prentice Hall, Englewood Cliffs, NJ, USA, 2002. Gibbons P.B. and Matias Y. Synopsis data structures for massive data sets. In Proc. 10th Annual ACM-SIAM Symp. on Discrete Algorithms, 1999, pp. 909–910. Gibbons P.B., Matias P.B., and Poosala V. Fast incremental maintenance of approximate histograms. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 466–475. Haas P.J., Naughton J.F., Seshadri S., and Swami A.N. Selectivity and cost estimation for joins based on random sampling. J. Comput. Syst. Sci., 52(3):550–569, 1996. Ioannidis Y.E. and Christodoulakis S. On the propagation of errors in the size of join results. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1991, pp. 268–277. Ioannidis Y.E. and Poosala V. Histogram-based approximation of set-valued query-answers. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 174–185. Ko¨nig A.C. and Weikum G. Combining histograms and parametric curve fitting for feedback-driven query result-size estimation. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 423–434. Ko¨nig A.C. and Weikum G. Auto-tuned spline synopses for database statistics management. In Proc. Int. Conf. on Management of Data, 2000. Korth H. and Silberschatz A. Database Systems Concepts. McGraw-Hill, Inc., New York, San Francisco, Washington, DC, USA, 1991. Lu H., Tan K.L., and Shan M.C. Hash-based join algorithms for multiprocessor computers. In Proc. 16th Int. Conf. on Very Large Data Bases, 1990, pp. 198–209. Manegold S. Understanding, Modeling, and Improving Main-Memory Database Performance. PhD Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, December 2002. Manegold S., Boncz P.A., and Kersten M.L. Generic database cost models for hierarchical memory systems. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 191–202. Matias Y., Vitter J.S., and Wang M. Wavelet-based histograms for selectivity estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 448–459. Selinger P.G., Astrahan M.M., Chamberlin D.D., Lorie R.A., and Price T.G. Access path selection in a relational database management system. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1979, pp. 23–34. Spiliopoulou M. and Freytag J.-C. Modelling resource utilization in pipelined query execution. In Proc. European Conference on Parallel Processing, 1996, pp. 872–880. Sun W., Ling Y., Rishe N., and Deng Y. An instant and accurate size estimation method for joins and selection in a retrievalintensive environment. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1993, pp. 79–88.
C
511
Count-Min Sketch G RAHAM C ORMODE AT&T Labs–Research, Florham Park, NJ, USA
Synonyms CM Sketch
Definition The Count-Min (CM) Sketch is a compact summary data structure capable of representing a highdimensional vector and answering queries on this vector, in particular point queries and dot product queries, with strong accuracy guarantees. Such queries are at the core of many computations, so the structure can be used in order to answer a variety of other queries, such as frequent items (heavy hitters), quantile finding, join size estimation, and more. Since the data structure can easily process updates in the form of additions or subtractions to dimensions of the vector (which may correspond to insertions or deletions, or other transactions), it is capable of working over streams of updates, at high rates. The data structure maintains the linear projection of the vector with a number of other random vectors. These vectors are defined implicitly by simple hash functions. Increasing the range of the hash functions increases the accuracy of the summary, and increasing the number of hash functions decreases the probability of a bad estimate. These tradeoffs are quantified precisely below. Because of this linearity, CM sketches can be scaled, added and subtracted, to produce summaries of the corresponding scaled and combined vectors.
Historical Background The Count-Min sketch was first proposed in 2003 [5] as an alternative to several other sketch techniques, such as the Count sketch [3] and the AMS sketch [1]. The goal was to provide a simple sketch data structure with a precise characterization of the dependence on the input parameters. The sketch has also been viewed as a realization of a counting Bloom filter or MultistageFilter [8], which requires only limited independence randomness to show strong, provable guarantees. The simplicity of creating and probing the sketch has led to its wide use in disparate areas since its initial description.
C
512
C
Count-Min Sketch
Foundations The CM sketch is simply an array of counters of width w and depth d, CM [1, 1] ... CM [d, w]. Each entry of the array is initially zero. Additionally, d hash functions h1 :::hd : f1:::ng ! f1:::wg are chosen uniformly at random from a pairwiseindependent family. Once w and d are chosen, the space required is fixed: the data structure is represented by wd counters and d hash functions (which can each be represented in O(1) machine words [14]). Update Procedure
Consider a vector a, which is presented in an implicit, incremental fashion (this abstract model captures a wide variety of data stream settings, see entries on Data Stream Models for more details). This vector has dimension n, and its current state at time t is aðtÞ ¼ ½a1 ðtÞ;:::ai ðtÞ;:::an ðtÞ. Initially, að0Þ is the zero vector, 0, so ai(0) is 0 for all i. Updates to individual entries of the vector are presented as a stream of pairs. The tth update is (it, ct), meaning that ait ðtÞ ¼ ait ðt 1Þ þ ct ai0 ðtÞ ¼ ai0 ðt 1Þ i 0 6¼ it This procedure is illustrated in Fig. 1. In the remainder of this article, t is dropped, and the current state of the vector is referred to as just a for convenience. It is assumed throughout that although values of ai increase and decrease with updates, each ai 0. The CountMin sketch also applies to the case where ais can be less than zero, with small factor increases in space. Here, details of these extensions are omitted for simplicity of exposition (full details are in [5]). When an update (it, ct) arrives, ct is added to one count in each row of the Count-Min sketch; the
Count-Min Sketch. Figure 1. Each item i is mapped to one cell in each row of the array of counts: when an update of ct to item it arrives, ct is added to each of these cells.
counter is determined by hj . Formally, given (it, ct), the following modifications are performed: 81 j d : CM ½ j; hj ði t Þ
CM½ j; hj ði t Þ þ c t
Because computing each hash function takes O(1) (constant) time, the total time to perform an update is O(d), independent of w. Since d is typically small in practice (often less than 10), updates can be processed at high speed. Point Queries
A point query is to estimate the value of an entry in the vector ai. The point query procedure is similar to updates: given a query point i, an estimate is found as ^ai ¼ min1 j d CM½ j; hj ðiÞ. Since the space used by the sketch is typically much smaller than that required to represent the vector exactly, there is necessarily some approximation in the estimate, which is quantified as follows: Theorem 1 (Theorem 1 from [5]). If w ¼ deee and d ¼ dln d1e, the estimate ^ai has the following guarantees: ai ^ai ; and, with probability at least 1 d, ^ai ai þ ekak1 : The proof follows by considering the estimate in each row, and observing that the expected error in using CM [j, hj(i)] as an estimate has expected (non-negative) error kak1 =w. By the Markov inequality [14], the probability that this error exceeds Ekak1 is at most 1 e (where e is the base of the natural logarithm, i.e., 2.71828 . . ., a constant chosen to optimize the space for fixed accuracy requirements). Taking the smallest estimate gives the best estimator, and the probability that this estimate has error exceeding Ekak1 is the probability that all estimates exceed this error, i.e., ed d. This analysis makes no assumption about the distribution of values in a. However, in many applications there are Zipfian, or power law, distributions of item frequencies. Here, the (relative) frequency of the ith most frequent item is proportional to iz, for some parameter z, where z is typically in the range 1–3 (z = 0 gives a perfectly uniform distribution). In such cases, the skew in the distribution can be used to show a stronger space/accuracy tradeoff: Theorem 2 (Theorem 5.1 from [7]). For a Zipf distribution with parameter z, the space required to
C
Count-Min Sketch
answer point queries with error Ekak1 with probability at least 1 d is given by O (emin{1,1/z} ln1∕d). Moreover, the dependency of the space on z is optimal: Theorem 3 (Theorem 5.2 from [7]). The space required to answer point queries correctly with any constant probability and error at most Ekak1 is O(e1) over general distributions, and O(e1∕z) for Zipf distributions with parameter z, assuming the dimension of a, n is O(emin{1,1/z}). Range, Heavy Hitter and Quantile Queries
Pr A range query is to estimate i¼l ai for a range [l...r]. For small ranges, the range sum can be estimated as a sum of point queries; however, as the range grows, the error in this approach also grows linearly. Instead, logn sketches can be kept, each of which summarizes a derived vector ak where k
a ½j ¼
k ðjþ1Þ2 X1
ai
i¼j2k
for k = 1...log n. A range of the form j2k...(j + 1)2k 1 is called a dyadic range, and any arbitrary range [l...r] can be partitioned into at most 2log n dyadic ranges. With appropriate rescaling of accuracy bounds, it follows that: Theorem 4 (Theorem 4 from [5]). Count-Min sketches can be used to find an estimate ^r for a range query on l...r such that ^r Ekak1
r X
ai ^r
i¼l
The right inequality holds with certainty, and the left inequality holds with probability at least 1 d. The total 2 space required is OðlogE n log d1Þ. Closely related to the range query is the f-quantile query, which is to find a point j such that j X i¼1
ai fkak1
jþ1 X
Theorem 5 (Theorem 5 from [5]). e-approximate f-quantiles can be found with probability at least 1 d by keeping a data structure with space 2 O 1E log ðnÞ log logd n . The time for each insert or delete operation is O logðnÞ log logd n , and the time to find each quantile on demand is O logðnÞ log logd n . Heavy Hitters are those points i such that ai fkak1 for some specified f. The range query primitive based on Count-Min sketches can again be used to find heavy hitters, by recursively splitting dyadic ranges into two and querying each half to see if the range is still heavy, until a range of a single, heavy, item is found. Formally, Theorem 6 (Theorem 6 from [5]). Using space
O
1logðnÞlog 2logðnÞ E df
, and time O logðnÞ log
2 logn df
per update, a set of approximate heavy hitters can be output so that every item with frequency at least ðf þ EÞkak1 is output, and with probabilitye 1 d no item whose frequency is less than fkak1 is output. For skewed Zipfian distributions, as described above, with parameter z > 1, it is shown more strongly that the top-k most frequent items can be found with ~ kÞ [7]. relative error e using space only Oð E Inner Product Queries
The Count-Min sketch can also be used to estimate the inner product between two vectors; in database terms, this captures the (equi)join size between relations. The inner product a b, can be estimated by treating the Count-Min sketch as a collection of d vectors of length w, and finding the minimum inner product between corresponding rows of sketches of the two vectors. With probability 1 d, this estimate is at most an additive quantity Ekak1 kbk1 above the true value of a b. This is to be compared with AMS sketches which guarantee Ekak2 kbk2 additive error, but require space proportional to E12 to make this guarantee.
ai :
i¼1
A natural approach is to use range queries to binary search for a j which satisfies this requirement approximately (i.e., tolerates up to Ekak1 error in the above expression) given f. In order to give the desired guarantees, the error bounds need to be adjusted to account for the number of queries that will be made.
Interpretation as Random Linear Projection
The sketch can also be interpreted as a collection of inner-products between a vector representing the input and a collection of random vectors defined by the hash functions. Let a denote the vector representing the input, so that a½i is the sum of the updates to the ith location in the input. Let rj,k be the binary vector such
513
C
514
C
Count-Min Sketch
that rj,k[i] = 1 if and only if hj(i) = k. Then it follows that CM½j; k ¼ a r j;k . Because of this linearity, it follows immediately that if sketches of two vectors, a and b, are built then (i) the sketch of a þ b (using the same w,d,hj) is the (componentwise) sum of the sketches (ii) the sketch of la for any scalar l is l times the sketch of a. In other words, the sketch of any linear combination of vectors can be found. This property is useful in many applications which use sketches. For example, it allows distributed measurements to be taken, sketched, and combined by only sending sketches instead of the whole data. Conservative Update
If only positive updates arrive, then an alternate update methodology may be applied, known as conservative update (due to Estan and Varghese [8]). For an update (i,c), ^ ai is computed, and the counts are modified according to 81 j d : CM½ j; hj ðiÞ maxðCM½ j; hj ðiÞ; ^ ai þ cÞ. It can be verified that procedure still ensures for point queries that ai ^ ai , and that the error is no worse than in the normal update procedure; it is remarked that this can improve accuracy ‘‘up to an order of magnitude’’ [8]. Note however that deletions or negative updates can no longer be processed, and the additional processing that must be performed for each update could effectively halve the throughput.
Key Applications Since its description and initial analysis, the CountMin Sketch has been applied in a wide variety of situations. Here is a list of some of the ways in which it has been used or modified. Lee et al. [13] propose using least-squares optimization to produce estimates from Count-Min Sketches for point queries (instead of returning the minimum of locations where the item was mapped). It was shown that this approach can give significantly improved estimates, although at the cost of solving a convex optimization problem over n variables (where n is the size of the domain from which items are drawn, typically 232 or higher). The ‘‘skipping’’ technique, proposed by Bhattacharrya et al. [2] entails avoiding adding items to the sketch (and saving the cost of the hash function computations) when this will not affect the
accuracy too much, in order to further increase throughout in high-demand settings. Indyk [9] uses the Count-Min Sketch to estimate the residual mass after removing a subset of items. That is, given a (small) set of indices I, to estimate P i= 2I ai . This is needed in order to find clusterings of streaming data. The entropy of a data stream is a function of the relative frequencies of each item or character within the stream. Using Count-Min Sketches within a larger data structure based on additional hashing techniques, Lakshminath and Ganguly [12] showed how to estimate this entropy to within relative error. Sarlo´s et al. [17] gave approximate algorithms for personalized page rank computations which make use of Count-Min Sketches to compactly represent web-size graphs. In describing a system for building selectivity estimates for complex queries, Spiegel and Polyzotis [18] use Count-Min Sketches in order to allow clustering over a high-dimensional space. Rusu and Dobra [16] study a variety of sketches for the problem of inner-product estimation, and conclude that Count-Min sketch has a tendency to outperform its theoretical worst-case bounds by a considerable margin, and gives better results than some other sketches for this problem. Many applications call for tracking distinct counts: that is, ai should represent the number of distinct updates to position i. This can be achieved by replacing the counters in the Count-Min sketch with approximate Count-Distinct summaries, such as the Flajolet-Martin sketch. This is described and evaluated in [6,10]. Privacy preserving computations ensure that multiple parties can cooperate to compute a function of their data while only learning the answer and not anything about the inputs of the other participants. Roughan and Zhang demonstrate that the CountMin Sketch can be used within such computations, by applying standard techniques for computing privacy preserving sums on each counter independently [15].
Related ideas to the Count-Min Sketch have also been combined with group testing to solve problems in the realm of Compressed Sensing, and finding significant changes in dynamic streams.
Count-Min Sketch
C
Future Directions
URL To Code
As is clear from the range of variety of applications described above, Count-Min sketch is a versatile data structure which is finding applications within Data Stream systems, but also in Sensor Networks, Matrix Algorithms, Computational Geometry and PrivacyPreserving Computations. It is helpful to think of the structure as a basic primitive which can be applied wherever approximate entries from high dimensional vectors or multisets are required, and one-sided error proportional to a small fraction of the total mass can be tolerated (just as a Bloom filter should be considered in order to represent a set wherever a list or set is used and space is at a premium). With this in mind, further applications of this synopsis can be expected to be seen in more settings. As noted below, sample implementations are freely available in a variety of languages, and integration into standard libraries will further widen the availability of the structure. Further, since many of the applications are within high-speed data stream monitoring, it is natural to look to hardware implementations of the sketch. In particular, it will be of interest to understand how modern multi-core architectures can take advantage of the natural parallelism inherent in the CountMin Sketch (since each of the d rows are essentially independent), and to explore the implementation choices that follow.
Several example implementations of the CountMin sketch are available. C code is given by the MassDal code bank: http://www.cs.rutgers.edu/ muthu/massdal-code-index.html. C++ code due to Marios Hadjieleftheriou is available from http:// research.att.com/~marioh/sketches/index.html.
Experimental Results Experiments performed in [7] analyzed the error for point queries and F2 (self-join size) estimation, in comparison to other sketches. High accuracy was observed for both queries, for sketches ranging from a few kilobytes to a megabyte in size. The typical parameters of the sketch were a depth d of 5, and a width w of a few hundred to thousands. Implementations on desktop machines achieved between and two and three million updates per second. Other implementation have incorporated Count-Min Sketch into high speed streaming systems such as Gigascope [4], and tuned it to process packet streams of multi-gigabit speeds. Lai and Byrd report on an implementation of Count-Min sketches on a low-power stream processor [18], capable of processing 40 byte packets at a throughput rate of up to 13 Gbps. This is equivalent to about 44 million updates per second.
Cross-references
▶ AMS Sketch ▶ Data sketch/synopsis ▶ FM Synopsis ▶ Frequent items on streams ▶ Quantiles on streams
Recommended Reading 1. Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. In Proc. 28th Annual ACM Symp. on Theory of Computing, 1996, pp. 20–29. Journal version in J. Comput. Syst. Sci., 58:137–147, 1999. 2. Bhattacharrya S., Madeira A., Muthukrishnan S., and Ye T. How to scalably skip past streams. In Proc. 1st Int. Workshop on Scalable Stream Processing Syst., 2007, pp. 654–663. 3. Charikar M., Chen K., and Farach-Colton M. Finding frequent items in data streams. In 29th Int. Colloquium on Automata, Languages, and Programming, 2002, pp. 693–703. 4. Cormode G., Korn F., Muthukrishnan S., Johnson T., Spatscheck O., and Srivastava D. Holistic UDAFs at streaming speeds. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 35–46. 5. Cormode G. and Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58–75, 2005. 6. Cormode G. and Muthukrishnan S. Space efficient mining of multigraph streams. In Proc. 24th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2005, pp. 271–282. 7. Cormode G. and Muthukrishnan S. Summarizing and mining skewed data streams. In Proc. SIAM International Conference on Data Mining, 2005. 8. Estan C. and Varghese G. New directions in traffic measurement and accounting. In Proc. ACM Int. Conf. of the on Data Communication, 2002, pp. 323–338. 9. Indyk P. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proceedings of ACMSIAM Symposium on Discrete Algorithms, 2003. 10. Kollios G., Byers J., Considine J., Hadjieleftheriou M., and Li F. Robust aggregation in sensor networks. Q. Bull. IEEE TC on Data Engineering, 28(1):26–32, 2005. 11. Lai Y.-K. and Byrd G.T. High-throughput sketch update on a low-power stream processor. In Proc. ACM/IEEE Symp. on Architecture for Networking and Communications Systems, 2006, pp. 123–132.
515
C
516
C
Coupling and De-coupling
12. Lakshminath B. and Ganguly S. Estimating entropy over data streams. In Proc. 14th European Symposium on Algorithms, 2006, pp. 148–159. 13. Lee G.M., Liu H., Yoon Y., and Zhang Y. Improving sketch reconstruction accuracy using linear least squares method. In Proc. 5th ACM SIGCOMM Conf. on Internet Measurement, 2005, pp. 273–278. 14. Motwani R. and Raghavan P. Randomized Algorithms. Cambridge University Press, 1995. 15. Roughan M. and Zhang Y. Secure distributed data mining and its application in large-scale network measurements. Computer Communication Review, 36(1):7–14, 2006. 16. Rusu F. and Dobra A. Statistical analysis of sketch estimators. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 187–198. 17. Sarlo´s T., Benzu´r A., Csaloga´ny K., Fogaras D., and Ra´cz B. To randomize or not to randomize: space optimal summaries for hyperlink analysis. In Proc. 15th Int. World Wide Web Conference, 2006, pp. 297–306. 18. Spiegel J. and Polyzotis N. Graph-based synopses for relational selectivity estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 205–216.
using the same syntax, format or encoding of data. performed at the same time. executed in the same state. Highly cohesive components lead to fewer dependencies between components and voice versa. Notions of coupling and cohesion were studied in structured and object oriented programming. The research developed software tools to calculate coupling and cohesion metrics. Low coupling is often desirable because it leads to reliability, easy of modification, low maintenance costs, understandability, and reusability. Low coupling can be achieved by deliberately designing system with low values of coupling metric. It can also be achieved by re-engineering of existing software system through restructuring of system into a set of more cohesive components. These activates are called de-coupling.
Cross-references
Coupling and De-coupling
▶ Cohesion ▶ OODB (Object-Oriented Database) ▶ Structured Programming
S ERGUEI M ANKOVSKII CA Labs, CA, Inc., Thornhill, ON, Canada
Definition Coupling is a measure of dependence between components of software system. De-coupling is a design or re-engineering activity aiming to reduce coupling between system elements.
Key Points Coupling of system components refers to a measure of dependency among them. Coupled components might depend on each other in different ways. Some examples of the dependencies are: One component might depend on syntax, format, or encoding of data produced by another component. One component might depend on the execution time within another component. One component might depend on state of another component. Notion of coupling is connected to notion of cohesion. Cohesion is a measure of how related and focused are responsibilities of a software component. For example a highly cohesive component might group responsibilities
Coverage ▶ Specificity
Covering Index D ONGHUI Z HANG Northeastern University, Boston, MA, USA
Definition Given an SQL query, a covering index is a composite index that includes all of the columns referenced in SELECT, JOIN, and WHERE clauses of this query. Because the index contains all the data needed by the query, to execute the query the actual data in the table does not need to be accessed.
Key Points Covering indexes [1] support index-only execution plans. In general, having everything indexed tends to increase the query performance (in number of I/Os). However, using a covering index with too many columns
Crash Recovery
can actually degrade performance. Typically, multi-dimensional index structures, e.g., the R-tree, perform poorer than linear scan with high dimensions. Some guidelines of creating a covering index are: (i) Create a covering index on frequently used queries. There are overheads in creating a covering index, which is often more significant than creating a regular index with fewer columns. Hence, if a query is seldom used, the overhead to create a covering index on it is not substantiated. This corresponds to Amdahl’s law: improve the ‘‘interesting’’ part to receive maximum overall benefit of a system. (ii) Try to build a covering index by expanding an existing index. For instance, if there already exists an index on ‘‘age’’ and ‘‘salary,’’ and one needs a covering index on ‘‘age,’’ ‘‘salary,’’ and ‘‘income,’’ it is often better to expand the existing index rather than building a new index, which would share two columns with the existing index. The term ‘‘covering index’’ is sometimes used to mean the collection of single-column, non-clustered indexes on all the columns in a table. This is due to the ‘‘index intersection’’ technique incorporated into the Microsoft SQL Server’s query optimizer [1]. In particular, the query optimizer can build, at run time, a hash-based ‘‘covering index’’ to speedup queries on a frequently used table. This covering index is really a hash table, which is built based on multiple existing indexes. Creating single-column indexes on all columns encourages the query optimizer to perform index intersection, i.e., to build dynamic covering indexes.
Cross-references ▶ Access Methods ▶ Indexing
Recommended Reading 1.
McGehee B. Tips on Optimizing Covering Indexes. http://www. sql-server-performance.com/tips/covering_indexes_p1.aspx, 2007.
Covert Communication ▶ Steganography
CPU Cache ▶ Processor Cache
C
517
Crabbing ▶ B-Tree Locking
C Crash Recovery T HEO H A¨ RDER University of Kaiserslautern, Kaiserslautern, Germany
Synonyms Failure handling; System recovery; Media recovery; Online recovery; Restart processing; Backward recovery
Definition In contrast to transaction aborts, a crash is typically a major failure by which the state of the current database is lost or parts of storage media are unrecoverable (destroyed). Based on log data from a stable log, also called temporary log file, and the inconsistent and/or outdated state of the permanent database, system recovery has to reconstruct the most recent transactionconsistent database state. Because DBMS restart may take too long to be masked for the user, a denial of service can be observed. Recovery from media failures relies on the availability of (several) backup or archive copies of earlier DB states – organized according to the generation principle – and archive logs (often duplexed) covering the processing intervals from the points of time the backup copies were created. Archive recovery usually causes much longer outages than system recovery.
Historical Background Log data delivering the needed redundancy to recover from failures was initially stored on nonvolatile core memory to be reclaimed at restart by a so-called log salvager [3] in the ‘‘pre-transaction area’’. Advances in VLSI technology enabled the use of cheaper and larger, but volatile semiconductor memory as the computers’ main memory. This technology change triggered by 1971 in industry – driven by database product adjustments – the development of new and refined concepts of logging such as log sequence numbers (LSNs), write-ahead log protocol (WAL), log duplexing and more. Typically, these concepts were not published, nevertheless they paved the way towards the use
518
C
Crash Recovery
of ACID transactions. As late as 1978, Jim Gray documented the design of such a logging system implemented in IMS in a widely referenced publication [5]. Many situations and dependencies related to failures and recovery from those in databases have been thoroughly explored by Lawrence Bjork and Charles Davies in their studies concerning DB/DC systems back in 1973 leading to the so-called ‘‘spheres of control’’ [2]. The first published implementation of the transaction concept by a full-fledged DBMS recovery manager was that of System R, started in 1976 [4]. It refined the Do-Undo-Redo protocol and enabled automatic recovery for new recoverable types and operations. In 1981, Andreas Reuter presented in his Ph.D. dissertation further investigations and refinements of concepts related to failure handling in database systems [9]. Delivering a first version of the principles of transaction-oriented database recovery [Ha¨rder and Reuter 1979], including the Ten Commandments [6], this classification framework, defining the paradigm of transaction-oriented recovery and coining the acronym ACID for it [7], was finally published in 1983. The most famous and most complete description of recovery methods and their implementation was presented by C. Mohan et al. in the ARIES paper [8] in 1992, while thorough treatment of all questions related to this topic appeared in many textbooks, especially those of Bernstein et al. [1], Gray and Reuter [3], and Weikum and Vossen [11]. All solutions implemented for crash recovery in industrial-strength DBMSs are primarily disk-based. Proposals to use ‘‘safe RAM’’, for example, were not widely accepted.
Foundations The most difficult failure type to be recovered from is the system failure or system crash (see Logging and Recovery). Due to some (expected, but) unplanned failure event (a bug in the DBMS code, an operating system fault, a power or hardware failure, etc.), the current database – comprising all objects accessible to the DBMS during normal processing – is not available anymore. In particular, the in-memory state of the DBMS (lock tables, cursors and scan indicators, status of all active transactions, etc.) and the contents of the database buffer and the log buffer are lost. Furthermore, the state lost may include information about LSNs, ongoing commit processing with participating coordinators and participants as well as commit requests and votes. Therefore, restart cannot rely on such
information and has to refer to the temporary log file (stable log) and the permanent (materialized) database, that is, the state the DBMS finds after a crash at the non-volatile storage devices (disks) without having applied any log information. Consistency Concerns
According to the ACID principle, a database is consistent if and only if it contains the results of successful transactions – called transaction-consistent database. Because a DBMS application must not lose changes of committed transactions and all of them have contributed to the DB state, the goal of crash recovery is to establish the most recent transaction-consistent DB state. For this purpose, redo and undo recovery is needed, in general. Results of committed transactions may not yet be reflected in the database, because execution has been terminated in an uncontrolled manner and the corresponding pages containing such results were not propagated to the permanent DB at the time of the crash. Therefore, they must be repeated, if necessary – typically by means of log information. On the other hand, changes of incomplete transactions may have reached the permanent DB state on disk. Hence, undo recovery has to completely roll back such uncommitted changes. Because usually many interactive users rely in their daily business on DBMS services, crash recovery is very time-critical. Therefore, crash-related interruption of DBMS processing should be masked for them as far as possible. Although today DBMS crashes are rather rare events and may occur several times a month or a year – depending on the stability of both the DBMS and its operational environment – , their recovery should take no more than a number of seconds or at most a few minutes (as opposed to archive recovery), even if GByte or TByte databases with thousands of users are involved. Forward Recovery
Having these constraints and requirements in mind, which kind of recovery strategies can be applied? Despite the presence of so-called non-stop systems (giving the impression that they can cope with failures by forward recovery), rollforward is very difficult, if not impossible in any stateful system. To guarantee atomicity in case of a crash, rollforward recovery had to enable all transactions to resume execution so that they can either complete successfully or require to be
Crash Recovery
aborted by the DBMS. Assume the DB state containing the most recent successful DB operations could be made available, that is, all updates prior to the crash have completely reached the permanent DB state. Even then rollforward would be not possible, because a transaction cannot resume in ‘‘forward direction’’ unless its local state is restored. Moreover in a DBMS environment, the in-memory state lost makes it entirely impossible to resume from the point at the time the crash occurred. For these reasons, a rollback strategy for active transactions is the only choice in case of crash recovery to ensure atomicity (wiping out all traces of such transactions); later on these transactions are started anew either by the user or the DBMS environment. The only opportunities for forward actions are given by redundant structures where it is immaterial for the logical DB content whether or not modifying operations are undone or completed. A typical example is the splitting operation of a B-tree node. Logging Methods and Rules
Crash recovery – as any recovery from a failure – needs some kind of redundancy to detect invalid or missing data in the permanent database and to ‘‘repair’’ its state as required, i.e., removing modifications effected by uncommitted transactions from it and supplementing it with updates of complete transactions. For this task, the recovery algorithms typically rely on log data collected during normal processing. Different forms of logging are conceivable. Logical logging is a kind of operator logging; it collects operators and their arguments at a higher level of abstraction (e.g., for internal operations (actions) or operations of the data management language (DML)). While this method of logging may save I/O to and space in the log file during normal processing, it requires at restart time a DB state that is level-consistent w.r.t. the level of abstraction used for logging, because the logged operations have to be executed using data of the permanent database. For example, action logging and DML-operation logging require action consistency and consistency at the application programming interface (API consistency), respectively [6]. Hence, the use of this kind of methods implies the atomic propagation (see below) of all pages modified by the corresponding operation which can be implemented by shadow pages or differential files. Physical logging – in the simplest form collecting the beforeand after-images of pages – does not expect any form
C
of consistency at higher DB abstraction levels and, in turn, can be used in any situation, in particular, when non-atomic propagation of modified pages (updatein-place) is performed. However, writing before- and after-images of all modified pages to the log file, is very time-consuming (I/O) and not space-economical at all. Therefore, a combination of both kinds leads to the so-called physiological logging, which can be roughly characterized as ‘‘physical to a page and logical within a page’’. It enables compact representation of log data (logging of elementary actions confined to single pages, entry logging) and leads to the practically most important logging/recovery method; non-atomic propagation of pages to disk is sufficient for the application of the log data. Together with the use of log sequence numbers in the log entries and in the headers of the data pages (combined use of LSNs and PageLSNs, see ARIES Protocol), simple and efficient checks at restart detect whether or not the modifications of elementary actions have reached the permanent database, that is, whether or not undo or redo operations have to be applied. While, in principle, crash recovery methods do not have specific requirements for forcing pages to the permanent DB, sufficient log information, however, must have reached the stable log. The following rules (for forcing of the log buffer to disk) have to be observed to guarantee recovery to the most recent transaction-consistent DB state: Redo log information must be written at the latest in phase 1 of commit. WAL (write ahead logging) has to be applied to enable undo operations, before uncommitted (dirty) data is propagated to the permanent database. Log information must not be discarded from the temporary log file, unless it is guaranteed that it will no longer be needed for recovery; that is, the corresponding data page has reached the permanent DB. Typically, sufficient log information has been written to the archive log, in addition. Taxonomy of Crash Recovery Algorithms
Forcing log data as captured by these rules yields the necessary and sufficient condition to successfully cope with system crashes. Specific assumptions concerning page propagation to the permanent database only influence performance issues of the recovery process.
519
C
520
C
Crash Recovery
When dirty data can reach the permanent DB (steal property), recovery must be prepared to execute undo steps and, in turn, redo steps when data modified by a transaction is not forced at commit or before (no-force property). In contrast, if propagation of dirty data is prevented (no-steal property), the permanent DB only contains clean (but potentially missing or old) data, thus making undo steps unnecessary. Finally, if all transaction modifications are forced at commit (force property), redo is never needed at restart. Hence, these properties concerning buffer replacement and update propagation are maintained by the buffer manager/transaction manager during normal processing and lead to four cases of crash recovery algorithms which cover all approaches so far proposed: 1. Undo/Redo: This class contains the steal/no-force algorithms which have to observe no other requirements than the logging rules. However, potentially undo and redo steps have to be performed during restart after a crash. 2. Undo/NoRedo: The so-called steal/force algorithms guarantee at any time that all actions of committed transactions are in the permanent DB. However, because of the steal property, dirty updates may be present, which may require undo steps, but never redo steps during restart. 3. NoUndo/Redo: The corresponding class members are known as no-steal/no-force algorithms which guarantee that dirty data never reaches the permanent DB. Dirty data pages are either never replaced from the DB buffer or, in case buffer space is in short supply, they are displaced to other storage areas outside the permanent DB. Restart after a crash may require redo steps, but never undo steps. 4. NoUndo/NoRedo: This ‘‘magic’’ class of the socalled no-steal/force algorithms always guarantees a
state of the permanent DB that corresponds to the most recent transaction-consistent DB state. It requires that no modified data of a transaction reaches the permanent DB before commit and that all transaction updates are atomically propagated (forced) at commit. Hence, neither undo nor redo steps are ever needed during restart. The discussion of these four cases is summarized in Fig. 1 which represents a taxonomy of crash recovery algorithms. Implementation Implications
The latter two classes of algorithms (NoUndo) require a mechanism which can propagate a set of pages in an atomic way (with regard to the remaining DBMS processing). Such a mechanism needs to defer updates to the permanent DB until or after these updates become committed and can be implemented by various forms of shadowing concepts or differential file approaches. Algorithms relying on redo steps, i.e., without the need to force committed updates to the permanent DB, have no control about the point of time when committed updates reach the permanent DB. While the buffer manager will propagate back most of the modified pages soon after the related update operations, a few hot-spot pages are modified again and again, and, since they are referenced so frequently, have not been written from the buffer. These pages potentially have accumulated the updates of many committed transactions, and redo recovery will therefore have to go back very far on the temporary log. As a consequence, restart becomes expensive and the DBMS’s out-of-service time unacceptably long. For this reason, some form of checkpointing is needed to make restart costs independent of mean time between failures. Generating a checkpoint means collecting
Crash Recovery. Figure 1. Taxonomy of crash recovery algorithms.
Crash Recovery
information related to the DB state in a safe place, which is used to define and limit the amount of redo steps required after a crash. The restart logic can then return to this checkpoint state and attempt to recover the most recent transaction-consistent state. From a conceptual point of view, the algorithms of class 4 seem to be particularly attractive, because they always preserve a transaction-consistent permanent DB. However in addition to the substantial cost of providing atomic update propagation, the need of forcing all updates at commit, necessarily in a synchronous way which may require a large amount of physical I/Os and, in turn, extend the lock duration for all affected objects, makes this approach rather expensive. Furthermore, with the typical disk-based DB architectures, pages are units of update propagation, which has the consequence that a transaction updating a record in a page cannot share this page with other updaters, because dirty updates must not leave the buffer and updates of complete transactions must be propagated to the permanent DB at commit. Hence, no-steal/force algorithms imply at least page locking as the smallest lock granule. One of these cost factors – either synchronously forced updates at commit or atomic updates for NoUndo – applies to the algorithms of class 2 and 3 each. Therefore, they were not a primary choice for the DBMS vendors competing in the today’s market. Hence, the laissez-faire solution ‘‘steal, no-force’’ with non-atomic update propagation (update-inplace) is today’s favorite solution, although it always leaves the permanent DB in a ‘‘chaotic state’’ containing dirty and outdated data pages and keeping the latest version of frequently used pages only in the DB buffer. Hence, with the optimistic expectation that crashes become rather rare events, it minimizes recovery provisions during normal processing. Checkpointing is necessary, but the application of direct checkpoints flushing the entire buffer at a time, is not advisable anymore, when buffers of several GByte are used. To affect normal processing as little as possible, so-called fuzzy checkpoints are written; only a few pages with metadata concerning the DB buffer state have to be synchronously
C
521
propagated, while data pages are ‘‘gently’’ moved to the permanent DB in an asynchronous way. Archive Recovery
So far, data of the permanent DB was assumed to be usable or at least recoverable using the redundant data collected in the temporary log. This is illustrated by the upper path in Fig. 2. If any of the participating components is corrupted or lost because of other hardware or software failure, archive recovery – characterized by the lower path – must be tried. Successful recovery also implies independent failure modes of the components involved. The creation of an archive copy, that is, copying the online version of the DB, is a very expensive process; for example, creating a transaction-consistent DB copy would interrupt update operation for a long time which is unacceptable for most DB applications. Therefore, two base methods – fuzzy dumping and incremental dumping – were developed to reduce the burden of normal DB operation while an archive copy is created. A fuzzy dump copies the DB on the fly in parallel with normal processing. The other method writes only the changed pages to the incremental dump. Of course, both methods usually deliver inconsistent DB copies such that log-based post-processing is needed to apply incremental modifications. In a similar way, either type of dump can be used to create a new, more up-to-date copy from the previous one, using a separate offline process such that DB operation is not affected. Archive copies are ‘‘hopefully’’ never or very infrequently used. Therefore, they may be susceptible to magnetic decay. For this reason, redundancy is needed again, which is usually solved by keeping several generations of the archive copy. So far, all log information was assumed to be written only to the temporary log file during normal processing. To create the (often duplexed) archive log, usually an independent and asynchronously running process copies the redo data from the temporary log. To guarantee successful recovery, failures when using
Crash Recovery. Figure 2. Two ways of DB crash recovery and the components involved.
C
522
C
Crawler
the archive copies must be anticipated. Therefore, archive recovery must be prepared to start from the oldest generation and hence the archive log must span the whole distance back to this point in time.
Key Applications Recovery algorithms, and in particular for crash recovery, are a core part of each commercial-strength DBMS and require a substantial fraction of design/implementation effort and of the code base: ‘‘A recoverable action is 30% harder and requires 20% more code than a nonrecoverable action’’ (J. Gray). Because the occurrence of failures can not be excluded and all data driving the daily business are managed in databases, missioncritical businesses depend on the recoverability of their data. In this sense, provisions for crash recovery are indispensable in such DBMS-based applications. Another important application area of crash recovery techniques are file systems, in particular their metadata about file existence, space allocation, etc.
Future Directions So far, crash recovery provisions are primarily diskbased. With ‘‘unlimited’’ memory available, mainmemory DBMSs will provide efficient and robust solutions without the need of non-volatile storage for crash recovery. More and more approaches are expected to exploit specialized storage devices such as battery-backed RAM or to use replication in gridorganized memories. Executing online transaction processing sequentially, revolutionary architectural concepts are already proposed which may not require transactional facilities at all [10].
Cross-references
▶ ACID Properties ▶ Application Recovery ▶ B-Tree Locking ▶ Buffer Management ▶ Logging and Recovery ▶ Multi-Level Recovery and the ARIES Algorithm
Recommended Reading 1. Bernstein P.A., Hadzilacos V., and Goodman N. Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, MA, 1987. 2. Davies C.T. Data processing spheres of control. IBM Syst. J., 17(2):179–198, 1978.
3. Gray H. and Reuter A. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, 1993. 4. Gray J., McJones P., Blasgen M., Lindsay B., Lorie R., Price T., Putzolu F., and Traiger I.L. The recovery manager of the System R database manager. ACM Comput. Surv., 13(2): 223–242, 1981. 5. Gray J, Michael J. Feynn, Jim Gray, Anita K. Jones, Klans Lagally, Holger Opderbeck, Gerald J. Popek, Brian Randell, Jerome H. Saltfer, Hans-Ru¨diger Wiehle. Notes on database operating systems. In Operating Systems: An Advanced Course. Springer, LNCS 60, 1978, pp. 393–481. 6. Ha¨rder T. DBMS Architecture – Still an Open Problem. In Proc. German National Database Conference, 2005, pp. 2–28. 7. Ha¨rder T. and Reuter A. Principles of transactionoriented database recovery. ACM Comput. Surv., 15(4): 287–317, 1983. 8. Mohan C., Haderle D.J., Lindsay B.G., Pirahesh H., and Schwarz P.M. ARIES: a transaction recovery method supporting finegranularity locking and partial rollbacks using write-ahead logging. ACM Trans. Database Syst., 17(1):94–162, 1992. 9. Reuter A. Fehlerbehandlung in Datenbanksystemen. Carl Hanser, Munich, 1981, p. 456. 10. Stonebraker M., Madden S., Abadi D.J., Harizopoulos S., Hachem N., and Helland P. The End of an Architectural Era (It’s Time for a Complete Rewrite). In Proc. 33rd Int. Conf. on Very Large Data Bases, 2007, pp. 1150–1160. 11. Weikum G. and Vossen G. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. Morgan Kaufmann, San Francisco, CA, 2002.
Crawler ▶ Incremental Crawling
Credulous Reasoning ▶ Possible Answers
Cross Product ▶ Cartesian Product
Cross-language Cross-Language Mining and Retrieval C217 Informational Retrieval ▶ Cross-Language Mining and Retrieval
Cross-Language Mining and Retrieval
Cross-Language Mining and Retrieval W EI G AO 1, C HENG N IU 2 1 The Chinese University of Hong Kong, Hong Kong, China 2 Microsoft Research Asia, Beijing, China
Synonyms Cross-language text mining; Cross-language web mining; Cross-language informational retrieval; Translingual information retrieval
Definition Cross-language mining is a task of text mining dealing with the extraction of entities and their counterparts expressed in different languages. The interested entities may be of various granularities from acronyms, synonyms, cognates, proper names to comparable or parallel corpora. Cross-Language Information Retrieval (CLIR) is a sub-field of information retrieval dealing with the retrieval of documents across language boundaries, i.e., the language of the retrieved documents is not the same as the language of the queries. Cross-language mining usually acts as an effective means to improve the performance of CLIR by complementing the translation resources exploited by CLIR systems.
Historical Background CLIR addresses the growing demand to access large volumes of documents across language barriers. Unlike monolingual information retrieval, CLIR requires query terms in one language to be matched with the indexed keywords in the documents of another language. Usually, the cross-language matching can be done by making use of bilingual dictionary, machine translation software, or statistical model for bilingual words association. CLIR generally takes into account but not limited to the issues like how to translate query terms, how to deal with the query terms nonexistent in a translation resource, and how to disambiguate or weight alternative translations (e.g., to decide that ‘‘traitement’’ in a French query means ‘‘treatment’’ but not ‘‘salary’’ in English, or how to order the French terms ‘‘aventure,’’ ‘‘business,’’ ‘‘affaire,’’ and ‘‘liaison’’ as relevant translations of English query ‘‘affair’’), etc. The performance of CLIR can be measured by the general evaluation metrics of information retrieval,
C
such as recall precision, average precision, and mean reciprocal rank, etc. The first workshop on CLIR was held in Zu¨rich during the SIGIR-96 conference. Workshops have been held yearly since 2000 at the meetings of CLEF (Cross Language Evaluation Forum), following its predecessor workshops of TREC (Text Retrieval Conference) cross-language track. The NTCIR (NII Test Collection for IR Systems) workshop is also held each year in Japan for CLIR community focusing on English and Asian languages. The study of cross-language mining appears relatively more lately than CLIR, partly due to the increasing demands on the quality of CLIR and machine translation, as well as the recent advancement of text/Web mining techniques. A typical early work on cross-lingual mining is believed to be PTMiner [14] that mines parallel text from the Web used for query translation. Other than parallel data mining, people also tried to mine the translations of Out-of-Vocabulary (OOV) terms from search results returned from search engine [5,18] or from web anchor texts and link structures [12]. Based on phonetic similarity, transliteration (the phonetic counterpart of a name in another language, e.g., ‘‘Schwarzenegger’’ is pronounced as ‘‘shi wa xin ge’’ in Chinese pinyin) of foreign names also could be extracted properly from the Web [10]. These methods are proposed to alleviate the OOV problem of CLIR since there is usually lack of appropriate translation resources for new terminologies and proper names, particularly in the scenario of cross-language web search.
Foundations Most approaches to CLIR perform query translation followed by monolingual retrieval. So the retrieval performance is largely determined by the quality of query translation. Queries are typically translated either using a bilingual dictionary [15], a machine translation (MT) software [7], bilingual word association model learned from parallel corpus [6,14], or recently a query log of a search engine [9]. Despite the types of the resources being used, OOV translation and translation disambiguation are the two major bottlenecks for CLIR. On one hand, translation resources can never be comprehensive. Correctly translating queries, especially Web queries, is difficult since they often contain new words (e.g., new movies, brands, celebrities, etc.) occurring timely and frequently, yet
523
C
524
C
Cross-Language Mining and Retrieval
being OOV to the system; On the other hand, many words are polysemous, or they do not have a unique translation, and sometimes the alternative translations have very different meanings. This is known as translation ambiguity. Selecting the correct translation is not trivial due to the shortage of context provided in a query, and effective techniques for translation disambiguation are necessary. It should be mentioned that document translation with MT in the opposite direction is an alternative approach to CLIR. However, it is less commonly used than query translation in the literature mainly because MT is computationally expensive and costly to develop, and the document sets in IR are generally very large. For cross-language web search, it is almost impractical to translate all the web pages before indexing. Some large scale attempts to compare query translation and document translation have suggested no clear advantage for either of the approaches to CLIR [12]. But they found that compared with extremely high quality human query translations, it is advantageous to incorporate both document and query translation into a CLIR system. Cross-Language Web Mining Mining Parallel Data
The approaches of mining parallel text make extensive use of bilingual websites where parallel web pages corresponding to the specified language pair can be identified and downloaded. Then the bilingual texts are automatically aligned in terms of sentences and words by statistical aligning tools, such as GIZAþþ [21]. The word translation probabilities can be derived with the statistics of word pairs occurring in the alignments, after which one can resort to statistical machine translation models, e.g., IBM model-1 [4], for translating given queries into the target language. The typical parallel data mining tools include PTMiner [14], STRAND [16] and the DOM-tree-alignment-based system [17].
Mining OOV Term Translation Web pages also contain translations of terms in either the body texts or the anchor texts of hyper-links pointing to other pages. For example, in some language pairs, such as ChineseEnglish or Japanese-English, the Web contains rich body texts in a mixture of multiple languages. Many of them contain bilingual translations of proper nouns, such as company names and person names. The work
of [5,16] exploits this nice characteristic to automatically extract translations from search result for a large number of unknown query terms. Using the extracted bilingual translations, the performance of CLIR between English and Chinese is effectively improved. Both methods select translations based on some variants of co-occurrence statistics. The anchor text of web pages’ hyperlinks is another source for translational knowledge acquisition. This is based on the observation that the anchor texts of hyperlinks pointing to the same URL may contain similar descriptive texts. Lu et al. [11] uses anchor text of different languages to extract the regional aliases of query terms for constructing a translation lexicon. A probabilistic inference model is exploited to estimate the similarity between query term and extracted translation candidates. Query Translation Disambiguation
Translation disambiguation or ambiguity resolution is crucial to the query translation accuracy. Compared to the simple dictionary-based translation approach without addressing translation disambiguation, the effectiveness of CLIR can be 60% lower than that of monolingual retrieval [3]. Different disambiguation techniques have been developed using statistics obtained from document collections, all resulting in significant performance improvement. Zhang et al. [19] give concise review on three main translation disambiguation techniques. These methods include using term similarity [1], word co-occurrence statistics of the target language documents, and language modeling based approaches [20]. In this subsection, we introduce these approaches following the review of Zhang et al. [19]. Disambiguation by Term Similarity Adriani [1] proposed a disambiguation technique based on the concept of statistical term similarity. The term similarity is measured by the Dice coefficient, which uses the termdistribution statistics obtained from the corpus. The similarity between term x and y, SIM(x, y), is calculated as: n n n P P 2 P 2 SIMðx; yÞ ¼ 2 ðwxi wyi Þ wxi þ wyi i¼1
i¼1
i¼1
where wxi and wyi is the weights of term x and y in document i. This method computes the sum of maximum similarity values between each candidate
Cross-Language Mining and Retrieval
translation of a term and the translations of all other terms in the query. For each query term, the translation with the highest sum is selected as its translation. The results of Indonesian-English CLIR experiments demonstrated the effectiveness of this approach. There are many variant term association measures like Jaccard, Cosine, Overlap, etc. that can be applied similarly for calculating their similarity. Disambiguation by Term Co-occurrence
Ballesteros and Croft [3] used co-occurrence statistics obtained from the target corpus for resolving disambiguation. They assume the correct translations of query terms should co-occur in target language documents and incorrect translations tend not to co-occur. Similar approach is studied by Gao et al. [8]. They observed that the correlation between two terms is stronger when the distance between them is shorter. They extended the previous co-occurrence model by incorporating a distance factor Dðx; yÞ ¼ eaðDisðx;yÞ1Þ. The mutual information between term x and y, MI(x, y), is calculated as:
fw ðx; yÞ MIðx; yÞ ¼ log þ 1 Dðx; yÞ fx fy
where fw(x, y) is the co-occurrence frequency of x and y that occur simultaneously within a window size of w in the collection, fx is the collection frequency of x, and fy is the collection frequency of y. D(x, y) decreases exponentially when the distance between the two terms increases, where a is the decay rate, and D(x, y) is the average distance between x and y in the collection. The experiments on the TREC9 Chinese collection showed that the distance factor leads to substantial improvements over the basic co-occurrence model. Disambiguation by Language Modeling
In the work of [20], a probability model based on hidden Markov model (HMM) is used to estimate the maximum likelihood of each sequence of possible translations of the original query. The highest probable translation set is selected among all the possible translation sets. HMM is a widely used for probabilistic modeling of sequence data. In their work, a smoothing technique based on absolute discounting and interpolation method is adopted to deal with the zerofrequency problem during probability estimation. See [20] for details.
C
525
Pre-/Post-Translation Expansion
Techniques of OOV term translation and translation disambiguation both aim to translate query correctly. However, it is arguable that precise translation may not be necessary for CLIR. Indeed, in many cases, it is helpful to introduce words even if they are not direct translations of any query word, but are closely related to the meaning of the query. This observation has led to the development of cross-lingual query expansion (CLQE) techniques [2,13]. [2] reported the enhancement on CLIR by post-translation expansion. [13] made performance comparison on various CLQE techniques, including pre-translation expansion, post-translation expansion and their combinations. Relevance feedback, the commonly used expansion technique in monolingual retrieval, is also widely adopted in CLQE. The basic idea is to expand original query by additional terms that are extracted from the relevant retrieval result initially returned. Amongst different relevance feedback methods, explicit feedback requires documents whose relevancy is explicitly marked by human; implicit feedback is inferred from users’ behaviors that imply the relevancy of the selected documents, such as which returned documents are viewed or how long they view some of the documents; blind or ‘‘pseudo’’ relevance feedback is obtained by assuming that top n documents in the initial result are relevant. Cross-Lingual Query Suggestion
Traditional query translation approaches rely on static knowledge and data resources, which cannot effectively reflect the quickly shifting interests of Web users. Moreover, the translated terms can be reasonable translations, but are not popularly used in the target language. For example, the French query ‘‘aliment biologique’’ is translated into ‘‘biologic food,’’ yet the correct formulation nowadays should be ‘‘organic food.’’ This mismatch makes the query translation in the target language ineffective. To address this problem, Gao et al. [9] proposed a principled framework called Cross-Lingual Query Suggestion (CLQS), which leverages cross-lingual mining and translation disambiguation techniques to suggest related queries found in the query log of a search engine. CLQS aims to suggest related queries in a language different from the original query. CLQS is closely related to CLQE, but is distinct in that it suggests full queries that have been formulated by users so that the query integrity and coherence are preserved in the
C
526
C
Cross-Language Mining and Retrieval
suggested queries. It is used as a new means of query ‘‘translation’’ in CLIR tasks. The use of query log for CLQS stems from the observation that in the same period of time, many search users share the same or similar interests, which can be expressed in different manners in different languages. As a result, a query written in a source language is possible to have an equivalent in the query log of the target language. Especially, if the user intends to perform CLIR, then original query is even more likely to have its correspondent included in the target language log. Therefore, if a candidate for CLQS appears often in the query log, it is more likely to be the appropriate one to suggest. CLQS is testified being able to cover more relevant documents for the CLIR task. The key problem with CLQS is how to learn a similarity measure between two queries in different languages. They define cross-lingual query similarity based on both translation relation and monolingual similarity. The principle for learning is, for a pair of queries, their cross-lingual similarity should fit the monolingual similarity between one query and the other query’s translation. There are many ways to obtain a monolingual similarity between queries, e.g., co-occurrence based mutual information and w2. Any of them can be used as the target for the cross-lingual similarity function to fit. In this way, cross-lingual query similarity estimation is formulated as a regression task: simCL ðqf ; qe Þ ¼ w ’ðf ðqf ; qeÞÞ ¼ simML ðTqf ; qe Þ where given a source language query qf, a target language query qe, and a monolingual query similarity between them simML, the cross-lingual query similarity simCL can be calculated as an inner product between a weight vector and the feature vector in the kernel space, and ’ is the mapping from the input feature space onto the kernel space, and w is the weight vector which can be learned by support vector regression training. The monolingual similarity is measured by combining both query content-based similarity and click-through commonality in the query log. This discriminative modeling framework can integrate arbitrary information sources to achieve an optimal performance. Multiple feature functions can be incorporated easily into the framework based on different translation resources, such as bilingual dictionaries, parallel data, web data, and query logs. They work uses co-occurrence-based dictionary translation
disambiguation, IBM translation model-1 based on parallel corpus, and Web-based query translation mining as means to discover related candidate queries in the query log. Experiments on TREC6 French-English CLIR task demonstrate that CLQS-based CLIR is significantly better than the traditional dictionary-based query translation with disambiguation and machine translation approaches. Latent Semantic Index (LSI) for CLIR
Different from most of the alternative approaches discussed above, LSI for CLIR [6] provides a method for matching text segments in one language with the segments of similar meaning in another language without having to translate either. Using a parallel corpus, LSI can create a language-independent representation of words. The representation matrix reflects the patterns of term correspondences in the documents of two languages. The matrix is factorized by Singular Value Decomposition (SVD) for deriving a latent semantic space with a reduced dimension, where similar terms are represented by similar vectors. In latent semantic space, therefore, the monolingual similarity between synonymous terms from one language and the crosslingual similarity between translation pairs from different languages tend to be higher than the similarity with irrelevant terms. This characteristic allows relevant documents to be retrieved even if they do not share any terms in common with the query, which makes LSI suitable for CLIR.
Key Applications Cross-language mining and retrieval is the foundation technology for searching web information across multiple languages. It can also provide the crosslingual functionality for the retrieval of structured, semi-structured and un-structured document databases of specific domains or in large multinational enterprises.
Experimental Results In general, for every presented work, there is an accompanying experimental evaluation in the corresponding reference. Especially, the three influential international workshops held annually, i.e., CLEF, NTCIR and TREC, defines many evaluation tasks for CLIR, and there are a large number of experimental results being published based on these benchmark specifications.
Cross-Language Mining and Retrieval
Data Sets Data sets for benchmarking CLIR are released to the participants of TREC, CLEF and NTCIR workshops annually with license agreements.
Cross-references
▶ Anchor Text ▶ Average Precision ▶ Document databases ▶ Document Links and Hyperlinks ▶ Evaluation Metrics for Structured Text Retrieval ▶ Information Extraction ▶ Information Retrieval ▶ MAP ▶ MRR ▶ Query Expansion for Information Retreival ▶ Query Translation ▶ Relevance Feedback ▶ Singular Value Decomposition ▶ Snippet ▶ Stemming ▶ Stoplists ▶ Term Statistics for Structuerd Text Retrieval ▶ Term Weighting ▶ Text Indexing and Retrieval ▶ Text Mining ▶ Web Information Extraction ▶ Web Search Relevance Feedback
Recommended Reading 1. Adriani M. Using statistical term similarity for sense disambiguation in cross-language information retrieval. Inform. Retr., 2(1):71–82, 2000. 2. Ballestors L.A. and Croft W.B. Phrasal translation and query expansion techniques for cross-language information retrieval. In Proc. 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1997, pp. 84–91. 3. Ballestors L.A. and Croft W.B. Resolving ambiguity for crosslanguage information retrieval. In Proc. 21st Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998, pp. 64–71. 4. Brown P.F., Pietra S.A.D., Pietra V.D.J., and Mercer R.L. The mathematics of machine translation: parameter estimation. Comput. Linguist., 19:263–312, 1992. 5. Cheng P.-J., Teng J.-W., Chen R.-C., Wang J.-H., Lu W.-H., and Chien L.-F. Translating unknown queries with Web corpora for cross-language information retrieval. In Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004, pp. 146–153.
C
6. Dumais S.T., Landauer T.K., and Littman M.L. Automatic crosslinguistic information retrieval using latent semantic indexing. ACM SIGIR Workshop on Cross-Linguistic Information Retrieval, 1996, pp. 16–23. 7. Fujii A. and Ishikawa T. Applying machine translation to twostage cross-language information retrieval. In Proc. 4th Conf. Association for Machine Translation in the Americas, 2000, pp. 13–24. 8. Gao J., Zhou M., Nie, J.-Y., He H., and Chen W. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2002, pp. 183–190. 9. Gao W., Niu C., Nie J.-Y., Zhou M., Hu J., Wong K.-F., and Hon H.-W.: Cross-lingual query suggestion using query logs of different languages. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 463–470. 10. Jiang L., Zhou M., Chien L.-F., and Niu C. Named entity translation with Web mining and transliteration. In Proc. 20th Int. Joint Conf. on AI, 2007, pp. 1629–1634. 11. Lu W.-H., Chien L.-F., and Lee H.-J. Translation of Web queries using anchor text mining. ACM Trans. Asian Lang. Information Proc., 1(2):159–172, 2002. 12. McCarley J.S. Should we translate the documents or the queries in cross-language information retrieval? In Proc. 27th Annual Meeting of the Assoc. for Computational Linguistics, 1999, pp. 208–214. 13. McNamee P. and Mayfield J. Comparing cross-language query expansion techniques by degrading translation resources. In Proc. 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2002, pp. 159–166. 14. Nie J.-Y., Smard M., Isabelle P., and Durand R. Cross-language information retrieval based on parallel text and automatic mining of parallel text from the Web. In Proc. 22nd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1999, pp. 74–81. 15. Pirkola A., Hedlund T., Keshusalo H., and Ja¨rvelin K. Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inform. Retr., 3(3–4): 209–230, 2001. 16. Resnik P. and Smith N.A. The Web as a parallel corpus. Comput. Linguist., 29(3):349–380, 2003. 17. Shi L., Niu C., Zhou M., and Gao J. A DOM Tree alignment model for mining parallel data from the Web. In Proc. 44th Annual Meeting of the Assoc. for Computational Linguistics, 2006, pp. 489–496. 18. Zhang Y. and Vines P. Using the Web for automated translation extraction in cross-language information retrieval. In Proc. 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004, pp. 162–169. 19. Zhang Y., Vines P., and Zobel J. An empirical comparison of translation disambiguation techniques for Chinese-English Cross-Language Information Retrieval. In Proc. 3rd Asia Information Retrieval Symposium, 2006, pp. 666–672. 20. Zhang Y., Vines P., and Zobel J. Chinese OOV translation and post-translation query expansion in Chinese-English
527
C
528
C
Cross-language Text Mining
cross-lingual information retrieval. ACM Trans. Asian Lang. Information Proc., 4(2):57–77, 2005. 21. http://www.fjoch.com/GIZA++.html
Cross-language Text Mining ▶ Cross-Language Mining and Retrieval
Cross-language Web Mining ▶ Cross-Language Mining and Retrieval
Cross-lingual Information Retrieval ▶ Cross-Language Mining and Retrieval
Cross-lingual Text Mining ▶ Cross-Language Mining and Retrieval
Cross-media Information Retrieval ▶ Cross-Modal Multimedia Information Retrieval
may only have some clues (e.g., a vague idea, a rough query object of the same or even different modality as that of the intended result) rather than concrete and indicative query objects. In such cases, traditional multimedia information retrieval techniques as Query-ByExample (QBE) fails to retrieve what users really want since their performance depends on a set of specifically defined features and carefully chosen query objects. The cross-modal multimedia information retrieval (CMIR) framework consists of a novel multifaceted knowledge base (which is embodied by a layered graph model) to discover the query results on multiple modalities. Such cross-modality paradigm leads to better query understanding and returns the retrieval result which meets user need better.
Historical Background Previous works addressing multimedia information retrieval can be classified into two groups: approaches on single-modality, and those on multi-modality integration. Retrieval Approaches on Single-Modality
The retrieval approach in this group only deals with a single type of media, so that most content-based retrieval (CBR) approaches [2,3,5,8,9] fall into this group. These approaches differ from each other in either the low-level features extracted from the data, or the distance functions used for similarity calculation. Despite the differences, all of them are similar in two fundamental aspects: (i) they all rely on low-level features; (ii) they all use the query-by-example paradigm. Retrieval Approaches on Multi-Modality Integration
Cross-Modal Multimedia Information Retrieval Q ING L I , Y U YANG City University of Hong Kong, Hong Kong, China
Synonyms Multi-modal information retrieval; Cross-media information retrieval
Definition Multimedia information retrieval tries to find the distinctive multimedia documents that satisfy people’s needs within a huge dataset. Due to the vagueness on the representation of multimedia data, usually the user
More recently there are some works that investigate the integration of multi-modality data, usually between text and image, for better retrieval performance. For example, iFind [7] proposes a unified framework under which the semantic feature (text) and low-level features are combined for image retrieval, whereas the 2M2Net [12] system extends this framework to the retrieval of video and audio. WebSEEK [9] extracts keywords from the surrounding text of image and videos, which is used as their indexes in the retrieval process. Although these systems involve more than one media, different types of media are not actually integrated but are on different levels. Usually, text is only used as the annotation (index) of other medias. In this regard, cross-modal multimedia information
Cross-Modal Multimedia Information Retrieval
retrieval (CMIR) enables an extremely high degree of multi-modality integration, since it allows the interaction among objects of any modality in any possible ways (via different types of links). MediaNet [1] and multimedia thesaurus (MMT) [10] seek to provide a multimedia representation of semantic concept – a concept described by various media objects including text, image, video, etc – and establish the relationships among these concepts. MediaNet extends the notion of relationships to include even perceptual relationships among media objects. Both approaches can be regarded as ‘‘concept-centric’’ approaches since they realize an organization of multi-modality objects around semantic concepts. In contrast, CMIR is ‘‘concept-less’’ since it makes no attempt to identify explicitly the semantics of each object.
Foundations The cross-modality multimedia information retrieval (CMIR) mechanism shapes a novel scenario for multimedia retrieval: The user starts the search by supplying a set of seed objects as the hints of his intention, which can be of any modality (even different with the intended objects), and are not necessarily the eligible results by themselves. From the seeds, the system figures out the user’s intention and returns a set of crossmodality objects that potentially satisfy this intention. The user can give further hints by identifying the results approximating his need, based on which the system improve its estimation about the user intention and refines the results towards it. This scenario can be also interpreted as a cooperative process: the user tries to focus the attention of the system to the objects by giving hints on the intended results, while the system tries to return more reasonable results that allows user to give better hints. A comparison between CMIR and the current CBR approaches is shown in Table 1.
Cross-Modal Multimedia Information Retrieval. CBR paradigms Interaction Highly representative sample object Data index Low-level features Results
C
To support all the necessary functionalities for such an ideal scenario, a suite of unique models, algorithms and strategies are developed in CMIR. As shown in Fig. 1, the foundation of the whole mechanism is a multifaceted knowledge base describing the relationships among cross-modality objects. The kernel of the knowledge base is a layered graph model, which characterizes the knowledge on (i) history of user behaviors, (ii) structural relationships among media objects, and (iii) content of media objects, at each of its layers. Link structure analysis—an established technique for web-oriented applications—is tailored to the retrieval of cross-modality data based on the layered graph model. A unique relevant feedback technique that gears with the underlying graph model is proposed, which can enrich the knowledge base by updating the links of the graph model according to user behaviors. The loop in Fig. 1 reveals the hillclimbing nature of the CMIR mechanism, i.e., it enhances its performance by learning from the previously conducted queries and feedbacks. Layered Graph Model
As the foundation of the retrieval capability, the multifaceted knowledge base accommodates a broad range of knowledge indicative of data semantics, mainly in three aspects: (i) user behaviors in the user-system interaction, (ii) structural relationships among media objects, and (iii) content of each media object. The kernel of the knowledge base is a three-layer graph model, with each layer describing the knowledge in one aspect, called knowledge layer. Its formal definition is given as follows. Definition 1 A knowledge layer is a undirected graph
G = (V, E), where V is a finite set of vertices and E is a finite set of edges. Each element in V corresponds to a media object Oi ∈ O, where O is the collection of media
Table 1. CBR paradigms, drawbacks, and suggested remedies in CMIR Drawbacks
Vague idea, or clear idea without appropriate samples Inadequate to capture semantics
Single-modality, Looks like or sounds like, but not perceptually similar objects what user actually needs
Suggested remedies in CMIR Cross-modality seed objects, only as hints Multifaceted knowledge (user behaviors, structure, content) Cross-modality, semantically related objects
529
C
530
C
Cross-Modal Multimedia Information Retrieval
objects in the database. E is a ternary relation defined on V V R, where R represents real numbers. Each edge in E has the form of , denoting a semantic link between Oi and Oj with r as the weight of the link. The graph corresponds to a |V| |V| adjacency matrix (The adjacency matrix defined here is slightly different from the conventional definition in mathematics, in which each component is a binary value indicating the existence of the corresponding edge.) M = [mij], where
mij = mji always holds. Each element mij = r if there is an edge , and mij = 0 if there is no edge between Oi and Oj. The elements on the diagonal are set to zero, i.e., mii = 0. Each semantic link between two media objects may have various interpretations, which corresponds to one of the three cases: (i) a user has implied the relevance between the two objects during the interaction, e.g., designating them as the positive example in the same query session; (ii) there is a structural relationships between them, e.g., they come from the same or linked web page(s); or (iii) they resemble each other in terms of their content. The multifaceted knowledge base seamlessly integrates all these links into the same model while preserving their mutual independence. Definition 2 The multifaceted knowledge base is a
Cross-Modal Multimedia Information Retrieval. Figure 1. Overview of the CMIR mechanism.
layered graph model consisting of three superimposed knowledge layers, which from top to bottom are user layer, structure layer, and content layer. The vertices of the three layers correspond to the same set of media objects, but their edges are different either in occurrences or in interpretations. Figure 2 illustrates the layered graph model. Note that the ordering of the three layers is immutable, which reflects their priorities in terms of knowledge reliability. The user layer is placed uppermost since user judgment is assumed most reliable (not necessarily always reliable). Structure links is a strong indicator of relevance, but not as reliable as user links. The lowest layer is the content layer. As a generally accepted fact in
Cross-Modal Multimedia Information Retrieval.
Figure 2. The Layered graph model as multifaceted knowledge base.
Cross-Modal Multimedia Information Retrieval
C
CBR area, content similarity does not entail any welldefined mapping with semantics. A unique property of the layered graph model is that it stores the knowledge on the links (or relationships) among media objects, rather than on the nodes (media objects) upon which most existing retrieval systems store the data index. All the algorithms based of this model can be interpreted as manipulation of links: to serve the user query, relevant knowledge is extracted from this graph model by analyzing the link structure; meanwhile, user behaviors are studied to enrich the knowledge by updating the links. An advantage of such link-based approach is that the retrieval can be performed in a relatively small locality connected via links instead of the whole database, and therefore it can afford more sophisticated retrieval algorithms.
the user’s intention; (ii) span the seeds to a collection of candidate objects via the links in the layered graph model; (iii) distill the results by ranking the candidates based on link structure analysis, (iv) update the knowledge base to incorporate the user evaluation of the current results, and (v) refine the results based on user evaluations.
Link Analysis Based Retrieval
Future Directions
As illustrated in Fig. 3, the retrieval process can be described as a circle: the intended objects are retrieved through the upper semicircle, and the user evaluations are studied and incorporated into the knowledge base though the lower half-circle, which initiates a new circle to refine the previously retrieved results based on the updated knowledge. Consequently, it is a hill-climbing approach in that the performance is enhanced incrementally as the loop is repeated. The retrieval process consists of five steps (as shown in Fig. 3): (i) generate the seed objects as the hints of
Due to the generality and extensibility of the CMIR, there are many potential directions that can be implemented on it: Navigation. The graph model provides abundant links through which the user can traverse from an object to its related objects. An intuitive scenario for navigation is when the user is looking at a certain object, he is recommended with the objects that are linked to it in the graph model, ranked by their link weights and link types, from which he may select one as the next object he will navigate to.
Key Applications Multimedia Information Retrieval System
For multimedia data, the modalities supported can be texts (surrounding or tagged), images, videos and audios. An ongoing prototype [11] utilizes the primitive features and similarity functions for these media shown in Table 2. The experimental results prove the usefulness of the approach for better query understanding.
Cross-Modal Multimedia Information Retrieval. Figure 3. Overview of the link analysis based retrieval algorithm.
531
C
532
C
Cross-Validation
Cross-Modal Multimedia Information Retrieval. Text Primitive features
Keywords, weighted by TF*IDF Similarity Cosine function distance
Table 2. Primitive features and similarity function used in prototype
Image
Video
256-d HSV color histogram, 64-d LAB color coherence, 32-d Tamura directionality Euclidean distance for each feature, linear combination of different similarities
Clustering. Clustering cross-modality objects into semantically meaningful groups is also an important and challenging issue, which requires an underlying similarity function among objects, along with a method that produces clusters based on the similarity function. The layered graph model provides knowledgeable and rich links, based on which different similarity functions can be easily formulated. Meanwhile, many existing approaches can be employed as the clustering method, such as simulated and deterministic annealing algorithm [4]. Moreover, CMIR inherently allows the clustering of cross-modality objects, rather than singlemodality objects that most previous classification approaches can deal with. Personalized retrieval. The user layer of the graph model characterizes the knowledge obtained from the behaviors of the whole population of users, and allows a query from a single user to benefit from such common knowledge. However, each user may have his/her personal interests, which may not agree with each other. The ‘‘multi-leveled user profile’’ mechanism [6] leads a good direction for future study.
First frame of each shot as key-frame, indexing key-frame as an image Key-frame (image) similarity as shot similarity, average pair-wise shot similarity as video similarity
5. Huang T.S., Mehrotra S., and Ramchandran K. Multimedia analysis and retrieval system (MARS) project. In Proc. 33rd Annual Clinic on Library Application of Data Processing-Digital Image Access and Retrieval, 1996. 6. Li Q., Yang J., and Zhuang Y.T. Web-based multimedia retrieval: balancing out between common knowledge and personalized views. In Proc. 2nd Int. Conf. on Web Information Systems Eng., 2001. 7. Lu Y., Hu C.H., Zhu X.Q., Zhang H.J., and Yang Q. A unified framework for semantics and feature based relevance feedback in image retrieval systems. In Proc. 8th ACM Int. Conf. on Multimedia, 2000, pp. 31–38. 8. Smith J.R. and Chang S.F. VisualSEEk: a fully automated content-based image query system. In Proc. 4th ACM Int. Conf. on Multimedia, 1996. 9. Smith J.R. and Chang S.F. Visually searching the web for content. IEEE Multimed. Mag., 4(3):12–20, 1997. 10. Tansley R. The Multimedia Thesaurus: An Aid for Multimedia Information Retrieval and Navigation. Master Thesis, Computer Science, University of Southampton, UK, 1998. 11. Yang J., Li Q., and Zhuang Y. Octopus: Aggressive search of multi-modality data using multifaceted knowledge base. In Proc. 11th Int. World Wide Web Conference, 2002, pp. 54–64. 12. Yang J., Zhuang Y.T., and Li Q. Search for multi-modality data in digital libraries. In Proc. Second IEEE Pacific-Rim Conference on Multimedia, 2001.
Cross-references
▶ Multimedia Data ▶ Multimedia Information Retrieval
Recommended Reading 1. Benitez A.B., Smith J.R., and Chang S.F. MediaNet: a multimedia information network for knowledge representation. In Proc. SPIE Conf. on Internet Multimedia Management Systems, vol. 4210, 2000, pp. 1–12. 2. Chang S.F., Chen W., Meng H.J., Sundaram H., and Zhong D. VideoQ: an automated content based video search system using visual cues. In Proc. 5th ACM Int. Conf. on Multimedia, 1997. 3. Flickner M., Sawhney H., Niblack W., and Ashley J. Query by image and video content: the QBIC system. IEEE Comput., 28(9):23–32, 1995. 4. Hofmann T. and Buhmann J.M. Pairwise data clustering by deterministic annealing. IEEE Trans. Pattern Anal. Mach. Intell., 19(1):1–14, 1997.
Cross-Validation PAYAM R EFAEILZADEH , L EI TANG , H UAN L IU Arizona State University, Tempe, AZ, USA
Synonyms Rotation estimation
Definition Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model. In typical cross-validation, the training and validation sets must cross-over in successive
Cross-Validation
C
rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation. Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation. In k-fold cross-validation, the data is first partitioned into k equally (or nearly equally) sized segments or folds. Subsequently k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k 1 folds are used for learning. Fig. 1 demonstrates an example with k = 3. The darker section of the data are used for training while the lighter sections are used for validation. In data mining and machine learning 10-fold cross-validation (k = 10) is the most common. Cross-validation is used to evaluate or compare learning algorithms as follows: in each iteration, one or more learning algorithms use k 1 folds of data to learn one or more models, and subsequently the learned models are asked to make predictions about the data in the validation fold. The performance of each learning algorithm on each fold can be tracked using some predetermined performance metric like accuracy. Upon completion, k samples of the performance metric will be available for each algorithm. Different methodologies such as averaging can be used to obtain an aggregate measure from these sample, or these samples can be used in a statistical hypothesis test to show that one algorithm is superior to another.
regression model or a classifier. The problem with evaluating such a model is that it may demonstrate adequate prediction capability on the training data, but might fail to predict future unseen data. cross-validation is a procedure for estimating the generalization performance in this context. The idea for cross-validation originated in the 1930s [6]. In the paper one sample is used for regression and a second for prediction. Mosteller and Tukey [9], and various other people further developed the idea. A clear statement of cross-validation, which is similar to current version of k-fold cross-validation, first appeared in [8]. In 1970s, both Stone [12] and Geisser [4] employed cross-validation as means for choosing proper model parameters, as opposed to using cross-validation purely for estimating model performance. Currently, cross-validation is widely accepted in data mining and machine learning community, and serves as a standard procedure for performance estimation and model selection.
Historical Background
The above two goals are highly related, since the second goal is automatically achieved if one knows the accurate estimates of performance. Given a sample
In statistics or data mining, a typical task is to learn a model from available data. Such a model may be a
Foundations There are two possible goals in cross-validation: To estimate performance of the learned model from available data using one algorithm. In other words, to gauge the generalizability of an algorithm. To compare the performance of two or more different algorithms and find out the best algorithm for the available data, or alternatively to compare the performance of two or more variants of a parameterized model.
Cross-Validation. Figure 1. Procedure of three-fold cross-validation.
533
C
534
C
Cross-Validation
of N data instances and a learning algorithm A, the average cross-validated accuracy of A on these N instances may be taken as an estimate for the accuracy of A on unseen data when A is trained on all N instances. Alternatively if the end goal is to compare two learning algorithms, the performance samples obtained through cross-validation can be used to perform two-sample statistical hypothesis tests, comparing a pair of learning algorithms. Concerning these two goals, various procedures are proposed:
folds. Subsequently k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k 1 folds are used for learning. Data is commonly stratified prior to being split into k folds. Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole. For example in a binary classification problem where each class comprises 50% of the data, it is best to arrange the data such that in every fold, each class comprises around half the instances.
Resubstitution Validation
Leave-One-Out Cross-Validation
In resubstitution validation, the model is learned from all the available data and then tested on the same set of data. This validation process uses all the available data but suffers seriously from over-fitting. That is, the algorithm might perform well on the available data yet poorly on future unseen test data.
Leave-one-out cross-validation (LOOCV) is a special case of k-fold cross-validation where k equals the number of instances in the data. In other words in each iteration nearly all the data except for a single observation are used for training and the model is tested on that single observation. An accuracy estimate obtained using LOOCV is known to be almost unbiased but it has high variance, leading to unreliable estimates [3]. It is still widely used when the available data are very rare, especially in bioinformatics where only dozens of data samples are available.
Hold-Out Validation
To avoid over-fitting, an independent test set is preferred. A natural approach is to split the available data into two non-overlapped parts: one for training and the other for testing. The test data is held out and not looked at during training. Hold-out validation avoids the overlap between training data and test data, yielding a more accurate estimate for the generalization performance of the algorithm. The downside is that this procedure does not use all the available data and the results are highly dependent on the choice for the training/test split. The instances chosen for inclusion in the test set may be too easy or too difficult to classify and this can skew the results. Furthermore, the data in the test set may be valuable for training and if it is heldout prediction performance may suffer, again leading to skewed results. These problems can be partially addressed by repeating hold-out validation multiple times and averaging the results, but unless this repetition is performed in a systematic manner, some data may be included in the test set multiple times while others are not included at all, or conversely some data may always fall in the test set and never get a chance to contribute to the learning phase. To deal with these challenges and utilize the available data to the max, k-fold cross-validation is used. K-Fold Cross-Validation
In k-fold cross-validation the data is first partitioned into k equally (or nearly equally) sized segments or
Repeated K-Fold Cross-Validation
To obtain reliable performance estimation or comparison, large number of estimates are always preferred. In k-fold cross-validation, only k estimates are obtained. A commonly used method to increase the number of estimates is to run k-fold cross-validation multiple times. The data is reshuffled and re-stratified before each round. Pros and Cons
Kohavi [5] compared several approaches to estimate accuracy: cross-validation(including regular crossvalidation, leave-one-out cross-validation, stratified cross-validation) and bootstrap (sample with replacement), and recommended stratified 10-fold crossvalidation as the best model selection method, as it tends to provide less biased estimation of the accuracy. Salzberg [11] studies the issue of comparing two or more learning algorithms based on a performance metric, and proposes using k-fold cross-validation followed by appropriate hypothesis test rather than directly comparing the average accuracy. Paired t-test is one test which takes into consideration the variance of training and test data, and is widely used in machine
Cross-Validation
learning. Dietterich [2] studied the properties of 10-fold cross-validation followed by a paired t-test in detail and found that such a test suffers from higher than expected type I error. In this study, this high type I error was attributed to high variance. To correct for this Dietterich proposed a new test: 5 2-fold crossvalidation. In this test 2-fold cross-validation is run five times resulting in 10 accuracy values. The data is re-shuffled and re-stratified after each round. All 10 values are used for average accuracy estimation in the t-test but only values from one of the five 2-fold crossvalidation rounds is used to estimate variance. In this study 5 2-fold cross-validation is shown to have acceptable type I error but not to be as powerful as 10-fold cross validation and has not been widely accepted in data mining community. Bouckaert [1] also studies the problem of inflated type-I error with 10-fold cross-validation and argues that since the samples are dependent (because the training sets overlap), the actual degrees of freedom is much lower than theoretically expected. This study compared a large number of hypothesis schemes, and recommend 10 10 fold cross-validation to obtain 100 samples, followed with t-test with degree of freedom equal to 10 (instead of 99). However this method has not been widely adopted in data mining field either and 10-fold cross-validation remains the most widely used validation procedure. A brief summery of the above results is presented in Table 1. Why 10-Fold Cross-Validation: From Ideal to Reality
Whether estimating the performance of a learning algorithm or comparing two or more algorithms in terms of their ability to learn, an ideal or statistically sound experimental design must provide a sufficiently large number of independent measurements of the algorithm(s) performance. To make independent measurements of an algorithm’s performance one must ensure that the factors affecting the measurement are independent from one run to the next. These factors are: (i) the training data the algorithm learns from and, (ii) the test data one uses to measure the algorithm’s performance. If some data is used for testing in more than one round, the obtained results, for example the accuracy measurements from these two rounds, will be dependent and a statistical comparison may not be valid. In fact, it has been shown that a paired t-test based on taking several random train/test splits tends to have an
C
535
Cross-Validation. Table 1. Pros and Cons of different validation methods Validation method
Pros
Cons
Resubstitution Validation Hold-out Validation
Simple
Over-fitting
Independent training and test
Reduced data for training and testing; Large variance
k-fold cross validation
Accurate performance estimation
Small samples of performance estimation; Overlapped training data; Elevated Type I error for comparison; Underestimated performance variance or overestimated degree of freedom for comparison Very large variance
Leave-One-Out Unbiased cross-validation performance estimation Large number Overlapped training Repeated of performance and test data k-fold between each round; cross-validation estimates Underestimated performance variance or overestimated degree of freedom for comparison
extremely high probability of Type I error and should never be used [2]. Not only must the datasets be independently controlled across different runs, there must not be any overlap between the data used for learning and the data used for validation in the same run. Typically, a learning algorithm can make more accurate predictions on a data that it has seen during the learning phase than those it has not. For this reason, an overlap between the training and validation set can lead to an over-estimation of the performance metric and is forbidden. To satisfy the other requirement, namely a sufficiently large sample, most statisticians call for 30+ samples. For a truly sound experimental design, one would have to split the available data into 30 2 = 60 partitions to perform 30 truly independent train-test runs. However, this is not practical because the performance of learning algorithms and their ranking is generally not invariant with respect to the number of
C
536
C
Cross-Validation
samples available for learning. In other words, an estimate of accuracy in such a case would correspond to the accuracy of the learning algorithm when it learns from just 1∕60 of the available data (assuming training and validation sets are of the same size). However, the accuracy of the learning algorithm on unseen data when the algorithm is trained on all the currently available data is likely much higher since learning algorithms generally improve in accuracy as more data becomes available for learning. Similarly, when comparing two algorithms A and B, even if A is discovered to be the superior algorithm when using 1∕60 of the available data, there is no guarantee that it will also be the superior algorithm when using all the available data for learning. Many high performing learning algorithms use complex models with many parameters and they simply will not perform well with a very small amount of data. But they may be exceptional when sufficient data is available to learn from. Recall that two factors affect the performance measure: the training set, and the test set. The training set affects the measurement indirectly through the learning algorithm, whereas the composition of the test set has a direct impact on the performance measure. A reasonable experimental compromise may be to allow for overlapping training sets, while keeping the test sets independent. K-fold cross-validation does just that. Now the issue becomes selecting an appropriate value for k. A large k is seemingly desirable, since with a larger k (i) there are more performance estimates, and (ii) the training set size is closer to the full data size, thus increasing the possibility that any conclusion made about the learning algorithm(s) under test will generalize to the case where all the data is used to train the learning model. As k increases, however, the overlap between training sets also increases. For example, with 5-fold cross-validation, each training set shares only 3∕4 of its instances with each of the other four training sets whereas with 10-fold crossvalidation, each training set shares 8 ∕ 9 of its instances with each of the other nine training sets. Furthermore, increasing k shrinks the size of the test set, leading to less precise, less fine-grained measurements of the performance metric. For example, with a test set size of 10 instances, one can only measure accuracy to the nearest 10%, whereas with 20 instances the accuracy can be measured to the nearest 5%. These competing factors have all been considered and the general consensus in the data mining community seems to be
that k = 10 is a good compromise. This value of k is particularity attractive because it makes predictions using 90% of the data, making it more likely to be generalizable to the full data.
Key Applications Cross-validation can be applied in three contexts: performance estimation, model selection, and tuning learning model parameters. Performance Estimation
As previously mentioned, cross-validation can be used to estimate the performance of a learning algorithm. One may be interested in obtaining an estimate for any of the many performance indicators such as accuracy, precision, recall, or F-score. Cross-validation allows for all the data to be used in obtaining an estimate. Most commonly one wishes to estimate the accuracy of a classifier in a supervised-learning environment. In such a setting, a certain amount of labeled data is available and one wishes to predict how well a certain classifier would perform if the available data is used to train the classifier and subsequently ask it to label unseen data. Using 10-fold cross-validation one repeatedly uses 90% of the data to build a model and test its accuracy on the remaining 10%. The resulting average accuracy is likely somewhat of an underestimate for the true accuracy when the model is trained on all data and tested on unseen data, but in most cases this estimate is reliable, particularly if the amount of labeled data is sufficiently large and if the unseen data follows the same distribution as the labeled examples. Model Selection
Alternatively cross-validation may be used to compare a pair of learning algorithms. This may be done in the case of newly developed learning algorithms, in which case the designer may wish to compare the performance of the classifier with some existing baseline classifier on some benchmark dataset, or it may be done in a generalized model-selection setting. In generalized model selection one has a large library of learning algorithms or classifiers to choose from and wish to select the model that will perform best for a particular dataset. In either case the basic unit of work is pair-wise comparison of learning algorithms. For generalized model selection combining the results of many pair-wise comparisons to obtain a single best algorithm may be difficult, but this is beyond the
Cross-Validation
scope of this article. Researchers have shown that when comparing a pair of algorithms using crossvalidation it is best to employ proper two sample hypothesis testing instead of directly comparing the average accuracies. Cross-validation yields k pairs of accuracy values for the two algorithms under test. It is possible to make a null hypothesis assumption that the two algorithms perform equally well and set out to gather evidence against this null-hypothesis using a two-sample test. The most widely used test is the paired t-test. Alternatively the non-parametric sign test can be used. A special case of model selection comes into play when dealing with non-classification model selection. For example when trying to pick a feature selection [7] algorithm that will maximize a classifier’s performance on a particular dataset. Refaeilzadeh et al. [10] explore this issue in detail and explain that there are in fact two variants of cross-validation in this case: performing feature selection before splitting data into folds (OUT) or performing feature selection k times inside the cross-validation loop (IN). The paper explains that there is potential for bias in both cases: With OUT, the feature selection algorithm has looked at the test set, so the accuracy estimate is likely inflated; On the other hand with IN the feature selection algorithm is looking at less data than would be available in a real experimental setting, leading to underestimated accuracy. Experimental results confirm these hypothesis and further show that: In cases where the two feature selection algorithms are not statistically differentiable, IN tends to be more truthful. In cases where one algorithm is better than another, IN often favors one algorithm and OUT the other. OUT can in fact be the better choice even if it demonstrates a larger bias than IN in estimating accuracy. In other words, estimation bias is not necessarily an indication of poor pair-wise comparison. These subtleties about the potential for bias and validity of conclusions obtained through cross-validation should always be kept in mind, particularly when the model selection task is a complicated one involving pre-processing as well as learning steps. Tuning
Many classifiers are parameterized and their parameters can be tuned to achieve the best result with a
C
particular dataset. In most cases it is easy to learn the proper value for a parameter from the available data. Suppose a Naı¨ve Bayes classifier is being trained on a dataset with two classes: {+, –}. One of the parameters for this classifier is the prior probability p(+). The best value for this parameter according to the available data can be obtained by simply counting the number of instances that are labeled positive and dividing this number by the total number of instances. However in some cases parameters do not have such intrinsic meaning, and there is no good way to pick a best value other than trying out many values and picking the one that yields the highest performance. For example, support vector machines (SVM) use soft-margins to deal with noisy data. There is no easy way of learning the best value for the soft margin parameter for a particular dataset other than trying it out and seeing how it works. In such cases, cross-validation can be performed on the training data as to measure the performance with each value being tested. Alternatively a portion of the training set can be reserved for this purpose and not used in the rest of the learning process. But if the amount of labeled data is limited, this can significantly degrade the performance of the learned model and cross-validation may be the best option.
Cross-references
▶ Classification ▶ Evaluation Metrics for Structured Text Retrieval ▶ Feature Selection for Clustering
Recommended Reading 1. Bouckaert R.R. Choosing between two learning algorithms based on calibrated tests. In Proc. 20th Int. Conf. on Machine Learning, 2003, pp. 51–58. 2. Dietterich T.G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput., 10(7):1895–1923, 1998. 3. Efron B. Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc., 78: 316–331,1983. 4. Geisser S. The predictive sample reuse method with applications. J. Am. Stat. Assoc., 70(350):320–328,1975. 5. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. 14th Int. Joint Conf. on AI, 1995, pp. 1137–1145. 6. Larson S. The shrinkage of the coefficient of multiple correlation. J. Educat. Psychol., 22:45–55, 1931. 7. Liu H. and Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. Knowl. Data Eng., 17(4):491–502, 2005.
537
C
538
C
Cryptographic Hash Functions
8. Mosteller F. and Tukey J.W. Data analysis, including statistics. In Handbook of Social Psychology. Addison-Wesley, Reading, MA, 1968. 9. Mosteller F. and Wallace D.L. Inference in an authorship problem. J. Am. Stat. Assoc., 58:275–309, 1963. 10. Refaeilzadeh P., Tang L., and Liu H. On comparison of feature selection algorithms. In Proc. AAAI-07 Workshop on Evaluation Methods in Machine Learing II. 2007, pp. 34–39. 11. Salzberg S. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc., 1(3):317–328, 1997. 12. Stone M. Cross-validatory choice and assessment of statistical predictions. J. Royal Stat. Soc., 36(2):111–147, 1974.
Cryptographic Hash Functions ▶ Hash Functions
C-Tables ▶ Conditional Tables
Cube T ORBEN B ACH P EDERSEN Aalborg University, Aalborg, Denmark
Synonyms Hypercube
Definition A cube is a data structure for storing and and analyzing large amounts of multidimensional data, often referred to as On-Line Analytical Processing (OLAP). Data in a cube lives in a space spanned by a number of hierarchical dimensions. A single point in this space is called a cell. A (non-empty) cell contains the values of one or more measures.
Key Points As an example, a three-dimensional cube for capturing sales may have a Product dimension P, a Time dimension T, and a Store dimension S, capturing the product sold, the time of sale, and the store it was sold in, for each sale, respectively. The cube has two measures: Dollar Sales and ItemSales, capturing the sales price and the number of items sold, respectively. In a cube, the
combinations of a dimension value from each dimension define a cell of the cube. The measure value(s), e.g., DollarSales and ItemSales, corresponding to the particular combination of dimension values are then stored stored in the corresponding cells. Data cubes provide true multidimensionality. They generalize spreadsheets to any number of dimensions, indeed cubes are popularly referred to as ‘‘spreadsheets on stereoids.’’ In addition, hierarchies in dimensions and formulas are first-class, built-in concepts, meaning that these are supported without duplicating their definitions. A collection of related cubes is commonly referred to as a multidimensional database or a multidimensional data warehouse. In a cube, dimensions are first-class concepts with associated domains, meaning that the addition of new dimension values is easily handled. Although the term ‘‘cube’’ implies three dimensions, a cube can have any number of dimensions. It turns out that most real-world cubes have 4–12 dimensions [3]. Although there is no theoretical limit to the number of dimensions, current tools often experience performance problems when the number of dimensions is more than 10–15. To better suggest the high number of dimensions, the term ‘‘hypercube’’ is often used instead of ‘‘cube.’’ Depending on the specific application, a highly varying percentage of the cells in a cube are non-empty, meaning that cubes range from sparse to dense. Cubes tend to become increasingly sparse with increasing dimensionality and with increasingly finer granularities of the dimension values. A non-empty cell is called a fact. The example has a fact for each combination of time, product, and store where at least one sale was made. Generally, only two or three dimensions may be viewed at the same time, although for low-cardinality dimensions, up to four dimensions can be shown by nesting one dimension within another on the axes. Thus, the dimensionality of a cube is reduced at query time by projecting it down to two or three dimensions via aggregation of the measure values across the projected-out dimensions. For example, to view sales by Store and Time, data is aggregates over the entire Product dimension, i.e., for all products, for each combination of Store and Time. OLAP SQL extensions for cubes were pioneered by the proposal of the data cube operators CUBE and ROLLUP [1]. The CUBE operator generalizes GROUP BY, crosstabs, and subtotals using the special ‘‘ALL’’ value that denotes that an aggregation has been performed
Cube Implementations
over all values for one or more attributes, thus generating a subtotal, or a grand total.
Cross-references
▶ Cube Implementations ▶ Dimension ▶ Hierarchy ▶ Measure ▶ Multidimensional Modeling ▶ On-Line Analytical Processing
Recommended Reading 1.
2.
3.
Gray J., Chaudhuri S., Bosworth A., Layman A., Venkatrao M., Reichart D., Pellow F., and Pirahesh H. Data cube: a relational aggregation operator generalizing group-by, cross-tab and sub-totals. Data Mining Knowl. Discov., 1(1):29–54, 1997. Pedersen T.B., Jensen C.S., and Dyreson C.E. A foundation for capturing and querying complex multidimensional data. Inf. Syst., 26(5):383–423, 2001. Thomsen E. OLAP Solutions: Building Multidimensional Information Systems. Wiley, New York, 1997.
Cube Implementations KONSTANTINOS M ORFONIOS , YANNIS I OANNIDIS University of Athens, Athens, Greece
Synonyms Cube materialization; Cube precomputation
Definition Cube implementation involves the procedures of computation, storage, and manipulation of a data cube, which is a disk structure that stores the results of the aggregate queries that group the tuples of a fact table on all possible combinations of its dimension attributes. For example in Fig. 1a, assuming that R is a fact table that consists of three dimensions (A, B, C) and one measure M (see definitional entry for Measure), the corresponding cube of R appears in Fig. 1b. Each cube node (i.e., view that belongs to the data cube) stores the results of a particular aggregate query as shown in Fig. 1b. Clearly, if D denotes the number of dimensions of a fact table, the number of all possible aggregate queries is 2D; hence, in the worst case, the size of the data cube is exponentially larger with respect to D than the size of the original fact table. In typical applications, this may be in the order of gigabytes or even terabytes, implying that the development of efficient
C
algorithms for the implementation of cubes is extremely important. Let grouping attributes be the attributes of the fact table that participate in the group-by clause of an aggregate query expressed in SQL. A common representation of the data cube that captures the computational dependencies among all the aggregate queries that are necessary for its materialization is the cube lattice [6]. This is a directed acyclic graph (DAG) where each node represents an aggregate query q on the fact table and is connected via a directed edge with every other node whose corresponding group-by part is missing one of the grouping attributes of q. For example, Fig. 2 shows the cube lattice that corresponds to the fact table R (Fig. 1a). Note that precomputing and materializing parts of the cube is crucial for the improvement of queryresponse times as well as for accelerating operators that are common in On-Line Analytical Processing (OLAP), such as drill-down, roll-up, pivot, and sliceand-dice, which make an extensive use of aggregation [3]. Materialization of the entire cube seems ideal for efficiently accessing aggregated data; nevertheless, in real-world applications, which typically involve large volumes of data, it may be considerably expensive in terms of storage space, as well as computation and maintenance time. In the existing literature, several efficient methods have been proposed that attempt to balance the aforementioned tradeoff between query-response times and other resource requirements. Their brief presentation is the main topic of this entry.
Historical Background Most data analysis efforts, whether manual by analysts or automatic by specialized algorithms, manipulate the contents of database systems in order to discover trends and correlations. They typically involve complex queries that make an extensive use of aggregation in order to group together tuples that ‘‘behave in a similar fashion.’’ The response time of such queries over extremely large data warehouses can be prohibitive. This problem inspired Gray et al. [3] to introduce the data-cube operator and propose its off-line computation and storage for efficiency at query time. The corresponding seminal publication has been the seed for a plethora of papers thereafter, which have dealt with several different aspects of the lifecycle of a data cube, from cube construction and storage to indexing, query answering, and incremental maintenance.
539
C
540
C
Cube Implementations
Cube Implementations. Figure 1. Fact table R and the corresponding data cube.
Cube Implementations. Figure 2. Example of a cube lattice.
Taking into account the format used for the computation and storage of a data cube, the cube-implementation algorithms that have appeared in the literature can be partitioned into four main categories: Relational-OLAP (ROLAP) algorithms exploit traditional materialized views in RDBMSes; Multidimensional-OLAP (MOLAP) algorithms take advantage of multidimensional arrays; Graph-Based methods use specialized graph structures; finally,
approximation algorithms use various in-memory representations, e.g., histograms. The literature also deals with the rest of the cubes lifecycle [12]. Providing fast answers to OLAP aggregate queries is the main purpose of implementing data cubes to begin with, and various algorithms have been proposed to handle different types of queries on the formats above. Moreover, as data stored in the original fact table changes, data cubes must follow suit; otherwise, analysis of obsolete data may result into invalid conclusions. Periodical reconstruction of the entire cube is impractical, hence, incremental-maintenance techniques have been proposed. The ideal implementation of a data cube must address efficiently all aspects of cube functionality in order to be viable. In the following section, each one of these aspects is further examined separately.
Foundations In the following subsections, the main stages of the cube lifecycle are analyzed in some detail, including subcube selection, computation, query processing, and incremental maintenance. Note that the references given in this section are only indicative, since the number of related publications is actually very
Cube Implementations
large. A more comprehensive survey may be found elsewhere [11]. Subcube Selection
In real-world applications, materialization of the entire cube is often extremely expensive in terms of computation, storage, and maintenance requirements, mainly because of the typically large fact-table size and the exponential number of cube nodes with respect to the number of dimensions. To overcome this drawback, several existing algorithms select an appropriate subset of the data cube for precomputation and storage [4,5,6]. Such selection algorithms try to balance the tradeoff between response times of queries (sometimes of a particular, expected workload) and resource requirements for cube construction, storage, and maintenance. It has been shown [6] that selection of the optimum subset of a cube is an NP-complete problem. Hence, the existing algorithms use heuristics in order to find nearoptimal solutions. Common constraints used during the selection process involve constraints on the time available for cube construction and maintenance, and/or on the space available for cube storage. As for the criteria that are (approximately) optimized during selection, they typically involve some form of the benefit gained from the materialization of a particular cube subset. A particularly beneficial criterion for the selection problem that needs some more attention, since it has been integrated in some of the most efficient cubeimplementation algorithms (including Dwarf [17] and CURE [10], which will be briefly presented below) is the so-called redundancy reduction. Several groups of researchers have observed that a big part of the cube data is usually redundant [7,8,10,12,17,20]. Formally, a value stored in a cube is redundant if it is repeated multiple times in the same attribute in the cube. For example, in Fig. 1b, tuples h1, 20i of node A, h1, 2, 20i of AB, and h1, 2, 20i of AC are redundant, since they can be produced by properly projecting tuple h1, 2, 2, 20i of node ABC. By appropriately avoiding the storage of such redundant data, several existing cube-implementation algorithms achieve the construction of compressed cubes that can still be considered as fully materialized. Typically, the decrease in the final cube size is impressive, a fact that benefits the performance of computation as well, since output costs are considerably reduced and sometimes, because early identification of
C
redundancy allows pruning of parts of the computation. Furthermore, during query answering, aggregation and decompression are not necessary; instead, some simple operations, e.g., projections, are enough. Finally, for some applications (e.g., for mining multidimensional association rules), accessing the tuples of the entire cube is not necessary, because they only need those group-by tuples with an aggregate value (e.g. count) above some prespecified minimum support threshold (minsup). For such cases, the concept of iceberg cubes has been introduced [2]. Iceberg-cube construction algorithms [2,16] take into consideration only sets of tuples that aggregate together giving a value greater than minsup. Hence, they perform some kind of subcube selection, by storing only the tuples that satisfy the aforementioned condition. Cube Computation
Cube computation includes scanning the data of the fact table, aggregating on all grouping attributes, and generating the contents of the data cube. The main goal of this procedure is to place tuples that aggregate together (i.e., tuples with identical grouping-attribute values) in contiguous positions in main memory, in order to compute the required aggregations with as few data passes as possible. The most widely used algorithms that accomplish such clustering of tuples are sorting and hashing. Moreover, nodes connected in the cube lattice (Fig. 2) exhibit strong computational dependencies, whose exploitation is particularly beneficial for the performance of the corresponding computation algorithms. For instance, assuming that the data in the fact table R (Fig. 1a) is sorted according to the attribute combination ABC, one can infer that it is also sorted according to both AB and A as well. Hence, the overhead of sorting can be shared by the computation of multiple aggregations, since nodes ABC !AB !A !Ø can be computed with the use of pipelining without reclustering the data. Five methods that take advantage of such node computational dependencies have been presented in the existing literature [1] in order to improve the performance of computation algorithms: smallest-parent, cache-results, amortizescans, share-shorts, and share-partitions. Expectedly, both sort-based and hash-based aggregation methods perform more efficiently when the data they process fits in main memory; otherwise,
541
C
542
C
Cube Implementations
they are forced to use external-memory algorithms, which generally increase the I/O overhead by a factor of two or three. In order to overcome such problems, most computation methods initially apply a step that partitions data into segments that fit in main memory, called partitions [2,10,15]. Partitioning algorithms distribute the tuples of the fact table in accordance with the principle that tuples that aggregate together must be placed in the same partition. Consequently, they can later process each partition independently of the others, since by construction, tuples that belong to different partitions do not share the same groupingattribute values. In addition to the above, general characteristics of cube-computation algorithms, there are some further details that are specific to each of four main categories mentioned above (i.e., ROLAP, MOLAP, Graph-Based, and Approximate), which are touched upon below. ROLAP algorithms store a data cube as a set of materialized relational views, most commonly using either a star or a snowflake schema. Among these algorithms, algorithm CURE [10] seems to be the most promising, since it is the only solution with the following features: It is purely compatible with the ROLAP framework, hence its integration into any existing relational engine is rather straightforward. Also, it is suitable not only for ‘‘flat’’ datasets but also for processing datasets whose dimension values are hierarchically organized. Furthermore, it introduces an efficient algorithm for external partitioning that allows the construction of cubes over extremely large volumes of data whose size may far exceed the size of main memory. Finally, it stores cubes in a compressed form, removing all types of redundancy from the final result. MOLAP algorithms store a data cube as a multidimensional array, thereby avoiding to store the dimension values in each array cell, since the position of the cell itself determines these values. The main drawback of this approach comes from the fact that, in practice, cubes have a large number of empty cells (i.e., cubes are sparse), rendering MOLAP algorithms inefficient with respect to their storage-space requirements. To overcome this problem, the so-called chunk-based algorithms have been introduced [21], which avoid the physical storage of most of the empty cells, storing only chunks, which are nonempty subarrays. ArrayCube [21] is the most widely accepted algorithm in this category. It has also served as an inspiration to algorithm MM-Cubing [16], which applies similar
techniques just to the dense areas of the cube, taking into account the distribution of data in a way that avoids chunking. Graph-Based algorithms represent a data cube as some specialized graph structure. They use such structures both in memory, for organizing data in a fashion that accelerates computation of the corresponding cube, and on disk, for compressing the final result and reducing storage-space requirements. Among the algorithms in this category, Dwarf [17] seems to be the strongest overall, since it is the only one that guarantees a polynomial time and space complexity with respect to dimensionality [18]. It is based on a highly compressed data structure that eliminates prefix and suffix redundancies efficiently. Prefix redundancy occurs when two or more tuples in the cube share the same prefix, i.e., the same values in the left dimensions; suffix redundancy, which is in some sense complementary to prefix redundancy, occurs when two or more cube tuples share the same suffix, i.e., the same values in the right dimensions and the aggregate measures. An advantage of Dwarf, as well as of the other graphbased methods, is that not only does its data structure store a data cube compactly, but it also serves as an index that can accelerate selective queries. Approximate algorithms assume that data mining and OLAP applications do not require fine grained or absolutely precise results in order to capture trends and correlations in the data; hence, they store an approximate representation of the cube, trading accuracy for level of compression. Such algorithms exploit various techniques, inspired mainly from statistics, including histograms [14], wavelet transformations [19], and others. Finally, note that some of the most popular industrial cube implementations include Microsoft SQL Server Analysis Services (http://www.microsoft.com/ sql/technologies/analysis/default.mspx) and Hyperion Essbase, which has been bought by ORACLE in 2007 (http://www.oracle.com/hyperion). Query Processing
The most important motivation for cube materialization is to provide low response times for OLAP queries. Clearly, construction of a highly-compressed cube is useless if the cube format inhibits good query answering performance. Therefore, efficiency during query processing should be taken into consideration as well when selecting a specific cube-construction algorithm and its
Cube Implementations
corresponding storage format. Note that the latter determines to a great extent the access methods that can be used for retrieving data stored in the corresponding cube; hence, it strongly affects performance of query processing algorithms over cube data. Intuitively, it seems that brute-force storage of an entire cube in an uncompressed format behaves best during query processing: in this case, every possible aggregation for every combination of dimensions is precomputed and the only cost required is that of retrieving the data stored in the lattice nodes participating in the query. On the other hand, query processing over compressed cubes seems to induce additional overhead for on-line computation or restoration of (possibly redundant) tuples that have not been materialized in the cube. Nevertheless, the literature has shown that the above arguments are not always valid in practice. This is mostly due to the fact that indexing an uncompressed cube is nontrivial in real-world applications, whereas applying custom indexing techniques for some sophisticated, more compact representations has been found efficient [2]. Furthermore, storing data in specialized formats usually offers great opportunities for unique optimizations that allow a wide variety of query types to run faster over compressed cubes [2]. Finally, recall that several graph-based algorithms, e.g., Dwarf [17], store the cube in a way that is efficient with respect to both storage space and query processing time. Incremental Maintenance
As mentioned earlier, in general, fact tables are dynamic in nature and change over time, mostly as new records are inserted in them. Aggregated data stored in a cube must follow the modifications in the corresponding fact table; otherwise, query answers over the cube will be inaccurate. According to the most common scenario used in practice, data in a warehouse is periodically updated in a batch fashion. Clearly, the window of time that is required for the update process must be kept as narrow as possible. Hence, reconstruction of the entire cube from scratch is practically not a viable solution; techniques for incremental maintenance must be used instead. Given a fact table, its corresponding cube, and a set of updates to the fact table that have occurred since the last cube update, let delta cube be the cube formed by the data corresponding to these updates. Most
C
incremental-maintenance algorithms proposed in the literature for the cube follow a common strategy [20]: they separate the update process into the propagation phase, during which they construct the delta cube, and the refresh phase, during which they merge the delta cube and the original cube, in order to generate the new cube. Most of them identify the refresh phase as the most challenging one and use specialized techniques to accelerate it, taking into account the storage format of the underlying cube (some examples can be found in the literature [12,17]). There is at least one general algorithm, however, that tries to optimize the propagation phase [9]. It selects particular nodes of the delta cube for construction and properly uses them in order to update all nodes of the original cube.
Key Applications Efficient implementation of the data cube is essential for OLAP applications in terms of performance, since they usually make an extensive use of aggregate queries.
Cross-references ▶ Data Warehouse ▶ Dimension ▶ Hierarchy ▶ Measure ▶ OLAP ▶ Snowflake Schema ▶ Star Schema
Recommended Reading 1. Agarwal S., Agrawal R., Deshpande P., Gupta A., Naughton J.F., Ramakrishnan R., and Sarawagi S. On the computation of multidimensional aggregates. In Proc. 22th Int. Conf. on Very Large Data Bases, 1996, pp. 506–521. 2. Beyer K.S. and Ramakrishnan R. Bottom-up computation of sparse and iceberg CUBEs. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 359–370. 3. Gray J., Bosworth A., Layman A., and Pirahesh H. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total. In Proc. 12th Int. Conf. on Data Engineering, 1996, pp. 152–159. 4. Gupta H. Selection of views to materialize in a data warehouse. In Proc. 6th Int. Conf. on Database Theory, 1997, pp. 98–112. 5. Gupta H. and Mumick I.S. Selection of views to materialize under a maintenance cost constraint. In Proc. 7th Int. Conf. on Database Theory, 1999, pp. 453–470. 6. Harinarayan V., Rajaraman A., and Ullman J.D. Implementing data cubes efficiently. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 205–216.
543
C
544
C
Cube Materialization
7. Kotsis N. and McGregor D.R. Elimination of redundant views in multidimensional aggregates. In Proc. 2nd Int. Conf. Data Warehousing and Knowledge Discovery, 2000, pp. 146–161. 8. Lakshmanan L.V.S., Pei J., and Zhao Y. QC-Trees: an efficient summary structure for semantic OLAP. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 64–75. 9. Lee K.Y. and Kim M.H. Efficient incremental maintenance of data cubes. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 823–833. 10. Morfonios K. and Ioannidis Y. CURE for cubes: cubing using a ROLAP engine. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 379–390. 11. Morfonios K., Konakas S., Ioannidis Y., and Kotsis N. ROLAP implementations of the data cube. ACM Comput. Surv., 39(4), 2007. 12. Morfonios K. and Ioannidis Y. Supporting the Data cube Lifecycle: the Power of ROLAP. VLDB J., 17(4):729–764, 2008. 13. Mumick I.S., Quass D., and Mumick B.S. Maintenance of data cubes and summary tables in a warehouse. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997, pp. 100–111. 14. Poosala V. and Ganti V. Fast approximate answers to aggregate queries on a data cube. In Proc. 11th Int. Conf. on Scientific and Statistical Database Management, 1999, pp. 24–33. 15. Ross K.A. and Srivastava D. Fast computation of sparse datacubes. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 116–125. 16. Shao Z., Han J., and Xin D. MM-Cubing: computing iceberg cubes by factorizing the lattice Space. In Proc. 16th Int. Conf. on Scientific and Statistical Database Management, 2004, pp. 213–222. 17. Sismanis Y., Deligiannakis A., Roussopoulos N., and Kotidis Y. Dwarf: shrinking the PetaCube. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2002, pp. 464–475. 18. Sismanis Y. and Roussopoulos N. The complexity of fully materialized coalesced cubes. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 540–551. 19. Vitter J.S. and Wang M. Approximate computation of multidimensional aggregates of sparse data using wavelets. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 193–204. 20. Wang W., Feng J., Lu H., and Yu J.X. Condensed cube: an efficient approach to reducing data cube size. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 155–165. 21. Zhao Y., Deshpande P., and Naughton J.F. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997, pp. 159–170.
Cube Materialization ▶ Cube Implementations
Cube Precomputation ▶ Cube Implementations
Curation ▶ Biomedical Scientific Textual Data Types and Processing
Current Date ▶ Now in Temporal Databases
Current Semantics M ICHAEL H. B O¨ HLEN 1, C HRISTIAN S. J ENSEN 2, R ICHARD T. S NODGRASS 3 1 Free University of Bozen-Bolzano, Bolzano, Italy 2 Aalborg University, Aalborg, Denmark 3 University of Arizona, Tucson, AZ, USA
Synonyms Temporal upward compatibility
Definition Current semantics constrains the semantics of nontemporal statements applied to temporal databases. Specifically, current semantics requires that nontemporal statements on a temporal database behave as if applied to the non-temporal database that is the result of taking the timeslice of the temporal database as of the current time.
Key Points Current semantics [3] requires that queries and views on a temporal database consider the current information only and work exactly as if applied to a non-temporal database. For example, a query to determine who manages the high-salaried employees should consider the current database state only. Constraints and assertions also work exactly as before: they are applied to the current state and checked on database modification. Database modifications are subject to the same constraint as queries: they should work exactly as if applied
Curse of Dimensionality
to a non-temporal database. Database modifications, however, also have to take into consideration that the current time is constantly moving forward. Therefore, the effects of modifications must persist into the future (until overwritten by a subsequent modification). The definition of current semantics assumes a timeslice operator t[t](Dt) that takes the snapshot of a temporal database Dt at time t. The timeslice operator takes the snapshot of all temporal relations in Dt and returns the set of resulting non-temporal relations. Let now be the current time [2] and let t be a time point that does not exceed now. Let Dt be a temporal database instance at time t. Let M 1,...,Mn, n 0 be a sequence of non-temporal database modifications. Let Q be a non-temporal query. Current semantics requires that for all Q, t, Dt, and M 1,...,Mn the following equivalence holds:
C
▶ Temporal Data Models ▶ Temporal Query Languages ▶ Timeslice Operator
Recommended Reading 1.
2.
3.
Bair J., Bo¨hlen M.H., Jensen C.S., and Snodgrass R.T. Notions of upward compatibility of temporal query languages. Wirtschaftsinformatik, 39(1):25–34, February 1997. Clifford J., Dyreson C., Isakowitz T., Jensen C.S., and Snodgrass R.T. On the Semantics of ‘‘NOW’’ in Databases. ACM Trans. Database Syst., 22:171–214, June 1997. Snodgrass R.T. Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann, Los Altos, CA, 1999.
Current Time ▶ Now in Temporal Databases
QðM n ðM n1 ð:::ðM 1 ðD t Þ:::ÞÞÞÞ ¼ QðM n ðM n1 ð:::ðM 1 ðt½nowðD t ÞÞÞ:::ÞÞÞ Note that for n = 0 there are no modifications, and the equivalence becomes Q(Dt) = Q(t[now](Dt)), i.e., a non-temporal query applied to a temporal database must consider the current database state only. An unfortunate ramification of the above equivalence is that temporal query languages that introduce new reserved keywords not used in the non-temporal languages they extend will violate current semantics. The reason is that the user may have previously used such a keyword as an identifier (e.g., a table name) in the database. To avoid being overly restrictive, it is reasonable to consider current semantics satisfied even when reserved words are added, as long as the semantics of all statements that do not use the new reserved words is retained by the temporal query language. Temporal upward compatibility [1] is a synonym that focuses on settings where the original temporal database is the result of rendering a non-temporal database temporal.
Cross-references
▶ Nonsequenced Semantics ▶ Now in Temporal Databases ▶ Sequenced Semantics ▶ Snapshot Equivalence ▶ Temporal Database
545
Current Timestamp ▶ Now in Temporal Databases
Curse of Dimensionality L EI C HEN Hong Kong University of Science and Technology, Hong Kong, China
Synonyms Dimensionality curse
Definition The curse of dimensionality, first introduced by Bellman [1], indicates that the number of samples needed to estimate an arbitrary function with a given level of accuracy grows exponentially with respect to the number of input variables (i.e., dimensionality) of the function. For similarity search (e.g., nearest neighbor query or range query), the curse of dimensionality means that the number of objects in the data set that need to be accessed grows exponentially with the underlying dimensionality.
C
546
C
Cursor
Key Points The curse of dimensionality is an obstacle for solving dynamic optimization problems by backwards induction. Moreover, it renders machine learning problems complicated, when it is necessary to learn a state-ofnature from finite number data samples in a high dimensional feature space. Finally, the curse of dimensionality seriously affects the query performance for similarity search over multidimensional indexes because, in high dimensions, the distances from a query to its nearest and to its farthest neighbor are similar. This indicates that data objects tend to be close to the boundaries of the data space with the increasing dimensionality. Thus, in order to retrieve even a few answers to a nearest neighbor query, a large part of the data space should be searched, making the multidimensional indexes less efficient than a sequential scan of the data set, typically with dimensionality greater than 12 [2]. In order to break the curse of dimensionality, data objects are usually reduced to vectors in a lower dimensional space via some dimensionality reduction technique before they are indexed.
Cross-references
▶ Dimensionality Reduction
Recommended Reading 1.
Bellman R.E. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961.
2.
Beyer K.S., Goldstein J., Ramakrishnan R., Shaft U. When is ‘‘Nearest Neighbor’’ Meaningful? In Proc. 7th Int. Conf. on Database Theory, 1999, pp. 217–235.
Cursor ▶ Iterator
CW Complex ▶ Simplicial Complex
CWM ▶ Common Warehouse Metamodel
Cyclic Redundancy Check (CRC) ▶ Checksum and Cyclic Redundancy Check (CRC) Mechanism
D DAC ▶ Discretionary Access Control
Daplex T ORE R ISCH Uppsala University, Uppsala, Sweden
Definition Daplex is a query language based on a functional data model [1] with the same name. The Daplex data model represents data in terms of entities and functions. The Daplex data model is close to the entity-relationship (ER) model with the difference that relationships between entities in Daplex have a logical direction, whereas ER relationships are directionless. Unlike ER entities, relationships, and properties are all represented as functions in Daplex. Also, entity types are defined as functions without arguments returning sets of a builtin type ENTITY. The entity types are organized in a type/subtype hierarchy. Functions represent properties of entities and relationships among entities. Functions also represent derived information. Functions may be set-valued and are invertible. The database is represented as tabulated function extents. Database updates change the function extents. The Daplex query language has been very influential for many other query languages, both relational, functional, and object oriented. Queries are expressed declaratively in an iterative fashion over sets similar to the FLWR semantics of the XQuery language. Daplex queries cannot return entities but a value returned from a query must always be a literal. The query language further includes schema (function) definition statements, update statements, constraints, etc.
#
2009 Springer ScienceþBusiness Media, LLC
Key Points Daplex functions are defined using a DECLARE statement, for example: DECLARE name(Student) = STRING Where ‘‘Student’’ is a user defined entity type and ‘‘STRING’’ is a built-in type. Set valued functions are declared by a ‘‘ = > >’’ notation, e.g., DECLARE course(Student) = > > Course Entity types are functions returning the built-in type ENTITY, for example: DECLARE Person() = > > ENTITY Inheritance among entity types is defined by defining entities as functions returning supertypes, for example: DECLARE Student() = > > Person Functions may be overloaded on different entity types. Queries in Daplex are expressed using a FOR EACH – SUCH THAT – PRINT fashion similar to the FLWR semantics of XQuery. For example: FOR EACH X IN Employee SUCH THAT Salary(X) > Salary(Manager (X)) PRINT Name(X) The PRINT statement is here not regarded as a side effect but rather as defining the result set from the query. Derived functions are defined though the DEFINE statement, e.g., DEFINE Course(Student) = > Course SUCH THAT FOR SOME Enrollment Stud#(Student) = Stud#(Enrollment) AND Course#(Enrollment) = Course#(Course) Daplex was first implemented in the Multibase system [2]. There, it was used as a multi-database query language to query data from several databases. The P/FDM [1] data model and query language is close to Daplex. The query languages OSQL and AmosQL are also based on Daplex. These languages extend Daplex with object identifiers (OIDs) to represent actual entities and thus queries can there return entities as OIDs.
548
D
DAS
Cross-references
▶ AmosQL ▶ Functional Data Model ▶ OSQL ▶ P/FDM ▶ Query Language ▶ XPath/XQuery
Recommended Reading 1.
2.
3.
Gray P.M.D., Kerschberg L., King P.J.H., and Poulovassilis A. (eds.). The Functional Approach to Data Management. Springer, Berlin, 2004. Landers T. and Rosenberg R.L. An overview of Multibase. In Proc. 2nd Int. Symp. on Distributed Databases, 1982, pp. 153–184. Shipman D.W. The functional data model and the data language DAPLEX. ACM Trans. Database Syst., 6(1):140–173, 1981.
DAS ▶ Direct Attached Storage
Data Acquisition ▶ Data Acquisition and Dissemination in Sensor Networks
Data Acquisition and Dissemination in Sensor Networks T URKMEN C ANLI , A SHFAQ K HOKHAR University of Illinois at Chicago, Chicago, IL, USA
Synonyms Data gathering; Data collection; Data acquisition
Definition Wireless sensor networks (WSNs) are deployed to monitor and subsequently communicate various aspects of physical environment, e.g., acoustics, visual, motion,
vibration, heat, light, moisture, pressure, radio, magnetic, biological, etc. Data acquisition and dissemination protocols for WSNs are aimed at collecting information from sensor nodes and forwarding it to the subscribing entities such that maximum data rate is achieved while maximizing the overall network life time. The information can be simple raw data or processed using basic signal processing techniques such as filtering, aggregation/compression, event detection, etc.
Historical Background Wireless sensor networks consist of tiny dispensable smart sensor nodes, with limited battery power and processing/communication capabilities. In addition, these networks also employ more powerful ‘‘sink’’ node(s) that collect information from the sensor nodes and facilitate interfacing with the outside computing and communication infrastructure. WSNs are configured to execute two fundamental tasks: information acquisition/collection at the sink nodes, and dissemination of information to the nodes across the network. Existing data acquisition and dissemination techniques have been investigated for different levels of application abstractions including: structured data collection [1,5,6,11,12] in a query-database paradigm, and raw data acquisition in for field reconstruction and event recognition at the sink nodes [2,3,4,7,9,10,13].
Foundations In WSNs, devices have limited battery life, which is generally considered non-replenishable, therefore the pertinent challenge is to design protocols and algorithms that maximize the network life time. In data acquisition and dissemination tasks, data volume is high therefore another optimization criteria is to increase throughput while reducing power consumption. Several data collection protocols aiming at data reduction using data transformation techniques have been suggested [2,10]. In [10], authors have proposed the use of wavelet compression to reduce data volume in structure monitoring WSN applications, thus resulting in low power consumption and reduced communication latency. Similar data transformation and compression techniques have been used to compute summary of the raw data [2]. In most of the data collection algorithms, the network nodes are organized into logical structures, and communication among the nodes and with the sink is
Data Acquisition and Dissemination in Sensor Networks
realized using such logical structures. For example, in tree based data acquisition protocols, a collection tree is built that is rooted at the data collection center such as the sink node [8]. The dissemination of the data requests from the participating nodes and collection of data from the sensor nodes are accomplished using this tree. A cluster based data acquisition mechanism has been proposed in [3]. As shown in Fig. 1, nodes are organized into a fixed number of clusters, and nodes within each cluster dynamically elect a cluster head. The data acquisition is carried out in two phases. In the first phase, cluster heads collect data from their cluster nodes. In the second phase, cluster heads send collected data to the nodes that have subscribed to the data. The cluster heads are re-elected to balance energy consumption among the nodes in the cluster. Zhang et al. [13] have proposed an adaptive cluster based data collection protocol that dynamically adjusts the number of cluster heads to the traffic load in the network. This dynamic traffic model is developed at the sink node. In [7], network is divided into virtual grids and sensor nodes in each grid are classified as either gateway nodes or internal nodes. For example, in Fig. 2 nodes B, G are selected as gateway nodes that are responsible for transmitting data to nodes outside the grid. By doing so, data contention and redundant data transmission of a packet are reduced, which saves energy. The common characteristic of all the aforementioned protocols is the pro-actively built routing infrastructure. As an alternative, authors in [4] have proposed the
Data Acquisition and Dissemination in Sensor Networks. Figure 1. Clustering concept as proposed in LEACH [3].
D
directed diffusion approach. The routing infrastructure is constructed on the fly. The sink node disseminates its interest to the network and gradients are set-up from nodes that match the sink’s interest. There may be more than one path from a sensor node to the sink node. Sink nodes regulate data rate across all the paths. Data acquisition and dissemination techniques designed with a higher level of application abstraction model the sensor network as a distributed database system. In these techniques, data center disseminates its queries, and database operation such as ‘‘join’’ or ‘‘select’’ operations are computed distributively using sensor nodes that have the requested data. For example, in [12] a new layer, referred to as query layer, is proposed that is logically situated between the network and application layers of the network protocol stack. This layer processes descriptive queries and determines a power efficient execution plan that makes use of innetwork processing and aggregation operations. An in-network aggregation is realized by packet merging or by incrementally applying aggregation operators such as min, max, count, or sum. Cougar system [11] presents a framework for specifying a Query execution plan in WSNs. It allows specification of routing among sensor nodes and execution of aggregate operation over the collected data. As depicted in Fig. 3, each sensor node samples the environment as specified by the query. According to the execution plan, sampled data is sent to a leader node, or together with the partially aggregated data received from other nodes, an aggregation operators is applied. The new partially aggregated value is then sent towards the leader node. Partial aggregation is possible only for the aggregation operators that can be computed incrementally. The volume of data is decreased by partial or incremental aggregation. The
Data Acquisition and Dissemination in Sensor Networks. Figure 2. Principle of LAF.
549
D
550
D
Data Acquisition and Dissemination in Sensor Networks
Data Acquisition and Dissemination in Sensor Networks. Figure 3. Query plan at a source and leader node [11].
responsibility of the leader node is to combine all the partially aggregated results and report it to the gateway node if the value exceed the set threshold. In TinyDB framework [5], WSN is viewed as one big relational table. The columns of the table correspond to the type of phenomenon observed, i.e., humidity, temperature, etc. The aim is to reduce the power consumption during data acquisition by innetwork processing of raw data. In other words, it addresses questions such as when data should be sampled for a particular query, which sensors should respond to a query, in which order sensors should sample the data, and how to achieve balance between in-network processing of the raw samples and collection of the raw samples at sink nodes without any processing. Moreover, the structured query language (SQL) is extended specifically for sensor network applications. Keywords such as, SAMPLE, ON EVENT and LIFETIME have been added to SQL language to facilitate realization of basic sensor network applications. Through SAMPLE clause, the sampling rate of the sensors can be controlled. LIFETIME allows automatic sampling rate adjustment for a given lifetime. ON EVENT is used for trigger, i.e., query is executed only when the specified event occurs. A query sample shown below [5] illustrates the use of extended SQL. In this example query, the sampling period is set via introducing SAMPLE clause. The query planner needs meta data information regarding power consumption, sensing and communication costs to compute lifetime approximation and execute the LIFETIME clause. ON EVENT type queries may trigger multiple instances of same type of internal query.
SELECT COUNT(*) FROM sensors AS s, recentLight AS r1 WHERE r1.nodeid=s.nodeid AND s.light < r1.light SAMPLE INTERVAL 10s
Optimization of the data sampling order can also reduce energy consumption significantly. Typically, a sensor node has more than one on-board sensor, i.e., nodes may have all temperature, humidity, pressure, etc. sensors on a single sensing platform. If in the query it is requested to report the temperature of the nodes where humidity value is greater than some threshold, it would be inefficient to simultaneously sample temperature and humidity values. The energy spent on sampling temperature values where humidity is less than the threshold could be saved by reordering the predicate evaluation [5]. The semantic tree (SRT) [5] is a mechanism that allows nodes to find out whether their children have the data for the incoming query. Every parent node stores the range of its children’s values. Therefore when query arrives to a node it is not forwarded to those children that do not have the data. For instance in Fig. 4, node 1 will not send query request to node 2, similarly, node 3 will not send query to node 5. Authors in [1] have used probabilistic models to facilitate efficient query processing and data acquisition on sensor networks. The idea is to build statistical model of the sensor readings from stored and current readings of the sensors. Whenever an SQL query is submitted to the network, constructed model is used to provide answers. Accuracy of the requested data can be specified by setting the
Data Acquisition and Dissemination in Sensor Networks
D
551
D
Data Acquisition and Dissemination in Sensor Networks. Figure 4. A semantic routing tree in use for a query. Gray arrows indicate flow of the query down the tree, gray nodes must produce or forward results in the query [5].
Data Acquisition and Dissemination in Sensor Networks. Figure 5. Model based querying in sensor networks [1].
552
D
Data Aggregation in Sensor Networks
confidence interval in the SQL statement. Figure 5 illustrates typical SQL query. Depending on error tolerance levels, response to the query may involve collecting information from every sensor node, for example, if 100% accuracy is required. On the other hand, if the error tolerance is high, query can be answered by only using the constructed model.
Key Applications Building Health Monitoring, Micro-climate Monitoring, Habitat Monitoring, Hazardous Environment Monitoring.
Cross-references
▶ Ad-hoc Queries in Sensor Networks ▶ Continuous Queries in Sensor Networks ▶ Data Aggregation in Sensor Networks ▶ Data Compression in Sensor Networks ▶ Data Estimation in Sensor Networks ▶ Data Fusion in Sensor Networks ▶ Data Storage and Indexing in Sensor Networks ▶ Database Languages for Sensor Networks ▶ Model-Based Querying in Sensor Networks ▶ Query Optimization in Sensor Networks ▶ Sensor Networks
Recommended Reading 1. Deshpande A., Guestrin C., Madden S.R., Hellerstein J.M., and Hong W. Model-driven data acquisition in sensor networks. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004, pp. 588– 599. 2. Ganesan D., Greenstein B., Perelyubskiy D., Estrin D., and Heidemann J. An evaluation of multi-resolution storage for sensor networks. In Proc. 1st Int. Conf. on Embedded Networked Sensor Systems, 2003, pp. 89–102. 3. Heinzelman W.R., Chandrakasan A., and Balakrishnan H. Energy-efficient communication protocol for wireless microsensor networks. In Proc. 33rd Annual Hawaii Conf. on System Sciences, 2000, pp. 8020. 4. Intanagonwiwat C., Govindan R., Estrin D., Heidemann J., and Silva F. Directed diffusion for wireless sensor networking. IEEE/ ACM Trans. Netw., 11(1):2–16, 2003. 5. Madden S., Franklin M.J., Hellerstein J.M., and Hong W. The design of an acquisitional query processor for sensor networks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 491–502. 6. Madden S.R., Franklin M.J., Hellerstein J.M., and Hong W. TinyDB: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst., 30(1):122–173, 2005. 7. Sabbineni M.H. and Chakrabarty S.M.K. Location-aided flooding: an energy-efficient data dissemination protocol for wireless sensor networks. IEEE Trans. Comput., 54(1):36–46, 2005.
8. Szewczyk R., Osterweil E., Polastre J., Hamilton M., Mainwaring A., and Estrin D. Habitat monitoring with sensor networks. Commun. ACM, 47(6):34–40, 2004. 9. Xi Y., Yang W., Yamauchi N., Miyazaki Y., Baba N., and Ikeda H. Real-time data acquisition and processing in a miniature wireless monitoring system for strawberry during transportation. In Proc. Int. Technical Conf. of IEEE Region 10 (Asia Pacific Region), 2006. 10. Xu N., Rangwala S., Chintalapudi K.K., Ganesan D., Broad A., Govindan R., and Estrin D. A wireless sensor network for structural monitoring. In Proc. 2nd Int. Conf. on Embedded Networked Sensor Systems, 2004, pp. 13–24. 11. Yao Y. and Gehrke J. The cougar approach to in-network query processing in sensor networks. ACM SIGMOD Rec., 31 (3):9–18, 2002. 12. Yao Y. and Gehrke J. Query processing in sensor networks, 2003. 13. Zhan X., Wang H., and Khokhar A. An energy-efficient data collection protocol for mobile sensor networks. Vehicular Technology Conference, 2006, pp. 1–15.
Data Aggregation in Sensor Networks J UN YANG 1, K AMESH M UNAGALA 1, A DAM S ILBERSTEIN 2 1 Duke University, Durham, NC, USA 2 Yahoo! Research Silicon Valley, Santa Clara, CA, USA
Definition Consider a network of N sensor nodes, each responsible for taking a reading vi ð1 i N Þ in a given epoch. The problem is to compute the result of an aggregate function (cf. Aggregation) over the collection of all readings n1 ; v2 ; :::; vN taken in the current epoch. The final result needs to be available at the base station of the sensor network. The aggregate function ranges from simple, standard SQL aggregates such as SUM and MAX, to more complex aggregates such as top-k, median, or even a contour map of the sensor field (where each value to be aggregated is a triple hxi ; yi ; zi i, with xi and yi denoting the location coordinates of the reading zi). In battery-powered wireless sensor networks, energy is the most precious resource, and radio communication is often the dominant consumer of energy. Therefore, in this setting, the main optimization objective is to minimize the total amount of communication needed in answering an aggregation query. A secondary objective is to balance the energy consumption across all sensor nodes, because the first
Data Aggregation in Sensor Networks
node to run out of battery may render a large portion of the network inaccessible. There are many variants of the above problem definition. For example, the aggregation query may be continuous and produce a new result for each epoch; not all nodes may participate in aggregation; the result may be needed not at the base station but at other nodes in the network. Some of these variants are further discussed below.
Historical Background Early deployments of wireless sensor networks collect and report all data to the base station without summarization or compression. This approach severely limits the scale and longevity of sensor networks, because nodes spend most their resources forwarding data on behalf of others. However, many applications do not need the most detailed data; instead, they may be interested in obtaining a summary view of the sensor field, monitoring outlier or extreme readings, or detecting events by combining evidence from readings taken at multiple nodes. Data aggregation is a natural and powerful construct for applications to specify such tasks. Data aggregation is supported by directed diffusion [9], a data-centric communication paradigm for sensor networks. In directed diffusion, a node diffuses its interests for data in the network. Then, nodes with data matching the interests return the relevant data along the reverse paths of interest propagation. Intermediate nodes on these paths can be programmed to aggregate relevant data as it converges on these nodes. Systems such as TinyDB and Cougar take a database approach, providing a powerful interface for applications to pose declarative queries including aggregation over a sensor network, and hiding the implementation and optimization details from application programmers. The seminal work by Madden et al. [12] on TAG (Tiny AGgregation) is the first systematic study of databasestyle aggregation in sensor networks. One of the first efforts at supporting more sophisticated aggregation queries beyond SQL aggregates is the work by Hellerstein et al. [8], which shows how to extend TAG to compute contour maps, wavelet summaries, and perform vehicle tracking. Since these early efforts, the research community has made significant progress in sensor data aggregation; some of the developments are highlighted below.
D
553
Foundations The key to efficient sensor data aggregation is innetwork processing. On a typical sensor node today, the energy cost of transmitting a byte over wireless radio is orders-of-magnitude higher than executing a CPU instruction. When evaluating an aggregate query, as data converges on an intermediate node, this node can perform aggregate computation to reduce the amount of data to be forwarded, thereby achieving a favorable tradeoff between computation and communication. To illustrate, consider processing a simple SUM aggregate with TAG [12]. TAG uses a routing tree rooted at the base station spanning all nodes in the network. During aggregation, each node listens to messages from its children in the routing tree, computes the sum of all values in these messages and its own reading, and then transmits this result – which equals the sum of all readings in the subtree rooted at this node – to its parent. To conserve energy, each node only stays awake for a short time interval to listen, compute, and transmit. To this end, TAG coordinates the nodes’ communication schedules: the beginning of a parent node’s interval must overlap with the end of its children’s intervals, so that the parent is awake to receive children’s transmissions. Overall, to compute SUM, each node needs to send only one constant-size message, so the total amount of communication is Y(N). In comparison, for the naı¨ve approach, which sends all readings to the root, the amount of data that needs to be forwarded by a node increases closer to the root, and the total amount of communication can be up to Y(Nd), where d is the depth of the routing tree. Clearly, in-network aggregation not only decreases overall energy consumption, but also balances energy consumption across nodes. The above algorithm can be generalized to many other aggregates. Formally, an aggregate function can be implemented using three functions: an initializer fi converts an input value into a partial aggregate record; a merging function fm combines two partial aggregate records into one; finally, an evaluator fe computes the final result from a partial aggregate record. During aggregation, each node applies fi to its own reading; each non-leaf node invokes fm to merge partial aggregate records received from its children with that of its own; the root uses fe to compute the aggregate result from the final partial aggregate record. As an example, standard deviation can be computed (in theory, without regard to numerical stability) using the following
D
554
D
Data Aggregation in Sensor Networks
functions, where the partial aggregate record is a triple hs; r; ni consisting of a sum (s), a sum of squares (r), and a count (n): fi ðvÞ ¼ hv; v 2 ; 1i; fm ðhs1 ; r1 ; n1 i; hs2 ; r2; n2 iÞ ¼ hs1 þ s2 ; r1 þ r2 ; n1 þ n2 i; fe ðhs; r; niÞ ¼
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi2 nr s : n
In general, one cannot expect the partial aggregate record to be of constant size for arbitrary aggregate functions. Consider the following examples. For a top-k aggregate, which finds the k largest values, a partial aggregate record needs to be of size Y(k). For exact median computation, the size requirement becomes Y(N), which is no better than the naı¨ve approach of sending all readings to the root. For contour map construction, the size of the partial aggregate records depends on the complexity of the field being sensed, and can be Y(N) in the worst case. Approximate Aggregation
Approximation is a popular and effective technique for bounding the communication cost in evaluating complex aggregation functions. The basic idea is to replace an exact partial aggregate record with an approximate partial aggregate record that consumes less space and is therefore cheaper to send. An early example of applying this idea is the work by Hellerstein et al. [8]. Since then, a diverse collection of approximation methods has been developed for many aggregate functions; a few illustrative examples are given below. To compute order-statistics (e.g., quantile queries including top-k and median), Greenwald and Khanna [7] propose a technique based on e-approximate quantile summaries. An e-approximate quantile summary for a collection S of sensor readings is an ordered subset {qi} of S, where each qi is associated with a lower bound r mini and an upper bound r mini on qi’s rank within S, and the difference between r maxi+1 and r mini is no greater than 2ejSj. Any quantile query over S can be answered instead on this summary within an additive rank error of ejS j. Specifically, a query requesting the r th ranked reading can be answered by returning qj from the summary, where r ejS j r minj and r max r þ ejSj. Greenwald and Khanna represent a partial aggregate record sent up from a node u by a set of quantile summaries – at most one for each class
numbered 1 through log N – which together disjointly cover all readings in the subtree rooted at u; the summary for class i covers between 2i and 2i+11 readings using at most ðlog N =e þ 1Þ of these readings. Each sensor node starts with an e=2-approximate summary of all its local readings. Each intermediate node merges summaries from its children together with its own summary into up to log N merged summaries, prunes each of them down to the maximum size allowed, and then sends them up to the parent. Finally, the root merges all summaries into a single one and prunes it down to ðlog N =e þ 1Þ readings. Although pruning introduces additional error, the use of per-class summaries bounds the error in a class-i summary to e=2 þ ði=ð2 log N =eÞÞ, which in turn allows the error in the final summary to be bounded by e. Overall, the communication cost incurred by each node during aggregation is only oðlog2 N =eÞ. Silberstein et al. [15] approach the problem of computing top-k aggregates using a very different style of approximation. Instead of having each node always sending the top k readings in its subtree, a node sends only the top k 0 readings among its local reading and those received from its children, where k 0 k. The appropriate setting of k 0 for each node is based on the samples of past sensor readings, or, more generally, a model capturing the expected behavior of sensor readings. Intuitively, a subtree that tends to contribute few of the top values will be allotted a smaller k 0 . Unlike the e-approximate quantile summaries, which provide hard accuracy guarantees, the accuracy of this approach depends on how well the past samples or the model reflect the current behavior of readings. Nevertheless, the approach can be augmented by transmitting additional information needed to establish the correctness of some top-k answers, thereby allowing the approximation quality to be assessed. As a third example, the contour map of a sensor field is a complex spatial aggregate defined over not only values but also locations of the sensor readings. In this case, the partial aggregate record produced by a node is a compact, usually lossy, representation of the contour map encompassing all readings in this node’s subtree. In Hellerstein et al. [8], each contour in the map is represented by an orthogonal polygon whose edges follow pre-imposed 2-d rectangular grid lines. This polygon is obtained by starting with the minimum bounding rectangle of the contour, and then repeatedly subtracting the largest-area rectangle that
Data Aggregation in Sensor Networks
does not contain any point in the contour, until a prescribed limit on the number of vertices is reached. Gandhi et al. [5] use general polygons instead of orthogonal ones. During aggregation, each node constructs and sends to its parent an approximate description of its contour map consisting of up to k possibly disconnected line segments. The root then connects and untangles such line segments to obtain the final contour map. It is shown that the approximation error in the k -segment representation produced by distributed aggregation is within a constant factor of the smallest possible error attainable by any k segments, and it is conjectured that the resulting contour map has size o(k). Duplicate-Insensitive Aggregation
Approximate aggregation methods based on duplicateinsensitive synopses are especially worth noting because of their resiliency against failures, which are common in sensor networks. Tree-based aggregation techniques are vulnerable to message failures. If a message carrying the partial aggregate record from a node fails, all information from that subtree will be lost, resulting in significant error. Sending the same message out on multiple paths towards the base station decreases the chance of losing all copies, and is a good solution for aggregation functions such as MAX. However, for other aggregation functions whose results are sensitive to duplicates in their inputs, e.g., SUM and COUNT, having multiple copies of the same partial aggregation record causes a reading to participate multiple times in aggregation, leading to incorrect results. In general, if an aggregation method is order- and duplicate-insensitive (ODI) [14], it can be implemented with more failure-resilient routing structures such as directed acyclic graphs, without worrying about the duplicates they introduce. The challenge, then, is in designing ODI aggregation methods for duplicate-sensitive aggregation functions. Duplicate-insensitive synopses provide the basis for computing many duplicate-sensitive aggregation functions approximately in an ODI fashion. This approach is pioneered by Considine et al. [1] and Nath et al. [14]. To illustrate the idea, consider COUNT, which is duplicate-sensitive. Nodes in the sensor network are organized into rings centered at the base station, where the i -th ring includes all nodes at i hops away from the base station. A partial aggregate record is a Flajolet-Martin sketch, a fixed-size bit-vector for
D
estimating the number of distinct elements in a multi-set. This sketch is duplicate-insensitive by design: conceptually, it is obtained by hashing each element to a bitmap index (using an exponential hash function) and setting that bit to one. During aggregation, each node first produces a sketch for its local sensors. A node in the i-th ring receives sketches from its neighbors (i.e., nodes within direct communication distance) in the (i+1)-th ring, takes the bitwise OR of all these sketches and its own, and broadcasts the result sketch to all its neighbors in the (i1)-th ring. Taking advantage of broadcast communication, each node sends out only one message during aggregation, but the information therein can reach the base station via multiple paths, boosting reliability. The overall COUNT can be estimated accurately with high probability using sketches of size Y(log N). The failure-resiliency feature comes with two costs in the above approach: the final answer is only approximate, and the size of each message is larger than the tree-based exact-aggregation approach. Manjhi et al. [13] have developed an adaptive, hybrid strategy that combines the advantages of the two approaches by applying them to different regions of the network, and dynamically exploring the tradeoffs between the two approaches. Temporal Aspects of Aggregation
The preceding discussion has largely ignored the temporal aspects of aggregation. In practice, aggregation queries in sensor networks are often continuous. In its simplest form, such a query executes continuously over time and produces, for each epoch, an aggregate result computed over all readings acquired in this epoch. A key optimization opportunity is that sensor readings often are temporally correlated and do not change haphazardly over time. Intuitively, rather than re-aggregating from scratch in every epoch, evaluation efforts should focus only on relevant changes since the last epoch. An effective strategy for implementing the above intuition is to install on nodes local constraints that dictate when changes in subtrees need to be reported. These constraints carry memory about past readings and filter out reports that do not affect the current aggregate result, thereby reducing communication. For example, to compute MAX continuously, Silberstein et al. [16] set a threshold at each node, which is always no less than its local reading and its children’s
555
D
556
D
Data Aggregation in Sensor Networks
thresholds. A node sends up the current maximum value in its subtree only if that value exceeds the threshold; the threshold is then adjusted higher. If the current global maximum falls, nodes with thresholds higher than the new candidate maximum must be visited to find the new maximum; at the same time, their thresholds are adjusted lower. Thresholds control the tradeoff between reporting and querying costs, and can be set adaptively at runtime. Alternatively, optimum settings can be found by assuming adversarial data behavior or using predictive models of data behavior. As another example, consider continuous SUM. Unlike MAX, even a small change in one individual reading affects the final SUM result. Approximation is thus needed to do better than one message per node per epoch. Deligiannakis et al. [4] employ an interval constraint at each node, which bounds the sum of all current readings within the subtree. In each epoch, the node sends its estimate of this partial sum to its parent only if this value falls outside its interval; the interval then recenters at this value. The length of the interval controls the error allowance for the subtree. Periodically, based on statistics collected, the interval lengths are adjusted by recursively redistributing the total error allowed in the final result to all nodes. Continuous versions of more complex queries, for which approximation is needed to reduce the size of partial aggregate records, have also been studied. For example, Cormode et al. [2] show how to continuously compute e-approximate quantile summaries using a hierarchy of constraints in the network to filter out insignificant changes in subtree summaries. As with other techniques described above, a major technical challenge lies in allocating error tolerance to each constraint; optimum allocation can be computed for predictive models of data behavior. Xue et al. [17] consider the continuous version of the contour map query. Instead of sending up the entire partial aggregate record (in this case, a contour map for the subtree), only its difference from the last transmitted version needs to be sent. In the continuous setting, aggregation can apply not only spatially to the collection of readings acquired in the same epoch, but also temporally over historical data (e.g., recent readings in a sliding window). Cormode et al. [3] consider the problem of continuously computing time-decaying versions of aggregation functions such as SUM, quantiles, and heavy hitters. The
contribution of a reading taken at time t0 to the aggregate result at the current time t 0 ¼ t 0 þ D is weighted by a user-defined decay function f ðDÞ 0, which is non-increasing with D. The solution is based on duplicate-insensitive sketching techniques, and it approximates a general decay function using a collection of sliding windows of different lengths. Other Aspects of Aggregation in Sensor Networks
Besides the above discussion, there are many other aspects of sensor data aggregation that are not covered by this entry; some of them are outlined briefly below. Most techniques presented earlier opportunistically exploit in-network processing, whenever two partial aggregate records meet at the same node following their standard routes to the base station. More generally, routing can be made aggregation-driven [11] by encouraging convergence of data that can be aggregated more effectively (e.g., the merged partial aggregate record uses less space to achieve the required accuracy). Oftentimes, only a sparse subset of nodes contribute inputs to aggregation, and this subset is not known a priori, e.g., when aggregation is defined over the output of a filter operation evaluated locally at each node. The challenge in this scenario is to construct an ad hoc aggregation tree of high quality in a distributed fashion [6]. Finally, for some applications, the final aggregate result is needed at all nodes in the sensor network as opposed to just the base station. Gossiping is an effective technique for this purpose [10], which relies only on local communication and does not assume any particular routing strategy or topology.
Key Applications Aggregation is a fundamental query primitive indispensible to many applications of sensor networks. It is widely used in expressing and implementing common sensor network tasks such as summarization, compression, monitoring, and event detection. Even for applications that are interested in collecting all detailed sensor readings, aggregation can be used in monitoring system and data characteristics, which support maintenance of the sensor network and optimization of its operations.
Cross-references
▶ Ad-Hoc Queries in Sensor Networks ▶ Aggregation ▶ Continuous Queries in Sensor Networks
Data Broadcasting, Caching and Replication in Mobile Computing
▶ Data Compression in Sensor Networks ▶ Data Fusion in Sensor Networks
Recommended Reading 1. Considine J., Li F., Kollios G., and Byers J. Approximate aggregation techniques for sensor databases. In Proc. 20th Int. Conf. on Data Engineering, 2004, pp. 449–460. 2. Cormode G., Garofalakis M., Muthukrishnan S., and Rastogi R. Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 25–36. 3. Cormode G., Tirthapura S., and Xu B. Time-decaying sketches for sensor data aggregation. In Proc. ACM Symposium on Principles of Distributed Computing, 2007, pp. 215–224. 4. Deligiannakis A., Kotidis Y., and Roussopoulos N. Hierarchical in-network data aggregation with quality guarantees. In Advances in Database Technology, In Proc. 9th Int. Conf. on Extending Database Technology, 2004, pp. 658–675. 5. Gandhi S., Hershberger J., and Suri S. Approximate isocontours and spatial summaries for sensor networks. In Proc. 6th Int. Symp. Inf. Proc. in Sensor Networks, 2007, pp. 400–409. 6. Gao J., Guibas L.J., Milosavljevic N., and Hershberger J. Sparse data aggregation in sensor networks. In Proc. 6th Int. Symp. Inf. Proc. in Sensor Networks, 2007, pp. 430–439. 7. Greenwald M. and Khanna S. Power-conserving computation of order-statistics over sensor networks. In Proc. 23rd ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2004, pp. 275–285. 8. Hellerstein J.M., Hong W., Madden S., and Stanek K. Beyond average: toward sophisticated sensing with queries. In Proc. 2nd Int. Workshop Int. Proc. in Sensor Networks, 2003, pp. 63–79. 9. Intanagonwiwat C., Govindan R., and Estrin D. Directed diffusion: a scalable and robust communication paradigm for sensor networks. In Proc. 6th Annual Int. Conf. on Mobile Computing and Networking, 2000, pp. 56–67. 10. Kempe D., Dobra A., and Gehrke J. Gossip-based computation of aggregate information. In Proc. 44th Annual Symp. on Foundations of Computer Science, 2003, pp. 482–491. 11. Luo H., Y. Liu, and S. Das. Routing correlated data with fusion cost in wireless sensor networks. IEEE Transactions on Mobile Computing, 11(5):1620–1632, 2006. 12. Madden S., Franklin M.J., Hellerstein J.M., and Hong W. TAG: a tiny aggregation service for ad-hoc sensor networks. In Proc. 5th USENIX Symp. on Operating System Design and Implementation, 2002. 13. Manjhi A., Nath S., and Gibbons P.B. Tributaries and deltas: efficient and robust aggregation in sensor network streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 287–298. 14. Nath S., Gibbons P.B., Seshan S., and Anderson Z.R. Synopsis diffusion for robust aggregation in sensor networks. In Proc. 2nd Int. Conf. on Embedded Networked Sensor Systems, 2004, pp. 250–262. 15. Silberstein A., Braynard R., Ellis C., and Munagala K. A sampling-based approach to optimizing top-k queries in sensor networks. In Proc. 22nd Int. Conf. on Data Engineering, 2006.
D
557
16. Silberstein A., Munagala K., and Yang J. Energy-efficient monitoring of extreme values in sensor networks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006. 17. Xue W., Luo Q., Chen L., and Liu Y. Contour map matching for event detection in sensor networks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006, pp. 145–156.
D Data Analysis ▶ Data Mining
Data Anomalies ▶ Data Conflicts
Data Broadcasting, Caching and Replication in Mobile Computing PANOS K. C HRYSANTHIS 1, E VAGGELIA P ITOURA 2 1 University of Pittsburgh, Pittsburgh, PA, USA 2 University of Ioannina, Ioannina, Greece
Synonyms Data dissemination; Push/pull delivery; Data copy
Definition Mobile computing devices (such as portable computers or cellular phones) have the ability to communicate while moving by being connected to the rest of the network through a wireless link. There are two general underlying infrastructures: single-hop and multi-hop ones. In single-hop infrastructures, each mobile device communicates with a stationary host, which corresponds to its point of attachment to the wired network. In multi-hop infrastructures, an ad-hoc wireless network is formed in which mobile hosts participate in routing messages among each other. In both infrastructures, the hosts between the source (or sources) and the requester of data (or data sink) form a dissemination tree. The hosts (mobile or stationary) that form the dissemination tree may store data and participate in computations towards achieving in network processing. Challenges include [14], (i) intermittent connectivity, which refers to both short and long periods of network
558
D
Data Broadcasting, Caching and Replication in Mobile Computing
unavailability, (ii) scarcity of resources, including storage and battery life, and (iii) mobility itself. To handle these challenges, data items may be stored locally (cached or replicated) at the requester or at the intermediate nodes of the dissemination tree. Cache and replication aim at increasing availability in the case of network disconnections or host failures as well as at handling intermittent connectivity. Mobility introduces additional challenges in maintaining cache and replica consistency and in replica placement protocols. Wireless data delivery in both infrastructures physically supports broadcasting. This broadcast facility has been used for providing a push-mode of data dissemination where a server broadcasts data to a large client population often without an explicit request from the clients. Issues addressed by related research include broadcast scheduling and organization (i.e., which items to broadcast and in which order), indexing broadcast data, and update propagation.
Historical Background Mobile computing can be traced back to file systems and the need for disconnected operations in the late 1980s. With the rapid growth in mobile technologies and the cost effectiveness in deploying wireless networks in the 1990s, the goal of mobile computing was the support of AAA (anytime, anywhere and any-form) access to data by users from their portable computers, mobile phones and other devices with small displays and limited resources. These advances motivated research in data management in the early 1990s.
Foundations Data Broadcasting
Many forms of wireless network infrastructures rely on broadcast technology to deliver data to large client populations. As opposed to point-to-point data delivery, broadcast delivery is scalable, since a single broadcast response can potentially satisfy many clients simultaneously. There are two basic modes of broadcast data delivery: pull-based and push-based. With push-based data delivery, the server sends data to clients without an explicit request. With pull-based or ondemand broadcast delivery, data are delivered only after a specific client request. In general, access to broadcast data is sequential with clients monitoring the broadcast channel and retrieving any data items
of interest as they arrive. The smallest access unit of broadcast data is commonly called a bucket or page. Scheduling and Organization
A central issue is determining the content of the broadcast or broadcast scheduling. Scheduling depends on whether we have on demand, push or hybrid delivery. In on-demand broadcast, there is an up-link channel available to clients to submit requests. The item to be broadcast next is chosen among those for which there are pending requests. Common heuristics for on-demand scheduling include First Come First Served and Longest Wait First [7]. The R W strategy selects the data item with the maximal R W value, where R is the number of pending requests for an item and W the amount of time that the oldest pending request for that item has spent waiting to be served [3]. More recent schemes extended R W to consider the semantics of the requested data and applications such as subsumption properties in data cubes [15]. Push-based broadcast scheduling assumes a-priori knowledge of client access distributions and prepares an off-line schedule. Push-based data delivery is often periodic. In hybrid broadcast, the set of items is partitioned, so that some items are pushed, i.e., broadcast continuously, and the rest are pulled, i.e., broadcast only after being requested [2]. Commonly, the partition between push and pull data is based on popularity with the most popular items being pushed periodically and the rest delivered on demand. One problem is that for push items, there is no way to detect any changes in their popularity. One solution is to occasionally stop broadcasting some pushed items. This forces clients to send explicit requests for them, which can be used to estimate their popularity [16]. An alternative that avoids flooding of requests requires a percentage of the clients to submit an explicit request irrespective of whether or not a data item appears on the broadcast [5]. The organization of the broadcast content is often called broadcast program. In general, broadcast organizations can be classified as either flat where each item is broadcast exactly once or skewed where an item may appear more than once. One can also distinguish between clustered organizations, where data items having the same or similar values at some attribute appear consecutively, and non-clustered ones, where there is no such correlation. In skewed organizations, the broadcast frequency of each item depends on its popularity. For achieving optimal access latency or
Data Broadcasting, Caching and Replication in Mobile Computing
response time, it was shown that (i) the relative number of appearances of items should be proportional to the square root of their access probabilities and (ii) successive broadcasts of the same item should be at equal distances [7]. It was also shown that the Mean Aggregate Access (MAD) policy that selects to broadcast next the item whose access probability the interval since its last broadcast is the highest achieves close to optimal response time [17]. Along these lines, a practical skewed push broadcast organization is that of broadcast disks [1]. Items are assigned to virtual disks with different ‘‘speeds’’ based on their popularity with popular items being assigned to fast disks. The spin speed of each disk is simulated by the frequency with which the items assigned to it are broadcast. For example, the fact that a disk D1 is three times faster than a disk D2, means that items assigned to D1 are broadcast three times as often as those assigned to D2. To achieve this, each disk is split into smaller equal-sized units called chunks, where the number of chunks per disk is inversely proportional to the relative frequence of the disk. The broadcast program is generated by broadcasting one chunk from each disk and cycling through all the chunks sequentially over all the disks. Indexing
To reduce energy consumption, a mobile device may switch to doze or sleep mode when inactive. Thus, research in wireless broadcast also considers reducing the tuning time defined as the amount of time a mobile client remains active listening to the broadcast. This is achieved by including index entries in the broadcast so that by reading them, the client can determine when to tune in next to access the actual data of interest. Adding index entries increases the size of the broadcast and thus may increase access time. The objective is to develop methods for allocating index entries together with data entries on the broadcast channel so that both access and tuning time are optimized. In (1, m) indexing [18], an index for all data items is broadcast following every fraction (1 ∕m) of the broadcast data items. Distributed indexing [18] improves over this method by instead of replicating the whole index m times, each index segment describes only the data items that follow it. Following the same principles, different indexing schemes have been proposed that support different query types or offer different trade-offs between access and tuning time. Finally, instead of broadcasting an index, hashing-based techniques have also been applied.
D
559
Data Caching and Replication
A mobile computing device (such as a portable computer or cellular phone) is connected to the rest of the network through a wireless link. Wireless communication has a double impact on the mobile device since the limited bandwidth of wireless links increases the response times for accessing remote data from a mobile host and transmitting as well as receiving of data are high energy consumption operations. The principal goal of caching and replication is to store appropriate pieces of data locally at the mobile device so that it can operate on its own data, thus reducing the need for communication that consumes both energy and bandwidth. Several cost-based caching policies along the principles of greedy-dual ones have been proposed that consider energy cost. In the case of broadcast push, the broadcast itself can be viewed as a ‘‘cache in the air.’’ Hence, in contrast to traditional policies, performance can be improved by clients caching those items that are accessed frequently by them but are not popular enough among all clients to be broadcast frequently. For instance, a costbased cache replacement policy selects as a victim the page with the lowest p ∕x value, where p is the local access probability of the page and x its broadcast frequency [1]. Prefetching can also be performed with low overhead, since data items are broadcast anyway. A simple prefetch heuristic evaluates the worth of each page on the broadcast to determine whether it is more valuable than some other page in cache and if so, it swaps the cache page with the broadcast one. Replication is also deployed to support disconnected operation that refers to the autonomous operation of a mobile client, when network connectivity becomes either unavailable (for instance, due to physical constraints), or undesirable (for example, for reducing power consumption). Preloading or prefetching data to sustain a forthcoming disconnection is often termed hoarding. Optimistic approaches to consistency control are typically deployed that allow data to be accessed concurrently at multiple sites without a priori synchronization between the sites, potentially resulting in short term inconsistencies. At some point, operations performed at the mobile device must be synchronized with operations performed at other sites. Synchronization depends on the level at which correctness is sought. This can be roughly categorized as replica-level correctness and transaction-level correctness. At the replica level, correctness or coherency requirements are
D
560
D
Data Broadcasting, Caching and Replication in Mobile Computing
expressed per item in terms of the allowable divergence among the values of the copies of each item. At the transaction level, the strictest form of correctness is achieved through global serializability that requires the execution of all transactions running at mobile and stationary hosts to be equivalent to some serial execution of the same transactions. With regards to update propagation with eager replication, all copies of an item are synchronized within a single transaction, whereas with lazy replication, transactions for keeping replica coherent execute as separate, independent database transactions after the original transaction commits. Common characteristics of protocols for consistency in mobile computing include: The propagation of updates performed at the mobile site follows in general lazy protocols. Reads are allowed at the local data, while updates of local data are tentative in the sense that they need to be further validated before commitment. For integrating operations at the mobile hosts with transactions at other sites, in the case of replicalevel consistency, copies of each item are reconciled following some conflict resolution protocol. At the transaction-level, local transactions are validated against some application or system level criterion. If the criterion is met, the transaction is committed. Otherwise, the execution of the transaction is either aborted, reconciled or compensated. Representative approaches along these lines include isolation-only transactions in Coda, mobile open-nested transactions [6], two-tier replications [8], two-layer transactions [10] and Bayou [9]. When local copies are read-only, a central issue is the design of efficient protocols for disseminating server updates to mobile clients. A server is called stateful, if it maintains information about its clients and the content of their caches and stateless otherwise. A server may use broadcasting to efficiently propagate update reports to all of its clients. Such update reports vary on the type of information they convey to the clients, for instance, they may include just the identifiers of the updated items or the updated values themselves. They may also provide information for individual items or aggregate information for sets of items. Update propagation may be either synchronous or asynchronous. In asynchronous methods, update reports are broadcast as the updates are performed. In synchronous methods, the server broadcasts an
update report periodically. A client must listen for the report first to decide whether its cache is valid or not. This adds some latency to query processing, however, each client needs only tune in periodically to read the report. The efficiency of update dissemination protocols for clients with different connectivity behavior, such as for workaholics (i.e., often connected clients) and sleepers (i.e., often disconnected clients), is evaluated in [4]. Finally, in the case of broadcast push-data delivery, clients may read items from different broadcast programs. The currency of the set of data items read by each client can be characterized based on the current values of the corresponding items at the server and on the temporal discrepancy among the values of the items in the set [13]. A more strict notion of correctness may be achieved through transaction-level correctness by requiring the client read-only transactions to be serializable with the server transactions. Methods for doing so include: (i) an invalidation method [12], where the server broadcasts an invalidation report that includes the data items that have been updated since the broadcast of the previous report, and transactions that have read obsolete items are aborted, (ii) serialization graph testing (SGT) [12], where the server broadcasts control information related to conflicting operations, and (iii) multiversion broadcast [11], where multiple versions of each item are broadcast, so that client transactions always read a consistent database snapshot.
Key Applications Data broadcasting, caching and replication techniques are part of the core of any application that requires data sharing and synchronization among mobile devices and data servers. Such applications include vehicle dispatching, object tracking, points of sale (e.g., ambulance and taxi services, Fedex/UPS), and collaborative applications (e.g., homecare, video gaming). They are also part of embedded or light versions of database management systems that extend enterprise applications to mobile devices. These include among others Sybase Inc.’s SQL Anywhere, IBM’s DB2 Everyplace, Microsoft SQL Server Compact, Oracle9i Lite and SQL Anywhere Technologies’ Ultralite.
Cross-references
▶ Concurrency Control ▶ Hash-Based Indexing
Data Cleaning
▶ MANET Databases ▶ Mobile Database ▶ Replicated Database Concurrency Control ▶ Transaction Management
Recommended Reading 1. Acharya S., Alonso R., Franklin M.J., and Zdonik S.B. Broadcast disks: data management for asymmetric communications environments. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995, pp. 199–210. 2. Acharya S., Franklin M.J., and Zdonik S.B. Balancing push and pull for data broadcast. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1997, pp. 183–194. 3. Aksoy D. and Franklin M.J. RxW: a scheduling approach for large scale on-demand broadcast. IEEE/ACM Trans. Netw., 7(6):846–860, 1999. 4. Barbara´ D. and Imielinski T. Sleepers and workaholics: caching strategies in mobile environments. VLDB J., 4(4):567–602, 1995. 5. Beaver J., Chrysanthis P.K., and Pruhs K. To broadcast push or not and what? In Proc. 7th Int. Conf. on Mobile Data Management, 2006, pp. 40–45. 6. Chrysanthis P.K. Transaction processing in a mobile computing environment. In Proc. IEEE Workshop on Advances in Parallel and Distributed Systems, 1993, pp. 77–82. 7. Dykeman H.D., Ammar M.H., and Wong J.W. Scheduling algorithms for videotex systems under broadcast delivery. In Proc. IEEE Int. Conf. on Communications, 1986, pp. 1847–1851. 8. Gray J., Helland P., Neil P.O., and Shasha D. The dangers of replication and a solution. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 173–182. 9. Petersen K., Spreitzer M., Terry D.B. Theimer M., and Demers A.J. Flexible update propagation for weakly consistent replication. In Proc. 16th ACM Symp. on Operating System Principles, 1997, pp. 288–301. 10. Pitoura E. and Bhargava B. Data consistency in intermittently connected distributed systems. IEEE Trans. Knowl. Data Eng., 11(6):896–915, 1999. 11. Pitoura E. and Chrysanthis P.K. Exploiting versions for handling updates in broadcast disks. In Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp. 114–125. 12. Pitoura E. and Chrysanthis P.K. Scalable processing of read-only transactions in broadcast push. In Proc. 19th Int. Conf. on Distributed Computing Systems, 1999, pp. 432–439. 13. Pitoura E., Chrysanthis P.K., and Ramamritham K. Characterizing the temporal and semantic coherency of broadcast-based data dissemination. In Proc. 9th Int. Conf. on Database Theory, 2003, pp. 410–424. 14. Pitoura E. and Samaras G. Data Management for Mobile Computing. Kluwer, Boston, USA, 1998. 15. Sharaf MA. and Chrysanthis P.K. On-demand data broadcasting for mobile decision making. MONET, 9(6):703–714, 2004. 16. Stathatos K., Roussopoulos N., and Baras J.S. Adaptive data broadcast in hybrid networks. In Proc. 23th Int. Conf. on Very Large Data Bases, 1997, pp. 326–335. 17. Su C.J, Tassiulas L., and Tsotras V.J. Broadcast scheduling for information distribution. Wireless Netw., 5(2):137–147, 1999.
D
561
18. T I., Viswanathan S., and Badrinath B.R. Data on air: organization and access. IEEE Trans. Knowl. Data Eng., 9(3):353–372, 1997.
Data Cache ▶ Processor Cache
Data Cleaning V ENKATESH G ANTI Microsoft Research, Redmond, WA, USA
Definition Owing to differences in conventions between the external sources and the target data warehouse as well as due to a variety of errors, data from external sources may not conform to the standards and requirements at the data warehouse. Therefore, data has to be transformed and cleaned before it is loaded into a data warehouse so that downstream data analysis is reliable and accurate. Data Cleaning is the process of standardizing data representation and eliminating errors in data. The data cleaning process often involves one or more tasks each of which is important on its own. Each of these tasks addresses a part of the overall data cleaning problem. In addition to tasks which focus on transforming and modifying data, the problem of diagnosing quality of data in a database is important. This diagnosis process, often called data profiling, can usually identify data quality issues and whether or not the data cleaning process is meeting its goals.
Historical Background Many business intelligence applications are enabled by data warehouses. If the quality of data in a data warehouse is poor, then conclusions drawn from business data analysis could also be incorrect. Therefore, much emphasis is placed on cleaning and maintaining high quality of data in data warehouses. Consequently, the area of data cleaning received considerable attention in the database community. An early survey of automatic data cleaning techniques can be found in [14]. Several companies also started developing domain-specific data cleaning solutions (especially for the customer address domain). Over time, several generic data cleaning techniques have been also been
D
562
D
Data Cleaning
developed (e.g., [10,5,15,8,9,1]) and, domain neutral commercial data cleaning software also started making its appearance (e.g., [13,11]).
to be addressed in this task include (i) identification of criteria under which two records represent the same real world entity, and (ii) efficient computation strategies to determine such pairs over large input relations.
Foundations Deduplication
Main Data Cleaning Tasks
In this section, the goals of several data cleaning tasks are introduced informally. The set of tasks mentioned below consists of those addressing commonly encountered problems in data cleaning and may not be a comprehensive list. However, note that most of the tasks mentioned below are important whether one wants to clean data at the time of loading a data warehouse or at the time of querying a database [6]. Column Segmentation Consider a scenario where a customer relation is being imported to add new records to a target customer relation. Suppose the address information in the target relation is split into its constituent attributes [street address, city, state, and zip code] while in the source relation they are all concatenated into one attribute. Before the records from the source relation could be inserted in the target relation, it is essential to segment each address value in the source relation to identity the attribute values at the target. The goal of a column segmentation task is to split an incoming string into segments, each of which may be inserted as attribute values at the target. A significant challenge to be addressed by this task is to efficiently match sub-strings of an input string with patterns such as regular expressions and with members of potentially large reference tables in order to identify values for target attributes. Note that, in general data integration may involve more complex schema transformations than achieved by the column segmentation task. Record Matching Consider a scenario where a new batch of customer records is being imported into a sales database. In this scenario, it is important to verify whether or not the same customer is represented in both the existing as well as the incoming sets and only retain one record in the final result. Due to representational differences and errors, records in both batches could be different and may not match exactly on their key attributes (e.g., name and address or the CustomerId). The goal of a record matching task is to identify record pairs, one in each of two input relations, which correspond to the same real world entity. Challenges
Consider a scenario where one obtains a set of customer records or product records from an external (perhaps low quality) data source. This set may contain multiple records representing the same real world (customer or product) entity. It is important to ‘‘merge’’ records representing the same entity into one record in the final result. The goal of a deduplication task is to partition a relation into disjoint sets of records such that each group consists of records which represent the same real world entity. Deduplication may (internally) rely on a record matching task but the additional responsibility of further grouping records based on pairwise matches introduces new challenges. The output of record matching may not be transitively closed. For instance, a record matching task comparing record pairs in a relation may output pairs (r1, r2) and (r2, r3) as matches, but not (r1, r3). Then, the problem of deriving a partitioning that respects the pairwise information returned by record matching is solved by deduplication. Data Standardization
Consider a scenario where a relation contains several customer records with missing zip code or state values, or improperly formatted street address strings. In such cases, it is important to fill in missing values and adjust, where possible, the format of the address strings so as to return correct results for analysis queries. For instance, if a business analyst wants to understand the number of customers for a specific product by zip code, it is important for all customer records to have correct zip code values. The task of improving the quality of information within a database is often called data standardization. Similar tasks also occur in various other domains such as product catalog databases. The data standardization task may also improve the effectiveness of record matching and deduplication tasks. Data Profiling
The process of cleansing data is often an iterative and continuous process. It is important to ‘‘evaluate’’ quality of data in a database before one initiates data cleansing process, and subsequently assesses its success. The process of evaluating data quality is called data profiling, and typically involves gathering several
Data Cleaning
aggregate data statistics which constitute the data profile, and ensuring that the values match up with expectations. For example, one may expect the customer name and address columns together to uniquely determine each customer record in a Customer relation. In such a case, the number of unique [name, address] values must be close to that of the total number of records in the Customer relation. Note that a large subset of elements of a data profile may each be obtained using one or more SQL queries. However, because all the elements of a data profile are computed together, there is an opportunity for a more efficient computation strategy. Further, the data profile of a database may also consist of elements which may not easily be computed using SQL queries. Besides the set of data cleaning tasks mentioned above, other data cleaning tasks such as filling in missing values, identifying incorrect attribute values and then automatically correcting them based on known attribute value distributions are also important for applications such as cleaning census data. Data Cleaning Platforms
The above requirements for a variety of data cleaning tasks have led to the development of utilities that support data transformation and cleaning. Such software falls into two broad categories: Vertical Solutions: The first category consists of verticals such as Trillium [15] that provide data cleaning functionality for specific domains, e.g., addresses. Since they understand the domain where the vertical is being applied, they can fine tune their software for the given domain. However, by design, these are not generic and hence cannot be applied to other domains. Horizontal Platforms: The other approach of building data cleaning software is to define and implement basic data cleaning operators. The broad goal here is to define a set of domain neutral operators, which can significantly reduce the load of developing common data cleaning tasks such as those outlined above. An example of such a basic operator is the set similarity join which may be used for identifying pairs of highly similar records across two relations (e.g., [16,4]). The advantage is that custom solutions for a variety of data cleaning tasks may now be developed for specialized domains by composing one or more of these basic operators along with other (standard or custom) operators. These basic operators do the heavy lifting and thus make the job of developing
D
data cleaning programs easier. Examples of such platforms include AJAX [7,8] and Data Debugger [3]. The above mentioned data cleaning operators may then be included in database platforms so as to enable programmers to easily develop custom data cleaning solutions. For instance, ETL (extract-transform-load) tools such as Microsoft SQL Server Integration Services (SSIS) [13] and IBM Websphere Information Integration [11] that can be characterized as ‘‘horizontal’’ platforms. These platforms are applicable across a variety of domains, and provide a set of composable operators (e.g., relational operators) enabling users to build programs involving these operators. Further, these platforms also allow users to build their own custom operators which may then be used in these programs. Hence, such ETL platforms provide a great vehicle to include core data cleaning operators.
Key Applications Data cleaning technology is critical for several information technology initiatives (such as data warehousing and business intelligence) which consolidate, organize, and analyze structured data. Accurate data cleaning processes are typically employed during data warehouse construction and maintenance to ensure that subsequent business intelligence applications yield accurate results. A significant amount of recent work has been focusing on extracting structured information from documents to enable structured querying and analysis over document collections [2,12]. Invariably, the extracted data is unclean and many data cleaning tasks discussed above are applicable in this context as well.
Cross-references
▶ Column segmentation ▶ Constraint-driven database repair ▶ Data deduplication ▶ Data profiling ▶ Record matching ▶ Similarity functions for data cleaning
Recommended Reading 1. Borkar V. Deshmukh V. and Sarawagi S. Automatic segmentation of text into structured records. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2001. 2. Cafarella M.J. Re C. Suciu D. Etzioni O. and Banko M. Structured querying of the web text. In Proc. 3rd Biennial Conf. on Innovative Data systems Research, 2007.
563
D
564
D
Data Collection
3. Chaudhuri S. Ganti V. and Kaushik. R. Data debugger: an operator-centric approach for data quality solutions. IEEE Data Eng. Bull., 2006. 4. Chaudhuri S. Ganti V. and Kaushik. R. A primitive operator for similarity joins in data cleaning. In Proc. 22nd Int. Conf. on Data Engineering, 2006. 5. Cohen. W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998. 6. Fuxman A. Fazli E. and Miller. R.J. Conquer: efficient management of inconsistent databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005. 7. Galhardas H. Florescu D. Shasha D. and Simon. E. An extensible framework for data cleaning. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999. 8. Galhardas H. Florescu D. Shasha D. Simon E. and Saita. C. Declarative data cleaning: language, model, and algorithms. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001. 9. Gravano L. Ipeirotis P.G. Jagadish H.V. Koudas N. Muthukrishnan S. and Srivastava. D. Approximate string joins in a database (almost) for free. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001. 10. Hernandez. M. and Stolfo. S. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1995. 11. IBM Websphere information integration. http://ibm.ascential.com. 12. Ipeirotis P.G. Agichtein E. Jain P. and Gravano. L. To search or to crawl? towards a query optimizer for text-centric tasks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2006. 13. Microsoft SQL Server 2005 integration services. 14. Rahm E. and Do. H.H. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 2000. 15. Raman V. and Hellerstein. J. An interactive framework for data cleaning. Technical report, University of California, Berkeley, 2000. 16. Sarawagi S. and Kirpal. A. Efficient set joins on similarity predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004. 17. Trillium Software. www.trilliumsoft.com/trilliumsoft.nsf.
Data Collection ▶ Data Acquisition and Dissemination in Sensor Networks
Data Compression in Sensor Networks A MOL D ESHPANDE University of Maryland, College Park, MD, USA
Synonyms Distributed source coding; Correlated data collection; Data suppression
Definition Data compression issues arise in a sensor network when designing protocols for efficiently collecting all data observed by the sensor nodes at an Internetconnected base station. More formally, let Xi denote an attribute being observed by a node in the sensor network – Xi may be an environmental property being sensed by the node (e.g., temperature), or it may be the result of an operation on the sensed values (e.g., in an anomaly-detection application, the sensor node may continuously evaluate a filter such as ‘‘temperature > 100’’ on the observed values). The goal is to design an energy-efficient protocol to periodically collect the observed values of all such attributes (denoted X1,...,Xn) at the base station, at a frequency specified by the user. In many cases, a bounded-error approximation might be acceptable, ie., the reported values may only be required to be within 2 of the observed values, for a given 2. The typical optimization metric is the total energy expended during the data collection process, commonly approximated by the total communication cost. However, metrics such as minimizing the maximum energy consumption across all nodes or maximizing the lifetime of the sensor network may also be appropriate in some settings.
Key Points The key issue in designing data collection protocols is modeling and exploiting the strong spatio-temporal correlations present in most sensor networks. Let Xit be a random variable that denotes the value of X i at time t (assuming time is discrete), and let H(X it) denote the information entropy of X i t. In most sensor network deployments, especially in environmental monitoring applications, the data generated by the sensor nodes is typically highly correlated both in time and in space — in other words, H(Xit+1jXit) H(Xit+1), and H(X1t,...,Xnt) H(X1 t ) + ... H(Xnt ). These correlations can usually be captured quite easily by constructing predictive models using either prior domain knowledge or historical data traces. However, because of the distributed nature of data generation in sensor networks, and the resource-constrained nature of sensor nodes, traditional data compression techniques cannot be easily adapted to exploit such correlations. The distributed nature of data generation has been well-studied in the literature under the name of Distributed Source Coding, whose foundations were
Data Conflicts
laid almost 35 years ago by Slepian and Wolf [6]. Their seminal work proves that it is theoretically possible to encode the correlated information generated by distributed data sources at the rate of their joint entropy even if the data sources do not communicate with each other. However this result is nonconstructive, and constructive techniques are known only for a few specific distributions [4]. More importantly, these techniques require precise and perfect knowledge of the correlations. This may not be acceptable in practical sensor networks, where deviations from the modeled correlations must be captured accurately. Pattem et al. [3] and Chu et al. [2], among others, propose practical data collection protocols that exploit the spatio-temporal correlations while guaranteeing correctness. However, these protocols may exploit only some of the correlations, and further require the sensor nodes to communicate with each other (thus increasing the overall cost). In many cases, it may not be feasible to construct a predictive model over the sensor network attributes, as required by the above approach, because of mobility, high failure rates or inherently unpredictable nature of the monitored phenomena. Suppression-based protocols, that monitor local constraints and report to the base station only when the constraints are violated, may be used instead in such scenarios [5]. Sensor networks, especially wireless sensor networks, exhibit other significant peculiarities that make the data collection problem challenging. First, sensor nodes are typically computationally constrained and have limited memories. As a result, it may not be feasible to run sophisticated data compression algorithms on them. Second, the communication in wireless sensor networks is typically done in a broadcast manner – when a node transmits a message, all nodes within the radio range can receive the message. This enables many optimizations that would not be possible in a one-to-one communication model. Third, sensor networks typically exhibit an extreme asymmetry in the computation and communication capabilities of the sensor nodes compared to the base station. This motivates the design of pull-based data collection techniques where the base station takes an active role in the process. Adler [1] proposes such a technique for a one-hop sensor network. The proposed algorithm achieves the information-theoretical lower bound on the number of bits sent by the sensor nodes,
D
while at the same time offloading most of the computeintensive work to the base station. However, the number of bits received by the sensor nodes may be very high. Finally, sensor networks typically exhibit high message loss and sensor failure rates. Designing robust and fault-tolerant protocols with provable guarantees is a challenge in such an environment.
Cross-references
▶ Continuous Queries in Sensor Networks ▶ Data Aggregation in Sensor Networks ▶ Data Fusion in Sensor Networks ▶ In-Network Query Processing ▶ Model-based Querying in Sensor Networks
Recommended Reading 1.
2.
3.
4.
5.
6.
Adler M. Collecting correlated information from a sensor network. In Proc. 16th Annual ACM -SIAM Symp. on Discrete Algorithms, 2005. Chu D., Deshpande A., Hellerstein J., and Hong W. Approximate data collection in sensor networks using probabilistic models. In Proc. 22nd Int. Conf. on Data Engineering, 2006. Pattem S., Krishnamachari B., and Govindan R. The impact of spatial correlation on routing with compression in wireless sensor networks. In Proc. 3rd Int. Symp. Inf. Proc. in Sensor Networks, 2004. Pradhan S. and Ramchandran K. Distributed source coding using syndromes (DISCUS): Design and construction. IEEE Trans. Inform. Theory, 49(3), 2003. Silberstein A., Puggioni G., Gelfand A., Munagala K., and Yang J. Making sense of suppressions and failures in sensor data: a Bayesian approach. In Proc. 33rd Int. Conf. on Very Large Data Bases, 2007. Slepian D. and Wolf J. Noiseless coding of correlated information sources. IEEE Trans. Inform. Theory, 19(4), 1973.
Data Confidentiality ▶ Security Services
Data Conflicts H ONG -H AI D O SAP AG, Dresden, Germany
Synonyms Data problems; Data quality problems; Data anomalies; Data inconsistencies; Data errors
565
D
566
D
Data Conflicts
Definition Data conflicts are deviations between data intended to capture the same state of a real-world entity. Data with conflicts are often called ‘‘dirty’’ data and can mislead analysis performed on it. In case of data conflicts, data cleaning is needed in order to improve the data quality and to avoid wrong analysis results. With an understanding of different kinds of data conflicts and their characteristics, corresponding techniques for data cleaning can be developed.
increasingly from a practical perspective in the context of business applications. This was essentially pushed by the need to integrate data from heterogeneous sources for business decision making and by the emergence of enterprise data warehouses at the beginning of the 1990s. To date, various research approaches and commercial tools have been developed to deal with the different kinds of data conflicts and to improve data quality [1,2,4,7].
Foundations Historical Background Statisticians were probably the first who had to face data conflicts on a large scale. Early applications, which needed intensive resolution of data conflicts, were statistical surveys in the areas of governmental administration, public health, and scientific experiments. In 1946, Halbert L. Dunn already observed the problem of duplicates in data records of a person’s life captured at different places [3]. He introduced the term Record Linkage to denote the process to resolve the problem, i.e., to obtain and link all unique data records to a consistent view on the person. In 1969, Fellegi and Sunter provided a formal mathematical model for the problem and thereby laid down the theoretical foundation for numerous record linkage applications developed later on [5]. Soon it became clear that record linkage is only the tip of the iceberg of the various problems, such as wrong, missing, inaccurate, and contradicting data, which makes it difficult for humans and applications to obtain a consistent view on real-world entities. In the late 1980s, computer scientists began to systematically investigate all problems related to data quality,
Classification of Data Conflicts
As shown in Fig. 1, data conflicts can be classified according to the following criteria: Single-source versus multi-source: Data conflicts can occur among data within a single source or between different sources. Schema-level versus instance-level: Schema-level conflicts are caused by the design of the data schemas. Instance-level conflicts, on the other hand, refer to problems and inconsistencies in the actual data contents, which are not visible at the schema level. Figure 1 also shows typical data conflicts for the various cases. While not shown, the single-source conflicts occur (with increased likelihood) in the multi-source case, too, besides specific multi-source conflicts. Single-Source Data Conflicts The data quality of a source largely depends on the degree to which it is governed by schema and integrity constraints controlling permissible data values. For sources without a schema,
Data Conflicts. Figure 1. Examples of multi-source problems at schema and instance level.
Data Conflicts
such as files, there are few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. Database systems, on the other hand, enforce restrictions of a specific data model (e.g., the relational approach requires simple attribute values, referential integrity, etc.) as well as applicationspecific integrity constraints. Schema-related data quality problems thus occur because of the lack of appropriate model-specific or application-specific integrity constraints, e.g., due to data model limitations or poor schema design, or because only a few integrity constraints were defined to limit the overhead for integrity control. Instance-specific problems are related to errors and inconsistencies that cannot be prevented at the schema level (e.g., misspellings). Both schema- and instance-level conflicts can be further differentiated according to the different problem scopes attribute, record, record type, and source. In particular, a data conflict can occur within an individual attribute value (attribute), between attributes of a record (record), between records of a record type (record type), and between records of different record types (source). Examples of data conflicts in each problem scope are shown and explained in Tables 1 and 2 for the schema and instance level, respectively. Note that uniqueness constraints specified at the schema level do not prevent duplicated instances, e.g., if information on the same real world entity is entered twice with different attribute values (see examples in Table 2). Multi-Source Data Conflicts
The problems present in single sources are aggravated when multiple sources need to be integrated. Each source may contain dirty data and the data in the sources may be represented differently, may overlap, or contradict. This is because
D
the sources are typically developed, deployed and maintained independently to serve specific needs. This results in a large degree of heterogeneity with respect to database management systems, data models, schema designs, and the actual data. At the schema level, data model and schema design differences are to be addressed by the steps of schema translation and schema integration, respectively. The main problems with respect to schema design are naming and structural conflicts. Naming conflicts arise when the same name is used for different objects (homonyms) or different names are used for the same object (synonyms). Structural conflicts occur in many variations and refer to different representations of the same object in different sources, e.g., attribute versus table representation, different component structure, different data types, different integrity constraints, etc. In addition to schema-level conflicts, many conflicts appear only at the instance level. All problems from the single-source case can occur with different representations in different sources (e.g., duplicated records, contradicting records). Furthermore, even when there are the same attribute names and data types, there may be different value representations (e.g., M/F vs. Male/Female for marital status) or different interpretation of the values (e.g., measurement units Dollar vs. Euro) across sources. Moreover, information in the sources may be provided at different aggregation levels (e.g., sales per product vs. sales per product group) or refer to different points in time (e.g., current sales as of yesterday for Source 1 vs. as of last week for Source 2). A main problem for cleaning data from multiple sources is to identify overlapping data, in particular matching records referring to the same real-world entity (e.g., a particular customer). This problem is
Data Conflicts. Table 1. Examples for single-source problems at schema level (violated integrity constraints) Scope
Type of conflict
Dirty data
Attribute Illegal values
birthdate = 13/30/1970
Record
city = ‘‘Redmond,’’ zip = 77777
Record type Source
Violated attribute dependencies Uniqueness violation Referential integrity violation
emp1 = (name = ‘‘John Smith,’’ SSN = ‘‘123456’’), emp2 = (name = ‘‘Peter Miller,’’ SSN = ‘‘123456’’) emp = (name = ‘‘John Smith,’’ deptno = 127)
Reasons/Remarks Values outside of domain range City and zip code should correspond Uniqueness for SSN (social security number) violated Referenced department (127) not defined
567
D
568
D
Data Conflicts
Data Conflicts. Table 2. Examples for single-source problems at instance level Scope
Type of conflict
Attribute Missing values
Record
Record type
Source
Dirty data phone = 9999–999999
Reasons/Remarks Unavailable values during data entry (dummy values or null) Usually typos, phonetic errors Use of code lists
Misspellings Cryptic values, Abbreviations Embedded values Misfielded values Violated attribute dependencies Word transpositions
city = ‘‘London’’ experience = ‘‘B’’; occupation = ‘‘DB Prog.’’
name1 = ‘‘J. Smith,’’ name2 = ‘‘Miller P.’’
Usually in a free-form field
Duplicated records
emp1 = (name = ‘‘John Smith,’’ . . .); emp2 = (name = ‘‘J. Smith,’’ . . .)
Contradicting records
emp1 = (name = ‘‘John Smith,’’ bdate = 02/12/70); emp2 = (name = ‘‘John Smith,’’ bdate = 12/12/70)
Same employee represented twice due to some data entry errors The same real world entity is described by different values
Wrong references
emp = (name = ‘‘John Smith,’’ deptno = 17)
name = ‘‘J. Smith 02/12/70 New York’’ city = ‘‘Germany’’ city = ‘‘Redmond,’’ zip = 77777
also referred to as the record linkage problem, the object identity problem, or the deduplication problem. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Thus duplicate information should be purged and complementing information should be consolidated and merged in order to achieve a consistent view of real-world entities. The two relational sources, Source 1 and Source 2, in the example of Fig. 2 exhibit several kinds of conflicts. At the schema level, there are name conflicts (synonyms Customer vs. Client, CID vs. Cno, Sex vs. Gender) and structural conflicts (different structures for names Name vs. {LastName, FirstName}, and for addresses {Street, City, Zip} vs. Address). At the instance level, one can see that there are different gender representations (‘‘0’’/‘‘1’’ vs. ‘‘F ’’/‘‘M’’) and presumably a duplicate record (Kristen Smith). The latter observation also reveals that while CID and Cno are both source-specific identifiers, their contents are not comparable between the sources; different numbers (‘‘11,’’ ‘‘49’’) refer to the same person while different persons have the same number (‘‘24’’).
Multiple values entered in one attribute (e.g., in a free-form field) City field contains value of country field City and zip code should correspond
Referenced department (17) is defined but wrong
Dealing with Data Conflicts
Data conflicts can be dealt with in a preventive and/or corrective way. As resolving existing data conflicts is generally an expensive task, preventing dirty data to be entered is promising to ensure a high data quality. This requires appropriate design of the database schema with corresponding integrity constraints and strict enforcement of the constraints in the databases and data entry applications. For most applications, however, a corrective strategy, i.e., data cleaning (a.k.a. cleansing or scrubbing), is needed in order to remove conflicts from given data and make it suitable for analysis. Typically, this process involves thorough analysis of the data to detect conflicts and transformation of the data to resolve the identified conflicts.
Key Applications Data Warehousing
Data warehousing aims at a consolidated and consistent view of enterprise data for business decision making. Transactional and non-transactional data from a variety of sources is aggregated and structured typically in a multidimensional schema to effectively support dynamic
Data Corruption
D
569
D
Data Conflicts. Figure 2. Classification of data conflicts in data sources.
querying and reporting, such as Online Analytical Processing (OLAP). As multiple data sources are considered, the probability that some of the sources contain conflicting data is high. Furthermore, the correctness of the integrated data is vital to avoid wrong conclusions. Due to the wide range of possible data inconsistencies and the sheer data volume, data cleaning is one of the biggest problems for data warehousing. Data conflicts need to be detected and resolved during the so-called ETL process (Extraction, Transformation, and Load), when source data is integrated from corresponding sources and stored into the data warehouse.
Cross-references
▶ Data Cleaning ▶ Data Quality ▶ Duplicate Detection
Recommended Reading 1. 2. 3. 4.
Data Mining
Data mining, or knowledge discovery, is the analysis of large data sets to extract new and useful information. Developed algorithms utilize a number of techniques, such as data visualization (charts, graphs), statistics (summarization, regression, clustering), and artificial intelligence techniques (classification, machine learning, and neural networks). As relevant data typically needs to be integrated from different sources, data warehousing represents a promising way to build a suitable data basis for data mining. However, due to performance reasons on large data sets, specialized data mining algorithms often operate directly on structured files. In either case, resolving data conflicts to obtain correct data is crucial for the success of data mining. On the other hand, the powerful data mining algorithms can also be utilized to analyze dirty data and discover data conflicts.
5. 6. 7.
Barateiro J. and Galhardas H. A survey of data quality tools. Datenbank-Spektrum, 14:15–21, 2005. Batini C. and Scannapieco M. Data Quality – Concepts, Methodologies and Techniques. Springer, Berlin, 2006. Dunn H.L. Record linkage. Am. J. Public Health, 36(12):1412–1416, 1946. Elmagarmid A.K., Ipeirotis P.G., and Verykios V.S. Duplicate record detection – a survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16, 2007. Fellegi I.P. and Sunter A.B. A theory for record linkage. J. Am. Stat. Assoc., 64(328):1183–1210, 1969. Kim W., Choi B.-J., Kim S.-K., and Lee D. A taxonomy of dirty data. Data Mining Knowl. Discov., 7(1):81–99, 2003. Rahm E. and Do H.-H. Data cleaning – problems and current approaches. IEEE Techn. Bull. Data Eng., 23(4):3–13, 2000.
Data Copy ▶ Data broadcasting, caching and replication
Data Corruption ▶ Storage Security
570
D
Data Deduplication
Data Deduplication ▶ Record Matching
Data Dependency ▶ Database Dependencies
Data Dictionary J AMES C AVERLEE Texas A&M University, College Station, TX, USA
Synonyms System catalog; Metadata repository
Definition A data dictionary catalogs the definitions of data elements, data types, data flows and other conventions that are used in an information system. Data dictionaries have been widely adopted by both (i) the database community, where a dictionary typically describes database entities, schemas, permissions, etc.; and (ii) the software development community, where a dictionary typically describes flows of information through the system. In essence, a data dictionary is a virtual database of metadata about an information system itself. A data dictionary may also be referred to as a ‘‘system catalog.’’
Key Points Understanding and managing an information system – both from a design and from an implementation point-of-view – requires some documentation of the schema, capabilities, constraints, and other descriptive features of the system. This documentation is typically embodied by a data dictionary – that is, a repository of information for an information system that describes the entities represented as data, including their attributes and the relationships between them [3]. The importance of a systematic way to store and manage the metadata associated with an information system has been well known since the earliest days of database and large-scale systems development. By the time the relational model was garnering attention in the 1970s, metadata management via a system catalog
was a standard feature in database management systems (DBMSs) like System R [1] and INGRES [5]. Around the same time, the structured analysis approach for large-scale systems development also advocated for the use of a data dictionary [2]. The phrase data dictionary has two closely related meanings: (i) as documentation primarily for consumption by human users, administrators, and designers; and (ii) as a mini-database managed by a DBMS and tightly coupled with the software components of the DBMS. In the first meaning, a data dictionary is a document (or collection of documents) that provides a conceptual view of the structure of an information system for those developing, using, and maintaining the system. In this first meaning, a data dictionary serves to document the system design process, to identify the important characteristics of the system (e.g., schemas, constraints, data flows), and to provide the designers, users, and administrators of the system a central metadata repository [6]. A data dictionary can provide the names of tables and fields, types for data attributes, encoding information, and further details of an overall structure and usage. The owner of a database or database administrator (DBA) might provide it as a book or a document with additional descriptions and diagrams, or as generated documentation derived from a database. Database users and application developers then benefit from the data dictionary as an accepted reference, though this hardcopy version is not always provided nor required. In the second meaning, a data dictionary is a minidatabase tightly coupled and managed by an information system (typically a DBMS) for supporting query optimization, transaction processing, and other typical features of a DBMS. When used in this sense, a data dictionary is often referred to as a catalog or as a system catalog. As a software component of a database or a DBMS, a data dictionary makes up all the metadata and additional functions needed for a database manipulation language (DML) to select, insert, and generally operate on data. A database user will do this in conjunction with a high-level programming language or from a textual or graphical user interface (GUI). The data dictionary for a database or DBMS typically has these elements: Descriptions of tables and fields Permissions information, such as usernames and privileges How data is indexed
Data Encryption
Referential integrity constraints Definitions for database schemas Storage allocation parameters Usage statistics Stored procedures and database triggers
For an example, a developer unfamiliar with what tables are available within a database could query the virtual INFORMATION_SCHEMA database, which serves as the data dictionary for MySQL databases [4]. Besides this low-level version of a data dictionary, some software frameworks add another layer of abstraction to create a high-level data dictionary as well. This layer can reduce development time by providing features not supported at the lower level, such as alternative database scheme models. One example is ObjectRelational Mapping (ORM), which seeks to map the data types created in the Object-Oriented Programming (OOP) paradigm to a relational database.
Cross-references
▶ Metadata ▶ Metadata Repository
Recommended Reading 1. Astrahan M. et al. (1979) System R: a relational data base management system. IEEE Comput. 12(5):42–48, 1979. 2. Demarco T. Structured Analysis and System Specification. Yourdon, 1978. 3. Elmasri R. and Navathe S. Fundamentals of database systems. Addison-Wesley, Reading, MA, 2000. 4. MySQL MySQL 5.0 Reference Manual, 2008. 5. Stonebraker M., Wong E., Kreps P., and Held G. The design and implementation of INGRES. ACM Trans. Database Syst., 1(3):189–222, 1976. 6. Yourdon E. Modern Structured Analysis. Yourdon, 1989.
Data Dissemination ▶ Data broadcasting, caching and replication
Data Encryption N INGHUI L I Purdue University, West Lafayette, IN, USA
Synonyms Encryption; Cipher
D
571
Definition Data encryption is the process of transforming data (referred to as plaintext) to make it unreadable except to those possessing some secret knowledge, usually referred to as a key. The result of the process is encrypted data (referred to as ciphertext). Data encryption aims at preserving confidentiality of messages. The reverse process of deriving the plaintext from the ciphertext (using the key) is known as decryption. A cipher is a pair of algorithms which perform encryption and decryption. The study of data encryption is part of cryptography. The study of how to break ciphers, i.e., to obtaining the meaning of encrypted information without access to the key, is called cryptanalysis.
Historical Background Encryption has been used to protect communications since ancient times by militaries and governments to facilitate secret communication. The earliest known usages of cryptography include a tool called Scytale, which was used by the Greeks as early as the seventh century BC, and the Caesar cipher, which was used by Julius Caesar in the first century B.C. The main classical cipher types are transposition ciphers, which rearrange the order of letters in a message, and substitution ciphers, which systematically replace letters or groups of letters with other letters or groups of letters. Ciphertexts produced by classical ciphers always reveal statistical information about the plaintext. Frequent analysis can be used to break classical ciphers. Early in the twentieth century, several mechanical encryption/decryption devices were invented, including rotor machines – most famously the Enigma machine used by Germany in World War II. Mechanical encryption devices, and successful attacks on them, played a vital role in World War II. Cryptography entered modern age in the 1970s, marked by two important events: the introduction of the U.S. Data Encryption Standard and the invention of public key cryptography. The development of digital computers made possible much more complex ciphers. At the same time, computers have also assisted cryptanalysis. Nonetheless, good modern ciphers have stayed ahead of cryptanalysis; it is usually the case that use of a quality cipher is very efficient (i.e., fast and requiring few resources), while breaking it requires an effort many orders of magnitude larger, making cryptanalysis so inefficient and impractical as to be effectively impossible.
D
572
D
Data Encryption
Today, strong encryption is no longer limited to secretive government agencies. Encryption is now widely used by the financial industry to protect money transfers, by merchants to protect credit-card information in electronic commerce, by corporations to secure sensitive communications of proprietary information, and by citizens to protect their private data and communications.
Foundations Data encryption can be either secret-key based or public-key based. In secret-key encryption (also known as symmetric encryption), a single key is used for both encryption and decryption. In public-key encryption (also known as asymmetric encryption), the encryption key (also called the public key) and the corresponding decryption key (also called the private key) are different. Modern symmetric encryption algorithms are often classified into stream ciphers and block ciphers. Stream Ciphers
In a stream cipher, the key is used to generate a pseudo-random key stream, and the ciphertext is computed by using a simple operation (e.g., bit-by-bit XOR or byte-by-byte modular addition) to combine the plaintext bits and the key stream bits. Mathematically, a stream cipher is a function f :{0,1}‘ !{0,1}m, where ‘ is the key size, and m determines the length of the longest message that can be encrypted under one key; m is typically much larger than ‘. To encrypt a message x using a key k, one computes c = f(k) ⊕ x, where ⊕ denote bit-by-bit XOR. To decrypt a ciphertext c using key k, one computes f(k) ⊕ c. Many stream ciphers implemented in hardware are constructed using linear feedback shift registers (LFSRs). The use of LFSRs on their own, however, is insufficient to provide good security. Additional variations and enhancements are needed to increase the security of LFSRs. The most widely-used software stream cipher is RC4. It was designed by Ron Rivest of RSA Security in 1987. It is used in popular protocols such as Secure Sockets Layer (SSL) (to protect Internet traffic) and WEP (to secure wireless networks). Stream ciphers typically execute at a higher speed than block ciphers and have lower hardware complexity. However, stream ciphers can be susceptible to serious security problems if used incorrectly; in particular, the same starting state (i.e., the same generated key stream) must never be used twice.
Block Ciphers
A block cipher operates on large blocks of bits, often 64 or 128 bits. Mathematically, a block cipher is a pair of functions E : f0; 1g‘ f0; 1gn ! f0; 1gn and D : f0; 1g‘ f0; 1gn ! f0; 1gn , where ‘ is the key size and n is the block size. To encrypt a message x using key k, one calculates Eðk; xÞ, which is often written as E k ½x. To decrypt a ciphertext c using key k, one calculates Dðk; cÞ, often written as Dk ½c. The pair E and D must satisfy 8k 2 f0; 1g‘ 8x 2 f0; 1gn Dk ½E k ½x ¼ x: The two most widely used block ciphers are the Data Encryption Standard (DES) and the Advanced Encryption Standard (AES). DES is a block cipher selected as Federal Information Processing Standard for the United States in 1976. It has subsequently enjoyed widespread use internationally. The block size of DES is 64 bits, and the key size 56 bits. The main weakness of DES is its short key size, which makes it vulnerable to bruteforce attacks that try all possible keys. One way to overcome the short key size of DES is to use Triple DES (3DES), which encrypts a 64-bit block by running DES three times using three DES keys. More specifically, let ðE; DÞ be the pair of encryption and decryption functions for DES, then the encryption function for 3DES is 3DESk1 ;k 2 ;k3 ðxÞ ¼ E k 1 ½Dk2 ½E k3 ðxÞ: AES was announced as an U.S. Federal Information Processing Standard on November 26, 2001 after a 5-year selection process that is opened to the public. It became effective as a standard May 26, 2002. The algorithm is invented by Joan Daemen and Vincent Rijmen and is formerly known as Rijndael. AES uses a block size of 128 bits, and supports key sizes of 128 bits, 192 bits, and 256 bits. Because messages to be encrypted may be of arbitrary length, and because encrypting the same plaintext under the same key always produces the same output, several modes of operation have been invented which allow block ciphers to provide confidentiality for messages of arbitrary length. For example, in the electronic codebook (ECB) mode, the message is divided into blocks and each block is encrypted separately. The disadvantage of this method is that identical plaintext blocks are encrypted into identical ciphertext blocks.
Data Encryption
D
It is not recommended for use in cryptographic protocols. In the cipher-block chaining (CBC) mode, each block of plaintext is XORed with the previous ciphertext block before being encrypted. This way, each ciphertext block is dependent on all plaintext blocks processed up to that point. Also, to make each message unique, an initialization vector must be used in the first block and should be chosen randomly. More specifically, to encrypt a message x under key k, let x1, x2,...,xm denote the message blocks, then the ciphertext is c0jjc1jj...jjcm where jj denote concatenation, c0 = IV, the randomly chosen initial value, and c i ¼ E k ½x i c i1 for 1 i m. Other well-known modes include Cipher feedback (CFB), Output feedback (OFB), and Counter (CTR).
A central problem for public-key cryptography is proving that a public key is authentic and has not been tampered with or replaced by a malicious third party. The usual approach to this problem is to use a publickey infrastructure (PKI), in which one or more third parties, known as certificate authorities, certify ownership of key pairs. Asymmetric encryption algorithms are much more computationally intensive than symmetric algorithms. In practice, public key cryptography is used in combination with secret-key methods for efficiency reasons. For encryption, the sender encrypts the message with a secret-key algorithm using a randomly generated key, and that random key is then encrypted with the recipient’s public key.
Public Key Encryption Algorithms
Attack Models
When using symmetric encryption for secure communication, the sender and the receiver must agree upon a key and the key must kept secret so that no other party knows the key. This means that the key must be distributed using a secure, but non-cryptographic, method; for example, a face-to-face meeting or a trusted courier. This is expensive and even impossible in some situations. Public key encryption was invented to solve the key distribution problem. When public key encryption is used, users can distribute public keys over insecure channels. One of the most widely used public-key encryption algorithm is RSA. RSA was publicly described in 1977 by Ron Rivest, Adi Shamir and Leonard Adleman at MIT; the letters RSA are the initials of their surnames. To generate a pair of RSA public/private keys, one does the following: choose two distinct large prime numbers p, q, calculate N = pq and f(N) = (p 1)(q 1), choose an integer e such that 1 < e < f(N), and e and f(N) share no factors other than 1. The public key is (N, e), and the private key is (N, d), where ed 1(mod f(N)). A message to be encrypted is encoded using a positive integer x where x < N. To encrypt x, compute c = xe mod N. To decrypt a ciphertext c, compute ce mod N. Practical RSA implementations typically embed some form of structured, randomized padding into the value x before encrypting it. Without such padding, the ciphertext leaks some information about the plaintext and is generally considered insecure for data encryption. It is generally presumed that RSA is secure if N is sufficiently large. The lengths of N are typically 1,024–4,096 bits long.
Attack models or attack types for ciphers specify how much information a cryptanalyst has access to when cracking an encrypted message. Some common attack models are: Ciphertext-only attack: the attacker has access only to a set of ciphertexts. Known-plaintext attack: the attacker has samples of both the plaintext and its encrypted version (ciphertext). Chosen-plaintext attack: the attacker has the capability to choose arbitrary plaintexts to be encrypted and obtain the corresponding ciphertexts. Chosen-ciphertext attack: the attacker has the capability to choose a number of ciphertexts and obtain the plaintexts.
Key Applications Data encryption is provided by most database management systems. It is also used in many settings in which database is used, e.g., electronic commerce systems.
Cross-references
▶ Asymmetric Encryption ▶ Symmetric Encryption
Recommended Reading 1. Diffie W. and Hellman M.E. New directions in cryptography. IEEE Trans. Inform. Theory, 22:644–654, 1976. 2. Federal information processing standards publication 46-3: data encryption standard (DES), 1999. 3. Federal information processing standards publication 197: advanced encryption standard, Nov. 2001.
573
D
574
D
Data Errors
4. Kahn D. The codebreakers: the comprehensive history of secret communication from ancient times to the internet. 1996. 5. Menezes A.J., Oorschot P.C.V., and Vanstone S.A. Handbook of applied cryptography (revised reprint with updates). CRC, West Palm Beach, FL, USA, 1997. 6. Rivest R.L., Shamir A., and Adleman L.M. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM, 21:120–126, 1978. 7. Singh S. The code book: the science of secrecy from ancient Egypt to quantum cryptography. Anchor, Garden City, NY, USA, 2000.
Data Errors ▶ Data Conflicts
Data Estimation in Sensor Networks L E G RUENWALD University of Oklahoma, Norman, OK, USA
Synonyms Data imputation
Definition In wireless sensor networks, sensors typically transmit their data to servers at predefined time intervals. In this environment, data packets are very susceptible to losses, delays or corruption due to various reasons, such as power outage at the sensor’s node, a higher bit error rate of the wireless radio transmissions compared to the wire communication alternative, an inefficient routing algorithm implemented in the network, or random occurrences of local interferences (e.g., mobile radio devices, microwaves or broken line-of-sight path). To process queries that need to access the missing data, if repeated requests are sent to sensors asking them to resend the missing information, this would incur power-costly communications as those sensors must be constantly in the listening mode. In addition, it is not guaranteed that those sensors would resend their missing data or would resend them in a timely manner. Alternatively, one might choose to estimate the missing data based on the underlying structure or patterns of the past reported data. Due to the low power-cost of computation, this approach represents an efficient way of answering queries that need to access the missing
information. This entry discusses a number of existing data estimation approaches that one can use to estimate the value of a missing sensor reading.
Key Points To estimate the value of a missing sensor reading, the quality of service in terms of high estimate accuracy and low estimation time needs to be observed. Data estimation algorithms can be divided into three major groups: (i) traditional statistical approaches; (ii) statistical-based sensor/stream data approaches, and (iii) association rule data mining based approaches. Many traditional statistical data approaches are not appropriate for wireless sensor networks as they require either the entire data set to be available or data to be missed at random, or do not consider relationships among sensors. Some statistical based sensor/stream data estimation algorithms include SPIRIT [3] and TinyDB [2]. SPIRIT is a pattern discovery system that uncovers key trends within data of multiple time series. These trends or correlations are summarized by a number of hidden variables. To estimate current missing values, SPIRIT applies an auto-regression forecasting model on the hidden variables. TinyDB is a sensor query processing system where missing values are estimated by taking the average of all the values reported by the other sensors in the current round. Two association rule based data estimation algorithms are WARM [1] and FARM [1]. WARM identifies sensors that are related to each other in a sliding window containing the latest w rounds using association rule mining. When the reading of one of those sensors is missing, it uses the readings of the other related sensors to estimate the missing reading. FARM is similar to WARM except that it does not use the concept of sliding window and it considers the freshness of data.
Cross-references
▶ Association Rule Mining on Streams ▶ Data Quality ▶ Sensor Networks ▶ Stream Data Analysis
Recommended Reading 1.
Gruenwald L., Chok H., and Aboukhamis M. Using data mining to estimate missing sensor data. In Proc. 7th IEEE ICDM Workshop on Optimization-Based Data Mining Techniques with Applications, 2007, pp. 207–212.
Data Exchange 2.
3.
Madden S., Franklin M., Hellerstein J., and Hong W. TinyDB: an acquisitional query processing system for sensor networks. ACM Trans. Database Syst., 30(1):122–173, 2005. Papadimitriou S., Sun J., and Faloutsos C. Pattern discovery in multiple time-series. In Proc. 31st Int. Conf. on Very Large Data Bases, 2005, pp. 697–708.
Data Exchange LUCIAN P OPA IBM Almaden Research Center, San Jose, CA, USA
Synonyms Data translation; Data migration; Data transformation
Definition Data exchange is the problem of materializing an instance of a target schema, given an instance of a source schema and a specification of the relationship between the source schema and the target schema. More precisely, a data exchange setting is a quadruple of the form (S, T, Sst, St), where S is the source schema, T is the target schema, Sst is a schema mapping that expresses constraints between S and T, and St is a set of constraints on T. Such a setting gives rise to the following data exchange problem: given an instance I over the source schema S, find an instance J over the target schema T such that I and J together satisfy the schema mapping Sst, and J satisfies the target constraints St. Such an instance J is called a solution for I in the data exchange setting. In general, many different solutions for an instance I may exist. The main focus of the data exchange research is to study the space of all possible solutions, to identify the ‘‘best’’ solutions to materialize in a practical application, and to develop algorithms for computing such a best solution.
Historical Background The first systems supporting the restructuring and translation of data were built several decades ago. An early such system was EXPRESS [21], which performed data exchange between hierarchical schemas. The need for systems supporting data exchange has persisted over the years and has become more pronounced with the proliferation of data in various formats ranging from traditional relational database schemas to semi-structured/XML schemas and scientific formats.
D
An example of a modern data exchange system is Clio [18,20], a schema mapping prototype developed at IBM Almaden Research Center and in collaboration with University of Toronto that influenced both theoretical and practical aspects of data exchange. The data exchange problem is related to the data integration problem [16] in the sense that both problems are concerned with management of data stored in heterogeneous formats. The two problems, however, are different for the following reasons. In data exchange, the main focus is on actually materializing a target instance (i.e., a solution) that reflects the source data as accurately as possible. This presents a challenge, due to the inherent under-specification of the relationship between the source and the target, which means that in general there are many different ways to materialize such a target instance. In contrast, a target instance need not be materialized in data integration. There, the main focus is on answering queries posed over the target schema using views that express the relationship between the target and source schemas. Fagin et al. [8] were the first to formalize the data exchange problem and to embark on an in-depth investigation of the foundational and algorithmic issues that surround it. Their framework focused on data exchange settings in which S and T are relational schemas, Sst is a set of tuple-generating dependencies (tgds) between S and T, also called source-to-target tgds, and St is a set of tgds and equality-generating dependencies (egds) on T. Fagin et al. isolated a class of solutions for the data exchange problem, called universal solutions, and showed that they have good properties that justify selecting them as the preferred solutions in data exchange. Universal solutions are solutions that can be homomorphically mapped into every other solution; thus, intuitively, universal solutions are the most general solutions. Moreover, in a precise sense, universal solutions represent the entire space of solutions. One of the main results in [8] is that, under fairly general conditions (weak acyclicity of the set of target tgds), a canonical universal solution can be computed (if solutions exist) in polynomial time, by using the classical chase procedure [2]. In general, universal solutions need not be unique. Thus, in a data exchange setting, there may be many universal solutions for a given source instance. Fagin, Kolaitis and Popa [9] addressed the issue of further isolating a ‘‘best’’ universal solution, by using the concept of the core of a graph or a structure [14]. By
575
D
576
D
Data Exchange
definition, the core of a structure is the smallest substructure that is also a homomorphic image of that structure. Since all universal solutions for a source instance I are homomorphically equivalent, it follows that they all have the same core (up to isomorphism). It is then shown in [9] that this core is also a universal solution, and hence the smallest universal solution. The uniqueness of the core of a universal solution together with its minimality make the core an ideal solution for data exchange. In a series of papers that started with [9] and continued with [12,13], it was shown that the core of the universal solutions can be computed in polynomial time, for data exchange settings where Sst is a set of source-to-target tgds and St is the union of a weaklyacyclic set of tgds with a set of edgs. This is in contrast with the general case of computing the core of an arbitrary structure, for which it is known that, unless P¼NP, there is no polynomial-time algorithm. There are quite a few papers on data exchange and theory of schema mappings that extended or made use of the concepts and results introduced in [8,9]. Some of the more representative ones addressed: extensions to XML data exchange [1], extensions to peer data exchange [11], the study of solutions under the closed-world assumption (CWA) [17], combined complexity of data exchange [15], schema mapping composition [10,19] and schema mapping inversion [7]. The Clio system, which served as both motivation and implementation playground for data exchange, was the first to use source-to-target dependencies as a language for expressing schema mappings [20]. Mapping constraints, expressed as either embedded dependencies (which comprise tgds and egds) or as equalities between relational or SQLS expressions, also play a central role in the model management framework of Bernstein and Melnik [3].
Foundations
Given a source schema S and a target schema T that are assumed to be disjoint, a source-to-target dependency is, in general, a formula of the form 8x(fS(x) ! wT(x)), where fS(x) is a formula, with free variables x, over the source schema S, and wT(x) is a formula, with free variables x, over the target schema T. The notation x signifies a vector of variables x1,...xk. A target dependency is, in general, a formula over the target schema T (the formalism used to express a target dependency may be different in general from those used for the source-to-target dependencies). The source schema
may also have dependencies that are assumed to be satisfied by every source instance. Source dependencies do not play a direct role in data exchange, because the source instance is given. The focus in [8] and in most of the subsequent papers on data exchange theory is on the case when S and T are relational schemas and when the dependencies are given as tuple-generating dependencies (tgds) and equality-generating dependencies (egds) [2]. More precisely, each source-to-target dependency in Sst is assumed to be a tgd of the form 8xðfS ðxÞ ! 9ycT ðx; yÞÞ; where fS(x) is a conjunction of atomic formulas over S and cT(x, y) is a conjunction of atomic formulas over T. All the variables in x are assumed to appear in fS(x). Moreover, each target dependency in St is either a tgd (of the form shown below left) or an egd (of the form shown below right): 8xðfT ðxÞ ! 9ycT ðx; yÞÞ
8xðfT ðxÞ ! ðx 1 ¼ x 2 ÞÞ
In the above, fT(x) and cT(x, y) are conjunctions of atomic formulas over T, where all the variables in x appear in fT(x), and x1, x2 are among the variables in x. An often used convention is to drop the universal quantifiers in front of a dependency, and implicitly assume such quantification. However, the existential quantifiers are explicitly written down. Source-to-target tgds are a natural and expressive language for expressing the relationship between a source and a target schema. Such dependencies are semi-automatically derived in the Clio system [20] based on correspondences between the source schema and the target schema. In turn, such correspondences can either be supplied by a human expert or discovered via schema matching techniques. Source-to-target tgds are also equivalent to the language of ‘‘sound’’ globaland-local-as-view (GLAV) assertions often used in data integration systems [16]. It is natural to take the target dependencies to be tgds and egds: these two classes together comprise the (embedded) implicational dependencies [6]. However, it is somewhat surprising that tgds, which were originally ‘‘designed’’ for other purposes (as constraints), turn out to be ideally suited for specifying desired data transfer. Example 1. Figure 1b shows a source schema (on the left) and a target schema (on the right) with
Data Exchange
D
577
D
Data Exchange. Figure 1. A data exchange example.
correspondences between their attributes. The source schema models two different data sources or databases, src1 and src2, each representing data about students. The first source consists of one relation, src1.students, while the second source consists of two relations, src2. students and src2.courseEvals. The attributes S,N,C,G,F represent, respectively,‘‘student id,’’ ‘‘student name,’’ ‘‘course,’’ ‘‘grade’’ (only in src1), and ‘‘file evaluation’’ (a written evaluation that a student receives for a course; only in src2). The attribute K in src2 is used to link students with the courses they take; more concretely, K plays the role of a foreign key in src2.students and the role of a key in src2.courseEvals. As seen in the instance in Fig.1a, information in the two sources may overlap: the same student can appear in both sources, with each source providing some information that the other does not have (e.g., either grade or file evaluation). The two data sources are mapped into a target schema with three relations: students (with general student information), enrolled (listing course entries for each student), and evals (with evaluation entries per student and per course). The attribute E (evaluation id) is used to link enrollment entries with the associated evaluation records (E is a foreign key in enrolled and a key in evals). Similarly, the attribute S (student id) links enrolled with students. The relationship between the individual attributes in the schemas is described by the arrows or correspondences that ‘‘go’’ between the attributes. However, the more precise mapping between the schemas is given by the set Sst ¼ {t1,t2} of source-to-target tgds that is shown in Fig.1c.
The first source-to-target tgd, t1, specifies that for each tuple in src1.students there must exist three corresponding tuples in the target: one in students, one in enrolled, and one in evals. Moreover, t1 specifies how the four components of the source tuple (i.e., s, n, c, g) must appear in the target tuples. The tgd also specifies the existence of ‘‘unknown’’ values (via the existential variables E and F) for the target attributes that do not have any corresponding attribute in the source. Note that existential variables can occur multiple times; in the example, it is essential that the same variable E is used in both enrolled and evals so that the association between students, courses and their grades is not lost in the target. The second source-to-target tgd, t2, illustrates a case where the source pattern (the premise of the tad) is not limited to one tuple of a relation but encodes a join between multiple relations. In general, not all variables in the source pattern must occur in the target (e.g., k does not occur in the target). In this example, t2 plays a ‘‘complementary’’ role to t1, since it maps a different source that contains file evaluations rather than grades. The target dependencies in St are formulas expressed solely in terms of the target schema that further constrain the space of possible target instances. In this example, the tgds i1 and i2 are inclusion dependencies that encode referential integrity constraints from enrolled to students and evals, respectively. The egds e1, e2 and e3 encode functional dependencies. that must be satisfied. In particular, e1 requires that a student and a course must have a unique evaluation id, while e2 and e3 together specify that the evaluation id must be a key for evals.
578
D
Data Exchange
Solutions In general, in a data exchange setting (S, T, Sst,St), there can be zero or more solutions J for a given source instance I. In other words, there can be zero or more target instances J such that: (1) J satisfies the target dependencies in St, and (2) I together with J satisfy the source-to-target dependencies in Sst. The latter condition simply means that the instance hI, J i that is obtained by considering together all the relations in I and J satisfies Sst. Note that hI, Ji is an instance over the union of the schemas S and T. Example 2. Figure 2 illustrates three target instances that are plausible for the source instance I shown in Fig.1a, and given the dependencies Sst and St in Fig.1b. Consider the first instance J0, shown in Fig.2a. It can be seen that hI, J0i satisfies all the sourceto-target dependencies in Sst; in particular, for any combination of the source data in I that satisfies the premises of some source-to-target tad in Sst, the ‘‘required’’ target tuples exist in J0. Note that in J0, the special values E1,...E4, F1, F2, G3 and G4 are used to represent ‘‘unknown’’ values, that is, values that do not occur in the source instance. Such values are called labeled nulls or nulls and are to be distinguished from the values occurring in the source instance, which are called constants. (See the later definitions.) It can then be seen that J0 fails to satisfy the set St of target dependencies; in particular, the egd e1 is not satisfied (there are two enrolled tuples for student 001 and course CS120 having different evaluation ids, E1 and E3). Thus, J0 is not a solution for I. On the other hand, the two instances J1 and J2 shown in Fig.2b and c, respectively, are both solutions for I. The main difference between J1 and J2 is that J2 is
Data Exchange. Figure 2. Examples of target instances.
more ‘‘compact’’: the same null E2 is used as an evaluation id for two different pairs of student and course (in contrast, J1 has different nulls, E2 and E4). In this example, J1 and J2 illustrate two possible ways of filling in the target that both satisfy the given specification. In fact, there are infinitely many possible solutions: one could choose other nulls or even constants instead of E1,E2,..., or one could add ‘‘extra’’ target tuples, and still have all the dependencies satisfied. This raises the question of which solutions to choose in data exchange and whether some solutions are better than others. Universal solutions. A key concept introduced in [8] is that of universal solutions, which are the most general among all the possible solutions. Let Const be the set, possibly infinite, of all the values (also called constants) that can occur in source instances. Moreover, assume an infinite set Var of values, called labeled nulls, such that Var \ Const ¼ ;. The symbols I, I0 , I1, I2,... are reserved for instances over the source schema S and with values in Const. The symbols J, J0, J1, J2,... are reserved for instances over the target schema T and with values in Const [ Var. All the target instances considered, and in particular, the solutions of a data exchange problem, are assumed to have values in Const [ Var. If J is a target instance, then Const(J) denotes the set of all constants occurring in J, and Var(J) denotes the set of labeled nulls occurring in J. Let J1 and J2 be two instances over the target schema. A homomorphismh : J1 ! J2 is a mapping from Const(J1) [ Var(J1) to Const(J2) [ Var(J2) such that: (1) h(c) ¼ c, for every c 2 Const(J1); (2) for every tuple
Data Exchange
t in a relation R of J1, the tuple h(t) is in the relation R of J2 (where, if t¼(a1,...as), then h(t)¼(h(a1),...,h (as))). The instance J1 is homomorphically equivalent to the instance J2 if there are homomorphisms h : J1 ! J2 and h0 : J2 !J1. Consider a data exchange setting (S,T,Sst,St). If I is a source instance, then a universal solution for I is a solution J for I such that for every solution J0 for I, there exists a homomorphism h : J!J0. Example 3. The solution J2 in Fig. 2c is not universal. In particular, there is no homomorphism from J2 to the solution J1 in Fig. 2b. Specifically, the two enrolled tuples (005, CS500, E2) and (001,CS200,E2) of J2 cannot be mapped into tuples of J1 (since E2 cannot be mapped into both E2 and E4). Thus, J2 has some ‘‘extra’’ information that does not appear in all solutions (namely, the two enrolled tuples for (005, CS500) and (001,CS200) sharing the same evaluation id). In contrast, a universal solution has only information that can be homomorphically mapped into every possible solution. It can be shown that J1 is such a universal solution, since it has a homomorphism to every solution (including J2). As the above example suggests, universal solutions are the preferred solutions in data exchange, because they are at least as general as any other solution (i.e., they do not introduce any ‘‘extra’’ information). Computing universal solutions with the chase. Fagin et al. [8] addressed the question of how to check the existence of a universal solution and how to compute one, if one exists. They showed that, under a weak acyclity condition on the set St of target dependencies, universal solutions exist whenever solutions exist. Moreover, they showed that, under the same condition, there is a polynomial-time algorithm for computing a canonical universal solution, if a solution exists; this algorithm is based on the classical chase procedure. Intuitively, the following procedure is applied to produce a universal solution from a source instance I: start with the instance hI, ;i that consists of I for the source, and the empty instance for the target; then chase hI, ;i with the dependencies in Sst and St in some arbitrary order and for as long as they are applicable. Each chase step either adds new tuples in the target or attempts to equate two target values (possibly failing, as explained shortly). More concretely, let hI, Ji denote an intermediate instance in the chase process (initially J ¼ ;). Chasing with a source-to-target tgd
D
fS(x) !∃ycT(x, y) amounts to the following: check whether there is a vector a of values that interprets x such that I ⊨fS(a), but there is no vector b of values that interprets y such that J ⊨cT(a, b); if such a exists, then add new tuples to J, where fresh new nulls Y interpret the existential variables y, such that the resulting target instance satisfies cT(a, Y). Chasing with a target tgd is defined similarly, except that only the target instance is involved. Chasing with a target egd fT(x)!(x1 ¼ x2) amounts to the following: check whether there is a vector a of values that interprets x such that J ⊨fT(a) and such that a1 6¼ a2; if this is the case, then the chase step attempts to identify a1 and a2, as follows. If both a1 and a2 are constants then the chase fails; it can be shown that there is no solution (and no universal solution) in this case. If one of a1 and a2 is a null, then it is replaced with the other one (either a null or a constant); this replacement is global, throughout the instance J. If no more chase steps are applicable, then the resulting target instance J is a universal solution for I. Example 4. Recall the earlier data exchange scenario in Fig. 1. Starting from the source instance I, the source-to-target tgds in Sst can be applied first. This process adds all the target tuples that are ‘‘required’’ by the tuples in I and the dependencies in Sst. The resulting target instance after this step is an instance that is identical, modulo renaming of nulls, to the instance J0 in Fig. 2a. Assume, for simplicity, that the result is J0. The chase continues by applying the dependencies in St to J0. The tgds (i1) and (i2) are already satisfied. However, the egd (e1) is not satisfied and becomes applicable. In particular, there are two enrolled tuples with the same student id and course but different evaluation ids (E1 and E3). Since E1 and E3 are nulls, the chase with e1 forces the replacement of one with the other. Assume that E3 is replaced by E1. After the replacement, the two enrolled tuples become identical (hence, one is a duplicate and is dropped). Moreover, there are now two evals tuples with the same evaluation id (E1). Hence, the egds e2 and e3 become applicable. As a result, the null G3 is replaced by the constant A and the null F1 is replaced by the constant file01. The resulting target instance is the instance J1 in Fig. 2b, which is a canonical universal solution for I. As a remark on the expressive power of dependencies (and of their associated chase), note how the above process has merged, into the same tuple, information from two different sources (i.e., the
579
D
580
D
Data Exchange
grade, and respectively, the file, for student 001 and course CS120). Also note that the chase is, inherently, a recursive procedure. In general, the chase with an arbitrary set of target tgds and egds may not terminate. Hence, it is natural to ask for sufficient conditions for the termination of the chase. An extensively studied condition that guarantees termination is that the target tgds in St form a weakly acyclic set of tgds (the latter is also known as a set of constraints with stratified witnesses) [8,5]. Two important classes of dependencies that are widely used in database dependency theory, namely sets of full target tgds and acyclic sets of inclusion dependencies, are special cases of weakly acyclic sets of tgds. The following theorem summarizes the use of chase in data exchange and represents one of the main results in [8]. Theorem 1 Assume a data exchange setting (S, T, Sst,St) where Sst is a set of source-to-target tgds, and St is the union of a weakly acyclic set of tgds with a set of egds. Then: (1) The existence of a solution can be checked in polynomial time. (2) A universal solution exists if and only if a solution exists. (3) If a solution exists, then a universal solution can be produced in polynomial time using the chase. The weak acyclicity restriction is essential for the above theorem to hold. In fact, it was shown in [15] that if the weak-acyclicity restriction is removed, the problem of checking the existence of solutions becomes undecidable. Multiple universal solutions and core. In general, in a data exchange setting, there can be many universal solutions for a given source instance. Nevertheless, it has been observed in [9] that all these universal solutions share one common part, which is the core of the universal solutions. The core of the universal solutions is arguably the ‘‘best’’ universal solution to materialize, since it is the unique most compact universal solution. It is worth noting that the chase procedure computes a universal solution that may not necessarily be the core. So, additional computation is needed to produce the core. A universal solution that is not a core necessarily contains some ‘‘redundant’’ information which does not appear in the core. Computing the core of the universal solutions could be performed, conceptually, in two steps: first, materialize a canonical universal solution by using the chase, then remove the redundancies by taking to the core. The second step is tightly
related to conjunctive query minimization [4], a procedure that is in general intractable. However, by exploiting the fact that in data exchange, the goal is on computing the core of a universal solution rather than that of an arbitrary instance, polynomial-time algorithms were shown to exist for certain large classes of data exchange settings. Specifically, for data exchange settings where Sst is a set of arbitrary source-to-target tgds and St is a set of egds, two polynomial-time algorithms, the blocks algorithm and the greedy algorithm, for computing the core of the universal solutions were given in [9]. By generalizing the blocks algorithm, this tractability case was further extended in [12] to the case where the target tgds are full and then in [13] to the more general case where the target tgds form a weakly acyclic set of tgds. Further results on query answering. The semantics of data exchange problem (i.e., which solution to materialize) is one of the main issues in data exchange. Another main issue is that of answering queries formulated over the target schema. Fagin et al. [8] adopted the notion of the ‘‘certain answers’’ in incomplete databases for the semantics of query answering in data exchange. Furthermore, they studied the issue of when can the certain answers be computed based on the materialized solution alone; in this respect, they showed that in the important case of unions of conjunctive queries, the certain answers can be obtained simply by running the query on an arbitrary universal solution and by eliminating the tuples that contain nulls. This in itself provided another justification for the ‘‘goodness’’ of universal solutions. The followup paper [9] further investigated the use of a materialized solution for query answering; it showed that for the larger class of existential queries, evaluating the query on the core of the universal solutions gives the best approximation of the certain answers. In fact, if one redefines the set of certain answers to be those that occur in every universal solution (rather than in every solution), then the core gives the exact answer for existential queries.
Key Applications Schema mappings are the fundamental building blocks in information integration. Data exchange gives theoretical foundations for schema mappings, by studying the transformation semantics associated to a schema mapping. In particular, universal solutions are the main concept behind the ‘‘correctness’’ of any
Data Fusion in Sensor Networks
program, query or ETL flow that implements a schema mapping specification. Data exchange concepts are also essential in the study of the operators on schema mappings such as composition of sequential schema mappings and inversion of schema mappings.
Cross-references
▶ Data Integration ▶ Schema Mapping ▶ Schema Mapping Composition
Recommended Reading 1 Arenas M. and Libkin L. XML data exchange: consistency and query answering. In Proc. 24th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2005, pp. 13–24. 2. Beeri C. and Vardi M.Y. A proof procedure for data dependencies. J. ACM, 31(4):718–741, 1984. 3. Bernstein P.A. and Melnik S. Model management 2.0: manipulating richer mappings. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2007, pp. 1–12. 4. Chandra A.K. and Merlin P.M. Optimal implementation of conjunctive queries in relational data bases. In Proc. 9th Annual ACM Symp. on Theory of Computing, 1977, pp. 77–90. 5. Deutsch A. and Tannen V. XML queries and constraints, containment and reformulation. Theor. Comput. Sci., 336 (1):57–87, 2005. 6. Fagin R. Horn clauses and database dependencies. J. ACM, 29 (4):952–985, 1982. 7. Fagin R. Inverting schema mappings. ACM Trans. Database Syst., 32(4), 2007. 8. Fagin R., Kolaitis P.G., Miller R.J., and Popa L. Data exchange: semantics and query answering. Theor. Comput. Sci., 336 (1):89–124, 2005. 9. Fagin R., Kolaitis P.G., and Popa L. Data exchange: getting to the core. ACM Trans. Database Syst., 30(1):174–210, 2005. 10. Fagin R., Kolaitis P.G., Popa L., and Tan W.-C. Composing schema mappings: second-order dependencies to the rescue. ACM Trans. Database Syst., 30(4):994–1055, 2005. 11. Fuxman A., Kolaitis P.G., Miller R.J., and Tan W.-C. Peer data exchange. ACM Trans. Database Syst., 31(4): 1454–1498, 2006. 12. Gottlob G. Computing cores for data exchange: new algorithms and practical solutions. In Proc. 24th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2005. 13. Gottlob G. and Nash A. Data exchange: computing cores in polynomial time. In Proc. 25th ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, 2006, pp. 40–49. 14. Hell P. and Nesˇetrˇil J. The core of a graph. Discrete Math, 109:117–126, 1992. 15. Kolaitis P.G., Panttaja J., and Tan W.C. The complexity of data exchange. In Proc. 25th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2006, pp. 30–39.
D
16. Lenzerini M. Data Integration: A Theoretical Perspective. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 233–246. 17. Libkin L. Data exchange and incomplete information. In Proc. 25th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2006, pp. 60–69. 18. Miller R.J., Haas L.M., and Herna´ndez M.A. Schema mapping as query discovery. In Proc. 26th Int. Conf. on Very Large Data Bases, 2000, pp. 77–88. 19. Nash A., Bernstein P.A., and Melnik S. Composition of mappings given by embedded dependencies. ACM Trans. Database Syst., 32(1):4, 2007. 20. Popa L., Velegrakis Y., Miller R.J., Herna´ndez M.A., and Fagin R. Translating Web data. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 598–609. 21. Shu N.C., Housel B.C., Taylor R.W., Ghosh S.P., and Lum V.Y. EXPRESS: A Data EXtraction, Processing, and REStructuring System. ACM Trans. Database Syst., 2(2):134–174, 1977.
Data Expiration ▶ Temporal Vacuuming
Data Extraction ▶ Screen Scraper
Data Flow Diagrams ▶ Activity Diagrams
Data Fusion ▶ Semantic Data Integration for Life Science Entities
Data Fusion in Sensor Networks A MAN K ANSAL , F ENG Z HAO Microsoft Research, Redmond, WA, USA
Synonyms Distributed sensor fusion
581
D
582
D
Data Fusion in Sensor Networks
Definition Data fusion in sensor networks is defined as the set of algorithms, processes, and protocols that combine data from multiple sensors. The goal may be to extract information not readily apparent in an individual sensor’s data, improve the quality of information compared to that provided by any individual data, or improve the operation of the network by optimizing usage of its resources. For instance, the output of a magnetic sensor and an audio sensor may be combined to detect a vehicle (new information), outputs of multiple vibration sensors may be combined to increase the signal to noise ratio (improving quality), or a passive infrared sensor may be combined with a camera in a people detection network to reduce the frame-rate of the camera for conserving energy (improving operation). The sensors fused may be of the same or different types. Key features of data fusion in sensor networks, that distinguish it from other methods to combine multiple sources of data, are that the former methods are designed to (i) reduce the amount of communication among and from the sensors, and (ii) increase the life-time of the sensor network when all or some of the sensors are battery operated. The methods used to combine the data may leverage past observations from the same sensors, previously known models of the sensed phenomenon, and other information in addition to the sensor data.
Historical Background Data fusion in sensor networks is founded on the methods developed for fusion of data in several older systems that used multiple sensors but not under the same system constraints as are typical of sensor networks. The Radar systems used in World War II present one of the first notable examples of such systems. The advantages that these systems demonstrated in robustness to failure of a fraction of sensors increased quality of data due to increased dimensionality of the measurement space. Since then, better discrimination between available hypotheses have led to the use of fusion methods in many sensor systems. Most of the algorithms for sensor data processing can be viewed as derivatives of the Kalman filter and related Bayesian methods [7,9]. Decentralized forms of these methods have been developed to take data from multiple sensors as input and produced a single higher quality fused output [15,16]. Latest
advances in these areas are discussed at the IEEE Sensor Array and Multichannel Signal Processing Workshop and the International Conference on Information Fusion among other venues. Other related works on fusion of sensor data are found in computer vision, tracking, and defense applications [2,3,5,6,8]. The achievable advantage in reducing signal distortion through fusion of multiple sensor inputs has been derived using information theoretic methods for general and Gaussian models for the phenomenon and sensor noise [1,12,13].
Foundations The basic data fusion problem may be expressed as follows. The sensor network is deployed to measure a phenomenon represented by a state vector x(t). For example, if the phenomenon is an object being tracked, x(t) may represent a vector of position and velocity of the object at time t. The observation model at sensor i, that relates the observation z to state x, is assumed to be Gaussian in many of the works, for computational tractability: zi ðtÞ ¼ Hi ðxðtÞÞ þ w i ðtÞ
ð1Þ
Linear models where H is a matrix are often used. The belief about the phenomenon at time t is defined to be the a posteriori distribution of x: pðxjz1 ; :::; zn Þ
ð2Þ
where n is the number of sensors. The belief is sufficient to characterize the phenomenon and compute typical statistics such as the expected value of x and its residual uncertainty after the estimation. The centralized methods to determine this belief from the observations require a knowledge of measurements zi from all the sensors. The decentralized Kalman filters, such as [15,16], typically assume that each of the n sensors in the sensor network is connected to every other sensor and an O(n2) communication overhead is acceptable. This design may be used for fusion in systems with a small number of sensors or when high data rate communication links are present among sensors, such as networks of defense vehicles. However, such a high overhead is not acceptable in sensor networks based on embedded and wireless platforms. In these systems, both due to the large number of sensors and the low data rates supported by their radios for battery efficiency, the fusion methods must minimize
Data Fusion in Sensor Networks
D
the communication requirements. Such fusion methods are said to be distributed. One approach [17,18] to realize a distributed fusion method is to selectively use only a subset of the large number of sensors that have the most relevant information about the phenomenon being sensed. The fusion method then is also required to provide for appropriate selection of these most informative sensors and to dynamically adapt the set of selected sensor as the phenomenon evolves over time. To this end, a quantitative measure of the usefulness of a sensor for fusion is introduced, referred to as the information content of that sensor. An information utility function is defined:
among other options. Also, the cost of communication from a sensor is explicitly modeled. Suppose the current belief is held at sensor l, referred to as the leader node. Suppose Mc(l, j) denotes the cost of communication to sensor j. Then the sensor selection method choses the best sensor as follows:
c : PðRd Þ ! R
^j ¼ arg max½aM u ðjÞ ð1 aÞM c ðl; jÞ j2A
ð3Þ
that acts on the class PðRd Þ of all probability distributions on Rd and returns a real number, with d being the dimension of x. Specific choices of c are derived from information measures known from information theory, such as entropy, the Fischer information matrix, the size of the covariance ellipsoid for Gaussian phenomena models, sensor geometry based measures and others. An example form of c if information theoretic entropy is used, is: Z cðpx Þ ¼ px ðxÞlogðpx ðxÞÞdx ð4Þ S
where S represents the domain of x and px is its probability distribution. Let U {1,...,n} be the set of sensors whose measurements have been incorporated into the belief, i.e., the current belief is: pðxjfzi gi2U Þ
ð5Þ
If the measurement from sensor j is also selected for inclusion in the computation of belief, the belief becomes: pðxjfzi gi2U [ fzj gÞ
ð6Þ
To select the sensor that has the maximum information content, the sensor j should be selected to maximize the information utility of the belief after including zj. Noting that j is to be selected from set A = {1,...,n} U, the best sensor ^j is: ^j ¼ arg max c pðxjfzi g ð7Þ j2A i2U [ fzj gÞ However, in practice, the knowledge about zj is not available before having selected j. The most likely best
583
sensor j can then be selected by computing the expectation of the information utility with respect to zj: ^j ¼ arg max E z ½c pðxjfzi g j j2A i2U [ fzj gÞ jfzi gi2U ð8Þ
ð9Þ
where Mu(j) denotes the expectation of the information utility as expressed in (8), and a 2 [0,1] balances the contribution from Mu and Mc. These fundamentals can be used to develop a distributed data fusion algorithm for a sensor network. Suppose the sensor nodes are synchronized in time, and the phenomenon is initially detected at time t = 0. At this time, a known distributed leader election algorithm may be executed to select a node l as the leader, which computes the initial belief using only its own observation. It now selects the next sensor to be included in the belief calculation using (9). The process continues until the belief is known to a satisfactory quality as characterized by known statistical measures. The flowchart of the fusion algorithm followed at all the nodes is shown in Fig. 1. The sensor selection process used in the algorithm above is a greedy one – at each selection step, it only considers a single sensor that optimizes the selection metric. It is possible that selecting multiple sensors at the same time yields a better choice. This can be achieved at the cost of higher computation complexity by selecting a set of sensors instead a single sensor j in (9), using appropriate modifications to the cost and utility metrics. The distributed fusion method described above limits the number of sensors used and hence significantly reduces the communication overhead as compared to the O(n2) overhead of decentralized methods. It is well-suited for problems where the sensed phenomenon is localized, such as an object being tracked, since the number of most informative nodes selected can then yield fused results close to that provided by the entire network.
D
584
D
Data Fusion in Sensor Networks
Data Fusion in Sensor Networks. Figure 1. Distributed sensor fusion based on selecting only the most informative and cost effective sensors.
In another approach, a distributed Kalman filter is derived. The sensing model is as expressed before in (1). However, instead of using a centralized Kalman filter to estimate the state x, n micro-Kalman filters, each executing locally at the n sensor nodes, using the measurements from only the local node, are used. In addition, two consensus problems are solved using a method that requires communication only with one hop wireless neighbors [10,11]. This approach is more relevant when the observations from all the nodes in the network are important, such as when measuring a distributed phenomenon. Distributed fusion methods that are tolerant to communication losses and network partitioning have also been developed using message passing on junction trees [14] and techniques from assumed density filtering [4]. Additionally, distributed methods are also available for cases where the sensed phenomenon is not itself required to be reproduced but only some of its properties
are to be obtained. These properties may be global and depend on the measurements of all the sensors in the network, such as the number of targets in the region, the contours of a given phenomenon value in a heat map, or tracking relations among a set of objects [17]. The field of data fusion in sensor networks is rapidly evolving with new advances being presented frequently at forums including the ACM/IEEE IPSN and ACM SenSys.
Key Applications Infrastructure monitoring, pervasive health-care, defense, scientific experimentation, environmental sensing, urban monitoring, home automation, supplychain control, industrial control, business process monitoring, security.
Data Sets Several data sets collected from experimental sensor network deployments are available for researchers to test their data fusion methods:
Data Integration Architectures and Methodology for the Life Sciences
Center for Embedded Networked Sensing, http:// www.sensorbase.org/ Intel Lab Data, http://db.csail.mit.edu/labdata/lab data.html
D
17. Zhao F., Liu J., Liu J., Guibas L., and Reich J. Collaborative Signal and Information Processing: An Information Directed Approach. In Proc. IEEE, 91(8):1199–1209, 2003. 18. Zhao F. and Guibas L. Wireless Sensor Networks: An Information Processing Approach. Morgan Kaufmann, 2004.
Cross-references
▶ Data Aggregation in Sensor Networks ▶ Data Estimation in Sensor Networks ▶ In-Network Query Processing
Recommended Reading 1. Berger T., Zhang Z., and Vishwanathan H. The CEO problem. IEEE Trans. Inform. Theory, 42(3):887–902, 1996. 2. Brooks R.R. and Iyengar S.S. Multi-sensor fusion: Fundamentals and applications with software. Prentice-Hall, Englewood, Cliffs, NJ, 1997. 3. Crowley J.L. and Demazeau Y. Principles and techniques for sensor data fusion. Signal Process., 32(1–2):5–27, 1993. 4. Funiak S., Guestrin C., Paskin M., and Sukthankar R. Distributed inference in dynamical systems. In Advances in Neural Information Processing Systems 19, B. Scholkopf, J. Platt, and T. Hoffman (eds.). MIT, Cambridge, MA, 2006, pp. 433–440. 5. Hall D.L. and McMullen S.A.H. Mathematical Techniques in Multisensor Data Fusion. Artech House, 2004. 6. Isard M. and Blake A. Condensation – conditional density propagation for visual tracking. Int. J. Comput. Vision, 29(1):5–28, 1998. 7. Jazwinsky A. Stochastic processes and filtering theory. Academic, New York, 1970. 8. Lodaya M.D. and Bottone R. Moving target tracking using multiple sensors. In Proc. SPIE, Vol. 4048, 2000, pp. 333–344. 9. Maybeck P.S. The kalman filter: an introduction to concepts. In Autonomous Robot vehicles, I.J. Cox and G.T. Wilfong, Eds. Springer-Verlag New York, New York, NY, 1990, pp. 194–204. 10. Olfati-Saber R. Distributed kalman filtering for sensor networks. In Proc. 46th IEEE Conf. on Decision and Control. 2007. 11. Olfati-Saber R. and Shamma J.S. Consensus filters for sensor networks and distributed sensor fusion. In Proc. 44th IEEE Conf. on Decision and Control. 2005. 12. Oohama Y. The rate distortion function for the quadratic Gaussian CEO problem. IEEE Trans. Inform. Theory, 44(3), 1998. 13. Pandya A., Kansal A., Pottie G.J., and Srivastava M.B. Fidelity and resource sensitive data gathering. In 42nd Allerton Conference, 2004. 14. Paskin M., Guestrin C., and McFadden J. A robust architecture for distributed inference in sensor networks. In Proc. 4th Int. Symp. Inf. Proc. in Sensor Networks, 2005. 15. Rao B., Durrant-Whyte H., and Sheen J. A fully decentralized multi-sensor system for tracking and surveillance. Int. J. Robot. Res., 12(1):20–44, 1993. 16. Speyer J.L. Computation and Transmission requirements for a decentralized linear-quadratic-Gaussian control problem. IEEE Trans. Automat. Control, 24(2):266–269, 1979.
585
D Data Gathering ▶ Data Acquisition and Dissemination in Sensor Networks
Data Grids ▶ Storage Grid
Data Imputation ▶ Data Estimation in Sensor Networks
Data Inconsistencies ▶ Data Conflicts
Data Integration ▶ Information Integration
Data Integration Architectures and Methodology for the Life Sciences A LEXANDRA P OULOVASSILIS University of London, London, UK
Definition Given a set of biological data sources, data integration is the process of creating an integrated resource combining data from the data sources, in order to allow queries and analyses that could not be supported by the individual data sources alone. Biological data
586
D
Data Integration Architectures and Methodology for the Life Sciences
sources are characterized by their high degree of heterogeneity, in terms of their data model, query interfaces and query processing capabilities, data types used, and nomenclature adopted for actual data values. Coupled with the variety, complexity and volumes of biological data that are becoming increasingly available, integrating biological data sources poses many challenges, and a number of methodologies, architectures and systems have been developed to support it.
Historical Background If an application requires data from different data sources to be integrated in order to support users’ queries and analyses, one possible solution is for the required data transformation and aggregation functionality to be encoded into the application’s programs. However, this may be a complex and lengthy process, and may also affect the robustness and maintainability of the application. These problems have motivated the development of architectures and methodologies which abstract out data transformation and aggregation functionality into generic data integration software. Much work has been done since the early 1990s in developing architectures and methodologies for integrating biological data sources in particular. Many systems have been developed which create and maintain integrated data resources: examples of significant systems are DiscoveryLink [7], K2/Kleisli [3], Tambis [6], SRS [16], Entrez [5], BioMart [4]. The main aim of such systems is to provide users with the ability to formulate queries and undertake analyses on the integrated resource which would be very complex or costly if performed directly on the individual data sources, sometimes prohibitively so. Providing access to a set of biological data sources via one integrated resource poses several challenges, mainly arising from the large volumes, variety and complexity of othe data, and the autonomy and heterogeneity of the data sources [2,8,9]. Data sources are developed by different people in differing research environments for differing purposes. Integrating them to meet the needs of new users and applications requires the reconciliation of their different data models, data representation and exchange formats, content, query interfaces, and query processing capabilities. Data sources are in general free to change their data formats and content without considering the impact this may have on integrated resources derived from them. Integrated resources may themselves serve as data sources for higher-level integrations,
resulting in a network of dependencies between biological data resources.
Foundations Three main methodologies and architectures have been adopted for biological data integration, materialized, virtual and link-based: With materialized integration, data from the data sources is imported into a data warehouse and it is transformed and aggregated as necessary in order to conform to the warehouse schema. The warehouse is the integrated resource, typically a relational database. Queries can be formulated with respect to the warehouse schema and their evaluation is undertaken by the database management system (DBMS), without needing to access the original data sources. With virtual integration, a schema is again created for the integrated resource. However, the integrated resource is represented by this schema, and the schema is not populated with actual data. Additional mediator software is used to construct mappings between the data sources and the integrated schema. The mediator software coordinates the evaluation of queries that are formulated with respect to the integrated schema, utilizing the mappings and the query processing capabilities of the database or file management software at the data sources. Data sources are accessed via additional ‘‘wrapper’’ software for each one, which presents a uniform interface to the mediator software. With link-based integration no integrated schema is created. Users submit queries to the integration software, for example via a web-based user interface. Queries are formulated with respect to data sources, as selected by the user, and the integration software provides additional capabilities for facilitating query formulation and speeding up query evaluation. For example, SRS [16] maintains indexes supporting efficient keyword-based search over data sources, and also maintains cross-references between different data sources which are used to augment query results with links to other related data. A link-based integration approach can be adopted if users will wish to query the data sources directly, will only need to pose keyword and navigation-style queries, and the scientific hypotheses that they will be investigating will not require any significant
Data Integration Architectures and Methodology for the Life Sciences
transformation or aggregation of the data. Otherwise, the adoption of a materialized or virtual integration approach is indicated. The link-based integration approach is discussed further in [1] and in the entry on Pathway Databases. A key characteristic of materialized or virtual integration is that the integrated resource can be queried as though it were itself a single data source rather than an integration of other data sources: users and applications do not need to be aware of the schemas or formats of the original data sources, only the schema/format of the integrated resource. Materialized integration is usually chosen for query performance reasons: distributed access to remote data sources is avoided and sophisticated query optimization techniques can be applied to queries submitted to the data warehouse. Other advantages are that it is easier to clean and annotate the source data than by using mappings within a virtual integration approach. However, maintaining a data warehouse can be complex and costly, and virtual integration may be the preferred option if these maintenance costs are too high, or if it is not possible to extract data from the data sources, or if access to the latest versions of the data sources is required. Examples of systems that adopt the materialized integration approach are GUS [3], BioMart [4], Atlas [14], BioMap [12]. With this approach, the standard methodology and architecture for data warehouse creation and maintenance can be applied. This consists of first extracting data from the data sources and transporting it into a ‘‘staging’’ area. Data from the data sources will need to be re-extracted periodically in order to identify changes in the data sources and to keep the warehouse up-to-date. Data extraction from each data source may be either full extraction or incremental extraction. With the former, the entire source data is re-extracted every time while with the latter it is only relevant data that has changed since the previous extraction. Incremental extraction is likely to be more efficient but for some data sources it may not be possible to identify the data that has changed since the last extraction, for example due to the limited functionality provided by the data sources, and full extraction may be the only option. After the source data has been brought into the staging area, the changes from the previous versions of the source data are determined using ‘‘difference’’ algorithms (in the case of full re-extraction) and the changed data is transformed into the format and data types specified
D
by the warehouse schema. The data is then ‘‘cleaned’’ i.e., errors and inconsistencies are removed, and it is loaded into the warehouse. The warehouse is likely to contain materialized views which transform and aggregate in various ways the detailed data from the data sources. View maintenance capabilities provided by the DBMS can be used to update such materialized views following insertions, updates and deletions of the detailed data. It is also possible to create and maintain additional ‘‘data marts’’ each supporting a set of specialist users via a set of additional views specific to their requirements. The warehouse serves as the single data source for each data mart, and a similar process of extraction, transformation, loading and aggregating occurs to create and maintain the data mart. One particular characteristic of biological data integration, as compared with business data integration for example, is the prevalence of both automated and manual annotation of data, either prior to its integration, or during the integration process, or both. For example, the Distributed Annotation System (DAS) (http://www.biodas.org) allows annotations to be generated and maintained by the owners of data resources, while the GUS data warehouse supports annotations that track the origins of data, information about algorithms or annotation software used to derive new inferred data, and who performed the annotation and when. Being able to find out the provenance of any data item in an integrated resource is likely to be important for users, and this is even more significant in biological data integration where multiple annotation processes may be involved. Another characteristic of biological data integration is the wide variety of nomenclatures adopted by different data sources. This greatly increases the difficulty of aggregating their data and has led to the proposal of many standardized ontologies, taxonomies and controlled vocabularies to help alleviate this problem e.g., from the Gene Ontology (GO) Consortium, Open Biomedical Ontologies (OBO) Consortium, Microarray Gene Expression Data (MGED) Society and Proteomics Standards Initiative (PSI). The role of ontologies in scientific data integration is discussed in the entry on Ontologies in Scientific Data Integration. Another key issue is the need to resolve possible inconsistencies in the ways that biological entities are identified within the data sources. The same biological entity may be identified differently in different data sources or, conversely, the same identifier may be used for different biological
587
D
588
D
Data Integration Architectures and Methodology for the Life Sciences
entities in different data sources. There have been a number of initiatives to address the problem of inconsistent identifiers e.g., the Life Sciences Identifiers (LSID) initiative and the International Protein Index (IPI). Despite such initiatives, there is still a legacy of large numbers of non-standardized identifiers in biological datasets and therefore techniques are needed for associating biological entities independently of their identifiers. One technique is described in [12] where a clustering approach is used to identify sets of data source entities that are likely to refer to the same real-world entity. Another area of complexity is that data sources may evolve their schemas over time to meet the needs of new applications or new experimental techniques (henceforth, the term ‘‘data source schema’’ is used to encompass also the data representation and exchange formats of data sources that are not databases). Changes in the data source schemas may require modification of the extraction-transformation-loading (ETL), view materialization and view maintenance procedures. Changes in the warehouse schema may impact on data marts derived from it and on the procedures for maintaining these. Turning now to virtual data integration, architectures that support virtual data integration typically include the following components: A Repository for storing information about data sources, integrated schemas, and the mappings between them. A suite of tools for constructing integrated schemas and mappings, using a variety of automatic and interactive methods. A Query Processor for coordinating the evaluation of queries formulated with respect to an integrated schema; the Query Processor first reformulates such a query, using the mappings in the Repository, into an equivalent query expressed over the data source schemas; it then optimizes the query and evaluates it, submitting as necessary sub-queries to the appropriate data source Wrappers and merging the results returned by them. An extensible set of Wrappers, one for each type of data source being integrated; each Wrapper extracts metadata from its data source for storage in the Repository, translates sub-queries submitted to it by the Query Processor into the data source’s query formalism, issues translated sub-queries to the data source, and translates sub-query results returned by
the data source into the Query Processor’s data model for further post-processing by the Query Processor. An integrated schema may be defined in terms of a standard data modeling language, or it may be a source-independent ontology defined in an ontology language and serving as a ‘‘global’’ schema for multiple potential data sources beyond the specific ones that are being integrated (as in TAMBIS for example). The two main integration methodologies are top-down and bottom-up. With top-down integration, the integrated schema, IS, is first constructed, or may already exist from previous schema design, integration or standardization efforts. The set of mappings, M, between IS and the data source schemas are then defined. With bottom-up integration, an initial version of IS and M are first constructed – for example, these may be based on just one of the data source schemas. The integrated schema IS and the set of mappings M are then incrementally extended by considering in turn each of the other data source schemas: for each object O in each source schema, M is modified so as to encompass the mapping between O and IS, if it is possible to do so using the current IS; otherwise, IS extended as necessary in order to encompass the data represented by O, and M is then modified accordingly. A mixed top-down/bottom-up approach is also possible: an initial IS may exist from a previous design or standardization activity, but it may need to be extended in order to encompass additional data arising from the set of data sources being integrated within it. With either top-down, bottom-up or mixed integration, it is possible that IS will not need to encompass all of the data of the data sources, but only a subset of the data which is sufficient for answering key queries and analyses – this avoids the possibly complex process of constructing a complete integrated schema and set of mappings. There are a number of alternatives to defining the set of mappings M above, and different data integration systems typically adopt different approaches: with the global-as-view (GAV) approach, each mapping relates one schema object in IS with a view that is defined over the source schemas; with the local-asview (LAV) approach, each mapping relates one schema object in one of the source schemas with a view defined over IS; and with the global-local-as-view (GLAV) approach, each mapping relates a view over a
Data Integration Architectures and Methodology for the Life Sciences
source schema with a view over IS [10,11]. Another approach is the both-as-view [13] approach supported by the AutoMed system. This provides a set of primitive transformations on schemas, each of which adds, deletes or renames a schema object. The semantic relationships between objects in the source schemas and the integrated schema are represented by reversible sequences of such transformations. The ISPIDER project [15] uses AutoMed for virtual integration of several Grid-enabled proteomics data sources. In addition to the approach adopted for specifying mappings between source and integrated schemas, different systems may also make different assumptions about the degree of semantic overlap between the data sources: some systems assume that each data source contributes to a different part of the integrated resource (e.g., K2/Kleisli); some relax this assumption but do not undertake any aggregation of duplicate or overlapping data that may be present in the data sources (e.g., TAMBIS); and some can support aggregation at both the schema and the data levels (e.g., AutoMed). The degree of data source overlap impacts on the degree of schema and data aggregation that will need to be undertaken by the mappings, and hence on their complexity and the design effort involved in specifying them. The complexity of the mappings in turn impacts on the sophistication of the query processing mechanisms that will be needed in order to optimize and evaluate queries posed on the integrated schema.
Key Applications Integrating, analyzing and annotating genomic data. Predicting the functional role of genes and integrating function-specific information. Integrating organism-specific information. Integrating and analyzing chemical compound data and metabolic pathway data to support drug discovery. Integrating protein family, structure and pathway data with gene expression data, to support functional genomics data analysis. Integrating, analyzing and annotating proteomics data sources recording data from experiments on protein separation and identification. Supporting systems biology research. Integrating phylogenetic data sources for genealogical reconstruction.
D
589
Integrating data about genomic variations in order to analyze the impact of genomic variations on health. Integrating genomic and proteomic data with clinical data to support personalized medicine.
Future Directions Identifying semantic correspondences between different data sources is a necessary prerequisite to integrating them. This is still largely a manual and time-consuming process undertaken with significant input from domain experts. Semi-automatic techniques are being developed to alleviate this problem, for example name-based or structural comparisons of source schemas, instancebased matching at the data level to determine overlapping schema concepts, and annotation of data sources with terms from ontologies to facilitate automated reasoning over the data sources. The transformation of source data into an integrated resource may result in loss of information, for example due to imprecise knowledge about the semantic correspondences between data sources. This is leading to research into capturing within the integrated resource incomplete and uncertain information, for example using probabilistic or logic-based representations and reasoning. Large amounts of information are potentially available in textual form within published scientific articles. Automated techniques are being developed for extracting information from such sources using grammar and rule-based approaches, and then integrating this information with other structured or semistructured biological data. Traditional approaches to data integration may not be sufficiently flexible to meet the needs of distributed communities of scientists. Peer-to-peer data integration techniques are being developed in which there is no single administrative authority for the integrated resource and it is maintained instead by a community of peers who exchange, transform and integrate data in a pair-wise fashion and who cooperate in query processing over their data. Finally, increasing numbers of web services are being made available to access biological data and computing resources - see, for example, the entry on Web Services and the Semantic Web for Life Science Data. Similar problems arise in combining such web services into larger-scale workflows as in integrating biological data sources: the necessary services are often created independently by different parties, using
D
590
D
Data Integration in Web Data Extraction System
different technologies, formats and data types, and therefore additional code needs to be developed to transform the output of one service into a format that can be consumed by another.
15. Zamboulis L., et al. Data access and integration in the ISPIDER Proteomics Grid. In Proc. 3rd Int. Workshop on Data Integration in the Life Sciences, 2006, pp. 3–18. 16. Zdobnov E.M., Lopez R., Apweiler R., and Etzold T. The EBI SRS Server – recent developments. Bioinformatics, 18(2):368–373, 2002.
Cross-references
▶ Data Provenance ▶ Ontologies and Life Science Data Management ▶ Pathway Databases ▶ Provenance in Scientific Databases ▶ Semantic Data Integration for Life Science Entities ▶ Web Services and the Semantic Web for Life Science Data
Data Integration in Web Data Extraction System M ARCUS H ERZOG 1,2 1 Vienna University of Technology, Vienna, Austria 2 Lixto Software GmbH, Vienna, Austria
Recommended Reading
Synonyms
1. Cohen-Boulakia S., Davidson S., Froidevaux C., Lacroix Z., and Vidal M.E. Path-based systems to guide scientists in the maze of biological data sources. J. Bioinformatics Comput. Biol., 4(5):1069–1095, 2006. 2. Davidson S., Overton C., and Buneman P. Challenges in integrating biological data sources. J. Comput. Biol., 2(4): 557–572, 1995. 3. Davidson S.B., et al. K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Syst. J., 40 (2):512–531, 2001. 4. Durnick S., et al. Biomart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21(16):3439–3440, 2005. 5. Entrez – the life sciences search engine. Available at: http://www. ncbi.nlm.nih.gov/Entrez 6. Goble C.A., et al. Transparent access to multiple bioinformatics information sources. IBM Syst. J., 40(2):532–551, 2001. 7. Haas L.M., et al. Discovery Link: a system for integrated access to life sciences data sources. IBM Syst. J., 40(2):489–511, 2001. 8. Hernandez T. and Kambhampati S. Integration of biological sources: current systems and challenges ahead. ACM SIGMOD Rec., 33(3):51–60, 2004. 9. Lacroix Z. and Critchlow T. Bioinformatics: Managing Scientific Data. Morgan Kaufmann, San Francisco, CA, 2004. 10. Lenzerini M. Data integration: a theoretical perspective. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 233–246. 11. Madhavan J. and Halevy A.Y. Composing mappings among data sources. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 572–583. 12. Maibaum M., et al. Cluster based integration of heterogeneous biological databases using the AutoMed toolkit. In Proc. 2nd Int. Workshop on Data Int. in the Life Sciences, 2005, pp. 191–207. 13. McBrien P. and Poulovassilis A. Data integration by bidirectional schema transformation rules. In Proc. 19th Int. Conf. on Data Engineering, 2003, pp. 227–238. 14. Shah S.P., et al. Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics, 6:34, 2005.
Web information integration and schema matching; Web content mining; Personalized Web
Definition Data integration in Web data extraction systems refers to the task of providing a uniform access to multiple Web data sources. The ultimate goal of Web data integration is similar to the objective of data integration in database systems. However, the main difference is that Web data sources (i.e., Websites) do not feature a structured data format which can be accessed and queried by means of a query language. In contrast, Web data extraction systems need to provide an additional layer to transform Web pages into (semi)structured data sources. Typically, this layer provides an extraction mechanism that exploits the inherent document structure of HTML pages (i.e., the document object model), the content of the document (i.e., text), visual cues (i.e., formatting and layout), and the inter document structure (i.e., hyperlinks) to extract data instances from the given Web pages. Due to the nature of the Web, the data instances will most often follow a semi-structured schema. Successful data integration then requires to solve the task of reconciling the syntactic and semantic heterogeneity, which evolves naturally from accessing multiple independent Web sources. Semantic heterogeneity can be typically observed both on the schema level and the data instance level. The output of the Web data integration task is a unified data schema along with consolidated data instances that can be queried in a structured way. From an operational point of view, one can distinguish between on-demand integration of Web data (also
Data Integration in Web Data Extraction System
referred to as metasearch) and off-line integration of Web data similar to the ETL process in data warehouses.
Historical Background The concept of data integration was originally conceived by the database community. Whenever data are not stored in a single database with a single data schema, data integration needs to resolve the structural and semantic heterogeneity found in databases built by different parties. This is a problem that researches have been addressing for years [8]. In the context of web data extraction systems, this issue is even more pressing due to the fact that web data extraction systems usually deal with schemas of semi-structured data, which are more flexible both from a structural and semantic perspective. The Information Manifold [12] was one of the systems that not only integrated relational databases but also took Web sources into account. However, these Web sources were structured in nature and were queried by means of a Web form. Answering a query involved a join across the relevant web sites. The main focus of the work was on providing a mechanism to describe declaratively the contents and query capabilities of the available information sources. Some of the first research systems which covered the aspects of data integration in the context of Web data extraction systems were ANDES, InfoPipes, and a framework based on the Florid system. These systems combine languages for web data extraction with mechanisms to integrate the extracted data in a homogeneous data schema. ANDES [15] is based on the Extensible Stylesheet Language Transformations (XSLT) for both data extraction and data integration tasks. The ANDES framework merges crawler technology with XML-based extraction techniques und utilized templates, (recursive) path expressions, and regular expressions for data extraction, mapping, and aggregation. ANDES is primarily a software framework, requiring application developers to manually build a complete process from components such as Data Retriever, Data Extractor, Data Checker, and Data Exporter. The InfoPipes system [10] features a workbench for visual composition of processing pipelines utilizing XML-based processing components. The components are defined as follows: Source, Integration, Transformation, and Deliverey. Each of those components features a configuration dialog to interactively define the configuration of the component. The components can
D
be arranged on the canvas of the workbench and can be connected to form information processing pipelines, thus the name InfoPipes. The Source component utilized ELOG programs [13] to extract semi-structured data from Websites. All integration tasks are subsequently performed on XML data. The Integration component also features a visual dialog to specify the reconciliation of the syntactic and semantic heterogeneity in the XML documents. These specifications are then translated into appropriate XSLT programs to perform the reconciliation during runtime. In [14] an integrated framework for Web exploration, wrapping, data integration, and querying is described. This framework is based on the Florid [13] system and utilizes a rule-based object-oriented language which is extended by Web accessing capabilities and structured document analysis. The main objective of this framework is to provide a unified framework – i.e., data model and language – in which all tasks (from Web data extraction to data integration and querying) are performed. Thus, these tasks are not necessarily separated, but can be closely intertwined. The framework allows for modeling the Web both on the page level as well as on the parse-tree level. Combined rules for wrapping, mediating, and Web exploration can be expressed in the same language and with the same data model. More recent work can be found in the context of Web content mining. Web content mining focuses on extracting useful knowledge from the Web. In Web content mining, Web data integration is a fundamental aspect, covering both schema matching and data instance matching.
Foundations Semi-structured Data
Web data extraction applications often utilize XML as data representation formalism. This is due to the fact that the semi-structured data format naturally matches with the HTML document structure. In fact, XHTML is an application of XML. XML provides a common syntactic format. However, it does not offer any means for addressing the semantic integration challenge. Query languages such as XQuery [5], XPath [1] or XSLT [11] provide the mechanism to manipulate the structure and the content of XML documents. These languages can be used as basis for implementing integration systems. The semantic integration aspect has to be dealt with on top of the query language.
591
D
592
D
Data Integration in Web Data Extraction System
Schema and Instance Matching
The main issue in data integration is the finding the semantic mapping between a number of data sources. In the context of Web extraction systems, these sources are web pages or more generally websites. There are three distinct approaches to the matching problem: manual, semiautomatic, or automatic matching. In the manual approach, an expert needs to define the mapping by using a toolset. This is of course time consuming. Automatic schema matching in contrast is AI-complete [3] and well researched in the database community [16], but typically still lacks reliability. In the semiautomatic approach, automatic matching algorithms suggest certain mappings which are validated by an expert. This approach saves time due to filtering out the most relevant matching candidates. An example for manual data integration framework is given in [6]. The Harmonize framework [9] deals with business-to-business (B2B) integration on the ‘‘information’’ layer by means of an ontology-based mediation. It allows organizations with different data standards to exchange information seamlessly without having to change their proprietary data schemas. Part of the Harmonize framework is a mapping tool that allows for manually generating mapping rules between two XML schema documents. In contrast to the manual mapping approach, automated schema mapping has to rely on clues that can be derived from the schema descriptions: utilizing the similarities between the names of the schema elements or taking the amount of overlap of data values or data types into account. While matching schemas is already a timeconsuming task, reconciling the data instances is even more cumbersome. Due to the fact that data instances are extracted from autonomous and heterogeneous websites, no global identifiers can be assumed. The same real world entity may have different textual representations, e.g., ‘‘CANOSCAN 3000ex 48 Bit, 12002400 dpi’’ and ‘‘Canon CanoScan 3000ex, 1200 2400dpi, 48Bit.’’ Moreover, data extracted from the Web is often incomplete and noisy. In such a case, a perfect match will not be possible. Therefore, a similarity metric for text joins has to be defined. Most often the widely used and established cosine similarity metric [17] from the information retrieval field is used to identify string matches. A sample implementation of text joins for Web data integration based on an unmodified RDBMS is given in [17]. Due to the fact that the number of data instances is
much higher than the number of schema elements, data instance reconciliation has to rely on automatic procedures. Web Content Mining
Web content mining uses the techniques and principles of data mining to extract specific knowledge from Web pages. An important step in Web mining is the integration of extracted data. Due to the fact that Web mining has to work on Web-scale, a fully automated process is required. In the Web mining process, Web data records are extracted from Web pages which serve as input for the subsequent processing steps. Due to the large scale approach of Web mining it calls for novel methods that draw from a wide range of fields spanning data mining, machine learning, natural language processing, statistics, databases, and information retrieval [4].
Key Applications Web data integration is required for all applications that draw data from multiple Web sources and need to interpret the data in a new context. The following main application areas can be identified: Vertical Search
In contrast to web search as provided by major search engines, vertical search targets a specific domain such as e.g., travel offers, job offers, or real estate offers. Vertical search applications typically deliver more structured results than conventional web search engines. While the focus of the web search is to cover the breath of all available websites and deliver the most relevant websites for a given query, vertical search typically searches less websites, but with the objective to retrieve relevant data objects. The output of a vertical search query is a result set that contains e.g., the best air fares for a specific route. Vertical search also needs to address the challenge of searching the deep Web, i.e., extracting data by means of automatically utilizing web forms. Data integration in the context of vertical search is both important for interface matching, i.e., merge the source query interfaces and map onto a single query interface, and result data object matching, where data extracted from the individual websites is matched against a single result data model. Web Intelligence
In Web Intelligence applications, the main objective is to gain new insights from the data extracted on the
Data Map
Web. Typical application fields are market intelligence, competitive intelligence, and price comparison. Price comparison applications are probably the most well known application type in this field. In a nutshell these applications aggregate data from the Web and integrate different Web data sources according to a single data schema to allow for easy analysis and comparison of the data. Schema matching and data reconciliation are important aspects with this type of applications. Situational Applications
Situational applications are a new type of application where people with domain knowledge can build an application in a short amount of time without the need to setup an IT project. In the context of the Web, Mashups are addressing these needs. With Mashups, readymade widgets are used to bring together content extracted from multiple websites. Additional value is derived by exploiting the relationship between the different sources, e.g., visualizing the location of offices in a mapping application. In this context, Web data integration is required to reconcile the data extracted from different Web sources and to resolve the references to real world objects.
7. Gravano L., Panagiotis G.I., Koudas N., and Srivastava D. Text joins in an RDBMS for web data integration. In Proc. 12th Int. World Wide Web Conference, 2003, pp. 90–101. 8. Halevy A., Rajaraman A., and Ordille J. Data integration: the teenage years. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 9–18. 9. Harmonise Framework. Available at: http://sourceforge.net/pro jects/hmafra/. 10. Herzog M. and Gottlob G. InfoPipes: a flexible framework for m-commerce applications. In Proc. 2nd Int. Workshop on Technologies for E-Services, 2001, pp. 175–186. 11. Kay M. (ed.). XSL Transformations. Version 2.0. W3C Recommendation, 2007. 12. Kirk T., Levy A.Y., Sagiv Y., and Srivastava D. The information manifold. In Proc. Working Notes of the AAAI Spring Symp. on Information Gathering from Heterogeneous, Distributed Environments. Stanford University. AAAI Press, 1995, pp. 85–91. 13. Luda¨scher B., Himmero¨der R., Lausen G., May W., and Schlepphorst C. Managing semistructured data with florid: a deductive object-oriented perspective. Inf. Syst., 23(9):589–613, 1998. 14. May W. and Lausen G. A uniform framework for integration of information from the web. Inf. Syst., 29:59–91, 2004. 15. Myllymaki J. Effective web data extraction with standard XML technologies. Comput. Networks, 39(5):653–644, 2002. 16. Rahm E. and Bernstein P.A. A survey of approaches to automatics schema matching. VLDB J., 10(4):334–350, 2001. 17. Salton G. and McGill M.J. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983.
Cross-references
▶ Data Integration ▶ Enterprise Application Integration ▶ Enterprise Information Integration ▶ Schema Matching ▶ Web Data Extraction
Data Integrity Services ▶ Security Services
Recommended Reading 1. Baumgartner R., Flesca S., and Gottlob G. Visual web information extraction with Lixto. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 119–128. 2. Berglund A., Boag S., Chamberlin D., Rernandez M.F., Kay M., Robie J., and Simeon J. (eds.). XML XPath Language 2.0. W3C Recommendation, 2007. 3. Bernstein P.A., Melnik S., Petropoulos M., and Quix C. Industrial-strength schema matching. ACM SIGMOD Rec., 33(4):38–43, 2004. 4. Bing L. and Chen-Chuan-Chang K. Editorial: special issue on web content mining. ACM SIGKDD Explorations Newsletter, 6(2):1–4, 2004. 5. Boag S., Chamberlin D., Fernandez M.F., Florescu D., Robie J., and Simeon J. (eds.). XQuery 1.0. An XML Query Language. W3C Recommendation, 2007. 6. Fodor O. and Werthner E. Harmonise: a step toward an interoperable e-tourism marketplace. Intl. J. Electron. Commerce, 9(2):11–39, 2005.
D
Data Lineage ▶ Data Provenance
Data Manipulation Language ▶ Query Language
Data Map ▶ Thematic Map
593
D
594
D
Data Mart
Data Mart I L -Y EOL S ONG Drexel University, Philadelphia, PA, USA
Definition A data mart is a small-sized data warehouse focused on a specific subject. While a data warehouse is meant for an entire enterprise, a data mart is built to address the specific analysis needs of a business unit. Hence, a data mart can be defined as ‘‘a small-sized data warehouse that contains a subset of the enterprise data warehouse or a limited volume of aggregated data for specific analysis needs of a business unit, rather than the needs of the whole enterprise.’’ Thus, an enterprise usually ends up having many data marts.
Key Points While a data warehouse is for a whole enterprise, a data mart focuses on a specific subject of a specific business unit. Thus, the design and management of a data warehouse must consider the needs of the whole enterprise, while those of a data mart are focused on the analysis needs of a specific business unit such as the sales department or the finance department. Thus, a data mart shares the characteristics of a data warehouse, such as being subject-oriented, integrated, nonvolatile, and a time-variant collection of data [1]. Since the scope and goal of a data mart is different from a data warehouse, however, there are some important differences between them: While the goal of a data warehouse is to address the needs of the whole enterprise, the goal of a data mart is to address the needs of a business unit such as a department. While the data of a data warehouse is fed from OLTP (Online Transaction Processing) systems, those of a data mart are fed from the enterprise data warehouse. While the granularity of a data warehouse is raw at its OLTP level, that of a data mart is usually lightly aggregated for optimal analysis by the users of the business unit. While the coverage of a data warehouse is fully historical to address the needs of the whole enterprise, that of a data mart is limited for the specific needs of a business unit.
An enterprise usually ends up having multiple data marts. Since the data to all the data marts are fed from the enterprise data warehouse, it is very important to maintain the consistency between a data mart and the data warehouse as well as among data marts themselves. A way to maintain the consistency is to use the notion of conformed dimension. A conformed dimension is a standardized dimension or a master reference dimension that are shared across multiple data marts [2]. The technology for advanced data analysis for the business intelligence in the context of data mart environment is called OLAP (Online Analytic Processing). Two OLAP technologies are ROLAP (Relational OLAP) and MOLAP (Multidimensional OLAP). In ROLAP, data are structured in the form a star schema or a dimensional model. In MOLAP, data are structured in the form of multidimensional data cubes. For MOLAP, a specialized OLAP software is used to support the creation of data cubes and OLAP operations such as drill-down and roll-up.
Cross-references
▶ Active and Real-Time Data Warehousing ▶ Business Intelligence ▶ Data Mining ▶ Data Warehouse ▶ Data Warehouse Life-Cycle and Design ▶ Data Warehouse Maintenance, Evolution and Versioning ▶ Data Warehouse Metadata ▶ Data Warehouse Security ▶ Data Warehousing and Quality Data Management for Clinical Practice ▶ Data Warehousing for Clinical Research ▶ Data Warehousing Systems: Foundations and Architectures ▶ Dimension ▶ Extraction, Transformation, and Loading ▶ Multidimensional Modeling ▶ On-Line Analytical Processing ▶ Transformation
Recommended Reading 1. 2.
Inmon W.H. Building the Data Warehouse, 3rd edn. Wiley, New York, 2002. Kimball R. and Ross M. The Data Warehouse Toolkit, 2nd edn. Wiley, New York, 2002.
Data Mining
D
Knowledge discovery from data; Data analysis; Pattern discovery
Conference on Knowledge Discovery and Data Mining (KDD) was held in 1995. Since then, there have been a number of international conferences and several scientific journals dedicated to the field of knowledge discovery and data mining. Many conferences on database systems, machine learning, pattern recognition, statistics, and World-Wide Web have also published influential research results on data mining. There are also many textbooks published on data mining, such as [5,7,8,11,12], or on specific aspects of data mining, such as data cleaning [4] and web mining [2,9]. Recently, there is also a trend to organize dedicated conferences and workshops on mining specific kinds of data or specific issues on data mining, such as International Conferences on Web Search and Data Mining (WSDM).
Definition
Foundations
Data mining is the process of discovering knowledge or patterns from massive amounts of data. As a young research field, data mining represents the confluence of a number of research fields, including database systems, machine learning, statistics, pattern recognition, high-performance computing, and specific application fields, such as WWW, multimedia, and bioinformatics, with broad applications. As an interdisciplinary field, data mining has several major research themes based on its mining tasks, including pattern-mining and analysis, classification and predictive modeling, cluster and outlier analysis, and multidimensional (OLAP) analysis. Data mining can also be categorized based on the kinds of data to be analyzed, such as multirelational data mining, text mining, stream mining, web mining, multimedia (or image, video) mining, spatiotemporal data mining, information network analysis, biological data mining, financial data mining, and so on. It can also beclassified based on the mining methodology or the issues to be studied, such as privacy-preserving data mining, parallel and distributed data mining, and visual data mining.
The overall knowledge discovery process usually consists of a few steps, including (i) data preprocessing, e.g., data cleaning and data integration (and possibly building up data warehouse), (ii) data selection, data transformation (and possibly creating data cubes by multidimensional aggregation), and feature extraction, (iii) data mining, (iv) pattern or model evaluation and justification, and (v) knowledge update and application. Data mining is an essential step in the knowledge discovery process. As a dynamic research field, many scalable and effective methods have been developed for mining patterns and knowledge from an enormous amount of data, which contributes to theories, methods, implementations, and applications of knowledge discovery and data mining. Several major themes are briefly outlined below.
Data Migration ▶ Data Exchange
Data Mining J IAWEI H AN University of Illinois at Urbana-Champaign, Urbana, IL, USA
Synonyms
Historical Background Data mining activities can be traced back to the dawn of early human history when data analysis methods (e.g., statistics and mathematical computation) were needed and developed for finding knowledge from data. As a distinct but interdisciplinary field, knowledge discovery and data mining can be viewed as starting at The First International Workshop on Knowledge Discovery from Data (KDD) in 1989. The first International
Mining Interesting Patterns from Massive Amount of Data
Frequent patterns are the patterns (e.g., itemsets, subsequences, or substructures) that occur frequently in data sets. This line of research started with association rule mining [1] and has proceeded to mining sequential patterns, substructure (or subgraph) patterns, and their variants. Many scalable mining algorithms have been developed and most of them explore the Apriori (or downward closure) property of frequent patterns, i.e., any subpattern of a frequent pattern is frequent. However, to make discovered patterns truly useful in many applications, it is important to study how to mine interesting frequent patterns [6], e.g., the patterns that satisfy certain constraints, patterns that reflect
595
D
596
D
Data Mining
strong correlation relationships, compressed patterns, and the patterns with certain distinct features. The discovered patterns can also be used for classification, clustering, outlier analysis, feature selection (e.g., for index construction), and semantic annotation. Scalable Classification and Predictive Modeling
There are many classification methods developed in machine learning [10] and statistics [8], including decision tree induction, rule induction, naive-Bayesian, Bayesian networks, neural networks, support vector machines, regression, and many statistical and pattern analysis methods [5,7,8,11,12]. Recent data mining research has been exploring scalable algorithms for such methods as well as developing new classification methods for handling different kinds of data, such as data streams, text data, web data, multimedia data, and high-dimensional biological data. For example, a pattern-based classification method, called DDPMine [3], that first extracts multidimensional features by discriminative frequent pattern analysis and then performs classification using these features has demonstrated high classification accuracy and efficiency. Cluster and Outlier Analysis
Data mining research has contributed a great deal to the recent development of scalable and effective cluster analysis methods. New methods have been proposed to make partitioning and hierarchical clustering methods more scalable and effective. For example, the micro-clustering idea in BIRCH [13] has been proposed that first groups objects into tight, micro-clusters based on their inherent similarity, and then performs flexible and efficient clustering on top of a relatively small number of micro-clusters. Moreover, new clustering methodologies, such as density-based clustering, link-based clustering, projection-based clustering of high-dimensional space, user-guided clustering, pattern-based clustering, and (spatial) trajectory clustering methods have been developed and various applications have been explored, such as clustering high-dimensional microarray data sets, image data sets, and interrelated multi-relational data sets. Furthermore, outlier analysis methods have been investigated, which goes beyond typical statistical distribution-based or regression deviation-based outlier analysis, and moves towards distance-based or densitybased outlier analysis, local outlier analysis, and trajectory outlier analysis.
Multidimensional (OLAP) Analysis
Each object/event in a dataset usually carries multidimensional information. Mining data in multidimensional space will substantially increase the power and flexibility of data analysis. By integration of data cube and OLAP (online analytical processing) technologies with data mining, the power and flexibility of data analysis can be substantially increased. Data mining research has been moving towards this direction with the proposal of OLAP mining, regression cubes, prediction cubes, and other scalable high-dimensional data analysis methods. Such multidimensional, especially high-dimensional, analysis tools will ensure that data can be analyzed in hierarchical, multidimensional structures efficiently and flexibly at user’s finger tips. OLAP mining will substantially enhance the power and flexibility of data analysis and lead to the construction of easy-to-use tools for the analysis of massive data with hierarchical structures in multidimensional space. Mining Different Kinds of Data
Different data mining methods are often needed for different kinds of data and for various application domains. For example, mining DNA sequences, moving object trajectories, time-series sequences on stock prices, and customer shopping transaction sequences require rather different sequence mining methodology. Therefore, another active research frontier is the development of data- or domain-specific mining methods. This leads to diverse but flourishing research on mining different kinds of data, including multi-relational data, text data, web data, multimedia data, geo-spatial data, temporal data, data streams, information networks, biological data, financial data, and science and engineering data.
Key Applications Data mining claims a very broad spectrum of applications since in almost every domain, there is a need for scalable and effective methods and tools to analyze massive amounts of data. Two applications are illustrated here as examples: Biological Data Mining
The fast progress of biomedical and bioinformatics research has led to the accumulation of an enormous amount of biological and bioinformatics data. However, the analysis of such data poses much greater challenges than traditional data analysis methods. For example, genes and proteins are gigantic in size
Data Mining
(e.g., a DNA sequence could be in billions of base pairs), very sophisticated in function, and the patterns of their interactions are largely unknown. Thus it is a fertile field to develop sophisticated data mining methods for in-depth bioinformatics research. Substantial research is badly needed to produce powerful mining tools for biology and bioinformatics studies, including comparative genomics, evolution and phylogeny, biological data cleaning and integration, biological sequence analysis, biological network analysis, biological image analysis, biological literature analysis (e.g., PubMed), and systems biology. From this point view, data mining is still very young with respect to biology and bioinformatics applications. Data Mining for Software Engineering
Software program executions potentially (e.g., when program execution traces are turned on) generate huge amounts of data. However, such data sets are rather different from the datasets generated from the nature or collected from video cameras since they represent the executions of program logics coded by human programmers. It is important to mine such data to monitor program execution status, improve system performance, isolate software bugs, detect software plagiarism, analyze programming system faults, and recognize system malfunctions. Data mining for software engineering can be partitioned into static analysis and dynamic/stream analysis, based on whether the system can collect traces beforehand for post-analysis or it must react at real time to handle online data. Different methods have been developed in this domain by integration and extension of the methods developed in machine learning, data mining, pattern recognition, and statistics. For example, statistical analysis such as hypothesis testing approach can be performed on program execution traces to isolate the locations of bugs which distinguish program success runs from failure runs. Despite of its limited success, it is still a rich domain for data miners to research and further develop sophisticated, scalable, and real-time data mining methods.
Future Directions There are many challenging issues to be researched further, and therefore, there are great many research frontiers in data mining. Besides the mining of biological data and software engineering data, as well as the above
D
597
introduced advanced mining methodologies, a few more research directions are listed here. Mining Information Networks
Information network analysis has become an important research frontier, with broad applications, such as social network analysis, web community discovery, cyberphysical network analysis, and network intrusion detection. However, information network research should go beyond explicitly formed, homogeneous networks (e.g., web page links, computer networks, and terrorist e-connection networks) and delve deeply into implicitly formed, heterogeneous, dynamic, interdependent, and multidimensional information networks, such as gene and protein networks in biology, highway transportation networks in civil engineering, theme-author-publication-citation networks in digital libraries, wireless telecommunication networks among commanders, soldiers and supply lines in a battle field. Invisible Data Mining
It is important to build data mining functions as an invisible process in many systems (e.g., rank search results based on the relevance and some sophisticated, preprocessed evaluation functions) so that users may not even sense that data mining has been performed beforehand or is being performed and their browsing and mouse clicking are simply using the results of or further exploration of data mining. Google has done excellent invisible data mining work for web search and certain web analysis. It is highly desirable to introduce such functionality to many other systems. Privacy-Preserving Data Mining
Due to the security and privacy concerns, it is appealing to perform effective data mining without disclosure of private or sensitive information to outsiders. Much research has contributed to this theme and it is expected that more work in this direction will lead to powerful as well as secure data mining methods.
Experimental Results There are many experimental results reported in numerous conference proceedings and journals.
Data Sets There are many, many data sets (mostly accessible on the web) that can be or are being used for data mining.
D
598
D
Data Mining in Bioinformatics
University of California at Irvine has an online repository of large data sets which encompasses a wide variety of data types, analysis tasks, and application areas. The website of UCI Knowledge Discovery in Databases Archive is http://kdd.ics.uci.edu. Researchers and practitioners should work on real data sets as much as possible to generate data mining tools for real applications.
URL to Code Weka (http://www.cs.waikato.ac.nz/ml/weka) presents a collection of machine learning and data mining algorithms for solving real-world data mining problems. RapidMiner (http://rapid-i.com), which was previously called YALE (Yet Another Learning Environment), is a free open-source software for knowledge discovery, data mining, and machine learning. IlliMine (IlliMine.cs.uiuc.edu) is a collection of data mining software derived from the research of the Computer Science department at the University of Illinois at Urbana-Champaign. For frequent pattern mining, the organizers of the FIMI (Frequent Itemset Mining Implementations) workshops provides a repository for frequent itemset mining implementations at http://fimi.cs.helsinki.fi. There are many other websites providing source or object codes on data mining.
Cross-references
▶ Association Rules ▶ Bayesian Classification ▶ Classification ▶ Classification by Association Rule Analysis ▶ Clustering Overview and Applications ▶ Data, Text, and Web Mining in Healthcare ▶ Decision Rule Mining in Rough Set Theory ▶ Decision Tree Classification ▶ Decision Trees ▶ Dimentionality Reduction ▶ Event pattern detection ▶ Event Prediction ▶ Exploratory data analysis ▶ Frequent graph patterns ▶ Frequent itemset mining with constraints ▶ Frequent itemsets and association rules ▶ Machine learning in Computational Biology ▶ Mining of Chemical Data ▶ Opinion mining ▶ Pattern-growth methods
▶ Privacy-preserving data mining ▶ Process mining ▶ Semi-supervised Learning ▶ Sequential patterns ▶ Spatial Data Mining ▶ Spatio-temporal Data Mining ▶ Stream Mining ▶ Temporal Data Mining ▶ Text Mining ▶ Text mining of biological resources ▶ Visual Association Rules ▶ Visual Classification ▶ Visual Clustering ▶ Visual Data Mining
Recommended Reading 1. Agrawal R. and Srikant R. Fast algorithms for mining association rules. In Proc. 20th Int. Conf. on Very Large Data Bases, 1994, pp. 487–499. 2. Chakrabarti S. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002. 3. Cheng H., Yan X., Han J., and Yu P.S. Direct discriminative pattern mining for effective classification. In Proc. 24th Int. Conf. on Data Engineering, 2008, pp. 169–178. 4. Dasu T. and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley, 2003. 5. Duda R.O., Hart P.E., and Stork D.G. Pattern Classification, 2nd edn. Wiley, New York, 2001. 6. Han J., Cheng H., Xin D., and Yan X. Frequent pattern mining: Current status and future directions. Data Min. Knowl. Disc., 15:55–86, 2007. 7. Han J. and Kamber M. Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, 2006. 8. Hastie T., Tibshirani R., and Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2001. 9. Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2006. 10. Mitchell T.M. Machine Learning. McGraw-Hill, 1997. 11. Tan P., Steinbach M., and Kumar V. Introduction to Data Mining. Addison Wesley, 2005. 12. Witten I.H. and Frank E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, 2005. 13. Zhang T., Ramakrishnan R., and Livny M. BIRCH: an efficient data clustering method for very large databases. In Proc. ACMSIGMOD Int. Conf. on Management of Data, 1996, pp. 103–114.
Data Mining in Bioinformatics ▶ Machine Learning in Computational Biology
Data Partitioning
Data Mining in Computational Biology ▶ Machine Learning in Computational Biology
Data Mining in Moving Objects Databases ▶ Spatio-Temporal Data Mining
Data Mining in Systems Biology ▶ Machine Learning in Computational Biology
D
query processing performance or increase database manageability. Query processing performance can be improved in one of two ways. First, depending on how the data is partitioned, in some cases it can be determined a priori that a partition does not have to be accessed to process the query. Second, when data is partitioned across multiple disks or sites, I/O parallelism and in some cases query parallelism can be attained as different partitions can be accessed in parallel. Data partitioning improves database manageability by optionally allowing backup or recovery operations to be done on partition subsets rather than on the complete database, and can facilitate loading operations into rolling windows of historical data by allowing individual partitions to be added or dropped in a single operation, leaving other data untouched.
Key Points
Data Mining Pipeline ▶ KDD Pipeline
Data Mining Process ▶ KDD Pipeline
Data Model Mapping ▶ Logical Database Design: From Conceptual to Logical Schema
Data Organization ▶ Indexing and Similarity Search
Data Partitioning DANIEL A BADI Yale University, New Haven, CT, USA
Definition Data Partitioning is the technique of distributing data across multiple tables, disks, or sites in order to improve
There are two dominant approaches to data partitioning. Horizontal partitioning divides a database table tuple-by-tuple, allocating different tuples to different partitions. This is typically done using one of five techniques: 1. Hash partitioning allocates tuples to partitions by applying a hash function to an attribute value (or multiple attribute values) within the tuple. Tuples with equivalent hash function values get allocated to the same partition. 2. Range partitioning allocates tuples to partitions by using ranges of attribute values as the partitioning criteria. For example, tuples from a customer table with last name attribute beginning with ‘‘A’’–‘‘C’’ are mapped to partition 1, ‘‘D’’–‘‘F’’ mapped to partition 2, etc. 3. List partitioning allocates tuples to partitions by associating a list of attribute values with each partition. Using range or list partitioning, it can be difficult to ensure that each partition contains approximately the same number of tuples. 4. Round-robin partitioning allocates the ith tuple from a table to the (i mod n)th partition where n is the total number of partitions. 5. Composite partitioning combines several of the above techniques, typically range partitioning followed by hash partitioning. Vertical partitioning divides a table column-bycolumn, allocating different columns (or sets of
599
D
600
D
Data Pedigree
columns) to different partitions. This approach is less frequently used relative to horizontal partitioning since it is harder to parallelize query processing over multiple vertical partitions, and merging or joining partitions is often necessary at query time. Column-stores are databases that specialize in vertical partitioning, usually taking the approach to the extreme, storing each column separately.
Cross-references
▶ Horizontally Partitioned Data ▶ Parallel Query Processing
Data Pedigree ▶ Data Provenance
Data Perturbation ▶ Matrix Masking
Data Privacy and Patient Consent DAVID H ANSEN 1, C HRISTINE M. O’K EEFE 2 1 The Australian e-Health Research Centre, Brisbane, QLD, Australia 2 CSIRO Preventative Health National Research Flagship, Acton, ACT, Australia
Synonyms Data protection
Definition Data privacy refers to the interest individuals and organisations have in the collection, sharing, use and disclosure of information about those individuals or organizations. Common information types raising data privacy issues include health (especially genetic), criminal justice, financial, and location. The recent rapid growth of electronic data archives and associated data technologies has increased the importance of data privacy issues, and has led to a growing body of legislation and codes of practice.
Patient consent, in relation to data, refers to a patient’s act of approving the collection, sharing, use or disclosure of information about them. It is important because data with appropriate patient consent often falls into exception clauses in privacy legislation. The challenge in data privacy is to balance the need to share and use data with the need to protect personally identifiable information and respect patient consent. For example, a person may have an expectation that their health data is being held securely but is available to clinicians involved in their care. In addition, researchers seek access to health data for medical research and to answer questions of clinical and policy relevance. Part of the challenge is that there are a range of views about, for example, what constitutes legitimate uses of data, what level of consent is required and how fine-grained that consent should be, what constitutes a clinical ‘‘care team,’’ and what privacy legislation should cover. The development of technologies to support data privacy and patient consent is currently attracting much attention internationally.
Historical Background Privacy issues began to attract attention in Europe and North America in the 1960s, and shortly thereafter in Australia. Probably a large factor in the relatively late recognition of privacy as a fundamental right is that most modern invasions of privacy involve new technology. For example, before the invention of computer databases, data were stored on paper in filing cabinets which made it difficult to find and use the information. The traditional ways of addressing the harm caused by invasions of privacy through invoking trespass, assault or eavesdropping were no longer sufficient in dealing with invasions of privacy enacted with modern information technologies. In the health care area, the notion of consent arose in the context of both treatment and clinical trials in research into new treatments. Patients undergoing a procedure or treatment would either be assumed to have given implicit consent by their cooperation, or would be required to sign an explicit statement of consent. Participants in clinical trials would be asked to give a formal consent to a trial, considered necessary because of the risk of harm to the individual due to the unknown effects of the intervention under research. The notion of consent has transferred to the context of the (primary) use of data for treatment and the
Data Privacy and Patient Consent
(secondary) use of data for research. For example, implied or expressed consent would be required for the transfer of medical records to a specialist or may be required for the inclusion of data in a research database or register. Increasingly, and somewhat controversially, health data are being made available for research without patient consent, but under strict legal and ethical provisions. Where this is the case, ethics committees are given the responsibility to decide if the research benefits outweigh to a substantial degree the public interest in protecting privacy. Ethics committees will generally look for de-identified data to be used wherever practical. A recent development related to the privacy of health data is the concept of participation or moral rights (http://www.privireal.org/). This can be viewed as an objection to the use of personal information on moral grounds; that individuals should have the right to know, and veto, how their data is used, even when there is no risk to the release of that personal information. The major data privacy-related legislative provisions internationally are Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, the Health Insurance Portability and Accountability Act (HIPAA) enacted by the US Congress in 1996, the Data Protection Act 1998 of the UK Parliament and the Australian Privacy Act 1988. There are apparent differences between these provisions in terms of both their scope and philosophy. Recent technological approaches to data privacy include: tight governance and research approvals processes, restricted access through physical and IT security, access control mechanisms, de-identification, statistical disclosure limitation and remote server technology for remote access and remote execution. Patient consent is often built into the approvals process, but can also be a component of the access control mechanism. In all cases the purpose of access to the data is of prime importance.
Foundations The use of sensitive personal data falls broadly into two areas: primary and secondary use. Primary use of data refers to the primary reason the data was captured. Increasing amounts of personal health data are being stored in databases for the
D
purpose of maintaining a life long health record of an individual. While in most cases this will provide for more appropriate care, there will be times when the individual will not want sensitive data revealed, even to a treating clinician. This may particularly be the case when the data is judged not relevant to the current medical condition. In some electronic health record programs, patients will have the option to opt-in or opt-out of the program. This can be a controversial aspect of health record systems and often legislation is needed to support this aspect of any collection of health related data. Once the patient data has been entered into an electronic health record system, there will generally be several layers of security protecting the data. Authentication is often performed using a two pass system – a password and a certificate are required to enter the system, to ensure that the person accessing the system is correctly authenticated. Once the person is authenticated access to appropriate data must be managed. Often a Role Based Access System (RBAC) [5] is implemented to ensure that access to the data is only granted to people who should have it. There are also now XML based mark up languages, such as XACML, which enable Role Based Access Control rules to be encoded using the language and hence shared by multiple systems. Audit of access to electronically stored health data is also important, to enable review access granted to data. Storage and transmission of data is also an issue for privacy and confidentiality of patient data. Encryption of the data when being transmitted is one way of ensuring the security of the data. There are also new computer systems which support Mandatory Access Control (MAC) which embed the security algorithms in the computer hardware, rather than at the software level, which are now available for the storage and access of electronic data. Security Assertion Markup Language (SAML) token are one technology which is being used to store privacy and security information with health data as they are transmitted between computers. The USA HIPAA legislation covers the requirements of how to capture, store and transmit demographic and clinical data. The secondary use of data is more problematic when it comes to data privacy and patient consent, since it refers to usage of the data for purposes which are not directly related to the purpose for which it was
601
D
602
D
Data Privacy and Patient Consent
collected. This includes purposes like medical research and policy analysis, which are unlikely to have a direct affect on the treatment of the patient whose data is used. There are a range of technological approaches to the problem of enabling the use of data for research and policy analysis while protecting privacy and confidentiality. None of these technologies provides a complete answer, for each must be implemented within an appropriate legislative and policy environment and governance structure, with appropriate management of the community of authorised users and with an appropriate level of IT security including user authentication, access control, system audit and follow-up. In addition, none of the technologies discussed here is the only solution to the problem, since there are many different scenarios for the use of data, each with a different set of requirements. It is clear that different technologies and approaches have different strengths and weaknesses, and so are suitable for different scenarios. A high level discussion of the problem of enabling the use of data while protecting privacy and confidentiality typically discusses two broad approaches. The first is restricted access, where access is only provided to approved individuals and for approved purposes. Further restrictions can be imposed, such as access to data only given at a restricted data centre, restrictions on the types of analyses which can be conducted and restrictions on the types of outputs which can be taken out of a data centre. There can be a cost associated with access to the data. The second is restricted or altered data, where something less than the full data set is published or the data are altered in some way before publication. Restricted data might involve removing attributes, aggregating geographic classifications or aggregating small groups of data. For altered data, some technique is applied to the data so that the released dataset does not reveal private or confidential information. Common examples here include the addition of noise, data swapping or the release of synthetic data. Often these broad approaches are used in combination. Below, three current technological approaches to the problem are reviewed. These fall into the category restricted or altered data described above, and all are used in combination with restricted access. The first approach is to release de-identified data to researchers under strict controls. De-identification is a
very complex issue surrounded by some lack of clarity and standard terminology. It is also very important as it underpins many health information privacy guidelines and legislation. First, it is often not at all clear what is meant when the term ‘‘de-identified’’ is used to refer to data. Sometimes it appears to mean simply that nominated identifiers such as name, address, date of birth and unique identifying numbers have been removed from the data. At other times its use appears to imply that individuals represented in a data set cannot be identified from the data – though in turn it can be unclear what this means. Of course simply removing nominated identifiers is often insufficient to ensure that individuals represented in a data set cannot be identified – it can be a straightforward matter to match some of the available data fields with the corresponding fields from external data sets, and thereby obtain enough information to determine individuals’ names either uniquely or with a low uncertainty. In addition, sufficiently unusual records in a database without nominated identifiers can sometimes be recognized. This is particularly true of health information or of information which contains times and/or dates of events. The second approach is statistical disclosure control where the techniques aim to provide researchers with useful statistical data at the same time as preserving privacy and confidentiality. It is widely recognized that any release of data or statistical summaries increases the risk of identification of some individual in the relevant population, with the consequent risk of harm to that individual through inference of private information about them. On the other hand, attempts to limit such disclosures can adversely affect the outcomes or usefulness of statistical analyses conducted on the data. Statistical disclosure control theory attempts to find a balance between these opposing objectives. Statistical disclosure control techniques can be organized into categories in several different ways. First, there are different techniques for tabular data (where data are aggregated into cells) versus microdata (individual level data). Second, techniques can be perturbative or non-perturbative. Perturbative methods operate by modifying the data, whereas non-perturbative methods do not modify the data. Perhaps the most well-known perturbative method is the addition of random ‘‘noise’’ to a dataset, and perhaps the most well-known nonperturbative method is cell suppression. In fact, current
Data Privacy and Patient Consent
non-perturbative methods operate by suppressing or reducing the amount of information released, and there is much ongoing debate on whether a good perturbative method gives more useful information than a nonperturbative method. On the other hand, it has been noted that perturbative techniques which involve adding noise provide weak protection and are vulnerable to repeated queries, essentially because the noise becomes error in models of the data. There is much activity directed at developing perturbative techniques that do not suffer from this problem. Virtually every statistical disclosure control technique can be implemented with differing degrees of intensity, and hence depends on a parameter which is usually pre-specified. Remote analysis servers are designed to deliver useful results of user-specified statistical analyses with acceptably low risk of a breach of privacy and confidentiality. The third approach is the technology of remote analysis servers. Such servers do not provide data to users, but rather allow statistical analysis to be carried out via a remote server. A user submits statistical queries by some means, analyses are carried out on the original data in a secure environment, and the user then receives the results of the analyses. In some cases the output is designed so that it does not reveal private information about the individuals in the database. The approach has several advantages. First, no information is lost through confidentialization, and there is no need for special analysis techniques to deal with perturbed data. In many cases it is found to be easier to confidentialize the output of an analysis, in comparison to trying to confidentialize a dataset when it is not known which analyses will be performed. However, analysis servers are not free from the risk of disclosure, especially in the face of multiple, interacting queries. They describe the risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered and with what output. The risk-utility framework is illustrated for regression models. Each of the broad technologies is implemented within the context that the analyst is trusted to comply with legal and ethical undertakings made. However, the different approaches have been designed with different risks of disclosure of private information, and so rely more or less heavily on trust. De-identification requires the greatest trust in the researcher, while remote servers
D
require the least. Statistical disclosure control, whether used alone or in combination with a remote analysis server, is somewhere inbetween these two extremes. De-identification provides the most detailed information to the researcher, while remote servers provide the least. Again, Statistical Disclosure Control is inbetween. These mechanisms of maintaining data privacy are made more difficult when it is necessary to link two or more databases together for the purpose of building a linked data set for analysis. This sort of research is increasingly being used as a way of reducing the cost of collecting new data and to make better use of existing data. The difficulties are two fold. First, there is the linkage step, i.e., recognizing that patients are the same across the data sets. Recent work has included blindfolded linkage methodologies [1,2] and encryption techniques [4] as ways of linking datasets while not revealing any patient information. The second difficulty lies in the greater chance of revealing information from a linked data set than a single data set, especially when combining data from multiple modalities, e.g., health data and geographic data.
Key Applications As discussed above, data privacy and patient consent has an impact on large sections of the health care industry and the biomedical research community. There are many applications which will need to consider data privacy and patient consent issues. Below is a discussion of Electronic Health Records, possibly the fastest growing example of data collected for primary use, and medical research, often the largest source of requests for data for secondary use. Electronic Health Records
Electronic Health Records (EHR) are the fastest growing example of an application which concern data privacy and patient consent. Increasing amounts of personal health data are being stored in databases for the purpose of maintaining a life long health record of an individual. The data is being stored according to a number of different international and local standards and development for the format of the data, for example openEHR (http://www.openehr.org/), and the transmission of the data, such as HL7 (http://www.hl7.org/). Some of this data is stored using codes from clinical terminologies, while some of it will be free text reports. This data are being stored so that the data are available to clinicians for the purpose of treating the individual.
603
D
604
D
Data Problems
While in most cases this will provide more appropriate care, there will be times when the individual will not want sensitive data revealed, even to a treating clinician. This may particularly be the case when the data is judged not relevant to the current medical condition. With a number of countries introducing Electronic Health Records, there are concerns over who will have access to the data. Generally these are governed by strict privacy policies, as well as allowing patients the opportunity to have some level of control over whether data is added to the EHR or not.
4.
5.
OKeefe C.M., Yung M., and Baxter R. Privacy-preserving linkage and data extraction protocols. In Workshop on Privacy in the Electronic Society in conjunction with the 11th ACM CCS Conference, 2004. Sandhu R.S., Coyne E.J., Feinstein H.L., and Youman C.E. Role-based access control models. IEEE Comput., 29(2):38–47, 1996.
Data Problems ▶ Data Conflicts
Medical Research
Secondary use of data is primarily concerned with providing health data for clinical or medical research. For most secondary data use, it is possible to use deidentified data, as described above. Increasingly, secondary use of data involves the linkage of data sets to bring different modalities of data together, which raises more concerns over the privacy of the data as described above. The publication of the Human Genome gave rise to new ways of finding relationships between clinical disease and human genetics. The increasing use and storage of genetic information also impacts the use of familial records, since the information about the patient also provides information on the patient’s relatives. The issues of data privacy and patient confidentiality and the use of the data for medical research are made more difficult in this post-genomic age.
Cross-references
▶ Access Control ▶ Anonymity ▶ Electronic Health Record ▶ Exploratory Data Analysis ▶ Health Informatics ▶ Privacy Policies and Preferences ▶ Privacy-Enhancing Technologies ▶ Privacy-Preserving data mining ▶ Record Linkage
Data Profiling T HEODORE J OHNSON AT&T Labs – Research, Florham, Park, NJ, USA
Synonyms Database profiling
Definition Data profiling refers to the activity of creating small but informative summaries of a database [5]. These summaries range from simple statistics such as the number of records in a table and the number of distinct values of a field, to more complex statistics such as the distribution of n-grams in the field text, to structural properties such as keys and functional dependencies. Database profiles are useful for database exploration, detection of data quality problems [4], and for schema matching in data integration [5]. Database exploration helps a user identify important database properties, whether it is data of interest or data quality problems. Schema matching addresses the critical question, ‘‘do two fields or sets of fields or tables represent the same information?’’ Answers to these questions are very useful for designing data integration scripts.
Historical Background Recommended Reading 1.
2. 3.
Agrawal R., Evfimievski A., and Srikant R. Information sharing across private databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 86–97. Churches T. and Christen P. Some methods for blindfolded record linkage. BMC Med. Inform. Decis. Making, 4:9, 2004. Domingo-Ferrer J. and Torra V. (eds.). Privacy in Statistical Databases. Lect. Notes Comput. Sci., Vol 3050. Springer Berlin Heidelberg, 2004.
Databases which support a complex organization tend to be quite complex also. Quite often, documentation and metadata are incomplete and outdated, no DBA understands the entire system, and the actual data fails to match documented or expected properties [2]. These problems greatly complicate already difficult tasks such as database migration and integration, and in fact database profiling was originally developed for
Data Profiling
their support. Newer developments in database profiling support database exploration, and finding and diagnosing data quality problems.
Foundations A database profile is a collection of summaries about the contents of a database. These summaries are usually collected by making a scan of the database (some profiles use sampled data, and some require multiple complex queries). Many of the profile statistics are collected by the DBMS for query optimization. If the optimizer statistics are available, they can be retrieved instead of calculated – though one must ensure that the optimizer’s statistics are current before using them. Profile summaries are typically stored in a database for fast interactive retrieval. Basic Statistics
These statistics include schema information (table and field names, field types, etc.) and various types of counts, such as the number of records in a table, and the number of special values of a field (typically, the number of null values). Distributional Statistics
These statistics summarize the frequency distribution of field values: for each distinct value of a field, how often does it occur. Examples include the number of distinct values of a field, the entropy of the frequency distribution [10], and the most common values and their frequency of occurrence. Another summary is the inverse frequency distribution, which is the distribution of the frequency distribution (e.g., the number of distinct values which occur once, occur twice, and so on). While the inverse frequency distribution is often small, it can become large and in general needs to be summarized also. Textual Summaries
A textual summary represents the nature of the data in a field. These summaries are very useful for exploring the value and pattern distributions and for field content matching in a schema matching task, i.e., to determine whether or not two fields across tables or databases represent similar content. Textual summaries apply to fields with numeric as well as string data types. The issue is that many identifiers are numeric, such as telephone numbers, Social Security numbers, IP addresses, and so on.
D
These numeric identifiers might be stored as numeric or string fields in a given table – or even combined with other string fields. To ensure that field matches can be made in these cases, numeric fields should be converted to their string representation for textual matching. Patterns (say, regular expressions) which most field values conform to are very useful in identifying anomalies and data quality issues. Minhash Signatures: One type of summary is very useful in this regard, the minhash signature [1]. To compute a minhash signature of a field, one starts with N hash functions from the field domain to the integers. For each hash function, compute the hash value of each field value, and collect the minimum. The collection of minimum hash values for each hash function constitutes the minhash signature. For example, suppose our set consists of X = {3,7,13,15} and our two hash functions are h1(x) = x mod 10, and h2(x) = x mod 5 (these are simple but very poor hash functions). Then, min{h1(x) | x in X} = 3, and min{h2(x) | x in X} = 2. Therefore the minhash signature of X is {3,2}. A surprising property of the minhash signature is its ability to determine the intersection of two sets. Given two minhash signatures for sets A and B, the number of hash functions with the same minimum value divided by the number of hash functions is an estimator for the resemblance of two sets, which is the size of the intersection of A and B divided by the union of A and B (r = |A∩B|/|AUB|). Given knowledge of |A| and |B| (from the distributional statistics), an estimate of the size of the intersection is |A∩B| = r(|A| + |B|)/(1 + r). If one extends the minhash signature to include the number of times that the minimum value of a hash function occurred, one can summarize the tail of the inverse frequency distribution [3]. Augmenting the minhash signature summary with the counts of the most frequent values (from the distributional statistics), which constitute the head, completes the summary. Substring summaries: Another type of textual summary determines if two fields have textually similar information, i.e., many common substrings. As with approximate string matching [6], these summaries rely on q-grams – all consecutive q-letter sequences in the field values. One type of approximate textual summary collects the distribution of all q-grams within a field’s values, and summarizes this distribution using a sketch
605
D
606
D
Data Profiling
such as the minhash signature or the min-count sketch [3]. Two fields are estimated to be textually similar if their q-gram distributions, represented by the sketches, are similar. Structural Summaries
Some summaries represent patterns among fields in a table. The two most common examples are keys and functional dependencies (FDs). Since a table might be corrupted by data quality problems, another type of structural summary are approximate keys and approximate FDs [7], which hold for most (e.g., 98%) of the records in a table. Key and FD finding is a very expensive procedure. Verifying whether or not a set X of fields is a key can be performed using a count distinct query on X, which returns the number of distinct values of X. Now, X is an (approximate) key if the number of distinct values of X is (approximately) equal to the size of the table. And, an FD X ! Y (approximately) holds if the number of distinct values of X is (approximately) equal to the number of distinct values of (X U Y). An exhaustive search of a d field table requires 2d expensive countdistinct operations on the table. There are several ways to reduce this cost. For one, keys and FDs with a small number of fields are more interesting than ones with a large number of fields (large keys and FDs are likely spurious – because of the limited size of the table – and likely do not indicate structural properties of the data). If the maximum set of fields is k (e.g., k = 3), then the search space is limited to O(dk). The search space can be trimmed further by searching for minimal keys and functional dependencies [7]. Finally, one can hash string fields to reduce the cost of computing count distinct queries over them (if exact keys and FD are required, a candidate key or FD found using hashing must be verified by a query over the actual table). Samples
A random sample of table records also serves as summary of the contents of a table. A few sampled rows of a table are surprisingly informative, and can be used to estimate a variety of distributions, e.g., identifying the most frequent values or patterns and their frequencies in a field. Sampling can be used to accelerate the expensive computation of profile data. For example, when computing keys and FDs on a large table (e.g., one with many records), one can sample the table and compute
keys and FDs over the sample. If a key or FD is (approximately) valid over the base table then it is also valid on a sample. But, it is possible that a key or FD which is valid on the sample may not be a valid on the base table. Therefore, a random sample can be used to identify candidate keys and FDs. If exact keys and FDs are needed, candidates can be verified by queries over the actual table. A minhash signature can be computed over sampled data. Suppose that F and G are again keys with identical sets of strings and of size S, and that one computes minhash signatures over F’ and G’, which are sampled from F and G (respectively) at rate p. Then, the resemblance of F’ and G’ is r0 ¼ jF0 \ G0 j=ðjF0 j þ jG0 j jF0 \ G0 jÞ ¼ p2 S=ð2pS p2 SÞ ¼ p=ð2 pÞ p=2 While the resemblance decreases linearly with p (and experiences a larger decrease than the sample intersection), minhash signatures have the advantage of being small. A profiling system which collects minhash signatures might use a signature size of, e.g., 250 hashes – small enough that an exhaustive search for matching fields can be performed in real time. Very large tables are sampled to accelerate the computation of the minhash signatures, say p = .1. When comparing two fields which both represent identical key values, there will be about 13 matches on average – enough to provide a reliable signal. In contrast, small uniform random samples of similar sizes drawn from the two fields may not provide accurate estimates of the resemblance. For instance, collecting 250 samples from a table with 1,000,000 rows requires a sampling rate of p = 0.00025, meaning that the intersection of the samples of F and G is very likely to be empty. While random sampling is a common data synopsis used to estimate a wide variety of data properties, its use as a database profile is limited. For one, a random sample cannot always provide an accurate estimation of the number of distinct values in a field, or of the frequency distribution of a field. Table samples are also ineffective for computing the size of the intersection of fields. Suppose that fields F and G are keys and contain the same set of strings, and suppose that they are sampled at rate p. Then, the size of the intersection of the sample is p2|F| = p2|G|. One can detect that F and G are identical if the size of the intersected sample is p times the size of the samples. However, if p is small
Data Profiling
enough for exhaustive matching (e.g., p = 0.001), then p2|F| is likely to be very small – and therefore an unreliable indicator of a field match. Implementation Considerations
Profiling a very large database can be a time-consuming and computationally intensive procedure. A given DBMS might have features, such as sampling, grouping sets, stored procedures, user-defined aggregate functions, etc., which can accelerate the computation of various summaries. Many profile statistics are computed by the DBMS for the query optimizer, and might be made available to users. However, database profiles are often used to compare data from different databases. Each of these databases likely belongs to its own administrative domain, which will enable or disable features depending on the DBA’s needs and preferences. Different databases often reside in different DBMSs. Therefore, a profiling tool which enables cross-database comparisons must in general make use of generic DBMS facilities, making use of DBMS-specific features as an optimization only. Modes of Use
The types of activities supported by database profiles can be roughly categorized into database exploration and schema matching. Database exploration means to help a user identify important database properties, whether it is data of interest, data quality problems, or properties that can be exploited to optimize database performance. For example, the user might want to know which are the important tables in a database, and how do they relate (how can they be joined). The number of records in a table is a good first indicator of the importance of a table, and a sample of records is a good first indicator of the kind of data in the table. The most frequent values of a field will often indicate the field’s default values (often there is more than one). Other types of information, e.g., keys and field resemblance, help to identify join paths and intra-table relationships. By collecting a sequence of historical snapshots of database profiles, one can extract information about how the database is updated. A comparison of profiles can indicate which tables are updated, which fields tend to change values, and even reveal changes in database maintenance procedures [3]. For example, in the two large production databases studied in [3], only 20–40% of the tables in the database changed at
D
all from week to week. Furthermore, most of the tables which ever changed experienced only a small change. Only 13 of the 800 + tables were found to be dynamic. A schema matching activity asks the question, ‘‘do these two instances represent the same thing?’’ – fields, sets of fields, tables, etc. For example, textual summaries are designed to help determine if two fields have the same (or nearly the same) contents. However, any single type of information (schema, textual, distributional) can fail or give misleading results in a large number of cases. The best approach is to use all available information [11]. Key Applications
Data profiling techniques and tools have been developed for database exploration, data quality exploration, database migration, and schema matching. Systems and products include Bellman [4], Ascential [8] and Informatica [9].
Cross-references
▶ Count-Min Sketch ▶ Data Sketch/Synopsis ▶ Hash Functions
Recommended Reading 1. Broder A. On the resemblance and containment of documents. In Proc. IEEE Conf. on Compression and Comparison of Sequences, 1997, pp. 21–29. 2. Dasu T. and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley Interscience, New York, 2003. 3. Dasu T., Johnson T., and Marathe A. Database exploration using database dynamics. IEEE Data Eng. Bull. 29(2):43–59, 2006. 4. Dasu T., Johnson T., Muthukrishnan S., and Shkapenyuk V. Mining database structure; or, how to build a data quality browser. In Proc. ACM SIGMOD Int. Conf. on Management of data, 2002, pp. 240–251. 5. Evoke Software. Data Profiling and Mapping, The Essential First Step in Data Migration and Integration Projects. Available at: http://www.evokesoftware.com/pdf/wtpprDPM.pdf 2000. 6. Gravano L., Ipeirotis P.G., Jagadish H.V., Koudas N., Muthukrishnan S., and Srivastava D. Approximate String Joins in a Database (Almost) for Free. In Proc. 27th Int. Conf. on Very Large Data Bases, 2001, pp. 491–500. 7. Huhtala Y., Karkkainen J., Porkka P., and Toivonen H. TANE: an efficient algorithm for discovering functional and approximate dependencies. Comp. J., 42(2):100–111, 1999. 8. IBM Websphere Information Integration. Available at: http:// ibm.ascential.com 9. Informatica Data Explorer. Available at: http://www.informatica. com/products_services/data_explorer 10. Kang J. and Naughton J.F. On schema matching with opaque column names and data values. In Proc. ACM SIGMOD Int.
607
D
608
D
Data Protection
Conf. on Management of Data, San Diego, CA, 2003, pp. 205– 216. 11. Shen W., DeRose P., Vu L., Doan A.H., and Ramakrishnan R. Source-aware entity matching: a compositional approach. In Proc. 23rd Int. Conf. on Data Engineering, pp. 196–205.
life cycle. While keeping a trail of provenance data is beneficial for many applications, storing, managing and searching provenance data introduces an overhead.
Cross-references ▶ Annotation ▶ Provenance
Data Protection ▶ Data Privacy and Patient Consent ▶ Storage Protection
Recommended Reading 1. 2.
3.
Data Provenance A MARNATH G UPTA University of California San Diego, La Jolla, CA, USA
4.
Synonyms
5.
Provenance metadata; Data lineage; Data tracking; Data pedigree
Bose R. and Frew J. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv., 37(1):1–28, 2005. Buneman P., Khanna S., Tajima K., and Tan W.-C. Archiving scientific data. In Proc. ACM SIGMOD Conf. on Management of Data, 2002, pp. 1–12. Buneman P., Khanna S., and Tan W.C. On propagation of deletions and annotations through views. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 150–158. Simmhan Y.L., Plale B., and Gannon D. A Survey of Data Provenance Techniques. Technical Report TR618, Department of Computer Science, Indiana University, 2005. Widom J. Trio: A System for Integrated Management of Data, Accuracy, and Lineage. In Proc. 2nd Biennial Conference on Innovative Data Systems Research, 2005, pp. 262–276.
Definition The term ‘‘data provenance’’ refers to a record trail that accounts for the origin of a piece of data (in a database, document or repository) together with an explanation of how and why it got to the present place. Example: In an application like Molecular Biology, a lot of data is derived from public databases, which in turn might be derived from papers but after some transformations (only the most significant data were put in the public database), which are derived from experimental observations. A provenance record will keep this history for each piece of data.
Data Quality ▶ Information Quality and Decision Making ▶ Information Quality Policy and Strategy
Data Quality Assessment C ARLO B ATINI University of Milano – Bicocca, Milan, Italy
Key Points
Synonyms
Databases today do not have a good way of managing provenance data and the subject is an active research area. One category of provenance research focuses on the case where one database derives some of its data by querying another database, and one may try to ‘‘invert’’ the query to determine which input data elements contribute to this data element. A different approach is to explicitly add annotations to data elements to capture the provenance. A related issue is to keep process provenance, especially in business applications, where an instrumented business process capturing software is used to track the data generation and transformation
Data quality measurement; Data quality benchmarking
Definition The goal of the assessment activity in the area of data quality methodologies is to provide a precise evaluation and diagnosis of the state of databases and data flows of an information system with regard to data quality issues. In the assessment the evaluation is performed measuring the quality of data collections along relevant quality dimensions. The term (data quality) measurement is used to address the issue of measuring the value of a set of data quality dimensions. The term (data quality) assessment
Data Quality Assessment
is used when such measurements are analyzed in order to enable a diagnosis of the quality of the data collection. The term (data quality) benchmarking is used when the output of the assessment is compared against reference indices, representing average values or best practices values in similar organizations. The term (data quality) readiness aims at assessing the overall predisposition of the organization in accepting and taking advantages of data quality improvement programs. The assessment activity may concern: (i) the schema of the data base (the intension), (ii) the values of data (the extension), and (iii) the costs of poor data quality to the organization. Therefore, the principal outputs of assessment methodologies are: (i) measurements of the quality of databases and data flows, both schemas and values, (ii) costs to the organization due to the present low data quality, and (iii) a comparison with data quality levels considered acceptable from experience, or else a benchmarking with best practices, together with suggestions for improvements.
Historical Background Ever since computer applications have been used to automate more and more business and administrative activities, it has become clear that available data often result from inaccurate observations, imputation, and elaborations, resulting in data quality problems. More importantly, in the last decades, information systems have been migrating from a hierarchical/monolithic to a network-based structure; therefore, the potential sources that organizations can use for the purpose of their businesses are dramatically increased in size and scope. Data quality problems have been further worsened by this evolution, since the external sources are created and updated at different times and by different organizations or persons and are characterized by various degrees of trustworthiness and accuracy, frequently unknown a priori. As a consequence, the overall quality of the data that flow between information systems may rapidly degrade over time if both processes and their inputs are not themselves subject to quality assessment.
Foundations The typical steps to assess data quality are: 1. Data analysis, which examines data schemas and performs interviews to reach a complete understanding of data and related architecture and management rules.
D
2. Requirements analysis, which surveys the opinion of data users and administrators to identify quality issues and set new quality targets. 3. Identification of critical areas, which selects the most relevant databases and data flows to be assessed quantitatively. 4. Process modeling, which provides a model of the processes producing or updating data. 5. Measurement of quality, which selects relevant quality dimensions, defines corresponding metrics and performs the actual measurement. The usual process followed in the measurement of quality step has two main activities: qualitative assessment, based on subjective judgments of experts, and objective assessment, based on measures of data quality dimensions. Qualitative assessment is performed through questionnaires and interviews with stakeholders and with internal and external users, with the goal of understanding the consequences and impact of poor data quality on the work of internal users and on products and services provided to information consumers, and the extent the needs of external users and customers are currently satisfied. Quantitative assessment is based on the selection of quality dimensions and their measurement through metrics. Over 50 quality dimensions have been proposed in the literature (see the respective entry and [2] for a thorough description of dimensions and proposed classifications). The most frequently mentioned concern the values of data, and are accuracy, completeness, currency/timeliness, and inconsistency. Examples of methodologies for the choice of dimensions and measures and for the objective vs. subjective evaluation are given in [3,7–9]. With regard to dimension classification, dimensions are classified in [7] into sound, useful, dependable, and usable, according to their positioning in quadrants related to ‘‘product quality/service quality’’ and ‘‘conforms to specifications/meets or exceeds consumer expectations’’ coordinates. The goal of the classification is to provide a context for each individual quality dimension and metric, and for consequent evaluation. In the following the five phases of the methodology proposed in [3] are described in more detail (see Fig. 1). Phase 1, attribute selection, concerns the identification, description, and classification of the main data attributes to be assessed. Then, they are characterized
609
D
610
D
Data Quality Assessment
Data Quality Assessment. Figure 1. The main phases of the assessment methodology described in [2].
according to their meaning and role. The possible characterizations are qualitative/categorical, quantitative/numerical, and date/time. In Phase 2, analysis, data quality dimensions and integrity constraints to be measured are identified. Statistical techniques are used for the inspection of data. Selection and inspection of dimensions is related to process analysis, and has the final goal of discovering the main causes of erroneous data, such as unstructured and uncontrolled data loading and data updating processes. The result of the analysis on selected dimensions leads to a report with the identification of the errors. In Phase 3, objective/quantitative assessment, appropriate indices are defined for the evaluation and quantification of the global data quality level. The number of low quality data items for the different dimensions and the different data attributes is first evaluated with statistical and/or empirical methods, and, subsequently, normalized and summarized. Phase 4 deals with subjective/qualitative assessment. The qualitative assessment is obtained by merging independent evaluations from (i) business experts, who analyze data from a business process point of view, (ii) final users (e.g., for financial data, a trader), and (iii) data quality experts, who have the role of analyzing data and examining its quality. Finally, in the comparison phase objective and subjective assessments are compared. For each attribute and quality dimension, the distance between the percentages of erroneous observations obtained from
quantitative analysis, mapped in a discrete domain, and the quality level defined by the judgment of the evaluations is calculated. Discrepancies are analyzed by the data quality experts, to further detect causes of errors and to find alternative solutions to correct them. The above mentioned methodologies, although not explicitly, refer to the assessment of structured data, namely, data represented in terms of typed files or relational tables and databases. Recently, the attention in data quality assessment has moved towards semi-structured and un-structured data. Assessment methodologies for evaluating specific qualities of web sites are proposed in [1,6]. Atzeni et al. [1] is specifically focused on accessibility, evaluated on the basis of a mixed quantitative/qualitative assessment. The quantitative assessment activity checks the guidelines provided by the World Wide Web Consortium in (W3C. http:// www.w3.org/WAI/). The qualitative assessment is based on experiments performed with disabled users. Fraternali et al. [6] focuses on the usability of the site and proposes an approach based on the adoption of conceptual logs, which are web usage logs enriched with meta-data derived from the application of conceptual specifications expressed by the conceptual schema of the web site. Since data and information are often the most relevant resource consumed in administrative and business processes, several authors consider the evaluation of costs of poor data quality as part of the data quality assessment problem. Figure 2 shows the classification proposed in [4], for which comments follow:
Data Quality Assessment
D
to collect and maintain data in another database, (ii) business rework costs, due to re-performing failed processes, such as resending correspondence, (iii) data verification costs, e.g., when data users do not trust the data, they perform their own quality inspection. Loss and missed opportunity costs correspond to the revenues and products not realized because of poor information quality. For example, due to low accuracy of customer e-mail addresses, a percentage of customers already acquired cannot be reached in periodic advertising campaigns, resulting in lower revenues, roughly proportional to the decrease of accuracy in addresses. Data quality assessment has been investigated also under a managerial perspective. Following the results of the assessment, a managerial activity might be the analysis of the main barriers in the organization to the quality management perspective in terms of resistance to change processes, control establishment, information sharing, and quality certification.
Key Applications Quality assessment is used in a large set of business and administrative activities, such as organization assessment, strategic planning, supply chain, marketing, selling, demographic studies, health experiments, management of health files, census applications, epidemiological analyses. The perception of the importance of quality assessment is increasing in the area of risk management, such as operational risk management related to the Basel II norms. Data Quality Assessment. Figure 2. A comprehensive classification of costs of poor data quality [4].
Future Directions
Process failure costs result when poor quality information causes a process to not perform properly. As an example, inaccurate mailing addresses cause correspondence to be misdelivered. Information scrap and rework costs occur every time data of poor quality requires several types of defect management activities, such as reworking, cleaning, or rejecting. Examples of this category are (i) redundant data handling, if the poor quality of a source makes it useless, time and money has to be spent
Open areas of research in data quality assessment concern quality dimensions and the relationship between data quality assessment and process quality assessment. The first area concerns assessment of a wider set of dimensions, such as performance, availability, security, with concern also to risk management, and investigation on dependencies among dimensions. For example, a dependency among currency and accuracy is the rule ‘‘70% of all outdated data is also inaccurate.’’ Knowledge about dependencies can greatly aid in finding causes of low data quality, and in conceiving improvement activities. The relationship between data quality and process quality is a wide area of investigation, due to the
611
D
612
D
Data Quality Attributes
relevance and diversity of characteristics of business processes in organizations. The different impacts of data quality at the three typical organizational levels, namely operations, the tactical level, and the strategic level, are analyzed in [10] reporting interviews and the outcomes of several proprietary studies. Data quality and its relationship with the quality of services, products, business operations, and consumer behavior is investigated in very general terms in [9,11]. The symmetric problem of investigating how to improve information production processes positively influences data quality is analyzed in [10].
Cross-references
▶ Data Quality Dimensions ▶ Design for Data Quality ▶ Information Quality Assessment ▶ Information Quality Policy and Strategy ▶ Quality of Data Warehouses
Recommended Reading 1. Atzeni P., Merialdo P., and Sindoni G. Web site evaluation: methodology and case study. In Proc. Int. Workshop on Data Semantics in Web Information Systems, 2001. 2. Batini C. and Scannapieco M. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. 3. De Amicis F. and Batini C. A methodology for data quality assessment on financial data. Stud. Commn. Sci., 4(2):115–136, 2004. 4. English L.P. Improving Data Warehouse and Business Information Quality. Wiley, 1999. 5. English L.P. Process management and information quality: how improving information production processes improves information (product) quality. In Proc. 7th Int. Conf. on Information Quality, 2002, pp. 206–209. 6. Fraternali P., Lanzi P.L., Matera M., and Maurino A. Modeldriven web usage analysis for the evaluation of web application quality. J. Web Eng., 3(2):124–152, 2004. 7. Kahn B., Strong D.M., and Wang R.Y. Information quality benchmarks: product and service performance. Commun. ACM, 45(4):184–192, 2002. 8. Lee Y.W., Strong D.M., Kahn B.K., and Wang R.Y. AIMQ: a methodology for information quality assessment. Inf. Manag., 40(2):133–146, 2001. 9. Pipino L., Lee Y.W., and Wang R.Y. Data quality assessment. Commun. ACM, 45(4):211–218, 2002. 10. Redman T.C. The impact of poor data quality on the typical enterprise. Commun. ACM, 41(2):70–82, 1998. 11. Sheng Y.H. Exploring the mediating and moderating effects of information quality on firms? Endeavor on information systems. In Proc. 8th Int. Conf. on Information Quality, 2003, pp. 344–353.
Data Quality Attributes ▶ Data Quality Dimensions
Data Quality Benchmarking ▶ Data Quality Assessment
Data Quality Criteria ▶ Data Quality Dimensions
Data Quality Dimensions K AI -U WE S ATTLER Technical University of Ilmenau, llmenau, Germany
Synonyms Data quality criteria; Data quality attributes; Data quality measurement
Definition Data quality (DQ) is usually understood as a multidimensional concept. The dimensions represent the views, criteria, or measurement attributes for data quality problems that can be assessed, interpreted, and possibly improved individually. By assigning scores to these dimensions, the overall data quality can be determined as an aggregated value of individual dimensions relevant in the given application context.
Historical Background Since the mid-1990s data quality issues have been addressed by systematic research studies. In this context, relevant dimensions of data quality have also been investigated. One of the first empirical studies by Wang and Strong [6] has identified 15 relevant dimensions out of 179 gathered criteria. This list was later supplemented by other researchers. Initially, there were proposed divergent definitions of the same dimensions, mostly due to different views, e.g., management perspectives versus data-oriented perspectives as well as application-specific views. In addition, several classifications for data quality problems and criteria were proposed.
Data Quality Dimensions
To date there is still a different understanding of several dimensions, depending on the application scenario and its requirements. However, there exists a set of agreed upon dimensions that are relevant in most domains.
Foundations The selection of dimensions relevant in a given scenario is mostly application-dependent. In addition, many dimensions are not independent and, therefore, should not be used together. However, because quality dimensions characterize potential data quality problems they can be classified according some important characteristics. In the following, some representative classifications are introduced followed by a discussion of the most important dimensions: Classifications
A first approach for classifying DQ dimensions proposed by Redman [5] is based on DQ problems or conflicts by considering the different levels where they can occur: The intensional level comprises criteria concerning the content of the conceptual schema relevance, clarity of definition, the scope, the level of detail (e.g., granularity of attributes, the precision of the attribute domains) as well as consistency and flexibility. The extensional level considers the data values comprising criteria such as accuracy and correctness of values, timeliness, and completeness of data. The level of data representation addresses problems related to the data format, e.g., interpretability, portability, adequateness. In contrast to this data-oriented approach, the classification introduced by Naumann [4] is more comprehensive. Dimensions are classified into four sets: 1. Content-related dimensions consider the actual data and therefore data-intrinsic properties such as accuracy, completeness, and relevance. 2. Technical dimensions address aspects of the hard- and software used for maintaining the data. Examples are availability, latency, response time, but also price. 3. Intellectual dimensions represent subjective aspects, such as trustworthiness or reputation. 4. Instantiation-related dimensions concern the presentation of data, e.g., the amount of data, understandability, and verifiability. An alternative way of classifying DQ dimensions is to look at the process of data evolution by analogy of data
D
with products. In [3] an approach is presented promoting hierarchical views on data quality following the steps of the data life cycle: collection, organization, presentation, and application. Based on an analysis of possible root causes for poor quality relevant dimensions can be identified and assigned to the different DQ views: Collection quality refers to problems during data capturing, such as observation biases or measurement errors. The relevant dimensions are, among others, accuracy, completeness, and trustworthiness of the collector. Organization quality deals with problems of data preparation and manipulation for storing it in a database. It comprises dimensions such as consistency, storage, and retrieval efficiency. Furthermore, collection quality is also a component of organization quality. Presentation quality addresses problems during processing, re-interpretation, and presentation of data. Dimensions are for example interpretability, formality as well as the organization quality component. Application quality concerns technical and social constraints preventing an efficient utilization of data and comprises dimensions like timeliness, privacy, and relevance in addition to the presentation quality component. Among all these dimensions the most important ones in many application scenarios are completeness, accuracy, consistency, and timeliness that are now described in detail. Completeness
Missing or incomplete data is one of the most important data quality problem in many applications. However, there are different meanings of completeness. An obvious and often used definition is the absence of null values or more exactly the ratio of non-null values and the total number of values. This measure can be easily assessed. Given a relation R(A1,...,An) then N Ai denotes the set of all non-null values in Ai: N Ai ¼ ft 2 RjNotNullðt:Ai Þg Completeness QC(Ai) can be now defined as: Q C ðAi Þ ¼
jN Ai j jRj
613
D
614
D
Data Quality Dimensions
This can be also extended to take tuples into account instead of single values by determining the number of tuples containing no null values: Q C ðRÞ ¼
jN A1 ;...;An j jRj
Note that null can have different meanings which have to be treated in a special way: it could represent a missing value or simply a not-applicable case, e.g., a customer without a special delivery address. Sometimes, not all attributes are of equal importance, e.g., whereas a customer identifier is always required, the customer’s email address is optional. In this case, weights can be assigned to the individual attributes or rules of the form ‘‘if A1 is not available (null) then A2 is important, otherwise not’’ are used. This notion of completeness concerns only the data inside the database. An alternative definition for completeness is the portion of real-world objects stored in the database. It addresses the case that for instance not all customers are represented in the database and therefore the data is incomplete. This is also known as coverage. However, assessing this completeness is often much more difficult because it requires either additional metadata (e.g., it is known that the DBLP digital library contains only computer science literature) or a (manual) checking with the real world, possibly supported by sampling. Besides these extensional views, completeness can be also interpreted from an intensional point of view. Here, completeness (or density) is defined as the number of attributes represented in the database compared to the required real-world properties. Again, assessing this kind of completeness requires manual inspection. Improvement of completeness is generally achieved by choosing better or additional data sources. In some cases, null values can be replaced with the help of dictionaries or reference sources (e.g., an address database). Depending of the usage of data missing numeric values can be sometimes also imputed based on knowledge about data characteristics, such as value distribution and variance. Accuracy
A second data quality problem is often caused by measurement errors, observation biases or simply improper representation. Accuracy can be defined as the extent to which data are correct, reliable, and certified
free of error. Note that the meaning of correctness is application-dependent: it can specify the distance to the actual real-world value or just the optimal degree of detail of an attribute value. Assuming a table representing sales volumes for products a value $10,000 could be interpreted as inaccurate if the actual value, e.g., obtained in a different way, is $10,500. However, if the user is interested only in some sales categories (low: 20K, high: > 20K) the value is accurate. In order to assess accuracy for a given value v the real world value v or at least a reference value is needed. Then, the distance can be easily computed for numeric values as jv vj or – for textual values – as the syntactic distance using the edit distance measure. However, particularly for textual attributes sometimes the semantic distance has to be considered, e.g., the strings ‘‘Munich’’ and ‘‘Mu¨nchen’’ are syntactically different but represent the same city. Solving this problem requires typically a dictionary or ontology. Based on the distance of single attribute values, the accuracy of tuples or the whole relation can be computed as shown above for completeness, for example by determining the fraction of tuples with only correct values. An improvement of accuracy is often possible only by removing inexact values or preferably by applying data cleaning techniques. Consistency
Though modern database systems provide advanced support for ensuring integrity and consistency of data, there are many reasons why inconsistency is a further important data quality problem. Thus, consistency as a DQ dimension is defined as the degree to which data managed in a system satisfy specified constraints or business rules. Such rules can be classic database integrity constraints, such as uniqueness of customer identifiers or referential integrity (e.g., ‘‘for each order, a customer record must exist,’’) or more advanced business rules describing relationships between attributes (for instance ‘‘age = current-date data-of-birth,’’ ‘‘driver license number can only exist, if age 16.’’) These rules have to be specified by the user or can be derived automatically from training data by applying rule induction. Using a set B of such rules, the set of tuples from a relation R satisfying these rules can be determined: W B ¼ ft 2 RjSatisfiesðt; BÞg
Data Quality Measurement
Then, the consistency measure for relation R can be computed as the fraction of tuples in WB as shown above. As for accuracy, an improvement of consistency can be achieved by removing or replacing inconsistent data. Timeliness
Another reason for poor quality of data is outdated data. This problem is captured by the dimension timeliness describing the degree to which the provided data is up-to-date. Depending on the application this is not always the same as the ordinary age (time between creation of data and now). For instance, in a stock information system stock quotes data from 2000 are outdated if the user is interested in the current quotes. But if he asks for stock quotes from the time of the dot-com bubble, it would be still sufficient. Therefore, both the age age(v) of a value v (as the time between observation and now) and the update frequency fu(v) (updates per time unit) have to be considered, where fu(v) = 0 means the value is never updated. Using this information, timeliness QT(v) of v can be computed as Q T ðvÞ ¼
1 f u ðvÞ ageðvÞ þ 1
This takes into account that an object with a higher update frequency ages faster and that objects that are never updated have the same timeliness. Further Dimensions
Finally, the following further dimensions are also important for many applications. Relevance, also known from information retrieval, is the degree to which the provided information satisfies the users need. The problem of relevance occurs mainly if keyword-based search is used for querying data or documents. In database systems using exact queries, relevance is inherently high. Response time measures the delay between the submission of a request (e.g., a query) and the arrival of the complete response. Though a technical criterion, response time is particularly important for users, because they usually do not want to wait more than a couple of seconds for an answer. Related to response time is latency defining the delay to the arrival of the
D
first result data. Often, a small latency compensates a larger response time in user satisfaction. Believability, trustworthiness, and reputation are dimensions which often depend on each other. Believability and trustworthiness can be understood as the degree to which data is accepted by the user as correct, accurate or complete. In contrast, reputation describes the degree to which a data (source) has a good standing by users. Reputation is based on the memory and summary of behavior from past transactions, whereas believability is more an subjective expectation.
Key Applications DQ dimensions are primarily used for quality assessment. They define the criteria under which data quality is measured and for which quality scores can be derived. A further application are data quality models for explicitly representing data quality scores that can be used for annotating the data.
Cross-references
▶ Data Conflicts ▶ Data Quality Assessment ▶ Data Quality Models
Recommended Reading 1. 2.
3.
4. 5. 6. 7.
Batini C. and Scannapieco M. Data Quality – Concepts, Methodologies and Techniques. Springer, 2006. ¨ zsu M.T., Saake G., and Sattler K. Report on Gertz M., O the Dagstuhl Seminar: data quality on the Web. ACM SIGMOD Rec., 33(1):127–132, 2004. Liu L. and Chi L. Evolutional data quality: a theory-specific view. In Proc. 7th Int. Conf. on Information Quality, 2002, pp. 292–304. Naumann F. Quality-Driven Query Answering for Integrated Information Systems. LNCS 2261, Springer, Berlin, 2002. Redman T. Data Quality for the Information Age. Artech House, Norwood, MA, USA, 1996. Wang R. and Strong D. Beyond Accuracy: What Data Quality Means to Data Consumers. J. Inf. Syst., 12(4):5–34, 1996. Wang R., Ziad M., and Lee Y. Data Quality. Kluwer, Boston, MA, USA, 2001.
Data Quality Measurement ▶ Data Quality Dimensions ▶ Data Quality Assessment
615
D
616
D
Data Quality Models
Data Quality Models M ONICA S CANNAPIECO University of Rome, Rome, Italy
Synonyms Data quality representations
Definition Data quality models extend traditional models for databases for the purpose of representing data quality dimensions and the association of such dimensions to data. Therefore, data quality models allow analysis of a set of data quality requirements and their representation in terms of a conceptual schema, as well as accessing and querying data quality dimensions by means of a logical schema. Data quality models also include process models tailored to analysis and design of quality improvement actions. These models permit tracking data from their source, through various manipulations that data can undergo, to their final usage. In this way, they support the detection of causes of poor data quality and the design of improvement actions.
of its data with a certain degree of flexibility; quality dimensions can be associated to various elements of the data model, ranging from the single data value to the whole data source, in this way being different from the previous attribute-based models. The principal data quality models that are oriented towards process representation are based on the principle that data can be seen as a particular product of a manufacturing activity, and so descriptive models (and methodologies) for data quality can be based on models conceived in the last two centuries for manufacturing traditional products. The Information Product Map (IP-MAP) [3] is a significant example of such models and follows this view, being centered on the concept of information product. The IP-MAP model has been extended in several directions (see [1], Chap. 3). Indeed, more powerful mechanisms have been included, such as event process chain diagrams representing the business process overview, the interaction model (how company units interact), the organization model (who does what), the component model (what happens), and the data model (what data is needed). A further extension called IP-UML consists of a UML profile for data quality based on IP-MAP.
Historical Background
Foundations
Among the first data quality models, in 1990 the polygen model [5] was proposed for explicitly tracing the origins of data and the intermediate sources used to arrive at that data. The model is targeted to heterogeneous distributed systems and is a first attempt to represent and analyze the provenance of data, which has been recently investigated in a more general context. In the mid-1990’s, there was a first proposal of extending the relational model with quality values associated to each attribute value, resulting in the quality attribute model [6]. An extension of the Entity Relationship model was also proposed in the same years ([4], and [7], Chapter 3), similarly focused on associating quality dimensions, such as accuracy or completeness, to attributes. More recently, models for associating quality values to data-oriented XML documents have been investigated (e.g., [2]). Such models are intended to be used in the context of distributed and cooperative systems, in which the cooperating organizations need to exchange data each other, and it is therefore critical for them to be aware of the quality of such data. These models are semi-structured, thus allowing each organization to export the quality
Data quality models can be distinguished in dataoriented models, focused on the representation of data quality dimensions, and process-oriented models focused on the representation of the processes that manipulate data and on their impact on the data quality. Data-oriented models include extensions of the Entity Relationship model, of the relational model, and of the XML data model. When extending the Entity Relationship model for representing data quality, one possibility is to introduce two types of entities, explicitly defined to express quality dimensions and their values: a data quality dimension entity and a data quality measure entity. The goal of the data quality dimension entity is to represent possible pairs of dimensions and corresponding ratings resulting from measurements. The data quality dimension entity characterizes the quality of an attribute and the scale may obviously depend on the attribute. In these cases, it is necessary to extend the properties of the data quality dimension entity to include the attribute, that is .
Data Quality Models
In order to represent metrics for dimensions, and the relationship with entities, attributes, and dimensions, the model introduces the data quality measure entity; its attributes are Rating, the values of which depend on the specific dimension modeled, and DescriptionOfRating. The complete data quality schema, shown by means of the example in Fig. 1, is made up of: The original data schema, in the example represented by the entity Class with the attribute Attendance. The DQ Dimension entity with a pair of attributes . The relationship between the entity Class, the related attribute Attendance, and the DQ Dimension entity with a many-to-many relationship ClassAttendanceHas; a distinct relationship has to be introduced for each attribute of the entity Class.
Data Quality Models. Figure 1. An example of IP-MAP.
D
617
The relationship between the previous structure and the DQ Measure entity with a new representation structure that extends the Entity Relationship model, and relates entities and relationships. An extension of the relational data model is provided by the quality attribute model, explained in the following by means of the example shown in Fig. 2. The figure shows a relational schema Employee, defined on attributes EmployeeId, Address, DateofBirth, and others, and one of its tuples. Relational schemas are extended adding an arbitrary number of underlying levels of quality indicators (only one level in the figure) to the attributes of the schema, to which they are linked through a quality key. In the example, the attribute EmployeeId is extended with one quality attribute, namely accuracy, the attribute Address with two quality attributes, namely accuracy and currency, while the attribute DateofBirth is extended with accuracy and completeness. The
D
618
D
Data Quality Models
Data Quality Models. Figure 2. An extension of the entity relationship model.
values of such quality attributes measure the quality dimensions’ values associated with the whole relation instance (top part of the figure). Therefore, completeness equal to 0.8 for the attribute DateofBirth means that the 80% of the tuples have a non-null value for such an attribute. Similar structures are used for the instances of quality indicator relations (bottom part of the figure); if there are n attributes of the relational schema, n quality tuples will be associated to each tuple in the instance. The model called Data and Data Quality (D2Q) is among the first models for associating quality values to data-oriented XML documents. D2Q can be used in order to certify dimensions like accuracy, consistency, completeness, and currency of data. The model is semi-structured, thus allowing each organization to export the quality of its data with a certain degree of flexibility. More specifically, quality dimension values can be associated with various elements of the data model, ranging from the single data value to the whole data source. The main features of the D2Q model are summarized as follows: A data class and a data schema are introduced to represent the business data portion of the D2Q model. A quality class and a quality schema correspond to the quality portion of the D2Q model. A quality association function that relates nodes of the graph corresponding to the data schema to
nodes of the graph corresponding to the quality schema. Quality associations represent biunivocal functions among all nodes of a data schema and all non-leaf nodes of a quality schema. In Fig. 3, an example of a D2Q schema is depicted. On the left-hand side of the figure, a data schema is shown representing enterprises and their owners. On the right-hand side, the associated quality schema is represented. Specifically, two quality classes, Enterprise_Quality and Owner_Quality are associated with the Enterprise and Owner data classes. Accuracy nodes are shown for both data classes and related properties. For instance, Code_accuracy is an accuracy node (of type t-accuracy) associated with the Code property, while Enterprise_accuracy is an accuracy node associated with the data class Enterprise. The arcs connecting the data schema and the quality schema with the quality labels represent the quality association functions. The D2Q model is intended to be easily translated into the XML data model. This is important for meeting the interoperability requirements that are particularly stringent in cooperative systems. Process-oriented models have their principal representative in the Information Product Map (IP-MAP) model. An information product map is a graphical model designed to help people comprehend, evaluate, and describe how an information product, such as an invoice, a customer order, or a prescription, is
Data Quality Models
assembled in a business process. IP-MAPs are designed to help analysts visualize the information production process, identify ownership of process phases, understand information and organizational boundaries, and estimate time and quality metrics associated with the current production process. There are eight types of construct blocks that can be used to form the IP-MAP: source, customer, data quality, processing, data storage, decision, business boundary, and information
D
systems boundary. An example of information product map is shown in Fig. 4. Information products (IP in the figure) are produced by means of processing activities and data quality checks on raw data (RD), and semi-processed information called component data (CD). In the example, it is assumed that high schools and universities of a district have decided to cooperate in order to improve their course offering to students, avoiding overlap and being more effective in the
Data Quality Models. Figure 3. An example of a D2Q schema.
Data Quality Models. Figure 4. An extension of the relational model.
619
D
620
D
Data Quality Problems
education value chain. To this end, they have to share historical data on students and their curricula. Therefore, they perform a record linkage activity that matches students in their education life cycle (‘‘Perform Record Linkage’’ block). To reach this objective, high schools periodically supply relevant information on students; in case it is in paper format, the information has to be converted in electronic format. At this point, invalid data are filtered and matched with the database of university students. Unmatched students are sent back to high schools for clerical checks, and matched students are analyzed. The result of the analysis of curricula and course topics are sent to the advisory panel of the universities.
Key Applications Data-oriented and process-oriented data quality models can be used to represent quality dimensions and quality related activities thus supporting techniques for data quality improvement. However, such techniques seldom rely on the described model extensions, with the distinctive exception of the IP-MAP model. Indeed, only a few prototypical DBMSs have experienced the adoption of some of the approaches mentioned. This is mainly due to the complexity of the representational structures proposed in the different approaches. Indeed, measuring data quality is not an easy task, hence models that impose to associate data quality dimensions values at attribute level have proven not very useful in practice. A greater flexibility is more useful in real applications, like for instance, in scenarios of e-Government or e-Commerce. In these scenarios, which involve cooperation between different organizations, a more successful case is provided by XML data exchanged with associated quality profiles, which are based on semi-structured data quality models.
Future Directions The future of research on models appears to be in provenance and semi-structured data quality models. In open information systems and in peer-to-peer ones, knowing the provenance and having a flexible tool to associate quality to data is crucial. Indeed, such systems have to be able to trace the history of data and to certify the level of quality of the retrieved data.
▶ Information Product Management ▶ Provenance ▶ Relational Model ▶ Semi-Structured Data ▶ XML
Recommended Reading 1. 2.
3.
4.
5.
6.
7.
Batini C. and Scannapieco M. Data Quality: Concepts, Methodologies, and Techniques. Springer, Berlin, 2006. Scannapieco M., Virgillito A., Marchetti C., Mecella M., and Baldoni R. The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems. Inf. Syst., 29(7):551–582, 2004. Shankaranarayan G., Wang R.Y., and Ziad M. Modeling the manufacture of an information product with IP-MAP. In Proc. 5th Int. Conf. on Information Quality, 2000, pp. 1–16. Storey V.C. and Wang R.Y. An analysis of quality requirements in database design. In Proc. 4th Int. Conf. on Information Quality, 1998, pp. 64–87. Wang R.Y. and Madnick S.E. A polygen model for heterogeneous database systems: the source tagging perspective. In Proc. 16th Int. Conf. on Very Large Data Bases, 1990, pp. 519–538. Wang R.Y., Reddy M.P., and Kon H. Toward data quality: an attribute-based approach. Decision Support Syst., 13(3– 4):349–372, 1995. Wang R.Y., Ziad M., and Lee Y.W. Data Quality. Kluwer, Boston, MA, USA, 2001.
Data Quality Problems ▶ Data Conflicts
Data Quality Representations ▶ Data Quality Models
Data Rank/Swapping J OSEP D OMINGO -F ERRER Universitat Rovira i Virgili, Tarragona, Catalonia
Synonyms Data swapping; Rank swapping
Definition Cross-references
▶ Data Quality Dimensions ▶ Entity Relationship Model
Data swapping was originally designed by Dalenius and Reiss [1] as a masking method for statistical disclosure control of databases containing only categorical
Data Reduction
attributes. The basic idea behind the method is to transform a database by exchanging values of confidential attributes among individual records. Records are exchanged in such a way that low-order frequency counts or marginals are maintained. Rank swapping is a variant of data swapping [2,3]. First, values of an attribute Xi are ranked in ascending order, then each ranked value of Xi is swapped with another ranked value randomly chosen within a restricted range (e.g., the rank of two swapped values cannot differ by more than p% of the total number of records, where p is an input parameter). This algorithm is independently used on each original attribute in the original data set.
Key Points It is reasonable to expect that multivariate statistics computed from data swapped with this algorithm will be less distorted than those computed after an unconstrained swap. In empirical work on SDC scores, rank swapping with small swapping range has been identified as a particularly well-performing method in terms of the trade-off between disclosure risk and information loss. Consequently, it is one of the techniques that have been implemented in the m Argus package [3].
Cross-references
▶ Data Rank/Swapping ▶ Disclosure Risk ▶ Inference Control in Statistical Databases ▶ Information Loss Measures ▶ K-anonymity ▶ Microaggregation ▶ Microdata ▶ Microdata rounding ▶ Noise Addition ▶ Non-perturbative masking methods ▶ Pram ▶ Record Linkage ▶ Synthetic Microdata ▶ SDC Score ▶ Tabular Data
Recommended Reading 1.
2.
Dalenius T. and Reiss S.P. Data-swapping: a technique for disclosure control (extended abstract). In Proc. ASA Section on Survey Research Methods. American Statistical Association, Washington DC, 1978, pp. 191–194. Domingo-Ferrer J. and Torra V. A quantitative comparison of disclosure control methods for microdata. In Confidentiality,
3.
D
621
Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. P. Doyle, J.I. Lane, J.J.M. Theeuwes, and L. Zayatz (eds.). Amsterdam, North-Holland, 2001, pp. 111–134. Hundepool A., Van de Wetering A., Ramaswamy R., Franconi F., Polettini S., Capobianchi A., De Wolf P.-P., Domingo-Ferrer J., Torra V., Brand R. and Giessing S. m-Argus User’s Manual version 4.1, February 2007.
D Data Reconciliation ▶ Constraint-Driven Database Repair
Data Reduction RUI Z HANG University of Melbourne, Melbourne, VIC, Australia
Definition Data reduction means the reduction on certain aspects of data, typically the volume of data. The reduction can also be on other aspects such as the dimensionality of data when the data is multidimensional. Reduction on any aspect of data usually implies reduction on the volume of data. Data reduction does not make sense by itself unless it is associated with a certain purpose. The purpose in turn dictates the requirements for the corresponding data reduction techniques. A naive purpose for data reduction is to reduce the storage space. This requires a technique to compress the data into a more compact format and also to restore the original data when the data needs to be examined. Nowadays, storage space may not be the primary concern and the needs for data reduction come frequently from database applications. In this case, the purpose for data reduction is to save computational cost or disk access cost in query processing.
Historical Background The need for data reduction arises naturally. In early years (pre-1990’s), storage was quite limited and expensive. It fostered the development of a class of techniques called compression techniques to reduce the data volume for lower consumption of resources such as storage space or bandwidth in telecommunication settings. Another requirement for a compression
622
D
Data Reduction
technique is to reproduce the original data (from the compressed data) for reading. Here ‘‘reading’’ has different meanings depending on the data contents. It means ‘‘listening’’ for audio data, ‘‘viewing’’ for video data, ‘‘file reading’’ for general contents, etc. Therefore the reproduction of the data should be either exactly the same as the original data or very close to the original data by human perception. For example, MP3 is an audio compression technique which makes a compressed audio sound almost the same to the original one. Until today, compression techniques is still a very active research topic. But, instead of concerning data size of kilobytes or megabytes as in the early years, today’s compression techniques concern data size of gigabytes or even terabytes. As the rapid advances of hardware technologies, storage limit is no longer the most critical issue in many cases. Another huge force driving the need for data reduction appears in database applications. Storing the data may not be a problem, but retrieving data from the storage (typically hard disk) is still a quite expensive operation due to the slow improvement in disk seek time. Database queries commonly need to retrieve large amount of data from the disk. Therefore data reduction is compelling for providing high performance in query processing. Different from data compression, data reduction in database applications usually do not need to generate a reproduction that is exactly the same as the original data or sounds/looks very close to the original data. Instead, an approximation of the intended answer suffices, which gives more flexibility for data reduction. Traditionally, data reduction techniques have been used in database systems to obtain summary statistics, mainly for estimating costs of query plans in a query optimizer. Here, an approximation of the expected cost suffices as an estimate. At the same time, highly reduced data (summary statistics) is essential to make evaluation of the query plans much cheaper than evaluation of the query. In the last two decades, there has been enormous interest in online analytic processing (OLAP), which is characterized by complex queries involving group-by and aggregation operators on extremely large volume of data. OLAP is mainly performed in decision support applications, which analyze data and generate summaries from data. Organizations need these results to support high-level decision making. The data typically comprises of data consolidated from many sources of
an organization, forming a repository called a data warehouse. In face of the high data volume, efficient OLAP calls for data reduction techniques. Due to the analytical and exploratory nature of the queries, approximate answers are usually acceptable and the error tolerance can sometimes be quite high.
Foundations Compression techniques and data reduction techniques in databases are discussed separately below due to the differences in their purposes and general characteristics. Compression techniques are more often studied in the information retrieval research community while data reduction techniques in databases are studied in the database research community. Compression techniques is a subcategory of data reduction techniques, although sometimes the term compression technique is used in a less strict way to refer to data reduction in general. Compression techniques involve the processes of encoding and decoding. Encoding converts the original data to a more compact format based on a mapping from source messages into codewords. Decoding performs the inverse operation to reproduce the original data. If the reproduction is exactly the same as the original data, the compression technique is lossless; otherwise, it is lossy. Lossless compression techniques are used for generally any data format without needing to know the contents or semantics of the data. Popular techniques include ZIP invented by Phil Katz in late 1980s and RAR invented by Eugene Roshal in early 1990s. If some domain knowledge on the data is available, usually lossy compression techniques yield better compression rates. For example, JEPG, MP3 and MPEG are popular compression techniques for audio, image and video data, respectively. Lossy compression techniques leave out the less important information and noise to achieve higher compression. More concretely, the MP3 audio encoding format removes the audio details most human beings cannot hear to make the compressed audio sound like a faithful reproduction of the original uncompressed one. Different compression techniques mainly differ in the mapping from source messages into codewords. A survey of compression techniques is given in [6]. Readers interested in recent research results in compression techniques are referred to the proceedings of the Data Compression Conference [1]. Data reduction in databases can make use of various techniques. Popular ones include histograms, clustering,
Data Reduction
singular value decomposition (SVD), discrete wavelet transform (DWT), etc. The techniques can be divided into two categories, parametric and nonparametric techniques, depending on whether the technique assumes a certain model. Histograms and clustering are nonparametric techniques while SVD and DWT are parametric techniques. A summary of data reduction techniques for databases can be found in [3]. Histograms
A histogram is a data structure used to approximate the distribution of values. The value domain is divided into subranges called buckets. For each bucket, a count of the data items whose values fall in the bucket is maintained. Therefore a histogram basically contains the information of the boundaries of the buckets and the counts. The data distribution can be approximated by the average values of the buckets and the counts. Commonly, two types of histograms are used, equiwidth and equidepth histograms, distinguished by how the buckets are determined. In an equiwidth histogram, the length of every bucket is the same while in an equidepth histogram, the count for every bucket is the same (Sometimes, an exact same count cannot be achieved and then the counts for the buckets are approximately the same.). Figure 1 shows an example data distribution in the value domain [0,8] and the equiwidth and equidepth histograms for the data assuming three buckets. A thick vertical line represents the count for a value; a dashed line represent a bucket
Data Reduction. Figure 1. Histograms.
D
range and the estimated count for the values in the bucket. The estimated count of a certain value is simply the average count in the bucket. In the equiwidth histogram (Fig. 1a), each bucket has the range of length 3. The estimated counts for most values are quite close to the actual values. Equiwidth histograms are simple and easy to maintain, but the estimate is less accurate for skewed distribution such as the count of value 3. This problem is alleviated in the equidepth histogram (Fig. 1b). Each bucket has the count of about 9. The estimate count for value 3 is very accurate. The disadvantage of equidepth histograms is that determining the boundaries of buckets is more difficult. There are other types of histograms such as compressed histograms and v-optimal histograms. A thorough classification on various histograms can be found in [7]. Clustering
Clustering is a technique to partition objects into groups called clusters such that the objects within a group are similar to each other. After clustering, operations can be performed on objects collectively as groups. The information of data can be represented at the cluster level and hence greatly reduced. The data to perform clustering on usually contain multiple attributes. Therefore each data object can be represented by a multidimensional point in space. The similarity is measured by a distance function. Typically, a metric function, such as Euclidean
623
D
624
D
Data Reduction
distance, is used as the distance function. Given a data set, there is no single universal answer for the problem of clustering. The result of clustering depends on the requirements or the algorithm used to perform clustering. A classic algorithm is the kmeans algorithm, which partitions the data into k clusters. Given a data set and a number k, the algorithm first picks k points randomly or based on some heuristics to serve as cluster centroids. Second, every point (object) is assigned to its closest centroid. Then the centroid for each cluster is recomputed based on the current assignment of points. If the newly computed centroids are different from the previous ones, all the points are assigned to their closest centroids again and then the centroids are computed again. This process is repeated until the centroids do not change. Figure 2 shows an example where k = 3. The black dots are data points, squares are initial centroids and the dashed circles show the resultant clusters. In the beginning, the value of k is given by the user in a very subjective manner, usually depending on the application needs. Another algorithm called kmedoid works in a very similar manner but with a different way of choosing their cluster representatives, called medoids, and with a different stop condition. Recently, algorithms designed for large data sets were proposed in the database research community such as BIRCH [9] and CURE [4].
Data Reduction. Figure 2. k-means clustering.
Singular Value Decomposition (SVD)
Any m n real matrix A can be decomposed as follows: A ¼ USVt
ð1Þ
where U is a column-orthonormal m r matrix, r is the rank of the matrix A, S is a diagonal r r matrix and V is a column-orthonormal n r matrix (bold symbols are used to represent matrices and vectors). It can be further expressed in the spectral decomposition [5] form: A ¼ l1 u1 v t1 þ l2 u2 v t2 þ ::: þ lr ur v tr
ð2Þ
where ui and vi are column vectors of U and V, respectively, and li are the diagonal elements of S. A can be viewed as m n-dimensional points (each row being a point). Because vi are orthogonal vectors, they form a new basis of the space. li represents the importance of the basis vector vi (dimension i) and ui represents the coordinates of the m points in dimension i in this new coordinate system. Assume that li are sorted in descending order. Then, v1 is the direction (dimension) with the largest dispersion (variance) of the points; v2 is the direction with the second largest dispersion of the points, and so on. If the last few li values are small and one omits them when calculating A, the resulted error will be very small. Therefore SVD is widely used in dimensionality reduction and matrix approximation. The following is an example, with A given as 2 3 2 1 6 2 1 7 6 7 6 1 1 7 7 A ¼6 6 2 3 7 6 7 4 4 4 5 5 2 The SVD of A is 2 3 0:118 0:691 6 0:250 0:125 7 6 7 6 0:158 0:079 7 8:82 0 0:811 0:585 7 A ¼6 6 0:383 0:441 7 0 2:87 0:585 0:811 6 7 4 0:633 0:316 5 0:593 0:454 0:811 and Here, l1 ¼ 8:82; l2 ¼ 2:87; v 1 ¼ 0:585 0:585 : Figure 3 shows the data points, and v2 ¼ 0:811 the directions of v 1 and v 2 .
Data Reduction
D
Data Reduction. Figure 4. Haar transform.
Data Reduction. Figure 3. SVD.
Discrete Wavelet Transform (DWT)
Wavelet Transform is a commonly used signal processing technique like other transforms such as Fourier Transform or Cosine Transform. In databases, commonly used is the discrete version called Discrete Wavelet Transform (DWT). After applying DWT, a multi-resolution decomposition of the original signal is obtained in the form of wavelet coefficients. The wavelet coefficients are projections of the signal onto a set of orthogonal basis vectors. The choice of the basis vectors determines the type of DWT. The most popular one is the Haar transform, which is easy to implement and fast to compute. Some of the wavelet coefficients obtained may be small, therefore they can be replaced by zeros and hence the data is reduced. Inverse DWT can be applied on the reduced wavelet coefficients to get an approximation of the original signal. This is basically how DWT is used for compression. DWT based compression provides better lossy compression than Discrete Fourier Transform and Discrete Cosine Transform. In the Haar transform, elements of a signal are processed pairwise. Specifically, the average and difference of every two neighboring elements are computed. The averages serve as a lower-resolution approximation of the signal and the differences (divided by 2) are the coefficients. For example, signal S = {2, 4, 5, 5, 3, 1, 2, 2}. Computing the average of every two neighboring elements results in a lowerresolution signal S1 = {3, 5, 2, 2}. The coefficients are obtained by computing the difference of every two neighboring elements divided by 2, which is D1 = {1, 0, 1, 0}. S can be restored exactly by adding
(or subtracting) the coefficient to the corresponding element in S1. For example, S(1) = S1(1) + D1(1) = 3 + (1) = 2; S(2) = S1(1) D1(1) = 3 (1) = 4. Similarly, a even lower-resolution signal S2 can be obtained by applying the same process on S1. This can be done recursively until the length of the signal becomes 1. The full decomposition on S is shown in Fig. 4. The Haar transform of S is given as the average over all elements (3), and all the coefficients, S0 = {3, 1, 1, 0, 1, 0, 1, 0}.
Key Applications Data Storage and Transfer
Compression techniques are essential for data storage and transfer in many applications. Database Management Systems
Histograms is a popular technique for maintaining summary information in database management systems. It is especially useful for a cost-based query optimizer. OLAP
Due to the huge volume of data in OLAP applications, data reduction techniques such as sampling are commonly used to obtain quick approximate answers. Multimedia Data
Multimedia data is characterized by large size. Therefore data reduction techniques are usually applied on multimedia data from storage to data processing. For example, the MP3, JPEG, MPEG formats for audio, image and video data, respectively, all use compression techniques. The new JPEG digital image standard, JPEG-2000, uses DWT for all its codecs [8]. Similarity search on multimedia data usually needs to deal with very high-dimensional point representations. SVD can be used to reduce the dimensionality to achieve better search performance. In a recent paper [2], DWT is used to represent 3D objects to obtain better data retrieval performance.
625
D
626
D
Data Replication
Taxonomy
Clustering is widely used in almost all taxonomy applications such as taxonomies of animals, plants, diseases and celestial bodies. It can also help visualization through a hierarchically clustered structure.
Cross-references
▶ Clustering ▶ Discrete Wavelet Transform and Wavelet Synopses ▶ Histogram ▶ Nonparametric Data Reduction Techniques ▶ Parametric Data Reduction Techniques ▶ Singular Value Decomposition
Recommended Reading 1. 2.
3.
4.
5. 6. 7.
8. 9.
http://www.cs.brandeis.edu/dcc/index.html. Ali M.E., Zhang R., Tanin E., and Kulik L. A motion-aware approach to continuous retrieval of 3D objects. In Proc. 24th Int. Conf. on Data Engineering, 2008. Barbara´ D., DuMouchel W., Faloutsos C., Haas P.J., Hellerstein J.M., Ioannidis Y.E., Jagadish H.V., Johnson T., Ng R.T., Poosala V., Ross K.A., and Sevcik K.C. The New Jersey data reduction report. IEEE Data Eng. Bull., 20(4):3–45, 1997. Guha S., Rastogi R., and Shim K. CURE: an efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998, pp. 73–84. Jolliffe I.T. Principal component analysis. Springer, Berlin, 1986. Lelewer D.A. and Hirschberg D.S. Data compression. ACM Comput. Surv., 19(3):261–296, 1987. Poosala V., Ioannidis Y.E., Haas P.J., and Shekita E.J. Improved histograms for selectivity estimation of range predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 294–305. The JPEG 2000 standard. http://www.jpeg.org/jpeg2000/index. html. Zhang T., Ramakrishnan R., and Livny M. BIRCH: an efficient data clustering method for very large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 103–114.
Depending on the context and the type of replication architecture, the term replica can refer to one of the physical copies of a particular data item, or to an entire site with all its data copies. Data replication can serve different purposes. First, it can be used to increase availability and provide fault-tolerance since the data can, in principle, be accessed as long as one replica is available. Second, it can provide good performance. By storing replicas close to users that want to access the data, replication allows fast local access. Third, access requests can be distributed across the replicas. By adding more replicas to the system a higher incoming workload can be handled, and hence, a higher throughput can be achieved. Thus, replication is a means to achieve scalability. Finally, for some applications, replication is a natural choice, e.g., if mobile users have to access data while disconnected from their corporate data server. The main challenge of replication is that the replicas have to be kept consistent when updates occur. This is the task of replica control. It has to translate the read and write operations that users submit on the logical data items into operations on the physical copies. In the most common approach, read operations are performed on one replica while write operations have to be performed on all replicas (ROWA, or readone-write-all approach). Ideally, all copies of a data item have the same value at all times. In reality, many different consistency models have been developed that reflect the needs for different application domains. Additionally, replica allocation algorithms have the task to decide where and when to install or remove replicas. They have to find a trade-off between the performance benefits for read operations and the overhead of keeping the copies consistent.
Historical Background
Data Replication B ETTINA K EMME McGill University, Montreal, QC, Canada
Synonyms Database replication; Replication
Definition Using data replication, each logical data item of a database has several physical copies, each of them located on a different machine, also referred to as site or node.
Data replication has gained attention in two different communities. The database community typically considers the replication of data items of a database, e.g., records or tables of a relational database. It has to consider that transactions and queries not only access individual data items, but read and write a whole set of related data items. In contrast, the distributed systems community mainly focuses on replication techniques for objects that are typically accessed individually, such as distributed file systems and web-servers. Nevertheless, in all the application types, similar issues arise regarding replica control and allocation, and the
Data Replication
associated coordination and communication overhead. Thus, there has always been an active exchange of research ideas, such as [1,5], and there exist several publication venues where work from both communities appears. Work on database replication had an early peak in the 1980s, where it was first introduced for availability purposes, and most approaches provided strong consistency properties. A good overview is given in [2]. A seminal paper by Gray et al. in 1996 [6] revived research in this area. It provided a first characterization of replica control algorithms and presented an analytical model showing that existing strong consistency solutions come with a large performance penalty. Since then, replication has remained an active research area. Emphasis has been put on reducing the high communication and coordination overhead of the early solutions. One research direction aims at reducing the costs by delaying the sending of updates to remote replicas. However, in this case, replicas might have stale or even inconsistent data. Solutions have been proposed to avoid inconsistencies (e.g., [3]), to provide limits on the staleness of the data (e.g., [8]), and to detect and then reconcile inconsistencies [9]. Another research direction has developed techniques to provide strong consistency guarantees at acceptable costs, for example, by taking advantage of multicast and group maintenance primitives provided by group communication systems [14]. In the distributed systems community, early work focused on replicated file systems (e.g., [10,13]). Later, web-server replication [12] and file replication in peer-2-peer systems, (e.g., [7]) have attained considerable attention. A wide range of consistency models has been defined to accommodate application needs. Also, a large body of literature exists regarding object replication for fault-tolerance purposes [4,11].
Foundations Replica Control
Replica control, which has the task of keeping the copies consistent despite updates, is the main issue to be tackled by any replication solution. Replica control has to decide which data copies read operations should access, and when and how to update individual data copies in case of write operations. Thus, most of the work done in the area of database replication is to some degree associated with replica control. The
D
627
entry Replica Control provides a detailed overview of the main challenges. Replica Control and Concurrency Control
In order to work properly with a database system providing transactional semantics, replica control has to be integrated with concurrency control. Even in a nonreplicated and nondistributed system, as soon as transactions are allowed to execute concurrently, concurrency control mechanisms restrict how the read and write operations of different transactions may interleave in order to provide each transaction a certain level of isolation. If data items are now replicated, the issue becomes more challenging. In particular, different transactions might start their execution on different replicas making it difficult to detect conflicts. For a nonreplicated database system, the most studied isolation level is serializability, which indicates that the concurrent execution of transactions should be equivalent to a serial execution of the same transactions. This is typically achieved via locking, optimistic, or multi-version concurrency control. Thus, one of the first steps in the research of replicated databases was to define a corresponding correctness criterion 1-copyserializability, which requires that the concurrent execution of transactions over all replicas is equivalent to a serial execution over a single logical copy of the database. Many replica control algorithms have been developed to provide this correctness criterion, often extending the concurrency control algorithms of nonreplicated systems to work in a replicated environment. In early solutions, replicas run some form of coordination during transaction execution to guarantee an appropriate serialization. This type of protocols is often called eager or synchronous since all replicas coordinate their operations before transaction commit. Textbooks on distributed systems typically contain a chapter on these replica control algorithms since they serve as a nice example of how to design coordination protocols in a distributed environment. A problem with most of these traditional replication solutions is that they induce a large increase in transaction response times which is often not acceptable from an application point of view. More recent approaches addressed this issue and designed replica control algorithms providing 1-copyserializability or other strong consistency levels that are tuned for performance. Many envision a cluster architecture, where a set of database replicas is connected
D
628
D
Data Replication
via a fast local area network. In such a network, eager replication can be feasible since communication delays are short. Replication is used to provide both scalability and fault-tolerance. The entries Replication based on Group Communication and Replication for Scalability describe recent, efficient replica control algorithms providing strong consistency levels for cluster architectures. Consistency Models and Conflict Resolution
By definition, eager replication incorporates coordination among replicas before transaction commit. Alternatively, lazy replication (also called asynchronous or optimistic) allows a transaction to commit data changes at one replica without coordination with other replicas. For instance, all update transactions could be executed and committed at a specific primary replica which then propagates changes to other replicas sometime after commit. Then, secondary replicas lag behind the current state at the primary. Alternatively, all replicas might accept and execute updates, and eventually propagate the changes to the rest of the replicas. In this case, replicas can become inconsistent. Conflicting updates are only detected after update propagation and inconsistent data has to be reconciled. In this context, a considerable body of research exists defining correctness criteria weaker than 1-copy-serializability. In particular, many formalisms exist that allow the defining of limits to the allowed divergence between the copies of a data item. Availability
Replication can increase the availability of the system since, in principle, a data item is available as long as one replica is accessible. However, in practice, it is not trivial to design a high-available replication solution. As discussed before, most replication algorithms require all updates to be performed at all replicas (ROWA, or read-one-write-all). If this is taken in the strict sense, then, if one replica is update transactions cannot execute, and the availability observed by update transactions is actually lower than in a nonreplicated system. To avoid this, most replica control algorithms actually implement a read-one-write-all-available (ROWAA) strategy that only updates replicas that are actually accessible. When a replica recovers after a crash, it first has to get the current state from the available replicas. This can be a complex process. The possibility of network partitions imposes an additional
challenge. Although a particular replica might be up and running, it might not be accessible because of an interruption in the network connectivity. Many commercial database systems offer specialized high-availability replication solutions. Typically, a primary database is replicated at a backup database system. All transactions are executed at the primary that sends updates to the backend. Only when the primary crashes, the backup takes over. Quorum systems are an alternative high-availability replication approach. In quorum approaches both read and write operations have to access a quorum of data replicas. For example, a quorum could be a majority of replicas. This guarantees that any two operations overlap in at least one replica. Thus, each read operation reads at least one replica that has the most recent update, and any two concurrent write operations are serialized at least at one replica. There exist many ways to define quorums differing in the structure and the sizes of the quorums. Quorums are an elegant solution to network partitions and are attractive for write intensive applications since writes do not need to access all replicas. However, they have worse performance than ROWA for the more common read-intensive applications. Replica Allocation
Using full replication, every site has copies of all existing data items. This simplifies the execution of read operations but has high update costs since all sites have to perform the updates for all data items. In contrast, in partial replication each site has only copies of some of the data items. The advantage is that an update on a data item does not lead to costs at all sites but only at those that contain a replica of the data item. However, read operations become more challenging. First, in a wide-area setting, if no local replica is available, read operations observe higher delays since they have to access a remote replica. Furthermore, if a request has to access several objects within a single query, the query might have to access data items on different sites, leading to distributed queries. Thus, replica allocation algorithms have to decide on the placement of replicas considering issues such as communication and update costs. Related to replica allocation is the task to adjust the replication configuration automatically and dynamically to the needs of the application. This is particularily interesting in a cluster-based configuration where
Data Replication
sites are located in a local area network and replication is used for load distribution and fault-tolerance. In here, configuration does not only relate to the number of replicas needed, but also to the question of how to distribute load across replicas, how to accomodate different applications on the replicated data, and how to optimally use all available resources. An important issue is that advanced information systems do not only have a database system but consist of a multi-tier architecture with web-servers, application servers, and database systems that interact with each other. Materialized views are a special form of data replication, where the data retrieved by the most typical database queries is stored in a pre-formatted form. Typically, queries can be run over materialized views as if they were base tables. Since the data is already in the format needed by the query, query processing time can be considerably reduced. However, updates on the views are usually disallowed but have to be performed directly on the base tables. Special refresh algorithms then guarantee that materialized views are updated when the base data changes. Replication in Various Computing Environments
Replication is a fundamental technique for data management that can be applied in various computing environments. The different purposes replication can generally have in LANs (clusters) and in wide-area environments have already been discussed above. Peer-to-peer (P2P) networks are a special form of wide-area environment. In here, each site is both client and server. For instance, each peer can provide storage space to store documents or data items that can be queried by all peers in the system. Or it provides processing capacity that can be used by other peers to perform complex calculations. In turn, it can request data items or processing capacity from other peers. A large body of research has developed algorithms that decide where to store data items and how to find them in the P2P network. Replication plays an important task for fault-tolerance, fast access, and load distribution. Replication also plays an important role in mobile environments that differ in some fundamental ways from wired networks. Firstly, communication between mobile units and the servers on the standard network is typically slow and unreliable. Secondly, mobile devices are usually less powerful and have considerably less storage space than standard machines leading to an
D
629
asymmetric architecture. Furthermore, mobile devices, such as laptops, are often disconnected from the network and only reconnect periodically. Thus, having replicas locally on the mobile device provides increased availability during disconnection periods.
Key Applications Data replication is widely used in practice. Basically, all database vendors offer a suite of replication solutions. Additionally, replication is often implemented ad-hoc at the application layer or as a middleware layer as the need arises. Replication is used in many application domains. Below some examples are listed. Companies use high-availability replication solutions for their mission critical data that has to be available 24/7. Examples are banking or trading applications. Companies use cluster replication, ideally with autonomic behavior, in order to provide a scalable and fault-tolerant database backend for their business applications. In particular companies that do e-business with a large number of users resort to database replication to be able to obtain the required throughput. This also includes techniques such as materialized views. Globally operating companies often have databases located at various sites. Parts of these databases are replicated at the other locations for fast local access. Examples are companies maintaining warehouses at many locations. Replication of web-sites is a common technique to achieve load-balancing and fast local access. As the information shown on the web becomes more and more dynamic (i.e., it is retrieved from the database in real-time), database replication techniques need to be applied. Companies that have employees working off-site, such as consulting or insurance companies, use mobile replication solutions. Data replicas are downloaded to mobile units such as laptops in order to work while disconnected. Upon reconnection to the network, the replicas are reconciled with the master replica on the database server. Typically, database vendors provide specialized software in order to allow for such types of replication. There exist many P2P-based document sharing systems, e.g., to share music, video files. Entire file systems can be built on P2P networks.
D
630
D
Data Replication Protocols
Data Warehouses can be seen as a special form of replication where the transactional data is copied and reformatted to be easily processed by data analysis tools.
Future Directions Being a fundamental data management technology, data replication solutions will need to be adjusted and revisited whenever new application domains and computing environments are developed. Thus, data replication will likely be a topic that will be further explored as our IT infrastructure and our demands change.
Cross-references
▶ Autonomous Replication ▶ Consistency Models for Replicated Data ▶ Data Broadcasting, Caching and Replication in Mobile Computing ▶ Distributed Database Design ▶ Middleware Support for Database Replication and Caching ▶ One-Copy-Serializability ▶ Optimistic Replication and Resolution ▶ Partial Replication ▶ P2P Database ▶ Quorum Systems ▶ Replication ▶ Replica Control ▶ Replica Freshness ▶ Replication based on Group Communication ▶ Replication for High Availability ▶ Replication for Scalability ▶ Replication in Multi-Tier Architectures ▶ Strong Consistency Models for Replicated Data ▶ Traditional Concurrency Control for Replicated Databases ▶ WAN Data Replication ▶ Weak Consistency Models for Replicated Data
Recommended Reading 1. Alonso G., Charron-Bost B., Pedone F., and Schiper A. (eds.), A 30-year Perspective on Replication, Monte Verita, Switzerland, 2007. 2. Bernstein P.A., Hadzilacos V., and Goodman N. Concurrency Control and Recovery in Database Systems. Addison Wesley, Boston, MA, 1987. 3. Breitbart Y., Komondoor R., Rastogi R., Seshadri S., and Silberschatz A. Update Propagation Protocols For Replicated Databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1999, pp. 97–108.
4. Budhiraja N., Marzullo K., Schneider F.B., and Toueg S. The primary-backup approach. In Distributed Systems (2nd edn.). S. Mullender (ed.). Addison Wesley, New York, NY, pp. 199–216. 5. Cabrera L.F. and Paˆris J.F. (eds.). In Proc. 1st Workshop on the Management of Replicated Data, 1990. 6. Gray J., Helland P., O’Neil P., and Shasha D. The dangers of replication and a solution. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1996, pp. 173–182. 7. Lv Q., Cao P., Cohen E., Li K., and Shenker S. Search and replication in unstructured peer-to-peer networks. In Proc. 16th Annual Int. Conf. on Supercomputing, 2002, pp. 84–95. 8. Ro¨hm U., Bo¨hm K., Schek H.J., and Schuldt H. FAS - a freshness-sensitive coordination middleware for a cluster of OLAP components. In Proc. Int. Conf. on Very Large Data Bases, 2002, pp. 754–765. 9. Saito Y. and Shapiro M. Optimistic replication. ACM Comput. Surv., 37(1):42–81, 2005. 10. Satyanarayanan M., Kistler J.J., Kumar P., Okasaki M.E., Siegel E.H., and Steere D.C. Coda: a highly available file system for a distributed workstation environment. IEEE Trans. Comput., 39 (4):447–459, 1990. 11. Schneider F.B. Replication management using the state-machine approach. In Distributed Systems (2nd edn.), S. Mullender (ed.). Addison Wesley, New York, NY, 1993, pp. 169–198. 12. Sivasubramanian S., Szymaniak M., Pierre G., and van Steen M. Replication for web hosting systems. ACM Comput. Surv., 36 (3):291–334, 2004. 13. Terry D.B., Theimer M., Petersen K., Demers A.J., Spreitzer M., and Hauser C. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proc. 15th ACM Symp. on Operating System Principles, 1995, pp. 172–183. 14. Wiesmann M. and Schiper A. Comparison of database replication techniques based on total order broadcast. IEEE Trans. Knowl. Data Eng., 17(4):551–566, 2005.
Data Replication Protocols ▶ Replica Control
Data Sampling Q ING Z HANG The Australian e-health Research Center, Brisbane, QLD, Australia
Definition Repeatedly choosing random numbers according to a given distribution is generally referred to as sampling. It is a popular technique for data reduction and
Data Sampling
approximate query processing. It allows a large set of data to be summarized as a much smaller data set, the sampling synopsis, which usually provides an estimate of the original data with provable error guarantees. One advantage of the sampling synopsis is easy and efficient. The cost of constructing such a synopsis is only proportional to the synopsis size, which makes the sampling complexity potentially sublinear to the size of the original data. The other advantage is that the sampling synopsis represents parts of the original data. Thus many query processing and data manipulation techniques that are applicable to the original data, can be directly applied on the synopsis.
Historical Background The notion of representing large data sets through small samples dates back to the end of nineteenth century and has led to the development of many techniques [5]. Over the past two decades, sampling techniques have been greatly developed in various database areas, especially on query optimization and approximate query processing. Query Optimization: Query optimizer identifies an efficient execution plan for evaluating the query. The optimizer generates alternative plans and chooses the cheapest one. It uses the statistical information stored in the system catalog to estimate the cost of a plan. Sampling synopsis plays a critical role in the query optimization of a RDBMS. Some commercial products, such as DB2 and Oracle, have already adopted sampling techniques to estimate several catalog statistics. In the Heterogeneous DBMS, the global query optimizer also employs sampling techniques to estimate query plans when some local statistical information is unavailable [6]. Approximate Query Processing: Sampling is mainly used to generate approximate numeric answers for aggregate queries over a set of records, such as COUNT, SUM, MAX, etc. Compared with other approximate query processing techniques, such as histogram and wavelet, sampling is easy to be implemented and efficient to generate approximate answers with error guarantees. Many prototypes on approximate query processing have adopted sampling approaches [2,3,4].
Foundations A sampling estimation can be roughly divided into two stages. The first stage is to find a suitable sampling
D
method to construct the sampling synopsis from the original data set. The second stage is to analyze the sampling estimator to find the characteristics (bounds and parameters) of its distribution. Sampling Method: Existing sampling methods can be classified into two groups, the uniform random sampling and biased sampling. The uniform random sampling is a straightforward solution. Every tuple of the original data has the same probability to be sampled. Thus for aggregate queries, the estimation from samples is the expected value of the answer. Due to the usefulness of uniform random sampling, commercial DBMSs have already supports operators to collect uniform samples. However, there are some queries for which the uniform random sampling are less effective on estimation. Given a simple group-by query which intends to find the average value of different groups, a smaller group is often as important to the user as those larger groups. It is obvious that the uniform random sampling will not have enough information for the smaller group. That is why the biased sampling methods are developed in these cases. Stratified sampling, for example, is a typical biased sampling, which will be explained in detail later. The four basic sampling methods, two uniform sampling and two biased sampling, are listed below. Figure 1 shows corresponding sampling synopses generated by those methods on the sample data. 1. Random sampling with replacement: This method creates a synopsis by randomly drawing n of the N original data records, where the probability of drawing any record is N1 . In other words, the records that have already been drawn are not to be remembered. So the chance exists that a certain record will be repeatedly drawn in several runs. 2. Random sampling without replacement: This is similar to the random sampling with replacement method except that in each run the drawn record will be remembered. That is, the same record will not be chosen on subsequent runs. Although sampling without replacement appears to lead to better approximation results, sampling with replacement is significantly easier to be implemented and analyzed. Thus in practice the negligible difference between these two methods’ effects makes the sampling with replacement a desirable alternative. 3. Cluster sampling: The N original data records are grouped into M mutually disjoint clusters. Then a
631
D
632
D
Data Sampling
Data Sampling. Figure 1. Sampling methods (sampling synopsis size = 3).
random sampling on M clusters is obtained to form the cluster sampling synopsis. That is, the clusters are treated as sampling units so statistical analysis is done on a population of clusters. 4. Stratified sampling : Like the cluster sampling, the N records are grouped into M mutually disjoin clusters, called strata. A stratified sampling synopsis is generated through running a random sampling on each cluster. This method is especially useful when the original data has skew distribution. In this way, the cluster with smallest number of records will be sure to be represented in the synopsis. These basic sampling methods are straightforward solutions although they usually can not satisfy the error or space requirements. However, they are building blocks of more advanced methods that can either be used in certain situations or guarantee the estimation error with a confidence level. Here is an example of a specially designed sampling method, which extends the basic random sampling method to be usable in the data stream environment. Note that the basic unbiased random sampling method requires a fixed data set with pre-defined data size. However, in a data stream environment, the size of the original data set is unknown. Thus the dynamic sampling method is required to get an unbiased sampling synopsis over the whole data stream. For this purpose, reservoir based sampling methods were originally proposed in [7]. Suppose constructing an unbiased sampling synopsis against a data stream T. A sample reservoir of n records is maintained from the stream. That is the first
n records of T will be added to the reservoir for initialization. Any t th new coming record will be added to the reservoir with probability nt . If a new record is added to the reservoir, any existing records of the reservoir will be discarded with probability n1 . Figure 2 demonstrates the construction steps of the reservoir. Finally, when all data of T has been processed, the n records of the reservoir form an unbiased random sampling synopsis of all the records of T. Similar reservoir based sampling method can also be developed for biased sampling [1]. Sampling Analysis : This stage analyzes the random variable generated by sampling methods. More specifically, it analyzes the distribution of the random variable through discovering its bound and distribution parameters. Given N records, assume that the function f(N) represents an operation on the N records. Let S represent a sampling synopsis of N, and f ðSÞ is often closely related to f ðN Þ for most common operations, such as AVERAGE or MAX. Let X ¼ f ðSÞ. X is the random variable that are going to be analyzed. If f ðN Þ represents some linear aggregation functions, such as AVERAGE, X can be approximated as a normal distribution, according to the Central Limit Theorem. If however, f ðN Þ represents other functions, such as MAX, probabilistic bounds based on key distribution parameters, such as expectation EðXÞ and variance Var½X, need to be found. This is often quite acceptable as an alternative to characterize the entire distribution of X. There exist a number of inequalities to estimate the probabilistic bound. These inequalities are collectively
Data Sampling
D
633
D
Data Sampling. Figure 2. Random sampling with a reservoir.
known as tail bounds. Given a random variable X, if E½X is known, Markov’s Inequality gives: 8 a > 0;
PrðX aÞ
E½X a
The variance of X, Var½X is defined as: Var½X ¼ E½ðX E½XÞ2 A significantly stronger tail bound can be obtained by Chebyshev’s Inequality if Var½X is known: 8 a > 0;
PrðjX E½Xj aÞ
Var½X a2
The third inequality is an extremely powerful tool called Chernoff bounds, which gives exponentially decreasing bounds on the tail distribution. These are derived by applying Markov’s Inequality to e tX for some well-chosen value t. Bounds derived from this approach are generally referred to collectively as Chernoff bounds. The most commonly used version of the Chernoff bound is for the tail distribution of a sum of independent 0–1 random variable. Let X1;:::;Xn be independent Poisson trials such that PrðXi Þ ¼ pi . P Let X ¼ ni¼1 Xi and m ¼ E½X. For 0 < d < 1, PrðjX mj mdÞ 2e md =3
below. Suppose estimating the synopsis size generated by a random sampling with replacement of N data records. Each record has a same probability 12 to be sampled. Let X denote the size of the sampling synopsis. Then the size expectation is E½X ¼ N2 . The probabilities of the synopsis size greater than 34 N , under estimations from different inequalities, are: Markov’s Inequality : Pr X 34 N 23 Chebyshev’s Inequality : Pr X 34 N N4 Chernoff Bounds : Pr X 34 N 2e N =24
Key Applications Query Optimization
Data sampling is one of the primary techniques used by query optimizers. In some multi-dimensional cases, it becomes the only easy and viable solution. Approximate Query Processing
Data sampling is one of the three major data deduction techniques (the other two are histogram and wavelet) employed by approximate query processors.
2
Finally, an example to illustrate the different tail bounding abilities of the three inequalities is given
Data Streaming
Sampling is a simple yet powerful method for synopsis construction in data stream.
634
D
Data Sketch/Synopsis
Cross-references
▶ Approximate Query Processing ▶ Data Reduction ▶ Data Sketch/Synopsis ▶ Histogram ▶ Query Optimization
Recommended Reading 1.
2.
3.
4.
5. 6.
7.
Aggarwal C.C. On biased reservoir sampling in the presence of stream evolution. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006. Chaudhuri S. et al. Overcoming limitations of sampling for aggregation queries. In Proc. 17th Int. Conf. on Data Engineering, 2001. Ganti V., Lee M.-L., and Ramakrishnan R. ICICLES: Self-tuning samples for approximate query answering. In Proc. 28th Int. Conf. on Very Large Data Bases, 2000. Gibbons P.B. and Matias Y. 1New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD int. conf. on Management of Data, 1998. Kish L. Survey Sampling. Wiley, New York, xvi, 643, 1965. Speegle G.D. and Donahoo M.J. Using statistical sampling for query optimization in heterogeneous library information systems. In Proc. 20th ACM Annual Conference on Computer Science, 1993. Vitter J.S. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985.
processing optimization is to estimate the result sizes of queries. Many techniques [1,2] have been developed for this purpose, including histograms, wavelets, and join synopses. In data stream applications, the space requirements of synopses/sketches are critical to keep them in memory for on-line query processing. Streams are usually massive in size and fast at arrival rates; consequently it may be infeasible to keep a whole data stream in memory. Many techniques [3] have been proposed with the aim to minimize the space requirement for a given precision guarantee. These [3] include heavy hitter, quantiles, duplicate-insensitive aggregates, joins, data distribution estimation, etc.
Cross-references
▶ Approximate Query Processing ▶ Histograms on Streams ▶ Wavelets on Streams
Recommended Reading 1.
2.
Data Sketch/Synopsis X UEMIN L IN University of New South Wales, Sydney, NSW, Australia
Synonyms Summary
Definition
3.
Alon N., Gibbons P.B., Matias Y., and Szegedy M. Tracking join and self-join sizes in limited storage. In Proc. 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 1999. Gibbons P.B. and Matias Y. Synopsis data structures for massive data sets. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 1999. Zhang Y., Lin X., Xu J., Korn F., and Wang W. Space-efficient relative error order sketch over data streams. In Proc. 22nd Int. Conf. on Data Engineering, 2006.
Data Skew LUC B OUGANIM INRIA-Rocquencourt, Le Chesnay, France
A synopsis of dataset D is an abstract of D. A sketch is also referred to an abstract of dataset D but is usually referred to an abstract in a sampling method.
Synonyms
Key Points
Definition
Sketch/synopsis techniques have many applications. They are mainly used for statistics estimation in query processing optimization and for supporting on-line data analysis via approximate query processing. The goal is to develop effective and efficient techniques to build a small space synopsis while achieving high precision. For instance, a key component in query
Data skew primarily refers to a non uniform distribution in a dataset. Skewed distribution can follow common distributions (e.g., Zipfian, Gaussian, Poisson), but many studies consider Zipfian [3] distribution to model skewed datasets. Using a real bibliographic database, [1] provides real-world parameters for the Zipf distribution model. The direct impact of data skew on
Biased distribution; Non-uniform distribution
Data Storage and Indexing in Sensor Networks
parallel execution of complex database queries is a poor load balancing leading to high response time.
Key Points Walton et al. [2] classify the effects of skewed data distribution on a parallel execution, distinguishing intrinsic skew from partition skew. Intrinsic skew is skew inherent in the dataset (e.g., there are more citizens in Paris than in Waterloo) and is thus called Attribute value skew (AVS). Partition skew occurs on parallel implementations when the workload is not evenly distributed between nodes, even when input data is uniformly distributed. Partition skew can further be classified in four types of skew. Tuple placement skew (TPS) is the skew introduced when the data is initially partitioned (e.g., with range partitioning). Selectivity skew (SS) is introduced when there is variation in the selectivity of select predicates on each node. Redistribution skew (RS) occurs in the redistribution step between two operators. It is similar to TPS. Finally join product skew (JPS) occurs because the join selectivity may vary between nodes.
Cross-references
▶ Query Load Balancing in Parallel Database Systems
Recommended Reading 1.
2.
3.
Lynch C. Selectivity estimation and query optimization in large databases with highly skewed distributions of column values. In Proc. 14th Int. Conf. on Very Large Data Bases, 1988, pp. 240–251. Walton C.B., Dale A.G., and Jenevin R.M. A taxonomy and performance model of data skew effects in parallel joins. In Proc. 17th Int. Conf. on Very Large Data Bases, 1991, pp. 537–548. Zipf G.K. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA, 1949.
Data Sorts ▶ Data Types in Scientific Data Management
D
Data Storage and Indexing in Sensor Networks P HILIP B. G IBBONS Intel Labs Pittsburgh, Pittsburgh, PA, USA
Definition Sensor data can either be stored local to the sensor node that collected the data (local storage), transmitted to one or more collection points outside of the sensor network (external storage), or transmitted and stored at other nodes in the sensor network (in-network storage). There are trade-offs with each of these approaches, as discussed below, depending on the volume of data collected at each sensor node, the query workload, and the resource limitations of each node. Moreover, the local and in-network storage scenarios often require in-network indexes in order to reduce the overheads of answering queries on data stored within the sensor network. Such indexes can be classified as either exact-match indexes or range indexes.
Historical Background External storage is in some sense the default approach for sensor networks, reflecting the common scenario in which the application is interested in all the collected sensor readings. Early work in local storage includes Cougar [11] and TinyDB [8]; both push SQL-style queries out to data stored at the sensor nodes. In TinyDB and in Directed Diffusion [6] (another early work) the query workload dictates which sensors are turned on. This functionality can be used for external, local, or in-network storage, depending on where these collected data get stored. Seminal work on in-network storage includes the work on geographic hash tables (GHTs) [10], which support exact-match indexing. The authors advocate data-centric storage, a class of in-network storage in which data are stored according to named attribute values. Early work on supporting range indexes for in-network storage includes DIFS [5] and DIM [7].
Foundations
Data Standardization ▶ Constraint-Driven Database Repair
635
External Storage, in which all the sensor readings are transmitted to collection points outside of the sensor network, has several important advantages. First, storage is plentiful outside the sensor network, so that all the data can be archived, as well as disseminated to any
D
636
D
Data Storage and Indexing in Sensor Networks
interested party (e.g., posted on the web). Archiving all the data is quite useful for testing out new theories and models on these historical data, and for forensic activities. Second, processing power, memory, and energy are plentiful outside the sensor network, so that queries and complex data analysis can be executed quickly and without exhausting the sensors’ limited energy reserves. Finally, such data processing can be done using standard programming languages and tools (such as MatLab) that are not available on sensor nodes. On the other hand, external storage suffers the disadvantage that it incurs the costs (primarily energy, but also bandwidth and latency) of transmitting all the data to outside the network. Local storage, in which sensor readings are stored local to the node that collected the data, avoids the costs of transmitting all the data. Instead, it incurs the costs of pushing queries into the network and returning the query answers. Queries are often flooded through the sensor network. A collection tree is constructed hop-by-hop from the query source (called the root), as follows [8]. The root broadcasts the query, and each sensor node that hears the broadcast makes the root its parent in the tree. These nodes in turn broadcast the query, and any node that hears one or more broadcasts (and is not yet in the tree) selects one of these nodes as its parent, and so on. This tree is used to collect query answers: Each leaf node sends its piece of the answer to its parent, and each internal node collects these partial answers from its children, combines them with its own piece of the answer, and sends the result to its parent. The process proceeds level-by-level up the tree to the root. Thus, the cost of pushing the query and gathering the answer can be high. Nevertheless, the amount of data transmitted when using local storage can often be far less than when using external storage. First, queries are often long running (continuous queries); for such queries, the costs of query flooding and tree construction are incurred only once at the start of the query (although maintaining the tree under failures can incur some additional costs). Second, indexes can be used (as discussed below) to narrow the scope of the query to a subset of the nodes. Third, queries can be highly selective (e.g., looking for rare events such as an intruder sighting), so that most sensors transmit little or no real data. In camera sensor networks (e.g., IrisNet [4]), local filtering of images can result in query answers that are orders of magnitude smaller than the raw
images. Fourth, many queries are amenable to efficient in-network aggregation, in which partial answers received from children can be combined into a single fixed-sized packet. For example, in a Sum query each internal node can send to its parent a single value equal to the sum of its value and the values received from its children. Finally, in sensor networks supporting queries of live data only (i.e., only the latest data are of interest), the amount of data sensed can far exceed the amount of data queried. A key consideration when using local storage is that the amount of such storage is limited. Thus, at some point old data need to be discarded or summarized to make room for new data [2]. Moreover, the local storage is often flash memory, and hence flash-friendly techniques are needed to minimize the costs for accessing and updating locally stored data [9]. In-network storage, in which sensor readings are transmitted and stored at other nodes in the sensor network, falls in between the extremes of external storage and local storage. Caching data that passes through a sensor node during query processing is a simple form of in-network storage. As cached data become stale over time, care must be taken to ensure that the data meets the query’s freshness requirements [4]. In TinyDB [8], ‘‘storage point’’ queries can be used to collect data satisfying a query (e.g., all temperature readings in the past 8 seconds, updated every second) at nodes within the network. In data-centric storage [10], data are stored according to named attribute values; all data with the same general name (e.g., intruder sightings) are stored at the same sensor node. Because data items are stored according to their names, queries can retrieve all data items associated with a target name from just a single ‘‘home’’ node for that name (as opposed to potentially all the nodes when using local storage). The approach relies on building and maintaining indexes on the names, so that both sensor readings and queries can be routed efficiently to the corresponding home node. In-network indexes, in which the index is maintained inside the sensor network, are useful for both local storage and in-network storage. The goal is to map named data to the sensor node(s) that hold such data, in order to minimize the cost of answering queries. Queries using the index are guided to the sensor nodes holding the desired data. In TinyDB each internal node of a collection tree maintains a lower and upper bound on the attribute values in the subtree rooted at the node;
Data Storage and Indexing in Sensor Networks
this index is used to restrict the query processing to only subtrees containing the value(s) of interest. In Directed Diffusion [6], queries expressing interests in some target data are initially flooded out from the query node. Paths leading to nodes that are sources of the target data are reinforced, resulting in an ad hoc routing tree from the sources back to the query node. This can be viewed as an ad hoc query-specific index. A geographic hash table (GHT) [10] is an exactmatch in-network indexing scheme for data-centric storage. A GHT hashes a data name (called a key) to a random geographic location within the sensor field. The home node for a key is the sensor node that is geographically closest to the location returned by the hash of the key. (For fault tolerance and to mitigate hot spots, data items are also replicated on nearby nodes.) Geographic routing is used to route sensor readings and queries to the home node. In geographic routing each node knows its geographic coordinates and the coordinates of its neighbors. Upon receiving a packet, a node forwards the packet to the neighbor closest to the home node. In this way, the packet attempts to take a shortest path to the home node. The packet can get stuck, however, in a local minimum node v such that no neighbor of v is closer to the home node than v itself. Different geographic routing schemes provide different approaches for recovering from local minima, so that the home node can always be reached. Follow-on work on GHTs (e.g., [1]) has focused on improving the practicality (efficiency, robustness, etc.) of geographic routing and hence GHTs. In-network storage based on a GHT is less costly than local storage whenever the savings in data transmitted in querying exceed the additional costs of transmitting sensor data to home nodes. In a square sensor pffiffiffi field of n sensors, routing to a home node takes Oð nÞ hops. Consider a workload where there are E event messages to be transmitted and Q queries each requesting all the events for a distinct named event type. With in-network storage based on a GHT, the total hops is pffiffiffi Oð nðQ þ EÞÞ. With local storage, the total hops is O (Qn), as it is dominated by the cost to flood the query. Thus, roughly, the in-network scheme is less costly pffiffiffi when the number of events is at most a factor of n larger than the number of queries. However, this is in many respects a best case scenario for in-network storage, and in general, local or external storage can often be less costly than in-network storage. For example, with a single continuous query that aggregates data
D
from all the sensor nodes for t 1 time periods (i.e., E = tn), in-network storage based on a GHT incurs O (tn1.5) total hops while local storage with in-network aggregation incurs only O(tn) total hops, as the cost is dominated by the t rounds of hops up the collection tree. Moreover, a GHT is not well-suited to answering range queries. To remedy this, a variety of data-centric storage schemes have been proposed that provide effective range indexes [5,7,3]. DIM [7], for example, presents a technique (inspired by k-d trees) for constructing a locality-preserving geographic hash function. Combined with geographic routing, this extends the favorable scenarios for in-network storage to include also multi-dimensional range queries. In summary, which of the external, local, or innetwork storage schemes is preferred depends on the volume of data collected at each sensor node, the query workload, and the resource limitations of each node.
Key Applications Sensor networks, Applications of sensor network data management.
Cross-references
▶ Ad-Hoc Queries in Sensor Networks ▶ Applications of Sensor Network Data Management ▶ Continuous Queries in Sensor Networks ▶ Data Acquisition and Dissemination in Sensor Networks ▶ Data Aggregation in Sensor Networks ▶ Data Compression in Sensor Networks ▶ Data Fusion in Sensor Networks ▶ In-Network Query Processing ▶ Sensor Networks
Recommended Reading 1. Ee C.T., Ratnasamy S., and Shenker S. Practical data-centric storage. In Proc. 3rd USENIX Symp. on Networked Systems Design & Implementation, 2006, pp. 325–338. 2. Ganesan D., Greenstein B., Estrin D., Heidemann J., and Govindan R. Multiresolution storage and search in sensor networks. ACM Trans. Storage, 1(3):277–315, 2005. 3. Gao J., Guibas L.J., Hershberger J., and Zhang L. Fractionally cascaded information in a sensor network. In Proc. 3rd Int. Symp. Inf. Proc. in Sensor Networks, 2004, pp. 311–319. 4. Gibbons P.B., Karp B., Ke Y., Nath S., and Seshan S. IrisNet: An architecture for a worldwide sensor web. IEEE Pervasive Comput, 2(4):22–33, 2003. 5. Greenstein B., Estrin D., Govindan R., Ratnasamy S., and Shenker S. DIFS: A distributed index for features in sensor
637
D
638
D 6.
7.
8.
9.
10.
11.
Data Stream networks. In Proc. First IEEE Int. Workshop on Sensor Network Protocols and Applications, 2003, pp. 163–173. Intanagonwiwat C., Govindan R., Estrin D., Heidemann J., and Silva F. Directed diffusion for wireless sensor networking. IEEE/ ACM Trans. Network., 11(1):2–16, 2003. Li X., Kim Y.J., Govindan R., and Hong W. Multi-dimensional range queries in sensor networks. In Proc. 1st Int. Conf. on Embedded Networked Sensor Systems, 2003, pp. 63–75. Madden S.R., Franklin M.J., Hellerstein J.M., and Hong W. TinyDB: An acquisitional query processing system for sensor networks. ACM Trans Database Syst, 30(1):122–173, 2005. Mathur G., Desnoyers P., Ganesan D., and Shenoy P. Capsule: An energy-optimized object storage system for memoryconstrained sensor devices. In Proc. 4th Int. Conf. on Embedded Networked Sensor Systems, 2006, pp. 195–208. Ratnasamy S., Karp B., Shenker S., Estrin D., Govindan R., Yin L., and Yu F. Data-centric storage in sensornets with GHT, a geographic hash table. Mobile Networks Appl., 8(4):427–442, 2003. Springer. Yao Y. and Gehrke J. Query processing for sensor networks. In Proc. 1st Biennial Conf. on Innovative Data Systems Research, 2003.
Data Stream LUKASZ G OLAB AT&T Labs-Research, Florham Park, NJ, USA
Synonyms Continuous data feed
Definition A data stream S is an ordered collection of data items, s1, s2,..., having the following properties: Data items are continuously generated by one or more sources and sent to one or more processing entities. The arrival order of data items cannot be controlled by the processing entities. For instance, an Internet Service Provider (ISP) may be interested in monitoring the traffic on one or more of its links. In this case, the data stream consists of data packets flowing through the network. The processing entities, e.g., monitoring software, may be located directly on routers inside the ISP’s network or on remote nodes. Data streams may be classified into two types: based and derived. A base stream arrives directly from the source. A derived stream is a pre-processed base stream,
e.g., an intermediate result of a query or a sub-query over one or more base streams. In the network monitoring scenario, the base stream corresponds to the actual IP packets, whereas a derived stream could contain aggregate measurements of traffic between each source and destination in a five-minute window.
Key Points Depending upon the application, a data stream may be composed of raw data packets, relational tuples, events, pieces of text, or pieces of an XML document. Furthermore, each data stream item may be associated with two timestamps: generation time (assigned by the source) and arrival time (assigned by the processing entity). The order in which items arrive may be different from their generation order, therefore these two timestamps may produce different orderings of the data stream. A data stream may arrive at a very high speed (e.g., a router may process hundreds of thousands of packets per second) and its arrival rate may vary over time. Hence, a data stream may be unbounded in size. In particular, the processing entity may not know if and when the stream ‘‘ends.’’
Cross-references
▶ Stream-Oriented Query Languages and Operators ▶ Stream processing ▶ One-pass algorithm ▶ Stream mining ▶ Synopsis structure
Recommended Reading 1.
2. 3.
Babcock B., Babu S., Datar M., Motwani R., and Widom J. Models and issues in data streams. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002, pp. 1–16. ¨ zsu M.T. Issues in data stream management. Golab L. and O ACM SIGMOD Rec., 32(2):5–14, 2003. Muthukrishnan S. Data streams: algorithms and applications. Found. Trends Theor. Comput. Sci., 1(2):1–67, 2005.
Data Stream Algorithm ▶ One-Pass Algorithm
Data Stream Management Architectures and Prototypes
Data Stream Management Architectures and Prototypes YANIF A HMAD, U G˘ UR C¸ ETINTEMEL Brown University, Providence, RI, USA
Definition Data stream processing architectures perform database-style query processing on a set of continuously arriving input streams. The core query executor in this type of architecture is designed to process continuous queries, rather than ad-hoc queries, by pushing inputs through a series of operators functioning in a pipelined and potentially non-blocking manner. Stream processing applications perform explicit read and write operations to access storage via asynchronous disk I/O operations. Other architectural components that differ significantly from standard database designs include the stream processor’s scheduler, storage manager and queue manager.
Historical Background Support database-style query processing for longrunning applications that operate in high (data) volume environments, and impose high throughput and low latency requirements on the system. There have been several efforts from both the academic and industrial communities at developing functional prototypes of stream processing engines (SPEs), to demonstrate their usefulness and to better understand the challenges posed by data stream applications. The first general purpose relational stream processing architectures appeared from the research community in 2001–2002, while the initial industrial offerings began to appear in 2003–2004. As a historical note, nonrelational approaches to stream or event processing have existed in many forms prior to the early 2000s, especially in the embedded systems and signal processing communities.
Foundations An SPE has a radically different architecture than that of a traditional database engine. Conceptually, the architectural differences can be captured by the following key characteristics: 1. Continuous query model 2. Inbound processing model 3. Single-process model
D
In the continuous query model, the queries execute continuously as new input data becomes available. This contrasts with the prevailing one-time query model where users (or clients) issue queries that process the available input data and produce one-time results. In other words, in the continuous model, the queries are persistent and input data is transient, whereas in the traditional model, queries are transient and input data is persistent, as illustrated in Fig. 1. An SPE supports inbound processing, where incoming event streams immediately start to flow through the query processing operators as they enter the system. The operators process the events as they move, continuously producing results, all in main memory when possible. Read or write operations to storage are optional and can be executed asynchronously in many cases, when they are present. Inbound processing overcomes a fundamental limitation of the traditional outbound processing model, employed by all conventional database management systems, where data are always inserted into the database (usually as part of a transaction) and indexed as a first step before any processing can take place to produce results. By removing the storage from the critical path of processing, an SPE achieves significant performance gains compared to the traditional outbound processing approach. An SPE often adopts a single-process model, where all time-critical operations (including data processing, storage, and execution of custom application logic) are run in a single process space. This integrated approach eliminates high-overhead process context switches that are present in solutions that use multiple software systems to collectively provide the same set of capabilities, yielding high throughput with low latency. The SPE prototypes developed independently in the academic community share core design principles and architectural components to implement pushbased dataflows. At a high level, an SPE’s core includes a query executor maintaining plans for users’ queries, a queue manager and storage manager to govern memory resources and perform optional disk access and persistence, a stream manager to handle stream I/O with data sources and sinks, and a scheduler to determine an execution strategy. Figure 2 presents a diagrammatic overview of these core components. SPEs implement continuous queries and inbound processing inside the query executor, by instantiating query plans with non-blocking operators that are capable of producing result tuples from each individual
639
D
640
D
Data Stream Management Architectures and Prototypes
Data Stream Management Architectures and Prototypes. Figure 1. Illustration of storage-oriented and streaming-oriented architectures. The former requires outbound processing of data, whereas the latter enables inbound (or straight-through) processing.
Data Stream Management Architectures and Prototypes. Figure 2. Architectural components of an SPE.
input tuple, or a window of tuples, depending on the operator type. This is in contrast to traditional operators that wait for relations to be scanned from disk. The query executor represents query plans as operators
connected together with queues that buffer continuously-arriving inputs (as found in the Aurora and Stream prototypes [2,5]), and any pending outputs (for example Fjords in TelegraphCQ [8]), while each
Data Stream Management Architectures and Prototypes
operator performs its processing. A queue manager is responsible for ensuring the availability of memory resources to support buffering inputs and outputs, and interacts with other system components to engage alternative processing techniques when memory availability guarantees cannot be upheld [3]. Operators may choose to access inputs, or share state with other operators, through persistent storage. Disk access is provided through a storage manager that is responsible for maintaining cursors on external tables, and for performing asynchronous read and write operations while continuing to process data streams. Advanced storage manager features include the ability to spill operator queues and state to disk under dwindling memory availability, as well as the ability to maintain approximations of streams, queues and states. SPEs typically include a stream manager component to handle interaction with the network layer as data sources transmit stream inputs typically over TCP or UDP sockets. The stream manager is additionally responsible for any data format conversions through built-in adaptors, and to indicate the arrival of new inputs to the scheduler as the new inputs are placed on the operators’ queues. An operator scheduler [5,7] is responsible for devising an execution order based on various policies to ensure efficient utilization of system resources. These policies typically gather operator cost and selectivity statistics in addition to resource utilization statistics to determine a schedule that improves throughput and latencies. While SPEs execute under a single-process model, various scheduler threading designs have been proposed to provide query processing parallelism. Finally, SPEs also include query optimizers such as load shedders [15] and adaptive plan optimizers [6], that also monitor the state of the running query in terms of statistics and other optimizer-specific monitors to dynamically and adaptively determine advantageous modifications to query plans and operator internals. Prototypes
The key features in the architectural design of stream processors primarily arose from academic prototypes, before being extended by industrial-grade tools based on the commercialization of the academic efforts. These features are described for a subset of the prototypes below. Aurora/Borealis: The Aurora and Borealis [2,1] projects are first- and second-generation stream processing engines built in a collaboration by Brandeis,
D
Brown and MIT. The Aurora engine was implemented from scratch in C++, and included the basic architectural components described above to produce a single-site design. Aurora was a general-purpose engine that provided a relational operator set to be used to construct queries visually in an editor, as a workflow. This workflow-style programming paradigm (sometimes referred to as ‘‘boxes-and-arrows’’) differed significantly from similar research projects which focused more on providing stream-oriented language extensions to SQL. The Aurora architecture included a multi-threaded scheduler capable of supporting tuple-trains and superbox scheduling. Aurora also supported load shedding, the concept of selectively processing inputs in the presence of excessive load due to high-rate data streams. The Aurora engine also supported embedded tables to enable operators to share state. The embedded tables were implemented as BerkeleyDB stores and the core operator set included operators capable of performing a subset of SQL queries on these tables. The Borealis project extended the Aurora architecture for a multi-site deployment, and implemented components to address the novel challenges exposed by distributed execution. These included a decentralized catalog structure maintaining metadata for the set of deployed queries, and a distributed statistics collector and optimizer. The optimizer targeted distributed query optimizations such as spreading load across multiple machines to achieve both a balanced allocation, and resilience to changes in load, in addition to a distributed load shedding mechanism that factored in the allocation of operators to sites. Stream: The Stream project at Stanford [5] developed a C++ implementation of a stream processing engine with a similar high-level architecture to the design described above. The novel features of the Stream architecture included its use of the Continuous Query Language (CQL) which extended SQL with DDL statements to define data streams and subsequently provided DML clauses for several types of windowing operations on these streams. The core engine included a single-threaded scheduler that continuously executes operators based on a scheduling policy, while the operators implement nonblocking query execution through the use of queues. In addition to this basic query executor, the Stream project studied various resource management, query approximation and adaptive query processing
641
D
642
D
Data Stream Management Architectures and Prototypes
techniques. These included memory management techniques implemented by both a scheduling policy that executes groups of operators to minimize queue sizes, and by exploiting shared synopses (window implementations) and constraints in the arrival patterns of data streams (such as bounding the arrival delay between interacting inputs). In terms of query approximation, Stream provided both a load-shedding algorithm that approximates query results to reduce CPU load, in addition to synopsis compaction techniques that reduced memory requirements at operators. Finally, Stream is capable of adapting the running query through the aid of two components, a profiler that continuously collects statistics during query execution, and a reoptimizer component that maintains both filter and join orderings based on changing selectivities. TelegraphCQ: In contrast to the ground up design of Aurora and Stream, the TelegraphCQ project [8] at UC Berkeley developed a stream processing engine on top of the PostgreSQL open-source database. This approach allowed the reuse of several PostgreSQL components, such as the system catalogs, parser and optimizer. TelegraphCQ is divided into two high-level components, a frontend and a backend. The frontend is responsible for client interaction such as parsing and planning queries, in addition to returning query results. The TelegraphCQ backend is a continually executing process that performs the actual query processing, and adds query plans submitted by the frontend to the set of executable objects. The backend is implemented in a multi-threaded fashion enabling processing parallelism. The query executor in the TelegraphCQ backend is principally designed to support adaptive query processing through the use of Eddies to dynamically route tuples between a set of commutative operators (thus performing run-time reordering). The executor also leans heavily on exploiting opportunities for shared processing, both in terms of the state maintained internally within operators (such as aggregates), and in terms of common expressions used by selections through grouped filters. Finally, as a result of its PostgreSQL base, TelegraphCQ investigated query processing strategies combining the use of streamed data and historical data from a persistent source. Gigascope: The Gigascope data stream engine [10] was developed at AT&T Labs-Research to primarily study network monitoring applications, for example involving complex network and protocol analyses of
BGP updates and IP packets. Gigascope supports textual queries through GSQL, a pure stream query language that is a simplified form of standard SQL. GSQL queries are internally viewed as having a twolevel structure, where queries consist of high-level and low-level operators comprising a graph-structured program, depending on the optimization opportunities determined by a query optimizer. Low-level operators are extremely lightweight computations to perform preliminary filtering and aggregation prior to processing high-level operators, and in some cases these low-level operators may be performed on the network cards of the machines present in the network monitoring application. Gigascope queries are thus compiled into C and C++ modules and linked into a run-time system for highly-efficient execution. Gigascope also investigated the blocking properties of both high- and low-level operators and developed a heartbeat mechanism to effectively alleviate operators’ memory requirements. Nile: The Nile stream processing engine was developed at Purdue on top of the Predator [14] objectrelational DBMS. Nile implements data streams as an enhanced datatype in Predator and performs stream processing with the aid of a stream manager component. This stream manager is responsible for buffering input streams and handing data to the execution engine for query processing. Nile uses a separate thread for its stream manager, and performs round-robin scheduling for processing new inputs on streams. In addition to the basic stream processing engine design, the Nile project investigated various query correctness issues and optimization opportunities arising in the stream processing context. This included studying scheduling strategies to exploit resource sharing amongst queries, for example sharing windowed join operators between multiple queries, and pipelining mechanisms based on strategies to expire tuples in multiple windows. System S: The System S [12] project is a recent endeavor at IBM Research investigating large-scale distributed stream processing systems focusing primarily on analytical streaming applications through the use of data mining techniques. System S processes data streams with a dataflow-oriented operator network consisting of processing elements (PEs) that are distributed across a set of processing nodes (PNs) and communicate through a transport component known as the data fabric. Some of the prominent architectural features of System S include the design and implementation of streaming analytic
Data Suppression
operators, including clustering and decision-tree based algorithms, and appropriate resource management algorithms to support these types of operators, such as a variety of load shedding and diffusion algorithms. System S also leverages data-parallelism through a content-based load partitioning mechanism that spreads the processing of an input or intermediate stream across multiple downstream PEs.
4.
5.
6.
Key Applications Stream processing architectures have been motivated by, and used in, several domains, including: Financial services: automated trading, market feed processing (cleaning, smoothing, and translation), smart order routing, real-time risk management and compliance (MiFID, RegNMS) Government and Military: surveillance, intrusion detection and infrastructure monitoring, battlefield command and control Telecommunications: network management, quality of service (QoS)/service level agreement (SLA) management, fraud detection Web/E-business: click-stream analysis, real-time customer experience management (CEM)
URL to Code Borealis: http://www.cs.brown.edu/research/borealis/ public/ Stream: http://infolab.stanford.edu/stream/code/
7.
8.
9.
10.
11. 12.
13. 14.
Cross-references
▶ Continuous Query ▶ Data Stream ▶ Stream-oriented Query Languages and Operators ▶ Stream Processing ▶ Streaming Applications ▶ Windows
15.
continuous data streams. ACM Trans. Database Syst., 29 (1):162–194, 2004. Babcock B., Babu S., Datar M., Motwani R., and Thomas D. Operator scheduling in data stream systems. VLDB J., 13 (4):333–353, 2004. Babcock B., Babu S., Datar M., Motwani R., and Widom J. Models and issues in data stream systems. In Proc. 21st ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2002. Babu S., Motwani R., Munagala K., Nishizawa I., and Widom J. Adaptive ordering of pipelined stream filters. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2004, pp. 407–418. Carney D., C¸etintemel U., Rasin A., Zdonik S.B., Cherniack M., and Stonebraker M. Operator scheduling in a data stream manager. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 838–849. Chandrasekaran S., Deshpande A., Franklin M., and Hellerstein J. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proc. 1st Biennial Conf. on Innovative Data Systems Research, 2003. Chen J., DeWitt D.J., Tian F., and Wang Y. Niagaracq: A scalable continuous query system for internet databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 379– 390. Cranor C.D., Johnson T., Spatscheck O., and Shkapenyuk V. Gigascope: a stream database for network applications. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003, pp. 647–651. Gehrke J. (ed.) Data stream processing. IEEE Data Eng. Bull., 26 (1), 2003. Gedik B., Andrade H., Wu K.-L., Yu P.S., and Doo M. SPADE: The Systems S Declarative Stream Processing Engine. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2008. ¨ zsu M.T. Issues in data stream management. Golab L. and O ACM SIGMOD Rec., 32(2):5–14, 2003. Hammad M.A., Mokbel M.F., Ali M.H., Aref W.G., Catlin A.C., Elmagarmid A.K, Eltabakh M.Y., Elfeky M.G., Ghanem T.M., Gwadera R., Ilyas I.F., Marzouk M.S., and Xiong X. Nile: a query processing engine for data streams. In Proc. 20th Int. Conf. on Data Engineering, 2004, p. 851. Tatbul N., C¸etintemel U., Zdonik S.B., Cherniack M., and Stonebraker M. Load shedding in a data stream manager. In Proc. 29th Int. Conf. on Very Large Data Bases, 2003, pp. 309–320.
Recommended Reading 1. Abadi D., Ahmad Y., Balazinska M., C¸etintemel U., Cherniack M., Hwang J.-H., Lindner W., Maskey A.S., Rasin A., Ryvkina E., Tatbul N., Xing Y., and Zdonik S. The design of the Borealis stream processing engine. In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, 2005. 2. Abadi D.J., Carney D., C¸etintemel U., Cherniack M., Convey C., Lee S., Stonebraker M., Tatbul N., and Zdonik S. Aurora: A new model and architecture for data stream management. VLDB J., 2003. 3. Arasu A., Babcock B., Babu S., McAlister J., and Widom J. Characterizing memory requirements for queries over
D
Data Stream Processing ▶ Stream Processing
Data Suppression ▶ Data Compression in Sensor Networks
643
D
644
D
Data Swapping
Data Swapping ▶ Data/Rank Swapping
Data Time ▶ Valid Time
Data Tracking ▶ Data Provenance
management derives from the same basic idea. A data type is a specification that concretely defines the ‘‘structure’’ of a data variable of that type, the operations that can be performed on that variable, and any constraints that might apply to them. For example, a ‘‘tuple’’ is a data type defined as a finite sequence (i.e., an ordered list) of objects, each of a specified type; it allows operations like ‘‘projection’’ popularly used in relational algebra. In science, the term ‘‘data type’’ is sometimes used less formally to refer to a kind of scientific data. For example, one would say ‘‘gene expression’’ or ‘‘4D surface mesh of a beating heart’’ is a data type.
Foundations Commonly Used Data Types in Science Applications
Data Transformation ▶ Data Exchange
Data Translation ▶ Data Exchange
Data Types for Moving Objects ▶ Spatio-Temporal Data Types
Data Types in Scientific Data Management A MARNATH G UPTA University of California San Diego, La Jolla, CA, USA
There is a very large variety of data types used in scientific domains. The following data types are commonly used in several different scientific disciplines. Arrays Multidimensional arrays are heavily used in many scientific applications; they not only serve as natural representation for many kinds of scientific data, but they are supported by programming languages, object relational databases, many computational software libraries, as well as data visualization routines. The most common operation on arrays is index-based access to data values. However, more complex (and useful) operations can be defined on arrays. Marathe and Salem [6,7] defined an algebra on multidimensional arrays where a cell may contain a vector of values. The algebra derives from nested relational algebra, and allows one to perform value-based relational queries on arrays. Arrays are a very general data type and can be specialized with additional semantics. Baumann [1] defined a somewhat different array algebra for modeling spatiotemporal data for a system called RasDaMan. Reiner et al. [10] present a storage model for large scale arrays.
Synonyms Data sorts; Many sorted algebra; Type theory
Definition In mathematics, logic and computer science, the term ‘‘type’’ has a formal connotation. By assigning a variable to a type in a programming language, one implicitly defines constraints on the domains and operations on the variable. The term ‘‘data type’’ as used in data
Time-Series Temporal data is a very important class of information for many scientific applications. Timeseries data is a class of temporal data where the value of a variable may change with a roughly regular interval. On the other hand, the salary history of an employee is temporal data but not necessarily time-series data because the change in salary can happen at arbitrary frequencies. Data from any sensors (temperature,
Data Types in Scientific Data Management
seismic, strain gages, electrocardiograms and so on) come in the form of a stream of explicitly or implicitly time-stamped sequence numbers (or characters). Time series data collection and storage is based on granularity, and often different data are collected at different granularity that need to be queries together [8]. All database management systems assume a discrete time line of various granularities. Granularity is a partition of a time line, for instance, years, days, hours, microseconds and so forth. Bettini et al have illustrated a formal notion of time granularities [2]. An interesting class of operations on time-series data is similarity between two time-series data. This operation can become complicated because one time series data can be recorded at a different time resolution than another and may show local variations, but still have overall similarity in the shape of the data envelope. This has prompted the investigation of efficient operators to perform this similarity. Popivanov and Miller [9] for example, has developed a measure of time-series similarity based on wavelet decomposition. Finite Element Meshes
Numerical modeling of a physical system is fundamental to many branches of science. A well known technique in this domain is called finite element analysis where a continuous domain (e.g., a 3D terrain) is partitioned into a mesh, and variables are recorded over the nodes of this mesh or regions covering multiple cells of the mesh. Assume that Fig. 1 shows the distribution of electric potential over a continuous surface. Every node of the mesh will have a (positive or negative) charge value, while a variable like ‘‘zero charge region’’ will be defined over regions of the mesh. Figure 1 also illustrates that mesh is not always regular – a region with a higher variation of data values will be partitioned into a finer mesh than a region will less variation. Finite element meshes are used in many modeling as well as visualization software. In cases, where the size of the mesh is very large, and complex manipulation of data (like repartitioning based on some conditions) is needed over the elements of the mesh, the existing software do not offer robust and scalable solutions. Recently, the CORIE system [5] has developed a systematic approach to modeling a general data structure they call a gridfield to handle finite element meshes, and an algebra for manipulating arbitrary gridded datasets together with algebraic optimization techniques to improve efficiency of operations.
D
645
D
Data Types in Scientific Data Management. Figure 1. A finite element mesh.
Graphs
Like arrays, graphs form a ubiquitous data type used in many scientific applications. Eckman and Brown [4] describes the use of graphs in molecular and cell biology. In their case, graphs denote relationships between biomolecular entities (A and B) that constitute molecular interaction networks, representing information like A is similar to B, A interacts with B, A regulates the expression of B, A inhibits the activity of B, A stimulates the activity of B, A binds to B and so forth. Operators on the graph data type include those that extract a subgraph from a large graph, compare one graph to another, transform one graph to another, decompose a graph into its nodes and edges, compute the intersection, union, or disjunction of two graphs, compute structural derivatives such as transitive closure and connected components and so on. In chemistry, data mining techniques are used to find most frequent subgraphs of a large number of graphs. Graphs play a dominant role in social sciences where social network analysts are interested in the analysis of the connectivity structure of the graphs. A class of operations of interest centers around the notion of aggregate properties of the graph structure. One such property is centrality, a quantity that measures for each node in a graph a value that denotes how well the node is connected to the rest of the nodes in the graph. This class of measures have been investigated in the context of data mining [11] where the task was to find the most likely subgraph ‘‘lying between’’ a
646
D
Data Types in Scientific Data Management
set of query-defined nodes in the graph. While there are many alternate definitions of centrality, it should be clear that computing these aggregate values on the fly requires a different kind of data representation and operators than the previous case, where traversal and subgraph operations were dominant. Some Basic Issues about Scientific Data Types
While database systems do not offer extensive support for scientific data types, there are many specialized software libraries that do, and hence are widely used by the scientific community. This leads to a fundamental problem as observed by the authors of [5]. On the one hand, the performance of SQL queries for manipulating large numeric datasets is not competitive with specialized tools. For example, database extensions for processing multidimensional discrete data can only model regular, rectilinear grids (i.e., arrays). On the other hand, specialized software products such as visualization software libraries are designed to process arbitrary gridded datasets efficiently. However, no algebra has been developed to simplify their use and afford optimization. In almost all cases, these libraries are data dependent – physical changes to data representation or organization break user programs. This basic observation about type specific scientific software holds for almost all of scientific data types. This calls for future research in developing storage and an algebraic manipulation for scientific data type as well as for an effort to incorporate in these techniques in scientific data management systems. A second basic problem regarding scientific data types arises from the fact that the same data can be viewed differently for different forms of analysis and thus need to support multiple representations and storage or indexes. Consider the data type of character sequences often used in genomic studies. If S is a sequence, it is common to determine is S is ‘‘similar to’’ another sequence T, where T may have additional characters and missing characters with respect to S. It has been shown that a suffix tree like representation of sequences is suitable for operations of this category. However, in biology, scientists are also interested in an arbitrary number of subsequences of on the same sequences like S to which they would assign an arbitrary number of properties (called ‘‘annotations’’ in biology) to each subsequence. Finding similar subsequences is not a very common operation in this case. The focus is rather on interval operations like finding all subsequences
overlapping a given interval that satisfies some conditions on their properties, and on finding the 1D spatial relationships among subsequences that satisfy some given properties. These operations possibly require a different storage and access structure such as an interval tree. Since a scientific application both kinds of operations would be necessary, it becomes important for the data management system to handle the multiplicity of representations and operations so that the right representations can be chosen as run time for efficient access.
Key Applications Bioinformatics, cheminformatics, engineering databases.
Cross-references
▶ Graph Data Management in Scientific Applications ▶ Mining of Chemical Data ▶ Storage of Large Scale Multidimensional Data
Recommended Reading 1. Baumann P. A database array algebra for spatio-temporal data and beyond. In Proc. Fourth Int. Workshop on Next Generation Information Technologies and Systems, 1999, pp. 76–93. 2. Bettini C., Jajodia S., and Wang S.X. Time Granularities in Database, Data Mining, and Temporal Reasoning. Springer, 2000. 3. Borgatti S.P. and Everett M.G. A graph-theoretic perspective on centrality. Soc. Netw., 28(4):466–484, 2006. 4. Eckman B.A. and Brown P.G. Graph data management for molecular and cell biology. IBM J. Res. Dev., 50(6):545–560, 2006. 5. Howe B. and Maier D. Algebraic manipulation of scientific data sets. VLDB J., 14(4):397–416, 2005. 6. Marathe A.P. and Salem K. A language for manipulating arrays. In Proc. 23rd Int. Conf. on Very Large Data Bases, 1997, pp. 46–55. 7. Marathe A.P. and Salem K. Query processing techniques for arrays. ACM SIGMOD Rec., 28(2):323–334, 1999. 8. Merlo I., Bertino E., Ferrari E., Gadia S., and Guerrini G. Querying multiple temporal granularity data. In Proc. Seventh Int. Conf. on Temporal Representation and Reasoning, 2000, pp. 103–114. 9. Popivanov I. and Miller R.J. Similarity search over time-series data using wavelets. In Proc. 18th Int. Conf. on Data Engineering, 2002, pp. 212–221. 10. Reiner B., Hahn K., Ho¨fling G., and Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proc. 13th Int. Conf. Database and Expert Syst. Appl., 2002, pp. 689 –700. 11. Tong H. and Faloutsos C. Center-piece subgraphs: problem definition and fast solutions. In Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2006, pp. 404–413.
Data Uncertainty Management in Sensor Networks
Data Types: Image, Video, Pixel, Voxel, Frame ▶ Biomedical Image Data Types and Processing
Data Uncertainty Management in Sensor Networks S UNIL P RABHAKAR 1, R EYNOLD C HENG 2 Purdue University, West Lafayette, IN, USA 2 The University of Hong Kong, Hong Kong, China
1
Synonyms Imprecise querying
data;
Probabilistic data; Probabilistic
Definition Data readings collected from sensors are often imprecise. The uncertainty in the data can arise from multiple sources, including measurement errors due to the sensing instrument and discrete sampling of the measurements. For some applications, ignoring the imprecision in the data is acceptable, since the range of the possible values is small enough not to significantly affect the results. However, for others it is necessary for the sensor database to record the imprecision and also to take it into account when processing the sensor data. This is a relatively new area for sensor data management. Handling the uncertainty in the data raises challenges in almost all aspects of data management. This includes modeling, semantics, query operators and types, efficient execution, and user interfaces. Probabilistic models have been proposed for handling the uncertainty. Under these models, data values that would normally be single values are transformed into groups of data values or even intervals of possible values.
Historical Background The management of uncertain data in database management systems is a relatively new topic of research, especially for attribute-level uncertainty. Earlier work has addressed the case of tuple-level uncertainty and also node-level uncertainty for XML data. The earliest work on attribute-level uncertainty is in the area of moving object databases. In order to reduce the need for very frequent updates from moving objects, the frequency of the updates is reduced at the expense of uncertainty
D
in the location of the object (in the database). For example, the notion of a dead-reckoning update policy allows an object to not report updates as long as it has not moved by more than a certain threshold from its last update. In most of the earlier works, the use of probability distributions of values inside an uncertainty interval as a tool for quantifying uncertainty was not considered. Further, discussions of queries on uncertain data were often limited to the scope of aggregate functions or range queries. A model for probabilistic uncertainty was proposed for moving-objects and later extended to general numeric data for sensors in [2]. A probabilistic data model for data obtained from a sensor network was described in [6]. Past data are used to train the model through machine-learning techniques, and obtain information such as data correlation, timevarying functions of probability distributions, as well as how probability distributions are updated when new sensor values are acquired. Recently, new relational models have been proposed to manage uncertain data. These projects include MauveDB [9], Mystiq [6], Orion [15], and Trio [1]. Each of these projects aims to develop novel database systems for handling uncertain data.
Foundations Modeling Uncertainty
Uncertainty in sensor data is often the result of either inherent limitations in the accuracy with which the sensed data is acquired or limitations imposed by concerns such as efficiency and battery life. Consider for example, a moving object application that uses GPS devices to determine the locations of people as they move about. Although GPS accuracy has improved significantly, it is well known that the location reported by a GPS sensor is really an approximation – in fact, the actual location is likely to be distributed with a Gaussian probability distribution around the reported location. This is an example of uncertainty due to the limitation of the measurement instrument. Since most sensors are powered by batteries that can be quickly depleted, most sensor applications take great pains to conserve battery power. A common optimization is to not measure and transmit readings continuously. Instead, the data are sampled at some reasonable rate. In this case the exact values are only known at the time instances when samples are taken.
647
D
648
D
Data Uncertainty Management in Sensor Networks
Between samples, the application can only estimate (based on the earlier samples) the values. For certain sensors, the battery overhead for taking certain types of measurements is much lower than that for others. Furthermore, the cheaper readings are correlated with more expensive reading. This allows the sensor to estimate the costlier reading by taking a cheaper reading and exploiting the correlation. However the estimate is not exact, which introduces some uncertainty. Even when sensor readings are precise and frequently sampled, uncertainty can creep in. For example, if a given sensor is suspected of being faulty or compromised, the application may only partially trust the data provided by the sensor. In these cases, the data are not completely ignored but their reliability can be reduced. Alternatively, sensor input may be processed to generate other information – e.g., face detection on video data from a sensor. Post processing methods may not yield certain matches – the face detection algorithm may have a known degree of error or may give a degree of confidence with which it has detected a face (or a given person). In these cases, the unreliability of the raw or processed sensor data can be captured as uncertain data. Each of these examples shows that sensor readings are not precise. Instead of data having a definite discrete value, data has numerous alternative values, possibly with associated likelihood (probabilities). The types of uncertainty in sensor data can be divided into two categories: Discrete uncertainty. Instead of a single value, a data item could take on one out of a set of alternative values. Each value in this set may further be associated with a probability indicating the likelihood of that particular value being the actual value.
Continuous uncertainty. Instead of a single value, a data item can take on any one value within an interval. In addition, there may be an associated probability density function (pdf) indicating the distribution of probabilities over this interval. In each of these cases, the total probability may or may not total to 1 for each data item. Several models for handling probabilistic data based upon the relational data model have been proposed in the literature. Most of these models can only handle discrete data wherein each alternative value for a given data time is stored in the database along with its associated probability. Extra rules are imposed over these records to indicate that only one of the alternative values for a given data time will actually occur. The Orion model is explicitly designed for handling continuous uncertainty. Under this model, uncertain attributes can be expressed as intervals with associated pdfs or as a discrete set. Representing probabilities symbolically as pdfs instead of enumerating every single alternative allows the model to handle continuous distributions. Queries
As data becomes imprecise, there is a direct impact on the nature of query results. Figure 1 shows an example of points in two-dimensional space, a range query (Q), and a nearest-neighbor query (q) with two cases: (i) with no uncertainty; and (ii) with different types of uncertainty for different objects. Consider the twodimensional range query Q shown in Fig. 1a. The result of the query are the identities of those points that fall within the range of the query – Points b and d in this example. If the data is imprecise (as in Fig. 1b), the data consist of regions of space (and possibly with associated probability distributions). Some of these
Data Uncertainty Management in Sensor Networks. Figure 1. A two-dimensional example: (a) exact points with no uncertainty; (b) points with uncertainty.
Data Uncertainty Management in Sensor Networks
regions may clearly lie outside the query region and the corresponding moving objects are thus excluded from the answer (e.g., Point a). Those that lie completely within the query region are included in the answer, as in the case of precise point data (e.g., Point b). However, those objects that partially overlap the query region represent points that may or may not actually be part of the query answer (Points d and e). These points may be reported as a special subset of the answer. In [14] Future Temporal Logic (FTL) was proposed for processing location-based queries over uncertain data with no probability information. Thus an object is known to be located somewhere within a given spatial region. Queries are augmented with either MUST or MAY keywords. With the MUST keyword, objects that have even a small chance of not satisfying a query are not included in the results. On the other hand, with the MAY keyword, all objects that have even a remote chance of satisfying a query are included. FTL therefore provides some qualitative treatment for queries over uncertain data. With the use of probability distributions, it is possible to give a more quantitative treatment to queries over uncertain data. In addition to returning the query answers, probability of each object satisfying the query can be computed and reported. In order to avoid reporting numerous low probability results, queries can be augmented with a probability threshold, t. Only those objects that have a probability greater than t of satisfying the query are reported. This notion of probabilistic queries was introduced in [2]. Most work on uncertain data management gives a quantitative treatment to queries. It should be noted that the MUST and MAY semantics can be achieved by choosing the threshold to be 1 or 0, respectively. An important issue with regards to queries over uncertain data is the semantics of the query. What exactly does it mean to execute an arbitrary query over uncertain data? Most researchers have adopted the well-established possible worlds semantics (PWS) [10]. Under PWS, a database with uncertain (probabilistic) data consists of numerous probabilistic events. Depending upon the outcome of each of these events, the actual database is one out of an exponential number of possible worlds. For example, consider a single relation with two attributes: Sensor_id and reading. Assume there is a single tuple in this table, with Sensor_id S1, and an uncertain reading which could be 1 with probability 0.3 and 2 with probability 0.7. This uncertain database consists of a
D
single event, and there are two possible worlds: in one world (W1), the relation consists of the single tuple <S1, 1>; in world W2, the relation consists of the single tuple <S1, 2>. Furthermore, the probability of W1 is 0.3 and that of W2 is 0.7. In general, with multiple uncertain events, each world corresponds to a given outcome of each event and the probability of the world is given by the product of the probabilities of each event that appears in the world. It should be noted that there is no uncertainty in a given world. Each world looks like a regular database relation. Under PWS, the semantics of a query are as follows. Executing a query over an uncertain data is conceptually composed of three steps: (i) Generate all possible worlds for the given data with associated probabilities; (ii) execute the query over each world (which has no uncertainty); and (iii) Collapse the results from all possible worlds to obtain the uncertain result to the original query. While PWS provides very clean semantics for any query over an uncertain database, it introduces challenges for efficient evaluation. First, if there is continuous uncertainty in the data, then there are an infinite number of possible worlds. Even when there are a finite number of possible worlds, the total number is exponential in the number of events. Thus it is impractical to enumerate all worlds and execute the query over each one. Techniques to avoid enumerating all worlds while computing the query correctly were proposed in [6]. They showed that there is a class of safe queries over uncertain data which can be computed using query plans similar to those for certain data. Implementation
With the goal of supporting PWS over uncertain data, systems that support uncertainty need to define probabilistic versions of database operators such as selection, projection, cross products, and comparison operators. Typically this involves operations over the probability distributions of the data, and tracking dependencies that are generated as a result of processing. Efficient management of dependencies between derived data is among the greatest challenges for uncertain data management. The base data in an uncertain database are assumed to be independent (with the exception of explicit dependencies that are expressed in the base data). However, as these data are used to produce other data, the derived data may no longer be independent of each other [6]. These dependencies
649
D
650
D
Data Uncertainty Management in Sensor Networks
affect the correct evaluation of query operators. To correctly handle dependencies, it is necessary to track them. Thus the model has to be augmented to store not only the data, but also dependencies among them. In the Trio system this information is called Lineage, the Orion model calls it History, and the MauveDB model handles dependencies using factor tables. As data is processed multiple times, the size and complexity of this dependency information can grow significantly. Efficient handling of this information is currently an active area of research. Query processing algorithms over uncertain data have been developed for range queries [13], nearestneighbor queries [2,11], and skyline queries [12]. Efficient join algorithms over uncertain data have been proposed in [3]. Despande et al. [7] studied the problem of answering probabilistic queries over data streams. They proposed algorithms to return results with minimum resource consumption. In [5], Cormode et al. proposed space- and time-efficient algorithms for approximating complex aggregate queries over probabilistic data streams. For queries that cannot be correctly processed using these modified operators and safe query plans, one alternative is to use approximation techniques based upon sampling. Samples of possible worlds can be drawn using the probabilities of the various events that make up the uncertain database. The query is then executed on these sample worlds and the results are aggregated to obtain an approximation of the true answer. Indexing
Indexing is a well known technique for improving query performance. Indexing uncertain data
presents some novel challenges. First, uncertain data do not have a single value as is the case for traditional data. Consequently indexes such as B+-trees (and also hash indexes, since hashing requires exact matches) are inapplicable. By treating the uncertain intervals (regions) as spatial data, it is possible to use spatial indexes, such as R-trees or interval indexes, over uncertain attributes. These indexes can provide pruning based upon the extent and location of the uncertainty intervals alone. However, these index structures do not consider probability information, and are therefore incapable of exploiting probability for better evaluation. This is especially true in the case for probabilistic threshold queries. There has been some recent work on developing index structures for uncertain data [4,11,13]. These index structures take the probability distribution of the underlying data into account. In particular, the Probability Threshold Index (PTI), is based on the modification of a one-dimensional R-tree. Each entry in this R-tree variant is augmented with multiple Minimum Bounding Rectangles (MBRs) to facilitate pruning. The extra MBRsare called x-bounds. Consider a one-dimensional data set. An MBR can been viewed as a pair of bounds: a left bound that is the right-most line that lies to the left of every object in the given node; and a right bound that is the left-most line that lies to the right of every object in the given node. The notion of x-bounds is similar, except that a left-x-bound is the right-most line that ensures that no object in the given node has a probability greater than x of lying to the left of this bound. The right-x-bound is similarly defined. Figure 2 shows an example of these bounds. Using these
Data Uncertainty Management in Sensor Networks. Figure 2. An example of X-Bounds for PTI.
Data Visualizations
bounds it is possible to achieve greater pruning as shown by the range query in the figure. This query has a threshold bound of 0.4. Even though the query intersects with overall MBR, using the right-0.4-bound it is clear that there is no need to visit this subtree for this query. Since the query does not cross the right-0.4-bound, there can be no objects under this node that have a probability greater than 0.4 of overlapping with the query.
Key Applications Uncertainty in sensor data is found in virtually all applications of sensors. For many applications, however, it may be acceptable to ignore the uncertainty and treat a given value as a reasonable approximation of the sensor reading. For others, such approximations and the resulting errors in query answers are unacceptable. In order to provide correct answers for these applications it is necessary to handle the uncertainty in the data. Examples include location-based services and applications that introduce uncertainty in order to provide some degree of privacy.
Future Directions Work on the problem of handling uncertain data in sensor databases has only just begun. Much remains to be done. A long-term goal of several current projects is the development of a full-fledged database management system with native support for uncertain data as first-class citizens. Examples of current systems include Orion, MauveDB, Mystiq, and Trio. Immediate steps in building such systems include the development of query optimization techniques. This includes cost estimation methods, query plan enumeration techniques, and approximate query evaluation methods. In addition, an important facet of system development is the user interface. Interesting issues for user interfaces include: How do users make sense of the probabilistic answers? How do they input probabilistic data and pose queries? Are new query language constructs needed? Should the probabilistic nature of the data be hidden from the user or not?
Cross-references
▶ Data Storage and Indexing in Sensor Networks ▶ Location-Based Services ▶ Moving Objects Databases and Tracking ▶ Probabilistic Databases ▶ R-Tree (and family)
D
651
Recommended Reading 1. Benjelloun O., Sarma A.D., Halevy A., and Widom J. ULDBs: databases with uncertainty and lineage. In Proc. 32nd Int. Conf. on Very Large Data Bases, 2006, pp. 953–964. 2. Cheng R., Kalashnikov D., and Prabhakar S. Evaluating probabilistic queries over uncertain data. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2003. 3. Cheng R., Singh S., Prabhakar S., Shah R., Vitter J., and Xia Y. Efficient join processing over uncertain data. In Proc. ACM 15th Conf. on Information and Knowledge Management, 2006. 4. Cheng R., Xia Y., Prabhakar S., Shah R., and Vitter J. Efficient indexing methods for probabilistic threshold queries over uncertain data. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004. 5. Cormode G. and Garofalakis M. Sketching probabilistic data streams. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 2005, pp. 143–154. 6. Dalvi N. and Suciu D. Efficient query evaluation on probabilistic databases. In Proc. 30th Int. Conf. on Very Large Data Bases. 2004. 7. Despande A., Guestrin C., Hong W., and Madden S. Exploiting correlated attributes in acquisitional query processing. In Proc. 21st Int. Conf. on Data Engineering, 2005. 8. Deshpande A., Guestrin C., Madden S., Hellerstein J., and Hong W. Model-driven data acquisition in sensor networks. In Proc. 30th Int. Conf. on Very Large Data Bases, 2004. 9. Deshpande A. and Madden S. MauveDB: supporting modelbased user views in database systems. In Proc. ACM SIGMOD Int. Conf. Management of Data. 2006, pp. 73–84. 10. Halpern J.Y. Reasoning about uncertainty. MIT, Cambridge, USA, 2003. 11. Ljosa V. and Singh A. ALPA: indexing arbitrary probability distributions. In Proc. 23rd Int. Conf. on Data Engineering, 2007. 12. Pei J., Jiang B., Lin X., and Yuan Y. Probabilistic skylines on uncertain data. In Proc. 33rd Int. Conf. on Very Large Data Bases, 2007. 13. Singh S., Mayfield C., Prabhakar S., Shah R., and Hambrusch S., Indexing uncertain categorical data. In Proc. 23rd Int. Conf. on Data Engineering, 2007. 14. Sistla P.A., Wolfson O., Chamberlain S., and Dao S. Querying the uncertain positions of moving objects. Temporal databases: research and practice 1998. 15. The Orion Uncertain Database Management System. Available at: http://orion.cs.purdue.edu/
Data Utility Measures ▶ Information Loss Measures
Data Visualizations ▶ Dense Pixel Displays
D
652
D
Data Visualization
Data Visualization H ANS H INTERBERGER ETH Zurich, Zurich, Switzerland
Synonyms Graphic representation visualization
of
data;
Information
Definition Data Visualization: (i) Interpreting information in visual terms by forming a mental picture based on data. (ii) Applying suitable methods to put data into visible form. This definition is consistent with the Oxford English Dictionary definitions of ‘data’: Facts, esp. numerical facts, collected together for reference or information and of ‘visualization’: (i) The action or fact of visualizing; the power or process of forming a mental picture or vision of something not actually present to the sight; a picture thus formed. (ii) The action or process of rendering visible. The first part of the definition refers to the human cognitive activity of forming a mental picture, independent of how something is represented. If this is the only activity of interest, then the term ‘information visualization’ is more commonly used. Similarly, ‘data visualization’ is often reduced to the second part of the definition. Some authors explicitly include the computer and cognition in their definition of visualization: The use of computer supported, interactive, visual representations of data to amplify cognition [1]. Others emphasize how data visualization differs from information visualization: data visualization is for exploration, for uncovering information, as well as for presenting information. It is certainly a goal of data visualization to present any information in the data, but another goal is to display the raw data themselves, revealing the inherent variability and uncertainty [16].
century (Fig. 2). In the middle of the fourteenth century, Nicole Oresme introduced an early form of coordinate graphing. He marked points in time along a horizontal line and for each of these points he drew a bar whose length represented the object’s velocity at that moment. 1500–1800. Meteorological maps, showing the prevalence of winds on a geographical map, date back to 1686. In 1782, Marcellin du Carla-Boniface issues the first modern topographical maps (Fig. 3) and August Crome prints the first thematic map, showing economic production data across Europe. Also in this time period appear the first graphics used for descriptive statistics, for example Christian Huygens’ plot of a function to graphically determine the years of life remaining given the current age, published in 1669. William Playfair, an English political economist, laid the ground for business graphics in 1786 with his Commercial and Political Atlas in which he documented commercial and political time series using curves, bar charts and column charts (Fig. 4). Playfair is also
Data Visualization. Figure 1. Ca. 6200 BC. The oldest known map, from a museum at Konya, Turkey.
Historical Background Up to the 15th Century. Over eight thousand year old maps, carved in stone, suggest that the visualization of information is as old as civilization (Fig. 1). The earliest known data visualization, a time series plot, depicting the changing values of several planets’ positions, is estimated to have appeared in the tenth
Data Visualization. Figure 2. Ca. 950. First known graphic of a time series visualizing data of planetary orbits.
Data Visualization
arguably the inventor of the pie chart (Statistical Breviary, 1801). 1800–1949. Scientists and civil servants are beginning to use thematic maps and statistical graphics to support their arguments. Famous examples are the dot map that Dr. John Snow drew by plotting the locations of deaths from cholera in central London during the 1854 epidemic (Fig. 5) and Florence Nightingale’s comparison of deaths due to injuries in combat and deaths due to illness in an army hospital for which she invented her own graphic, the Polar-Area Diagram (Fig. 6). The second half of the nineteenth century saw a 50 year long debate on the standardization of statistical maps and diagrams which failed to produce concrete results. Early in the twentieth century there followed a 50 year period of consolidation where the
Data Visualization. Figure 3. 1782. Detail of Marcellin du Carla-Boniface’s topological map.
D
accomplishments of the previous one hundred years became widely accepted. 1950–1975. Researchers started to face enormous challenges when analyzing the deluge of data produced by electronic equipment that was put in use after the Second World War. In this environment, John Tukey led the way, established the field of Exploratory Data Analysis [15] (Fig. 8) and sparked a flurry of activities that have and still are producing many novel graphical methods [5,8], including techniques that marked the beginning of dynamic statistical graphics. In the 1960s, when signals from remote sensing satellites needed to be processed graphically, geographers started to combine spatially referenced data, spatial models and map based visualizations in geographic information systems. The French cartographer Jacques Bertin worked on a theory of graphics [3] and introduced with his reorderable matrix an elegant technique to graphically process quantitative data [2] (Fig. 7). From 1975–present. Fast, interactive computers connected to a high resolution color graphic display created almost unlimited possibilities for scientific visualizations of data generated by imaging techniques, computational geometry or physics based models [10]. Event oriented programming made it easy to link different data displays, encouraging new techniques such as brushing [5]. Statisticians started to tackle high dimensional data by interactively ‘touring’ low dimensional projections. Large display areas encouraged graphic methods based on multiple plots [5], space filling techniques (e.g., mosaic plots) and graphics with high data densities. For an overview the reader is referred to [1,5,16]. In the early 1990s, virtual
Data Visualization. Figure 4. 1786. William Playfair’ chart, depicting prices, wages, and the reigns of British kings and queens.
653
D
654
D
Data Visualization
To further explore the history of data visualization the reader is referred to [7,14].
Foundations
Data Visualization. Figure 5. 1854. Detail of the pioneering statistical map drawn by John Snow to illustrate patterns of disease.
Data Visualization. Figure 6. 1858. Polar-Area diagram, invented by Florence Nightingale to convince authorities of the need to improve sanitary conditions in hospitals.
environments were introduced as methods to immersively investigate scientific data [4]. (Fig. 10) Another way to overcome the restrictions of two dimensional displays was shown by Alfred Inselberg with his concept of parallel coordinates [9], today a ubiquitous method to visualize multidimensional data (Fig. 9).
The literature on scientific fundamentals of data visualizations fall into three independent but related fields: (i) computer graphics, (ii) presentation techniques, (iii) cognition. Computer graphics, primarily the domain of computer scientists and mathematicians, builds on elementary principles in the following broad areas: visualization algorithms and data structures, modeling and (numerical) simulation, (volume) rendering, particle tracing, grid generation, wavelet transforms, multiscale and multiresolution methods as well as optics and color theory. A more exhaustive treatment of computer graphics fundamentals related to data visualization can be found in [10,12,17]. Most of the literature on presentation techniques can be found in statistics and computer science although economists and cartographers also made substantial contributions. The publications of John Tukey [15] and Andrew Ehrenberg [6] show how tables can be used as simple but effective presentation technique to organize data and demonstrate the method’s usefulness for statistics and data analysis. In 1967 Jacques Bertin formulated a comprehensive theory for a graphical system [3] and subsequently applied parts of it to graphic information processing [2]. Other classics were published in 1983 when William Cleveland wrote an excellent methodological resource for the design of plots and Edward Tufte published his review on the graphical practice to visualize quantitative data. A reference, featuring perhaps the most complete listing of graphs, maps, tables, diagrams, and charts has been compiled by Robert Harris [8]. Parallel coordinates is one of the leading methodologies for multidimensional visualization [9]. Starting from geometric foundations, Al Inselberg explains how n-dimensional lines and planes can be represented in 2D through parallel coordinates. The most recent publications explain mostly dynamic, interactive methods. Antony Unwin concentrates on graphics for large datasets [16] while Robert Spence favors techniques that allow user interaction [13]. Ultimately, to be of any use, data visualization must support human cognition. The challenges this raises are of interest to cognitive scientists, psychologists, and computer scientists specializing in human-computer
Data Visualization
D
655
D
Data Visualization. Figure 7. 1967. Bertin’s reorderable matrix, a visualization method embedded in a comprehensive theory of graphics.
interaction. Rudolf Arnheim investigated the role of visual perception as a crucial cognitive activity of reasoning [1]. Also in the domain of ‘visual thinking’ is the work of Colin Ware [17] as he, among other contributions, proposes a foundation for a science of data visualization based on human visual and cognitive processing. Card et al. discuss topics of computer graphics as well as presentation techniques with a focus on how different methods support cognition. A more general approach worth mentioning is taken by Donald Norman when he argues that people deserve information appliances that fit their needs and lives [11]. To summarize, most of the literature on data visualization describes the efforts of computer scientists and cognitive scientists to develop new techniques for people to interact with data, from small statistical datasets to large information environments.
Data Visualization. Figure 8. 1977. Tukey’s box-and-whisker plot graphically summarizes effectively key characteristics of the data’s distribution.
Key Applications
Statistics
There are few – if any – application areas that do not benefit from data visualization simply because graphical methods assist the fundamental human activity of cognition and because in an increasingly digital world people are flooded with data. In the following four areas, data visualization plays a key role:
Descriptive statistics has traditionally been the strongest customer for data visualization, primarily through its application to support exploratory data analysis. The use of data visualization as part of descriptive statistics has become a matter of fact wherever data are being collected.
656
D
Data Visualization
Data Visualization. Figure 9. 1999. A continued mathematical development of parallel coordinates led to software for ‘visual data mining’ in high dimensional data sets.
of recovering information from large and complex databases often depends on data mining techniques. Visual data mining – a new and rapidly growing field – supports people in their data exploration activity with graphical methods. Geographic information systems have traditionally been key applications, particularly for map-based visualizations. Documentation
Data Visualization. Figure 10. 2006. Immersive Geovisualization at West Virginia University.
Ever since thematic maps and statistics graphics became popular with commerce, government agencies and the sciences, data visualization methods are being used routinely to illustrate that part of documents which deal with data.
Information Systems
Computational Science
Data visualization has become an important component in the interface to information systems simply because information is stored as data. The process
Progress in solving scientific and engineering problems increasingly depends on powerful software for modeling and simulation. Nevertheless, success in the end
Data Warehouse
often only comes with effective scientific visualizations. Computational science as a key application for data visualization is a strong driving force behind the development of graphical methods for huge amounts of high dimensional data.
Cross-references
▶ Chart ▶ Comparative Visualization ▶ Dynamic Graphics ▶ Exploratory Data Analysis ▶ Graph ▶ Methods ▶ Multivariate Data Visualization ▶ Parallel Coordinates ▶ Result Display ▶ Symbolic Representation
Recommended Reading 1. Arnheim R. Visual Thinking. University of California Press, Berkeley, CA, 1969. 2. Bertin J. Graphics and Graphic Information-Processing. Walter de Gruyter, Berlin/New York, 1981. 3. Bertin J. Semiology of Graphics (translation by W.J. Berg). University of Wisconsin Press, USA, 1983. 4. Card S.K., MacKinlay J.D., and Shneiderman B. Readings in Information Visualization: Using Vision to Think. Morgan Kaufmann, San Francisco, CA, 1999. 5. Cleveland W.S. The Elements of Graphing Data (Revised Edition). Hobart Press, Summit, NJ, 1994. 6. Ehrenberg A.S.C. A Primer in Data Reduction. Wiley, Chichester, UK, 1982. 7. Friendly M. The History of Thematic Cartography, Statistical Graphics, and Data Visualization. 8. Harris R.L. Information Graphics: A Comprehensive Illustrated Reference. Oxford University Press, New York, 1999. 9. Inselberg A. The plane with parallel coordinates. The Visual Comput., 1(2):69–91, 1985. 10. Nielson G.M., Hagen H., and Mu¨ller H. Scientific Visualization: Overviews, Methodologies, Techniques. IEEE Computer Society Press, USA, 1997. 11. Norman D.A. The Invisible Computer. The MIT Press, 1998. 12. Post F.H., Nielson G.M., and Bonneau, G.-P. (eds.). Data Visualization: The State of the Art. Kluwer Academic, 2002. 13. Spence R. Information Visualization: Design for Interaction (2nd edn.). Pearson Education, 2007. 14. Tufte E.R. The Visual Display of Quantitative Information. Graphics Press, 1983. 15. Tukey J.W. Exploratory Data Analysis. Addison-Wesley, Reading, MA, 1977. 16. Unwin A., Theus M., and Hofmann H. Graphics of Large Datasets: Visualizing a Million. Springer Series in Statistics and Computing, Berlin, 2006. 17. Ware C. Information Visualization: Perception for Design (2nd edn.). Morgan Kaufmann, 2004.
D
657
Data Warehouse I L -Y EOL S ONG Drexel University, Philadelphia, PA, USA
Synonyms Information repository; DW
Definition A data warehouse (DW) is an integrated repository of data put into a form that can be easily understood, interpreted, and analyzed by the people who need to use it to make decisions. The most widely cited definition of a DW is from Inmon [2] who states that ‘‘a data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.’’ The subject-oriented property means that the data in a DW are organized around major entities of interests of an organization. Examples of subjects are customers, products, sales, and vendors. This property allows users of a DW to analyze each subject in depth for tactical and strategic decision-making. The integrated property means that the data in a DW are integrated not only from all operational database systems but also some meta-data and other related external data. When data are moved from operational databases to a DW, they are extracted, cleansed, transformed, and then loaded. This makes a DW a centralized repository of all the business data with common semantics and formats. The nonvolatile property means that the data in a DW are not usually updated. Once the data are loaded into a DW, they are not deleted. Any change to the data that were already moved to a DW is recorded in the form of a snapshot. This allows a DW to keep track of the history of the data. The time-variant property means that a DW usually contains multiple years of data. It is not uncommon for a DW to contain data for more than ten years. This allows users of a DW to analyze trends, patterns, correlations, rules, and exceptions from a historical perspective.
Key Points DWs have become popular for addressing the needs of a centralized repository of business data in decisionmaking. An operational database system, also known as an online transaction processing (OLTP) system,
D
658
D
Data Warehouse Back Stage
supports daily business processing. On the other hand, a DW usually supports tactical or strategic business processing for business intelligence. While an OLTP system is optimized for short transactions, a DW system is optimized for complex decision-support queries. Thus, a data warehouse system is usually maintained separately from operational database systems. This distinction makes DW systems different from OLTP systems in many aspects. The data in a DW are usually organized in formats for easy access and analysis in decision-making. The most widely used data model for DWs is called the dimensional model or the star schema [3]. A dimensional model consists of two types of entities – a fact table and many dimensions. A fact table stores transactional or factual data called measures that are analyzed. Examples of fact tables are order, sale, return, and claim. A dimension represents an axis that analyzes the fact data. Examples of dimensions are time, customer, product, promotion, store, and market. The dimensional model allows users of a data warehouse to analyze the fact data from any combination of dimensions. Thus, a dimensional model simplifies end-user query processing and provides a multidimensional analysis space within a relational database. The different goals and data models of DWs need special access, implementation methods, maintenance, and analysis methods, different from those of OLTP systems [1]. Therefore, a data warehouse requires an environment that uses a blend of technologies.
Recommended Reading 1. 2. 3.
Chaudhuri S. and Dayal U. An overview of data warehousing and OLAP technology. ACM SIGMOD Rec., 26(1):65–74, 1997. Inmon W.H. Building the Data Warehouse, 3rd edn. Wiley, New York, 2002. Kimball R. and Ross M. The Data Warehouse Toolkit, 2nd edn. Wiley, New York, 2002.
Data Warehouse Back Stage ▶ Extraction, Transformation and Loading
Data Warehouse Design Methodology ▶ Data Warehouse Life-Cycle and Design
Data Warehouse Life-Cycle and Design M ATTEO G OLFARELLI University of Bologna, Bologna, Italy
Synonyms Cross-references
▶ Active and Real-Time Data Warehousing ▶ Business Intelligence ▶ Data Mart ▶ Data Mining ▶ Data Warehouse Life-cycle and Design ▶ Data Warehouse Maintenance, Evolution and Versioning ▶ Data Warehouse Metadata ▶ Data Warehouse Security ▶ Data Warehousing and Quality Data Management for Clinical Practice ▶ Data Warehousing for Clinical Research ▶ Data Warehousing Systems: Foundations and Architectures ▶ Dimension ▶ Multidimensional Modeling ▶ On-Line Analytical Processing
Data Warehouse design methodology
Definition The term data warehouse life-cycle is used to indicate the phases (and their relationships) a data warehouse system goes through between when it is conceived and when it is no longer available for use. Apart from the type of software, life cycles typically include the following phases: requirement analysis, design (including modeling), construction, testing, deployment, operation, maintenance, and retirement. On the other hand, different life cycles differ in the relevance and priority with which the phases are carried out, which can vary according to the implementation constraints (i.e., economic constraints, time constraints, etc.) and the software specificities and complexity. In particular, the specificities in the data warehouse life-cycle derive from the presence of the operational database that
Data Warehouse Life-Cycle and Design
feeds the system and by the extent of this kind of system that must be considered in order to keep the cost and the complexity of the project under control. Although the design phase is only a step within the overall life cycle, the identification of a proper life-cycle model and the adoption of a correct design methodology are strictly related since each one influences the other.
Historical Background The data warehouse (DW) is acknowledged as one of the most complex information system modules, and its design and maintenance is characterized by several complexity factors, which determined, in the early stages of this discipline, a high percentage of project failures. A clear classification of the critical factors of Data Warehousing projects was already available in 1997 when three different risk categories were identified [1]: Socio-technical: DW projects have deep impact on the decisional processes and political equilibriums, thus reducing the power of some stakeholders who will be willing to interfere with the project. For example, data ownership is power within an organization. Any attempt to share or take control over somebody else’s data is equivalent to a loss of power of this particular stakeholder. Furthermore, no division or department can claim to possess 100% clean, error-free data. The possibility of revealing the quality problems of data within the information system of the department is definitely frustrating for the stakeholders affected. Technological: DW technologies are continuously evolving and their features are hard to test. As a consequence, problems related to the limited scalability of the architecture, difficulty in sharing meta-data between different components and the inadequate expertise of the programmers may hamper the projects. Design: designing a DW requires a deep knowledge of the business domain. Some recurrent errors are related to limited involvement of the user communities in the design as well as the lack of a deep analysis of the quality of the source data. In both these cases, the information extracted from the DW will have a limited value for the stakeholders since they will turn out to be unreliable and outside the user focus. The awareness of the critical nature of the problems and the experience accumulated by practitioners
D
659
determined the development of different design methodologies and the adoption of proper life cycles that can increase the probability of completing the project and fulfill the user requirements.
Foundations The choice of a correct life cycle for the DW must take into account the specificities of this kind of systems, which according to [2], are summarized as follows: 1. DWs rely on operational databases that represent the sources of the data. 2. User requirements are difficult to collect and usually change during the project. 3. DW projects are usually huge projects: the average time for their construction is 12–36 months and their average cost ranges from 0.5 to 10 million dollars. 4. Managers are demanding users that require reliable results in a time compatible with business needs. While there is no consensus on how to address points (i) and (ii), the DW community has agreed on an approach that cuts down cost and time to make a satisfactory solution available to the final users. Instead of approaching the DW development as a whole in a top-down fashion, it is more convenient to build it bottom-up working on single data marts [3]. A data mart is part of a DW with a restricted scope of content and support for analytical processing, serving a single department, part of an organization, and/or a particular data analysis problem domain. By adopting a bottom-up approach, the DW will turn out to be the union of all the data marts. This iterative approach promises to fulfill requirement (iii) since it cuts down development costs and time by limiting the design and implementation efforts to get the first results. On the other hand, requirement (iv) will be fulfilled if the designer is able to implement first those data marts that are more relevant to the stakeholders. As stated by many authors, adopting a pure bottom-up approach presents many risks originating from the partial vision of the business domain that will be available at each design phase. This risk can be limited by first developing the data mart that plays a central role within the DW, so that the following can be easily integrated into the existing backbone; this kind of solution is also called bus architecture. The basis for designing coherent data marts and for achieving an
D
660
D
Data Warehouse Life-Cycle and Design
integrated DW is the agreement of all the design teams on the classes of analysis that are relevant for the business. This is primarily obtained by the adoption of conformed dimensions of analysis [4]. A dimension is conformed when two copies of the dimensions are either exactly the same (including the values of the keys and all the attributes), or else one dimension is a proper subset of the other. Therefore, using the same time dimension in all the data marts implies that the data mart teams agree on a corporate calendar. All the data mart teams must use this calendar and agree on fiscal periods, holidays, and workdays. When choosing the first data mart to be implemented the designer will probably cope with the fact that the most central data mart (from a technical point of view) is not the most relevant to the user. In that case, the designer choice must be a trade-off between technical and political requirements. Based on these considerations the main phases for the DW life-cycle can be summarized as follows: 1. DW planning: this phase is aimed at determining the scope and the goals of the DW, and determines the number and the order in which the data marts are to be implemented according to the business priorities and the technical constraints [5]. At this stage the physical architecture of the system must be defined too: the designer carries out the sizing of the system in order to identify appropriate hardware and software platforms and evaluates the need for a reconciled data level aimed at improving data quality. Finally, during the project planning phase the staffing of the project is carried out. 2. Data mart design and implementation: this macrophase will be repeated for each data mart to be implemented and will be discussed in more detail in the following. At every iteration, a new data mart is designed and deployed. Multidimensional modeling of each data mart must be carried out
considering the available conformed dimensions and the constraints derived from previous implementations. 3. DW maintenance and evolution: DW maintenance mainly concerns performance optimization that must be periodically carried out due to user requirements that change according to the problems and the opportunities the managers run into. On the other hand, DW evolution concerns keeping the DW schema up-to-date with respect to the business domain and the business requirement changes: a manager requiring a new dimension of analysis for an existing fact schema or the inclusion of a new level of classification due to a change in a business process may cause the early obsolescence of the system (Fig. 1). DW design methodologies proposed in the literature mainly concern phase 2 and thus should be better referred to as data mart design methodologies. Though a lot has been written about how a DW should be designed, there is no consensus on a design method yet. Most methods agree on the opportunity of distinguishing between the following phases: Requirement analysis: identifies which information is relevant to the decisional process by either considering the user needs or the actual availability of data in the operational sources. Conceptual design: aims at deriving an implementation-independent and expressive conceptual schema for the DW, according to the conceptual model chosen (see Fig. 2). Logical design: takes the conceptual schema and creates a corresponding logical schema on the chosen logical model. While nowadays most of the DW systems are based on the relational logical model (ROLAP), an increasing number of software vendors are proposing also pure or mixed multidimensional solutions (MOLAP/HOLAP). Figure 3 reports the
Data Warehouse Life-Cycle and Design. Figure 1. The main phases for the DW life-cycle.
Data Warehouse Life-Cycle and Design
D
661
D
Data Warehouse Life-Cycle and Design. Figure 2. A conceptual representation for the SALES fact based on the DFM model [6].
Data Warehouse Life-Cycle and Design. Figure 3. A relational implementation of the SALE fact using the well-known star schema.
relational implementation of the SALE fact based on the well-known star schema [4]. ETL process design: designs the mappings and the data transformations necessary to load into the logical schema of the DW the data available at the operational data source level. Physical design: addresses all the issues specifically related to the suite of tools chosen for implementation – such as indexing and allocation.
described at the beginning of the present section. The lack of settled user requirements and the existence of operational data sources that fix the set of available information make it hard to develop appropriate multidimensional schemata that on the one hand fulfill user requirements and on the other can be fed from the operational data sources. Two different design principles can be identified: supply-driven and demanddriven [5].
Requirement analysis and conceptual design play a crucial role in handling DW peculiarities (i) and (ii)
Supply-driven approaches [3,6] (also called datadriven) start with an analysis of operational data
662
D
Data Warehouse Life-Cycle and Design
sources in order to reengineer their schemata and identify all the available data. Here user involvement is limited to select which chunks of the available data are relevant for the decision-making process. While supply-driven approaches simplify the design of the ETL because each piece of data in the DW corresponds to one or more attributes of the sources, they give user requirements a secondary role in determining the information contents for analysis as well as giving the designer little support in identifying facts, dimensions, and measures. Supply-driven approaches are feasible when all of the following are true: (i) detailed knowledge of data sources is available a priori or easily achievable; (ii) the source schemata exhibit a good degree of normalization; and (iii) the complexity of source schemata is not too high. Demand-driven approaches [7,8] start from determining the information requirements of business users. The emphasis is on the requirement analysis process and on the approaches for facilitating user participations. The problem of mapping these requirements onto the available data sources is faced only a posteriori, and may fail thus determining the users’ disappointment as well as a waste of the designer’s time. Based on the previous approaches some mixed modeling solutions have been proposed in the last few years in order to overcome the weakness of each pure solution. Conceptual design is widely recognized to be the necessary foundation for building a DW that is welldocumented and fully satisfies the user requirements. The goal of this phase is to provide the designer with a high level description of the data mart possibly at different levels of detail. In particular, at the DW level it is aimed at locating the data mart within the overall DW picture, basically characterizing the class of information captured, its users, and its data sources. At the data mart level, a conceptual design should identify the set of facts to be built and their conformed dimensions. Finally, at the fact level a nonambiguous and implementation-independent representation of each fact should be provided. If a supply driven approach has been followed for requirement analysis, the conceptual model at the schema level can be semiautomatically derived from the source schemata by identifying the many-to-one relationship [3,6].
Concerning the formalism to be adopted for representing information at this level, researchers and practitioners agreed that, although the E/R model has enough expressivity to represent most necessary concepts, in its basic form, it is not able to properly emphasize the key aspects of the multidimensional model. As a consequence many ad-hoc formalisms has been proposed in the last years (e.g., [6,9]) and a comparison of the different models done by [10] pointed out that, abstracting from their graphical form, the core expressivity is similar, thus proving that the academic community reached an informal agreement on the required expressivity. Logical design is the phase that most attracted the interest of researchers in the early stage of Data Warehousing since it strongly impacts the system performance. It is aimed at deriving out of the conceptual schemata the data structure that will actually implement the data mart by considering some sets of constraints (e.g., concerning disk space or query answering time) [11]. Logical design is more relevant when a relational DBMS is adopted (ROLAP) while in the presence of a native multidimensional DBMS (MOLAP) the logical model derivation is straightforward. On the other hand, in ROLAP system, the choices concern, for example the type of schema to be adopted (i.e., star o snowflake), the specific solution for historicization of data (i.e., slowly changing dimensions) and schema. ETL process design is considered to be the most complex design phase and usually takes up to 70% of the overall design time. Complexity arises from the need of integrating and transforming heterogeneous and inconsistent data coming from different data sources. This phase also includes the choice of the strategy for handling wrong and incomplete data (e.g., discard, complete). Obviously, the success of this phase impacts the overall quality of DW data. Different from other design phases little efforts have been made in the literature to organize and standardize this phase [12,13], and actually none of the formalisms proposed have been widely adopted in real projects that usually rely on the graphical representation obtained from the ETL tool for documentation purposes. Finally, during physical design, the logical structure is optimized based on the means made available by the adopted suite of tools. Specialized DBMSs usually include ad hoc index types (e.g., bitmap index and join index) and can store the meta-knowledge necessary to automatically rewrite a given query on
Data Warehouse Life-Cycle and Design
the appropriate materialized view. In DW systems, a large part of the available disk space is devoted to optimization purposes and it is a designer task to find out its assignment to the different optimization data structures in order to maximize the overall performance [14]. Despite the basic role played by a well-structured methodological framework in ensuring that the DW designed fully meets the user expectations, only a few of the cited papers cover all the design phases [6,13]. In addition, an influential book, particularly from the practitioners’ viewpoint, is the one by Kimball [4], which discusses the major issues arising in the design and implementation of data warehouses. The book presents a case-based approach to data mart design that is bottom-up oriented and adopts a mixed approach for collecting user requirements. Finally it should be noted that, though most vendors of DW technology propose their own CASE solutions (that are very often just wizards capable of supporting the designer during the most tedious and repetitive phases of design), the only tools that currently promise to effectively automate some phases of design are research prototypes. In particular, [3,15], embracing the supply-driven philosophy, propose two approaches for automatically deriving the conceptual multidimensional schema from the relational data sources. On the contrary the CASE tool proposed in [12] follows the demand-driven approach and allows the multidimensional conceptual schemata to be drawn from scratch and to be semi-automatically translated into the target commercial tool.
Key Applications The adoption of an appropriate methodological approach during design phases is crucial to ensure the project success. People involved in the design must be skilled on this topic, in particular. Designers
Designers should have a deep knowledge of the pros and cons of different methodologies in order to adopt the one that best fits the project characteristics. Business Users
Users should be aware of the design methodology adopted and their role within it in order to properly support the designer’s work and to provide the correct information at the right time.
D
663
Future Directions Research on this topic should be directed to generalizing the methodologies discussed so far in order to derive a consensus approach that, depending on the characteristics of the project, will be made up of different phases. Besides, more generally, mechanisms should appear to coordinate all DW design phases allowing the analysis, control, and traceability of data and metadata along the project life-cycle. An interesting approach in this direction consists in applying the Model Driven Architecture to automate the inter schema transformations from requirement analysis to implementation [16]. Finally, the emergence of new applications for DW such as spatial DW [17], web DW, real-time DW [18], and business performance management [19] will have their side-effects on the DW life-cycle and inevitably more general design methodologies will be devised in order to allow their correct handling.
Cross-references
▶ Cube Implementations ▶ Data Mart ▶ Data Warehouse Maintenance, evolution and versioning ▶ Data Warehousing Systems: Foundations and Architectures ▶ Multidimensional Modeling ▶ Optimization and Tuning in Data Warehouses ▶ Snowflake Schema ▶ Star Schema
Recommended Reading 1. Abello A., Samos J., and Saltor F.YAM2: a multidimensional conceptual model extending UML. Infor. Syst., 31(6):541–567, 2006. 2. Bimonte S., Towards S., and Miquel M.Towards a Spatial Multidimensional Model. In Proc. ACM 8th Int. Workshop on Data Warehousing and OLAP, 2005. 3. Demarest, M. The politics of data warehousing. Retrieved June 2007 from http://www.noumenal.com/marc/dwpoly.html. 4. Giorgini P., Rizzi S., and Garzetti M. GRAnD: A goal-oriented approach to requirement analysis in data warehouses. Decision Support System, 2008, 45(1):4–21. 5. Golfarelli M., Maio D., and Rizzi S. The dimensional fact model: a conceptual model for data warehouses. Int. J. Coop. Inf. Syst. 7(2–3): 215–247, 1998. 6. Golfarelli M. and Rizzi S. WAND: A CASE tool for data warehouse design. In Proc. 17th Int. Conf. on Data Engineering, 2001. 7. Golfarelli M., Rizzi S., and Cella I. Beyond data warehousing: What’s next in business intelligence? In Proc. ACM 7th Int. Workshop on Data Warehousing and OLAP, 2004.
D
664
D
Data Warehouse Maintenance, Evolution and Versioning
8. Golfarelli M., Rizzi S., and Saltarelli E. Index selection for data warehousing. In Proc. 4th Int. Workshop on Design and Management of Data Warehouses, 2002. 9. Hu¨semann B., Lechtenbo¨rger J., and Vossen G. Conceptual data warehouse design. In Proc. 2nd Int. Workshop on Design and Management of Data Warehouses, 2000. 10. Jarke M., Lenzerini M., Vassiliou Y., and Vassiliadis P. Fundamentals of Data Warehouses. Springer, 2000. 11. Jensen M., Holmgren T., and Pedersen T. Discovering Multidimensional Structure in Relational Data. In Proc. 6th Int. Conf. Data Warehousing and Knowledge Discovery, 2004. 12. Kimbal R., Reeves L., Ross M., and Thornthwaite W. The Data Warehouse Lifecycle Toolkit. Wiley, New York, 1998. 13. Laender A., Freitas G., and Campos M. MD2 – Getting users involved in the development of data warehouse applications. In Proc. 14th Int. Conf. on Advanced Information Systems Eng., 2002. 14. Mazon J., Trujillo J., Serrano M., and Piattini M. Applying MDA to the development of data warehouses. In Proc. ACM 8th Int. Workshop on Data Warehousing and OLAP, 2005. 15. Theodoratos D. and Sellis T. Designing data Data warehouses. Data & Knowl. Eng., 31(3):279–301, 1999. 16. Tho N. and Tjoa A. Grid-Based Zero-Latency Data Warehousing for continuous data streams processing. In Proc. 6th Int. Conf. Information Integration and Web Based Applications & Services, 2004. 17. Trujillo J. and Luja´n-Mora S.A. UML Based Approach for Modeling ETL Processes in Data Warehouses. In Proc. 22nd Int. Conf. on Conceptual Modeling, 2003. 18. Trujillo J., Luja´n-Mora S., and Medina E. The Gold model case tool: An environment for designing OLAP applications. In Proc. ACM 5th Int. Workshop on Data Warehousing and OLAP, 2002. 19. Vassiliadis P., Simitsis A., and Skiadopoulos S. Conceptual modeling for ETL processes. In Proc. ACM 5th Int. Workshop on Data Warehousing and OLAP, 2002. 20. Winter R. and Strauch B. A method for demand-driven information requirements analysis in data warehousing. In Proc. 36th Annual Hawaii Int. Conf. on System Sciences, 2003.
Data Warehouse Maintenance, Evolution and Versioning J OHANN E DER 1, K ARL W IGGISSER 2 1 University of Vienna, Vienna, Austria 2 University of Klagenfurt, Klagenfurt, Austria
Synonyms Temporal data warehousing
Definition A multidimensional data warehouse consists of three different levels: The schema level (dimensions, categories), the instance level (dimension members, master data) and the data level (data cells, transaction data).
The process and methodology of performing changes on the schema and instance level to represent changes in the data warehouse’s application domain or requirements is called Data Warehouse Maintenance. Data Warehouse Evolution is a form of data warehouse maintenance where only the newest data warehouse state is available. Data Warehouse Versioning is a form of data warehouse maintenance where all past versions of the data warehouse are kept available. Dealing with changes on the data level, mostly insertion of new data, is not part of data warehouse maintenance, but part of a data warehouse’s normal operation.
Historical Background Data warehouses are supposed to provide functionality for storing and analyzing data over a long period of time. Since the world is changing, the need for applying changes to data warehouse structures arose. Kimball [8] was probably the first to describe the problem and propose solutions. Several more sophisticated proposals followed (see below).
Foundations A multidimensional data warehouse consists of three different levels: The schema level, the instance level, and the data level. On the schema level a data warehouse is defined by a set of dimensions and corresponding dimension categories, which build up a category hierarchy. On the instance level a data warehouse is defined by a set of dimension members for each dimension. Dimension members build up a member hierarchy which corresponds to the category hierarchy of the respective dimension. Schema and instance level together define the structure of a data warehouse. Different multidimensional models deal with measures in different ways. If no particular measure dimension is defined, measures are modeled as attributes of the fact table, thus are seen as part of the schema. If there is a measure dimension existing, measures are members of this particular dimension and therefore seen as instances. On the data level, a data warehouse consists of a set of data cells, which hold the actual values to analyze. A data cell is defined by selecting one dimension member from each dimension. Whereas changes on the data level, most of the time data inserts, are part of the daily business in data warehouse systems, modifications of the data warehouse structure need additional effort. Structural
Data Warehouse Maintenance, Evolution and Versioning
modifications can be implied by changes in the application domain of a data warehouse system or by changes in the requirements. Levels of the Maintenance Problem
Data warehouse maintenance systems must provide means to keep track of schema modifications as well as of instance modifications. On the schema level one needs operations for the Insertion, Deletion and Change of dimensions and categories. Category changes are for instance adding or deleting user defined attributes. Also the hierarchical relations between categories may be modified. On the instance level operations for the Insertion, Deletion and Change of dimension members are needed, as well as operations for changing the hierarchical relations between dimension members. Whether changing measures is a schema or instance change depends on the underlying multidimensional model. Typically, schema changes happen rarely but need much effort to be dealt with, whereas modifications of instances may happen quite often, but need fewer effort. Keeping track of the data warehouse structure is only one aspect of data warehouse maintenance. The structure of the cell data contained in a data warehouse is determined by the data warehouse’s structure. Thus, if this structure changes, existing cell data may have to be adjusted to be consistent with the new structure. Such adjustments can range from simple reaggregation to complex data transformations because for instance some unit of a measure is changed. These data adaptations must not be mistaken for data change operations as mentioned above, for instance loading new data into the data warehouse. Figure 1 shows an example for instance and schema changes. It contains three subsequent versions of one dimension of a car dealer’s data warehouse structure together with the categories for this dimension. On top, the initial version is shown. The dealer sells different car models of different brands. Each model has an attribute which denotes the engine power. For traditional German models this is given in horsepower, for English models it is given in kilowatt. The outline in the middle shows the subsequent version, where two instance changes can be seen: a new model (BMW 1) is introduced, and one model (Phantom V) is discontinued. The bottom outline shows the current structure version. Here one can see a schema change: a new category (Company) is inserted into the category hierarchy. On the instance level there are a number of
D
changes: one brand (Puch) is removed from the product portfolio. The model (Modell G) attached to this brand is now sold under another brand (Mercedes). Furthermore a new brand (Chrysler) was added to the product portfolio, together with one model assigned to it. For the newly introduced category two dimension members (BMW&Rolls-Royce and DaimlerChrysler) are added and the brands are connected to the respective company. The attribute denoting the power of a model is unified for all models to kilowatt. All the mentioned structure modifications are due to changes in the application domain. A requirements change leading to structure updates could for instance be that besides analyzing the number of car sells, the car dealer also wants to keep track of the resulting profit (insert measure). A data adjustment for this example would be the reaggregation to express that Modell G is now sold under the brand of Mercedes. Data transformation could for instance result from changing the currency from ATS to EUR, where every money-related value has to be divided by 13.7603. Data Warehouse Versioning Versus Data Warehouse Evolution
In principle two methods of maintenance can be distinguished: Evolution and Versioning. Both of these techniques rely on the defined operations for structure changes but significantly vary in terms of query flexibility, query costs and data management effort. This distinction between versioning and evolution can be applied for both the schema and the instance level. With Data Warehouse Evolution, every applied operation changes the structure of the data warehouse and the old structure is lost. The respective cell data is transformed to fit the new structure. As the old structure is lost, queries can only be done against the current structure. Queries spanning different structure versions are not possible. As the data follows one single structure, no adaptations have to be done during query runtime, which results in a better query performance compared to the versioning approach. Furthermore, no information about former versions has to be kept, which reduces the effort for data management. With Data Warehouse Versioning every applied operation again leads to a new structure version. But in contrast to the evolutionary approach the old version is also kept available. Existing cell data does not need to be adapted, but can be stored further on following
665
D
666
D
Data Warehouse Maintenance, Evolution and Versioning
Data Warehouse Maintenance, Evolution and Versioning. Figure 1. Changes in Data Warehouse Structure.
the respective structure version. This facilitates queries spanning multiple structure versions. When running such multiversion queries, data has to be either adapted in runtime, which reduces query performance, or precalculated and stored, which increases the required space and maintenance effort. Keeping track of structure version history is mandatory, which results in a considerable effort for the data management. Approaches Addressing the Maintenance Problem
There are a set of approaches addressing the data warehouse maintenance problem. Kimball [8] is one of the first, discovering the need for evolving data warehouses and introducing three methods for dealing with ‘‘slowly changing dimensions’’. The first method proposes simply overwriting old instances with their new values. Tracking a change history is not possible. The second method consists in creating a new instance for each change. This will create a change history, but needs additional effort in data management. One has to introduce a surrogate key, because the natural primary keys may not be unique any longer. For relating the various instances for an object to each other, creating a time
stamp for the validity of each version is proposed. The third method proposes creating a new attribute for the instance, such that the original and the current attribute value can be saved. This method can of course only handle two versions of an instance. All three methods are quite straightforward and only allow very basic modifications on the instance level. With FIESTA [2] Blaschka, Sapia and Ho¨fling present a schema design technique supporting schema evolution. Evolution for instances is not supported, but FIESTA provides an automatism to adapt existing instances after schema modification. For this adaptation two alternatives are proposed: adaption on the physical level (i.e., database changes) and adaption on the logical level (i.e., create a filter for accessing the instances). The authors define a rich set of schema changing operations, including the creation and deletion of dimensions, categories and attributes. In [11] Ravat and Teste present their approach for dealing with changing instances. The authors define an object-oriented approach for data warehouse modeling, based on the class concept proposed by the Object Database Management Group. A warehouse object (instance)
Data Warehouse Maintenance, Evolution and Versioning
is defined by its current state and a set of historical and archived states. The difference between historical and archived states is that historical states can be exactly reestablished, whereas for archived states only aggregations are kept, for reducing data size. Mapping functions describe the building process from which the data warehouse classes are generated. The approach of Hurtado, Mendelzon and Vaisman [14] allows data warehouse evolution on the schema and the instance level. Both schema and instances are modeled using a directed acyclic graph where the nodes represent levels and instances, respectively. The edges are labeled with their valid time intervals. Nodes connected to edges are only valid in the time interval where the edge is valid. Operations for inserting and deleting categories and instances are provided. Evolution of instances is not supported. Defining whether a specific instance is part of the current schema happens by timestamping the edge which connects the node to the graph. Additionally the temporal query language TOLAP is defined to enable queries over a set of temporal dimensions and temporal fact tables. In [4,5] Eder and Koncilia present their COMET Metamodel for temporal data warehousing. Based on the principles of temporal databases, they introduce a system that supports data warehouse versioning on the schema and the instance level. COMET provides a rich set of maintenance operations, which comprise insertion, deletion, and update of schema elements and instances. Also the complex operations split member and merge members are defined. In contrast to other approaches these operations can also be applied on the time and fact dimensions. COMET furthermore defines so called transformation functions, which allow to transform the cell data between arbitrary versions of the data warehouse. This provides the functionality of queries spanning several structure versions. In [6] Golfarelli et al. present their approach for schema versioning in data warehouse. Based on a graph model of the data warehouse schema they present their algebra for schema modifications. This approach supports versioning, therefore past versions are not lost. Based on those schema versions the authors describe a mechanism to execute cross-version queries, with the help of so called augmented schemas. For creating such an augmented schema, an old schema version is enriched with structure elements from a subsequent version, such that the data belonging to the old schema version can be queried as if it follows the new version.
D
Besides these research proposals there are also two commercial products which introduce basic means for data warehouse maintenance. SAP Inc. describes in a white paper [9] how to produces different types of reports over existing data. This can be a report using the current constellation, a report using an old constellation, a report showing the historical truth, and a report showing comparable results. This approach supports only basic operations on dimension data. The KALIDO Dynamic Information Warehouse [7] also realizes some aspects of data warehouse maintenance. Their support for change is based on the so called generic data modeling. The data warehouse model consists of three categories of data, the transaction data (which describes the activities of the business and the measures associated with them), the business context data (which is the analog to the instances), and the metadata (which comprises among others, parts the schema). With evolving the business context data, instance evolution is supported. There are a set of alternative approaches which have not been mentioned yet. The different techniques addressing the data warehouse maintenance problem can be classified by two features: First, by whether they support structure versioning or structure evolution, and second by the level of modifications they can handle. Table 1 shows this classification for some of the best known approaches in this area. So each of the mentioned approaches provides the features naming the respective row and column. Besides the classical maintenance requirements of keeping track of changes in data warehouse, maintenance methodologies can also be used to facilitate so called what–if-analysis. In [1] Bebel et al. present their approach for the management of multiversion data warehouses. They differentiate between real versions and alternative versions. Real versions are used to historicize data warehouse modifications resulting from real world changes. Alternative versions provide the Data Warehouse Maintenance, Evolution and Versioning. Table 1. Classification of data warehouse maintenance approaches
Schema and instance maintenance Schema maintenance only Instance maintenance only
Versioning
Evolution
[4]
[14]
[6] [9,11,13]
[2,10] [7,3,8,15]
667
D
668
D
Data Warehouse Maintenance, Evolution and Versioning
functionality to create several versions, each of them representing a possible future situation and then apply what–if-analysis on them. Additionally, alternative versions can be used to simulate data warehouse changes for optimization purposes. Another instance of data warehouse maintenance is the so called view maintenance. Whereas the approaches presented above assume a data warehouse structure which is defined somehow independent from underlying data sources and is populated with data by ETL-processes, a data warehouse can also be seen as materialized view over a set of data sources. Such a materialized view is of course directly affected by changes in the sources. For instance, in [16] Zhuge et al. present their approach for view maintenance. But as these approaches most times only deal with data updates, they are out of scope for data warehouse maintenance. Rundensteiner et al. [12] present a view maintenance approach which can also deal with changing structures. Their evolvable view management is realized as middleware between the data sources and the data warehouse. A core feature is the so called evolvable SQL which allows to define preferences for view evolution. With these preferences it is possible to redefine the view after some source changes, such that the resulting view is possibly not equivalent the to original view any more, but still fulfills the user’s needs.
Key Applications Data warehouses are often used to efficiently support the decision making process in companies and public authorities. To fulfil this task they have to represent the application domain and users’ requirements. To keep the analysis results accurate and correct over the time, data warehouse maintenance is a crucial issue. Application domains which are typically vulnerable to changing structures are among others statistic and geographic applications (for instance statistical data in the European Union), health care (for instance switching from International Classification of Deceases Version 9 to Version 10), or stock market (for instance splitting stocks). In each of these domains, traceability and comparability of data over long periods of time are very important, thus effective and efficient means to provide these capabilities have to be defined.
Future Directions Current commercial systems assume the data warehouse structure to be constant, therefore their support for
modifications is rather limited. On the other hand, in real-world applications the demand for changing structures is rather high, as the data warehouse has to be consistent with the application domain and the requirements. Despite the fact that more effort is put into integrating maintenance capabilities into commercial data warehouse systems [9,7], current products are still not well prepared for this challenge. Whereas schema and instance maintenance is quite elaborated in current research papers, the efficient transformation of cell data between different versions is still subject to research. The main problems with data transformation are first of all defining semantically correct transformation functions, and second the oftentimes huge amount of cell data which has to be handled in an efficient way. Related to data transformation is the problem of multiversion queries. The problem with such queries is defining the desired semantics and structure of the outcome, i.e., whether and how elements and cell values, which are not valid for all affected versions should be included in the result.
Cross-references
▶ Data Warehousing Systems: Foundations and Architectures ▶ On-line Analytical Processing ▶ Optimization and Tuning in Data Warehouses ▶ Quality of Data Warehouses ▶ Schema Versioning ▶ Temporal Database ▶ What-If Analysis
Recommended Reading 1. Be¸bel B., Eder J., Koncilia C., Morzy T., and Wrembel R. Creation and management of versions in multiversion data warehouse. In Proc. 2004 ACM Symp. on Applied computing, 2004, pp. 717–723. 2. Blaschka M., Sapia C., and Ho¨fling G. On schema evolution in multidimensional databases. In Proc. Int. Conf. on Data Warehousing and Knowledge Discovery, 1999, pp. 153–164. 3. Chamoni P. and Stock S. Temporal structures in data warehousing. In Proc. Int. Conf. on Data Warehousing and Knowledge Discovery, 1999, pp. 353–358. 4. Eder J., Koncilia C., and Morzy T. The COMET Metamodel for Temporal Data Warehouses. In Proc. Int. Conf. on Advanced Information Systems Engineering, 2002, pp. 83–99. 5. Eder J., Koncilia C., and Wiggisser K. Maintaining temporal warehouse models. In Proc. Int. Conf. on Research and Practical Issues of Enterprise Information Systems, 2006, pp. 21–30.
Data Warehouse Metadata 6. Golfarelli M., Lechtenbo¨rger J., Rizzi S., and Vossen G. Schema versioning in data warehouses: Enabling cross-version querying via schema augmentation. Data & Knowledge Eng., 59:435–459, 2006. 7. KALIDO Dynamic Information Warehouse: A Technical Overview. Tech. rep., Kalido, 2004. 8. Kimball R. Slowly Changing Dimensions. DBMS Magazine, 9(4):14, 1996. 9. Multi-Dimensional Modeling with BW: ASAP for BW Accelerator. Tech. rep., SAP Inc., 2000. 10. Quix C. Repository Support for Data Warehouse Evolution. In Proc. Int. Workshop on Design and Management of Data Warehouses, 1999. 11. Ravat F. and Teste O. A Temporal Object-Oriented Data Warehouse Model. In Proc. Int. Conf. on Database and Expert Systems Applications, 2000, pp. 583–592. 12. Rundensteiner E.A., Koeller A., and Zhang X. Maintaining data warehouses over changing information sources. Commun. ACM, 43(6):57–62, 2000. 13. Sarda N.L. Temporal Issues in Data Warehouse Systems. In Proc. Int. Symp. on Database Applications in Non-Traditional Environments, 1999. 14. Vaisman A. and Mendelzon A. A Temporal Query Language for OLAP: Implementation and a Case Study. In Proc. Int. Workshop on Database Programming Languages, 2001, pp. 78–96. 15. Yang J. and Widom J. Maintaining temporal views over nontemporal information sources for data warehousing. In Proc. Int. Conf. on Extending Database Technology. 1998, pp. 389– 403. 16. Zhuge Y., Garcia-Molina H., Hammer J., and Widom J. View Maintenance in a Warehousing Environment. In Proc. ACM SIGMOD Int Conf. on Management of Data, 1995, pp. 316–327.
Data Warehouse Indexing ▶ Indexing of Data Warehouses
Data Warehouse Integration ▶ Interoperability in Data Warehouses
Data Warehouse Metadata PANOS VASSILIADIS University of Ioannina, Ioannina, Greece
Definition
Data warehouse metadata are pieces of information stored in one or more special-purpose metadata
D
repositories that include (i) information on the contents of the data warehouse, their location and their structure, (ii) information on the processes that take place in the data warehouse back-stage, concerning the refreshment of the warehouse with clean, up-to-date, semantically and structurally reconciled data, (iii) information on the implicit semantics of data (with respect to a common enterprise model), along with any other kind of data that aids the end-user exploit the information of the warehouse, (iv) information on the infrastructure and physical characteristics of components and the sources of the data warehouse, and, (v) information including security, authentication, and usage statistics that aids the administrator tune the operation of the data warehouse as appropriate.
Historical Background Data warehouses are systems with significant complexity in their architecture and operation. Apart from the central data warehouse itself, which typically involves an elaborate hardware architecture, several sources of data, in different operational environments are involved, along with many clients that access the data warehouse in various ways. The infrastructure complexity is only one part of the problem; the largest part of the problem lies in the management of the data that are involved in the warehouse environment. Source data with different formats, structure, and hidden semantics are integrated in a central warehouse and then, these consolidated data are further propagated to different end-users, each with a completely different perception of the terminology and semantics behind the structure and content of the data offered to them. Thus, the administrators, designers, and application developers that cooperate towards bringing clean, upto-date, consolidated and unambiguous data from the sources to the end-users need to have a clear understanding of the following issues (see more in the following section): 1. The location of the data 2. The structure of each involved data source 3. The operations that take place towards the propagation, cleaning, transformation and consolidation of the data towards the central warehouse 4. Any audit information concerning who has been using the warehouse and in what ways, so that its performance can be tuned
669
D
670
D
Data Warehouse Metadata
5. The way the structure (e.g., relational attributes) of each data repository is related to a common model that characterizes each module of information Data warehouse metadata repositories store large parts (if not all) of this kind of data warehouse metadata and provide a central point of reference for all the stakeholders that are involved in a data warehouse environment. What happened was that all areas of data warehousing, ad-hoc solutions by industrial vendors and consultants were in place before the academic world provided a principled solution for the problem of the structure and management of data warehouse metadata. Early attempts of academic projects that related to wrapper-mediator schemes of information integration (Information Manifold, WHIPS, Squirrel, TSIMMIS – see [9] for a detailed discussion of the related literature), did not treat metadata as first-class concepts in their deliberations. At the same time, early standardization efforts from the industrial world (e.g., the MDIS
standard [13]) were also poor in their treatment of the problem. The first focused attempt towards the problem of data warehouse metadata management was made in the context of the European Project ‘‘Foundations of Data Warehouse Quality (DWQ)’’ [7,5]. In Fig. 1, the vertical links represent levels of abstraction: the data warehouse metadata repository, depicted in the middle layer, is an abstraction of the way the warehouse environment is structured in real life (depicted in the lowest layer of Fig. 1). At the same time, coming up with the appropriate formalism for expressing the contents of the repository (depicted in the upper layer of Fig. 1), provided an extra challenge that was tackled by [7] through the usage of the Telos language.
Foundations Structure of the data warehouse metadata repository. A principled approach towards organizing the structure of the data warehouse metadata repository was
Data Warehouse Metadata. Figure 1. Role and structure of a data warehouse metadata repository [12].
Data Warehouse Metadata
first offered by [7,8]. The ideas of these papers were subsequently refined in [9] and formed the basis of the DWQ methodology for the management of data warehouse metadata. The specifics of the DWQ approach are fundamentally based on the separation of data and processes and their classification in a grid which is organized in three perspectives, specifically the conceptual, the logical and the physical one and three location levels, specifically, the source, warehouse and client levels (thus the 3 3 contents of the middle layer of Fig. 1 and also the structure of Fig. 2). The proposal was subsequently extended to incorporate a program versus data classification (Fig. 1) that discriminates static architectural elements of the warehouse environment (i.e., stored data) from process models (i.e., software modules). The location axis is straightforward and classifies elements as source, data warehouse and client elements. The data warehouse elements incorporate both the officially published data, contained in fact and dimension tables as well as any auxiliary data structures, concerning the Operational Data Store and the Data Staging Area. Similarly, any back-stage Extract-Transform-Clean (ETL) processes that populate the warehouse and the data marts with data are also classified according to the server in which they execute. The most interesting part of the DWQ method
D
has to do with the management of the various models (a.k.a. perspectives in the DWQ terminology) of the system. Typically, in all DBMS’s –and, thus, all deployed data warehouses- the system catalog includes both a logical model of the data structure (i.e., the database schema) as well as a physical schema, indicating the physical properties of the data (tablespaces, internal representation, indexing, statistics, etc) that are useful to the database administrator to perform his everyday maintenance and tuning tasks. The DWQ approach claimed that in a complicated and large environment like a data warehouse it is absolutely necessary to add a conceptual modeling perspective to the system that explains the role of each module of the system (be it a data or a software module). Clearly, due to the vast number of the involved information systems, each of them is accompanied by its own model, which is close enough to the perception of its users. Still, to master the complexity of all these submodels, it is possible to come up with a centralized, reference model of all the collected information (a.k.a., enterprise model) – exploiting, thus, the centralized nature of data warehouses. The interesting part of the method is the idea of expressing every other submodel of the warehouse as a ‘‘view’’ over this enterprise model. Thus, once an interested user understands the enterprise model, he/she can ultimately understand the
Data Warehouse Metadata. Figure 2. The DWQ proposal for the internal structure of the data warehouse metadata repository [4].
671
D
672
D
Data Warehouse Metadata
particularities of each submodel, independently of whether it concerns a source or client piece of data or software. In [15], the authors discuss a coherent framework for the structuring of data warehouse metadata. The authors discriminate between back-stage technical metadata, concerning the structure and population of the warehouse and semantic metadata, concerning the front-end of the warehouse, which are used for querying purposes. Concerning the technical metadata, the proposed structure is based on (i) entities, comprising attributes as their structural components and (ii) an early form of schema mappings, also called mappings in the paper’s terminology, that try to capture the semantics of the back-stage ETL process by appropriately relating the involved data stores through aggregations, joins etc. Concerning the semantic metadata, the authors treat the enterprise model as a set of business concepts, related to the typical OLAP metadata concerning cubes, dimensions, dimension levels and hierarchies. The overall approach is a coherent, UML-based framework for data warehouse metadata, defined at a high-level of abstraction. Specialized approaches for specific parts (like definitions of OLAP models, or ETL workflows) can easily be employed in a complementary fashion to the framework of [6] (possibly through some kind of specialization) to add more detail to the metadata representation of the warehouse. It is also noteworthy to mention that the fundamental distinction between technical and business metadata has also deeply influenced the popular, industrially related literature [11]. Contents of the data warehouse metadata repository (data warehouse metadata in detail). The variety and complexity of metadata information in a data warehouse environment are so large that giving a detailed list of all
metadata classes that can be recorded is mundane. The reader who is interested in a detailed list is referred to [12] for a broader discussion of all these possibilities, and to [11] for an in depth discussion with a particular emphasis on ETL aspects (with the note that the ETL process is indeed the main provider of entries in the metadata repository concerning the technical parts of the warehouse). In the sequel, the discussion is classified in terms of data and processes. Data. Figure 3 presents a summarized view of relevant metadata concerning the static parts of the warehouse architecture. The physical-perspective metadata are mostly related to (i) the location and naming of the information wherever data files are used and (ii) DBMS catalog metadata wherever DBMS’s are used. Observe the need for efficiently supporting the enduser in his navigation through the various reports, spreadsheets and web pages (i.e., answering the question ‘‘where can I find the information I am looking for?’’) also observe the need to support the questions ‘‘what information is available to me anyway?’’ which is supported at the logical perspective for the client level. The rest of the logical perspective is also straightforward and mostly concerns the schema of data; nevertheless business rules are also part of any schema and thus data cleaning requirements and the related business rules can also be recorded at this level. The conceptual perspective involves a clear recording of the involved concepts and their intra-level mappings (source-to-DW, client-to-DW). As expected, academic efforts adopt rigorous approaches at this level [9], whereas industrial literature suggests informal, but simpler methods (e.g., see the discussion on ‘‘Business metadata’’ at [11]). It is important to stress the need of tracing the mappings between the different levels and perspectives in the
Data Warehouse Metadata. Figure 3. Metadata concerning the data of the warehouse.
Data Warehouse Metadata
warehouse. The physical-to-logical mapping is typically performed by the DBMS’s and their administrative facilities; nevertheless, the logical-to-conceptual mapping is not. Two examples are appropriate in this place: (i) the developer who constructs (or worse, maintains) a module that processes a source file of facts, has to translate cryptic code-and-value pairs (e.g., CDS_X1 = 145) to data that will be stored in the warehouse and (ii) an end-user who should see data presented with names that relate to the concepts he is familiar with (e.g., see a description ‘‘Customer name’’ instead of the attribute name CSTR_NAME of a dimension table). In both cases, the logical-to-conceptual mappings are of extreme importance for the appropriate construction and maintenance of code and reports. This is also the place to stress the importance of naming conventions in the schema of databases and the signatures of software modules: the huge numbers of involved attributes and software modules practically enforce the necessity of appropriately naming all data and software modules in order to facilitate the maintenance process (see [11] for detailed instructions). Processes. When the discussion comes to the metadata that concern processes, things are not very complicated again, at the high level (Fig. 4). There is a set of ETL workflows that operate at the warehouse level, and populate the warehouse along with any pre-canned reports or data marts on a regular basis. The structure of the workflow, the semantics of the activities and the regular scheduling of the process form the conceptual and logical parts of the metadata. The physical locations and names of any module, along with the management of failures form the physical part of the metadata, concerning the design level of the software. Still, it is worth noting that the physical metadata can
D
be enriched with information concerning the execution of the back-stage processes, the failures, the volumes of processed data, clean data, cleansed or impossible-to-clean data, the error codes returned by the DBMS and the time that the different parts of the process took. This kind of metadata is of statistical importance for the tuning and maintenance of the warehouse back-stage by the administration team. At the same time, the audit information is of considerable value, since the data lineage is recorded as every step (i.e., transformation or cleaning) in the path that the data follow from the sources to their final destination can be traced. Standards. The development of standards for data warehouse metadata has been one of the holy grails in the area of data warehousing. The standardization of data warehouse metadata allows the vendors of all kinds of warehouse-related tools to extract and retrieve metadata in a standard format. At the same time, metadata interchange among different sources and platforms –and even migration from one software configuration to another – is served by being able to export metadata from one configuration and loading it to another. The first standardization effort came from the MetaData Coalition (MDC), an industrial, nonprofitable consortium. The standard was named MetaData Interchange Specification (MDIS) [13] and its structure was elementary, comprising descriptions for databases, records, dimensions and their hierarchies and relationships among them. Some years after MDIS, the Open Information Model (OIM) [14] followed. OIM was also developed in the context of the MetaData Coalition and significantly extends MDIS by capturing core metadata types found in the operational
Data Warehouse Metadata. Figure 4. Metadata concerning the process of the warehouse.
673
D
674
D
Data Warehouse Metadata
and data warehousing environment of enterprises. The MDC OIM uses UML both as a modeling language and as the basis for its core model. The OIM is divided into sub-models, or packages, which extend UML in order to address different areas of information management, including database schema elements, data transformations, OLAP schema elements and data types. Some years later, in 2001, the Object Management Group (OMG) initiated its own standard, named Common Warehouse Metamodel (CWM) [4]. CWM is built on top of other standard OMG notations (UML, MOF, XMI) also with the aim to facilitate the interchange of metadata between different tools and platforms. As of 2007, CWM appears to be very popular, both due to its OMG origin and as it is quite close to the parts concerning data warehouse structure and operation. Much like OIM, CWM is built around packages, each covering a different part of the data warehouse lifecycle. Specifically, the packages defined by CWM cover metadata concerning (i) static parts of the warehouse architecture like relational, multidimensional and XML data sources, (ii) back-stage operations like data warehouse processes and operations, as well as data transformations and (iii) front-end, user-oriented concepts like business concepts, OLAP hierarchies, data mining and information visualization tasks. A detailed comparison of earlier versions of OIM and CWM can be found in [19].
Key Applications Data Warehouse Design. Typically, the data warehouse designers both populate the repository with data and benefit from the fact that the internal structure and architecture of the warehouse is documented in the metadata repository in a principled way. [17] implements a generic graphical modeling tool operating on top of a metadata repository management system that uses the IRDS standard. Similar results can be found in [3,18]. Data Warehouse Maintenance. The same reasons with data warehouse design explain why the data warehouse administrators can effectively use the metadata repository for tuning the operation of the warehouse. In [16], there is a first proposal for the extension of the data warehouse metadata with operators characterizing the evolution of the warehouse’s structure over time. A more formal approach on the problem is given by [6]. Data Warehouse Usage. Developers constructing or maintaining applications, as well as the end-users interactively exploring the contents of the warehouse can
benefit from the documentation facilities that data warehouse metadata offer (refer to [11] for an example where metadata clarify semantic discrepancies for synonyms). Data Warehouse Quality. The research on the annotation of data warehouse metadata with annotations concerning the quality of the collected data (a.k.a. quality indicators) is quite large. The interested reader is referred to [10,9] for detailed discussions. Model Management. Model management was built upon the results of having a principled structure of data warehouse metadata. The early attempts in the area [1,2] were largely based on the idea of mapping source and client schemata to the data warehouse schema and tracing their attribute inter-dependencies. Design of large Information Systems. The mental tools developed for the management of large, intraorganizational environments like data warehouses can possibly benefit other areas –even as a starting point. The most obvious candidate concerns any kind of open agoras of information systems (e.g., digital libraries) that clearly need a common agreement in the hidden semantics of exported information, before they can interchange data or services.
Cross-references
▶ CWM ▶ Data Quality ▶ Data Warehouse Life-Cycle and Design ▶ Data Warehouse ▶ MDC ▶ Metadata ▶ Metadata Repository ▶ Model Management ▶ OIM
Recommended Reading 1. Bernstein P., Levy A., and Pottinger R. A Vision for management of complex models. ACM SIGMOD Rec. 29(4):55–63, 2000. 2. Bernstein P.A. and Rahm E. Data warehouse scenarios for model management. In Proc. 19th Int. Conf. on Conceptual Modeling, 2000, pp. 1–15. 3. Carneiro L., and Brayner A. X-META: A methodology for data warehouse design with metadata management. In Proc. 4th Int. Workshop on Design and Management of Data Warehouses, 2002, pp. 13–22. 4. Common Warehouse Metamodel (CWM) Specification, version 1.1. OMG, March 2003. 5. Foundations of Data Warehouse Quality (DWQ) homepage. http://www.dblab.ece.ntua.gr/dwq/.
Data Warehouse Security 6. Golfarelli M., Lechtenbo¨rger J., Rizzi S., and Vossen G. Schema versioning in data warehouses: enabling cross-version querying via schema augmentation. Data Knowl. Eng., 59(2):435–459, 2006. 7. Jarke M., Jeusfeld M.A., Quix C., and Vassiliadis P. 1998, Architecture and quality in data warehouses. In Proc. 10th Conf. on Advanced Information Systems Engineering, 1998. LNCS, vol. 1413, 1998, pp. 93–113. 8. Jarke M., Jeusfeld M.A., Quix C., and Vassiliadis P. Architecture and quality in data warehouses. Inf. Syst., 24(3):229–253, 1999. 9. Jarke M., Lenzerini M., Vassiliou Y., and Vassiliadis P. (eds.). Fundamentals of Data Warehouses (2nd edn.). Springer, 2003, p. 207. 10. Jeusfeld M.A., Quix C., and Jarke M. Design and analysis of quality information for data warehouses. In Proc. 17th Int. Conf. on Conceptual Modeling, 1998, pp. 349–362. 11. Kimball R. and Caserta J. The Data Warehouse ETL Toolkit. Wiley, New York, NY, 2004. 12. Kimbal R., Reeves L., Ross M., and Thornthwaite W. The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data Warehouses. Wiley, 1998. 13. Metadata Coalition: Proposal for version 1.0 metadata interchange specification, 1996. 14. MetaData Coalition. Open Information Model, version 1.0 (1999). 15. Mu¨ller R., Sto¨hr T., and Rahm E. An integrative and uniform model for metadata management in data warehousing environments. In Proc. Int. Workshop on Design and Management of Data Warehouses, 1999. 16. Quix C. Repository support for data warehouse evolution. In Proc. Int. Workshop on Design and Management of Data Warehouses, 1999. 17. Sapia C., Blaschka M., and Ho¨fling G. GraMMi: Using a standard repository management system to build a generic graphical modeling tool. In 33rd Annual Hawaii Int. Conf. on System Sciences, 2000. 18. Vaduva A, Kietz J-U, Zu¨cker R. M4 - A metamodel for data preprocessing. In Proc. ACM 4th Int. Workshop on Data Warehousing and OLAP, 2001. 19. Vetterli T, Vaduva A, and Staudt M. Metadata standards for data warehousing: open information Model vs. Common warehouse metamodel. ACM SIGMOD Rec., 29(3):68–75, 2000.
Data Warehouse Query Processing ▶ Query Processing in Data Warehouses
Data Warehouse Refreshment ▶ Extraction, Transformation and Loading
D
675
Data Warehouse Security C ARLOS B LANCO 1, E DUARDO F ERNA´ NDEZ -M EDINA 1, J UAN T RUJILLO 2, M ARIO P IATTINI 1 1 University of Castilla-La Mancha, Ciudad Real, Spain 2 University of Alicante, Alicante, Spain
Synonyms Secure data warehouses; Data warehouses confidentiality
Definition Security, as is stated in the ISO/IEC 9126 International Standard, is one of the components of software quality. Information Security can be defined as the preservation of confidentiality, integrity and availability of information [5], in which confidentiality ensures that information is accessible only to those users with authorization privileges. Integrity safeguards the accuracy and completeness of information and process methods, and availability ensures that authorized users have access to information and associated assets when required. Other modern definitions of Information Security also consider properties such as authenticity, accountability, non-repudiation, and reliability. Therefore, Data Warehouse (DW) Security is defined as the mechanisms which ensure the confidentiality, integrity and availability of the data warehouse and its components. Confidentiality is especially important once the Data Warehouse has been deployed, since the most frequent operations that users perform are SQL and OLAP queries, and therefore the most frequent security attack is against the confidentiality of data stored in the data warehouse.
Historical Background Considering that DWs are the basis of companies’ decision making processes, and due to the fact that they frequently contain crucial and sensitive internal information, and that DWs are usually managed by OLAP tools, most of the initial approaches to data warehouse security were focused on the definition and enforcement of access control policies for OLAP tools [6,10], taking into consideration one of the most traditional access control models (Discretional Access Control) and also managing the concept of role defined as subject. Other approaches dealt with real implementation in specific commercial tools by using multidimensional elements [10]. Indirect access and cover channel problems have
D
676
D
Data Warehouse Security
also been detected in Statistical Databases but an entirely satisfactory solution has not yet been found. Moreover, data stores in DWs come from heterogeneous data sources, which must be integrated, thus provoking various security problems. However, few works dealing with the security defined in data sources (which tackle the problem of merging different security measures established in each source) appear to exist. This problem has, nevertheless, been addressed in the field of Federated Databases, and some authors have used this parallelism to propose an architecture for developing Data Warehouses through the integration of Multilevel Access Control (MAC) policies defined in data sources [12]. Furthermore, ETL processes have to load the information extracted and transformed from the data sources into the Data Warehouse, but these processes do not consider security issues and must use the security defined in the data source and add new security measures for the detected lacks of security. In this field, the proposed works focus solely upon ETL model processes, and do not consider security issues. In recent decades, the development of DWs has evolved from being a handmade method, to being a more engineering-based method, and several approaches have been defined for the conceptual modeling of DWs, e.g., [4,8]. Unfortunately none of these proposals has considered security issues. However, one of these approaches has recently been extended to propose a Model Driven Multidimensional approach for
developing secure DWs [1]. This approach permits the inclusion of security requirements (audit and access control) from the first stages of the DWs life cycle, and it is possible to automatically generate code for different target platforms through the use of model transformation. The scientific community demands the integration of security engineering and software engineering in order to ensure the quality and robustness of the final applications [9], and this approach fulfills this demand.
Foundations The DW development process follows the scheme presented in Fig. 1. Therefore, security should be considered in all stages of this process by integrating the existing security measures defined in data sources, considering these measures in ETL processes, defining models that represent security constraints at a high level of abstraction and finally, enforcing these security constraints in the OLAP tools in which the DW is deployed. Security in Data Sources
In DWs architecture, data coming from heterogeneous data sources are extracted, cleaned, debugged, and stored. Once this process is completed, the DW will be composed of these stored and integrated data, with which users will be able to discover information in strategic decision making processes. Data sources are heterogeneous, can use different representation models (relational databases, object-orientated databases, XML files, etc.), and may or may not have associated
Data Warehouse Security. Figure 1. Data warehouse architecture.
Data Warehouse Security
security policies. Although DW users will be different from data sources, these security policies should be considered and integrated into the DW security design. Data source security can be defined by using various security policies such as Discretional Access Control (DAC) which restricts access to objects based on the identity of subjects with a certain access permission: Mandatory Access Control (MAC), which restricts access to objects based on the sensitivity of the information contained in the objects and the formal authorization of subjects to access information of such sensitivity; or Role-Based Access Control (RBAC), an approach which restricts system access to authorized users by assigning permissions to perform certain operations to specific roles. The integration of these policies presents a problem which has been studied in Federated Databases [12]. Some research efforts have been made to integrate different multilevel policies in a semi-automatic manner by using a schema integration process which obtains the ordered set and the translation functions between each ordered set belonging to each component database and the federated ordered set. In addition, the integration of different role-based policies has been dealt with by representing role configurations as role graphs and by applying techniques of graph integration to obtain the final role configuration. Other authors, such as Rosenthal and Sciore [11], have applied inference mechanisms to data sources in order to obtain access control policies and have used them to set up DW security. After considering the parallelism between DW and Federated Information Systems (FIS), Saltor et al. [12] propose a seven layered architecture for preserving and integrating the multilevel security established in data sources. This architecture extends the five layered architecture developed for FIS, including two schemas: ‘‘authorization schemas’’ for each authorization level and ‘‘external schemas’’ with which to represent multilevel security information of the data sources in a Canonical Data Model (CDM). These ‘‘external schemas’’ with security information will later be used to obtain DW and Data Marts (DM) schemas. Security in ETL Processes
ETL (Extraction-Transformation-Loading) processes participate in the first stage of acquisition and extract information from heterogeneous data sources, debug it, integrate it and finally, load it into data warehouses following their previously defined design.
D
It is necessary to define security measures in ETL processes, in order to both use, adapt and integrate the security measures defined in the data sources and to add new security measures for the possibly detected lacks of security. At present, despite the existence of proposals with which to model ETL processes which can be extended to include security issues, none of the said proposals include the aforementioned concepts. Some interesting works on the modeling of ETL processes exist, but they do not deal with security issues. Vassiliadis and Simitsis use their own graphic notation for modeling ETL processes at a conceptual level, propose how to transform these conceptual designs into logical designs [13], and define a framework for designing and maintaining ETL processes (ARKTOS). Trujillo and Luja´n-Mora [15] model ETL processes by using the UML notation and OCL to establish constraints. Their proposal does not take attributes into consideration but simplifies the design and maintenance processes, and the use of UML and OCL provides one with possibilities which greatly simplify the extension of this model with security. Security in Data Warehouses Modeling
Multidimensional modeling is the foundation of DWs, Multidimensional Databases and On-Line Analytical Processing Applications (OLAP) and is different from traditional database modeling in that it is adapted to the characteristics of these approaches. Despite the quantity of interesting background on security measures and access control models specified for relational databases which is available, it cannot be directly applied as it is not appropriate for DWs. Both are models but they are based on different concepts. Relational security measures use terms of database tables, rows and columns, and DW security uses multidimensional terms of facts, dimensions or classification hierarchy. Several modeling proposals specifically created for DWs consider their properties, but none use standard notations or include security issues, e.g., [4,8]. A model driven multidimensional modeling approach for developing secure DWs has been proposed by Ferna´ndez-Medina et al. [1]. This approach proposes a Query/Views/Transformations (QVT) and Model-Driven Architecture (MDA) based approach (see Fig. 2). This aligns MDA with the DWs development process, considering multidimensional models as being PIM, logical models (such as ROLAP, MOLAP and HOLAP) as being Platform-Specific Model (PSM),
677
D
678
D
Data Warehouse Security
Data Warehouse Security. Figure 2. Model driven architecture.
and the DBMS and OLAP tools as being the target platforms. This proposal is made up of a security model (access control and audit model) for DW [2], an extension of UML for modeling secure multidimensional models [3] as Platform-Independent Models (PIM), and an extension of the Common Warehouse Metamodel (CWM) [14] as a Platform-Specific Model (PSM). This proposal is currently being extended within the extremes of MDA architecture: the ComputationalIndependent Model (CIM) level is being defined through an extension of i* which defines security goals and subgoals, and the code generation is being carried out by considering Oracle, SQL Server Analysis Services, and Pentaho as target platforms of the architecture. Security in OLAP Tools
OLAP systems are mechanisms with which to discover business information and use a multidimensional analysis of data to make strategic decisions. This information is organized according to the business parameters, and users can discover unauthorized data by applying a set of OLAP operations to the multidimensional view. Therefore, it is of vital importance for the organization to protect its data from unauthorized accesses including security constraints in OLAP systems which take these OLAP operations (roll-up, drill-down, slice-dice and pivoting) into account, and from indirect accesses (inferences) which use parallel navigation, tracker queries, etc. The inference problem is an important security problem in OLAP which has yet to be solved and which can be studied by using the
existing parallelism with Statistical Databases. Several solutions to the inference problem have been applied. Various solutions to the problem of controlling inference exist, such as the perturbation of data or the limitation of queries, but these imply a large amount of computational effort. On the other hand the establishment of security constraints at cell level allows one to control inferences without this lack of efficiency. Several works attempting to include security issues in OLAP tools by implementing the previously defined security rules at a conceptual level have been proposed, but these works focus solely upon Discretional Access Control (DAC) and use a simplified role concept implemented as a subject. For instance, Katic et al. [6] proposed a DWs security model based on metamodels which provides one with views for each user group and uses Discretional Access Control (DAC) with classification and access rules for security objects and subjects. However, this model does not allow one to define complex confidentiality constraints. Kirkgo¨ze et al. [7] defined a role-based security concept for OLAP by using a ‘‘constraints list’’ for each role, and this concept is implemented through the use of a discretional system in which roles are defined as subjects. Priebe and Pernul later proposed a security design methodology, analyzed security requirements, classifying them into basic and advanced, and dealt with their implementation in commercial tools. First, in [10] they used adapted UML to define a Discretional Access Control (DAC) system with roles defined
Data Warehousing for Clinical Research
as subjects at a conceptual level. They then went on to implement this in Microsoft Analysis Services (SQL Server 2000) by using Multidimensional Expressions (MDX). They created a Multidimensional Security Constraint Language (MDSCL) based on MDX and put forward HIDE statements with which to represent negative authorization constraints on certain multidimensional elements: cube, measure, slice, and level.
Key Applications DWs security is a highly important quality aspect of a DW, which must be taken into account at all stages of the development process. If security measures are not established, then unauthorized users may obtain the business information used for making strategic decisions which is vital to the survival of the organization. DWs security has to be considered in all the fields involved. These are, principally, the following: the application of techniques through which to integrate different kinds of security policies detected in the data sources; the definition of models, which permit the establishment of security constraints at upper abstraction levels; and the study of the final implementation of the defined security measures in OLAP tools in order to protect information from malicious operations such as navigations or inferences.
Cross-references
▶ Data Warehousing Systems: Foundations and Architectures ▶ Extraction ▶ Multidimensional Modeling ▶ On-Line Analytical Processing ▶ Transformation and Loading
Recommended Reading 1. Ferna´ndez-Medina E., Trujillo J., and Piattini M. Model driven multidimensional modeling of secure data warehouses. Eur. J. Inf. Syst., 16:374–389, 2007. 2. Fernandez-Medina E., Trujillo J., Villarroel R., and Piattini M. Access control and audit model for the multidimensional modeling of data warehouses. Decis. Support Syst., 42(3):1270–1289, 2006. 3. Fernandez-Medina E., Trujillo J., Villarroel R., and Piattini M. Developing secure data warehouses with a UML extension. Inf. Syst., 32(6):826–856, 2007. 4. Golfarelli M., Maio D., and Stefano R. The dimensional fact model: a conceptual model for data warehouses. Int. J. Coop. Inf. Syst., 7(2–3):215–247, 1998.
D
5. ISO27001, ISO/IEC 27001 Information technology – Security techniques – Information security management systems – Requirements, 2005. 6. Katic N., Quirchmayr G., Schiefer J., Stolba M., and Tjoa A. 1A prototype model for DW security based on metadata. In Proc. Ninth Int. Workshop on Database and Expert Systems Applications, 1998, p. 300. 7. Kirkgo¨ze R., Katic N., Stolda M., and Tjoa A. A security concept for OLAP. In Proc. 8th Int. Workshop on Database and Expert System Applications, 1997, p. 0619. 8. Lujan-Mora S., Trujillo J., and Song I.-Y. A UML profile for multidimensional modeling in data warehouses. Data Knowl. Eng., 59(3):725–769, 2006. 9. Mouratidis H. and Giorgini P. Integrating Security and Software Engineering: Advances and Future Visions. Idea Group, Hershey, PA, 2006. 10. Priebe T. and Pernul G. A pragmatic approach to conceptual modeling of OLAP security. In Proc. 20th Int. Conf. on Conceptual Modeling, 2001, pp. 311–324. 11. Rosenthal A. and Sciore E. View security as the basis for data warehouse security. In Proc. 2nd Int. Workshop on Design and Management of Data Warehouses, 2000, p. 8. 12. Saltor F., Oliva M., Abello´ A., and Samos J. Building secure data warehouse schemas from federated information systems. In Heterogeneous Information Exchange and Organizational Hubs, D.T. Bestougeff (ed.). Kluwer Academic, 2002. 13. Simitsis A. and Vassiliadis P. A method for the mapping of conceptual designs to logical blueprints for ETL processes. Decis. Support Syst., 45(1):22–40, 2007. 14. Soler E, Trujillo J., Ferna´ndez-Medina E., and Piattini M. SECRDW: an extension of the relational package from CWM for representing secure data warehouses at the logical level. In Proc. 5th Int. Workshop on Security in Information Systems. 2007, pp. 245–256. 15. Trujillo J. and Luja´n-Mora S. A UML based approach for modeling ETL processes in data warehouses. In Proc. 22nd Int. Conf. on Conceptual Modeling, 2003, pp. 307–320.
Data Warehousing for Clinical Research S HAWN M URPHY Massachusetts General Hospital, Boston, MA, USA
Synonyms Clinical research chart
Definition The clinical data warehouse allows rapid querying and reporting across patients. It is used to support the discovery of new relationships between the cause and effects of diseases, and to find specific patients that qualify for research studies.
679
D
680
D
Data Warehousing for Clinical Research
Historical Background In healthcare, the term ‘‘data warehouse’’ is generally reserved for those databases optimized for analysis and integrated queries across patient populations. This is as opposed to the transactional database, which is optimized for rapid updating and highly specific kinds of retrieval (like those based upon a specific patient identifier). There appear to be three fundamentally different approaches to organizing the healthcare data warehouse. The first is to extract tables from the transaction systems of the healthcare organization and load them into the database platform of the data warehouse with minimal transformation of the data model. The codes present in the columns are usually transformed to make them compatible with codes from other systems. For example, an ICD9 diagnosis code stored as ‘‘27.60’’ in one system may be transformed to a common format of 02760. However, the tables are left in essentially the same schema as the transaction system [2]. The second approach is more ambitious, where not just the codes from different systems are transformed to look the same, but the data is transformed to look the same as well. The diverse data coming from different systems must be made to fit into new tables. This involves a considerable amount of data transformation, but queries against the warehouse are then much less complex [1]. This is the approach that will be described. The third approach is to keep the data located at its source in a ‘‘distributed’’ data warehouse. Queries are distributed to the local databases across a network. This strategy can be successful when patients have all of their data contained within one of the local systems (such as when systems exist in distant cities). However, if a single patient’s data is distributed across many of these local databases, detailed data would need to travel across the network to be accumulated in the internal processing structures of a central CPU to allow the execution of query plans. This will have a severe negative impact on the performance of these types of systems.
Foundations Database Design for Clinical Research Data Warehouse
The clinical data warehouse allows rapid querying and reporting across patients, which unexpectedly is not available in most clinical transaction systems. Rather, transaction systems are optimized for lookups, inserts,
updates, and deletes to a single patient in the database. Transactions usually occur in small packets during the day, such as when a patient’s lab test is sent to the database. Transaction systems are usually updated by small bits of data at a time, but these bits come in at the rate of thousands per second. Therefore the typical clinical database used for patient care must be optimized to handle these transactions [2]. Because the clinical data warehouse does not need to handle high volumes of transactions all day long, the data warehouse can be optimized for rapid, cross patient searching. For optimal searching of a database it is best to have very large tables. These can be indexed such that a single index allows a global search. So when one designs a clinical data warehouse, one adopts a few tables that can hold nearly all the available data. The way to hold many forms of healthcare data in the same table is by the classical entity-attribute-value schema (or EAV for short) [4,5]. The EAV schema forces one to define the fundamental fact of healthcare [2]. The fundamental fact of healthcare will be the most detailed rendition possible of any healthcare observation as reported from the data warehouse. This can be defined as an observation on a patient, made at a specific time, by a specific observer, during a specific event. The fact may be accompanied by any number of values or modifiers. Each observation is tagged with a specific concept code, and each observation is entered as a row in a ‘‘fact table.’’ This fact table can grow to billions of rows, each representing an observation on a patient. The fact table is complimented by a least an event table, a patient table, a concept table, and an observer table [4]. The Patient table is straightforward. Each row in the table represents a patient in the database. The table includes common fields such as gender, age, race, etc. Most attributes of the patient dimension table are discrete (i.e., Male/Female, Zip code, etc.) or relevant dates. The Event table represents a ‘‘session’’ where observations were made. This ‘‘session’’ can involve a patient directly such as a visit to a doctor’s office, or it can involve the patient indirectly such as running several tests on a tube of the patient’s blood. Several observations can be made during a visit. Visits have a start and end date-time. The visit record also contains specifics about the location of the session, like which hospital or clinic the session occurred, and whether the patient was an inpatient or outpatient at the time of the visit.
Data Warehousing for Clinical Research
The Observer table is a list of observers. Generally, each row in the observer dimension represents a provider at an institution, but more abstractly, it may be an observing machine, such as an Intensive Care Unit continuous blood pressure monitor. The Concept table is the key to understanding how to search the fact table. A concept specifies exactly what observation was made on the patient and is being represented in a particular row of the fact table. A code is used to represent the concept in the fact table, and the concept table links if to a human-readable description of the code (Fig. 1). Metadata Management in Clinical Research Data Warehouse
When looking at rows in the concept table, one is introduced to Metadata. Metadata is everywhere in a data warehouse. It represents data about the data, and is where medical knowledge is represented in the clinical data warehouse. The primary form of representation is in the groupings of terms so they can be queried as groups of similar concepts. The terms are grouped into hierarchies, each level up usually expressing a more general medical concept. Many diverse concepts about a patient can exist in the fact table. In a clinical data warehouse, typically 100–500 thousand different concepts exist. All sorts of concepts including ICD-9 codes (International Classification of Diseases 9th Edition, most common codes used in hospitals to classify diagnoses), CPT codes (Current
D
Procedural Terminology, most common codes used in hospitals to classify procedures), NDC codes (National Drug Codes, most common codes used in hospitals to classify medication), and LOINC codes (Logical Observation Identifiers Names and Codes, most common codes used in hospitals to classify laboratory tests) as well as numerous local coding systems are used to describe the patient. The challenge is maintaining and updating the classification of the concepts. This classification needs to seamlessly absorb new codes, and be back-compatible to old coding and classification systems. The organization of concepts hierarchically allows the user to navigate and use the concepts in a query. Like a file path in the Windows Explorer, the path of the hierarchy indicates in which groups the concept belongs, with the most general group being listed on the far left and each group to the right of that growing more and more specific. An interface to present this concept representation is shown below (Fig. 2). The use of this interface has been described in detail [3], but is essentially a way of building queries using concepts represented in the concept and provider dimension tables. Privacy Management in the Clinical Research Data Warehouse
The clinical data warehouse should be built with patient privacy in mind. The most common strategy is to separate the data warehouse into two databases. The clinical
Data Warehousing for Clinical Research. Figure 1. Optimal star schema database design for healthcare data warehouse.
681
D
682
D
Data Warehousing for Clinical Research
Data Warehousing for Clinical Research. Figure 2. Construction of query using the metadata from a healthcare data warehouse.
data goes into one database, and the identifiers of the patients go into a second database. Access to the second, identified, database is strictly controlled, and only accessed during data loading and the building of the data marts. The patients are given codes in the clinical database and these codes can only be looked up in the identified database. In this way, customers can use the clinical database and not have access to the patient identifiers. Data Flow in Clinical Research Data Warehouse
Data generally flows into the data warehouse by loading it from the transaction systems, or by receiving a duplicate feed of data that are going into the transaction systems. Data are usually loaded from the transaction systems once it is provided as large ‘‘data dumps,’’ or downloads. Transaction systems may contain many millions of records, but with current technology they can usually be written out in their entirety in just hours. Reloading all this data into the data warehouse similarly takes only a few hours, and the simplicity of this model, as opposed to the complexity of update models, often makes this a much more desirable process. The risk of an
update process is that errors in update flags will cause the data warehouse to become desynchronized with the transaction system. To note many transaction systems do not have a way to provide updates and a full ‘‘data dump’’ is all that is possible from the transaction system. When the data is loaded from the transaction systems, it is usually first loaded to a ‘‘staging area.’’ As previously discussed, the data structure usually differs considerably between the transaction system and the data warehouse. Loading the transaction data into a staging area allows the data to be studied and quality assured before introducing the complexity of transforming the data into the format of the data warehouse. Because the teams from the transaction systems are usually very familiar with the data in this form, it is desirable to have the transaction team responsible for their corresponding staging area, and allow them to transfer and load the data into this area. The data warehouse will usually distribute data back to the data consumers as a ‘‘data mart.’’ These are subsets of the data from the data warehouse. The advantage of this approach is that the data can be prepared per request in a consumer-friendly format.
Data Warehousing for Clinical Research
Attempting to allow customers to query the clinical data warehouse using Structured Query Language (SQL) is rarely successful. The EAV scheme is notoriously unfriendly to the causal user of data [5]. Furthermore, the metadata exists in tables that are not obviously connected to the patient data, so that tables in the data warehouse often contain no humanly readable content. Finally, the data in the data warehouse is often updated once every day and so analysis would need to go against constantly shifting data. The result is that the data is often exported into a user friendly data mart. This also limits the set of patients that a customer can view, which is important from the patient privacy point-of-view.
Key Applications This clinical research data warehouse allows researchers to quickly obtain information that can be critical for winning corporate and government sponsored research grants, and easily gather data on patients identified for research studies. It allows clinical data to be available for research analysis where security and confidentiality are an integral part of the design, bringing clinical information to researchers’ fingertips while controlling and auditing the distribution of patient data within the guidelines of the Institutional Review Boards. It also serves as a ‘‘building-block’’ that enables high-throughput use of patient data in some of the following applications: 1. Bayesian inference engines. Bayesian inference can be used to synthesize many diverse observations into fundamental atomic concepts regarding a patient. For example, a code may be assigned to a patient from several sources indicating that a patient has a disease such as diabetes. Some sources may indicate the patient has type I diabetes, while others indicate the patient has type II diabetes. Since these two types of diabetes are mutually exclusive, it is clear that one of the sources is in error. A determination of the true diagnosis can be estimated by assigning a prior probability to each source as to how often it contains correct information, and use these probabilities, to calculate the likelihood of each diagnosis. 2. Clinical trials performed ‘‘in-silico.’’ Performing an observational phase IV clinical trial is an expensive and complex process that can be potentially modeled in a retrospective database using groups of patients available in the large amounts of highly organized medical data. This application would allow a formalized way of
D
discovering new knowledge from medical databases in a manner that is well accepted by the medical community. For example, a prospective trial examining the potential harm of Vioxx would entail recruiting large numbers of patients and several years of observation. However, an in-silico clinical trial would entail setting up the database to enroll patients into a patient set automatically when they are given a prescription for Vioxx and watching them for adverse events as these events are entered in the course of clinical care. Besides requiring fewer resources, these trials could be set up for thousands of medications at a time and thereby provide a much greater scope of observational trials. 3. Finding correlations within data. When multiple variables are measured for each patient in a data set, there exists an underlying relationship between all pairs of variables, some highly correlated and some not. Correlations between pairs of variables may be discovered with this application, leading to new knowledge, or further insight into known relationships. Unsupervised techniques using Relevance Networks and Mutual Information algorithms can generate hypothesis from secondary observed correlations in the data. This is a way to exploit existing electronic databases for unsupervised medical knowledge discovery without a prior model for the information content. Observations collected within labs, physical examinations, medical histories, and gene expressions can be expressed as continuous variables describing human physiology at a point in time. For example, the expression of RNA found within a tumor cell may be found to correlate with the dose of effective chemotherapy for that tumor. This would allow future tumors to have their RNA expression determined and matched to various chemotherapies, and the chemotherapy found to correlate most with that gene expression would be chosen as the agent for that individual.
Cross-references
▶ Health Informatics ▶ Data Integration in Web Data Extraction System ▶ Data Mining ▶ Data Models
Recommended Reading 1. 2. 3.
Inmon W.H. Building the Data Warehouse, 2nd edn. Wiley, NY, 1996. Kimball R. The Data Warehousing Toolkit. Wiley, NY, 1997. Murphy S.N., Gainer V.S., and Chueh H. A visual interface designed for novice users to find research patient cohorts in
683
D
684
D 4.
5.
Data Warehousing Systems: Foundations and Architectures a large biomedical database. In Proc. AMIA Annu. Fall Symp., 489–493, 2003. Murphy S.N., Morgan M.M., Barnett G.O., and Chueh H.C. Optimizing healthcare research data warehouse design through past COSTAR query analysis. In Proc. AMIA Fall Symp., 892–896, 1999. Nadkarni P.M. and Brandt C. Data extraction and ad hoc query of an entity-attribute-value database. J. Am. Med. Inform. Assoc., 5:511–517, 1998.
Data Warehousing Systems: Foundations and Architectures I L -Y EOL S ONG Drexel University, Philadelphia, PA, USA
Definition A data warehouse (DW) is an integrated repository of data for supporting decision-making applications of an enterprise. The most widely cited definition of a DW is from Inmon [3] who states that ‘‘a data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.’’
Historical Background DW systems have evolved from the needs of decisionmaking based on integrated data, rather than an individual data source. DW systems address the two primary needs of enterprises: data integration and decision support environments. During the 1980s, relational database technologies became popular. Many organizations built their mission-critical database systems using the relational database technologies. This trend proliferated many independent relational database systems in an enterprise. For example, different business lines in an enterprise built separate database systems at different geographical locations. These database systems improved the operational aspects of each business line significantly. Organizations, however, faced the needs of integrating the data which were distributed over different database systems and even the legacy database systems in order to create a central knowledge management repository. In addition, during the 1990s, organizations faced increasingly complex challenges in global environments. Organizations realized the need for decision support systems that can analyze historical data trends, generate sophisticated but easy-to-read reports, and react to changing business conditions in
a rapid fashion. These needs resulted in the development of a new breed of database systems that can process complex decision-making queries against integrated, historical, atomic data. These new database systems are now commonly called data warehousing systems because they store a huge amount of data – much more than operational database systems – and they are kept for long periods of time. A data warehousing system these days provides an architectural framework for the flow of data from operational systems to decision-support environments. With the rapid advancement in recent computing technologies, organizations build data warehousing systems to improve business effectiveness and efficiency. In a modern business environment, a data warehousing system has emerged as a central component of an overall business intelligence solution in an enterprise.
Foundations OLTP vs. Data Warehousing Systems
Data warehousing systems contain many years of integrated historical data, ending up storing a huge amount of data. Directly storing the voluminous data in an operational database system and processing many complex decision queries would degrade the performance of daily transaction processing. Thus, DW systems are maintained separately from operational databases, known as online transaction processing (OLTP) systems. OLTP systems support daily business operations with updatable data. In contrast, data warehousing systems provide users with an environment for the decision-making process with readonly data. Therefore, DW systems need a query-centric view of data structures, access methods, implementation methods, and analysis methods. Table 1 highlights the major differences between OLTP systems and data warehousing systems. Rolap and Molap
The data in a DW are usually organized in formats made for easy access and analysis in decision-making. The most widely used data model for DWs is called the dimensional model or the star schema [6]. A dimensional model consists of two types of entities–a fact table and many dimensions. A fact table stores transactional or factual data called measures that get analyzed. Examples of fact tables are Order, Sale, Return, and Claim. A dimension represents an axis that analyzes the fact data. Examples of
Data Warehousing Systems: Foundations and Architectures
D
685
Data Warehousing Systems: Foundations and Architectures. Table 1. A comparison between OLTP and data warehousing systems OLTP Purpose
Data warehouse & OLAP
User
Daily business support Transaction processing Data entry clerk, administrator, developer
Decision support Analytic processing Decision maker, executives
DB design DB design model Data structures
Application oriented ER model Normalized, Complex
Subject-oriented Star, snowflake, Multidimensional model Denormalized Simple
Data redundancy Data contents
Low Current, up-to-date operational data Atomic Isolated or limited integration
High Historical Atomic and summarized Integrated
Repetitive, Routine Predictable, predefined Simple joins Optimized for small transactions
Ad-hoc Unpredictable, Complex, long queries
Data integration Usage Queries
Optimized for complex queries
Update
Transactions constantly generate new data Data is relatively static Often refreshed weekly, daily Access type Read/update/delete/insert Read/append mostly Number of Records per access Few Many Concurrency level Data retention Response time Systems Requirements
High Usually less than a year Subsecond to second Transaction throughput, Data consistency
Low 3–10 years or more Seconds, minutes, worse Query throughput, Data accuracy
Data Warehousing Systems: Foundations and Architectures. Figure 1. The typical structure of the star schema.
dimensions are Time, Customer, Product, Promotion, Store, and Market. Since a DW contains time-variant data, the Time dimension is always included in dimensional schemas and the data in a fact table are organized by a unit of time. An extensive list of dimensions
commonly found in DWs including those dimensions used in [1,6] are presented in [4]. A typical structure of the dimensional model is illustrated in Fig. 1. Syntactically, all the dimensions are connected with the fact table by one-to-many relationships. Thus, when
D
686
D
Data Warehousing Systems: Foundations and Architectures
a dimension has a many-to-many relationship with the fact table, a special technique such as an intersection table should be used. All the dimensions have a surrogate key, which establishes an identifying relationship with the fact table. In a star schema, all the dimensions are usually denormalized to simplify the query structure in order to minimize the number of joins. When dimensions are normalized into the third normal form, the schema is called a snowflake schema [6]. A dimensional model simplifies end-user query processing by simplifying the database structure with a few well-defined join paths. Conceptually, a dimensional model characterizes a business process with the fact table, the dimensions, and the measures involved in the business process. The dimensional model allows users of a DW to analyze the fact data from any combination of dimensions. The structure provides a multidimensional analysis space within a relational database. Interactive data analysis of the data in a DW environment is called online analytic processing (OLAP). When the data in a dimensional model is stored in a relational database, the analysis is called relational online analytic processing (ROLAP). ROLAP engines extend SQL to support dimensional model schema and advanced OLAP functions. DW data can also be stored in a specialized multidimensional structure called a data cube or a hypercube. Data analysis of the data stored in a data cube is called multidimensional OLAP (MOLAP). Compared with
ROLAP engines, MOLAP engines are usually limited in data storage, but provide more efficient OLAP processing by taking advantage of the multidimensional data cube structure. A typical structure of a data cube is illustrated in Fig. 2. Hybrid OLAP (HOLAP) servers take advantage of both ROLAP and MOLAP technologies. They usually store large volumes of detailed data in a ROLAP server and store aggregated data in a MOLAP server. Data Warehousing Architecture
A data warehousing system is an environment that integrates diverse technologies into its infrastructure. As business data and analysis requirements change, data warehousing systems need to go through an evolution process. Thus, DW design and development must take growth and constant change into account to maintain a reliable and consistent architecture. A DW architecture defines an infrastructure by which components of DW environments are organized. Figure 3 depicts the various components of a typical DW architecture that consists of five layers – data source systems, ETL management services, DW storage and metadata repository, data marts and OLAP engines, and front-end tools. Data Source Systems
The data source system layer represents data sources that feed the data into the DW. An enterprise usually maintains many different databases or information systems to serve different OLTP
Data Warehousing Systems: Foundations and Architectures. Figure 2. A three dimensional data cube having dimensions Time, Item, and Location for MOLAP.
Data Warehousing Systems: Foundations and Architectures
functions. Since a DW integrates all the important data for the analysis requirements of an enterprise, it needs to integrate data from all disparate sources. Data could include structured data, event data, semi-structured data, and unstructured data. The primary source for data is usually operational OLTP databases. A DW may also integrate data from other internal sources such as legacy databases, spreadsheets, archived storages, flat files, and XML files. Frequently, a DW system may also include any relevant data from external sources. Examples of such data are demographic data purchased from an information vendor to support sales and marketing analysis and standard reference data from the industry or the government. In order to analyze trends of data from a historical perspective, some archived data could also be selected. Thus, data warehousing systems usually end up with huge amounts of historical data. These data are regularly fed into the second layer for processing. The interval between each feed could be monthly, weekly, daily, or even real-time,
D
687
depending on the frequency of changes in the data and the importance of up-to-datedness of the data in the DW. ETL Management Services The second layer extracts
the data from disparate data sources, transforms the data into a suitable format, and finally loads them to a DW. This process is known as ETL processing. A DW does not need all the data from the data source systems. Instead, only those data that are necessary for data analysis for tactical and strategic decisionmaking processes are extracted. Since these data come from many different sources, they could come in heterogeneous formats. Because a DW contains integrated data, data need to be kept in a single standard format by removing syntactic and semantic variations from different data source systems. Thus, these data are standardized for the data model used in the DW in terms of data type, format, size, unit of data, encoding of values, and semantics. This process ensures that the warehouse provides a ‘‘single version of the truth’’ [3].
Data Warehousing Systems: Foundations and Architectures. Figure 3. An enterprise data warehousing system architecture with ROLAP/MOLAP/Hybrid OLAP.
D
688
D
Data Warehousing Systems: Foundations and Architectures
Only cleaned and conformed data are loaded into the DW. The storage required for ETL processing is called a staging database. The ETL process is usually the most time-consuming phase in developing a data warehousing system [7]. It normally takes 60–80% of the whole development effort. Therefore, it is highly recommended that ETL tools and data cleansing tools be used to automate the ETL process and data loading. Data Warehouse Storage and Metadata Repository
The third layer represents the enterprise DW and metadata repository. The enterprise DW contains all the extracted and standardized historical data at the atomic data level. A DW addresses the needs of crossfunctional information requirements of an enterprise. The data will remain in the warehouse until they reach the limit specified in the retention strategy. After that period, the data are purged or archived. Another component of this layer is the metadata repository. Metadata are data about the data. The repository contains information about the structures, operations, and contents of the warehouse. Metadata allows an organization to track, understand, and manage the population and management of the warehouse. There are three types of metadata – business metadata, technical metadata, and process metadata [7]. Business metadata describe the contents of the DW in business terms for easy access and understanding. They include the meaning of the data, organizational rules, policies, and constraints on the data as well as descriptive names of attributes used in reports. They help users in finding specific information from the warehouse. Technical metadata define the DW objects such as tables, data types, partitions, and other storage structures, as well as ETL information such as the source systems, extraction frequency, and transformation rules. Process metadata describe events during ETL operations and query statistics such as begin time, end time, CPU seconds, disk reads, and rows processed. These data are valuable for monitoring and troubleshooting the warehouse. Metadata management should be carefully planned, managed, and documented. OMG’s Common Warehouse Metamodel [9] provides the metadata standard. Data Mart and OLAP Engines
The fourth layer represents the data marts and OLAP engines. A data mart is a small-sized DW that contains a subset of the enterprise DW or a limited volume of aggregated data for
the specific analysis needs of a business unit, rather than the needs of the whole enterprise. This definition implies three important features of a data mart, different from a DW system. First, the data for a data mart is fed from the enterprise DW when a separate enterprise DW exists. Second, a data mart could store lightly aggregated data for optimal analysis. Using aggregated data improves query response time. Third, a data mart contains limited data for the specific needs of a business unit. Conceptually, a data mart covers a business process or a group of related business processes of a business unit. Thus, in a fully-developed DW environment, end-users access data marts for daily analysis, rather than the enterprise DW. An enterprise usually ends up having multiple data marts. Since the data to all data marts are fed from the enterprise DW, it is very important to maintain the consistency between a data mart and the DW as well as among data marts themselves. A way to maintain the consistency is to use the notion of conformed dimension. A conformed dimension is a standardized dimension or a master reference dimension that is shared across multiple data marts [6]. Using conformed dimensions allows an organization to avoid repeating the ‘‘silos of information’’ problem. Data marts are usually implemented in one or more OLAP servers. OLAP engines allow business users to perform data analysis using one the underlying implementation model – ROLAP, MOLAP, or HOLAP. Front-end Tools The fifth layer represents the front-
end tools. In this layer, end-users use various tools to explore the contents of the DW through data marts. Typical analyses include standard report generations, ad-hoc queries, desktop OLAP analysis, CRM, operational business intelligence applications such as dashboards, and data mining. Other DW Architectures
Figure 3 depicts the architecture of a typical data warehousing system with various possible components. The two primary paradigms for DW architectures are enterprise DW design in the top-down manner [3] and data mart design in the bottom-up manner [6]. A variety of architectures based on the two paradigms and other options exists [3,6,8,10,12]. In this section, seven different architectures are outlined. Figures 4–9 illustrate those architectures.
Data Warehousing Systems: Foundations and Architectures
D
689
D
Data Warehousing Systems: Foundations and Architectures. Figure 4. Independent data marts. Data Warehousing Systems: Foundations and Architectures. Figure 6. Centralized DW architecture with no data marts.
Data Warehousing Systems: Foundations and Architectures. Figure 5. Data mart bus architecture with conformed dimensions. Data Warehousing Systems: Foundations and Architectures. Figure 7. Hub-and-spoke architecture. Independent Data Marts Architecture
In this architecture, multiple data marts are created independently of each other. The data marts do not use conformed dimensions and measures. Thus, there is no unified
view of enterprise data in this architecture. As the number of data marts grows, maintenance of consistency among data marts are difficult. In the long run, this architecture is likely to produce ‘‘silos of data marts.’’
690
D
Data Warehousing Systems: Foundations and Architectures
Data Warehousing Systems: Foundations and Architectures. Figure 8. Distributed DW architecture.
Data Mart Bus Architecture with Conformed Dimensions In this architecture, instead of creating a single
Hub-and-Spoke Architecture (Corporate Information Factory) In this architecture, a single enterprise DW,
enterprise level DW, multiple dimensional data marts are created that are linked with conformed dimensions and measures to maintain consistency among the data marts [6,7]. Here, an enterprise DW is a union of all the data marts together with their conformed dimensions. The use of the conformed dimensions and measures allows users to query all data marts together. Data marts contain either atomic data or summary data. The strength of the architecture is that data marts can be delivered quickly, and multiple data marts can be delivered incrementally. The potential weaknesses are that it does not create a single physical repository of integrated data and some data may be redundantly stored in multiple data marts.
called the hub, is created with a set of dimensional data marts, called spokes, that are dependent on the enterprise DW. The warehouse provides a single version of truth for the enterprise, and each data mart addresses the analytic needs of a business unit. This architecture is also called the corporate information factory or the enterprise DW architecture [3]. The warehouse contains data at the atomic level, and the data marts usually contain either atomic data, lightly summarized data, or both, all fed from the warehouse. The enterprise warehouse in this architecture is usually normalized for flexibility and scalability, while the data marts are structured in star schemas for performance. This top-down development methodology provides a centralized integrated repository of the enterprise data and tends to be robust against business changes. The primary weakness of this architecture is that it requires significant up-front costs and time for developing the warehouse due to its scope and scale.
Centralized Data Warehouse Architecture
In this architecture, a single enterprise level DW is created for the entire organization without any dependent data marts. The warehouse contains detailed data for all the analytic needs of the organization. Users and applications directly access the DW for analysis.
Distributed Data Warehouse Architecture
A distributed DW architecture consists of several local DWs
Data Warehousing Systems: Foundations and Architectures
D
691
D
Data Warehousing Systems: Foundations and Architectures. Figure 9. Federated DW architecture.
and a global DW [3]. Here, local DWs have mutually exclusive data and are autonomous. Each local warehouse has its own ETL logic and processes its own analysis queries for a business division. The global warehouse may store corporate-wide data at the enterprise level. Thus, either corporate-level data analysis at the enterprise level or global data analyses that require data from several local DWs will be done at the global DW. For example, a financial analysis covering all the business divisions will be done at the global DW. Depending on the level of data and query flows, there could be several variations in this architecture [3]. This architecture supports multiple, geographically distributed business divisions. The architecture is especially beneficial when local DWs run on multiple vendors. Federated Data Warehouse Architecture
A federated DW architecture is a variation of a distributed DW architecture, where the global DW serves as a logical DW for all local DWs. The logical DW provides users
with a single centralized DW image of the enterprise. This architecture is a practical solution when an enterprise acquires other companies that have their own DWs, which become local DWs. The primary advantage of this architecture is that existing environments of local DWs can be kept as they are without physically restructuring them into the global DW. This architecture may suffer from complexity and performance when applications require frequent distributed joins and other distributed operations. The architecture is built on an existing data environment rather than starting with a ‘‘clean slate.’’ Virtual Data Warehouses Architecture
In a virtual DW architecture, there is no physical DW or any data mart. In this architecture, a DW structure is defined by a set of materialized views over OLTP systems. End-users directly access the data through the materialized views. The advantages of this approach are that it is easy to build and the additional storage requirement is
692
D
Data, Text, and Web Mining in Healthcare
minimal. This approach, however, has many disadvantages in that it does not allow any historical data; it does not contain a centralized metadata repository; it does not create cleansed standard data items across source systems; and it could severely affect the performance of the OLTP system.
Key Applications Numerous business applications of data warehousing technologies to different domains are found in [1,6]. Design and development of clickstream data marts is covered in [5]. Applications of data warehousing technologies to customer relationship management (CRM) are covered in [2,11]. Extension of data warehousing technologies to spatial and temporal applications is covered in [8].
URL to Code Two major international forums that focus on data warehousing and OLAP research are International Conferences on Data Warehousing and Knowledge Discovery (DaWaK) and ACM International Workshop on Data Warehousing and OLAP (DOLAP). DaWaK has been held since 1999, and DOLAP has been held since 1998. DOLAP papers are found at http://www.cis.drexel.edu/faculty/song/dolap.htm. A collection of articles on industrial DW experience and design tips by Kimball is listed in http://www.ralphkimball.com/, and the one by Inmon is listed in www. inmoncif.com.
▶ Transformation ▶ View Maintenance
Recommended Reading 1. Adamson C. and Venerable M. Data Warehouse Design Solutions. Wiley, New York, 1998. 2. Cunningham C., Song I.-Y., and Chen P.P. Data warehouse design for customer relationship management. J. Database Manage., 17(2):62–84, 2006. 3. Inmon W.H. Building the Data Warehouse, 3rd edn., Wiley, New York, 2002. 4. Jones M.E. and Song I.-Y. Dimensional modeling: identification, classification, and evaluation of patterns. Decis. Support Syst., 45(1):59–76, 2008. 5. Kimball R. and Merz R. The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse. Wiley, New York, 2000. 6. Kimball R. and Ross M. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn., Wiley, 2002. 7. Kimball R., Ross M., Thorntwaite W., Munday J., and Becker B. 1The Data Warehouse Lifecycle Toolkit, 2nd edn., Wiley, 2008. 8. Malinowski E. and Zimanyi E. Advanced Data Warehouse Design: From Conventional to Spatial and Temporal Applications. Springer, 2008. 9. Poole J., Chang D., Tolbert D., and Mellor D. Common Warehouse Metamodel: An Introduction to the Standard for Data Warehouse Integration. Wiley, 2002. 10. Sen A. and Sinha P. A comparison of data warehousing methodologies. Commun. ACM, 48(3):79–84, 2005. 11. Todman C. Designing a Data Warehouse Supporting Customer Relationship Management. Prentice-Hall, 2000. 12. Watson H.J. and Ariyachandra T. Data Warehouse Architectures: Factors in the Selection, Decision, and the Success of the Architectures. Technical Report, University of Georgia, 2005. Available from http://www.terry.uga.edu/hwatson/ DW_Architecture_Report.pdf
Cross-references
▶ Active and Real-time Data Warehousing ▶ Cube ▶ Data Mart ▶ Data Mining ▶ Data Warehouse ▶ Data Warehouse Life-Cycle and Design ▶ Data Warehouse Maintenance, Evolution and Versioning ▶ Data Warehouse Metadata ▶ Data Warehouse Security ▶ Dimension ▶ Extraction, Transformation, and Loading ▶ Materialized Views ▶ Multidimensional Modeling ▶ On-line analytical Processing ▶ Optimization and Tuning in Data Warehouses
Data, Text, and Web Mining in Healthcare E LIZABETH S. C HEN Partners HealthCare System, Boston, MA, USA
Synonyms Data mining; Text data mining; Web mining; Web data mining; Web content mining; Web structure mining; Web usage mining
Definition The healthcare domain presents numerous opportunities for extracting information from heterogeneous sources ranging from structured data (e.g., laboratory results and diagnoses) to unstructured data (e.g.,
Data, Text, and Web Mining in Healthcare
clinical documents such as discharge summaries) to usage data (e.g., audit logs that record user activity for clinical applications). To accommodate the unique characteristics of these disparate types of data and support the subsequent use of extracted information, several existing techniques have been adapted and applied including Data Mining, Text Mining, and Web Mining [7]. This entry provides an overview of each of these mining techniques (with a focus on Web usage mining) and example applications in healthcare.
Historical Background Given the exponential growth of data in all domains, there has been an increasing amount of work focused on the development of automated methods and techniques to analyze data for extracting useful information. Data mining is generally concerned with large data sets or databases; several specialized techniques have emerged such as text mining and Web mining that are focused on text data and Web data, respectively. Early applications were in the domains of business and finance; however, the past decade has seen an increasing use of mining techniques in the life sciences, biomedicine, and healthcare. In the healthcare domain, data mining techniques have been used to discover medical knowledge and patterns from clinical databases, text mining techniques have been used to analyze unstructured data in the electronic health record, and Web mining techniques have been used for studying use of healthcare-related Web sites and systems.
Foundations Data Mining
Knowledge Discovery in Databases (KDD) and data mining are aimed at developing methodologies and tools, which can automate the data analysis process and create useful information and knowledge from data to help in decision-making [9,11]. KDD has been defined as ‘‘the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.’’ This process is interactive and iterative and consists of several steps: data selection, preprocessing, transformation, data mining, and interpretation. Data mining is considered one step in the KDD process and is concerned with the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules [9,11]. Two primary goals of data mining are prediction and description.
D
693
Text Mining
While data mining focuses on algorithmic and database-oriented methods that search for previously unsuspected structure and patterns in data, text mining is concerned with semi-structured or unstructured data found within text documents [5,12]. A narrower definition of text mining follows that of data mining in that it aims to extract useful information from text data or documents; a broader definition includes general text processing techniques that deal with search, extraction, and categorization [17]. Example applications include document classification, entity extraction, and summarization. Web Mining
Web mining is the application of data mining techniques to automatically discover and extract information from data related to the World Wide Web [9,24,25]. Three categories of Web mining have been defined [18,6]: Web content mining: involves the discovery of useful information from Web content. These techniques involve examining the content of Web pages as well as results of Web searching. Web structure mining: obtains information from the organization of pages on the Web. These techniques seek to discover the model underlying link structures of the Web. Web usage mining: discovers usage patterns from Web data. These techniques involve analyzing data derived from the interactions of users while interacting with the Web. Web usage mining seeks to understand the behavior of users by automatically discovering access patterns from their Web usage data. These data include Web server access logs, proxy server logs, browser logs, user sessions, and user queries. The typical Web usage mining process has three phases: preprocessing, pattern discovery, and pattern analysis [6,18,25].
Key Applications Data Mining in Healthcare
Several studies have discussed the use of structured and unstructured data in the electronic health record for understanding and improving health care processes [5]. Applications of data mining techniques for structured clinical data include extracting diagnostic
D
694
D
Data, Text, and Web Mining in Healthcare
rules, identifying new medical knowledge, and discovering relationships between different types of clinical data. Using association rule generation, Doddi et al. discovered relationships between procedures performed on a patient and the reported diagnoses; this knowledge could be useful for identifying the effectiveness of a set of procedures for diagnosing a particular disease [8]. To identify factors that contribute to perinatal outcomes, a database of obstetrical patients was mined for the goal of improving the quality and cost effectiveness of perinatal care [23]. Mullins et al. explored a set of data mining tools to search a clinical data repository for novel disease correlations to enhance research capabilities [21]. Text Mining in Healthcare
Natural language processing and text mining techniques have been applied in healthcare for a range of applications including coding and billing, tracking physician performance and resource utilization, improving provider communication, monitoring alternate courses of treatment, and detecting clinical conditions and medical errors [15]. Several studies have focused on the development of text mining approaches for identifying specific types of co-occurring concepts (e.g., concept pairs such as disease-drug or disease-finding) in clinical documents (e.g., discharge summaries) and biomedical documents (e.g., Medline articles). In one study, associations between diseases and findings (extracted from discharge summaries using a natural language processing tool) were identified and used to construct a knowledge base for supporting an automated problem list summarization system [2]. Another study discusses the mining of free-text medical records for the creation of disease profiles based on demographic information, primary diseases, and other clinical variables [14].
Web Usage Mining in Healthcare
Major application areas of Web usage mining include personalization, system improvement, site modification, business intelligence, and usage characterization [25]. Web usage mining is viewed as a valuable source of ideas and methods for the implementation of personalized functionality in Web-based information systems [10,22]. Web personalization aims to make Web-based information systems adaptive for the needs and interests of individual users. The four basic classes of personalization functions are: memorization, guidance, customization, and task performance support. A number of research projects have used Web usage mining techniques to add personalization functionality in Web-based systems [20]. There are several reports of applying advanced techniques such as Web usage mining to study healthcare-related Web sites and systems. Malin has looked at correlating medical status (represented in health insurance claims as ICD-9 codes) with how information is accessed in a health information Web site [19]. The value of log data for public health surveillance has been explored for detecting possible epidemics through usage logs that record accesses to disease-specific on-line health information [16,13]. Zhang et al. used Web usage data to study users’ information-seeking patterns of MyWelch, a Web-based medical library portal system [27]. Rozic-Hristovski et al. have used data warehouse and On-Line Analytical Processing (OLAP) techniques to evaluate use of the Central Medical Library (CMK) Web site. They found that existing Web log analysis tools only provided a set of predefined reports without any support for interactive data exploration, while their data warehouse and OLAP techniques would allow for dynamic generation of different user-defined reports that could be used to
Data, Text, and Web Mining in Healthcare. Figure 1. WebCIS log file records. The WebCIS log files record details for users’ (e.g., clinicians) interactions with patient data. Log file lines provide information on who, what, when, where, and how information was accessed in a patient’s record. Each line has seven fields: timestamp, application name, userID, IP address, Medical Record Number (MRN), data type, and action. Data types may have subtypes (delimited by ‘‘^’’). For example, the subtype ‘‘2002–09–30–12.15.00.000000’’ for the data type ‘‘lab’’ refers to a specific laboratory result (e.g., Basic Metabolic Panel) for the patient.
Data, Text, and Web Mining in Healthcare
D
Data, Text, and Web Mining in Healthcare. Figure 2. Transforming usage patterns to rules to shortcut rules. Each usage pattern (mined from the CIS log files (a) can be converted to a rule (b) and some patterns can be transformed to shortcut rules that exclude viewing of the department listings such as a listing of radiology results (c).
restructure the CMK Web site [26]. Another study explored regression analysis as a Web usage mining technique to analyze navigational routes used to access the gateway pages of the Arizona Health Sciences Library Web site. Bracke concluded that this technique could supplement Web log analysis for improving the design of Web sites [1].
Experimental Results Depending on the clinical task, often only subsets of data are of interest to clinicians. Identifying these data, and the patterns in which they are accessed, can contribute to the design of efficient clinical information systems. At NewYork-Presbyterian Hospital (NYP), a study was performed to learn the patientspecific information needs (need for data in the patient record) of clinicians from the log files of WebCIS (a Web-based clinical information system at NYP) and subsequently apply this knowledge to enhance PalmCIS (a wireless handheld extension to WebCIS) [3,4]. Based on existing mining techniques (i.e., data mining and Web usage mining), ‘‘CIS Usage Mining’’ was developed as an automated approach for identifying patterns of usage for clinical information systems through associated log files (CIS log files). The CIS usage mining process consists of four phases: Data Collection – identify sources of CIS log files and obtain log file data (Fig. 1); Preprocessing – perform various tasks to prepare data for pattern discovery techniques including de-identification, data cleaning, data enrichment, and data transformation; Pattern Discovery – apply techniques for discovering statistics, patterns, and relationships such as descriptive statistical analysis, sequential pattern discovery, classification, and association rule generation; and, Pattern Analysis – filter out uninteresting patterns and determine how the discovered knowledge can be used through visualization techniques or query mechanisms.
The CIS usage mining techniques were applied to the log files of WebCIS to obtain usage statistics and patterns for all WebCIS users as well as particular classes of users (e.g., role-based groups such as physicians or nurses or specialty-based groups like pediatrics and surgery). A subset of the patterns were transformed into rules and stored in a knowledge base for enhancing PalmCIS with context-sensitive ‘‘shortcuts’’, which seek to anticipate what patient data the clinician may be interested in viewing next and provide automated links to those data (Fig. 2). Preliminary evaluation results indicated that shortcuts may have a positive impact and that CIS usage mining techniques may be valuable for detecting clinician information needs in different contexts.
Cross-references
▶ Association Rules ▶ Data Mining ▶ Text Mining ▶ Text Mining of Biological Resources ▶ Visual Data Mining
Recommended Reading 1. Bracke P.J. Web usage mining at an academic health sciences library: an exploratory study. J. Med. Libr. Assoc., 92(4): 421–428, 2004. 2. Cao H., Markatou M., Melton G.B., Chiang M.F., and Hripcsak G. Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics. In Proc. AMIA Annual Symposium, 2005, pp. 106–110. 3. Chen E.S. and Cimino J.J. Automated discovery of patientspecific clinician information needs using clinical information system log files. In Proc. AMIA Annual Symposium, 2003, pp. 145–149. 4. Chen E.S. and Cimino J.J. Patterns of usage for a web-based clinical information system. In Proc. Medinfo, 2004, pp. 18–22. 5. Chen H., Fuller S., Friedman C., and Hersh W. Knowledge Management and Data Mining in Biomedicine. Springer, 2005. 6. Cooley R., Mobasher B., and Srivastava J. Web mining: information and pattern discovery on the World Wide Web. In Proc.
695
D
696
D 7. 8.
9. 10. 11.
12. 13.
14. 15.
16.
17. 18. 19.
20.
21.
22.
23.
24. 25.
26.
Database Adapter and Connector Nineth IEEE Int. Conf. on Tools with Artificial Intelligence, 1997, pp. 558–567. Data Mining, Web Mining, Text Mining, and Knowledge Discovery. wwwkdnuggetscom. Doddi S., Marathe A., Ravi S.S., and Torney D.C. Discovery of association rules in medical data. Med. Inform. Internet Med., 26(1):25–33, 2001. Dunham M. Data Mining Introductory and Advanced Topics. Prentice-Hall, Englewood, Cliffs, NJ, 2003. Eirinaki M. and Vazirgiannis M. Web mining for web personalization. ACM Trans. Internet Techn., 3(1):1–27, 2003. Fayyad U., Piatetsky-Shapiro G., Smyth P., and Uthurusamy R. Advances in Knowledge Discovery and Data Mining. AAAI/MIT, 1996. Hearst M. Untangling text data mining. In Proc. 27th Annual Meeting of the Assoc. for Computational Linguistics, 1999. Heino J. and Toivonen H. Automated detection of epidemics from the usage logs of a physicians’ reference database. In Principles of Data Mining and Knowledge Discovery, 7th European Conf, 2003, pp. 180–191. Heinze D.T., Morsch M.L., and Holbrook J. Mining free-text medical records. In Proc. AMIA Symposium, 2001, pp. 254–258. Hripcsak G., Bakken S., Stetson P.D., and Patel V.L. Mining complex clinical data for patient safety research: a framework for event discovery. J. Biomed. Inform. 36(1–2):120–30, 2003. Johnson H.A., Wagner M.M., Hogan W.R., Chapman W., Olszewski R.T., and Dowling J. et al. Analysis of web access logs for surveillance of influenza. In Proc. Medinfo, 2004, p. 1202. Konchady M. Text Mining Application Programming. Charles River Media. 2006, p. 2. Kosala R. and Blockeel H. Web mining research: a survey. SIGKDD Explor., 2(1):1–15, 2000. Malin B.A. Correlating web usage of health information with patient medical data. In Proc. AMIA Symposium, 2002, pp. 484–488. Mobasher B., Cooley R., and Srivastava J. Automatic personalization based on web usage mining. Commun. ACM, 43(8):142–151, 2000. Mullins I.M., Siadaty M.S., Lyman J., Scully K., Garrett C.T., and Greg Miller W. et al. Data mining and clinical data repositories: insights from a 667,000 patient data set. Comput. Biol. Med., 36 (12):1351–77, 2006. Pierrakos D., Paliouras G., Papatheodorou C., and Spyropoulos C. Web usage mining as a tool for personalization: a survey. User Model. User-Adap., 13(4):311–372, 2003. Prather J.C., Lobach D.F., Goodwin L.K., Hales J.W., Hage M.L., and Hammond W.E. Medical data mining: knowledge discovery in a clinical data warehouse. In Proc. AMIA Annual Fall Symposium, 1997, pp. 101–105. Scime A. Web mining: applications and techniques. Idea Group Inc. 2005. Srivastava J., Cooley R., Deshpande M., and Tan P. Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor., 1(2):12–23, 2000. Rozic-Hristovski A., Hristovski D., and Todorovski L. Users’ information-seeking behavior on a medical library Website. J. Med. Libr. Assoc., 90(2):210–217, 2002.
27. Zhang D., Zambrowicz C., Zhou H., and Roderer N. User information seeking behavior in a medical web portal environment: a preliminary study. J. Am. Soc. Inform. Sci. Tech., 55(8): 670–684, 2004.
Database Adapter and Connector C HANGQING L I Duke University, Durham, NC, USA
Synonyms Database connectivity
Definition A database connector is a software that connects an application to any database. A database adapter is an implementation of a database connector. The connector is more at the conceptual level, while the adapter is at the implementation level, though they refer to the same thing. For simplicity, in the remaining parts of this entry, a database adapter will not be explicitly distinguished from a database connector, i.e., they are used to have the same meaning in the rest sections. Unlike the way to access data with a fixed schema, stored procedures, or queues, one can access table data directly and transparently with a database adapter. Open Database Connectivity (ODBC) [2] and Java Database Connectivity (JDBC) [4] are two main database adapters to execute Structured Query Language (SQL) statements and retrieve results.
Historical Background Before the universal database adapters, one has to write code that talks to a particular database using an appropriate language. For example, if a program needs to talk to an Access database and an Oracle database, the program has to be coded with two different database languages. This can be a quite daunting task, therefore uniform database adapters emerged. Here the histories of the two main universal database adapters, i.e., ODBC and JDBC, are introduced. ODBC enables applications connect to any database for which an ODBC driver is available. ODBC was created in 1992 by Microsoft, in partnership with Simba Technologies, by adapting the Call Level Interface (CLI) from the SQL Access Group (SAG). Later ODBC was aligned with the CLI specification making its way through X/Open (a company name) and International
Database Adapter and Connector
Organization for Standardization (ISO), and SQL/CLI became part of the international SQL standard in 1995. JDBC is similar to ODBC, but is designed specifically for Java programs. JDBC was firstly developed by JavaSoft, a subsidiary of Sun Microsystems, then developed under the Java Community Process. JDBC is part of the Java Standard Edition and the Java package java.sql contains the JDBC classes.
Foundations A data architecture defines a data source interface to an application through connectors, and also by commands. Thus, a configurable request for data is issued through commands to the adapters of the data sources. This architecture provides the ability to create custom connectivity to disparate backend data sources. Universal connectors enable rapid access to heterogeneous data and allow a broad range of seamless connectivity to file systems, databases, web applications, business applications and industry-standard protocols on numerous platforms. Business connectors allow customers to participate in collaboration, while web database adapters allow direct access to the database from web services. Relational Database (RDB) adapters efficiently provide access to RDB data and systems. Standard SQL statements may be used to access RDB data via connectors including ODBC, OLE DB (Object Linking and Embedding, Database), JDBC, XML, iWay Business Services (Web services), MySQL Connector/ODBC and Connector/NET driver, and others. Due to the longer history, ODBC offers connectivity to a wider variety of data sources than other new dataaccess Application Programming Interfaces (APIs) such as OLE DB, JDBC, and ADO.NET (ADO stands for ActiveX Data Objects). Before the information from a database can be used by an application, an ODBC data source name must be defined, which provides information about how to connect the application server to a database, such as Microsoft SQL Server, Sybase, Oracle, or IBM DB2. The implementations of ODBC can run on different operating systems such as Microsoft Windows, Unix, Linux, OS/2, and Mac OS X. Hundreds of ODBC drivers exist for different database products including Oracle, DB2, Microsoft SQL Server, Sybase, MySQL, PostgreSQL, Pervasive SQL, FileMaker, and Microsoft Access. The first ODBC product was released by Microsoft as a set of Dynamic-Link Libraries (DLLs) for
D
Microsoft Windows. In 2006, Microsoft ships its own ODBC with every supported version of Windows. Independent Open Database Connectivity (iODBC) offers an open source, platform-independent implementation of both the ODBC and X/Open specifications. iODBC has been bundled into Darwin and Mac OS X, and it has also been ported by programmers to several other operating systems and hardware platforms, including Linux, Solaris, AIX, HP-UX, Digital UNIX, Dynix, FreeBSD, DG-UX, OpenVMS, and others. Universal Database Connectivity (UDBC), laid the foundation for the iODBC open source project, is a cross-platform fusion of ODBC and SQL Access Group CLI, which enables non-Windows-based DBMS-independent (Database Management System independent) application development when shared-library implementations on Unix occurred only sporadically. Headed, maintained and supported by Easysoft Director Nick Gorham, unixODBC has become the most common driver-manager for non-Microsoft Windows platforms and for one Microsoft platform, Interix. In advance of its competitors, unixODBC fully supports ODBC3 and Unicode. Most Linux distributions including Red Hat, Mandriva and Gentoo, now ship unixODBC. unixODBC is also used as the drivers by several commercial database vendors, including IBM (DB2, Informix), Oracle and SAP (Ingres). Many open source projects also make use of unixODBC. unixODBC builds on any platform that supports most of the GNU (a computer operating system composed entirely of free software) autoconf tools, and uses the LGPL (Lesser General Public License) and the GPL (General Public License) for licensing. ODBC provides the standard of ubiquitous connectivity and platform-independence because hundreds of ODBC drivers exist for a large variety of data sources. However, ODBC has certain drawbacks. Writing ODBC code to exploit DBMS-specific features requires more advanced programming. An application needs to use introspection to call ODBC metadata functions that return information about supported features, available types, syntax, limits, isolation levels, driver capabilities and more. Even when adaptive techniques are used, ODBC may not provide some advanced DBMS features. Important issues can also be raised by differences between drivers and driver maturity. Compared with drivers deployed and tested for years which may contain fewer bugs, newer ODBC drivers do not always have the stability.
697
D
698
D
Database Adapter and Connector
Developers may use other SQL APIs if ODBC does not support certain features or types but these features are required by the applications. Proprietary APIs can be used if it is not aiming for platform-independence; whereas if it is aiming to produce portable, platform-independent, albeit language specific code, JDBC API is a good choice. Sun’s (a company name) Java (a programming language) 2 Enterprise Edition (J2EE) Connector Architecture (JCA) defines a standard architecture for connecting the Java 2 Platform to heterogeneous Enterprise Information Systems (EISs). The JCA enables an EIS vendor to provide a standard resource adapter (connector). The JDBC Connector is used to connect relational data sources. DataDirect technology is a pioneer in JDBC which provides resource adapters as an installable option for JDBC. The JDBC Developer Center provides the most current, developer-oriented JDBC data connectivity information available in the industry. Multiple implementations of JDBC can exist and be used by the same application. A mechanism is provided by the API to dynamically load the correct Java packages and register them with the JDBC Driver Manager, a connection factory for creating JDBC connections. Creating and executing statements are supported by JDBC connections. These statements may either be update statements such as SQL CREATE, INSERT, UPDATE and DELETE or query statements with SELECT. Update statements e.g., INSERT, UPDATE and DELETE return how many rows are affected in the database, but do not return any other information. Query statements, on the other hand, return a JDBC row result set, which can be walked over. Based on a name or a column number, an individual column in a row can be retrieved. Any number of rows may exist in the result set and the row result set has metadata to describe the names of the columns and their types. To allow for scrollable result sets and cursor support among other things, there is an extension to the basic JDBC API in the javax.sql package. Next the bridging configurations between ODBC and JDBC are discussed: ODBC-JDBC bridges: an ODBC-JDBC bridge consists of an ODBC driver, but this ODBC driver uses the services of a JDBC driver to connect to a database. Based on this driver, ODBC function calls are translated into JDBC method calls. This bridge is usually used when an ODBC driver is lacked for a particular database but access to a JDBC driver is provided.
JDBC-ODBC bridges: a JDBC-ODBC bridge consists of a JDBC driver, but this JDBC driver uses the ODBC driver to connect to the database. Based on this driver, JDBC method calls are translated into ODBC function calls. This bridge is usually used when a particular database lacks a JDBC driver. One such bridge is included in the Java Virtual Machine (JVM) of Sun Microsystems. Sun generally recommends against the use of its bridge. Far outperforming the JVM built-in, independent data-access vendors now deliver JDBCODBC bridges which support current standards. Furthermore, the OLE DB [1], the Oracle Adapter [3], the iWay [6] Intelligent Data Adapters, and MySQL [5] Connector/ODBC and Connector/NET are briefly introduced below: OLE DB (Object Linking and Embedding, Database), maybe written as OLEDB or OLE-DB, is an API designed by Microsoft to replace ODBC for accessing different types of data stored in a uniform manner. While supporting traditional DBMSs, OLE DB also allows applications to share and access a wider variety of non-relational databases including object databases, file systems, spreadsheets, e-mail, and more [1]. The Oracle Adapter for Database and Files are part of the Oracle Business Process Execution Language (BPEL) Process Manager installation and is an implementation of the JCA 1.5 Resource Adapter. The Adapter is based on open standards and employs the Web Service Invocation Framework (WSIF) technology for exposing the underlying JCA Interactions as Web Services [3]. iWay Software’s Data Adapter can be used for ALLBASE Database, XML, JDBC, and ODBC-Based Enterprise Integration. The Intelligent Data Adapters of iWay Work Together; each adapter contains a communication interface, a SQL translator to manage adapter operations in either SQL or iWay’s universal Data Manipulation Language (DML), and a database interface to translate standard SQL into native SQL syntax [6]. MySQL supports the ODBC interface Connector/ ODBC. This allows MySQL to be addressed by all the usual programming languages that run under Microsoft Windows (Delphi, Visual Basic, etc.). The ODBC interface can also be implemented under Unix, though that is seldom necessary [5]. The Microsoft .NET Framework, a software component of Microsoft Windows operating system, provides a programming interface to Windows services and APIs, and manages the execution of programs written for this framework [7].
Database Clustering Methods
Key Applications Database adapters and connectors are essential for the current and future Web Services and Service Oriented Architecture, Heterogeneous Enterprise Information Systems, Data Integration and Data Interoperability, and any other applications to access any data transparently.
D
the database clustering problem is defined as a partitioning process, such that D can be partitioned into a number of (such as k) subsets (k can be given), as C1, C2,...,Ck, according to s by assigning each tuple in D to a subset Ci. Ci is called a cluster such that Ci = {ti | s(ti, tr) s(ti,ts), if ti,tr 2 Cj and ts 2 = Cj}.
URL To Code
Key Points
The catalog and list of ODBC Drivers can be found at: http://www.sqlsummit.com/ODBCVend.htm and http:// www.unixodbc.org/drivers.html. The guide about how to use JDBC can be found at: http://java.sun.com/javase/6/docs/technotes/guides/ jdbc/.
Database clustering is a process to group data objects (referred as tuples in a database) together based on a user defined similarity function. Intuitively, a cluster is a collection of data objects that are ‘‘similar’’ to each other when they are in the same cluster and ‘‘dissimilar’’ when they are in different clusters. Similarity can be defined in many different ways such as Euclidian distance, Cosine, or the dot product. For data objects, their membership belonging to a certain cluster can be computed according to the similarity function. For example, Euclidian distance can be used to compute the similarity between the data objects with the numeric attribute values, where the geometric distance is used as a measure of the similarity. In a Euclidian space, the data objects are to each other, the more similar they are. Another example is to use the Euclidian distance to measure the similarity between a data object and a central point namely centroid of the cluster. The closer to the centroid the object is, the more likely it will belong to the cluster. So in this case, the similarity is decided by the radius of the points to their geometric centre. For any given dataset a challenge question is how many natural clusters that can be defined. The answer to this question is generally application-dependent and can be subjective to user intentions. In order to avoid specifying k for the number of clusters in a clustering process, a hierarchical method can be used. In this case, two different approaches, either agglomerative or divisive, can be applied. Agglomerative approach is to find the clusters step-by-step through a bottom-up stepwise merging process until the whole dataset is grouped as a single cluster. Divisive approach is to find the clusters stepby-step through a top-down stepwise split process until every data object becomes a single cluster. Although hierarchical approaches have been widely used in many applications such as biomedical researches and experimental analysis in life science, they suffer from the problems of unable to undo the intermediate results in order to approach a global
Cross-references
▶ Data Integration ▶ Interface ▶ Java Database Connectivity ▶ .NET Remoting ▶ Open Database Connectivity ▶ Web 2.0/3.0 ▶ Web Services
Recommended Reading 1. 2. 3. 4.
5. 6. 7.
Blakeley J. OLE DB: a component dbms architecture. In Proc. 12th Int. Conf. on Data Engineering, 1996. Geiger K. Inside ODBC. Microsoft, 1995. Greenwald R., Stackowiak R., and Stern J. Oracle Essentials: Oracle Database 10g. O’Reilly, 2004. Hamilton G., Cattell R., and Fisher M. JDBC Database Access with Java: A Tutorial and Annotated Reference. Addison Wesley, USA, 1997. Kofler M. The Definitive Guide to MySQL5. A press, 2005. Myerson J. The Complete Book of Middleware. CRC, USA, 2002. Thai T., Lam H., .NET Framework Essentials. O’Reilly, 2003.
Database Clustering Methods X UE L I The University of Queensland, Brisbane, QLD, Australia
Synonyms Similarity-based data partitioning
Definitions Given a database D = {t1, t2,...,tn}, of tuples and a user defined similarity function s, 0 s(ti, tj) 1, ti, tj 2 D,
699
D
700
D
Database Clusters
optimum solution. In an agglomerative approach, once two objects are merged, they will be together for all following merges and cannot be reassigned. In a divisive approach, once a cluster is split into two subclusters, they cannot be re-grouped into the same cluster for the further split. In addition to hierarchical approaches, which do not need to specify how many clusters to be discovered, a user may specify an integer k for clustering data objects. In general, the task of finding a global optimal k partitions belongs to the class of NP-hard problem. For this reason, heuristics are used in many algorithms to achieve a balance between the efficiency and effectiveness as much as possible to close to the global optimum. Two well-known algorithms are the k-means and k-medoids. One important feature of database clustering is that a dataset tends to be very large, high-dimensional, and coming at a high speed. By using a balanced tree structure, BIRCH algorithm [3] makes a single scan on the incoming data stream. BIRCH algorithm consists of two phases: (i) a summary of historical data is incrementally maintained in main memory as a clustering tree (CF tree). A node in CF tree gives the cardinality, centre, and radius of the cluster. Based on some heuristics, each new arriving data object is assigned to a subcluster, which leads to the update of its cluster feature in the CF tree. (ii) The clustering process is then applied on the leaf nodes of the CF tree. When the final cluster needs to be generated, the sub-clusters are treated as weighted data points and various traditional clustering algorithms can be applied in phase two computation without involving I/O operations. DBSCAN [1] is a density based approach considering the coherence of the data objects. As a result, the nonconvex shapes clusters can be found based on the density that connect the data objects forming any kind of shapes in a Euclidian space. Spatial data indexes such as R* tree can be used to improve the system performance. STING [2] is another hierarchical approach that uses a grid structure to stores density information of the objects. The key features of database clustering approaches are that (i) they are designed to deal with a large volume of data so a trade-off of accuracy and efficiency often needs to be considered. (ii) They are not able to see the complete dataset before the objects are clustered. So a progressive resolution refinement is used to approach the optimal solutions. (iii) They are designed
to deal with constant data streams and so the incremental maintenances of the clustering results are required.
Cross-references
▶ Data Partitioning ▶ K-Means and K-Medoids ▶ Unsupervised Learning
Recommended Reading 1.
2.
3.
Ester M., Kriegel H.P., Sander J., and Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise, In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226–231. Han J., Kamber M., and Tung A.K.H. 1Spatial clustering methods in data mining: a survey, In Geographic Data Mining and Knowledge Discovery, H. Miller, J. Han (eds.). Taylor and Francis, UK, 2001. Zhang T., Ramakrishnan R., and Livny M. Birch: An efficient data clustering method for very large databases. In Proc. 1996 ACM SIGMOD Int. Conf. on Management of Data. Quebec, Canada, 1996, pp. 103–114.
Database Clusters M ARTA M ATTOSO Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Synonyms DBC
Definition A database cluster (DBC) is as a standard computer cluster (a cluster of PC nodes) running a Database Management System (DBMS) instance at each node. A DBC middleware is a software layer between a database application and the DBC. Such middleware is responsible for providing parallel query processing on top of the DBC. It intercepts queries from applications and coordinates distributed and parallel query execution by taking advantage of the DBC. The DBC term comes from an analogy with the term PC cluster, which is a solution for parallel processing by assembling sequential PCs. In a PC cluster there is no need for special hardware to provide parallelism as opposed to parallel machines or supercomputers. A DBC takes advantage of off-the-shelf sequential DBMS to run parallel queries. There is no need for special software
Database Clusters
or hardware as opposed to parallel database systems. The idea is to offer a high performance and costeffective solution based on a PC cluster, without needing to change the DBMS or the application and its database.
Historical Background Traditionally, high-performance of database query processing has been achieved with parallel database systems [7]. Parallel processing has been successfully used to improve performance of heavy-weight queries, typically by replacing the software and hardware platforms with higher computational capacity components (e.g., tightlycoupled multiprocessors and parallel database systems). Although quite effective, this solution requires the database system to have full control over the data, requiring an efficient database partitioning design. It also requires adapting applications from the sequential to the parallel environment. Migrating applications is complex (sometimes impossible), since it may require modifications to the source code. In addition, often it requires the expansion of the computational environment and the application modification, which can be very costly. A cheaper hardware alternative is to use parallel database systems for PC clusters. However, the costs can still be high because of a new database partitioning design and some solutions require specific software (DBMS) or hardware (e.g., SAN – Storage Area Network). The DBC approach has been initially proposed by the database research group from ETH Zurich through the PowerDB project [10] to offer a less expensive and
Database Clusters. Figure 1. DBC architecture.
D
cost-effective alternative for high performance query processing. Thus, DBC is based on clusters of PC servers and pre-existing DBMS and applications. However, PowerDB is not open-source nor available for download. Several open-source DBC systems (e.g., RepDB*, C-JDBC, ParGRES, and Sequoia) have been proposed to support database applications by using different kinds of database replication on the DBC to obtain inter- and intra-query parallelism and fault tolerance.
Foundations While many techniques are available for high performance query processing in parallel database systems, the main challenge of a DBC is to provide parallelism from outside the DBMS software. A typical DBC architecture is a set of PC servers interconnected by a dedicated high-speed network, each one having its own processor(s) and hard disk (s), and running an off-the-shelf DBMS all coordinated by the DBC software middleware (Fig. 1). The DBC middleware is responsible for offering a single external view of the whole system, like a virtual DBMS. Applications need not be modified when database servers are replaced by their cluster counterparts. The DBC approach is considered to be non-intrusive since it does not require changes on the current application, its queries, its DBMS and its database. Typically, the application is on the client side while the DBMS and the database is fully replicated at the PC cluster nodes. The DBC software middleware
701
D
702
D
Database Clusters
intercepts the application queries at the moment they are sent to the DBMS through the database driver. The DBC middleware then defines the best strategy to execute this query on the DBC to obtain the best performance from the DBC configuration. The DBC software middleware is typically divided on a global component which orchestrates the parallelism and a local component which tunes the local execution to participate on load balancing. High performance in database applications can be obtained by increasing the system throughput, i.e., improving the number of transactions processed per second, and by speeding-up the query execution time for long running queries. The DBC query execution strategy varies according to the type of transactions being submitted and the DBC load. To improve system throughput, the DBC uses inter-query parallelism. To improve queries with long time execution the DBC implements intra-query parallelism. Inter- and intraquery parallelism can be combined. The query execution strategy is based on available database replicas. Inter-query parallelism consists of executing many queries at the same time, each at a different node. Inter-query parallelism is implemented in DBC by transparently distributing queries to nodes that contain replicas of the required database. When the database is replicated at all nodes of the DBC, read-only interquery parallelism is almost straightforward. Any read query can be sent to any replica node and execute in parallel. However, in the presence of updates the DBC must ensure the ACID transaction properties. Typically, a DBC global middleware has a component that manages a pool of connections to running DBMSs. Each request received by the DBC is submitted to a scheduler component that controls concurrent request executions and makes sure that update requests are executed in the same order by all DBMSs. Such scheduler should be able to be configured to enforce different parallel levels of concurrency. Intra-query parallelism consists of executing the same query in parallel, using sub-queries that scan different parts of the database (i.e., a partition), each at a different node. In a DBC, scanning different partitions without hurting the database autonomy is not simple to implement. In DBC, independent DBMSs are used by the middleware as ‘‘black-box’’ components. It is up to the middleware to implement and coordinate parallel execution. This means that query execution plans generated by such DBMSs are not parallel. Furthermore, as
‘‘black-boxes,’’ they cannot be modified to become aware of the other DBMS and generate cooperative parallel plans. Physically partitioning the database relies on a good distribution design which may not work for several queries. An interesting solution to implement intraquery parallelism in DBC is to keep the database replicated and design partitions using virtual partitioning (VP) as proposed by Akal et al. [1]. VP is based on replication and dynamically designs partitions. The basic principle of VP is to take one query, rewrite it as a set of sub-queries ‘‘forcing’’ the execution of each one over a different subset of the table. Then the final query result is obtained through a composition of the partial results generated by the sub-queries.
Key Applications DBC obtained much interest for various database applications like OLTP, OLAP, and e-commerce. Such applications can be easily migrated from sequential environments to the low cost DBC solution and obtain high performance in query processing. Different DBC open source solutions are available to cost-effective parallelism for various database applications. Since the high-performance requirements vary according to the typical queries of the applications, different DBC parallel techniques are provided. C-JDBC [3] and Sequoia [11] are DBC focused on e-commerce and OLTP applications. They use inter-query parallelism and are based on fault tolerance and load balancing in query distribution. RepDB* [8] is a DBC focused on throughput, which offers HPC for OLTP transactions. It uses inter-query parallelism and it is based on replica consistency techniques. ParGRES [6] is the only open-source DBC to provide for intra-query parallel processing [5], thus it is focused on OLAP applications. All these solutions have shown significant speedup through high performance query processing. Experimental results using the TPC series of benchmarks can be found for each one of the specific DBC software middlewares, for example TPC-W with C-JDBC and Sequoia, TPC-C with RepDB* and TPC-H with ParGRES.
Future Directions Grid platforms can be considered a natural extension of PC clusters. They are also an alternative of high performance computing with large volumes of data. Several challenges in grid data management are discussed in [9]. An extension of the DBC approach to
Database Clusters
D
703
D
Database Clusters. Figure 2. ParGRES DBC – TPC-H query execution times.
grids is proposed [4]. However, communication and data transfer can become a major issue.
URL to Code url: cvs.forge.objectweb.org/cgi-bin/viewcvs.cgi/pargres/ pargres/
Experimental Results The graphic in Fig. 2 shows query execution time decreasing as more processors are included to process queries from the TPC-H benchmark. Query execution times in the graphic are normalized. These experiments have used a 32 PC cluster from Grid0 5000 [2]. The graphic also shows the execution time that should be obtained if linear speedup was achieved. The speedup achieved by ParGRES while processing isolated queries with different number of nodes (from 1 to 32) is superlinear for most queries. A typical OLAP transaction is composed by a sequence of such queries, where one query depends on the result of the previous query. The user has a time frame to take his decisions after running a sequence of queries. Since OLAP queries are time consuming, running eight queries can lead to a four hour elapsed time, according to these tests using one single node for an 11 GB database. These eight queries can have their execution time reduced from four hours of elapsed time to less than one hour, just by using a small four nodes cluster configuration. With 32 nodes these queries are processed in a few minutes.
Data Sets ‘‘TPC BenchmarkTM H – Revision 2.1.0’’, url: www. tpc.org.
Cross-references
▶ Data Partitioning ▶ Data Replication ▶ Data Warehouse Applications ▶ Distributed Database Design ▶ Grid File (and family) ▶ JDBC ▶ ODBC ▶ On-line Analytical Processing ▶ Parallel Database ▶ Parallel Query Processing ▶ Storage Area Network
Recommended Reading 1. Akal F., Bo¨hm K., and Schek H.J. OLAP query evaluation in a database cluster: a performance study on intra-query parallelism. In Proc. Sixth East-European Conference on Advances in Databases and Information Systems, 2002, pp. 218–231. 2. Cappello F., Desprez F., and Dayde, M., et al. Grid5000: a large scale and highly reconfigurable grid experimental testbed. In International Workshop on Grid Computing, 2005, pp. 99–106. 3. Cecchet E. C-JDBC: a middleware framework for database clustering. IEEE Data Eng. Bull., 27:19–26, 2004. 4. Kotowski N., Lima A.A., Pacitti E., Valduriez P., and Mattoso M., Parallel Query Processing for OLAP in Grids. Concurrency and Computation: Practice & Experience, 20(17):2039–2048, 2008.
704
D
Database Connectivity
5. Lima A.A.B., Mattoso M., and Valduriez P. Adaptive virtual partitioning for OLAP query processing in a database cluster. In Proc. 14th Brazilian Symp. on Database Systems, 2004, pp. 92–105. 6. Mattoso M. et al. ParGRES: a middleware for executing OLAP queries in parallel. COPPE-UFRJ Technical Report, ES-690, 2005. ¨ zsu M.T. and Valduriez P. Principles of Distributed Database 7. O Systems (2nd edn.). Prentice Hall, Englewood Cliffs, NJ, 1999. ¨ zsu M.T. Preventive 8. Pacitti E., Coulon C., Valduriez P., and O replication in a database cluster. Distribut. Parallel Databases, 18(3):223–251, 2005. 9. Pacitti E., Valduriez P., and Mattoso M. Grid data management: open problems and new issues. J. Grid Comput., 5(3):273–281, 2007. 10. Ro¨hm U., Bo¨hm K., Scheck H.-J., and Schuldt H. FAS - A freshness-sensitive coordination middleware for a cluster of OLAP components. In Proc. 28th Int. Conf. on Very Large Data Bases, 2002, pp. 754–768. 11. Sequoia Project, http://sequoia.continuent.org.
Database Connectivity ▶ Database Adapter and Connector
Database Constraints ▶ Database Dependencies
data in the database, in the sense that whenever a certain pattern is present among the data, this pattern can either be extended or certain data values must be equal. Such a relationship is called a database dependency. The vast majority of database dependencies in the literature are of the following form [5]: ð8x 1 Þ:::ð8x n Þ’ðx 1; :::;x n Þ ) ð9z 1 Þ:::ð9z k Þcðy 1 ; :::; y m ; z 1 ; :::; z k Þ: Here, {y1,...,ym} {x1,...,xn}, ’ is a (possibly empty) conjunction of relation atoms using all the variables x1,...,xn, and c is either a single equality atom involving universally quantified variables only (in which case the dependency is called equality-generating); or c is a non-empty conjunction of relation atoms involving all the variables y1,...,ym, z1,...,zk (in which case the dependency is called tuple-generating. A tuple-generating dependency is called full if it has no existential quantifiers: In the other case, it is called embedded.
Historical Background The theory of database dependencies started with the introduction of functional dependencies by Codd in his seminal paper [8]. They are a generalization of (super) keys. A relation satisfies a functional dependency X ! Y (where X and Y are sets of attributes) if, whenever two tuples agree on X, they also agree on Y . For example, if in a employee relation of a company database with schema O ¼ fEMP-NR; EMP-NAME; JOB; SALARYg;
Database Dependencies M ARC G YSSENS University of Hasselt & Transnational University of Limburg, Diepenbeek, Belgium
Synonyms Database constraints; Data dependency
Definition For a relational database to be valid, it is not sufficient that the various tables of which it is composed conform to the database schema. In addition, the instance must also conform to the intended meaning of the database [15]. While many aspects of this intended meaning are inherently informal, it will generally induce certain formalizable relationships between the
the functional dependencies fEMP-NRg ! fEMP-NAME; DEPT ; JOB; SALARY g; fDEPT ; JOBg ! fSALARY g hold, this means that EMP-NR is a key of this relation, i.e., uniquely determines the values of the other attributes, and that JOB in combination with DEPT uniquely determines SALARY. Codd also noticed that the presence of a functional dependency X ! Y also allowed a lossless decomposition of the relation into its projections onto X [ Y and X [ Y (Y denoting the complement of Y). In the example above, the presence of {DEPT, JOB} ! {SALARY} allows for the decomposition of the original relation into its projections onto {DEPT, JOB, SALARY} and {EMP-NR, EMP-NAME, DEPT}.
Database Dependencies
Hence, the identification of constraints was not only useful for integrity checking but also for more efficient representation of the data and avoiding update anomalies through redundancy removal. Subsequent researchers (e.g., [18]) noticed independently that the presence of the functional dependency X ! Y is a sufficient condition for decomposability of the relation into its projection onto X [ Y and X [ Y , but not a necessary one. For example, Drinker
Beer
Bar
Jones
Tuborg
Far West
Smith Jones Smith
Tuborg Tuborg Tuborg
Far West Tivoli Tivoli
D
database satisfies the inclusion dependency R[A1,..., An] S[B1,...,Bm] if the projection of the relation R onto the sequence of attributes A1,...,An is contained in the projection of the relation S onto the sequence of attributes B1,...,Bn. The proliferation of dependency types motivated researchers to propose subsequent generalizations, eventually leading to the tuple- and equality-generating dependencies of Beeri and Vardi [5] defined higher. For a complete overview, the reader is referred to [14] or the bibliographic sections in [1]. For the sake of completeness, it should also be mentioned that dependency types have been considered that are not captured by the formalism of Beeri and Vardi. An example is the afunctional dependency of De Bra and Paredaens (see, e.g., Chap. 5 of [15]).
Foundations
↡
↡
can be decomposed losslessly into its projections onto {DRINKER, BEER} and {BEER, BAR}, but neither {BEER} ! {DRINKER} nor {BEER} ! {BAR} holds. This led to the introduction of the multivalued dependency: a relation satisfies the multivalued dependency X Y exactly when this relation can be decomposed losslessly into its projections onto X [ Y and X [ Y . Fagin [10] also introduced embedded multivalued dependencies: A relation satisfies the embedded multivalued dependency X Y jZ if its projection onto X [ Y [ Z can be decomposed losslessly into its projections onto X [ Y and X [ Z. Sometimes, however, a relation be decomposed losslessly into three or more of its projections but not in two. This led Rissanen [17] to introduce a more general notion: a relation satisfies a join dependency X1⋈ ... ⋈Xk if it can be decomposed losslessly into its projections onto X1,...,Xk. Quite different considerations led to the introduction of inclusion dependencies [6], which are based on the concept of referential integrity, already known to the broader database community in the 1970s. As an example, consider a company database in which one relation, MANAGERS, contains information on department managers, in particular, MAN-NAME, and another, EMPLOYEES, contains general information on employees, in particular, EMP-NAME. As each manager is also an employee, every value MANNAME in MANAGERS must also occur as a value of EMP-NAME in EMPLOYEES. This is written as the inclusion dependency MANAGERS[MAN-NAME] EMPLOYEES[EMP-NAME]. More generally, a
The development of database dependency theory has been driven mainly by two concerns. One of them is solving the inference problem, and, when decidable, developing tools for deciding it. The other is, as pointed out in the historical background, the use of database dependencies to achieve decompositions of the database contributing to more efficient data representation, redundancy removal, and avoiding update anomalies. Each of these concerns is discussed in some more detail below. Inference
The inference problem is discussed here in the context of tuple- and equality-generating dependencies. The question that must be answered is the following: given a subtype of the tuple- and equality generating dependencies, given as input a set of constraints C and a single constraint c, both of the given type, is it decidable whether C logically implies c In other words, is it decidable if each database instance satisfying C also satisfies c? Given that database dependencies have been defined as first-order sentences, one might be inclined to think that the inference problem is just an instance of the implication problem in mathematical logic. However, for logical implication, one must consider all models of the given database scheme, also those containing infinite relations, while database relations are by definition finite. (In other words, the study of the inference of database dependencies lies within finite model theory.) To separate both notions of inference, a distinction is made between unrestricted
705
D
Database Dependencies
ðF1Þ ; X ! Y if Y X ðreflexivityÞ ðF2Þ fX ! Yg XZ ! YZ ðaugmentationÞ ↡
ðF3Þ fX ! Y ; Y ! Z g X ! Z ðtransitivityÞ ðM1Þ fX Y g X Y ðcomplementationÞ Y if Y X ðreflexivityÞ
ðM2Þ ; X ðM3Þ fX
Y g XZ
ðM4Þ fX
Y;Y
YZ ðaugmentationÞ
↡
Zg X
Z Y ðpseudo
↡
As will be pointed out later, the finite implication problem for functional dependencies and so-called unary inclusion dependencies (i.e., involving only one attribute in each side) is decidable. An important tool for deciding (unrestricted) implication is the chase. In the chase, a table is created for each relation in the database. For each relation atom in the left-hand side of the dependency c to be inferred, its tuple of variables is inserted in the corresponding table. This set of tables is then chased with the dependencies of C: in the case of a tuplegenerating dependency, new tuples are added in a minimal way until the dependency is satisfied (in each application, new variables are substituted for existential variables); in the case of an equality-generating dependency, variables are equated until the dependency is satisfied. The result, chaseðCÞ, which may be infinite, can be seen as a model for C. It is the case that C c if and only if the right-hand side of c is subsumed by some tuple of chaseðCÞ (in the case of a tuple-generating dependency) or the required equality has been applied during the chase procedure. In the case where only full tuple-generating dependencies and equality-generating dependencies are involved, the chase procedure is bound to end, as no existential variables occur in the dependencies, and
transitivityÞ ðFM1Þ fX ! Y g X
Y ðconversionÞ
↡
1 2 3 4 .. .
↡
0 1 2 3 .. .
↡
B
↡
A
hence no new values are introduced. In particular, the unrestricted implication problems coincides with the finite implication problem, and is therefore decidable. Deciding this inference problem is EXPTIME-complete, however. The inference problem for all tuple- and equalitygenerating dependencies is undecidable, however (hence unrestricted and finite implication do not coincide). In 1992, Herrmann [13] solved a longstanding open problem by showing that the finite implication problem is already undecidable for embedded multivalued dependencies. Another approach towards deciding inference of dependency types is trying to find an axiomatization: a finite set of inference rules that is both sound and complete. The existence of such an axiomatization is also a sufficient condition for the decidability of inference. Historically, Armstrong [2] was the first to propose such an axiomatization for functional dependencies. This system of inference rules was eventually extended to a sound and complete axiomatization for functional and multivalued dependencies together [3]:
↡
implication (denoted C c) and finite implication (denoted Cf c) [5]. Since unrestricted implication is recursively enumerable and finite implication is co-recursively enumerable, their coincidence yields that the finite implication problem is decidable. The opposite, however, is not true, as is shown by the following counterexample. Consider a database consisting of a single relation R with scheme {A, B}. Let C ¼ fB ! A; R½B R½Ag and let c be the inclusion dependency R[A] R[B]. One can show that Cf c, but C j6¼ c, as illustrated by the following, necessarily infinite, counterexample:
↡
D
ðFM2Þ fX
Y ; Y ! Z g X ! Z Y ðinteractionÞ
↡
706
Moreover, (F1)–(F3) are sound and complete for the inference of functional dependencies alone, and (M1)–(M4) are sound and complete form the inference of multivalued dependencies alone. The above axiomatization is at the basis of an algorithm to decide inference of functional and multivalued dependencies in low polynomial time. Of course, the inference problem for join dependencies is also decidable, as they are full tuple-generating dependencies. However, there does not exist a sound and complete axiomatization for the inference of join dependencies [16], even though there does exist
Database Dependencies
an axiomatization for a larger class of database dependencies. There also exists a sound and complete axiomatization for inclusion dependencies [6]: ðI1Þ ; R½X R ½X ðreflexivityÞ ðI2Þ fR½A1 ; ::: ; Am S ½B1 ; :::; Bm g R ½Ai1 ; ::: Aik S½Bi1 ; :::; Bik if i1 ; :::; ik is a sequence of integers in f1;:::; mg ðprojectionÞ ðI3Þ fR½X S ½Y ; S ½Y T ½Z g R½X T ½Z ðtransitivityÞ Above, X, Y , and Z represent sequences rather than sets of attributes. Consequently, the implication problem for inclusion dependencies is decidable, even though inclusion dependencies are embedded tuple-generating dependencies. However, deciding implication of inclusion dependencies is PSPACE-complete. It has already been observed above that the unrestricted and finite implication problems for functional dependencies and unary inclusion dependencies taken together do no coincide. Nevertheless, the finite implication problem for this class of dependencies is decidable. Unfortunately, the finite implication problem for functional dependencies and general inclusion dependencies taken together is undecidable (e.g., [7]). Decompositions
As researchers realized that the presence of functional dependencies yields the possibility to decompose the database, the question arose as to how far this decomposition process ought to be taken. This led Codd in follow-up papers to [8] to introduce several normal forms, the most ambitious of which is Boyce-Codd Normal Form (BCNF). A database is in BCNF if, whenever one of its relations satisfies a nontrivial functional dependency X ! Y (i.e., where Y is not a subset of X), X must be a superkey of the relation (i.e., the functional dependency X ! U holds, where U is the set of all attributes of that relation). There exist algorithms that construct a lossless BCNF decomposition for a given relation. Unfortunately, it is not guaranteed that such a decomposition is also dependency-preserving, in the following sense: the set of functional dependencies that hold in the relations of the decomposition and that can
D
be inferred from the given functional dependencies is in general not equivalent with the set of the given functional dependencies. Even worse, a dependencypreserving BCNF decomposition of a given relation does not always exist. For that reason, Third Normal Form (3NF), historically a precursor to BCNF, is also still considered. A datatabase is in 3NF if, whenever one of its relations satisfies a nontrivial functional dependency X !{A} (A being a single attribute), the relation must have a minimal key containing A. Every database in BCNF is also in 3NF, but not the other way around. However, there exists an algorithm that, given a relation, produces a dependency-perserving lossless decomposition in 3NF. Several other normal forms have also been considered, taking into account multivalued dependencies or join dependencies besides functional dependencies. However, one can argue that, by giving a join dependency, one actually already specifies how one wants to decompose a database. If one stores this decomposed database rather than the original one, the focus shifts from integrity checking to consistency checking: can the various relations of the decompositions be interpreted as the projections of a universal relation? Unfortunately, consistency checking is in general exponential in the number of relations. Therefore, a lot of attention has been given to so-called acyclic join dependencies [4]. There are many equivalent definitions of this notion, one of which is that an acyclic join dependency is equivalent to a set of multivalued dependencies. Also, global consistency of a decomposition is already implied by pairwise consistency if and only if the join dependency defining the decomposition is acyclic, which explains in part the desirability of acyclicity. Gyssens [12] generalized the notion of acyclicity to k-cyclicity, where acyclicity corresponds with the case k = 2. A join dependency is k-cyclic if it is equivalent to a set of join dependencies each of which has at most k components. Also, global consistency of a decomposition is already implied by k-wise consistency if and only if the join dependency defining the decomposition is k-cyclic.
Key Applications Despite the explosion of dependency types during the latter half of the 1970s, one must realize that the dependency types most used in practice are still functional dependencies (in particular, key dependencies) and inclusion dependencies. It is therefore unfortunate
707
D
708
D
Database Design
that the inference problem for functional and inclusion dependencies combined is undecidable. At a more theoretical level, the success of studying database constraints from a logical point view and the awareness that is important to distinguish between unrestricted and finite implication certainly contributed to the interest in and study and further development of finite model theory by theoretical computer scientists. Finally, decompositions of join dependencies led to a theory of decompositions for underlying hypergraphs, which found applications in other areas as well, notably in artificial intelligence (e.g., [9,11]).
Cross-references
▶ Boyce-Codd Normal Form ▶ Chase ▶ Equality-Generating Dependencies ▶ Fourth Normal Form ▶ Functional Dependency ▶ Implication of Constraints ▶ Inconsistent Databases ▶ Join Dependency ▶ Multivalued Dependency ▶ Normal Forms and Normalization ▶ Relational Model ▶ Second Normal Form (2NF) ▶ Third Normal Form ▶ Tuple-Generating Dependencies
Recommended Reading 1. Abiteboul S., Hull R., and Vianu V. Foundations of databases. Addison-Wesley, Reading, Mass., 1995. (Part C). 2. Armstrong W.W. Dependency structures of data base relationships. In Proc. IFIP Congress 74, 1974, pp. 580–583. 3. Beeri C., Fagin R., and Howard J.H. A complete axiomatization for functional and multivalued dependencies. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1978, pp. 47–61. 4. Beeri C., Fagin R., Maier D., and Yannakakis M. On the desirability of acyclic database schemes. J. ACM, 30 (3):479–513, 1983. 5. Beeri C. and Vardi M.Y. The implication problem for data dependencies. In Proc. Int. Conf. on Algorithms, Languages, and Programming, 1981. Springer, 1981, pp. 73–85. 6. Casanova M.A., Fagin R., and Papadimitriou C.H. Inclusion dependencies and their interaction with functional dependencies. J. Comput. Syst. Sci., 28(1):29–59, 1984. 7. Chandra A.K. and Vardi M.Y. The implication problem for functional and inclusion dependencies is undecidable. SIAM J. Comput., 14(3):671–677, 1985.
8. Codd E.F. A relational model of data for large shared data banks. Commun. ACM, 13(6):377–387, 1970. 9. Cohen D.A., Jeavons P., and Gyssens M. A unified theory of structural tractability for constraint satisfaction problems. J. Comput. Syst. Sci., 74(5):721–743, 2008. 10. Fagin R. Multivalued dependencies and a new normal form for relational databases. ACM Trans. Database Syst., 2(3):262–278, 1977. 11. Gottlob G., Miklo´s Z., and Schwentick T. Generalized hypertree decompositions: NP-hardness and tractable variants. In Proc. 26th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2007, pp. 13–22. 12. Gyssens M. On the complexity of join dependencies. Trans. Database Syst., 11(1):81–108, 1986. 13. Herrmann C. On the undecidability of implications between embedded multivalued dependencies. Inform. Comput., 122 (2):221–235, 1995. 14. Kanellakis P.C. Elements of relational database theory. In: Van Leeuwen J. (ed.). Handbook of theoretical computer science, Elsevier, 1991, pp. 1074–1156. 15. Paredaens J., De Bra P., Gyssens M., and Van Gucht D. The structure of the relational database model. In EATCS Monographs on Theoretical Computer Science, Vol. 17. Brauer W., Rozenberg G., and Salomaa A., (eds.). Springer, 1989. 16. Petrov S.V. Finite axiomatization of languages for representation of system properties. Inform. Sci., 47(3):339–372, 1989. 17. Rissanen J. Independent components of relations. ACM Trans. Database Syst., 2(4):317–325, 1977. 18. Zaniolo C. Analysis and design opf relational schemata for database systems. Ph. D. thesis, University of California at Los Angeles, 1976. Technical Report UCLA-Eng-7669.
Database Design J OHN M YLOPOULOS University of Trento, Trento, Italy
Definition Database design is a process that produces a series of database schemas for a particular application. The schemas produced usually include a conceptual, logical and physical schema. Each of these is defined using a different data model. A conceptual or semantic data model is used to define the conceptual schema, while a logical data model is used for the logical schema. A physical schema is obtained from a logical schema by deciding what indexes and clustering to use, given a logical schema and an expected workload for the database under design.
Key Points For every existing database, there is a design team and a design process that produced it. That process can
Database Design
make or break a database, as it determines what information it will contain and how will this information be structured. The database design process produces a conceptual, a logical and a physical database schema. These schemas describe the contents of a database at different levels of abstraction. The conceptual schema focuses on the entities and relationships about which information is to be contained in the database. The EntityRelationship Model is the standard model for defining conceptual schemas, though there have been many other proposals. UML class diagrams can also be used for this design phase. The logical schema describes the logical structure of the database. The Relational Model is the standard model for this phase, which views a database as a collection of tables. Alternative data models include the Hierarchical and the Network Data Models, but also object-oriented data models that view a database as a collection of inter-related objects instantiating a collection of classes. The need to create different schemas that describe the contents of a database at different levels of abstraction was noted as far back as 1975 in a report by the American National Standards Institute (ANSI) [1], but has also evolved since. The report proposed a threelevel architecture consisting of several external schemas representing alternative user views of a database, a conceptual schema whose information content subsumed that of external schemas, and an internal schema that represented database content in terms of a particular database technology (such as a relational Database Management System). For database design purposes, conceptual schemas have to be built upfront, whereas external schemas can be created dynamically according to user needs. Moreover, the notion of an internal schema has been refined to that of a logical and a physical schema. The database design process often consists of four phases: requirements elicitation, conceptual schema design, logical schema design, and physical schema design. Requirements elicitation gathers information about the contents of the database to be designed from those who have a stake (a.k.a. stakeholders) This information is often expressed in natural language and may be ambiguous and/or contradictory. For example, two stakeholders may differ on what information about customers or patients is useful and should be included in the database-to-be. A conceptual schema is extracted from a given set of requirements through a
D
series of steps that focus on noun phrases to identify entities, verb phrases to identify important relationships among entities, and other grammatical constructions to identify attributes about which information is useful to include in the database. A conceptual schema is then transformed to a logical one through a series of well-defined transformations that map collections of entities and relationships into a relation whose attributes and keys are determined by the source entities and relationships. The logical schema design phase often includes a normalization step where an initial logical schema with associated functional dependencies is transformed into a normalized schema using one of several well-studied normal forms. Physical schema design starts with a logical schema and determines the index to be used for each relation in the logical schema. This decision is based on the expected workload for the database-to-be, defined by the set of most important queries and updates that will be evaluated against the database. In addition, physical design determines the clustering of tuples in physical storage. This clustering plays an important role in the performance of the system as it evaluates queries that return many tuples (for example, queries that include joins). Physical schema design may dictate the revision of the logical schema by splitting/merging relations to improve performance. This step is known as denormalization. As suggested by denormalization, the database design process should not be viewed as a sequential process that begins with requirements elicitation and proceeds to generate a conceptual, logical and physical schema in that order. Rather, the process consists of four linearly ordered phases and is iterative: after completing any one phase, the designer may return to earlier ones to revise the schemas that have been produced so far, and even the requirements that have been gathered.
Cross-references
▶ Conceptual Data Model ▶ Normalization Theory ▶ Physical Database Design for Relational Databases ▶ Semantic Data Model
Recommended Reading 1.
American National Standards Institute. Interim Report: ANSI/ X3/SPARC Study Group on Data Base Management Systems. FDT – Bull. ACM SIGMOD, 7(2):1–140, 1975.
709
D
710
D 2.
Database Design Recovery Atzeni P., Ceri S., Paraboschi S., and Torlone R. Database Systems: Concepts, Languages and Architectures. McGraw Hill, New York, 1999.
Database Design Recovery ▶ Database Reverse Engineering
what data they would like to capture from the network and how they would like that data processed without worrying about low-level details such power management, network formation, and time synchronization. This entry discusses the main features of these languages, and their relationship to SQL and other database languages.
Historical Background
Database Engine ▶ Query Processor
Database Implementation ▶ Physical Database Design for Relational Databases
Database Interaction ▶ Session
Database Languages for Sensor Networks S AMUEL M ADDEN Massachusetts Institute of Technology, Cambridge, MA, USA
Synonyms Acquisitional query languages; TinySQL
Definition Sensor networks – collections of small, inexpensive battery-powered, wirelessly networked devices equipped with sensors (microphones, temperature sensors, etc.) – offer the potential to monitor the world with unprecedented fidelity. Deploying software for these networks, however, is difficult, as they are complex, distributed, and failure prone. To address these complexities, several sensor network database systems, including TinyDB [7], Cougar [12], and SwissQM [8] have been proposed. These systems provide a high level SQL-like query language that allows users to specify
Cougar and TinyDB were the first sensor network databases with the bulk of their development occurring between 1999 and 2003. They emerged as a result of rising interest in wireless sensor networks and other tiny, embedded, battery powered computers. TinyDB was co-developed as a part of the TinyOS operating system [2] for Berkeley Mote-based sensor networks. Initial versions of the motes used Atmel 8-bit microprocessors and 40 kbit s1 radios; newer generations, developed by companies like Crossbow Technologies (http://www.xbow.com) and Moteiv Technologies (http://www.moteiv.com) use Zigbee (802.15.4) radios running at 250 kbit s1 and Atmel or Texas Instruments 8 or 16 bit microprocessors running at 4–8 MHz. Nodes typically are very memory constrained (with 4–10 Kbytes of RAM and 48–128 Kbytes of nonvolatile flash-based program memory.) Most nodes can be interfaced to sensors that can capture a variety of readings, including light, temperature, humidity, vibration, acceleration, sounds, or images. The limited processing power and radio bandwidth of these devices constrains sample rates to at most a few kilosamples/s. Using such tiny devices does allow power consumption to be quite low, especially when sample rates are kept down; for example, networks that sample about once a second from each node can provide lifetimes of a month or longer on coin-cell batteries or a year or more on a pair of AA batteries [4]. The promise of sensor network databases is that they provide a very simple way to accomplish one of the most common goals of sensor networks: data collection. Using a simple, high level declarative language, users specify what data they want and how fast they want it. The challenge of building a sensor network database lies in capturing the required data in a power-efficient and reliable manner. The choice of programming language for these systems – the main topic of this entry – is essential to meeting that challenge. The language must be expressive enough to allow users to get the data they
Database Languages for Sensor Networks
want, but also implementable in a way that is powerefficient, so that the network lasts as long as possible. To understand how sensor network querying works, it is important to understand how sensor network databases are used. The typical usage model is as follows: a collection of static sensor nodes is placed in some remote location; each node is pre-programmed with the database software. These nodes report data wirelessly (often over multiple radio hops) to a nearby ‘‘basestation’’ – typically a laptop-class device with an Internet connection, which then relays data to a server where data is stored, visualized, and browsed. Users interact with the system by issuing queries at the basestation, which in turn broadcasts queries out into the network. Queries are typically disseminated via flooding, or perhaps using some more clever gossip based dissemination scheme (e.g., Trickle [3]). As nodes receive the query, they begin processing it. The basic programming model is data-parallel: each node runs the same query over data that it locally produces or receives from its neighbors. As nodes produce query results, they send them towards the basestation. When a node has some data to transmit, it relays it to the basestation using a so-called tree-based routing protocol. These protocols cause the nodes to arrange themselves into a tree rooted at the basestation. This tree is formed by having the basestation periodically broadcast a beacon message. Nodes that hear this beacon re-broadcast it, indicating that they are one hop from the basestation; nodes that hear those messages in turn re-broadcast them, indicating that they are two hops from the basestation, and so on. This process of (re)broadcasting beacons occurs continuously, such that (as long as the network is connected) all nodes will eventually hear a beacon message. When a node hears a beacon message, it chooses a node from which it heard the message to be its parent, sending messages through that parent when it needs to transmit data to the basestation (In general, parent selection is quite complicated, as a node may hear beacons from several candidate parents. Early papers by Woo and Culler [11] and DeCouto et al. [1] provide details.). Note that this ad hoc tree-based network topology is significantly different than the any-to-any routing networks that are used in traditional parallel and distributed database systems. As discussed below, this imposes certain limitations on the types of queries that are feasible to express efficiently in sensor network database systems.
D
711
Foundations Most sensor network databases systems provide a SQLlike query interface. TinySQL, the query language used in TinyDB, for example, allows users to specify queries (through a GUI or command line interface) that appear as follows: SELECT <select list> FROM