It may surprise you but when meeting other people, it helps to act interested in their recent publications/research. I created this enhanced version of the VLDB program listing (authors have links to DBLP) for tracking recent resesarch for my blog. You may find it useful at the VLDB conference.
I will be revising this document to resolve names (where DBLP suggests more than one match) and to correct errors. Comments and suggestions are welcome! patrick@durusau.net
I have generally followed the format of the VLDB program, which means there are duplicate entries for demonstrations and papers, for instance. I left the duplicates in, reasoning being able to quickly find content in the expected location outweighed other considerations.
Abstract. Due in part to the increasing mobile use of the web and the proliferation of geo-positioning, the web is fast acquiring a significant spatial aspect. Content and users are being augmented with locations that are used increasingly by location-based services. Studies suggest that each week, several billion web queries are issued that have local intent and target spatial web objects. These are points of interest with a web presence, and they thus have locations as well as textual descriptions. This development has given prominence to spatial web data management, an area ripe with new and exciting opportunities and challenges. The research community has embarked on inventing and supporting new query functionality for the spatial web. Different kinds of spatial web queries return objects that are near a location argument and are relevant to a text argument. To support such queries, it is important to be able to rank objects according to their relevance to a query. And it is important to be able to process the queries with low latency. The talk offers an overview of key aspects of the spatial web. Based on recent results obtained by the speaker and his colleagues, the talk explores new query functionality enabled by the setting. Further, the talk offers insight into the data management techniques capable of supporting such functionality.
Abstract. New applications of computing are being enabled by instrumentation of physical entities, aggregation of data, and the analysis of the data. The resulting integration of information and control permits efficient and effective management of complex man-made systems. Examples include transportation systems, buildings, electrical grids, health care systems, governments, and supply chains. Achieving this vision requires extensive data integration and analysis, over diverse, rapidly changing, and often uncertain data. There are many challenges, requiring both new data management techniques as well as new mathematics, forcing new collaborations as the basis of the new "Data Science". Needs and opportunities will be discussed in the context of specific pilots and projects.
Abstract. Rapid spread of social networks, global on-line shopping, post 9/11 security oriented linking of data bases and foremost the global adoption of smart phones/devices, among other phenomena, are transforming data clusters into dynamic and almost uncontrollable entities that have their own local intelligence, clients and objectives. The scale and rapidity of change is such that large scale innovations in content storage and management are urgently needed if the diseconomies of scale and complexity are to be mitigated. The field needs to reinvent itself. Istanbul, a city that has reinvented itself many times is an excellent venue to engage in such a discussion and for me to offer suggestions and proposals that derive from personal experiences that span academia, start ups, R&D firms and Bell Labs as well my early years spent in Istanbul.
Abstract. Research in data stream algorithms has blossomed since late 90s. The talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research. The talk will also touch upon a recent development: analysis of personal data streams for improving our quality of lives.
Abstract. This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.
Abstract. There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google's MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for data mining, machine learning and similarity joins. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.
Abstract. This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.
Abstract. The advent of the so-called NoSQL databases has brought about a new model of using storage systems. While traditional relational database systems took advantage of features offered by centrally-managed, enterprise-class storage arrays, the new generation of database systems with weaker data consistency models is content with using and managing locally attached individual storage devices and providing data reliability and availability through high-level software features and protocols. This work aims to review the architecture of several existing NoSQL DBs with an emphasis on how they organize and access data in the shared-nothing locally-attached storage model. It shows how these systems operate under typical workloads (new inserts and point and range queries), what access characteristics they exhibit to storage systems. Finally, it examines how several recently developed key/value stores, schema-free document storage systems, and extensible column stores organize data on local filesystems on top of directly-attached disks and what system features they must (re)implement in order to provide the expected data reliability.
Yizhou Sun (University of Illinois at Urbana-Champaign, USA)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
Xifeng Yan (University of California, Santa Barbara, USA)
Philip S. Yu (University of Illinois at Chicago, USA)
Abstract. Most objects and data in the real world are interconnected, forming complex, heterogeneous but often semi-structured information networks. However, most people consider a database merely as a data repository that supports data storage and retrieval rather than one or a set of heterogeneous information networks that contain rich, inter-related, multi-typed data and information. Most network science researchers only study homogeneous networks, without distinguishing the different types of objects and links in the networks. In this tutorial, we view database and other interconnected data as heterogeneous information networks, and study how to leverage the rich semantic meaning of types of objects and links in the networks. We systematically introduce the technologies that can effectively and efficiently mine useful knowledge from such information networks.
B. Aditya Prakash (Carnegie Mellon University, USA)
Christos Faloutsos (Carnegie Mellon University, USA)
Abstract. How do contagions spread in population networks? Which group should we market to, for maximizing product penetration? Will a given YouTube video go viral? Who are the best people to vaccinate? What happens when two products compete? The objective of this tutorial is to provide an intuitive and concise overview of most important theoretical results and algorithms to help us understand and manipulate such propagation-style processes on large networks. The tutorial contains three parts: (a) Theoretical results on the behavior of fundamental models; (b) Scalable Algorithms for changing the behavior of these processes e.g., for immunization, marketing etc.; and (c) Empirical Studies of diffusion on blogs and on-line websites like Twitter. The problems we focus on are central in surprisingly diverse areas: from computer science and engineering, epidemiology and public health, product marketing to information dissemination. Our emphasis is on intuition behind each topic, and guidelines for the practitioner.
Asuman Dogac (SRDC Ltd., Turkey)
Abstract. Interoperability in eHealth systems is important for delivering quality healthcare and reducing healthcare costs. Some of the important use cases include coordinating the care of chronic patients by enabling the co-operation of many different eHealth systems such as Electronic Health Record Systems (EHRs), Personal Health Record Systems (PHRs) and wireless medical sensor devices; enabling secondary use of EHRs for clinical research; being able to share life long EHRs among different healthcare providers. Although achieving eHealth interoperability is quite a challenge both because there are competing standards and clinical information itself is very complex, there have been a number of successful industry initiatives such as Integrating the Healthcare Enterprise (IHE) Profiles, as well as large scale deployments such as the National Health Information System of Turkey and the epSOS initiative for sharing Electronic Health Records and ePrescriptions in Europe. This article briefly describes the subjects discussed in the VLDB 2012 tutorial to provide an overview of the issues in eHealth interoperability describing the key technologies and standards, identifying important use cases and the associated research challenges and also describing some of the large scale deployments. The aim is to foster further interest in this area.
Divyakant Agrawal (University of California at Santa Barbara, USA)
Amr El Abbadi (University of California at Santa Barbara, USA)
Shiyuan Wang (University of California at Santa Barbara, USA)
Abstract. Cloud computing becomes a successful paradigm for data computing and storage. Increasing concerns about data security and privacy in the cloud, however, have emerged. Ensuring security and privacy for data management and query processing in the cloud is critical for better and broader uses of the cloud. This tutorial covers some common cloud security and privacy threats and the relevant research, while focusing on the works that protect data confidentiality and query access privacy for sensitive data being stored and queried in the cloud. We provide a comprehensive study of state-of-the-art schemes and techniques for protecting data confidentiality and access privacy, which make different tradeoffs in the multidimensional space of security, privacy, functionality and performance.
Sudipto Guha (University of Pennsylvania, USA)
Andrew McGregor (University of Massachusetts Amherst, USA)
Abstract. Massive graphs arise in any application where there is data about both basic entities and the relationships between these entities, e.g., web-pages and hyperlinks; neurons and synapses; papers and citations; IP addresses and network flows; people and their friendships. Graphs have also become the de facto standard for representing many types of highly structured data. However, the sheer size of many of these graphs renders classical algorithms inapplicable when it comes to analyzing such graphs. In addition, these existing algorithms are typically ill-suited to processing distributed or stream data. Various platforms have been developed for processing large data sets. At the same time, there is the need to develop new algorithmic ideas and paradigms. In the case of graph processing, a lot of recent work has focused on understanding the important algorithmic issues. An central aspect of this is the question of how to construct and leverage small-space synopses in graph processing. The goal of this tutorial is to survey recent work on this question and highlight interesting directions for future research.
Panels
Moderators:
Alexandros Labrinidis (University of Pittsburgh, USA)
H. V. Jagadish (University of Michigan, USA)
Panelists:
Susan Davidson (University of Pennsylvania)
Johannes Gehrke (Cornell University)
Nick Koudas (University of Toronto)
Raghu Ramakrishnan (Microsoft)
Abstract. The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of "Big Data, including the recent announcement from the White House about new funding initiatives across different agencies, that target research for Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is no clear consensus on what is Big Data. In fact, there have been many controversial statements about Big Data, such as "Size is the only thing that matters." In this panel we will try to explore the controversies and debunk the myths surrounding Big Data.
Moderators:
Amr El Abbadi (University of California, Santa Barbara, USA)
Mohamed F. Mokbel (University of Minnesota, USA)
Panelists:
Gustavo Alonso (ETH Zurich)
Mike Carey (University of California, Irvine)
Mohamed Mokbel (University of Minnesota)
Srinivas Narayanan (Facebook)
Gerhard Weikum (Max-Planck-Institut für Informatik)
Abstract. Social networks, mobility and the cloud represent special and unique opportunities for synergy among several existing and emerging communities that are now often evolving in isolated silos. All three areas hold much promise for the future of computing, and represent significant challenges for large scale data management. As these three areas evolve, their direct influence on significant decisions on each other becomes evident and critical. In particular, the cloud has evolved as a new infrastructure paradigm that is a favorite for most computing applications, especially with its attractive pay-as-you go model. This makes it especially attractive for large scale, elastic novel applications. Social networks have exploded in the last few years with diverse and novel large scale needs, connecting diverse communities and demanding ever increasing resources, thus, exploiting the cloud to the maximum degree. Mobility has been an important and significant aspect of many computing applications for a while now. However, the advent of the cloud and social network applications raises many important challenges when confronted with mobility due to it highly dynamic nature. The potential for cross fertilization among these three areas of important research will drive much of the research in academia, and the products in industry. However, each of these areas also has its own particular demands and needs, which are often at odds with each other. Hence, there is typically much tension between the needs of the applications, ie, social network and mobility needs, and those of the underlying cloud infrastructure. This often causes controversy when discussing these seemingly diverse topics together. This panel will bring together a set of renowned researchers who will explore and discuss the synergy and tensions among critical and often intertwined research and application issues that arise in the context of social networks and mobility in a cloud infrastructure setting.
Demonstration Program Details
Demonstration Session 1: MapReduce, Big Data Systems, and Crowdsourcing
»
Demonstration Session 2: Query Pricing, Processing, and Optimization
»
Demonstration Session 3: Information Retrieval, Web, and Mobility
»
PhD Workshop Program Details
PhD Workshop Session 1: Data Semantics and Data Mining
» Session Chair: Ioana Manolescu, INRIA
PhD Workshop Session 2: Database Systems
» Session Chair: Murat Kantarcioglu Univ. of Texas at Dallas
- Invited Talk: Opportunities for Big Data Research
Ling Liu Georgia Tech
- Parallel Genetic Algorithms for the Optimization of Multi-Way Chain Join Queries of Distributed Databases
Tansel Dokeroglu
- A Toolkit for the Efficient Processing of Big Data on Large Clusters
Vinayak Borkar
- Knowlog: Knowledge + Datalog for Distributed Systems
Matteo Interlandi
- Partial Cube Materialization in a Dynamic Data Warehouse
Usman Ahmed
PhD Workshop Session 3: Query Processing
» Session Chair: Jeffrey Xu Yu Chinese University of Hong Kong
Abstracts
Research Session 1: Spatial Queries
Session Chair: Chen Li
- On The Spatiotemporal Burstiness of Terms
Theodoros Lappas (University of California, Riverside, USA)
Marcos R. Vieira (University of California, Riverside, USA)
Dimitrios Gunopulos (University of Athens, Greece)
Thousands of documents are made available to the users via the web on a daily basis. One of the most extensively studied problems in the context of such document streams is burst identification. Given a term t, a burst is generally exhibited when an unusually high frequency is observed for t. While spatial and temporal burstiness have been studied individually in the past, our work is the first to simultaneously track and measure spatiotemporal term burstiness. In addition, we use the mined burstiness information toward an efficient document-search engine: given a user’s query of terms, our engine returns a ranked list of documents discussing influential events with a strong spatiotemporal impact. We demonstrate the efficiency of our methods with an extensive experimental evaluation on real and synthetic datasets.
- Spatial Queries with Two kNN Predicates
Ahmed M. Aly (Purdue University, USA)
Walid G. Aref (Purdue University, USA),
Mourad Ouzzani (Qatar Computing Research Institute,
The widespread use of location-aware devices has led to countless location-based services in which a user query can be arbitrarily complex, i.e., one that embeds multiple spatial selection and join predicates. Amongst these predicates, the k-Nearest-Neighbor (kNN) predicate stands as one of the most important and widely used predicates. Unlike related research, this paper goes beyond the optimization of queries with single kNN predicates, and shows how queries with two kNN predicates can be optimized. In particular, the paper addresses the optimization of queries with: (i) two kNN-select predicates, (ii) two kNN-join predicates, and (iii) one kNN-join predicate and one kNN-select predicate. For each type of queries, conceptually correct query evaluation plans (QEPs) and new algorithms that optimize the query execution time are presented. Experimental results demonstrate that the proposed algorithms outperform the conceptually correct QEPs by orders of magnitude.
- A Scalable Algorithm for Maximizing Range Sum in Spatial Databases
Dong-Wan Choi (Korea Advanced Institute of Science and Technology, Korea)
Chin-Wan Chung (Korea Advanced Institute of Science and Technology, Korea)
Yufei Tao (Chinese University of Hong Kong, Hong Kong)
This paper investigates the MaxRS problem in spatial databases. Given a set O of weighted points and a rectangular region r of a given size, the goal of the MaxRS problem is to find a location of r such that the sum of the weights of all the points covered by r is maximized. This problem is useful in many location-based applications such as finding the best place for a new franchise store with a limited delivery range and finding the most attractive place for a tourist with a limited reachable range. However, the problem has been studied mainly in theory, particularly, in computational geometry. The existing algorithms from the computational geometry community are in-memory algorithms which do not guarantee the scalability. In this paper, we propose a scalable external-memory algorithm (ExactMaxRS) for the MaxRS problem, which is optimal in terms of the I/O complexity. Furthermore, we propose an approximation algorithm (ApproxMaxCRS) for the MaxCRS problem that is a circle version of the MaxRS problem. We prove the correctness and optimality of the ExactMaxRS algorithm along with the approximation bound of the ApproxMaxCRS algorithm. From extensive experimental results, we show that the ExactMaxRS algorithm is two orders of magnitude faster than methods adapted from existing algorithms, and the approximation bound in practice is much better than the theoretical bound of the ApproxMaxCRS algorithm.
Research Session 2: Map Reduce I
- PerfXplain: Debugging MapReduce Job Performance
Nodira Khoussainova (University of Washington, USA)
Magdalena Balazinska (University of Washington, USA),
Dan Suciu (University of Washington, USA),
While users today have access to many tools that assist in performing large scale data analysis tasks, understanding the performance characteristics of their parallel computations, such as MapReduce jobs, remains difficult. We present PerfXplain, a system that enables users to ask questions about the relative performances (i.e., runtimes) of pairs of MapReduce jobs. PerfXplain provides a new query language for articulating performance queries and an algorithm for generating explanations from a log of past MapReduce job executions. We formally define the notion of an explanation together with three metrics, relevance, precision, and generality, that measure explanation quality. We present the explanation-generation algorithm based on techniques related to decision-tree building. We evaluate the approach on a log of past executions on Amazon EC2, and show that our approach can generate quality explanations, outperforming two naive explanation-generation methods.
- Efficient Multi-way Theta-Join Processing Using MapReduce
Xiaofei Zhang (Hong Kong University of Science and Technology, Hong Kong)
Lei Chen (Hong Kong University of Science and Technology, Hong Kong)
Min Wang (HP Labs, China)
Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.
- Early Accurate Results for Advanced Analytics on MapReduce
Nikolay Laptev (University of California, Los Angeles, USA)
Kai Zeng (University of California, Los Angeles, USA)
Carlo Zaniolo (University of California, Los Angeles, USA)
Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for `big data'. Therefore, we proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation. These estimates are based on a technique called bootstrapping that has been widely employed in statistics and can be applied to arbitrary functions and data distributions. In this paper, we describe our Early Accurate Result Library (EARL) for Hadoop that was designed to minimize the changes required to the MapReduce framework. Various tests of EARL of Hadoop are presented to characterize the frequent situations where EARL can provide major speed-ups over the current version of Hadoop.
Research Session 3: Caching and Prefetching
- Flash-based Extended Cache for Higher Throughput and Faster Recovery
Woon-Hak Kang (Sungkyunkwan University, Korea)
Sang-Won Lee (Sungkyunkwan University, Korea)
Bongki Moon (University of Arizona, USA)
Considering the current price gap between disk and flash memory drives, for applications dealing with large scale data, it will be economically more sensible to use flash memory drives to supplement disk drives rather than to replace them. This paper presents FaCE, which is a new low-overhead caching strategy that uses flash memory as an extension to the DRAM buffer. FaCE aims at improving the transaction throughput as well as shortening the recovery time from a system failure. To achieve the goals, we propose two novel algorithms for flash cache management, namely, Multi-Version FIFO replacement and Group Second Chance. One striking result from FaCE is that using a small flash memory drive as a caching device could deliver even higher throughput than using a large flash memory drive to store the entire database tables. This was possible due to flash write optimization as well as disk access reduction obtained by the FaCE caching methods. In addition, FaCE takes advantage of the non-volatility of flash memory to fully support database recovery by extending the scope of a persistent database to include the data pages stored in the flash cache. We have implemented FaCE in the PostgreSQL open source database server and demonstrated its effectiveness for TPC-C benchmarks.
- Don't Thrash: How to Cache Your Hash on Flash
Michael A. Bender (Stony Brook University, USA)
Martin Farach-Colton (Rutgers University, USA)
Rob Johnson (Stony Brook University, USA)
Russell Kraner (VCORE Solutions, LLC, USA)
Bradley C. Kuszmaul (Massachusetts Institute of Technology, USA)
Dzejla Medjedovic (Stony Brook University, USA)
Pablo Montes (Stony Brook University, USA)
Pradeep Shetty (Stony Brook University, USA)
Richard P. Spillane (Stony Brook University, USA)
Erez Zadok (Stony Brook University, USA)
This paper presents new alternatives to the well-known Bloom filter data structure. The Bloom filter, a compact data structure supporting set insertion and membership queries, has found wide application in databases, storage systems, and networks. Because the Bloom filter performs frequent random reads and writes, it is used almost exclusively in RAM, limiting the size of the sets it can represent. This paper first describes the quotient filter, which supports the basic operations of the Bloom filter, achieving roughly comparable performance in terms of space and time, but with better data locality. Operations on the quotient filter require only a small number of contiguous accesses. The quotient filter has other advantages over the Bloom filter: it supports deletions, it can be dynamically resized, and two quotient filters can be efficiently merged. The paper then gives two data structures, the buffered quotient filter and the cascade filter, which exploit the quotient filter advantages and thus serve as SSD-optimized alternatives to the Bloom filter. The cascade filter has better asymptotic I/O performance than the buffered quotient filter, but the buffered quotient filter outperforms the cascade filter on small to medium data sets. Both data structures significantly outperform recently-proposed SSD-optimized Bloom filter variants, such as the elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom filter. In experiments, the cascade filter and buffered quotient filter performed insertions 8.6-11 times faster than the fastest Bloom filter variant and performed lookups 0.94-2.56 times faster.
- SCOUT: Prefetching for Latent Feature Following Queries
Farhan Tauheed (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Thomas Heinis (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Felix Shürmann (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Henry Markram (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Anastasia Ailamaki (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Today's scientists are quickly moving from in vitro to in silico experimentation: they no longer analyze natural phenomena in a petri dish, but instead they build models and simulate them. Managing and analyzing the massive amounts of data involved in simulations is a major task. Yet, they lack the tools to efficiently work with data of this size. One problem many scientists share is the analysis of the massive spatial models they build. For several types of analysis they need to interactively follow the structures in the spatial model, e.g., the arterial tree, neuron fibers, etc., and issue range queries along the way. Each query takes long to execute, and the total time for executing a sequence of queries significantly delays data analysis. Prefetching the spatial data reduces the response time considerably, but known approaches do not prefetch with high accuracy. We develop SCOUT, a structure-aware method for prefetching data along interactive spatial query sequences. SCOUT uses an approximate graph model of the structures involved in past queries and attempts to identify what particular structure the user follows. Our experiments with neuroscience data show that SCOUT prefetches with an accuracy from 71% to 92%, which translates to a speedup of 4x-15x. SCOUT also improves the prefetching accuracy on datasets from other scientific domains, such as medicine and biology.
Research Session 4: Automation
- A Statistical Approach Towards Robust Progress Estimation
Arnd Christian König (Microsoft Research, USA)
Bolin Ding (University of Illinois at Urbana-Champaign, USA)
Surajit Chaudhuri (Microsoft Research, USA)
Vivek Narasayya (Microsoft Research, USA)
The need for accurate SQL progress estimation in the context of decision support administration has led to a number of techniques proposed for this task. Unfortunately, no single one of these progress estimators behaves robustly across the variety of SQL queries encountered in practice, meaning that each technique performs poorly for a significant fraction of queries. This paper proposes a novel estimator selection framework that uses a statistical model to characterize the sets of conditions under which certain estimators outperform others, leading to a significant increase in estimation robustness. The generality of this framework also enables us to add a number of novel ``special purpose'' estimators which increase accuracy further. Most importantly, the resulting model generalizes well to queries very different from the ones used to train it. We validate our findings using a large number of industrial real-life and benchmark workloads.
- Semi-Automatic Index Tuning: Keeping DBAs in the Loop
Karl Schnaitter (Teradata Aster, USA)
Neoklis Polyzotis (University of California, Santa Cruz, USA)
To obtain a high level of system performance, a database administrator (DBA) must choose a set of indices that is appropriate for the workload. The system can aid in this challenging task by providing recommendations for the index configuration. We propose a new index recommendation technique, termed semi-automatic tuning, that keeps the DBA "in the loop" by generating recommendations that use feedback about the DBA's preferences. The technique also works online, which avoids the limitations of commercial tools that require the workload to be known in advance. The foundation of our approach is the Work Function Algorithm, which can solve a wide variety of online optimization problems with strong competitive guarantees. We present an experimental analysis that validates the benefits of semi-automatic tuning in a wide variety of conditions.
- Robust Estimation of Resource Consumption for SQL Queries using Statistical Techniques
Jiexing Li (University of Wisconsin - Madison, USA)
Arnd Christian König (Microsoft Research, USA)
Vivek Narasayya (Microsoft Research, USA)
Surajit Chaudhuri (Microsoft Research, USA)
The ability to estimate resource consumption of SQL queries is crucial for a number of tasks in a database system such as admission control, query scheduling and costing during query optimization. Recent work has explored the use of statistical techniques for resource estimation in place of the manually constructed cost models used in query optimization. Such techniques, which require as training data examples of resource usage in queries, offer the promise of superior estimation accuracy since they can account for factors such as hardware characteristics of the system or bias in cardinality estimates. However, the proposed approaches lack robustness in that they do not generalize well to queries that are different from the training examples, resulting in significant estimation errors. Our approach aims to address this problem by combining knowledge of database query processing with statistical models. We model resource-usage at the level of individual operators, with different models and features for each operator type, and explicitly model the asymptotic behavior of each operator. This results in significantly better estimation accuracy and the ability to estimate resource usage of arbitrary plans, even when they are very different from the training instances. We validate our approach using various large scale real-life and benchmark workloads on Microsoft SQL Server.
Research Session 5: Web and IR I
- Answering Table Queries on the Web using Column Keywords
Rakesh Pimplikar (IBM Research, India)
Sunita Sarawagi (Indian Institute of Technology Bombay, India)
We present the design of a structured search engine which returns a multi-column table in response to a query consisting of keywords describing each of its columns. We answer such queries by exploiting the millions of tables on the Web because these are much richer sources of structured knowledge than free-format text. However, a corpus of tables harvested from arbitrary HTML web pages presents huge challenges of diversity and redundancy not seen in centrally edited knowledge bases. We concentrate on one concrete task in this paper. Given a set of Web tables T1 , . . . , Tn , and a query Q with q sets of keywords Q1 , . . . , Qq , decide for each Ti if it is relevant to Q and if so, identify the mapping between the columns of Ti and query columns. We represent this task as a graphical model that jointly maps all tables by incorporating diverse sources of clues spanning matches in different parts of the table, corpus-wide co-occurrence statistics, and content overlap across table columns. We define a novel query segmentation model for matching keywords to table columns, and a robust mechanism of exploiting content overlap across table columns. We design efficient inference algorithms based on bipartite matching and constrained graph cuts to solve the joint labeling task. Experiments on a workload of 59 queries over a 25 million web table corpus shows significant boost in accuracy over baseline IR methods.
- Efficient Verification of Web-Content Searching Through Authenticated Web Crawlers
Michael T. Goodrich (University of California, Irvine, USA)
Duy Nguyen (Brown University, USA)
Olga Ohrimenko (Brown University, USA)
Charalampos Papamanthou (University of California, Berkeley, USA)
Roberto Tamassia (Brown University, USA)
Nikos Triandopoulos (RSA Laboratories & Boston University, USA)
Cristina Videira Lopes (University of California, Irvine, USA), USA)
We consider the problem of verifying the correctness and completeness of the result of a keyword search. We introduce the concept of an authenticated web crawler and present its design and prototype implementation. An authenticated web crawler is a trusted program that computes a specially-crafted signature over the web contents it visits. This signature enables (i) the verification of common Internet queries on web pages, such as conjunctive keyword searches - this guarantees that the output of a conjunctive keyword search is correct and complete; (ii) the verification of the content returned by such Internet queries - this guarantees that web data is authentic and has not been maliciously altered since the computation of the signature by the crawler. In our solution, the search engine returns a cryptographic proof of the query result. Both the proof size and the verification time are proportional only to the sizes of the query description and the query result, but do not depend on the number or sizes of the web pages over which the search is performed. As we experimentally demonstrate, the prototype implementation of our system provides a low communication overhead between the search engine and the user, and fast verification of the returned results by the user.
- Querying Schemas With Access Restrictions
Michael Benedikt (Oxford University, UK)
Pierre Bourhis (Oxford University, UK)
Clemens Ley (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
We study verification of systems whose transitions consist of accesses to a Web-based data-source. An access is a lookup on a relation within a relational database, fixing values for a set of positions in the relation. For example, a transition can represent access to a Web form, where the user is restricted to filling in values for a particular set of fields. We look at verifying properties of a schema describing the possible accesses of such a system. We present a language where one can describe the properties of an access path, and also specify additional restrictions on accesses that are enforced by the schema. Our main property language, AccLTL, is based on a first-order extension of linear-time temporal logic, interpreting access paths as sequences of relational structures. We also present a lower-level automaton model, Aautomata, which AccLTL specifications can compile into. We show that AccLTL and A-automata can express static analysis problems related to "querying with limited access patterns" that have been studied in the database literature in the past, such as whether an access is relevant to answering a query, and whether two queries are equivalent in the accessible data they can return. We prove decidability and complexity results for several restrictions and variants of AccLTL, and explain which properties of paths can be expressed in each restriction.
- Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes
Yizhou Sun (University of Illinois at Urbana-Champaign, USA)
Charu C. Aggarwal (IBM T J Watson Research Center, USA)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
With the rapid development of online social media, online shopping sites and cyber-physical systems, heterogeneous information networks have become increasingly popular and content-rich over time. In many cases, such networks contain multiple types of objects and links, as well as different kinds of attributes. The clustering of these objects can provide useful insights in many applications. However, the clustering of such networks can be challenging since (a) the attribute values of objects are often incomplete, which implies that an object may carry only partial attributes or even no attributes to correctly label itself; and (b) the links of different types may carry different kinds of semantic meanings, and it is a difficult task to determine the nature of their relative importance in helping the clustering for a given purpose. In this paper, we address these challenges by proposing a model-based clustering algorithm. We design a probabilistic model which clusters the objects of different types into a common hidden space, by using a user-specified set of attributes, as well as the links from different relations. The strengths of different types of links are automatically learned, and are determined by the given purpose of clustering. An iterative algorithm is designed for solving the clustering problem, in which the strengths of different types of links and the quality of clustering results mutually enhance each other. Our experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of the algorithm.
Research Session 6: Dense Graphs Discovery
- Truss Decomposition in Massive Networks
Jia Wang (The Chinese University of Hong Kong, Hong Kong)
James Cheng (Nanyang Technological University, Singapore)
The k-truss is a type of cohesive subgraphs proposed recently for the study of networks. While the problem of computing most cohesive subgraphs is NP-hard, there exists a polynomial time algorithm for computing k-truss. Compared with k-core which is also efficient to compute, k-truss represents the "core" of a k-core that keeps the key information of, while filtering out less important information from, the k-core. However, existing algorithms for computing k-truss are inefficient for handling today's massive networks. We first improve the existing in-memory algorithm for computing k-truss in networks of moderate size. Then, we propose two I/O-efficient algorithms to handle massive networks that cannot fit in main memory. Our experiments on real datasets verify the efficiency of our algorithms and the value of k-truss.
- Densest Subgraph in Streaming and MapReduce
Bahman Bahmani (Stanford University, USA)
Ravi Kumar (Yahoo! Research, USA)
Sergei Vassilvitskii (Yahoo! Research, USA)
The problem of finding locally dense components of a graph is an important primitive in data analysis, with wide-ranging applications from community mining to spam detection and the discovery of biological network modules. In this paper we present new algorithms for finding the densest subgraph in the streaming model. For any epsilon>0, our algorithms make O((log n)/log (1+epsilon)) passes over the input and find a subgraph whose density is guaranteed to be within a factor 2(1+epsilon) of the optimum. Our algorithms are also easily parallelizable and we illustrate this by realizing them in the MapReduce model. In addition we perform extensive experimental evaluation on massive real-world graphs showing the performance and scalability of our algorithms in practice.
- Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification
Albert Angel (University of Toronto, Canada)
Nick Koudas (University of Toronto, Canada)
Nikos Sarkas (University of Toronto, Canada)
Divesh Srivastava (AT&T Labs - Research, USA)
Recent years have witnessed an unprecedented proliferation of social media. People around the globe author, every day, millions of blog posts, social network status updates, etc. This rich stream of information can be used to identify, on an ongoing basis, emerging stories, and events that capture popular attention. Stories can be identified via groups of tightly-coupled real-world entities, namely the people, locations, products, etc., that are involved in the story. The sheer scale, and rapid evolution of the data involved necessitate highly efficient techniques for identifying important stories at every point of time. The main challenge in real-time story identification is the maintenance of dense subgraphs (corresponding to groups of tightly-coupled entities) under streaming edge weight updates (resulting from a stream of user-generated content). This is the first work to study the efficient maintenance of dense subgraphs under such streaming edge weight updates. For a wide range of definitions of density, we derive theoretical results regarding the magnitude of change that a single edge weight update can cause. Based on these, we propose a novel algorithm, DYNDENS, which outperforms adaptations of existing techniques to this setting, and yields meaningful results. Our approach is validated by a thorough experimental evaluation on large-scale real and synthetic datasets.
- Real Time Discovery of Dense Clusters in Highly Dynamic Graphs: Identifying Real World Events in Highly Dynamic Environments
Manoj K Agarwal (IBM Research, India)
Krithi Ramamritham (Indian Institute of Technology Bombay, India)
Manish Bhide (IBM Research, India)
Due to their real time nature, microblog streams are a rich source of dynamic information, for example, about emerging events. Existing techniques for discovering such events from a microblog stream in real time (such as Twitter trending topics), have several lacunae when used for discovering emerging events; extant graph based event detection techniques are not practical in microblog settings due to their complexity; and conventional techniques, which have been developed for blogs, web-pages, etc., involving the use of keyword search, are only useful for finding information about known events. Hence, in this paper, we present techniques to discover events that are unraveling in microblog message streams in real time so that such events can be reported as soon as they occur. We model the problem as discovering dense clusters in highly dynamic graphs. Despite many recent advances in graph analysis, ours is the first technique to identify dense clusters in massive and highly dynamic graphs in real time. Given the characteristics of microblog streams, in order to find clusters without missing any events, we propose and exploit a novel graph property which we call short-cycle property. Our algorithms find these clusters efficiently in spite of rapid changes to the microblog streams. Further we present a novel ranking function to identify the important events. Besides proving the correctness of our algorithms we show their practical utility by evaluating them using real world microblog data. These demonstrate our technique’s ability to discover, with high precision and recall, emerging events in high intensity data streams in real time. Many recent web applications create data which can be represented as massive dynamic graphs. Our technique can be easily extended to discover, in real time, interesting patterns in such graphs.
Research Session 7: Query Processing I
- Fast Updates on Read-Optimized Databases Using Multi-Core CPUs
Jens Krueger (Hasso-Plattner-Institut, Germany), Changkyu Kim (Intel Labs, USA)
Martin Grund (Hasso-Plattner-Institute, Germany)
Nadathur Satish (Intel Labs, USA)
David Schwalb (Hasso-Plattner-Institute, Germany)
Jatin Chhugani (Intel Labs, USA)
Hasso Plattner (Hasso-Plattner-Institute, Germany)
Pradeep Dubey (Intel Labs, USA)
Alexander Zeier (Hasso Plattner Institute, Germany), Changkyu Kim (Intel Labs, USA)
Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process introduces significant overheads and unacceptable downtimes in update intensive systems, aspiring to combine transactional and analytical workloads into one system. In the first part of the paper, we report data analyses of 12 SAP Business Suite customer systems. In the second half, we present an optimized merge process reducing the merge overhead of current systems by a factor of 30. Our linear-time merge algorithm exploits the underlying high compute and bandwidth resources of modern multi-core CPUs with architecture-aware optimizations and efficient parallelization. This enables compressed in-memory column stores to handle the transactional update rate required by enterprise applications, while keeping properties of read-optimized databases for analytic-style queries.
- SharedDB: Killing One Thousand Queries With One Stone
Georgios Giannikis (ETH Zürich, Switzerland)
Gustavo Alonso (ETH Zürich, Switzerland)
Donald Kossmann (ETH Zürich, Switzerland)
Traditional database systems are built around the query-at-a-time model. This approach tries to optimize performance in a best-effort way. Unfortunately, best effort is not good enough for many modern applications. These applications require response time guarantees in high load situations. This paper describes the design of a new database architecture that is based on batching queries and shared computation across possibly hundreds of concurrent queries and updates. Performance experiments with the TPC-W benchmark show that the performance of our implementation, SharedDB, is indeed robust across a wide range of dynamic workloads.
- Compacting Transactional Data in Hybrid OLTP & OLAP Databases
Florian Funke (Technische Universität München, Germany)
Alfons Kemper (Technische Universität München, Germany)
Thomas Neumann (Technische Universität München, Germany)
Growing main memory sizes have facilitated database management systems that keep the entire database in main memory. The drastic performance improvements that came along with these in-memory systems have made it possible to reunite the two areas of online transaction processing (OLTP) and online analytical processing (OLAP): An emerging class of hybrid OLTP and OLAP database systems allows to process analytical queries directly on the transactional data. By offering arbitrarily current snapshots of the transactional data for OLAP, these systems enable real-time business intelligence. Despite memory sizes of several Terabytes in a single commodity server, RAM is still a precious resource: Since free memory can be used for intermediate results in query processing, the amount of memory determines query performance to a large extent. Consequently, we propose the compaction of memory-resident databases. Compaction consists of two tasks: First, separating the mutable working set from the immutable "frozen" data. Second, compressing the immutable data and optimizing it for efficient, memory-consumption-friendly snapshotting. Our approach reorganizes and compresses transactional data online and yet hardly affects the mission-critical OLTP throughput. This is achieved by unburdening the OLTP threads from all additional processing and performing these tasks asynchronously.
- DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views
Yanif Ahmad (Johns Hopkins University, USA)
Oliver Kennedy (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Christoph Koch (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Milos Nikolic (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Applications ranging from algorithmic trading to scientific data analysis require realtime analytics based on views over databases that change at very high rates. Such views have to be kept fresh at low maintenance cost and latencies. At the same time, these views have to support classical SQL, rather than window semantics, to enable applications that combine current with aged or historical data. In this paper, we present viewlet transforms, a recursive finite differencing technique applied to queries. The viewlet transform materializes a query and a set of its higher-order deltas as views. These views support each other's incremental maintenance, leading to a reduced overall view maintenance cost. The viewlet transform of a query admits efficient evaluation, the elimination of certain expensive query operations, and aggressive parallelization. We develop viewlet transforms into a workable query execution technique, present a heuristic and cost-based optimization framework, and report on experiments with a prototype dynamic data management system that combines viewlet transforms with an optimizing compilation technique. The system supports tens of thousands of complete view refreshes a second for a wide range of queries.
Research Session 8: Crowd Sourcing
- Human-powered Sorts and Joins
Adam Marcus (Massachusetts Institute of Technology, USA)
Eugene Wu (Massachusetts Institute of Technology, USA)
David Karger (Massachusetts Institute of Technology, USA)
Samuel Madden (Massachusetts Institute of Technology, USA)
Robert Miller (Massachusetts Institute of Technology, USA)
Crowdsourcing markets like Amazon's Mechanical Turk (MTurk) make it possible to task people with small jobs, such as labeling images or looking up phone numbers, via a programmatic interface. MTurk tasks for processing datasets with humans are currently designed with significant reimplementation of common workflows and ad-hoc selection of parameters such as price to pay per task. We describe how we have integrated crowds into a declarative workflow engine called Qurk to reduce the burden on workflow designers. In this paper, we focus on how to use humans to compare items for sorting and joining data, two of the most common operations in DBMSs. We describe our basic query interface and the user interface of the tasks we post to MTurk. We also propose a number of optimizations, including task batching, replacing pairwise comparisons with numerical ratings, and pre-filtering tables before joining them, which dramatically reduce the overall cost of running sorts and joins on the crowd. In an experiment joining two sets of images, we reduce the overall cost from $67 in a naive implementation to about $3, without substantially affecting accuracy or latency. In an end-to-end experiment, we reduced cost by a factor of 14.5.
- CrowdER: Crowdsourcing Entity Resolution
Jiannan Wang (Tsinghua University, China)
Tim Kraska (University of California, Berkeley, USA)
Michael J. Franklin (University of California, Berkeley, USA)
Jianhua Feng (Tsinghua University, China)
Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.
- CDAS: A Crowdsourcing Data Analytics System
Xuan Liu (National University of Singapore, Singapore)
Meiyu Lu (National University of Singapore, Singapore)
Beng Chin Ooi (National University of Singapore, Singapore)
Yanyan Shen (National University of Singapore, Singapore)
Sai Wu (Zhejiang University, China)
Meihui Zhang (National University of Singapore, Singapore)
Some complex problems, such as image tagging and natural language processing, are very challenging for computers, where even state-of-the-art technology is yet able to provide satisfactory accuracy. Therefore, rather than relying solely on developing new and better algorithms to handle such tasks, we look to the crowdsourcing solution -- employing human participation -- to make good the shortfall in current technology. Crowdsourcing is a good supplement to many computer tasks. A complex job may be divided into computer-oriented tasks and human-oriented tasks, which are then assigned to machines and humans respectively. To leverage the power of crowdsourcing, we design and implement a Crowdsourcing Data Analytics System, CDAS. CDAS is a framework designed to support the deployment of various crowdsourcing applications. The core part of CDAS is a quality-sensitive answering model, which guides the crowdsourcing engine to process and monitor the human tasks. In this paper, we introduce the principles of our quality-sensitive model. To satisfy user required accuracy, the model guides the crowdsourcing query engine for the design and processing of the corresponding crowdsourcing jobs. It provides an estimated accuracy for each generated result based on the human workers' historical performances. When verifying the quality of the result, the model employs an online strategy to reduce waiting time. To show the effectiveness of the model, we implement and deploy two analytics jobs on CDAS, a twitter sentiment analytics job and an image tagging job. We use real Twitter and Flickr data as our queries respectively. We compare our approaches with state-of-the-art classification and image annotation techniques. The results show that the human-assisted methods can indeed achieve a much higher accuracy. By embedding the quality-sensitive model into crowdsourcing query engine, we effectively reduce the processing cost while maintaining the required query answer quality.
- Pushing the Boundaries of Crowd-enabled Databases with Query-driven Schema Expansion
Joachim Selke (Technische Universität Braunschweig, Germany)
Christoph Lofi (Technische Universität Braunschweig, Germany)
Wolf-Tilo Balke (Technische Universität Braunschweig, Germany)
By incorporating human workers into the query execution process crowd-enabled databases facilitate intelligent, social capabilities like completing missing data at query time or performing cognitive operators. But despite all their flexibility, crowd-enabled databases still maintain rigid schemas. In this paper, we extend crowd-enabled databases by flexible query-driven schema expansion, allowing the addition of new attributes to the database at query time. However, the number of crowd-sourced mini-tasks to fill in missing values may often be prohibitively large and the resulting data quality is doubtful. Instead of simple crowd-sourcing to obtain all values individually, we leverage the user-generated data found in the Social Web: By exploiting user ratings we build perceptual spaces, i.e., highly-compressed representations of opinions, impressions, and perceptions of large numbers of users. Using few training samples obtained by expert crowd sourcing, we then can extract all missing data automatically from the perceptual space with high quality and at low costs. Extensive experiments show that our approach can boost both performance and quality of crowd-enabled databases, while also providing the flexibility to expand schemas in a query-driven fashion.
Research Session 9: Cloud Databases
- Distributed GraphLab: A Framework for Machine Learning in the Cloud
Yucheng Low (Carnegie Mellon University, USA)
Joseph Gonzalez (Carnegie Mellon University, USA)
Aapo Kyrola (Carnegie Mellon University, USA)
Danny Bickson (Carnegie Mellon University, USA)
Carlos Guestrin (Carnegie Mellon University, USA)
Joseph M. Hellerstein (University of California, Berkeley, USA)
While high-level data parallel frameworks, like MapReduce, simplify the design and implementation of large-scale data processing systems, they do not naturally or efficiently support many important data mining and machine learning algorithms and can lead to inefficient learning systems. To help fill this critical void, we introduced the GraphLab abstraction which naturally expresses asynchronous, dynamic, graph-parallel computation while ensuring data consistency and achieving a high degree of parallel performance in the shared-memory setting. In this paper, we extend the GraphLab framework to the substantially more challenging distributed setting while preserving strong data consistency guarantees. We develop graph based extensions to pipelined locking and data versioning to reduce network congestion and mitigate the effect of network latency. We also introduce fault tolerance to the GraphLab abstraction using the classic Chandy-Lamport snapshot algorithm and demonstrate how it can be easily implemented by exploiting the GraphLab abstraction itself. Finally, we evaluate our distributed implementation of the GraphLab abstraction on a large Amazon EC2 deployment and show 1-2 orders of magnitude performance gains over Hadoop-based implementations.
- Spinning Fast Iterative Data Flows
Stephan Ewen (Technische Universität Berlin, Germany)
Kostas Tzoumas (Technische Universität Berlin, Germany)
Moritz Kaufmann (Technische Universität Berlin, Germany)
Volker Markl (Technische Universität Berlin, Germany)
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory. We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.
- PIQL: Success-Tolerant Query Processing in the Cloud
Michael Armbrust (University of California, Berkeley, USA)
Kristal Curtis (University of California, Berkeley, USA)
Tim Kraska (University of California, Berkeley, USA)
Armando Fox (University of California, Berkeley, USA)
Michael J. Franklin (University of California, Berkeley, USA)
David A. Patterson (University of California, Berkeley, USA), USA)
Newly-released web applications often succumb to a "Success Disaster," where overloaded database machines and resulting high response times destroy a previously good user experience. Unfortunately, the data independence provided by a traditional relational database system, while useful for agile development, only exacerbates the problem by hiding potentially expensive queries under simple declarative expressions. As a result, developers of these applications are increasingly abandoning relational databases in favor of imperative code written against distributed key/value stores, losing the many benefits of data independence in the process. Instead, we propose PIQL, a declarative language that also provides scale independence by calculating an upper bound on the number of key/value store operations that will be performed for any query. Coupled with a service level objective (SLO) compliance prediction model and PIQL's scalable database architecture, these bounds make it easy for developers to write success-tolerant applications that support an arbitrarily large number of users while still providing acceptable performance. In this paper, we present the PIQL query processing system and evaluate its scale independence on hundreds of machines using two benchmarks, TPC-W and SCADr.
- How to Price Shared Optimizations in the Cloud
Prasang Upadhyaya (University of Washington, USA)
Magdalena Balazinska (University of Washington, USA)
Dan Suciu (University of Washington, USA)
Data-management-as-a-service systems are increasingly being used in collaborative settings, where multiple users access common datasets. Cloud providers have the choice to implement various optimizations, such as indexing or materialized views, to accelerate queries over these datasets. Each optimization carries a cost and may benefit multiple users. This creates a major challenge: how to select which optimizations to perform and how to share their cost among users. The problem is especially challenging when users are selfish and will only report their true values for different optimizations if doing so maximizes their utility. In this paper, we present a new approach for selecting and pricing shared optimizations by using Mechanism Design. We first show how to apply the Shapley Value Mechanism to the simple case of selecting and pricing additive optimizations, assuming an offline game where all users access the service for the same time-period. Second, we extend the approach to online scenarios where users come and go. Finally, we consider the case of substitutive optimizations. We show analytically that our mechanisms induce truth- fulness and recover the optimization costs. We also show experimentally that our mechanisms yield higher utility than the state-of-the-art approach based on regret accumulation.
Research Session 10: Graphs Statistics and Summaries
- Size-l Object Summaries for Relational Keyword Search
Georgios J. Fakas (Manchester Metropolitan University, UK)
Zhi Cai (Manchester Metropolitan University, UK)
Nikos Mamoulis (The University of Hong Kong, Hong Kong)
A previously proposed keyword search paradigm produces, as a query result, a ranked list of Object Summaries (OSs). An OS is a tree structure of related tuples that summarizes all data held in a relational database about a particular Data Subject (DS). However, some of these OSs are very large in size and therefore unfriendly to users that initially prefer synoptic information before proceeding to more comprehensive information about a particular DS. In this paper, we investigate the effective and efficient retrieval of concise and informative OSs. We argue that a good size-l OS should be a stand-alone and meaningful synopsis of the most important information about the particular DS. More precisely, we define a size-l OS as a partial OS composed of l important tuples. We propose three algorithms for the efficient generation of size-l OSs (in addition to the optimal approach which requires exponential time). Experimental evaluation on DBLP and TPC-H databases verifies the effectiveness and efficiency of our approach.
- Mining Attribute-structure Correlated Patterns in Large Attributed Graphs
Arlei Silva (Universidade Federal de Minas Gerais, Brazil)
Wagner Meira Jr. (Universidade Federal de Minas Gerais, Brazil)
Mohammed J. Zaki (Rensselaer Polytechnic Institute, USA)
In this work, we study the correlation between attribute sets and the occurrence of dense subgraphs in large attributed graphs, a task we call structural correlation pattern mining. A structural correlation pattern is a dense subgraph induced by a particular attribute set. Existing methods are not able to extract relevant knowledge regarding how vertex attributes interact with dense subgraphs. Structural correlation pattern mining combines aspects of frequent itemset and quasi-clique mining problems. We propose statistical significance measures that compare the structural correlation of attribute sets against their expected values using null models. Moreover, we evaluate the interestingness of structural correlation patterns in terms of size and density. An efficient algorithm that combines search and pruning strategies in the identification of the most relevant structural correlation patterns is presented. We apply our method for the analysis of three real-world attributed graphs: a collaboration, a music, and a citation network, verifying that it provides valuable knowledge in a feasible time.
- Injecting Uncertainty in Graphs for Identity Obfuscation
Paolo Boldi (University of Milano, Italy)
Francesco Bonchi (Yahoo! Research, Spain)
Aris Gionis (Yahoo! Research, Spain)
Tamir Tassa (The Open University, Israel)
Data collected nowadays by social-networking applications create fascinating opportunities for building novel services, as well as expanding our understanding about social structures and their dynamics. Unfortunately, publishing social-network graphs is considered an ill-advised practice due to privacy concerns. To alleviate this problem, several anonymization methods have been proposed, aiming at reducing the risk of a privacy breach on the published data, while still allowing to analyze them and draw relevant conclusions. In this paper we introduce a new anonymization approach that is based on injecting uncertainty in social graphs and publishing the resulting uncertain graphs. While existing approaches obfuscate graph data by adding or removing edges entirely, we propose using a finer-grained perturbation that adds or removes edges partially: this way we can achieve the same desired level of obfuscation with smaller changes in the data, thus maintaining higher utility. Our experiments on real-world networks confirm that at the same level of identity obfuscation our method provides higher usefulness than existing randomized methods that publish standard graphs.
- Measuring Two-Event Structural Correlations on Graphs
Ziyu Guan (University of California, Santa Barbara, USA)
Xifeng Yan (University of California, Santa Barbara, USA)
Lance M. Kaplan (U.S. Army Research Laboratory, USA), USA)
Real-life graphs usually have various kinds of events happening on them, e.g., product purchases in online social networks and intrusion alerts in computer networks. The occurrences of events on the same graph could be correlated, exhibiting either attraction or repulsion. Such structural correlations can reveal important relationships between different events. Unfortunately, correlation relationships on graph structures are not well studied and cannot be captured by traditional measures. In this work, we design a novel measure for assessing two-event structural correlations on graphs. Given the occurrences of two events, we choose uniformly a sample of "reference nodes" from the vicinity of all event nodes and employ the Kendall's tau rank correlation measure to compute the average concordance of event density changes. Significance can be efficiently assessed by tau's nice property of being asymptotically normal under the null hypothesis. In order to compute the measure in large scale networks, we develop a scalable framework using different sampling strategies. The complexity of these strategies is analyzed. Experiments on real graph datasets with both synthetic and real events demonstrate that the proposed framework is not only efficacious, but also efficient and scalable.
Research Session 11: Concurrency
- On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems
Andrew Pavlo (Brown University, USA)
Evan P.C. Jones (Massachusetts Institute of Technology, USA)
Stanley Zdonik (Brown University, USA)
A new emerging class of parallel database management systems (DBMS) is designed to take advantage of the partitionable workloads of on-line transaction processing (OLTP) applications. Transactions in these systems are optimized to execute to completion on a single node in a shared-nothing cluster without needing to coordinate with other nodes or use expensive concurrency control measures. But some OLTP applications cannot be partitioned such that all of their transactions execute within a single-partition in this manner. These distributed transactions access data not stored within their local partitions and subsequently require more heavy-weight concurrency control protocols. Further difficulties arise when the transaction's execution properties, such as the number of partitions it may need to access or whether it will abort, are not known beforehand. The DBMS could mitigate these performance issues if it is provided with additional information about transactions. Thus, in this paper we present a Markov model-based approach for automatically selecting which optimizations a DBMS could use, namely (1) more efficient concurrency control schemes, (2) intelligent scheduling, (3) reduced undo logging, and (4) speculative execution. To evaluate our techniques, we implemented our models and integrated them into a parallel, main-memory OLTP DBMS to show that we can improve the performance of applications with diverse workloads.
- High-Performance Concurrency Control Mechanisms for Main-Memory Databases
Per-Åke Larson (Microsoft Research, USA)
Spyros Blanas (University of Wisconsin-Madison, USA)
Cristian Diaconu (Microsoft Research, USA)
Craig Freedman (Microsoft Research, USA)
Jignesh M. Patel (University of Wisconsin - Madison, USA)
Mike Zwilling (Microsoft Research, USA)
A database system optimized for in-memory storage can support much higher transaction rates than current systems. However, standard concurrency control methods used today do not scale to the high transaction rates achievable by such systems. In this paper we introduce two efficient concurrency control methods specifically designed for main-memory databases. Both use multiversioning to isolate read-only transactions from updates but differ in how atomicity is ensured: one is optimistic and one is pessimistic. To avoid expensive context switching, transactions never block during normal processing but they may have to wait before commit to ensure correct serialization ordering. We also implemented a main-memory optimized version of single-version locking. Experimental results show that while single-version locking works well when transactions are short and contention is low performance degrades under more demanding conditions. The multiversion schemes have higher overhead but are much less sensitive to hotspots and the presence of long-running transactions.
- Concurrency Control for Adaptive Indexing
Goetz Graefe (HP Labs, USA)
Felix Halim (National University of Singapore, Singapore)
Stratos Idreos (Centrum Wiskunde & Informatica, Netherlands)
Harumi Kuno (HP Labs, USA)
Stefan Manegold (Centrum Wiskunde & Informatica, Netherlands)
Adaptive indexing initializes and optimizes indexes incrementally, as a side effect of query processing. The goal is to achieve the benefits of indexes while hiding or minimizing the costs of index creation. However, index-optimizing side effects seem to turn read-only queries into update transactions that might, for example, create lock contention. This paper studies concurrency control in the context of adaptive indexing. We show that the design and implementation of adaptive indexing rigorously separates index structures from index contents; this relaxes the constraints and requirements during adaptive indexing compared to those of traditional index updates. Our design adapts to the fact that an adaptive index is refined continuously, and exploits any concurrency opportunities in a dynamic way. A detailed experimental analysis demonstrates that (a) adaptive indexing maintains its adaptive properties even when running concurrent queries, (b) adaptive indexing can exploit the opportunity for parallelism due to concurrent queries, (c) the number of concurrency conflicts and any concurrency administration overheads follow an adaptive behavior, decreasing as the workload evolves and adapting to the workload needs.
- LogBase: A Scalable Log-structured Database System in the Cloud
Hoang Tam Vo (National University of Singapore, Singapore)
Sheng Wang (National University of Singapore, Singapore)
Divyakant Agrawal (University California, Santa Barbara, USA)
Gang Chen (Zhejiang University, China)
Beng Chin Ooi (National University of Singapore, Singapore)
Numerous applications such as financial transactions (e.g., stock trading) are write-heavy in nature. The shift from reads to writes in web applications has also been accelerating in recent years. Write-ahead-logging is a common approach for providing recovery capability while improving performance in most storage systems. However, the separation of log and application data incurs write overheads observed in write-heavy environments and hence adversely affects the write throughput and recovery time in the system. In this paper, we introduce LogBase – a scalable log-structured database system that adopts log-only storage for removing the write bottleneck and supporting fast system recovery. LogBase is designed to be dynamically deployed on commodity clusters to take advantage of elastic scaling property of cloud environments. LogBase provides in-memory multiversion indexes for supporting efficient access to data maintained in the log. LogBase also supports transactions that bundle read and write operations spanning across multiple records. We implemented the proposed system and compared it with HBase and a disk-based log-structured record-oriented system modeled after RAMCloud. The experimental results show that LogBase is able to provide sustained write throughput, efficient data access out of the cache, and effective system recovery.
Research Session 12: Spatio-Temporal Queries
- Efficient Reachability Query Evaluation in Large Spatiotemporal Contact Datasets
Houtan Shirani-Mehr (University of Southern California, USA)
Farnoush Banaei Kashani (University of Southern California, USA)
Cyrus Shahabi (University of Southern California, USA)
With the advent of reliable positioning technologies and prevalence of location-based services, it is now feasible to accurately study the propagation of items such as infectious viruses, sensitive information pieces, and malwares through a population of moving objects, e.g., individuals, mobile devices, and vehicles. In such application scenarios, an item passes between two objects when the objects are sufficiently close (i.e., when they are, so-called, in contact), and hence once an item is initiated, it can penetrate the object population through the evolving network of contacts among objects, termed contact network. In this paper, for the first time we define and study reachability queries in large (i.e., disk-resident) contact datasets which record the movement of a (potentially large) set of objects moving in a spatial environment over an extended time period. A reachability query verifies whether two objects are "reachable" through the evolving contact network represented by such contact datasets. We propose two contact-dataset indexes that enable efficient evaluation of such queries despite the potentially humongous size of the contact datasets. With the first index, termed ReachGrid, at the query time only a small necessary portion of the contact network which is required for reachability evaluation is constructed and traversed. With the second approach, termed ReachGraph, we precompute reachability at different scales and leverage these precalculations at the query time for efficient query processing. We optimize the placement of both indexes on disk to enable efficient index traversal during query processing. We study the pros and cons of our proposed approaches by performing extensive experiments with both real and synthetic data. Based on our experimental results, our proposed approaches outperform existing reachability query processing techniques in contact networks by 76% on average.
- Boosting Moving Object Indexing through Velocity Partitioning
Thi Nguyen (La Trobe University, Australia)
Zhen He (La Trobe University, Australia)
Rui Zhang (The University of Melbourne, Australia)
Phillip Ward (La Trobe University, Australia)
There have been intense research interests in moving object indexing in the past decade. However, existing work did not exploit the important property of skewed velocity distributions. In many real world scenarios, objects travel predominantly along only a few directions. Examples include vehicles on road networks, flights, people walking on the streets, etc. The search space for a query is heavily dependent on the velocity distribution of the objects grouped in the nodes of an index tree. Motivated by this observation, we propose the velocity partitioning (VP) technique, which exploits the skew in velocity distribution to speed up query processing using moving object indexes. The VP technique first identifies the "dominant velocity axes (DVAs)" using a combination of principal components analysis (PCA) and k-means clustering. Then, a moving object index (e.g., a TPR-tree) is created based on each DVA, using the DVA as an axis of the underlying coordinate system. An object is maintained in the index whose DVA is closest to the object's current moving direction. Thus, all the objects in an index are moving in a near 1-dimensional space instead of a 2-dimensional space. As a result, the expansion of the search space with time is greatly reduced, from a quadratic function of the maximum speed (of the objects in the search range) to a near linear function of the maximum speed. The VP technique can be applied to a wide range of moving object index structures. We have implemented the VP technique on two representative ones, the TPR*-tree and the Bx-tree. Extensive experiments validate that the VP technique consistently improves the performance of those index structures.
- Multiple Location Profiling for Users and Relationships from Social Network and Content
Rui Li (University of Illinois at Urbana-Champaign, USA)
Shengjie Wang (University of Illinois at Urbana-Champaign, USA)
Kevin Chen-Chuan Chang (University of Illinois at Urbana-Champaign, USA)
Users' locations are important for many applications such as personalized search and localized content delivery. In this paper, we study the problem of profiling Twitter users' locations with their following network and tweets. We propose a multiple location profiling model (MLP), which has three key features: 1) it formally models how likely a user follows another user given their locations and how likely a user tweets a venue given his location, 2) it fundamentally captures that a user has multiple locations and his following relationships and tweeted venues can be related to any of his locations, and some of them are even noisy, and 3) it novelly utilizes the home locations of some users as partial supervision. As a result, MLP not only discovers users' locations accurately and completely, but also "explains" each following relationship by revealing users' true locations in the relationship. Experiments on a large-scale data set demonstrate those advantages. Particularly, 1) for predicting users' home locations, MLP successfully places 62% users and outperforms two state-of-the-art methods by 10% in accuracy, 2) for discovering users' multiple locations, MLP improves the baseline methods by 14% in recall, and 3) for explaining following relationships, MLP achieves 57% accuracy.
Research Session 13: Mapreduce II
- Building Wavelet Histograms on Large Data in MapReduce
Jeffrey Jestes (University of Utah, USA)
Ke Yi (Hong Kong University of Science and Technology, Hong Kong)
Feifei Li (University of Utah, USA)
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential. Among various data summarization tools, histograms have proven to be particularly important and useful for summarizing data, and the wavelet histogram is one of the most widely used histograms. In this paper, we investigate the problem of building wavelet histograms efficiently on large datasets in MapReduce. We measure the efficiency of the algorithms by both end-to-end running time and communication cost. We demonstrate straightforward adaptations of existing exact and approximate methods for building wavelet histograms to MapReduce clusters are highly inefficient. To that end, we design new algorithms for computing exact and approximate wavelet histograms and discuss their implementation in MapReduce. We illustrate our techniques in Hadoop, and compare to baseline solutions with extensive experiments performed in a heterogeneous Hadoop cluster of 16 nodes, using large real and synthetic datasets, up to hundreds of gigabytes. The results suggest significant (often orders of magnitude) performance improvement achieved by our new algorithms.
- ReStore: Reusing Results of MapReduce Jobs
Iman Elghandour (University of Waterloo, Canada)
Ashraf Aboulnaga (University of Waterloo, Canada)
Analyzing large scale data has emerged as an important activity for many organizations in the past few years. This large scale data analysis is facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Users of MapReduce often have analysis tasks that are too complex to express as individual MapReduce jobs. Instead, they use high-level query languages such as Pig, Hive, or Jaql to express their complex tasks. The compilers of these languages translate queries into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system and produces output that is stored in this distributed file system and read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. One way to improve the performance of workflows of MapReduce jobs is to keep these intermediate results and reuse them for future workflows submitted to the system. In this paper, we present ReStore, a system that manages the storage and reuse of such intermediate results. ReStore can reuse the output of whole MapReduce jobs that are part of a workflow, and it can also create additional reuse opportunities by materializing and storing the output of query execution operators that are executed within a MapReduce job. We have implemented ReStore as an extension to the Pig dataflow system on top of Hadoop, and we experimentally demonstrate significant speedups on queries from the PigMix benchmark.
- Only Aggressive Elephants are Fast Elephants
Jens Dittrich (Universität des Saarlandes, Germany)
Jorge-Arnulfo Quiané-Ruiz (Universität des Saarlandes, Germany)
Stefan Richter (Universität des Saarlandes, Germany)
Stefan Schuh (Universität des Saarlandes, Germany)
Alekh Jindal (Universität des Saarlandes, Germany)
Jörg Schad (Universität des Saarlandes, Germany)
Yellow elephants are slow. A major reason is that they consume their inputs entirely before responding to an elephant rider's orders. Some clever riders have trained their yellow elephants to only consume parts of the inputs before responding. However, the teaching time to make an elephant do that is high. So high that the teaching lessons often do not pay off. We take a different approach. We make elephants aggressive; only this will make them very fast. We propose HAIL (Hadoop Aggressive Indexing Library), an enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica. An interesting feature of HAIL is that we typically create a win-win situation: we improve both data upload to HDFS and the runtime of the actual Hadoop MapReduce job. In terms of data upload, HAIL improves over HDFS by up to 60% with the default replication factor of three. In terms of query execution, we demonstrate that HAIL runs up to 68x faster than Hadoop. In our experiments, we use six clusters including physical and EC2 clusters of up to 100 nodes. A series of scalability experiments also demonstrates the superiority of HAIL.
Research Session 14: Storage
- Towards Cost-Effective Storage Provisioning for DBMSs
Ning Zhang (University of Wisconsin - Madison, USA)
Junichi Tatemura (NEC Labs America, USA)
Jignesh M. Patel (University of Wisconsin - Madison, USA)
Hakan Hacıgümüş (NEC Labs America, USA)
Data center operators face a bewildering set of choices when considering how to provision resources on machines with complex I/O subsystems. Modern I/O subsystems often have a rich mix of fast, high performing, but expensive SSDs sitting alongside with cheaper but relatively slower (for random accesses) traditional hard disk drives. The data center operators need to determine how to provision the I/O resources for specific workloads so as to abide by existing Service Level Agreements (SLAs), while minimizing the total operating cost (TOC) of running the workload, where the TOC includes the amortized hardware costs and the run time energy costs. The focus of this paper is on introducing this new problem of TOC-based storage allocation, cast in a framework that is compatible with traditional DBMS query optimization and query processing architecture. We also present a heuristic-based solution to this problem, called DOT. We have implemented DOT in PostgreSQL, and experiments using TPC-H and TPC-C demonstrate significant TOC reduction by DOT in various settings.
- hStorage-DB: Heterogeneity-aware Data Management to Exploit the Full Capability of Hybrid Storage Systems
Tian Luo (The Ohio State University, USA)
Rubao Lee (The Ohio State University, USA)
Michael Mesnier (Intel Labs, USA)
Feng Chen (Intel Labs, USA)
Xiaodong Zhang (The Ohio State University, USA)
As storage systems become increasingly heterogeneous and complex, it adds burdens on DBAs, causing suboptimal performance even after a lot of human efforts have been made. In addition, existing monitoring-based storage management by access pattern detections has difficulties to handle workloads that are highly dynamic and concurrent. To achieve high performance by best utilizing heterogeneous storage devices, we have designed and implemented a heterogeneity-aware software framework for DBMS storage management called hStorage-DB, where semantic information that is critical for storage I/O is identified and passed to the storage manager. According to the collected semantic information, requests are classified into different types. Each type is assigned a proper QoS policy supported by the underlying storage system, so that every request will be served with a suitable storage device. With hStorage-DB, we can well utilize semantic information that cannot be detected through data access monitoring but is particularly important for a hybrid storage system. To show the effectiveness of hStorage-DB, we have implemented a system prototype that consists of an I/O request classification enabled DBMS, and a hybrid storage system that is organized into a two-level caching hierarchy. Our performance evaluation shows that hStorage-DB can automatically make proper decisions for data allocation in different storage devices and make substantial performance improvements in a cost-efficient way.
- Definition, Detection, and Recovery of Single-Page Failures, a Fourth Class of Database Failures
Goetz Graefe (HP Labs, USA)
Harumi Kuno (HP Labs, USA)
The three traditional failure classes are system, media, and transaction failures. Sometimes, however, modern storage exhibits failures that differ from all of those. In order to capture and describe such cases, single-page failures are introduced as a fourth failure class. This class encompasses all failures to read a data page correctly and with plausible contents despite all correction attempts in lower system levels. Efficient recovery seems to require a new data structure called the page recovery index. Its transactional maintenance can be accomplished writing the same number of log records as today's efficient implementations of logging and recovery. Detection and recovery of a single-page failure can be sufficiently fast that the affected data access is merely delayed, without the need to abort the transaction.
Research Session 15: Privacy I
- An Adaptive Mechanism for Accurate Query Answering under Differential Privacy
Chao Li (University of Massachusetts Amherst, USA)
Gerome Miklau (University of Massachusetts Amherst, USA)
We propose a novel mechanism for answering sets of counting queries under differential privacy. Given a workload of counting queries, the mechanism automatically selects a different set of "strategy" queries to answer privately, using those answers to derive answers to the workload. The main algorithm proposed in this paper approximates the optimal strategy for any workload of linear counting queries. With no cost to the privacy guarantee, the mechanism improves significantly on prior approaches and achieves near-optimal error for many workloads, when applied under (epsilon, delta)-differential privacy. The result is an adaptive mechanism which can help users achieve good utility without requiring that they reason carefully about the best formulation of their task.
- Privacy Preservation by Disassociation
Manolis Terrovitis (IMIS Research Center Athena, Greece)
John Liagouris (National Technical University of Athens, Greece)
Nikos Mamoulis (University of Hong Kong, Greece)
Spiros Skiadopoulos (University of Peloponnese, Greece)
In this work, we focus on protection against identity disclosure in the publication of sparse multidimensional data. Existing multidimensional anonymization techniquesa) protect the privacy of users either by altering the set of quasi-identifiers of the original data (e.g., by generalization or suppression) or by adding noise (e.g., using differential privacy) and/or (b) assume a clear distinction between sensitive and non-sensitive information and sever the possible linkage. In many real world applications the above techniques are not applicable. For instance, consider web search query logs. Suppressing or generalizing anonymization methods would remove the most valuable information in the dataset: the original query terms. Additionally, web search query logs contain millions of query terms which cannot be categorized as sensitive or non-sensitive since a term may be sensitive for a user and non-sensitive for another. Motivated by this observation, we propose an anonymization technique termed disassociation that preserves the original terms but hides the fact that two or more different terms appear in the same record. We protect the users' privacy by disassociating record terms that participate in identifying combinations. This way the adversary cannot associate with high probability a record with a rare combination of terms. To the best of our knowledge, our proposal is the first to employ such a technique to provide protection against identity disclosure. We propose an anonymization algorithm based on our approach and evaluate its performance on real and synthetic datasets, comparing it against other state-of-the-art methods based on generalization and differential privacy.
- Publishing Microdata with a Robust Privacy Guarantee
Jianneng Cao (National University of Singapore, Singapore)
Panagiotis Karras (Rutgers University, USA)
Today, the publication of microdata poses a privacy threat. Vast research has striven to define the privacy condition that microdata should satisfy before it is released, and devise algorithms to anonymize the data so as to achieve this condition. Yet, no method proposed to date explicitly bounds the percentage of information an adversary gains after seeing the published data for each sensitive value therein. This paper introduces beta-likeness, an appropriately robust privacy model for microdata anonymization, along with two anonymization schemes designed therefor, the one based on generalization, and the other based on perturbation. Our model postulates that an adversary's confidence on the likelihood of a certain sensitive-attribute (SA) value should not increase, in relative difference terms, by more than a predefined threshold. Our techniques aim to satisfy a given beta threshold with little information loss. We experimentally demonstrate that (i) our model provides an effective privacy guarantee in a way that predecessor models cannot, (ii) our generalization scheme is more effective and efficient in its task than methods adapting algorithms for the k-anonymity model, and (iii) our perturbation method outperforms a baseline approach. Moreover, we discuss in detail the resistance of our model and methods to attacks proposed in previous research.
Research Session 16: Analytics
- Optimization of Analytic Window Functions
Yu Cao (EMC Labs, China)
Chee-Yong Chan (National University of Singapore, Singapore)
Jie Li (Duke University, USA)
Kian-Lee Tan (National University of Singapore, Singapore)
Analytic functions represent the state-of-the-art way of performing complex data analysis within a single SQL statement. In particular, an important class of analytic functions that has been frequently used in commercial systems to support OLAP and decision support applications is the class of window functions. A window function returns for each input tuple a value derived from applying a function over a window of neighboring tuples. However, existing window function evaluation approaches are based on a naive sorting scheme. In this paper, we study the problem of optimizing the evaluation of window functions. We propose several efficient techniques, and identify optimization opportunities that allow us to optimize the evaluation of a set of window functions. We have integrated our scheme into PostgreSQL. Our comprehensive experimental study on the TPC-DS datasets as well as synthetic datasets and queries demonstrate significant speedup over existing approaches.
- Optimizing I/O for Big Array Analytics
Yi Zhang (Duke University, USA)
Jun Yang (Duke University, USA)
Big array analytics is becoming indispensable in answering important scientific and business questions. Most analysis tasks consist of multiple steps, each making one or multiple passes over the arrays to be analyzed and generating intermediate results. In the big data setting, I/O optimization is a key to efficient analytics. In this paper, we develop a framework and techniques for capturing a broad range of analysis tasks expressible in nested-loop forms, representing them in a declarative way, and optimizing their I/O by identifying sharing opportunities. Experiment results show that our optimizer is capable of finding execution plans that exploit nontrivial I/O sharing opportunities with significant savings.
- Diversifying Top-K Results
Lu Qin (The Chinese University of Hong Kong, Hong Kong)
Jeffrey Xu Yu (The Chinese University of Hong Kong, Hong Kong)
Lijun Chang (The Chinese University of Hong Kong, Hong Kong)
Top-k query processing finds a list of k results that have largest scores w.r.t the user given query, with the assumption that all the k results are independent to each other. In practice, some of the top-k results returned can be very similar to each other. As a result some of the top-k results returned are redundant. In the literature, diversified top-k search has been studied to return k results that take both score and diversity into consideration. Most existing solutions on diversified top-k search assume that scores of all the search results are given, and some works solve the diversity problem on a specific problem and can hardly be extended to general cases. In this paper, we study the diversified top-k search problem. We define a general diversified top-k search problem that only considers the similarity of the search results themselves. We propose a framework, such that most existing solutions for top-k query processing can be extended easily to handle diversified top-k search, by simply applying three new functions, a sufficient stop condition sufficient(), a necessary stop condition necessary(), and an algorithm for diversified top-k search on the current set of generated results, div-search-current(). We propose three new algorithms, namely, div-astar, div-dp, and div-cut to solve the div-search-current() problem. div-astar is an A* based algorithm, div-dp is an algorithm that decomposes the results into components which are searched using div-astar independently and combined using dynamic programming. div-cut further decomposes the current set of generated results using cut points and combines the results using sophisticated operations. We conducted extensive performance studies using two real datasets, enwiki and reuters. Our div-cut algorithm finds the optimal solution for diversified top-k search problem in seconds even for k as large as 2,000.
- Cologne: A Declarative Distributed Constraint Optimization Platform
Changbin Liu (University of Pennsylvania, USA)
Lu Ren (University of Pennsylvania, USA)
Boon Thau Loo (University of Pennsylvania, USA)
Yun Mao (AT&T Labs - Research, USA)
Prithwish Basu (Raytheon BBN Technologies, USA)
This paper presents Cologne, a declarative optimization platform that enables constraint optimization problems (COPs) to be declaratively specified and incrementally executed in distributed systems. Cologne integrates a declarative networking engine with an off-the-shelf constraint solver. We have developed the Colog language that combines distributed Datalog used in declarative networking with language constructs for specifying goals and constraints used in COPs. Cologne uses novel query processing strategies for processing Colog programs, by combining the use of bottom-up distributed Datalog evaluation with top-down goal-oriented constraint solving. Using case studies based on cloud and wireless network optimizations, we demonstrate that Cologne (1) can flexibly support a wide range of policy-based optimizations in distributed systems, (2) results in orders of magnitude less code compared to imperative implementations, and (3) is highly efficient with low overhead and fast convergence times.
Research Session 17: Information Networks
- Fast and Exact Top-k Search for Random Walk with Restart
Yasuhiro Fujiwara (NTT Cyber Space Labs, Japan)
Makoto Nakatsuji (NTT Cyber Space Labs, Japan)
Makoto Onizuka (NTT Cyber Space Labs, Japan)
Masaru Kitsuregawa (The University of Tokyo, Japan)
Graphs are fundamental data structures and have been employed for centuries to model real-world systems and phenomena. Random walk with restart (RWR) provides a good proximity score between two nodes in a graph, and it has been successfully used in many applications such as automatic image captioning, recommender systems, and link prediction. The goal of this work is to find nodes that have top-k highest proximities for a given node. Previous approaches to this problem find nodes efficiently at the expense of exactness. The main motivation of this paper is to answer, in the affirmative, the question, `Is it possible to improve the search time without sacrificing the exactness?'. Our solution, {it K-dash}, is based on two ideas: (1) It computes the proximity of a selected node efficiently by sparse matrices, and (2) It skips unnecessary proximity computations when searching for the top-k nodes. Theoretical analyses show that K-dash guarantees result exactness. We perform comprehensive experiments to verify the efficiency of K-dash. The results show that K-dash can find top-k nodes significantly faster than the previous approaches while it guarantees exactness.
- The Filter-Placement Problem and its Application to Minimizing Information Multiplicity
Dóra Erdös (Boston University, USA)
Vatche Ishakian (Boston University, USA)
Andrei Lapets (Boston University, USA)
Evimaria Terzi (Boston University, USA)
Azer Bestavros (Boston University, USA)
In many information networks, data items -- such as updates in social networks, news flowing through interconnected RSS feeds and blogs, measurements in sensor networks, route updates in ad-hoc networks -- propagate in an uncoordinated manner: nodes often relay information they receive to neighbors, independent of whether or not these neighbors received the same information from other sources. This uncoordinated data dissemination may result in significant, yet unnecessary communication and processing overheads, ultimately reducing the utility of information networks. To alleviate the negative impacts of this information multiplicity phenomenon, we propose that a subset of nodes (selected at key positions in the network) carry out additional information filtering functionality. Thus, nodes are responsible for the removal (or significant reduction) of the redundant data items relayed through them. We refer to such nodes as filters. We formally define the Filter Placement problem as a combinatorial optimization problem, and study its computational complexity for different types of graphs. We also present polynomial-time approximation algorithms and scalable heuristics for the problem. Our experimental results, which we obtained through extensive simulations on synthetic and real-world information flow networks, suggest that in many settings a relatively small number of filters are fairly effective in removing a large fraction of redundant information.
- Labeling Workflow Views with Fine-Grained Dependencies
Zhuowei Bao (University of Pennsylvania, USA)
Susan B. Davidson (University of Pennsylvania, USA)
Tova Milo (Tel Aviv University, Israel)
This paper considers the problem of efficiently answering reachability queries over views of provenance graphs, derived from executions of workflows that may include recursion. Such views include composite modules and model fine-grained dependencies between module inputs and outputs. A novel view-adaptive dynamic labeling scheme is developed for efficient query evaluation, in which view specifications are labeled statically (i.e. as they are created) and data items are labeled dynamically as they are produced during a workflow execution. Although the combination of fine-grained dependencies and recursive workflows entail, in general, long (linear-size) data labels, we show that for a large natural class of workflows and views, labels are compact (logarithmic-size) and reachability queries can be evaluated in constant time. Experimental results demonstrate the benefit of this approach over the state-of-the-art technique when applied for labeling multiple views.
- gSketch: On Query Estimation in Graph Streams
Peixiang Zhao (University of Illinois at Urbana-Champaign, USA)
Charu C. Aggarwal (IBM T J Watson Research Center, USA)
Min Wang (HP Labs, China)
Many dynamic applications are built upon large network infrastructures, such as social networks, communication networks, biological networks and the Web. Such applications create data that can be naturally modeled as graph streams, in which edges of the underlying graph are received and updated sequentially in a form of a stream. It is often necessary and important to summarize the behavior of graph streams in order to enable effective query processing. However, the sheer size and dynamic nature of graph streams present an enormous challenge to existing graph management techniques. In this paper, we propose a new graph sketch method, gSketch, which combines well studied synopses for traditional data streams with a sketch partitioning technique, to estimate and optimize the responses to basic queries on graph streams. We consider two different scenarios for query estimation: (1) A graph stream sample is available; (2) Both a graph stream sample and a query workload sample are available. Algorithms for different scenarios are designed respectively by partitioning a global sketch to a group of localized sketches in order to optimize the query estimation accuracy. We perform extensive experimental studies on both real and synthetic data sets and demonstrate the power and robustness of gSketch in comparison with the state-of-the-art global sketch method.
Research Session 18: Distributed Databases
- Probabilistically Bounded Staleness for Practical Partial Quorums
Peter Bailis (University of California, Berkeley, USA)
Shivaram Venkataraman (University of California, Berkeley, USA)
Michael J. Franklin (University of California, Berkeley, USA)
Joseph M. Hellerstein (University of California, Berkeley, USA)
Ion Stoica (University of California, Berkeley, USA), USA)
Data store replication results in a fundamental trade-off between operation latency and data consistency. In this paper, we examine this trade-off in the context of quorum-replicated data stores. Under partial, or non-strict quorum replication, a data store waits for responses from a subset of replicas before answering a query, without guaranteeing that read and write replica sets intersect. As deployed in practice, these configurations provide only basic eventual consistency guarantees, with no limit to the recency of data returned. However, anecdotally, partial quorums are often "good enough" for practitioners given their latency benefits. In this work, we explain why partial quorums are regularly acceptable in practice, analyzing both the staleness of data they return and the latency benefits they offer. We introduce Probabilistically Bounded Staleness (PBS) consistency, which provides expected bounds on staleness with respect to both versions and wall clock time. We derive a closed-form solution for versioned staleness as well as model real-time staleness for representative Dynamo-style systems under internet-scale production workloads. Using PBS, we measure the latency-consistency trade-off for partial quorum systems. We quantitatively demonstrate how eventually consistent systems frequently return consistent data within tens of milliseconds while offering significant latency benefits.
- Minuet: A Scalable Distributed Multiversion B-Tree
Benjamin Sowell (Cornell University, USA)
Wojciech Golab (HP Labs, USA)
Mehul A. Shah (Nou Data Palo Alto, USA)
Data management systems have traditionally been designed to support either long-running analytics queries or short-lived transactions, but an increasing number of applications need both. For example, online games, socio-mobile apps, and e-commerce sites need to not only maintain operational state, but also analyze that data quickly to make predictions and recommendations that improve user experience. In this paper, we present Minuet, a distributed, main-memory B-tree that supports both transactions and copy-on-write snapshots for in-situ analytics. Minuet uses main-memory storage to enable low-latency transactional operations as well as analytics queries without compromising transaction performance. In addition to supporting read-only analytics queries on snapshots, Minuet supports writable clones, so that users can create branching versions of the data. This feature can be quite useful, e.g. to support complex "what-if" analysis or to facilitate wide-area replication. Our experiments show that Minuet outperforms a commercial main-memory database in many ways. It scales to hundreds of cores and TBs of memory, and can process hundreds of thousands of B-tree operations per second while executing long-running scans.
- Processing a Trillion Cells per Mouse Click
Alexander Hall (Google Inc., Switzerland)
Olaf Bachmann (Google Inc., Switzerland)
Robert Büssow (Google Inc., Switzerland)
Silviu Gănceanu (Google Inc., Switzerland)
Marc Nunkesser (Google Inc., Switzerland)
Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner. In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill. It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.
- Serializability, not Serial: Concurrency Control and Availability in Multi-Datacenter Datastores
Stacy Patterson (Israel Institute of Technology, Israel)
Aaron J. Elmore (University of California, Santa Barbara, USA)
Faisal Nawab (University of California, Santa Barbara, USA)
Divyakant Agrawal (University of California, Santa Barbara, USA)
Amr El Abbadi (University of California, Santa Barbara, USA)">Stacy Patterson (Israel Institute of Technology, Israel)
We present a framework for concurrency control and availability in multi-datacenter datastores. While we consider Google's Megastore as our motivating example, we define general abstractions for key components, making our solution extensible to any system that satisfies the abstraction properties. We first develop and analyze a transaction management and replication protocol based on a straightforward implementation of the Paxos algorithm. Our investigation reveals that this protocol acts as a concurrency prevention mechanism rather than a concurrency control mechanism. We then propose an enhanced protocol called Paxos with Combination and Promotion (Paxos-CP) that provides true transaction concurrency while requiring the same per instance message complexity as the basic Paxos protocol. Finally, we compare the performance of Paxos and Paxos-CP in a multi-datacenter experimental study, and we demonstrate that Paxos-CP results in significantly fewer aborted transactions than basic Paxos.
Research Session 19: Privacy II
- Functional Mechanism: Regression Analysis under Differential Privacy
Jun Zhang (Nanyang Technological University, Singapore)
Zhenjie Zhang (Advanced Digital Sciences Center Illinois at Singapore, Singapore)
Xiaokui Xiao (Nanyang Technological University, Singapore)
Yin Yang (Advanced Digital Sciences Center Illinois at Singapore, Singapore)
Marianne Winslett (University of Illinois at Urbana-Champaign, USA)
epsilon-differential privacy is the state-of-the-art model for releasing sensitive information while protecting privacy. Numerous methods have been proposed to enforce epsilon-differential privacy in various analytical tasks, e.g., regression analysis. Existing solutions for regression analysis, however, are either limited to non-standard types of regression or unable to produce accurate regression results. Motivated by this, we propose the Functional Mechanism, a differentially private method designed for a large class of optimization-based analyses. The main idea is to enforce epsilon-differential privacy by perturbing the objective function of the optimization problem, rather than its results. As case studies, we apply the functional mechanism to address two most widely used regression models, namely, linear regression and logistic regression. Both theoretical analysis and thorough experimental evaluations show that the functional mechanism is highly effective and efficient, and it significantly outperforms existing solutions.
- Low-Rank Mechanism: Optimizing Batch Queries under Differential Privacy
Ganzhao Yuan (South China University of Technology, China)
Zhenjie Zhang (Advanced Digital Sciences Center Illinois at Singapore, Singapore)
Marianne Winslett (University of Illinois at Urbana-Champaign, USA)
Xiaokui Xiao (Nanyang Technological University, Singapore)
Yin Yang (Advance Digital Sciences Center Illinois at Singapore, Singapore)
Zhifeng Hao (South China University of Technology, China)
Differential privacy is a promising privacy-preserving paradigm for statistical query processing over sensitive data. It works by injecting random noise into each query result, such that it is provably hard for the adversary to infer the presence or absence of any individual record from the published noisy results. The main objective in differentially private query processing is to maximize the accuracy of the query results, while satisfying the privacy guarantees. Previous work, notably the matrix mechanism, has suggested that processing a batch of correlated queries as a whole can potentially achieve considerable accuracy gains, compared to answering them individually. However, as we point out in this paper, the matrix mechanism is mainly of theoretical interest; in particular, several inherent problems in its design limit its accuracy in practice, which almost never exceeds that of naive methods. In fact, we are not aware of any existing solution that can effectively optimize a query batch under differential privacy. Motivated by this, we propose the Low-Rank Mechanism (LRM), the first practical differentially private technique for answering batch queries with high accuracy, based on a low rank approximation of the workload matrix. We prove that the accuracy provided by LRM is close to the theoretical lower bound for any mechanism to answer a batch of queries under differential privacy. Extensive experiments using real data demonstrate that LRM consistently outperforms state-of-the-art query processing solutions under differential privacy, by large margins.
- PrivBasis: Frequent Itemset Mining with Differential Privacy
Ninghui Li (Purdue University, USA)
Wahbeh Qardaji (Purdue University, USA)
Dong Su (Purdue University, USA)
Jianneng Cao (Purdue University, USA)
The discovery of frequent itemsets can serve valuable economic and research purposes. Releasing discovered frequent itemsets, however, presents privacy challenges. In this paper, we study the problem of how to perform frequent itemset mining on transaction databases while satisfying differential privacy. We propose an approach, called PrivBasis, which leverages a novel notion called basis sets. A theta-basis set has the property that any itemset with frequency higher than theta is a subset of some basis. We introduce algorithms for privately constructing a basis set and then using it to find the most frequent itemsets. Experiments show that our approach greatly outperforms the current state of the art.
- Explanation-Based Auditing
Daniel Fabbri (University of Michigan, USA)
Kristen LeFevre (University of Michigan, USA)
To comply with emerging privacy laws and regulations, it has become common for applications like electronic health records systems (EHRs) to collect access logs, which record each time a user (e.g., a hospital employee) accesses a piece of sensitive data (e.g., a patient record). Using the access log, it is easy to answer simple queries (e.g., Who accessed Alice's medical record?), but this often does not provide enough information. In addition to learning who accessed their medical records, patients will likely want to understand why each access occurred. In this paper, we introduce the problem of generating explanations for individual records in an access log. The problem is motivated by user-centric auditing applications, and it also provides a novel approach to misuse detection. We develop a framework for modeling explanations which is based on a fundamental observation: For certain classes of databases, including EHRs, the reason for most data accesses can be inferred from data stored elsewhere in the database. For example, if Alice has an appointment with Dr. Dave, this information is stored in the database, and it explains why Dr. Dave looked at Alice's record. Large numbers of data accesses can be explained using general forms called explanation templates. Rather than requiring an administrator to manually specify explanation templates, we propose a set of algorithms for automatically discovering frequent templates from the database (i.e., those that explain a large number of accesses). We also propose techniques for inferring collaborative user groups, which can be used to enhance the quality of the discovered explanations. Finally, we have evaluated our proposed techniques using an access log and data from the University of Michigan Health System. Our results demonstrate that in practice we can provide explanations for over 94% of data accesses in the log.
Research Session 20: Modern Hardware
- Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems
Martina-Cezara Albutiu (Technische Universität München, Germany)
Alfons Kemper (Technische Universität München, Germany)
Thomas Neumann (Technische Universität München, Germany)
Two emerging hardware trends will dominate the database system technology in the near future: increasing main memory capacities of several TB per server and massively parallel multi-core processing. Many algorithmic and control techniques in current database technology were devised for disk-based systems where I/O dominated the performance. In this work we take a new look at the well-known sort-merge join which, so far, has not been in the focus of research in scalable massively parallel multi-core data processing as it was deemed inferior to hash joins. We devise a suite of new massively parallel sort-merge (MPSM) join algorithms that are based on partial partition-based sorting. Contrary to classical sort-merge joins, our MPSM algorithms do not rely on a hard to parallelize final merge step to create one complete sort order. Rather they work on the independently created runs in parallel. This way our MPSM algorithms are NUMA-affine as all the sorting is carried out on local memory partitions. An extensive experimental evaluation on a modern 32-core machine with one TB of main memory proves the competitive performance of MPSM on large main memory databases with billions of objects. It scales (almost) linearly in the number of employed cores and clearly outperforms competing hash join proposals - in particular it outperforms the "cutting-edge" Vectorwise parallel query engine by a factor of four.
- OLTP on Hardware Islands
Danica Porobic (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Ippokratis Pandis (IBM Almaden Research Center, USA)
Miguel Branco (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Pınar Tözün (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Anastasia Ailamaki (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and to the processor caches, which causes variability in the communication costs. Unfortunately, database systems mostly assume that all processing cores are the same and that microarchitecture differences are not significant enough to appear in critical database execution paths. As we demonstrate in this paper, however, hardware heterogeneity does appear in the critical path and conventional database architectures achieve suboptimal and even worse, unpredictable performance. We perform a detailed performance analysis of OLTP deployments in servers with multiple cores per CPU (multicore) and multiple CPUs per server (multisocket). We compare different database deployment strategies where we vary the number and size of independent database instances running on a single server, from a single shared-everything instance to fine-grained shared-nothing configurations. We quantify the impact of non-uniform hardware on various deployments by (a) examining how efficiently each deployment uses the available hardware resources and (b) measuring the impact of distributed transactions and skewed requests on different workloads. Finally, we argue in favor of shared-nothing deployments that are topology- and workload-aware and take advantage of fast on-chip communication between islands of cores on the same socket.
- Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems
Kaibo Wang (The Ohio State University, USA)
Yin Huai (The Ohio State University, USA)
Rubao Lee (The Ohio State University, USA)
Fusheng Wang (Emory University, USA)
Xiaodong Zhang (The Ohio State University, USA)
Joel H. Saltz (Emory University, USA)
As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach.
- B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives
Hongchan Roh (Yonsei University, Korea)
Sanghyun Park (Yonsei University, Korea)
Sungho Kim (Yonsei University, Korea)
Mincheol Shin (Yonsei University, Korea)
Sang-Won Lee (Sungkyunkwan University, Korea)
Previous research addressed the potential problems of the hard-disk oriented design of DBMSs of flashSSDs. In this paper, we focus on exploiting potential benefits of flashSSDs. First, we examine the internal parallelism issues of flashSSDs by conducting benchmarks to various flashSSDs. Then, we suggest algorithm-design principles in order to best benefit from the internal parallelism. We present a new I/O request concept, called psync I/O that can exploit the internal parallelism of flashSSDs in a single process. Based on these ideas, we introduce B+-tree optimization methods in order to utilize internal parallelism. By integrating the results of these methods, we present a B+-tree variant, PIO B-tree. We confirmed that each optimization method substantially enhances the index performance. Consequently, PIO B-tree enhanced B+-tree’s insert performance by a factor of up to 16.3, while improving point-search performance by a factor of 1.2. The range search of PIO B-tree was up to 5 times faster than that of the B+-tree. Moreover, PIO B-tree outperformed other flash-aware indexes in various synthetic workloads. We also confirmed that PIO B-tree outperforms B+-tree in index traces collected inside the Postgresql DBMS with TPC-C benchmark.
Research Session 21: Shortest Paths and Reachability
- Shortest Path Computation with No Information Leakage
Kyriakos Mouratidis (Singapore Management University, Singapore)
Man Lung Yiu (Hong Kong Polytechnic University, Hong Kong)
Shortest path computation is one of the most common queries in location-based services (LBSs). Although particularly useful, such queries raise serious privacy concerns. Exposing to a (potentially untrusted) LBS the client’s position and her destination may reveal personal information, such as social habits, health condition, shopping preferences, lifestyle choices, etc. The only existing method for privacy-preserving shortest path computation follows the obfuscation paradigm; it prevents the LBS from inferring the source and destination of the query with a probability higher than a threshold. This implies, however, that the LBS still deduces some information (albeit not exact) about the client’s location and her destination. In this paper we aim at strong privacy, where the adversary learns nothing about the shortest path query. We achieve this via established private information retrieval techniques, which we treat as black-box building blocks. Experiments on real, large-scale road networks assess the practicality of our schemes.
- K-Reach: Who is in Your Small World
James Cheng (Nanyang Technological University, Singapore)
Zechao Shang (The Chinese University of Hong Kong, Hong Kong)
Hong Cheng (The Chinese University of Hong Kong, Hong Kong)
Haixun Wang (Microsoft Research Asia, China)
Jeffrey Xu Yu (The Chinese University of Hong Kong, Hong Kong)
We study the problem of answering k-hop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of k-hop reachability is a general problem of the classic reachability (where k=infinity). Existing indexes for processing classic reachability queries, as well as for processing shortest path queries, are not applicable or not efficient for processing k-hop reachability queries. We propose an index for processing k-hop reachability queries, which is simple in design and efficient to construct. Our experimental results on a wide range of real datasets show that our index is more efficient than the state-of-the-art indexes even for processing classic reachability queries, for which these indexes are primarily designed. We also show that our index is efficient in answering k-hop reachability queries.
- Performance Guarantees for Distributed Reachability Queries
Wenfei Fan (University of Edinburgh, UK)
Xin Wang (Harbin Institute of Technology, China)
Yinghui Wu (University of Edinburgh, UK)
In the real world a graph is often fragmented and distributed across different sites. This highlights the need for evaluating queries on distributed graphs. This paper proposes distributed evaluation algorithms for three classes of queries: reachability for determining whether one node can reach another, bounded reachability for deciding whether there exists a path of a bounded length between a pair of nodes, and regular reachability for checking whether there exists a path connecting two nodes such that the node labels on the path form a string in a given regular expression. We develop these algorithms based on partial evaluation, to explore parallel computation. When evaluating a query Q on a distributed graph G, we show that these algorithms possess the following performance guarantees, no matter how G is fragmented and distributed: (1) each site is visited only once; (2) the total network traffic is determined by the size of Q and the fragmentation of G, independent of the size of G; and (3) the response time is decided by the largest fragment of G rather than the entire G. In addition, we show that these algorithms can be readily implemented in the MapReduce framework. Using synthetic and real-life data, we experimentally verify that these algorithms are scalable on large graphs, regardless of how the graphs are distributed.
- Relational Approach for Shortest Path Discovery over Large Graphs
Jun Gao (Peking University, China)
Ruoming Jin (Kent State University, USA)
Jiashuai Zhou (Peking University, China)
Jeffrey Xu Yu (Chinese University of Hong Kong, Hong Kong)
Xiao Jiang (Peking University, China)
Tengjiao Wang (Peking University, China)
With the rapid growth of large graphs, we cannot assume that graphs can still be fully loaded into memory, thus the disk-based graph operation is inevitable. In this paper, we take the shortest path discovery as an example to investigate the technique issues when leveraging existing infrastructure of relational database (RDB) in the graph data management. Based on the observation that a variety of graph search queries can be implemented by iterative operations including selecting frontier nodes from visited nodes, making expansion from the selected frontier nodes, and merging the expanded nodes into the visited ones, we introduce a relational FEM framework with three corresponding operators to implement graph search tasks in the RDB context. We show new features such as window function and merge statement introduced by recent SQL standards can not only simplify the expression but also improve the performance of the FEM framework. In addition, we propose two optimization strategies specific to shortest path discovery inside the FEM framework. First, we take a bi-directional set Dijkstra’s algorithm in the path finding. The bi-directional strategy can reduce the search space, and set Dijkstra’s algorithm finds the shortest path in a set-at-a-time fashion. Second, we introduce an index named SegTable to preserve the local shortest segments, and exploit SegTable to further improve the performance. The final extensive experimental results illustrate our relational approach with the optimization strategies achieves high scalability and performance.
Research Session 22: Query Processing II
- FDB: A Query Engine for Factorised Relational Databases
Nurzhan Bakibayev (University of Oxford, UK)
Dan Olteanu (University of Oxford, UK)
Jakub Závodný (University of Oxford, UK)
Factorised databases are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. This paper introduces FDB, an in-memory query engine for select-project-join queries on factorised databases. Key components of FDB are novel algorithms for query optimisation and evaluation that exploit the succinctness brought by data factorisation. Experiments show that for data sets with many-to-many relationships FDB can outperform relational engines by orders of magnitude.
- Automatic Partitioning of Database Applications
Alvin Cheung (Massachusetts Institute of Technology, USA)
Owen Arden (Cornell University, USA)
Samuel Madden (Massachusetts Institute of Technology, USA)
Andrew C. Myers (Cornell University, USA)
Alvin Cheung (Massachusetts Institute of Technology, USA)
Database-backed applications are nearly ubiquitous in our daily lives. Applications that make many small accesses to the database create two challenges for developers: increased latency and wasted resources from numerous network round trips. A well-known technique to improve transactional database application performance is to convert part of the application into stored procedures that are executed on the database server. Unfortunately, this conversion is often difficult. In this paper we describe Pyxis, a system that takes database-backed applications and automatically partitions their code into two pieces, one of which is executed on the application server and the other on the database server. Pyxis profiles the application and server loads, statically analyzes the code's dependencies, and produces a partitioning that minimizes the number of control transfers as well as the amount of data sent during each transfer. Our experiments using TPC-C and TPC-W show that Pyxis is able to generate partitions with up to 3x reduction in latency and 1.7x improvement in throughput when compared to a traditional non-partitioned implementation and has comparable performance to that of a custom stored procedure implementation.
- Generating Exact- and Ranked Partially-Matched Answers to Questions in Advertisements
Rani Qumsiyeh (Brigham Young University, USA)
Maria S. Pera (Brigham Young University, USA)
Yiu-Kai Ng (Brigham Young University, USA)
Taking advantage of the Web, many advertisements (ads for short) websites, which aspire to increase client's transactions and thus profits, offer searching tools which allow users to (i) post keyword queries to capture their information needs or (ii) invoke form-based interfaces to create queries by selecting search options, such as a price range, filled-in entries, check boxes, or drop-down menus. These search mechanisms, however, are inadequate, since they cannot be used to specify a natural-language query with rich syntactic and semantic content, which can only be handled by a question answering (QA) system. Furthermore, existing ads websites are incapable of evaluating arbitrary Boolean queries or retrieving partiallymatched answers that might be of interest to the user whenever a user's search yields only a few or no results at all. In solving these problems, we present a QA system for ads, called CQAds, which (i) allows users to post a natural-language question Q for retrieving relevant ads, if they exist, (ii) identifies ads as answers that partially-match the requested information expressed in Q, if insufficient or no answers to Q can be retrieved, which are ordered using a similarity-ranking approach, and (iii) analyzes incomplete or ambiguous questions to perform the "best guess" in retrieving answers that "best match" the selection criteria specified in Q. CQAds is also equipped with a Boolean model to evaluate Boolean operators that are either explicitly or implicitly specified in Q, i.e., with or without Boolean operators specified by the users, respectively. CQAds is easy to use, scalable to all ads domains, and more powerful than search tools provided by existing ads websites, since its query-processing strategy retrieves relevant ads of higher quality and quantity. We have verified the accuracy of CQAds in retrieving ads on eight ads domains and compared its ranking strategy with other well-known ranking approaches.
- Scalable K-Means++
Bahman Bahmani (Stanford University, USA)
Benjamin Moseley (University of Illinois at Urbana-Champaign, USA)
Andrea Vattani (University of California, San Diego, USA)
Ravi Kumar (Yahoo! Research, USA)
Sergei Vassilvitskii (Yahoo! Research, USA)
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.
Research Session 23: Similarity Search and Ranking I
- Answering Top-k Queries Over a Mixture of Attractive and Repulsive Dimensions
Sayan Ranu (University of California, Santa Barbara, USA)
Ambuj K. Singh (University of California, Santa Barbara, USA), USA)
In this paper, we formulate a top-k query that compares objects in a database to a user-provided query object on a novel scoring function. The proposed scoring function combines the idea of attractive and repulsive dimensions into a general framework to overcome the weakness of traditional distance or similarity measures. We study the properties of the proposed class of scoring functions and develop efficient and scalable index structures that index the isolines of the function. We demonstrate various scenarios where the query finds application. Empirical evaluation demonstrates a performance gain of one to two orders of magnitude on querying time over existing state-of-the-art top-k techniques. Further, a qualitative analysis is performed on a real dataset to highlight the potential of the proposed query in discovering hidden data characteristics.
- Bayesian Locality Sensitive Hashing for Fast Similarity Search
Venu Satuluri (The Ohio State University, USA)
Srinivasan Parthasarathy (The Ohio State University, USA)
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.
- Challenging the Long Tail Recommendation
Hongzhi Yin (Peking University, China)
Bin Cui (Peking University, China)
Jing Li (Peking University, China)
Junjie Yao (Peking University, China)
Chen Chen (Peking University, China)
The success of "infinite-inventory" retailers such as Amazon.com and Netflix has been largely attributed to a "long tail" phenomenon. Although the majority of their inventory is not in high demand, these niche products, unavailable at limited-inventory competitors, generate a significant fraction of total revenue in aggregate. In addition, tail product availability can boost head sales by offering consumers the convenience of "one-stop shopping" for both their mainstream and niche tastes. However, most of existing recommender systems, especially collaborative filter based methods, can not recommend tail products due to the data sparsity issue. It has been widely acknowledged that to recommend popular products is easier yet more trivial while to recommend long tail products adds more novelty yet it is also a more challenging task. In this paper, we propose a novel suite of graph-based algorithms for the long tail recommendation. We first represent user-item information with undirected edge-weighted graph and investigate the theoretical foundation of applying Hitting Time algorithm for long tail item recommendation. To improve recommendation diversity and accuracy, we extend Hitting Time and propose efficient Absorbing Time algorithm to help users find their favorite long tail items. Finally, we refine the Absorbing Time algorithm and propose two entropy-biased Absorbing Cost algorithms to distinguish the variation on different user-item rating pairs, which further enhances the effectiveness of long tail recommendation. Empirical experiments on two real life datasets show that our proposed algorithms are effective to recommend long tail items and outperform state-of-the-art recommendation techniques.
- Ranking Large Temporal Data
Jeffrey Jestes (University of Utah, USA)
Jeff M. Phillips (University of Utah, USA)
Feifei Li (University of Utah, USA)
Mingwang Tang (University of Utah, USA)
Ranking temporal data has not been studied until recently, even though ranking is an important operator (being promoted as a firstclass citizen) in database systems. However, only the instant top-k queries on temporal data were studied in, where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from 02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.
Research Session 24: String Processing
- ERA: Efficient Serial and Parallel Suffix Tree Construction for Very Long Strings
Essam Mansour (King Abdullah University of Science and Technology, Saudi Arabia)
Amin Allam (King Abdullah University of Science and Technology, Saudi Arabia)
Spiros Skiadopoulos (University of Peloponnese, Greece)
Panos Kalnis (King Abdullah University of Science and Technology, Saudi Arabia)
The suffix tree is a data structure for indexing strings. It is used in a variety of applications such as bioinformatics, time series analysis, clustering, text editing and data compression. However, when the string and the resulting suffix tree are too large to fit into the main memory, most existing construction algorithms become very inefficient. This paper presents a disk-based suffix tree construction method, called Elastic Range (ERa), which works efficiently with very long strings that are much larger than the available memory. ERa partitions the tree construction process horizontally and vertically and minimizes I/Os by dynamically adjusting the horizontal partitions independently for each vertical partition, based on the evolving shape of the tree and the available memory. Where appropriate, ERa also groups vertical partitions together to amortize the I/O cost. We developed a serial version; a parallel version for shared-memory and shared-disk multi-core systems; and a parallel version for shared-nothing architectures. ERa indexes the entire human genome in 19 minutes on an ordinary desktop computer. For comparison, the fastest existing method needs 15 minutes using 1024 CPUs on an IBM BlueGene supercomputer.
- Mining Statistically Significant Substrings using the Chi-Square Statistic
Mayank Sachan (Indian Institute of Technology Kanpur, India)
Arnab Bhattacharya (Indian Institute of Technology Kanpur, India)
The problem of identification of statistically significant patterns in a sequence of data has been applied to many domains such as intrusion detection systems, financial models, web-click records, automated monitoring systems, computational biology, cryptology, and text analysis. An observed pattern of events is deemed to be statistically significant if it is unlikely to have occurred due to randomness or chance alone. We use the chi-square statistic as a quantitative measure of statistical significance. Given a string of characters generated from a memoryless Bernoulli model, the problem is to identify the substring for which the empirical distribution of single letters deviates the most from the distribution expected from the generative Bernoulli model. This deviation is captured using the chi-square measure. The most significant substring (MSS) of a string is thus defined as the substring having the highest chi-square value. Till date, to the best of our knowledge, there does not exist any algorithm to find the MSS in better than O(n^2) time, where n denotes the length of the string. In this paper, we propose an algorithm to find the most significant substring, whose running time is O(n^{3/2}) with high probability. We also study some variants of this problem such as finding the top-t set, finding all substrings having chi-square greater than a fixed threshold and finding the MSS among substrings greater than a given length. We experimentally demonstrate the asymptotic behavior of the MSS on varying the string size and alphabet size. We also describe some applications of our algorithm on cryptology and real world data from finance and sports. Finally, we compare our technique with the existing heuristics for finding the MSS.
- Learning Semantic String Transformations from Examples
Rishabh Singh (Massachusetts Institute of Technology, USA)
Sumit Gulwani (Microsoft Research, USA)
We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks. We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums.
Research Session 25: Data Integration
- A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
Bo Zhao (University of Illinois at Urbana-Champaign, USA)
Benjamin I. P. Rubinstein (Microsoft Research, USA)
Jim Gemmell (Microsoft Research, USA)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this work, we propose a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem.
- PARIS: Probabilistic Alignment of Relations, Instances, and Schema
Fabian M. Suchanek (INRIA Saclay, France)
Serge Abiteboul (INRIA Saclay, France)
Pierre Senellart (Télécom ParisTech, France)
One of the main challenges that the Semantic Web faces is the integration of a growing number of independently designed ontologies. In this work, we present PARIS, an approach for the automatic alignment of ontologies. PARIS aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic, i.e., we measure degrees of matchings based on probability estimates. This allows PARIS to run without any parameter tuning. We demonstrate the efficiency of the algorithm and its precision through extensive experiments. In particular, we obtain a precision of around 90% in experiments with some of the world's largest ontologies.
- Learning Expressive Linkage Rules using Genetic Programming
Robert Isele (Freie Universität Berlin, Germany)
Christian Bizer (Freie Universität Berlin, Germany)
A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learning linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem.
Research Session 26: Fundamentals and Theory
- Fundamentals of Order Dependencies
Jaroslaw Szlichta (York University, Canada)
Parke Godfrey (York University, Canada)
Jarek Gryz (York University, Canada)
Dependencies have played a significant role in database design for many years. They have also been shown to be useful in query optimization. In this paper, we discuss dependencies between lexicographically ordered sets of tuples. We introduce formally the concept of order dependency and present a set of axioms (inference rules) for them. We show how query rewrites based on these axioms can be used for query optimization. We present several interesting theorems that can be derived using the inference rules. We prove that functional dependencies are subsumed by order dependencies and that our set of axioms for order dependencies is sound and complete.
- Queries with Guarded Negation
Vince Barany (Technische Universität Darmstadt, Germany)
Balder ten Cate (University of California, Santa Cruz, USA)
Martin Otto (Technische Universität Darmstadt, Germany)
Vince Barany (Technische Universität Darmstadt, Germany)
A well-established and fundamental insight in database theory is that negation (also known as complementation) tends to make queries difficult to process and difficult to reason about. Many basic problems are decidable and admit practical algorithms in the case of unions of conjunctive queries, but become difficult or even undecidable when queries are allowed to contain negation. Inspired by recent results in finite model theory, we consider a restricted form of negation, guarded negation. We introduce a fragment of SQL, called GN-SQL, as well as a fragment of Datalog with stratified negation, called GN-Datalog, that allow only guarded negation, and we show that these query languages are computationally well behaved, in terms of testing query containment, query evaluation, open-world query answering, and boundedness. GN-SQL and GN-Datalog subsume a number of well known query languages and constraint languages, such as unions of conjunctive queries, monadic Datalog, and frontier-guarded tgds. In addition, an analysis of standard benchmark workloads shows that many uses of negation in SQL in practice are guarded.
- Answering Queries using Views over Probabilistic XML: Complexity and Tractability
Bogdan Cautis (Telecom ParisTech, France)
Evgeny Kharlamov (Free University of Bozen-Bolzano, Italy)
We study the complexity of query answering using views in a probabilistic XML setting, identifying large classes of XPath queries -- with child and descendant navigation and predicates -- for which there are efficient (PTime) algorithms. We consider this problem under the two possible semantics for XML query results: with persistent node identifiers and in their absence. Accordingly, we consider rewritings that can exploit a single view, by means of compensation, and rewritings that can use multiple views, by means of intersection. Since in a probabilistic setting queries return answers with probabilities, the problem of rewriting goes beyond the classic one of retrieving XML answers from views. For both semantics of XML queries, we show that, even when XML answers can be retrieved from views, their probabilities may not be computable. For rewritings that use only compensation, we describe a PTime decision procedure, based on easily verifiable criteria that distinguish between the feasible cases -- when probabilistic XML results are computable -- and the unfeasible ones. For rewritings that can use multiple views, with compensation and intersection, we identify the most permissive conditions that make probabilistic rewriting feasible, and we describe an algorithm that is sound in general, and becomes complete under fairly permissive restrictions, running in PTime modulo worst-case exponential time equivalence tests. This is the best we can hope for since intersection makes query equivalence intractable already over deterministic data. Our algorithm runs in PTime whenever deterministic rewritings can be found in PTime.
Research Session 27: Streams
- Verifying Computations with Streaming Interactive Proofs
Graham Cormode (AT&T Labs - Research, USA)
Justin Thaler (Harvard University, USA)
Ke Yi (Hong Kong University of Science and Technology, Hong Kong)
When computation is outsourced, the data owner would like to be assured that the desired computation has been performed correctly by the service provider. In theory, proof systems can give the necessary assurance, but prior work is not sufficiently scalable or practical. In this paper, we develop new proof protocols for verifying computations which are streaming in nature: the verifier (data owner) needs only logarithmic space and a single pass over the input, and after observing the input follows a simple protocol with a prover (service provider) that takes logarithmic communication spread over a logarithmic number of rounds. These ensure that the computation is performed correctly: that the service provider has not made any errors or missed out some data. The guarantee is very strong: even if the service provider deliberately tries to cheat, there is only vanishingly small probability of doing so undetected, while a correct computation is always accepted. We first observe that some theoretical results can be modified to work with streaming verifiers, showing that there are efficient protocols for problems in the complexity classes NP and NC. Our main results then seek to bridge the gap between theory and practice by developing usable protocols for a variety of problems of central importance in streaming and database processing. All these problems require linear space in the traditional streaming model, and therefore our protocols demonstrate that adding a prover can exponentially reduce the effort needed by the verifier. Our experimental results show that our protocols are practical and scalable.
- Summarization and Matching of Density-Based Clusters in Streaming Environments
Di Yang (Oracle, USA)
Elke A. Rundensteiner (Worcester Polytechnic Institute, USA)
Matthew O. Ward (Worcester Polytechnic Institute, USA)
Density-based cluster mining is known to serve a broad range of applications ranging from stock trade analysis to moving object monitoring. Although methods for efficient extraction of density-based clusters have been studied in the literature, the problem of summarizing and matching of such clusters with arbitrary shapes and complex cluster structures remains unsolved. Therefore, the goal of our work is to extend the state-of-art of density-based cluster mining in streams from cluster extraction only to now also support analysis and management of the extracted clusters. Our work solves three major technical challenges. First, we propose a novel multi-resolution cluster summarization method, called Skeletal Grid Summarization (SGS), which captures the key features of density-based clusters, covering both their external shape and internal cluster structures. Second, in order to summarize the extracted clusters in real-time, we present an integrated computation strategy C-SGS, which piggybacks the generation of cluster summarizations within the online clustering process. Lastly, we design a mechanism to efficiently execute cluster matching queries, which identify similar clusters for given cluster of analyst's interest from clusters extracted earlier in the stream history. Our experimental study using real streaming data shows the clear superiority of our proposed methods in both efficiency and effectiveness for cluster summarization and cluster matching queries to other potential alternatives.
- Sketch-based Querying of Distributed Sliding-Window Data Streams
Odysseas Papapetrou (Technical University of Crete, Greece)
Minos Garofalakis (Technical University of Crete, Greece)
Antonios Deligiannakis (Technical University of Crete, Greece)
While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely distributed and constantly updated. Furthermore, such query answers often need to discount data that is "stale", and operate solely on a sliding window of recent data arrivals (e.g., data updates occurring over the last 24 hours). Such distributed data streaming applications mandate novel algorithmic solutions that are both time- and space-efficient (to manage high-speed data streams), and also communication-efficient (to deal with physical data distribution). In this paper, we consider the problem of complex query answering over distributed, high-dimensional data streams in the sliding-window model. We introduce a novel sketching technique (termed ECM-sketch) that allows effective summarization of streaming data over both time-based and count-based sliding windows with probabilistic accuracy guarantees. Our sketch structure enables point as well as inner-product queries, and can be employed to address a broad range of problems, such as maintaining frequency statistics, finding heavy hitters, and computing quantiles in the sliding-window model. Focusing on distributed environments, we demonstrate how ECM-sketches of individual, local streams can be composed to generate a (low-error) ECM-sketch summary of the order-preserving aggregation of all streams; furthermore, we show how ECM-sketches can be exploited for continuous monitoring of sliding-window queries over distributed streams. Our extensive experimental study with two real-life data sets validates our theoretical claims and verifies the effectiveness of our techniques. To the best of our knowledge, ours is the first work to address efficient, guaranteed-error complex query answering over distributed data streams in the sliding-window model.
Research Session 28: Indexing
- Indexing the Earth Mover's Distance Using Normal Distributions
Brian E. Ruttenberg (University of California, Santa Barbara, USA)
Ambuj K. Singh (University of California, Santa Barbara, USA), USA)
Querying uncertain data sets (represented as probability distributions) presents many challenges due to the large amount of data involved and the difficulties comparing uncertainty between distributions. The Earth Mover's Distance (EMD) has increasingly been employed to compare uncertain data due to its ability to effectively capture the differences between two distributions. Computing the EMD entails finding a solution to the transportation problem, which is computationally intensive. In this paper, we propose a new lower bound to the EMD and an index structure to significantly improve the performance of EMD based K-nearest neighbor (K-NN) queries on uncertain databases. We propose a new lower bound to the EMD that approximates the EMD on a projection vector. Each distribution is projected onto a vector and approximated by a normal distribution, as well as an accompanying error term. We then represent each normal as a point in a Hough transformed space. We then use the concept of stochastic dominance to implement an efficient index structure in the transformed space. We show that our method significantly decreases K-NN query time on uncertain databases. The index structure also scales well with database cardinality. It is well suited for heterogeneous data sets, helping to keep EMD based queries tractable as uncertain data sets become larger and more complex.
- Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores
Felix Halim (National University of Singapore, Singapore)
Stratos Idreos (Centrum Wiskunde & Informatica, Netherlands)
Panagiotis Karras (Rutgers University, USA)
Roland H. C. Yap (National University of Singapore, Singapore)
Modern business applications and scientific databases call for inherently dynamic data storage environments. Such environments are characterized by two challenging features: (a) they have little idle system time to devote on physical design; and (b) there is little, if any, a priori workload knowledge, while the query and data workload keeps changing dynamically. In such environments, traditional approaches to index building and maintenance cannot apply. Database cracking has been proposed as a solution that allows on-the-fly physical data reorganization, as a collateral effect of query processing. Cracking aims to continuously and automatically adapt indexes to the workload at hand, without human intervention. Indexes are built incrementally, adaptively, and on demand. Nevertheless, as we show, existing adaptive indexing methods fail to deliver workload-robustness; they perform much better with random workloads than with others. This frailty derives from the inelasticity with which these approaches interpret each query as a hint on how data should be stored. Current cracking schemes blindly reorganize the data within each query's range, even if that results into successive expensive operations with minimal indexing benefit. In this paper, we introduce stochastic cracking, a significantly more resilient approach to adaptive indexing. Stochastic cracking also uses each query as a hint on how to reorganize data, but not blindly so; it gains resilience and avoids performance bottlenecks by deliberately applying certain arbitrary choices in its decision-making. Thereby, we bring adaptive indexing forward to a mature formulation that confers the workload-robustness previous approaches lacked. Our extensive experimental study verifies that stochastic cracking maintains the desired properties of original database cracking while at the same time it performs well with diverse realistic workloads.
- A MovingObject Index for Efficient Query Processing with Peer-Wise Location Privacy
Dan Lin (Missouri University of Science and Technology, USA)
Christian S. Jensen (Aarhus University, Denmark)
Rui Zhang (The University of Melbourne, Australia)
Lu Xiao (Missouri University of Science and Technology, USA)
Jiaheng Lu (Renmin University, China)
With the growing use of location-based services, location privacy attracts increasing attention from users, industry, and the research community. While considerable effort has been devoted to inventing techniques that prevent service providers from knowing a user's exact location, relatively little attention has been paid to enabling so-called peer-wise privacy--the protection of a user's location from unauthorized peer users. This paper identifies an important efficiency problem in existing peer-privacy approaches that simply apply a filtering step to identify users that are located in a query range, but that do not want to disclose their location to the querying peer. To solve this problem, we propose a novel, privacy-policy enabled index called the PEB-tree that seamlessly integrates location proximity and policy compatibility. We propose efficient algorithms that use the PEB-tree for processing privacy-aware range and kNN queries. Extensive experiments suggest that the PEB-tree enables efficient query processing.
Research Session 29: Probabilistic Databases
- Aggregation in Probabilistic Databases via Knowledge Compilation
Robert Fink (University of Oxford, UK)
Larisa Han (University of Oxford, UK)
Dan Olteanu (University of Oxford, UK)
This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into so-called decomposition trees, for which the computation of the probability distribution can be done in time linear in the product of the sizes of the probability distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomial-time decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPC-H data.
- Probabilistic Management of OCR Data using an RDBMS
Arun Kumar (University of Wisconsin - Madison, USA)
Christopher Ré (University of Wisconsin - Madison, USA).
The digitization of scanned forms and documents is changing the data sources that enterprises manage. To integrate these new data sources with enterprise data, the current state-of-the-art approach is to convert the images to ASCII text using optical character recognition (OCR) software and then to store the resulting ASCII text in a relational database. The OCR problem is challenging, and so the output of OCR often contains errors. In turn, queries on the output of OCR may fail to retrieve relevant answers. State-of-the-art OCR programs, e.g., the OCR powering Google Books, use a probabilistic model that captures many alternatives during the OCR process. Only when the results of OCR are stored in the database, do these approaches discard the uncertainty. In this work, we propose to retain the probabilistic models produced by OCR process in a relational database management system. A key technical challenge is that the probabilistic data produced by OCR software is very large (a single book blows up to 2GB from 400kB as ASCII). As a result, a baseline solution that integrates these models with an RDBMS is over 1000x slower versus standard text processing for single table select-project queries. However, many applications may have quality-performance needs that are in between these two extremes of ASCII and the complete model output by the OCR software. Thus, we propose a novel approximation scheme called STACCATO that allows a user to trade recall for query performance. Additionally, we provide a formal analysis of our scheme's properties, and describe how we integrate our scheme with standard-RDBMS text indexing.
- Probabilistic Databases with MarkoViews
Abhay Jha (University of Washington, USA)
Dan Suciu (University of Washington, USA)
Most of the work on query evaluation in probabilistic databases has focused on the simple tuple-independent data model, where tuples are independent random events. Several efficient query evaluation techniques exists in this setting, such as safe plans, algorithms based on OBDDs, tree-decomposition and a variety of approximation algorithms. However, complex data analytics tasks often require complex correlations, and query evaluation then is significantly more expensive, or more restrictive. In this paper, we propose MVDB as a framework both for representing complex correlations and for efficient query evaluation. An MVDB specifies correlations by views, called MarkoViews, on the probabilistic relations and declaring the weights of the view's outputs. An MVDB is a (very large) Markov Logic Network. We make two sets of contributions. First, we show that query evaluation on an MVDB is equivalent to evaluating a Union of Conjunctive Query(UCQ) over a tuple-independent database. The translation is exact (thus allowing the techniques developed for tuple independent databases to be carried over to MVDB), yet it is novel and quite non-obvious (some resulting probabilities may be negative!). This translation in itself though may not lead to much gain since the translated query gets complicated as we try to capture more correlations. Our second contribution is to propose a new query evaluation strategy that exploits offline compilation to speed up online query evaluation. Here we utilize and extend our prior work on compilation of UCQ. We validate experimentally our techniques on a large probabilistic database with MarkoViews inferred from the DBLP data.
- Uncertain Centroid based Partitional Clustering of Uncertain Data
Francesco Gullo (University of Calabria, Italy)
Andrea Tagarelli (University of Calabria, Italy)
Clustering uncertain data has emerged as a challenging task in uncertain data management and mining. Thanks to a computational complexity advantage over other clustering paradigms, partitional clustering has been particularly studied and a number of algorithms have been developed. While existing proposals differ mainly in the notions of cluster centroid and clustering objective function, little attention has been given to an analysis of their characteristics and limits. In this work, we theoretically investigate major existing methods of partitional clustering, and alternatively propose a well-founded approach to clustering uncertain data based on a novel notion of cluster centroid. A cluster centroid is seen as an uncertain object defined in terms of a random variable whose realizations are derived based on all deterministic representations of the objects to be clustered. As demonstrated theoretically and experimentally, this allows for better representing a cluster of uncertain objects, thus supporting a consistently improved clustering performance while maintaining comparable efficiency with existing partitional clustering algorithms.
Research Session 30: Social Networks
- A Data-Based Approach to Social Influence Maximization
Amit Goyal (University of British Columbia, Canada)
Francesco Bonchi (Yahoo! Research, Spain)
Laks V. S. Lakshmanan (University of British Columbia, Canada)
Influence maximization is the problem of finding a set of users in a social network, such that by targeting this set, one maximizes the expected spread of influence in the network. Most of the literature on this topic has focused exclusively on the social graph, overlooking historical data, i.e., traces of past action propagations. In this paper, we study influence maximization from a novel data-based perspective. In particular, we introduce a new model, which we call credit distribution, that directly leverages available propagation traces to learn how influence flows in the network and uses this to estimate expected influence spread. Our approach also learns the different levels of influenceability of users, and it is time-aware in the sense that it takes the temporal nature of influence into account. We show that influence maximization under the credit distribution model is NP-hard and that the function that defines expected spread under our model is submodular. Based on these, we develop an approximation algorithm for solving the influence maximization problem that at once enjoys high accuracy compared to the standard approach, while being several orders of magnitude faster and more scalable.
- The Complexity of Social Coordination
Konstantinos Mamouras (Cornell University, USA)
Sigal Oren (Cornell University, USA)
Lior Seeman (Cornell University, USA)
Lucja Kot (Cornell University, USA)
Johannes Gehrke (Cornell University, USA)
Coordination is a challenging everyday task; just think of the last time you organized a party or a meeting involving several people. As a growing part of our social and professional life goes online, an opportunity for an improved coordination process arises. Recently, Gupta et al. proposed entangled queries as a declarative abstraction for data-driven coordination, where the difficulty of the coordination task is shifted from the user to the database. Unfortunately, evaluating entangled queries is very hard, and thus previous work considered only a restricted class of queries that satisfy safety (the coordination partners are fixed) and uniqueness (all queries need to be satisfied). In this paper we significantly extend the class of feasible entangled queries beyond uniqueness and safety. First, we show that we can simply drop uniqueness and still efficiently evaluate a set of safe entangled queries. Second, we show that as long as all users coordinate on the same set of attributes, we can give an efficient algorithm for coordination even if the set of queries does not satisfy safety. In an experimental evaluation we show that our algorithms are feasible for a wide spectrum of coordination scenarios.
- Who Tags What? An Analysis Framework
Mahashweta Das (University of Texas at Arlington, USA)
Saravanan Thirumuruganathan (University of Texas at Arlington, USA)
Sihem Amer-Yahia (Qatar Computing Research Institute, Qatar)
Gautam Das (University of Texas at Arlington, USA)
Cong Yu (Google Inc., USA)
The rise of Web 2.0 is signaled by sites such as Flickr, del.icio.us, and YouTube, and social tagging is essential to their success. A typical tagging action involves three components, user, item (e.g., photos in Flickr), and tags (i.e., words or phrases). Analyzing how tags are assigned by certain users to certain items has important implications in helping users search for desired information. In this paper, we explore common analysis tasks and propose a dual mining framework for social tagging behavior mining. This framework is centered around two opposing measures, similarity and diversity, being applied to one or more tagging components, and therefore enables a wide range of analysis scenarios such as characterizing similar users tagging diverse items with similar tags, or diverse users tagging similar items with diverse tags, etc. By adopting different concrete measures for similarity and diversity in the framework, we show that a wide range of concrete analysis problems can be defined and they are NP-Complete in general. We design efficient algorithms for solving many of those problems and demonstrate, through comprehensive experiments over real data, that our algorithms significantly out-perform the exact brute-force approach without compromising analysis result quality.
- Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services
Caleb Chen CAO (Hong Kong University of Science and Technology, Hong Kong)
Jieying She (Hong Kong University of Science and Technology, Hong Kong)
Yongxin Tong (Hong Kong University of Science and Technology, Hong Kong)
Lei Chen (Hong Kong University of Science and Technology, Hong Kong)
It is universal to see people obtain knowledge on micro-blog services by asking others decision making questions. In this paper, we study the Jury Selection Problem(JSP) by utilizing crowdsourcing for decision making tasks on micro-blog services. Specifically, the problem is to enroll a subset of crowd under a limited budget, whose aggregated wisdom via Majority Voting scheme has the lowest probability of drawing a wrong answer(Jury Error Rate-JER). Due to various individual error-rates of the crowd, the calculation of JER is non-trivial. Firstly, we explicitly state that JER is the probability when the number of wrong jurors is larger than half of the size of a jury. To avoid the exponentially increasing calculation of JER, we propose two efficient algorithms and an effective bounding technique. Furthermore, we study the Jury Selection Problem on two crowdsourcing models, one is for altruistic users(AltrM) and the other is for incentive-requiring users(PayM) who require extra payment when enrolled into a task. For the AltrM model, we prove the monotonicity of JER on individual error rate and propose an efficient exact algorithm for JSP. For the PayM model, we prove the NP-hardness of JSP on PayM and propose an efficient greedy-based heuristic algorithm. Finally, we conduct a series of experiments to investigate the traits of JSP, and validate the efficiency and effectiveness of our proposed algorithms on both synthetic and real micro-blog data.
Research Session 31: Trees, Hierarchies and Taxonomies
- Mining Flipping Correlations from Large Datasets with Taxonomies
Marina Barsky (University of Victoria, Canada)
Sangkyum Kim (University of Illinois at Urbana-Champaign, USA)
Tim Weninger (University of Illinois at Urbana-Champaign, USA)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
In this paper we introduce a new type of pattern -- a flipping correlation pattern. The flipping patterns are obtained from contrasting the correlations between items at different levels of abstraction. They represent surprising correlations, both positive and negative, which are specific for a given abstraction level, and which ``flip'' from positive to negative and vice versa when items are generalized to a higher level of abstraction. We design an efficient algorithm for finding flipping correlations, the Flipper algorithm, which outperforms naive pattern mining methods by several orders of magnitude. We apply Flipper to real-life datasets and show that the discovered patterns are non-redundant, surprising and actionable. Flipper finds strong contrasting correlations in itemsets with low-to-medium support, while existing techniques cannot handle the pattern discovery in this frequency range.
- Supercharging Recommender Systems using Taxonomies for Learning User Purchase Behavior
Bhargav Kanagal (Yahoo! Research, USA)
Amr Ahmed (Yahoo! Research, USA)
Sandeep Pandey (Yahoo! Research, USA)
Vanja Josifovski (Yahoo! Research, USA)
Jeff Yuan (Yahoo! Research, USA)
Lluis Garcia-Pueyo (Yahoo! Research, USA)
Recommender systems based on latent factor models have been effectively used for understanding user interests and predicting future actions. Such models work by projecting the users and items into a smaller dimensional space, thereby clustering similar users and items together and subsequently compute similarity between unknown user-item pairs. When user-item interactions are sparse (sparsity problem) or when new items continuously appear (cold start problem), these models perform poorly. In this paper, we exploit the combination of taxonomies and latent factor models to mitigate these issues and improve recommendation accuracy. We observe that taxonomies provide structure similar to that of a latent factor model: namely, it imposes human-labeled categories (clusters) over items. This leads to our proposed taxonomy-aware latent factor model (TF) which combines taxonomies and latent factors using additive models. We develop efficient algorithms to train the TF models, which scales to large number of users/items and develop scalable inference/recommendation algorithms by exploiting the structure of the taxonomy. In addition, we extend the TF model to account for the temporal dynamics of user interests using high-order Markov chains. To deal with large-scale data, we develop a parallel multi-core implementation of our TF model. We empirically evaluate the TF model for the task of predicting user purchases using a real-world shopping dataset spanning more than a million users and products. Our experiments demonstrate the benefits of using our TF models over existing approaches, in terms of both prediction accuracy and running time.
- Efficient Indexing and Querying over Syntactically Annotated Trees
Pirooz Chubak (University of Alberta, Canada)
Davood Rafiei (University of Alberta, Canada)
Natural language text corpora are often available as sets of syntactically parsed trees. A wide range of expressive tree queries are possible over such parsed trees that open a new avenue in searching over natural language text. They not only allow for querying roles and relationships within sentences, but also improve search effectiveness compared to flat keyword queries. One major drawback of current systems supporting querying over parsed text is the performance of evaluating queries over large data. In this paper we propose a novel indexing scheme over unique subtrees as index keys. We also propose a novel root-split coding scheme that stores subtree structural information only partially, thus reducing index size and improving querying performance. Our extensive set of experiments show that root-split coding reduces the index size of any interval coding which stores individual node numbers by a factor of 50% to 80%, depending on the sizes of subtrees indexed. Moreover, We show that our index using root-split coding, outperforms previous approaches by at least an order of magnitude in terms of the response time of queries.
- RTED: A Robust Algorithm for the Tree Edit Distance
Mateusz Pawlik (Free University of Bozen-Bolzano, Italy)
Nikolaus Augsten (Free University of Bozen-Bolzano, Italy)
We consider the classical tree edit distance between ordered labeled trees, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The state-of-the-art solutions for the tree edit distance are not satisfactory. The main competitors in the field either have optimal worst-case complexity, but the worst case happens frequently, or they are very efficient for some tree shapes, but degenerate for others. This leads to unpredictable and often infeasible runtimes. There is no obvious way to choose between the algorithms. In this paper we present RTED, a robust tree edit distance algorithm. The asymptotic complexity of RTED is smaller or equal to the complexity of the best competitors for any input instance, i.e., RTED is both efficient and worst-case optimal. We introduce the class of LRH (Left-Right-Heavy) algorithms, which includes RTED and the fastest tree edit distance algorithms presented in literature. We prove that RTED outperforms all previously proposed LRH algorithms in terms of runtime complexity. In our experiments on synthetic and real world data we empirically evaluate our solution and compare it to the state-of-the-art.
Research Session 32: Similarity Search and Ranking II
- SEAL: Spatio-Textual Similarity Search
Ju Fan (Tsinghua University, China)
Guoliang Li (Tsinghua University, China)
Lizhu Zhou (Tsinghua University, China)
Shanshan Chen (Tsinghua University, China)
Jun Hu (Tsinghua University, China)
Location-based services (LBS) have become more and more ubiquitous recently. Existing methods focus on finding relevant points-of-interest (POIs) based on users' locations and query keywords. Nowadays, modern LBS applications generate a new kind of spatio-textual data, regions-of-interest (ROIs), containing region-based spatial information and textual description, e.g., mobile user profiles with active regions and interest tags. To satisfy search requirements on ROIs, we study a new research problem, called spatio-textual similarity search: Given a set of ROIs and a query ROI, we find the similar ROIs by considering spatial overlap and textual similarity. Spatio-textual similarity search has many important applications, e.g., social marketing in location-aware social networks. It calls for an efficient search method to support large scales of spatio-textual data in LBS systems. To this end, we introduce a filter-and-verification framework to compute the answers. In the filter step, we generate signatures for the ROIs and the query, and utilize the signatures to generate candidates whose signatures are similar to that of the query. In the verification step, we verify the candidates and identify the final answers. To achieve high performance, we generate effective high-quality signatures, and devise efficient filtering algorithms as well as pruning techniques. Experimental results on real and synthetic datasets show that our method achieves high performance.
- V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors
Ahmed Metwally (Google Inc., USA)
Christos Faloutsos (Carnegie Mellon University, USA)
This work proposes V-SMART-Join, a scalable MapReduce-based framework for discovering all pairs of similar entities. The V-SMART-Join framework is applicable to sets, multisets, and vectors. V-SMART-Join is motivated by the observed skew in the underlying distributions of Internet traffic, and is a family of 2-stage algorithms, where the first stage computes and joins the partial results, and the second stage computes the similarity exactly for all candidate pairs. The V-SMART-Join algorithms are very efficient and scalable in the number of entities, as well as their cardinalities. They were up to 30 times faster than the state of the art algorithm, VCL, when compared on a real dataset of a small size. We also established the scalability of the proposed algorithms by running them on a dataset of a realistic size, on which VCL never succeeded to finish. Experiments were run using real datasets of IPs and cookies, where each IP is represented as a multiset of cookies, and the goal is to discover similar IPs to identify Internet proxies.
- PASS-JOIN: A Partition-based Method for Similarity Joins
Guoliang Li (Tsinghua University, China)
Dong Deng (Tsinghua University, China)
Jiannan Wang (Tsinghua University, China)
Jianhua Feng (Tsinghua University, China)
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join partitions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets.
- Efficient Processing of k Nearest Neighbor Joins using MapReduce
Wei Lu (National University of Singapore, Singapore)
Yanyan Shen (National University of Singapore, Singapore)
Su Chen (National University of Singapore, Singapore)
Beng Chin Ooi (National University of Singapore, Singapore)
k nearest neighbor join (kNN join), designed to find k nearest neighbors from a dataset S for every object in another dataset R, is a primitive operation widely adopted by many data mining applications. As a combination of the k nearest neighbor query and the join operation, kNN join is an expensive operation. Given the increasing volume of data, it is difficult to perform a kNN join on a centralized machine efficiently. In this paper, we investigate how to perform kNN join using MapReduce which is a well-accepted framework for data-intensive applications over clusters of computers. In brief, the mappers cluster objects into groups; the reducers perform the kNN join on each group of objects separately. We design an effective mapping mechanism that exploits pruning rules for distance filtering, and hence reduces both the shuffling and computational costs. To reduce the shuffling cost, we propose two approximate algorithms to minimize the number of replicas. Extensive experiments on our in-house cluster demonstrate that our proposed methods are efficient, robust and scalable.
Research Session 33: Web and IR II
- View Selection in Semantic Web Databases
François Goasdoué (INRIA Saclay, France)
Konstantinos Karanasos (INRIA Saclay, France)
Julien Leblay (INRIA Saclay, France)
Ioana Manolescu (INRIA Saclay, France)
We consider the setting of a Semantic Web database, containing both explicit data encoded in RDF triples, and implicit data, implied by the RDF semantics. Based on a query workload, we address the problem of selecting a set of views to be materialized in the database, minimizing a combination of query processing, view storage, and view maintenance costs. Starting from an existing relational view selection method, we devise new algorithms for recommending view sets, and show that they scale significantly beyond the existing relational ones when adapted to the RDF context. To account for implicit triples in query answers, we propose a novel RDF query reformulation algorithm and an innovative way of incorporating it into view selection in order to avoid a combinatorial explosion in the complexity of the selection process. The interest of our techniques is demonstrated through a set of experiments.
- Type-Based Detection of XML Query-Update Independence
Nicole Bidoit-Tollu (Universite Paris Sud & INRIA Saclay, France)
Dario Colazzo (Universite Paris Sud & INRIA Saclay, France)
Federico Ulliana (Universite Paris Sud & INRIA Saclay, France)
This paper presents a novel static analysis technique to detect XML query-update independence, in the presence of a schema. Rather than types, our system infers chains of types. Each chain represents a path that can be traversed on a valid document during query/update evaluation. The resulting independence analysis is precise, although it raises a challenging issue: recursive schemas may lead to infer infinitely many chains. A sound and complete approximation technique ensuring a finite analysis in any case is presented, together with an efficient implementation performing the chain-based analysis in polynomial space and time.
- Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data
Qiang Zeng (Chinese Academy of Sciences, China)
Xiaorui Jiang (Chinese Academy of Sciences, China)
Hai Zhuge (Chinese Academy of Sciences, China)
As data are increasingly modeled as graphs for expressing complex relationships, the tree pattern query on graph-structured data becomes an important type of queries in real-world applications. Most practical query languages, such as XQuery and SPARQL, support logical expressions using logical-AND/OR/NOT operators to define structural constraints of tree patterns. In this paper, (1) we propose generalized tree pattern queries (GTPQs) over graph-structured data, which fully support propositional logic of structural constraints. (2) We make a thorough study of fundamental problems including satisfiability, containment and minimization, and analyze the computational complexity and the decision procedures of these problems. (3) We propose a compact graph representation of intermediate results and a pruning approach to reduce the size of intermediate results and the number of join operations -- two factors that often impair the efficiency of traditional algorithms for evaluating tree pattern queries. (4) We present an efficient algorithm for evaluating GTPQs using 3-hop as the underlying reachability index. (5) Experiments on both real-life and synthetic data sets demonstrate the effectiveness and efficiency of our algorithm, from several times to orders of magnitude faster than state-of-the-art algorithms in terms of evaluation time, even for traditional tree pattern queries with only conjunctive operations.
- Keyword-aware Optimal Route Search
Xin Cao (Nanyang Technological University, Singapore)
Lisi Chen (Nanyang Technological University, Singapore)
Gao Cong (Nanyang Technological University, Singapore)
Xiaokui Xiao (Nanyang Technological University, Singapore)
Identifying a preferable route is an important problem that finds applications in map services. When a user plans a trip within a city, the user may want to find "a most popular route such that it passes by shopping mall, restaurant, and pub, and the travel time to and from his hotel is within 4 hours." However, none of the algorithms in the existing work on route planning can be used to answer such queries. Motivated by this, we define the problem of keyword-aware optimal route query, denoted by KOR, which is to find an optimal route such that it covers a set of user-specified keywords, a specified budget constraint is satisfied, and an objective score of the route is optimal. The problem of answering KOR queries is NP-hard. We devise an approximation algorithm OSScaling with provable approximation bounds. Based on this algorithm, another more efficient approximation algorithm BucketBound is proposed. We also design a greedy approximation algorithm. Results of empirical studies show that all the proposed algorithms are capable of answering KOR queries efficiently, while the BucketBound and Greedy algorithms run faster. The empirical studies also offer insight into the accuracy of the proposed algorithms.
Research Session 34: Graphs Similarity Search
- SODA: Generating SQL for Business Users
Lukas Blunschi (ETH Zurich, Switzerland)
Claudio Jossen (Credit Suisse AG, Switzerland)
Donald Kossmann (ETH Zurich, Switzerland)
Magdalini Mori (Credit Suisse AG, Switzerland)
Kurt Stockinger (Credit Suisse AG, Switzerland)
The purpose of data warehouses is to enable business analysts to make better decisions. Over the years the technology has matured and data warehouses have become extremely successful. As a consequence, more and more data has been added to the data warehouses and their schemas have become increasingly complex. These systems still work great in order to generate pre-canned reports. However, with their current complexity, they tend to be a poor match for non tech-savvy business analysts who need answers to ad-hoc queries that were not anticipated. This paper describes the design, implementation, and experience of the SODA system (Search over DAta Warehouse). SODA bridges the gap between the business needs of analysts and the technical complexity of current data warehouses. SODA enables a Google-like search experience for data warehouses by taking keyword queries of business users and automatically generating executable SQL. The key idea is to use a graph pattern matching algorithm that uses the metadata model of the data warehouse. Our results with real data from a global player in the financial services industry show that SODA produces queries with high precision and recall, and makes it much easier for business users to interactively explore highly-complex data warehouses.
- Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases
Ye Yuan (Northeastern University, China)
Guoren Wang (Northeastern University, China)
Lei Chen (Hong Kong University of Science and Technology, Hong Kong)
Haixun Wang (Microsoft Research Asia, China)
Many studies have been conducted on seeking the efficient solution for subgraph similarity search over certain (deterministic) graphs due to its wide application in many fields, including bioinformatics, social network analysis, and Resource Description Framework (RDF) data management. All these works assume that the underlying data are certain. However, in reality, graphs are often noisy and uncertain due to various factors, such as errors in data extraction, inconsistencies in data integration, and privacy preserving purposes. Therefore, in this paper, we study subgraph similarity search on large probabilistic graph databases. Different from previous works assuming that edges in an uncertain graph are independent of each other, we study the uncertain graphs where edges' occurrences are correlated. We formally prove that subgraph similarity search over probabilistic graphs is #P-complete, thus, we employ a filter-and-verify framework to speed up the search. In the filtering phase,we develop tight lower and upper bounds of subgraph similarity probability based on a probabilistic matrix index, PMI. PMI is composed of discriminative subgraph features associated with tight lower and upper bounds of subgraph isomorphism probability. Based on PMI, we can sort out a large number of probabilistic graphs and maximize the pruning capability. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. The efficiency of our proposed solutions has been verified through extensive experiments.
- Capturing Topology in Graph Pattern Matching
Shuai Ma (Beihang University, China)
Yang Cao (Beihang University, China)
Wenfei Fan (Universith of Edinburgh, UK)
Jinpeng Huai (Beihang University, China)
Tianyu Wo (Beihang University, China)
Graph pattern matching is often defined in terms of subgraph isomorphism, an NP-complete problem. To lower its complexity, various extensions of graph simulation have been considered instead. These extensions allow pattern matching to be conducted in cubic-time. However, they fall short of capturing the topology of data graphs, i.e., graphs may have a structure drastically different from pattern graphs they match, and the matches found are often too large to understand and analyze. To rectify these problems, this paper proposes a notion of strong simulation, a revision of graph simulation, for graph pattern matching. (1) We identify a set of criteria for preserving the topology of graphs matched. We show that strong simulation preserves the topology of data graphs and finds a bounded number of matches. (2) We show that strong simulation retains the same complexity as earlier extensions of simulation, by providing a cubic-time algorithm for computing strong simulation. (3) We present the locality property of strong simulation, which allows us to effectively conduct pattern matching on distributed graphs. (4) We experimentally verify the effectiveness and efficiency of these algorithms, using real-life data and synthetic data.
- Efficient Subgraph Matching on Billion Node Graphs
Zhao Sun (Fudan University, China)
Hongzhi Wang (Harbin Institute of Technology, China)
Haixun Wang (Microsoft Research Asia, China)
Bin Shao (Microsoft Research Asia, China)
Jianzhong Li (Harbin Institute of Technology, China)
The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.
Research Session 35: Web Databases
- Multilingual Schema Matching for Wikipedia Infoboxes
Thanh Nguyen (University of Utah, USA)
Viviane Moreira (Universidade Federal do Rio Grande do Sul, Brazil)
Huong Nguyen (University of Utah, USA)
Hoa Nguyen (University of Utah, USA)
Juliana Freire (Polytechnic Institute of New York University, USA)
Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.
- REX: Explaining Relationships between Entity Pairs
Lujun Fang (University of Michigan, USA)
Anish Das Sarma (Yahoo! Research, USA)
Cong Yu (Google Inc., USA)
Philip Bohannon (Google Inc., USA)
Knowledge bases of entities and relations (either constructed manually or automatically) are behind many real world search engines, including those at Yahoo!, Microsoft, and Google. Those knowledge bases can be viewed as graphs with nodes representing entities and edges representing (primary) relationships, and various studies have been conducted on how to leverage them to answer entity seeking queries. Meanwhile, in a complementary direction, analyses over the query logs have enabled researchers to identify entity pairs that are statistically correlated. Such entity relationships are then presented to search users through the "related searches" feature in modern search engines. However, entity relationships thus discovered can often be "puzzling" to the users because why the entities are connected is often indescribable. In this paper, we propose a novel problem called "entity relationship explanation", which seeks to explain why a pair of entities are connected, and solve this challenging problem by integrating the above two complementary approaches, i.e., we leverage the knowledge base to "explain" the connections discovered between entity pairs. More specifically, we present REX, a system that takes a pair of entities in a given knowledge base as input and efficiently identifies a ranked list of relationship explanations. We formally define relationship explanations and analyze their desirable properties. Furthermore, we design and implement algorithms to efficiently enumerate and rank all relationship explanations based on multiple measures of "interestingness." We perform extensive experiments over real web-scale data gathered from DBpedia and a commercial search engine, demonstrating the efficiency and scalability of REX. We also perform user studies to corroborate the effectiveness of explanations generated by REX.
- Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections
Christopher Hoobin (RMIT University, Australia)
Simon J. Puglisi (King’s College London, UK)
Justin Zobel (University of Melbourne, Australia)
Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive algorithm such as GZIP. Random access to a specific document then requires decompression of a block. The choice of block size is critical: it trades between compression effectiveness and document retrieval times. In this paper we present a scalable compression method for large document collections that allows fast random access. We build a representative sample of the collection and use it as a dictionary in a LZ77-like encoding of the rest of the collection, relative to the dictionary. We demonstrate on large collections, that using a dictionary as small as 0.1% of the collection size, our algorithm is dramatically faster than previous methods, and in general gives much better compression.
- Optimal Algorithms for Crawling a Hidden Database in the Web
Cheng Sheng (Chinese University of Hong Kong, Hong Kong)
Nan Zhang (George Washington University, USA)
Yufei Tao (Chinese University of Hong Kong, Hong Kong)
Xin Jin (George Washington University, USA)
A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.
Research Session 36: Data Flow Processing
- Stubby: A Transformation-based Optimizer for MapReduce Workflows
Harold Lim (Duke University, USA)
Herodotos Herodotou (Duke University, USA)
Shivnath Babu (Duke University, USA)
There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces--ranging from program-based to query-based interfaces--for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large between optimized and unoptimized workflows. However, automatic cost-based optimization of MapReduce workflows remains a challenge due to the multitude of interfaces, large size of the execution plan space, and the frequent unavailability of all types of information needed for optimization. We introduce a comprehensive plan space for MapReduce workflows generated by popular workflow generators. We then propose Stubby, a cost-based optimizer that searches selectively through the subspace of the full plan space that can be enumerated correctly and costed based on the information available in any given setting. Stubby enumerates the plan space based on plan-to-plan transformations and an efficient search algorithm. Stubby is designed to be extensible to new interfaces and new types of optimizations, which is a desirable feature given how rapidly MapReduce systems are evolving. Stubby's efficiency and effectiveness have been evaluated using representative workflows from many domains.
- Opening the Black Boxes in Data Flow Optimization
Fabian Hueske (Technische Universität Berlin, Germany)
Mathias Peters (Humboldt-Universität zu Berlin, Germany)
Matthias Sax (Humboldt-Universität zu Berlin, Germany)
Astrid Rheinländer (Humboldt-Universität zu Berlin, Germany)
Rico Bergmann (Humboldt-Universität zu Berlin, Germany)
Aljoscha Krettek (Technische Universität Berlin, Germany)
Kostas Tzoumas (Technische Universität Berlin, Germany)
Many systems for big data analytics employ a data flow abstraction to define parallel data processing tasks. In this setting, custom operations expressed as user-defined functions are very common. We address the problem of performing data flow optimization at this level of abstraction, where the semantics of operators are not known. Traditionally, query optimization is applied to queries with known algebraic semantics. In this work, we find that a handful of properties, rather than a full algebraic specification, suffice to establish reordering conditions for data processing operators. We show that these properties can be accurately estimated for black box operators by statically analyzing the general-purpose code of their user-defined functions. We design and implement an optimizer for parallel data flows that does not assume knowledge of semantics or algebraic properties of operators. Our evaluation confirms that the optimizer can apply common rewritings such as selection reordering, bushy join-order enumeration, and limited forms of aggregation push-down, hence yielding similar rewriting power as modern relational DBMS optimizers. Moreover, it can optimize the operator order of non-relational data flows, a unique feature among today's systems.
- Putting Lipstick on Pig: Enabling Database-style Workflow Provenance
Yael Amsterdamer (Tel Aviv University, Israel)
Susan B. Davidson (University of Pennsylvania, USA)
Daniel Deutch (Ben Gurion University, Israel)
Tova Milo (Tel Aviv University, Israel)
Julia Stoyanovich (University of Pennsylvania, USA)
Val Tannen (University of Pennsylvania, USA)
Workflow provenance typically assumes that each module is a "black-box", so that each output depends on all inputs (coarse-grained dependencies). Furthermore, it does not model the internal state of a module, which can change between repeated executions. In practice, however, an output may depend on only a small subset of the inputs (fine-grained dependencies) as well as on the internal state of the module. We present a novel provenance framework that marries database-style and workflow-style provenance, by using Pig Latin to expose the functionality of modules, thus capturing internal state and fine-grained dependencies. A critical ingredient in our solution is the use of a novel form of provenance graph that models module invocations and yields a compact representation of fine-grained workflow provenance. It also enables a number of novel graph transformation operations, allowing to choose the desired level of granularity in provenance querying (ZoomIn and ZoomOut), and supporting "what-if" workflow analytic queries. We implemented our approach in the Lipstick system and developed a benchmark in support of a systematic performance evaluation. Our results demonstrate the feasibility of tracking and querying fine-grained workflow provenance.
- REX: Recursive, Delta-Based Data-Centric Computation
Svilen R. Mihaylov (University of Pennsylvania, USA)
Zachary G. Ives (University of Pennsylvania, USA)
Sudipto Guha (University of Pennsylvania, USA)
In today's Web and social network environments, query workloads include ad hoc and OLAP queries, as well as iterative algorithms that analyze data relationships (e.g., link analysis, clustering, learning). Modern DBMSs support ad hoc and OLAP queries, but most are not robust enough to scale to large clusters. Conversely, "cloud" platforms like MapReduce execute chains of batch tasks across clusters in a fault tolerant way, but have too much overhead to support ad hoc queries. Moreover, both classes of platform incur significant overhead in executing iterative data analysis algorithms. Most such iterative algorithms repeatedly refine portions of their answers, until some convergence criterion is reached. However, general cloud platforms typically must reprocess all data in each step. DBMSs that support recursive SQL are more efficient in that they propagate only the changes in each step -- but they still accumulate each iteration's state, even if it is no longer useful. User-defined functions are also typically harder to write for DBMSs than for cloud platforms. We seek to unify the strengths of both styles of platforms, with a focus on supporting iterative computations in which changes, in the form of deltas, are propagated from iteration to iteration, and state is efficiently updated in an extensible way. We present a programming model oriented around deltas, describe how we execute and optimize such programs in our REX runtime system, and validate that our platform also handles failures gracefully. We experimentally validate our techniques, and show speedups over the competing methods ranging from 2.5 to nearly 100 times.
Research Session 37: Sequence Processing
- ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases
Xiaochun Yang (Northeastern University, China)
Honglei Liu (Northeastern University, China)
Bin Wang (Northeastern University, China)
We study the problem of local alignment, which is finding pairs of similar subsequences with gaps. The problem exists in biosequence databases. BLAST is a typical software for finding local alignment based on heuristic, but could miss results. Using the Smith-Waterman algorithm, we can find all local alignments in O(mn) time, where m and n are lengths of a query and a text, respectively. A recent exact approach BWT-SW improves the complexity of the Smith-Waterman algorithm under constraints, but still much slower than BLAST. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches, especially for long queries. In this paper, we propose an efficient software called ALAE to speed up BWT-SW using a compressed suffix array. ALAE utilizes a family of filtering techniques to prune meaningless calculations and an algorithm for reusing score calculations. We also give a mathematical analysis and show that the upper bound of the total number of calculated entries using ALAE could vary from 4.50mn^0.520 to 9.05mn^0.896 for random DNA sequences and vary from 8.28mn^0.364 to 7.49mn^0.723 for random protein sequences. We demonstrate the significant performance improvement of ALAE on BWT-SW using a thorough experimental study on real biosequences. ALAE guarantees correctness and accelerates BLAST for most of parameters.
- sDTW: Computing DTW Distances using Locally Relevant Constraints based on Salient Feature Alignments
K. Selçuk Candan (Arizona State University, USA)
Rosaria Rossini (Universita degli Studi di Torino, Italy)
Maria Luisa Sapino (Universita degli Studi di Torino, Italy)
Xiaolan Wang (Arizona State University, USA)
Many applications generate and consume temporal data and retrieval of time series is a key processing step in many application domains. Dynamic time warping (DTW) distance between time series of size N and M is computed relying on a dynamic programming approach which creates and fills an NxM grid to search for an optimal warp path. Since this can be costly, various heuristics have been proposed to cut away the potentially unproductive portions of the DTW grid. In this paper, we argue that time series often carry structural features that can be used for identifying locally relevant constraints to eliminate redundant work. Relying on this observation, we propose salient feature based sDTW algorithms which first identify robust salient features in the given time series and then find a consistent alignment of these to establish the boundaries for the warp path search. More specifically, we propose alternative fixed core&adaptive width, adaptive core&fixed width, and adaptive core&adaptive width strategies which enforce different constraints reflecting the high level structural characteristics of the series in the data set. Experiment results show that the proposed sDTW algorithms help achieve much higher accuracy in DTWcomputation and time series retrieval than fixed core & fixed width algorithms that do not leverage local features of the given time series.
- A Generic Framework for Efficient and Effective Subsequence Retrieval
Haohan Zhu (Boston University, USA)
George Kollios (Boston University, USA)
Vassilis Athitsos (University of Texas at Arlington, USA)
This paper proposes a general framework for matching similar subsequences in both time series and string databases. The matching results are pairs of query subsequences and database subsequences. The framework finds all possible pairs of similar subsequences if the distance measure satisfies the "consistency" property, which is a property introduced in this paper. We show that most popular distance functions, such as the Euclidean distance, DTW, ERP, the Frechet distance for time series, and the Hamming distance and Levenshtein distance for strings, are all "consistent". We also propose a generic index structure for metric spaces named "reference net". The reference net occupies O(n) space, where n is the size of the dataset and is optimized to work well with our framework. The experiments demonstrate the ability of our method to improve retrieval performance when combined with diverse distance measures. The experiments also illustrate that the reference net scales well in terms of space overhead and query time.
Experiments and Analysis Session 1: Mining Cleaning and Matching
- An Analysis of Structured Data on the Web
Nilesh Dalvi (Yahoo! Research, USA)
Ashwin Machanavajjhala (Yahoo! Research, USA)
Bo Pang (Yahoo! Research, USA)
In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.
- Shortest Path and Distance Queries on Road Networks: An Experimental Evaluation
Lingkun Wu (Nanyang Technological University, Singapore)
Xiaokui Xiao (Nanyang Technological University, Singapore)
Dingxiong Deng (Fudan University, China)
Gao Cong (Nanyang Technological University, Singapore)
Andy Diwen Zhu (Nanyang Technological University, Singapore)
Shuigeng Zhou (Fudan University, China)
Computing the shortest path between two given locations in a road network is an important problem that finds applications in various map services and commercial navigation products. The state-of-the-art solutions for the problem can be divided into two categories: spatial-coherence-based methods and vertex-importance-based approaches. The two categories of techniques, however, have not been compared systematically under the same experimental framework, as they were developed from two independent lines of research that do not refer to each other. This renders it difficult for a practitioner to decide which technique should be adopted for a specific application. Furthermore, the experimental evaluation of the existing techniques, as presented in previous work, falls short in several aspects. Some methods were tested only on small road networks with up to one hundred thousand vertices; some approaches were evaluated using distance queries (instead of shortest path queries), namely, queries that ask only for the length of the shortest path; a state-of-the-art technique was examined based on a faulty implementation that led to incorrect query results. To address the above issues, this paper presents a comprehensive comparison of the most advanced spatial-coherence-based and vertex-importance-based approaches. Using a variety of real road networks with up to twenty million vertices, we evaluated each technique in terms of its preprocessing time, space consumption, and query efficiency (for both shortest path and distance queries). Our experimental results reveal the characteristics of different techniques, based on which we provide guidelines on selecting appropriate methods for various scenarios.
- Uncertain Time-Series Similarity: Return to the Basics
Michele Dallachiesa (University of Trento, Italy)
Besmira Nushi (University of Trento, Italy)
Katsiaryna Mirylenka (University of Trento, Italy)
Themis Palpanas (University of Trento, Italy)
In the last years there has been a considerable increase in the availability of continuous sensor measurements in a wide range of application domains, such as Location-Based Services (LBS), medical monitoring systems, manufacturing plants and engineering facilities to ensure efficiency, product quality and safety, hydrologic and geologic observing systems, pollution management, and others. Due to the inherent imprecision of sensor observations, many investigations have recently turned into querying, mining and storing uncertain data. Uncertainty can also be due to data aggregation, privacy-preserving transforms, and error-prone mining algorithms. In this study, we survey the techniques that have been proposed specifically for modeling and processing uncertain time series, an important model for temporal data. We provide an analytical evaluation of the alternatives that have been proposed in the literature, highlighting the advantages and disadvantages of each approach, and further compare these alternatives with two additional techniques that were carefully studied before. We conduct an extensive experimental evaluation with 17 real datasets, and discuss some surprising results, which suggest that a fruitful research direction is to take into account the temporal correlations in the time series. Based on our evaluations, we also provide guidelines useful for the practitioners in the field.
- Towards Energy-Efficient Database Cluster Design
Willis Lang (University of Wisconsin, USA)
Stavros Harizopoulos (Nou Data, USA)
Jignesh M. Patel (University of Wisconsin, USA)
Mehul A. Shah (Nou Data, USA)
Dimitris Tsirogiannis (Microsoft Corporation, USA)
Energy is a growing component of the operational cost for many "big data" deployments, and hence has become increasingly important for practitioners of large-scale data analysis who require scale-out clusters or parallel DBMS appliances. Although a number of recent studies have investigated the energy efficiency of DBMSs, none of these studies have looked at the architectural design space of energy-efficient parallel DBMS clusters. There are many challenges to increasing the energy efficiency of a DBMS cluster, including dealing with the inherent scaling inefficiency of parallel data processing, and choosing the appropriate energy-efficient hardware. In this paper, we experimentally examine and analyze a number of key parameters related to these challenges for designing energy-efficient database clusters. We explore the cluster design space using empirical results and propose a model that considers the key bottlenecks to energy efficiency in a parallel DBMS. This paper represents a key first step in designing energy-efficient database clusters, which is increasingly important given the trend toward parallel database appliances.
Experiments and Analysis Session 2: Large Data Management
- Mining Frequent Itemsets over Uncertain Databases
Yongxin Tong (Hong Kong University of Science and Technology, Hong Kong)
Lei Chen (Hong Kong University of Science and Technology, Hong Kong)
Yurong Cheng (Northeastern University, China)
Philip S. Yu (University of Illinois at Chicago, USA)
In recent years, due to the wide applications of uncertain data, mining frequent itemsets over uncertain databases has attracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected support-based frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings.
- Controlling False Positives in Association Rule Mining
Guimei Liu (National University of Singapore, Singapore)
Haojun Zhang (National University of Singapore Singapore),
Limsoon Wong (National University of Singapore Singapore),
Association rule mining is an important problem in the data mining area. It enumerates and tests a large number of rules on a dataset and outputs rules that satisfy user-specified constraints. Due to the large number of rules being tested, rules that do not represent real systematic effect in the data can satisfy the given constraints purely by random chance. Hence association rule mining often suffers from a high risk of false positive errors. There is a lack of comprehensive study on controlling false positives in association rule mining. In this paper, we adopt three multiple testing correction approaches---the direct adjustment approach, the permutation-based approach and the holdout approach---to control false positives in association rule mining, and conduct extensive experiments to study their performance. Our results show that (1) Numerous spurious rules are generated if no correction is made. (2) The three approaches can control false positives effectively. Among the three approaches, the permutation-based approach has the highest power of detecting real association rules, but it is very computationally expensive. We employ several techniques to reduce its cost effectively.
- Statistical Distortion: Consequences of Data Cleaning
Tamraparni Dasu (AT&T Labs - Research, USA)
Ji Meng Loh (AT&T Labs - Research, USA)
We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a comprehensive suite of experiments and analyses.
- Comments on "Stack-based Algorithms for Pattern Matching on DAGs"
Qiang Zeng (Chinese Academy of Sciences, China)
Zhuge Hai (Chinese Academy of Sciences, China)
The paper "Stack-based Algorithms for Pattern Matching on DAGs" generalizes the classical holistic twig join algorithms and proposes PathStackD, TwigStackD and DagStackD to respectively evaluate path, twig and DAG pattern queries on directed acyclic graphs. In this paper, we investigate the major results of that paper, pointing out several discrepancies and proposing solutions to resolving them. We show that the original algorithms do not find particular types of query solutions that are common in practice. We also analyze the effect of an underlying assumption on the correctness of the algorithms and discuss the pre-filtering process that the original work proposes to prune redundant nodes. Our experimental study on both real and synthetic data substantiates our conclusions.
Industrial Session 1: Database Engine
- Serializable Snapshot Isolation in PostgreSQL
Dan R. K. Ports (Massachusetts Institute of Technology)
Kevin Grittner (Wisconsin Supreme Court)
This paper describes our experience implementing PostgreSQL's new serializable isolation level. It is based on the recently-developed Serializable Snapshot Isolation (SSI) technique. This is the first implementation of SSI in a production database release as well as the first in a database that did not previously have a lock-based serializable isolation level. We reflect on our experience and describe how we overcame some of the resulting challenges, including the implementation of a new lock manager, a technique for ensuring memory usage is bounded, and integration with other PostgreSQL features. We also introduce an extension to SSI that improves performance for read-only transactions. We evaluate PostgreSQL's serializable isolation level using several benchmarks and show that it achieves performance only slightly below that of snapshot isolation, and significantly outperforms the traditional two-phase locking approach on read-intensive workloads.
- The Vertica Analytic Database: C-Store 7 Years Later
Andrew Lamb (Vertica Systems, USA)
Matt Fuller (Vertica Systems, USA)
Ramakrishna Varadarajan (Vertica Systems, USA)
Nga Tran (Vertica Systems, USA)
Ben Vandier (Vertica Systems, USA)
Lyric Doshi (Vertica Systems, USA)
Chuck Bear (Vertica Systems, USA)
This paper describes the system architecture of the Vertica Analytic Database (Vertica), a commercialization of the design of the C-Store research prototype. Vertica demonstrates a modern commercial RDBMS system that presents a classical relational interface while at the same time achieving the high performance expected from modern "web scale" analytic systems by making appropriate architectural choices. Vertica is also an instructive lesson in how academic systems research can be directly commercialized into a successful product.
- From Cooperative Scans to Predictive Buffer Management
Michał Świtakowski (Actian Corp., Netherlands)
Peter Boncz (Centrum Wiskunde & Informatica, Netherlands)
Marcin Żukowski (Actian Corp., Netherlands)
In analytical applications, database systems often need to sustain workloads with multiple concurrent scans hitting the same table. The Cooperative Scans (CScans) framework, which introduces an Active Buffer Manager (ABM) component into the database architecture, has been the most effective and elaborate response to this problem, and was initially developed in the X100 research prototype. We now report on the the experiences of integrating Cooperative Scans into its industrial-strength successor, the Vectorwise database product. During this implementation we invented a simpler optimization of concurrent scan buffer management, called Predictive Buffer Management (PBM). PBM is based on the observation that in a workload with long-running scans, the buffer manager has quite a bit of information on the workload in the immediate future, such that an approximation of the ideal OPT algorithm becomes feasible. In the evaluation on both synthetic benchmarks as well as a TPC-H throughput run we compare the benefits of naive buffer management (LRU) versus CScans, PBM and OPT; showing that PBM achieves benefits close to Cooperative Scans, while incurring much lower architectural impact.
- Transaction Log Based Application Error Recovery and Point In-Time Query
Tomas Talius (Microsoft Corp., USA)
Robin Dhamankar (Microsoft Corp., USA)
Andrei Dumitrache (Microsoft Corp., USA)
Hanuma Kodavalla (Microsoft Corp., USA)
Database backups have traditionally been used as the primary mechanism to recover from hardware and user errors. High availability solutions maintain redundant copies of data that can be used to recover from most failures except user or application errors. Database backups are neither space nor time efficient for recovering from user errors which typically occur in the recent past and affect a small portion of the database. Moreover periodic full backups impact user workload and increase storage costs. In this paper we present a scheme that can be used for both user and application error recovery starting from the current state and rewinding the database back in time using the transaction log. While we provide a consistent view of the entire database as of a point in time in the past, the actual prior versions are produced only for data that is accessed. We make the as of data accessible to arbitrary point in time queries by integrating with the database snapshot feature in Microsoft SQL Server.
Industrial Session 2: Potpourri
- Building User-defined Runtime Adaptation Routines for Stream Processing Applications
Gabriela Jacques-Silva (IBM T J Watson Research Center, USA)
Buğra Gedik (Bilkent University, Turkey)
Rohit Wagle (IBM T J Watson Research Center, USA)
Kun-Lung Wu (IBM T J Watson Research Center, USA)
Vibhore Kumar (IBM T J Watson Research Center, USA)
Stream processing applications are deployed as continuous queries that run from the time of their submission until their cancellation. This deployment mode limits developers who need their applications to perform runtime adaptation, such as algorithmic adjustments, incremental job deployment, and application-specific failure recovery. Currently, developers do runtime adaptation by using external scripts and/or by inserting operators into the stream processing graph that are unrelated to the data processing logic. In this paper, we describe a component called orchestrator that allows users to write routines for automatically adapting the application to runtime conditions. Developers build an orchestrator by registering and handling events as well as specifying actuations. Events can be generated due to changes in the system state (e.g., application component failures), built-in system metrics (e.g., throughput of a connection), or custom application metrics (e.g., quality score). Once the orchestrator receives an event, users can take adaptation actions by using the orchestrator actuation APIs. We demonstrate the use of the orchestrator in IBM's System S in the context of three different applications, illustrating application adaptation to changes on the incoming data distribution, to application failures, and on-demand dynamic composition.
- A Storage Advisor for Hybrid-Store Databases
Philipp Rösch (SAP AG, SAP Research, Germany)
Lars Dannecker (SAP AG, SAP Research, Germany)
Gregor Hackenbroich (SAP AG, SAP Research, Germany)
Franz Faerber (SAP AG, Germany), Germany)
With the SAP HANA database, SAP offers a high-performance in-memory hybrid-store database. Hybrid-store databases---that is, databases supporting row- and column-oriented data management---are getting more and more prominent. While the columnar management offers high-performance capabilities for analyzing large quantities of data, the row-oriented store can handle transactional point queries as well as inserts and updates more efficiently. To effectively take advantage of both stores at the same time the novel question whether to store the given data row- or column-oriented arises. We tackle this problem with a storage advisor tool that supports database administrators at this decision. Our proposed storage advisor recommends the optimal store based on data and query characteristics; its core is a cost model to estimate and compare query execution times for the different stores. Besides a per-table decision, our tool also considers to horizontally and vertically partition the data and manage the partitions on different stores. We evaluated the storage advisor for the use in the SAP HANA database; we show the recommendation quality as well as the benefit of having the data in the optimal store with respect to increased query performance.
- MOIST: A Scalable and Parallel Moving Object Indexer with School Tracking
Junchen Jiang (Google Inc., China)
Hongji Bao (Google Inc., China)
Edward Y. Chang (Google Inc., China)
Yuqian Li (Google Inc., China)
Location-Based Service (LBS) is rapidly becoming the next ubiquitous technology for a wide range of mobile applications. To support applications that demand nearest-neighbor and history queries, an LBS spatial indexer must be able to efficiently update, query, archive and mine location records, which can be in contention with each other. In this work, we propose MOIST, whose baseline is a recursive spatial partitioning indexer built upon BigTable. To reduce update and query contention, MOIST groups nearby objects of similar trajectory into the same school, and keeps track of only the history of school leaders. This dynamic clustering scheme can eliminate redundant updates and hence reduce update latency. To improve history query processing, MOIST keeps some history data in memory, while it flushes aged data onto parallel disks in a locality-preserving way. Through experimental studies, we show that MOIST can support highly efficient nearest-neighbor and history queries and can scale well with an increasing number of users and update frequency.
- Exploiting Evidence from Unstructured Data to Enhance Master Data Management
Karin Murthy (IBM Research, India)
Prasad M Deshpande (IBM Research, India)
Atreyee Dey (IBM Research, India)
Ramanujam Halasipuram (IBM Research, India)
Mukesh Mohania (IBM Research, India)
Deepak P (IBM Research, India)
Jennifer Reed (IBM Corp., USA)
Scott Schumacher (IBM Corp., USA)
Master data management (MDM) integrates data from multiple structured data sources and builds a consolidated 360-degree view of business entities such as customers and products. Today's MDM systems are not prepared to integrate information from unstructured data sources, such as news reports, emails, call-center transcripts, and chat logs. However, those unstructured data sources may contain valuable information about the same entities known to MDM from the structured data sources. Integrating information from unstructured data into MDM is challenging as textual references to existing MDM entities are often incomplete and imprecise and the additional entity information extracted from text should not impact the trustworthiness of MDM data. In this paper, we present an architecture for making MDM text-aware and showcase its implementation as IBM InfoSphere MDM Extension for Unstructured Text Correlation, an add-on to IBM InfoSphere Master Data Management Standard Edition. We highlight how MDM benefits from additional evidence found in documents when doing entity resolution and relationship discovery. We experimentally demonstrate the feasibility of integrating information from unstructured data sources into MDM.
Industrial Session 3: Big Data I
- Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads
Yanpei Chen (University of California, Berkeley, USA)
Sara Alspaugh (University of California, Berkeley, USA)
Randy Katz (University of California, Berkeley, USA), USA)
Within the past few years, organizations in diverse industries have adopted MapReduce-based systems for large-scale data processing. Along with these new users, important new workloads have emerged which feature many small, short, and increasingly interactive jobs in addition to the large, long-running batch jobs for which MapReduce was originally designed. As interactive, large-scale query processing is a strength of the RDBMS community, it is important that lessons from that field be carried over and applied where possible in this new domain. However, these new workloads have not yet been described in the literature. We fill this gap with an empirical analysis of MapReduce traces from six separate business-critical deployments inside Facebook and at Cloudera customers in e-commerce, telecommunications, media, and retail. Our key contribution is a characterization of new MapReduce workloads which are driven in part by interactive analysis, and which make heavy use of query-like programming frameworks on top of MapReduce. These workloads display diverse behaviors which invalidate prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs. A secondary contribution is a first step towards creating a TPC-like data processing benchmark for MapReduce.
- Avatara: OLAP for Web-scale Analytics Products
Lili Wu (LinkedIn Corp., USA)
Roshan Sumbaly (LinkedIn Corp., USA)
Chris Riccomini (LinkedIn Corp., USA)
Gordon Koo (LinkedIn Corp., USA)
Hyung Jin Kim (LinkedIn Corp., USA)
Jay Kreps (LinkedIn Corp., USA)
Sam Shah (LinkedIn Corp., USA)
Multidimensional data generated by members on websites has seen massive growth in recent years. OLAP is a well-suited solution for mining and analyzing this data. Providing insights derived from this analysis has become crucial for these websites to give members greater value. For example, LinkedIn, the largest professional social network, provides its professional members rich analytics features like "Who's Viewed My Profile?" and "Who's Viewed This Job?" The data behind these features form cubes that must be efficiently served at scale, and can be neatly sharded to do so. To serve our growing 160 million member base, we built a scalable and fast OLAP serving system called Avatara to solve this many, small cubes problem. At LinkedIn, Avatara has been powering several analytics features on the site for the past two years.
- The MADlib Analytics Library or MAD Skills, the SQL
Joe Hellerstein (University of California, Berkeley, USA)
Christopher Ré (University of Wisconsin - Madison, USA)
Florian Schoppmann (Greenplum, USA)
Daisy Zhe Wang (University of Florida, USA)
Eugene Fratkin (Greenplum, USA)
Aleksander Gorajek (Greenplum, USA)
Kee Siong Ng (Greenplum, USA)
Caleb Welton (Greenplum, USA)
Xixuan Feng (University of Wisconsin - Madison, USA)
Kun Li (University of Florida, USA)
Arun Kumar (University of Wisconsin - Madison, USA), USA)
MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals. MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.
- M3R: Increased performance for in-memory Hadoop jobs
Avraham Shinnar (IBM T J Watson Research Center, USA)
David Cunningham (IBM T J Watson Research Center, USA)
Benjamin Herta (IBM T J Watson Research Center, USA)
Vijay Saraswat (IBM T J Watson Research Center, USA)
Main Memory Map Reduce (M3R) is a new implementation of the Hadoop Map Reduce (HMR) API targeted at online analytics on high mean-time-to-failure clusters. It does not support resilience, and supports only those workloads which can fit into cluster memory. In return, it can run HMR jobs unchanged -- including jobs produced by compilers for higher-level languages such as Pig, Jaql, and SystemML and interactive front-ends like IBM BigSheets -- while providing significantly better performance than the Hadoop engine on several workloads (e.g. 45x on some input sizes for sparse matrix vector multiply). M3R also supports extensions to the HMR API which can enable Map Reduce jobs to run faster on the M3R engine, while not affecting their performance under the Hadoop engine.
Industrial 4 : Big Data II
- The Unified Logging Infrastructure for Data Analytics at Twitter
George Lee (Twitter Inc., USA)
Jimmy Lin (Twitter Inc., USA)
Chuang Liu (Twitter Inc., USA)
Andrew Lorek (Twitter Inc., USA)
Dmitriy Ryaboy (Twitter Inc., USA)
In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper, we present Twitter's production logging infrastructure and its evolution from application-specific logging to a unified "client events" log format, where messages are captured in common, well-formatted, flexible Thrift messages. Since most analytics tasks consider the user session as the basic unit of analysis, we pre-materialize "session sequences", which are compact summaries that can answer a large class of common queries quickly. The development of this infrastructure has streamlined log collection and data analysis, thereby improving our ability to rapidly experiment and iterate on various aspects of the service.
- Solving Big Data Challenges for Enterprise Application Performance Management
Tilmann Rabl (University of Toronto, Canada)
Mohammad Sadoghi (University of Toronto, Canada)
Hans-Arno Jacobsen (University of Toronto, Canada)
Sergio Gómez-Villamor (Universitat Politecnica de Catalunya, Spain)
Victor Muntés-Mulero (CA Labs Europe, Spain)
Serge Mankowskii (CA Labs Europe, Spain)
As the complexity of enterprise systems increases, the need for monitoring and analyzing such systems also grows. A number of companies have built sophisticated monitoring tools that go far beyond simple resource utilization reports. For example, based on instrumentation and specialized APIs, it is now possible to monitor single method invocations and trace individual transactions across geographically distributed systems. This high-level of detail enables more precise forms of analysis and prediction but comes at the price of high data rates (i.e., big data). To maximize the benefit of data monitoring, the data has to be stored for an extended period of time for ulterior analysis. This new wave of big data analytics imposes new challenges especially for the application performance monitoring systems. The monitoring data has to be stored in a system that can sustain the high data rates and at the same time enable an up-to-date view of the underlying infrastructure. With the advent of modern key-value stores, a variety of data storage systems have emerged that are built with a focus on scalability and high data rates as predominant in this monitoring use case. In this work, we present our experience and a comprehensive performance evaluation of six modern (open-source) data stores in the context of application performance monitoring as part of CA Technologies initiative. We evaluated these systems with data and workloads that can be found in application performance monitoring, as well as, on-line advertisement, power monitoring, and many other use cases. We present our insights not only as performance results but also as lessons learned and our experience relating to the setup and configuration complexity of these data stores in an industry setting.
- Can the Elephants Handle the NoSQL Onslaught?
Avrilia Floratou (University of Wisconsin - Madison, USA)
Nikhil Teletia (Microsoft Jim Gray Systems Laboratory, USA)
David J. Dewitt (Microsoft Jim Gray Systems Laboratory, USA)
Jignesh M. Patel (University of Wisconsin - Madison, USA)
Donghui Zhang (Paradigm4, USA)
In this new era of "big data", traditional DBMSs are under attack from two sides. At one end of the spectrum, the use of document store NoSQL systems (e.g. MongoDB) threatens to move modern Web 2.0 applications away from traditional RDBMSs. At the other end of the spectrum, big data DSS analytics that used to be the domain of parallel RDBMSs is now under attack by another class of NoSQL data analytics systems, such as Hive on Hadoop. So, are the traditional RDBMSs, aka "big elephants", doomed as they are challenged from both ends of this "big data" spectrum? In this paper, we compare one representative NoSQL system from each end of this spectrum with SQL Server, and analyze the performance and scalability aspects of each of these approaches (NoSQL vs. SQL) on two workloads (decision support analysis and interactive data-serving) that represent the two ends of the application spectrum. We present insights from this evaluation and speculate on potential trends for the future.
- Muppet: MapReduce-Style Processing of Fast Data
Wang Lam (@WalmartLabs, USA)
Lu Liu (@WalmartLabs, USA)
STS Prasad (@WalmartLabs, USA)
Anand Rajaraman (@WalmartLabs, USA)
Zoheb Vacheri (@WalmartLabs, USA)
AnHai Doan (University of Wisconsin-Madison, USA)
MapReduce has emerged as a popular method to process big data. In the past few years, however, not just big data, but fast data has also exploded in volume and availability. Examples of such data include sensor data streams, the Twitter Firehose, and Facebook updates. Numerous applications must process fast data. Can we provide a MapReduce-style framework so that developers can quickly write such applications and execute them over a cluster of machines, to achieve low latency and high scalability? In this paper we report on our investigation of this question, as carried out at Kosmix and WalmartLabs. We describe MapUpdate, a framework like MapReduce, but specifically developed for fast data. We describe Muppet, our implementation of MapUpdate. Throughout the description we highlight the key challenges, argue why MapReduce is not well suited to address them, and briefly describe our current solutions. Finally, we describe our experience and lessons learned with Muppet, which has been used extensively at Kosmix and WalmartLabs to power a broad range of applications in social media and e-commerce.
Demonstration Session 1: MapReduce, Big Data Systems, and Crowdsourcing
- Dedoop: Efficient Deduplication with Hadoop
Lars Kolb (Universität Leipzig, Germany)
Andreas Thor (Universität Leipzig, Germany)
Erhard Rahm (Universität Leipzig, Germany)
We demonstrate a powerful and easy-to-use tool called Dedoop (Deduplication with Hadoop) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. To achieve high performance Dedoop supports several advanced load balancing strategies.
- MapReduce-based Dimensional ETL Made Easy
Xiufeng Liu (Aalborg University, Denmark)
Christian Thomsen (Aalborg University, Denmark)
Torben Bach Pedersen (Aalborg University, Denmark)
This paper demonstrates ETLMR, a novel dimensional Extract-Transform-Load (ETL) programming framework that uses MapReduce to achieve scalability. ETLMR has built-in native support of data warehouse (DW) specific constructs such as star schemas, snowflake schemas, and slowly changing dimensions (SCDs). This makes it possible to build MapReduce-based dimensional ETL flows very easily. The ETL process can be configured with only few lines of code. We will demonstrate the concrete steps in using ETLMR to load data into a (partly snowflaked) DW schema. This includes configuration of data sources and targets, dimension processing schemes, fact processing, and deployment. In addition, we also present the scalability on large data sets.
- CloudVista: Interactive and Economical Visual Cluster Analysis for Big Data in the Cloud
Huiqi Xu (Wright State University, USA)
Zhen Li (Wright State University, USA)
Shumin Guo (Wright State University, USA)
Keke Chen (Wright State University, USA)
Analysis of big data has become an important problem for many business and scientific applications, among which clustering and visualizing clusters in big data raise some unique challenges. This demonstration presents the CloudVista prototype system to address the problems with big data caused by using existing data reduction approaches. It promotes a whole-big-data visualization approach that preserves the details of clustering structure. The prototype system has several merits. (1) Its visualization model is naturally parallel, which guarantees the scalability. (2) The visual frame structure minimizes the data transferred between the cloud and the client. (3) The RandGen algorithm is used to achieve a good balance between interactivity and batch processing. (4) This approach is also designed to minimize the financial cost of interactive exploration in the cloud. The demonstration will highlight the problems with existing approaches and show the advantages of the CloudVista approach. The viewers will have the chance to play with the CloudVista prototype system and compare the visualization results generated with different approaches.
- Myriad: Scalable and Expressive Data Generation
Alexander Alexandrov (Technische Universität Berlin, Germany)
Kostas Tzoumas (Technische Universität Berlin, Germany)
Volker Markl (Technische Universität Berlin, Germany)
The current research focus on Big Data systems calls for a rethinking of data generation methods. The traditional sequential data generation approach is not well suited to large-scale systems as generating a terabyte of data may require days or even weeks depending on the number of constraints imposed on the generated model. We demonstrate Myriad, a new data generation toolkit that enables the specification of semantically rich data generator programs that can scale out linearly in a shared-nothing environment. Data generation programs built on top of Myriad implement an efficient parallel execution strategy leveraged by the extensive use of pseudo-random number generators with random access support.
- A Demonstration of DBWipes: Clean as You Query
Eugene Wu (Massachusetts Institute of Technology, USA)
Samuel Madden (Massachusetts Institute of Technology, USA),
Michael Stonebraker (Massachusetts Institute of Technology, USA)
As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly im- portant to provide tools that help analysts understand the underly- ing reasons when they encounter errors in the result. While data provenance has been a large step in providing tools to help debug complex workflows, its current form has limited utility when de- bugging aggregation operators that compute a single output from a large collection of inputs. Traditional provenance will return the entire input collection, which has very low precision. In contrast, users are seeking precise descriptions of the inputs that caused the errors. We propose a Ranked Provenance System, which identi- fies subsets of inputs that influenced the output error, describes each subset with human readable predicates and orders them by contribution to the error. In this demonstration, we will present DBWipes, a novel data cleaning system that allows users to ex- ecute aggregate queries, and interactively detect, understand, and clean errors in the query results. Conference attendees will explore anomalies in campaign donations from the current US presidential election and in readings from a 54-node sensor deployment.
- ASTERIX: An Open Source System for "Big Data" Management and Analysis
Sattam Alsubaiee (University of California, Irvine, USA)
Yasser Altowim (University of California, Irvine, USA)
Hotham Altwaijry (University of California, Irvine, USA)
Alexander Behm (University of California, Irvine, USA)
Vinayak Borkar (University of California, Irvine, USA)
Yingyi Bu (University of California, Irvine, USA)
Michael Carey (University of California, Irvine, USA)
Raman Grover (University of California, Irvine, USA)
Zachary Heilbron (University of California, Irvine, USA)
Young-Seok Kim (University of California, Irvine, USA)
Chen Li (University of California, Irvine, USA)
Nicola Onose (Google Inc., USA)
Pouria Pirzadeh (University of California, Irvine, USA)
Rares Vernica (HP Labs, USA)
Jian Wen (University of California Riverside, USA), USA)
At UC Irvine, we are building a next generation parallel database system, called ASTERIX, as our approach to addressing today's "Big Data" management challenges. ASTERIX aims to combine time-tested principles from parallel database systems with those of the Web-scale computing community, such as fault tolerance for long running jobs. In this demo, we present a whirlwind tour of ASTERIX, highlighting a few of its key features. We will demonstrate examples of our data definition language to model semi-structured data, and examples of interesting queries using our declarative query language. In particular, we will show the capabilities of ASTERIX for answering geo-spatial queries and fuzzy queries, as well as ASTERIX' data feed construct for continuously ingesting data.
- Blink and It's Done: Interactive Queries on Very Large Data
Sameer Agarwal (University of California, Berkeley, USA)
Aurojit Panda (University of California, Berkeley, USA)
Barzan Mozafari (Massachusetts Institute of Technology, USA)
Anand P. Iyer (University of California, Berkeley, USA)
Samuel Madden (Massachusetts Institute of Technology, USA)
Ion Stoica (University of California, Berkeley, USA), USA)
In this demonstration, we present BlinkDB, a massively parallel, sampling-based approximate query processing framework for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make reasonable decisions in the absence of perfect answers. BlinkDB extends the Hive/HDFS stack and can handle the same set of SPJA (selection, projection, join and aggregate) queries as supported by these systems. BlinkDB provides real-time answers along with statistical error guarantees, and can scale to petabytes of data and thousands of machines in a fault-tolerant manner. Our experiments using the TPC-H benchmark and on an anonymized real-world video content distribution workload from Conviva Inc. show that BlinkDB can execute a wide range of queries up to 150x faster than Hive on MapReduce and 10-150x faster than Shark (Hive on Spark) over tens of terabytes of data stored across 100 machines, all with an error of 2-10%.
- Massive Genomic Data Processing and Deep Analysis
Abhishek Roy (University of Massachusett, USA)
Yanlei Diao (University of Massachusetts, USA)
Evan Mauceli (Harvard Medical School & Children’s Hospital Boston, USA)
Yiping Shen (Harvard Medical School & Children’s Hospital Boston, USA)
Bai-Lin Wu (Harvard Medical School & Children’s Hospital Boston, USA)
Today large sequencing centers are producing genomic data at the rate of 10 terabytes a day and require complicated processing to transform massive amounts of noisy raw data into biological information. To address these needs, we develop a system for end-to-end processing of genomic data, including alignment of short read sequences, variation discovery, and deep analysis. We also employ a range of quality control mechanisms to improve data quality and parallel processing techniques for performance. In the demo, we will use real genomic data to show details of data transformation through the workflow, the usefulness of end results (ready for use as testable hypotheses), the effects of our quality control mechanisms and improved algorithms, and finally performance improvement.
- MonetDB/DataCell: Online Analytics in a Streaming Column-Store
Erietta Liarou (Centrum Wiskunde & Informatica, Netherlands)
Stratos Idreos (Centrum Wiskunde & Informatica, Netherlands)
Stefan Manegold (Centrum Wiskunde & Informatica, Netherlands)
Martin Kersten (Centrum Wiskunde & Informatica, Netherlands)
In DataCell, we design streaming functionalities in a modern relational database kernel which targets big data analytics. This includes exploitation of both its storage/execution engine and its optimizer infrastructure. We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages for modern applications in need for online analytics such as web logs, network monitoring and scientific data management. The major challenge then becomes the efficient support for specialized stream features, e.g., multi-query processing and incremental window-based processing as well as exploiting standard DBMS functionalities in a streaming environment such as indexing. This demo presents DataCell, an extension of the MonetDB open-source column-store for online analytics. The demo gives users the opportunity to experience the features of DataCell such as processing both stream and persistent data and performing window based processing. The demo provides a visual interface to monitor the critical system components, e.g., how query plans transform from typical DBMS query plans to online query plans, how data flows through the query plans as the streams evolve, how DataCell maintains intermediate results in columnar form to avoid repeated evaluation of the same stream portions, etc. The demo also provides the ability to interactively set the test scenarios and various DataCell knobs.
- SWORS: A System for the Efficient Retrieval of Relevant Spatial Web Objects
Xin Cao (Nanyang Technological University, Singapore)
Gao Cong (Nanyang Technological University, Singapore)
Christian S. Jensen (Arhus University, Denmark)
Jun Jie Ng (Nanyang Technological University, Singapore)
Beng Chin Ooi (National University of Singapore, Singapore)
Nhan-Tue Phan (Nanyang Technological University, Singapore)
Dingming Wu (Hong Kong Baptist University, Hong Kong)
Spatial web objects that possess both a geographical location and a textual description are gaining in prevalence. This gives prominence to spatial keyword queries that exploit both location and textual arguments. Such queries are used in many web services such as yellow pages and maps services. We present SWORS, the Spatial Web Object Retrieval System, that is capable of efficiently retrieving spatial web objects that satisfy spatial keyword queries. Specifically, SWORS supports two types of queries: a) the location-aware top-k text retrieval (LkT) query that retrieves k individual spatial web objects taking into account query location proximity and text relevancy; b) the spatial keyword group (SKG) query that retrieves a group of objects that cover the query keywords and are nearest to the query location and have the shortest inter-object distances. SWORS provides browser-based interfaces for desktop and laptop computers and provides a client application for mobile devices. The interfaces and the client enable users to formulate queries and view the query results on a map. The server side stores the data and processes the queries. We use three real-life data sets to demonstrate the functionality and performance of SWORS.
- CyLog/Crowd4U: A Declarative Platform for Complex Data-centric Crowdsourcing
Atsuyuki Morishima (University of Tsukuba, Japan)
Norihide Shinagawa (University of Tsukuba, Japan)
Tomomi Mitsuishi (University of Tsukuba, Japan)
Hideto Aoki (University of Tsukuba, Japan)
Shun Fukusumi (University of Tsukuba, Japan)
This demo presents a principled approach to the problems of data-centric human/machine computations with Crowd4U, a crowdsourcing platform equipped with a suite of tools for rapid development of crowdsourcing applications. Using the demo, we show that declarative database abstraction can be used as a powerful tool to design, implement, and analyze data-centric crowdsourcing applications. The power of Crowd4U comes from CyLog, a database abstraction that handles complex data-centric human/machine computations. CyLog is a Datalog-like language that incorporates a principled feedback system for humans at the language level so that the semantics of the computation not closed in machines can be defined based on the game theory. We believe that the demo clearly shows that database abstraction can be a promising basis for designing complex data-centric applications requiring human/machine computations.
Demonstration Session 2: Query Pricing, Processing, and Optimization
- Exploiting Database Similarity Joins for Metric Spaces
Yasin N. Silva (Arizona State University, USA)
Spencer Pearson (Arizona State University, USA)
Similarity Joins are recognized among the most useful data processing and analysis operations and are extensively used in multiple application domains. They retrieve all data pairs whose distances are smaller than a predefined threshold epsilon. Multiple Similarity Join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Recent work has shown that this operation can be efficiently implemented as a physical database operator. However, the proposed operator only support 1D numeric data. This paper presents DBSimJoin, a physical Similarity Join database operator for datasets that lie in any metric space. DBSimJoin is a non-blocking operator that prioritizes the early generation of results. We implemented the proposed operator in PostgreSQL, an open source database system. We show how this operator can be used in multiple real-world data analysis scenarios with multiple data types and distance functions. Particularly, we show the use of DBSimJoin to identify similar images represented as feature vectors, and similar publications in a bibliographic database. We also show that DBSimJoin scales very well when important parameters, e.g., epsilon, data size, increase.
- Stethoscope: A platform for interactive visual analysis of query execution plans
Mrunal Gawade (Centrum Wiskunde & Informatica, Netherlands)
Martin Kersten (Centrum Wiskunde & Informatica, Netherlands)
Searching for the performance bottleneck in an execution trace is an error prone and time consuming activity. Existing tools oer some comfort by providing a visual representation of trace for analysis. In this paper we present the Stethoscope, an interactive visual tool to inspect and analyze columnar database query performance, both online and offline. It's unique interactive animated interface capitalizes the large data-flow graph representation of a query execution plan, augmented with query execution trace information. We demonstrate features of Stethoscope for both online and offline analysis of long running queries. It helps in understanding where time goes, how optimizers perform, and how parallel processing on multi-core systems is exploited.
- Hum-a-song: A Subsequence Matching with Gaps-Range-Tolerances Query-By-Humming System
Alexios Kotsifakos (University of Texas at Arlington, USA)
Panagiotis Papapetrou (Aalto University, Finland)
Jaakko Hollmén (Aalto University, Finland)
Dimitrios Gunopulos (University of Athens, Greece)
Vassilis Athitsos (University of Texas at Arlington, USA)
George Kollios (Boston University, USA)
We present "Hum-a-song", a system built for music retrieval, and particularly for the Query-By-Humming (QBH) application. According to QBH, the user is able to hum a part of a song that she recalls and would like to learn what this song is, or find other songs similar to it in a large music repository. We present a simple yet efficient approach that maps the problem to time series subsequence matching. The query and the database songs are represented as 2-dimensional time series conveying information about the pitch and the duration of the notes. Then, since the query is a short sequence and we want to find its best match that may start and end anywhere in the database, subsequence matching methods are suitable for this task. In this demo, we present a system that employs and exposes to the user a variety of state-of-the-art dynamic programming methods, including a newly proposed efficient method named SMBGT that is robust to noise and considers all intrinsic problems in QBH; it allows variable tolerance levels when matching elements, where tolerances are defined as functions of the compared sequences, gaps in both the query and target sequences, and bounds the matching length and (optionally) the minimum number of matched elements. Our system is intended to become open source, which is to the best of our knowledge the first non-commercial effort trying to solve QBH with a variety of methods, and that also approaches the problem from the time series perspective.
- SkewTune in Action: Mitigating Skew in MapReduce Applications
YongChul Kwon (University of Washington, USA)
Magdalena Balazinska (University of Washington, USA)
Bill Howe (University of Washington, USA)
Jerome Rolia (HP Labs, USA)
We demonstrate SkewTune, a system that automatically mitigates skew in user-defined MapReduce programs and is a drop-in replacement for Hadoop. The demonstration has two parts. First, we demonstrate how SkewTune mitigates skew in real MapReduce applications at runtime by running a real application in a public cloud. Second, through an interactive graphical interface, we demonstrate the details of the skew mitigation process using both real and synthetic workloads that represent various skew configurations.
- Playful Query Specification with DataPlay
Azza Abouzied (Yale University, USA)
Joseph M. Hellerstein (University of California, Berkeley, USA)
Avi Silberschatz (Yale University, USA)
DataPlay is a query tool that encourages a trial-and-error approach to query specification. DataPlay uses a graphical query language to make a particularly challenging query specification task - quantification - easier. It constrains the relational data model to enable the presentation of non-answers, in addition to answers, to aid query interpretation. Two novel features of DataPlay are suggesting semantic variations to a query and correcting queries by example. We introduce DataPlay as a sophisticated query specification tool and demonstrate its unique interaction models.
- NoDB in Action: Adaptive Query Processing on Raw Data
Ioannis Alagiannis (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Renata Borovica (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Miguel Branco (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Stratos Idreos (Centrum Wiskunde & Informatica, Netherlands)
Anastasia Ailamaki (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications already avoid using traditional database systems, e.g., scientific data analysis and social networks, due to their complexity and the increased data-to-query time, i.e. the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze. In this demonstration, we will showcase a new philosophy for designing database systems called NoDB. NoDB aims at minimizing the data-to-query time, most prominently by removing the need to load data before launching queries. We will present our prototype implementation, PostgresRaw, built on top of PostgreSQL, which allows for efficient query execution over raw data files with zero initialization overhead. We will visually demonstrate how PostgresRaw incrementally and adaptively touches, parses, caches and indexes raw data files autonomously and exclusively as a side-effect of user queries.
- Complex Preference Queries Supporting Spatial Applications for User Groups
Florian Wenzel (Universität Augsburg, Germany)
Markus Endres (Universität Augsburg, Germany)
Stefan Mandl (Universität Augsburg, Germany)
Werner Kießling (Universität Augsburg, Germany)
Our demo application demonstrates a personalized location-based web application using Preference SQL that allows single users as well as groups of users to find accommodations in Istanbul that satisfy both hard constraints and user preferences. The application assists in defining spatial, numerical, and categorical base preferences and composes complex preference statements in an intuitive fashion. Unlike existing location-based services, the application considers spatial queries as soft instead of hard constraints to determine the best matches which are finally presented on a map. The underlying Preference SQL framework is implemented on top of a database, therefore enabling a seamless application integration with standard SQL back-end systems as well as efficient and extensible preference query processing.
- Demonstration of the FDB Query Engine for Factorised Databases
Nurzhan Bakibayev (University of Oxford, UK)
Dan Olteanu (University of Oxford, UK)
Jakub Závodný (University of Oxford, UK)
FDB is an in-memory query engine for factorised databases, which are relational databases that use compact factorised representations at the physical layer to reduce data redundancy and boost query performance. We demonstrate FDB using real data sets from IMDB, DBLP, and the NELL repository of facts learned from Web pages. The users can inspect factorisations as well as plans used by FDB to compute factorised results of select-project-join queries on factorised databases.
- PET: Reducing Database Energy Cost via Query Optimization
Zichen Xu (The Ohio State University, USA)
Yi-Cheng Tu (The University of South Florida, USA)
Xiaorui Wang (The Ohio State University, USA)
Energy conservation is a growing important issue in designing modern database management system (DBMS). This requires a deep thinking about the tradeoffs between energy and performance. Despite the significant amount of efforts at the hardware level to make the major components consume less energy, we argue for a revisit of the DBMS query processing mechanism to identify and harvest the potential of energy saving. However, the state-of-art architecture of DBMS does not take energy usage into consideration in its design. A major challenge in developing an energy-aware DBMS is to design and implement a cost-based query optimizer that evaluates query plans by both performance and energy costs. By following such a strategy, our previous work revealed the fact that energy-efficient query plans do not necessarily have the shortest processing time. This demo proposal introduces PET -- an energy-aware query optimization framework that is built as a part of the PostgreSQL kernel. PET, via its power cost estimation module and plan evaluation model, enables the database system to run under a DBA-specified energy/performance tradeoff level. PET contains a power cost estimator that can accurately estimate the power cost of query plans at compile time, and a query evaluation engine that the DBA could configure key PET parameters towards the desired tradeoff. The software to be demonstrated will also include workload engine for producing large quantities of queries and data sets. Our demonstration will show how PET functions via a comprehensive set of views from its graphical user interface named PET Viewer. Through such interfaces, a user can achieve a good understanding of the energy-related query optimization and cost-based plan generation. Users are also allowed to interact with PET to experience the different energy/performance tradeoffs by changing PET and workload parameters at query runtime.
- SPAM: A SPARQL Analysis and Manipulation Tool
Andrés Letelier (Pontificia Universidad Católica de Chile, Chile)
Jorge Pérez (Universidad de Chile, Chile)
Reinhard Pichler (Technische Universität Wien, Austria)
Sebastian Skritek (Technische Universität Wien, Austria)
SQL developers are used to having elaborate tools which help them in writing queries. In contrast, the creation of tools to assist users in the development of SPARQL queries is still in its infancy. In this system demo, we present the SPARQL Analysis and Manipulation (SPAM) tool, which provides help for the development of SPARQL queries. The main features of the SPAM tool comprise an editor with both text and graphical interface, as well as various functions for the static and dynamic analysis of SPARQL queries.
- QueryMarket Demonstration: Pricing for Online Data Markets
Paraschos Koutris (University of Washington, USA)
Prasang Upadhyaya (University of Washington, USA)
Magdalena Balazinska (University of Washington, USA)
Bill Howe (University of Washington, USA)
Dan Suciu (University of Washington, USA)
Increasingly data is being bought and sold online. To facilitate such transactions, online data marketplaces have emerged to provide a service for sellers to price views on their data, and buyers to buy such views. These marketplaces neither support the sale of ad-hoc queries (that are not one of the specified views), nor do they support queries that join datasets. We present QueryMarket, a prototype data marketplace that automatically extrapolates prices to ad-hoc queries, including those with joins, from the manually priced views. We call this capability "query-based pricing" and describe how it is superior to existing pricing methods, and how it provides more flexible pricing for the sellers. We then show how QueryMarket implements query-based pricing and how it generates explanations for the prices it computes.
Demonstration Session 3: Information Retrieval, Web, and Mobility
- DISKs: A System for Distributed Spatial Group Keyword Search on Road Networks
Siqiang Luo (Fudan University, China),
Yifeng Luo (Fudan University, China),
Shuigeng Zhou (Fudan University, China),
Gao Cong (Nayang Technological University, Singapore),
Jihong Guan (Tongji University, China)
Query (e.g., shortest path) on road networks has been extensively studied. Although most of the existing query processing approaches are designed for centralized environments, there is a growing need to handle queries on road networks in distributed environments due to the increasing query workload and the challenge of querying large networks. In this demonstration, we showcase a distributed system called DISKs (DIstributed Spatial Keyword search) that is capable of efficiently supporting spatial group keyword search (SGKS) on road networks. Given a group of keywords X and a distance r, an SGKS returns locations on a road network, such that for each returned location p, there exists a set of nodes (on the road network), which are located within a network distance r from p and collectively contains X. We will demonstrate the innovative modules, performance and interactive user interfaces of DISKs.
- WETSUIT: An Efficient Mashup Tool for Searching and Fusing Web Entities
Stefan Endrullis (Universität Leipzig, Germany)
Andreas Thor (Universität Leipzig, Germany)
Erhard Rahm (Universität Leipzig, Germany)
We demonstrate a new powerful mashup tool called WETSUIT (Web EnTity Search and fUsIon Tool) to search and integrate web data from diverse sources and domain-specific entity search engines. WETSUIT supports adaptive search strategies to query sets of relevant entities with a minimum of communication overhead. Mashups can be composed using a set of high-level operators based on the Java-compatible language Scala. The operator implementation supports a high degree of parallel processing, in particular a streaming of entities between all data transformation operations facilitating a fast presentation of intermediate results. WETSUIT has already been applied to solve challenging integration tasks from different domains.
- Model-based Integration of Past & Future in TimeTravel
Mohamed E. Khalefa (Aalborg University, Denmark)
Ulrike Fischer (Technische Universität Dresden, Germany)
Torben Bach Pedersen (Aalborg University, Denmark)
Wolfgang Lehner (Technische Universität Dresden, Germany)
We demonstrate TimeTravel, an efficient DBMS system for seamless integrated querying of past and (forecasted) future values of time series, allowing the user to view past and future values as one joint time series. This functionality is important for advanced application domain like energy. The main idea is to compactly represent time series as models. By using models, the TimeTravel system answers queries approximately on past and future data with error guarantees (absolute error and confidence) one order of magnitude faster than when accessing the time series directly. In addition, it efficiently supports exact historical queries by only accessing relevant portions of the time series. This is unlike existing approaches, which access the entire time series to exactly answer the query. To realize this system, we propose a novel hierarchical model index structure. As real-world time series usually exhibits seasonal behavior, models in this index incorporate seasonality. To construct a hierarchical model index, the user specifies seasonality period, error guarantees levels, and a statistical forecast method. As time proceeds, the system incrementally updates the index and utilizes it to answer approximate and exact queries. TimeTravel is implemented into PostgreSQL, thus achieving complete user transparency at the query level. In the demo, we show the easy building of a hierarchical model index for a real-world time series and the effect of varying the error guarantees on the speed up of approximate and exact queries.
- DrillBeyond: Enabling Business Analysts to Explore the Web of Open Data
Julian Eberius (Technische Universität Dresden, Germany)
Maik Thiele (Technische Universität Dresden, Germany)
Katrin Braunschweig (Technische Universität Dresden, Germany)
Wolfgang Lehner (Technische Universität Dresden, Germany)
Following the Open Data trend, governments and public agencies have started making their data available on the Web and established platforms such as data.gov or data.un.org. These Open Data platforms provide a huge amount of data for various topics such as demographics, transport, finance or health in various data formats. One typical usage scenario for this kind of data is their integration into a database or data warehouse in order to apply data analytics. However, in today's business intelligence tools there is an evident lack of support for so-called situational or ad-hoc data integration. In this demonstration we will therefore present DrillBeyond, a novel database and information retrieval engine which allows users to query a local database as well as the Web of Open Data in a seamless and integrated way with standard SQL. The audience will be able to pose queries to our DrillBeyond system which will be answered partly from local data in the database and partly from datasets that originate from the Web of Data. We will show how such queries are divided into known and unknown parts and how missing attributes are mapped to open datasets. We will demonstrate the integration of the open datasets back into the DBMS in order to apply its analytical features.
- Discovering and Exploring Relations on the Web
Ndapandula Nakashole (Max Planck Institute for Informatics, Germany)
Gerhard Weikum (Max Planck Institute for Informatics, Germany)
Fabian Suchanek (Max Planck Institute for Informatics, Germany)
We propose a demonstration of PATTY, a system for learning semantic relationships from the Web. PATTY relations are organized by semantic types and form a hierarchy based on subsumptions of relations. Our demonstration will illustrate the coverage and quality of the PATTY taxonomy. We will show the richness of PATTY by a variety of use cases, including paraphrasing of relations which is relevant for information extraction, and semantic search over subject-predicate-object triples capturing entities, relations, semantic types, noun phrases, and relational phrases.
- MapRat: Meaningful Explanation, Interactive Exploration and Geo-Visualization of Collaborative Ratings
Saravanan Thirumuruganathan (University of Texas At Arlington, USA)
Mahashweta Das (University of Texas At Arlington, USA)
Shrikant Desai (University of Texas At Arlington, USA)
Sihem Amer-Yahia (Qatar Computing Research Institute, Qatar)
Gautam Das (University of Texas At Arlington, USA)
Cong Yu (Google Inc., USA)
Collaborative rating sites such as IMDB and Yelp have become rich resources that users consult to form judgments about and choose from among competing items. Most of these sites either provide a plethora of information for users to interpret all by themselves or a simple overall aggregate information. Such aggregates (e.g., average rating over all users who have rated an item, aggregates along pre-defined dimensions, etc.) can not help a user quickly decide the desirability of an item. In this paper, we build a system MapRat that allows a user to explore multiple carefully chosen aggregate analytic details over a set of user demographics that meaningfully explain the ratings associated with item(s) of interest. MapRat allows a user to systematically explore, visualize and understand user rating patterns of input item(s) so as to make an informed decision quickly. In the demo, participants are invited to explore collaborative movie ratings for popular movies.
- Deco: A System for Declarative Crowdsourcing
Hyunjung Park (Stanford University, USA)
Richard Pang (Stanford University, USA)
Aditya Parameswaran (Stanford University, USA)
Hector Garcia-Molina (Stanford University, USA)
Neoklis Polyzotis (University of California, Santa Cruz, USA)
Jennifer Widom (Stanford University, USA)
Deco is a system that enables declarative crowdsourcing: answering SQL queries posed over data gathered from the crowd as well as existing relational data. Deco implements a novel push-pull hybrid execution model in order to support a flexible data model and a precise query semantics, while coping with the combination of latency, monetary cost, and uncertainty of crowdsourcing. We demonstrate Deco using two crowdsourcing platforms: Amazon Mechanical Turk and an in-house platform, to show how Deco provides a convenient means of collecting and querying crowdsourced data.
- Developing and Analyzing XSDs through BonXai
Wim Martens (Universität Bayreuth, Germany)
Frank Neven (Universiteit Hasselt, Germany)
Matthias Niewerth (Technische Universität Dortmund, Germany)
Thomas Schwentick (Technische Universität Dortmund, Germany)
BonXai is a versatile schema specification language expressively equivalent to XML Schema. It is not intended as a replacement for XML Schema but it can serve as an additional, user-friendly front-end. It offers a simple way and a lightweight syntax to specify the context of elements based on regular expressions rather than on types. In this demo we show the front-end capabilities of BonXai and exemplify its potential to offer a novel way to view existing XML Schema Definitions. In particular, we present several usage scenarios specifically targeted to showcase the ease of specifying, modifying, and understanding XML Schema Definitions through BonXai.
- InfoPuzzle: Exploring Group Decision Making in Mobile Peer-to-Peer Databases
Aaron J. Elmore (University California, Santa Barbara, USA)
Sudipto Das (Microsoft Research, USA)
Divyakant Agrawal (University California, Santa Barbara, USA)
Amr El Abbadi (University California, Santa Barbara, USA), USA)
As Internet-based services and mobile computing devices, such as smartphones and tablets, become ubiquitous, society's reliance on them to accomplish critical and time-sensitive tasks, such as information dissemination and collaborative decision making, also increases. Dependence on these media magnifies the damage caused by their disruption, whether malicious or natural. For instance, a natural disaster disrupting cellular and Internet infrastructures impedes information spread, which in turn leads to chaos, both among the victims as well as the aid providers. Decentralized and ad-hoc mechanisms for information dissemination and decision making are paramount to help restore order. We demonstrate InfoPuzzle, a mobile peer-to-peer database that utilizes direct device communication to enable group decision making, or consensus, without reliance on centralized communication services. InfoPuzzle minimizes the system's resource consumption, to prolong the lifetime of the power constrained devices by minimizing communication overhead, computational complexity, and persistent storage size. Due to user mobility and the limited range of point-to-point communication, knowing the exact number of participants is impossible, and therefore traditional consensus or quorum protocols cannot be used. We rely of distinct counting techniques, probabilistic thresholds, and bounded time based approaches to reach agreement. In this demo, we will explore various challenges and heuristics in estimating group participation to aid users in reconciling consensus without centralized services.
- Manage and Query Generic Moving Objects in SECONDO
Jianqiu Xu (FernUniversität Hagen, Germany)
Ralf Hartmut Güting (FernUniversität Hagen, Germany)
In this demonstration, we introduce a system that is able to manage moving objects in all real world environments, e.g., road network, bus network and indoor. The complete trip of a person is managed by the system such as Walk, Car, Walk and Indoor, where the precise locations of both outdoor and indoor movements are represented. Trajectories located in several environments are integrated into the same framework. The system supports the shortest path searching for start and end locations being in different environments, for example, from a room to a bus stop. A comprehensive and scalable set of moving objects is generated to simulate human movement in practice. Optimization methods are developed to efficiently answer novel queries regarding transportation modes and mobile environments. Most of these queries are not supported by existing methods because of the limitation of data representation.
- Chronos: Facilitating History Discovery by Linking Temporal Records
Pei Li (University of Milano-Bicocca, Italy)
Haidong Wang (Nankai University)
Christina Tziviskou (University of Milano-Bicocca, Italy)
Xin Luna Dong (AT&T Labs - Research, USA)
Xiaoguang Liu (Nankai University, China)
Andrea Maurino (University of Milano-Bicocca, Italy)
Divesh Srivastava (AT&T Labs - Research, USA)
Many data sets contain temporal records over a long period of time; each record is associated with a time stamp and describes some aspects of a real-world entity at that particular time. From such data, users often wish to search for entities in a particular period and understand the history of one entity or all entities in the data set. A major challenge for enabling such search and exploration is to identify records that describe the same real-world entity over a long period of time; however, linking temporal records is hard given that the values that describe an entity can evolve over time (e.g., a person can move from one affiliation to another). We demonstrate the CHRONOS system which offers users the useful tool for finding real-world entities over time and understanding history of entities in the bibliography domain. The core of CHRONOS is a temporal record-linkage algorithm, which is tolerant to value evolution over time. Our algorithm can obtain an F-measure of over 0.9 in linking author records and fix errors made by DBLP. We show how CHRONOS allows users to explore the history of authors, and how it helps users understand our linkage results by comparing our results with those of existing systems, highlighting differences in the results, explaining our decisions to users, and answering "what-if" questions.
- TELEIOS: A Database-Powered Virtual Earth Observatory
Manolis Koubarakis (University of Athens, Greece)
Kostis Kyzirakos (University of Athens, Greece)
Manos Karpathiotakis (University of Athens, Greece)
Charalampos Nikolaou (University of Athens, Greece)
Stavros Vassos (University of Athens, Greece)
George Garbis (University of Athens, Greece)
Michael Sioutis (University of Athens, Greece)
Konstantina Bereta (University of Athens, Greece)
Dimitrios Michail (Harokopio University of Athens, Greece)
Charalampos Kontoes (National Observatory of Athens, Greece)
Ioannis Papoutsis (National Observatory of Athens, Greece)
Themos Herekakis (National Observatory of Athens, Greece)
Stefan Manegold (Centrum Wiskunde & Informatica, Netherlands)
Martin Kersten (Centrum Wiskunde & Informatica, Netherlands)
Milena Ivanova (Centrum Wiskunde & Informatica, Netherlands)
Holger Pirk (Centrum Wiskunde & Informatica, Netherlands)
Ying Zhang (Centrum Wiskunde & Informatica, Netherlands)
Mihai Datcu (German Aerospace Center, Germany)
Gottfried Schwarz (German Aerospace Center, Germany)
Corneliu Dumitru (German Aerospace Center, Germany)
Daniela Espinoza Molina (German Aerospace Center, Germany)
Katrin Molch (German Aerospace Center, Germany)
Ugo Di Giammatteo (Advanced Computer Systems, Italy)
Manuela Sagona (Advanced Computer Systems, Italy)
Sergio Perelli (Advanced Computer Systems, Italy)
Thorsten Reitz (Fraunhofer Institute for Computer Graphics Research, Germany)
Eva Klien (Fraunhofer Institute for Computer Graphics Research, Germany)
Robert Gregor (Fraunhofer Institute for Computer Graphics Research, Germany)
TELEIOS is a recent European project that addresses the need for scalable access to petabytes of Earth Observation data and the discovery and exploitation of knowledge that is hidden in them. TELEIOS builds on scientific database technologies (array databases, SciQL, data vaults) and Semantic Web technologies (stRDF and stSPARQL) implemented on top of a state of the art column store database system (MonetDB). We demonstrate a first prototype of the TELEIOS Virtual Earth Observatory (VEO) architecture, using a forest fire monitoring application as example.
PhD Workshop Session 1: Data Semantics and Data Mining
- Invited Talk: Research and academic life
Tamer Ozsu, Univ. of Waterloo
- A Cognitive Model for Mining Latent Semantics in Unstructured Text
Aditya Ramana Rachakonda (International Institute of Information Technology, India)
A cognitive model for mining latent semantics in unstructured text is described. The approach is to understand the formation of lexical semantics in humans from a cognitive standpoint and abstract this process onto a machine. A 3-layer model of latent semantics is proposed which helps define certain semantic associations as understood by humans. These associations are then systematically reduced to properties of term cooccurrences through a series of hypotheses. The hypotheses are confirmed through user evaluation of the resultant semantics. This work focuses on three latent semantic associations viz., topical anchors, semantic siblings and topical markers to explain the model.
- Personalizing Search Results on User Intent
Giorgos Giannopoulos (National Technical University of Athens, Greece)
Personalized retrieval models aim at capturing user interests to provide personalized results that are tailored to the respective information needs. User interests are however widely spread, subject to change, and cannot always be captured well, thus rendering the deployment of personalized models challenging. In this doctoral work, we describe our approach where we study ranking models from the aspect of search intent. Our approach is query-centric. That is, it does not rely on separate user search profiles/histories nor it personalizes ranking based on topical similarity of queries. Contrary, it examines the search behaviors/intents induced by queries and groups together queries with similar such behaviors, forming search behavior clusters. Specifically, we exploit user feedback in terms of click data to cluster the queries. Each cluster is finally represented by a single ranking model that captures the contained intents expressed by users. Once new queries are issued, these are mapped to the clustering and the retrieval process diversifies possible intents by combining relevant ranking functions.
- Efficient Combined Clustering of Graph and Attribute Data
Brigitte Boden (RWTH Aachen University, Germany)
In the knowledge discovery process, clustering is an established technique for grouping similar objects while separating dissimilar ones. While traditional clustering methods mostly work on vector data, in recent years also various clustering methods for data represented as a graph have been developed. In many applications, however, both types of data are available. For example, for a social network representing relationships between users, additional information about each user can be available, represented as vector data. Therefore, we aim at developing combined clustering approaches that consider graph and vector data simultaneously to obtain clustering results that are meaningful in both domains. In this paper we summarize the main challenges for combined clustering and we give an overview of our recent solutions. Furthermore, we give an overview of our ongoing and future work on this subject.
- Semantic Heterogeneity Reconciliation in Data Integration
Yoones Sekhavat (Memorial University, Canada)
Semantic heterogeneity reconciliation is an important challenge of many data interoperability applications such as data exchange and data integration. In spite of a large amount of research in this area, the lack of theoretical foundations behind semantic heterogeneity reconciliation techniques has resulted in many ad-hoc approaches. This thesis addresses this issue by providing ontological foundations for semantic heterogeneity reconciliation in data integration through query processing. Using these ontological foundations, we propose some novel query rewriting rules that allow reconciling complex semantic heterogeneities.
PhD Workshop Session 2: Database Systems
- Invited Talk: Opportunities for Big Data Research
Ling Liu, Georgia Tech.
- Parallel Genetic Algorithms for the Optimization of Multi-Way Chain Join Queries of Distributed Databases
Tansel Dokeroglu (Middle East Technical University, Turkey)
Distributed database query optimization is an intractable problem as the number of possible query execution plans grows exponentially with the increase in the number of relations accessed and the number of sites used. Query optimizers try to arrive at an optimal plan by using different algorithms. Genetic algorithms have been used frequently for the large join query optimizations where it is not possible to extract an optimal plan with exhaustive strategies. With this study, we analyze the performance of genetic algorithms when they are parallelized for this important problem. A set of Parallel Genetic Algorithms (PGAs) are developed by using different models and the best performing one among them is chosen to compare with a sequential genetic algorithm, sequential dynamic programming, and a parallel exhaustive algorithm. We were able to show that PGAs are very promising tools mostly producing near-optimal solutions within small deviations. To the best of our knowledge, this is the first time that distributed database query optimization problem is analyzed with PGAs up to this extent.
- A Toolkit for the Efficient Processing of Big Data on Large Clusters
Vinayak Borkar (University of California, Irvine)
The amount of data being generated has been growing astronomically since the popularization of the Internet. Being able to analyze this growing volume of data has become important and several systems have been proposed lately to help programmers express computations that are scaled out to run on clusters of shared-nothing computers. This thesis investigates the tradeoffs involved in designing a reusable toolkit of components to be used for processing large amounts of data in a scalable manner. We describe 1) Hyracks, a runtime platform that accepts jobs from users or compilers of high-level languages and executes them in parallel on a cluster, and 2) Algebricks, an algebraic layer that provides operators (logical and physical) that compilers can target in compiling high-level declarative languages for data-intensive applications.
- Knowlog: Knowledge + Datalog for Distributed Systems
Matteo Interlandi (University of Modena e Reggio Emilia, Italy)
One of the new emerging trends that is gaining a lot of attentions in the database community is about distributed programming in large datacenters. A lot of discussions are arising around the CAP theorem [8] and how to achieve correct and efficient programs while performing less coordinated actions as possible. In order to address these issues, monotonic logic programming [4] has been employed to formally specify eventually consistent distributed programs. We conjecture that a missing piece in the current state-of-the-art is the capability to express statements about the knowledge state of distributed nodes, i.e., statements about what a node "knows" given the current system configuration. In fact, reasoning about the state of remote nodes is fundamental in distributed contexts in order to design and analyze protocols behavior or perform coordinated actions [10]. To reach this goal, we leverage Datalog with an epistemic modal operator, allowing the programmer to directly express nodes' state of knowledge instead of low level communication details. As a result, reasoning about high level properties of distributed programs can be performed. To support the effectiveness of our proposal, we introduce, as example, the declarative implementation of the well-known protocol employed to execute distributed transactions: the two phase commit protocol.
- Partial Cube Materialization in a Dynamic Data Warehouse
Usman Ahmed (Université de Lyon, France)
Many modern applications such as sensors based monitoring, or certain business applications, raise the need for real time analysis, and thus require OLAP multidimensional models to be adapted to these evolving environments. Indeed, the scheduled offline batch update strategy, and the prerequisite of having totally ordered dimensions in traditional data warehouses, are both no more suitable for these time-critical applications. In this paper, we present a multidimensional model for real-time data warehousing in hierarchical multidimensional data space. We also propose a dynamic partial cube materialization and a tree storage structure that groups the multidimensional data in no-ordered data partitions called minimum bounding spaces. We use Star Schema Benchmark to compare the performance of our solution with existing dynamic indexing technique. Experimental study shows performance improvement in both insertion time and queries over the data space.
PhD Workshop Session 3: Query Processing
- Efficient Cross-Device Query Processing
Holger Pirk (Centrum Wiskunde & Informatica, Netherlands)
The increasing diversity of hardware within a single system promises large performance gains but also poses a challenge for data management systems. Strategies for the efficient use of hardware with large performance differences are still lacking. For example, existing research on GPU supported data management largely handles the GPU in isolation from the system's CPU -- The GPU is considered the central processor and the CPU used only to mitigate the GPU's weaknesses where necessary. To make efficient use of all available devices, we developed a processing strategy that lets unequal devices like GPU and CPU combine their strengths rather than work in isolation. To this end, we decompose relational data into individual bits and place the resulting partitions on the appropriate devices. Operations are processed in phases, each phase executed on one device. This way, we achieve significant performance gains and good load distribution among the available devices in a limited real-life use case. To grow this idea into a generic system, we identify challenges as well as potential hardware configurations and applications that can benefit from this approach.
- An In-GPU-Memory Column-Oriented Database for Processing Analytical Workloads
Pedram Ghodsnia (University of Waterloo, Canada)
Due to ever increasing demand for fast processing of large analytical workloads, main memory column-oriented databases have attracted a lot of attention in recent years. In-memory databases eliminate the disk I/O barrier by storing the data in memory. In addition, they utilize a column-oriented data layout to offer a multi-core-friendly and memory-bandwidth-efficient processing scheme. On the other hand, recently, graphics processing units (GPUs) have emerged as powerful tools for general high-performance computing. GPUs are affordable and energy-efficient devices that deliver a massive computational power by utilizing a large number of cores and a high memory bandwidth. GPUs can be used as co-processors for query acceleration of in-memory databases. One of the main bottlenecks in GPU-acceleration of in-memory databases is the need for data to be transferred back and forward between GPU memory and RAM through a low-bandwidth PCIe bus. To address this problem, in this study, a new generation of in-memory databases is proposed that instead of keeping data in main memory stores it in GPU device memory.
- Translatable View Updates and Determinacy Under Constraints
Paolo Guagliardo (Free University of Bozen-Bolzano, Italy)
We present a general framework for view updating based on the notion of determinacy under constraints and revisit the abstract functional framework by Bancilhon and Spyratos in a setting where views and updates are given by functions that are expressible in first-order logic. We provide a general method for checking whether a view update can be propagated to the underlying database in a unique way. We study a practical application setting in which views are defined by projections and we extend known results to more expressive database constraints, any number of views and more general view updates.
- Interactive Visualization of High-Velocity Event Streams
Uwe+Jugel (SAP Research, Germany)
1Volker Markl (Technische Universität Berlin, Germany)
Today, complex event processing systems enable real-time analysis of high-velocity event streams. Considering their efficiency for high-speed data analytics, they provide a promising basis for real-time visualization. However, a CEP system has to deal with several streaming-specific problems when being used for modern, web-based visualizations. Such visualizations do not only consume streaming data in real-time, but should also provide advanced, interactive exploration options, run on mobile devices, and scale efficiently for mass user applications. In this paper, I define three core challenges for CEP systems regarding interactive, real-time visualization for future web applications. Within my PhD work, I want to meet those challenges by investigating (1) Interactivity Operators, solving problems with long running queries, (2) backend-powered Visualization Operators, relieving mobile devices of rendering duties, and (3) Multi-User Visualization Pipelines that avoid redundant data processing when serving visualizations to thousands of event stream consumers.
SAP Lunch Talk: An in-memory columnar processing based database platform for Enterprise Applications
Vishal Sikka (Member of Executive Board, Head of Technology & Innovation, SAP, Germany)
http://www.sap.com/corporate-en/our-company/sap-boards/executive-board/Vishal-Sikka.epx
Abstract. Specialized data management systems have recently emerged because increasingly commercially available systems have been making compromises to comprehensively address the diverse data needs of a business. Compounded with the evolution of how data is created and used, the specialization of database management systems created a critical need for a comprehensive, real-time data processing platform for the modern enterprise. SAP HANA, the core database engine of SAP's in-memory platform, seeks to enable both transactional and analytical workloads on the same data representation. It also supports both structured and unstructured data analysis, and optimizes execution of application-specific function logic, all within a highly scalable in-memory execution environment. It succeeds in this mission because it leverages the advantages of a fully in-memory columnar store with highly optimized internal data structures and parallel processing algorithms. In this talk, Dr. Sikka will outline the key principles of this next-generation data platform, and some ways in which enterprise application architecture can be rethought in light of this work.