DATA 2016 Abstracts


Area 1 - Business Analytics

Full Papers
Paper Nr: 24
Title:

Comparison of Distributed Computing Approaches to Complexity of n-gram Extraction

Authors:

Sanzhar Aubakirov, Paulo Trigo and Darhan Ahmed-Zaki

Abstract: In this paper we compare different technologies that support distributed computing as a means to address complex tasks. We address the task of n-gram text extraction which is a big computational given a large amount of textual data to process. In order to deal with such complexity we have to adopt and implement parallelization patterns. Nowadays there are several patterns, platforms and even languages that can be used for the parallelization task. We implemented this task on three platforms: (1) MPJ Express, (2) Apache Hadoop, and (3) Apache Spark. The experiments were implemented using two kinds of datasets composed by: (A) a large number of small files, and (B) a small number of large files. Each experiment uses both datasets and the experiment repeats for a set of different file sizes. We compared performance and efficiency among MPJ Express, Apache Hadoop and Apache Spark. As a final result we are able to provide guidelines for choosing the platform that is best suited for each kind of data set regarding its overall size and granularity of the input data.

Paper Nr: 38
Title:

Management of Data Quality Related Problems - Exploiting Operational Knowledge

Authors:

Mortaza S. Bargh, Jan van Dijk and Sunil Choenni

Abstract: Dealing with data quality related problems is an important issue that all organizations face in realizing and sustaining data intensive advanced applications. Upon detecting these problems in datasets, data analysts often register them in issue tracking systems in order to address them later on categorically and collectively. As there is no standard format for registering these problems, data analysts often describe them in natural languages and subsequently rely on ad-hoc, non-systematic, and expensive solutions to categorize and resolve registered problems. In this contribution we present a formal description of an innovative data quality resolving architecture to semantically and dynamically map the descriptions of data quality related problems to data quality attributes. Through this mapping, we reduce complexity – as the dimensionality of data quality attributes is far smaller than that of the natural language space – and enable data analysts to directly use the methods and tools proposed in literature. Furthermore, through managing data quality related problems, our proposed architecture offers data quality management in a dynamic way based on user generated inputs. The paper reports on a proof of concept tool and its evaluation.

Short Papers
Paper Nr: 30
Title:

An Ontology for Describing ETL Patterns Behavior

Authors:

Bruno Oliveira and Orlando Belo

Abstract: The use of software patterns is a common practice in software design, providing reusable solutions for recurring problems. Patterns represent a general skeleton used to solve common problems, providing a way to share regular practices and reduce the resources needed for implementing software systems. Data warehousing populating processes are a very particular type of software used to migrate data from one or more data sources to a specific data schema used to support decision support activities. The quality of such processes should be guarantee. Otherwise, the final system will deal with data inconsistencies and errors, compromising its suitability to support strategic business decisions. To minimize such problems, we propose a pattern-oriented approach to support ETL lifecycle, from conceptual representation to its execution primitives using a specific commercial tool. An ontology-based meta model it was designed and used for describing patterns internal specification and providing the means to support and enable its configuration and instantiation using a domain specific language.

Paper Nr: 36
Title:

Management of Scientific Documents and Visualization of Citation Relationships using Weighted Key Scientific Terms

Authors:

Hui Wei, Youbing Zhao, Shaopeng Wu, Zhikun Deng, Farzad Parvinzamir, Feng Dong, Enjie Liu and Gordon Clapworthy

Abstract: Effective management and visualization of scientific and research documents can greatly assist researchers by improving understanding of relationships (e.g. citations) between the documents. This paper presents work on the management and visualization of large corpuses of scientific papers in order to help researchers explore their citation relationships. Term selection and weighting are used for mining citation relationships by identifying the most relevant. To this end, we present a variation of the TF-IDF scheme, which uses external domain resources as references to calculate the term weighting in a particular domain; document weighting is taken into account in the calculation of term weighting from a group of citations. A simple hierarchical word weighting method is also presented. The work is supported by an underlying architecture for document management using NoSQL databases and employs a simple visualization interface.

Paper Nr: 39
Title:

Map-reduce Implementation of Belief Combination Rules

Authors:

Frédéric Dambreville

Abstract: This paper presents a generic and versatile approach for implementing combining rules on preprocessed belief functions, issuing from a large population of information sources. In this paper, we address two issues, which are the intrinsic complexity of the rules processing, and the possible large amount of requested combinations. We present a fully distributed approach, based on a map-reduce (Spark) implementation.

Paper Nr: 52
Title:

Urban Gamification as a Source of Information for Spatial Data Analysis and Predictive Participatory Modelling of a City’s Development

Authors:

Robert Olszewski, Agnieszka Turek and Marcin Łączyński

Abstract: The basic problem in predictive participatory urban planning is activating residents of a city, e.g. through the application of the technique of individual and/or team gamification. The authors of the article developed (and tested in Płock) a methodology and prototype of an urban game called “Urban Shaper”. This permitted obtaining a vast collection of opinions of participants on the directions of potential development of the city. The opinions, however, are expressed in an indirect manner. Therefore, their analysis and modelling of participatory urban development requires the application of extended algorithms of spatial statistics. The collected source data are successively processed by means of spatial data mining techniques, permitting activation of condensed spatial knowledge based on “raw” source data with high volume (big data).

Posters
Paper Nr: 17
Title:

Database Buffer Cache Simulator to Study and Predict Cache Behavior for Query Execution

Authors:

Chetan Phalak, Rekha Singhal and Tanmay Jhunjhunwala

Abstract: Usage of an electronic media is increasing day by day and consequently the usage of applications. This fact has resulted in rapid growth of an application's data which may lead to violation of service level agreement (SLA) given to its users. To keep applications SLA compliance, it is necessary to predict the query response time before its deployment. The query response time comprises of two elements, computation time and IO access time. The latter includes time spent in getting data from disk subsystem and database/operating system (OS) cache. Correct prediction of a query performance needs to model cache behavior for growing data size. The complex nature of data storage and data access pattern by queries brings in difficulty to use only mathematical model for cache behavior prediction. In this paper, a Database Buffer Cache Simulator has been proposed, which mimics the behavior of the database buffer cache, which can be used to predict the cache misses for different types of data access by a query. The simulator has been validated using Oracle 11g and TPC-H benchmarks. The simulator is able to predict cache misses with an average error of 2%.

Paper Nr: 21
Title:

Evaluating Open Source Data Mining Tools for Business

Authors:

Pedro Almeida, Le Gruenwald and Jorge Bernardino

Abstract: Businesses are struggling to stay ahead of competition in a globalized economy where there are more and stronger competitors. Managers are constantly looking for advantages that can generate benefits at low costs. One way to have such advantage is using the data about customers, demographic data, purchase history, customer behavior and preferences that can help to take better business decisions. Data Mining addresses the challenges of collecting value inside data and the ways to put that value to use for virtually any area of our lives, including business. In this paper, we address the interest of Data Mining for business and analyze three popular Open Source Data Mining Tools – KNIME, Orange and RapidMiner – considered as a good starting point for enterprises to begin exploring the power of Data Mining and its benefits.

Paper Nr: 53
Title:

Dimensioning and Optimizing TaaS for Improved Performance in Telecom System with Text Analytics Features

Authors:

M. Saravanan and Harivadhana Ravi

Abstract: To cope with speed in ICT developments, the industries are venturing different testing schemes by enabling the advancement in Mobile Cloud Computing, which has given way to meet the aggressive delivery schedules and high demand to reduce costs. Test automation can help to reduce the time to market by achieving best quality standards in end product within limited schedule and budget. In the existing Testing as a Service (TaaS), the on-demand execution of well-defined test suites will be done on an outsourced basis. These instal-lations would be effective and adaptable if the TaaS is customized based on-demand and integrated to the real-time environment. In this paper we propose a new service framework which automates the test suite by con-verting the existing regression test suite procedure into intelligible TaaS operation by augmenting text categori-zation, dimensioning and optimizing techniques. The proposed DOTaaS framework induces robust regression testing with the aim of reducing cost and providing reliable quality assurance. Test suites are categorized by understanding the semantics of the text and marshalling it under its corresponding classifier label. Dimension-ing is done through the technique of text analytics with the application of graphical models that help to reduce the suite by merging and revamping of the existing test cases. The proposed service framework can also be easily realized in any new environment with an advantage of providing stability. The pragmatic uses of dimen-sioning and optimizing techniques in automated testing environment were evaluated to prove the effectiveness.

Area 2 - Data Management and Quality

Full Papers
Paper Nr: 18
Title:

ISE: A High Performance System for Processing Data Streams

Authors:

Paolo Cappellari, Soon Ae Chun and Mark Roantree

Abstract: Many organizations require the ability to manage high-volume high-speed streaming data to perform analysis and other tasks in real-time. In this work, we present the Information Streaming Engine, a high-performance data stream processing system capable of scaling to high data volumes while maintaining very low-latency. The Information Streaming Engine adopts a declarative approach which enables processing and manipulation of data streams in a simple manner. Our evaluation demonstrates the high levels of performance achieved when compared to existing systems.

Paper Nr: 50
Title:

A Novel Method for Unsupervised and Supervised Conversational Message Thread Detection

Authors:

Giacomo Domeniconi, Konstantinos Semertzidis, Vanessa Lopez, Elizabeth M. Daly, Spyros Kotoulas and Gianluca Moro

Abstract: Efficiently detecting conversation threads from a pool of messages, such as social network chats, emails, comments to posts, news etc., is relevant for various applications, including Web Marketing, Information Retrieval and Digital Forensics. Existing approaches focus on text similarity using keywords as features that are strongly dependent on the dataset. Therefore, dealing with new corpora requires further costly analyses conducted by experts to find out new relevant features. This paper introduces a novel method to detect threads from any type of conversational texts overcoming the issue of previously determining specific features for each dataset. To automatically determine the relevant features of messages we map each message into a three dimensional representation based on its semantic content, the social interactions in terms of sender/recipients and its timestamp; then clustering is used to detect conversation threads. In addition, we propose a supervised approach to detect conversation threads that builds a classification model which combines the above extracted features for predicting whether a pair of messages belongs to the same thread or not. Our model harnesses the distance measure of a message to a cluster representing a thread to capture the probability that a message is part of that same thread. We present our experimental results on seven datasets, pertaining to different types of messages, and demonstrate the effectiveness of our method in the detection of conversation threads, clearly outperforming the state of the art and yielding an improvement of up to a 19%.

Short Papers
Paper Nr: 14
Title:

Performance Evaluation of Phonetic Matching Algorithms on English Words and Street Names - Comparison and Correlation

Authors:

Keerthi Koneru, Venkata Sai Venkatesh Pulla and Cihan Varol

Abstract: Researchers confront major problems while searching for various kinds of data in the large imprecise database, as they are not spelled correctly or in the way they were expected to be spelled. As a result, they cannot find the word they sought. Over the years of struggle, pronunciation of words was considered to be one of the practices to solve the problem effectively. The technique used to acquire words based on sounds is known as “Phonetic Matching”. Soundex is the first algorithm proposed and other algorithms like Metaphone, Caverphone, DMetaphone, Phonex etc., are also used for information retrieval in different environments. This paper deals with the analysis and evaluation of different phonetic matching algorithms on several datasets comprising of street names of North Carolina and English dictionary words. The analysis clearly states that there is no clear best technique for generic word lists as Metaphone has best performance for English dictionary words, while NYSIIS has better performance for datasets having street names. Though Soundex has high accuracy in correcting the exact words compared to other algorithms, it has lower precision due to more noise in the considered arena. The experimental results paved way for introducing some suggestions that would aid to make databases more concrete and achieve higher data quality.

Paper Nr: 55
Title:

Architecture of a Multi-domain Processing and Storage Engine

Authors:

Johannes Luong, Dirk Habich, Thomas Kissinger and Wolfgang Lehner

Abstract: In today’s data-driven world, economy and research depend on the analysis of empirical datasets to guide decision making. These applications often encompass a rich variety of data types and special purpose processing models. We believe, the database system of the future will integrate flexible processing and storage of a variety of data types in a scalable and integrated end-to-end solution. In this paper, we propose a database system architecture that is designed from the core to support these goals. In the discussion we will especially focus on the multi-domain programming concept of the proposed architecture that exploits domain specific knowledge to guide compiler based optimization.

Posters
Paper Nr: 29
Title:

Test-Data Quality as a Success Factor for End-to-End Testing - An Approach to Formalisation and Evaluation

Authors:

Yury Chernov

Abstract: Test-data quality can be decisive in the success of end-to-end testing of complicated multi-component and distributed systems. The proper metrics allows to compare different data sets and to evaluate their quality. The typical data quality factors should be, on the one hand, enriched with the dimensions specific for the testing, and, on the other hand many requirements to the production system data quality become not relevant. The proposed formal model is based to great extent on the common sense and practical experience, and not over-formalised. The implementation requires quite an effort, but once established can be very useful.

Paper Nr: 43
Title:

Simultaneous Approximation of Several Non-uniformly Distributed Values within Histogram Buckets

Authors:

Wissem Labbadi and Jalel Akaichi

Abstract: The problem of estimating median and other quantiles without storing observations was first proposed, in the field of simulation modeling to improve the studying of the performance of modeled systems rather than relying only on the mean and standard deviation alone. This problem is then extended to histogram plotting which raises the problem of estimating many quantiles of the same variable. However, the calculation of several values simultaneously is a computationally complex task since it requires several passes through the data. We propose in this paper an extension of the state-of the-art Values Approximation algorithm (Labbadi and Akichi, 2014) to estimate simultaneously several values within a histogram bucket. The extended algorithm significantly reduces the computation of the estimation since the original algorithm estimates only a single value. The experimental results show that the extended algorithm provides good estimates especially when data have non-equal spreads.

Paper Nr: 45
Title:

Social Evaluation of Learning Material

Authors:

Paolo Avogadro, Silvia Calegari and Matteo Dominoni

Abstract: In academic environments the success of a course is given by the interaction among students, teachers and learning material. This paper is focused on the definition of a model to establish the quality of learning material within a Social Learning Management System (Social LMS). This is done by analyzing how teachers and students interact by: (1) objective evaluations (e.g., grades), and (2) subjective evaluations (e.g., social data from the Social LMS). As a reference, we use the Kirkpatrick-Philips model to characterize learning material with novel key performance indicators. As an example, we propose a social environment where students and teachers interact with the help of a wall modified for the evaluation of learning material.

Area 3 - Ontologies and the Semantic Web

Posters
Paper Nr: 15
Title:

Incorporating Collaborative Bookmark Recommendation with Social Network Analysis

Authors:

C.-L. Huang and J.-H. Ho

Abstract: Users collect internet resources and follow other users on the social bookmark sharing websites. Based on the interest-based social network, a hybrid recommender system incorporating collaborative and content-based filtering was built to suggest bookmark resources in this study. In the collaborative filtering, the recommender system discovers collaborative users who have similar interest with the target user; in the content-based filtering, the recommender system suggests bookmark items to the target user. An experiment was conducted to demonstrate the performances of the proposed system based on the dataset collected from a famous social bookmark sharing website.

Paper Nr: 40
Title:

RESTful Encapsulation of OWL API

Authors:

Ramya Dirsumilli and Till Mossakowski

Abstract: OWL API is a high level API for working with ontologies. Despite of its functionalities and numerous advantages, it is restricted to a set of users due to its platform dependency. Being built as a java API the OWL API can only be used by java or related platform users. The major goal of this paper is to design a RESTful web interface of OWL API methods, such that ontology developers and researchers independent of platform could work with OWL API. This RESTful OWL API tool is designed to exhibit all the functionalities of OWL API that do not deal with rendering the input ontology such that it doesn't behave as an ontology editor, instead supports web ontology developers and open ontology repositories such as Ontohub.

Area 4 - Databases and Data Security

Short Papers
Paper Nr: 20
Title:

Scalable Versioning for Key-Value Stores

Authors:

Martin Haeusler

Abstract: Versioning of database content is rapidly gaining importance in modern applications, due to the need for reliable auditing, data history analysis, or due to the fact that temporal information is inherent to the problem domain. Data volume and complexity also increase, demanding a high level of scalability. However, implementations are rarely found in practice. Existing solutions treat versioning as an add-on instead of a first-class citizen, and therefore fail to take full advantage of its benefits. Often, there is also a trade-off between performance and the age of an entry, with newer entries being considerably faster to retrieve. This paper provides three core contributions. First, we provide a formal model that captures and formalizes the properties of the temporal indexing problem in an intuitive way. Second, we provide an in-depth discussion on the unique benefits in transaction control which can be achieved by treating versioning as a first-class citizen in a data store as opposed to treating it as an add-on feature to a non-versioned system. We also introduce an index model that offers equally fast access to all entries, regardless of their age. The third contribution is an opensource implementation of the presented formalism in the form of a versioned key-value store, which serves as a proof-of-concept prototype. An evaluation of this prototype demonstrates the scalability of our approach.

Paper Nr: 31
Title:

Adopting Workflow Patterns for Modelling the Allocation of Data in Multi-Organizational Collaborations

Authors:

Ayalew Kassahun and Bedir Tekinerdogan

Abstract: Currently, many organizations need to collaborate to achieve their common objectives. An important aspect hereby is the data required for making decisions at anyone organization may be distributed over the different organizations involved in the collaboration. The data objects and the activities in which they are generated or used are typically represented using business process models. Unfortunately, the existing business process modeling approaches are limited in representing the complex data allocation dimensions that occur in the context of multi-organization collaboration. In this paper we provide a systematic approach that adopts workflow data patterns to explicitly model data allocations in multi-organization collaboration. To this end we propose a design viewpoint that integrates business process models with well-known data allocation problem-solution pairs defined as workflow data patterns. We illustrate the viewpoint using a case study of a collaboration research project.

Paper Nr: 33
Title:

Fault Tolerance Logging-based Model for Deterministic Systems

Authors:

Óscar Mortágua Pereira, David Simões and Rui L. Aguiar

Abstract: Fault tolerance allows a system to remain operational to some degree when some of its components fail. One of the most common fault tolerance mechanisms consists on logging the system state periodically, and recovering the system to a consistent state in the event of a failure. This paper describes a general fault tolerance logging-based mechanism, which can be layered over deterministic systems. Our proposal describes how a logging mechanism can recover the underlying system to a consistent state, even if an action or set of actions were interrupted mid-way, due to a server crash. We also propose different methods of storing the logging information, and describe how to deploy a fault tolerant master-slave cluster for information replication. We adapt our model to a previously proposed framework, which provided common relational features, like transactions with atomic, consistent, isolated and durable properties, to NoSQL database management systems.

Paper Nr: 34
Title:

Cassandra’s Performance and Scalability Evaluation

Authors:

Melyssa Barata and Jorge Bernardino

Abstract: In the past, relational databases were the most commonly used technology for storing and retrieving data, allowing easier management and retrieval of any stored information organized as a set of tables. However, today databases are larger in size and the query execution time can become very long, requiring servers with bigger capacities. The purpose of this paper is to describe and analyze the Cassandra NoSQL database using the Yahoo! Cloud Serving Benchmark in order to better understand the execution capabilities for various types of applications in environments with different amounts of stored data. The experiments with Cassandra show good scalability and performance results and how the database size and number of nodes affect it.

Paper Nr: 46
Title:

Boosting an Embedded Relational Database Management System with Graphics Processing Units

Authors:

Samuel Cremer, Michel Bagein, Saïd Mahmoudi and Pierre Manneback

Abstract: Concurrently, with the rise of Big Data systems, relational database management systems (RDBMS) are still widely exploited in servers, client devices, and even embedded inside end-user applications. In this paper, it is suggest to improve the performance of SQLite, the most deployed embedded RDBMS. The proposed solution, named CuDB, is an ”In-Memory” Database System (IMDB) which attempts to exploit specificities of modern CPU / GPU architectures. In this study massively parallel processing was combined with strategic data placement, closer to computing units. According to content and selectivity of queries, the measurements reveal an acceleration range between 5 to 120 times - with peak up to 411 - with one GPU GTX770 compared to SQLite standard implementation on a Core i7 CPU.