DATA 2015 Abstracts


Area 1 - Business Analytics

Full Papers
Paper Nr: 16
Title:

Finding Maximal Quasi-cliques Containing a Target Vertex in a Graph

Authors:

Yuan Heng Chou, En Tzu Wang and Arbee L. P. Chen

Abstract: Many real-world phenomena such as social networks and biological networks can be modeled as graphs. Discovering dense sub-graphs from these graphs may be able to find interesting facts about the phenomena. Quasi-cliques are a type of dense graphs, which is close to the complete graphs. In this paper, we want to find all maximal quasi-cliques containing a target vertex in the graph for some applications. A quasi-clique is defined as a maximal quasi-clique if it is not contained by any other quasi-cliques. We propose an algorithm to solve this problem and use several pruning techniques to improve the performance. Moreover, we propose another algorithm to solve a special case of this problem, i.e. finding the maximal cliques. The experiment results reveal that our method outperforms the previous work both in real and synthetic datasets in most cases.

Paper Nr: 25
Title:

From Static to Agile - Interactive Particle Physics Analysis in the SAP HANA DB

Authors:

David Kernert, Norman May, Michael Hladik, Klaus Werner and Wolfgang Lehner

Abstract: In order to confirm their theoretical assumptions, physicists employ Monte-Carlo generators to produce millions of simulated particle collision events and compare them with the results of the detector experiments. The traditional, static analysis workflow of physicists involves creating and compiling a C++ program for each study, and loading large data files for every run of their program. To make this process more interactive and agile, we created an application that loads the data into the relational in-memory column store DBMS SAP HANA, exposes raw particle data as database views and offers an interactive web interface to explore this data. We expressed common particle physics analysis algorithms using SQL queries to benefit from the inherent scalability and parallelization of the DBMS. In this paper we compare the two approaches, i.e. manual analysis with C++ programs and interactive analysis with SAP HANA. We demonstrate the tuning of the physical database schema and the SQL queries used for the application. Moreover, we show the web-based interface that allows for interactive analysis of the simulation data generated by the EPOS Monte-Carlo generator, which is developed in conjunction with the ALICE experiment at the Large Hadron Collider (LHC), CERN.

Paper Nr: 30
Title:

A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf

Authors:

Giacomo Domeniconi, Gianluca Moro, Roberto Pasolini and Claudio Sartori

Abstract: Within text categorization and other data mining tasks, the use of suitable methods for term weighting can bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout literature, based on assumptions commonly derived from observation of distribution of words in documents. For example, the idf assumption states that words appearing in many documents are usually not as important as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval, schemes proposed more recently are supervised, i.e. based on knownledge of membership of training documents to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the usual idf factor without considering documents of the category to be recognized, so that importance of terms frequently appearing only within it is not underestimated. A further proposed variant is additionally based on relevance frequency, considering occurrences of words within the category itself. In extensive experiments on two recurring text collections with several unsupervised and supervised weighting schemes, we show that the ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two different learning methods.

Paper Nr: 35
Title:

A First Framework for Mutually Enhancing Chorem and Spatial OLAP Systems

Authors:

Sandro Bimonte, Francois Johany and Sylvie Lardon

Abstract: Spatial OLAP systems aim to interactively analyze huge volumes of geo-referenced data. They allow decision-makers to on-line explore and visualize warehoused spatial using pivot tables, graphical displays and interactive maps. On the other hand, it has been recently shown that chorem maps represent an excellent geovisualzation technique to summarize spatial phenomena. Therefore, in this paper we introduce a framework being capable to merge the interactive analysis capability of SOLAP systems and the potentiality of a chorem-based visual notation in terms of visual summary. We also propose a general architecture based on standards to automatically extract and visualize chorems from SDWs according to our framework.

Paper Nr: 36
Title:

A Unifying Polynomial Model for Efficient Discovery of Frequent Itemsets

Authors:

Slimane Oulad-Naoui, Hadda Cherroun and Djelloul Ziadi

Abstract: It is well-known that developing a unifying theory is one of the most important issues in Data Mining research. In the last two decades, a great deal has been devoted to the algorithmic aspects of the Frequent Itemset (FI) Mining problem. We are motivated by the need of formal modeling in the field. Thus, we introduce and analyze, in this theoretical study, a new model for the FI mining task. Indeed, we encode the itemsets as words over an ordered alphabet, and state this problem by a formal series over the counting semiring (N,+,x,0,1), whose the range constitutes the itemsets and the coefficients their supports. This formalism offers many advantages in both fundamental and practical aspects: The introduction of a clear and unified theoretical framework through which we can express the main FI-approaches, the possibility of their generalization to mine other more complex objects, and their incrementalization and/or parallelization; in practice, we explain how this problem can be seen as that of word recognition by an automaton, allowing an efficient implementation in O(|Q|) space and O(|FL||Q|]) time, where Q is the set of states of the automaton used for representing the data, and FL the set of prefixial maximal FI.

Short Papers
Paper Nr: 28
Title:

Exploiting Linked Data Towards the Production of Added-Value Business Analytics and Vice-versa

Authors:

Eleni Fotopoulou, Panagiotis Hasapis, Anastasios Zafeiropoulos, Dimitris Papaspyros, Spiros Mouzakitis and Norma Zanetti

Abstract: The majority of enterprises are in the process of recognizing that business data analytics have the potential to transform their daily operations and make them extremely effective at addressing business challenges, identifying new market trends and embracing new ways to engage customers. Such analytics are in most cases related with the processing of data coming from various data sources that include structured and unstructured data. In order to get insight through the analysis results, appropriate input has to be provided that in many cases has to combine data from cross-sectorial and heterogeneous public or private data sources. Thus, there is inherent a need for applying novel techniques in order to harvest complex and heterogeneous datasets, turn them into insights and make decisions. In this paper, we present an approach for the production of added-value business analytics through the consumption of interlinked versions of data and the exploitation of linked data principles. Such interlinked data constitute valuable input for the initiation of an analytics extraction process and can lead to the realization of analysis that was not envisaged in the past. In addition to the production of analytics based on the consumption of linked data, the proposed approach supports the interlinking of the produced results with the associated input data, increasing in this way the value of the produced data and making them discoverable for further use in the future. The designed business analytics and data mining component is described in detail, along with an indicative application scenario combining data from the governmental, societal and health sectors.

Paper Nr: 39
Title:

Visual-CBIR: Platform for Storage and Effective Manipulation of a Database Images

Authors:

Kaouther Zekri, Amel Grissa Touzi and Noureddine Ellouze

Abstract: Today, image retrieval system has become a vital necessity for computing users. Different search systems are increasingly invading the computing software markets, such as QBIC, Photobook and BlobWord. The only negative point these systems have in common is their lack of semantics in query processing, low interactivity and the irrelevance of search results. To overcome these limitations, we propose a more efficient alternative system: a system for image retrieval. This new system provides an intelligent search by content, by keyword, by index, etc. To confirm our approach, we have defined a combination with Oracle DBMS that would lead to 1) an advanced modeling of image type using a signature that describes the physical and semantic content of images, 2) the modeling of different types of search by creating stored procedures in PL/SQL language and 3) simple storage and handling of images in database through an intuitive interface. We prove that this system can be used in a distributed environment.

Paper Nr: 51
Title:

Preserving Prediction Accuracy on Incomplete Data Streams

Authors:

Olivier Parisot, Yoanne Didry, Thomas Tamisier and Benoît Otjacques

Abstract: Model tree is a useful and convenient method for predictive analytics in data streams, combining the interpretability of decision trees with the efficiency of multiple linear regressions. However, missing values within the data streams is a crucial issue in many real world applications. Often, this issue is solved by pre-processing techniques applied prior to the training phase of the model. In this article we propose a new method that proceeds by estimating and adjusting missing values before the model tree creation. A prototype has been developed and experimental results on several benchmarks show that the method improves the accuracy of the resulting model tree.

Posters
Paper Nr: 18
Title:

Structuring Documents from Short Texts - The Enel SpA Case Study

Authors:

Silvia Calegari and Matteo Dominoni

Abstract: Nowadays, structured documents are marked-up using XML. XML is the W3C standard that allows to give a meaning about the stored content of a document by the definition of its logical structure. A logical structure can be exploited to have a focused access to structured documents. For instance, in XML Information Retrieval, the logical structure is aimed at retrieving the most relevant fragments within documents as answers to queries, instead of the whole document. The problem arises when it is not possible to automatically define the logical structure of a document by using the methodologies presented in the literature. This position paper takes into account this situation and provides a possible solution adopted in the Enel SpA energy company.

Area 2 - Data Management and Quality

Full Papers
Paper Nr: 17
Title:

Determining Top-K Candidates by Reverse Constrained Skyline Queries

Authors:

Ruei Sian Jheng, En Tzu Wang and Arbee L. P. Chen

Abstract: Given a set of criteria, an object o is defined to dominate another object o' if o is no worse than o' in each criterion and has better outcomes in at least a specific criterion. A skyline query returns each object that is not dominated by any other objects. Consider a scenario as follows. Given three types of datasets, including residents in a city, existing restaurants in the city, and candidate places for opening new restaurants in the city, where each restaurant and candidate place has its respective rank on a set of criteria, e.g., convenience of parking, we want to find the top-k candidate places that have the most potential customers. The potential customers of a candidate place is defined as the number of residents whose distance to this candidate is no larger than a given distance r and also regard this candidate as their skyline restaurants. In this paper, we propose an efficient method based on the quad-tree index and use four pruning strategies to solve this problem. A series of experiments are performed to compare the proposed method with a straightforward method using the R-tree index. The experiment results demonstrate that the proposed method is very efficient, and the pruning strategies very powerful.

Paper Nr: 64
Title:

Extended Techniques for Flexible Modeling and Execution of Data Mashups

Authors:

Pascal Hirmer, Peter Reimann, Matthias Wieland and Bernhard Mitschang

Abstract: Today, a multitude of highly-connected applications and information systems hold, consume and produce huge amounts of heterogeneous data. The overall amount of data is even expected to dramatically increase in the future. In order to conduct, e.g., data analysis, visualizations or other value-adding scenarios, it is necessary to integrate specific, relevant parts of data into a common source. Due to oftentimes changing environments and dynamic requests, this integration has to support ad-hoc and flexible data processing capabilities. Furthermore, an iterative and explorative trial-and-error integration based on different data sources has to be possible. To cope with these requirements, several data mashup platforms have been developed in the past. However, existing solutions are mostly non-extensible, monolithic systems or applications with many limitations regarding the mentioned requirements. In this paper, we introduce an approach that copes with these issues (i) by the introduction of patterns to enable decoupling from implementation details, (ii) by a cloud-ready approach to enable availability and scalability, and (iii) by a high degree of flexibility and extensibility that enables the integration of heterogeneous data as well as dynamic (un-)tethering of data sources. We evaluate our approach using runtime measurements of our prototypical implementation.

Short Papers
Paper Nr: 27
Title:

Facts Collection and Verification Efforts

Authors:

Rizwan Mehmood and Hermann Maurer

Abstract: Geographic web portals and geospatial databases are emerging on the web recently, offering information about countries and places in the world. Digital content is increasing at a staggering rate due to community collaboration and the integration of information from webcams and sensors. Like in case of Wikipedia, some geospatial databases allow everyone to edit the content. We cannot ignore the role of wikis and geospatial databases particularly Wikipedia, Wikicommons, GeoNames etc., as they have replaced the traditional encyclopedias and they are empowering information seekers by providing information at the door step. However, there is no guarantee of validity and authenticity of the information provided by them. The reason behind is that very little attention has been given to verify information before publishing it on the Web. Also, to find particular information about countries, web users mainly teachers, students and tourists rely on search engines such as Google which often points to Wikipedia. We will identify some inconsistencies in online facts such as area, cities and mountains rankings using multiple data sources. Our investigations reveal that there is a need for a reliable geographic web portal which can be used for learning and other purpose. We will explain how we managed to devise a mechanism for collecting and verifying different facts. Our attempt to provide a reliable geographic web portal has resulted in a comprehensive collection covering a wide range of information aspects such as culture, geography, economy etc. that are associated with a country. We will also describe our approach to measure the reliability of geographic facts such as area, cities and mountains rankings for all countries.

Paper Nr: 31
Title:

Database Architectures: Current State and Development

Authors:

Jaroslav Pokorny and Karel Richta

Abstract: The paper presents shortly a history and development of database management tools in last decade. The movement towards a higher database performance and database scalability is discussed in the context to requirements of practice. These include Big Data and Big Analytics as driving forces that together with a progress in hardware development led to new DBMS architectures. We describe and evaluate them mainly in terms of their scalability. We focus also on a usability of these architectures which depends strongly on application environment. We also mention more complex software stacks containing tools for management of real-time analysis and intelligent processes.

Paper Nr: 52
Title:

Service for Data Retrieval via Persistent Identifiers

Authors:

Vasily Bunakov

Abstract: Persistent identifiers for research data citation have become commonplace yet current practices of minting them need evaluation to see how the data cited can be actually discovered, contextualized and processed in scalable eInfrastructures that serve both human users and machine agents. The existing means of data identifiers dereferencing can be used for basic data contextualization but more sophisticated contextualization services are required to make data readily available for automated retrieval and processing. This work takes a look at the data identifiers minting practices and discusses a possible design of a service for the machine-assisted or fully automated retrieval of formally citeable data.

Paper Nr: 58
Title:

Towards a Context-aware Framework for Assessing and Optimizing Data Quality Projects

Authors:

Meryam Belhiah, Mohammed Salim Benqatla, Bouchaïb Bounabat and Saïd Achchab

Abstract: This paper presents an approach to clearly identify the opportunities for increased monetary and non-monetary benefits from improved Data Quality, within an Enterprise Architecture context. The aim is to measure, in a quantitative manner, how key business processes help to execute an organization’s strategy, and then to qualify the benefits as well as the complexity of improving data, that are consumed and produced by these processes. These findings will allow to clearly identify data quality improvement projects, based on the latter’s benefits to the organization and their costs of implementation. To facilitate the understanding of this approach, a Java EE Web application is developed and presented here.

Paper Nr: 60
Title:

An Ontology for Representing and Extracting Knowledge Starting from Open Data of Public Administrations

Authors:

Patrizia Agnello and Silvia Ansaldi

Abstract: As proposed by European Commission through the institution of Europe’s Digital Agenda, the Italian Digital Agenda has promoted the publication and the use of Open Data (OD) owned by the Public Administration (PA), providing with the appropriate guidelines for describing and managing Open Data under criteria of transparency, usability, accessibility. Inail (Italian Institute for Insurance against Accidents at Work) has recently started to publish periodically OD related to occupational accidents and diseases. The Inail OD, as they are defined, are mainly suitable for statistical studies but together with the vocabulary and the typological tables, that explicitly describe some codes in the dataset, it has been possible to develop a conceptual model which may be linked to schemas provided by other PA’s for achieving the interoperability among those data through their semantics. Starting from the OD on occupational accidents, an ontology has been developed for capturing the meaning (semantics) of information, both contained into the OD available from Inail and defined in the vocabulary. The formalism of the ontology adopted for representing the knowledge, in terms of concepts and their relations, ensures to share a common language avoiding misinterpretations and misunderstanding.

Paper Nr: 67
Title:

Task Clustering on ETL Systems - A Pattern-Oriented Approach

Authors:

Bruno Oliveira and Orlando Belo

Abstract: Usually, data warehousing populating processes are data-oriented workflows composed by dozens of granular tasks that are responsible for the integration of data coming from different data sources. Specific subset of these tasks can be grouped on a collection together with their relationships in order to form higherlevel constructs. Increasing task granularity allows for the generalization of processes, simplifying their views and providing methods to carry out expertise to new applications. Well-proven practices can be used to describe general solutions that use basic skeletons configured and instantiated according to a set of specific integration requirements. Patterns can be applied to ETL processes aiming to simplify not only a possible conceptual representation but also to reduce the gap that often exists between two design perspectives. In this paper, we demonstrate the feasibility and effectiveness of an ETL pattern-based approach using task clustering, analyzing a real world ETL scenario through the definitions of two commonly used clusters of tasks: a data lookup cluster and a data conciliation and integration cluster.

Posters
Paper Nr: 11
Title:

Database Evolution for Software Product Lines

Authors:

Kai Herrmann, Jan Reimann, Hannes Voigt, Birgit Demuth, Stefan Fromm, Robert Stelzmann and Wolfgang Lehner

Abstract: Software product lines (SPLs) allow creating a multitude of individual but similar products based on one common software model. Software components can be developed independently and new products can be generated easily. Inevitably, software evolves, a new version has to be deployed, and the data already existing in the database has to be transformed accordingly. As independently developed components are compiled into an individual SPL product, the local evolution script of every involved component has to be weaved into a single global database evolution script for the product. In this paper, we report on the database evolution toolkit DAVE in the context of an industry project. DAVE solves the weaving problem and provides a feasible solution for database evolution in SPLs.

Paper Nr: 14
Title:

A Visual Technique to Assess the Quality of Datasets - Understanding the Structure and Detecting Errors and Missing Values in Open Data CSV Files

Authors:

Paulo Da Silva Carvalho, Patrik Hitzelberger, Fatma Bouali and Gilles Venturini

Abstract: Nowadays, more and more information is flowing in and is provided on the Web. Large datasets are made available covering many fields and sectors. Open Data (OD) plays an important role in this field. Thanks to the volumes and the variety of the released datasets, OD brings high societal and business potential. In order to realize this potential, the reuse of the datasets (e.g. in internal business processes) becomes primordial. However, if the aim is to reuse OD, it is also necessary to be able of assessing its quality. This paper demonstrates how Information Visualization may help on this task and presents Stacktab chart - a new chart to analyse and assess CSV files in order to understand their structure, identify the location of relevant information and detect possible problems in the datasets.

Paper Nr: 38
Title:

Data Quality Assessment of Company’s Maintenance Reporting: A Case Study

Authors:

Manik Madhikermi, Sylvain Kubler, Andrea Buda and Kary Främling

Abstract: Businesses are increasingly using their enterprise data for strategic decision-making activities. In fact, information, derived from data, has become one of the most important tools for businesses to gain competitive edge. Data quality assessment has become a hot topic in numerous sectors and considerable research has been carried out in this respect, although most of the existing frameworks often need to be adapted with respect to the use case needs and features. Within this context, this paper develops a methodology for assessing the quality of enterprises' daily maintenance reporting, relying both on an existing data quality framework and on a Multi-Criteria Decision Making (MCDM) technique. Our methodology is applied in cooperation with a Finnish multinational company in order to evaluate and rank different company sites/office branches (carrying out maintenance activities) according to the quality of their data reporting. Based on this evaluation, the industrial partner wants to establish new action plans for enhanced reporting practices.

Paper Nr: 49
Title:

Integrated Smart Home Services and Smart Wearable Technology for the Disabled and Elderly

Authors:

Ayse Tuna, Resul Daş and Gurkan Tuna

Abstract: Smart Home is indeed a broad concept which includes the techniques and systems applied to living spaces. While its main goal is to reduce the consumption of energy, it provides many benefits including living in comfort, security and increasing flexibility. Smart homes are achieved through networking, control and automation technologies. Since smart homes offer more comfort and security and provide novel innovative services, people with disabilities or the elderly can take the advantages and improve their life quality. However, for such novel services, an analytical infrastructure which can manage overall data flow provided by various sensors, understand anomalous behaviour, and make necessary decisions. In this study, for efficient data handling and visualisation, an integrated smart service approach based on the use of a smart vest is proposed. The smart vest plays a key role in the proposed system since it provides the main health parameters of the monitored person to the smart home service and enables tracking of the monitored person’s location. The proposed system offers many benefits to both people with disabilities and the elderly and their families in terms of increased efficiency of health care service and comfort for the monitored person. It can also reduce the cost of health care services by reducing the number of periodical visits.

Paper Nr: 54
Title:

Linking Library Data for Quality Improvement and Data Enrichment

Authors:

Uldis Bojārs, Artūrs Žogla and Elita Eglīte

Abstract: Dataset interlinking holds the potential for data quality improvement and data enrichment as demonstrated by the Linked Open Data project. This paper explores the library domain characterized by carefully curated datasets that require high quality standards. It presents the results of an experiment in dataset quality improvement and data enrichment conducted by linking library datasets and analysing the results. The experiment was performed using subject authority files from the National Library of Latvia and the Library of Congress. The paper concludes by discussing how Linked Data can be used for data enrichment.

Paper Nr: 66
Title:

Open Science - Practical Issues in Open Research Data

Authors:

Robert Viseur

Abstract: The term “open data” refers to information that has been made technically and legally available for reuse. In our research, we focus on the particular case of open research data. We conducted a literature review in order to determine what are the motivations to release open research data and what are the issues related to the development of open research data. Our research allowed to identify seven motivations for researchers to open research data and discuss seven issues. The paper highlights the lack of data infrastructure dedicated to open research data and the need for developing the researcher’s ability to animate online communities.

Area 3 - Ontologies and the Semantic Web

Short Papers
Paper Nr: 32
Title:

A Model for Digital Content Management

Authors:

Filippo Eros Pani, Simone Porru and Simona Ibba

Abstract: Digital libraries work in complex and heterogeneous scenarios. The quantity and diversity of resources, together with the plurality of agents involved in this context, and the continuous evolution of user-generated content, require knowledge to be formally and flexibly organized. In our work, we propose a library management system - which specifically addresses the Italian context - based on the creation of a metadata taxonomy that analyses the existing management standards, connects them, and associates them with the multimedia content, through a comparison with popular metadata standard employed for User-Generated Content. The approach is based on the conviction that cultural heritage should be managed in the most flexible way through the use of open data and open standards that promote knowledge interoperability and exchange. Our management model for the proposed metadata aims to be a useful instrument for the greater sharing of knowledge in a logic of reuse.

Paper Nr: 53
Title:

Automatic Generation of Concept Maps based on Collection of Teaching Materials

Authors:

Aliya Nugumanova, Madina Mansurova, Ermek Alimzhanov, Dmitry Zyryanov and Kurmash Apayev

Abstract: The aim of this work is demonstration of usefulness and efficiency of statistical methods of text processing for automatic construction of concept maps of the pre-determined domain. Statistical methods considered in this paper are based on the analysis of co-occurrence of terms in the domain documents. To perform such analysis, at the first step we construct a term-document frequency matrix on the basis of which we can estimate the correlation between terms and the designed domain. At the second step we go on from the term-document matrix to the term-term matrix that allows to estimate the correlation between pairs of terms. The use of such approach allows to define the links between concepts as links in pairs which have the highest values of correlation. At the third step, we have to summarize the obtained information identifying concepts as nodes and links as edges of a graph and construct a concept map as resulting graph.

Paper Nr: 63
Title:

Efficient Exploration of Linked Data Cloud

Authors:

Jakub Klímek, Martin Nečaský, Bogdan Kostov, Miroslav Blaško and Petr Křemen

Abstract: As the size of semantic data available as Linked Open Data (LOD) increases, the demand for methods for automated exploration of data sets grows as well. A data consumer needs to search for data sets meeting his interest and look into them using suitable visualization techniques to check whether the data sets are useful or not. In the recent years, particular advances have been made in the field, e.g., automated ontology matching techniques or LOD visualization platforms. However, an integrated approach to LOD exploration is still missing. On the scale of the whole web, the current approaches allow a user to discover data sets using keywords or manually through large data catalogs. Existing visualization techniques presume that a data set is of an expected type and structure. The aim of this position paper is to show the need for time and space efficient techniques for discovery of previously unknown LOD data sets on the base of a consumer’s interest and their automated visualization which we address in our ongoing work

Posters
Paper Nr: 10
Title:

The Use of Extensible Markup Language (XML) to Analyse Medical Full Text Repositories – An Example from Homeopathy

Authors:

Thomas Ostermann, Marc Malik and Christa Raak

Abstract: Extensible Markup Language (XML) is one of the most popular web languages in the life science used for for Semantic Data Analysis in various fields of clinical research. One of these fields is the processing of medical full texts. To extract meaningful information out of natural texts is one of the challenges when dealing with huge text repositories. We present an application of XML together with linguistic algorithms in the processing of texts from a homeopathic materia medica. Our approach enables the user not only to search within the symptom descriptions but also offers special features like sequential search within the results or the comparison of homeopathic remedies. However user demands of day to day practice and terms of information technology have both to be taken carefully into account to further develop this prototype.

Paper Nr: 19
Title:

TORCIA - A Decision-support Collaborative Platform for Emergency Management

Authors:

Chiara Francalanci and Paolo Giacomazzi

Abstract: The TORCIA platform has been developed as part of a project funded by the Lombardy Region. The main goal of the project is the development of a tool that leverages social media in emergency management processes. With a continuous and real-time collection of information from social media, TORCIA can detect situations of potential emergency and identify their geographical position. The TORCIA platform supports emergency operators from different organizations and institutions with a decision-support dashboard and favors the creation of a collaborative process combining the contributions of citizens and institutions by means of a mobile app that is also integrated within the platform.

Paper Nr: 20
Title:

RDF Resource Search and Exploration with LinkZoo

Authors:

Marios Meimaris, George Alexiou, Katerina Gkirtzou, George Papastefanatos and Theodore Dalamagas

Abstract: The Linked Data paradigm is the most common practice for publishing, sharing and managing information in the Data Web. Linkzoo is an IT infrastructure for collaborative publishing, annotating and sharing of Data Web resources, and their publication as Linked Data. In this paper, we overview LinkZoo and its main components, and we focus on the search facilities provided to retrieve and explore RDF resources. Two search services are presented: (1) an interactive, two-step keyword search service, where live natural language query suggestions are given to the user based on the input keywords and the resource types they match within LinkZoo, and (2) a keyword search service for exploring remote SPARQL endpoints that automatically generates a set of candidate SPARQL queries, i.e., SPARQL queries that try to capture user’s information needs as expressed by the keywords used. Finally, we demonstrate the search functionalities through a use case drawn from the life sciences domain.

Area 4 - Databases and Data Security

Full Papers
Paper Nr: 12
Title:

Hypergraph-based Access Control Using Formal Language Expressions - HGAC

Authors:

Alexander Lawall

Abstract: In all organizations, access assignments are essential in order to ensure data privacy, permission levels and the correct assignment of tasks. Traditionally, such assignments are based on total enumeration, with the consequence that constant effort has to be put into maintaining the assignments. This problem still persists when using abstraction layers, such as group and role concepts, e.g. Access Control Matrix and Role-Based Access Control. Role and group memberships are statically defined and members have to be added and removed constantly. This paper describes a novel approach - Hypergraph-Based Access Control HGAC - to assign human and automatic subjects to access rights in a declarative manner. The approach is based on an organizational (meta-) model and a declarative language. The language is used to express queries and formulate predicates. Queries define sets of subjects based on their properties and their position in the organizational model. They also contain additional information that causes organizational relations to be active or inactive depending on predicates. In HGAC, the subjects that have a specific permission are determined by such a query. The query itself is not defined statically but created by traversing a hypergraph path. This allows a structured aggregation of permissions on resources. Consequently, multiple resources can share parts of their queries.

Paper Nr: 21
Title:

NoSQL Storage for Configurable Information System - The Conceptual Model

Authors:

Sergei Kucherov, Yuri Rogozov and Alexander Sviridov

Abstract: A basis for creation of configurable information systems and data warehousing for them is the using of a single form of representation - base abstraction for action synthesis. The base abstraction for action synthesis (BAfAS) involves a description of information system as an user's activity, rather than as individual components. Configurable information system's data storage should be focused on storage not isolated facts, but on types of user actions. Implementation of these action types in turn will lead to obtain facts. Purpose of the paper is to propose the model of configurable information system's storage, which is adequate to a model of the system and the conceptual model of storage described with NoSQL-technologies.

Paper Nr: 22
Title:

A Benchmark for Online Non-blocking Schema Transformations

Authors:

Lesley Wevers, Matthijs Hofstra, Menno Tammens, Marieke Huisman and Maurice van Keulen

Abstract: This paper presents a benchmark for measuring the blocking behavior of schema transformations in relational database systems. As a basis for our benchmark, we have developed criteria for the functionality and performance of schema transformation mechanisms based on the characteristics of state of the art approaches. To address limitations of existing approaches, we assert that schema transformations must be composable while satisfying the ACID guarantees like regular database transactions. Additionally, we have identified important classes of basic and complex relational schema transformations that a schema transformation mechanism should be able to perform. Based on these transformations and our criteria, we have developed a benchmark that extends the standard TPC-C benchmark with schema transformations, which can be used to analyze the blocking behavior of schema transformations in database systems. The goal of the benchmark is not only to evaluate existing solutions for non-blocking schema transformations, but also to challenge the database community to find solutions that allow more complex transactional schema transformations.

Short Papers
Paper Nr: 41
Title:

Blockchain-based Model for Social Transactions Processing

Authors:

Idrissa Sarr, Hubert Naacke and Ibrahima Gueye

Abstract: The goal of this work in progress is to handle transactions of social applications by using their access classes. Basically, social users access simultaneously to a small piece of data owned by a user or a few ones. For instance, a new post of a Facebook user can create the reactions of most of his/her friends, and each of such reactions is related to the same data. Thus, grouping or chaining transactions that require the same access classes may reduce significantly the response time since several transactions are executed in one shot while ensuring consistency as well as minimizing the number of access to the persistent data storage. With this insight, we propose a middleware-based transaction scheduler that uses various strategies to chain transactions based on their access classes. The key novelties lie in (1) our distributed transaction scheduling devised on top of a ring to ensure communication when chaining transactions and (2) our ability to deal with multi-partitions transactions. The scheduling phase is based on Blockchain principle, which means in our context to record all transactions requiring the same access class into a master list in order to ensure consistency and to plan efficiently their processing. We designed and simulated our approach using SimJava and preliminary results show interesting and promising results.

Paper Nr: 48
Title:

Non-consent Data Retrieval While using Web or Email Services

Authors:

Jose Martinez Rivera, Luis Medina Ramos and Amir H. Chinaei

Abstract: User’s private data might be secretly retrieved for or against them every time they browse the web or use email services. This study addresses how these services could secretly retrieve such data. While users may appreciate these techniques as a means of protecting them from hacking, fraud, etc., they may have some privacy concerns against them. We have implemented two approaches to demonstrate how such data retrievals help web and email services to properly identify a user. We have also conducted extensive experiments to measure the success rates of our approaches. Results show 86.217% successful identification for web services and 81.1% successful identification of emails that were attempted to be sent anonymously.

Paper Nr: 59
Title:

Resiliency-aware Data Compression for In-memory Database Systems

Authors:

Till Kolditz, Dirk Habich, Patrick Damme, Wolfgang Lehner, Dmitrii Kuvaiskii, Oleksii Oleksenko and Christof Fetzer

Abstract: Nowadays, database systems pursuit a main memory-centric architecture, where the entire business-related data is stored and processed in a compressed form in main memory. In this case, the performance gain is massive because database operations can benefit from its higher bandwidth and lower latency. However, current main memory-centric database systems utilize general-purpose error detection and correction solutions to address the emerging problem of increasing dynamic error rate of main memory. The costs of these generalpurpose methods dramatically increases with increasing error rates. To reduce these costs, we have to exploit context knowledge of database systems for resiliency. Therefore, we introduce our vision of resiliency-aware data compression in this paper, where we want to exploit the benefits of both fields in an integrated approach with low performance and memory overhead. In detail, we present and evaluate a first approach using AN encoding and two different compression schemes to show the potentials and challenges of our vision.

Posters
Paper Nr: 23
Title:

Tree Inclusion Checking Revisited

Authors:

Yangjun Chen and Yibin Chen

Abstract: In this paper, we discuss an efficient algorithm for the ordered tree inclusion problem, by which it is checked whether a pattern tree (forest) P can be embedded in a target tree (forest) T. The time com¬plexity of this algorithm is bounded by O(|T|logDP), where DP is the depth of P; and its space overhead is bounded by O(|T| + |P|). This computational complexity is better than any existing algorithm for this problem.